This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Double-Hard Debias:
Tailoring Word Embeddings for Gender Bias Mitigation

Tianlu Wang1   Xi Victoria Lin2   Nazneen Fatema Rajani2
Bryan McCann2   Vicente Ordonez1   Caiming Xiong2
1University of Virginia   {tw8cb, vicente}@virginia.edu
2Salesforce Research   {xilin, nazneen.rajani, bmccann, cxiong}@salesforce.com
  This research was conducted during the author’s internship at Salesforce Research.
Abstract

Word embeddings derived from human-generated corpora inherit strong gender bias which can be further amplified by downstream models. Some commonly adopted debiasing approaches, including the seminal Hard Debias algorithm  Bolukbasi et al. (2016), apply post-processing procedures that project pre-trained word embeddings into a subspace orthogonal to an inferred gender subspace. We discover that semantic-agnostic corpus regularities such as word frequency captured by the word embeddings negatively impact the performance of these algorithms. We propose a simple but effective technique, Double-Hard Debias, which purifies the word embeddings against such corpus regularities prior to inferring and removing the gender subspace. Experiments on three bias mitigation benchmarks show that our approach preserves the distributional semantics of the pre-trained word embeddings while reducing gender bias to a significantly larger degree than prior approaches.

1 Introduction

Despite widespread use in natural language processing (NLP) tasks, word embeddings have been criticized for inheriting unintended gender bias from training corpora. Bolukbasi et al. (2016) highlights that in word2vec embeddings trained on the Google News dataset (Mikolov et al., 2013a), “programmer” is more closely associated with “man” and “homemaker” is more closely associated with “woman”. Such gender bias also propagates to downstream tasks. Studies have shown that coreference resolution systems exhibit gender bias in predictions due to the use of biased word embeddings (Zhao et al., 2018a; Rudinger et al., 2018). Given the fact that pre-trained word embeddings have been integrated into a vast number of NLP models, it is important to debias word embeddings to prevent discrimination in NLP systems.

To mitigate gender bias, prior work have proposed to remove the gender component from pre-trained word embeddings through post-processing (Bolukbasi et al., 2016), or to compress the gender information into a few dimensions of the embedding space using a modified training scheme (Zhao et al., 2018b; Kaneko and Bollegala, 2019). We focus on post-hoc gender bias mitigation for two reasons: 1) debiasing via a new training approach is more computationally expensive; and 2) pre-trained biased word embeddings have already been extensively adopted in downstream NLP products and post-hoc bias mitigation presumably leads to less changes in the model pipeline since it keeps the core components of the original embeddings.

Existing post-processing algorithms, including the seminal Hard Debias Bolukbasi et al. (2016), debias embeddings by removing the component that corresponds to a gender direction as defined by a list of gendered words. While Bolukbasi et al. (2016) demonstrate that such methods alleviate gender bias in word analogy tasks, Gonen and Goldberg (2019) argue that the effectiveness of these efforts is limited, as the gender bias can still be recovered from the geomrtry of the debiased embeddings.

Refer to caption
(a) Change the frequency of “boy”.
Refer to caption
(b) Change the frequency of “daughter”.
Figure 1: Δ\Delta of cosine similarities between gender difference vectors before / after adjusting the frequency of word ww. When the frequency of ww changes, the cosine similarities between the gender difference vector (v\overrightarrow{v}) for ww and other gender difference vectors exhibits a large change. This demonstrates that frequency statistics for ww have a strong influence on the the gender direction represented by v\overrightarrow{v}.

We hypothesize that it is difficult to isolate the gender component of word embeddings in the manner employed by existing post-processing methods. For example, Gong et al. (2018); Mu and Viswanath (2018) show that word frequency significantly impact the geometry of word embeddings. Consequently, popular words and rare words cluster in different subregions of the embedding space, despite the fact that words in these clusters are not semantically similar. This can degrade the ability of component-based methods for debiasing gender.

Specifically, recall that Hard Debias seeks to remove the component of the embeddings corresponding to the gender direction. The important assumption made by Hard Debias is that we can effectively identify and isolate this gender direction. However, we posit that word frequency in the training corpora can twist the gender direction and limit the effectiveness of Hard Debias.

To this end, we propose a novel debiasing algorithm called Double-Hard Debias that builds upon the existing Hard Debias technique. It consists of two steps. First, we project word embeddings into an intermediate subspace by subtracting component(s) related to word frequency. This mitigates the impact of frequency on the gender direction. Then we apply Hard Debias to these purified embeddings to mitigate gender bias. Mu and Viswanath (2018) showed that typically more than one dominant directions in the embedding space encode frequency features. We test the effect of each dominant direction on the debiasing performance and only remove the one(s) that demonstrated the most impact.

We evaluate our proposed debiasing method using a wide range of evaluation techniques. According to both representation level evaluation (WEAT test (Caliskan et al., 2017), the neighborhood metric (Gonen and Goldberg, 2019)) and downstream task evaluation (coreference resolution (Zhao et al., 2018a)), Double-Hard Debias outperforms all previous debiasing methods. We also evaluate the functionality of debiased embeddings on several benchmark datasets to demonstrate that Double-Hard Debias effectively mitigates gender bias without sacrificing the quality of word embeddings111Code and data are available at https://github.com/uvavision/Double-Hard-Debias.git.

2 Motivation

Current post-hoc debiasing methods attempt to reduce gender bias in word embeddings by subtracting the component associated with gender from them. Identifying the gender direction in the word embedding space requires a set of gender word pairs, 𝒫{\mathcal{P}}, which consists of “she & he”, “daughter & son”, etc. For every pair, for example “boy & girl”, the difference vector of the two embeddings is expected to approximately capture the gender direction:

vboy,girl=wboywgirl\overrightarrow{v}_{boy,girl}=\overrightarrow{w}_{boy}-\overrightarrow{w}_{girl} (1)

Bolukbasi et al. (2016) computes the first principal component of ten such difference vectors and use that to define the gender direction.222The complete definition of 𝒫{\mathcal{P}} is: “woman & man”, “girl & boy”, “she & he”, “mother & father”, “daughter & son”, “gal & guy”, “female & male”, “her & his”, “herself & himself”, and “Mary & John” (Bolukbasi et al., 2016).

Recent works (Mu and Viswanath, 2018; Gong et al., 2018) show that word frequency in a training corpus can degrade the quality of word embeddings. By carefully removing such frequency features, existing word embeddings can achieve higher performance on several benchmarks after fine-tuning. We hypothesize that such word frequency statistics also interferes with the components of the word embeddings associated with gender. In other words, frequency-based features learned by word embedding algorithms act as harmful noise in the previously proposed debiasing techniques.

To verify this, we first retrain GloVe Pennington et al. (2014) embeddings on the one billion English word benchmark (Chelba et al., 2013) following previous work Zhao et al. (2018b); Kaneko and Bollegala (2019). We obtain ten difference vectors for the gendered pairs in 𝒫{\mathcal{P}} and compute pairwise cosine similarity. This gives a similarity matrix 𝒮\mathcal{S} in which 𝒮pi,pj\mathcal{S}_{p_{i},p_{j}} denotes the cosine similarity between difference vectors vpairi\overrightarrow{v}_{pair_{i}} and vpairj\overrightarrow{v}_{pair_{j}}.

We then select a specific word pair, e.g. “boy” & “girl”, and augment the corpus by sampling sentences containing the word “boy” twice. In this way, we produce a new training corpus with altered word frequency statistics for “boy”. The context around the token remains the same so that changes to the other components are negligible. We retrain GloVe with this augmented corpus and get a set of new offset vectors for the gendered pairs 𝒫{\mathcal{P}}. We also compute a second similarity matrix 𝒮\mathcal{S}^{\prime} where 𝒮pi,pj\mathcal{S}_{p_{i},p_{j}}^{\prime} denotes the cosine similarity between difference vectors vpairi\overrightarrow{v}_{pair_{i}}^{\prime} and vpairj\overrightarrow{v}_{pair_{j}}^{\prime}.

By comparing these two similarity matrices, we analyze the effect of changing word frequency statistics on gender direction. Note that the offset vectors are designed for approximating the gender direction, thus we focus on the changes in offset vectors. Because statistics were altered for “boy”, we focus on the difference vector vboy,girl\overrightarrow{v}_{boy,girl} and make two observations. First, the norm of vboy,girl\overrightarrow{v}_{boy,girl} has a 5.8%5.8\% relative change while the norms of other difference vectors show much smaller changes. For example, the norm of vman,woman\overrightarrow{v}_{man,woman} only changes by 1.8%1.8\%. Second, the cosine similarities between vboy,girl\overrightarrow{v}_{boy,girl} and other difference vectors also show more significant change, as highlighted by the red bounding box in Figure 1(a). As we can see, the frequency change of “boy” leads to deviation of the gender direction captured by vboy,girl\overrightarrow{v}_{boy,girl}. We observe similar phenomenon when we change the frequency of the word “daughter” and present these results in Figure 1(b).

Based on these observations, we conclude that word frequency plays an important role in gender debiasing despite being overlooked by previous works.

3 Method

In this section, we first summarize the terminology that will be used throughout the rest of the paper, briefly review the Hard Debias method, and provide background on the neighborhood evaluation metric. Then we introduce our proposed method: Double-Hard Debias.

3.1 Preliminary Definitions

Let WW be the vocabulary of the word embeddings we aim to debias. The set of word embeddings contains a vector wn{\overrightarrow{w}}\in\mathbbm{R}^{n} for each word wWw\in W. A subspace BB is defined by kk orthogonal unit vectors B={b1,,bk}dB=\{b_{1},\dots,b_{k}\}\in\mathbbm{R}^{d}. We denote the projection of vector vv on BB by

vB=j=1k(vbj)bj.v_{B}=\sum_{j=1}^{k}(v\cdot b_{j})b_{j}. (2)

Following Bolukbasi et al. (2016), we assume there is a set of gender neutral words NWN\subset W, such as “doctor” and “teacher”, which by definition are not specific to any gender. We also assume a pre-defined set of nn male-female word pairs D1,D2,,DnWD_{1},D_{2},\dots,D_{n}\subset W, where the main difference between each pair of words captures gender.

Hard Debias. The Hard Debias algorithm first identifies a subspace that captures gender bias. Let

μi:=wDiw/|Di|.\mu_{i}\vcentcolon=\sum_{w\in D_{i}}\overrightarrow{w}/|D_{i}|. (3)

The bias subspace BB is the first kk (1\geq 1) rows of SVD(𝐂{\mathbf{C}}), where

𝐂:=i=1mwDi(wμi)T(wμi)/|Di|{\mathbf{C}}\vcentcolon=\sum\limits_{i=1}^{m}\sum\limits_{w\in D_{i}}(\overrightarrow{w}-\mu_{i})^{T}(\overrightarrow{w}-\mu_{i})/|D_{i}| (4)

Following the original implementation of Bolukbasi et al. (2016), we set k=1k=1. As a result the subspace BB is simply a gender direction.333Bolukbasi et al. (2016) normalize all embeddings. However, we found it is unnecessary in our experiments. This is also mentioned in Ethayarajh et al. (2019)

Hard Debias then neutralizes the word embeddings by transforming each w{\overrightarrow{w}} such that every word wNw\in N has zero projection in the gender subspace. For each word wNw\in N, we re-embed w{\overrightarrow{w}}:

w:=wwB{\overrightarrow{w}}\vcentcolon={\overrightarrow{w}}-{\overrightarrow{w}}_{B} (5)

Neighborhood Metric. The Neighborhood Metric proposed by Gonen and Goldberg (2019) is a bias measurement that does not rely on any specific gender direction. To do so it looks into similarities between words. The bias of a word is the proportion of words with the same gender bias polarity among its nearest neighboring words.

We selected kk of the most biased male and females words according to the cosine similarity of their embedding and the gender direction computed using the word embeddings prior to bias mitigation. We use WmW_{m} and WfW_{f} to denote the male and female biased words, respectively. For wiWmw_{i}\in W_{m}, we assign a ground truth gender label gi=0g_{i}=0. For wiWfw_{i}\in W_{f}, gi=1g_{i}=1. Then we run KMeans (k=2k=2) to cluster the embeddings of selected words gi^=KMeans(wi)\hat{g_{i}}=KMeans({\overrightarrow{w}}_{i}), and compute the alignment score aa with respect to the assigned ground truth gender labels:

a=12ki=12k𝟙[gi^==gi]a=\frac{1}{2k}\sum_{i=1}^{2k}\mathbbm{1}{[\hat{g_{i}}==g_{i}]} (6)

We set a=max(a,1a)a=\max(a,1-a). Thus, a value of 0.50.5 in this metric indicates perfectly unbiased word embeddings (i.e. the words are randomly clustered), and a value closer to 11 indicates stronger gender bias.

3.2 Double-Hard Debiasing

Input : Word embeddings: {wd,w𝒲}\{{\overrightarrow{w}}\in\mathbbm{R}^{d},w\in{\mathcal{W}}\}
Male biased words set: WmW_{m}
Female biased words set: WfW_{f}
1 Sdebias=[]S_{debias}=[]
2Decentralize w{\overrightarrow{w}}: μ1|𝒱|w𝒱w\mu\leftarrow\frac{1}{|\mathcal{V}|}\sum_{w\in\mathcal{V}}{\overrightarrow{w}}, for each w𝒲{\overrightarrow{w}}\in{\mathcal{W}}, w~wμ\tilde{w}\leftarrow{\overrightarrow{w}}-\mu;
3Compute principal components by PCA: {𝐮1𝐮d}\{{\mathbf{u}}_{1}\ldots{\mathbf{u}}_{d}\} \leftarrow PCA({w~\tilde{w}, w𝒲w\in{\mathcal{W}}});
4
5//discover the frequency directions
6for i=1i=1 to d do
7       wmwm~(𝐮iTwm)𝐮iw_{m}^{\prime}\leftarrow\tilde{w_{m}}-({\mathbf{u}}_{i}^{T}w_{m}){\mathbf{u}}_{i};
8       wfwf~(𝐮iTwf)𝐮iw_{f}^{\prime}\leftarrow\tilde{w_{f}}-({\mathbf{u}}_{i}^{T}w_{f}){\mathbf{u}}_{i};
9       w^mHardDebias(wm)\hat{w}_{m}\leftarrow HardDebias(w_{m}^{\prime});
10       w^fHardDebias(wf)\hat{w}_{f}\leftarrow HardDebias(w_{f}^{\prime});
11       output=KMeans([w^mw^f])output=KMeans([\hat{w}_{m}\hat{w}_{f}]);
12       aa = eval(output, WmW_{m}, WfW_{f});
13       SdebiasS_{debias}.append(aa);
14      
15 end for
16
17k=argminiSdebiask=\operatorname*{arg\,min}_{i}S_{debias};
18
19// remove component on frequency direction
20ww~(𝐮kTw)𝐮kw^{\prime}\leftarrow\tilde{w}-({\mathbf{u}}_{k}^{T}w){\mathbf{u}}_{k};
21
22// remove components on gender direction
23w^HardDebias(w)\hat{w}\leftarrow HardDebias(w^{\prime});
24
Output : Debiased word embeddings:
{w^d,w𝒲\hat{w}\in\mathbb{R}^{d},w\in{\mathcal{W}}}
Algorithm 1 Double-Hard Debias.
Refer to caption
Figure 2: Clustering accuracy after projecting out D-th dominating direction and applying Hard Debias. Lower accuracy indicates less bias.

According to Mu and Viswanath (2018), the most statistically dominant directions of word embeddings encode word frequency to a significant extent.  Mu and Viswanath (2018) removes these frequency features by centralizing and subtracting components along the top DD dominant directions from the original word embeddings. These post-processed embedddings achieve better performance on several benchmark tasks, including word similarity, concept categorization, and word analogy. It is also suggested that setting DD near d/100d/100 provides maximum benefit, where dd is the dimension of a word embedding.

We speculate that most the dominant directions also affect the geometry of the gender space. To address this, we use the aforementioned clustering experiment to identify whether a direction contains frequency features that alter the gender direction.

More specifically, we first pick the top biased words (500500 male and 500500 female) identified using the original GloVe embeddings. We then apply PCA to all their word embeddings and take the top principal components as candidate directions to drop. For every candidate direction 𝐮\mathbf{u}, we project the embeddings into a space that is orthogonal to 𝐮\mathbf{u}. In this intermediate subspace, we apply Hard Debias and get debiased embeddings. Next, we cluster the debiased embeddings of these words and compute the gender alignment accuracy (Eq. 6). This indicates whether projecting away direction 𝐮\mathbf{u} improves the debiasing performance. Algorithm 1 shows the details of our method in full.

We found that for GloVe embeddings pre-trained on Wikipedia dataset, elimination of the projection along the second principal component significantly decreases the clustering accuracy. This translates to better debiasing results, as shown in Figure 2. We further demonstrate the effectiveness of our method for debaising using other evaluation metrics in Section 4.

4 Experiments

In this section, we compare our proposed method with other debiasing algorithms and test the functionality of these debiased embeddings on word analogy and concept categorization task. Experimental results demonstrate that our method effectively reduces bias to a larger extent without degrading the quality of word embeddings.

4.1 Dataset

We use 300-dimensional GloVe  Pennington et al. (2014) 444Experiments on Word2Vec are included in the appendix. embeddings pre-trained on the 2017 January dump of English Wikipedia555https://github.com/uclanlp/gn_glove, containing 322,636322,636 unique words. To identify the gender direction, we use 1010 pairs of definitional gender words compiled by Bolukbasi et al. (2016)666https://github.com/tolga-b/debiaswe.

4.2 Baselines

We compare our proposed method against the following baselines:

GloVe: the pre-trained GloVe embeddings on Wikipedia dataset described in 4.1. GloVe is widely used in various NLP applications. This is a non-debiased baseline for comparision.

GN-GloVe: We use debiased Gender-Neutral GN-GloVe embeddings released by the original authors Zhao et al. (2018b). GN-GloVe restricts gender information in certain dimensions while neutralizing the rest dimensions.

GN-GloVe(waw_{a}): We exclude the gender dimensions from GN-GloVe. This baseline tries to completely remove gender.

GP-GloVe: We use debiased embeddings released by the original authors Kaneko and Bollegala (2019). Gender-preserving Debiasing attempts to preserve non-discriminative gender information, while removing stereotypical gender bias.

GP-GN-GloVe:: This baseline applies Gender-preserving Debiasing on already debaised GN-GloVe embeddings. We also use debiased embeddings provided by authors.

Hard-GloVe: We apply Hard Debias introduced in Bolukbasi et al. (2016) on GloVe embeddings. Following the implementation provided by original authors, we debias netural words and preserve the gender specific words.

Strong Hard-GloVe: A variant of Hard Debias where we debias all words instead of avoiding gender specific words. This seeks to entirely remove gender from GloVe embeddings.

Double-Hard GloVe: We debias the pre-trained GloVe embeddings by our proposed Double-Hard Debias method.

Embeddings OntoNotes PRO-1 ANTI-1 Avg-1 ||Diff-1 || PRO-2 ANTI-2 Avg-2 ||Diff-2 ||
GloVe 66.5\bf 66.5 77.777.7 48.248.2 62.9\bf 62.9 29.029.0 82.782.7 67.567.5 75.175.1 15.215.2
GN-GloVe 66.166.1 68.468.4 56.556.5 62.562.5 12.012.0 78.278.2 71.371.3 74.774.7 6.96.9
GN-GloVe(waw_{a}) 66.466.4 66.766.7 56.656.6 61.661.6 10.210.2 79.079.0 72.372.3 75.775.7 6.76.7
GP-GloVe 66.166.1 72.072.0 52.052.0 62.062.0 20.020.0 78.578.5 70.070.0 74.374.3 8.68.6
GP-GN-GloVe 66.366.3 70.070.0 54.554.5 62.062.0 15.015.0 79.979.9 70.770.7 75.375.3 9.29.2
Hard-GloVe 66.266.2 72.372.3 52.752.7 62.662.6 19.719.7 80.680.6 78.378.3 79.479.4 2.32.3
Strong Hard-GloVe 66.066.0 69.069.0 58.658.6 63.863.8 10.410.4 82.282.2 78.678.6 80.480.4 3.63.6
Double-Hard GloVe 66.466.4 66.066.0 58.358.3 62.262.2 7.7\bf 7.7 85.485.4 84.584.5 85.0\bf 85.0 0.9\bf 0.9
Table 1: F1 score (%) of coreference systems on OntoNotes test set and WinoBias dataset. ||Diff || represents the performance gap between pro-stereotype (PRO) subset and anti-stereotype (ANTI) subset. Coreference system trained on our Double-Hard GloVe embeddings has the smallest ||Diff || values, suggesting less gender bias.

4.3 Evaluation of Debiasing Performance

We demonstrate the effectiveness of our debiasing method for downstream applications and according to general embedding level evaluations.

4.3.1 Debiasing in Downstream Applications

Coreference Resolution. Coreference resolution aims at identifying noun phrases referring to the same entity. Zhao et al. (2018a) identified gender bias in modern coreference systems, e.g. “doctor” is prone to be linked to “he”. They also introduce a new benchmark dataset WinoBias, to study gender bias in coreference systems.

WinoBias provides sentences following two prototypical templates. Each type of sentences can be divided into a pro-stereotype (PRO) subset and a antistereotype (ANTI) subset. In the PRO subset, gender pronouns refer to professions dominated by the same gender. For example, in sentence “The physician hired the secretary because he was overwhelmed with clients.”, “he” refers to “physician”, which is consistent with societal stereotype. On the other hand, the ANTI subset consists of same sentences, but the opposite gender pronouns. As such, “he” is replaced by “she” in the aforementioned example. The hypothesis is that gender cues may distract a coreference model. We consider a system to be gender biased if it performs better in pro-stereotypical scenarios than in anti-stereotypical scenarios.

We train an end-to-end coreference resolution model (Lee et al., 2017) with different word embeddings on OntoNotes 5.0 training set and report the performance on WinoBias dataset. Results are presented in Table1. Note that absolute performance difference (Diff) between the PRO set and ANTI set connects with gender bias. A smaller Diff value indicates a less biased coreference system. We can see that on both types of sentences in WinoBias, Double-Hard GloVe achieves the smallest Diff compared to other baselines. This demonstrates the efficacy of our method. Meanwhile, Double-Hard GloVe maintains comparable performance as GloVe on OntoNotes test set, showing that our method preserves the utility of word embeddings. It is also worth noting that by reducing gender bias, Double-Hard GloVe can significantly improve the average performance on type-2 sentences, from 75.1%75.1\% (GloVe) to 85.0%85.0\%.

4.3.2 Debiasing at Embedding Level

The Word Embeddings Association Test (WEAT). WEAT is a permutation test used to measure the bias in word embeddins. We consider male names and females names as attribute sets and compute the differential association of two sets of target words777All word lists are from Caliskan et al. (2017). Because GloVeembeddings are uncased, we use lower cased people names and replace “bill” with “tom” to avoid ambiguity. and the gender attribute sets. We report effect sizes (dd) and p-values (pp) in Table2. The effect size is a normalized measure of how separated the two distributions are. A higher value of effect size indicates larger bias between target words with regard to gender. p-values denote if the bias is significant. A high p-value (larger than 0.050.05) indicates the bias is insignificant. We refer readers to Caliskan et al. (2017) for more details.

As shown in Table 2, across different target words sets, Double-Hard GloVe consistently outperforms other debiased embeddings. For Career & Family and Science & Arts, Double-Hard GloVe reaches the lowest effect size, for the latter one, Double-Hard GloVe successfully makes the bias insignificant (p-value >0.05>0.05). Note that in WEAT test, some debiasing methods run the risk of amplifying gender bias, e.g. for Math & Arts words, the bias is significant in GN-GloVe while it is insignificant in original GloVe embeddings. Such concern does not occur in Double-Hard GloVe.

Embeddings Career & Family Math & Arts Science & Arts
dd pp dd pp dd pp
GloVe 1.811.81 0.00.0 0.550.55 0.140.14 0.880.88 0.040.04
GN-GloVe 1.821.82 0.00.0 1.211.21 6e36\mathrm{e}^{-3} 1.021.02 0.020.02
GN-GloVe(waw_{a}) 1.761.76 0.00.0 1.431.43 1e31\mathrm{e}^{-3} 1.021.02 0.020.02
GP-GloVe 1.811.81 0.00.0 0.870.87 0.040.04 0.910.91 0.030.03
GP-GN-GloVe 1.801.80 0.00.0 1.421.42 1e31\mathrm{e}^{-3} 1.041.04 0.010.01
Hard-GloVe 1.551.55 2e42\mathrm{e}^{-4} 0.070.07 0.440.44 0.160.16 0.620.62
Strong Hard-GloVe 1.551.55 2e42\mathrm{e}^{-4} 0.070.07 0.440.44 0.160.16 0.620.62
Double-Hard GloVe 1.531.53 2e42\mathrm{e}^{-4} 0.090.09 0.570.57 0.150.15 0.610.61
Table 2: WEAT test of embeddings before/after Debiasing. The bias is insignificant when p-value, p>0.05p>0.05. Lower effective size (dd) indicates less gender bias. Significant gender bias related to Career & Family and Science & Arts words is effectively reduced by Double-Hard GloVe. Note for Math & Arts words, gender bias is insignificant in original GloVe.

Neighborhood Metric. Gonen and Goldberg (2019) introduces a neighborhood metric based on clustering. As described in Sec 3.1, We take the top kk most biased words according to their cosine similarity with gender direction in the original GloVe embedding space888To be fair, we exclude all gender specific words used in debiasing, so Hard-GloVe and Strong Hard-GloVe have same acurracy performance in Table 3. We then run k-Means to cluster them into two clusters and compute the alignment accuracy with respect to gender, results are presented in Table 3. We recall that in this metric, a accuracy value closer to 0.50.5 indicates less biased word embeddings.

Using the original GloVe embeddings, k-Means can accurately cluster selected words into a male group and a female group, suggesting the presence of a strong bias. Hard Debias is able to reduce bias in some degree while other baselines appear to be less effective. Double-Hard GloVe achieves the lowest accuracy across experiments clustering top 100/500/1000 biased words, demonstrating that the proposed technique effectively reduce gender bias. We also conduct tSNE (van der Maaten and Hinton, 2008) projection for all baseline embeddings. As shown in Figure 3, original non-debiased GloVe  embeddings are clearly projected to different regions. Double-Hard GloVe mixes up male and female embeddings to the maximum extent compared to other baselines, showing less gender information can be captured after debiasing.

Embeddings Top 100 Top 500 Top 1000
GloVe 100.0100.0 100.0100.0 100.0100.0
GN-GloVe 100.0100.0 100.0100.0 99.999.9
GN-GloVe(waw_{a}) 100.0100.0 99.799.7 88.588.5
GP-GloVe 100.0100.0 100.0100.0 100.0100.0
GP-GN-GloVe 100.0100.0 100.0100.0 99.499.4
(Strong) Hard GloVe 59.059.0 62.162.1 68.168.1
Double-Hard GloVe 51.5\bf 51.5 55.5\bf 55.5 59.5\bf 59.5
Table 3: Clustering Accuracy (%) of top 100/500/1000 male and female words. Lower accuracy means less gender cues can be captured. Double-Hard GloVe  consistently achieves the lowest accuracy.
Refer to caption
(a) GloVe
Refer to caption
(b) GN-GloVe
Refer to caption
(c) GN-GloVe(waw_{a})
Refer to caption
(d) GP-GloVe
Refer to caption
(e) GP-GN-GloVe
Refer to caption
(f) Hard-GloVe
Refer to caption
(g) Strong Hard-GloVe
Refer to caption
(h) Double-Hard GloVe
Figure 3: tSNE visualization of top 500500 most male and female embeddings. Double-Hard GloVe mixes up two groups to the maximum extent, showing less gender information is encoded.

4.4 Analysis of Retaining Word Semantics

Embeddings Analogy Concept Categorization
Sem Syn Total MSR AP ESSLI Battig BLESS
GloVe 80.580.5 62.862.8 70.870.8 54.254.2 55.655.6 72.772.7 51.251.2 81.081.0
GN-GloVe 77.777.7 61.661.6 68.968.9 51.951.9 56.956.9 70.570.5 49.549.5 85.085.0
GN-GloVe(waw_{a}) 77.777.7 61.661.6 68.968.9 51.951.9 56.956.9 75.075.0 51.351.3 82.582.5
GP-GloVe 80.680.6 61.761.7 70.370.3 51.351.3 56.156.1 75.075.0 49.049.0 78.578.5
GP-GN-GloVe 77.777.7 61.761.7 68.968.9 51.851.8 61.161.1 72.772.7 50.950.9 77.577.5
Hard-GloVe 80.380.3 62.562.5 70.670.6 54.054.0 62.362.3 79.579.5 50.050.0 84.584.5
Strong Hard-GloVe 78.678.6 62.462.4 69.869.8 53.953.9 64.164.1 79.579.5 49.249.2 84.584.5
Double-Hard GloVe 80.980.9 61.661.6 70.470.4 53.853.8 59.659.6 72.772.7 46.746.7 79.579.5
Table 4: Results of word embeddings on word analogy and concept categorization benchmark datasets. Performance (x100) is measured in accuracy and purity, respectively. On both tasks, there is no significant degradation of performance due to applying the proposed method.

Word Analogy. Given three words AA, BB and CC, the analogy task is to find word DD such that “AA is to BB as CC is to DD”. In our experiments, DD is the word that maximize the cosine similarity between DD and CA+BC-A+B. We evaluate all non-debiased and debiased embeddings on the MSR (Mikolov et al., 2013c) word analogy task, which contains 80008000 syntactic questions, and on a second Google word analogy (Mikolov et al., 2013a) dataset that contains 19,54419,544 (Total) questions, including 8,8698,869 semantic (Sem) and 10,67510,675 syntactic (Syn) questions. The evaluation metric is the percentage of questions for which the correct answer is assigned the maximum score by the algorithm. Results are shown in Table4. Double-Hard GloVe achieves comparable good results as GloVe  and slightly outperforms some other debiased embeddings. This proves that Double-Hard Debias is capable of preserving proximity among words.

Concept Categorization. The goal of concept categorization is to cluster a set of words into different categorical subsets. For example, “sandwich” and “hotdog” are both food and “dog” and “cat” are animals. The clustering performance is evaluated in terms of purity (Manning et al., 2008) - the fraction of the total number of the words that are correctly classified. Experiments are conducted on four benchmark datasets: the Almuhareb-Poesio (AP) dataset (Almuhareb, 2006); the ESSLLI 2008 (Baroni et al., 2008); the Battig 1969 set (Battig and Montague, 1969) and the BLESS dataset (Baroni and Lenci, 2011). We run classical Kmeans algorithm with fixed kk. Across four datasets, the performance of Double-Hard GloVe is on a par with GloVe embeddings, showing that the proposed debiasing method preserves useful semantic information in word embeddings. Full results can be found in Table4.

5 Related Work

Gender Bias in Word Embeddings. Word embeddings have been criticized for carrying gender bias. Bolukbasi et al. (2016) show that word2vec (Mikolov et al., 2013b) embeddings trained on the Google News dataset exhibit occupational stereotypes, e.g. “programmer” is closer to “man” and “homemaker” is closer to “woman”. More recent works (Zhao et al., 2019; Kurita et al., 2019; Basta et al., 2019) demonstrate that contextualized word embeddings also inherit gender bias.

Gender bias in word embeddings also propagate to downstream tasks, which substantially affects predictions. Zhao et al. (2018a) show that coreference systems tend to link occupations to their stereotypical gender, e.g. linking “doctor” to “he” and “nurse” to “she”. Stanovsky et al. (2019) observe that popular industrial and academic machine translation systems are prone to gender biased translation errors.

Recently,  Vig et al. (2020) proposed causal mediation analysis as a way to interpret and analyze gender bias in neural models.

Debiasing Word Embeddings. For contextualized embeddings, existing works propose task-specific debiasing methods, while in this paper we focus on more generic ones. To mitigate gender bias, Zhao et al. (2018a) propose a new training approach which explicitly restricts gender information in certain dimensions during training. While this method separates gender information from embeddings, retraining word embeddings on massive corpus requires an undesirably large amount of resources. Kaneko and Bollegala (2019) tackles this problem by adopting an encoder-decoder model to re-embed word embeddings. This can be applied to existing pre-trained embeddings, but it still requires train different encoder-decoders for different embeddings.

Bolukbasi et al. (2016) introduce a more simple and direct post-processing method which zeros out the component along the gender direction. This method reduces gender bias to some degree, however, Gonen and Goldberg (2019) present a series of experiments to show that they are far from delivering gender-neutral embeddings. Our work builds on top of Bolukbasi et al. (2016). We discover the important factor – word frequency – that limits the effectiveness of existing methods. By carefully eliminating the effect of word frequency, our method is able to significantly improve debiasing performance.

6 Conclusion

We have discovered that simple changes in word frequency statistics can have an undesirable impact on the debiasing methods used to remove gender bias from word embeddings. Though word frequency statistics have until now been neglected in previous gender bias reduction work, we propose Double-Hard Debias, which mitigates the negative effects that word frequency features can have on debiasing algorithms. We experiment on several benchmarks and demonstrate that our Double-Hard Debias is more effective on gender bias reduction than other methods while also preserving the quality of word embeddings suitable for the downstream applications and embedding-based word analogy tasks. While we have shown that this method significantly reduces gender bias while preserving quality, we hope that this work encourages further research into debiasing along other dimensions of word embeddings in the future.

References

  • Almuhareb (2006) Abdulrahman Almuhareb. 2006. Attributes in lexical acquisition. Ph.D. thesis, University of Essex, Colchester, UK.
  • Baroni et al. (2008) Marco Baroni, Stefan Evert, and Alessandro Lenci. 2008. Bridging the gap between semantic theory and computational simulations: Proceedings of the esslli workshop on distributional lexical semantics.
  • Baroni and Lenci (2011) Marco Baroni and Alessandro Lenci. 2011. How we blessed distributional semantic evaluation. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, GEMS ’11, pages 1–10, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Basta et al. (2019) Christine Basta, Marta Ruiz Costa-jussà, and Noe Casas. 2019. Evaluating the underlying gender bias in contextualized word embeddings. CoRR, abs/1904.08783.
  • Battig and Montague (1969) William F. Battig and William E. Montague. 1969. Category norms of verbal items in 56 categories a replication and extension of the connecticut category norms. Journal of Experimental Psychology, 80(3p2):1.
  • Bolukbasi et al. (2016) Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In NIPS.
  • Caliskan et al. (2017) Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186.
  • Chelba et al. (2013) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
  • Ethayarajh et al. (2019) Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. 2019. Understanding undesirable word embedding associations. arXiv preprint arXiv:1908.06361.
  • Gonen and Goldberg (2019) Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In NAACL-HLT.
  • Gong et al. (2018) Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. Frage: Frequency-agnostic word representation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1334–1345. Curran Associates, Inc.
  • Kaneko and Bollegala (2019) Masahiro Kaneko and Danushka Bollegala. 2019. Gender-preserving debiasing for pre-trained word embeddings. CoRR, abs/1906.00742.
  • Kurita et al. (2019) Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W. Black, and Yulia Tsvetkov. 2019. Measuring bias in contextualized word representations. CoRR, abs/1906.07337.
  • Lee et al. (2017) Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. 2017. End-to-end neural coreference resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 188–197. Association for Computational Linguistics.
  • van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605.
  • Manning et al. (2008) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK.
  • Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
  • Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.
  • Mikolov et al. (2013c) Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia. Association for Computational Linguistics.
  • Mu and Viswanath (2018) Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  • Rudinger et al. (2018) Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301.
  • Stanovsky et al. (2019) Gabriel Stanovsky, Noah A Smith, and Luke Zettlemoyer. 2019. Evaluating gender bias in machine translation. arXiv preprint arXiv:1906.00591.
  • Vig et al. (2020) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Causal mediation analysis for interpreting neural nlp: The case of gender bias.
  • Zhao et al. (2019) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. 2019. Gender bias in contextualized word embeddings. In North American Chapter of the Association for Computational Linguistics (NAACL).
  • Zhao et al. (2018a) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018a. Gender bias in coreference resolution: Evaluation and debiasing methods. In North American Chapter of the Association for Computational Linguistics (NAACL).
  • Zhao et al. (2018b) Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018b. Learning gender-neutral word embeddings. In EMNLP.

Appendix A Appendices

Refer to caption
Figure 4: Clustering accuracy after projecting out D-th dominating direction and applying Hard Debias. Lower accuracy indicates less bias.
Embeddings Top 100 Top 500 Top 1000
Word2Vec 100.0100.0 99.399.3 99.399.3
Hard-Word2Vec 79.579.5 74.374.3 79.879.8
Double-Hard Word2Vec 71.0\bf 71.0 52.3\bf 52.3 56.7\bf 56.7
Table 5: Clustering Accuracy(%) of top 100/500/1000 male and female words. Lower accuracy means less gender cues captured. Double-Hard Word2Vec  consistently achieves the lowest accuracy.
Refer to caption
(a) Word2Vec
Refer to caption
(b) Hard-Word2Vec
Refer to caption
(c) Double-Hard Word2Vec
Figure 5: tSNE visualization of top 500500 most male and female embeddings. Double-Hard Word2Vec mixes up two groups to the maximum extent, showing less gender information encoded.
Embeddings Career & Family Math & Arts Science & Arts
dd pp dd pp dd pp
Word2Vec 1.891.89 0.00.0 1.821.82 0.00.0 1.571.57 2e42\mathrm{e}^{-4}
Hard-Word2Vec 1.801.80 0.00.0 1.571.57 7e57\mathrm{e}^{-5} 0.830.83 0.050.05
Double-Hard Word2Vec 1.731.73 0.00.0 1.511.51 5e45\mathrm{e}^{-4} 0.680.68 0.090.09
Table 6: WEAT test of embeddings before/after Debiasing. The bias is insignificant when p-value, p>0.05p>0.05. Lower effective size (dd) indicates less gender bias. Across all target words sets, Double-Hard Word2Vec  leads to the smallest effective size. Specifically, for Science & Arts words, Double-Hard Word2Vec  successfully reaches a bias insignificant state (p=0.09p=0.09).
Embeddings Analogy Concept Categorization
Sem Syn Total MSR AP ESSLI Battig BLESS
Word2Vec 24.8\bf 24.8 66.5\bf 66.5 55.3\bf 55.3 73.673.6 64.5\bf 64.5 75.075.0 46.346.3 78.9\bf 78.9
Hard-Word2Vec 23.823.8 66.366.3 54.954.9 73.573.5 62.762.7 75.075.0 47.1\bf 47.1 77.477.4
Double-Hard Word2Vec 23.523.5 66.366.3 54.954.9 74.0\bf 74.0 63.263.2 75.075.0 46.546.5 77.977.9
Table 7: Results of word embeddings on word analogy and concept categorization benchmark datasets. Performance (x100) is measured in accuracy and purity, respectively. On both tasks, there is no significant degradation of performance due to applying the proposed method.

We also apply Double-Hard Debias on Word2Vec embeddings Mikolov et al. (2013b) which have been widely used by many NLP applications. As shown in Figure 4, our algorithm is able to identify that the eighth principal component significantly affects the debiasing performance.

Similarly, we first project away the identified direction 𝐮\mathbf{u} from the original Word2Vec embeddings and then apply Hard Debias algorithm. We compare embeddings debiased by our method with the original Word2Vec embeddings and Hard-Word2Vec  embeddings.

Table 5 reports the experimental result using the neighborhood metric. Across three experiments where we cluster top 100100/500500/10001000 male and female words, Double-Hard Word2Vec  consistently achieves the lowest accuracy . Note that neighborhood metric reflects gender information that can be captured by the clustering algorithm. Experimental result validates that our method can further improve Hard Debias algorithm. This is also verified in Figure 5 where we conduct tSNE visualization of top 500500 male and female embeddings. While the original Word2Vec  embeddings clearly locate separately into two groups corresponding to different genders, this phenomenon becomes less obvious after applying our debiasing method.

We further evaluate the debiasing outcome with WEAT test. Similar to experiments on GloVe  embeddings, we use male names and female names as attribute sets and analyze the association between attribute sets and three target sets. We report effective size and p-value in Table 6. Across three target sets, Double-Hard Word2Vec  is able to consistently reduce the effect size. More importantly, the bias related to Science & Arts words becomes insignificant after applying our debiasing method.

To test the functionality of debiased embeddings, we again conduct experiments on word analogy and concept categorization tasks. Results are included in Table 7. We demonstrate that our proposed debiasing method brings no significant performance degradation in these two tasks.

To summarize, experiments on Word2Vec  embeddings also support our conclusion that the proposed Double-Hard Debiasing reduces gender bias to a larger degree while is able to maintain the semantic information in word embeddings.