Potential and Challenges of Model Editing for Social Debiasing
Abstract
Warning: This paper contains content that is stereotypical and may be upsetting.
Large language models (LLMs) trained on vast corpora suffer from inevitable stereotype biases. Mitigating these biases with fine-tuning could be both costly and data-hungry. Model editing methods, which focus on modifying LLMs in a post-hoc manner, are of great potential to address debiasing. However, it lacks a comprehensive study that facilitates both internal and external model editing methods, supports various bias types, as well as understands the pros and cons of applying editing methods to stereotypical debiasing. To mitigate this gap, we carefully formulate social debiasing into an editing problem and benchmark seven existing model editing algorithms on stereotypical debiasing, i.e., debias editing. Our findings in three scenarios reveal both the potential and challenges of debias editing: (1) Existing model editing methods can effectively preserve knowledge and mitigate biases, while the generalization of debias effect from edited sentences to semantically equivalent sentences is limited. (2) Sequential editing highlights the robustness of SERAC Mitchell et al. (2022b), while internal editing methods degenerate with the number of edits. (3) Model editing algorithms achieve generalization towards unseen biases both within the same type and from different types. In light of these findings, we further propose two simple but effective methods to improve debias editing, and experimentally show the effectiveness of the proposed methods.
1 Introduction
Large Language Models (LLMs) perform excellently across various downstream tasks, allowing a wide range of applications, ranging from chatbots to medical diagnoses Wang et al. (2023b) to robotics Driess et al. (2023). However, LLMs trained on vast corpora can inadvertently learn biased information, leading to negative stereotypes and social biases encoded within the models Gallegos et al. (2023). For instance, when given ’Arab people are rich.’ and ’Arab people are poor.’, the LLaMA-2-7B Touvron et al. (2023b) model favors the first sentence, which reflects stereotypes encoded within the model. Such biases have the potential to result in unfairness and harm when deployed in production systems Nadeem et al. (2021a); Prakash and Lee (2023).
Traditional debiasing methods typically require training models from scratch to modify training datasets Zmigrod et al. (2019); Dinan et al. (2020); Qian et al. (2022a); Narayanan Venkit et al. (2023), optimization process Huang et al. (2020); Qian et al. (2022b); He et al. (2022); Park et al. (2023); Zhou et al. (2023) or they may operate within the output space He et al. (2021); Tokpo and Calders (2022); Majumder et al. (2022); Dhingra et al. (2023). The former is very costly for language models, while the latter does not truly address bias encoded in the model, potentially leading to non-robustness.
When it comes to mitigating specific biases in pre-trained models, such as the association between "nurse" and "women", directly fine-tuning Ghanbarzadeh et al. (2023) the model might be both costly and impractical with insufficient data. On the other hand, model editing Mitchell et al. (2022a); Zhang et al. (2024), capable of post-training adjustments to the knowledge within the model, has shown potential in tackling this issue. It is initially adopted for modifying pre-trained models’ knowledge Meng et al. (2022, 2023b); Zhang et al. (2024). Recently, researchers have started to employ model editing methods for other tasks such as editing personality Mao et al. (2023) or reasoning process Akyürek et al. (2023). As for editing stereotypical biases, there are several important research questions left unsolved: (1) Given the rich and complex semantics of stereotype sentences compared with factual knowledge, how to define the editing problem for debiasing? (2) What are the advantages and disadvantages of employing model editing methods to debiasing? (3) Can edits of debiasing generalize to unseen biases?
In this work, we present a comprehensive study of the problem of model editing for social debias. We adopt a flexible formulation that takes the tokens before the first appearance of the subject as the prompt and the rest as our target. This formulation supports different bias types and various internal and external editing methods. Based on the formulation, we construct a debias editing dataset based on StereoSet Nadeem et al. (2021b) to support both internal and external editing algorithms. By experimenting with seven editing algorithms over both LLaMA2-7B Touvron et al. (2023b) and GPT2-XL Radford et al. , we have the following observations:
-
•
In the setting of single-edit, i.e., edit out a single biased sentence, model editing methods can achieve almost 100% edit success rates and do not hurt LLMs’ internal knowledge (relatively high scores of Knowledge ACC). Nonetheless, generalization is challenging for debias editing. The edited sentence achieves a higher probability than the biased one as well as its paraphrases, but this effect hardly generalizes to the paraphrases of the edited sentence.
-
•
For sequential editing, i.e., edit out biased sentences one by one, all methods degenerate when increasing the number of edits, except for SERAC Mitchell et al. (2022b) which performs consistently well when increasing edits. With more edits, the complex structure of biased sentences brings a severe challenge to both debias success rate and hurt internal knowledge.
-
•
We show that model editing algorithms can achieve generalization over bias types. After being edited on one specific bias type, e.g., Race, LLMs demonstrate some debiasing effect on another bias type, e.g., Profession.
In light of our observations, we propose two simple methods to mitigate the challenge brought by many edits. We adopt a heuristic rule-based target selection and a causal tracing selection to constrain the scope of the edits. Experimental results show that our methods demonstrate strong performance while retaining the edit performance. To the best of our knowledge, we are the first to comprehensively study the problem of debias editing, reveal its potential and challenges, and propose effective solutions. We release our dataset and codes to facilitate future work111https://github.com/ElliottYan/ModelEditingForDebias.
2 Related Work
Debiasing
Addressing social bias in language models is an ongoing challenge that has received significant attention. Strategies for mitigating bias in language models can be classified based on different stages of the model workflow: Preprocessing techniques aim to detect and eliminate bias and unfairness early on, either within the dataset Zmigrod et al. (2019); Dinan et al. (2020); Abid et al. (2021); Qian et al. (2022a); Ghanbarzadeh et al. (2023) or prompt Mattern et al. (2022); Fatemi et al. (2023); Yang et al. (2023). In-training bias mitigation techniques focus on reducing bias and unfairness during model training, by adjusting model architecture Bartl et al. (2020); Han et al. (2022), modifying loss functions Liu et al. (2020); Webster et al. (2020); Ouyang et al. (2022); Woo et al. (2023); Park et al. (2023); Zhou et al. (2023); Li et al. (2023), or selectively updating parameters Qian et al. (2022b); Ranaldi et al. (2023); Yu et al. (2023). Intraprocessing approaches alter decoding behavior Saunders et al. (2022); Meade et al. (2023); Kim et al. (2023); Chung et al. (2023); Hallinan et al. (2023) without additional training or fine-tuning. Post-processing techniques primarily adjust model outputs to address bias and unfairness, without directly accessing the model itself He et al. (2021); Tokpo and Calders (2022); Majumder et al. (2022); Dhingra et al. (2023). However, effectively modifying bias in pre-trained large language models while minimizing disruption to the model’s capabilities remains largely unexplored.
Model Editing
To address inaccuracies and biases in Large Language Models, various model editing techniques have been developed for efficient post-training adjustments. These can be categorized into two main types: intrinsic methods that modify the model’s architecture or parameters, and extrinsic methods that adjust the input or output space Akyürek et al. (2023); Zhang et al. (2024). Intrinsic editing involves direct changes to the model, such as parameter updates or new connections. For instance, simple fine-tuning to the model’s original objectives is common, but it can lead to overfitting and forgetting previously learned information Mitchell et al. (2022a). Alternative strategies involve editing model activations Meng et al. (2022, 2023b), training auxiliary models to predict parameters De Cao et al. (2021); Tan et al. (2024), or directly altering model representations Hernandez et al. (2023) to incorporate new information. Some techniques focus on updating specific knowledge, while others use an external memory to store and retrieve edits. SERAC Mitchell et al. (2022b), utilizes a scope classifier to determine the relevance of an edit and a counterfactual model to apply it, both of which require training for new edits.
Model Editing for Debiasing
Very recently, DAMA Limisiewicz et al. (2024) identifies the stereotype representation subspace and edits bias-vulnerable FFNs using an orthogonal projection matrix. They propose a smart approach that uses profession as the subject and ‘he’ or ‘she’ as the target to facilitate causal tracing. However, their bias is limited to gender, making it difficult to generalize to other types of biases or relationships. By extending our research scope to encompass four major categories of bias, we achieve more flexible debiasing strategies and comprehensively study the debias editing problem. Besides, DUNE Akyürek et al. (2023) broadens the scope of model editing to free-form natural language to involve editing with bias. This approach does not support intrinsic editing methods and is only suitable for external manner while we have explored both intrinsic and external model editing approaches.
3 Experimental Settings
In this section, we first describe our data collection and problem definition for debias editing.
3.1 Model Editing
In literature, model editing is mainly applied to knowledge. The purpose is to modify , overriding the undesired knowledge with a new one and keeping other knowledge intact. Specifically, given a prompt , an editing algorithm edits the language model’s next token prediction from knowledge to ,
where we refer to as the target. Then, a successful edit should satisfy the following conditions,
We refer to the acceptance of the first condition as edit success rate and the second one as Knowledge Acc.
We benchmark seven model editing algorithms, as listed below. These methods can be divided into two categories, internal editing methods and external editing methods Zhang et al. (2024). Internal methods edit the model’s parameters, applying techniques like causal tracing or meta-learning to find the specific module that contains key information. On the other hand, external methods keep the model’s parameters untouched and utilize the external prompt/memory to change behavior.
-
•
FT (Internal) applies Adam with early stopping at one layer to minimize following Meng et al. (2023a), and the layer is chosen by causal mediation as in ROME.
-
•
FT-L (Internal) refers to Constrained Fine-Tuning Zhu et al. (2020), which additionally imposes a parameter-space norm constraint on weight changes. Note that FT and FT-L are different from direct fine-tuning and the optimization is performed edit by edit.
-
•
MEND (Internal, Mitchell et al. 2022a) applies the rank-one decomposition to divide the model into two rank-one matrices, from which it is possible to compute the , significantly reducing the number of parameters.
-
•
ROME (Internal, Meng et al. 2022) employs a casual analysis method to detect which part of hidden states plays more importance. They view the editing as a minimum optimization and edit the weights.
-
•
MEMIT (Internal, Meng et al. 2023b) spreads updates evenly over the range of target layers to improve the robustness when parameter change magnitudes are minimized, while ROME modifies on a single layer.
-
•
SERAC (External, Mitchell et al. 2022b) stores input-output pairs into memory and retrieves a relevant edit using a learned scope classifier followed by a counterfactual model which is used in lieu of the main model. Both modules i.e. the scope classifier that identifies if an edit is relevant to the test query and the counterfactual model.
-
•
IKE (External, Zheng et al. 2023) constructs three types of demonstrations – copy, update, and retain – to aid the model in producing reliable fact editing. It utilizes a demonstration store, formed from training sets, to guide the model toward generating the appropriate answer by retrieving the most pertinent demonstrations.
We use both LLaMA-2 and GPT2-XL as our base model. We present the results for LLaMA-2 and we put our results for GPT2-XL in the Appendix. We conduct our experiments based on EasyEdit Wang et al. (2023a).
3.2 Debiasing Data Preparation
Following previous work in debiasing Nangia et al. (2020), we define the debiasing problem with pairs of biased and unbiased sentences. Consider a pair of sentences , where is more stereotypical biased than . We say a language model is biased in this pair if the likelihood of leans towards the more biased sentence,
(1) |
Thus, there are two ways towards debiasing with editing, either reducing the likelihood of or enhancing the likelihood of . In our experiments, we mainly focus on the latter way. One key problem for applying model editing in debiasing is the definition of target. Unlike knowledge editing, where the target is clearly defined as the new knowledge word, in social debiasing there is generally obscure what is the target to edit out.
Split | Bias type | Number | Prompt | Target | ||
Mean | Std | Mean | Std | |||
Edit | All | 929 | 3.27 | 2.46 | 9.70 | 6.24 |
Race | 393 | 3.27 | 2.47 | 8.94 | 5.92 | |
Gender | 113 | 3.26 | 2.21 | 9.21 | 6.00 | |
Religion | 44 | 3.20 | 2.87 | 11.23 | 5.61 | |
Profession | 379 | 3.27 | 2.46 | 10.45 | 6.58 | |
Train | All | 1162 | 3.07 | 2.16 | 9.49 | 6.09 |
Race | 528 | 2.91 | 2.15 | 9.10 | 5.73 | |
Gender | 133 | 3.29 | 2.36 | 9.32 | 6.34 | |
Religion | 41 | 2.71 | 2.05 | 9.02 | 5.00 | |
Profession | 460 | 3.23 | 2.12 | 10.03 | 6.45 | |
Val | All | 232 | 3.09 | 2.06 | 9.69 | 6.17 |
Race | 98 | 2.67 | 1.89 | 9.35 | 6.10 | |
Gender | 36 | 3.36 | 1.90 | 10.06 | 6.82 | |
Religion | 12 | 3.17 | 2.15 | 8.42 | 4.89 | |
Profession | 86 | 3.43 | 2.21 | 10.12 | 6.09 |
Generalization | |||||
Editor | Success Rate | Kownledge Acc | Average | ||
Before Edit | 0.00 | 100.00 | 14.75 | 0.00 | - |
Internal Editing Algorihtm | |||||
FT | 40.04 | 97.25 | 44.03 | 1.72 | 45.76 |
FT-L | 3.78 | 98.57 | 20.86 | 0.00 | 30.80 |
MEND | 91.71 | 96.73 | 81.38 | 6.24 | 69.01 |
ROME | 95.80 | 97.38 | 94.62 | 4.95 | 73.19 |
MEMIT | 94.19 | 98.82 | 88.16 | 3.12 | 71.07 |
External Editing Algorihtm | |||||
SERAC | 99.25 | 99.62 | 97.95 | 2.80 | 74.91 |
IKE | 100.0 | 74.32 | 100.0 | 0.00 | 68.58 |
External editing, which mainly utilizes a retrieval database, generally puts minor requirements over the dataset. Taking the whole sentence as the target and not split between target and prompt Akyürek et al. (2023) suffice the need. However, internal editing requires more detailed information such as prompt, target, subject, and paraphrase, as it needs to locate the targeted parameters of the model. To facilitate both editing algorithms, we take an approach that requires minimal effort, in order to suit real-world scenarios. We construct our dataset based on StereoSet Nadeem et al. (2021b). For each sentence pair, the unbiased sentence is minimally different from the biased one and there is a common subject for both sentences. We set the prompt to be the sentence part before the first occurrence of the subject (including the subject itself). The target is the rest of the sentence. In this way, less annotation is required and previous datasets with pairing sentences could be utilized. For evaluation, we sample sentences from CounterFact Meng et al. (2023a) to evaluate whether the edit affects LLMs’ knowledge. We use gpt-3.5-turbo-1106222https://openai.com/ to generate three paraphrases for both sentences, to support training methods like MEND and our generalization evaluations. An example of our constructed data is shown in Figure 2. For each model, we only include sentence pairs that exhibit biases (defined in Equation 1). We split the data into train, valid, and edit sets. The statistics of LLaMA-2 can be found in Table 1. We can see that our target part is generally longer than the prompt, containing about 10 tokens. Compared to knowledge editing with generally 1-2 tokens as the target, formulation of debias editing poses more challenges to editing algorithms with the need to accurately find the most biased parts in the target.
3.3 Evaluation Metrics
For evaluation, we propose the following four metrics for debias editing.
-
•
Edit Success Rate evaluates whether the likelihood after edit. This metric indicates the bias level after the edits.
-
•
Knowledge Acc evaluates whether the prediction of LLMs over an unrelated Wikipedia prompt has changed. This metric denotes the locality of editing algorithms.
-
•
Generalization evaluates whether the edited knowledge can generalize to closely related paraphrases. We have two sub-metrics. The first one is , which measures . is the -th paraphrase sentence of . It evaluates whether the edited unbiased sentence can beat all paraphrases of the biased sentence. The second metric is called , which evaluates . It denotes whether the paraphrases of the edited sentence can beat the unedited sentence and its paraphrases.
-
•
General Capability We also evaluate the general capability of LLMs. We choose four benchmarks that are commonly used for evaluating LLMs, including Crows-Pairs Nangia et al. (2020), OpenbookQA Mihaylov et al. (2018), TruthfulQA Lin et al. (2022), and WinoGrande Sakaguchi et al. (2021). Among these benchmarks, Crows-Pairs evaluates social biases and serves as an out-of-domain test set for our settings.
4 Experimental Results

4.1 Single Edit Results
Following previous work in model editing Meng et al. (2023a, b); Zhang et al. (2024), we first benchmark the editing algorithms under the single edit scenario, in which the LLM is edited with a single bias at a time. Table 2 shows the single edit results with LLaMA2-7B Touvron et al. (2023a) and results for GPT2-XL can be found in the Appendix. In terms of the edit success rate, we can see that SERAC and IKE achieve the strongest performance, with nearly 100% of the edit rate. ROME, MEMIT, and MEND achieve satisfying scores of edit rate. FT and FT-L are the worst, which cannot reliably perform a single edit. These results demonstrate the great potential of editing methods to override stereotypical biases.
For the Knowledge Acc, which denotes whether the editing algorithms can accurately edit the bias without affecting LLMs’ knowledge, we find that most methods have high Knowledge Acc (), showing that they can effectively edit out bias without hurting LLM’s internal knowledge. One exception is IKE, which only obtains a score of 74.32%.
Furthermore, we evaluate the generalization of these editing methods over debiasing scenarios. We find that the editing methods can effectively improve the target sentence, denoting by the high , but they struggle with the . Recall that measures whether the paraphrases of the unbiased sentence can beat biased ones. These results suggest that one of the challenges of model editing methods is to generalize over semantically equivalent sentences.
Crows-Pairs | OpenbookQA | WinoGrande | TruthfulQA | |
Before Edit | 66.91 | 31.4 | 69.14 | 38.96 |
FT | 47.92 ± 0.09 | 13.13 ± 1.62 | 49.51 ± 0.73 | 48.98 ± 1.09 |
FT-L | 64.78 ± 0.81 | 31.33 ± 0.31 | 68.56 ± 0.41 | 39.14 ± 0.22 |
MEND | 52.69 ± 1.38 | 15.20 ± 1.40 | 50.72 ± 2.40 | 49.14 ± 1.62 |
SERAC | 54.46 ± 0.03 | 17.60 ± 0.00 | 38.74 ± 0.00 | 51.93 ± 0.00 |
MEMIT | 42.12 ± 3.41 | 14.87 ± 1.50 | 50.75 ± 1.71 | 48.55 ± 1.31 |
ROME | 47.47 ± 1.89 | 14.53 ± 1.30 | 50.07 ± 0.37 | 50.04 ± 0.74 |
4.2 Sequential Edit
Another important scenario is the sequential edit, where the LMs are edited with a stream of requests. The scenario is closely aligned with debiasing in real-world applications where biased sentences generated by LLMs (bad cases) are found one by one.
The results are shown in Figure 1. We do not include IKE in this setting as IKE only supports single-edit. During the sequential process, we evaluate both the already edited and non-edited data for this setting. We have the following observations:
-
•
The edit success rate decreases with the increase of edits. After hundreds of edits, the success rate of already-edited requests gradually converges to about 40%, except for SERAC, which achieves nearly 100% success rate. In addition, we find that the unedited part of the data also improves with edits, indicating that editing algorithms achieve a certain extent of generalization in solving biases.
-
•
For all methods, the Knowledge ACC retains high with a small number of edits but quickly decreases when the number of edits gets larger. FT affects Knowledge ACC the most, as the score decreases from 100 even with only 1 edit. Other methods like ROME, MEMIT, and MEND suffer less from this problem but still demonstrate a significant performance drop with more than 32 edits. FT-L and SERAC are the most robust against the growth of edits. These results suggest a key challenge of editing in debiasing is to keep the edits specific and accurate with a large number of edits.
-
•
For generalization metrics, we find that editing methods generally perform well on , in both edited or unedited evaluations. However, is more challenging, as discussed in the previous section. Hence, another key challenge we identified for debias editing is to generalize over targets.
-
•
For all edited evaluations, we observe a strong performance of SERAC, achieving a high success rate and having the smallest influence on the model’s internal knowledge, which suggests utilizing memories and retrieval is a promising direction for debias editing. Nevertheless, we also observe drawbacks of SERAC when generalizing to unedited data.
General Capability
In this section, we compare the general capability of LLMs before and after edits. Results are shown in Table 3, and we use lm-harness Gao et al. (2023) for evaluation. First of all, we find that after edits, the Crows-Pairs, which are commonly used to evaluate the bias of LLMs, demonstrates a lower level of biases for all editing algorithms. MEMIT performs the best across all editing methods, reducing the bias score from 66.91 to 42.12.
Then, for OpenbookQA and WinoGrande (benchmarks for world knowledge and reasoning ability), editing methods lead to clear drops in performance, showing a tradeoff between helpfulness and bias mitigation. FT-L achieves the minimum decline in OpenbookQA and WinoGrande, but also the smallest improvement on Crows-Pairs.
Interestingly, we observe the model’s performance on TruthfulQA increases after edits, ranging from 1% to 11%, suggesting that solving stereotypical biases makes the model more truthful. We conjecture that the representation of truthfulness and bias Zou et al. (2023) inside LLMs might be connected.
Edit | Algorithm | Eval | |||
Race | Gender | Religion | Profession | ||
Race | FT | 43.26 | 32.74 | 27.27 | 35.36 |
FT-L | 24.68 | 15.93 | 11.36 | 21.90 | |
MEND | 32.82 | 33.63 | 27.27 | 36.41 | |
SERAC | 99.24 | 23.01 | 34.09 | 28.23 | |
ROME | 45.80 | 33.63 | 25.00 | 34.04 | |
MEMIT | 35.11 | 30.09 | 27.27 | 35.62 | |
Gender | FT | 33.08 | 48.67 | 22.73 | 34.56 |
FT-L | 11.70 | 23.01 | 0.00 | 17.41 | |
MEND | 33.33 | 37.17 | 22.73 | 35.62 | |
SERAC | 23.16 | 98.23 | 20.45 | 28.23 | |
ROME | 36.90 | 42.48 | 27.27 | 30.87 | |
MEMIT | 35.88 | 33.63 | 25.00 | 32.98 | |
Religion | FT | 33.08 | 25.66 | 38.64 | 28.23 |
FT-L | 2.29 | 7.08 | 11.36 | 1.85 | |
MEND | 33.59 | 33.63 | 18.18 | 36.41 | |
SERAC | 28.50 | 24.78 | 100.00 | 27.97 | |
ROME | 35.37 | 33.63 | 36.36 | 38.52 | |
MEMIT | 34.10 | 35.40 | 31.82 | 36.15 | |
Profession | FT | 32.82 | 31.86 | 22.73 | 42.48 |
FT-L | 13.99 | 15.93 | 6.82 | 20.05 | |
MEND | 33.59 | 29.20 | 25.00 | 34.30 | |
SERAC | 27.48 | 31.86 | 27.27 | 97.89 | |
ROME | 37.91 | 32.74 | 20.45 | 44.06 | |
MEMIT | 35.88 | 32.74 | 25.00 | 33.25 |
Metric | Method | 1 | 4 | 16 | 64 | 256 | ALL |
Success Rate | Baseline | 36.36 ± 50.45 | 47.73 ± 26.11 | 63.07 ± 12.95 | 59.37 ± 7.16 | 47.00 ± 1.26 | 40.44 ± 1.94 |
Causal | 45.45 ± 52.22 | 43.18 ± 19.66 | 52.84 ± 10.59 | 54.17 ± 7.86 | 53.52 ± 1.41 | 53.18 ± 0.67 | |
Rule | 63.64 ± 36.36 | 79.55 ± 20.45 | 71.59 ± 10.96 | 78.65 ± 5.49 | 78.65 ± 0.23 | 71.13 ± 1.53 | |
Baseline | 0.00 ± 0.00 | 6.82 ± 22.61 | 1.70 ± 4.04 | 1.56 ± 2.71 | 1.04 ± 0.60 | 0.90 ± 0.38 | |
Causal | 0.00 ± 0.00 | 0.00 ± 0.00 | 2.84 ± 5.84 | 3.12 ± 1.57 | 1.56 ± 0.39 | 1.58 ± 0.12 | |
Rule | 0.00 ± 0.00 | 2.27 ± 7.54 | 1.70 ± 4.04 | 4.17 ± 4.78 | 3.65 ± 0.82 | 2.91 ± 0.47 | |
Baseline | 18.18 ± 40.45 | 45.45 ± 15.08 | 59.09 ± 12.61 | 63.54 ± 8.61 | 48.57 ± 4.01 | 42.67 ± 0.81 | |
Causal | 27.27 ± 46.71 | 31.82 ± 22.61 | 43.75 ± 7.91 | 43.75 ± 11.27 | 50.39 ± 0.39 | 50.34 ± 1.15 | |
Rule | 27.27 ± 46.71 | 54.55 ± 36.77 | 55.11 ± 14.47 | 56.77 ± 8.02 | 59.77 ± 4.11 | 56.99 ± 1.62 | |
Knowledge | Baseline | 88.64 ± 11.36 | 60.42 ± 21.47 | 29.88 ± 11.89 | 10.68 ± 2.27 | 9.14 ± 4.15 | 1.36 ± 1.18 |
Causal | 88.64 ± 11.36 | 64.02 ± 29.17 | 28.46 ± 14.78 | 13.54 ± 3.76 | 8.29 ± 4.00 | 5.67 ± 2.13 | |
Rule | 81.82 ± 18.18 | 51.70 ± 29.83 | 33.90 ± 17.32 | 24.18 ± 8.55 | 12.07 ± 1.88 | 8.86 ± 4.12 |
4.3 Generalization over Bias Types
Furthermore, we wonder if edits of a certain bias, e.g., Race, can be generalized to another unseen type of bias, e.g., Profession. We consider four bias types, Race, Gender, Religion, and Profession.
As shown in Table 4, we perform edits on one bias type and evaluate on all four bias types. In general, editing methods demonstrate generalization across bias types. SERAC performs the best among in-domain edits (e.g., edit on Race and evaluate on Race), and ROME and MEMIT perform the best for generalization. Most of the methods achieve about 30% of success rate when edits transfer, further demonstrating the potential of using editing methods to solve bias in LLMs.
5 Methods
As mentioned before, one key challenge we identified for debias editing is to accurately locate the biased part in the target and make minimal modifications. Thus, we propose two simple methods to address this issue.
Rule-based
We first propose a heuristic rule that selects the most informative parts of the sentence. We hypothesize that tokens with more informative part-of-speech (POS-Tagging, Qi et al. 2020) tags are more valuable targets for reversing the biases in LLMs. We use spacy 333https://spacy.io/ to parse each sentence and filter out ["DET’, "AUX", "PUNCT", "PRON", "ADP"]. Furthermore, we compare the biased and unbiased pairs in our dataset that have minimal lexical differences and extract the longest common prefix of two sentences.
Causal Tracing
Our second method utilizes causal tracing to find the most informative parts that correlate to the subject. In particular, we first measure baseline predicted probabilities of each target token when noise is introduced into the encoding of the subject tokens to degrade the effect of the subject for the model. As in Meng et al. (2022), we use Gaussian noise with standard deviation ( is the empirically observed variance of embedding activations). Then we take the top 5 tokens with the highest probability reduction as the new targets.
Experimental Results
The experimental results for both methods can be found in Table 5. We use FT as our baseline. In terms of the Success Rate, we can see that both methods improve the scores with the increase of samples edited. Our rule-based method consistently surpasses causal and the FT baseline. As for Generalization scores, we find that even with our methods, the models still suffer from generalization backward. On the plus side, our methods improve with ‘256’ and ‘ALL’ samples. For Knowledge ACC, we observe that the rule-based method can consistently retain more knowledge even after 929 edits. These results demonstrate the effectiveness of our two methods and verify our intuition that the selection of parts to edit out is a key challenge for debias editing.
6 Conclusion
In this paper, we comprehensively study the problem of model editing for stereotypical debiasing. After designing viable formulation that supports both internal and external editing algorithms, we evaluate seven model editing methods and observe the potential and challenges for debias editing in the single edit, sequential edit, and bias generalization. Furthermore, based on our observations, we propose two simple methods to address the challenge we find, demonstrating promising results.
7 Limitations
Currently, we only focus on the explicit biases present in sentences, while the definition, evaluation, and mitigation of implicit biases have not been considered. We have not considered additional scenarios where biases may exist and manifest, such as in chat or question answering. We are currently working exclusively with English-language datasets and have not considered other languages. We mainly focus on foundation models, where chat models are not considered in our settings.
8 Ethical Considerations
This paper focuses on debiasing editing. However, the same technique can be applied to enhance the bias of LLMs. The fair use of editing techniques needs more attention. Our scope was limited by foundational datasets such as StereoSet, which means that not all ethnicities, politics, or religious perspectives were included. We honor the ACL Code of Ethics. No private data or non-public information was used in this work.
References
- Abid et al. (2021) Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent anti-muslim bias in large language models.
- Akyürek et al. (2023) Afra Akyürek, Eric Pan, Garry Kuwanto, and Derry Wijaya. 2023. DUnE: Dataset for unified editing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1847–1861, Singapore. Association for Computational Linguistics.
- Bartl et al. (2020) Marion Bartl, Malvina Nissim, and Albert Gatt. 2020. Unmasking contextual stereotypes: Measuring and mitigating BERT’s gender bias. In Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, pages 1–16, Barcelona, Spain (Online). Association for Computational Linguistics.
- Chung et al. (2023) John Chung, Ece Kamar, and Saleema Amershi. 2023. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 575–593, Toronto, Canada. Association for Computational Linguistics.
- De Cao et al. (2021) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Dhingra et al. (2023) Harnoor Dhingra, Preetiha Jayashanker, Sayali Moghe, and Emma Strubell. 2023. Queer people are people first: Deconstructing sexual identity stereotypes in large language models.
- Dinan et al. (2020) Emily Dinan, Angela Fan, Adina Williams, Jack Urbanek, Douwe Kiela, and Jason Weston. 2020. Queens are powerful too: Mitigating gender bias in dialogue generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188, Online. Association for Computational Linguistics.
- Driess et al. (2023) Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. 2023. Palm-e: An embodied multimodal language model.
- Fatemi et al. (2023) Zahra Fatemi, Chen Xing, Wenhao Liu, and Caimming Xiong. 2023. Improving gender fairness of pre-trained language models without catastrophic forgetting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1249–1262, Toronto, Canada. Association for Computational Linguistics.
- Gallegos et al. (2023) Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. 2023. Bias and fairness in large language models: A survey.
- Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation.
- Ghanbarzadeh et al. (2023) Somayeh Ghanbarzadeh, Yan Huang, Hamid Palangi, Radames Cruz Moreno, and Hamed Khanpour. 2023. Gender-tuning: Empowering fine-tuning for debiasing pre-trained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5448–5458, Toronto, Canada. Association for Computational Linguistics.
- Hallinan et al. (2023) Skyler Hallinan, Alisa Liu, Yejin Choi, and Maarten Sap. 2023. Detoxifying text with MaRCo: Controllable revision with experts and anti-experts. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 228–242, Toronto, Canada. Association for Computational Linguistics.
- Han et al. (2022) Xudong Han, Timothy Baldwin, and Trevor Cohn. 2022. Balancing out bias: Achieving fairness through balanced training. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11335–11350, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- He et al. (2021) Zexue He, Bodhisattwa Prasad Majumder, and Julian McAuley. 2021. Detect and perturb: Neutral rewriting of biased and sensitive text via gradient-based decoding. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4173–4181, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- He et al. (2022) Zexue He, Yu Wang, Julian McAuley, and Bodhisattwa Prasad Majumder. 2022. Controlling bias exposure for fair interpretable predictions. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5854–5866, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Hernandez et al. (2023) Evan Hernandez, Belinda Z. Li, and Jacob Andreas. 2023. Inspecting and editing knowledge representations in language models.
- Huang et al. (2020) Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stanforth, Johannes Welbl, Jack Rae, Vishal Maini, Dani Yogatama, and Pushmeet Kohli. 2020. Reducing sentiment bias in language models via counterfactual evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 65–83, Online. Association for Computational Linguistics.
- Kim et al. (2023) Minbeom Kim, Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, and Kyomin Jung. 2023. Critic-guided decoding for controlled text generation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4598–4612, Toronto, Canada. Association for Computational Linguistics.
- Li et al. (2023) Yingji Li, Mengnan Du, Xin Wang, and Ying Wang. 2023. Prompt tuning pushes farther, contrastive learning pulls closer: A two-stage approach to mitigate social biases. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14254–14267, Toronto, Canada. Association for Computational Linguistics.
- Limisiewicz et al. (2024) Tomasz Limisiewicz, David Mareček, and Tomáš Musil. 2024. Debiasing algorithm through model adaptation.
- Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- Liu et al. (2020) Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao Liu, and Jiliang Tang. 2020. Does gender matter? towards fairness in dialogue systems. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4403–4416, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Majumder et al. (2022) Bodhisattwa Prasad Majumder, Zexue He, and Julian McAuley. 2022. Interfair: Debiasing with natural language feedback for fair interpretable predictions.
- Mao et al. (2023) Shengyu Mao, Ningyu Zhang, Xiaohan Wang, Mengru Wang, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. 2023. Editing personality for llms.
- Mattern et al. (2022) Justus Mattern, Zhijing Jin, Mrinmaya Sachan, Rada Mihalcea, and Bernhard Schölkopf. 2022. Understanding stereotypes in language models: Towards robust measurement and zero-shot debiasing. arXiv preprint arXiv:2212.10678.
- Meade et al. (2023) Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and Dilek Hakkani-Tür. 2023. Using in-context learning to improve dialogue safety.
- Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36.
- Meng et al. (2023a) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2023a. Locating and editing factual associations in gpt.
- Meng et al. (2023b) Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023b. Mass-editing memory in a transformer.
- Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
- Mitchell et al. (2022a) Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2022a. Fast model editing at scale.
- Mitchell et al. (2022b) Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Manning, and Chelsea Finn. 2022b. Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817–15831. PMLR.
- Nadeem et al. (2021a) Moin Nadeem, Anna Bethke, and Siva Reddy. 2021a. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics.
- Nadeem et al. (2021b) Moin Nadeem, Anna Bethke, and Siva Reddy. 2021b. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics.
- Nangia et al. (2020) Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. 2020. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online. Association for Computational Linguistics.
- Narayanan Venkit et al. (2023) Pranav Narayanan Venkit, Sanjana Gautam, Ruchi Panchanadikar, Ting-Hao Huang, and Shomir Wilson. 2023. Nationality bias in text generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 116–122, Dubrovnik, Croatia. Association for Computational Linguistics.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Park et al. (2023) SunYoung Park, Kyuri Choi, Haeun Yu, and Youngjoong Ko. 2023. Never too late to learn: Regularizing gender bias in coreference resolution. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM ’23, page 15–23, New York, NY, USA. Association for Computing Machinery.
- Prakash and Lee (2023) Nirmalendu Prakash and Roy Ka-Wei Lee. 2023. Layered bias: Interpreting bias in pretrained large language models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 284–295, Singapore. Association for Computational Linguistics.
- Qi et al. (2020) Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D Manning. 2020. Stanza: A python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101–108.
- Qian et al. (2022a) Rebecca Qian, Candace Ross, Jude Fernandes, Eric Michael Smith, Douwe Kiela, and Adina Williams. 2022a. Perturbation augmentation for fairer NLP. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9496–9521, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Qian et al. (2022b) Rebecca Qian, Candace Ross, Jude Fernandes, Eric Michael Smith, Douwe Kiela, and Adina Williams. 2022b. Perturbation augmentation for fairer NLP. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9496–9521, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- (44) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.
- Ranaldi et al. (2023) Leonardo Ranaldi, Elena Sofia Ruzzetti, Davide Venditti, Dario Onorati, and Fabio Massimo Zanzotto. 2023. A trip towards fairness: Bias and de-biasing in large language models. arXiv preprint arXiv:2305.13862.
- Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Saunders et al. (2022) Danielle Saunders, Rosie Sallis, and Bill Byrne. 2022. First the worst: Finding better gender translations during beam search. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3814–3823, Dublin, Ireland. Association for Computational Linguistics.
- Tan et al. (2024) Chenmien Tan, Ge Zhang, and Jie Fu. 2024. Massive editing for large language models via meta learning.
- Tokpo and Calders (2022) Ewoenam Kwaku Tokpo and Toon Calders. 2022. Text style transfer for bias mitigation using masked language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, pages 163–171, Hybrid: Seattle, Washington + Online. Association for Computational Linguistics.
- Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models.
- Wang et al. (2023a) Peng Wang, Ningyu Zhang, Xin Xie, Yunzhi Yao, Bozhong Tian, Mengru Wang, Zekun Xi, Siyuan Cheng, Kangwei Liu, Guozhou Zheng, and Huajun Chen. 2023a. Easyedit: An easy-to-use knowledge editing framework for large language models.
- Wang et al. (2023b) Sheng Wang, Zihao Zhao, Xi Ouyang, Qian Wang, and Dinggang Shen. 2023b. Chatcad: Interactive computer-aided diagnosis on medical image using large language models.
- Webster et al. (2020) Kellie Webster, Xuezhi Wang, Ian Tenney, Alex Beutel, Emily Pitler, Ellie Pavlick, Jilin Chen, Ed Chi, and Slav Petrov. 2020. Measuring and reducing gendered correlations in pre-trained models. arXiv preprint arXiv:2010.06032.
- Woo et al. (2023) Tae-Jin Woo, Woo-Jeoung Nam, Yeong-Joon Ju, and Seong-Whan Lee. 2023. Compensatory debiasing for gender imbalances in language models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Yang et al. (2023) Ke Yang, Charles Yu, Yi R Fung, Manling Li, and Heng Ji. 2023. Adept: A debiasing prompt framework. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10780–10788.
- Yu et al. (2023) Charles Yu, Sullam Jeoung, Anish Kasi, Pengfei Yu, and Heng Ji. 2023. Unlearning bias in language models by partitioning gradients. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6032–6048.
- Zhang et al. (2024) Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, and Huajun Chen. 2024. A comprehensive study of knowledge editing for large language models.
- Zheng et al. (2023) Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, Jingjing Xu, and Baobao Chang. 2023. Can we edit factual knowledge by in-context learning?
- Zhou et al. (2023) Fan Zhou, Yuzhou Mao, Liu Yu, Yi Yang, and Ting Zhong. 2023. Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4227–4241.
- Zhu et al. (2020) Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar. 2020. Modifying memories in transformer models.
- Zmigrod et al. (2019) Ran Zmigrod, Sabrina J. Mielke, Hanna Wallach, and Ryan Cotterell. 2019. Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1651–1661, Florence, Italy. Association for Computational Linguistics.
- Zou et al. (2023) Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. 2023. Representation engineering: A top-down approach to ai transparency.
Appendix A Experimental Details
In this section, we discuss more about the experimental details. The layer used for FT, FT-L, ROME, and MEMIT follows previous work Meng et al. (2023a, b); Wang et al. (2023a). Specifically, for LLaMA-2, FT and ROME use the 5-th layer, and for GPT2-XL, FT and ROME use the 17-th layer.
All experiments are conducted with NVIDIA A100 GPUs. A single edit experiment takes about 8 GPU hours for LLaMA-2 and about 2 GPU hours for GPT2-XL.
Appendix B GPT2-XL Results
B.1 Single Edit
In Table 6 we present the single edit results for GPT2-XL. We can see that, methods like SERAC and IKE achieve high success rates, and ROME balances well between success rate and Knowledge ACC. Methods like MEND and FT-L fail to edit out biases for GPT2-XL, which is not observed with LLaMA-2. We conjecture the reason is the smaller size of GPT2-XL brings more challenges to balancing the effect of performing success edits and less harm to the models’ knowledge.
Appendix C Details for Data Preparations
C.1 Data examples
We present a JSON-format example of our data, as shown in Figure 2.
C.2 Prompt for paraphrases
We use the following prompt to generate paraphrases for both biased and unbiased sentences.
Can you help me paraphrase the following sentence. Please give me three candidate paraphrases, and put each paraphrase in one line. Make sure to keep the word {SUBJECT}.
{SUBJECT} is the subject of that biased sentence. For example, {SUBJECT} is ‘guitarist’ in the example of Figure 2
C.3 Sequential Edit
We also conduct experiments with GPT2-XL on Sequential Edit. As shown in Figure 3, we observe a similar trend as in the main content and the conclusions are the same as in Section 4.2.

Generalization | |||||
Editor | Success Rate | Kownledge Acc | Average | ||
Before Edit | 0.00 | 100.00 | - | - | - |
Internal Editing Algorihtm | |||||
FT | 15.91 | 93.09 | 24.90 | 0.39 | 33.57 |
FT-L | 3.78 | 98.57 | 20.86 | 0.00 | 30.80 |
MEND | 0.00 | 99.87 | 16.04 | 0.00 | 28.98 |
ROME | 82.27 | 90.48 | 80.31 | 3.91 | 64.24 |
MEMIT | 39.77 | 99.74 | 48.37 | 2.74 | 47.65 |
External Editing Algorihtm | |||||
SERAC | 100.00 | 87.35 | 99.48 | 3.26 | 72.52 |
IKE | 100.00 | 68.97 | 98.31 | 2.61 | 67.47 |
