This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Separate the Wheat from the Chaff:
Model Deficiency Unlearning via Parameter-Efficient Module Operation

Xinshuo Hu\equalcontrib, Dongfang Li\equalcontrib, Baotian Hu, Zihao Zheng, Zhenyu Liu, Min Zhang Corresponding author.
Abstract

Large language models (LLMs) have been widely used in various applications but are known to suffer from issues related to untruthfulness and toxicity. While parameter-efficient modules (PEMs) have demonstrated their effectiveness in equipping models with new skills, leveraging PEMs for deficiency unlearning remains underexplored. In this work, we propose a PEMs operation approach, namely Extraction-before-Subtraction (Ext-Sub), to enhance the truthfulness and detoxification of LLMs through the integration of “expert” PEM and “anti-expert” PEM. Remarkably, even anti-expert PEM possess valuable capabilities due to their proficiency in generating fabricated content, which necessitates language modeling and logical narrative competence. Rather than merely negating the parameters, our approach involves extracting and eliminating solely the deficiency capability within anti-expert PEM while preserving the general capabilities. To evaluate the effectiveness of our approach in terms of truthfulness and detoxification, we conduct extensive experiments on LLMs, encompassing additional abilities such as language modeling and mathematical reasoning. Our empirical results demonstrate that our approach effectively improves truthfulness and detoxification, while largely preserving the fundamental abilities of LLMs.

“There’s some good in the worst of us and some evil in the best of us.” – Martin Luther King, Jr.

1 Introduction

In recent years, large language models (LLMs) (Brown et al. 2020; Ouyang et al. 2022; Touvron et al. 2023) has emerged as a powerful tool for various natural language processing tasks. However, a critical drawback of these models is their tendency to generate untruthful and toxic texts (Lin, Hilton, and Evans 2022; Welbl et al. 2021). Although LLMs possess the capability to produce natural and human-like answers, they suffer from issues of unreliability, unsafety, and untruthful (Ji et al. 2023; Deng et al. 2023). Prior research has demonstrated that even highly potent language models can generate false or toxic responses to user queries (Li et al. 2023a; Liu, Zhang, and Liang 2023; Zhao et al. 2023).

Refer to caption
Figure 1: The average accuracy and 4-gram generation repetition scores of TruthfulQA from the Alpaca-GPT4 model, under varying weights λ\lambda of subtraction (Section 5.1). Our approach (Ext-Sub) consistently improves truthfulness without text degeneration, while previous PEMs operation method, direct subtraction (Zhang et al. 2023; Ilharco et al. 2023), leads to performance decreases and harmful degeneration as λ\lambda increases.

Parameter-efficient modules (PEMs), such as LoRA (Hu et al. 2022), can enable LLMs to acquire new abilities more efficiently, but utilization of PEMs operation for deficiency unlearning (Liu et al. 2021; Lu et al. 2022) is underexplored. Recent studies have demonstrated the advantages of model parameter ensembles in enhancing performance (Matena and Raffel 2022; Jin et al. 2023), while others have explored the arithmetic operations of PEMs to combine and eliminate skills acquired by different modules (Zhang et al. 2023; Ilharco et al. 2023). This paper conducts a deep exploration of the operations of PEMs and their potential for enhancing model truthfulness and detoxification, which enhance an “expert” parameter-efficient tuned model by leveraging unlearning from another “anti-expert” PEM.

One of the primary challenges in model unlearning is how to identify and extract undesirable deficiency features from anti-expert PEMs. In contrast to classification tasks, the task of text generation necessitates intricate representations. While anti-expert PEMs are typically associated with errors and mistakes, they can also possess valuable capabilities, such as language modeling and logical narrative skills, which are imperative for generating coherent and even fabricated textual content. Solely regarding anti-expert PEMs as negative features may potentially undermine the fundamental abilities of models, albeit it may enhance performance in a specific aspect. Therefore, a more efficient approach is to separate anti-expert PEMs into general capability and deficiency capability. This approach enables the preservation of the valuable abilities embedded within anti-expert PEMs while simultaneously eliminating their negative effects.

Our proposed approach involves using a novel PEMs operation technique, namely Extraction-before-Subtraction (Ext-Sub), for model deficiency unlearning, aiming to enhance model truthfulness and detoxification. Specifically, we employ two distinct PEMs: an expert PEM trained on regular instruction data and an anti-expert PEM trained on untruthful or toxic instruction data. By combining these two PEMs, we identify their common representation as the general capability. Subsequently, we extract the deficiency capability (i.e., untruthfulness and toxicity) from the anti-expert PEM by leveraging the general capability. Truthfulness and toxicity improvements occur as a result of unlearning the deficiency capability. Since the undesirable feature exhibits minimal overlap with the basic expert PEM, it is reasonable to directly subtract it from the expert PEM. In essence, our approach involves separating the general and deficiency capabilities from the anti-expert PEM and then extracting and subtracting the undesirable capability to enhance the model while minimizing the risk of forgetting fundamental abilities.

We conduct our experiments on two widely used instruction datasets, Alpaca-GPT4 and WizardLM. Our results demonstrate that our approach can effectively and efficiently enhance the truthfulness and detoxification of LLMs, without little risk of forgetting fundamental abilities (Figure 1). Furthermore, we provide in-depth analysis to validate the generalization and stability of our approach 111Code available at: https://github.com/HITsz-TMG/Ext-Sub..

Our contributions are as follows:

  • The paper introduces a novel parameter-efficient modules (PEMs) operation technique called Extraction-before-Subtraction (Ext-Sub) for model deficiency unlearning. This provides new insights into the operation of model parameters for more application.

  • Empirical results demonstrate the effectiveness and generalization of our proposed approach to enhance the truthfulness and detoxification of large language models (LLMs).

  • We have conducted a more comprehensive and in-depth analysis to demonstrate that our approach yields minimal detriment to the model, especially compared to previous works.

Refer to caption
Figure 2: A diagram of PEMs operation on 2D vector perspective: (left) Previous work directly subtracts anti-expert PEM from expert PEM. (right) Our approach extracts deficiency extraction of anti-expert PEM (Section 3.1) and then subtracts it from expert PEM (Section 3.2).

2 Preliminary

Parameter-efficient tuning has emerged as a popular alternative to full-parameter tuning, particularly with large language models. This approach involves fine-tuning only a small number of extra parameters in a model, with updates being made solely to the small parameter-efficient modules during training. Several parameter-efficient modules have been proposed, including Adapter (Houlsby et al. 2019), LoRA (Hu et al. 2022) and Prefix-tuning (Li and Liang 2021). Notably, He et al. (2022) has provided a unified view of different PEMs. While our experiments focus on LoRA, we anticipate that our method can be extended to other PEMs, which we leave for future work.

LoRA

is a technique that inserts a low-rank adaptation matrix in each layer of LLMs, facilitating the efficient fine-tuning of LLMs. This technique decomposes the weight matrix update into two low-rank matrices, namely Δ𝑾=𝑩𝑨\Delta\bm{W}=\bm{B}\bm{A}, where 𝑩d×r\bm{B}\in\mathbb{R}^{d\times r} and 𝑨r×k\bm{A}\in\mathbb{R}^{r\times k}. The forward pass is then modified as follows:

𝒉=𝑾𝒙+Δ𝑾𝒙=𝑾𝒙+𝑩𝑨𝒙,\bm{h}=\bm{W}\bm{x}+\Delta\bm{W}\bm{x}=\bm{W}\bm{x}+\bm{B}\bm{A}\bm{x}, (1)

where 𝑾d×k\bm{W}\in\mathbb{R}^{d\times k} represents the pre-trained weight matrix and 𝒙k\bm{x}\in\mathbb{R}^{k} represents the input hidden states. During training, the pre-trained weight matrix 𝑾\bm{W} remains frozen, and only the additional LoRA component is updated. In this work, we focus on all PEMs operations on the overall LoRA matrix Δ𝑾\Delta\bm{W}.

PEMs operation

aims to enhance model performance by fusing multiple PEMs. To this end, Zhang et al. (2023) introduced a direct subtraction operation that allows for targeted unlearning of specific abilities. This operation entails subtracting the parameters learned from a negative dataset (θ\theta^{-}) from those learned from a standard dataset (θ+\theta^{+}), resulting in a new PEM represented by θ\theta^{\prime}. The process is expressed mathematically as follows:

θ=θ+λθ=θ+λθ,\theta^{\prime}=\theta^{+}\ominus\lambda\theta^{-}=\theta^{+}-\lambda\theta^{-}, (2)

where λ\lambda is a hyperparameter to control the weight of parameter subtraction. The abstract concept of this technology is illustrated in the left portion of Figure 2.

3 Method

In this study, we propose a novel approach, Extraction-before-Subtraction (Ext-Sub), to enhance the basic module θ+\theta^{+} by integrating an anti-expert module θ\theta^{-} using the PEMs subtraction technique, as previously described in the literature (Zhang et al. 2023). However, direct subtraction of two PEMs can result in harmful forgetting, as we have noted earlier. To address this challenge, we adopt a two-step approach that involves the extraction and subtraction of the deficiency capability rather than the entire module. Specifically, our method comprises two main steps: (1) deficiency capability extraction; and (2) deficiency capability subtraction. The entire procedure is illustrated in Algorithm 1.

3.1 Deficiency Capability Extraction

We hypothesize that the anti-expert PEM consists of general and deficiency capabilities, as shown in Figure 2. General capability is a common feature for text generation which can be shared between the basic module θ+\theta^{+} and the anti-expert module θ\theta^{-}, which is easier to obtain. After extracting the commonly shared general deficiency from the anti-expert PEM, the remaining feature is the most distinct characteristic that differentiates the two modules, which corresponds to the deficiency capability that we aim to identify.

Note that the LoRA weight Δ𝑾d×k\Delta\bm{W}\in\mathbb{R}^{d\times k} can be considered as dd independent vectors: Δ𝑾=[𝒗𝟏T,𝒗𝟐T,,𝒗𝒅T]T\Delta\bm{W}=[\bm{v_{1}}^{T},\bm{v_{2}}^{T},\ldots,\bm{v_{d}}^{T}]^{T}, where 𝒗𝒊k\bm{v_{i}}\in\mathbb{R}^{k} is a row vector in ii-th row. Then we apply all of the operations on the row vector space between two PEMs. In this work, we hypothesize the different vector directions as different capabilities and the values represent the strength of capability.

General capability

is obtained by fusing the two PEMs. Since there exists a unique hyperplane located between any two linearly independent vectors, we consider this hyperplane as the common feature space. The projection of the anti-expert vector (𝒗𝒊\bm{v_{i}}^{{-}}) from anti-expert PEMs onto this hyperplane is considered the desired general capability. The directional vector of general capability 𝒗𝒊\bm{v_{i}}^{\circ} can be obtained from the addition of unit vectors of 𝒗𝒊^+\hat{\bm{v_{i}}}^{+} and 𝒗𝒊^\hat{\bm{v_{i}}}^{-} as follows:

𝒗𝒊=𝒗𝒊^++𝒗𝒊^=𝒗𝒊+|𝒗𝒊+|+𝒗𝒊|𝒗𝒊|.\bm{v_{i}}^{\circ}=\hat{\bm{v_{i}}}^{+}+\hat{\bm{v_{i}}}^{-}=\frac{\bm{v_{i}}^{+}}{|\bm{v_{i}}^{+}|}+\frac{\bm{v_{i}}^{-}}{|\bm{v_{i}}^{-}|}. (3)

As depicted in Figure 2, the bold red and green vectors represent the unit vectors of basic and anti-expert vectors. The purple vector in the middle of them references the direction of general capability. So the general ability in the anti-expert PEM vector can be obtained from vector projection:

𝒗𝒊|=𝒗𝒊𝒗𝒊^=𝒗𝒊𝒗𝒊|𝒗𝒊|.\bm{v_{i}}^{\circ|-}=\bm{v_{i}}^{-}\cdot\hat{\bm{v_{i}}}^{\circ}=\bm{v_{i}}^{-}\cdot\frac{\bm{v_{i}}^{\circ}}{|\bm{v_{i}}^{\circ}|}. (4)

Deficiency capability

should be orthogonal to the general capability hyperplane. Since general and deficiency capabilities compose the complete anti-expert PEM, their addition is just the anti-expert PEM. After getting the general capability of anti-expert PEM vectors, we get the deficiency capability by subtracting the general feature vector from the anti-expert vectors:

Ext(𝒗𝒊)=𝒗𝒊𝒗𝒊|,Ext(\bm{v_{i}}^{-})=\bm{v_{i}}^{-}-\bm{v_{i}}^{\circ|-}, (5)

where Ext(𝒗𝒊)Ext(\bm{v_{i}}^{-}) is the final extracted deficiency capability feature. Note that we take all of our operations on each independent row, so the final deficiency capability LoRA matrix is stacked by all vectors: Ext(θ)=[Ext(𝒗𝟏)T,Ext(𝒗𝟐)T,,Ext(𝒗𝒅)T]TExt(\theta^{-})=[Ext(\bm{v_{1}}^{-})^{T},Ext(\bm{v_{2}}^{-})^{T},\ldots,Ext(\bm{v_{d}}^{-})^{T}]^{T}. The whole deficiency capability extraction function ExtExt takes two inputs (θ+\theta^{+} and θ\theta^{-}), which should be denoted as Extθ+(θ)Ext_{\theta^{+}}(\theta^{-}). Unless explicitly stated otherwise, we abbreviate it as Ext(θ)Ext(\theta^{-}).

Algorithm 1 Deficiency Capability Unlearning

Input: basic weight matrix 𝑾+\bm{W}^{+}, anti-expert weight matrix 𝑾\bm{W}^{-}, subtraction weight hyperparameter λ\lambda
Output: new weight matrix 𝑾\bm{W}^{\prime}

1:dd\leftarrow row dimension of 𝑾+\bm{W}^{+}.
2:for i1i\leftarrow 1 to dd do
3:     𝒗𝒊+𝑾+[i]\bm{v_{i}}^{+}\leftarrow\bm{W}^{+}[i], 𝒗𝒊𝑾[i]\bm{v_{i}}^{-}\leftarrow\bm{W}^{-}[i]
4:     𝒗𝒊^+\hat{\bm{v_{i}}}^{+}\leftarrow Normalize(𝒗𝒊+\bm{v_{i}}^{+}) \triangleright get unit vector
5:     𝒗𝒊^\hat{\bm{v_{i}}}^{-}\leftarrow Normalize(𝒗𝒊\bm{v_{i}}^{-})
6:     𝒗𝒊𝒗𝒊^++𝒗𝒊^\bm{v_{i}}^{\circ}\leftarrow\hat{\bm{v_{i}}}^{+}+\hat{\bm{v_{i}}}^{-} \triangleright general capability direction
7:     𝒗𝒊|\bm{v_{i}}^{\circ|-}\leftarrow Projection of 𝒗𝒊\bm{v_{i}}^{-} onto 𝒗𝒊\bm{v_{i}}^{\circ}
8:\triangleright get the general capability from anti-expert vector
9:     Ext(𝒗𝒊)=𝒗𝒊𝒗𝒊|Ext(\bm{v_{i}}^{-})=\bm{v_{i}}^{-}-\bm{v_{i}}^{\circ|-} \triangleright deficiency capability
10:     𝒗𝒊𝒗𝒊+λExt(𝒗𝒊)\bm{v_{i}}^{\prime}\leftarrow\bm{v_{i}}^{+}-\lambda\cdot Ext(\bm{v_{i}}^{-})
11:end for
12:𝑾\bm{W}^{\prime}\leftarrow Stack[𝒗𝟏,𝒗𝟐,,𝒗𝒅\bm{v_{1}}^{\prime},\bm{v_{2}}^{\prime},\dots,\bm{v_{d}}^{\prime}]
13:return 𝑾\bm{W}^{\prime}

3.2 Deficiency Capability Subtraction

This step is the same as the linear subtraction operation, but we subtract the basic parameter with the extracted deficiency feature Ext(θ)Ext(\theta^{-}). Then the new module is represented as follows:

θ=θ+λExt(θ)=θ+λExt(θ),\theta^{\prime}=\theta^{+}\ominus\lambda\cdot Ext(\theta^{-})=\theta^{+}-\lambda\cdot Ext(\theta^{-}), (6)

where \ominus denotes the direct parameter subtraction operation and λ\lambda is a hyperparameter to control the weight of parameter subtraction.

Multi-Choice Free-Generation
mc1
mc2
bleu acc
rouge1 acc
true(%)
true&info(%)
Alpaca-GPT4 [Uncaptioned image]
Expert [Uncaptioned image]+ 33.3 52.8 43.1 48.1 31.3 31.2
Anti-expert [Uncaptioned image]- 25.8 44.5 26.7 27.9 8.1 8.0
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 33.5 52.7 45.5 47.0 32.3 31.8
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 35.0 54.2 45.2 47.1 33.7 33.5
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=2.0){}^{-})\ (\lambda=2.0) (Ours) 36.0 55.2 46.4 49.2 34.6 34.4
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 33.7 52.7 43.7 46.4 31.6 31.3
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 36.1 55.3 48.6 50.1 34.9 34.8
WizardLM [Uncaptioned image]
Expert [Uncaptioned image]+ 31.3 49.9 39.3 40.5 25.0 24.8
Anti-expert [Uncaptioned image]- 25.9 45.1 27.4 28.3 8.0 8.0
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 32.4 50.0 39.5 41.6 24.8 24.5
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 32.7 50.9 38.4 40.9 24.7 24.7
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=0.6){}^{-})\ (\lambda=0.6) (Ours) 32.2 50.6 40.1 41.9 25.5 25.2
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 32.1 49.9 39.9 40.5 23.3 23.2
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 33.9 51.6 39.4 39.2 22.8 22.4
Table 1: The untruthfulness unlearning results on TruthfulQA benchmark. The [Uncaptioned image]+ and [Uncaptioned image]- denote the basic expert and anti-expert PEM models. [Uncaptioned image]+ \ominus [Uncaptioned image]- denotes the direct subtraction method and [Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image]){}^{-}) denotes our proposed method (Extraction-before-Subtraction).

4 Experiments

Our approach is primarily evaluated based on its ability to improve truthfulness or detoxification, and its generalization performance under the composition of the two different domains.

4.1 General Setup

Language Model

To conduct our experiments, we adopt LLaMA-7B (Touvron et al. 2023), a decoder-only pre-trained large language model. We also evaluate OPT-6.7B (Zhang et al. 2022) in the Appendix.

LoRA Module

All of our LoRA modules have a low-rank dimension of 1616 and only 0.124%0.124\% of the LLaMA-7B’s parameters are trainable. During training, we set the dropout to 0.050.05.

Some experimental details for instruction tuning are presented in the Appendix 222Please refer to the full version of the arXiv paper with the Appendix at: https://arxiv.org/abs/2308.08090 ..

4.2 Untruthfulness Unlearning

Training

We trained our basic expert PEMs using two widely-used instruction datasets, Alpaca-GPT4 (Taori et al. 2023; Peng et al. 2023b) and WizardLM-70k (Xu et al. 2023) to train our basic expert PEMs, which we denote as [Uncaptioned image]+ and [Uncaptioned image]+, respectively. To obtain the corresponding anti-expert PEMs, namely [Uncaptioned image]- and [Uncaptioned image]-, we use ChatGPT 333Specifically, we conduct experiments on ChatGPT based on OpenAI’s gpt-3.5-turbo-0613 in this work. to generate untruthful responses to the original instructions.

QA
Summary
Alpaca-GPT4 [Uncaptioned image]
Expert [Uncaptioned image]+ 69.0 47.4
Anti-expert [Uncaptioned image]- 63.8 45.6
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 70.6 49.6
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 70.3 48.1
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=2.0){}^{-})\ (\lambda=2.0) (Ours) 72.2 49.3
WizardLM [Uncaptioned image]
Expert [Uncaptioned image]+ 75.8 47.5
Anti-expert [Uncaptioned image]- 65.6 44.5
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 77.5 49.8
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 79.2 48.5
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=0.6){}^{-})\ (\lambda=0.6) (Ours) 77.9 48.1
Table 2: The untruthfulness unlearning results on HaluEval benchmark. The setting are the same as Table 1.

Evaluation

We choose TruthfulQA (Lin, Hilton, and Evans 2022) and HaluEval (Li et al. 2023a) as our primary measures of truthfulness. To evaluate TruthfulQA, we report both multi-choice and free-generation accuracy, as specified in the original paper. Multi-choice accuracy is determined by whether the model assigns the highest probability to the correct answer among a set of options. We report results for both the single-true (mc1) and multi-true (mc2) settings. Additionally, we measure the similarity of the model-generated answer to the correct reference by BLEU and ROUGE-L metrics (bleu acc and rougel acc). To further evaluate truthfulness and informativeness, we use ChatGPT judge the quality of generated answers for effeciency. We report two metrics:“true” represents the percentage of truthful examples, while “true&info” represents the percentage of both truthful and informative examples. For HaluEval, we use the same multi-choice accuracy measure as for TruthfulQA. We exclude the Dialogue and Alpaca subsets and only evaluate QA and Summary benchmarks. Because we focus on single-turn setting and Alpaca instruction data has already been included in our training data. We use the same prompt format during evaluation as during training.

Results

The results of our TruthfulQA experiments are presented in Table 1. It is evident that the anti-expert PEMs, i.e. [Uncaptioned image]- and [Uncaptioned image]-, exhibit the poorest performance. We report the best results obtained using the direct subtraction method with λ=0.2\lambda=0.2 and demonstrate our approach under both the optimal settings (λ=2.0\lambda=2.0 for [Uncaptioned image]and λ=0.6\lambda=0.6 for [Uncaptioned image]) and a fundamental setting (λ=1.0\lambda=1.0). The impact of varying λ\lambda will be discussed further in the subsequent section. Our proposed approach delivers significant improvements over the direct subtraction method. Furthermore, even when combining two different instruction datasets to assess its generalization, our approach remains competitive, albeit [Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) performs slightly worse than the subtraction method in free-generation. We also present the results of our HaluEval benchmark in Table 2, where we follow the same settings as the TruthfulQA experiments. Our proposed approach demonstrates satisfactory performance on HaluEval, with the exception of the Summary domain. We posit that this may be due to the Summary subset of HaluEval primarily evaluating the ability to ensure factual consistency, a skill that is noticeably underrepresented in our negative dataset.

4.3 Toxicity Unlearning

Training

The expert PEMs used in this study are identical to those described in Section 4.2, namely [Uncaptioned image]+ and [Uncaptioned image]+. To develop the anti-expert PEM, we adopted the toxic instruction tuning dataset proposed by Zhang et al. (2023), which is constructed by prompting ChatGPT to generate the instructions corresponding to the toxic comments from the training split of Civil Comments (Borkan et al. 2019). The anti-expert PEM is denoted as [Uncaptioned image]-.

Evaluation

For evaluating the toxicity of the models, we employed the test data consisting of 200 instructions from Zhang et al. (2023), which consists of 100 toxic and 100 non-toxic instructions. We prompt all models to generate corresponding responses to these instructions, and subsequently evaluate their toxicity scores and the ratio of toxic responses whose toxicity scores exceed the threshold of 0.80.8, using the Detoxify API (Hanu and Unitary team 2020).

Results

We report the results of our investigation into the efficacy of toxicity unlearning, as summarized in Table 3. The direct subtraction method achieves the best performance with λ=0.4\lambda=0.4 for [Uncaptioned image]and λ=0.2\lambda=0.2 for [Uncaptioned image]. We evaluate our approach under both the optimal setting (λ=2.0\lambda=2.0 for [Uncaptioned image]and λ=1.4\lambda=1.4 for [Uncaptioned image]) and a fundamental setting (λ=1.0\lambda=1.0). To ensure the validity of our results, we only consider models that do not exhibit repetitive behavior, with an average 4-gram repetition score of less than 20. The subsequent section provides detailed measurements of toxicity and generation quality under varying λ\lambda. The results indicate that our proposed method outperforms the direct subtraction operation in toxicity unlearning, with a significant improvement over the basic expert PEM models.

Score \downarrow
% \downarrow
Anti-expert [Uncaptioned image]- .586 49.0
Expert [Uncaptioned image]+ .164 12.5
[Uncaptioned image]+ \ominus [Uncaptioned image]- (λ=0.4)\ (\lambda=0.4) .135 10.0
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image]- )(λ=1.0))\ (\lambda=1.0) (Ours) .126 9.0
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image]- )(λ=2.0))\ (\lambda=2.0) (Ours) .108 6.0
Expert [Uncaptioned image]+ .207 14.5
[Uncaptioned image]+ \ominus [Uncaptioned image]- (λ=0.2)\ (\lambda=0.2) .201 16.0
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image]- )(λ=1.0))\ (\lambda=1.0) (Ours) .195 13.5
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image]- )(λ=1.4))\ (\lambda=1.4) (Ours) .169 10.5
Table 3: Toxicity evaluation of generated responses from varied models. We present toxicity scores and the ratio of toxic responses.

4.4 Compositional Unlearning

Setup

The experiments detailed thus far have primarily focused on single domains, specifically truthfulness or toxicity. However, an intriguing question arises as to what would happen if multiple PEMs were combined to unlearn different deficient capabilities. In this section, we utilize PEMs from two domains previously identified as anti-expert PEMs. The direct subtraction method, which satisfies the commutative property, is employed to subtract two PEMs in sequence. To evaluate our approach, we test two different unlearning orders (truthfulness first or detoxification first), as the deficiency capability extraction process involves different basic expert PEMs.

Results

Our results in Table 4 indicate that compositional anti-expert PEMs enable compositional deficiency unlearning in both domains. Our approach can still outperform direct subtraction, except in the Summary domain per HaluEval. Furthermore, the unlearning order significantly impacts outcomes, especially for toxicity, implying that additional research should investigate different order effects.

5 Analysis

5.1 Weight Hyperparameter Impact

TruthfulQA
HaluEval
Toxicity
MC1 \uparrow
MC2 \uparrow
QA \uparrow
Summary \uparrow
Score \downarrow
% \downarrow
[Uncaptioned image]+{}^{+}\ominus [Uncaptioned image]{}^{-}\ominus [Uncaptioned image](λ=0.2){}^{-}(\lambda=0.2) 33.8 52.5 70.1 51.1 .157 11.5
[[[Uncaptioned image]+Ext({}^{+}\ominus Ext( [Uncaptioned image])]Ext({}^{-})]\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})(\lambda=1.0) (Ours) 35.5 54.9 71.8 49.1 .115 7.0
[[[Uncaptioned image]+Ext({}^{+}\ominus Ext( [Uncaptioned image])]Ext({}^{-})]\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})(\lambda=1.0) (Ours) 35.5 54.8 71.6 49.0 .097 5.0
[Uncaptioned image]+{}^{+}\ominus [Uncaptioned image]{}^{-}\ominus [Uncaptioned image](λ=0.2){}^{-}(\lambda=0.2) 31.3 49.6 76.8 51.6 .200 16.5
[[[Uncaptioned image]+Ext({}^{+}\ominus Ext( [Uncaptioned image])]Ext({}^{-})]\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})(\lambda=1.0) (Ours) 33.0 51.1 76.5 49.2 .162 10.5
[[[Uncaptioned image]+Ext({}^{+}\ominus Ext( [Uncaptioned image])]Ext({}^{-})]\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})(\lambda=1.0) (Ours) 32.8 50.9 74.7 49.2 .154 11.5
Table 4: Compositional unlearning results of truthfulness and detoxification. We report two operation orders for our approach since they involve different basic expert PEMs for deficiency capability extraction.

Setup

Our proposed approach exclusively relies on arithmetic operations, requiring no additional training. The weight hyperparameter λ\lambda is the most critical hyperparameter that can influence performance. In this section, we conduct an evaluation primarily focusing on the impact of varying the weight hyperparameter λ\lambda, following the experimental settings outlined in Section 4. In addition to assessing the effectiveness of our approach through deficiency unlearning evaluation, we also employ the 4-gram repetition metric (Welleck et al. 2020) to gauge the quality of text generated by the model using the truthfulness or detoxification benchmark.

Results

The evaluation results of Alpaca-GPT4 on the TruthfulQA dataset are presented in Figure 1. The results clearly demonstrate that our approach consistently enhances the truthfulness without significant degradation in performance as the value of λ\lambda increases. On the other hand, when λ>0.6\lambda>0.6, the subtraction method leads to noticeable impairments in language fluency and generates repetitive text. A similar trend has been shown on the WizardLM dataset in the Appendix within the truthfulness and detoxification domains. Despite the minor shortcomings observed in our approach compared to direct subtraction under the same λ\lambda in the detoxification evaluation, we demonstrate that our approach achieves a higher performance upper bound without experiencing any degradation. It is worth noting that the evaluation of abnormal text by the toxic detection model itself has certain limitations or flaws. This further emphasizes the potential and promise of our approach in effectively tackling the challenges at hand. We also present an example of TruthfulQA from Alpaca-GPT4 in Figure 4. It is worth noting that degeneration happens from the direct subtraction method even when λ=0.6\lambda=0.6. The generated response is totally corrupted under λ=1.0\lambda=1.0. Our approach demonstrates normal performance consistently and exhibits an increasing truthfulness as λ\lambda increases.

5.2 Model Fundamental Abilities

Setup

Another important aspect is the fundamental abilities of LLMs since we need to reduce the deficiency capability without compromising their underlying foundational capabilities. We focus on four model fundamental abilities from five datasets: next token accuracy (Language Modeling), MMLU (Factuality) (Hendrycks et al. 2021), Grade School Math (GSM) (Reasoning) (Cobbe et al. 2021), Big-Bench-Hard (BBH) (Reasoning) (Suzgun et al. 2023), and AlpacaEval (Instruction Following) (Li et al. 2023d). The detailed settings are presented in the Appendix. We evaluate the basic expert PEMs and two operated models from direct subtraction and our extraction-before-subtraction methods on Alpaca-GPT4 and WizardLM under untruthfulness unlearning. We adopt the same setting as in Section 4.2 with λ=0.2\lambda=0.2 for direct subtraction and λ=1.0\lambda=1.0 for our approach.

Refer to caption
Figure 3: Four model fundamental abilities evaluation on five benchmarks.
Refer to caption
Figure 4: Some generated examples from TruthfulQA of direct subtraction and our method (Ext-Sub). The baseline result is generated from the basic expert PEMs.

Results

We present the fundamental abilities evaluation results in Figure 3. Based on the results, it appears that our approach and the direct subtraction method have their respective strengths and weaknesses in different abilities. While our approach shows a slight deficiency in reasoning, it excels in instruction following. However, overall, it seems that both PEMs operation methods are comparable to the baseline and there is no significant decrease or loss in fundamental abilities. Some detailed results of MMLU, GSM and BBH are presented in the Appendix.

6 Related Work

Model Representation Modification

As the scale of language models continues to grow, modifying their internal representations has emerged as a promising approach for improving their performance. Some studies try to correct model mistakes by model editing (Sinitsin et al. 2020; De Cao, Aziz, and Titov 2021; Mitchell et al. 2022a, b; Meng et al. 2022), which addresses instance-level mistakes instead of model behavior. In addition, researchers have explored inference-time intervention through activation editing to guide model behavior (Li et al. 2023b; Hernandez, Li, and Andreas 2023; Li et al. 2023c). Another line of research proposes full model parameter averaging to boost model generalization (Wortsman et al. 2022; Matena and Raffel 2022; Jin et al. 2023; Ilharco et al. 2023). Other researchers (Zhang et al. 2023; Huang et al. 2023; Chronopoulou et al. 2023; Yang et al. 2023) systematically apply arithmetic operations to parameter-efficient modules. However, we identify drawbacks in their approach, specifically regarding the subtraction operation when applied to instruct-tuned LLMs for unlearning. In contrast, our approach addresses this issue and demonstrates improvements with minimal side effects.

Constrained Text Generation

Constraining the generation of large language models is an important research topic. Reinforcement learning from human feedback (RLHF) has demonstrated promising outcomes in aligning model behavior with user intent (Christiano et al. 2017; Ouyang et al. 2022; Wu et al. 2023). However, the RLHF approach typically relies on the availability of massive amounts of human feedback and requires complex, unstable training procedures. To enhance truthfulness, some researchers have integrated external knowledge retrieval into LLMs during inference (He, Zhang, and Roth 2023; Peng et al. 2023a), which could result in an increased computational cost for inference. Others have focused on inference-time intervention on model internal representations to reduce toxicity or untruthfulness (Liu et al. 2021; Geva et al. 2022; Li et al. 2023c). Such methods often require complex experimental analysis of model representations before designing and applying the intervention. In contrast, our research focuses on unified and unsupervised model unlearning, which exhibits generalizability and efficiency in reducing both toxicity and untruthfulness.

7 Conclusion and Discussion

This paper introduces a novel operation for the parameter-efficient modules that enables deficiency capability unlearning. The proposed method involves extracting unwanted attributes from anti-expert PEM and eliminating them from the base model while retaining the general model capability. Experimental results demonstrate that our approach can effectively enhance model truthfulness and detoxification, and would not harm basic model ability.

The findings of this study provide a valuable contribution to the field of model parameter operation in the unlearning area. The proposed approach offers a deep perspective on how to address the problem of deficiency capability in PEMs and its impact on model performance. There are several directions that remain for future work:

  • Storage Efficiency. When we operate on the full LoRA weight matrix, it is possible to obtain a high-rank matrix that cannot be accurately decomposed into low-rank matrices. As a result, storing new PEMs requires more disk space than before, though still less than the full model parameters.

  • Generalization Exploring. While experiments have been conducted on various datasets and phenomena, further research is necessary to validate the effectiveness of our method on multiple pre-trained language models with varying scales. Exploring other PEM architectures and expanding other deficiency capabilities are avenues for future work. Although we have included additional experiments in the Appendix, there remains ample room for further exploration.

  • Hyperparameter Optimization. It has been observed that different modules trained from different datasets may have different optimal weight hyperparameters λ\lambda during composition. Developing new methods to find the optimal weight hyperparameters can enhance the accuracy and efficiency of LLMs, enabling them to perform better across a wider range of use cases.

References

  • Borkan et al. (2019) Borkan, D.; Dixon, L.; Sorensen, J.; Thain, N.; and Vasserman, L. 2019. Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification. CoRR, abs/1903.04561.
  • Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165.
  • Christiano et al. (2017) Christiano, P. F.; Leike, J.; Brown, T. B.; Martic, M.; Legg, S.; and Amodei, D. 2017. Deep reinforcement learning from human preferences. CoRR, abs/1706.03741.
  • Chronopoulou et al. (2023) Chronopoulou, A.; Pfeiffer, J.; Maynez, J.; Wang, X.; Ruder, S.; and Agrawal, P. 2023. Language and Task Arithmetic with Parameter-Efficient Layers for Zero-Shot Summarization. CoRR, abs/2311.09344.
  • Cobbe et al. (2021) Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; Hesse, C.; and Schulman, J. 2021. Training Verifiers to Solve Math Word Problems. CoRR, abs/2110.14168.
  • De Cao, Aziz, and Titov (2021) De Cao, N.; Aziz, W.; and Titov, I. 2021. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164.
  • Deng et al. (2023) Deng, J.; Sun, H.; Zhang, Z.; Cheng, J.; and Huang, M. 2023. Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey. CoRR, abs/2302.09270.
  • Geva et al. (2022) Geva, M.; Caciularu, A.; Wang, K. R.; and Goldberg, Y. 2022. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. CoRR, abs/2203.14680.
  • Hanu and Unitary team (2020) Hanu, L.; and Unitary team. 2020. Detoxify. https://github.com/unitaryai/detoxify. Accessed: 2023-06-01.
  • He, Zhang, and Roth (2023) He, H.; Zhang, H.; and Roth, D. 2023. Rethinking with Retrieval: Faithful Large Language Model Inference. CoRR, abs/2301.00303.
  • He et al. (2022) He, J.; Zhou, C.; Ma, X.; Berg-Kirkpatrick, T.; and Neubig, G. 2022. Towards a Unified View of Parameter-Efficient Transfer Learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Hendrycks et al. (2021) Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; and Steinhardt, J. 2021. Measuring Massive Multitask Language Understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  • Hernandez, Li, and Andreas (2023) Hernandez, E.; Li, B. Z.; and Andreas, J. 2023. Measuring and Manipulating Knowledge Representations in Language Models. CoRR, abs/2304.00740.
  • Houlsby et al. (2019) Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; de Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; and Gelly, S. 2019. Parameter-Efficient Transfer Learning for NLP. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, 2790–2799. PMLR.
  • Hu et al. (2022) Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Huang et al. (2023) Huang, C.; Liu, Q.; Lin, B. Y.; Pang, T.; Du, C.; and Lin, M. 2023. LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition. CoRR, abs/2307.13269.
  • Ilharco et al. (2023) Ilharco, G.; Ribeiro, M. T.; Wortsman, M.; Schmidt, L.; Hajishirzi, H.; and Farhadi, A. 2023. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  • Ji et al. (2023) Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; and Fung, P. 2023. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv., 55(12): 248:1–248:38.
  • Jin et al. (2023) Jin, X.; Ren, X.; Preotiuc-Pietro, D.; and Cheng, P. 2023. Dataless Knowledge Fusion by Merging Weights of Language Models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  • Li et al. (2023a) Li, J.; Cheng, X.; Zhao, W. X.; Nie, J.; and Wen, J. 2023a. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. CoRR, abs/2305.11747.
  • Li et al. (2023b) Li, K.; Hopkins, A. K.; Bau, D.; Viégas, F. B.; Pfister, H.; and Wattenberg, M. 2023b. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  • Li et al. (2023c) Li, K.; Patel, O.; Viégas, F. B.; Pfister, H.; and Wattenberg, M. 2023c. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. CoRR, abs/2306.03341.
  • Li et al. (2023d) Li, X.; Zhang, T.; Dubois, Y.; Taori, R.; Gulrajani, I.; Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023d. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca˙eval. Accessed: 2023-05-30.
  • Li and Liang (2021) Li, X. L.; and Liang, P. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. CoRR, abs/2101.00190.
  • Lin, Hilton, and Evans (2022) Lin, S.; Hilton, J.; and Evans, O. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Muresan, S.; Nakov, P.; and Villavicencio, A., eds., Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, 3214–3252. Association for Computational Linguistics.
  • Liu et al. (2021) Liu, A.; Sap, M.; Lu, X.; Swayamdipta, S.; Bhagavatula, C.; Smith, N. A.; and Choi, Y. 2021. DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts. In Zong, C.; Xia, F.; Li, W.; and Navigli, R., eds., Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, 6691–6706. Association for Computational Linguistics.
  • Liu, Zhang, and Liang (2023) Liu, N. F.; Zhang, T.; and Liang, P. 2023. Evaluating Verifiability in Generative Search Engines. CoRR, abs/2304.09848.
  • Lu et al. (2022) Lu, X.; Welleck, S.; Hessel, J.; Jiang, L.; Qin, L.; West, P.; Ammanabrolu, P.; and Choi, Y. 2022. QUARK: Controllable Text Generation with Reinforced Unlearning. In NeurIPS.
  • Matena and Raffel (2022) Matena, M.; and Raffel, C. 2022. Merging Models with Fisher-Weighted Averaging. In NeurIPS.
  • Meng et al. (2022) Meng, K.; Bau, D.; Andonian, A.; and Belinkov, Y. 2022. Locating and Editing Factual Associations in GPT. In NeurIPS.
  • Mitchell et al. (2022a) Mitchell, E.; Lin, C.; Bosselut, A.; Finn, C.; and Manning, C. D. 2022a. Fast Model Editing at Scale. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Mitchell et al. (2022b) Mitchell, E.; Lin, C.; Bosselut, A.; Manning, C. D.; and Finn, C. 2022b. Memory-Based Model Editing at Scale. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvári, C.; Niu, G.; and Sabato, S., eds., International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, 15817–15831. PMLR.
  • Ouyang et al. (2022) Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C. L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; Schulman, J.; Hilton, J.; Kelton, F.; Miller, L.; Simens, M.; Askell, A.; Welinder, P.; Christiano, P. F.; Leike, J.; and Lowe, R. 2022. Training language models to follow instructions with human feedback. In NeurIPS.
  • Peng et al. (2023a) Peng, B.; Galley, M.; He, P.; Cheng, H.; Xie, Y.; Hu, Y.; Huang, Q.; Liden, L.; Yu, Z.; Chen, W.; and Gao, J. 2023a. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. CoRR, abs/2302.12813.
  • Peng et al. (2023b) Peng, B.; Li, C.; He, P.; Galley, M.; and Gao, J. 2023b. Instruction Tuning with GPT-4. CoRR, abs/2304.03277.
  • Rajbhandari et al. (2020) Rajbhandari, S.; Rasley, J.; Ruwase, O.; and He, Y. 2020. ZeRO: memory optimizations toward training trillion parameter models. In Cuicchi, C.; Qualters, I.; and Kramer, W. T., eds., Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, 20. IEEE/ACM.
  • Rasley et al. (2020) Rasley, J.; Rajbhandari, S.; Ruwase, O.; and He, Y. 2020. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Gupta, R.; Liu, Y.; Tang, J.; and Prakash, B. A., eds., KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, 3505–3506. ACM.
  • Sinitsin et al. (2020) Sinitsin, A.; Plokhotnyuk, V.; Pyrkin, D. V.; Popov, S.; and Babenko, A. 2020. Editable Neural Networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Suzgun et al. (2023) Suzgun, M.; Scales, N.; Schärli, N.; Gehrmann, S.; Tay, Y.; Chung, H. W.; Chowdhery, A.; Le, Q. V.; Chi, E.; Zhou, D.; and Wei, J. 2023. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. In Rogers, A.; Boyd-Graber, J. L.; and Okazaki, N., eds., Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 13003–13051. Association for Computational Linguistics.
  • Taori et al. (2023) Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford˙alpaca. Accessed: 2023-05-10.
  • Touvron et al. (2023) Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample, G. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR, abs/2302.13971.
  • Wang et al. (2023) Wang, Y.; Ivison, H.; Dasigi, P.; Hessel, J.; Khot, T.; Chandu, K. R.; Wadden, D.; MacMillan, K.; Smith, N. A.; Beltagy, I.; and Hajishirzi, H. 2023. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources. CoRR, abs/2306.04751.
  • Wei et al. (2022) Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E. H.; Le, Q. V.; and Zhou, D. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS.
  • Welbl et al. (2021) Welbl, J.; Glaese, A.; Uesato, J.; Dathathri, S.; Mellor, J.; Hendricks, L. A.; Anderson, K.; Kohli, P.; Coppin, B.; and Huang, P. 2021. Challenges in Detoxifying Language Models. In Moens, M.; Huang, X.; Specia, L.; and Yih, S. W., eds., Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, 2447–2469. Association for Computational Linguistics.
  • Welleck et al. (2020) Welleck, S.; Kulikov, I.; Roller, S.; Dinan, E.; Cho, K.; and Weston, J. 2020. Neural Text Generation With Unlikelihood Training. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Wortsman et al. (2022) Wortsman, M.; Ilharco, G.; Gadre, S. Y.; Roelofs, R.; Lopes, R. G.; Morcos, A. S.; Namkoong, H.; Farhadi, A.; Carmon, Y.; Kornblith, S.; and Schmidt, L. 2022. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvári, C.; Niu, G.; and Sabato, S., eds., International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, 23965–23998. PMLR.
  • Wu et al. (2023) Wu, Z.; Hu, Y.; Shi, W.; Dziri, N.; Suhr, A.; Ammanabrolu, P.; Smith, N. A.; Ostendorf, M.; and Hajishirzi, H. 2023. Fine-Grained Human Feedback Gives Better Rewards for Language Model Training. CoRR, abs/2306.01693.
  • Xu et al. (2023) Xu, C.; Sun, Q.; Zheng, K.; Geng, X.; Zhao, P.; Feng, J.; Tao, C.; and Jiang, D. 2023. WizardLM: Empowering Large Language Models to Follow Complex Instructions. CoRR, abs/2304.12244.
  • Yang et al. (2023) Yang, E.; Wang, Z.; Shen, L.; Liu, S.; Guo, G.; Wang, X.; and Tao, D. 2023. AdaMerging: Adaptive Model Merging for Multi-Task Learning. CoRR, abs/2310.02575.
  • Zhang et al. (2023) Zhang, J.; Chen, S.; Liu, J.; and He, J. 2023. Composing Parameter-Efficient Modules with Arithmetic Operations. CoRR, abs/2306.14870.
  • Zhang et al. (2022) Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M. T.; Li, X.; Lin, X. V.; Mihaylov, T.; Ott, M.; Shleifer, S.; Shuster, K.; Simig, D.; Koura, P. S.; Sridhar, A.; Wang, T.; and Zettlemoyer, L. 2022. OPT: Open Pre-trained Transformer Language Models. CoRR, abs/2205.01068.
  • Zhao et al. (2023) Zhao, R.; Li, X.; Chia, Y. K.; Ding, B.; and Bing, L. 2023. Can ChatGPT-like Generative Models Guarantee Factual Accuracy? On the Mistakes of New Generation Search Engines. CoRR, abs/2304.11076.

Appendix A Training Details

Refer to caption
Figure 5: Prompt format for instruction tuning.

For training, we train all of our PEMs for 22 epochs across datasets, with a learning rate of 2e52e-5 and the AdamW optimizer. We use a linear scheduler with a learning rate warmup ratio of 3%3\%. We have found that a maximum text length of 10241024 is sufficient for the majority of single-turn instruction datasets. We set the training batch size to 44 per device with gradient accumulation steps of 1616. We train our models primarily on two A100 GPUs with 80G memory, and we utilize the DeepSpeed library (Rasley et al. 2020) and ZeRO optimizer (Rajbhandari et al. 2020) with Stage-2 and CPU-offload. The prompt format for instruction tuning is depicted in Figure 5, where the <|user|> and <|assistant|> are the boundary between messages for the instruction and response.

Appendix B Prompt for ChatGPT

The sentence presented in Figure 6 delineates the prompt template employed by ChatGPT for generating deceptive responses pertaining to Alpaca and WizardLM instructions. The created untruthful datasets are then used to train our anti-expert PEMs.

Following (Lin, Hilton, and Evans 2022), we apply ChatGPT to evaluate the generated response of TruthfulQA. We use different templates to assess the truthfulness (top) and informativeness (bottom) as in Figure 7.

Refer to caption
Figure 6: Prompt template of gpt-3.5-turbo-0613 to create untruthful responses for Alpaca and WizardLM instructions.
Refer to caption
Figure 7: Prompt template of gpt-3.5-turbo-0613 to evaluate the truthfulness and informativeness of generated answers.

Appendix C Evaluation of Model Fundamental Ability

To evaluate the model fundamental ability, we focus on the four following aspects:

  • Factuality We use the Massive Multitask Language Understanding dataset, MMLU (Hendrycks et al. 2021), for measuring the factual knowledge. We mainly report the zero-shot results in this section, while few-shot results can be found in Table 7.

  • Reasoning The reasoning ability is evaluated on Grade School Math, GSM (Cobbe et al. 2021), and Big-Bench-Hard, BBH (Suzgun et al. 2023), datasets. Following Wang et al. (2023), we employ the 8-shot for GSM and 3-shot for BBH with Chain-of-Thought (Wei et al. 2022). We also sample 200 and 40 examples from GSM and BBH for more efficient testing.

  • Instruction Following We utilize AlpacaEval (Li et al. 2023d) to automatically evaluate the open-ended instruction following of models. Two instructions that are overlapped with our training data are exclusive for testing. We apply ChatGPT for our evaluator.

  • Language Modeling To evaluate language modeling of instruct tuned model, we measure the next token accuracy on AlpacaEval test data. We have not employed perplexity evaluation due to its uncertain scope.

Detailed evaluation results of MMLU, GSM and BBH are presented in Table 7.

Refer to caption
Refer to caption
Refer to caption
Figure 8: Evaluation of truthfulness or detoxocity with 4-gram repetition scores under varying weight hyperparameters λ\lambda.
Refer to caption
(a) Trutfhulness
Refer to caption
(b) Detoxocity
Figure 9: Evaluation of truthfulness and detoxocity, and 4-gram repetition scores under varying weight hyperparameters λ\lambda using LLaMA-13B.
TruthfulQA
HaluEval
Toxicity
mc1 \uparrow
mc2 \uparrow
rep-4 \downarrow
QA \uparrow
Summary \uparrow
Score \downarrow
% \downarrow
rep-4 \downarrow
[Uncaptioned image]+ 31.7 51.2 6.9 69.0 51.6 .174 13.5 10.9
[Uncaptioned image]- 26.4 45.1 1.3 65.3 49.9 - -
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 32.3 50.9 10.1 69.8 52.9 - -
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=0.4){}^{-})\ (\lambda=0.4) (Ours) 34.1 51.3 7.7 71.8 56.1 - -
[Uncaptioned image]- - - - - - .574 50.0 21.8
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) - - - - - .155 11.0 14.8
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=0.2){}^{-})\ (\lambda=0.2) (Ours) - - - - - .153 10.0 12.3
Table 5: Deficiency unlearning results of truthfulness and detoxification on prefix-tuning (Li and Liang 2021) PEMs of LLaMA-7B.
TruthfulQA
HaluEval
Toxicity
mc1 \uparrow
mc2 \uparrow
rep-4 \downarrow
QA \uparrow
Summary \uparrow
Score \downarrow
% \downarrow
rep-4 \downarrow
[Uncaptioned image]+ 30.7 47.9 14.8 63.3 53.4 .205 15.0 22.1
[Uncaptioned image]- 25.1 43.3 1.5 61.2 49.8 - - -
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 31.0 48.1 17.3 69.7 56.3 - - -
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=0.6){}^{-})\ (\lambda=0.6) (Ours) 30.6 48.4 17.4 64.4 53.7 - - -
[Uncaptioned image]- - - - - - .580 52.0 23.7
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) - - - - - .182 13.0 23.0
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=0.6){}^{-})\ (\lambda=0.6) (Ours) - - - - - .166 11.5 23.1
Table 6: Deficiency unlearning results of truthfulness and detoxification on LORA of OPT-6.7B model (Zhang et al. 2022).
MMLU
GSM
BBH
Average
0-shot
5-shot
Direct
CoT
Direct
CoT
-
Alpaca [Uncaptioned image]
[Uncaptioned image]+ 32.5 33.1 7.5 11.5 31.4 34.6 25.1
[Uncaptioned image]- 30.7 31.3 6.0 9.5 31.3 32.7 23.6
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 33.1 33.6 7.0 13.0 30.8 33.9 25.2
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 32.8 33.3 8.0 10.5 30.4 32.6 24.6
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=2.0){}^{-})\ (\lambda=2.0) (Ours) 33.0 33.5 8.0 11.5 30.7 32.4 24.8
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 33.0 33.5 6.5 14.0 31.1 33.6 25.3
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 33.2 33.5 8.0 12.5 30.7 33.5 25.2
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.4){}^{-}\ (\lambda=0.4) 32.2 33.5 6.5 11.0 30.6 33.4 24.5
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 32.0 33.1 7.5 9.5 29.1 33.3 24.1
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=2.0){}^{-})\ (\lambda=2.0) (Ours) 31.2 33.0 6.5 6.5 25.9 32.6 22.6
WizardLM [Uncaptioned image]
[Uncaptioned image]+ 32.9 32.9 6.5 12.0 30.0 35.4 25.0
[Uncaptioned image]- 29.7 30.8 6.5 10.0 30.2 33.3 23.4
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 33.1 33.1 6.0 14.5 29.5 35.2 25.2
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 32.5 33.3 6.0 13.0 30.3 35.4 25.1
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=0.6){}^{-})\ (\lambda=0.6) (Ours) 32.6 32.8 6.5 13.0 30.3 35.6 25.1
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 33.1 33.1 5.5 14.0 30.2 33.6 24.9
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 32.8 32.2 6.0 13.5 28.8 34.5 24.6
[Uncaptioned image]+ \ominus [Uncaptioned image](λ=0.2){}^{-}\ (\lambda=0.2) 32.6 32.9 5.5 13.5 30.6 35.2 25.1
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.0){}^{-})\ (\lambda=1.0) (Ours) 31.4 32.6 6.0 13.0 27.5 34.7 24.2
[Uncaptioned image]+ Ext(\ominus Ext( [Uncaptioned image])(λ=1.4){}^{-})\ (\lambda=1.4) (Ours) 30.8 32.4 6.5 11.5 26.6 34.5 23.7
Table 7: Detailed evaluation of model fundamental ability.
Refer to caption
Figure 10: Some generated examples from TruthfulQA of direct subtraction and our method. The baseline result is generated from the basic expert PEMs.