Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation

Chong Zhang Jieyu Zhao Huan Zhang Kai-Wei Chang Cho-Jui Hsieh
Department of Computer Science, UCLA
{chongz, jyzhao, kwchang, chohsieh}@cs.ucla.edu, huan@huan-zhang.com

Abstract

Robustness and counterfactual bias are usually evaluated on a test dataset. However, are these evaluations robust? If the test dataset is perturbed slightly, will the evaluation results keep the same? In this paper, we propose a “double perturbation” framework to uncover model weaknesses beyond the test dataset. The framework first perturbs the test dataset to construct abundant natural sentences similar to the test data, and then diagnoses the prediction change regarding a single-word substitution. We apply this framework to study two perturbation-based approaches that are used to analyze models’ robustness and counterfactual bias in English. (1) For robustness, we focus on synonym substitutions and identify vulnerable examples where prediction can be altered. Our proposed attack attains high success rates ( $96.0\%$ – $99.8\%$ ) in finding vulnerable examples on both original and robustly trained CNNs and Transformers. (2) For counterfactual bias, we focus on substituting demographic tokens (e.g., gender, race) and measure the shift of the expected prediction among constructed sentences. Our method is able to reveal the hidden model biases not directly shown in the test dataset. Our code is available at https://github.com/chong-z/nlp-second-order-attack.

1 Introduction

Recent studies show that NLP models are vulnerable to adversarial perturbations. A seemingly “invariance transformation” (a.k.a. adversarial perturbation) such as synonym substitutions (Alzantot et al., 2018; Zang et al., 2020) or syntax-guided paraphrasing (Iyyer et al., 2018; Huang and Chang, 2021) can alter the prediction. To mitigate the model vulnerability, robust training methods have been proposed and shown effective (Miyato et al., 2017; Jia et al., 2019; Huang et al., 2019; Zhou et al., 2020).

In most studies, model robustness is evaluated based on a given test dataset or synthetic sentences constructed from templates (Ribeiro et al., 2020). Specifically, the robustness of a model is often evaluated by the ratio of test examples where the model prediction cannot be altered by semantic-invariant perturbation. We refer to this type of evaluations as the first-order robustness evaluation. However, even if a model is first-order robust on an input sentence $x_{0}$ , it is possible that the model is not robust on a natural sentence $\tilde{x}_{0}$ that is slightly modified from $x_{0}$ . In that case, adversarial examples still exist even if first-order attacks cannot find any of them from the given test dataset. Throughout this paper, we call $\tilde{x}_{0}$ a vulnerable example. The existence of such examples exposes weaknesses in models’ understanding and presents challenges for model deployment. Fig.˜1 illustrates an example.

Figure 1: A vulnerable example beyond the test dataset. Numbers on the bottom right are the sentiment predictions for film and movie. Blue

x_{0}

comes from the test dataset and its prediction cannot be altered by the substitution

\texttt{film}\rightarrow\texttt{movie}

(robust). Yellow example

\tilde{x}_{0}

is slightly perturbed but remains natural. Its prediction can be altered by the substitution (vulnerable).

In this paper, we propose the double perturbation framework for evaluating a stronger notion of second-order robustness. Given a test dataset, we consider a model to be second-order robust if there is no vulnerable example that can be identified in the neighborhood of given test instances (Section˜2.2). In particular, our framework first perturbs the test set to construct the neighborhood, and then diagnoses the robustness regarding a single-word synonym substitution. Taking Fig.˜2 as an example, the model is first-order robust on the input sentence $x_{0}$ (the prediction cannot be altered), but it is not second-order robust due to the existence of the vulnerable example $\tilde{x}_{0}$ . Our framework is designed to identify $\tilde{x}_{0}$ .

We apply the proposed framework and quantify second-order robustness through two second-order attacks (Section˜3). We experiment with English sentiment classification on the SST-2 dataset (Socher et al., 2013) across various model architectures. Surprisingly, although robustly trained CNN (Jia et al., 2019) and Transformer (Xu et al., 2020) can achieve high robustness under strong attacks (Alzantot et al., 2018; Garg and Ramakrishnan, 2020) ( $23.0\%$ – $71.6\%$ success rates), for around $96.0\%$ of the test examples our attacks can find a vulnerable example by perturbing 1.3 words on average. This finding indicates that these robustly trained models, despite being first-order robust, are not second-order robust.

Furthermore, we extend the double perturbation framework to evaluate counterfactual biases (Kusner et al., 2017) (Section˜4) in English. When the test dataset is small, our framework can help improve the evaluation robustness by revealing the hidden biases not directly shown in the test dataset. Intuitively, a fair model should make the same prediction for nearly identical examples referencing different groups (Garg et al., 2019) with different protected attributes (e.g., gender, race). In our evaluation, we consider a model biased if substituting tokens associated with protected attributes changes the expected prediction, which is the average prediction among all examples within the neighborhood. For instance, a toxicity classifier is biased if it tends to increase the toxicity if we substitute $\texttt{straight}\rightarrow\texttt{gay}$ in an input sentence (Dixon et al., 2018). In the experiments, we evaluate the expected sentiment predictions on pairs of protected tokens (e.g., (he, she), (gay, straight)), and demonstrate that our method is able to reveal the hidden model biases.

Figure 2: An illustration of the decision boundary. Diamond area denotes invariance transformations. Blue

x_{0}

is a robust input example (the entire diamond is green). Yellow

\tilde{x}_{0}

is a vulnerable example in the neighborhood of

x_{0}

. Red

\tilde{x}^{\prime}_{0}

is an adversarial example to

\tilde{x}_{0}

. Note:

\tilde{x}^{\prime}_{0}

is not an adversarial example to

x_{0}

since they have different meanings to human (outside the diamond).

Our main contributions are: (1) We propose the double perturbation framework to diagnose the robustness of existing robustness and fairness evaluation methods. (2) We propose two second-order attacks to quantify the stronger notion of second-order robustness and reveal the models’ vulnerabilities that cannot be identified by previous attacks. (3) We propose a counterfactual bias evaluation method to reveal the hidden model bias based on our double perturbation framework.

2 The Double Perturbation Framework

In this section, we describe the double perturbation framework which focuses on identifying vulnerable examples within a small neighborhood of the test dataset. The framework consists of a neighborhood perturbation and a word substitution. We start with defining word substitutions.

2.1 Existing Word Substitution Strategy

We focus our study on word-level substitution, where existing works evaluate robustness and counterfactual bias by directly perturbing the test dataset. For instance, adversarial attacks alter the prediction by making synonym substitutions, and the fairness literature evaluates counterfactual fairness by substituting protected tokens. We integrate the word substitution strategy into our framework as the component for evaluating robustness and fairness.

For simplicity, we consider a single-word substitution and denote it with the operator $\oplus$ . Let $\mathcal{X}\subseteq\mathcal{V}^{l}$ be the input space where $\mathcal{V}$ is the vocabulary and $l$ is the sentence length, $\bm{p}=(p^{(1)},p^{(2)})\in\mathcal{V}^{2}$ be a pair of synonyms (called patch words), $\mathcal{X}_{\bm{p}}\subseteq\mathcal{X}$ denotes sentences with a single occurrence of $p^{(1)}$ (for simplicity we skip other sentences), $x_{0}\in\mathcal{X}_{\bm{p}}$ be an input sentence, then $x_{0}\oplus\bm{p}$ means “substitute $p^{(1)}\rightarrow p^{(2)}$ in $x_{0}$ ”. The result after substitution is:

x_{0}^{\prime}=x_{0}\oplus\bm{p}.

Taking Fig.˜1 as an example, where $\bm{p}=($ film, movie $)$ and $x_{0}=$ a deep and meaningful film, the perturbed sentence is $x_{0}^{\prime}=$ a deep and meaningful movie. Now we introduce other components in our framework.

2.2 Proposed Neighborhood Perturbation

Instead of applying the aforementioned word substitutions directly to the original test dataset, our framework perturbs the test dataset within a small neighborhood to construct similar natural sentences. This is to identify vulnerable examples with respect to the model. Note that examples in the neighborhood are not required to have the same meaning as the original example, since we only study the prediction difference caused by applying synonym substitution $\bm{p}$ (Section˜2.1).

Constraints on the neighborhood. We limit the neighborhood sentences within a small $\ell_{0}$ norm ball (regarding the test instance) to ensure syntactic similarity, and empirically ensure the naturalness through a language model. The neighborhood of an input sentence $x_{0}\in\mathcal{X}$ is:

\text{Neighbor}_{k}(x_{0})\subseteq\text{Ball}_{k}(x_{0})\cap\mathcal{X}_{\text{natural}},

(1)

where $\text{Ball}_{k}(x_{0})=\{x\mid\left\lVert x-x_{0}\right\rVert_{0}\leq k,x\in\mathcal{X}\}$ is the $\ell_{0}$ norm ball around $x_{0}$ (i.e., at most $k$ different tokens), and $\mathcal{X}_{\text{natural}}$ denotes natural sentences that satisfy a certain language model score which will be discussed next.

Construction with masked language model. We construct neighborhood sentences from $x_{0}$ by substituting at most $k$ tokens. As shown in Algorithm˜1, the construction employs a recursive approach and replaces one token at a time. For each recursion, the algorithm first masks each token of the input sentence (may be the original $x_{0}$ or the $\tilde{x}$ from last recursion) separately and predicts likely replacements with a masked language model (e.g., DistilBERT, Sanh et al. 2019). To ensure the naturalness, we keep the top $20$ tokens for each mask with the largest logit (subject to a threshold, Algorithm˜1). Then, the algorithm constructs neighborhood sentences by replacing the mask with found tokens. We use the notation $\tilde{x}$ in the following sections to denote the constructed sentences within the neighborhood.

Data: Input sentence

x_{0}

, masked language model LM, max distance

k

1 Function $\text{Neighbor}_{k}$ ( $x_{0}$ ):

2 if

k=0

then return

\{x_{0}\}

;

3 if

k\geq 2

then

4 return

\bigcup_{\tilde{x}\in\text{Neighbor}_{1}(x_{0})}\text{Neighbor}_{k-1}(\tilde{x})

;

\mathcal{X}_{\text{neighbor}}\leftarrow\varnothing

;

7 for

i\leftarrow 0,\dots,\text{len}(x_{0})-1

T,L\leftarrow\text{LM.fillmask}(x_{0},i)

;

\triangleright

Mask

i_{\text{th}}

token and return candidate tokens and corresponding logits.

L\leftarrow\text{SortDecreasing}(L)

;

l_{\text{min}}\leftarrow\max\{L^{(\kappa)},\;L^{(0)}-\delta\}

;

\triangleright

L^{(i)}

denotes the

i_{\text{th}}

element. We empirically set

\kappa\leftarrow 20

and

\delta\leftarrow 3

T_{\text{new}}\leftarrow\{t\mid l>l_{\text{min}},\;(t,l)\in T\times L\}

;

\mathcal{X}_{\text{new}}\leftarrow\{x_{0}\mid x_{0}^{(i)}\leftarrow t,\;t\in T_{\text{new}}\}

;

\triangleright

Construct new sentences by replacing the

i_{\text{th}}

token.

\mathcal{X}_{\text{neighbor}}\leftarrow\mathcal{X}_{\text{neighbor}}\cup\mathcal{X}_{\text{new}}

;

15 return

\mathcal{X}_{\text{neighbor}}

;

Algorithm 1 Neighborhood construction

3 Evaluating Second-Order Robustness

Figure 3: The attack flow for SO-Beam (Algorithm˜2). Blue

x_{0}

is the input sentence and Yellow

\tilde{x}_{0}

is our constructed vulnerable example (the prediction can be altered by substituting

\texttt{film}\rightarrow\texttt{movie}

). Green boxes in the middle show intermediate sentences, and

f_{\text{soft}}(x)

denotes the probability outputs for film and movie.

With the proposed double perturbation framework, we design two black-box attacks¹¹1Black-box attacks only observe the model outputs and do not know the model parameters or the gradient. to identify vulnerable examples within the neighborhood of the test set. We aim at evaluating the robustness for inputs beyond the test set.

3.1 Previous First-Order Attacks

Adversarial attacks search for small and invariant perturbations on the model input that can alter the prediction. To simplify the discussion, in the following, we take a binary classifier $f(x):\mathcal{X}\rightarrow\{0,1\}$ as an example to describe our framework. Let $x_{0}$ be the sentence from the test set with label $y_{0}$ , then the smallest perturbation $\delta^{*}$ under $\ell_{0}$ norm distance is:²²2For simplicity, we use $\ell_{0}$ norm distance to measure the similarity, but other distance metrics can be applied.

\begin{split}\delta^{*}:=\mathop{\mathrm{argmin}}_{\delta}\left\lVert\delta\right\rVert_{0}\text{ s.t. }f(x_{0}\oplus\delta)\neq y_{0}.\end{split}

Here $\delta=\bm{p}_{1}\oplus\dots\oplus\bm{p}_{l}$ denotes a series of substitutions. In contrast, our second-order attacks fix $\delta=\bm{p}$ and search for the vulnerable $x_{0}$ .

3.2 Proposed Second-Order Attacks

Second-order attacks study the prediction difference caused by applying $\bm{p}$ . For notation convenience we define the prediction difference $F(x;\bm{p}):\mathcal{X}\times\mathcal{V}^{2}\rightarrow\{-1,0,1\}$ by:³³3We assume a binary classification task, but our framework is general and can be extended to multi-class classification.

F(x;\bm{p}):=f(x\oplus\bm{p})-f(x).

(2)

Taking Fig.˜1 as an example, the prediction difference for $\tilde{x}_{0}$ on $\bm{p}$ is $F(\tilde{x}_{0};\bm{p})=f($ …moving movie. $)-f($ …moving film. $)=-1$ .

Given an input sentence $x_{0}$ , we want to find patch words $\bm{p}$ and a vulnerable example $\tilde{x}_{0}$ such that $f(\tilde{x}_{0}\oplus\bm{p})\neq f(\tilde{x}_{0})$ . Follow Alzantot et al. (2018), we choose $\bm{p}$ from a predefined list of counter-fitted synonyms (Mrkšić et al., 2016) that maximizes $|f_{\text{soft}}(p^{(2)})-f_{\text{soft}}(p^{(1)})|$ . Here $f_{\text{soft}}(x):\mathcal{X}\rightarrow[0,1]$ denotes probability output (e.g., after the softmax layer but before the final argmax), $f_{\text{soft}}(p^{(1)})$ and $f_{\text{soft}}(p^{(2)})$ denote the predictions for the single word, and we enumerate through all possible $\bm{p}$ for $x_{0}$ . Let $k$ be the neighborhood distance, then the attack is equivalent to solving:

\tilde{x}_{0}=\mathop{\mathrm{argmax}}_{x\in\text{Neighbor}_{k}(x_{0})}|F(x;\bm{p})|.

(3)

Brute-force attack (SO-Enum). A naive approach for solving Eq.˜3 is to enumerate through $\text{Neighbor}_{k}(x_{0})$ . The enumeration finds the smallest perturbation, but is only applicable for small $k$ (e.g., $k\leq 2$ ) given the exponential complexity.

Beam-search attack (SO-Beam). The efficiency can be improved by utilizing the probability output, where we solve Eq.˜3 by minimizing the cross-entropy loss with regard to $x\in\text{Neighbor}_{k}(x_{0})$ :

\displaystyle\mathcal{L}(x;\bm{p}):=-\log(1-f_{\text{min}})-\log(f_{\text{max}}),

(4)

where $f_{\text{min}}$ and $f_{\text{max}}$ are the smaller and the larger output probability between $f_{\text{soft}}(x)$ and $f_{\text{soft}}(x\oplus\bm{p})$ , respectively. Minimizing Eq.˜4 effectively leads to $f_{\text{min}}\rightarrow 0$ and $f_{\text{max}}\rightarrow 1$ , and we use a beam search to find the best $x$ . At each iteration, we construct sentences through $\text{Neighbor}_{1}(x)$ and only keep the top 20 sentences with the smallest $\mathcal{L}(x;\bm{p})$ . We run at most $k$ iterations, and stop earlier if we find a vulnerable example. We provide the detailed implementation in Algorithm˜2 and a flowchart in Fig.˜3.

Data: Input sentence

x_{0}

, synonyms

\mathcal{P}

, model functions

F

and

f_{\text{soft}}

, loss

\mathcal{L}

, max distance

k

1 Function $\text{SO-Beam}_{k}$ ( $x_{0}$ ):

\bm{p}\leftarrow\mathop{\mathrm{argmax}}\limits_{\bm{p}\in\mathcal{P}\text{ s.t. }x_{0}\in\mathcal{X}_{\bm{p}}}|f_{\text{soft}}(p^{(2)})-f_{\text{soft}}(p^{(1)})|

;

\mathcal{X}_{\text{beam}}\leftarrow\{x_{0}\}

;

4 for

i\leftarrow 1,\dots,k

\mathcal{X}_{\text{new}}\leftarrow\bigcup_{\tilde{x}\in\mathcal{X}_{\text{beam}}}\text{Neighbor}_{1}(\tilde{x})

;

\tilde{x}_{0}\leftarrow\mathop{\mathrm{argmax}}_{x\in\mathcal{X}_{\text{new}}}|F(x;\bm{p})|

;

7 if

F(\tilde{x}_{0};\bm{p})\neq 0

then return

\tilde{x}_{0}

;

\mathcal{X}_{\text{new}}\leftarrow\text{SortIncreasing}(\mathcal{X}_{\text{new}},\mathcal{L})

;

\mathcal{X}_{\text{beam}}\leftarrow\{\mathcal{X}_{\text{new}}^{(0)},\dots,\mathcal{X}_{\text{new}}^{(\beta-1)}\}

;

\triangleright

Keep the best beam. We set

\beta\leftarrow 20

11 return None;

Algorithm 2 Beam-search attack (SO-Beam)

3.3 Experimental Results

In this section, we evaluate the second-order robustness of existing models and show the quality of our constructed vulnerable examples.

3.3.1 Setup

Original: 70% Negative
Input Example:	in its best moments , resembles a bad high school production of grease , without benefit of song .
Genetic: 56% Positive
Adversarial Example:	in its best moment , recalling a naughty high school production of lubrication , unless benefit of song .
BAE: 56% Positive
Adversarial Example:	in its best moments , resembles a great high school production of grease , without benefit of song .
SO-Enum and SO-Beam (ours): 60% Negative (67% Positive)
Vulnerable Example:	in its best moments , resembles a bad (unhealthy) high school production of musicals , without benefit of song .

Table 1: Sampled attack results on the robust BoW. For Genetic and BAE the goal is to find an adversarial example that alters the original prediction, whereas for SO-Enum and SO-Beam the goal is to find a vulnerable example beyond the test set such that the prediction can be altered by substituting

\texttt{bad}\rightarrow\texttt{unhealthy}

We follow the setup from the robust training literature (Jia et al., 2019; Xu et al., 2020) and experiment with both the base (non-robust) and robustly trained models. We train the binary sentiment classifiers on the SST-2 dataset with bag-of-words (BoW), CNN, LSTM, and attention-based models.

Base models. For BoW, CNN, and LSTM, all models use pre-trained GloVe embeddings (Pennington et al., 2014), and have one hidden layer of the corresponding type with 100 hidden size. Similar to the baseline performance reported in GLUE (Wang et al., 2019), our trained models have an evaluation accuracy of 81.4%, 82.5%, and 81.7%, respectively. For attention-based models, we train a 3-layer Transformer (the largest size in Shi et al. 2020) and fine-tune a pre-trained bert-base-uncased from HuggingFace (Wolf et al., 2020). The Transformer uses 4 attention heads and 64 hidden size, and obtains 82.1% accuracy. The BERT-base uses the default configuration and obtains 92.7% accuracy.

Robust models (first-order). With the same setup as base models, we apply robust training methods to improve the resistance to word substitution attacks. Jia et al. (2019) provide a provably robust training method through Interval Bound Propagation (IBP, Dvijotham et al. 2018) for all word substitutions on BoW, CNN and LSTM. Xu et al. (2020) provide a provably robust training method on general computational graphs through a combination of forward and backward linear bound propagation, and the resulting 3-layer Transformer is robust to up to 6 word substitutions. For both works we use the same set of counter-fitted synonyms provided in Jia et al. (2019). We skip BERT-base due to the lack of an effective robust training method.

Attack success rate (first-order). We quantify first-order robustness through attack success rate, which measures the ratio of test examples that an adversarial example can be found. We use first-order attacks as a reference due to the lack of a direct baseline. We experiment with two black-box attacks: (1) The Genetic attack (Alzantot et al., 2018; Jia et al., 2019) uses a population-based optimization algorithm that generates both syntactically and semantically similar adversarial examples, by replacing words within the list of counter-fitted synonyms. (2) The BAE attack (Garg and Ramakrishnan, 2020) generates coherent adversarial examples by masking and replacing words using BERT. For both methods we use the implementation provided by TextAttack (Morris et al., 2020).

Attack success rate (second-order). We also quantify second-order robustness through attack success rate, which measures the ratio of test examples that a vulnerable example can be found. To evaluate the impact of neighborhood size, we experiment with two configurations: (1) For the small neighborhood ( $k=2$ ), we use SO-Enum that finds the most similar vulnerable example. (2) For the large neighborhood ( $k=6$ ), SO-Enum is not applicable and we use SO-Beam to find vulnerable examples. We consider the most challenging setup and use patch words $\bm{p}$ from the same set of counter-fitted synonyms as robust models (they are provably robust to these synonyms on the test set). We also provide a random baseline to validate the effectiveness of minimizing Eq.˜4 (Section˜A.1).

Quality metrics (perplexity and similarity). We quantify the quality of our constructed vulnerable examples through two metrics: (1) GPT-2 (Radford et al., 2019) perplexity quantifies the naturalness of a sentence (smaller is better). We report the perplexity for both the original input examples and the constructed vulnerable examples. (2) $\ell_{0}$ norm distance quantifies the disparity between two sentences (smaller is better). We report the distance between the input and the vulnerable example. Note that first-order attacks have different objectives and thus cannot be compared directly.

3.3.2 Results

We experiment with the validation split (872 examples) on a single RTX 3090. The average running time per example (in seconds) on base LSTM is 31.9 for Genetic, 1.1 for BAE, 7.0 for SO-Enum ( $k=2$ ), and 1.9 for SO-Beam ( $k=6$ ). We provide additional running time results in Section˜A.3. Table˜1 provides an example of the attack result where all attacks are successful (additional examples in Section˜A.5). As shown, our second-order attacks find a vulnerable example by replacing grease $\rightarrow$ musicals, and the vulnerable example has different predictions for bad and unhealthy. Note that, Genetic and BAE have different objectives from second-order attacks and focus on finding the adversarial example. Next we discuss the results from two perspectives.

	Attack Success Rate (%)
	Genetic	BAE	SO-Enum	SO-Beam
Base Models:
BoW	57.0	69.7	95.3	99.7
CNN	62.0	71.0	95.3	99.8
LSTM	60.0	68.3	95.8	99.5
Transformer	73.0	74.3	95.4	98.0
BERT-base	41.0	61.5	94.3	98.7
Robust Models:
BoW	28.0	63.1	81.5	88.4
CNN	23.0	64.4	91.0	96.0
LSTM	24.0	61.0	62.9	77.5
Transformer	56.0	71.6	91.2	96.2

Table 2: The average rates over 872 examples (100 for Genetic due to long running time). Second-order attacks achieve higher successful rate since they are able to search beyond the test set.

Second-order robustness. We observe that existing robustly trained models are not second-order robust. As shown in Table˜2, our second-order attacks attain high success rates not only on the base models but also on the robustly trained models. For instance, on the robustly trained CNN and Transformer, SO-Beam finds vulnerable examples within a small neighborhood for around $96.0\%$ of the test examples, even though these models have improved resistance to strong first-order attacks (success rates drop from $62.0\%$ – $74.3\%$ to $23.0\%$ – $71.6\%$ for Genetic and BAE).⁴⁴4BAE is more effective on robust models as it may use replacement words outside the counter-fitted synonyms. This phenomenon can be explained by the fact that both first-order attacks and robust training methods focus on synonym substitutions on the test set, whereas our attacks, due to their second-order nature, find vulnerable examples beyond the test set, and the search is not required to maintain semantic similarity. Our methods provide a way to further investigate the robustness (or find vulnerable and adversarial examples) even when the model is robust to the test set.

Quality of constructed vulnerable examples. As shown in Table˜3, second-order attacks are able to construct vulnerable examples by perturbing 1.3 words on average, with a slightly increased perplexity. For instance, on the robustly trained CNN and Transformer, SO-Beam constructs vulnerable examples by perturbing 1.3 words on average, with the median⁵⁵5We report median due to the unreasonably large perplexity on certain sentences. e.g., 395 for that’s a cheat. but 6740 for that proves perfect cheat. perplexity increased from around 165 to around 210. We provide metrics for first-order attacks in Section˜A.5 as they have different objectives and are not directly comparable.

Furthermore, applying existing attacks on the vulnerable examples constructed by our method will lead to much smaller perturbations. As a reference, on the robustly trained CNN, Genetic attack constructs adversarial examples by perturbing 2.7 words on average (starting from the input examples). However, if Genetic starts from our vulnerable examples, it would only need to perturb a single word (i.e., the patch words $\bm{p}$ ) to alter the prediction. These results demonstrate the weakness of the models (even robustly trained) for those inputs beyond the test set.

	SO-Enum			SO-Beam
	Original PPL	Perturb PPL	$\ell_{0}$	Original PPL	Perturb PPL	$\ell_{0}$
Base Models:
BoW	168	202	1.1	166	202	1.2
CNN	170	204	1.1	166	201	1.2
LSTM	168	204	1.1	166	204	1.2
Transformer	165	193	1.0	165	195	1.1
BERT-base	170	229	1.3	168	222	1.4
Robust Models:
BoW	170	212	1.2	171	222	1.4
CNN	166	209	1.2	168	210	1.3
LSTM	194	251	1.3	185	260	1.8
Transformer	170	213	1.2	165	208	1.3

Table 3: The quality metrics for second-order methods. We report the median perplexity (PPL) and average

\ell_{0}

norm distance. The original PPL may differ across models since we only count successful attacks.

3.3.3 Human Evaluation

We perform human evaluation on the examples constructed by SO-Beam. Specifically, we randomly select 100 successful attacks and evaluate both the original examples and the vulnerable examples. To evaluate the naturalness of the constructed examples, we ask the annotators to score the likelihood (on a Likert scale of 1-5, 5 to be the most likely) of being an original example based on the grammar correctness. To evaluate the semantic similarity after applying the synonym substitution $\bm{p}$ , we ask the annotators to predict the sentiment of each example, and calculate the ratio of examples that maintain the same sentiment prediction after the synonym substitution. For both metrics, we take the median from 3 independent annotations. We use US-based annotators on Amazon’s Mechanical Turk⁶⁶6https://www.mturk.com and pay $0.03 per annotation, and expect each annotation to take 10 seconds on average (effectively, the hourly rate is about $11). See Section˜A.2 for more details.

As shown in Table˜4, the naturalness score only drop slightly after the perturbation, indicating that our constructed vulnerable examples have similar naturalness as the original examples. As for the semantic similarity, we observe that 85% of the original examples maintain the same meaning after the synonym substitution, and the corresponding ratio is 71% for vulnerable examples. This indicates that the synonym substitution is an invariance transformation for most examples.

Naturalness (1-5)		Semantic Similarity (%)
Original	Perturb	Original	Perturb
3.87	3.63	85	71

Table 4: The quality metrics from human evaluation.

4 Evaluating Counterfactual Bias

In addition to evaluating second-order robustness, we further extend the double perturbation framework (Section˜2) to evaluate counterfactual biases by setting $\bm{p}$ to pairs of protected tokens. We show that our method can reveal the hidden model bias.

4.1 Counterfactual Bias

In contrast to second-order robustness, where we consider the model vulnerable as long as there exists one vulnerable example, counterfactual bias focuses on the expected prediction, which is the average prediction among all examples within the neighborhood. We consider a model biased if the expected predictions for protected groups are different (assuming the model is not intended to discriminate between these groups). For instance, a sentiment classifier is biased if the expected prediction for inputs containing woman is more positive (or negative) than inputs containing man. Such bias is harmful as they may make unfair decisions based on protected attributes, for example in situations such as hiring and college admission.

Refer to caption — Figure 4: An illustration of an unbiased model vs. a biased model. Green and gray indicate the probability of positive and negative predictions, respectively. Left: An unbiased model where the $(x,x\oplus\bm{p})$ pair (Yellow-Red dots) is relatively parallel to the decision boundary. Right: A biased model where the predictions for $x\oplus\bm{p}$ (Red) are usually more negative (gray) than $x$ (Yellow).

Patch Words	# Original	# Perturbed
he,she	$5$	$325401$
his,her	$4$	$255245$
him,her	$4$	$233803$
men,women	$3$	$192504$
man,woman	$3$	$222981$
actor,actress	$2$	$141780$
$\dots$
Total	$34$	$2317635$

	Genetic			BAE
	Original PPL	Perturb PPL	$\ell_{0}$	Original PPL	Perturb PPL	$\ell_{0}$
Base Models:
BoW	145	258	3.3	192	268	1.6
CNN	146	282	3.0	186	254	1.5
LSTM	131	238	2.9	190	263	1.6
Transformer	137	232	2.8	185	254	1.4
BERT-base	201	342	3.4	189	277	1.6
Robust Models:
BoW	132	177	2.4	214	269	1.5
CNN	136	236	2.7	211	279	1.5
LSTM	163	267	2.5	220	302	1.6
Transformer	118	200	2.8	196	261	1.4

Type	Predictions	Text
Original	95% Negative 94% Negative	it ’s hampered by a lifetime-channel kind of plot and a lead actor (actress) who is out of their depth .
Distance $k=1$	97% Negative (97% Negative)	it ’s hampered by a lifetime-channel kind of plot and lone lead actor (actress) who is out of their depth .
	56% Negative (55% Positive )	it ’s hampered by a lifetime-channel kind of plot and a lead actor (actress) who is out of creative depth .
	89% Negative (84% Negative)	it ’s hampered by a lifetime-channel kind of plot and a lead actor (actress) who talks out of their depth .
	98% Negative (98% Negative)	it ’s hampered by a lifetime-channel kind of plot and a lead actor (actress) who is out of production depth .
	96% Negative (96% Negative)	it ’s hampered by a lifetime-channel kind of plot and a lead actor (actress) that is out of their depth .
Distance $k=2$	88% Negative (87% Negative)	it ’s hampered by a lifetime-channel cast of stars and a lead actor (actress) who is out of their depth .
	96% Negative (95% Negative)	it ’s hampered by a simple set of plot and a lead actor (actress) who is out of their depth .
	54% Negative (54% Negative)	it ’s framed about a lifetime-channel kind of plot and a lead actor (actress) who is out of their depth .
	90% Negative (88% Negative)	it ’s hampered by a lifetime-channel mix between plot and a lead actor (actress) who is out of their depth .
	78% Negative (68% Negative)	it ’s hampered by a lifetime-channel kind of plot and a lead actor (actress) who storms out of their mind .
Distance $k=3$	52% Positive (64% Positive )	it ’s characterized by a lifetime-channel combination comedy plot and a lead actor (actress) who is out of their depth .
	93% Negative (93% Negative)	it ’s hampered by a lifetime-channel kind of star and a lead actor (actress) who falls out of their depth .
	58% Negative (57% Negative)	it ’s hampered by a tough kind of singer and a lead actor (actress) who is out of their teens .
	70% Negative (52% Negative)	it ’s hampered with a lifetime-channel kind of plot and a lead actor (actress) who operates regardless of their depth .
	58% Negative (53% Positive )	it ’s hampered with a lifetime-channel cast of plot and a lead actor (actress) who is out of creative depth .

Type	Predictions	Text
Original	55% Positive (67% Positive )	a hamfisted romantic comedy that makes our boy (girl) the hapless facilitator of an extended cheap shot across the mason-dixon line .
Distance $k=1$	52% Positive (66% Positive )	a hamfisted romantic comedy that makes our boy (girl) the hapless facilitator of an extended cheap shot from the mason-dixon line .
	73% Positive (79% Positive )	a hamfisted romantic comedy that makes our boy (girl) the hapless facilitator gives an extended cheap shot across the mason-dixon line .
	56% Negative (58% Positive )	a hamfisted romantic comedy that makes our boy (girl) the hapless facilitator of an extended cheap shot across the phone line .
	75% Positive (83% Positive )	a hamfisted romantic comedy that makes our boy (girl) the hapless facilitator of an extended chase shot across the mason-dixon line .
	75% Positive (81% Positive )	a hamfisted romantic comedy that makes our boy (girl) our hapless facilitator of an extended cheap shot across the mason-dixon line .
Distance $k=2$	85% Positive (85% Positive )	a hilarious romantic comedy that makes our boy (girl) the hapless facilitator of an emotionally cheap shot across the mason-dixon line .
	81% Positive (86% Positive )	a hamfisted romantic comedy romance makes our boy (girl) the hapless facilitator of an extended cheap delivery across the mason-dixon line .
	84% Positive (87% Positive )	a hamfisted romantic romance adventure makes our boy (girl) the hapless facilitator of an extended cheap shot across the mason-dixon line .
	50% Negative (62% Positive )	a hamfisted romantic comedy that makes our boy (girl) the hapless boss of an extended cheap shot behind the mason-dixon line .
	77% Negative (71% Negative)	a hamfisted lesbian comedy that makes our boy (girl) the hapless facilitator of an extended slap shot across the mason-dixon line .
Distance $k=3$	97% Positive (97% Positive )	a darkly romantic comedy romance makes our boy (girl) the hapless facilitator delivers an extended cheap shot across the mason-dixon line .
	69% Positive (74% Positive )	a hamfisted romantic comedy film makes our boy (girl) the hapless facilitator of an extended cheap shot across the production line .
	87% Positive (89% Positive )	a hamfisted romantic comedy that makes our boy (girl) the exclusive focus of an extended cheap shot across the mason-dixon line .
	64% Positive (76% Positive )	a hamfisted romantic comedy that makes our boy (girl) the hapless facilitator shoots an extended flash shot across the camera line .
	99% Positive (99% Positive )	a compelling romantic comedy that makes our boy (girl) the perfect facilitator of an extended story shot across the mason-dixon line .

Type	Predictions	Text
Original	99% Positive (99% Positive )	it ’s a charming and sometimes (often) affecting journey .
Vulnerable	59% Negative (56% Positive )	it ’s a charming and sometimes (often) painful journey .
Original	99% Negative (97% Negative)	unflinchingly bleak (somber) and desperate
Vulnerable	80% Negative (79% Positive )	unflinchingly bleak (somber) and mysterious
Original	99% Positive (93% Positive )	allows us to hope that nolan is poised to embark a major career (quarry) as a commercial yet inventive filmmaker .
Vulnerable	76% Positive (75% Negative)	allows us to hope that nolan is poised to embark a major career (quarry) as a commercial yet amateur filmmaker .
Original	94% Positive (68% Positive )	the acting , costumes , music , cinematography and sound are all astounding (staggering) given the production ’s austere locales .
Vulnerable	87% Positive (66% Negative)	the acting , costumes , music , cinematography and sound are largely astounding (staggering) given the production ’s austere locales .
Original	99% Positive (97% Positive )	although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young (juvenile) women .
Vulnerable	94% Positive (81% Negative)	although laced with humor and a few fanciful touches , the film is a moderately serious look at young (juvenile) women .
Original	99% Negative (98% Negative)	a sometimes (occasionally) tedious film .
Vulnerable	62% Negative (55% Positive )	a sometimes (occasionally) disturbing film .
Original	100% Negative (100% Negative)	in exactly 89 minutes , most of which passed as slowly as if i ’d been sitting naked on an igloo , formula 51 sank from quirky (lunatic) to jerky to utter turkey .
Vulnerable	51% Positive (65% Negative)	lasting exactly 89 minutes , most of which passed as slowly as if i ’d been sitting naked on an igloo , but 51 ranges from quirky (lunatic) to delicious to crisp turkey .
Original	97% Positive (100% Positive )	the scintillating (mesmerizing) performances of the leads keep the film grounded and keep the audience riveted .
Vulnerable	91% Negative (90% Positive )	the scintillating (mesmerizing) performances of the leads keep the film grounded and keep the plot predictable .
Original	89% Negative (96% Negative)	it takes a uncanny (strange) kind of laziness to waste the talents of robert forster , anne meara , eugene levy , and reginald veljohnson all in the same movie .
Vulnerable	80% Positive (76% Negative)	it takes a uncanny (strange) kind of humour to waste the talents of robert forster , anne meara , eugene levy , and reginald veljohnson all in the same movie .
Original	100% Negative (100% Negative)	… the film suffers from a lack of humor ( something needed to balance (equilibrium) out the violence ) …
Vulnerable	76% Positive (86% Negative)	… the film derives from a lot of humor ( something clever to balance (equilibrium) out the violence ) …
Original	55% Positive (97% Positive )	we root for ( clara and paul ) , even like them , though perhaps it ’s an emotion closer to pity (compassion) .
Vulnerable	89% Negative (91% Positive )	we root for ( clara and paul ) , even like them , though perhaps it ’s an explanation closer to pity (compassion) .
Original	95% Negative (97% Negative)	even horror fans (stalkers) will most likely not find what they ’re seeking with trouble every day ; the movie lacks both thrills and humor .
Vulnerable	61% Positive (59% Negative)	even horror fans (stalkers) will most likely not find what they ’re seeking with trouble every day ; the movie has both thrills and humor .
Original	100% Positive (100% Positive )	a gorgeous , high-spirited musical from india that exquisitely mixed (blends) music , dance , song , and high drama .
Vulnerable	87% Negative (81% Positive )	a dark , high-spirited musical from nowhere that loosely mixed (blends) music , dance , song , and high drama .
Original	99% Negative (94% Negative)	… the movie is just a plain old (longtime) monster .
Vulnerable	94% Negative (94% Positive )	… the movie is just a pretty old (longtime) monster .

Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation

Abstract

1 Introduction

2 The Double Perturbation Framework

2.1 Existing Word Substitution Strategy

2.2 Proposed Neighborhood Perturbation

3 Evaluating Second-Order Robustness

3.1 Previous First-Order Attacks

3.2 Proposed Second-Order Attacks

3.3 Experimental Results

3.3.1 Setup

3.3.2 Results

3.3.3 Human Evaluation

4 Evaluating Counterfactual Bias

4.1 Counterfactual Bias

4.2 Experimental Results

4.2.1 Setup

4.2.2 Results

5 Related Work

6 Conclusion

Acknowledgments

Ethical Considerations

References

Appendix A Supplemental Material

A.1 Random Baseline

A.2 Human Evaluation

A.3 Running Time

A.4 Additional Results on Protected Tokens

A.5 Additional Results on Robustness

	Running Time (seconds)
	Genetic	BAE	SO-Enum	SO-Beam
Base Models:
BoW	31.6	0.9	6.2	1.8
CNN	28.8	1.0	5.9	1.7
LSTM	31.9	1.1	7.0	1.9
Transformer	51.9	0.5	6.5	2.5
BERT-base	65.6	1.1	35.4	7.1
Robust Models:
BoW	103.9	1.0	8.0	3.5
CNN	129.4	1.0	6.7	2.6
LSTM	116.4	1.1	10.7	5.3
Transformer	66.4	0.5	5.9	2.6

	# Negative	# Positive
gay	37	20
straight	71	18