This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

WaNLI: Worker and AI Collaboration for
Natural Language Inference Dataset Creation

Alisa Liu   Swabha Swayamdipta   Noah A. Smith   Yejin Choi   
Paul G. Allen School of Computer Science & Engineering, University of Washington
Allen Institute for Artificial Intelligence  University of Southern California
alisaliu@cs.washington.edu
Abstract

A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We introduce a novel approach for dataset creation based on worker and AI collaboration, which brings together the generative strength of language models and the evaluative strength of humans. Starting with an existing dataset, MultiNLI for natural language inference (NLI), our approach uses dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instructs GPT-3 to compose new examples with similar patterns. Machine generated examples are then automatically filtered, and finally revised and labeled by human crowdworkers. The resulting dataset, WaNLI, consists of 107,885 NLI examples and presents unique empirical strengths over existing NLI datasets. Remarkably, training a model on WaNLI improves performance on eight out-of-domain test sets we consider, including by 11% on HANS and 9% on Adversarial NLI, compared to training on the 4×4\times larger MultiNLI. Moreover, it continues to be more effective than MultiNLI augmented with other NLI datasets. Our results demonstrate the promise of leveraging natural language generation techniques and re-imagining the role of humans in the dataset creation process.

Refer to caption
Figure 1: An illustration of our pipeline for creating WaNLI. Starting with a data map Swayamdipta et al. (2020) of an existing dataset relative to a trained model, (1) we automatically identify pockets of data instances exemplifying challenging reasoning patterns. Next, (2) we use GPT-3 to generate new instances with the same pattern. These generated examples are then (3) automatically filtered via a metric we introduce inspired by data maps, and (4) given to human annotators to assign a gold label and optionally revise.

1 Introduction

As much as large-scale crowdsourced datasets have expedited progress on various NLP problems, a growing body of research has revealed fundamental limitations in existing datasets: they are often flooded with repetitive and spurious patterns, rather than covering the broad range of linguistic phenomena required by the task Bowman and Dahl (2021). This leads to models that seem to achieve human-level performance on in-domain test sets, yet are brittle when given out-of-domain or adversarial examples Ribeiro et al. (2020); Glockner et al. (2018).

We attribute this problem to an inherent challenge in the crowdsourcing design—the prevalent paradigm for creating large-scale NLP datasets—where a relatively small number of workers create a massive number of free text examples. While human annotators are generally reliable for writing correct examples, crafting diverse and creative examples at scale can be challenging. Thus, crowdworkers often resort to a limited set of writing strategies for speed, at the expense of diversity Geva et al. (2019); Gururangan et al. (2018). When models overfit to such repetitive patterns, they fail to generalize to out-of-domain examples where these patterns no longer hold Geirhos et al. (2020).

On the other hand, there has been remarkable progress in open-ended text generation based on massive language models (Brown et al., 2020; Raffel et al., 2020, i.a.). Despite known deficiencies such as incoherence or repetition Dou et al. (2021), these models often produce human-like text Clark et al. (2021) and show potential for creative writing tasks Lee et al. (2022). Importantly, these models are capable of replicating a pattern given just a few examples in context (Brown et al., 2020, GPT-3).

In this paper, we introduce a novel approach for dataset creation which brings together the generative strength of language models and the evaluative strength of humans through human and machine collaboration2). The key insight of our approach is that language models can create new examples by replicating linguistic patterns that are valuable for training, without necessarily “understanding” the task itself. Illustrated in Figure 1, our pipeline starts with an existing dataset. We use dataset cartography from Swayamdipta et al. (2020) to automatically identify pockets of examples that demonstrate challenging reasoning patterns relative to a trained model. Using each group as a set of in-context examples, we leverage a pretrained language model to generate new examples likely to have the same pattern (see Table 1). We then propose a novel metric, building on dataset cartography, to automatically filter generations that are most likely to aid model learning. Finally, we validate the generated examples by subjecting them to human review, where crowdworkers assign a gold label and (optionally) revise for quality.

Seed MultiNLI example Generated WaNLI Example Label & Reasoning
P: 5 percent probability that each part will be defect free. H: Each part has a 95 percent chance of having a defect. P: 1 percent of the seats were vacant. H: 99 percent of the seats were occupied. Entailment Set complements
P: The artisans, tradespeople, and providers of entertainment (reputable and not so reputable) lived downtown on the reclaimed marshlands north and east, in the area still known as Shitamachi. H: The only place where artisans, tradespeople and entertainers could live was in the marshlands to the north and east. P: At the time of the Revolution, the old port of Marseille was a great center of shipbuilding and commerce. H: The only place where ships were built was in the old port of Marseille. Neutral Hypothesis introduces exclusivity
P: To the south, in the Sea of Marmara, lie the woods and beaches of the Princes’ Islands. H: In the north is the Sea of Marmara where there are mountains to climb. P: From the park’s southern entrance, follow the avenue south to the Hotel de Ville. H: From the park’s northern entrance, follow the avenue north to the Hotel de Ville. Contradiction Reversing cardinal directions
P: Democrats released documents indicating that Republicans sold big political donors meals with the party’s leaders in federal buildings in 1995. H: It is illegal for a party to solicit products to donors. P: In the late 1960s, students at a university in Wisconsin tried to organize a union. H: It was illegal for the students to organize a union. Neutral Illegal things can happen
P: She ducked and parried the blow. H: She ducked to miss the blow. P: She stepped on the brake and the car came to a stop. H: She stepped on the brake to stop the car. Entailment Implied intention
P: To build a worldclass finance organization and help achieve better business outcomes, each of the organizations we examined set an agenda for transforming the finance organization by defining a shared vision -i.e. H: The transformation was a disaster and the entire organization had to be scrapped. P: In order to help improve customer service, I suggested that they send a representative to our office to discuss our concerns. H: The representative sent to our office did not solve our problems and we lost a lot of business. Neutral Intended goals may not actualize
P: Salinger wrote similar letters to other young female writers. H: Other young female writers received similar letters from Salinger as well. P: The three schools have a number of students who are from families with no history of financial difficulties. H: Families with no history of financial difficulties send their children to the three schools. Entailment Substituting a verb with a different subcategorization frame
Table 1: Seed MultiNLI examples, and corresponding WaNLI examples generated by GPT-3. P stands for premise, H for hypothesis. The seed example is “ambiguous” according to the definitions of Swayamdipta et al. (2020), discussed in §2. The remaining in-context examples (shown in Appendix C.1) share the same pattern and are found using distance in [CLS] embeddings of a trained task model. The reasoning is a short description of the pattern we observe from the group, and which is successfully repeated in the generated example.

We demonstrate the effectiveness of our approach on the task of natural language inference (NLI), which determines whether a premise entails (i.e., implies the truth of) a hypothesis, both expressed in natural language. Despite being one of the most resource-available tasks in NLP, analysis and challenge sets repeatedly demonstrate the limitations of existing datasets and the brittleness of NLI models trained on them Gururangan et al. (2018); Poliak et al. (2018); Tsuchiya (2018). Using MultiNLI Williams et al. (2018) as our original dataset, we use our pipeline to create a dataset of 107,885 examples, which we call Worker-and-AI NLI (WaNLI).111Pronounced wan-li like the Chinese characters 万理, as in ten thousand reasoning. A demo, data, and code are available at https://wanli.allenai.org/.

Remarkably, empirical results demonstrate that replacing MultiNLI supervision with WaNLI (which is 44 times smaller) improves performance on eight different out-of-domain test sets, including datasets that are converted to the NLI format from downstream tasks such as question-answering and fact verification (§3). This result holds even when augmenting MultiNLI with other NLI datasets and recently proposed augmentation sets. Moreover, including WaNLI in the training data can help improve performance on certain in-domain test sets. We then analyze WaNLI and show that it has fewer previously documented spurious correlations than MultiNLI (§4), and provide insights into the collaborative framework (§5).

Our approach contrasts with previous instruction-based generation of dataset examples Schick and Schütze (2021); West et al. (2021), which require the model to understand the task from context, fundamentally limiting the complexity of generated output to what is accessible by the model. Moreover, our human-in-the-loop approach is collaborative, rather than adversarial Dinan et al. (2019); Nie et al. (2020); Bartolo et al. (2020). Overall, we leverage the best of both worlds: a powerful model’s ability to efficiently generate diverse examples, and humans’ ability to improve and ensure the quality of generations.

Our worker-AI collaborative approach is more scalable compared to the traditional crowdsourcing framework. Our approach is generalizable, allowing for rejuvenating datasets on many different classification tasks, especially when performance seems to stagnate due to overfitting to popular benchmarks Recht et al. (2019). Our work shows the promise of leveraging language models in a controlled way to aid the dataset creation process, and we encourage the community to think of dataset curation as an AI challenge itself.

2 Worker-AI Collaborative Dataset Creation for NLI

We describe our four-stage approach for dataset creation based on worker and AI collaboration. In this work, we apply it to the task of natural language inference (NLI), which involves predicting whether a premise entails, contradicts or is neutral to a hypothesis. NLI has broad applicability in NLP: it has proven useful for pretraining Clark et al. (2019); Phang et al. (2018), and can be applied to verify candidate answers in question-answering Chen et al. (2021) or factuality of generated summaries Maynez et al. (2020).

Our approach requires as prerequisites an initial dataset 𝒟0\mathcal{D}_{0} and a strong task model \mathcal{M} trained on 𝒟0\mathcal{D}_{0}. We use MultiNLI Williams et al. (2018), a large-scale multi-genre NLI dataset, as 𝒟0\mathcal{D}_{0}. We finetune RoBERTa-large Liu et al. (2019) on MultiNLI for our task model \mathcal{M} (training details in Appendix B).

As an overview, we first automatically collect groups of examples exemplifying challenging reasoning patterns in 𝒟0\mathcal{D}_{0} relative to \mathcal{M}, using data maps (Swayamdipta et al., 2020; Stage 1, see §2.1). Then we overgenerate similar examples by leveraging the pattern replication capabilities of GPT-3 Brown et al. (2020) (Stage 2; §2.2). While GPT-3 can generate examples efficiently, it may not reliably replicate the desired pattern and its output quality will not be uniform. We address this by automatically filtering the generated examples using a metric derived from data maps (Stage 3; §2.3). We finally subject the collected data to human review, in which crowdworkers optionally revise examples and assign gold labels (Stage 4; §2.4).

Dataset Cartography.

A key component of our pipeline is inspired by data maps Swayamdipta et al. (2020), which automatically reveal different regions in a dataset, w.r.t. the behavior of a classification model during training. These include easy-to-learn examples which the model consistently predicts correctly through training, hard-to-learn examples on which it is consistently incorrect, and ambiguous examples for which the model’s confidence in the correct answer exhibits high variability across train epochs. Our pipeline focuses on ambiguous examples, which were shown to lead to more robust models. Additionally, ambiguous examples contain fewer spurious correlations Gardner et al. (2021), suggesting that they capture under-represented counterexamples to spurious correlations. Indeed, such counterexamples take more epochs of training to learn and are crucial for generalization Tu et al. (2020), providing a potential explanation for why they appear ambiguous across early epochs and lead to more robust models.

2.1 Stage 1: Collection of Exemplars

In this stage, we automatically collect groups of examples from 𝒟0\mathcal{D}_{0} which represent linguistic patterns we wish to include in the target dataset. We begin with a seed example (xi,yi)𝒟0(x_{i},y_{i})\in\mathcal{D}_{0} belonging to the most ambiguous p=25%p=25\% relative to \mathcal{M}.222For exemplar collection, we exclude the telephone genre of MultiNLI, which consists of telephone conversation transcripts, due to their low fluency and ill-defined entailment relationships. During pilots, we found that generated examples mimicking telephone conversations would require crowdworkers to revise low-quality text for basic fluency.

To generate a new example with the same reasoning pattern, we wish to leverage the ability of GPT-3 Brown et al. (2020) for in-context learning; hence, we need to first collect examples that test a similar kind of reasoning to xix_{i}. To do this, we use the [CLS] token representation of each example relative to the task model \mathcal{M}, and find the k=4k=4 nearest neighbors via cosine similarity to xix_{i} that have the same label. Detailed qualitative inspection shows that the nearest neighbors in this representation space tend to capture a human-interpretable similarity in the reasoning required to solve an example, rather than lexical or semantic similarity (examples in Table 1).

Han and Tsvetkov (2021) give another interpretation for this approach: for examples with the same label, the similarity of [CLS] token embeddings actually represents the similarity of gradient updates in the row of the final projection layer corresponding to that label. Thus, two examples are close if training on them would “update” the final layer of the model similarly.

By automatically identifying areas for augmentation, our method does not require any prior knowledge of challenging patterns and makes our method tractable for building on top of large-scale datasets. Nonetheless, exemplar collection could potentially be approached in different ways (e.g., through expert curation or category labels).

2.2 Stage 2: Overgeneration

Given an automatically extracted group of k+1k+1 examples from the original dataset 𝒟0\mathcal{D}_{0}, we construct a natural language context (prompt) for a left-to-right language model; in this work, we use GPT-3 Curie (the second-largest GPT-3 model). The prompt template we use is shown in Figure 2, where we order the examples in increasing similarity to the seed example.

Note that our method leverages GPT-3 in way that is distinct from its typical usage in few-shot settings, where given examples demonstrating a task, GPT-3 performs the task on a new, unlabeled example. Here, we instead give GPT-3 examples representing a particular slice of the task, and ask GPT-3 to generate a new example in the same slice.

For each context, we sample from GPT-3 to create n=5n=5 distinct examples. We use top-pp decoding Holtzman et al. (2020) with p=0.5p=0.5 (additional details in Appendix C.2). Although generated examples at this stage could be assumed to share label of its k+1k+1 in-context examples, we instead consider the resulting dataset 𝒟gen={xi}i\mathcal{D}\textsubscript{gen}=\{x_{i}\}_{i} at the end of Stage 1 to be unlabeled.

Refer to caption
Figure 2: Prompt template instructing GPT-3 to generate a new example, given a set of in-context examples. To separate the premise and hypothesis, the word “Implication” is used for entailment examples (shown here), “Possibility” for neutral examples, and “Contradiction” for contradiction examples.

2.3 Stage 3: Automatic Filtering

In this step, we wish to filter generated examples from Stage 2 to retain those that are the most ambiguous with respect to \mathcal{M}. However, computing ambiguity for an example requires that it be a part of the original training set, whereas we wish to estimate the ambiguity of an unlabeled example without additional training. Thus we introduce a new metric called estimated max variability, which measures the worst-case spread of predictions on an example xix_{i} across checkpoints of a trained model. Let EE be the total epochs in training, 𝒴\mathcal{Y} the label set, and pθ(e)p_{\theta^{(e)}} the probability assigned with parameters θe\theta^{e} at the end of the ee-th epoch. We define the estimated max variability as:

σi=maxy𝒴σ({pθ(e)(yxi)}eE),\sigma_{i}=\max_{y\in\mathcal{Y}}\sigma\left(\left\{p_{\theta^{(e)}}(y\mid x_{i})\right\}_{e\in E}\right), (1)

where σ\sigma is the standard deviation function.

Concretely, we retroactively compute the prediction from each saved epoch of \mathcal{M} on xix_{i}. The only assumption made is that the single example, if it had been a part of the training set, would have made a negligible difference on each model checkpoint (at least as observed through its posterior probabilities).333Indeed, we find a high correlation between variability and estimated max variability; see Appendix A. In taking a maximum across labels, we consider xix_{i} to be ambiguous as long as \mathcal{M} is undecided on any label 𝒴\in\mathcal{Y}.

We first employ simple heuristics to discard examples exhibiting observable failure cases of GPT-3. Specifically, we discard examples where 1) the premise and hypothesis are identical, modulo punctuation or casing, 2) the generated example is an exact copy of an in-context example, 3) the example contains some phrases from the instruction (e.g., “pair of sentences”), or 4) the premise or hypothesis is shorter than 5 characters. Then, we compute the estimated max variability for the remaining examples with respect to \mathcal{M}, and retain an equal number of examples from each (intended) label class with the highest max variability, to create a dataset 𝒟filtered\mathcal{D}\textsubscript{filtered} that is half the size of 𝒟gen\mathcal{D}\textsubscript{gen}.

2.4 Stage 4: Human Review

As the final stage of our pipeline, we recruit human annotators on Amazon Mechanical Turk to review each unlabeled example xi𝒟filteredx_{i}\in\mathcal{D}\textsubscript{filtered}. (Details about crowdworkers and guidelines in Appendix D.) The annotator may optionally revise xix_{i} to create a higher-quality example xix_{i}^{\prime}, or let xi=xix_{i}^{\prime}=x_{i}. Either way, they assign a label yiy_{i}. When revising examples, we asked annotators to preserve the intended meaning as much as possible through minimal revisions.444In pilots, we found that when annotators exercised too much freedom in revision, they often re-introduced the same artifacts that have been well-documented in NLI. However, if an example would require a great deal of revision to fix or if it could be perceived as offensive, they should discard it. This results in the labeled dataset 𝒟collab={(xi,yi)}i\mathcal{D}\textsubscript{collab}=\{(x_{i}^{\prime},y_{i})\}_{i}.

Crowdworkers annotate a total of 118,724 examples, with two distinct workers reviewing each example. For examples that both annotators labeled without revision, we achieved a Cohen’s κ\kappa of 0.600.60, indicating substantial agreement. To create the final dataset, we discard an example if either annotator chose to discard it, and we keep a revision only if both annotators revise an example (and choose a revision uniformly at random). When both annotators label the example as-is but choose different labels, we sample one of the two labels uniformly at random. The rationale for this is discussed in Appendix D.4. This leads to a labeled dataset of 107,885 examples (90.87% of all annotated examples, with the remaining discarded). Of the labeled examples, 3.54% were revised.

We randomly split the data into a train and test sets. Key dataset statistics are summarized in Table 2. Unlike MultiNLI, WaNLI is not label-balanced; see §5.3 for a discussion.

In general, we believe the role of revision depends on the quality of machine-generated examples. Indeed, we need to strike a balance between leveraging human capabilities and avoiding the re-emergence of annotation artifacts that may come with too much freedom in revision.

Split Size Label distribution (E/N/C)
Train 102,885 38,511 / 48,977 / 15,397
Test 5,000 1,858 / 2,397 / 745
Table 2: WaNLI dataset statistics.
Test Set
Diagnostics HANS* QNLI* WNLI* NQ-NLI* ANLI FEVER-NLI BIG-Bench* WaNLI
Data size 1104 30K 5266 706 4855 3200 20K 3324 5000
Training Set MNLI 393K 68.47 78.08 52.69 56.09 62.34 32.37 68.29 64.68 64.62
MNLI + Tailor 485K 67.75 79.03 54.89 56.23 63.83 32.87 68.75 72.38 64.27
MNLI + Z-Aug 754K 66.39 80.52 57.72 55.52 62.30 33.37 68.73 66.12 64.78
MNLI \diamond ANLI 393K 67.75 79.90 68.74 60.48 62.49 54.59 72.30 72.32 65.96
MNLI + ANLI 556K 66.84 77.94 62.41 57.08 62.84 53.84 72.30 71.11 65.93
MNLI \diamond FEVER-NLI 393K 66.75 76.50 56.70 57.08 61.81 35.65 76.83 58.39 63.31
MNLI + FEVER-NLI 601K 67.57 76.05 52.90 54.95 63.02 35.37 76.93 64.65 64.53
MNLI + SNLI + ANLI 943K 68.75 78.65 63.38 58.49 62.94 54.21 72.02 71.05 65.10
MNLI \diamond WaNLI 393K 71.01 83.10 77.00 61.89 62.94 36.46 71.14 76.17 75.49
MNLI + WaNLI 496K 71.64 82.00 68.40 60.05 63.21 36.78 70.79 70.81 75.26
WaNLI 103K 72.73 89.28 81.40 67.28 64.18 41.12 70.13 85.19 75.40
Table 3: Empirical comparison of different training sets for RoBERTa-large, for generalization to out-of-distribution (OOD) challenge sets. Gray cells mark settings that do not represent an OOD challenge. Top: Training on MultiNLI alone. Middle: Comparison of combination schemes with MultiNLI. We consider two data combination strategies, augmentation (+), and random replacement (\diamond), where the resulting dataset size is unchanged. Bottom: Training sets that include WaNLI. The highest accuracy on each test set (excluding gray cells) is bolded. Test sets with * contain two label classes: entailment and non-entailment.

3 Training NLI Models with WaNLI

We finetune different copies of RoBERTa-large Liu et al. (2019) on different training sets, and evaluate each resulting model’s performance on a large suite of NLI challenge sets. Given that the challenge sets were constructed independently of MultiNLI or WaNLI, we consider them out-of-distribution (OOD) for both training datasets.

3.1 NLI Test Suite

The NLI challenge sets come from a wide array of domains, methodologies (e.g., crowdsourcing, expert curation, generation), and initial task formats (e.g., question-answering, fact verification).555We evaluate on the development set for every dataset, except for Winograd NLI, where we combine the train and development set for greater statistical power, and Adversarial NLI, where we use the test set as the labels were not hidden.

NLI Diagnostics Wang et al. (2018) is a manually-curated test set that evaluates a variety of linguistic phenomena using naturally-occurring sentences from several domains.

HANS McCoy et al. (2019) targets unreliable syntactic heuristics based on lexical overlap between the premise and hypothesis.

QNLI was adapted from the Stanford Question-Answering Dataset Rajpurkar et al. (2016) by the GLUE benchmark Wang et al. (2018). Each example consists of a premise that is a sentence, and a hypothesis that is a question, which is entailed if the question is answered by the premise.

Winograd NLI was adapted by the GLUE benchmark from the Winograd Schema Challenge Levesque et al. (2011), which tests correct coreference via common sense. To convert this dataset to NLI, an entailed hypothesis is formed by substituting a correct referent and a non-entailed hypothesis is formed by substituting an incorrect referent.

Adversarial NLI (ANLI; Nie et al., 2020) is an adversarially-constructed dataset where crowdworkers are instructed to write examples that stump existing models. Examples are collected in three rounds that progressively increase in difficulty, with model adversaries trained on MultiNLI, SNLI Bowman et al. (2015), FEVER-NLI (discussed below), as well as ANLI sets from earlier rounds.

Natural Questions NLI (NQ-NLI, Chen et al., 2021) is created from the Natural Questions QA dataset Kwiatkowski et al. (2019). The premise is a decontextualized sentence from the original context; the hypothesis consists of a question and answer candidate converted into declarative form.

FEVER NLI is adapted from the FEVER fact verification dataset Thorne et al. (2018), and introduced along with ANLI. In each example, the premise is a short context from Wikipedia, and the hypothesis is a claim that is either supported (entailed), refuted (contradicted), or neither (neutral).

BIG-Bench NLI is a combination of four datasets from BIG-Bench Srivastava et al. (2022) about entailment: Analytic Entailment, Epistemic Reasoning, Disambiguation QA, Presuppositions NLI.

3.2 Training Datasets

In addition to stand-alone WaNLI and MultiNLI, we also consider combining MultiNLI with other NLI datasets. We use the train sets of SNLI Bowman et al. (2015), ANLI, and FEVER-NLI, as well as the augmentation set generated via Tailor Ross et al. (2022), which perturbed SNLI hypotheses to create examples with high lexical overlap between the premise and hypothesis, and the augmentation set Z-Aug Wu et al. (2022), which was created by generating in-distribution examples and filtering them based on spurious correlations.

We consider two schemes for combining datasets 𝒜\mathcal{A} and \mathcal{B}: 1) augmentation (𝒜+\mathcal{A}+\mathcal{B}), in which the two datasets are concatenated, and 2) random replacement (𝒜\mathcal{A}\diamond\mathcal{B}), where ||\lvert\mathcal{B}\rvert examples from 𝒜\mathcal{A} are randomly swapped out and replaced with all examples from \mathcal{B}.

3.3 Results

Results are shown in Table 3. When comparing MultiNLI (MNLI) and WaNLI alone, training a model on WaNLI instead of MultiNLI leads to better performance on every test set we consider, including by 4%4\% on Diagnostics, 11%11\% on HANS, and 9%9\% on Adversarial NLI. This is remarkable given WaNLI is 4×4\times smaller than MultiNLI, and contains primarily machine-written examples.

A WaNLI-trained model continues to outperform baselines that combine MultiNLI with other NLI datasets and augmentation sets, in every OOD setting. This includes when comparing to a model trained on 9×9\times more data from three existing NLI datasets, MNLI ++ SNLI ++ ANLI. The consistent advantage of WaNLI over datasets that include ANLI (e.g., MNLI ++ ANLI) is noteworthy, as ANLI’s adversarial creation pipeline posed a much greater challenge for human workers, and used more existing resources to train model adversaries.

Quite surprisingly, training on WaNLI alone also outperforms combining WaNLI with MultiNLI. This reinforces that more data might not necessarily be better, especially when the data predominantly consists of easy-to-learn examples.

Test Set
Diagnostics HANS* ANLI BIG-Bench* WaNLI
ANLI 65.67 80.58 55.21 77.10 63.85
ANLI + WaNLI 72.82 88.58 56.59 84.89 75.84
Table 4: Comparison of whether including WaNLI in the training data of ANLI improves in-domain test performance, when finetuning RoBERTa-large.

In addition to the OOD setting, we consider whether augmentation with WaNLI can improve in-domain test performance for another dataset (Table 4). Indeed, augmenting ANLI’s train set with WaNLI improves test accuracy on ANLI by 1.4%, while greatly aiding OOD test performance.

4 Artifacts in WaNLI

We next investigate whether WaNLI contains similar artifacts to MultiNLI.666We note, however, that recent work has challenged whether artifacts based on partial input and lexical correlations in the dataset pose genuine robustness threats Srikanth and Rudinger (2022); Eisenstein (2022). We find that while WaNLI contains fewer previously known spurious correlations, it has a distinct set of lexical correlations that may reflect artifacts in GPT-3 output.

4.1 Partial Input Models

Given that the task requires reasoning with both the premise and the hypothesis, a model that sees only one of the two inputs should have no information about the correct label. We reproduce the methodology from Gururangan et al. (2018) and train fastText classifiers to predict the label using partial input. After first balancing WaNLI, a model trained on just the hypotheses of WaNLI achieves 41.6%41.6\% accuracy on the test set compared to 49.6%49.6\% for MultiNLI, when restricted to the same size. A premise-only model trained on WaNLI achieves an accuracy of 42.9%42.9\%.777Unlike WaNLI, each MultiNLI premise is associated with hypotheses from all three labels; a premise-only baseline is thus guaranteed to have no information about the label.

4.2 Lexical Correlations

Gardner et al. (2021) posit that all correlations between single words and output labels are spurious. We plot the statistical correlation for every word and label in Figure 3, after balancing WaNLI and downsampling MultiNLI. We observe that WaNLI also contains words with detectable correlations, suggesting that GPT-3 may have some artifacts of its own due to the slightly different templates and different sets of in-context examples for each label. Interestingly, the correlations tend to be a different set of words than for MultiNLI (other than “not” and “no”), with less interpretable reasons for correlating with a certain label (e.g., “second”, “was”).

Refer to caption
Figure 3: Competency problem-style statistical correlation plot between individual words and particular class labels, where the yy-axis is the probability of label yy given the presence of the word xix_{i}, and the xx-axis is the number of times word xix_{i} appears in the data. All points representing (word, label) pairs above the blue line have detectable correlations Gardner et al. (2021).

4.3 Premise-Hypothesis Semantic Similarity

We explore the semantic similarity between the premise and hypothesis within each label class using Sentence-BERT Reimers and Gurevych (2019); these distributions are shown in Figure 4. In both MultiNLI and WaNLI, entailed hypotheses are naturally most semantically similar to the premise. In MultiNLI, this is followed by neutral examples and then contradiction examples. In contrast, in WaNLI there is much greater overlap in the three distributions, and those for neutral and contradiction examples are nearly indistinguishable. This suggests in WaNLI, the semantic similarity between the premise and hypothesis provides less signal of the label.

Refer to caption
Figure 4: Semantic similarity between the premise and hypothesis, computed based on SBERT embeddings Reimers and Gurevych (2019). The distributions for each label class are much more well-separated in MultiNLI than in WaNLI.

5 What does WaNLI show about the human machine collaboration pipeline?

We discuss observations from collecting WaNLI that may shed insight for future work in the direction of collaborative dataset creation.

5.1 What kinds of revisions do annotators tend to make?

We find that revisions fall broadly into two categories: improving the fluency of the text, and improving the clarity of the relationship. The majority of revisions change the length only slightly, with 74%74\% of both premise revisions and hypothesis revisions changing the word count between 1-1 and +2+2 words. Fluency revisions often target well-documented issues with text generation, such as redundancy and self-contradiction. Clarity revisions often resolve ambiguities in the example that make the entailment relationship difficult (or impossible) to determine, such as ambiguous coreference or temporal references. We provide examples of revisions in Appendix D.3.

Example Labels Ambiguity
P: According to the most recent statistics, the rate of violent crime in the United States has dropped by almost half since 1991. H: The rate of violent crime has not dropped by half since 1991. Entailment Contradiction Does “almost half” mean “not half” or “basically half”?
P: As a result of the disaster, the city was rebuilt and it is now one of the most beautiful cities in the world. H: A disaster made the city better. Entailment Neutral Do indirect consequences count?
P: It is a shame that the world has to suffer the pain of such unnecessary war. H: The world does not have to suffer such pain. Entailment Contradiction Is the scope of “has to” in the hypothesis given the war or not?
P: The original draft of the treaty included a clause that would have prohibited all weapons of mass destruction. H: The clause was removed in the final version of the treaty. Entailment Neutral Does the premise imply that the clause is no longer in the treaty?
P: If you can’t handle the heat, get out of the kitchen. H: If you can’t handle the pressure, get out of the situation. Entailment Neutral Is the premise to be interpreted literally or figuratively?
P: In a world of increasing uncertainty, the only certainty is that nothing is certain. H: There is no certainty in the world. Entailment Contradiction Self-contradictory but coherent premise
Table 5: Examples where two annotators assigned different labels. We find that many examples represent genuinely ambiguous cases rather than careless mislabels, echoing previous findings Pavlick and Kwiatkowski (2019).

5.2 What kinds of examples do annotators disagree on?

We find that examples on which annotators disagree provide an extremely interesting test bed for how ambiguities surface in classification tasks. Upon inspecting the examples (some are shown in Table 5), we observe that they represent genuinely ambiguous cases rather than careless mislabels, echoing previous findings Pavlick and Kwiatkowski (2019). See further discussion in Appendix D.4.

5.3 How reliably does GPT-3 reproduce the in-context pattern?

One characteristic of WaNLI is its imbalanced label distribution: even though the set of seed examples for generation was constructed to be balanced, after undergoing human labeling, only 15% of examples are given the contradiction label. We observe that contradiction patterns in in-context examples are generally much more challenging for GPT-3 to copy, likely because it was trained on (mostly) coherent sequences of sentences. More broadly, we find that more abstract reasoning patterns are harder for GPT-3 to mimic than patterns that involve simpler transformations.

Nonetheless, even when GPT-3 does not successfully copy the examples, the diverse set of in-context examples leads to a variety of creative output that may be challenging for human crowdworkers to achieve.

6 Related Work

Crowdsourcing

The scalability and flexibility of crowdsourcing has enabled the creation of foundational NLP benchmarks across a wide range of subproblems, and made it the dominant paradigm for data collection (Mihaylov et al., 2018; Rajpurkar et al., 2016; Huang et al., 2019; Talmor et al., 2019, i.a.). Nonetheless, a growing body of research shows that resulting datasets may not isolate the key linguistic phenomena Jia and Liang (2017); Chen et al. (2016); Sugawara et al. (2020).

For crowdsourcing NLI datasets, where the annotator is given a premise and asked to write a hypothesis of each label Bowman et al. (2015); Williams et al. (2018), the presence of annotation artifacts is especially well-studied Gururangan et al. (2018); McCoy et al. (2019); Glockner et al. (2018). Recent work attempted to remedy this through different data collection protocols but found negative results Vania et al. (2020); Bowman et al. (2020), showing this is a hard problem requiring greater innovation.

Adversarial data collection

In this paradigm, annotators are asked to produce examples on which current systems fail (Kiela et al., 2021; Talmor et al., 2021; Zellers et al., 2019, i.a.). Beyond increasing annotator effort Bartolo et al. (2020), adversarial methods have been challenged for not leading to better generalization on non-adversarial test sets Kaushik et al. (2021) and decreasing data diversity Bowman and Dahl (2021). Moreover, the resulting data has been shown to depend strongly on the adversaries, inhibiting a fair evaluation Phang et al. (2021). Finally, these approaches may produce examples beyond the scope of the task. For example, in Adversarial NLI Nie et al. (2020), an estimated 58% of examples required “reasoning from outside knowledge or additional facts,” which is arguably separate from the underlying problem of understanding semantic entailments. We argue that we can better leverage the strengths of machines and humans by having them collaborate rather than act as adversaries.

Dataset generation

Another recent approach leverages language models toward fully automatic dataset creation (Schick and Schütze, 2021; Wu et al., 2022; West et al., 2021; Bartolo et al., 2021a, i.a.). Removing human input may fundamentally limit the complexity of examples to phenomena already accessible by the model, when our goal is precisely to teach models more diverse phenomena. The most similarly-motivated work to ours, Lee et al. (2021), trains a data generator on “data-rich slices” of an existing dataset, and applies it to under-represented slices. However, they use labels or metadata to represent slices, leaving automatic methods of identifying slices to future work.

Human-machine collaboration

In terms of human-machine collaboration, Tekiroğlu et al. (2020) and Yuan et al. (2021) employ a language model to generate counter-narratives to hate speech and biographies, respectively, which are validated and revised by humans. This was for a generative task, and we complement their findings by showing that human-machine collaboration can also be useful for generating labeled datasets for robust classification models. Contemporary work Bartolo et al. (2021b) finetunes a generative annotation assistant to produce question-answer pairs that humans can revise for extractive QA.

7 Conclusion

At the heart of dataset creation is distilling human linguistic competence into data that models can learn from. The traditional crowdsourcing paradigm takes the view that the best approach for this is to solicit people to write free-form examples expressing their capabilities. In this work, we present a worker-and-AI collaborative approach and apply it to create WaNLI, whose empirical utility suggests that a better way of eliciting human intelligence at scale is to ask workers to revise and evaluate content. To this end, we hope to encourage more work in developing generative algorithms to aid the dataset creation process, and therefore re-imagining the role of human annotation.

Acknowledgments

We thank members of UW NLP, AI2, and Mila NLP for valuable feedback and discussion, and especially Jena Hwang for help in designing the AMT template, Julian Michael for countless discussions of NLI examples, and Alexander Fang for feedback during writing. We thank OpenAI for offering access to the GPT-3 API and the anonymous reviewers for valuable feedback.

This work was funded in part by the DARPA MCS program through NIWC Pacific (N66001-19-2-4031). The first author is supported by the National Science Foundation Graduate Research Fellowship Program.

8 Ethics Statement

We acknowledge that text generated from large pretrained language models is susceptible to perpetuating social harms and containing toxic language Sheng et al. (2019); Gehman et al. (2020). To partially remedy this, we ask annotators to discard any examples that may be perceived as offensive. Nonetheless, it is possible that harmful examples (especially if they contain subtle biases) may have been missed by annotators and included in the final dataset. Specifically due to the above harms, we additionally caution readers and practitioners against fully automating any data creation pipeline.

In addition, we are cognizant of the asymmetrical relationship between requesters and workers in crowdsourcing. We took great care to pay fair wages, and were responsive to feedback and questions throughout the data collection process (see Appendix D for details). The only personal information we collect is the worker IDs from Amazon Mechanical Turk, which we will not release. The annotation effort received an IRB exemption.

9 Limitations

In this paper, we apply our collaborative dataset creation pipeline to a single language and task, English natural language inference, and leave application of the pipeline more broadly to future work.

It is possible (if not likely) that datasets partially authored by language models will have artifacts of their own, especially those reflecting social biases that may not be captured by our accuracy-based evaluation setup. For investigation of a specific generation artifact observed by Yuan et al. (2021) in their own collaborative dataset, namely the over-representation of Western entities, please see Appendix C.4.

We are not able to perform ablations on different parts of the pipeline to understand the effectiveness of each component, e.g., by comparing different means of collecting exemplar groups or different templates for prompting GPT-3. Unfortunately, such variations would be prohibitively expensive as they each require collecting a dataset of sufficient scale (along with the necessary human annotation).

Finally, although we uncover examples where annotators disagree for valid reasons (see Table 5), we only use one label per example for training and evaluation. This is because to show the effectiveness of WaNLI, we need to compare WaNLI to existing (singly-labeled) training datasets via performance on established (singly-labeled) benchmarks. We encourage future work to understand the limitations of forcing inherently ambiguous instances into the nn-way classification scheme, or otherwise discarding these potentially valuable examples of linguistic reasoning as noise.

References

Appendix A Estimated Max Variability

In order to test the correlation between variability and estimated max variability on a dataset 𝒟\mathcal{D}, we would have to repeatedly hold out a single example xx, train a model on 𝒟{x}\mathcal{D}\setminus\{x\}, and evaluate how well the estimated max variability from the model trained on 𝒟{x}\mathcal{D}\setminus\{x\} correlates with the true variability from the model trained on 𝒟\mathcal{D}, which saw xx during training.

Unfortunately, this would be a very expensive experiment. Instead, we split the MNLI train set into 99%99\% for training and 1%1\% (3928 examples) for evaluation. For each of the held-out examples, we calculate the variability under MNLI\mathcal{M}\textsubscript{MNLI} and estimated max variability under MNLI 99%\mathcal{M}\textsubscript{MNLI 99\%}. The correlation is shown in Figure 5, and has a Pearson’s correlation coefficient of 0.527 with a pp-value of 7×102817\times 10^{-281}.

Refer to caption
Figure 5: Correlation between variability of examples on a model that trains on the full MNLI dataset, and estimated max variability of the same examples when they are held out of the training set.

Appendix B Modeling Details

All model training is implemented with the HuggingFace Wolf et al. (2020) library and uses the original hyperparameters from the RoBERTa paper for finetuning on GLUE Liu et al. (2019). We train the model for five epochs and evaluate the final model. We choose not to use an early stopping scheme in order to isolate the training data as the object of study and control for training length as a confounding factor. This is important since Tu et al. (2020) showed that counter-examples can be learned better with longer training.

All training was performed on a single Nvidia Quadro RTX 6000 GPU. The duration of training varied depending on the size of the training data, from 3 hours for WaNLI to 14 hours for MultiNLI ++ WaNLI.

Hyperparameter Assignment
Model RoBERTa-large
Number of parameters 345M
Number of epochs 5
Learning rate 10-5
Batch size 32
Weight decay 0.1
Learning rate decay linear
Warmup ratio 0.06
Table 6: Training hyperparameters for RoBERTa-large.

Appendix C WaNLI Details and Discussion

C.1 Example GPT-3 Context

We include some examples of full GPT-3 contexts in Table 12, 13, 14, 15.

C.2 GPT-3 Generation Hyperparameters

We queried the GPT-3 Curie model available through the OpenAI API888https://openai.com/api on the dates November 3 to November 5, 2021. In total, the generation cost $677.89. Hyperparameters for generation999described at https://beta.openai.com/docs/api-reference/completions/create are shown in Table 7.

Hyperparameter Assignment
Top pp 0.5
Temperature 1
Max tokens 120
Stop string \n\n
Presence penalty 0.0
Frequency penalty 0.0
Table 7: Hyperparameters for generation from GPT-3.

C.3 Dataset sizes at each stage

In Stage 1, we collect the top 25% most ambiguous examples from each label class in MultiNLI as our set of seed examples. This leads to 98,176 seed examples, where each seed example corresponds to a unique context for GPT-3. We generate n=5n=5 examples per seed example, and skip examples that are not properly formatted with a distinct premise and hypothesis following the context template (Figure 2). At the end of Stage 2, the size of 𝒟gen\mathcal{D}\textsubscript{gen} is 372,404. After applying the filtering heuristics described in §2.3 on 𝒟gen\mathcal{D}\textsubscript{gen}, the remaining dataset size is 287,241. Of the examples discarded, 79,278 generated examples had identical premise and hypothesis (sans punctuation and casing), and 4,732 examples had copied an in-context example. Next, we keep the half with the highest estimated max variability by sourcing an equal number of examples from each (intended) label class for a balanced dataset, resulting in 𝒟filtered\mathcal{D}\textsubscript{filtered} with size 143,619. However, we do not actually recruit human review on all of 𝒟filtered\mathcal{D}\textsubscript{filtered}, and instead annotate a total of 118,724 examples. Since some of these examples are discarded, the final WaNLI dataset contains 107,885 examples. These correspond to 57,825 seed examples from MultiNLI.

C.4 Investigation of Western entities in WaNLI versus MNLI

While we investigated known artifacts of crowdsourced datasets in §4, generated datasets may have distinct kinds of artifacts. Indeed, recent related work qualitatively observed an over-representation of Western entities in generated biographies Yuan et al. (2021). To investigate whether this is also characteristic of WaNLI, we use flair Akbik et al. (2019) to perform named entity recognition on MultiNLI and WaNLI. Due to the challenges and ethical risks of automatically determining the origin of names and organizations, we focus on the diversity of locations mentioned. We use geopy101010https://geopy.readthedocs.io to map all locations (e.g., cities, provinces, landmarks, as well as countries) to a country.

We find that 79% of location mentions in WaNLI are in Europe or North America, compared to 71% in MultiNLI. In particular, the United States is massively over-represented, accounting for 46% of mentions in WaNLI and 26% in MultiNLI. However, both datasets feature a diversity of location names: WaNLI mentions locations in 210 countries across 22K location entities, and MultiNLI mentions locations in 227 countries across 163K location entities. We conclude that over-representation of Western entities is indeed a concern for generated datasets, and encourage future work to consider this.

Appendix D Human Review

Screenshots of the instructions, guidelines, and annotation interface are shown in Tables 6, 7, and 8. The guidelines take inspiration from the design of the NLI Diagnostics dataset Wang et al. (2018). To collect a pool of qualified workers, we designed a qualification task with examples testing each of these categories. NLI is a challenging task, and many generated examples are especially challenging by design. Therefore, instructing annotators in how to think about the task and resolve common issues is key to collecting high-quality, label-consistent data.

D.1 The Annotators

Annotators were required to have a HIT approval rate of 98%, a total of 10,000 approved HITs, and be located in the United States.

300 Turkers took our qualification test, of which 69 passed. Turkers who were later found to produce extremely careless annotations were removed from the qualification list (and oftentimes, their annotations were discarded, though they were still paid for their work). The number of workers who contributed to the final dataset is 62.

Throughout the data collection process, the authors would review annotations and write individualized emails to Turkers with feedback, as well as group emails to clarify common challenging cases of NLI (such as examples involving questions). This follows the recommended crowdsourcing protocol from Nangia et al. (2021).

D.2 Compensation

In designing the task, we aimed for a pay rate of at least $15 per hour. Workers were paid $0.12 for each example that they annotate. At the end of data collection, we aggregate the earning and time spent from each crowdworker, and find that the median hourly rate was $22.72, with 85% of workers being paid over the $15/hour target.

D.3 Revision Analysis

We provide examples of revisions in Table 9. We find that revisions are generally targeted yet effective. The majority of revisions change the length only slightly, with 74%74\% of both premise revisions and hypothesis revisions changing the word count between 1-1 and +2+2 words. A very large proportion, 11.6% of premise revisions and 20.6% of hypothesis revisions, changed the set of pronouns present in the text, often to clarify coreference.

We instructed annotators to revise examples only when it would make the example more “interesting” in some sense, or more clear without removing what’s interesting. Nonetheless, we still observed a large number of revisions that greatly simplified the example, oftentimes re-introducing the same artifacts that have been documented in prior work. Therefore, we ultimately chose to include revisions only when both annotators revised the example, indicating that the revision was necessary to improve the quality of the example.

D.4 Disagreement Analysis

In order to investigate the utility of collecting a third annotation, we randomly sampled 80 examples where the two annotators disagreed on the label (and neither revised nor discarded), and two of the authors separately annotated each one. Shockingly, the two authors agreed on the label only 49% of the time. Furthermore, in 12% of cases, all three labels were present among the four annotations. This suggests that disagreement is often due to true ambiguity rather than careless mislabeling, and a third annotation would be unlikely to have high payoff in terms of “correcting” the label. As a result, we choose not to collect a third annotation in this work. Instead, we believe that the doubly-annotated examples in WaNLI have flagged many interesting cases of ambiguity in NLI, and we encourage future work to design richer annotation frameworks to uncover the source(s) of ambiguity.

We choose to keep examples with disagreement in the WaNLI dataset because we believe that finetuning with one of multiple reasonable labels still provides valuable training signal.

MNLI Dev. Set
Matched Mismatched
Train Set MNLI 90.30 90.10
MNLI \diamond WaNLI 89.63 88.95
MNLI + WaNLI 89.90 89.32
WaNLI 80.17 80.46
Table 8: Results on MultiNLI’s development set.
Example Label Purpose of Revision
P: The power plant It is the only source of continuous electric power for the city. H: The power plant is very important for the city. Entailment Coreference resolution
P: It was a well-known fact that it was a well-known fact that the solution was well-known. H: The solution was well-known. Entailment Redundancy
P: This will be the first time the king has met the queen in person. H: The king has met the queen in person before. Contradiction Clarity
P: She walked with a light step, as if she were floating on air. H: She was floating on air, as if she were walking on air. Contradiction Coherence
P: There is a slight possibility that, if the same temperature data are used, the temperature of the Earth’s surface in 1998 will be lower than the temperature of the Earth’s surface in 1998 now. H: The Earth’s surface in 1998 was lower than the Earth’s surface in 1998 now. Neutral Self-contradiction
P: She had to go to the library to find out what the name of the street was. H: She already knew the name of the street. Contradiction Ambiguous temporal reference
P: A number of theories have been proposed to explain the decline of violence in modern society. H: Violence will declinehas declined in modern society. Entailment Consistent tense
Table 9: Some examples of revisions that were done by annotators on examples generated by GPT-3.

Appendix E Additional Experiments

E.1 Additional baselines

We additionally perform comparisons with several subsets of MultiNLI which are the same size as WaNLI: MultiNLI filtered with the AFLite algorithm (MultiNLI with AFLite; Le Bras et al., 2020), the most ambiguous examples of MultiNLI (MultiNLI ambiguous; Swayamdipta et al., 2020), and a random subset of MultiNLI (MultiNLI downsampled). Results in Table 10 show that a WaNLI-trained model outperforms these baselines on every test set.

E.2 Evaluation on MultiNLI

We report the results on MultiNLI’s development set in Table 8. We find that mixing WaNLI into the MultiNLI training data (either through swapping or augmentation) maintains in-domain accuracy within \sim1%. Training on WaNLI alone drops performance on MultiNLI’s development set by \sim10%; however, the higher performance on other out-of-domain test sets suggests that evaluation through MultiNLI may not be a definitive signal of model ability.

E.3 Finetuning T5

We demonstrate that the robustness improvements from training on WaNLI generalizes to another model architecture, T5-base Raffel et al. (2020), which was never used in the data curation pipeline. Shown in Table 11, training T5-base on WaNLI also outperforms training on MultiNLI on every test set, including by 4% of NLI Diagnostics, 10% on HANS, and 8% on Adversarial NLI (similar margins compared to finetuning RoBERTa-large).

Appendix F Data Map of WaNLI

In Figure 9, we show a data map of MultiNLI relative to RoBERTa-large trained on MNLI, and of WaNLI relative to RoBERTa-large trained on WaNLI.

Test Set
Diagnostics HANS* QNLI* WNLI* NQ-NLI* ANLI FEVER-NLI BIG-Bench* WaNLI
Data size 1104 30K 5266 706 4855 3200 20K 3324 5000
Training Set MNLI 393K 68.47 78.08 52.69 56.09 62.34 32.37 68.29 64.68 64.62
MNLI (AFLite) 103K 60.50 73.73 53.91 56.37 64.28 33.12 68.04 70.75 62.19
MNLI (ambiguous) 103K 65.03 74.93 54.42 62.32 62.14 32.68 67.42 68.77 61.15
MNLI (downsampled) 103K 64.67 71.15 59.15 52.97 62.14 28.99 69.08 56.76 62.84
WaNLI 103K 72.55 89.40 76.81 65.15 64.03 41.12 70.63 75.40 75.49
Table 10: Additional baselines that finetune RoBERTa-large on different subsets of MultiNLI, filtered via existing debiasing methods.
Test Set
Diagnostics HANS* QNLI* WNLI* NQ-NLI* ANLI FEVER-NLI BIG-Bench* WaNLI
Data size 1104 30K 5266 706 4855 3200 20K 3324 5000
Training Set MNLI 393K 60.87 76.40 65.49 50.56 61.33 30.56 66.94 58.87 61.72
MNLI + Tailor 485K 61.14 74.34 63.33 50.70 62.05 31.06 67.15 68.95 61.28
MNLI + Z-Aug 754K 60.05 76.73 63.46 50.14 60.53 32.50 67.10 54.81 61.38
MNLI \diamond ANLI 393K 61.23 73.55 69.80 52.26 61.64 49.91 70.82 68.80 61.66
WaNLI 103K 64.58 86.25 74.66 51.13 63.66 38.22 68.27 76.17 72.56
Table 11: Empirical comparison of different training datasets for T5-base. For brevity, we include MNLI, WaNLI, and the strongest baselines from the results based on RoBERTa-large from Table 3.
Refer to caption
Figure 6: Instructions provided to crowdworkers on Amazon Mechanical Turk.
Refer to caption
Figure 7: Guidelines provided to crowdworkers in the human review stage.
Refer to caption
Figure 8: The interface on Amazon Mechanical Turk used for collecting human annotations. Annotators are given free text boxes that are pre-populated with the original premise and hypothesis, to ease the work of revision. Then, they either select an entailment class or discard the example.
Write a pair of sentences that have the same relationship as the previous examples. Examples: 1. In six states, the federal investment represents almost the entire contribution for providing civil legal services to low-income individuals. Implication: In 44 states, the federal investment does not represent the entire contribution for providing civil legal services for people of low income levels. 2. But if it’s at all possible, plan your visit for the spring, autumn, or even the winter, when the big sightseeing destinations are far less crowded. Implication: This destination is most crowded in the summer. 3. 5 percent of the routes operating at a loss. Implication: 95 percent of routes are operating at either profit or break-even. 4. 30 About 10 percent of households did not Implication: Roughly ninety percent of households did this thing. 5. 5 percent probability that each part will be defect free. Implication: Each part has a 95 percent chance of having a defect. 6.
Table 12: Context corresponding to row 1 in Table 1, which contains Entailment examples from MultiNLI found via nearest neighbors in [CLS] token embedding space. All examples require reasoning about set complements, including from the universe of 100 percent, the 50 U.S. states, as well as the four seasons.
Write a pair of sentences that have the same relationship as the previous examples. Examples: 1. Small holdings abound, and traditional houses sit low on the treeless hillsides. Possibility: The hills were the only place suitable to build traditional houses. 2. The inner courtyard has a lovely green and blue mosaic of Neptune with his wife Amphitrite. Possibility: The only colors used in the mosaic of Neptune and Amphitrite are green and blue. 3. Nathan Road, Central, and the hotel malls are places to look. Possibility: The only places to look are Nathan Road, Central and hotel malls. 4. Make your way westward to the Pont Saint-Martin for a first view of the city’s most enchanting quarter, the old tannery district known as Petite France. Possibility: The only place to the west of Pont Saint-Martin is the old tannery district. 5. The artisans, tradespeople, and providers of entertainment (reputable and not so reputable) lived downtown on the reclaimed marshlands north and east, in the area still known as Shitamachi. Possibility: The only place where artisans, tradespeople and entertainers could live was in the marshlands to the north and east. 6.
Table 13: Context corresponding to row 2 in Table 1, which contains Neutral examples where the hypothesis introduces an exclusivity that is not implied by the premise.
Write a pair of sentences that have the same relationship as the previous examples. Examples: 1. Dun Laoghaire is the major port on the south coast. Contradiction: Dun Laoghaire is the major port on the north coast. 2. Leave the city by its eastern Nikanor Gate for a five-minute walk to Hof Argaman (Purple Beach), one of Israel’s finest beaches. Contradiction: Leave the city by its western Nikanor Gate for a fifty five minute walk to Hof Argaman. 3. Southwest of the Invalides is the Ecole Militaire, where officers have trained since the middle of the 18th century. Contradiction: North of the Invalides is the Ecole Militaire, where officers have slept since the early 16th century. 4. Across the courtyard on the right-hand side is the chateau’s most distinctive feature, the splendid Francois I wing. Contradiction: The Francois l wing can be seen across the courtyard on the left-hand side. 5. To the south, in the Sea of Marmara, lie the woods and beaches of the Princes’ Islands. Contradiction: In the north is the Sea of Marmara where there are mountains to climb. 6.
Table 14: Context corresponding to row 3 in Table 1, which contains Contradiction examples that flip cardinal directions between the premise and hypothesis.
Write a pair of sentences that have the same relationship as the previous examples. Examples: 1. Vendors and hair braiders are sure to approach you. Implication: You’re likely to be solicited by vendors or hair braiders. 2. The Carre d’Art, an ultramodern building opposite the Maison Carre, exhibits modern art. Implication: Pieces of modern art can be found in the Carre d’Art, a structure which stands across from the Maison Carre. 3. But they also take pains not to dismiss the trauma the Holocaust visited and continues to visit upon Jews. Implication: The Holocaust visited trauma upon Jews, and they are careful not to dismiss this. 4. One fortunate result of this community’s influence has been the proliferation of good restaurants and interesting bars from which to choose. Implication: The influence of this community has led to an increase in the number of intriguing bars and good dining establishments. 5. Salinger wrote similar letters to other young female writers. Implication: Other young female writers received similar letters from Salinger as well. 6.
Table 15: Context corresponding to row 7 in Table 1, which contains Entailment examples that substitute a verb in the premise with one in the hypothesis that has a different subcategorization frame. Note that the third in-context example does not share quite the same pattern, but GPT-3 is still able to replicate the pattern present in other examples.
Refer to caption
Refer to caption
Figure 9: Left: Data map for MultiNLI train set, based on a RoBERTa-large classifier trained on MultiNLI. Right: Data map for WaNLI train set, based on a RoBERTa-large classifier trained on WaNLI. A comparison of the distribution in variability (which determines example ambiguity) is remarkable – we see that MultiNLI is overwhelmingly dominated by easy-to-learn examples with variability close to 0. In contrast, the distribution in variability is much more spread out in WaNLI, suggesting that the dataset contains more valuable examples overall.