Generative Context Pair Selection for Multi-hop Question Answering

Dheeru Dua^$\clubsuit$, Cicero Nogueira dos Santos^$\spadesuit$, Patrick Ng^$\spadesuit$,
Ben Athiwaratkun^$\spadesuit$, Bing Xiang^$\spadesuit$, Matt Gardner^$\heartsuit$ and Sameer Singh^$\clubsuit$
^$\clubsuit$University of California, Irvine, USA
^$\spadesuit$Amazon Web Services, New York
^$\heartsuit$Allen Institute for Artificial Intelligence
ddua@uci.edu

Abstract

Compositional reasoning tasks like multi-hop question answering, require making latent decisions to get the final answer, given a question. However, crowdsourced datasets often capture only a slice of the underlying task distribution, which can induce unanticipated biases in models performing compositional reasoning. Furthermore, discriminatively trained models exploit such biases to get a better held-out performance, without learning the right way to reason, as they do not necessitate paying attention to the question representation (conditioning variable) in its entirety, to estimate the answer likelihood. In this work, we propose a generative context selection model for multi-hop question answering that reasons about how the given question could have been generated given a context pair. While being comparable to the state-of-the-art answering performance, our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set which tests robustness of model’s multi-hop reasoning capabilities.

1 Introduction

Recently many reading comprehension datasets like HotpotQA Yang et al. (2018) and WikiHop Welbl et al. (2018) that require compositional reasoning over several disjoint passages have been introduced. This style of compositional reasoning, also referred to as multi-hop reasoning, first requires finding the correct set of passages relevant to the question and then the answer span in the selected set of passages. Most of these dataset are often collected via crowdsourcing, which makes the evaluation of such models heavily reliant on the quality of the collected held-out sets.

Crowdsourced datasets often present only a partial picture of the underlying data distribution. Learning complex latent sequential decisions, like multi-hop reasoning, to answer a given question under such circumstances is marred by numerous biases, such as annotator bias Geva et al. (2019), label bias (Dua et al., 2020; Gururangan et al., 2018), survivorship bias (Min et al., 2019; Jiang and Bansal, 2019), and ascertainment bias Jia and Liang (2017). As a result, testing model performance on such biased held-out sets becomes unreliable as the models exploit these biases and learn shortcuts to get the right answer but without learning the right way to reason.

Question: The 2011-12 VCU Rams men’s basketball team, led by third year head coach Shaka Smart, represented the university which was founded in what year? Gold Answer: 1838 Passage 1: The 2011-12 VCU Rams men’s basketball team represented Virginia Commonwealth University during the 2011-12 NCAA Division I men’s basketball season… Passage 2: Virginia Commonwealth University (VCU) is a public research university located in Richmond, Virginia. VCU was founded in 1838 as the medical department of Hampden-Sydney College, becoming the Medical College of Virginia in 1854… Prediction: 1838 Adversarial context from Jiang and Bansal (2019): Dartmouth University is a public research university located in Richmond, Virginia. Dartmouth was founded in 1938 as the medical department of Hampden-Sydney College, becoming the Medical College of Virginia in 1854… New Prediction: 1938

Figure 1: Example from HotpotQA, showing the reasoning chain for answering the question (in green) and an adversarial context (in pink) introduced by Jiang and Bansal (2019) which confuses the model, causing it to change its prediction because it did not learn the right way to reason.

Consider an example from HotpotQA in Figure 1, where the latent entity “Virgina Commonwealth University" can be used by the model Jiang and Bansal (2019) to bridge the two relevant passages (highlighted in green) from the original dev set and correctly predict the answer “1838”. However, upon adding an adversarial context (highlighted in pink) to the pool of contexts, the model prediction changes to “1938” implying that the model did not learn the right way to reason. This is because the discriminatively trained passage selector exploits lexical cues like “founded” in the second passage and does not pay attention to the complete question. The absence of such adversarial contexts at training allows the model to find incorrect reasoning paths.

In this work, we propose a generative context pair selection model, which tries to reason through the data generation process of how a specific question could have been constructed from a given pair of passages. We show that our proposed model is comparable in performance to the state-of-the-art systems, with minimal drop in performance on the adversarial held-out set. Our generative passage selector shows an improvement of 4.9% in Top-1 accuracy as compared to discriminatively trained passage selector on the adversarial dev set.

2 Generative Passage Selection

Given a set of contexts $C=\{c_{0},c_{1},...c_{N}\}$ , the goal of multi-hop question answering is to combine information from multiple context passages to identify the answer span $a$ for a given question $q$ . In single-hop QA, the goal is to identify the pair of contexts, from all possible pairs $\psi=\{(c_{i},c_{j}):c_{i}\in C,c_{j}\in C)\}$ , that is appropriate for answering the question.

Existing models for multi-hop question answering Tu et al. (2020); Chen et al. (2019) consist of two components: a discriminative passage selection and an answering model. Passage selection identifies which pairs of contexts are relevant for answering the given question, i.e. estimates $p(c_{ij}|q,\psi)$ . This is followed by the answering model to extract the answer span given a context pair and the question ( $p(a|q,c_{ij})$ ). These are combined as follows:

p(a|q,\psi)=\sum_{c_{ij}}p(a|q,c_{ij})p(c_{ij}|q,\psi)

(1)

The discriminative passage selector learns to select a set of contexts by conditioning on the question representation. This learning process does not encourage the model to pay attention to the entire question, which can result in ignoring parts of the question, thus, learning spurious correlations.

To predict the answer at test time, we do not sum over all pairs of contexts, but instead use the top scoring pair to answer the question¹¹1Summing over all context pairs, or maintaining a beam of highly ranked pairs, did not yield much higher performance, in particular, not worth the additional computation cost..

In other words, we use passage selection to pick the best context pair $c^{*}_{ij}$ , which is used by the answering module to get the answer, $a^{*}=\operatorname*{argmax}p(a|q,c^{*}_{ij})$ .

2.1 Model Description

We propose a joint question-answering model which learns $p(a,q|\psi)$ instead of $p(a|q,\psi)$ . This joint question-answer model can be factorized into a generative passage selector and a standard answering model as:

p(a,q|\psi)=\sum_{c_{ij}}p(a|q,c_{ij})p(q|c_{ij})p(c_{ij}|\psi)

(2)

First, a prior, $p(c_{ij}|\psi)$ , over the context pairs establishes a measure of compatibility between passages in a particular dataset. Then, a conditional generation model, $p(q|c_{ij})$ , establishes the likelihood of generating the given question from a selected pair of passages. Finally, a standard answering model, $p(a|q,c_{ij})$ , estimates the likely answer distribution given a question and context pair. The first two terms (prior and conditional generation) can be seen as a generative model that chooses a pair of passages from which the given question could have been constructed. The answering model can be instantiated with any existing state-of-the-art model, such as a graph neural network Tu et al. (2020); Shao et al. (2020), entity-based chain reasoning Chen et al. (2019), etc.

The process at test time is identical to that with discriminative passage selection, except that the context pairs are scored by taking the entire question into account, $c^{*}_{ij}=\operatorname*{argmax}_{c_{ij}}p(q|c_{ij})p(c_{ij}|\psi)$ .

2.2 Model Learning

We use a pre-trained T5 Raffel et al. (2019) based encoder-decoder model for obtaining contextual representations, which are further trained to estimate all individual probability distributions.

For learning the generative model, we train the prior, $p(c_{ij}|\psi)$ and the conditional generation model $p(q|c_{ij},\psi)$ jointly. First, the prior network projects the concatenated contextualized representation, $r_{ij}$ , of starting and ending token of concatenated contexts $(c_{i};c_{j})$ , from the encoder to obtain un-normalized scores, which are then normalized across all context-pairs via softmax operator. The loss function tries to increases the likelihood of gold context pair over all possible context pairs.

	$\displaystyle r_{ij}$	$\displaystyle=encoder(c_{i};c_{j})$		(3)
	$\displaystyle s_{ij}$	$\displaystyle=W^{1\times d}(r_{ij}[start];r_{ij}[end])$		(4)

The conditional question generation network gets contextual representations for context-pair candidates from the encoder and uses them to generate the question, via the decoder. We define the objective to increase the likelihood of the question for gold context pairs and the unlikelihood Welleck et al. (2019) for a sample set of negative context pairs (Eq. 5)

	$\displaystyle\mathcal{L}(\theta)=$	$\displaystyle\sum_{t=1}^{\|question\|}\log p(q_{t}\|q_{<t},c_{gold})$
		$\displaystyle+\sum_{n\in\|neg.pairs\|}\sum_{t=1}^{\|question\|}\log(1-p(q_{t}\|q_{<t},c_{n}))$		(5)

3 Experiments and Results

We experiment with two popular multi-hop datasets: HotpotQA Yang et al. (2018) and WikiHop Welbl et al. (2018). Most SOTA passage selection modules for HotpotQA use a RoBERTa Liu et al. (2019) based classifier to select top-k passages given the question, which has an accuracy of $\sim$ 94.5% Tu et al. (2020). We used a T5-based standard passage selector, $p(c_{ij}|q,\psi)$ , as our baseline, which provides a comparable performance to SOTA passage selector (Table 1).

Dataset	Standard Selector	Generative Selector
Dataset	$p(c_{ij}\|q,\psi)$	$p(q\|c_{ij})p(c_{ij}\|\psi)$
HotpotQA	95.3	97.5
WikiHop	96.8	97.2

Table 1: Passage selection accuracy: Accuracy that the selected passage pair (

c^{*}_{ij}

) by different techniques is the oracle one (

c_{gold}

) on original development set.

We also use a simple T5-based answering model that has a comparable performance to SOTA answering models to illustrate the effect of our generative selector on end-to-end model performance. The oracle EM/F1 of our answering model, $p(a|q,c_{gold})$ , on HotpotQA and WikiHop are 74.5/83.5 and 76.2/83.9 respectively. The overall EM/F1 of WikiHop with generative model are 73.5/80.2.

3.1 Adversarial Evaluation

We use an existing adversarial set Jiang and Bansal (2019) for HotpotQA to test the robustness of model’s multi-hop reasoning capabilities given a confusing passage. This helps measure, quantitatively, the degree of biased correlations learned by the model. In Table 2, we show that the standard discriminative passage selector has a much higher performance drop ( $\sim$ 4%) as compared to the generative selector ( $\sim$ 1%) on adversarial dev set Jiang and Bansal (2019), showing that generative selector is less biased and less affected by conservative changes Ben-David et al. (2010) to the data distribution. We can also see in Table 2 that SOTA models Tu et al. (2020); Fang et al. (2019), which use the standard passage selector, also have a larger F1 drop when applied to the adversarial set. Table 3 shows that the generator was able to generate multi-hop style questions using both the contexts.

Models	Original		Adversarial
Models	Acc	F1	Acc	F1
Standard Selector	95.3	79.5	91.4	76.0
Generative Selector	97.5	81.9	96.3	80.1
Tu et al. (2020)	94.5	80.2	-	61.1
Fang et al. (2019)	-	82.2	-	78.9

Table 2: Performance on Adversarial Data: Passage selection accuracy and end to end QA F1 on original and adversarial set Jiang and Bansal (2019) of HotpotQA. The results of Tu et al. (2020) and Fang et al. (2019) are taken from Perez et al. (2020).

Context 1, $c_{i}$ :	The America East Conference is a collegiate athletic conference affiliated with the NCAA Division I, whose members are located mainly in the Northeastern United States. The conference was known as the Eastern College Athletic Conference-North from 1979 to 1988 and the North Atlantic Conference from 1988 to 1996.
Context 2, $c_{j}$ :	The Vermont Catamounts men’s soccer team represents the University of Vermont in all NCAA Division I men’s college soccer competitions. The team competes in the America East Conference.
Original Question, $q$ :	the vermont catamounts men’s soccer team currently competes in a conference that was formerly known as what from 1988 to 1996?
Generated Questions: $p(q\|c_{ij},\psi)$	the vermont catamounts men’s soccer team competes in what collegiate athletic conference affiliated with the ncaa division i, whose members are located mainly in the northeastern united states?
	the vermont catamounts men’s soccer team competes in a conference that was known as what from 1979 to 1988?
	the vermont catamounts men’s soccer team competes in a conference that was known as what from 1988 to 1996?

Table 3: Sample questions generated by using the question generation decoder with top-k sampling show that the generative model is able to construct (reason about) possible multi-hop questions given a context-pair.

3.2 Context pairs vs. Sentences

Some context selection models for HotpotQA use a multi-label classifier that chooses top-k sentences Fang et al. (2019); Clark and Gardner (2017) which result in limited inter-document interaction than context pairs. To compare these two input types, we construct a multi-label sentence classifier $p(s|q,C)$ that selects relevant sentences. This classifier projects a concatenated sentence and question representation, followed by a sigmoid, to predict if the sentence should be selected. This model has a better performance over the context-pair selector but is more biased (Table 4).

Model	Original	Adversarial
Discriminative Selectors
Passage, $p(c_{ij}\|q,\psi)$	95.3	96.3
Sentence, $p(s\|q,C)$	97.6	90.9
Generative Selectors
Passage, $p(q\|c_{ij},\psi)p(c_{ij}\|\psi)$	97.5	96.3
Sentence, $p(q\|s,C)p(s\|C)$	90.6	89.2
Multi-task, $p(q,s\|c_{ij},\psi)p(c_{ij}\|\psi)$	98.1	97.2

Table 4: Passages vs Sentences: Passage selection accuracy for models with different context inputs on the development and adversarial set of HotpotQA.

We performed similar experiments with the generative model. Along with the passage selection model, we train a generative sentence selection model by first selecting a set of sentences with gumbel softmax and then generating the question given the set of sentences. Given that the space of set of sentences is much larger than context pairs, the generative sentence selector does not have good performance (Table 4). To further improve the performance of the generative selector, we add an auxiliary loss term that predicts the relevant sentences in the context pair, $p(q,s|c_{ij},\psi)$ , along with selecting the context pair in a multi-task setting. We see slight performance improvements by using relevant sentences as an additional supervision signal.

4 Related work

Most passage selection models for HotpotQA and Wikihop’s distractor style setup employ a RoBERTA based context selectors given the question Tu et al. (2020); Fang et al. (2019). In an ideal scenario, the absence of latent entity in the question should not allow selection of all oracle passages. However, the high performance of these systems can be attributed to existing bias in HotpotQA Jiang and Bansal (2019); Min et al. (2019). Another line of work dynamically updates the working memory to re-rank the set of passage at each hop Das et al. (2019). With the release of datasets like SearchQA Dunn et al. (2017), TriviaQA Joshi et al. (2017), and NaturalQuestions Kwiatkowski et al. (2019), lot of work has been done in open-domain passage retrieval, especially in the full Wikipedia setting. However, these questions do not necessarily require multi-hop reasoning. A series of work has tried to match a document-level summarized embedding to the question Seo et al. (2018); Karpukhin et al. (2020); Lewis et al. (2020) for obtaining the relevant answers. In generative question answering, a few works Lewis and Fan (2018); dos Santos et al. (2020) have used a joint question answering approach on single context.

5 Conclusion

We have presented a generative formulation of context pair selection in multi-hop question answering models. By encouraging the context selection model to explain the entire question, it is less susceptible to bias, performing substantially better on adversarial data than existing methods that use discriminative selection. Our proposed model is simple to implement and can be used with any existing (or future) answering model; we will release code to support this integration.

Since context pair selection scales quadratically with the number of contexts, it is not ideal for scenarios that involve a large number of possible contexts. However, it allows for deeper inter-document interaction as compared to other approaches that use summarized document representations. With more reasoning steps, selecting relevant documents given only the question becomes challenging, increasing the need for inter-document interaction.

6 Ethical Considerations

This paper focuses on biases found in question answering models that make its reasoning capabilities brittle. It uses an existing method of testing model performance on adversarial held-out set as an evaluation metric. This work does not deal with any social impacts of biases in natural language processing systems.

References

Ben-David et al. (2010) Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál. 2010. Impossibility theorems for domain adaptation. In International Conference on Artificial Intelligence and Statistics, pages 129–136.
Chen et al. (2019) Jifan Chen, Shih-ting Lin, and Greg Durrett. 2019. Multi-hop question answering via reasoning chains. arXiv preprint arXiv:1910.02610.
Clark and Gardner (2017) Christopher Clark and Matt Gardner. 2017. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723.
Das et al. (2019) Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, and Andrew McCallum. 2019. Multi-step retriever-reader interaction for scalable open-domain question answering. arXiv preprint arXiv:1905.05733.
dos Santos et al. (2020) Cicero dos Santos, Xiaofei Ma, Ramesh Nallapati, Zhiheng Huang, and Bing Xiang. 2020. Beyond [cls] through ranking by generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1722–1727.
Dua et al. (2020) Dheeru Dua, Matt Gardner, and Sameer Singh. 2020. Benefits of intermediate annotations in reading comprehension. Annual Meeting of the Association for Computational Linguistics (ACL).
Dunn et al. (2017) Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179.
Fang et al. (2019) Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, and Jingjing Liu. 2019. Hierarchical graph network for multi-hop question answering. arXiv preprint arXiv:1911.03631.
Geva et al. (2019) Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. arXiv preprint arXiv:1908.07898.
Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL.
Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328.
Jiang and Bansal (2019) Yichen Jiang and Mohit Bansal. 2019. Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop qa. arXiv preprint arXiv:1906.07132.
Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
Lewis and Fan (2018) Mike Lewis and Angela Fan. 2018. Generative question answering: Learning to answer the whole question.
Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprint arXiv:2005.11401.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Min et al. (2019) Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional questions do not necessitate multi-hop reasoning. Annual Meeting of the Association for Computational Linguistics (ACL).
Perez et al. (2020) Ethan Perez, Patrick Lewis, Wen-tau Yih, Kyunghyun Cho, and Douwe Kiela. 2020. Unsupervised question decomposition for question answering. arXiv preprint arXiv:2002.09758.
Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
Seo et al. (2018) Minjoon Seo, Tom Kwiatkowski, Ankur P Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2018. Phrase-indexed question answering: A new challenge for scalable document comprehension. arXiv preprint arXiv:1804.07726.
Shao et al. (2020) Nan Shao, Yiming Cui, Ting Liu, Shijin Wang, and Guoping Hu. 2020. Is graph structure necessary for multi-hop reasoning? arXiv preprint arXiv:2004.03096.
Tu et al. (2020) Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, Xiaodong He, and Bowen Zhou. 2020. Select, answer and explain: Interpretable multi-hop reading comprehension over multiple documents. In AAAI, pages 9073–9080.
Welbl et al. (2018) Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287–302.
Welleck et al. (2019) Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2019. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319.
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. Annual Meeting of the Association for Computational Linguistics (ACL).