How Can Context Help?
Exploring Joint Retrieval of Passage and Personalized Context

Hui Wan
IBM Research AI
hwan@us.ibm.com
\AndHongkang Li
Rensselaer Polytechnic Institute
lih35@rpi.edu
\ANDSongtao Lu
IBM Research AI
songtao@ibm.com
\AndXiaodong Cui
IBM Research AI
cuix@us.ibm.com
\AndMarina Danilevsky
IBM Research AI
mdanile@us.ibm.com

Abstract

The integration of external personalized context information into document-grounded conversational systems has significant potential business value, but has not been well-studied. Motivated by the concept of personalized context-aware document-grounded conversational systems, we introduce the task of context-aware passage retrieval. We also construct a dataset specifically curated for this purpose. We describe multiple baseline systems to address this task, and propose a novel approach, Personalized Context-Aware Search (PCAS), that effectively harnesses contextual information during passage retrieval. Experimental evaluations conducted on multiple popular dense retrieval systems demonstrate that our proposed approach not only outperforms the baselines in retrieving the most relevant passage but also excels at identifying the pertinent context among all the available contexts. We envision that our contributions will serve as a catalyst for inspiring future research endeavors in this promising direction.

1 Introduction

With the recent developments in AI, the world has witnessed an eruption of chatbots deployed with LLMs (large language models), such as ChatGPT, BARD Cohen et al. (2022), BlenderBot Shuster et al. (2022b), which often generate texts indistinguishable from human fluency. However, chatbots powered by parameter-based LLMs are known to generate factually incorrect statements - a problem regardless of the model size Shuster et al. (2021). By leveraging an external corpus of knowledge, retrieval augmented systems Dinan et al. (2018); Lewis et al. (2020); Karpukhin et al. (2020), including document-grounded dialogue systems Roller et al. (2020); Shuster et al. (2022b, a); Cohen et al. (2022), have demonstrated several advantages compared to pure parameter-based systems. For instance, grounding responses on external knowledge bases have been shown to reduce hallucinations across a variety of retrieval systems and model architectures Shuster et al. (2021).

A document-grounded conversational system, particularly in the enterprise setting, is likely to have access to a significant amount of contextual information, whether as a knowledge base or a library of API calls. This context may be temporal, such as the current date and time, or recent events; or it may be user-specific, such as information about the user’s account, profile, recent transactions, activity logs, etc. Without any such context, a user’s question such as, "Am I eligible for this rebate?" would receive the generic answer "You may be eligible for this rebate depending on where you live," grounded only on the relevant documents. If the right context were also retrieved and made available, the response could be instantly elevated to "Yes, you are eligible since you live in Singapore." Furthermore, retrieving the correct context information may serve to better understand the user’s intent, and therefore improve the likelihood of identifying the correct grounding document to use.

The significant challenge of choosing which context to retrieve has great potential business value, but has not been well-studied. Including too much contextual information may result in too much noise to the generation step, or exceeding the LLM’s allowed input size. Including irrelevant contextual information may degrade the generated response. Our motivating question can thus be posed as follows: Given a user query (which itself may be underspecified), a document collection, and a set of available contexts, how can a document-grounded conversational system retrieve a good subset of contexts to help answer the query, and can this process also help in retrieving the most relevant grounding document(s)?

To this end, we propose a new task of personalized context-aware passage retrieval for document-grounded dialogue, and create a dataset, ORCA-ShARC, for this setting. We provide several baseline approaches, as well as develop a novel approach, Personalized Context-Aware Search (PCAS), to address the task.

In order to showcase the efficacy of PCAS, we conduct extensive experiments on multiple well-known retrieval systems. The results illustrate that PCAS not only surpasses the baselines in retrieving the most relevant passage but also excels in identifying the pertinent context. We hope that our advancements in joint context-passage retrieval will serve as a catalyst, motivating future research endeavors in this highly promising field.

2 Context-and-Passage Retrieval

We now formally define the task of context-passage retrieval, which involves not only retrieving the relevant document from the external knowledge base, but also selecting the relevant piece of context from all the available contexts.

Formally, when a user $u$ engages in a conversation with the system, in addition to the static document corpus $\mathcal{D}$ , all the available personalized context information $\mathcal{C}^{u}$ are accessible to the system. There is also the conversation history composed of utterances that have already occurred in this session between the user and the system: $H=\{r_{1}:X_{1},...,r_{i}:X_{i},...\}$ where $r_{i}$ is the speaker role, and $X_{i}$ is the utterances at the $i$ -th turn, respectively. Since the focus of our work is in context-passage retrieval rather than the conversation history, in the rest of the paper, we simply consider a single turn user query $q$ rather than $H$ .

Given the input of a user query $q$ , the task is to select 1) the most relevant latent document $d$ from $\mathcal{D}$ ; and 2) the most relevant latent context $c$ from $\mathcal{C}^{u}$ , to help the system generate a good response.

To evaluate the retrieved documents and context, we use standard retrieval metrics, including binary rank-aware metrics MAP (mean average precision) and decision support metrics Recall@ $K$ .

3 The ORCA-ShARC Dataset

To the best of our knowledge, there is no existing open-retrieval content-grounded dialogue or QA datasets where each document-grounded example is annotated with a set of context. To this end, we curate a dataset for the proposed task in Section 2.

ShARC Saeidi et al. (2018) is a conversational QA dataset focusing on question answering from given text and one piece of given context (scenario). OR-ShARC Gao et al. (2021) is adapted from the ShARC dataset to an open-retrieval setting, where the task is to retrieve the relevant text snippet from the whole corpus. In OR-ShARC, each example is given one piece of relevant context (scenario).

We create a dataset, ORCA-ShARC (Open-Retrieval Context-Aware ShARC), by converting the OR-ShARC dataset into our task setting, where a set of contexts is provided for each example. To create the set, we use the example’s original relevant context, and expand the set by randomly sampling from all the contexts appearing in the OR-ShARC dataset, as long as there is no contradiction between contexts introduced (as judged by prompting FLAN_T5_3B model Chung et al. (2022)). We include 10 pieces of context for each example.¹¹1As the size of the context set grows, it naturally becomes harder to add context without contradictions. A few examples could only support 6-9 pieces of context. Table 1 summarizes the statistics of the ORCA-ShARC dataset and Table 2 provides an example.

# Documents (Rule Text Snippets)	651
# Avg. Document Length	38.5
# Avg. Pieces of Context	9.94
# Training Examples	17936
# Validation Examples	1105
# Test Examples	2373

Table 1: Summary statistics of ORCA-ShARC.

Source URL	https://www.gov.uk/winter-fuel-payment/eligibility
Scenario Set	…
	I am and have been an eligible veteran.
	I live in the Swiss Alps.
	I’m trying to export some boots.
	…
Question	Can I get Winter Fuel Payment?
Gold Snippet	…you might still get the payment if both the following apply: * you live in Switzerland or a European Economic Area (EEA) country…
Gold Scenario	I live in the Swiss Alps.

Table 2: An example from ORCA-ShARC. Note how the Scenario Set is expanded from the one piece of gold Scenario originally annotated in OR-ShARC.

4 Approach

We compare our approach with three baselines, and use some of the most popular neural retrieval systems to address the context-and-passage retrieval task on the newly constructed dataset.

4.1 Baselines

We design and implement several baselines for the task. The approaches are independent of the underlying retrieval systems. We use $score_{dq}(d,q)$ , $score_{cq}(c,q)$ and $score_{cd}(c,d)$ to represent the scores from the retrievers to model the pairwise relevance of document $d$ , query $q$ , and context $c$ .

OR: {question + original relevant context} $\xrightarrow{}$ document: For clarification, this is an experiment on the original OR-ShARC dataset as a reference rather than a baseline. Knowing the original relevant context, the passage retrieval task in OR-ShARC is easier than our task. In this experiment, the original context is concatenated with the user question, forming a new query $q^{\textrm{OR}}$ to retrieve documents based on $score_{dq}(d,q^{\textrm{OR}})$ .
B1: {question + all contexts} $\xrightarrow{}$ document: A baseline that concatenates the user question together with all available contexts to form a new query $q^{\textrm{B1}}$ and retrieve documents based on $score_{dq}(d,q^{\textrm{B1}})$ .
B2: question $\xrightarrow{}$ document; document $\xrightarrow{}$ context: A baseline that uses the user question to retrieve documents based on $score_{dq}(d,q)$ , then uses the top predicted documents to select contexts based on $score_{cd}(c,d)$ .
B3: question $\xrightarrow{}$ context; {question + predicted context} $\xrightarrow{}$ document: A baseline that uses the user question to select contexts based on $score_{cq}(c,q)$ , then concatenates the user question with the top-1 predicted context to form a new query $q^{\textrm{B3}}$ and retrieves documents based on $score_{dq}(d,q^{\textrm{B3}})$ .

4.2 PCAS Approach

We propose a novel approach, PCAS, that jointly predicts the document and the context as a pair, based on the relevance of the document to both the query and the context.

First, we use the user question $q$ to retrieve the top $K$ document candidates based on $score_{dq}(d,q)$ . Then, for each document $d$ , we select the context that is most relevant to it based on $score_{dc}(d,c)$ . Last but not least, a convex combination score $\lambda*score_{dq}(d,q)+(1-\lambda)*score_{dc}(d,c)$ is used to select the most relevant pair $(d,c)$ where $0<\lambda<1$ . The underlying intuition is as follows: the user question might not contain sufficient information for the system to understand the intent and retrieve the gold document. However, the system will partially know the intent, and has a good chance of including the best document in the top- $K$ list. Matching the top- $K$ documents with the user’s actual situation, which is captured in the contexts, will greatly help decipher the user’s true intent and retrieve the gold document.

5 Experimental Results

We evaluate the baselines and PCAS in a $0$ -shot context-and-passage retrieval task on the ORCA-ShARC dataset. We conduct experiments using several popular pretrained modern neural retrieval systems including a late-interaction retrieval model ColBERT Khattab and Zaharia (2020); Santhanam et al. (2021), single-vector retrieval models DPR Karpukhin et al. (2020), ANCE Xiong et al. (2021a) and Sentence BERT (S-BERT) Reimers and Gurevych (2019) with DistilBERT-TAS-B model Sanh et al. (2019); Hofstätter et al. (2021) .

For ColBERT, we adapt the code from the ColBERT repository ²²2https://github.com/stanford-futuredata/ColBERT. From the BEIR Thakur et al. (2021) repository³³3https://github.com/beir-cellar/beir, we get the pretrained model names, as well as the code for the other dense retrieval systems. We use the py_trec toolVan Gysel and de Rijke (2018) ⁴⁴4https://github.com/cvangysel/pytrec_eval for evaluation.

In Table 3, we present the document retrieval results and the context selection results (limited to the approaches that predict the context: B2, B3 and PCAS). For the same approach, the results largely varies across the retrievers, due to the distinct models and different pre-training data and processes. For example, DPR results are on the low side, especially for B3 document metrics, because it involves two chained retrieval steps that amplify this effect. B1 yields the lowest accuracy across the board, mainly due to the noise introduced by including all the contexts without discrimination. Importantly, we observe that when the original relevant context is unknown, our proposed PCAS approach achieves better retrieval results than all baselines, which indicates that jointly considering the documents and contexts can improve the performances of both document and context retrieval. Note that the PCAS results are close to the OR experiment in which the original relevant context is given. This suggests that PCAS can identify the relevant and important contexts for the query, with no need for users to specify any contexts. Furthermore, the comparison between B2 and B3 illustrates that the retrieval process of $q\rightarrow d\rightarrow c$ is better than $q\rightarrow c\rightarrow d$ , which supports the motivation of our PCAS design.

		documents			contexts
Methods		R@1	R@5	M@5	R@1
OR		60.81	84.25	70.52	NA
ColBERT	B1	34.93	59.82	44.47	NA
	B2	55.57	90.77	70.30	27.24
	B3	52.85	78.01	62.34	20.90
	PCAS	59.19	90.77	71.78	25.79
OR		31.58	54.93	40.78	NA
DPR	B1	14.84	35.66	22.24	NA
	B2	28.96	51.67	38.38	32.13
	B3	18.91	38.73	25.94	31.58
	PCAS	30.04	56.02	38.81	32.49
OR		58.37	84.52	68.69	NA
ANCE	B1	41.18	73.85	53.01	NA
	B2	44.98	77.29	55.35	37.74
	B3	42.53	75.48	55.46	31.04
	PCAS	52.85	83.26	65.71	41.36
OR		68.15	91.40	77.55	NA
S-BERT	B1	45.43	75.84	57.00	NA
	B2	55.57	91.04	68.57	39.01
	B3	53.30	82.26	63.53	23.44
	PCAS	63.53	91.04	74.28	42.80

Table 3: Evaluation for document retrieval (first three columns) and context retrieval (last column) on the ORCA-ShARC validation set. R@K denotes Recall@K scores. M@5 denotes MAP@5 scores. NA means that this approach does not retrieve context.

6 Related Work

Our proposed task is different from the work on context-aware QA Seonwoo et al. (2020); Taunk et al. (2023) in which QA is done on a given document and no retrieval is involved. On the other hand, context-aware QA can be considered as the next step following our task.

Our work is closely related to the work on contextual information retrieval MERROUNI et al. (2019). The major difference from our work is that the tasks in this line of work do not involve selecting of relevant context. The form of context being used is structured and in a few pre-defined genres, whereas we leverage a set of unstructured contexts. There is also no joint relevance modeling of both context and content with respect to the query. Our task and approach is also different from contextual recommendation systems (on which Tourani et al. (2023) recently presents their work), which does not involve user questions or queries.

Our proposed data and task are also related to open domain question answering (Kwiatkowski et al., 2019; Lewis et al., 2020; Min et al., 2020; Qu et al., 2020; Izacard and Grave, 2021; Li et al., 2021; Xiong et al., 2021b; Yu et al., 2021), open-retrieval conversational QA (Qu et al., 2020; Gao et al., 2021), and open-retrieval document-grounded dialogues Feng et al. (2021). However, none of these datasets and tasks include external context information. The lone exception is OR-SHARC, which provides one relevant context for each example and does not involve any selection of relevant context from a larger set.

Finally, our work is related to Multi-Session Chat (MSC) Qian et al. (2021), a dataset consisting of multiple chat sessions, created for studying how to utilize information outside of the current conversation session. Similarly, Xu et al. (2022) recently leveraged retrieval-augmented methods to select useful contexts from previous chat sessions. However, the datasets in both works are not document-grounded, and retrieving documents jointly with contextual information is not explored.

7 Conclusion

This work proposes the task of context-aware passage retrieval and creates a dataset based on OR-ShARC. We also present a novel approach for integrating external context information into the retrieval aspect of document-grounded conversational systems. The proposed PCAS method effectively combines both the document-query relevance and contextual relevance. We conduct experimental evaluations on popular retrieval systems, including ColBERT, DPR, ANCE, and S-BERT. The results demonstrate that incorporating external context information through PCAS brings significant improvement on passage retrieval and achieves higher MAP and Recall@ $K$ than baseline models.

The proposed retrieval paradigm opens up avenues for future research and extensions. Several potential directions include extending the PCAS method to the training process, integrating downstream modules such as response generation, and creating real-world context datasets with the inclusion of human feedback. These topics offer promising directions for the community to explore and advance the field further.

Limitations

Although the motivation for our work is to improve the quality of conversation systems that could be grounded on both document content and contextual information, in this work we focus exclusively on the task of retrieving the content and context. We do not evaluate the resulting generative responses to the user query, which is the actual final desired outcome, though we are working on doing so in the near future.

There are no publicly available datasets that are suited to the context-aware passage retrieval task. To the best of our knowledge, some of the related datasets do not have a passage retrieval setting, the rest of them either do not have annotated context set, or have a few limited genre contexts which does not serve our motivation with rich context from various business applications. This leads us to create a dataset by ourselves. Experimenting on only one dataset may limit the more generalizable conclusions we can draw.

The size of the context set that we construct is limited. In a realistic setting, it is likely that the number of available contexts is much greater, and more heterogeneous in nature rather than the short unstructured text available from the OR-ShARC dataset. More work is needed to discover how our approach scales to these settings.

Ethics Statement

To the best of our knowledge, we have not identified any immediate ethical concerns or negative societal consequences arising from this work, in accordance with the ACL ethics policy. The dataset we created does not involve generating any data with LLMs, hence does not impose any risk of non-factual or harmful content.

We hope our work may serve as an inspiration for the community to utilize the newly created dataset and further enhance retrieval performance. In the future, it may be necessary to tread carefully when incorporating user context in a real-world setting, for example to not cause consternation about how much a system may know about a given user.

Acknowledgements

This work was partly supported by the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons).

References

Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models.
Cohen et al. (2022) Aaron Daniel Cohen, Adam Roberts, Alejandra Molina, Alena Butryna, Alicia Jin, Apoorv Kulshreshtha, Ben Hutchinson, Ben Zevenbergen, Blaise Hilary Aguera-Arcas, Chung ching Chang, Claire Cui, Cosmo Du, Daniel De Freitas Adiwardana, Dehao Chen, Dmitry (Dima) Lepikhin, Ed H. Chi, Erin Hoffman-John, Heng-Tze Cheng, Hongrae Lee, Igor Krivokon, James Qin, Jamie Hall, Joe Fenton, Johnny Soraker, Kathy Meier-Hellstern, Kristen Olson, Lora Mois Aroyo, Maarten Paul Bosma, Marc Joseph Pickett, Marcelo Amorim Menegali, Marian Croak, Mark Díaz, Matthew Lamm, Maxim Krikun, Meredith Ringel Morris, Noam Shazeer, Quoc V. Le, Rachel Bernstein, Ravi Rajakumar, Ray Kurzweil, Romal Thoppilan, Steven Zheng, Taylor Bos, Toju Duke, Tulsee Doshi, Vincent Y. Zhao, Vinodkumar Prabhakaran, Will Rusch, YaGuang Li, Yanping Huang, Yanqi Zhou, Yuanzhong Xu, and Zhifeng Chen. 2022. Lamda: Language models for dialog applications. In arXiv.
Dinan et al. (2018) Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations.
Feng et al. (2021) Song Feng, Siva Sankalp Patel, Hui Wan, and Sachindra Joshi. 2021. MultiDoc2Dial: Modeling dialogues grounded in multiple documents. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6162–6176, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Gao et al. (2021) Yifan Gao, Jingjing Li, Chien-Sheng Wu, Michael R. Lyu, and Irwin King. 2021. Open-retrieval conversational machine reading.
Hofstätter et al. (2021) Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proc. of SIGIR.
Izacard and Grave (2021) Gautier Izacard and Edouard Grave. 2021. Distilling knowledge from reader to retriever for question answering. In International Conference on Learning Representations.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 39–48.
Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
Li et al. (2021) Yongqi Li, Wenjie Li, and Liqiang Nie. 2021. A graph-guided multi-round retrieval method for conversational open-domain question answering. arXiv preprint arXiv:2104.08443.
MERROUNI et al. (2019) Zakariae ALAMI MERROUNI, Bouchra FRIKH, and Brahim OUHBI. 2019. Toward contextual information retrieval: A review and trends. Procedia Computer Science, 148:191–200. The Second International Conference on Intelligent Computing in Data Sciences, ICDS2018.
Min et al. (2020) Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020. AmbigQA: Answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5783–5797, Online. Association for Computational Linguistics.
Qian et al. (2021) Hongjin Qian, Zhicheng Dou, Yutao Zhu, Yueyuan Ma, and Ji-Rong Wen. 2021. Learning implicit user profile for personalized retrieval-based chatbot. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1467–1477.
Qu et al. (2020) Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W Bruce Croft, and Mohit Iyyer. 2020. Open-retrieval conversational question answering. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 539–548.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Roller et al. (2020) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, and Jason Weston. 2020. Recipes for building an open-domain chatbot.
Saeidi et al. (2018) Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Singh, Tim Rocktäschel, Mike Sheldon, Guillaume Bouchard, and Sebastian Riedel. 2018. Interpretation of natural language rules in conversational machine reading. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2087–2097, Brussels, Belgium. Association for Computational Linguistics.
Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In NeurIPS $EMC^{2}$ Workshop.
Santhanam et al. (2021) Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2021. Colbertv2: Effective and efficient retrieval via lightweight late interaction.
Seonwoo et al. (2020) Yeon Seonwoo, Ji-Hoon Kim, Jung-Woo Ha, and Alice Oh. 2020. Context-aware answer extraction in question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2418–2428.
Shuster et al. (2022a) Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, and Jason Weston. 2022a. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 373–393, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Shuster et al. (2021) Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Shuster et al. (2022b) Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, William Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur, and Jason Weston. 2022b. Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage.
Taunk et al. (2023) Dhaval Taunk, Lakshya Khanna, Siri Venkata Pavan Kumar Kandru, Vasudeva Varma, Charu Sharma, and Makarand Tapaswi. 2023. Grapeqa: Graph augmentation and pruning to enhance question-answering. In Companion Proceedings of the ACM Web Conference 2023, pages 1138–1144.
Thakur et al. (2021) Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
Tourani et al. (2023) Ali Tourani, Hossein A. Rahmani, Mohammadmehdi Naghiaei, and Yashar Deldjoo. 2023. Capri: Context-aware interpretable point-of-interest recommendation framework.
Van Gysel and de Rijke (2018) Christophe Van Gysel and Maarten de Rijke. 2018. Pytrec_eval: An extremely fast python interface to trec_eval. In SIGIR. ACM.
Xiong et al. (2021a) Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021a. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations.
Xiong et al. (2021b) Wenhan Xiong, Xiang Li, Srini Iyer, Jingfei Du, Patrick Lewis, William Yang Wang, Yashar Mehdad, Scott Yih, Sebastian Riedel, Douwe Kiela, and Barlas Oguz. 2021b. Answering complex open-domain questions with multi-hop dense retrieval. In International Conference on Learning Representations.
Xu et al. (2022) Jing Xu, Arthur Szlam, and Jason Weston. 2022. Beyond goldfish memory: Long-term open-domain conversation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5180–5197, Dublin, Ireland. Association for Computational Linguistics.
Yu et al. (2021) Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and Zhiyuan Liu. 2021. Few-shot conversational dense retrieval. In International ACM SIGIR Conference.

Appendix A Appendix

A.1 Experimental setups

Models Used: We use the models “facebook-dpr-question_encoder-multiset-base”⁵⁵5https://huggingface.co/facebook/dpr-question_encoder-multiset-base and “facebook-dpr-ctx_encoder-multiset-base”⁶⁶6https://huggingface.co/facebook/dpr-ctx_encoder-multiset-base for DPR Karpukhin et al. (2020), “msmarco-roberta-base-ance-firstp”⁷⁷7https://huggingface.co/sentence-transformers/msmarco-roberta-base-ance-firstp for ANCE Xiong et al. (2021a), and “msmarco-distilbert-base-tas-b”⁸⁸8https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b Sanh et al. (2019); Hofstätter et al. (2021) for Sentence BERT Reimers and Gurevych (2019). We use the ColBERT model from Beir website ⁹⁹9https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/models/ColBERT/msmarco.psg.l2.zip.
Hyper-parameters: On the validation set, for DPR, ANCE, and S-BERT, we set $\lambda=0.6$ and $beam=7,\ 7,\ ,5$ for these three systems, respectively. “ $beam$ ” means the number of top document candicates retrieved using $score_{dq}(d,q)$ in the first step of the PCAS approach. Please see Section 4.2 for more details. On the test set, for DPR, ANCE, and S-BERT, we keep $\lambda=0.6$ and all $beam$ ’s equal to $5$ . For ColBERT, we set $\lambda=0.55$ and $beam$ to $5$ on both test set and validation set.

A.2 Results on the test set

We present evaluation results for document retrieval on the test set with systems in Section 4 in Table 4. We notice that the performance of the baselines are not stable on the test set across different retrievers. For example, with S-BERT, OR yields much better results than B2, but with ColBERT, B2 performs much better than OR, whereas the difference between B2 and OR in document retrieval is whether to concatenate the given "gold" context to the query. This indicates that there might be noisy data in the test set, and the characteristic of late interaction in ColBERT might be at a disadvantage in the test set data. It also explains why B2 outperforms both OR and PCAS with ColBERT.

		documents			contexts
Methods		R@1	R@5	M@5	R@1
OR		66.16	85.08	73.32	NA
ColBERT	B1	39.36	60.30	47.40	NA
	B2	70.50	93.30	79.09	34.77
	B3	58.49	78.85	66.49	28.87
	PCAS	67.09	93.30	77.07	32.79
OR		33.21	56.38	41.70	NA
DPR	B1	17.28	33.46	22.99	NA
	B2	29.25	57.10	39.31	34.13
	B3	23.30	41.93	29.89	33.59
	PCAS	29.92	58.96	40.24	35.74
OR		63.55	83.52	71.20	NA
ANCE	B1	49.73	72.19	58.08	NA
	B2	59.80	83.73	68.72	45.51
	B3	54.45	77.24	62.95	36.83
	PCAS	59.92	83.73	68.94	45.89
OR		69.91	89.13	77.38	NA
S-BERT	B1	47.32	75.18	57.90	NA
	B2	60.94	88.50	72.42	42.98
	B3	55.29	78.13	63.85	28.70
	PCAS	66.75	88.50	75.29	44.88

Table 4: Evaluation results for document retrieval (first three columns) and context retrieval (last column) on the ORCA-ShARC test set. R@K denotes Recall@K scores. M@5 denotes MAP@5 scores. NA (not applicable) means that this approach does not involve retrieving context.

How Can Context Help? Exploring Joint Retrieval of Passage and Personalized Context