Densifying Sparse Representations for Passage Retrieval by Representational Slicing

Sheng-Chieh Lin David R. Cheriton School of Computer Science
University of Waterloo Jimmy Lin David R. Cheriton School of Computer Science
University of Waterloo

Abstract

Learned sparse and dense representations capture different successful approaches to text retrieval and the fusion of their results has proven to be more effective and robust. Prior work combines dense and sparse retrievers by fusing their model scores. As an alternative, this paper presents a simple approach to densifying sparse representations for text retrieval that does not involve any training. Our densified sparse representations (DSRs) are interpretable and can be easily combined with dense representations for end-to-end retrieval. We demonstrate that our approach can jointly learn sparse and dense representations within a single model and then combine them for dense retrieval. Experimental results suggest that combining our DSRs and dense representations yields a balanced tradeoff between effectiveness and efficiency.

1 Introduction

Transformer-based bi-encoders have been widely used as a first-stage retriever for text retrieval. Previous work Reimers and Gurevych (2019); Chang et al. (2020); Karpukhin et al. (2020) trains models to project texts into a continuous embedding space for dense retrieval. Such dense retrievers are capable of tackling the vocabulary and semantic mismatch problems compared to traditional word matching approaches (e.g., BM25). Subsequent work further improves dense retrieval through knowledge distillation Lin et al. (2021); Hofstätter et al. (2021), hard negative mining Xiong et al. (2021); Zhan et al. (2021), or their combination Qu et al. (2021); Gao and Callan (2021). However, as Sciavolino et al. (2021) have shown, dense retrievers still fail in some easy cases and it remains challenging to interpret why they sometimes perform poorly.

Refer to caption — Figure 1: With sparse representations, both queries and passages are represented as fixed-width vectors where the number of dimensions is equal to the vocabulary size. Our approach first groups these high-dimensional vectors into $M$ slices, each with $N$ dimensions (e.g., $M=3$ , $N=5$ here). For each slice, the maximum value is selected. These values from the original vector (Value) and their positions in each slice (Index) are recorded separately. When computing the gated inner product (GIP) between two vectors, only the dimensions with the same index are considered.

Recently, another thread of work uses bi-encoders to learn sparse representations for text retrieval. For example, Dai and Callan (2020) demonstrate that replacing tf–idf with contextualized term weights via a regression model significantly improves retrieval effectiveness. Mallia et al. (2021) further improve the approach by combining contextualized term weights with passage expansion models based on sequence-to-sequence transformers Nogueira and Lin (2019), addressing the vocabulary mismatch problem with sparse retrieval.

Around the same time, Formal et al. (2021b); Zhuang and Zuccon (2021) propose bi-encoder designs for jointly learning contextualized term weights and expansions. These papers demonstrate that sparse retrievers generate representations that are more understandable for humans than dense retrievers. Furthermore, there is evidence Luan et al. (2021); Gao et al. (2021b); Lin and Ma (2021) that sparse representations compensate for the weaknesses of dense representations; these researchers fuse sparse and dense representations for text retrieval. Chen et al. (2021) recently propose to train another dense retriever (SPAR) to imitate sparse retrievers, and directly fuse the dense representations from SPAR and DPR.

The advantages of sparse retrievers motivate us to explore the following research question: Can we densify sparse representations? Our motivation comes from two reasons: (1) The densified representations are more likely to be interpretable and can serve as a good alternative to dense representations in scenarios where interpretability is important. (2) The densified representations can be directly combined with other dense representations under the same physical retrieval framework, which makes these techniques easier to deploy.

We present an approach to densifying sparse representations by representational slicing. Unlike previous work Chen et al. (2021), our approach involves no training. In addition, we propose a gated inner product (or GIP) operation to compute the relevance scores between the densified sparse representations (DSRs) of queries and passages. This process is depicted in Fig. 1. Although our analysis indicates that GIP has higher time complexity compared to inner products for dense vectors, with our proposed retrieve-and-rerank approach, DSR retrieval can be further sped up without any retrieval effectiveness drop. Finally, our experiments show that combining DSRs with other dense representations yields performance tradeoffs that balance effectiveness, index size, and query latency.

2 Background

Following Lin et al. (2020), let us formulate the task of text (or ad hoc) retrieval as follows: Given a query $q$ , the goal is to retrieve a ranked list of documents $\{d_{1},d_{2},\cdots d_{k}\}\in C$ to maximize some ranking metric, such as nDCG, where $C$ is the collection of documents.

Specifically, given a (query, document) pair, we aim to maximize the following:

\displaystyle P(\text{Relevance}|q,d)\triangleq\phi(\eta_{q}(q),\eta_{d}(d))=\phi(\mathbf{q},\mathbf{d})

where $\eta_{q}(q)$ and $\eta_{d}(d)\in\mathbb{R}^{h}$ denote the functions mapping the query and document into $h$ -dimensional vector representations, $\mathbf{q}$ and $\mathbf{d}$ , respectively. The function that quantifies the degree of relevance between the representations $\mathbf{q}$ and $\mathbf{d}$ is denoted $\phi(\cdot,\cdot)$ , which can be a simple inner product or a much more complex operation Nogueira and Cho (2019); Humeau et al. (2020).

Dense retrieval.

Transformer-based bi-encoders have been widely applied to the task of passage retrieval Reimers and Gurevych (2019); Chang et al. (2020); Karpukhin et al. (2020). The query and passage texts are first projected to low-dimensional dense vectors, and their inner product is computed as a measurement of relevance:

\displaystyle P^{ds}(\text{Relevance}|q,d)\triangleq\langle\mathbf{q}^{\texttt{[CLS]}},\mathbf{d}^{\texttt{[CLS]}}\rangle,

where $\mathbf{q}^{\texttt{[CLS]}}$ and $\mathbf{d}^{\texttt{[CLS]}}\in\mathbb{R}^{768}$ are the query and passage dense representations encoded by the [CLS] token from the final layer of BERT-base.

Sparse retrieval.

Zamani et al. (2018) was the first to demonstrate that neural networks can learn sparse representations for text retrieval. Recently, transformer-based bi-encoders have also been applied to sparse representation learning. Briefly, the solutions can be classified into two approaches. On the one hand, some researchers Dai and Callan (2020); Mallia et al. (2021); Zhuang and Zuccon (2021); Lin and Ma (2021) use BERT token embeddings to learn a contextualized term weight for each input token. On the other hand, some researchers Formal et al. (2021b); Bai et al. (2020) design models to directly learn sparse representations on the entire BERT wordpiece vocabulary.

Generally, all these sparse retrieval approaches can be considered as projecting queries and passages into $|V_{\text{BERT}}|$ -dimensional vectors, where $|V_{\text{BERT}}|=30522$ is the vocabulary size of BERT wordpiece tokens:

\displaystyle P^{sp}(\text{Relevance}|q,d)\triangleq\langle\mathbf{q}^{sp},\mathbf{d}^{sp}\rangle,

where $\mathbf{q}^{sp}$ and $\mathbf{d}^{sp}\in\mathbb{R}^{30522}$ . The value in each dimension is the token’s term weight. The same as dense retrieval, the relevance score between a query and a passage is computed by the inner product of their vector representations.

Gap between dense and sparse retrieval.

Although the inner product is the common operation for measuring relevance for both dense and sparse representations, there are still some major differences between them. (1) Unlike dense representations, sparse ones can be considered as bag of words and thus are more interpretable, since each dimension of the vectors represents a word in the BERT vocabulary. (2) Text retrieval using sparse representations are executed through inverted indexes due to their high dimensionality, while dense retrieval is often executed on flat indexes using Faiss Johnson et al. (2017). This gap between the execution of dense and sparse retrieval increases the complexity of dense–sparse hybrid retrieval systems. Previous work Luan et al. (2021); Gao et al. (2021b); Lin and Ma (2021) realizes hybrid retrieval by combining dense and sparse retrieval scores from different retrieval systems, which requires further post processing. Also, it is not trivial to perform dense and sparse retrieval (i.e., on inverted and Faiss flat indexes) simultaneously. Thus, we are motivated to explore approaches to densifying sparse representations.

3 Methodology

In this section, we first introduce our approach to densifying sparse representations into low-dimensional dense vectors. We also propose a simple scoring model for text retrieval with the densified sparse representations (DSRs).

3.1 Densify Sparse Representations

Sparse representations can be considered vectors with $|V|$ dimensions, i.e., $\mathbf{q}^{sp}=(q_{0},\cdots q_{|V|-1})$ ; $\mathbf{d}^{sp}=(d_{0},\cdots d_{|V|-1})$ . We first divide the vectors into $M$ slices of smaller vectors, each of which has $N$ dimensions (i.e., $|V|=M\cdot N$ ). In terms of a standard “slice” notation:

	$\displaystyle S^{q}_{m}=\mathbf{q}^{sp}[mN:mN+N]\in\mathbb{R}^{N}$		(1)
	$\displaystyle S^{d}_{m}=\mathbf{d}^{sp}[mN:mN+N]\in\mathbb{R}^{N},$		(2)

where $m\in\{0,1,\cdots,M-1\}$ . Note that the slicing can be performed in different ways (e.g., contiguous, stride,¹¹1The array is sliced with $N$ stride. or random); for simplicity, we use contiguous in our presentation. Thus, the inner product between $\mathbf{q}^{sp}$ and $\mathbf{d}^{sp}$ can be rewritten as the summation of all their slices’ dot products:

\displaystyle\langle\mathbf{q}^{sp},\mathbf{d}^{sp}\rangle=\sum_{m=0}^{M-1}\langle S^{q}_{m},S^{d}_{m}\rangle

(3)

Approximate sparse representations.

Intuitively, if a representation is sparse enough, we can assume that for each slice, there is only one non-zero entry. Thus, we can approximate $S^{q}_{m}$ ( $S^{d}_{m}$ ) by keeping only the entry with maximum value in each slice:

	$\displaystyle S^{q}_{m}\approx\max{(S^{q}_{m})}\cdot\hat{\mathbf{u}}(\text{e}^{q}_{m})$		(4)
	$\displaystyle S^{d}_{m}\approx\max{(S^{d}_{m})}\cdot\hat{\mathbf{u}}(\text{e}^{d}_{m}),$		(5)

where $\hat{\mathbf{u}}(\text{e}^{q}_{m})$ is a unit vector with the only non-zero entry at the entry $\text{e}^{q}_{m}=\mathrm{argmax}{(S^{q}_{m})}$ . Thus, the inner product of $\mathbf{q}^{sp}$ and $\mathbf{d}^{sp}$ sparse vectors in Eq. (3) can be approximated as follows:

		$\displaystyle\langle\mathbf{q}^{sp},\mathbf{d}^{sp}\rangle\approx$
		$\displaystyle\sum_{m=0}^{M-1}\max{(S^{q}_{m})}\cdot\max{(S^{d}_{m})}\cdot\langle\hat{\mathbf{u}}(\text{e}^{q}_{m}),\hat{\mathbf{u}}(\text{e}^{d}_{m})\rangle$
		$\displaystyle=\sum_{m=0}^{M-1}\max{(S^{q}_{m})}\cdot\max{(S^{d}_{m})}\mathds{1}_{\{\text{e}^{q}_{m}=\text{e}^{d}_{m}\}}$		(6)

Densified sparse representations.

Observing Eq. (3.1), in order to compute the approximate inner product of sparse vectors, each query (document) can be alternatively represented as two $M$ -dimension dense vectors:

	$\displaystyle\mathbf{q}^{ds}$	$\displaystyle=(\max{(S^{q}_{0})},\cdots,\max{(S^{q}_{M-1})})\in\mathbb{R}^{M}$
	$\displaystyle\mathbf{d}^{ds}$	$\displaystyle=(\max{(S^{d}_{0})},\cdots,\max{(S^{d}_{M-1})})\in\mathbb{R}^{M}$
	$\displaystyle\mathbf{e}^{q}$	$\displaystyle=(\text{e}^{q}_{0},\cdots,\text{e}^{q}_{M-1})\in\mathbb{N}^{M}$
	$\displaystyle\mathbf{e}^{d}$	$\displaystyle=(\text{e}^{d}_{0},\cdots,\text{e}^{d}_{M-1})\in\mathbb{N}^{M},$

where $\mathbf{q}^{ds}$ ( $\mathbf{d}^{ds}$ ) is the dense vector storing the $M$ maximum values from the query (document) slices, and $\mathbf{e}^{q}$ ( $\mathbf{e}^{d}$ ) is the integer dense vector storing the entries with the maximum value in the corresponding slices. We call the representations $\mathbf{q}^{ds}$ and $\mathbf{e}^{q}$ ( $\mathbf{d}^{ds}$ and $\mathbf{e}^{d}$ ) the densified sparse representations (DSRs) for queries (documents).

Gated inner product.

Using DSRs, Eq. (3.1) can be rewritten as:

		$\displaystyle\langle\mathbf{q}^{sp},\mathbf{d}^{sp}\rangle\approx$
		$\displaystyle\sum_{m=0}^{M-1}\mathbf{q}^{ds}[m]\cdot\mathbf{d}^{ds}[m]\mathds{1}_{\{\mathbf{e}^{q}[m]=\mathbf{e}^{d}[m]\}},$		(7)

where $\mathbf{q}^{ds}[m]$ ( $\mathbf{d}^{ds}[m]$ ) and $\mathbf{e}^{q}[m]$ ( $\mathbf{e}^{d}[m]$ ) is the $m$ -th entry of query (document) DSRs. We call the operation in Eq. (3.1) the gated inner product (GIP), which can be considered an approximation of the inner product of the original sparse vectors.

However, note that the gated inner product requires $4\cdot M$ operations, which is more than the standard inner product of $M$ -dimensional dense vectors, which only requires $2\cdot M$ operations. When conducting brute-force search over corpus $C$ , the differences are even larger: $4\cdot M|C|$ > $2\cdot M|C|$ .

Retrieval and reranking.

In order to address the issue of high time complexity, we propose a retrieve-and-rerank approach. In the first stage, we use approximate brute-force search by computing only partial dimensions of the gated inner product between the query and document DSRs:

\displaystyle\sum_{m\in\mathcal{M}}\mathbf{q}^{ds}[m]\cdot\mathbf{d}^{ds}[m]\mathds{1}_{\{\text{e}^{q}[m]=\text{e}^{d}[m]\}},

(8)

where $\mathcal{M}=\{m|\mathbf{q}^{ds}[m]>\theta\}$ is the set of indices for the GIP computation, and $\theta$ is a hyperparameter. This first-stage retrieval only relies on the dimensions where $\mathbf{q}^{ds}[m]$ has a large value. The intuition behind this design is that user queries usually contain only a few terms. Thus, only a few dimensions in the sparse query vector have impact on retrieval effectiveness, which might be also true for the DSRs. In the second (reranking) stage, the retrieved documents are sorted by the approximate gated inner product; then, the top- $k$ documents are reranked by gated inner product with all dimensions, as in Eq. (3.1).

Table 1: Results of sparse representation densification experiments.

		Dev		DL 2019		DL 2020
	M (Dims)	MRR@10	R@1K	nDCG@10	R@1K	nDCG@10	R@1K
	30K	0.312	0.925	0.621	0.775	0.633	0.794
DSR-uniCOIL	768	0.309	0.923	0.615	0.769	0.632	0.792
(no expansion)	256	0.305	0.919	0.605	0.751	0.622	0.788
	128	0.300	0.913	0.605	0.744	0.614	0.782
DSR-SPLADE	30K^a	0.340	0.965	0.684	0.851	-	-
	768	0.343	0.957	0.681	0.831	0.655	0.824
	256	0.338	0.953	0.667	0.820	0.651	0.822
	128	0.332	0.946	0.657	0.810	0.657	0.812

a

These numbers are copied from SPLADE-max in Formal et al. (2021a) for reference only and are not directly comparable to our results.

Table 2: Effectiveness/efficiency comparisons.

	Dim		Dev	DL 2019	DL 2020	storage	latency
	Sparse	Dense	MRR@10	nDCG@10	nDCG@10	(GB)	(ms/q)
ColBERT Khattab and Zaharia (2020)	0	128/tok	0.360	-	-	154	458^a
COIL Gao et al. (2021a)	32/tok	768	0.355	0.704	-	50	41^a
Dense-[CLS] (our dense baseline)	0	768	0.310	0.626	0.625	13	115
DSR-uniCOIL	768	0	0.309	0.615	0.632	20	40
DSR-SPLADE	768	0	0.343	0.681	0.655	20	40
DSR-uniCOIL + Dense-[CLS]	768	768	0.350	0.678	0.704	32	2 $\times$ 24^b
DSR-SPLADE + Dense-[CLS]	768	768	0.352	0.697	0.677	32	2 $\times$ 24^b
DSR-uniCOIL + Dense-[CLS]	256	256	0.346	0.680	0.688	11	34
DSR-SPLADE + Dense-[CLS]	256	256	0.348	0.711	0.678	11	34
DSR-uniCOIL + Dense-[CLS]	128	128	0.337	0.679	0.671	5	32
DSR-SPLADE + Dense-[CLS]	128	128	0.344	0.709	0.673	5	32

a

These numbers are copied from the original papers, with different execution environments.
b

This condition cannot run on our single GPU; thus, we divide the index into 2 shards, each with a latency of 24 ms.

3.2 Fusion with Dense Representations

One of the advantages of densifying sparse representations is to make the fusion of sparse and dense retrieval easier, since it actually becomes the fusion of two dense representations. Their fusion scores can be computed as follows:

	$\displaystyle\sum_{m=0}^{M-1}$	$\displaystyle\lambda\cdot\mathbf{q}^{\texttt{[CLS]}}[m]\cdot\mathbf{d}^{\texttt{[CLS]}}[m]$
		$\displaystyle+\mathbf{q}^{ds}[m]\cdot\mathbf{d}^{ds}[m]\mathds{1}_{\{\mathbf{e}^{q}[m]=\mathbf{e}^{d}[m]\}},$		(9)

where $\lambda$ is a hyperparameter. Note that the sparse and dense representations can be from the same model using joint training or from separately trained models. Eq. (3.2) can be implemented by existing packages that support common array operations. We use PyTorch in our experiments.

4 Experimental Setup

Dataset descriptions.

(a) MS MARCO Dev: 6980 queries comprise the development set for the MS MARCO passage ranking test collection, with on average one relevant passage per query. We report MRR@10 and R@1000 as the top- $k$ retrieval measures. (b) TREC DL Craswell et al. (2019, 2020): the organizers of the 2019 (2020) Deep Learning Track at the Text REtrieval Conferences (TRECs) released 43 (53) queries with multiple graded relevance labels, where (query, passage) pairs were annotated by NIST assessors. We report NDCG@10 and R@1000 for these evaluation sets.

Sparse retrieval models.

In our experiments, we explore two previous sparse retrieval models based on BERT-base:

1.

uniCOIL Lin and Ma (2021) uses token embeddings from the final layer in BERT to generate one-dimension representations (term weights) for each input query and passage token. In our experiments, we use uniCOIL without passage expansion for simplicity.
2.

SPLADE Formal et al. (2021b) uses the MLM technique to project each token embeddings from query (or passage) to a $|V_{\text{BERT}}|$ -dimensional sparse representation, where $|V_{\text{BERT}}|=30522$ is the vocabulary size of BERT wordpiece. Following Formal et al. (2021a), we conduct max pooling over all sparse representations generated by query (or passage) tokens. Note that unlike Formal et al. (2021a), we do not control the sparsity of generated representations while training.

We denote the densified sparse representations from uniCOIL (SPLADE) as DSR-uniCOIL (DSR-SPLADE). We train all models on the MS MARCO “small” triples training set for 100k steps with a batch size of 96 and learning rate 7e-6. In our experiments, we test retrieval latency using single NVIDIA Titan RTX with batch size 1.

We densify the 30522-dimensional representations into 768-dimensional vectors by (1) discarding the first 570 unused tokens in the BERT vocabulary; (2) divide the remaining 29952 tokens into 768 slices. Each of the slices contains 39 tokens according to token ids; i.e., $M=768$ and $N=39$ . In addition, we also conduct experiments for densifying into 256 and 128 dimensions, where there are 117 and 234 tokens in each slice, respectively. The “value” and “index” dense vectors (see Fig. 1) are stored as float16 and uint8, respectively.

5 Results

5.1 Densifying Sparse Representations

Table 1 shows our results for densifying sparse representations. The first row (with 30K dimension) in each block reports the retrieval effectiveness of the original sparse representations, which can be considered an upper bound. Note that uniCOIL’s representation is highly sparse, and thus retrieval can be performed using an inverted index. However, since our SPLADE model is trained without any sparsity constraints, the output query and passage representations are actually quite dense and challenging for end-to-end retrieval; see Mackenzie et al. (2021) for more discussion. For reference, we report the retrieval effectiveness of the SPLADE-max model from the original paper Formal et al. (2021a), which does include sparsification that renders the model amenable to inverted indexes.

As we can see from Table 1, our method is able to densify the original 30K-dimensional sparse vectors into 768-dimensional vectors with less than $1\%$ retrieval effectiveness drop. In addition, only around $2\%$ and $4\%$ retrieval effectiveness degradation can be seen for $M=256$ and $128$ , respectively. These results demonstrate the advantage of our approach, which does not require any training.

5.2 Retrieve and Rerank

In Section 3, we discussed the disadvantage of the gated inner product operation, that it is more expensive than the standard inner product. In these experiments, we test SPLADE with 768 dimensions on the MS MARCO development queries to demonstrate how our proposed retrieve-and-rerank approach remedies the issue.

In Fig. 3, the red lines show the effectiveness and efficiency of the retrieve-and-rerank approach under different settings of $\theta$ (which controls the value threshold in GIP), while the blue lines depict first-stage retrieval only. Note that the $x$ axis is in log scale. The point $\theta=0$ represents the retrieval effectiveness upper bound of the model since the gated inner product is computed using all dimensions. As $\theta$ increases, the first-stage (approximate) retrieval efficiency improves dramatically but effectiveness drops. However, we see no effectiveness drops with $\theta\leq 0.1$ after reranking the top-10K candidates from first-stage retrieval, which only consumes an additional latency of 10 ms. Thus, in the following experiments, we set $\theta$ to $0.1$ .

5.3 Comparisons to Other Dense Models

Table 2 compares the performance of DSR and other dense retrievers in terms of retrieval effectiveness, latency, and index size. For a fair comparison, here we exclude other dense models trained with more sophisticated techniques, such as knowledge distillation and hard negative mining. We implement a bi-encoder model (denoted “Dense-[CLS]”) trained under the same condition as our sparse models; this model takes the [CLS] token as the output representation. Note that the latency of “Dense-[CLS]” is measured using a Faiss flat index.

In the conditions denoted DSR-{uniCOIL, SPLADE} + Dense-[CLS], we train a bi-encoder model for sparse and dense representations jointly, where the loss takes into account inner products from both the [CLS] token and the sparse vectors. After training, we then densify the sparse model as described above. During inference, we apply our retrieve-and-rerank approach, where 10K candidates are first retrieved by only DSRs and then reranked by the fusion of DSRs and dense representations, as in Eq. (3.2). Here we set $\lambda$ to 1.

Among the dense retrievers considered here, ColBERT is the most effective, and COIL, which can be thought of as a variant of ColBERT, reduces the time complexity by adding a lexical match prior. However, both multi-vector designs still require much more storage for the index than single-vector dense retrievers. Our fusion models combining 768-dimensional DSRs with dense [CLS] representations show retrieval effectiveness close to COIL but with a much smaller index size. When reducing the dimensions of the DSR models, we can see a tradeoff between efficiency and effectiveness. However, it is worth noting that our fusion DSR models with 128-dimensional vectors shows better efficiency and effectiveness than the dense-only and DSR-model. This shows that our densified sparse representations exhibit complementary relevance signals than “pure” dense retrieval models.

Table 3: Ablation on slicing method.

	MS MARCO dev
DSR-SPLADE	MRR@10	R@1K
Stride	0.343	0.957
Contiguous	0.332	0.951
Random	0.343	0.957

5.4 Alternative Slicing Strategies

Finally, we examine different slicing strategies for densifying sparse representations using DSR-SPLADE with 768 dimensions in these experiments. Table 3 reports the results. Our default setting for the previous experiments is stride and we observe that stride has the same effectiveness as randomized slicing. Surprisingly, the contiguous slicing strategy shows degradation in ranking effectiveness. This indicates that max pooling over contiguous slices could lose information compared to the other strategies. It is worth exploring other slicing strategies, which we leave for future work.

6 Conclusions

In this paper, we study how to densify sparse representations for passage retrieval. Our approach requires no training and can be considered an approximation of sparse representations. The relevance score of the densified sparse representations (DSRs) can be computed through our proposed gated inner product (GIP), a variant of the standard inner product. Although our analysis indicates that GIP has higher time complexity than inner product, our experiments show that in practice, DSR retrieval can be conducted through a retrieve-and-rerank approach, which is 10 times faster without sacrificing any effectiveness. Finally, we empirically show that a bi-encoder model can be a more robust dense retriever by combining DSRs along with other “standard” dense representations.

Acknowledgements

This research was supported in part by the Canada First Research Excellence Fund and the Natural Sciences and Engineering Research Council (NSERC) of Canada. Additionally, we would like to thank the support of Cloud TPUs from Google’s TPU Research Cloud (TRC).

References

Bai et al. (2020) Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fangshan Wang, and Qun Liu. 2020. SparTerm: Learning term-based sparse representation for fast text retrieval. arXiv:2010.00768.
Chang et al. (2020) Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. 2020. Pre-training tasks for embedding-based large-scale retrieval. In Proc. ICLR.
Chen et al. (2021) Xilun Chen, Kushal Lakhotia, Barlas Oğuz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta, and Wen tau Yih. 2021. Salient phrase aware dense retrieval: Can a dense retriever imitate a sparse one? arXiv:2110.06918.
Craswell et al. (2019) Nick Craswell, Bhaskar Mitra, and Daniel Campos. 2019. Overview of the TREC 2019 deep learning track. In Proc. TREC.
Craswell et al. (2020) Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2020 deep learning track. In Proc. TREC.
Dai and Callan (2020) Zhuyun Dai and Jamie Callan. 2020. Context-aware term weighting for first stage passage retrieval. In Proc. SIGIR, page 1533–1536.
Formal et al. (2021a) Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2021a. SPLADE v2: Sparse lexical and expansion model for information retrieval. arXiv:2109.10086.
Formal et al. (2021b) Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021b. SPLADE: Sparse lexical and expansion model for first stage ranking. In Proc. SIGIR, page 2288–2292.
Gao and Callan (2021) Luyu Gao and Jamie Callan. 2021. Condenser: a pre-training architecture for dense retrieval. In Proc. EMNLP, pages 981–993.
Gao et al. (2021a) Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021a. COIL: Revisit exact lexical match in information retrieval with contextualized inverted list. In Proc. NAACL, pages 3030–3042.
Gao et al. (2021b) Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan. 2021b. Complement lexical retrieval model with semantic residual embeddings. In Proc. ECIR, pages 146–160.
Hofstätter et al. (2021) Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proc. SIGIR, page 113–122.
Humeau et al. (2020) Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2020. Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. In Proc. ICLR.
Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv:1702.08734.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proc. EMNLP, pages 6769–6781.
Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proc. SIGIR, page 39–48.
Lin and Ma (2021) Jimmy Lin and Xueguang Ma. 2021. A few brief notes on DeepImpact, COIL, and a conceptual framework for information retrieval techniques. arXiv:2106.14807.
Lin et al. (2020) Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020. Pretrained transformers for text ranking: BERT and beyond. arXiv:2010.06467.
Lin et al. (2021) Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proc. RepL4NLP, pages 163–173.
Luan et al. (2021) Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2021. Sparse, dense, and attentional representations for text retrieval. Trans. Assoc. Comput. Linguistics, pages 329–345.
Mackenzie et al. (2021) Joel Mackenzie, Andrew Trotman, and Jimmy Lin. 2021. Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation. arXiv:2110.11540.
Mallia et al. (2021) Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. 2021. Learning passage impacts for inverted indexes. In Proc. SIGIR, page 1723–1727.
Nogueira and Cho (2019) Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with BERT. arXiv:1901.04085.
Nogueira and Lin (2019) Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery.
Qu et al. (2021) Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Proc. NAACL, pages 5835–5847.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proc. EMNLP, pages 3982–3992.
Sciavolino et al. (2021) Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, and Danqi Chen. 2021. Simple entity-centric questions challenge dense retrievers. In Proc. EMNLP.
Xiong et al. (2021) Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In Proc. ICLR.
Zamani et al. (2018) Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In Proc. CIKM, page 497–506.
Zhan et al. (2021) Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Optimizing dense retrieval model training with hard negatives. In Proc. SIGIR, page 1503–1512.
Zhuang and Zuccon (2021) Shengyao Zhuang and Guido Zuccon. 2021. TILDE: Term independent likelihood model for passage re-ranking. In Proc. SIGIR, page 1483–1492.