SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval

Yang Bai Tsinghua University , Xiaoguang Li Huawei Noah’s Ark Lab , Gang Wang Huawei Noah’s Ark Lab , Chaoliang Zhang Huawei Noah’s Ark Lab , Lifeng Shang Huawei Noah’s Ark Lab , Jun Xu Renmin University of China , Zhaowei Wang Huawei Noah’s Ark Lab , Fangshan Wang Huawei Technologies Co., Ltd and Qun Liu Huawei Noah’s Ark Lab

(2018)

Abstract.

Term-based sparse representations dominate the first-stage text retrieval in industrial applications, due to its advantage in efficiency, interpretability, and exact term matching. In this paper, we study the problem of transferring the deep knowledge of the pre-trained language model (PLM) to Term-based Sparse representations, aiming to improve the representation capacity of bag-of-words(BoW) method for semantic-level matching, while still keeping its advantages. Specifically, we propose a novel framework SparTerm to directly learn sparse text representations in the full vocabulary space. The proposed SparTerm comprises an importance predictor to predict the importance for each term in the vocabulary, and a gating controller to control the term activation. These two modules cooperatively ensure the sparsity and flexibility of the final text representation, which unifies the term-weighting and expansion in the same framework. Evaluated on MSMARCO dataset, SparTerm significantly outperforms traditional sparse methods and achieves state of the art ranking performance among all the PLM-based sparse models.

Fast Retrieval, Sparse Representation, BERT

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: 10.1145/1122445.1122456

1. Introduction

Text retrieval in response to a natural language query is a core task for information retrieval (IR) systems. Most recent work has adopted a two-stage pipeline to tackle this problem, where an initial set of documents are firstly retrieved from the document collection by a fast retriever, and then further re-ranked by more sophisticated models.

For the first-stage retrieval, neural dense representations show great potentials for semantic matching and outperform sparse methods in many NLP tasks, but this is not necessarily true in scenarios that emphasize long document retrieval and exact matching(Luan et al., 2020). Moreover, for extremely large (e.g. 10 billion) candidates collection, the dense method has to struggle with the efficiency vs. accuracy tradeoff. Classical term-based sparse representations, also known as bag-of-words (BoW), such as TF-IDF (Sparck-jones, 1972) and BM25 (Robertson and Walker, 1994), can efficiently perform literal matching, thus playing a core role in industrial IR systems. However, traditional term-based methods are generally considered to have insufficient representation capacity and inadequate for semantic-level matching.

Refer to caption — Figure 1. The comparison between BoW and SparTerm representation. The depth of the color represents the term weights, deeper is higher. Compared with BoW, SparTerm is able to figure out the semantically important terms and expand some terms not appearing in the passage but very semantically relevant, even the terms in the target query such as “sign”.

Some attempts have been made to make sparse methods beyond lexical matching while still keeping their advantages. SRNM (Zamani et al., 2018) learns latent sparse representations for the query and document based on dense neural models, in which the “latent” token plays the role of the traditional term during inverted indexing. One challenge about SNRM is that it loses the interpretability of the original terms, which is critical to industrial systems.

Recently proposed pre-trained language models(PLM) such as ELMO (Peters et al., 2018) and BERT (Devlin et al., 2018) show superior performance in many NLP tasks, thus providing new opportunities to transfer deep contextualized knowledge from dense representations to sparse models. Focusing on the relevant relationship between a passage/document and corresponding query, DeepCT (Dai and Callan, 2019) and Doc2Query (Nogueira et al., 2019) learn PLM-based models to enhance the performance of traditional BoW methods. The difference is that DeepCT learns a regression model to re-weight terms with contextualized representations, while Doc2query learns an encoder-decoder generative model to expand query terms for passage. Both of these two methods train an auxiliary intermediate model and then help refine the final sparse representations to achieve better text ranking performance.

In this paper, we propose a novel framework SparTerm to learn Term-based Sparse representations directly in the full vocabulary space. Equipped with the pre-trained language model, the proposed SparTerm learns a function to map the frequency-based BoW representation to a sparse term importance distribution in the whole vocabulary, which offers the flexibility to involve both term-weighting and expansion in the same framework. As shown in Figure 1, compared with BoW representation, SparTerm assigns more weights to the term of high distinguishability given the context, and expand extra terms hopefully bridging the lexical gap with future queries. We empirically show that SparTerm significantly increase the upper limit of sparse retrieval methods, and gives new insights of transferring deep knowledge from PLM-based representation to simple BoW representations.

More specifically, SparTerm comprises an importance predictor and a gating controller. The importance predictor maps the raw input text to a dense importance distribution in the vocabulary space, which is different from traditional term weighting methods that only consider literal terms of the input text. To ensure the sparsity and flexibility of the final representation, the gating controller is introduced to generate a binary and sparse gating signal across the dimension of vocabulary size, indicating which tokens should be activated. These two modules cooperatively yield a term-based sparse representation based on the semantic relationship of the input text with each term in the vocabulary.

Our contributions. In summary, we propose to directly learn term-based sparse representation in the full vocabulary space. The proposed SparTerm indicates that there is much space for improving the ranking performance of termed-based representations, while still keeping the interpretability and efficiency of BoW methods. Evaluated on MSMARCO (Nguyen et al., 2016) dataset, SparTerm significantly outperforms previous sparse models based on the comparable size of PLMs. The top-ranking performance of SparTerm even outperforms Doc2Query-T5, which is based on the pre-trained model of 2x model size and 70x pre-training corpus size. Moreover, we conduct further empirical analysis about how the deep knowledge of PLMs can be transferred to the sparse method, which gives new insights for sparse representation learning.

2. Related Work

Our work relates to two research fields: bag-of-words representations and pre-trained language model for text retrieval.

2.1. Bag-of-words Methods

Bag-of-words(BoW) methods have played a central role in the first-stage retrieval. These methods convert a document or query into a set of single terms, and each term associates a weight to characterize its weight. Most of the early common practice adopted TF-IDF style models to calculate weights. Robertson (Robertson and Walker, 1994) proposed the well-known method BM25, which further improve the performance of the original TF-IDF. Later proposed methods, such as (Lafferty and Zhai, 2001), (Zaragoza et al., 2004), (Strohman et al., 2005), did not show much advantage over BM25. More recently, Hamed Zamani (Zamani et al., 2018) proposed SRNM to learn a sparse coding in hidden space using weak supervision, which shows good potential for solving the “lexical mismatch” problem. However, the latent unexplainable tokens can not ensure that documents with exact matched terms can be retrieved.

2.2. PLMs for dense text retrieval

The pre-trained language models like BERT (Devlin et al., 2018) show new possibilities for text retrieval. Based on dense representations, Lee (Lee et al., 2019) proposed ORQA with bi-encoder architecture to retrieve candidate passages for question answering using FAISS (Johnson et al., 2017). However, analysis from (Luan et al., 2020) concludes that bi-encoders based on dense representation suffer from its capacity limitation in scenarios that emphasize long document retrieval and exact matching. Following the late-interaction paradigm, Khattab (Khattab and Zaharia, 2020) proposed Col-BERT to conduct efficient interaction between the query and document, which can run 150x faster than fully-interactive BERT but achieve comparable precision. Though much faster than BERT, Col-BERT is still not computationally feasible for large scale first-stage retrieval, for the existence of the late interaction layer.

2.3. PLMs for sparse text retrieval

Several PLM-based models have emerged to improve the traditional sparse BoW representations. Dai (Dai and Callan, 2019) proposed DeepCT to estimate a term’s weight considering its contextualized information, and this work was later extended to generate document-level term weights (Dai and Callan, 2020). Another work Doc2query (Nogueira et al., 2019) tries to “translate” potential queries to expand document content, which also shows a large improvement compared to the traditional BM25 method. The biggest difference between our work and these two methods is that DeepCT and Doc2Query train an auxiliary intermediate model to help refine the sparse representations, while SparTerm is desinged to directly learn sparse representations within the whole vocabulary.

3. Sparse Representation Learning

This section presents the model architecture of SparTerm and the corresponding training strategy.

3.1. Overview

Figure 2(a) depicts the general architecture of SparTerm which comprises an importance predictor and a gating controller. Given the original textual passage $p$ , we aim to map it into a deep and contextualized sparse representation $p{{}^{\prime}}$ in the vocabulary space. The mapping process can be formulated as:

(1)

p{{}^{\prime}}=\mathcal{F}(p)\odot\mathcal{G}(p)

where $\mathcal{F}$ is the item importance predictor and $\mathcal{G}$ the gating controller. The importance predictor $\mathcal{F}$ generates a dense vector representing the semantic importance of each item in the vocabulary. The gating controller $\mathcal{G}$ generates a binary gating vector to control which terms to appear in the final sparse representation. To achieve this, we let $||\mathcal{G}(p)||<\lambda$ and $\mathcal{G}(p)\in\{0,1\}^{v}$ , where $\lambda$ is the maximum number of non-zero elements for $p{{}^{\prime}}$ , and $v$ the vocabulary size. These two modules cooperatively ensure the sparsity and flexibility of the final representation $p{{}^{\prime}}$ . We discuss the detailed model architecture and learning strategy for $\mathcal{F}$ and $\mathcal{G}$ in the following sections.

Term expansion	Description and examples
Passage2query	Expand words that tend to appear in corresponding queries, i.e. “how far”, “what causes”.
Synonym	Expand synonym for original core words, i.e. “cartoon”-¿“animation”.
Co-occurred words	Expand frequently co-occurred words for original core words, i.e. “earthquakes”-¿“ruins”.
Summarization words	Expand summarization words that tend to appear in passage summarization or taggings.

Table 1. Different kinds of term expansion.

3.2. The Importance Predictor

Given the input passage $p$ , the importance predictor outputs semantic importance of all the terms in the vocabulary, which unify term weighting and expansion into the framework. As shown in Figure 2(b), prior to importance prediction, BERT-based encoder is employed to help get the deep contextualized embedding $h_{i}$ for each term $w_{i}$ in the passage $p$ . Each $h_{i}$ models the surrounding context from a certain position $i$ , thus providing a different view of which terms are semantically related to the topic of the current passage. With a token-wise importance predictor, we obtain a dense importance distribution $I_{i}$ of dimension $v$ for each $h_{i}$ :

(2)

I_{i}=Transform(h_{i})E^{\mathrm{T}}+b

where $Transform$ denotes a linear transformation with GELU activation and layer normalization, $E$ is the shared word embedding matrix and $b$ the bias term. Note that the token-wise importance prediction module is similar to the masked language prediction layer in BERT, thus we can initialize this part of parameters directly from pre-trained BERT. The final passage-wise importance distribution can be fetched simply by the summation of all token-wise importance distributions:

(3)

I=\sum_{i=0}^{L}Relu(I_{i})

where $L$ is the sequence length of passage $p$ and Relu activation function is leveraged to ensure the nonnegativity of importance logits.

3.3. The Gating Controller

The gating controller generates a binary gating signal of which terms to activate to represent the passage. First, the terms appearing in the original passage, which we referred to as literal terms, should be activated by the controller by default. Apart from the literal terms, some other terms related to the passage topic are also expected to be activated to tackle the “lexical mismatch” problem of BOW representation. Accordingly, we propose two kinds of gating controller: literal-only gating and expansion-enhanced gating, which can be applied in scenarios with different requirements for lexical matching.

Literal-only Gating. If simply setting $\mathcal{G}(p)=BOW(p)$ , where $BoW(p)$ denotes the binary BoW vector for passage $p$ , we get the literal-only gating controller. In this setting, only those terms existing in the original passage are considered activated for the passage representation. Without expansion for non-literal terms, the sparse representation learning is reduced to a pure term re-weighting scheme. Nevertheless, in the experiment part, we empirically show that this gating controller can achieve competitive retrieval performance by learning importance for literal terms.

Exapnsion-enhanced Gating. The expansion-enhanced gating controller activates terms that can hopefully bridge the “lexical mismatch” gap. Similar to the importance prediction process formulated by Equation (2) and Equation (3), we obtain a passage-wise dense term gating distribution $G$ of dimension $v$ with independent network parameters, as shown in Figure 2(c). Note that although the gating distribution $G$ and the importance distribution $I$ share the same dimension $v$ , they are different in logit scales and mathematical implications. $I$ represents the semantic importance of each term in vocabulary, while $G$ quantifies the probability of each term to participate in the final sparse representation. To ensure the sparsity of $p^{{}^{\prime}}$ , we apply a binarizer to $G$ :

(4)

G^{{}^{\prime}}=Binarizer(G)

where the $Binarizer$ denotes a binary activation function which outputs only 0 or 1. The gating vector for expansion terms $G_{e}$ is obtained by:

(5)

G_{e}=G^{{}^{\prime}}\odot(\neg BoW(p))

where the bitwise negation vector $\neg BoW(p)$ is applied to ensure orthogonal to the literal-only gating. Simply adding the expansion gating and the literal-only gating, we get the final expansion-enhanced gating vector $G_{le}$ :

(6)

G_{le}=G_{e}+BoW(p)

Involving both literal and expansion terms, the final sparse representation can be a “free” distribution in the vocabulary space. Note that in the framework of SparTerm, expanded terms are not directly appended to the original passage, but are used to control the gating signal of whether allowing a term participating the final representation. This ensures the input text to the BERT encoder is always the natural language of the original passage.

3.4. Training

In this section, we introduce the training strategy of the importance predictor and expansion-enhanced gating controller.

Training the importance predictor. The importance predictor is trained end-to-end by optimizing the ranking objective. Let $R=\{(q_{1},p_{1,+},p_{1,-}),...,(q_{N},p_{N,+},p_{N,-})\}$ denote a set of N training instances; each containing a query $q_{i}$ , a posotive candidate passage $p_{i,+}$ and a negative one $p_{i,-}$ , indicating that $p_{i,+}$ is more relevant to the query than $p_{i,-}$ . The loss function is the negative log likelihood of the positive passage:

(7)

L_{rank}(q_{i},p_{i,+},p_{i,-})=-\log\frac{e^{sim(q_{i}^{{}^{\prime}},p_{i,+}^{{}^{\prime}})}}{e^{sim(q_{i}^{{}^{\prime}},p_{i,+}^{{}^{\prime}})}+e^{sim(q_{i}^{{}^{\prime}},p_{i,-}^{{}^{\prime}})}}

where $q_{i}^{{}^{\prime}}$ , $p_{i,+}^{{}^{\prime}}$ , $p_{i,-}^{{}^{\prime}}$ is the sparse representation of $q_{i}$ , $p_{i,+}$ , $p_{i,-}$ obtained by Equation (1), $sim$ denotes any similarity measurement such as dot-product. Different with the training objective of DeepCT LABEL:dai2019context, we don’t directly fit the statistical term importance distribution, but view the importance as intermediate variables that can be learned by distant supervisory signal for passage ranking. End-to-end learning can involve every terms in the optimization process, which can yield smoother importance distribution, but also of enough distinguishability.

Training the exapnsion-enhanced gating controller. We summarize four types of term expansion in Table 1, all of which can be optimized in our SparTerm framework. Intuitively, the pre-trained BERT already has the ability of expanding synonym words and co-occured words by the Masked Language Model(MLM) pre-training task. Therefore, in this paper, we focus on expanding passage2query-alike and summarization terms. Given a passage-query/summary parallel corpus $\mathbf{C}$ , where $p$ is a passage, $t$ the corresponding target text, and $T$ of dimension $v$ is the binary bag-of-words vector of $t$ . We use the binary cross-entropy loss to maximize probability values of all the terms in vocabulary:

(8)

L_{exp}=-\lambda_{1}\sum\nolimits_{j\in\{m|T_{m}=0\}}log(1-G_{j})-\lambda_{2}\sum\nolimits_{k\in\{m|T_{m}=1\}}logG_{k}

where $G$ is the dense gating probability distribution for $p$ , $\lambda_{1}$ and $\lambda_{2}$ two tunable hyper-parameters. $\lambda_{1}$ is the loss weight for terms expected not to be expanded, while $\lambda_{2}$ is for terms that appear in the target text. In the experiment, we set $\lambda_{2}$ much larger than $\lambda_{1}$ to encourage more terms to be expanded.

End-to-end joint training. Intuitively, the supervisory ranking signal can also be leveraged to guide the training of the gating controller, thus we can train the importance predictor and gating controller jointly:

(9)

L=L_{rank}+L_{exp}

Model	MRR@10	R@10	R@20	R@50	R@100	R@200	R@500	R@1000
BM25	18.6	-	49	60	69	75	82	85.71
Doc2query	21.5	-	-	-	-	-	-	89.1
Doc2query-T5	27.7	-	-	75.6	81.89	86.88	91.64	94.7
DeepCT	24.3	49	58	69	76	82	86	91
SparTerm(literal-only)	27.46	51.05	60.21	71.55	78.28	83.27	88.33	91.16
SparTerm(expansion-only)	19.8	40.93	-	63.42	70.96	77.62	84.81	89.08
SparTerm(expansion-enhanced)	27.94	51.95	61.58	72.48	78.95	84.05	89.5	92.45

Table 2. Performances of different models on Dev Set of MSMARCO Passage Retrieval dataset.

4. Experimental Setup

4.1. Datasets and Metrics

We evaluate our method on MSMARCO (Nguyen et al., 2016) which consists of two benchmark datasets:

MSMARCO Passage Retrieval dataset is based on the public MSMARCO dataset with a collection of 8.8M passages from Web pages gathered from Bing’s results to 1M real-world queries. Each query is associated with one or very few passages marked as relevant while no passage explicitly indicated as irrelevant. We build a small dev set for validating the full ranking performance instead of re-ranking by sampling the most relevant 1M passages to 1000 queries from the original passage ranking dev set with BM25.

MSMARCO Document Retrieval dataset is based on the source documents which contain the passages in the passage retrieval task. The dataset contains 367,013 documents and 367,013 queries for training set and 5,193 queries for dev set.

The original Dev Set of MSMARCO dataset is a re-ranking task, which is inconsistent with the retrieval task. Therefore, to find the best checkpoint of our model more accurately we build a new Dev Set to evaluate the retrieval performance by sampling about 1M passages from the collections and 1,000 queries from the original Dev Set (including the top 1000 passages of each query retrieved by BM25). To evaluate the full ranking performance of our model, we use the sparse representation of each document to build the inverted index and use the sparse representation of queries to retrieval topK relevant documents and measure the performance with MRR@10 and Recall from top10 to top1000.

4.2. Implementation

The Importance Predictor and Gating Controller of our model have the same architecture and hyper-parameters of BERT (12-layer, 768-hidden, 12-heads, 110M parameters) and do not share weights. We initialize the Importance Predictor with Google’s official pre-trained $\mathrm{BERT_{base}}$ model while the parameters of Token-wise Importance Predictor are initialized with the Masked Language Prediction layer of BERT. When using expansion-enhanced gating, the Gating Controller is also initialized with $\mathrm{BERT_{base}}$ . We fine-tune our model on the training set of MSMARCO passage retrieval dataset on 4 NVIDIA-v100 GPUs with a batch size of 128. During the fine-tuning, we first fine-tune the Gating Controller with Equation (8) for 50k iterations where $\lambda_{1}=1e-3$ and $\lambda_{2}=1$ . Then we fix the parameters of the Gating Controller and fine-tune our SparTerm jointly for 100k iterations. We use Adam optimizer with the learning rate $2\times 10^{-5}$ . To ensure the sparsity, the threshold in the Binarizer in Equation (4) is set to $0.7$ . We do not fine-tune our model on the training set of document retrieval dataset but just use the model trained on the passage retrieval dataset for the document ranking.

4.3. Baselines and Experimental Settings

We compare our model with the following strong baselines which are all methods based on sparse representation . The former two focus on re-weighting while the latter two focus on document expansion:

•

BM25(Robertson and Walker, 1994) is a bag-of-words retrieval models with frequency-based signals to estimate the weights of terms in a text.
•

DeepCT(Dai and Callan, 2019) is a deep contextualized term weighting model which maps the BERT’s representations to term weightings for retrieval.
•

Doc2query(Nogueira et al., 2019) is a document expansion method with Transformer that can expand documents with terms related to the documents’ content.
•

Doc2query-T5(Cheriton, 2019) is a document expansion method which utilizes more powerful T5 (Raffel et al., 2019) language model to generate queries for document expansion.

We also evaluate three different settings of SparTerm for evaluation:

•

SparTerm(literal-only) uses Importance Predictor with the Literal-only Gating which can also be seen as a term weighting model.
•

SparTerm(expansion-only) uses the Expansion-enhanced Gating for passage expansion without term weighting. We just add the expanded words (weight of each word is 1) to the passages.
•

SparTerm(expansion-enhanced) implements both Importance Predictor and Expansion-enhanced Gating for sparse representation of passage.

5. Experimental Results

5.1. Performance on Passage Full Ranking

Table 2 shows the full ranking performances of our models and baselines on MSMARCO Passage Retrieval dataset. SparTerm (expansion-enhanced) outperforms all baselines on MRR, achieving the state-of-the-art ranking performance among all sparse models, and outperforms all baselines except Doc2query-T5 on Recall. We find that SparTerm achieves more significant performance improvements on MRR and Recall@10-100, which illustrates that our model has a more significant ability on top ranking compared with previous sparse models. Further, pre-trained language model(PLM) based methods (DeepCT, Doc2query-T5, and SparTerm) perform better than those without PLM, demonstrating that PLM can facilitate the passage full ranking with better representation. Considering the improvements T5 brings to Doc2query, we believe that SparTerm can be further improved with more advanced PLM.

Even without any expansion, SparTerm(literal-only) outperforms DeepCT on both MRR and Recall, demonstrating that SparTerm can produce more effective term weights thus facilitating the retrieval. We also analyze the difference between SparTerm and DeepCT on term weighting in Section 5.4. With only the expanded words, SparTerm achieves a definite improvement compared with BM25, especially on Recall. This improvement proves the effectiveness of passage expansion on improving the Recall for retrieval.

Model	MRR@10
BM25+PassageRetrievalMax	23.6
HDCT+PassageRetrievalMax	26.1
BM25	24.5
HDCT(sum)	28.0
HDCT(decay)	28.7
SparTerm(literal-only)+PassageRetrievalMax	28.5
SparTerm(expansion-enhanced)+PassageRetrievalMax	29.0

Table 3. Performance of baselines and our models on dev set of MSMARCO document ranking dataset. All use the max score of passages in the document as the document score at the query time.

5.2. Performance on Document Ranking

For the Document Ranking task, we cut down each document into several passages to adapt the max length (256) of the sequence of our model and generate the sparse representation of each passage with our model. We compare our models with two baseline methods: BM25 (Robertson and Walker, 1994) and HDCT (Dai and Callan, 2020). HDCT is based on the work of DeepCT and focuses on document ranking, which is also a term weighting method. HDCT compares two different ways to combine the representations of passages for document ranking. The first one represents the document as a sum of the passage representations while the second one uses a decayed weighted sum. The PassageRetrievalMax does not represent the document but just calculates the scores of passages in the document and choose the maximum score as the score of the document for ranking. Table 3 shows the ranking performance of baselines and our models. Here we only report the results of PassageRetrievalMax of our models.

Strictly speaking, it is incomparable between HDCT and our models since we fine-tune SparTerm on MSMARCO pasage ranking dataset while HDCT was trained using document titles on MARCO. Even though, SparTerm(expansion-enhanced) still achieves a better performance on document ranking compared with HDCT, demonstrating that the sparse representation produced by SparTerm can also facilitate long document retrieval.

Model	MRR@10	R@1000
Query-tf	25.7	94.2
Query-neural-symmetric	26.4	94.7
Query-neural-asymmetric	25.4	94.2

Table 4. Performances of our model with different query representation strategies on our new Dev Set of MSMARCO passage retrieval.

5.3. Comparison of Different Query Representation Methods

We conduct experiments to evaluate the performance of SparTerm with different query representation methods:

•

Query-tf is a one-tower model that use tf-based vectors to represent the queries while use the model to represent documents.
•

Query-neural-symmetric is a symmetric two-tower model to represent queries and passages that the two towers with the same architectures share the same weights.
•

Query-neural-asymmetric is a asymmetric two-tower model that the two towers do not share weights. Queries and passages are represented with different towers.

The results are reported in Table 4, from where we find that the neural representation of queries with the symmetric two-tower model brings better performance on MRR and Recall on our built Dev set. The symmetric model performs better than the asymmetric model might because asymmetric two-tower architecture leads to twice the quantity of parameters, which makes the model more difficult to converge. We further analyze the distribution of passage term weights with different query representation methods and find that tf-based representation of query results in a sharper distribution compared to the neural representation. The reason may be that the query representation is fixed during training, the model needs to give more weights to the relative terms in the positive passage.

5.4. Analysis of Term Weighting

To further evaluate the ability of SparTerm on term weighting, we normalize the term weights of passages weighted by DeepCT and SparTerm(literal-only) to the same range and visulization them in Figure 3. Figure 3 shows three different queries(the first column) and the most relevant passages. The depth of the color represents the weights of terms, deeper is higher. We find that both DeepCT and SparTerm can figure out the most important terms and give them higher weights. However, DeepCT obtains sparser and sharper distributions and only activates very few terms in a passage, missing some important terms, such as “allergic reaction” in the first case. SparTerm can yield a smoother importance distribution by activating more terms though not appearing in the query. This distribution allows the passage to be retrieved by more queries. This also demonstrates that our model has a better ability on pointing out important terms in a passage.

5.5. Analysis of Term Expansion

Figure 3 shows the expanded terms and their probabilities for different passages predicted by the Gating Controller. The probability of each term illustrates how likely this term to be expanded. It is obvious that our model can really activate some important terms not appearing in the passage but very semantically similar, especially occurring in the queries such as “sign” in the first case and “temperature” in the second case.

In order to analyze how these words are expanded and which category in Figure 3 do they belong to, we trace the source of each expanded word and show the top 5 words with their logits which contribute to the expanded word in Figure 4. We can find that there are basically three different situations of the expanded terms:

(1) The passage2query terms such as “temperature”: Almost every word in the passage contributes much to this kind of terms, which seem more likely to learn from the supervised signal.

(2) Synonyms of the original terms, i.e. “weather” and “climate”, “rainfall” and “rain”, “season, monthly” and “month”, “heat” and “hot”.

(3) Co-occurred words for the original terms, i.e. “season, heat”-¿“summer”, “wet, humidity, weather”-¿“rain” and “heat, rainfall, humidity”-¿“tropical, monsoon”.

The first situation is benefited by the optimization objective of the Gating Controller while the latter two are more likely the ability of MLM pretraining task since we reuse the MLM module for prediction in the Gating Controller.

6. Conclusion

In this work, we propose SparTerm to directly learn term-based sparse representation in the full vocabulary space. SparTerm learns a function to map the frequency-based and BoW representation to a sparse term importance distribution in the whole vocabulary space, which involves both term-weighting and expansion in the same framework. Experiments conducted on MSMARCO dataset show that SparTerm significantly outperforms previous sparse models based on the comparable size of PLMs, achieving state-of-the-art ranking performance among all sparse models. We conduct further empirical analysis about how the deep knowledge of PLMs can be transferred to the sparse method, which gives new insights for sparse representation learning. Empirical results show that SAPRT significantly increases the upper limit of sparse retrieval methods.

Acknowledgement

We thank Xin Jiang, Xiuqiang He, and Xiao Chen for the helpful discussions.

References

(1)
Cheriton (2019) D. Cheriton. 2019. From doc2query to docTTTTTquery.
Dai and Callan (2019) Zhuyun Dai and Jamie Callan. 2019. Context-aware sentence/passage term importance estimation for first stage retrieval. arXiv preprint arXiv:1910.10687 (2019).
Dai and Callan (2020) Zhuyun Dai and Jamie Callan. 2020. Context-Aware Document Term Weighting for Ad-Hoc Search. In Proceedings of The Web Conference 2020. 1897–1907.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
Khattab and Zaharia (2020) O. Khattab and M. Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2020).
Lafferty and Zhai (2001) John Lafferty and Chengxiang Zhai. 2001. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. 111–119.
Lee et al. (2019) Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300 (2019).
Luan et al. (2020) Yi Luan, Jacob Eisenstein, Kristina Toutanova, and M. Collins. 2020. Sparse, Dense, and Attentional Representations for Text Retrieval. ArXiv abs/2005.00181 (2020).
Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset (CEUR Workshop Proceedings), Vol. 1773. CEUR-WS.org.
Nogueira et al. (2019) Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv preprint arXiv:1904.08375 (2019).
Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).
Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, W. Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv abs/1910.10683 (2019).
Robertson and Walker (1994) Stephen E Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR¡¯94. Springer, 232–241.
Sparck-jones (1972) K. Sparck-jones. 1972. A statistical interpretation of term specificity and its application in retrieval.
Strohman et al. (2005) Trevor Strohman, Donald Metzler, Howard Turtle, and W Bruce Croft. 2005. Indri: A language model-based search engine for complex queries. In Proceedings of the international conference on intelligent analysis, Vol. 2. Citeseer, 2–6.
Zamani et al. (2018) Hamed Zamani, Mostafa Dehghani, W Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In Proceedings of the 27th ACM international conference on information and knowledge management. 497–506.
Zaragoza et al. (2004) Hugo Zaragoza, Nick Craswell, Michael J Taylor, Suchi Saria, and Stephen E Robertson. 2004. Microsoft Cambridge at TREC 13: Web and Hard Tracks.. In TREC, Vol. 4. 1–1.