Generating Negative Samples for Sequential Recommendation

Yongjun Chen^♣, Jia Li^♣, Zhiwei Liu^♣, Nitish Shirish Keskar^♣,
Huan Wang^♣, Julian McAuley^♢, Caiming Xiong^♣ ^♣ Salesforce Research
^♢ UC San Diego
{yongjun.chen, jia.li, zhiweiliu, nkeskar, huan.wang, cxiong}@salesforce.com, jmcauley@eng.ucsd.edu

(2021)

Abstract.

To make Sequential Recommendation (SR) successful, recent works focus on designing effective sequential encoders, fusing side information, and mining extra positive self-supervision signals. The strategy of sampling negative items at each time step is less explored. Due to the dynamics of users’ interests and model updates during training, considering randomly sampled items from a user’s non-interacted item set as negatives can be uninformative. As a result, the model will inaccurately learn user preferences toward items. Identifying informative negatives is challenging because informative negative items are tied with both dynamically changed interests and model parameters (and sampling process should also be efficient). To this end, we propose to Generate Negative Samples (items) for SR (GenNi). A negative item is sampled at each time step based on the current SR model’s learned user preferences toward items. An efficient implementation is proposed to further accelerate the generation process, making it scalable to large-scale recommendation tasks. Extensive experiments on four public datasets verify the importance of providing high-quality negative samples for SR and demonstrate the effectiveness and efficiency of GenNi.

Sequential Recommendation, Dynamic Negative Sampling, Noise Contrastive Estimation

^†^†copyright: acmcopyright^†^†journalyear: 2021^†^†doi: nn.nnnn/nnnnnnn.nnnnnnn^†^†conference: KDD 2022: The 28th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2022; August, 14-18, 2022; Washington, DC, U.S.A^†^†booktitle: KDD ’22: The 28th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2022, August, 14-18, 2022, Washington, DC, U.S.A^†^†price: 15.00^†^†isbn: XXX-X-XXXX-XXXX-X/YY/MM

1. Introduction

The central task of Sequential Recommendation (SR) is to accurately predict the next item that a user is interested in based on her past behaviors (e.g., shopping, clicking, etc.). To achieve this, an effective model must be able to learn accurate user preferences toward massive vocabularies of items at each time step. Benefiting from the expressive power of deep neural networks (e.g., Transformer (Vaswani et al., 2017; Radford et al., 2019)), recent deep SR models including (Kang and McAuley, 2018; Li et al., 2020, 2021; Sun et al., 2019b; Zhou et al., 2020; Ma et al., 2020; Qiu et al., 2021b) arguably represent the current state-of-the-art.

Due to the high computational cost of computing the exact log likelihood for all items (Zhou et al., 2021), most SR methods are optimized via a Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2010; Kang and McAuley, 2018; Tang and Wang, 2018; Zhou et al., 2020) paradigm, which is an approximation of maximum likelihood estimation (MLE). Training with NCE requires the model to sample negative items to pair with positive items, where the training target is to pull positive items closer to sequences while pushing away negative items. Though existing methods improve SR from many different perspectives, such as exploring the potential of different sequential encoders (Hidasi et al., 2015; Tang and Wang, 2018, 2018), leveraging side information (Li et al., 2020; Zhang et al., 2019; Zhou et al., 2020) and incorporating additional training tasks (Zhou et al., 2020; Ma et al., 2020; Liu et al., 2021; Xie et al., 2020; Qiu et al., 2021b, a; Chen et al., 2022b, a), they rarely look into the impact of those negative items. Instead, they commonly adopt uniform or popularity-biased sampling strategies, which are either unable to reflect true negative item distributions or sub-optimal for training sequence encoders. Therefore, this paper investigates on the importance of sampling informative negative items for training SR models.

Refer to caption — Figure 1. As a user’s interests change and as training goes on, the informative negative item distribution also changes.

To specifically demonstrate the necessity of sampling informative negative items in sequential recommendation, we illustrate a toy example in Figure 1. When a user purchases a water bottle online, the recommender predicts a bottle holder as her next item because of observed concurrent consumption behavior from others; however, she purchases shoes instead. At this moment, the bottle holder is an informative negative because the SR model made a wrong prediction. After she purchases shoes, the informative negative item changes to sports shirts because the model has observed sequential correlations between shoes and shirts, which however are not the actual next item in this user’s sequence. In such scenario, the uniform sampling method would ignore both dynamics and relatedness of negative items as training proceeds, which samples uninformative items, thus contributing little to the optimization. Without dynamically generating informative negative samples, the SR model is unable to improve further, resulting in sub-optimal performance of the sequential recommendation.

Nevertheless, sampling informative negative items for sequential recommendation poses threefold challenges. First, it is non-trivial to characterize the sequential dynamics in sampling informative negative items. DNS (Zhang et al., 2013) proposes a ranking-aware negative sampling scheme, which is devised to optimize static collaborative signals. PinSAGE (Ying et al., 2018) identifies items of high PageRank scores with respect to positive items as the informative negative items. MCNS (Yang et al., 2020) introduces a Markov Chain negative sampler for graph representation learning, which only harnesses constant neighbors for sampling. As in the aforementioned example, the informative negative samples change according to users’ consumption behaviors. In this sense, ignoring the sequential correlations fails to reveal true negative item distributions. Second, informative negative items are tightly associated with the model. During initial training stages, the model has no ability to classify items, so that all negative candidates are equally informative. As training proceeds, the model is capable of identifying some negative items; therefore, only those ‘hard’ negative (Bengio and Senécal, 2008; Blanc and Rendle, 2018; Robinson et al., 2020) items are informative and should be sampled to accelerate optimization. As such, we should recognize the current state of models when generating negative samples. Last but not least, it is hard to retain efficiency. Due to the large-scale item corpus and (usually) sparse observed interactions, there are many negative item candidates. Identifying informative negative items from those candidates requires awareness of their contributions to optimization, which is time-consuming. Therefore, it is crucial to efficiently sample informative negative items to preserve the scalability of models.

To this end, we propose to Generate Negative items (GenNi) for SR. At each time step, a negative item is sampled based on the similarity between the current SR model learned user interests and the item embeddings. GenNi adaptively generates negative samples without training an additional generative module except the SR model itself, which reduces computation cost. We develop an efficient algorithm to further improve the computational efficiency by a two-stage sampling strategy, which makes GenNi scalable to large-scale recommendation tasks. A self-adjusted curriculum learning strategy is also proposed to alleviate the human effort of tuning the hyperparameters in GenNi. Though conceptually simple, GenNi greatly improves upon state-of-the-art SR models. Its success shares the same spirit with works in other domains (Kumar et al., 2010; Zhang and Stratos, 2021; Zhang et al., 2013; Sun et al., 2019a; Yang et al., 2020) where “hard” negatives matter and can be generated via self-adversarial training.

We conduct extensive experiments on four public datasets and observe that SR models can be significantly improved simply by replacing the original negative sampler with GenNi, e.g., the average performance of S ${}^{3}\text{-Rec}$ (Zhou et al., 2020) in NDCG@5 is improved 107.66% over four datasets. It shows that negative item sampling is as important as other components to make SR successful. Detailed comparisons with other negative sampling strategies and analyses further validate the superiority of the proposed method.

2. Related Work

2.1. Sequential Recommendation

Sequential recommendation aims to accurately characterize users’ dynamic interests by modeling their past behavior sequences. Early works on SR usually models an item-to-item transaction pattern based on Markov Chains (Rendle, 2010; He and McAuley, 2016). FPMC (Rendle et al., 2010) combines the advantages of Markov Chains and matrix factorization to fuse both sequential patterns and users’ general interest. With the recent advances of deep learning, many deep sequential recommendation models are also developed (Tang and Wang, 2018; Hidasi et al., 2015; Kang and McAuley, 2018; Sun et al., 2019b). GRU4Rec (Hidasi et al., 2015), Caser (Tang and Wang, 2018), and SASRec (Kang and McAuley, 2018) explore the potential of encoding user sequential behaviors via an RNN, CNN, and Transformer, respectively. FDSA (Zhang et al., 2019), TiSASRec (Li et al., 2020) and S ${}^{3}\text{-Rec}$ (Zhou et al., 2020) leverage side information (e.g., time-interval, item categories.) for a comprehensive representation. BERT4Rec (Sun et al., 2019b) replaces next-item prediction (NIP) task with a masked-item prediction task (Taylor, 1953) to capture contextual information. With the success of contrastive self-supervised learning, several works (Ma et al., 2020; Xie et al., 2020; Qiu et al., 2021b, a) propose different contrastive SSL paradigms as a complement or a replacement task of NIP for a more comprehensive learning. LSAN (Li et al., 2021) also improves SASRec from efficiency serving perspective Nevertheless, most existing works ignore the importance of quality of sampled negative items and view the item randomly sampled from user non-interacted item set or all items in the same training batch as negative items.

2.2. Negative Sampling

Word2vec (Mikolov et al., 2013) first proposes to sample negative items based on the word frequency distribution proportional to the 3/4 power to train the skip-gram language models. Later works in NLP and Social Networks often follow such setting (Pennington et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016). In graph mining, RotatE (Sun et al., 2019a) first proposes to sample negative items based on model’s prediction and then MCNS (Yang et al., 2020) proposes to further improve its efficiency via the Markov Chain based Metropolis-Hastings algorithm. However, these methods only consider neighborhoods of the nodes on graph while ignore the sequential dynamic of the data. Another line of works improves the Sampled Softmax (Zhou et al., 2021; Blanc and Rendle, 2018) to better approximate to the full Softmax. In contrast, our work study the SR methods that trained under NCE framework, which trains a sequential binary classifier to distinguish target and negative items. Several GAN-based (Goodfellow et al., 2014) methods are proposed for application such as information retrieval (Wang et al., 2017; Park and Chang, 2019) and graph node embeddings (Cai and Wang, 2017). However, GAN-based methods are often hard to train and the additional training of generator also makes the sampling inefficient for SR models. In recommendation, Bayesian Personalized Ranking (Rendle et al., 2012) first proposes to sample negative items uniformly from user non-interacted items for training factorization machines. Dynamic negative sampling(DNS) (Zhang et al., 2013) develops a ranking-aware negative sampling strategy for improving collaborative filtering based methods. PinSAGE (Ying et al., 2018) consider items with high PageRank scores as “hard-negative” samples with curriculum learning scheme to train large scale graph neural networks. Despite of their success in their own domain, these methods ignored the importance of the sequential dynamics of users’ interests thus are not ideal to be adopt for sequential recommendation.

3. METHOD

In this section, we first describe the Sequential Recommendation (SR) problem and a general approach to solve the problem with two key ingredients of training a SR model. We then describe our proposed negative item generator, an efficient algorithm as well as a self-adjusted curriculum learning approach to adaptively sample negative items for each user.

3.1. Problem Formulation

SR is usually formulated as a next item prediction (NIP) task. Formally, in a recommender system, there is a set of users and items denoted as $\mathcal{U}$ and $\mathcal{V}$ respectively. Each user $u\in\mathcal{U}$ is associated with a sequence of interacted items sorted in chronological order $S^{u}=[s^{u}_{1},\dots,s^{u}_{t},\dots,s^{u}_{|S^{u}|}]$ where $|S^{u}|$ is the number of interacted items and $s^{u}_{t}$ is the item $u$ interacted with at step $t$ . We denote $\mathbf{S}^{u}$ as the embedded representation of $S^{u}$ , where $\mathbf{s}^{u}_{t}$ is the $d$ -dimensional embedding of item $s^{u}_{t}$ . In practice, sequences are truncated with maximum length $T$ . If the sequence length is larger than $T$ , the most recent $T$ actions are considered. If the sequence length is smaller than $T$ , “padding” items will be added to the left until the length is $T$ (Hidasi et al., 2015; Tang and Wang, 2018; Kang and McAuley, 2018). For each user $u$ at time step $t$ , the goal of SR is to predict the item that the user $u$ would be interested in at step $t+1$ among the item set $\mathcal{V}$ , given her past behavior sequence $\mathbf{S}^{u}_{1:t}$ .

3.2. Training an SR Model with Noise Contrastive Estimation

To train an SR model, a standard learning procedure fits the sequential data following the maximum likelihood estimation principle. Specifically, for each user $u$ at position step $t$ in a mini-batch $\mathcal{B}$ , we want to learn a parametric function $f_{\theta}$ that maximize the probability of the target item:

(1)

\arg\underset{\theta}{\max}\sum_{(u,t)\in B}P_{\theta}(\mathbf{s}_{t+1}^{u}|\mathbf{h}^{u}_{t}),

where

(2)

P_{\theta}(\mathbf{s}_{t+1}^{u}|\mathbf{h}^{u}_{t})=\frac{\exp(\mathbf{h}^{u}_{t}\cdot\mathbf{s}_{t+1}^{u})}{Z_{\theta}(\mathbf{h}^{u}_{t})},

where $\mathbf{h}^{u}_{t}=f_{\theta}(S^{u}_{1:t})$ is the encoded user’s interest representation at time $t$ , $Z_{\theta}(\mathbf{h}^{u}_{t})=\sum_{v\in V}(\mathbf{h}^{u}_{t}\cdot\mathbf{v})$ is the partition function that normalizes the score into a probability distribution, and $\exp(\mathbf{h}^{u}_{t}\cdot\mathbf{s}^{u}_{t+1})$ is a similarity score of a user’s preference toward the target item. Unfortunately, computing this probability as well as its derivatives are infeasible since the $Z_{\theta}(\cdot)$ term requires summing over all items in $\mathcal{V}$ , which is generally of large-scale in sequential recommendation.

Hence, existing methods (Kang and McAuley, 2018; Li et al., 2020; Ma et al., 2020; Xie et al., 2020) commonly adopt an approximation via Noise Contrastive Estimiation (NCE) (Gutmann and Hyvärinen, 2010). NCE is based on the reduction of density estimation to probabilistic binary classification. It provides a stable and efficient way to avoid computing $Z_{\theta}(\cdot)$ while estimating the original goal. The basic idea is to train a binary classifier to discriminate between samples from the positive data distribution and samples from a “noise” (negative sampling) distribution. Specifically, given the encoded user interest $\mathbf{h}^{u}_{t}$ , we view the next item $\mathbf{s}^{u}_{t+1}$ as its positive item and the sampled $k$ negative items from a pre-defined distribution function $Q(\cdot)$ (e.g., a uniform distribution over all other items in $V$ ). We train the SR model with the following loss function:

(3)

\begin{split}\mathcal{L}=\sum_{(u,t)\in\mathcal{B}}\mathcal{L}_{t}^{u}\end{split}

and

(4)

\begin{split}\mathcal{L}_{t}^{u}=&-\log(P(D=1|\mathbf{h}^{u}_{t},\mathbf{s}^{u}_{t+1}))\\ &-k\mathbb{E}_{\text{neg}\sim Q}\log(P(D=0|\mathbf{h}^{u}_{t},\mathbf{s}^{u}_{-,t+1})),\end{split}

where $P(D=1|\mathbf{h}^{u}_{t},\mathbf{s}^{u}_{t+1})=\sigma(\mathbf{h}^{u}_{t}\cdot\mathbf{s}^{u}_{t+1})$ , $\sigma$ is sigmoid function, and $\mathbf{s}^{u}_{-,t+1}$ is the sampled negative item at $t+1$ . This loss decreases when $\mathbf{h}^{u}_{t}\cdot\mathbf{s}^{u}_{t+1}$ increases and $\mathbf{h}^{u}_{t}\cdot\mathbf{s}^{u}_{-,t+1}$ decreases. In other words, optimizing this loss function is equivalent to pulling the sequence embedding $\mathbf{h}^{u}_{t}$ closer to the positive item $\mathbf{s}^{u}_{t+1}$ whilst pushing away from sampled negative items, thus being contrastive. To make NCE approximate to maximum log-likelihood (Eq 2) closer, one needs to either sample more negative items or improve the quality of the negative sampling distribution $Q(\cdot)$ . Surprisingly, neither of them is paid enough attention by existing methods.

Theorem 1 (Impact of $k$ ).

Increasing $k$ can reduce the mean square error (aka risk) of model estimation and the distribution of negative items $Q(\cdot)$ become less important when $k\to\infty$ .

The above theorem shows that $k$ is an importance factor of making SR models well trained (proof given in Appendix A). Empirically, naively increasing $k$ though trivial, but is not a good choice in recommendation tasks. Because under random sampling, most of the sampled items can be uninformative with the training going on while training time cost is linearly increased. Because of that, existing SR models often keep the default number $k=1$ . Without naively increasing the number of negative items $k$ , designing a good negative item distribution function $Q$ is crucial to make SR models successful.

Theorem 2 (Optimal Embeddings).

The optimal sequence and item embedding for each user $u$ at each time step $t$ should satisfy:

\mathbf{h}^{u}_{t}\cdot\mathbf{s}^{u}_{t+1}=-\log\frac{k\cdot Q(\mathbf{s}^{u}_{t+1}|\mathbf{h}^{u}_{t})}{P(\mathbf{s}^{u}_{t+1}|\mathbf{h}^{u}_{t})}.

Theorem 2 indicates that the optimal embeddings are dependent on both data distribution $P(\cdot)$ and the negative sampling distribution $Q(\cdot)$ (proof given in Appendix B.). As such, it is necessary to sample items from true negative sampling distribution, which would otherwise yield sub-optimal results.

The two theorems motivate us to improve sampling process of negative items for sequential recommendation as in following sections. Hereafter, we propose a novel negative item generator as well as a strategy to further improve its efficiency.

3.3. Next Negative Item Generator

3.3.1. Principles of an Informative Negative Item Sampler in SR

Theorem 2 implies that in sequential recommendation, the informative negative items dynamically change with the user’s interests at time $t$ as well as the network parameters $\theta$ . We therefore define the principles of informative negative item sampler for SR as follows:

Dynamic: The sampler should be aware of the dynamic of the user’s interests at each time step. When a user interacts with a new item, the corresponding informative negative items can also be changed.

Adaptive: The sampler should be adaptive to the model structure as well as its parameters $f_{\theta}$ . The sampled item is uninformative if it is easy to be predicted as a negative item (Kumar et al., 2010).

Efficient: The sampler should also be efficient enough to scale to large recommender systems. The sampler can be alternated by tuning the hyperparameter $k$ or even training without sampling (Eq. 1) if it is inefficient.

3.3.2. Generating Negative Items via Self-Adversarial Training

Based on the aforementioned principles, we propose to generate negative items based on user’s interests and model’s current predictions. Specifically, at each time step $t$ , a user historical behavior sequence is encoded by a networks: $\mathbf{h}_{t}^{u}=f_{\theta}(S_{1:t}^{u})$ (e.g., Transformer encoder (Kang and McAuley, 2018; Li et al., 2020; Ma et al., 2020; Xie et al., 2020)). Then we leverage the current sequential dynamic $\mathbf{h}_{t}^{u}$ and the model’s current state (parameterized by $\theta$ ) to generate next informative negative item. The $Q(\cdot)$ function is defined as follows:

(5)

Q(s_{i}|\mathbf{h}_{t}^{u},\hat{\theta}_{l})=(\frac{\exp(\mathbf{s}_{i,\hat{\theta}_{l}}\cdot\mathbf{h}_{t,\hat{\theta}_{l}}^{u})}{\sum_{s_{i}\in\mathcal{V}}\exp(\mathbf{s}_{i,\hat{\theta}_{l}}\cdot\mathbf{h}_{t,\hat{\theta}_{l}}^{u})})^{\alpha},s_{i}\neq s_{t+1}^{u},

where $\hat{\theta}_{l}$ is the estimated model parameters at $l^{th}$ learning iterations and $\alpha$ controls the difficulty of the sampler. When $\alpha=0$ , the sampler follows a uniform distribution. The larger $\alpha$ , the more informative item is more likely to be sampled. We can see that now the $Q(\cdot)$ function is both dynamic to the changes of user’s interests over each time step $t$ and also adaptive to the model’s learning state over each training iteration $l$ . We denote Eq 5 next negative item (NNI) sampler. The sampling strategy shares the same spirit with works in other domains, such as CV, NLP and graph mining (Bengio and Senécal, 2008; Kumar et al., 2010; Zhang and Stratos, 2021; Zhang et al., 2013; Sun et al., 2019a; Yang et al., 2020) where “hard” negatives matter and can be generated via self-adversarial training. Figure 2 (b)-(c) illustrates this process.

3.3.3. Acceleration

Although Eq. (5) already defines the negative sampling distribution, it is still inefficient due to the summation over all the items in the denominator part. Hence, we devise a two-stage sampling strategy to further accelerate the sampling procedure. To be more specific, at a certain time step, a negative item is sampled as follows:

•

Pre-Selection: a small subset of candidate items is pre-selected from $\mathcal{V}$ in the first stage. We uniformly select $\beta$ ratio of candidate items denoted as $\mathcal{V^{\prime}}\subset\mathcal{V}$ .

•

Post-Selection: we use the proposed NNI sampler to further narrow down the nominated items $\mathcal{V^{\prime}}$ and serve to the user:

(6)

Q(s_{i}|\mathbf{h}_{t}^{u})=(\frac{\exp(s_{i}\cdot\mathbf{h}_{t}^{u})}{\sum_{s_{i}\in\mathcal{V}^{\prime}}\exp(s_{i}\cdot\mathbf{h}_{t}^{u})})^{\alpha},s_{i}\neq s_{t+1}^{u},

With the acceleration, the computation time of negative item generation reduces from the original $\mathcal{O}(|\mathcal{V}|)$ to $\mathcal{O}(\beta\cdot|\mathcal{V}|)$ , where $\beta$ ranges from 0 to 1. When $\beta\approx 0$ , sampling becomes uniform (and Post-Selection is not needed). When $\beta=1$ , Pre-Selection is no longer needed, which becomes Eq. 5. $\beta$ controls the trade-off between effectiveness and efficiency. Figure 3 illustrates the process. There are two strategies to set $\beta$ :

•

A fixed $\beta$ value. This strategy is simple and potentially can save the most computation cost. The drawback of having a fixed $\beta$ value is that as training proceeds, the number of informative items become less and less (most of the items are already considered as negatives by the SR model). Having a small $\beta$ value can potentially filter out all the informative items in later training stage, so the model will stop learning. Although, we empirically (in Section 5.4) find that $\beta$ can be small without a large performance drop.
•

Gradually increasing $\beta$ . An alternative strategy is to gradually increase $\beta$ as training proceeds:

(7) $\beta=\min(0.001\cdot 10^{E_{i}/m},1.0),$

where $E_{i}$ denotes the $i^{th}$ training epoch and $m$ controls how fast $\beta$ increases. Items sampled from a uniform distribution can be informative in initial stages because the SR model hasn’t started to learn. But most of them become uninformative as the training continues. By gradually increasing $\beta$ , informative items can always be sampled while reducing computation cost compared with the full version (fixed $\beta=1.0$ ).

See Section 5.4 for more detailed comparisons.

3.3.4. Overall Scheme

We term the whole negative item generation process described from Section 3.3.1 to Section 3.3.3 as GenNi. The overall training scheme with GenNi for SR model is provided in Algorithm 1. It generates negative items based on the SR model without introducing additional parameters. The proposed acceleration strategy further improves its efficiency so GenNi can be scaled to large-scale recommendation tasks. GenNi is a model-agnostic negative item generator, we apply GenNi to both SASRec and S ${}^{3}\text{-Rec}$ , denoted as $\textsf{GenNi}_{\text{SA}}$ and $\textsf{GenNi}_{S^{3}}$ .

Input: Users’ historical behaviors

\{S^{u}_{1:T}\}_{u=1}^{|\mathcal{U}|}

, sequential encoder

f_{\theta}

, hyper-parameters

\alpha

\beta

Output: Learned

\theta

including item embeddings

\{\mathbf{s}_{i}\}_{i=1}^{|\mathcal{V}|}

1 while $epoch\leq MaxTrainEpoch$ do

2 for a minibatch $\{S_{1:t}^{u}\}_{(u,t)\in\mathcal{B}}$ do

// Sequential Encoding with GenNi

3 for $(u,t)\in\mathcal{B}$ do

// Encode Sequence via

f_{\theta}(\cdot)

\mathbf{h}^{u}_{t}=f_{\theta}(S^{u}_{1:t})

// Pre-selection with

\beta

(fixed or gradually increasing)

\mathcal{V}_{u}^{\prime}=\text{Uniform}(\mathcal{V},\beta)

// Sample a Negative Item from

\mathcal{V}^{\prime}

via Eq. 6

s^{u}_{-,t+1}\sim Q(\mathbf{h}_{t}^{u},\mathcal{V}_{u}^{\prime},\alpha)

// View Next Item

s_{t+1}^{u}

as Target Item

// Next Item Prediction Optimization

8 Update

\theta

based on

\{\textbf{h}_{t}^{u}\}_{(u,t)\in\mathcal{B}}

\{s_{t+1}^{u}\}_{(u,t)\in\mathcal{B}}

\{s_{-,t+1}^{u}\}_{(u,t)\in\mathcal{B}}

to minimize the loss (Eq. 4).

Algorithm 1 GenNi for Sequential Recommendation

3.4. Self-Adjusted Curriculum Learning

GenNi introduces $\alpha$ to control how often hard negatives are sampled. But we must still manually tune $\alpha$ . Curriculum learning (Bengio et al., 2009; Kumar et al., 2010) allows neural networks to begin by understanding easy negative samples followed by hard ones. We further reduce this rule to let the model itself adjust $\alpha$ . Specifically, we use the loss value in each batch as the critic to see if the current curriculum is too hard or too easy. When the previous loss is larger than the current one, we increase $\alpha$ , otherwise we decrease $\alpha$ . In this way, $\alpha$ is self-adjusted with the online loss value as feedback, which reduces human effort in choosing the initial $\alpha$ (see Section 5.5.1 for more detail).

4. Discussion

4.1. Time Complexity and Convergence Analysis

The computation costs of $\textsf{GenNi}_{\text{SA}}$ and $\textsf{GenNi}_{S^{3}}$ are similar to SASRec and S ${}^{3}\text{-Rec}$ except that our methods use GenNi instead of uniform sampling. The overall computation cost is mainly from Transformer, the feed-forward network and GenNi, which is $\mathcal{O}(T^{2}\cdot d+T\cdot d^{2}+\beta\cdot|\mathcal{V}|\cdot T)$ . The dominant term is typically $\mathcal{O}(|T|^{2}d)$ from Transformer when $\beta$ is small. Though GenNi requires high computational cost when $\beta\cdot\mathcal{|V|}$ are large, however, our proposed acceleration strategy of it ensures faster convergence as well as better performance (see Section 5.3). The proposed two strategies of choosing $\beta$ in Section 3.3.3 also help to balance the effectiveness and efficiency of GenNi. More details regarding convergence analysis are provided in Appendix C.

4.2. GenNi for Improving Sequential BPR loss

Though our method is induced from NCE paradigm in SR, GenNi also has the ability to improve other training framework built upon pair-wise ranking loss, e.g., sequential BPR (Rendle et al., 2012). Previous work (Hidasi and Karatzoglou, 2018) justifies that optimizing a recommender model with a BPR loss results in gradient vanishing issue if introducing more than one negative samples. The reason is that after several epochs of training, those uniformly sampled negative items already have lower scores than the target due to their easiness to identify. As a result, gradients towards those negative items gradually diminish. Instead, GenNi generates informative negative items during each epoch of training, which alleviates the gradient vanishing issue of BPR. We conduct experiments to verify this claim in Section 5.5.2.

5. Experiments

In this section, we evaluate the performance of our approaches compared with the state-of-the-art sequential recommenders and justify the benefits of our proposed negative item generator GenNi. We also investigate impacts of the hyper-parameters and conduct the ablation study. A case study is also included to better understand how GenNi improves the training.

Table 1. Overall performance comparison among SR Models. For each metric, the best score of our methods is in bold, and we underline the best scores in baselines. The last columns are the relative improvements compared between the bold and underlined scores.

SR Model		GRU4Rec	Caser	SASRec	SASRec ${}_{\text{pop}_{\gamma}}$	S ${}^{3}\text{-Rec}$	DSSRec	CL4SRec	MMInfoRec	ours		Improv.
SR Model		GRU4Rec	Caser	SASRec	SASRec ${}_{\text{pop}_{\gamma}}$	S ${}^{3}\text{-Rec}$	DSSRec	CL4SRec	MMInfoRec	$\textsf{GenNi}_{\text{SA}}$	$\textsf{GenNi}_{S^{3}}$	Improv.
Beauty	HR@5	1.64	2.51	3.84 $\pm$ 0.06	4.08	3.85 $\pm$ 0.10	4.10	4.23 $\pm$ 0.31	5.25 $\pm$ 0.21	6.30 $\pm$ 0.09	6.47 $\pm$ 0.15	23.24%
	HR@10	2.83	3.47	6.07 $\pm$ 0.11	6.18	6.35 $\pm$ 0.10	6.89	6.94 $\pm$ 0.10	7.45 $\pm$ 0.12	8.79 $\pm$ 0.05	9.45 $\pm$ 0.21	26.85%
	NDCG@5	0.99	1.45	2.49 $\pm$ 0.09	2.69	2.40 $\pm$ 0.07	2.61	2.81 $\pm$ 0.18	3.71 $\pm$ 0.06	4.48 $\pm$ 0.07	4.64 $\pm$ 0.04	25.07%
	NDCG@10	1.37	1.76	3.21 $\pm$ 0.09	3.37	3.20 $\pm$ 0.07	3.58	3.73 $\pm$ 0.06	4.43 $\pm$ 0.10	5.33 $\pm$ 0.05	5.39 $\pm$ 0.16	21.67%
Sports	HR@5	1.62	1.54	2.20 $\pm$ 0.24	2.22	2.26 $\pm$ 0.03	2.14	2.17 $\pm$ 0.21	2.78 $\pm$ 0.09	3.55 $\pm$ 0.09	3.68 $\pm$ 0.13	32.37%
	HR@10	2.04	1.94	3.41 $\pm$ 0.30	3.43	3.73 $\pm$ 0.06	3.24	3.69 $\pm$ 0.09	3.89 $\pm$ 0.10	5.00 $\pm$ 0.11	5.50 $\pm$ 0.09	49.05%
	NDCG@5	1.03	1.14	1.45 $\pm$ 0.16	1.46	1.45 $\pm$ 0.05	1.42	1.37 $\pm$ 0.10	1.91 $\pm$ 0.08	2.57 $\pm$ 0.12	2.65 $\pm$ 0.09	38.74%
	NDCG@10	1.10	1.424	1.84 $\pm$ 0.17	1.86	1.93 $\pm$ 0.06	1.85	1.91 $\pm$ 0.08	2.33 $\pm$ 0.11	3.04 $\pm$ 0.12	3.14 $\pm$ 0.08	34.76%
Toys	HR@5	0.97	1.66	4.68 $\pm$ 0.16	4.97	4.43 $\pm$ 0.27	5.02	5.26 $\pm$ 0.14	6.02 $\pm$ 0.06	7.18 $\pm$ 0.05	6.96 $\pm$ 0.08	19.27%
	HR@10	1.76	2.70	6.81 $\pm$ 0.19	7.08	7.00 $\pm$ 0.43	7.21	7.76 $\pm$ 0.11	8.14 $\pm$ 0.08	9.96 $\pm$ 0.16	9.50 $\pm$ 0.12	22.36%
	NDCG@5	0.59	1.07	3.18 $\pm$ 0.09	3.37	2.94 $\pm$ 0.19	3.37	3.62 $\pm$ 0.08	4.53 $\pm$ 0.05	5.15 $\pm$ 0.06	4.89 $\pm$ 0.08	13.69%
	NDCG@10	0.84	1.41	3.87 $\pm$ 0.10	4.05	3.76 $\pm$ 0.24	4.21	4.28 $\pm$ 0.14	5.10 $\pm$ 0.04	5.90 $\pm$ 0.05	5.86 $\pm$ 0.09	15.69%
Yelp	HR@5	1.52	1.42	1.72 $\pm$ 0.04	1.73	1.94 $\pm$ 0.11	1.71	2.29 $\pm$ 0.03	5.04 $\pm$ 0.06	5.25 $\pm$ 0.12	5.35 $\pm$ 0.02	6.15%
	HR@10	2.63	2.53	2.86 $\pm$ 0.03	2.88	3.35 $\pm$ 0.08	2.97	3.92 $\pm$ 0.10	6.01 $\pm$ 0.09	7.72 $\pm$ 0.18	7.84 $\pm$ 0.04	30.45%
	NDCG@5	0.91	0.80	1.07 $\pm$ 0.03	0.99	1.19 $\pm$ 0.06	1.12	1.44 $\pm$ 0.01	3.19 $\pm$ 0.08	3.28 $\pm$ 0.06	3.43 $\pm$ 0.02	7.52%
	NDCG@10	1.34	1.29	1.44 $\pm$ 0.01	1.42	1.64 $\pm$ 0.06	1.52	1.97 $\pm$ 0.05	3.60 $\pm$ 0.13	4.03 $\pm$ 0.08	4.15 $\pm$ 0.01	15.39%

5.1. Experimental Setup

5.1.1. Datasets

We conduct experiments on four datasets: Sports, Beauty, Toys, and Yelp. Sports, Beauty, and Toys are three subcategories of Amazon review data introduced in (McAuley et al., 2015). Yelp¹¹1https://www.yelp.com/dataset is a dataset for business recommendation. We follow (Zhou et al., 2020; Xie et al., 2020; Ma et al., 2020; Qiu et al., 2021a) to prepare the datasets. In detail, we only keep the “5-core” datasets, in which all users and items have at least 5 interactions. The statistics of the prepared datasets are summarized in Appnedix D.

5.1.2. Evaluation Metrics

For a fair comparison, we follow previous works (Wang et al., 2019; Krichene and Rendle, 2020) to rank the prediction on the whole item set without negative sampling. Performance is evaluated on a variety of Top-K evaluation metrics, including Hit Ratio $@k$ ( $\mathrm{HR}@k$ ), and Normalized Discounted Cumulative Gain $@k$ ( $\mathrm{NDCG}@k$ ) where $k\in\{5,10\}$ .

5.1.3. Baselines

We compare our approach with three groups of representative baselines. (i). SR models with uniform negative samplers including Caser (Tang and Wang, 2018), GRU4Rec (Hidasi et al., 2015), SASRec (Kang and McAuley, 2018), and S ${}^{3}\text{-Rec}$ (Zhou et al., 2020). We omit non-sequential models such as BPR-MF (Rendle et al., 2012) and simple item popularity based methods, which are shown weaker than SR methods on these datasets (Sun et al., 2019b; Zhou et al., 2020; Ma et al., 2020). (ii). SR models with other negative sampling strategies including DSSRec (Ma et al., 2020), CL4SRec (Xie et al., 2020) and MMInfoRec (Qiu et al., 2021a). Different heuristic hard negative mining strategies are also purposed to further improve the quality of negative samples. (iii). Additional negative sampling strategies. In addition, we also includes the popularity-based method (Mikolov et al., 2013) from NLP domain that samples negative items based on the power of item frequency $Q(i)\propto Pop(i)^{\gamma}$ , denoted as SASRec ${}_{\text{pop}_{\gamma}}$ . Detailed descriptions of these baselines are in Appendix E.

5.1.4. Implementation Details

Caser²²2https://github.com/graytowne/caser_pytorch, S ${}^{3}\text{-Rec}$ ³³3https://github.com/RUCAIBox/CIKM2020-S3Rec, and MMInfoRec⁴⁴4https://github.com/RuihongQiu/MMInfoRec are provided by the authors. GRU4Rec⁵⁵5https://github.com/slientGe/Sequential_Recommendation_Tensorflow and DSSRec ⁶⁶6https://github.com/abinashsinha330/DSSRec are implemented based on public resources. SASRec is implemented based on S ${}^{3}\text{-Rec}$ and we implement CL4SRec in Pytorch. The number of attention heads and number of self-attention layers in SASRec, S ${}^{3}\text{-Rec}$ and DSSRec are tuned from $\{1,2,4\}$ , and $\{1,2,3\}$ , respectively. The number of latent factors introduced in DSSRec is tuned from $\{1,2,\dots,8\}$ . For SASRec ${}_{\text{pop}_{\gamma}}$ , we tune the $\gamma$ from 0 to 1.5.

We implement two variants of our approaches $\textsf{GenNi}_{\text{SA}}$ and $\textsf{GenNi}_{S^{3}}$ with Pytorch. Our methods consider SASRec and S ${}^{3}\text{-Rec}$ as our base models and replace the uniform sampler with our proposed GenNi. Models are optimized by an Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.001, $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , and batch size of 256. Early stopping criteria (models stop training if the performance on the validation set doesn’t increase for 40 successive epochs) is used during training. For hyper-parameters in GenNi, $\alpha$ is tuned from 0 to 6, $\beta$ is tuned from 0.0001 to 1.0 in a $\exp$ scale, number of negative items $k$ is tuned from 1 to 10. We also provide results using a self-adjusted curriculum learning strategy (See Section 3.4) that reduces the need to tune $\alpha$ . All experiments are run on a single Tesla V100 GPU and we report the average results under 4 different random seeds on the test set.

All code shall be released upon publication.

5.2. Performance Comparisons

Table 1 shows overall recommendation performance of all models on the four datasets. We observe that:

•

Our methods $\textsf{GenNi}_{\text{SA}}$ and $\textsf{GenNi}_{S^{3}}$ both consistently outperform existing methods on all datasets by a large margin. The average improvement compared with the best baseline ranges from 6.15% to 49.05%. Specifically, compared with SASRec and S ${}^{3}\text{-Rec}$ , our approaches simply replacing their original uniform sampler with GenNi, achieve 96.02% and 107.66% average performance improvements on four datasets over SASRec and S ${}^{3}\text{-Rec}$ at NDCG@5, respectively. This observation clearly shows that sampling informative negative items is as important as other components in making SR successful and also demonstrates the effectiveness of our proposed sampler GenNi.
•

Transformer is an effective way of encoding user sequential dynamic patterns. Compared with GRU4Rec, Caser, SASRec and S ${}^{3}\text{-Rec}$ we can see that SASRec and S ${}^{3}\text{-Rec}$ that utilize a Transformer-based encoder can consistently achieve better performance compared to CNN/RNN-based encoders: Caser and GRU4Rec. S ${}^{3}\text{-Rec}$ performs better than SASRec in most datasets because it fuses additional item attributes during pre-training. However, all these methods sample negative items randomly from user non-interacted item sets, yielding to a sub-optimally trained model.
•

For different negative sampling strategies, SASRec ${}_{\text{pop}_{\gamma}}$ performs slightly better than SASRec, indicating that the popularity-based method can help improve model learning. However, this strategy is static and does not consider the penalization of each user behavior, resulting in a large performance gap compared to $\textsf{GenNi}_{\text{SA}}$ . DSSRec, CL4SRec, and MMInfoRec proposes different contrastive self-supervised learning paradigms can outperform other baselines that only train with an NIP objective. This observation demonstratives the effectiveness of contrastive self-supervised learning. These three methods commonly consider items in the whole training batch as negatives, and MMInfoRec also proposes a heuristic hard negative mining strategy with a memory bank to further improve the quality of the samples. Their successes suggest that sample more negative items and hard negative mining also benefits model learning. Although MMInfoRec is the best baseline method, it still performs worse than our approaches. The reason might be twofold. First, considering all items in the training batch as negative items can introduce false-negative samples. Second, heuristic hard negative mining (e.g., considering user historical interacted items as hard negatives in MMInfoRec) is not adaptive to model parameters. As a result, the sampled hard negatives can gradually become uninformative to the model.

5.3. Training Efficiency Comparison

SASRec has proven to be an order of magnitude faster than CNN and RNN-based recommendation methods (Kang and McAuley, 2018), such as Caser and GRU4Rec. In this section, we evaluate the efficiency of $\textsf{GenNi}_{\text{SA}}$ (on the Beauty dataset) by comparing with the most efficient baseline SASRec and the best performing baseline MMInfoRec (See Appendix F for result comparisons on other datasets). We omit the comparisons of $\textsf{GenNi}_{S^{3}}$ as it has the same computation cost as $\textsf{GenNi}_{\text{SA}}$ in its training stage. The only difference is that $\textsf{GenNi}_{S^{3}}$ requires a pre-training stage to fuse item attributes in the model.

Figure 4 shows the performance w.r.t. training (wall-clock) time as well as the computation cost per epoch. We can see that replacing the uniform sampler with GenNi does introduce additional computation cost; for example, SASRec spends 2.44 seconds on model updates for one epoch while $\textsf{GenNi}_{\text{SA}}$ ( $\beta=1$ ) requires 6.30s/epoch. However, $\textsf{GenNi}_{\text{SA}}$ converges to much higher performance and requires fewer training epochs to converge. What’s more, as we reduce $\beta$ to 0.1, $\textsf{GenNi}_{\text{SA}}$ ( $\beta=0.1$ ) only needs 2.47 seconds to update the model for one epoch, which is close to SASRec (2.44s/epoch), and still performs better than SASRec. Although MMInfoRec is the best performing baseline, it requires 34.22 seconds on model updates for one epoch. Our method $\textsf{GenNi}_{\text{SA}}$ ( $\beta=1.0$ ) and $\textsf{GenNi}_{\text{SA}}$ ( $\beta=0.1$ ) are over 5.42 and 13.85 times faster and also perform better than MMInfoRec.

5.4. Hyper-parameter Sensitivity

GenNi introduces two hyper-parameters $\alpha$ and $\beta$ that controls the difficulty of sampled negatives and the negative item generation computation cost. The number of negative samples is set as $k=1$ , which is the same as the original SASRec’s setting for fair comparison. We also study model sensitivity to the number of negative samples $k$ , the embedding size, and learning rate.

Impact of the informative of negative items $\alpha$ . Figure 5 shows the influence of $\alpha$ on model performance over four datasets. We can see that the model performance increases as $\alpha$ increases at the beginning, and then the performance reaches a peak. Specifically, when $\alpha=2.5$ , the model performs best on Beauty,while $\alpha=4.4$ , the model performs best on Yelp. Note that when $\alpha=0$ , GenNi becomes a uniform sampler. The large $\alpha$ shows that randomly sampled items can be uninformative as training proceeds, while considering items that are currently hard to be correctly classified can further improve the model. Similar observations are found on Sports and Toys.

Impact of $\beta$ for accelerating generation. Figure 6 shows model performance w.r.t a fixed $\beta$ value. We interestingly find that there is an elbow point of $\beta$ that balances the effectiveness and efficiency of GenNi well. For example, when $\beta=0.1$ , it reduces about $90\%$ computation cost of GenNi while the model can still achieve about 95% performance (e.g., NDCG@5) of its original version ( $\beta=1.0$ ) in Beauty. On one hand, it shows the superiority of GenNi, which takes the efficiency of randomly sampling to pre-select a certain portion of items in the first stage and then concentrates on finding informative ones with a slower but more accurate sampling strategy. On the other hand, the decreasing of performance with small $\beta$ also indicates that with the training goes, the number of informative items also decreasing so too small $\beta$ can filter out all these items in pre-selection stage. As introduced in Section 3.3.3, we also report the results that gradually increasing the $\beta$ value via Eq 7 in Table 2. We can see that gradually increasing $\beta$ can achieves the similar effect as of a fixed $\beta=1.0$ because the informative items are decreasing along with training goes and $\beta$ can be small while still capture informative items in early training stage. This strategy reduces the computation cost while achieving same effect comparing with a fixed $\beta=1.0$ .

Table 2. Comparison of a fixed

\beta

or gradually increasing

\beta

Strategy		Beauty		Sports		Toys
Strategy		HR	NDCG	HR	NDCG	HR	NDCG
fixed $\beta$	$\beta=0.1$	6.09	4.33	3.18	2.14	6.50	4.72
fixed $\beta$	$\beta=1.0$	6.30	4.48	3.55	2.57	7.18	5.15
Gradually Increasing	$m=20$	6.35	4.53	3.50	2.52	7.16	5.07
Gradually Increasing	$m=40$	6.31	4.47	3.55	2.50	7.11	5.13

Impact of the number of negative samples $k$ . Figure 7 shows the impact of the number of negative samples. We can observe a diminishing return in the performance improvement for both SASRec and $\textsf{GenNi}_{\text{SA}}$ . However $\textsf{GenNi}_{\text{SA}}$ can consistently outperform SASRec, which further verifies the importance of sampling informative negative items. Note that training with additional negative samples linearly increases the time cost. While $\textsf{GenNi}_{\text{SA}}$ can even achieve better performance with only 1 negative sample compared with SASRec that uses 9 negative samples on Beauty and Sports. See Appendix G for additional results on Toys and Yelp, and the sensitivity to the embedding size, and learning rate.

5.5. Ablation Study

5.5.1. Benefits of Self-Adjusted Curriculum Learning

As we can see from Figure 5, model performance is sensitive to $\alpha$ ; in general, larger $\alpha$ benefits model performance. In order to reduce the effort of tuning $\alpha$ for GenNi, we also propose a self-adjusted curriculum learning to let the model adjust $\alpha$ based on its current performance. Figure 8 shows the sensitivity to the initial $\alpha$ . We can see the model performance is less sensitive to the initial $\alpha$ value.

5.5.2. GenNi For Improving BPR Loss

As discussed in Section 4.2, training a SR model with sequential BPR loss can have a gradient vanish issue when using additional negative samples ( $k>1$ ). In this section, we conduct experiments to show that GenNi can help alleviate such issues. We train SASRec with a sequential BPR loss and replace the uniform sampling strategy used in BPR with GenNi. Table 3 shows comparisons between uniform sampling and GenNi in HR@5 and NDCG@5 (See Appendix H of additional results). We see that SASRec cannot benefit from more negative samples when training with BPR loss because of the gradient vanishing issue. After replacing the uniform sampler with GenNi, the model’s performance is improved with more negative samples.

Table 3. Effectiveness of GenNi for improving BPR loss (SASRec is the base SR model).

Additional Negatives		1	2	3	4	5	6	7	8
Uniform	HR@5	2.32	2.16	2.21	2.34	2.13	2.14	2.24	2.07
Uniform	NDCG@5	1.42	1.28	1.33	1.36	1.27	1.31	1.34	1.25
GenNi ( $\alpha=2.2$ )	HR@5	5.64	5.7	5.83	5.81	5.93	5.87	5.96	6.08
GenNi ( $\alpha=2.2$ )	NDCG@5	3.90	4.04	4.11	4.07	4.12	4.16	4.23	4.25

5.6. Case Study

We conduct a case study on the Sports dataset (McAuley et al., 2015) to show examples of dynamically changing informative negative items. Figure 9 visualizes the informative items to the SR model. When the user reviews a water bottle, the cup holder is the most informative item; the user reviews earphones instead, and the most informative items changes to a gym bike (etc.). We can also observe that the informative negative distribution is close to uniform initially, and gradually diversifies as training goes.

6. Conclusion

In this work, we identified the dynamic of informative negative items in sequential recommender systems, because of the dynamic of users’ interests, and the updates of model’s parameters during training. We propose a negative item generator GenNi to adaptively generative informative negative samples for training sequential recommenders. Extensive studies on four datasets shows that informative negative sampling is crucial of making the sequential recommendation model well-trained and also demonstrates the superiority of GenNi. The detailed analysis also verified the effectiveness and efficiency of GenNi.

References

(1)
Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning. 41–48.
Bengio and Senécal (2008) Yoshua Bengio and Jean-Sébastien Senécal. 2008. Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks 19, 4 (2008), 713–722.
Blanc and Rendle (2018) Guy Blanc and Steffen Rendle. 2018. Adaptive sampled softmax with kernel based sampling. In International Conference on Machine Learning. PMLR, 590–599.
Cai and Wang (2017) Liwei Cai and William Yang Wang. 2017. Kbgan: Adversarial learning for knowledge graph embeddings. arXiv preprint arXiv:1711.04071 (2017).
Chen et al. (2022a) Yongjun Chen, Jia Li, and Caiming Xiong. 2022a. ELECRec: Training Sequential Recommenders as Discriminators. arXiv preprint arXiv:2204.02011 (2022).
Chen et al. (2022b) Yongjun Chen, Zhiwei Liu, Jia Li, Julian McAuley, and Caiming Xiong. 2022b. Intent Contrastive Learning for Sequential Recommendation. In Proceedings of the ACM Web Conference 2022. 2172–2182.
Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855–864.
Gutmann and Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 297–304.
Gutmann and Hyvärinen (2012) Michael U Gutmann and Aapo Hyvärinen. 2012. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. Journal of machine learning research 13, 2 (2012).
He and McAuley (2016) Ruining He and Julian McAuley. 2016. Fusing similarity models with markov chains for sparse sequential recommendation. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 191–200.
Hidasi and Karatzoglou (2018) Balázs Hidasi and Alexandros Karatzoglou. 2018. Recurrent neural networks with top-k gains for session-based recommendations. In Proceedings of the 27th ACM international conference on information and knowledge management. 843–852.
Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In ICDM. IEEE, 197–206.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Krichene and Rendle (2020) Walid Krichene and Steffen Rendle. 2020. On Sampled Metrics for Item Recommendation. In SIGKDD. 1748–1757.
Kumar et al. (2010) M Kumar, Benjamin Packer, and Daphne Koller. 2010. Self-paced learning for latent variable models. Advances in neural information processing systems 23 (2010).
Li et al. (2020) Jiacheng Li, Yujie Wang, and Julian McAuley. 2020. Time Interval Aware Self-Attention for Sequential Recommendation. In WSDM. 322–330.
Li et al. (2021) Yang Li, Tong Chen, Peng-Fei Zhang, and Hongzhi Yin. 2021. Lightweight Self-Attentive Sequential Recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 967–977.
Liu et al. (2021) Zhiwei Liu, Yongjun Chen, Jia Li, Philip S Yu, Julian McAuley, and Caiming Xiong. 2021. Contrastive Self-supervised Sequential Recommendation with Robust Augmentation. arXiv preprint arXiv:2108.06479 (2021).
Ma et al. (2020) Jianxin Ma, Chang Zhou, Hongxia Yang, Peng Cui, Xin Wang, and Wenwu Zhu. 2020. Disentangled self-supervision in sequential recommenders. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 483–491.
Ma and Collins (2018) Zhuang Ma and Michael Collins. 2018. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. arXiv preprint arXiv:1809.01812 (2018).
McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In SIGIR. 43–52.
Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
Mnih and Teh (2012) Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426 (2012).
Park and Chang (2019) Dae Hoon Park and Yi Chang. 2019. Adversarial sampling and training for semi-supervised information retrieval. In The World Wide Web Conference. 1443–1453.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
Qiu et al. (2021a) Ruihong Qiu, Zi Huang, and Hongzhi Yin. 2021a. Memory Augmented Multi-Instance Contrastive Predictive Coding for Sequential Recommendation. arXiv preprint arXiv:2109.00368 (2021).
Qiu et al. (2021b) Ruihong Qiu, Zi Huang, Hongzhi Yin, and Zijian Wang. 2021b. Contrastive Learning for Representation Degeneration Problem in Sequential Recommendation. arXiv preprint arXiv:2110.05730 (2021).
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
Rendle (2010) Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International conference on data mining. IEEE, 995–1000.
Rendle et al. (2012) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2012. BPR: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618 (2012).
Rendle et al. (2010) Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th international conference on World wide web. 811–820.
Robinson et al. (2020) Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. 2020. Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592 (2020).
Sun et al. (2019b) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019b. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In CIKM. 1441–1450.
Sun et al. (2019a) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019a. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197 (2019).
Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web. 1067–1077.
Tang and Wang (2018) Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommendation via convolutional sequence embedding. In WSDM. 565–573.
Taylor (1953) Wilson L Taylor. 1953. “Cloze procedure”: A new tool for measuring readability. Journalism quarterly 30, 4 (1953), 415–433.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
Wang et al. (2017) Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. 515–524.
Wang et al. (2019) Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval. 165–174.
Xie et al. (2020) Xu Xie, Fei Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Bolin Ding, and Bin Cui. 2020. Contrastive Learning for Sequential Recommendation. arXiv preprint arXiv:2010.14395 (2020).
Yang et al. (2020) Zhen Yang, Ming Ding, Chang Zhou, Hongxia Yang, Jingren Zhou, and Jie Tang. 2020. Understanding negative sampling in graph representation learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1666–1676.
Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 974–983.
Zhang et al. (2019) Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S Sheng, Jiajie Xu, Deqing Wang, Guanfeng Liu, and Xiaofang Zhou. 2019. Feature-level Deeper Self-Attention Network for Sequential Recommendation.. In IJCAI. 4320–4326.
Zhang et al. (2013) Weinan Zhang, Tianqi Chen, Jun Wang, and Yong Yu. 2013. Optimizing top-n collaborative filtering via dynamic negative item sampling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 785–788.
Zhang and Stratos (2021) Wenzheng Zhang and Karl Stratos. 2021. Understanding Hard Negatives in Noise Contrastive Estimation. arXiv preprint arXiv:2104.06245 (2021).
Zhou et al. (2021) Chang Zhou, Jianxin Ma, Jianwei Zhang, Jingren Zhou, and Hongxia Yang. 2021. Contrastive learning for debiased candidate generation in large-scale recommender systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3985–3995.
Zhou et al. (2020) Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1893–1902.

Appendix A Proof for Theorem 1

We can derive from the discrete version of NCE theory (See (Ma and Collins, 2018) for assumptions that make the conclusion hold) that there exists an integer $k_{0}$ such that, for a large sample size $|B|$ , for any $k>k_{0}$ (number of negative items),

(8)

\begin{split}\sqrt{\mathcal{|B|}|T|}(\hat{\theta}-\theta^{*})\Rightarrow\mathcal{N}(0,\mathcal{I}_{k}^{-1}),\end{split}

for some matrix $\mathcal{I}_{k}^{-1}$ . So there exists a constant value $C$ such that for any $k>k_{0}$ , the mean square error (MSE) (aka risk) of the parameter estimation is bounded by:

(9)

\begin{split}\mathbb{E}_{(u,t)\in\mathcal{B}}[||\hat{\theta}-\theta^{*}||^{2}]=&\frac{1}{|\mathcal{B}||T|}(\frac{1}{P(\mathbf{s}_{t}^{u}|\mathbf{h}_{t}^{u})}+\frac{1}{kQ(\mathbf{s}_{t}^{u}|\mathbf{h}_{t}^{u})}-\frac{k+1}{k})\\ \leq&C/(k|T||\mathcal{B}|)\end{split}

As $k$ grows, the risk of parameter estimation is decreasing, thus able to improve model performance. Alternatively, reader can follow (Mnih and Teh, 2012) to calculate the gradient of Eq. 4 in terms of $\theta$ and will see that as $k\to\infty$ , the gradient of Eq. 4 is approximated to the maximum likelihood gradient (Eq. 1). Eq. 9 also shows that as $\mathcal{B}$ and $T$ larger and larger, $Q(\cdot)$ become less and less important as the estimation can be bound by $C/(k|T||\mathcal{B}|)$ . Interesting readers can read on (Gutmann and Hyvärinen, 2012; Ma and Collins, 2018) for a comprehensive review of NCE.

Appendix B Proof for Theorem 2

Proof.

The SR model is optimized through the following objective:

(10)

\begin{split}\mathcal{L}=\mathbb{E}_{(u,t)\in\mathcal{B}}\mathcal{L}_{t}^{u}=&-\mathbb{E}_{\mathbf{s}^{u}_{+,t+1}\sim P}\log(P(D=1|\mathbf{h}^{u}_{t},\mathbf{s}^{u}_{+,t+1}))\\ &-k\mathbb{E}_{\mathbf{s}^{u}_{-,t+1}\sim Q}\log(P(D=0|\mathbf{h}^{u}_{t},\mathbf{s}^{u}_{-,t+1}))\\ =&\sum_{t}[\sum_{\mathbf{s}_{+,t+1}^{u}}P(s_{+,t+1}^{u}|\mathbf{h}^{u}_{t})\log(\sigma(\mathbf{h}^{u}_{t}\cdot\mathbf{s}_{+,t+1}^{u}))\\ +&k\sum_{\mathbf{s}_{-,t+1}^{u}}Q(s_{-,t+1}^{u}|\mathbf{h}^{u}_{t})\log(1-\sigma(\mathbf{h}^{u}_{t}\cdot\mathbf{s}_{-,t+1}^{u}))],\end{split}

where $\mathbf{s}^{u}_{+,t+1}$ and $\mathbf{s}^{u}_{-,t+1}$ are target and negative items to the user $u$ at time $t$ . The above equation can be simplied as

(11)

\begin{split}\mathcal{L}=&\sum_{t}\sum_{u}(P(s_{t+1}^{u}|\mathbf{h}^{u}_{t})+KQ(s_{t+1}^{u}|\mathbf{h}^{u}_{t}))H(P^{\prime},P^{\prime\prime})\end{split}

where $P^{\prime}_{\mathbf{s}^{u}_{t+1},\mathbf{h}_{t}^{u}}(D=1)=\frac{P(\mathbf{s}^{u}_{t+1}|\mathbf{h}_{t}^{u})}{P(\mathbf{s}^{u}_{t+1}|\mathbf{h}_{t}^{u})+kQ(\mathbf{s}^{u}_{t+1}|\mathbf{h}_{t}^{u})}$ and $P^{\prime\prime}_{\mathbf{s}^{u}_{t+1},\mathbf{h}_{t}^{u}}(D=1)=\sigma(\mathbf{s}^{u}_{t+1}|\mathbf{h}_{t}^{u})$ are two Bernoulli distributions, and $H(\cdot)$ measures the cross entropy between two distributions. Based on Gibbs inequality, optimized Eq 10 should satisfy that $P^{\prime}=P^{\prime\prime}$ for all user interests $\mathbf{h}_{t}^{u}$ toward next predict item $\mathbf{s}^{u}_{t+1}$ , i.e.,

(12)

\begin{split}\frac{1}{1+e^{-\mathbf{h}^{u}_{t}\cdot\mathbf{s}_{-,t+1}^{u}}}=\frac{P(\mathbf{s}^{u}_{t+1}|\mathbf{h}_{t}^{u})}{P(\mathbf{s}^{u}_{t+1}|\mathbf{h}_{t}^{u})+k\cdot Q(\mathbf{s}^{u}_{t+1}|\mathbf{h}^{u}_{t})}.\end{split}

So the optimal embeddings should satisfy:

(13)

\begin{split}\mathbf{h}^{u}_{t}\cdot\mathbf{s}^{u}_{t+1}=-\log\frac{k\cdot Q(\mathbf{s}^{u}_{t+1}|\mathbf{h}^{u}_{t})}{P(\mathbf{s}^{u}_{t+1}|\mathbf{h}^{u}_{t})}.\end{split}

∎

Appendix C Convergence Analysis

An explanation of why GenNi is superior to heuristic samplings such as uniform sampler is that it can help reduces the risk: $\mathbb{E}[||\hat{\theta}-\theta^{*}||^{2}]$ . From Eq 9 we can see that, as the training goes, the randomly sampled item would most likely has a small $Q(\mathbf{s}|\mathbf{h})$ than $P(\mathbf{s}|\mathbf{h})$ value, i.e., the model has learnt to classify it as a negative sample, While the deviate in terms of $\theta$ is determined by the smallest value between $Q(\mathbf{s}|\mathbf{h})$ and $P(\mathbf{s}|\mathbf{h})$ . Optimize with small $Q(\mathbf{s}|\mathbf{h})$ in often time interrupted the accurate optimization. With GenNi, the sampled negatives are often has large $Q(\mathbf{s}|\mathbf{h})$ value meaning that the estimation can more accurately approximate to the optimal $\theta^{*}$ .

Appendix D Data Information

The statistics of four datasets are shown in Table 4.

Table 4. Dataset information.

Dataset	Sports	Beauty	Toys	Yelp
$\|\mathcal{U}\|$	35,598	22,363	19,412	30,431
$\|\mathcal{V}\|$	18,357	12,101	11,924	20,033
# Actions	0.3m	0.2m	0.17m	0.3m
Avg. length	8.3	8.9	8.6	8.3
Sparsity	99.95%	99.95%	99.93%	99.95%

Appendix E Baseline Methods

We compare our approach with three groups of representative baselines.

•

SR models with uniform negative samplers. GRU4Rec (Hidasi et al., 2015), SASRec (Kang and McAuley, 2018), which encode sequences with CNN, RNN, and Transformer, respectively. S ${}^{3}\text{-Rec}$ (Zhou et al., 2020), which builds on SASRec with a pre-training stage to incorporate additional item attributes into the model. We omit non-sequential models such as BPR-MF (Rendle et al., 2012) and simple item popularity based methods, which are weaker than SR methods (Sun et al., 2019b; Zhou et al., 2020; Ma et al., 2020).
•

SR models with other negative sampling strategies. We compare with recent works that add or replace the NIP objective with a contrastive self-supervised learning objective: DSSRec (Ma et al., 2020), CL4SRec (Xie et al., 2020) and MMInfoRec (Qiu et al., 2021a). These works follow the contrastive learning paradigm to consider items in a training mini-batch as negatives and propose different heuristic hard negative mining strategies to further improve the quality of negative samples. respectively.
•

Additional negative sampling strategies. we also includes the popularity-based method (Mikolov et al., 2013) from NLP domain that samples negative items based on the power of item frequency $Q(i)\propto Pop(i)^{\gamma}$ , denoted as SASRec ${}_{\text{pop}_{\gamma}}$ .

Table 5. Comparison of

\textsf{GenNi}_{\text{SA}}

against other models (in HR@5) w.r.t the average (over 100 epochs) training time (second) per epoch.

Model		Beauty		Sports		Toys		Yelp
		time	HR	time	HR	time	HR	time	HR
SASRec		2.44	3.84	3.69	2.20	2.09	4.68	3.35	1.72
SASRec ${}_{\text{pop}_{\gamma}}$		2.45	4.08	3.66	2.12	2.11	4.97	3.36	1.58
MMInfoRec		34.22	5.25	58.18	2.78	43.20	6.02	54.29	5.04
$\textsf{GenNi}_{\text{SA}}$ (vary $\beta$ )	0.1	2.47	6.09	3.92	3.18	2.17	6.50	3.39	2.08
$\textsf{GenNi}_{\text{SA}}$ (vary $\beta$ )	1.0	6.30	6.30	7.25	3.55	3.13	7.18	6.56	2.27

Appendix F Additional Efficiency Comparisons

The training cost comparisons among SASRec, MMInfoRec, and $\textsf{GenNi}_{\text{SA}}$ over four datasets are reported in Table 5. In summary, the $\beta$ of balances the effectiveness and efficiency of GenNi. Besides, model with GenNi (e.g., $\textsf{GenNi}_{\text{SA}}$ ) can achieve better performance than the best baseline MMInfoRec using much less computation resource.

Appendix G additional Results on Hyper-parameter Sensitivity

Impact of $k$ Impact of $k$ on Toys and Yelp are shown in Figure 10. Similar to observations on Beauty and Sports, Models can be further improved by sampling additional negatives while $\textsf{GenNi}_{\text{SA}}$ can consistently outperform SASRec.

Impact of embedding size and learning rate Model’s sensitivity to the embedding size and learning rate are shown in Figure 11. We can see that vary learning rate or embedding size does influence model’s final performance, but their impact to SASRec and $\textsf{GenNi}_{\text{SA}}$ have a similar trend and $\textsf{GenNi}_{\text{SA}}$ can consistently outperform SASRec in all circumstances.

Table 6. Effectiveness of GenNi for improving BPR loss (in HR@10 and NDCG@10).

Additional Negatives		1	2	3	4	5	6	7	8
Uniform	HR@10	3.99	4.01	4.15	4.22	3.92	3.83	4.07	3.89
Uniform	NDCG@10	1.96	1.87	1.95	1.96	1.84	1.86	1.93	1.83
GenNi ( $\alpha=2.2$ )	HR@10	7.62	8.08	8.09	8.24	8.22	8.34	8.35	8.32
GenNi ( $\alpha=2.2$ )	NDCG@10	4.48	4.80	4.84	4.85	4.86	4.95	4.99	4.95

Appendix H Additional Results on Ablation Study

Table 6 shows the additional result comparisons between uniform sampling and GenNi in HR@10 and NDCG@10 with use of BPR loss. We see observe that SASRec cannot benefit from more negative samples when training with BPR loss. While GenNi alleviates the gradient vanishing issue thus the model’s performance is stably improved after sampling more negative items.