This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Understanding Negative Sampling in Graph Representation Learning

Zhen Yang∗†, Ming Ding∗†, Chang Zhou, Hongxia Yang, Jingren Zhou, Jie Tang†§ Department of Computer Science and Technology, Tsinghua University DAMO Academy, Alibaba Group zheny2751@gmail.com, dm18@mails.tsinghua.edu.cn ericzhou.zc,yang.yhx,jingren.zhou@alibaba-inc.com jietang@tsinghua.edu.cn
(2020)
Abstract.
11footnotetext: Equal contribution. Ming Ding proposed the theories and method. Zhen Yang conducted the experiments. Codes are available at https://github.com/zyang-16/MCNS.44footnotetext: Corresponding Authors.

Graph representation learning has been extensively studied in recent years, in which sampling is a critical point. Prior arts usually focus on sampling positive node pairs, while the strategy for negative sampling is left insufficiently explored. To bridge the gap, we systematically analyze the role of negative sampling from the perspectives of both objective and risk, theoretically demonstrating that negative sampling is as important as positive sampling in determining the optimization objective and the resulted variance. To the best of our knowledge, we are the first to derive the theory and quantify that a nice negative sampling distribution is pn(u|v)pd(u|v)α,0<α<1p_{n}(u|v)\propto p_{d}(u|v)^{\alpha},0<\alpha<1. With the guidance of the theory, we propose MCNS, approximating the positive distribution with self-contrast approximation and accelerating negative sampling by Metropolis-Hastings. We evaluate our method on 5 datasets that cover extensive downstream graph learning tasks, including link prediction, node classification and recommendation, on a total of 19 experimental settings. These relatively comprehensive experimental results demonstrate its robustness and superiorities.

Negative Sampling; Graph Representation Learning; Network Embedding
journalyear: 2020copyright: acmcopyrightconference: Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 23–27, 2020; Virtual Event, CA, USAbooktitle: Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’20), August 23–27, 2020, Virtual Event, CA, USAprice: 15.00doi: 10.1145/3394486.3403218isbn: 978-1-4503-7998-4/20/08ccs: Mathematics of computing Graph algorithmsccs: Computing methodologies Learning latent representations

1. Introduction

Refer to caption
Figure 1. The SampledNCE framework. Positive pairs are sampled implicitly or explicitly according to the graph representation methods, while negative pairs are from a pre-defined distribution, both composing the training data of contrastive learning.

Recent years have seen the graph representation learning gradually stepping into the spotlight of data mining research. The mainstream graph representation learning algorithms include traditional Network Embedding methods (e.g. DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015)) and Graph Neural Networks (e.g. GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017)), although the latter sometimes are trained end-to-end in classification tasks. These methods transform nodes in the graph into low-dimensional vectors to facilitate a myriad of downstream tasks, such as node classification (Perozzi et al., 2014), link prediction (Zhou et al., 2017) and recommendation (Ying et al., 2018).

Most graph representation learning can be unified within a Sampled Noise Contrastive Estimation (SampledNCE) framework (§ 2), comprising of a trainable encoder to generate node embeddings, a positive sampler and a negative sampler to sample positive and negative nodes respectively for any given node. The encoder is trained to distinguish positive and negative pairs based on the inner products of their embeddings (See Figure 1 for illustration).

Massive network embedding works have investigated good criteria to sample positive node pairs, such as by random walk (Perozzi et al., 2014), the second-order proximity (Tang et al., 2015), community structure (Wang et al., 2017a), etc. Meanwhile, the line of GNN papers focuses on advanced encoders, evolving from basic graph convolution layers (Kipf and Welling, 2017) to attention-based (Veličković et al., 2018), GAN-based (Ding et al., 2018), sampled aggregator (Hamilton et al., 2017) and WL kernel-based (Xu et al., 2018) layers. However, the strategy for negative sampling is relatively unexplored.

Negative sampling (Mikolov et al., 2013) is firstly proposed to serve as a simplified version of noise contrastive estimation (Mnih and Kavukcuoglu, 2013; Gutmann and Hyvärinen, 2012), which is an efficient way to compute the partition function of an unnormalized distribution to accelerate the training of word2vec. Mikolov et al. (2013) set the negative sampling distribution proportional to the 3/43/4 power of degree by tuning the power parameter. Most later works on network embedding (Perozzi et al., 2014; Tang et al., 2015) readily keep this setting. However, results in several papers (Zhang and Zweigenbaum, 2018; Caselles-Dupré et al., 2018; Ying et al., 2018) demonstrate that not only the distribution of negative sampling has a huge influence on the performance, but also the best choice varies largely on different datasets (Caselles-Dupré et al., 2018).

To the best of our knowledge, few papers systematically analyzed or discussed negative sampling in graph representation learning. In this paper, we first examine the role of negative sampling from the perspectives of objective and risk. Negative sampling is as important as positive sampling to determine the optimization objective and reduce the variance of the estimation in real-world graph data. According to our theory, the negative sampling distribution should be positively but sub-linearly correlated to their positive sampling distribution. Although our theory contradicts with the intuition “positively sampling nearby nodes and negatively sampling the faraway nodes”, it accounts for the observation in previous works (Ying et al., 2018; Zhang and Zweigenbaum, 2018; Gao and Huang, 2018) where additional hard negative samples improve performance.

We propose an effective and scalable negative sampling strategy, Markov chain Monte Carlo Negative Sampling (MCNS), which applies our theory with an approximated positive distribution based on current embeddings. To reduce the time complexity, we leverage a special Metropolis-Hastings algorithm (Metropolis et al., 1953) for sampling. Most graphs naturally satisfy the assumption that adjacent nodes share similar positive distributions, thus the Markov chain from neighbors can be continued to skip the burn-in period and finally accelerates the sampling process without degradations in performance.

We evaluate MCNS on three most general graph-related tasks, link prediction, node classification and recommendation. Regardless of the network embedding or GNN model used, significant improvements are shown by replacing the original negative sampler to MCNS. We also collect many negative sampling strategies from information retrieval (Weston et al., 2011; Wang et al., 2017b), ranking in recommendation (Hu et al., 2008; Zhang et al., 2013) and knowledge graph embeddings (Cai and Wang, 2018; Zhang et al., 2019). Our experiments in five datasets demonstrate that MCNS stably outperforms these negative sampling strategies.

2. Framework

Input: Graph G=(V,E)G=(V,E), batch size mm, Encoder Eθ{E}_{\theta},
Positive Sampling Distribution pd^\hat{p_{d}},
Negative Sampling Distribution pnp_{n},
Number of Negative Samples kk.
repeat
       Sample mm nodes from p(v)deg(v)p(v)\propto deg(v)
       Sample 11 positive node from pd^(u|v)\hat{p_{d}}(u|v) for each node vv
       Sample kk negative nodes from pn(u|v)p_{n}(u^{\prime}|v) for each node vv
       Initialize loss \mathcal{L} as zero
       for each node vv do
             for each positive node uu for vv do
                   =logσ(Eθ(u)Eθ(v))\mathcal{L}=\mathcal{L}-\log\sigma(E_{\theta}(u)\cdot E_{\theta}(v))
             end for
            for each negative node uu^{\prime} for vv do
                   =log(1σ(Eθ(u)Eθ(v)))\mathcal{L}=\mathcal{L}-\log(1-\sigma(E_{\theta}(u^{\prime})\cdot E_{\theta}(v)))
             end for
            
       end for
      Update θ\theta by descending the gradients θ\nabla_{\theta}\mathcal{L}
until Convergence;
Algorithm 1 The Sampled Noise Contrastive Estimation Framework

In this section, we introduce the preliminaries to understand negative sampling, including the unique embedding transformation, the SampledNCE framework and how the traditional network embeddings and GNNs are unified into the framework.

2.1. Unique Embedding Transformation

Many network embedding algorithms, such as DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015) and node2vec (Grover and Leskovec, 2016), actually assign each node two embeddings 111Others, for example GraphSAGE (Hamilton et al., 2017) use a unique embedding., node embedding and contextual embedding. When sampled as context, the contextual embedding, instead of its node embedding is used for inner product. This implementation technique stems from word2vec (Mikolov et al., 2013), though sometimes omitted by following papers, and improves performance by utilizing the asymmetry of the contextual nodes and central nodes.

We can split each node ii into central node viv_{i} and contextual node uiu_{i}, and each edge iji\rightarrow j becomes viujv_{i}\rightarrow u_{j}. Running those network embedding algorithms on the transformed bipartite graph is strictly equivalent to running on the original graph.222For bipartite graphs, always take nodes from VV part as central node and sample negative nodes from UU part. According to the above analysis, we only need to anaylze graphs with unique embeddings without loss of generality.

2.2. The SampledNCE Framework

We unify a large part of graph representation learning algorithms into the general SampledNCE framework, shown in Algorithm 1. The framework depends on an encoder EθE_{\theta}, which can be either an embedding lookup table or a variant of GNN, to generate node embeddings.

In the training process, mm positive nodes and kmk\cdot m negative nodes are sampled. In Algorithm 1, we use the cross-entropy loss to optimize the inner product where σ(x)=11+ex\sigma(x)=\frac{1}{1+e^{-x}}. And the other choices of loss function work in similar ways. In link prediction or recommendation, we recall the top KK nodes with largest Eθ(v)Eθ(u)E_{\theta}(v)\cdot E_{\theta}(u) for each node uu. In classification, we evaluate the performance of logistic regression with the node embeddings as features.

2.3. Network Embedding and GNNs

Edges in natural graph data can be assumed as sampled from an underlying positive distribution pd(u,v)p_{d}(u,v). However, only a few edges are observable so that numerous techniques are developed to make reasonable assumptions to better estimate pd(u,v)p_{d}(u,v), resulting in various pd^\hat{p_{d}} in different algorithms to approximate pdp_{d}.

Network embedding algorithms employ various positive distributions implicitly or explicitly. LINE (Tang et al., 2015) samples positive nodes directly from adjacent nodes. DeepWalk (Perozzi et al., 2014) samples via random walks and implicitly defines pd^(u|v)\hat{p_{d}}(u|v) as the stationary distribution of random walks (Qiu et al., 2018). node2vec (Grover and Leskovec, 2016) argues that homophily and structural equivalence of nodes uu and vv indicates a larger pd(u,v)p_{d}(u,v). These assumptions are mainly based on local smoothness and augment observed positive node pairs for training.

Graph neural networks are equipped with different encoder layers and perform implicit regularizations on positive distribution. Previous work (Li et al., 2018) reveals that GCN is a variant of graph Laplacian smoothing, which directly constrains the difference of the embedding of adjacent nodes. GCN, or its sampling version GraphSAGE (Hamilton et al., 2017), does not explicitly augment positive pairs but regularizes the output embedding by local smoothness, which can be seen as an implicit regularization of pdp_{d}.

3. Understanding Negative Sampling

In contrast to elaborate estimation of the positive sampling distribution, negative sampling has long been overlooked. In this section, we revisit negative sampling from the perspective of objective and risk, and conclude that “the negative sampling distribution should be positively but sub-linearly correlated to their positive sampling distribution”.

3.1. How does negative sampling influence the objective of embedding learning?

Previous works (Qiu et al., 2018) interpret network embedding as implicit matrix factorization. We discuss a more general form where the factorized matrix is determined by estimated data distribution pd^(u|v)\hat{p_{d}}(u|v) and negative distribution pn(u|v)p_{n}(u^{\prime}|v). According to  (Levy and Goldberg, 2014) (Qiu et al., 2018), the objective function for embedding is

J\displaystyle J =𝔼(u,v)pdlogσ(𝒖𝒗)+𝔼vpd(v)[k𝔼upn(u|v)logσ(𝒖𝒗)]\displaystyle=\mathbb{E}_{(u,v)\sim p_{d}}\log\sigma(\vec{\bm{u}}^{\top}\vec{\bm{v}})+\mathbb{E}_{v\sim p_{d}(v)}[k\mathbb{E}_{u^{\prime}\sim p_{n}(u^{\prime}|v)}\log\sigma(-\vec{\bm{u^{\prime}}}^{\top}\vec{\bm{v}})]
(1) =𝔼vpd(v)[𝔼upd(u|v)logσ(𝒖𝒗)+k𝔼upn(u|v)logσ(𝒖𝒗)],\displaystyle=\mathbb{E}_{v\sim p_{d}(v)}[\mathbb{E}_{u\sim p_{d}(u|v)}\log\sigma(\vec{\bm{u}}^{\top}\vec{\bm{v}})+k\mathbb{E}_{u^{\prime}\sim p_{n}(u^{\prime}|v)}\log\sigma(-\vec{\bm{u^{\prime}}}^{\top}\vec{\bm{v}})],

where 𝒖,𝒗\vec{\bm{u}},\vec{\bm{v}} are embeddings for node uu and vv and σ()\sigma(\cdot) is the sigmoid function. We can separately consider each node vv and show the objective for optimization as follows:

Theorem 1.

(Optimal Embedding) The optimal embedding satisfies that for each node pair (u,v)(u,v),

(2) 𝒖𝒗=logkpn(u|v)pd(u|v).\displaystyle\vec{\bm{u}}^{\top}\vec{\bm{v}}=-\log\frac{k\cdot p_{n}(u|v)}{p_{d}(u|v)}.

Proof  Let us consider the conditional data and the negative distribution of node vv. To maximize JJ is equivalent to minimize the following objective function for each vv,

J(v)\displaystyle J^{(v)} =𝔼upd(u|v)logσ(𝒖𝒗)+k𝔼upn(u|v)logσ(𝒖𝒗)\displaystyle=\mathbb{E}_{u\sim p_{d}(u|v)}\log\sigma(\vec{\bm{u}}^{\top}\vec{\bm{v}})+k\mathbb{E}_{u^{\prime}\sim p_{n}(u^{\prime}|v)}\log\sigma(-\vec{\bm{u^{\prime}}}^{\top}\vec{\bm{v}})
=upd(u|v)logσ(𝒖𝒗)+kupn(u|v)logσ(𝒖𝒗)\displaystyle=\sum\limits_{u}p_{d}(u|v)\log\sigma(\vec{\bm{u}}^{\top}\vec{\bm{v}})+k\sum\limits_{u^{\prime}}p_{n}(u^{\prime}|v)\log\sigma(-\vec{\bm{u^{\prime}}}^{\top}\vec{\bm{v}})
=u[pd(u|v)logσ(𝒖𝒗)+kpn(u|v)logσ(𝒖𝒗)]\displaystyle=\sum\limits_{u}[p_{d}(u|v)\log\sigma(\vec{\bm{u}}^{\top}\vec{\bm{v}})+kp_{n}(u|v)\log\sigma(-\vec{\bm{u}}^{\top}\vec{\bm{v}})]
=u[pd(u|v)logσ(𝒖𝒗)+kpn(u|v)log(1σ(𝒖𝒗))].\displaystyle=\sum\limits_{u}[p_{d}(u|v)\log\sigma(\vec{\bm{u}}^{\top}\vec{\bm{v}})+kp_{n}(u|v)\log\big{(}1-\sigma(\vec{\bm{u}}^{\top}\vec{\bm{v}})\big{)}].

For each (u,v)(u,v) pair, if we define two Bernoulli distributions Pu,v(x)P_{u,v}(x) where Pu,v(x=1)=pd(u|v)pd(u|v)+kpn(u|v)P_{u,v}(x=1)=\frac{p_{d}(u|v)}{p_{d}(u|v)+kp_{n}(u|v)} and Qu,v(x)Q_{u,v}(x) where Qu,v(x=1)=σ(𝒖𝒗)Q_{u,v}(x=1)=\sigma(\vec{\bm{u}}^{\top}\vec{\bm{v}}), the equation above is simplified as follows:

J(v)\displaystyle J^{(v)} =u(pd(u|v)+kpn(u|v))H(Pu,v,Qu,v),\displaystyle=-\sum\limits_{u}(p_{d}(u|v)+kp_{n}(u|v))H(P_{u,v},Q_{u,v}),

where H(p,q)=x𝒳p(x)logq(x)H(p,q)=-\sum_{x\in\mathcal{X}}p(x)\log q(x) is the cross entropy between two distributions. According to Gibbs Inequality, the maximizer of L(v)L^{(v)} satisfies P=QP=Q for each (u,v)(u,v) pair, which gives

(3) 11+e𝒖T𝒗\displaystyle\frac{1}{1+e^{-\vec{\bm{u}}^{T}\vec{\bm{v}}}} =pd(u|v)pd(u|v)+kpn(u|v)\displaystyle=\frac{p_{d}(u|v)}{p_{d}(u|v)+kp_{n}(u|v)}
𝒖𝒗\displaystyle\vec{\bm{u}}^{\top}\vec{\bm{v}} =logkpn(u|v)pd(u|v).\displaystyle=-\log\frac{k\cdot p_{n}(u|v)}{p_{d}(u|v)}.

The above theorem clearly shows that the positive and negative distributions pdp_{d} and pnp_{n} have the same level of influence on optimization. In contrast to abundant research on finding a better pd^\hat{p_{d}} to approximate pdp_{d}, negative distribution pnp_{n} is apparently under-explored.

3.2. How does negative sampling influence the expected loss (risk) of embedding learning?

The analysis above shows that with enough (possibly infinite) samples of edges from pdp_{d}, the inner product of embedding 𝒖𝒗\vec{\bm{u}}^{\top}\vec{\bm{v}} has an optimal value. In real-world data, we only have limited samples from pdp_{d}, which leads to a nonnegligible expected loss (risk). Then the loss function to minimize empirical risk for node vv becomes:

(4) JT(v)=1Ti=1Tlogσ(𝒖i𝒗)+1Ti=1kTlogσ(𝒖i𝒗),\displaystyle J^{(v)}_{T}=\frac{1}{T}\sum_{i=1}^{T}\log\sigma(\vec{\bm{u}}_{i}^{\top}\vec{\bm{v}})+\frac{1}{T}\sum_{i=1}^{kT}\log\sigma(-\vec{\bm{u^{\prime}}}_{i}^{\top}\vec{\bm{v}}),

where {u1,,uT}\{u_{1},...,u_{T}\} are sampled from pd(u|v)p_{d}(u|v) and {u1,,uk}\{u^{\prime}_{1},...,{u^{\prime}}_{k}^{\top}\} are sampled from pn(u|v)p_{n}(u|v).

In this section, we only consider optimization of node vv, which can be directly generalized to the whole embedding learning. More precisely, we equivalently consider 𝜽=[𝒖𝟎𝒗,,𝒖N1𝒗]\bm{\theta}=[\vec{\bm{u_{0}}}^{\top}\vec{\bm{v}},...,\vec{\bm{u}}_{N-1}^{\top}\vec{\bm{v}}] as the parameters to be optimized, where {u0,uN1}\{u_{0},...u_{N-1}\} are all the N nodes in the graph. Suppose the optimal parameter for JT(v)J^{(v)}_{T} and J(v)J^{(v)} are 𝜽T\bm{\theta}_{T} and 𝜽\bm{\theta}^{*} respectively. The gap between 𝜽T\bm{\theta}_{T} and 𝜽\bm{\theta}^{*} is described as follows:

Theorem 2.

The random variable T(𝛉T𝛉)\sqrt{T}(\bm{\theta}_{T}-\bm{\theta}^{*}) asymptotically converges to a distribution with zero mean vector and covariance matrix

(5) Cov(T(𝜽T𝜽))=diag(𝒎)1(1+1/k)𝟏𝟏,\displaystyle\text{Cov}\big{(}\sqrt{T}(\bm{\theta}_{T}-\bm{\theta}^{*})\big{)}=\textbf{diag}(\bm{m})^{-1}-(1+1/k)\bm{1}^{\top}\bm{1},

where 𝐦=[kpd(u0|v)pn(u0|v)pd(u0|v)+kpn(u0|v),,kpd(uN1|v)pn(uN1|v)pd(uN1|v)+kpn(uN1|v)]\bm{m}=\big{[}\frac{kp_{d}(u_{0}|v)p_{n}(u_{0}|v)}{p_{d}(u_{0}|v)+kp_{n}(u_{0}|v)},...,\frac{kp_{d}(u_{N-1}|v)p_{n}(u_{N-1}|v)}{p_{d}(u_{N-1}|v)+kp_{n}(u_{N-1}|v)}\big{]}^{\top} and 𝟏=[1,,1]\bm{1}=[1,...,1]^{\top}.

Proof  Our analysis is based on the Taylor expansion of θJT(v)(𝜽)\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}) around 𝜽\bm{\theta}^{*}. As 𝜽T\bm{\theta}_{T} is the minimizer of JT(v)(𝜽)J_{T}^{(v)}(\bm{\theta}), we must have θJT(v)(𝜽T)=𝟎\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}_{T})=\bm{0}, which gives

(6) θJT(v)(𝜽T)=JT(v)(𝜽)+θ2JT(v)(𝜽)(𝜽T𝜽)+O(𝜽T𝜽2)=𝟎.\displaystyle\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}_{T})=\nabla J_{T}^{(v)}(\bm{\theta}^{*})+\nabla^{2}_{\theta}J_{T}^{(v)}(\bm{\theta}^{*})(\bm{\theta}_{T}-\bm{\theta}^{*})+O(\|\bm{\theta}_{T}-\bm{\theta}^{*}\|^{2})=\bm{0}.

Up to terms of order O(𝜽𝜽T2)O(\|\bm{\theta}^{*}-\bm{\theta}_{T}\|^{2}), we have

(7) T(𝜽T𝜽)=(θ2JT(v)(𝜽))1TθJT(v)(𝜽).\displaystyle\sqrt{T}(\bm{\theta}_{T}-\bm{\theta}^{*})=-\big{(}\nabla^{2}_{\theta}J_{T}^{(v)}(\bm{\theta}^{*})\big{)}^{-1}\sqrt{T}\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}^{*}).

Next, we will analyze (θ2JT(v)(𝜽))1-\big{(}\nabla^{2}_{\theta}J_{T}^{(v)}(\bm{\theta}^{*})\big{)}^{-1} and TθJT(v)(𝜽)\sqrt{T}\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}^{*}) respectively by the following lemmas.

Lemma 0.

The negative Hessian matrix θ2JT(v)(𝛉)-\nabla^{2}_{\theta}J_{T}^{(v)}(\bm{\theta}^{*}) converges in probability to diag(𝐦)\textbf{diag}(\bm{m}).

Proof  First, we calculate the gradient of JT(v)(𝜽)J_{T}^{(v)}(\bm{\theta}) and Hessian matrix θ2JT(v)(𝜽)\nabla_{\theta}^{2}J^{(v)}_{T}(\bm{\theta}). Let 𝜽u\bm{\theta}_{u} be the parameter in 𝜽\bm{\theta} for modeling 𝒖𝒗\vec{\bm{u}}^{\top}\vec{\bm{v}} and 𝒆(u)\bm{e}_{(u)} be the one-hot vector which only has a 1 on this dimension.

JT(v)(𝜽)=\displaystyle J^{(v)}_{T}(\bm{\theta})= 1Ti=1Tlogσ(𝜽ui)+1Ti=1kTlogσ(𝜽ui)\displaystyle\frac{1}{T}\sum_{i=1}^{T}\log\sigma(\bm{\theta}_{u_{i}})+\frac{1}{T}\sum_{i=1}^{kT}\log\sigma(-\bm{\theta}_{u^{\prime}_{i}})
(8) θJT(v)(𝜽)=\displaystyle\nabla_{\theta}J^{(v)}_{T}(\bm{\theta})= 1Ti=1T(1σ(𝜽ui))𝒆(ui)1Ti=1kTσ(𝜽ui)𝒆(ui)\displaystyle\frac{1}{T}\sum_{i=1}^{T}\big{(}1-\sigma(\bm{\theta}_{u_{i}})\big{)}\bm{e}_{(u_{i})}-\frac{1}{T}\sum_{i=1}^{kT}\sigma(\bm{\theta}_{u^{\prime}_{i}})\bm{e}_{(u^{\prime}_{i})}
θ2JT(v)(𝜽)=\displaystyle\nabla_{\theta}^{2}J^{(v)}_{T}(\bm{\theta})= 1Ti=1T{σ(𝜽ui)(1σ(𝜽ui))𝒆(ui)𝒆(ui)+𝟎𝟎}\displaystyle\frac{1}{T}\sum_{i=1}^{T}\{-\sigma(\bm{\theta}_{u_{i}})\left(1-\sigma(\bm{\theta}_{u_{i}})\right)\bm{e}_{(u_{i})}\bm{e}_{(u_{i})}^{\top}+\bm{0}^{\top}\bm{0}\}
1Ti=1kT{σ(𝜽ui)(1σ(𝜽ui))𝒆(ui)𝒆(ui)+𝟎𝟎}.\displaystyle-\frac{1}{T}\sum_{i=1}^{kT}\{\sigma(\bm{\theta}_{u^{\prime}_{i}})\left(1-\sigma(\bm{\theta}_{u^{\prime}_{i}})\right)\bm{e}_{(u^{\prime}_{i})}\bm{e}_{(u^{\prime}_{i})}^{\top}+\bm{0}^{\top}\bm{0}\}.

According to equation (3), σ(𝜽ui)=pd(u|v)pd(u|v)+kpn(u|v)\sigma(\bm{\theta}_{u_{i}})=\frac{p_{d}(u|v)}{p_{d}(u|v)+kp_{n}(u|v)} at 𝜽=𝜽\bm{\theta}=\bm{\theta}^{*}. Therefore,

limT+θ2JT(v)(𝜽)𝑃\displaystyle\lim\limits_{T\rightarrow+\infty}-\nabla_{\theta}^{2}J^{(v)}_{T}(\bm{\theta}^{*})\xrightarrow{P} uσ(𝜽u)(1σ(𝜽u))𝒆(u)𝒆(u)\displaystyle\sum\limits_{u}\sigma(\bm{\theta}_{u})\left(1-\sigma(\bm{\theta}_{u})\right)\bm{e}_{(u)}\bm{e}_{(u)}^{\top}
(pd(u)+kpn(u))\displaystyle\cdot(p_{d}(u)+kp_{n}(u))
=\displaystyle= ukpd(u|v)pn(u|v)pd(u|v)+kpn(u|v)𝒆(u)𝒆(u)\displaystyle\sum\limits_{u}\frac{kp_{d}(u|v)p_{n}(u|v)}{p_{d}(u|v)+kp_{n}(u|v)}\bm{e}_{(u)}\bm{e}_{(u)}^{\top}
(9) =\displaystyle= diag(𝒎).\displaystyle\ \textbf{diag}(\bm{m}).
Lemma 0.

The expectation and variance of θJT(v)(𝛉)\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}^{*}) satisfy

(10) 𝔼[θJT(v)(𝜽)]\displaystyle\mathbb{E}[\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}^{*})] =𝟎,\displaystyle=\bm{0},
(11) Var [θJT(v)(𝜽)]\displaystyle\text{Var }[\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}^{*})] =1T(diag(𝒎)(1+1/k)𝒎𝒎).\displaystyle=\frac{1}{T}\big{(}\textbf{diag}(\bm{m})-(1+1/k)\bm{m}\bm{m}^{\top}\big{)}.

Proof  According to equation (3)(8),

𝔼[θJT(v)(𝜽)]\displaystyle\mathbb{E}[\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}^{*})] =u[pd(u)(1σ(𝜽u))𝒆(u)kpn(u)σ(𝜽u)𝒆(u)]=𝟎,\displaystyle=\sum\limits_{u}[p_{d}(u)\big{(}1-\sigma(\bm{\theta}_{u}^{*})\big{)}\bm{e}_{(u)}-kp_{n}(u)\sigma(\bm{\theta}_{u}^{*})\bm{e}_{(u)}]=\bm{0},
Cov [θJT(v)(𝜽)]=\displaystyle\text{Cov }[\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}^{*})]= 𝔼[θJT(v)(𝜽)θJT(v)(𝜽)]\displaystyle\mathbb{E}[\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}^{*})\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}^{*})^{\top}]
=\displaystyle= 1T𝔼upd(u|v)(1σ(𝜽u))2𝒆(u)𝒆(u)\displaystyle\frac{1}{T}\mathbb{E}_{u\sim p_{d}(u|v)}\left(1-\sigma(\bm{\theta}_{u}^{*})\right)^{2}\bm{e}_{(u)}\bm{e}_{(u)}^{\top}
+kT𝔼upn(u|v)σ(𝜽u)2𝒆(u)𝒆(u)\displaystyle+\frac{k}{T}\mathbb{E}_{u\sim p_{n}(u|v)}\sigma(\bm{\theta}_{u}^{*})^{2}\bm{e}_{(u)}\bm{e}_{(u)}^{\top}
+(T1T2+11kT)𝒎𝒎\displaystyle+(\frac{T-1}{T}-2+1-\frac{1}{kT})\bm{m}\bm{m}^{\top}
(12) =\displaystyle= 1T(diag(𝒎)(1+1k)𝒎𝒎).\displaystyle\frac{1}{T}\big{(}\textbf{diag}(\bm{m})-(1+\frac{1}{k})\bm{m}\bm{m}^{\top}\big{)}.

Let us continue to prove Theorem 2 by substituting (9)(12) into equation (7) and derive the covariance of T(𝜽T𝜽)\sqrt{T}(\bm{\theta}_{T}-\bm{\theta}^{*}).

Cov(T(𝜽T𝜽))\displaystyle\text{Cov}\big{(}\sqrt{T}(\bm{\theta}_{T}-\bm{\theta}^{*})\big{)} =𝔼[T(𝜽T𝜽)T(𝜽T𝜽)]\displaystyle=\mathbb{E}[\sqrt{T}(\bm{\theta}_{T}-\bm{\theta}^{*})\ \sqrt{T}(\bm{\theta}_{T}-\bm{\theta}^{*})^{\top}]
Tdiag(𝒎)1Var [θJT(v)(𝜽)](diag(𝒎)1)\displaystyle\approx T\textbf{diag}(\bm{m})^{-1}\text{Var }[\nabla_{\theta}J_{T}^{(v)}(\bm{\theta}^{*})]\big{(}\textbf{diag}(\bm{m})^{-1}\big{)}^{\top}
=diag(𝒎)1(1+1/k)𝟏𝟏.\displaystyle=\textbf{diag}(\bm{m})^{-1}-(1+1/k)\bm{1}^{\top}\bm{1}.

According to Theorem 2, the mean squared error, aka risk, of 𝒖𝒗\vec{\bm{u}}^{\top}\vec{\bm{v}} is

(13) 𝔼[(𝜽T𝜽)u2]=1T(1pd(u|v)1+1kpn(u|v)1k).\displaystyle\mathbb{E}\big{[}\|(\bm{\theta}_{T}-\bm{\theta}^{*})_{u}\|^{2}\big{]}=\frac{1}{T}(\frac{1}{p_{d}(u|v)}-1+\frac{1}{kp_{n}(u|v)}-\frac{1}{k}).

3.3. The Principle of Negative Sampling

To determine an appropriate pnp_{n} for a specific pdp_{d}, a trade-off might be needed to balance the rationality of the objective (to what extent the optimal embedding fits for the downstream task) and the expected loss, also known as risk (to what extent the trained embedding deviates from the optimum).

A simple solution is to sample negative nodes positively but sub-linearly correlated to their positive sampling distribution, i.e. pn(u|v)pd(u|v)α,0<α<1p_{n}(u|v)\propto p_{d}(u|v)^{\alpha},0<\alpha<1.

(Monotonicity) From the perspective of objective, if we have pd(ui|v)>pd(uj|v)p_{d}(u_{i}|v)>p_{d}(u_{j}|v),

𝒖i𝒗\displaystyle\vec{\bm{u}}^{\top}_{i}\vec{\bm{v}} =logpd(ui|v)αlogpd(ui|v)+c\displaystyle=\log p_{d}(u_{i}|v)-\alpha\log p_{d}(u_{i}|v)+c
>(1α)logpd(uj|v)+c=𝒖j𝒗,\displaystyle>(1-\alpha)\log p_{d}(u_{j}|v)+c=\vec{\bm{u}}^{\top}_{j}\vec{\bm{v}},

where cc is a constant. The sizes of inner products is in accordance with positive distribution pdp_{d}, facilitating the task relying on relative sizes of 𝒖𝒗\vec{\bm{u}}^{\top}\vec{\bm{v}} for different uus, such as recommendation or link prediction.

(Accuracy) From the perspective of risk, we mainly care about the scale of the expected loss. According to equation (13), the expected squared deviation is dominated by the smallest one in pd(u|v)p_{d}(u|v) and pn(u|v)p_{n}(u|v). If an intermediate node with large pd(u|v)p_{d}(u|v) is negatively sampled insufficiently (with small pn(u|v)p_{n}(u|v)), the accurate optimization will be corrupted. But if pn(u|v)pd(u|v)αp_{n}(u|v)\propto p_{d}(u|v)^{\alpha}, then

𝔼[(𝜽T𝜽)u2]=1T(1pd(u|v)(1+pd(u|v)1αc)11k).\displaystyle\mathbb{E}\big{[}\|(\bm{\theta}_{T}-\bm{\theta}^{*})_{u}\|^{2}\big{]}=\frac{1}{T}\big{(}\frac{1}{p_{d}(u|v)}(1+\frac{p_{d}(u|v)^{1-\alpha}}{c})-1-\frac{1}{k}\big{)}.

The order of magnitude of deviation only negatively related to pd(u|v)p_{d}(u|v), meaning that with limited positive samples, the inner products for high-probability node pairs are estimated more accurate. This is an evident advantage over evenly negative sampling.

4. Method

Refer to caption
Figure 2. A running example of MCNS. The central nodes are traversed by DFS, each samples three negative context nodes using Metropolis-Hastings algorithm by proceeding with the Markov chain.
Input: DFS sequence T=[v1,,v|T|]T=[v_{1},...,v_{|T|}],
Encoder Eθ{E}_{\theta}, Positive Sampling Distribution pd^\hat{p_{d}},
Proposal Distribution qq, Number of Negative Samples kk.
repeat
       Initialize current negative node xx at random
       for each node vv in DFS sequence TT do
             Sample 11 positive sample uu from pd^(u|v)\hat{p_{d}}(u|v)
             Initialize loss =0\mathcal{L}=0
             //Negative Sampling via Metropolis-Hastings
             for i=1i=1 to kk do
                   Sample a node yy from q(y|x)q(y|x)
                   Generate a uniform random number r[0,1]r\in[0,1]
                   if r<min(1,(Eθ(v)Eθ(y))α(Eθ(v)Eθ(x))αq(x|y)q(y|x))r<\text{min}\big{(}1,\frac{(E_{\theta}(v)\cdot E_{\theta}(y))^{\alpha}}{(E_{\theta}(v)\cdot E_{\theta}(x))^{\alpha}}\frac{q(x|y)}{q(y|x)}\big{)} then
                         x=yx=y
                        
                   end if
                  =+max(0,Eθ(v)Eθ(x)Eθ(u)Eθ(v)+γ)\mathcal{L}=\mathcal{L}+\text{max}\big{(}0,E_{\theta}(v)\cdot E_{\theta}(x)-E_{\theta}(u)\cdot E_{\theta}(v)+\gamma\big{)}
                   Update θ\theta by descending the gradients θ\nabla_{\theta}\mathcal{L}
                  
             end for
            
       end for
      
until Early Stopping or Convergence;
Algorithm 2 The training process with MCNS

4.1. The Self-contrast Approximation

Although we deduced that pn(u|v)pd(u|v)αp_{n}(u|v)\propto p_{d}(u|v)^{\alpha}, the real pdp_{d} is unknown and its approximation pd^\hat{p_{d}} is often implicitly defined. – How can the principle help negative sampling?

We therefore propose a self-contrast approximation, replacing pdp_{d} by inner products based on the current encoder, i.e.

pn(u|v)pd(u|v)α(Eθ(u)Eθ(v))αuU(Eθ(u)Eθ(v))α.p_{n}(u|v)\propto p_{d}(u|v)^{\alpha}\approx\frac{\big{(}E_{\theta}(u)\cdot E_{\theta}(v)\big{)}^{\alpha}}{\sum_{u^{\prime}\in U}\big{(}E_{\theta}(u^{\prime})\cdot E_{\theta}(v)\big{)}^{\alpha}}.

The resulting form is similar to a technique in RotatE (Sun et al., 2019), and very time-consuming. Each sampling requires O(n)O(n) time, making it impossible for middle- or large-scale graphs. Our MCNS solves this problem by our adapted version of Metropolis-Hastings algorithm.

4.2. The Metropolis-Hastings Algorithm

As one of the most distinguished Markov chain Monte Carlo (MCMC) methods, the Metropolis-Hastings algorithm (Metropolis et al., 1953) is designed for obtaining a sequence of random samples from unnormalized distributions. Suppose we want to sample from distribution π(x)\pi(x) but only know π~(x)π(x)\tilde{\pi}(x)\propto\pi(x). Meanwhile the computation of normalizer Z=xπ~(x)Z=\sum_{x}\tilde{\pi}(x) is unaffordable. The Metropolis-Hastings algorithm constructs a Markov chain {X(t)}\{X(t)\} that is ergodic and stationary with respect to π\pi —- meaning that, X(t)π(x),tX(t)\sim\pi(x),t\rightarrow\infty.

The transition in the Markov chain has two steps:
(1) Generate yq(y|X(t))y\sim q(y|X(t)) where the proposal distribution qq is arbitrary chosen as long as positive everywhere.
(2) Pick yy as X(t+1)X(t+1) at the probability min{π~(y)π~(X(t))q(X(t)|y)q(y|X(t)),1}\min\left\{\frac{\tilde{\pi}(y)}{\tilde{\pi}(X(t))}\frac{q(X(t)|y)}{q(y|X(t))},1\right\}, or X(t+1)X(t)X(t+1)\leftarrow X(t). See tutorial (Chib and Greenberg, 1995) for more details.

4.3. Markov chain Negative Sampling

The main idea for MCNS is to apply the Metropolis-Hastings algorithm for each node vv with π~(u|v)=(Eθ(u)Eθ(v))α\tilde{\pi}(u|v)=\big{(}E_{\theta}(u)\cdot E_{\theta}(v)\big{)}^{\alpha}. Then we are sampling from the self-contrast approximated distribution. However, a notorious disadvantage for the Metropolis-Hastings is a relatively long burninburn-in period, i.e. the distribution of X(t)π(x)X(t)\approx\pi(x) only when tt is large, which heavily affects its efficiency.

Fortunately, the connatural property exists for graphs that nearby nodes usually share similar pd(|v)p_{d}(\cdot|v), as evident from triadic closure in social network (Huang et al., 2014), collaborative filtering in recommendation (Linden et al., 2003), etc. Since nearby nodes have similar target negative sampling distribution, the Markov chain for one node is likely to still work well for its neighbors. Therefore, a neat solution is to traverse the graph by Depth First Search (DFS, See Appendix A.6) and keep generating negative samples from the last node’s Markov chain (Figure 2).

The proposal distribution also influences convergence rate. Our q(y|x)q(y|x) is defined by mixing uniform sampling and sampling from the nearest kk nodes with 12\frac{1}{2} probability each. The hinge loss pulls positive node pairs closer and pushes negative node pairs away until they are beyond a pre-defined margin. Thus, we substitute the binary cross-entropy loss as a γ\gamma-skewed hinge loss,

(14) =max(0,Eθ(v)Eθ(x)Eθ(u)Eθ(v)+γ),\mathcal{L}=\text{max}\big{(}0,E_{\theta}(v)\cdot E_{\theta}(x)-E_{\theta}(u)\cdot E_{\theta}(v)+\gamma\big{)},

where (u,v)(u,v) is a positive node pair and (x,v)(x,v) is negative. γ\gamma is a hyperparameter. The MCNS is summarized in Algorithm 2.

5. Experiments

To demonstrate the efficacy of MCNS, we conduct a comprehensive suite of experiments on 3 representative tasks, 3 graph representation learning algorithms and 5 datasets, a total of 19 experimental settings, under all of which MCNS consistently achieves significant improvements over 8 negative sampling baselines collected from previous papers (Ying et al., 2018; Perozzi et al., 2014) and other relevant topics (Zhang et al., 2013; Wang et al., 2017b; Cai and Wang, 2018; Weston et al., 2011).

5.1. Experimental Setup

5.1.1. Tasks and Dataset.

The statistics of tasks and datasets are shown in Table 1. Introductions about the datasets are in Appendix A.4.

Table 1. Statistics of the tasks and datasets.
Task Dataset Nodes Edges Classes
Recommendation MovieLens 2,625 100,000 /
Amazon 255,404 1,689,188 /
Alibaba 159,633 907,470 /
Link Prediction Arxiv 5,242 28,980 /
Node Classification BlogCatalog 10,312 333,983 39

5.1.2. Encoders.

To verify the adaptiveness of MCNS to different genres of graph representation learning algorithms, we utilize an embedding lookup table or a variant of GNN as encoder for experiments, including DeepWalk, GCN and GraphSAGE. The detailed algorithms are shown in Appendix A.2.

MovieLens Amazon Alibaba
DeepWalk GCN GraphSAGE DeepWalk GraphSAGE DeepWalk GraphSAGE
MRRMRR Deg0.75Deg^{0.75} 0.025±\pm.001 0.062±\pm.001 0.063±\pm.001 0.041±\pm.001 0.057±\pm.001 0.037±\pm.001 0.064±\pm.001
WRMF 0.022±\pm.001 0.038±\pm.001 0.040±\pm.001 0.034±\pm.001 0.043±\pm.001 0.036±\pm.001 0.057±\pm.002
RNS 0.031±\pm.001 0.082±\pm.002 0.079±\pm.001 0.046±\pm.003 0.079±\pm.003 0.035±\pm.001 0.078±\pm.003
PinSAGE 0.036±\pm.001 0.091±\pm.002 0.090±\pm.002 0.057±\pm.004 0.080±\pm.001 0.054±\pm.001 0.081±\pm.001
WARP 0.041±\pm.003 0.114±\pm.003 0.111±\pm.003 0.061±\pm.001 0.098±\pm.002 0.067±\pm.001 0.106±\pm.001
DNS 0.040±\pm.003 0.113±\pm.003 0.115±\pm.003 0.063±\pm.001 0.101±\pm.003 0.067±\pm.001 0.090±\pm.002
IRGAN 0.047±\pm.002 0.111±\pm.002 0.101±\pm.002 0.059±\pm.001 0.091±\pm.001 0.061±\pm.001 0.083±\pm.001
KBGAN 0.049±\pm.001 0.114±\pm.003 0.100 ±\pm.001 0.060±\pm.001 0.089±\pm.001 0.065±\pm.001 0.087±\pm.002
MCNS 0.053±\pm.001 0.122±\pm.004 0.114±\pm.001 0.065±\pm.001 0.108±\pm.001 0.070±\pm.001 0.116±\pm.001
Hits@30Hits@30 Deg0.75Deg^{0.75} 0.115±\pm.002 0.270±\pm.002 0.270±\pm.001 0.161±\pm.003 0.238±\pm.002 0.138±0.138\pm.003 0.249±\pm.004
WRMF 0.110±\pm.003 0.187±\pm.002 0.181±\pm.002 0.139±\pm.002 0.188±\pm.001 0.121±\pm.003 0.227±\pm.004
RNS 0.143±\pm.004 0.362±\pm.004 0.356±\pm.001 0.171±\pm.004 0.317±\pm.004 0.132±\pm.004 0.302±\pm.005
PinSAGE 0.158±\pm.003 0.379±\pm.005 0.383±\pm.005 0.176±\pm.004 0.333±\pm.005 0.146±\pm.003 0.312±\pm.005
WARP 0.164±\pm.005 0.406±\pm.002 0.404±\pm.005 0.181±\pm.004 0.340±\pm.004 0.178±\pm.004 0.342±\pm.004
DNS 0.166±\pm.005 0.404±\pm.006 0.410±\pm.006 0.182±\pm.003 0.358±\pm.004 0.186±\pm.005 0.336±\pm.004
IRGAN 0.207±\pm.002 0.415±\pm.004 0.408±\pm.004 0.183±\pm.004 0.342±\pm.003 0.175±\pm.003 0.320±\pm.002
KBGAN 0.198±\pm.003 0.420±\pm.003 0.401±\pm.005 0.181±\pm.003 0.347±\pm.003 0.181±\pm.003 0.331±\pm.004
MCNS 0.230±\pm.003 0.426 ±\pm.005 0.413±\pm.003 0.207±\pm.003 0.386±\pm.004 0.201±\pm.003 0.387±\pm.002
Table 2. Recommendation Results of MCNS with various encoders on three datasets. GCN is not evaluted on Amazon and Alibaba datasets due to its limited scalability. Similar trends hold for Hits@5 and Hits@10.

5.1.3. Baselines.

Baselines in our experiments generally fall into the following categories. The detailed description of each baseline is provided in Appendix A.3.

  • Degree-based Negative Sampling. The compared methods include Power of Degree (Mikolov et al., 2013), RNS (Caselles-Dupré et al., 2018) and WRMF (Hu et al., 2008), which can be summarized as pn(v)deg(v)βp_{n}(v)\propto deg(v)^{\beta}. These strategies are static, inconsiderate to the personalization of nodes.

  • Hard-samples Negative Sampling. Based on the observation that “hard negative samples” – the nodes with high positive probabilities – helps recommendation, methods are proposed to mine the hard negative samples by rejection (WARP (Zhao et al., 2015)), sampling-max (DNS (Zhang et al., 2013)) and personalized PageRank (PinSAGE (Ying et al., 2018)).

  • GAN-based Negative Sampling. IRGAN (Wang et al., 2017b) trains generative adversarial nets (GANs) to adaptively generate hard negative samples. KBGAN (Cai and Wang, 2018) employs a sampling-based adaptation of IRGAN on knowledge base completion.

5.2. Results

In this section, we demonstrate the results in the 19 settings. Note that all values and standard deviations reported in the tables are from ten-fold cross validation. We also leverage paired t-tests (Hsu and Lachenbruch, 2007) to verify whether MCNS is significantly better than the strongest baseline. The tests in the 19 settings give a max p-value (Amazon, GraphSAGE) of 0.0005 0.01\ll 0.01, quantitatively proving the significance of our improvements.

5.2.1. Recommendation

Recommendation is the most important technology in many E-commerce platforms, which evolved from collaborative filtering to graph-based models. Graph-based recommender systems represent all users and items by embeddings, and recommend items with largest inner products for a given user.

In this paper, Hits@kHits@k (Cremonesi et al., 2010) and mean reciprocal ranking (MRRMRR) (Sugiyama and Kan, 2010) serve as evaluation methodologies, whose definitions can be found in Appendix A.5. Moreover, we only use observed positive pairs to train node embeddings, which can clearly demonstrate that the improvement of performance comes from the proposed negative sampling. We summarize the performance of MCNS as well as the baselines in Table 2.

The results show that Hard-samples negative samplings generally surpass degree-based strategies with 5%40%5\%\sim 40\% performance leaps in MRR. GAN-based negative sampling strategies perform even better, owing to the evolving generator mining hard negative samples more accurately. In the light of our theory, the proposed MCNS accomplishes significant gains of 2%13%2\%\sim 13\% over the best baselines.

5.2.2. Link Prediction

Link prediction aims to predict the occurrence of links in graphs, more specifically, to judge whether a node pair is a masked true edge or unlinked pair based on nodes’ representations. Details about data split and evaluation are in Appendix A.5.

The results for link prediction are given in Table 3. MCNS outperforms all baselines with various graph representation learning methods on the Arxiv dataset. An interesting phenomenon about degree-based strategies is that both WRMF and commonly-used Deg0.75Deg^{0.75} outperform uniform RNS sampling, completely adverse to the results on recommendation datasets. The results can be verified by examining the results in previous papers (Mikolov et al., 2013; Zhang and Zweigenbaum, 2018), reflecting the different network characteristics of user-item graphs and citation networks.

Algorithm DeepWalk GCN GraphSAGE
AUC Deg0.75Deg^{0.75} 64.6±\pm0.1 79.6±\pm0.4 78.9±\pm0.4
WRMF 65.3±\pm0.1 80.3±\pm0.4 79.1±\pm0.2
RNS 62.2±\pm0.2 74.3±\pm0.5 74.7±\pm0.5
PinSAGE 67.2±\pm0.4 80.4±\pm0.3 80.1±\pm0.4
WARP 70.5±\pm0.3 81.6±\pm0.3 82.7±\pm0.4
DNS 70.4±\pm0.3 81.5±\pm0.3 82.6±\pm0.4
IRGAN 71.1±\pm0.2 82.0±\pm0.4 82.2±\pm0.3
KBGAN 71.6±\pm0.3 81.7±\pm0.3 82.1±\pm0.3
MCNS 73.1±\pm0.4 82.6±\pm0.4 83.5±\pm0.5
Table 3. The results of link prediction with different negative sampling strategies on the Arxiv dataset.

5.2.3. Node Classification

Multi-label node classification is a usual task to assess graph representation algorithms. The one-vs-rest logistic regression classifier(Fan et al., 2008) is trained in a supervised way with graph representations as the features of nodes. In this experiment, we keep the default setting as the DeepWalk’s original paper for fair comparison on the BlogCatalog dataset.

Table 4 summarizes the Micro-F1 result on the BlogCatalog dataset. MCNS stably outperforms all baselines regardless of the training set ratio TRT_{R}. The trends are similar for Macro-F1, which is omitted due to the space limitation.

Algorithm DeepWalk GCN GraphSAGE
TR(%)T_{R}(\%) 10 50 90 10 50 90 10 50 90
Micro -F1 Deg0.75Deg^{0.75} 31.6 36.6 39.1 36.1 41.8 44.6 35.9 42.1 44.0
WRMF 30.9 35.8 37.5 34.2 41.4 43.3 34.4 41.0 43.1
RNS 29.8 34.1 36.0 33.4 40.5 42.3 33.5 39.6 41.6
PinSAGE 32.0 37.4 40.1 37.2 43.2 45.7 36.9 43.2 45.1
WARP 35.1 40.3 42.1 39.9 45.8 47.7 40.1 45.5 47.5
DNS 35.2 40.4 42.5 40.4 46.0 48.6 40.5 46.3 48.5
IRGAN 34.3 39.6 41.8 39.1 45.2 47.9 38.9 45.0 47.6
KBGAN 34.6 40.0 42.3 39.5 45.5 48.3 39.6 45.3 48.5
MCNS 36.1 41.2 43.3 41.7 47.3 49.9 41.6 47.5 50.1
Table 4. Micro-F1 scores for multi-label classification on the BlogCatalog dataset. Similar trends hold for Macro-F1 scores.

5.3. Analysis

Comparison with Power of Degree. To thoroughly investigate the potential of degree-based strategies pn(v)deg(v)βp_{n}(v)\propto deg(v)^{\beta}, a series of experiments are conducted by varying β\beta from -1 to 1 with GraphSAGE. Figure 3 shows that the performance of degree-based is consistently lower than that of MCNS, suggesting that MCNS learns a better negative distribution beyond the expressiveness of degree-based strategies. Moreover, the best β\beta varies between datasets and seems not easy to determine before experiments, while MCNS naturally adapts to different datasets.

Refer to caption
Figure 3. Comparison between Power of Degree and MCNS. Each red line represents the performance of MCNS in this setting and the blue curves are performances of Power of Degree with various β\beta.

Parameter Analysis. To test the robustness of MCNS quantitively, we visualize the MRR curves of MCNS by varying two most important hyperparameters, embedding dimension and margin γ\gamma. The results with GraphSAGE on Alibaba dataset are shown in Figure 4.

The skewed hinge loss in Equation (14) aims to keep at least γ\gamma distance between the inner product of the positive pair and that of the negative pair. γ\gamma controls the distance between positive-negative samples and must be positive in principle. Correspondingly, Figure 4 shows that the hinge loss begins to take effect when γ0\gamma\geq 0 and reaches its optimum at γ0.1\gamma\approx 0.1, which is a reasonable boundary. As the performance increases slowly with a large embedding dimension, we set it 512512 due to the trade-off between the performance and time consumption.

Refer to caption
Figure 4. The performance of MCNS on the Alibaba by varying γ\gamma and embedding dimension.

Efficiency Comparison. To compare the efficiency of different negative sampling methods, we report the runtime of MCNS and hard-samples or GAN-based strategies (PinSAGE, WARP, DNS, KBGAN) with GraphSAGE encoder in recommendation task in Figure 5. We keep the same batch size and number of epochs for a fair comparison. WARP might need a large number of tries before finding negative samples, leading to a long running time. PinSAGE utilizes PageRank to generate “hard negative items”, which increases running time compared with uniformly sampling. DNS and KBGAN both first sample a batch of candidates and then select negative samples from them, but the latter uses a lighter generator. MCNS requires 17 mins to complete ten-epochs training while the fastest baseline KBGAN is at least 2×\times slower on the MovieLens dataset.

5.4. Experiments for Further Understanding

In section 3, we have analyzed the criterion about negative sampling from the perspective of objective and risk. However, doubts might arise about (1) whether sampling more negative samples is always helpful, (2) why our conclusion contradicts with the intuition “positively sampling nearby nodes and negatively sampling faraway nodes ”.

To further deepen the understanding of negative sampling, we design two extended experiments on MovieLens with GraphSAGE encoder to verify our theory and present the results in Figure 6.

Refer to caption
Figure 5. Runtime comparisons for different negative sampling strategies with GraphSAGE encoder on the recommendation datasets.
Refer to caption
Figure 6. The results of verification experiments. (Left) The effect of increasing the number of negative sampling kk in Deg0.75Deg^{0.75}. (Right) The effect of sampling more distant nodes.

If we sample more negative samples, the performance increases with subsequently decrease. According to Equation (13), sampling more negative samples always decrease the risk, leading to an improvement in performance at first. However, performance begins to decrease after reaching the optimum, because extra bias is added to the objective by the increase of kk according to Equation (2). The bias impedes the optimization of the objective from zero-centered initial embeddings.

The second experiment in Figure 6 shows the disastrous consequences of negatively sampling more distant nodes, the nodes with small pd(u)p_{d}(u). To test the performances with varying degrees of the mismatching of pdp_{d} and pnp_{n}, we design a strategy, inverseDNS, by selecting the one scored lowest in the candidate items. In this way, not only the negative sampling probabilities of the items are sampled negatively correlated to pdp_{d}, but also we can control the degree by varying candidate size MM. Both MRR and Hits@kHits@k go down as MM increases, hence verifies our theory.

6. Related Work

Graph Representation Learning. The mainstream of graph representation learning on graphs diverges into two main topics: traditional network embedding and GNNs. Traditional network embedding cares more about the distribution of positive node pairs. Inspired by the skip-gram model (Mikolov et al., 2013), DeepWalk (Perozzi et al., 2014) learns embeddings via sampling “context” nodes for each vertex with random walks, and maximize the log-likelihood of observed context nodes for the given vertex. LINE (Tang et al., 2015) and node2vec (Grover and Leskovec, 2016) extend DeepWalk with various positive distributions. GNNs are deep learning based methods that generalize convolution operation to graph data. (Kipf and Welling, 2017) design GCNs by approximating localized 1-order spectral convolution. For scalability, GraphSAGE (Hamilton et al., 2017) employs neighbor sampling to alleviate the receptive field expansion. FastGCN (Chen et al., 2018) adopts importance sampling in each layer.

Negative Sampling. Negative sampling is firstly proposed to speed up skip-gram training in word2vec (Mikolov et al., 2013). Word embedding models sample negative samples according to its word frequency distribution proportional to the 3/4 power. Most later works on network embedding (Grover and Leskovec, 2016; Tang et al., 2015) follow this setting. In addition, negative sampling has been studied extensively in recommendation. Bayesian Personalized Ranking (Rendle et al., 2009) proposes uniform sampling for negative samples. Then, dynamic negative sampling (DNS) (Zhang et al., 2013) is proposed to sample informative negative samples based on current prediction scores. Besides, the PinSAGE (Ying et al., 2018) adds “hard negative items” to obtain fine enough “resolution” for recommender system. The negative samples are sampled uniformly with the rejection in WARP (Weston et al., 2011). Recently, GAN-based negative sampling method has also been adopted to train better node representations for information retrieval (Wang et al., 2017b), recommendation (Wang et al., 2018), word embeddings (Bose et al., 2018), knowledge graph embeddings (Cai and Wang, 2018). Another lines of research focus on exploiting sampled softmax (Bengio and Senécal, 2008) or its debiasing variant (Zhou et al., 2020) for extremely large scale link prediction tasks.

7. Conclusion

In this paper, we study the effect of negative sampling, a practical approach adopted in the literature of graph representation learning. Different from the existing works that decide a proper distribution for negative sampling in heuristics, we theoretically analyze the objective and risk of the negative sampling approach and conclude that the negative sampling distribution should be positively but sub-linearly correlated to their positive sampling distribution. Motivated by the theoretical results, we propose MCNS, approximating the ideal distribution by self-contrast and accelerating sampling by Metropolis-Hastings. Extensive experiments show that MCNS outperforms 8 negative sampling strategies, regardless of the underlying graph representation learning methods.

Acknowledgements

The work is supported by the NSFC for Distinguished Young Scholar (61825602), NSFC (61836013), and a research fund supported by Alibaba Group.

References

  • (1)
  • Bengio and Senécal (2008) Yoshua Bengio and Jean-Sébastien Senécal. 2008. Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks 19, 4 (2008), 713–722.
  • Bose et al. (2018) Avishek Joey Bose, Huan Ling, and Yanshuai Cao. 2018. Adversarial Contrastive Estimation. (2018), 1021–1032.
  • Cai and Wang (2018) Liwei Cai and William Yang Wang. 2018. KBGAN: Adversarial Learning for Knowledge Graph Embeddings. In NAACL-HLT‘18. 1470–1480.
  • Caselles-Dupré et al. (2018) Hugo Caselles-Dupré, Florian Lesaint, and Jimena Royo-Letelier. 2018. Word2vec applied to recommendation: Hyperparameters matter. In RecSys’18. ACM, 352–356.
  • Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. 2018. FastGCN: fast learning with graph convolutional networks via importance sampling. ICLR’18 (2018).
  • Chib and Greenberg (1995) Siddhartha Chib and Edward Greenberg. 1995. Understanding the metropolis-hastings algorithm. The american statistician 49, 4 (1995), 327–335.
  • Cremonesi et al. (2010) Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 39–46.
  • Ding et al. (2018) Ming Ding, Jie Tang, and Jie Zhang. 2018. Semi-supervised learning on graphs with generative adversarial nets. In CIKM’18. ACM, 913–922.
  • Fan et al. (2008) Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of machine learning research 9, Aug (2008), 1871–1874.
  • Gao and Huang (2018) Hongchang Gao and Heng Huang. 2018. Self-Paced Network Embedding. (2018), 1406–1415.
  • Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD’16. ACM, 855–864.
  • Gutmann and Hyvärinen (2012) Michael U Gutmann and Aapo Hyvärinen. 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research 13, Feb (2012), 307–361.
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NIPS’17. 1024–1034.
  • Hsu and Lachenbruch (2007) Henry Hsu and Peter A Lachenbruch. 2007. Paired t test. Wiley encyclopedia of clinical trials (2007), 1–3.
  • Hu et al. (2008) Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In ICDM’08. Ieee, 263–272.
  • Huang et al. (2014) Hong Huang, Jie Tang, Sen Wu, Lu Liu, and Xiaoming Fu. 2014. Mining triadic closure patterns in social networks. In WWW’14. 499–504.
  • Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. ICLR’17 (2017).
  • Leskovec et al. (2007) Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. 2007. Graph evolution: Densification and shrinking diameters. TKDD’07 1, 1 (2007), 2–es.
  • Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS’14. 2177–2185.
  • Li et al. (2018) Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI’18.
  • Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing 7, 1 (2003), 76–80.
  • McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In SIGIR’15. ACM, 43–52.
  • Metropolis et al. (1953) Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. 1953. Equation of state calculations by fast computing machines. The journal of chemical physics 21, 6 (1953), 1087–1092.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS’13. 3111–3119.
  • Mnih and Kavukcuoglu (2013) Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS’13. 2265–2273.
  • Pan et al. (2008) Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-class collaborative filtering. In ICDM’08. IEEE, 502–511.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In KDD’14. ACM, 701–710.
  • Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In WSDM’18. ACM, 459–467.
  • Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In UAI’09. AUAI Press, 452–461.
  • Sugiyama and Kan (2010) Kazunari Sugiyama and Min-Yen Kan. 2010. Scholarly paper recommendation via user’s recent research interests. In JCDL’10. ACM, 29–38.
  • Sun et al. (2019) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197 (2019).
  • Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In WWW’15. 1067–1077.
  • Tu et al. (2017) Cunchao Tu, Han Liu, Zhiyuan Liu, and Maosong Sun. 2017. Cane: Context-aware network embedding for relation modeling. In ACL’17. 1722–1731.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. ICLR’18 (2018).
  • Wang et al. (2017b) Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017b. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In SIGIR’17. ACM, 515–524.
  • Wang et al. (2018) Qinyong Wang, Hongzhi Yin, Zhiting Hu, Defu Lian, Hao Wang, and Zi Huang. 2018. Neural memory streaming recommender networks with adversarial training. In KDD’18. ACM, 2467–2475.
  • Wang et al. (2017a) Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. 2017a. Community preserving network embedding. In AAAI’17.
  • Weston et al. (2011) Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI’11.
  • Xu et al. (2018) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In KDD’18. ACM, 974–983.
  • Zhang et al. (2013) Weinan Zhang, Tianqi Chen, Jun Wang, and Yong Yu. 2013. Optimizing Top-N Collaborative Filtering via Dynamic Negative Item Sampling. In SIGIR’13. ACM, 785–788.
  • Zhang et al. (2019) Yongqi Zhang, Quanming Yao, Yingxia Shao, and Lei Chen. 2019. NSCaching: Simple and Efficient Negative Sampling for Knowledge Graph Embedding. (2019), 614–625.
  • Zhang and Zweigenbaum (2018) Zheng Zhang and Pierre Zweigenbaum. 2018. GNEG: Graph-Based Negative Sampling for word2vec. In ACL’18. 566–571.
  • Zhao et al. (2015) Tong Zhao, Julian McAuley, and Irwin King. 2015. Improving latent factor models via personalized feature projection for one class recommendation. In CIKM’15. ACM, 821–830.
  • Zhou et al. (2017) Chang Zhou, Yuqiong Liu, Xiaofei Liu, Zhongyi Liu, and Jun Gao. 2017. Scalable graph embedding for asymmetric proximity. In AAAI’17.
  • Zhou et al. (2020) Chang Zhou, Jianxin Ma, Jianwei Zhang, Jingren Zhou, and Hongxia Yang. 2020. Contrastive Learning for Debiased Candidate Generation in Large-Scale Recommender Systems. arXiv:cs.IR/2005.12964

Appendix A Appendix

In the appendix, we first report the implementation notes of MCNS, including the running environment and implementation details. Then, the encoder algorithms are presented. Next, the detailed description of each baseline is provided. Finally, we present datasets, evaluation metrics and the DFS algorithm in detail.

A.1. Implementation Notes

Running Environment. The experiments are conducted on a single Linux server with 14 Intel(R) Xeon(R) CPU E5-268 v4 @ 2.40GHz, 256G RAM and 8 NVIDIA GeForce RTX 2080TI-11GB. Our proposed MCNS  is implemented in Tensorflow 1.14 and Python 3.7.

Implementation Details. All algorithms can be divided into three parts: negative sampling, positive sampling and embedding learning. In this paper, we focus on negative sampling strategy. A negative sampler selects a negative sample for each positive pair and feeds it with a corresponding positive pair to the encoder to learn node embeddings. With the exception of PinSAGE, the ratio of positive to negative pairs is set to 1:11:1 for all negative sampling strategies. PinSAGE uniformly selects 20 negative samples per batch and adds 5 “hard negative samples” per positive pair. Nodes ranked at top-100 according to PageRank scores are randomly sampled as “hard negative samples”. The candidate size SS in DNS is set to 5 for all datasets. The value of TT in KBGAN is set to 200 for Amazon and Alibaba datasets, and 100 for other datasets.

For recommendation task, we evenly divide the edges in each dataset into 11 parts, 10 for ten-fold cross validation and the extra one as a validation set to determine hyperparameters. For recommendation datasets, the node types contain UU and II representing user and item respectively. Thus, the paths for information aggregation are set to UIUU-I-U and IUII-U-I for GraphSAGE and GCN encoders. All the important parameters in MCNS are determined by a rough grid search. For example, the learning rate is the best among [10510^{-5},10410^{-4},10310^{-3},10210^{-2}]. Adam with β1=0.9,β2=0.999\beta_{1}=0.9,\beta_{2}=0.999 serves as the optimizer for all the experiments. The values of hyperparameters in different datasets are listed as follows:

MovieLens Amazon Alibaba
learning rate 10310^{-3} 10310^{-3} 10310^{-3}
embedding dim 256 512 512
margin γ\gamma 0.1 0.1 0.1
batch size 256 512 512
MM in Hits@k ALL 500 500

For link prediction task, we employ five-fold cross validation to train all methods. The embedding dimension is set to 256, and the margin γ\gamma also be set to 0.1 for all strategies. The model parameters are updated and optimized by stochastic gradient descent with Adam optimizer.

In multi-label classification task, we first employ all the edges to learn node embeddings. Then, these embeddings serve as features to train the classifier. Embedding dimension is set to 128 and the margin γ\gamma is set as 0.1 on the BlogCatalog dataset for all negative sampling strategies. The optimizer also adopts Adam updating rule.

A.2. Encoders

We perform experiments on three algorithms as follows.

  • DeepWalk (Perozzi et al., 2014) is the most representative graph representation algorithm. According to the Skip-gram model (Mikolov et al., 2013), DeepWalk samples nodes co-occurring in the same window in random-walk paths as positive pairs, which acts as a data augmentation. This strategy serves as “sampling from pd^\hat{p_{d}}” in the view of SampledNCE (§2.2).

  • GCN (Kipf and Welling, 2017) introduces a simple and effective neighbor aggregation operation for graph representation learning. Although GCN samples positive pairs from direct edges in the graph, the final embeddings of nodes are implicitly constrained by the smoothing effect of graph convolution. However, due to its poor scalability and high memory consumption, GCN cannot handle large graphs.

  • GraphSAGE (Hamilton et al., 2017) is an inductive framework of graph convolution. A fixed number of neighbors are sampled to make the model suitable for batch training, bringing about its prevalence in large-scale graphs. The mean-aggregator is used in our experiments.

A.3. Baselines

We give detailed negative sampling strategies as follows. For each encoder, the hyperparameters for various negative sampling strategies remain the same.

  • Power of Degree (Mikolov et al., 2013). This strategy is first proposed in word2vec, in which pn(v)deg(v)βp_{n}(v)\propto deg(v)^{\beta}. In word embedding, the best α\alpha is found to be 0.75 by experiments and followed by most graph embedding algorithms (Tang et al., 2015; Perozzi et al., 2014). We use Deg0.75Deg^{0.75} to denote this widely-used setting.

  • RNS (Caselles-Dupré et al., 2018). RNS means sampling negative nodes at uniform random. Previous work (Caselles-Dupré et al., 2018) mentions that in recommendation, RNS performs better and more robust than Deg0.75Deg^{0.75} during tuning hyperparameters. This phenomenon is also verified in our recommendation experiments by using this setting as a baseline.

  • WRMF (Hu et al., 2008; Pan et al., 2008). Weighted Regularized Matrix Factorization is an implicit MF model, which adapts a weight to reduce the impact of unobserved interactions and utilizes an altering-least-squares optimization process. To solve the high computational costs, WRMF proposes three negative sampling schemes, including Uniform Random Sampling (RNS), Item-Oriented Sampling (ItemPop), User-Oriented Sampling (Unconsider in our paper).

  • DNS (Zhang et al., 2013). Dynamic Negative Sampling (DNS) is originally designed for sampling negative items to improve pair-wise collaborative filtering. For each user uu, DNS samples negative SS items {v0,,vS1}\{v_{0},...,v_{S-1}\} and only retains the one with the highest scoring function s(u,v)s(u,v). DNS can be applied to graph-based recommendation by replacing s(u,v)s(u,v) in BPR (Rendle et al., 2009) with the inner product of embedding 𝒖𝒗\vec{\bm{u}}\vec{\bm{v}}^{\top}.

  • PinSAGE (Ying et al., 2018). In PinSAGE, a technique called “hard negative items” is introduced to improve its performance. The hard negative items refer to the high ranking items according to users’ Personalized PageRank scores. Finally, negative items are sampled from top-K hard negative items.

  • WARP (Weston et al., 2011; Zhao et al., 2015). Weighted Approximate-Rank Pairwise(WARP) utilizes SGD and a novel sampling trick to approximate ranks. For each positive pair (u,v)(u,v), WARP proposes uniform sampling with rejection strategy to find a negative item vv^{\prime} until 𝒖𝒗𝒖𝒗+γ>0\vec{\bm{u}}\vec{\bm{v}}^{\prime\top}-\vec{\bm{u}}\vec{\bm{v}}^{\top}+\gamma>0.

  • IRGAN (Wang et al., 2017b). IRGAN is a GAN-based IR model that unify generative and discriminative information retrieval models. The generator generates adversarial negative samples to fool the discriminator while the discriminator tries to distinguish them from true positive samples.

  • KBGAN (Cai and Wang, 2018). KBGAN proposed a adversarial sampler to improve the performances of knowledge graph embedding models. To reduce computational complexity, KBGAN uniformly randomly selects TT nodes to calculate the probability distribution for generating negative samples. Inspired by it, NMRN (Wang et al., 2018) employed a GAN-based adversarial training framework in streaming recommender model.

A.4. Datasets

The detailed descriptions for five datasets are listed as follows.

  • MovieLens333https://grouplens.org/datasets/movielens/100k/ is a widely-used movie rating dataset for evaluating recommender systems. We choose the MovieLens-100k version that contains 100,000 interaction records generated by 943 users on 1682 movies.

  • Amazon444http://jmcauley.ucsd.edu/data/amazon/links.html is a large E-commerce dataset introduced in (McAuley et al., 2015), which contains purchase records and review texts whose time stamps span from May 1996 to July 2014. In the experiments, we take the data from the commonly-used “electronics” category to establish a user-item graph.

  • Alibaba555we use Alibaba dataset to represent the real data we collected. is constructed based on the data from another large E-commerce platform, which contains user’s purchase records and items’ attribute information. The data are organized as a user-item graph for recommendation.

  • Arxiv GR-QC (Leskovec et al., 2007) is a collaboration network generated from the e-print arXiv where nodes represent authors and edges indicate collaborations between authors.

  • BlogCatalog is a network of social relationships provided by blogger authors. The labels represent the topic categories provided by the authors and each blogger may be associated to multiple categories. There are overall 39 different categories for BlogCatalog.

A.5. Evaluation Metrics

Recommendation. To evaluate the performance of MCNS for recommendation task, Hits@kHits@k (Cremonesi et al., 2010) and mean reciprocal ranking (MRRMRR)  (Sugiyama and Kan, 2010) serve as evaluation methodologies.

  • Hits@k is introduced by the classic work (Sugiyama and Kan, 2010) to measure the performance of personalized recommendation at a coarse granularity. Specifically, we can compute Hits@kHits@k of a recommendation system as follows:

    1. (1)

      For each user-item pair (u,v)(u,v) in the test set DtestD_{test}, we randomly select MM additional items {v0,,vM1}\{v_{0}^{\prime},...,v_{M-1}^{\prime}\} which are never showed to the queried user uu.

    2. (2)

      Form a ranked list of {𝒖𝒗,𝒖𝒗0,,𝒖𝒗M1}\{\vec{\bm{u}}\vec{\bm{v}}^{\top},\vec{\bm{u}}\vec{\bm{v}}_{0}^{\prime\top},...,\vec{\bm{u}}\vec{\bm{v}}_{M-1}^{\prime\top}\} in descending order.

    3. (3)

      Examine whether the rank of 𝒖𝒗\vec{\bm{u}}\vec{\bm{v}}^{\top} is less than kk (hit) or not (miss).

    4. (4)

      Calculate Hits@kHits@k according to Hits@k=#hit@k|Dtest|Hits@k=\frac{\#hit@k}{|D_{test}|}.

  • MRR is another evaluation metric for recommendation task, which measures the performance at a finer granularity than Hits@kHits@k. MRRMRR assigns different scores to different ranks of item vv in the ranked list mentioned above when querying user uu, which is defined as follows:

    MRR=1|Dtest|(ui,vi)Dtest1rank(vi).MRR=\frac{1}{|D_{test}|}\sum_{(u_{i},v_{i})\in D_{test}}{\frac{1}{rank(v_{i})}}.

In our experiment, we set kk as {10,30}\{10,30\}, and MM is set to 500 for the Alibaba and Amazon datasets. For the MovieLens dataset, we use all unconnected items for each user as MM items to evaluate the hit rate.

Link Prediction. The standard evaluation metric AUC is widely applied to evaluate the performance on link prediction task (Grover and Leskovec, 2016; Tu et al., 2017; Zhou et al., 2017). Here, we utilize roc_\_auc_\_score function from scikit-learn to calculate prediction score. The specific calculation process is demonstrated as follows.

  1. (1)

    We extract 30%30\% of true edges and remove them from the whole graph while ensuring the residual graph is connected.

  2. (2)

    We sample 30 %\% false edges, and then combine 30%30\% of true edges and false edges into a test set.

  3. (3)

    Each graph embedding algorithm with various negative sampling strategies is trained using the residual graph, and then the embedding for each node is obtained.

  4. (4)

    We predict the probability of a node pair being a true edge according to the inner product. We finally calculate AUC score according to the probability via roc_\_auc_\_score function.

Multi-label Classification. In the multi-label classification, we adopt Micro-F1 and Macro-F1 as evaluation metrics, which are widely used in many previous works (Mikolov et al., 2013; Grover and Leskovec, 2016). In our experiments, the F1 score is calculated by scikit-learn. In detail, we randomly select a percentage of labeled nodes, TR{10T_{R}\in\{10%,50,50%,90,90%}\} to learn node embeddings. Then the learned embeddings is used as features to train the classifier and predict the labels of the remaining nodes. For fairness in comparison, we used the same training set to train the classifier for all negative sampling strategies. For each training ratio, we repeat the process 10 times and report the average scores.

A.6. DFS

Here we introduce a DFS strategy to generate traversal sequence LL where adjacent nodes in the sequence is also adjacent in the graph.

Input: Current node xx.
Global variables: graph G=(V,E)G=(V,E), sequence LL.
Append xx to LL
for each unvisited neighbor yy of xx in GG do
       DFS(yy)
       Append xx to LL
end for
Algorithm 3 DFS