\DeclareFloatFont

small\floatsetup[table]font=small

Dropout Training of Matrix Factorization and Autoencoder for Link Prediction in Sparse Graphs

Shuangfei Zhai Binghamton Univeristy, USA. szhai2@binghamton.edu Zhongfei (Mark) Zhang Binghamton University, USA. zhongfei@cs.binghamton.edu

Abstract

Matrix factorization (MF) and Autoencoder (AE) are among the most successful approaches of unsupervised learning. While MF based models have been extensively exploited in the graph modeling and link prediction literature, the AE family has not gained much attention. In this paper we investigate both MF and AE’s application to the link prediction problem in sparse graphs. We show the connection between AE and MF from the perspective of multiview learning, and further propose MF+AE: a model training MF and AE jointly with shared parameters. We apply dropout to training both the MF and AE parts, and show that it can significantly prevent overfitting by acting as an adaptive regularization. We conduct experiments on six real world sparse graph datasets, and show that MF+AE consistently outperforms the competing methods, especially on datasets that demonstrate strong non-cohesive structures.

1 Introduction

Link prediction is one of the fundamental problems of network analysis, as pointed out in [12], ”a network model is useful to the extent that it can support meaningful inferences from observed network data.” Given a graph $G(V,E)$ together with its adjacency matrix $A\in\{0,1\}^{N\times N}$ , the set of nodes $V$ , and the set of edges $E$ , link prediction can be considered as a matrix completion problem on $A$ . The problem is challenging because $A$ is often large and sparse, which means that only a small fraction of the links are observed. As a result, a good model should have enough capacity to accommodate the complex connectivity pattern between all $N^{2}$ pairs of nodes, as well as strong generalization ability to make accurate predictions on unobserved pairs.

Among the large number of models proposed over the decade, Matrix Factorization (MF) is one of the most popular ones in network modeling and relational learning in general [27, 24, 14, 22]. In its simplest form, MF directly models the interaction between a pair of nodes as the inner product of two latent factors, one for each node. This assumes that the large square adjacency matrix $A$ can be factorized into the product of a tall, thin matrix and a short, wide matrix as $A\approx W_{2}W_{1}$ , where $W_{2}\in R^{N\times K},W_{1}\in R^{K\times N}$ . Each row of $W_{1}$ and each column of $W_{1}$ are often called a latent factor, as they often capture the community membership information of the nodes. Training of such a model is usually conducted with stochastic gradient descent, which makes it easily scalable to large datasets [27, 24].

Bayesian models, such as MMSB [2, 15, 26] are another family of latent factor models for studying network structures. Compared with MF which directly learns the latent factors as free parameters by solving an optimization problem, Bayesian models treat the latent factors as random variables and model the stochastic process of the creation of a link. As a result, they can be considered as a stochastic version of MF. By putting various priors on the latent factors, link prediction is reduced to the inference problem of the posterior distribution of the link status. While Bayesian models can often significantly reduce overfitting compared with their deterministic counterparts, the inference processes such as MCMC and Variational Inference are usually much slower to run than direct optimization. As a result, their application has been limited to only moderate sized datasets.

Autoencoder (AE) together with its variants such as Restricted Boltzman Machine (RBM) has recently achieved great success in various machine learning applications, and is well recognized as the building block of Deep Learning [5, 8]. AE learns useful representations by learning a mapping from an input to itself, which makes it different from the above mentioned approaches. Surprisingly, it is not until recently that AE finds its application to modeling graphs [7, 11]. In this paper we investigate the effectiveness of AE on the link prediction problem. In particular, we show that AE is closely connected to MF when unified in the same architecture. Motivated by this observation, we argue that MF and AE are indeed two complementary views of the same problem, and propose a novel model MF+AE where MF and AE are jointly trained with shared parameters.

To prevent overfitting, we train the model with Dropout [19] combined with stochastic gradient descent. We highlight the effect of dropout training to MF+AE, and show that when approximated by the second order Taylor expansion, the dropout training effectively penalizes a scaled $\ell_{2}$ norm of (the combination of) the rows or columns of the weight matrices. We evaluate MF+AE on six real world sparse graphs and show that dropout significantly mitigates overfitting on both MF and AE, and that MF+AE outperforms the competing methods on all the datasets.

2 Model

Matrix Factorization: Given an undirected graph $G(V,E)$ with $N$ nodes and $|E|$ edges, MF approximates its adjacency matrix $A\in\{0,1\}^{N\times N}$ with the product of two low rank matrices $W_{2}$ and $W_{1}$ by solving the optimization problem:

(2.1)

\begin{split}&\min\sum_{i,j}{L(A_{i,j},\;g(W_{2}^{j}W_{1,i}+b_{1,i}+b_{2,j}))}\end{split}

where $W_{1}\in R^{K\times N},W_{2}\in R^{N\times K}$ ; $b_{1}\in R^{N},b_{2}\in R^{N}$ are the biases, $W_{1,i}$ is the $i^{th}$ column of $W_{1}$ , $W_{2}^{j}$ is the $j^{th}$ row of $W_{2}$ , $g$ is the link function, $L$ is the loss function defined on each pair of nodes, and $K$ is the dimension of the latent factor. Since $A$ is symmetric for undirected graph, it is sometimes useful to adopt tied weights, i.e., $W_{1}=W_{2},b_{1}=b_{2}$ . We refer to the model with tied weights as the symmetric version of MF in this paper.

Autoencoder: AE is originally proposed as an unsupervised representation learning method. Given an example represented by a feature vector $x$ , an AE learns a reconstruction of itself by a function $\tilde{x}=F(x)$ . The typical form of $F(x)$ can be characterized by a neural network with one hidden layer, where $x$ is first mapped to a hidden layer representation $h$ , then a reconstruction $\tilde{x}$ is obtained by mapping $h$ to the original feature space:

(2.2)

\begin{split}h&=f(W_{1}x+b_{1})\\ \tilde{x}&=g(W_{2}h+b_{2})\end{split}

with $W_{1}\in R^{K\times N},b_{1}\in R^{K},W_{2}\in R^{N\times K},b_{2}\in R^{N}$ . The parameters are learned by solving the following optimization problem:

(2.3)

\begin{split}\min\sum_{i}{L(x_{i},\tilde{x}_{i};W_{1},b_{1},W_{2},b_{2})}\end{split}

Here we have slightly overloaded the loss function $L$ by defining it on the column vector $x_{i}$ . The natural way of applying AE to modeling graphs is to represent each node as the set of its neighbors; in other words, set $x_{i}=A_{i}$ . This is analogous to the bag of words representation prevalent in the document modeling community, and we call it the bag of nodes representation. Note that this representation is sufficient since when only the topological structure is available, we can learn an unseen node if we know all its neighbors.

The Joint Model: To better see the connection and difference between MF and AE, we now rewrite (2.3) by substituting $x_{i}$ with $A_{i}$ :

(2.4)

\begin{split}&\min\sum_{i,j}L(A_{i,j},g(W_{2}^{j}h_{i}+b_{2,j}))\\ &s.t.\;h_{i}=f(W_{1}A_{i}+b_{1})\end{split}

And we rewrite (2.1) by omitting the $b_{1}$ term:

(2.5)

\begin{split}&\min\sum_{i,j}L(A_{i,j},g(W_{2}^{j}h_{i}+b_{2,j}))\\ &s.t.\;h_{i}=W_{1}\delta_{i}\end{split}

where $\delta_{i}\in R^{N}$ is the indicator vector of node $i$ , which is a binary vector with $1$ at the $i^{th}$ entry and $0$ for all the rest. We deliberately organize (2.4) and (2.5) in such a way that in the high level, they share the same architecture. Both models first learn a hidden representation $h_{i}$ , which is then fed through a classifier with link function $g$ and loss function $L$ . The main difference is only in the form of the hidden representation. For each node $i$ , MF only looks at its id, and the hidden layer representation is learned by simply extracting the $i^{th}$ column of $W_{1}$ . For AE, we first sum up the columns of $W_{1}$ indicated by $i$ ’s neighbors, and then pass the sum through an activation function $f$ . As a result, in MF, two nodes propagate ”positive” information to each other only if they are directly connected; in AE, however, two nodes can do so as long as they appear in the same neighborhood of some other node, even if they are not directly connected. The different ways of the information propagation between that two models indicates that MF and AE are complementary to each other to model different aspects of the same topological structure.

We can also interpret the distinction between MF and AE as two different views of the same entity: MF uses $\delta_{i}$ , and AE uses $A_{i}$ . We note that the two views are disjoint and sufficient: they do not overlap with each other, but each of them can sufficiently represent a node. This perspective motivates us to build a unified architecture where we train the two models jointly, and require both of them to be able to uncover the graph structure. The idea is similar to co-training [23] in the semisupervised learning setting, where one trains two classifiers on two sufficient views such that the two views can ”teach” each other on unlabeled data. While in our problem, there is no ”unlabeled data” available, we argue that the model can still benefit from the co-training idea by requiring the two views to ”agree with” each other. We then formulate our model, which we call MF+AE, as follows:

(2.6)

\begin{split}\min&\sum_{i}{L(A_{i},g(W_{2}h_{1,i}+b_{2}))}\\ &+\rho\sum_{i}{L(A_{i},g(W_{2}h_{2,i}+b_{4}))}\\ s.t.\;&h_{1,i}=f(W_{1}A_{i}+b_{1}),\\ &h_{2,i}=f(W_{1}\delta_{i}+b_{3})\end{split}

In (2.6), the objective function is composed of two parts, one for AE and the other for MF. $h_{1,i}$ and $h_{2,i}$ are the hidden representation learned by the AE and MF part, respectively, and the ”agreement” is achieved by using the same set of weights $W_{1}$ and $W_{2}$ for both AE and MF. We modify the architecture of the MF objective by adding the same activation function $f$ and a corresponding bias term $b_{3}$ . $\rho$ is a positive real number which could be simply set to $1$ in practice. We show the architecture of MF+AE in Figure 1.

Figure 1: The architecture of the MF+AE model. The top part corresponds to the MF module, where the input is the indicator vector

\delta_{i}

; the bottom part corresponds to the AE module, where the input is the set of neighbors represented by the vector

A_{i}

; The two modules share the same transform matrices

W_{1}

and

W_{2}

. The model is trained such that both of the two views can reconstruct the adjacency matrix

A

In Figure 1, the bottom part and top part correspond to the first and second line of (2.6), respectively. The color of each entry of the vector indicates its value, with white for $0$ and black for $1$ . We see that the MF module is trained to reconstruct the neighborhood structure based on $\delta_{i}$ , while the AE module learns to reconstruct $A_{i}$ from itself. The reconstructions from AE and MF are denoted as $\tilde{A}^{(1)}_{i}$ and $\tilde{A}^{(2)}_{i}$ , respectively. Note that although the $5^{th}$ node does not appear in the neighbor of $i$ ( $A_{i,5}=0$ ), the reconstruction is close to $1$ (dark gray). This means that we can make a confident prediction of the link of $i$ to node $5$ . After the two reconstructions are obtained, the final prediction is calculated as the geometric mean of the two:

(2.7)

\tilde{A}_{i}=\sqrt[1+\rho]{\tilde{A}^{(1)}_{i}\odot(\tilde{A}^{(2)}_{i})^{\rho}}

where $\odot$ denotes the element-wise product of two vectors.

Activation Function and Loss Function: For the activation function $f$ , we choose the Rectified Linear Units ( $ReLU$ ), which is defined as:

(2.8)

f(x)=\max(0,x)

$ReLU$ provides nonlinearity by zeroing out all the negative inputs and keeping positive ones intact. We choose $ReLU$ over other popular choices such as $Sigmoid$ or $\tanh$ for two reasons. First, it does not squash the output to a fixed interval as $Sigmoid$ or $tanh$ does. As a result, it is closest to our intuition of approximating the behavior of MF. In fact, from the point of view of MF, the effect of $ReLU$ can be considered as putting a non-negativity constraint on $h$ , which is closely related to the idea of Non-negative Matrix Factorization. Secondly, $ReLU$ is fast to compute compared with its alternatives, and still provides enough nonlinearity to significantly improve the model capacity over linear structures.

For $g$ and $L$ , we choose the $Sigmoid$ function combined with cross entropy loss:

(2.9)

\displaystyle g(x)=\frac{1}{1+e^{-x}}

(2.10)

L(x,\tilde{x})=-x\log{\tilde{x}}-(1-x)\log(1-\tilde{x})

The saturating property of the $Sigmoid$ function endows the model much flexibility since $h_{1,i}$ and $h_{2,i}$ need only to have similar activation patterns to both achieve good reconstructions; cross entropy is naturally a more appropriate choice than square loss for binary matrix. Moreover, as A is often extremely sparse, reconstructing the whole matrix incurs the class imbalance problem, which means that the loss of the positive entries is dominated by the negative entries. As a result, it is important to reweight the cost of the positive and negative classes by utilizing the cost sensitive strategy. Consequently, our final form of the loss function becomes:

(2.11)

L(A_{i},\tilde{A}_{i})=\sum_{j\approx i}{-\log\tilde{A}_{i,j}-\eta\sum_{j\not\approx i}\log(1-\tilde{A}_{i,j})}

where $\tilde{A}_{i}=g(W_{2}h_{i}+b_{2})$ is the reconstruction of $A_{i}$ ; $j\approx{i}$ if $A_{i,j}=1$ , otherwise $j\not\approx i$ . In practice, it is sufficient to approximate the second part of $L$ by a few samples; that is to say that at each training iteration we only need to sample part of the non-links for each node. Doing this may greatly speed up the training process on large graphs without sacrificing the performance. $\eta$ is the weight for the loss of the negative entries, which can be simply set as $\frac{\#nonlink\,samples}{\#links}$ .

Dropout As Regularization: Regularization is critical for most machine learning models to generalize well to unseen data. Instead of putting explicit constraints on the parameters or hidden units, we use dropout training [19, 21] as an implicit regularization. Dropout is a technique originally proposed for training feed forward neural networks to avoid overfitting. It works as follows: in the stochastic gradient training, we randomly drop out half of the hidden units (for both AE and MF) and half of the feature units (for AE) for each node in the current iteration. Mathematically, the effect of dropping out can be simulated as applying an element-wise random dropout mask as follows:

(2.12)

\begin{split}\min\sum_{i}&\mathrm{E}_{\xi_{h,i},\xi_{in,i}}\{L(A_{i},g(W_{2}(\xi_{h,i}\odot h_{1,i})+b_{2}))\\ &+L(A_{i},g(W_{2}(\xi_{h,i}\odot h_{2,i})+b_{4}))\}\\ s.t.\;&h_{1,i}=f(W_{1}(\xi_{in,i}\odot A_{i})+b_{1}),\\ &h_{2,i}=f(W_{1}\delta_{i}+b_{3})\end{split}

Here $\xi_{h,i}\in\{0,1\}^{K}$ and $\xi_{in,i}\in\{0,1\}^{N}$ are the dropout masks for the hidden and input units, respectively; each element of them is an iid draw from the Bernoulli distribution. And $\odot$ is the element-wise product of two vectors. Note that we use the same dropout mask $\xi_{h,i}$ for both the AE and MF modules. This is to ensure that dropout does not cause any difference in the architecture between the two views.

For the AE module, randomly dropping out the input can be considered as a ”denoising” technique, which was exploited by the previous work [20, 6], and also was applied to link prediction [7]. The motivation is that a model should be able to make a good reconstruction under a noisy or partial input. This property is particularly interesting to our problem because this is exactly the same link prediction problem: prediction of the whole based on parts.

While theoretically explaining the effect of dropout is difficult for complex models, we can still gain an insight by looking at an approximate surrogate. Previously, [21, 6] used the second order Taylor expansion to explain the effect of feature noising in generalized linear models and AE, respectively. We can borrow the same tool to showcase a simplified version of MF+AE.

Dropout for Matrix Factorization: We consider the effect of dropout training on (2.1). For a concise articulation we ignore the bias terms; the resulting model is described as the following objective function:

(2.13)

O=\sum_{i,j}\mathrm{E}_{\xi_{i}}\{L(A_{i,j},g(W_{2}^{j}(\xi_{i}\odot W_{1,i}))\}

When $Sigmoid$ activation function with the cross entropy loss is used, we compute the second order approximation of (2.13) in a closed form as:

(2.14)

\begin{split}\tilde{O}=\sum_{i,j}&L(A_{i,j},g(\frac{1}{2}W_{2}^{j}W_{1,i}))\\ &+\mathrm{E}_{\xi_{i}}\{(W_{2}^{j}\odot W_{1,i}^{T})(\xi_{i}-\frac{1}{2}e)(g_{i,j}-A_{i,j})\}\\ &+\frac{1}{2}\mathrm{E}_{\xi_{i}}\{((W_{2}^{j}\odot W_{1,i}^{T})(\xi_{i}-\frac{1}{2}e))^{2}g_{i,j}(1-g_{i,j})\}\\ =\sum_{i,j}&L(A_{i,j},g(\frac{1}{2}W_{2}^{j}W_{1,i}))\\ &+\frac{1}{8}(W_{2}^{j})^{2}(W_{1,i})^{2}g_{i,j}(1-g_{i,j})\\ =\sum_{i}&L(A_{i},g(\frac{1}{2}W_{2}W_{1,i}))\\ &+\frac{1}{8}\underbrace{(\sum_{j}(W_{2}^{j})^{2}g_{i,j}(1-g_{i,j}))}_{\text{$\lambda^{i}$}}(W_{1,i})^{2}\end{split}

where $e$ is a column vector of all $1$ s, and $(W_{2}^{j})^{2}$ and $(W_{1,i})^{2}$ are the element-wise square of the row and column vectors, respectively; $g_{i,j}$ is short for $g(\frac{1}{2}W_{2}^{j}W_{1,i})$ . The first equality of (2.14) is the result of the second order Taylor expansion; the second equality performs the expectation over the random variable $\xi_{i}$ whose $K$ entries are iid Bernoulli variables; the third equality is just a reorganization. We see that with the second order approximation, the dropout effect can be split into two factors. The first term is equivalent to the original objective except that the activation of each pair is scaled down by a half. The second part is more interesting; it can be considered as the product of a row vector $\lambda^{i}$ and the square of the column vector of $W_{1}$ . Note that if we set $\lambda^{i}\propto e$ , the second term is reduced to the ordinary $\ell_{2}$ regularization on each column of $W_{1}$ . In the case of dropout, however, $\lambda^{i}$ is equivalent to a weighted sum of the square of the rows of $W_{2}$ , where the weight of each row of $W_{2}$ is determined by the degree of uncertainty of the prediction $g_{i,j}$ . The overall effect of this regularization is two folds. First, it encourages the model to make confident predictions everywhere by minimizing $g_{i,j}(1-g_{i,j})$ ; secondly, it performs a scaled version of $\ell_{2}$ regularization on the columns of $W_{1}$ : coordinates that are highly active in the rows of $W_{2}$ are less penalized in the columns of $W_{1}$ , and vice versa. In other words, the penalization on the column vectors of $W_{1}$ is adapted both to the structure of $W_{2}$ and the uncertainty of the predictions. This is in stark contrast to $\ell_{2}$ regularization where the penalization is uniformly put on each column of $W_{1}$ . Finally, note that since the roles of $W_{1}$ and $W_{2}$ are exchangeable, the discussion of the regularization on the columns of $W_{1}$ also applies to the rows of $W_{2}$ by symmetry.

Dropout for Autoencoder: The nonlinear and nonsmooth nature of the ReLU activation function makes it difficult to analyze the behavior of dropout. We thus only show the case when $f$ is set as the identity function. Unsurprisingly, the effect of dropping out the hidden layer is similar to that of MF; the only difference is that we replace $W_{1,i}$ in (2.14) with $W_{1}A_{i}$ . Following similar reasoning, it is obvious to see that dropping out the hidden layer in AE again penalizes the scaled $\ell_{2}$ norm of rows of $W_{2}$ in the same way. Its effect on $W_{1}$ is more subtle: instead of penalizing the norms of the columns of $W_{1}$ directly, the regularization is performed on the linear combinations of them.

Next we proceed to study the effect of dropping out the input. Let us now denote $W=W_{2}W_{1}$ , and we have the dropout version of objective function:

(2.15)

O=\sum_{i}\mathrm{E}_{\xi_{i}}\{L(A_{i},g(W(\xi_{i}\odot A_{i})))\}

The second order approximation immediately follows as:

(2.16)

\begin{split}\tilde{O}=\sum_{i}&L(A_{i},g(\frac{1}{2}WA_{i}))\\ &+\sum_{j}\mathrm{E}_{\xi_{i}}\{(W^{j}\odot A_{i}^{T})(\xi_{i}-\frac{1}{2}e)(g_{i,j}-A_{i,j})\}\\ &+\frac{1}{2}\sum_{j}\mathrm{E}_{\xi_{i}}\{((W^{j}\odot A_{i}^{T})(\xi_{i}-\frac{1}{2}e))^{2}g_{i,j}(1-g_{i,j})\}\\ =\sum_{i}&L(A_{i},g(\frac{1}{2}WA_{i}))\\ +\frac{1}{8}&\sum_{j}(W^{j})^{2}\underbrace{\sum_{i}g_{i,j}(1-g_{i,j})(A_{i})^{2}}_{\text{$\lambda_{j}$}}\end{split}

where $W^{j}=W_{2}^{j}W_{1}$ is the $j^{th}$ row of $W$ , $g_{i,j}$ is short for $g(\frac{1}{2}W^{j}A_{i})$ . Recall that dropping out the hidden units in both MF and AE performs a scaled $\ell_{2}$ norm regularization on (the linear combinations of) columns of $W_{1}$ ; dropping out the input performs a scaled $\ell_{2}$ regularization on the linear combinations of rows of $W_{1}$ . The regularization on $W_{2}$ is also very different from the case of dropping out the hidden units, but in a less clear way.

To summarize, we show that in the simplified case, dropping out the hidden units and inputs can be both interpreted as an adaptive regularization. They both push the model to make confident predictions by minimizing the factor $g_{i,j}(1-g_{i,j})$ , while they penalize different aspects of the parameter matrices. When combined in the joint training architecture, they provide complementary regularization to prevent MF+AE from overfitting.

3 Experiments

3.1 Experiment Setup

We conduct the experiments on six real world datasets: DBLP, Facebook, Youtube, Twitter, GooglePlus, and LiveJournal, all of which are available to download at snap.stanford.edu/data. We summarize the statistics of the six datasets in Table 1. Except for DBLP which is an author collaboration network, all the rest are online social networks. In particular, Youtube, Twitter, Gplus, and LiveJournal are all originally directed network; we convert them to undirected graphs by ignoring the direction of the links.

dataset	N	E	D
DBLP	2,958	64,674	21.9
Facebook	2,277	148,546	65.2
Youtube	1,955	102,950	52.6
Twitter	2,477	107,895	43.6
Gplus	2,129	148,306	69.7
LiveJournal	3,006	123,236	41.0

Table 1: Statistics of the datasets where N: number of nodes, E: number of links, D: average degree.

Following the experimental protocol suggested in [4], we first split the observed links into a training graph $G_{train}$ and a testing graph $G_{test}$ . We then train the models on $G_{train}$ , and evaluate the performance of the models on $G_{test}$ . In particular, note that a naive algorithm which simply predicts a link for all the pairs that have at least one common neighbor would make a pretty good accuracy. We then only consider the nodes that are 2-hops from the target node as the candidate nodes [4]. The metrics used are Precision at top 10 position( $Prec@10$ ) and AUC.

The methods we evaluate are:

Adamic-Adar Score (AA) [1]. This is a popular score based method, it calculates the pair-wise score of node $i,j$ as $S(i,j)=\sum\limits_{n\in CN(i,j)}\log(\frac{1}{d_{n}})$ , where $CN(i,j)$ denotes the set of common neighbors of node $i$ and $j$ , $d_{n}$ is the degree of node $n$ . Prediction is made by ranking the score of all nodes to the target node.

Random Walk with restart (RW)[12]. RW uses the stationary distribution of a random walk from a given node to all the nodes as the score. The restart probability needs to be set in advance. In practice we find that while the performance of RW is pretty robust within a wide range of values, a relatively large value (compared with $0.3$ used in [4]) works slightly better on our problem . We set it as $0.5$ throughout the experiments.

Matrix Factorization with $\ell_{2}$ regularization (MF2). This is a variant of the model proposed in [14]. To be fair for the comparison with the other models, we use the cross entropy loss instead of the rank loss proposed in the paper. We also use a weight decay of $10^{-5}$ .

Autoencoder with $\ell_{2}$ regularization (AE2). This is the model corresponding to (2.3) with an additional $\ell_{2}$ regularization on $W_{1}$ and $W_{2}$ . The weight decay parameter is also set as $10^{-5}$ .

Matrix Factorization with dropout (MFd). This corresponds to the model described in (2.13) with the additional bias vectors. No weight decay is used.

Autoencoder with Dropout (AEd). This is the single AE with $ReLU$ activation function and the cross entropy loss trained by dropout.

Marginalized Denoising Model (MDM)[7]. This is one of the few existing AE based models where a linear activation together with square loss is used. MDM marginalizes the dropout noise of the features during training, but no hidden layer dropout is adopted. The model requires the input noise level to be set; we set it as $0.9$ throughout the experiments.

The Joint Model (MF+AE). This corresponds to the model described in (2.12), where we jointly train MF and AE with shared weights using dropout.

All the above methods are implemented in $Matlab$ . We use the authors’ implementation for MDM, and use our own implementations for the rest models. Note that MF2, AE2, MDM, MFd, AEd, and MF+AE all require the dimensionality of the latent factor or hidden layer $K$ as the input. To be fair for a comparison, we do not attempt to optimize this parameter for each model. Instead, we set it the same for all of the models on all the six datasets. In the experiments, we use $100$ .

Another important aspect for both MF and AE models is the choice of a symmetry model vs. an asymmetry model, i.e., whether or not to set $W_{1}=W_{2}^{T}$ . We find that AE models are less sensitive to the characteristics of a dataset, and almost always benefit from the tied weights. The optimal choice for MF models is, however, extremely problem-dependent. In our experiments, we use tied weights for all AE based models including MF+AE on all the six datasets. For MF based models, we use tied weights on Facebook, Twitter, DBLP, LiverJournal, and untied weights on Youtube and Gplus.

In this paper, we are interested in evaluating the performance of different models on sparse graphs. In other words, we investigate how well a model generalizes given only a sparse training graph $G_{train}$ . To this end, each of the six datasets is chosen as a relatively densely connected subgraph as in Table 1. We then randomly take $10\%$ of links for training and use the rest for testing. In this way, we train the model on a sparse graph, while still have enough held out links for testing. We train all the models (except AA and RW which require no training) to convergence, which means that we do not use a separate validation set to perform early stopping. We do this for two reasons. First, in sparse graphs, splitting out a separate set for validation is expensive. Secondly and more importantly, we are interested in testing the generalization ability of each model. In practice we find that almost all the models we have tested benefit from a properly chosen early stopping point. However, this makes the results very difficult to interpret as it is difficult to evaluate the contribution of early stopping in different models.

Model	Facebook	Twitter	Youtube	Gplus	DBLP	LiveJournal	Average
MF+AE	0.58057	0.46693	0.33132	0.41277	0.32462	0.29027	0.4011
AEd	0.54643	0.44229	0.31769	0.40085	0.29942	0.28659	0.3822
AE2	0.37748	0.2773	0.087839	0.17973	0.28308	0.1722	0.2296
MFd	0.46716	0.4041	0.23636	0.28956	0.29599	0.23958	0.3221
MF2	0.45216	0.39823	0.13842	0.24594	0.30735	0.21651	0.2931
MDM	0.54255	0.41304	0.23548	0.3149	0.30286	0.25415	0.3438
RW	0.53143	0.40647	0.15805	0.21685	0.27757	0.20524	0.2993
AA	0.47439	0.34576	0.13647	0.17523	0.23712	0.18247	0.2586

Model	Facebook	Twitter	Youtube	Gplus	DBLP	LiveJournal	Average
MF+AE	0.9136	0.8192	0.75832	0.82262	0.80695	0.75779	0.8131
AEd	0.89655	0.81006	0.74512	0.80808	0.79257	0.76216	0.8024
AE2	0.69946	0.65219	0.53972	0.58406	0.71501	0.61556	0.6343
MFd	0.82891	0.74193	0.68857	0.754	0.77362	0.71683	0.7506
MF2	0.75104	0.69247	0.64594	0.68186	0.76708	0.68844	0.7045
MDM	0.89366	0.7882	0.66943	0.74716	0.79974	0.72491	0.7705
RW	0.88156	0.79524	0.74236	0.76542	0.78078	0.71717	0.7804
AA	0.70223	0.57279	0.47685	0.57406	0.74724	0.5681	0.6069

Table 2: Performance of each model on each dataset. The top half of the table is

Prec@10

, bottom half is

AUC

. The best performance of each metric&dataset combination is highlighted with bold face.

MF+AE Achieves Best Performance We first list the results of the experiments in Table 2 for a comparison. Overall, we see that MF+AE has the best average performance. In particular, it achieves the best performance on all the six datasets evaluated by $Prec@10$ , and on all but the LiveJournal dataset evaluated by AUC. This shows that the joint training of MF and AE consistently boosts the generalization ability without increasing the model size. AEd achieves the second best average performance evaluated by both metrics. MDM, as a variant of AE, achieves the third best performance on $Prec@10$ and fourth on AUC. This is reasonable since on the one hand, the utilization of feature noising improves the generalization ability, and on the other hand, the linear nature limits its ability to model complex graph structures, and also due to the use of the square loss and ignorance of the class imbalance, the performance further deteriorates in sparse graphs. One seemingly surprising result is that RW performs pretty well despite of its simplicity.

Dropout Improves Generalization We note that for both AE and MF, the dropout version significantly outperforms their $\ell_{2}$ counterparts on both $Prec@10$ and AUC. Evaluated by the average performance, AEd outperforms AE2 by $66\%$ on $Prec@10$ , $26\%$ on AUC; MFd also outperforms MF2 by $10\%$ on $Prec@10$ and $6.5\%$ on AUC. This verifies that dropout as an adaptive regularization performs much better than $\ell_{2}$ norm regularization, especially with AE whose objective function is highly nonconvex. To better understand this, we visualize the full graph, the training graph of the Facebook dataset together with the predictions made by each model in Figure 2. We do not visualize the results of RW and AA since they do not output direct reconstructions. For each of the other six models, we convert the predictions to binary by a threshold at $0.5$ . Also for a better visualization, we down sample all the graphs by $80\%$ .

In Figure 2 we see that AE2 and MF2 fit the training graph very well, but fail to uncover the much densely connected full graph. However, with dropout training, the predictions of AEd and MFd look much closer to the full graph than AE2 and MF2. This suggests that models trained with dropout generalize much better than their $\ell_{2}$ regularized counterparts.

We then compare the predictions of MF+AE, AEd, MFd, MDM, respectively, which all use (different variants of) dropout training. It is not difficult to see that MF+AE’s prediction resembles the full graph the most. AEd and MFd make a lot of ”False Positive” predictions which are clearly shown by the pepper salt like pixels off the diagonal. MDM makes more ”False Negative” predictions, such as the missing of the small cluster near the right top corner. We note that the quality of the predictions shown by the visualization is also consistent with the results in Table 2.

Refer to caption — Figure 2: Visualization of the Facebook dataset. From top left to bottom right: the full graph, the predictions of MFd, AEd, MF+AE, respectively, the training graph, the prediction of MF2, AE2, MDM, respectively.

Modeling Non-cohesive Graphs

Among all the six datasets we have tested, Youtube and Gplus datasets demonstrate least cohesive structures, i.e., they consist of more follower-followee relationships than the other datasets. The cohesive vs. non-cohesive distinction of graph structure has previously been investigated in [29, 9]. To show this, we have trained the symmetric and asymmetric versions of MFd and MF2, respectively, on all the six datasets. We then report the averaged performances of the symmetric MF and asymmetric MF in Figure 3. We see that the symmetric version works better on Facebook, Twitter, DBLP, LiveJournal, and the asymmetric version works better on Youtube and Gplus. This experiment shows that Youtube and Gplus demonstrate more non-cohesive structure which cannot be symmetrically modeled by the inner product of two vectors.

With this in mind, let us look at the performances of different models on these two datasets. It is clear that MF+AE and AEd as the best and second best models outperform the other methods by much larger margins than on the other four datasets. Note that AEd and MF+AE still use the tied weights on these two datasets, as we found little difference in performance when switched to untied weights. Also note that even though MDM uses feature dropout, it still fails to model the non-cohesive structures properly. We argue that it is the nonlinear activation function that gives the MF+AE and AEd more modeling power than linear models.

4 Related Work

The link prediction problem can be considered as a special case of relational learning and recommender systems [27, 24], and a lot of techniques proposed are directly applicable to link prediction as well. Salakhutdinov et al. [30] first apply RBM to movie recommendation. Recently, Chen and Zhang [7] propose a variant of linear AE with marginalized feature noise for link prediction, and Li et al. [11] apply RBM to link prediction in dynamic graphs.

MF+AE is also related to the supervised learning based methods [3, 13]. While these approaches directly train a classifier on manually collected features, MF+AE directly learns the appropriate features from the adjacency matrix.

The utilization of dropout training as an implicit regularization also contrasts with Bayesian models [2, 15]. While both dropout and Bayesian Inference are designed to reduce overfitting, their approaches are essentially orthogonal to each other. It would be an interesting future work to investigate whether they can be combined to further increase the generalization ability. Dropout has also been applied to training generalized linear models [21], log linear models with structured output [28], and distance metric learning [25].

This work is also related to graph representation learning. Recently, Perozzi et al. [16] propose to learn node embeddings by predicting the path of a random walk, and they show that the learned representation can boost the performance of the classification task on graph data. It would also be interesting to evaluate the effectiveness of MF+AE in the same setting.

5 Conclusion

We propose a novel model MF+AE which jointly trains MF and AE with shared parameters. We show that dropout can significantly improve the generalization ability of both MF and AE by acting as an adaptive regularization on the weight matrices. We conduct experiments on six real world sparse graphs, and show that MF+AE outperforms all the competing methods, especially on datasets with strong non-cohesive structures.

Acknowledgements

This work is supported in part by NSF CCF-1017828, the National Basic Research Program of China (2012CB316400), and Zhejiang Provincial Engineering Center on Media Data Cloud Processing and Analysis in China.

References

[1] L. A. Adamic and E. Adar. Friends and neighbors on the web. SOCIAL NETWORKS, 25:211–230, 2001.
[2] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. JMLR, 9:1981–2014, 2008.
[3] M. Al Hasan, V. Chaoji, S. Salem, and M. Zaki. Link prediction using supervised learning. In SDM , 2006.
[4] L. Backstrom and J. Leskovec. Supervised random walks: predicting and recommending links in social networks. In WSDM, pages 635–644, 2011.
[5] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
[6] M. Chen, K. Q Weinberger, F. Sha, and Y. Bengio. Marginalized denoising auto-encoders for nonlinear representations. In ICML, pages 1476–1484, 2014.
[7] Z. Chen and W. Zhang. A marginalized denoising method for link prediction in relational data. In SDM, pages 298–306, 2014.
[8] G. E Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
[9] P. Hoff. Modeling homophily and stochastic equivalence in symmetric relational data. In NIPS, 2007.
[10] D. D. Lee and H. Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401:788–791, 1999.
[11] X. Li, N. Du, H. Li, K. Li, J. Gao, and A. Zhang. A deep learning approach to link prediction in dynamic networks. In SDM, pages 289–297, 2014.
[12] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol., 58(7):1019–1031, 2007.
[13] R. N. Lichtenwalter, J. T. Lussier, and N. V. Chawla. New perspectives and methods in link prediction. In KDD, pages 243–252, 2010.
[14] A. Krishna Menon and C. Elkan. Link prediction via matrix factorization. In ECML PKDD Proceedings, Part II, pages 437–452, 2011.
[15] K. T. Miller, T. L. Griffiths, and M. I. Jordan. Nonparametric latent feature models for link prediction. In NIPS, pages 1276–1284, 2009.
[16] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In KDD , pages 701–710, 2014. ACM.
[17] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In ICML, pages 833–840, 2011.
[18] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS, 2007
[19] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
[20] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11:3371–3408, 2010.
[21] S. Wager, S. Wang, and P. Liang. Dropout training as adaptive regularization. In NIPS, pages 351–359, 2013.
[22] J. Yang and J. Leskovec. Overlapping community detection at scale: a nonnegative matrix factorization approach. In WSDM, pages 587–596, 2013.
[23] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, pages 92–100, 1998.
[24] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
[25] Q. Qian, J. Hu, R. Jin, J. Pei, and S. Zhu. Distance metric learning using dropout: A structured regularization approach. In KDD , pages 323–332, 2014.
[26] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using markov chain monte carlo. In ICML, pages 880–887, 2008.
[27] A. P Singh and G. J Gordon. Relational learning via collective matrix factorization. In KDD, pages 650–658, 2008.
[28] S. Wang, Mengqiu Wang, S. Wager, P. Liang, and C. D. Manning. Feature noising for log-linear structured prediction. In EMNLP, pages 1170–1179, 2013.
[29] J. Yang, J. McAuley, and J. Leskovec. Detecting cohesive and 2-mode communities indirected and undirected networks. In WSDM , pages 323–332, 2014
[30] R. Salakhutdinov, A. Mnih, G. Hinton. Restricted Boltzmann Machines for Collaborative Filtering in ICML, pages 791–798, 2007