HMSG: Heterogeneous Graph Neural Network based on Metapath Subgraph Learning
Abstract
Many real-world data can be represented as heterogeneous graphs with different types of nodes and connections. Heterogeneous graph neural network model aims to embed nodes or subgraphs into low-dimensional vector space for various downstream tasks such as node classification, link prediction, etc. Although several models were proposed recently, they either only aggregate information from the same type of neighbors, or just indiscriminately treat homogeneous and heterogeneous neighbors in the same way. Based on these observations, we propose a new heterogeneous graph neural network model named HMSG to comprehensively capture structural, semantic and attribute information from both homogeneous and heterogeneous neighbors. Specifically, we first decompose the heterogeneous graph into multiple metapath-based homogeneous and heterogeneous subgraphs, and each subgraph associates specific semantic and structural information. Then message aggregation methods are applied to each subgraph independently, so that information can be learned in a more targeted and efficient manner. Through a type-specific attribute transformation, node attributes can also be transferred among different types of nodes. Finally, we fuse information from subgraphs together to get the complete representation. Extensive experiments on several datasets for node classification, node clustering and link prediction tasks show that HMSG achieves the best performance in all evaluation metrics than state-of-the-art baselines.
Keywords: Heterogeneous graph, Graph neural network, Graph embedding, Metapath
1 Introduction
Many real-world data can be organized as graph or network structure, which provides a more abstract representation of various objects and their interactions, such as social networks, traffic networks, protein molecular structures, recommendation systems, etc. Although there are many widely used deep learning models on Euclidean data such as CNN and RNN, they cannot be directly transferred to non-Euclidean data such as graphs due to the irregular structure [13]. Therefore, it is necessary to design deep learning models suitable for graph data.
In the past decade, a large number of graph representation learning methods have been proposed. Random-walk-based methods such as DeepWalk [21] and node2vec [11] use random walk to sample node sequences on the graph, and then feed them to the skip-gram [19] model to obtain low-dimensional vector representation of each node. Model [24] and [18] utilized recurrent neural network to tackle graph data. Due to the powerful feature extraction capabilities of convolutional neural networks, many graph convolutional neural networks have achieved great achievements. ChebNet [5] and GCN [17] use Fourier transform to map graph data to spectral domain for convolution operation. Models such as GraphSAGE [12] and GAT [30] directly aggregate information from neighbors, which exhibit good generalization ability and stability.
Although graph embedding methods have achieved good performance, most of them deal with homogeneous graphs, in which all nodes and edges in the graph are of the same type. However, in reality, the types of nodes or edges in graphs are usually heterogeneous, such as authors and papers in a scholar graph, or users and items in recommendation systems, etc. These graphs are generally called heterogeneous information networks (HIN) or heterogeneous graphs, which contain abundant structural and semantic information. Recently, scholars have devoted great research efforts on heterogeneous graphs representation learning.

Motivation. Most of the researches on heterogeneous graphs are based on metapath [26]. A metapath is an ordered sequence of node types and edge types, which is used to capture specific semantic information of the graph. For example, a simple co-author graph is shown in Figure 1 (a), which contains three types of nodes: author, paper, and venue. The metapaths such as paper-author-paper (PAP) and paper-venue-paper (PVP) represent relationships between two papers, paper-author (PA) and paper-venue (PV) represent interaction relation between papers and authors or venues. Based on this, methods such as metapath2vec [6], HIN2vec [6] and HERec [25] embed heterogeneous graph structure into low-dimensional vectors. HetGNN [33], HAN [32] and MAGNN [9] further incorporate attributes of nodes and aggregate embeddings of multiple metapaths or neighbor sets with attention mechanism. However, these heterogeneous graph models mainly suffer from the following two limitations: First, many of them only aggregate information from homogeneous neighbors connected by metapaths, and discard rich structural and attribute information of heterogeneous neighbors; Second, some studies aggregate information from both homogeneous and heterogeneous neighbors, but just indiscriminately treat these neighbors in the same way. As a consequence, these methods may lose important information and result in unsatisfactory performance.
We show this through the following example. As shown in the scholar graph in Figure 1 (c), through metapath PAP a homogeneous subgraph with paper nodes and is constructed. For a specific paper node (e.g., ), if we only aggregate information from the homogeneous neighbors ( and ), then the structural and attribute information of heterogeneous neighbors () that contribute to their connections will be ignored. Therefore, only considering homogeneous graphs will lose a lot of useful interactive information in the original graph, and these interaction relationships are especially important to some tasks such as link prediction. Moreover, there are different interactive relationships between a node and different types of neighbors. Since these relationships often carry different semantics, they should be considered separately to avoid information loss. It is worth mentioning that the attributes of different types of nodes are often different. For example, in the recommendation systems, the attributes of user nodes are generally age, gender, hobbies, etc., while items have attributes like price, text description, and images. Raw attributes cannot be directly transferred between different types of nodes, and need to be transformed in advance.
To the best of our knowledge, no existing studies have simultaneously considered all the above aspects. Based on this observation, in this paper we propose a new heterogeneous graph representation learning model named HMSG (Heterogeneous graph neural network based on Metapath SubGraph learning) which comprehensively captures structural, semantic and attribute information from both homogeneous and heterogeneous neighbors. To this end, we first perform a type-specific attribute transformation to project the attributes of different types of nodes into the same latent space, so that attribute information can be transferred among them. Then, in order to learn more discriminatively, we generate multiple homogeneous and heterogeneous subgraphs111Note the concept of subgraph in this paper is different from that in traditional graph theory. from the original heterogeneous graph through various metapaths. This step can be regarded as task decomposition, because after being decomposed into subgraphs, the originally complex structural and semantic information can be learned in a more targeted and efficient manner. By learning from both homogeneous and heterogeneous subgraphs independently, HMSG not only aggregates information from homogeneous neighbors, but also obtain attribute and structural information from heterogeneous neighbors. On each subgraph, different aggregation methods can be applied to get the node representation. Finally, we perform attention-based aggregation to combine information from different subgraphs according to their importance, resulting the final node representation.
In summary, the main contributions of our work are as follows:
-
•
We propose HMSG, a metapath-based heterogeneous graph neural network model which comprehensively captures structural, semantic and attribute information from both homogeneous and heterogeneous neighbors.
-
•
We creatively decompose the heterogeneous graph representation learning task into multiple metapath-based subgraph learning tasks, so that the originally complex structural and semantic information can be learned in a more targeted and efficient manner.
-
•
Experiments of node classification, node clustering and link prediction are carried out on multiple datasets, and our proposed model achieves the best performance in all metrics than other state-of-the-art baselines.
2 Related Work
2.1 Graph Neural Networks
The purpose of GNN is to apply deep neural network model to graph representation learning, and map graphs to low-dimensional vector spaces for downstream tasks, such as link prediction [35, 4], node classification [23, 1], recommendation [22, 36, 7], etc. The notion of graph neural networks is initially outlined in [10], then Franco et al. [24] extended recursive neural networks to graph learning task and Li et al. [18] treated the neighborhood information as the time step input of gated recurrent units. Recently, how to apply convolutional neural networks to graph data has become a hot topic. Related works can be divided into two categories: spectral-based methods and spatial-based methods. The main idea of spectral-based methods is to perform convolution operations in the Fourier domain. Defferrard et al. [5] used Chebyshev polynomials to approximate the graph filters, which reduced the complexity of graph convolution in spectral domain. Kipf et al. [17] further restricted the filters only on the first-order neighborhood of each node. Due to the reason that spectral-based methods input the whole graph to perform related operations, these methods suffer poor scalability and stability. The idea of spatial-based method is to directly aggregate information from neighbors of each node. Atwood et al. [2] regarded graph convolution as a diffusion process and assumed that information is transferred from a node to one of its neighboring nodes with a certain transition probability. Hamilton et al. [12] utilized aggregator functions to aggregate the information from sampled neighbors. Benefiting from the wide application of attention mechanism which have achieved good performance in various fields [29], Veličković et al. [30] used the self-attention mechanism to aggregate node information based on neighbors’ importance. The graph neural network models mentioned above are mostly only for homogeneous graphs, and cannot be directly applied to heterogeneous graphs.
2.2 Heterogeneous Graph Embedding
Heterogeneous graph embedding aims to embed heterogeneous graphs into low-dimensional vector spaces. Dong et al. [6] designed metapath-guided random walk to generate sequences as the input of skip-gram [19] model to obtain the embedding of each node. Fu et al. [8] performed multiple prediction training tasks to learn embedding of both nodes and metapaths in heterogeneous information networks. Shi et al. [25] designed a metapath-based random walk to generate the same type of node sequences and apply DeepWalk model to learn node representation. Chen et al. [3] decomposed the heterogeneous graph into several bipartite graphs, and then applied the LINE [28] model to learn the embedding of each bipartite graph. Zhang et al. [34] performed joint optimization of heterogeneous skip-gram and deep semantic encoding to capture semantic-aware representation of heterogeneous graph. The above researches are traditional graph representation learning models, which only considered the graph structure and ignored node attributes. There are also many models base on deep learning. Wang et al. [32] used the attention mechanism to aggregate information on metapath-based homogeneous graphs, and then use semantic attention to aggregate multiple metapath information. Zhang et al. [33] jointly considered the heterogeneous content information and heterogeneous structural information. However, in their models, the information aggregation process was carried out only within the same type of nodes. Fu et al. [9] proposed intra-metapath aggregation to aggregate all node information in each metapath instance, but they indiscriminately treated different types of nodes in the same way.
The are some other heterogeneous graph neural network models that used different methodologies instead of metapath. Hong et al. [14] designed a type-aware attention layer, which embedded each node of the whole heterogeneous graph by jointing different types of adjacent nodes and associated linkages. Hu et al. [16] proposed a subgraph sampling method and designed graph transformer to directly aggregate features from heterogenous neighbors. Hu et al. [15] employed adversarial learning to capture rich semantics on heterogeneous graph, but the node attributes were not considered in this model. Since these studies used different methodologies, they can be regarded as parallel works to ours.
3 Preliminaries and Problem Statement
In this section, we give some important definitions related to heterogeneous graph that will be used in the paper. Table 1 summarizes frequently used notations in this paper.
Notations | Definitions |
---|---|
-dimensional Euclidean space | |
, , | Scalar, vector, matrix |
A graph | |
The set of metapath in a graph | |
The set of metapath-based subgraphs of | |
The set of homogeneous subgraphs | |
The set of heterogeneous subgraphs | |
A metapath-based subgraph | |
The set of neighbors of node in graph | |
Raw (attribute) feature vector of node | |
The finial embedding of node | |
, | Weight matrix, bias |
Importance of node to in graph | |
Importance of graph | |
, | Normalized attention weight |
Activation function | |
The cardinality of a set | |
Vector concatenation |
Definition 1
Heterogeneous Graph [27]. A heterogeneous graph is a graph associates with a node type mapping function and an edge type mapping function . and denote the predefined sets of node types and edge types, where .
Example. A heterogeneous graph composed of multiple types of nodes (Author (A), Paper (P), Venue (V)) and relations (authoring relation between authors and papers, publication relation between papers and venues) is shown in Figure 1(a).
Definition 2
Metapath [26]. A metapth is defined as a path in the form of (abbreviated as ), which describes a composite relation between node types and , where denotes the composition operator on relations. If the metapath is the same as the reverse metapath , then the metapath is symmetric.
Example. Figure 1(b) depicts four metapths: paper-author-paper (PAP), paper-venue-paper (PVP), paper-author (PA) and paper-venue (PV). Different metapaths represent different semantics. PAP means two papers are authored by the same author; PVP means two papers are published in the same venue; PA means the authoring relations between papers and authors; PV means publication relations between papers and venues.

Definition 3
Metapath-based Neighbor. Given a node and a metapath of a heterogeneous graph, the metapath-based neighbors is defined as the set of nodes that connect to node via metapath . Note that includes itself if is symmetric.
Example. Considering the metapath PAP in Figure 1, the metapath-based neighbors of includes (itself), ; and , are metapath-based neighbors of connected by metapath PA. In this paper, we unify the starting node in metapath as the target node.
Definition 4
Metapath-based Subgraph. Given a metapath of a heterogeneous graph , the metapath-based subgraph is a graph constructed by all neighbor pairs based on metapath in graph . Note that is a homogeneous subgraph if starts and ends with the same node type, otherwise it is a heterogeneous subgraph.
Example. We give two metapath-based subgraphs in Figure 1(c): A homogeneous subgraph generated by metapath PAP and a heterogeneous bipartite subgraph generated by metapath PA.
Heterogeneous Graph Representation Learning Problem: Given a heterogeneous graph , the heterogeneous graph representation learning problem aims to learn -dimensional node representations that are able to capture rich structural, semantic and attribute information.
4 METHODOLOGY
In this section, we formally present HMSG for heterogeneous graph embedding. HMSG consists of four parts: (1) node attribute transformation; (2) metapath-based subgraph generation; (3) node aggregation; and (4) subgraph aggregation. Figure 2 shows an overview about the framework of HMSG, and we will give detailed illustration in the following subsections.
4.1 Node Attribute Transformation
This part mainly aims to project attributes of different types of nodes into the same latent space, so that attribute information can be transferred among them. In a heterogeneous graph, different types of nodes are associated with different attributes, such as paper node with keywords, abstracts and author node with research fields. Those attributes are usually in represented in different feature spaces. Therefore, for each type of node, we design a type-specific linear transformation matrix to project their attributes into the same latent space. For a node of type in , we have
(1) |
where is the original feature vector, and is the projected feature vector of node . is the linear transformation matrix of node type .
Through the transformation of attributes, the heterogeneity between different types of nodes is addressed, that facilitates the information transfer and aggregation between nodes in the graph.
4.2 Metapath-based Subgraph Generation
Different metapaths represent different semantics. According to the starting and ending node types of metapaths, we can divide metapaths into two categories for convenience:
(2) |
where means the starting and ending nodes are of the same type in metapaths, otherwise is .
In order to fully learn the information of each metapath, we generate the corresponding subgraphs according to Definition 4 and then apply aggregation methods to each subgraph. Following the types of metapaths, the generated subgraphs can be divided into homogeneous subgraphs and heterogeneous subgraphs (as shown in Figure 1(c)):
(3) |
For a specific type of nodes, their connections to the neighbors in different subgraphs carry different semantic information, so each subgraph can be regarded as an interaction graph with specific semantic information. Since the subgraphs are independent, learning tasks can be carried out on each subgraph in parallel, which results in more efficient learning. Moreover, by learning from homogeneous and heterogeneous subgraphs independently, useful information can be retained as much as possible.
4.3 Node Aggregation
In this step, the information in each subgraph will be transmitted between nodes. For homogeneous graph learning, there are many excellent works such as GCN [17], GAT [30] which can be used accordingly.
Each heterogeneous subgraph is in the form of a bipartite graph, because there are only two types of nodes in the subgraph and connections only exist between node pairs with different types. In bipartite graph aggregation, we are mainly concerned with the information of first-order heterogeneous neighbors, because the information of second-order neighbors, i.e., homogeneous neighbors, can be obtained from the homogeneous graph aggregation. Here we give three candidate aggregators inspired by GraphSAGE [12].
Mean. The mean operation averages the element-wise features of heterogeneous neighbor nodes as the features of target node:
(4) |
where is the neighbors of node .
Pooling. Element-wise pooling operation aggregates information across heterogeneous neighbors in the following way:
(5) |
where is the non-linearity activation function (ReLU in this paper), and demotes the element-wise maximize operator which can be replaced by mean as well, and are learnable parameters.
Attention. The self-attention mechanism has been proven to be an effective information aggregator [12]. Since different neighbors may have different influence on the target node, we use attention to learn the importance of each neighbor.
Next, we will use the self-attention mechanism to aggregate information from both homogeneous and heterogeneous graphs. Specifically, for a target node , given a node pair in , which is generated by a metapath that starts from node type , we adopt a graph attention layer to learn the importance which measures how node would contribute to the target node . In homogeneous subgraph the importance of node can be formulated as follows:
(6) |
while in heterogeneous subgraph the importance of node is:
(7) |
where in homogeneous graph and in heterogeneous graph are the parameterized attention vector for graph and denotes the concatenate operation.
Then we calculate for nodes , where denotes the neighbors which are directly connected to node in . We apply softmax function to obtain the normalized weight coefficient :
(8) |
node type ,
metapaths ,
subgraph type ,
node features ,
the number of attention heads
for do
;
Fuse the embedding from different subgraphs
.
Finally, the embedding of node in subgraph can be aggregated by the neighbors’ projected features with the corresponding coefficients as follows:
(9) |
where is the output of node for subgraph , and is the activation function.
In order to reduce the variance introduced by the heterogeneity of graphs and make the learning process more stable, we extended the self-attention to multiple heads. Specifically, we repeat independent attention mechanisms, and then concatenate the learned embeddings, resulting in the following formulation:
(10) |
In summary, given metapath-based graphs with node type and projected node features , groups of embeddings of the target node are generated, as shown in Figure 2(b).
4.4 Subgraph aggregation
After node aggregation step, we obtain the embedding of each node in different subgraphs. In order to get more semantic information, attention mechanism is applied to assign different weights to different subgraphs according to their importance.
First, the embeddings we obtained from node aggregation are transformed through a nonlinear transformation. Then we sum up each subgraph by averaging the node embeddings for all nodes ,
(11) |
where is the parameterized attention vector for node type , and are learnable parameters.
To make coefficients easily comparable across different subgraphs , we normalize them using softmax function and then weighted summing all subgraphs:
(12) |
(13) |
In this step, we get the final embeddings for all nodes , Figure 2(c) gives the illustration.
4.5 Training
Through the above sections, we obtain the node representations of each node, which can be used for different downstream tasks. Different loss function can be defined depend on specific tasks. We train HMSG in two major learning paradigms: semi-supervised learning and unsupervised learning.
For semi-supervised learning, there are only a small number of nodes in the graph with label information. We can minimize the cross entropy of labeled node data and apply back propagation and gradient descent methods to optimize the parameters of all nodes:
(14) |
where is the set of labeled nodes, and are the ground truth and predicted probability vector of node , respectively.
For unsupervised learning, label information is invisible, we can optimize the model weights by minimizing the following reconstructive loss function through negative sampling [20]:
(15) |
where is the sigmoid function, is the set of connected (positive) node pairs, is the set of negative node pairs randomly sampled from all unconnected node pairs. The overall process of HMSG is shown in Algorithm 1.
5 EXPERIMENTS
5.1 Datasets
We use four commonly used public datasets to evaluate the performance of our proposed model and compare with state-of-the-art baselines. The detailed data description is summarized in Table 2.
Datasets | Node | Edge | Metapath | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
ACM |
|
|
|
|||||||||
IMDB |
|
|
|
|||||||||
Amazon Review-I |
|
# U-I: 180,047 |
|
|||||||||
Amazon Review-II |
|
# U-I: 209,416 |
|
ACM222http://dl.acm.org/. This is an academic network. Similar to HAN [32], we extract a subset of ACM that contains 4025 papers (P), 7167 authors (A) and 60 subjects (S) according to the conferences (KDD, SIGMOD, VLDB, SIGCOMM, MobiCOMM) where papers published in, and divide the papers into three classes (Data Mining, Database, Wireless Communication). Each paper is described by a bag-of-words representation of terms. The metapaths we selected are {PAP, PSP} and {PA, PS}. For semi-supervised learning tasks, the paper nodes are divided into training, validation, and testing ratio of 1:1:8.
IMDB333https://www.imdb.com/. This is an online database about movies and television programs. We extract a subset of IMDB that contains 4181 movies (M), 5257 actors (A), and 2081 directors (D). Each movie is labeled as one of three classes (Action, Comedy, Drama) based on their genre. Each movie is described by a bag-of-words representation of its plot keywords. The metapaths we selected are {MAM, MDM} and {MA, MD}. For semi-supervised learning tasks, the movie nodes are divided into training, validation, and testing ratio of 1:1:8.
Amazon Review444http://jmcauley.ucsd.edu/data/amazon/index.html. This dataset is about product reviews from Amazon which is used for the link prediction task, and no label or feature is included in it. We extract two datasets, i.e., Review-I (Video Games category) with 6259 users and 20145 items, Review-II (Movies category) with 12453 users and 32395 items. The metapaths we selected are {UI, IU}. For unsupervised learning tasks, the user-item pairs are divided into training, validation, and testing ratio of two groups, i.e., 5:1:4 and 7:1:2 for both Review-I and Review-II. The node features of training dataset are pretrained by DeepWalk algorithm [21].
Datasets | Metrics | Training | DeepWalk | metapath2vec | HERec | GCN | GAT | HAN | MAGNN | HMSG |
---|---|---|---|---|---|---|---|---|---|---|
ACM | Macro-F1 | 20% | 79.80 | 87.26 | 77.16 | 89.92 | 88.55 | 89.16 | 90.49 | 91.36 |
40% | 80.71 | 87.81 | 80.04 | 89.98 | 89.61 | 90.13 | 90.65 | 91.57 | ||
60% | 80.83 | 87.69 | 80.90 | 89.91 | 89.92 | 90.45 | 90.60 | 91.72 | ||
80% | 80.69 | 87.83 | 81.28 | 89.87 | 90.11 | 90.55 | 90.58 | 91.81 | ||
Micro-F1 | 20% | 80.41 | 87.87 | 78.67 | 89.96 | 88.71 | 89.57 | 90.47 | 91.32 | |
40% | 81.58 | 88.41 | 81.46 | 90.02 | 89.72 | 90.20 | 90.66 | 91.54 | ||
60% | 81.72 | 88.34 | 82.22 | 89.96 | 90.00 | 90.51 | 90.62 | 91.69 | ||
80% | 81.58 | 88.47 | 82.48 | 89.87 | 90.15 | 90.54 | 90.57 | 91.73 | ||
IMDB | Macro-F1 | 20% | 50.03 | 40.22 | 46.14 | 53.83 | 55.69 | 58.03 | 60.83 | 60.89 |
40% | 52.26 | 41.98 | 47.86 | 54.14 | 56.55 | 58.63 | 61.21 | 61.56 | ||
60% | 52.71 | 43.06 | 49.15 | 54.19 | 56.98 | 59.05 | 61.16 | 61.85 | ||
80% | 52.54 | 43.50 | 50.13 | 53.99 | 57.38 | 59.00 | 61.12 | 62.04 | ||
Micro-F1 | 20% | 51.45 | 41.12 | 47.25 | 54.20 | 55.82 | 58.25 | 61.15 | 61.16 | |
40% | 53.77 | 43.16 | 49.39 | 54.44 | 56.68 | 58.82 | 61.47 | 61.81 | ||
60% | 54.30 | 44.29 | 50.76 | 54.49 | 57.12 | 59.24 | 61.40 | 62.12 | ||
80% | 54.23 | 44.81 | 51.82 | 54.31 | 57.57 | 59.22 | 61.57 | 62.33 |
Datasets | Metrics | DeepWalk | metapath2vec | HERec | GCN | GAT | HAN | MAGNN | HMSG |
---|---|---|---|---|---|---|---|---|---|
ACM | NMI | 53.75 | 35.71 | 28.46 | 60.77 | 65.65 | 67.18 | 67.24 | 70.16 |
ARI | 50.35 | 29.00 | 33.83 | 66.36 | 70.79 | 72.04 | 70.55 | 74.51 | |
IMDB | NMI | 5.66 | 0.31 | 0.54 | 8.34 | 11.03 | 12.13 | 13.64 | 14.52 |
ARI | 6.65 | 0.10 | 0.13 | 7.52 | 11.01 | 12.08 | 13.58 | 14.77 |
5.2 Baselines
We compare HMSG with other graph embedding methods to verify the performance of our proposed method. The detail description of baselines are as follows.
-
•
Deepwalk [21]: This is a random walk-based homogeneous graph representation learning method. We ignore the heterogeneity of the graph and input the entire graph into the DeepWalk model.
-
•
metapath2vec [6]: This is a heterogeneous graph representation learning method based on metapath guided random walk and skip-gram model. We test all the metapaths for metapath2vec and report the best performance.
-
•
HERec [25]: This is a heterogeneous graph embedding method which designs a metapath-based random walk to generate the same type of node sequences and feed into DeepWalk model. We test all the metapaths for HERec and report the best performance.
-
•
GCN [17]: It is a homogeneous graph convolutional network which aggregates information from immediate neighbors. We test GCN on metapath-based homogeneous graphs and report the results from the best metapath on semi-supervised learning tasks. On unsupervised learning tasks, we input the entire graph by ignoring the heterogeneity of the graph.
-
•
GAT [30]: It is a homogeneous graph convolutional network which calculates the importance of different neighbors by attention mechanism. We test GAT on metapath-based homogeneous graphs and report the results from the best metapath on semi-supervised learning tasks and input the entire graph by ignoring the heterogeneity of the graph on unsupervised learning tasks.
-
•
HAN [32]: It is a heterogeneous graph neural network which designs node attention for aggregating metapath-based homogeneous graph and utilize semantic attention to aggregate the information of multiple metapaths.
-
•
MAGNN [9]: It is a heterogeneous graph neural network which ultilizes intra-metapath aggregation and inter-metapath aggregation to aggregate node information.
For random-walk-based methods, including DeepWalk, metapath2vec and HERec, we set window size to 5, walk length to 100, walk per node to 40, and negative samples to 5. For GCN, GAT, HAN and HMSG, we use the same splits of training, validation, and testing sets. We train them for 1000 epochs and apply early stopping with patience of 30. We employ the Adam optimizer with the learning rate set to 0.005 and the weight decay (L2 penalty) set to 0.001. We set the dropout rate to 0.6. In the HMSG model, we use the self-attention mechanism to aggregate homogeneous and heterogeneous subgraphs, and other variants of HMSG are analyzed in the ablation study section. For GAT, HAN and HMSG, we set the number of attention heads to 8. For HAN, HMSG, we set the dimension of attention vector in subgraph aggregation to 128. For a fair comparison, we set the embedding dimension of all the algorithms to 64. Our model is implemented via the Deep Graph Library (DGL) [31] package of PyTorch frameworks.
Datasets | Metrics | DeepWalk | metapath2vec | HERec | GCN | GAT | HAN | MAGNN | HMSG |
---|---|---|---|---|---|---|---|---|---|
Review-I (7:1:2) | AUC | 63.66 | 68.96 | 62.55 | 79.78 | 76.79 | 84.12 | 84.36 | 84.56 |
AP | 59.57 | 65.74 | 58.78 | 80.40 | 75.16 | 82.76 | 83.43 | 83.66 | |
Review-I (5:1:4) | AUC | 60.11 | 68.52 | 59.78 | 76.12 | 72.10 | 79.11 | 79.17 | 79.72 |
AP | 55.56 | 65.31 | 53.84 | 77.05 | 70.57 | 78.37 | 77.05 | 79.08 | |
Review-II (7:1:2) | AUC | 60.97 | 58.64 | 58.69 | 73.77 | 78.93 | 81.05 | 78.45 | 81.70 |
AP | 57.37 | 56.83 | 55.37 | 73.67 | 78.65 | 80.54 | 78.12 | 81.63 | |
Review-II (5:1:4) | AUC | 56.67 | 57.32 | 56.04 | 70.76 | 72.18 | 74.24 | 73.31 | 76.33 |
AP | 54.58 | 55.47 | 54.12 | 70.83 | 72.67 | 74.09 | 72.60 | 76.58 |
5.3 Node Classification
We conduct experiments on ACM and IMDB datasets to compare the node classification performance of different models. We use the embedding of labeled nodes (paper nodes in ACM and movie nodes in IMDB) as the input of the SVM classifier, and divide the train/test ratio into different proportions. Similar to [9], only the nodes in the testing set of HMSG are used for the downstream training and evaluation of SVM. This strategy is also used for node clustering and link prediction tasks. In order to eliminate the variance caused by the unbalanced data division, we repeat each experiment for 10 times and report the average and as evaluation metrics. The node classification results are shown in Table 3.
It can be seen from the Table 3 that, compared to the state-of-the-art models HAN and MAGNN, HMSG achieves the best performance on all experimental groups, showing the significant advantage of fusing the information from both homogeneous and heterogeneous subgraphs. In addition, the performance of metapath2vec method which considers the heterogeneous structure outperforms other random walk-based methods, but it is still not as good as GCN and GAT, which are based on graph convolution model.
5.4 Node Clustering
We conduct clustering experiments on ACM and IMDB datasets to evaluate the quality of embeddings learned by HMSG. For the clustering experiment, similar to node classification, we take the embeddings of labeled nodes in the testing set as the input of K-Means model, and use NMI and ARI metrics to evaluate the performance. Similarly, the results are average of 10 executions.
The results are shown in Table 4, from which we see that HMSG significantly outperforms all baselines, showing its strong capability of learning effective node representations. Besides, there is a significant performance gap between graph convolution-based methods (GCN, GAT, HAN, MAGNN and HMSG) and random walk-based methods, which again shows the power of graph convolutional model.
5.5 Link Prediction
Link prediction is used to test our algorithm’s effectiveness in unsupervised learning task. We use two Amazon Review datasets Review-I and Review-II to evaluate the performance in link prediction task. We regard connected user-item pairs as positive node pairs and all unconnected links as negative node pairs. In the training phase, we randomly select the same number of positive and negative node pairs to training, validation and testing, and use formula (15) to optimize the model.
Then, we use the dot product of user and item embeddings obtained from the model to calculate the link probability between two nodes: , and is the sigmoid function. We conduct 10 repeated experiments and report the average AUC and AP metrics on the testing set. The results of link prediction task are presented in Table 5.
As shown in Table 5, HMSG still performs the best on all the dataset groups evaluated by all the metrics. Again, graph convolution-based methods perform much better than random-walk-based methods. At the same time, we can see that heterogeneous graph models such as HAN and MAGNN perform better than homogeneous neural networks GCN and GAT in most of the time.
In sum, the above results validate the effectiveness of our HMSG model. In the following subsection, we will conduct ablation study to compare different variants of HMSG.
Variants | ACM | IMDB | Review-I | Review-II | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Macro-F1 | Micro-F1 | NMI | ARI | Macro-F1 | Micro-F1 | NMI | ARI | AUC | AP | AUC | AP | |
91.30 | 91.25 | 69.85 | 74.22 | 61.31 | 61.57 | 14.79 | 15.42 | 78.32 | 76.07 | 76.28 | 75.86 | |
91.42 | 91.38 | 70.20 | 74.75 | 61.40 | 61.72 | 14.36 | 14.41 | 78.94 | 77.49 | 76.24 | 76.48 | |
91.48 | 91.42 | 70.78 | 75.13 | 61.43 | 61.73 | 14.29 | 14.22 | 79.06 | 77.41 | 74.24 | 73.70 | |
91.62 | 91.57 | 70.16 | 74.51 | 61.59 | 61.86 | 14.52 | 14.77 | 82.14 | 81.37 | 78.96 | 79.05 |
5.6 Ablation Study
To validate the effectiveness of different components in our model, we conduct experiments on different variants of HMSG. Here means only homogeneous subgraphs are used in the model; means we use an average strategy in heterogeneous subgraph aggregation; means max pooling strategy in heterogeneous subgraph aggregation; means self-attention mechanism are used in heterogeneous subgraph aggregation. We present the average results with different training ratios in Table 6.
As can be seen, except for the clustering task on IMDB where the model (which uses homogeneous subgraphs only) performs the best, on all the other experimental groups, HMSG models considering both homogeneous and heterogeneous subgraphs give the best results, among which model exhibits the best overall performance, which again validates the effectiveness and necessity of our model design.
6 CONCLUSION
In this paper we proposed a new heterogeneous graph neural network model HMSG, which learned graph representations from multiple homogeneous and heterogeneous subgraphs generated through various metapaths. Compared to existing models, HMSG is able to comprehensively capture structural, semantic and attribute information from both homogeneous and heterogeneous neighbors. Extensive experiments on multiple tasks and datasets showed that HMSG significantly outperformed the state-of-the-art baselines.
For the future work, we will consider extending HMSG so that it can support dynamic heterogeneous graphs. We will also try other subgraph generating and aggregating methods in the HMSG model in the future.
References
- [1] S. Abu-El-Haija, A. Kapoor, B. Perozzi, and J. Lee. N-gcn: Multi-scale graph convolution for semi-supervised node classification. In uncertainty in artificial intelligence, pages 841–851. PMLR, 2020.
- [2] J. Atwood and D. Towsley. Diffusion-convolutional neural networks. In Advances in neural information processing systems, pages 1993–2001, 2016.
- [3] H. Chen, H. Yin, W. Wang, H. Wang, Q. V. H. Nguyen, and X. Li. Pme: projected metric embedding on heterogeneous networks for link prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1177–1186, 2018.
- [4] J. Chen, X. Xu, Y. Wu, and H. Zheng. Gc-lstm: Graph convolution embedded lstm for dynamic link prediction. arXiv preprint arXiv:1812.04206, 2018.
- [5] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29:3844–3852, 2016.
- [6] Y. Dong, N. V. Chawla, and A. Swami. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 135–144, 2017.
- [7] W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin. Graph neural networks for social recommendation. In The World Wide Web Conference, pages 417–426, 2019.
- [8] T.-y. Fu, W.-C. Lee, and Z. Lei. Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 1797–1806, 2017.
- [9] X. Fu, J. Zhang, Z. Meng, and I. King. Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding. In Proceedings of The Web Conference 2020, pages 2331–2341, 2020.
- [10] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, pages 729–734. IEEE, 2005.
- [11] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
- [12] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034, 2017.
- [13] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015.
- [14] H. Hong, H. Guo, Y. Lin, X. Yang, Z. Li, and J. Ye. An attention-based graph neural network for heterogeneous structural learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4132–4139, 2020.
- [15] B. Hu, Y. Fang, and C. Shi. Adversarial learning on heterogeneous information networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 120–129, 2019.
- [16] Z. Hu, Y. Dong, K. Wang, and Y. Sun. Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, pages 2704–2710, 2020.
- [17] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- [18] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
- [19] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- [20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26:3111–3119, 2013.
- [21] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014.
- [22] R. Qiu, Z. Huang, J. Li, and H. Yin. Exploiting cross-session information for session-based recommendation with graph neural networks. ACM Transactions on Information Systems (TOIS), 38(3):1–23, 2020.
- [23] Y. Rong, W. Huang, T. Xu, and J. Huang. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, 2019.
- [24] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
- [25] C. Shi, B. Hu, W. X. Zhao, and S. Y. Philip. Heterogeneous information network embedding for recommendation. IEEE Transactions on Knowledge and Data Engineering, 31(2):357–370, 2018.
- [26] Y. Sun and J. Han. Mining heterogeneous information networks: principles and methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery, 3(2):1–159, 2012.
- [27] Y. Sun and J. Han. Mining heterogeneous information networks: a structural analysis approach. Acm Sigkdd Explorations Newsletter, 14(2):20–28, 2013.
- [28] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pages 1067–1077, 2015.
- [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- [30] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
- [31] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai, et al. Deep graph library: Agraph-centric, highly-performant package for graph neural net. arXiv preprint arXiv:1909.01315, 2019.
- [32] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu. Heterogeneous graph attention network. In The World Wide Web Conference, pages 2022–2032, 2019.
- [33] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla. Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 793–803, 2019.
- [34] C. Zhang, A. Swami, and N. V. Chawla. Shne: Representation learning for semantic-associated heterogeneous networks. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 690–698, 2019.
- [35] M. Zhang and Y. Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pages 5165–5175, 2018.
- [36] H. Zhao, Q. Yao, J. Li, Y. Song, and D. L. Lee. Meta-graph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 635–644, 2017.