TeKo: Text-Rich Graph Neural Networks
with External Knowledge

Zhizhi Yu, Di Jin, Jianguo Wei, Ziyang Liu, Yue Shang, Yun Xiao,
Jiawei Han, and Lingfei Wu Zhizhi Yu and Di Jin are with the College of Intelligence and Computing, Tianjin University, Tianjin 300350, China. E-mail: {yuzhizhi, jindi}@tju.edu.cn. Jianguo Wei is with the College of Intelligence and Computing, Tianjin University, Tianjin 300350, China, and also with the Cumputer College, Qinghai Nationalities Unversity, Qinghai 810007, China. E-mail: jianguo@tju.edu.cn. Ziyang Liu is with the School of Software, Tsinghua University, Beijing 100080, China. E-mail: liu-zy21@mails.tsinghua.edu.cn. Jiawei Han is with the Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801. E-mail: hanj@cs.uiuc.edu. Yue Shang, Yun Xiao and Lingfei Wu are with the JD.COM Silicon Valley Research Center, 675 E Middlefield Rd, Mountain View, CA 94043 USA. Email: {yue.shang, xiaoyun1}@jd.com, lwu@email.wm.edu.

Abstract

Graph Neural Networks (GNNs) have gained great popularity in tackling various analytical tasks on graph-structured data (i.e., networks). Typical GNNs and their variants follow a message-passing manner that obtains network representations by the feature propagation process along network topology, which however ignore the rich textual semantics (e.g., local word-sequence) that exist in many real-world networks. Existing methods for text-rich networks integrate textual semantics by mainly utilizing internal information such as topics or phrases/words, which often suffer from an inability to comprehensively mine the text semantics, limiting the reciprocal guidance between network structure and text semantics. To address these problems, we propose a novel text-rich graph neural network with external knowledge (TeKo), in order to take full advantage of both structural and textual information within text-rich networks. Specifically, we first present a flexible heterogeneous semantic network that incorporates high-quality entities and interactions among documents and entities. We then introduce two types of external knowledge, that is, structured triplets and unstructured entity descriptions, to gain a deeper insight into textual semantics. We further design a reciprocal convolutional mechanism for the constructed heterogeneous semantic network, enabling network structure and textual semantics to collaboratively enhance each other and learn high-level network representations. Extensive experimental results on four public text-rich networks as well as a large-scale e-commerce searching dataset illustrate the superior performance of TeKo over state-of-the-art baselines.

Index Terms:

Graph Neural Networks, Text-Rich Networks, External Knowledge, Network Representation.

I Introduction

Networks are ubiquitous structures for abstracting and modeling relational data, such as social networks, bibliographic networks and biomedical networks. With their prevalence, it is particularly important to learn effective representations of networks and apply them to downstream tasks. Recently, graph neural networks (GNNs) [1, 2, 3], which exhibits significant power on naturally capturing both network structures and feature information,

Refer to caption — (a) Classical GNNs following feature propagation along topology

have gained great success and been adapted in a wide range of application tasks, including community detection [4] and recommender system [5].

The classical GNNs and their variants [6, 7] follow a message-passing manner, where the most essential part is the feature propagation along network topology. However, networks in the real-world usually contain nodes associated with rich textual information, which are called text-rich networks. Typical examples include academic networks (e.g., DBLP) where document nodes are accompanied with their abstracts, and e-commerce network (e.g., Amazon) in which product nodes are attached with their descriptions. Under text-rich situation, existing GNNs may suffer from poor performance, since the propagation mechanism within node neighborhoods typically only treat textual information as attribute words, as shown in Fig. 1(a), inevitably leading to the loss of some important semantic structure information (e.g., local word-sequence or global topic) contained in the text.

A few very recent studies [8, 9] have been dedicated to generalizing GNNs to text-rich networks. They incorporate semantic structures within the textual information, including local word-sequence and/or global topics, to original network structure for modeling text-rich networks. While doing so can leverage additional semantic information in text to a certain extent, these methods fail to well comprehend the semantic content of network data space, and reason over complex concepts and relational paths. This may introduce irrelevant information that negatively influence the performance of text-rich GNNs. Our intuition is that, leveraging the knowledge provided from outside sources (e.g., Wikipedia and ConceptNet [10]) can gain a deeper insight into the textual semantics (e.g., “qos” means “quality of service”) and thus further facilitate the prediction of downstream tasks.

So an interesting yet important question is how to effectively leverage the external knowledge in order to design text-rich graph neural networks. There are two key challenges to consider. First, how to identify appropriate and useful related knowledge in order to well comprehend the textual semantics underlying text-rich networks? External knowledge usually consists of multiple types of data. For example, Wikipedia contains two types of data, i.e., structured triplets and unstructured entity descriptions. Different data types represent different semantics, each of which reflects one aspect of textual information [11, 12]. Therefore, it is important to comprehend the textual semantics via the reciprocal fusion of different data types of external knowledge. Second, how to fully understand and leverage the acquired knowledge to facilitate the effective guidance of knowledge space to text-rich network data space? As the structure and information contained in knowledge space and text-rich network data space are different, naively gluing information from these two spaces together may result in an over-complicated model [13]. As a result, it is of great importance to design a more advanced model that can flexibly consider not only the information aggregation of each space, but also the information interaction among different spaces. More importantly, the model itself should also correctly and adaptively learn the contribution of balance of network structure and textual information within text-rich networks aiming at given learning objectives.

In this paper, we focus on the above problems of semi-supervised learning on text-rich networks and propose a new Text-rich graph neural network with external Knowledge, namely TeKo. To this end, we first augment the text-rich network with a newly constructed heterogeneous document-entity network, as shown in Fig. 1(b), in order to incorporate entities and capture rich semantic structures among documents and entities. We then introduce a knowledge-based entity representation module to adaptively extract useful external knowledge from structured triplets and unstructured entity descriptions. In this way, the external knowledge is capable of helping comprehend the semantic content of network data space. Finally, we design a discriminative propagation mechanism, based on reciprocal graph attention, in order to realize the interaction between network structure and textual semantics. During the intermediate training steps, these two parts are guided by each other and optimized collaboratively. Meanwhile, by making full use of textual semantics, TeKo can also alleviate the topological limitations of GNNs [14] such as heterophily.

We summarize our main contributions as follows:

•

To the best of our knowledge, we are the first to gain a deep insight into the textual semantics underlying text-rich networks through knowledge enhancement, empowering the effective fusion of both structural and textual information.
•

We present a novel text-rich graph neural network, TeKo, which innovatively employs the guidance of external knowledge space over network data space to learn text-rich network representations.
•

Extensive experiments demonstrate the superiority of the new approach TeKo over the state-of-the-art methods with significant improvements.

The rest of the paper is organized as follows. Section II gives the preliminaries. Section III proposes the new text-rich GNNs with external knowledge. We conduct experiments in Section IV and discuss related work in Section V. Finally, we conclude in Section VI.

II Preliminaries

We first introduce the terms and notations, and then give the problem definitions. We finally discuss GNNs which serve as the base of our proposed TeKo.

II-A Terms and Notations

Given an undirected and unweighted text-rich network $G=(R,V,E)$ , where $R$ is a set of document raw text, $V=\left\{v_{1},\dots,v_{n}\right\}$ is a set of $n$ nodes, and $E=\{e_{ij}\}\subseteq V\times V$ is a set of edges. The topological structure of $G$ is represented by an adjacency matrix $\mathbb{A}=[a_{ij}]\in\mathbb{R}^{n\times n}$ , where $a_{ij}=1$ if nodes $v_{i}$ and $v_{j}$ are connected, or $a_{ij}$ = 0 otherwise. The attribute matrix of $G$ is denoted as $\mathbb{X}\in\mathbb{R}^{n\times f}$ , where attributes are extracted from document raw text $R$ and $f$ represents the number of attributes. The notations we used throughout the paper are summarized in Table I.

II-B Problem Definitions

We focus on two kinds of downstream tasks, namely, semi-supervised node classification and node clustering, to assess the learned text-rich network representations.

Semi-Supervised Node Classification. Following the common semi-supervised learning setting, each node belongs to one out of $C$ classes and only a small part of nodes $V_{L}\ll n$ are associated with corresponding labels $Y_{L}$ .

TABLE I: Summary of notations.

Notations	Descriptions
$G$	A network.
$R$	The set of document raw text.
$V,E$	The sets of nodes and edges of a network.
$e_{ij}$	The edge between nodes $v_{i}$ and $v_{j}$ .
$a_{ij}$	The connection between nodes $v_{i}$ and $v_{j}$ .
$\mathbb{A},\mathbb{X}$	The adjacency matrix and node attribute matrix.
$\mathbb{D}$	The node degree matrix.
$\mathbb{x}_{i}$	The attribute vector of a given node $v_{i}$ .
$(h,r,t)$	A triplet in knowledge graph.
$\mathbb{h}_{i}$	The latent representation of a given node $v_{i}$ .
$\mathbb{e}_{s}$	The triple representation of an entity node $w_{i}$ .
$\mathbb{e}_{d}$	The textual representation of an entity node $w_{i}$ .
$\sigma$	The non-linear activation function.
$\mathcal{N}_{i}$	The set of neighbors of node $v_{i}$ .
$\alpha$	Weight of type-level attention.
$\beta$	Weight of node-level attention.

The objective of semi-supervised node classification is to predict the labels of $V\verb|\|V_{L}$ by learning a function $\mathcal{F}$ .

Node Clustering. Following the unsupervised learning setting, that is, without any node labels. The goal of node clustering is to design a mapping $\mathcal{F}$ to assign every node $v_{i}$ into one out of $C$ classes.

II-C Graph Neural Networks

Graph Neural Networks (GNNs) [16] are a class of neural networks designed to graph-related tasks in an end-to-end manner. They typically learn node representations through an iterative aggregation of local network neighborhoods. Formally, let $\mathbb{h}_{i}^{(k)}$ be the feature representation of node $v_{i}$ at the $k$ -th layer, the message passing mechanism can be written as:

\begin{split}&\mathbb{m}_{i}^{(k)}=\operatorname{MSG}^{(k)}({\mathbb{h}_{i}^{(k-1)}}),\\ &\mathbb{h}_{i}^{(k)}=\operatorname{AGG}^{(k)}(\mathbb{h}_{i}^{(k-1)},\{\mathbb{m}_{j}^{(k)}:v_{j}\in\mathcal{N}(v_{i})\})\end{split}

(1)

where $\mathbb{h}_{i}^{(0)}$ denotes the node’s attribute vector, $\mathbb{m}_{i}^{(k)}$ represents the message embedding, and $\mathcal{N}(v_{i})$ is the local neighborhood of $v_{i}$ . GNNs work well on several network analytical tasks [17]. But many networks in the real-world are text-rich. Since existing GNNs typically treat text as attribute words alone, they will inevitably overlook important textual semantics. Therefore, it is of great significance to design a new text-rich GNN that takes full advantage of both network structure and textual semantics.

III Methodology

We first give a brief overview of the proposed method, and then introduce three key components in detail.

III-A Overview

To let the textual semantics essentially provide supplementary information for text-rich network representations, we propose a novel text-rich graph neural network that can effectively integrate network structure and text semantics via the effective guidance of external knowledge space over network data space, namely TeKo. The whole structure of the proposed approach is illustrated in Fig. 3, which consists of three main components: semantic network generation, knowledge-based entity representation, and heterogeneous graph attention. Specifically, for the semantic network generation, we construct a flexible heterogeneous document-entity framework for modeling the text-rich network, which entails both local and global semantic relationships among documents and entities. For the knowledge-based entity representation, we extract useful and related knowledge from structured triples and unstructured entity descriptions, and accordingly learn jointly entity representations by automatically finding a balance between these two types of information. This provides sufficient information for understanding text semantics and further enhance its integration with network structures. For the heterogeneous graph attention, we introduce a discriminative propagation mechanism which adaptively aggregates the information from both text-rich network data space and knowledge space, enabling the model to realize a well balanced combination of network structure and textual semantics.

III-B Semantic Network Generation

As an important auxiliary information of text-rich networks, textual semantics (e.g., local word-sequence) play an indispensable role in learning network representations. To take full advantage of the underlying semantic structures within node textual information, we construct a heterogeneous semantic network that enables the integration of entities and captures the rich relationships among documents and entities. The semantic network generation consists of two parts: entity network generation as well as whole network integration, as shown in Fig. 2.

Entity Network Generation. As motivated, on a text-rich network, it is imperative to capture the underlying textual semantics. However, it will inevitably introduce a lot of useless information if we treat all words contained in the corpus as entities. To this end, we consider recognizing high-quality entities from corpus and map them to Wikipedia with the entity linking tool TagMe [15], that is, select entities above a predefined threshold $\delta_{tag}$ . After that, the edges among entities can be built according to the similarity of their initial knowledge-based representation (see Subsection 3.3 below). Actually, there are many ways to construct entity edges, and we uniformly choose the cosine similarity to generate edges between entities.

Cosine Similarity. It uses the cosine value of the angle between two vectors to measure the entity similarity. Mathematically, given a text-rich network $G_{D}=(R,V_{D},E_{D})$ , where each document node $v_{i}$ is associated with a textual description denoted as $d_{i}$ . Let $V_{W}$ be the set of annotated entities, $\mathbb{e}_{i}$ be the knowledge-based representation of entity node $w_{i}$ , then the similarity $s_{ij}$ between entity nodes $w_{i}$ and $w_{j}$ can be calculated as:

s_{ij}=\frac{\mathbb{e}_{i}\cdot\mathbb{e}_{j}}{|\mathbb{e}_{i}||\mathbb{e}_{j}|}.

(2)

Then, the adjacency matrix $E_{W}$ can be obtained by choosing node pairs where the similarity is above a predefined threshold $\delta_{sim}$ . Note that the constructed entity sub-network is static, where the entity nodes and edges do not change during the training process.

Whole Network Integration. We finally augment the original text-rich network into a heterogeneous semantic network, so as to explicitly capture both network topology and textual semantics. It includes two kinds of nodes (i.e., document nodes and entity nodes), and three kinds of edges (i.e., edges between document nodes from the original text-rich network representing paper relationships such as citations, edges between document nodes and entity nodes constructed based on the inclusion relationships between documents and entities, as well as edges between entity nodes capturing word-sequence semantics). Formally, the heterogeneous semantic network is represented as:

G=(V_{D}\cup V_{W},E_{D}\cup E_{DW}\cup E_{W}),

(3)

where $E_{DW}$ is the set of edges between document nodes and entity nodes. In this way, by incorporating entities and relations, we can enrich the semantics of text-rich networks at a structure level.

III-C Knowledge-based Entity Representation

To further comprehensively mine the textual semantics underlying text-rich networks and facilitate its interaction with text-rich network structure, we learn the entity representation using external knowledge. Considering that the information density of external knowledge is usually sparse and incomplete, we generate entity representation by encoding structured triplets and unstructured entity descriptions, respectively, and adaptively combine them with a learnable gating mechanism.

Triplet Representation. Knowledge graph embedding is an effective way to parameterize entities and relations as vector representations from structured triplets. Here we employ TransE [18], a simple and effective approach, to learn entity representations $\mathbb{e}_{s}\in\ \mathbb{R}^{u}$ . Given a triplet $(h,r,t)$ , let $\mathbb{r}$ be the representations of relation $r$ , $\mathbb{h}$ and $\mathbb{t}$ be the representations of entities $h$ and $t$ , respectively. TransE aims to embed each entity and relation by optimizing the translation principle $\mathbb{h}+\mathbb{r}\approx\mathbb{t}$ , if $(h,r,t)$ holds. The score function is formulated as:

f(h,r,t)=-||\mathbb{h}+\mathbb{r}-\mathbb{t}||^{2}_{2},

(4)

where $\mathbb{h}$ and $\mathbb{t}$ are subject to the normalization constraint that the magnitude of each vector is 1. Intuitively, a large score of $f(h,r,t)$ indicates that the triplet is more likely to be a true fact in real-world, and vice versa. Notice that, we only consider triplets in which entity nodes within the heterogeneous semantic network are head (subject) instead of tail (object).

Textual Representation. For each entity, we employ the paragraph of its corresponding Wikipedia page as the description. As the description may represent an entity from various aspects, we adopt latent dirichlet allocation (LDA) [19], a statistical model capable of modeling hidden topics behind texts, to learn the textual representation of each entity (denoted as $\mathbb{e_{d}}\in\mathbb{R}^{u}$ ). Specifically, LDA assumes that each description comes from a mixture of topics, where the topics are shared across the descriptions and the mixing proportion of each description is unique. The generation process for each description can be iteratively divided into two steps: the first is to choose a topic with a certain probability, and the second is to select a word under this topic with a certain probability.

Representation Fusion. Since both the structured triplets and unstructured entity descriptions provide valuable information for an entity, we jointly integrate these two kinds of information for knowledge-based entity representation. To achieve the optimal combination of triple representation $\mathbb{e}_{s}$ and textual representation $\mathbb{e}_{d}$ , we introduce a learnable gating mechanism [20] to determine how much the joint representation depends upon triples or description. Mathematically, given an entity node $w_{i}$ , the joint representation $\mathbb{e}_{i}$ can be formulated as:

\mathbb{e}_{i}=\mathbb{g}_{e}\odot\mathbb{e}_{s}+(1-\mathbb{g}_{e})\odot\mathbb{e}_{d},

(5)

where $\mathbb{g}_{e}\in\mathbb{R}^{u}$ is a gating vector with elements in $[0,1]$ to balance the information from triples and description, and $\odot$ represents element-wise multiplication. Obviously, the joint representation with gate closer to 0 tends to use textual representation; whereas the joint representation with gate closer to 1 utilizes triple representation. More importantly, to constrain the value of each element in $[0,1]$ , we apply sigmoid function to calculate the gate $\mathbb{g}_{e}$ :

\mathbb{g}_{e}=\operatorname{sigmoid}(\mathbb{\tilde{g}}_{e}),

(6)

where $\mathbb{\tilde{g}}_{e}$ is a real-value vector which is learned during the training process.

III-D Heterogeneous Graph Attention

The core of our approach is to take full advantage of both network structure and textual semantics for text-rich network representations with the help of external knowledge. For this purpose, we design a heterogeneous graph attention that performs information propagation on the augmented heterogeneous semantic network. It considers not only the information aggregation in the text-rich network data space, but also the information guidance of the knowledge space to the network data space.

Due to the heterogeneity of nodes in the augmented semantic network, different types of nodes have different feature spaces. Therefore, for a node $v_{i}$ with type $\phi_{i}$ , we project its features into common space using a type-specific transformation matrix $\mathbb{W}_{\phi_{i}}$ as:

\mathbb{h}_{i}^{\prime}=\mathbb{W}_{\phi_{i}}\cdot\mathbb{h}_{i},

(7)

where $\mathbb{h}_{i}^{\prime}$ is the projected feature of node $v_{i}$ .

After that, to facilitate information propagation among neighboring nodes, we learn node representations from the perspective of network schema, which makes a heterogeneous network semi-structured, guiding the exploration of semantics of this network. Typically, given a target node $v_{i}$ , different types of neighboring nodes may have different impacts on it. Therefore, we employ type-level attention [21] to learn the importance of different types of neighboring nodes. To be specific, let $\mathbb{h_{\phi}}$ be the embedding of type $\phi$ , which is defined as the sum of the neighboring node embedding $\mathbb{h}_{j}^{\prime}$ with node $v_{j}\in\mathcal{N}_{i}$ under type $\phi$ , that is:

\mathbb{h}_{\phi}=\sum\nolimits_{v_{j}}\hat{a}_{ij}\mathbb{h}_{j}^{\prime},

(8)

where $\mathbb{\hat{A}}=[\hat{a}_{ij}]=\mathbb{\tilde{D}}^{-\frac{1}{2}}\mathbb{\tilde{A}}\mathbb{\tilde{D}}^{-\frac{1}{2}}$ ( ${\mathbb{\tilde{A}}=\mathbb{A}+\mathbb{I}}$ stands for the adjacency matrix with self-loops, and $\mathbb{\tilde{D}}$ is degree matrix with $\tilde{d}_{ii}=\sum\nolimits_{v_{j}}\tilde{a}_{ij}$ ).

Then, based on the target node embedding $\mathbb{h}_{i}^{\prime}$ and the type embedding $\mathbb{h}_{\phi}$ , the type-level attention weights can be calculated as:

\alpha_{\phi}=\operatorname{softmax}_{\phi}(\operatorname{\sigma}(\eta_{\phi}^{T}[\mathbb{h}_{i}^{\prime},\mathbb{h}_{\phi}])),

(9)

where $\eta_{\phi}$ is the attention vector for the type $\phi$ , $\sigma$ represents activation function such as LeakyReLU, and the softmax function is adopted to normalize across all the types.

On the other hand, considering that different neighboring nodes of the same type could also have different importance, we further apply node-level attention [22] to learn the weights between nodes of the same type. Formally, given a target node $v_{i}$ with type $\phi$ , let $v_{j}$ be its neighboring node with type $\phi^{\prime}$ , the node-level attention weights can then be computed as:

\beta_{ij}=\operatorname{softmax}_{v_{j}}(\operatorname{\sigma}(\gamma^{T}\cdot\alpha_{\phi^{\prime}}[\mathbb{h}_{i}^{\prime},\mathbb{h}_{j}^{\prime}])),

(10)

where $\gamma$ is the attention vector, and softmax is applied to normalize across all the neighboring nodes of the target node $v_{i}$ .

By integrating the above process, the matrix form of the layer-wise propagation rule in heterogeneous graph attention can be defined as follows:

\mathbb{H}^{(k)}=\operatorname{\sigma}(\sum\nolimits_{\phi\in\Phi}\mathbb{\mathcal{B}}_{\phi}\cdot\mathbb{H}_{\phi}^{(k-1)}\cdot\mathbb{W}_{\phi}^{(k-1)}),

(11)

where $\Phi$ is the set of node types in the augmented semantic network. In this way, we can well realize reciprocal enhancement of information from both network structure and textual semantics with the guidance of external knowledge.

III-E Model Training

After obtaining the final document representation, we can apply it to specific tasks and design different loss functions. For semi-supervised node classification, we define the loss function by using cross entropy as:

\mathcal{L}=-\sum\limits_{v_{i}\in V_{L}}\sum\limits_{c=1}^{C}\mathbb{y}_{i}[c]\operatorname{log}\mathbb{h}_{i}[c],

(12)

where $V_{L}$ denotes the set of node indices that have labels, $C$ is the number of classes, and $\mathbb{y}_{i}$ is the one-hot label vector of node $v_{i}$ .

For unsupervised learning, without any node labels, we define the loss function through negative sampling as:

\mathcal{L}=-\sum\limits_{(v_{i},v_{j})\in\Omega}\operatorname{log}\sigma(\mathbb{h}^{\top}_{i}\cdot\mathbb{h}_{j})-\sum\limits_{(v_{m},v_{n})\in\Omega^{-}}\operatorname{log}\sigma(-\mathbb{h}^{\top}_{m}\cdot\mathbb{h}_{n}),

(13)

where $\sigma$ represents activation function, $\Omega$ is the set of positive node pairs, $\Omega^{-}$ is the set of negative node pairs (the complement of $\Omega$ ).

IV Experiments

We first introduce the experimental setup. We then compare the new approach TeKo with state-of-the-arts on three network analysis tasks, thereafter present an in-depth analysis of different components of TeKo and give the parameter analysis. We finally introduce the application on e-commerce search scenes.

IV-A Experimental Setup

Datasets. We conduct experiments on the following four public text-rich datasets, where the basic information are summarized in Table II.

•

Cora-Enrich¹¹1http://zhang18f.myweb.cs.uwindsor.ca/datasets/ is a text-rich citation dataset composed of machine learning papers, where each node represents a document with textual description composed of its title and abstract. The documents are manually labeled with seven categories, e.g., Reinforcement Learning, based on their academic topics.
•

DBLP-Five [23] is extracted from the computer science bibliography website, where the textual information of each document contains its title and description. According to the published domains, the documents are labeled with five research areas, including High-Performance Computing, Software engineering, Computer networks, Theoretical computer science, and Computer graphics: Multimedia.
•

Hep-Small²²2https://www.cs.cornell.edu/projects/kddcup/datasets.html and Hep-Large² are two citation datasets about scientific documents in physics, where Hep-Small contains 397 documents in three categories: Phys.Rev.Lett, Nucl.Phys.Proc.Suppl, and Commun.Math.Phys; while Hep-Large contains 11,752 documents in four categories: Phys.Rev, Phys.Lett, Nucl.Phys, and JHEP.

TABLE II: The statistics of the datasets.

Datasets	#Nodes	#Edges	#Categories
Hep-Small	397	812	3
Cora-Enrich	2,708	5,429	7
DBLP-Five	6,936	12,353	5
Hep-Large	11,752	134,956	4

TABLE III: Node classification results with mean value and standard deviation in terms of Accuracy (%) and Macro-F1 (%). Bold and underline are used to show the best and the runner-up results.

Methods	Hep-Small		Cora-Enrich		DBLP-Five		Hep-Large
Methods	Accuracy	Macro-F1	Accuracy	Macro-F1	Accuracy	Macro-F1	Accuracy	Macro-F1
GCN	61.54 $\pm$ 3.44	60.37 $\pm$ 4.17	88.71 $\pm$ 0.76	87.83 $\pm$ 0.90	93.52 $\pm$ 0.47	93.19 $\pm$ 0.53	50.24 $\pm$ 0.44	50.23 $\pm$ 0.45
GAT	64.87 $\pm$ 3.64	64.12 $\pm$ 4.37	88.96 $\pm$ 1.00	88.48 $\pm$ 1.11	93.70 $\pm$ 0.34	93.30 $\pm$ 0.31	50.17 $\pm$ 0.57	49.43 $\pm$ 1.00
SGC	63.59 $\pm$ 3.94	63.01 $\pm$ 4.41	88.00 $\pm$ 1.22	87.13 $\pm$ 0.97	93.78 $\pm$ 0.46	93.29 $\pm$ 0.55	47.69 $\pm$ 1.15	46.15 $\pm$ 1.31
DGI	59.72 $\pm$ 3.04	59.30 $\pm$ 3.16	79.57 $\pm$ 1.68	74.66 $\pm$ 1.62	83.48 $\pm$ 1.54	81.28 $\pm$ 1.64	43.22 $\pm$ 0.80	39.28 $\pm$ 0.79
GMI	60.00 $\pm$ 2.98	59.23 $\pm$ 2.96	85.03 $\pm$ 0.46	84.81 $\pm$ 0.45	92.29 $\pm$ 0.17	91.41 $\pm$ 0.20	49.55 $\pm$ 0.24	49.44 $\pm$ 0.25
GraphSage	63.33 $\pm$ 2.31	62.07 $\pm$ 2.52	89.48 $\pm$ 0.81	88.38 $\pm$ 1.22	93.71 $\pm$ 0.70	93.35 $\pm$ 0.82	46.26 $\pm$ 0.85	45.54 $\pm$ 0.90
AM-GCN	68.21 $\pm$ 4.00	67.65 $\pm$ 4.36	89.04 $\pm$ 1.06	88.16 $\pm$ 1.43	93.52 $\pm$ 0.59	93.05 $\pm$ 0.65	47.17 $\pm$ 1.24	45.40 $\pm$ 1.51
Geom-GCN	64.36 $\pm$ 2.42	64.33 $\pm$ 2.88	88.19 $\pm$ 1.65	87.09 $\pm$ 1.89	94.57 $\pm$ 0.62	94.19 $\pm$ 0.73	51.78 $\pm$ 0.80	51.42 $\pm$ 0.95
BiTe-GCN	67.44 $\pm$ 3.04	66.84 $\pm$ 3.06	89.59 $\pm$ 0.99	88.69 $\pm$ 0.74	93.81 $\pm$ 0.59	93.12 $\pm$ 0.58	50.43 $\pm$ 1.32	50.12 $\pm$ 1.82
AS-GCN	67.95 $\pm$ 4.48	68.06 $\pm$ 3.97	91.18 $\pm$ 1.52	90.29 $\pm$ 1.81	94.17 $\pm$ 0.88	93.66 $\pm$ 0.97	50.92 $\pm$ 1.02	50.81 $\pm$ 0.88
TeKo	71.54 $\pm$ 3.88	70.76 $\pm$ 4.37	92.11 $\pm$ 0.86	91.38 $\pm$ 0.95	95.12 $\pm$ 0.52	94.74 $\pm$ 0.68	53.02 $\pm$ 1.03	53.01 $\pm$ 1.01

Baselines. We evaluate the performance of our proposed TeKo by comparing it with ten state-of-art baselines.

•

GCN [16] is a classical GNN which derives node representations by defining convolutional operators on graph-structured data.
•

GAT [22] is an attention-based GNN which performs convolutional operations in the graph spatial domain and assigns different weights to neighbors.
•

SGC [24] is a simplified GNN that reduces the complexity of model by removing nonlinearities and collapsing weight matrices between consecutive layers.
•

DGI [25] is an unsupervised GNN which maximizes local mutual information via utilizing the graph’s patch representations.
•

GMI [26] is an unsupervised GNN that measures the correlation between input graphs and high-level representations through graphical mutual information.
•

GraphSage [27] is an inductive GNN leveraging sampler and aggregator to generate node representations.
•

AM-GCN [28] is an adaptive multi-channel GNN which extracts node representations via learning suitable weights to fuse the information from topology space and feature space.
•

Geom-GCN [29] is a semi-supervised GNN designing a geometric aggregation mechanism to aggregate neighbor information of each node.
•

BiTe-GCN [8] is a text-rich GNN that obtains node representations by integrating word sequence structure.
•

AS-GCN [9] is a text-rich GNN employing both local word-sequence and global topic semantic structures for node representations.

Implementation Details. For all baselines, we use the same parameter settings suggested by their papers. For BiTe-GCN and AS-GCN, we employ pre-trained 300-dimensional GloVe embeddings [30] to initialize entity representations. For our model, we use Wikipedia anchors to align mentions extracted from the textual description of each document node to Wikidata5M [31], which is a newly proposed large-scale knowledge graph containing 4M entities and 21M fact triplets. We set the layer number of heterogeneous graph attention as 2, the Adam optimizer with an initial learning rate of 0.005 and a weight decay of 5e-4. In addition, the threshold of TagMe and similarity score (cosine similarity) are searched in {0.1, 0.2, 0.3, 0.4} and {0.5, 0.6, 0.7, 0.8, 0.9}, respectively. We set the activation function as LeakyReLU with slope 0.2, and apply a dropout rate of 0.5 to prevent overfitting. For a fair comparison of all methods, we generate 10 random splits for training, validation and test.

IV-B Node Classification

On the node classification task, the goal is to predict the labels of the remaining nodes on the premise of giving a fraction of node labels. Since the variance of graph-structured data can be quite large, we report the average Accuracy and Macro-F1 along with the standard deviation of 10 independent trials with different random seeds.

As presented in Table III, we can find that TeKo performs consistently much better than all baselines across the four datasets. Specifically, in terms of Accuracy, TeKo achieves up to 3.33%, 0.93%, 0.55% and 1.24% better accurate than the best baseline method on Hep-Small, Cora-Enrich, DBLP-Five and Hep-Large, respectively. In terms of Macro-F1, TeKo is 2.70%, 1.09%, 0.55% and 1.59% more accurate than the best baseline method on these four datasets. These results not only demonstrate the superiority of comprehensively mining textual semantics underlying text-rich networks via external knowledge, but also validate the effectiveness of our new propagation mechanism that facilitates the information interaction between network structure and textual information. The performance of TeKo is much better than that of vanilla GCN (i.e., 10.00%, 3.40%, 1.60% and 2.78% relative improvements in Accuracy, and 10.39%, 3.55%, 1.55%, and 2.78% relative improvements in Macro-F1), which implies that TeKo is capable of making a well balanced combination of both network structure and textual semantics within a text-rich network. Also of note, comparing with BiTe-GCN and AS-GCN which are also designed for text-rich networks, TeKo also achieves the best performance, which further verifies the significance of fully comprehending textual semantics for text-rich network representations with the guidance of external knowledge.

IV-C Node Clustering

We also conduct comparisons of these methods on node clustering. In this task, for each method, the learned document embeddings are used as the input to K-Means algorithm, where $K$ is set to the number of clusters. Since the performance of clustering is easily affected by the initial center, we report the average normalized mutual information (NMI) and adjusted rand index (ARI) along with the standard deviation of 10 splits.

As shown in Table IV and Table V, the proposed method TeKo performs the best across all the four datasets. To be specific, TeKo outperforms BiTe-GCN and AS-GCN, which also focus on extracting textual semantics within text-rich networks, by 5.49% and 3.07% in terms of NMI, and 0.0506 and 0.0302 in terms of ARI (in the range of -1 to 1) on average on all the four networks. TeKo also performs better than the vanilla GCN by 6.52% in terms of NMI and 0.064 in terms of ARI on overage. These results further validate the soundness of designing a text-rich graph neural network with external knowledge to take advantage of both structure and textual information within text-rich networks. Neither AM-GCN nor Geom-GCN is so competitive here. This may be mainly because they fail to fully utilize the semantics contained in the text, which significantly compromises their performance in the unsupervised clustering setting.

TABLE IV: Node clustering results with mean and standard deviation in terms of NMI (%). Bold and underline are used to show the best and the runner-up results.

Methods	Hep-Small	Cora-Enrich	DBLP-Five	Hep-Large
GCN	24.37 $\pm$ 5.87	75.71 $\pm$ 1.30	81.07 $\pm$ 1.37	12.18 $\pm$ 0.61
GAT	27.02 $\pm$ 7.82	76.42 $\pm$ 1.68	81.46 $\pm$ 0.99	12.74 $\pm$ 0.44
SGC	25.33 $\pm$ 7.49	74.77 $\pm$ 2.25	82.10 $\pm$ 1.37	9.54 $\pm$ 0.85
DGI	20.55 $\pm$ 5.47	62.99 $\pm$ 2.46	79.57 $\pm$ 1.68	7.52 $\pm$ 0.69
GMI	28.85 $\pm$ 8.34	63.84 $\pm$ 2.89	78.40 $\pm$ 0.37	11.90 $\pm$ 0.42
GraphSage	26.33 $\pm$ 6.29	77.46 $\pm$ 1.34	81.27 $\pm$ 1.76	7.98 $\pm$ 0.67
AM-GCN	28.84 $\pm$ 5.57	76.22 $\pm$ 2.23	80.90 $\pm$ 1.46	11.68 $\pm$ 0.79
Geom-GCN	26.78 $\pm$ 5.43	74.70 $\pm$ 3.61	83.66 $\pm$ 1.63	13.62 $\pm$ 0.71
BiTe-GCN	27.25 $\pm$ 4.75	77.70 $\pm$ 1.86	81.92 $\pm$ 1.26	13.79 $\pm$ 0.63
AS-GCN	30.34 $\pm$ 7.39	79.89 $\pm$ 2.75	82.81 $\pm$ 2.20	14.09 $\pm$ 0.76
TeKo	36.38 $\pm$ 7.92	82.04 $\pm$ 1.88	85.73 $\pm$ 1.46	15.24 $\pm$ 1.00

TABLE V: Node clustering results with mean and standard deviation in terms of ARI in the range of [-1, 1].

Methods	Hep-Small	Cora-Enrich	DBLP-Five	Hep-Large
GCN	0.1992 $\pm$ 0.0511	0.7631 $\pm$ 0.0180	0.8477 $\pm$ 0.0105	0.1199 $\pm$ 0.0051
GAT	0.2205 $\pm$ 0.0714	0.7579 $\pm$ 0.0248	0.8543 $\pm$ 0.0098	0.1243 $\pm$ 0.0050
SGC	0.2025 $\pm$ 0.0539	0.7413 $\pm$ 0.0330	0.8561 $\pm$ 0.0117	0.0983 $\pm$ 0.0082
DGI	0.1409 $\pm$ 0.0410	0.6001 $\pm$ 0.0388	0.6570 $\pm$ 0.0306	0.0740 $\pm$ 0.0068
GMI	0.2414 $\pm$ 0.0735	0.6048 $\pm$ 0.0376	0.8225 $\pm$ 0.0333	0.0788 $\pm$ 0.0021
GraphSage	0.2098 $\pm$ 0.0491	0.7805 $\pm$ 0.0221	0.8530 $\pm$ 0.0138	0.0801 $\pm$ 0.0073
AM-GCN	0.2615 $\pm$ 0.0755	0.7659 $\pm$ 0.0254	0.8506 $\pm$ 0.0133	0.1127 $\pm$ 0.0117
Geom-GCN	0.2016 $\pm$ 0.0407	0.7535 $\pm$ 0.0342	0.8743 $\pm$ 0.0133	0.1344 $\pm$ 0.0084
BiTe-GCN	0.2312 $\pm$ 0.0493	0.7744 $\pm$ 0.0279	0.8532 $\pm$ 0.0143	0.1346 $\pm$ 0.0098
AS-GCN	0.2564 $\pm$ 0.0794	0.8073 $\pm$ 0.0248	0.8656 $\pm$ 0.0190	0.1360 $\pm$ 0.0107
TeKo	0.3215 $\pm$ 0.0678	0.8315 $\pm$ 0.0251	0.8837 $\pm$ 0.0148	0.1491 $\pm$ 0.0132

TABLE VI: Comparisons of our TeKo with its three variants on node classification in terms of Accuracy (%) and Macro-F1 (%).

Methods	Hep-Small		Cora-Enrich		DBLP-Five		Hep-Large
Methods	Accuracy	Macro-F1	Accuracy	Macro-F1	Accuracy	Macro-F1	Accuracy	Macro-F1
TeKo	71.54 $\pm$ 3.88	70.76 $\pm$ 4.37	92.11 $\pm$ 0.86	91.38 $\pm$ 0.95	95.12 $\pm$ 0.52	94.74 $\pm$ 0.68	53.02 $\pm$ 1.03	53.01 $\pm$ 1.01
- w/o Triplet	69.49 $\pm$ 3.33	69.06 $\pm$ 3.30	91.18 $\pm$ 1.89	90.86 $\pm$ 1.97	94.90 $\pm$ 0.64	94.54 $\pm$ 0.71	50.66 $\pm$ 1.19	49.83 $\pm$ 1.42
- w/o Textual	73.33 $\pm$ 3.48	72.94 $\pm$ 4.03	90.30 $\pm$ 1.58	89.48 $\pm$ 1.60	94.99 $\pm$ 0.46	94.66 $\pm$ 0.54	50.94 $\pm$ 0.95	50.07 $\pm$ 0.93
TeKo (concatenation)	66.92 $\pm$ 2.68	66.63 $\pm$ 2.90	90.89 $\pm$ 1.36	90.18 $\pm$ 1.82	94.24 $\pm$ 0.47	93.78 $\pm$ 0.42	50.98 $\pm$ 1.01	50.37 $\pm$ 1.40

IV-D Visualization

To provide a more intuitive comparison, we conduct embedding visualization of GCN, GAT, Geom-GCN, AS-GCN and our proposed TeKo on DBLP dataset as an example. We use t-SNE [32] to downscale the learned node representations to a two-dimensional space, where different colors mean different labels. Therefore, a desirable visualization result refers to that nodes belonging to the same category (in the same color) should be close to each other.

From Fig. 4, we can find that the visualization results of GCN and GAT are not so satisfactory, as nodes with the same color are dispersed and nodes with different colors are mixed with each other. The results of Geom-GCN and AS-GCN are relatively better but the borders between different classes are not so clear. Apparently, the visualization of TeKo performs better, where the learned representations have a higher intra-class similarity and form more discernible clusters.

IV-E Ablation Study

Similar to most deep learning models, our proposed method TeKo also contains some important components that may have a significant impact on the performance. To test the effectiveness of each component, we conduct experiments on comparing it with three variations. The variants are as follows: 1) TeKo of removing triplet representation of entity, named as w/o triplet, 2) TeKo of removing the textual representation of entity, named as w/o textual, and 3) TeKo of employing concatenation operator instead of gating mechanism to generate the knowledge-based entity representation, named as TeKo (concatenation).

We take their comparison on node classification in terms of Accuracy and Macro-F1 as an example. From the results in Table VI, we can draw the following conclusions: (1) The results of TeKo are better than that of its three variants in most cases (except on Hep-Small), indicating the effectiveness of using both structured triplets and unstructured entity descriptions for entity representation together. (2) TeKo w/o textual is in general better than TeKo w/o triplet on three out of the four datasets, which implies the textual representation of entity plays a more vital role for fully comprehend textual semantics and promote its interaction with network structure. (3) Compared to Teko w/o gating that uses concatenation operator to fuse the triple representation and textual representation of an entity, the improvement brought by Teko is more significant, which illustrates the rationality of adaptively fusing these two representations aiming to the given learning objectives.

IV-F Parameter Analysis

We investigate the sensitivity of two main parameters: the threshold $\delta_{tag}$ of TagMe and the threshold $\delta_{sim}$ of entity edge similarity. We take the node classification on all the four text-rich datasets as an example and report the Accuracy and Macro-F1 values.

Analysis of $\delta_{tag}$ . The threshold $\delta_{tag}$ determines the number of entities within the generated semantic network. We vary its value from 0.1 to 0.4 and the corresponding results are shown in Fig. 5. With the increase of this threshold $\delta_{tag}$ , the performance shows a trend of first rising and then descending. It is reasonable since a too small threshold of TagMe would introduce some noise entities, whereas a too large threshold may filter some informative entities and thus weaken the reciprocal guidance between text semantics and network structure.

Analysis of $\delta_{sim}$ . The threshold $\delta_{sim}$ determines the number of edges between entities, which represents the local word-sequence semantic structure underlying the corpus. We vary its value from 0.5 to 0.9 and the corresponding results are shown in Fig. 6. With the increase of this threshold of entity edge similarity, the values of metrics, including Accuracy and Macro-F1, also increase first and then start to descend. This is probably because a small number of edges between entities would result in information loss and ineffective information propagation, whereas too many edges between entities would introduce more noise.

IV-G Application on E-commerce Search

To further verify the effectiveness of our proposed new approach, we collect an e-commerce searching dataset and apply TeKo on it to solve the problem of relevance matching, that is, predicting whether the current example pair (query and item) is relevant or not. The collected dataset contains million-scale queries or items, and their detailed statistics are shown in Table VII.

TABLE VII: The statistics of the e-commerce dataset.

Datasets	#Queries	#Items	#Edges
Training set	3,284,480	1,307,557	7,525,355
Validation set	3,108	30,097	30,661

External Knowledge in E-commerce Search. We use the category information to serve as the external knowledge in e-commerce search. Specifically, the category information includes three different levels of categories, i.e., $Cid_{1}$ , $Cid_{2}$ , and $Cid_{3}$ . They present as a tree-shape structure as shown in Fig. 7. Adding this information is helpful to analyze the semantics of the current query or items. For example, for an item whose title is “red mac 2020 (made in US)”, we cannot ensure whether is an electronic product or a cosmetic product. But if we further know that its $Cid_{1}$ is “Clothes & Cosmetics”, then we can easily infer that this is a lipstick.

Baselines. Since this relevance matching problem lies in the field of natural language matching, we compare our new approach TeKo with seven state-of-arts in this topic.

•

MV-LSTM [33] is a deep model which assigns the importance score of each local keyword using rich context information to match two sentences.
•

K-NRM [34] is a kernel-based semantic matching model that uses embedding layer, translation, and kernel pooling to capture the word-level interaction relationship between both sentences.
•

ARC-I [35] matches two sentences with their representations by using a multi-layer perceptron (MLP), which is essentially the Siamese architecture.
•

ARC-II [35] is an advanced version of ARC-I. Compared to ARC-I, ARC-II more focuses on the interaction relationship between two sentences.
•

MatchPyramid [36] employs hierarchical convolution to capture different-level matching patterns contained in both sentences, such as unigram, n-gram and n-term.
•

DUET [37] calculates the final relevance score by summing the scores from the representation-based embedding and the interaction-based embedding.
•

BERT2DNN [38] is a data-driven model which adopts the techniques of transfer learning and knowledge distillation for search relevance.

Metrics. We choose six evaluation metrics to measure the model’s quality, including Area Under receiver operator characteristic Curve (AUC), Accuracy, Precision, F1-score, Recall, and False Negative Rate (FNR). The lower FNR implies the better model, while the other metrics are opposite. In particular, AUC is the most important of these metrics.

Experimental Results and Analysis. From the results in Table VIII, we can find that our TeKo consistently outperforms all the baseline methods. Especially, compared to the popular e-commerce search algorithm BERT2DNN, which uses transfer learning and

TABLE VIII: Comparisons on an e-commerce searching dataset, where (*) denotes the dominant metric. The lower FNR implies the better model, while the other metrics are opposite.

Methods	AUC(*)	Accuracy	Precision	F1-score	Recall	FNR
MV-LSTM	0.8760	0.8069	0.8392	0.7256	0.7541	0.2459
ARC-I	0.7945	0.7392	0.8005	0.6329	0.6830	0.3170
K-NRM	0.8424	0.7899	0.7341	0.6919	0.7333	0.2667
ARC-II	0.8466	0.7913	0.6931	0.7002	0.7439	0.2561
MatchPyramid	0.8758	0.8278	0.7741	0.7451	0.7683	0.2317
DUET	0.8682	0.8156	0.8159	0.7253	0.7337	0.2663
BERT2DNN	0.8906	0.7896	0.8565	0.8362	0.8169	0.1831
TeKo	0.9092	0.8319	0.8648	0.8735	0.8824	0.1176

knowledge distillation to improve result relevance, the superiority of TeKo is even up to 4.23% and 3.73% improvements in Accuracy and F1-score, respectively. These results not only demonstrate the effectiveness of our TeKo in capturing query-item pair correlation, but also shows the rationality of our adopting a more advanced way (i.e., utilize external knowledge to comprehend text semantics underlying text-rich situations for high-level relevance matching).

V Related Work

In line with the focus of our work, we briefly review some related studies, including classical graph neural networks, text-rich graph neural networks, knowledge-enhanced graph neural networks, and graph neural networks for text analysis.

Classical Graph Neural Networks. Graph neural networks (GNNs) have attracted considerable research interests due to the ability to model graph-structured data. For example, Bruna et al. [39] first designs the graph convolutional operation in Fourier domain through the graph Laplacian. Defferrard et al. [40] futher promote the efficiency using the Chebyshev polynomial expansion. After that comes GCN [16], a simplified graph convolutional operation that aggregates information from nodes’ one-hop neighbors. GraphSAGE [27] employs the mean/max/LSTM pooling to sample and aggregate the information from neighbors. GAT [22] assigns different weights to neighbors with a node-level attention mechanism. Though these methods can be used for text-rich network representations, they suffer from an inability to fully consider and mine the text semantics (e.g., local word-sequence or global topic) underlying text-rich networks, which is particularly important for information propagation along network topology.

Text-Rich Graph Neural Networks. Recently, many attentions have been paid to text-rich network representations. For instance, HyperMine [41] is designed for discovering hypernymy from text-rich networks. NetTaxo [42] focuses on topic taxonomy construction, and integrates text data and network structures simultaneously. LTRN [43] designs a minimally-supervised categorization framework based on personalized PageRank-based neighborhood sampling and attentive aggregation from a text-rich network prospective. BiTe-GCN [8] uses bidirectional convolution of topology and features to depict original network structure and local word-sequence structure extracted from text, and on this basis, AS-GCN [9] takes the global topic semantic structure into account. However, the above methods can not well comprehend the semantic content contained in text-rich network, and reason over complex concepts and relational paths, limiting the reciprocal guidance between network structure and text semantics.

Knowledge-Enhanced Graph Neural Networks. As GNNs become the most eye-catching tools for node representations, several efforts have been made to combine external knowledge and GNNs to boost the performance of downstream tasks. For instance, KGCN [44] captures users’ preferences on the knowledge graph for recommender systems. KGAT [45] presents a knowledge-aware recommendation by explicitly modeling the high-order connectivity with semantic relations in collaborative knowledge graph. More recently, RGHAT [46] designs a two-level attention mechanism which uses the inherent and valuable neighborhood information surrounding an entity for knowledge graph completion. Caps-GNN [47] proposes a novel knowledge-enhanced personalized review generation model that adopts capsule graph neural networks and knowledge graph to capture both aspect- and word-level user preference. CompareNet [48] utilizes topics and entities extracted from the text to enrich news representation, then compares news to the external knowledge through entities to detect fake news. However, how to utilize the guidance of external knowledge to facilitate text-rich network representations is still an area that needs to be explored urgently.

Graph Neural Networks for Text Analysis. Another research line relevant to our work is to apply GNNs for text analysis. For example, Text GCN [49] turns text classification into a node classification problem by constructing a heterogeneous word document graph for a whole corpus. HGAT [21] enriches the semantics of the short texts via incorporating the topics, entities and the relations. TensorGCN [50] constructs a text graph tensor to describe contextual information within text, and further integrate these information through generalizing the GNN into a tensor version. TextING [51] treats each text as an individual graph and learns word interactions at the text level for inductive text classification. TextGTL [52] proposes a non-heterogeneous graph construction method which jointly considers different linguistic information, including semantics, syntax and sequence context for text classification. In summary, the above GNNs mainly focus on establishing associations for independent texts via building a graph structure, which essentially lies in the field of text analysis, rather than text-rich network representations.

VI Conclusion

In this paper, we propose a novel text-rich graph neural network, namely TeKo, which effectively integrates both network structure and textual semantics with guidance from external knowledge for text-rich network representations. In specific, we first augment the original text-rich network structure into a heterogeneous semantic network by incorporating informative entities and interactions among documents and entities. We then leverage two types of external knowledge, structured triples and unstructured entity descriptions, to learn jointly entity representations for deep understanding of text semantics. We further design a reciprocal propagation mechanism for the augmented heterogeneous semantic network, which realizes a well balanced combination of network structure and textual semantics, ultimately improving the quality of text-rich network representations. Extensive experiments demonstrate the superior performance of the proposed new approach over the state-of-the-arts on four public text-rich networks as well as a large-scale real-world e-commerce searching dataset.

References

[1] L. Wu, P. Cui, J. Pei, and L. Zhao, Graph Neural Networks: Foundations, Frontiers, and Applications. Springer Singapore, 2022.
[2] J. Ma, P. Cui, K. Kuang, X. Wang, and W. Zhu, “Disentangled graph convolutional networks,” in Proceedings of ICML, vol. 97, pp. 4212–4221, 2019.
[3] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph sequence neural networks,” in Proceedings of ICLR, 2016.
[4] D. He, Y. Song, D. Jin, Z. Feng, B. Zhang, Z. Yu, and W. Zhang, “Community-centric graph convolutional network for unsupervised community detection,” in Proceedings of IJCAI, pp. 3515–3521, 2020.
[5] Q. Tan, N. Liu, X. Zhao, H. Yang, J. Zhou, and X. Hu, “Learning to hash with graph neural networks for recommender systems,” in Proceedings of WWW, pp. 1988–1998, 2020.
[6] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla, “Heterogeneous graph neural network,” in Proceedings of SIGKDD, pp. 793–803, 2019.
[7] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. V. Steeg, and A. Galstyan, “Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing,” in Proceedings of ICML, vol. 97, pp. 21–29, 2019.
[8] D. Jin, X. Song, Z. Yu, Z. Liu, H. Zhang, Z. Cheng, and J. Han, “Bite-gcn: A new GCN architecture via bidirectional convolution of topology and features on text-rich networks,” in Proceedings of WSDM, pp. 157–165, 2021.
[9] Z. Yu, D. Jin, Z. Liu, D. He, X. Wang, H. Tong, and J. Han, “AS-GCN: adaptive semantic architecture of graph convolutional networks for text-rich networks,” in Proceedings of ICDM, pp. 837–846, 2021.
[10] R. Speer, J. Chin, and C. Havasi, “Conceptnet 5.5: An open multilingual graph of general knowledge,” in Proceedings of AAAI, pp. 4444–4451, 2017.
[11] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity and relation embeddings for knowledge graph completion,” in Proceedings of AAAI, pp. 2181–2187, 2015.
[12] T. Sun, Y. Shao, X. Qiu, Q. Guo, Y. Hu, X. Huang, and Z. Zhang, “Colake: Contextualized language and knowledge embedding,” in Proceedings of COLING, pp. 3660–3670, 2020.
[13] C. Huang, H. Xu, Y. Xu, P. Dai, L. Xia, M. Lu, L. Bo, H. Xing, X. Lai, and Y. Ye, “Knowledge-aware coupled graph neural network for social recommendation,” in Proceedings of AAAI, pp. 4115–4122, 2021.
[14] J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra, “Beyond homophily in graph neural networks: Current limitations and effective designs,” in Proceedings of NeurIPS, 2020.
[15] P. Ferragina and U. Scaiella, “TAGME: on-the-fly annotation of short text fragments (by wikipedia entities),” in Proceedings of CIKM, pp. 1625–1628, 2010.
[16] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in proceedings of ICLR, 2017.
[17] J. Zhu, R. A. Rossi, A. Rao, T. Mai, N. Lipka, N. K. Ahmed, and D. Koutra, “Graph neural networks with heterophily,” in Proceedings of AAAI, pp. 11168–11176, 2021.
[18] A. Bordes, N. Usunier, A. García-Durán, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling multi-relational data,” in Proceedings of NeurIPS, pp. 2787–2795, 2013.
[19] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
[20] J. Xu, X. Qiu, K. Chen, and X. Huang, “Knowledge graph representation with jointly structural and textual encoding,” in Proceedings of IJCAI, pp. 1318–1324, 2017.
[21] L. Hu, T. Yang, C. Shi, H. Ji, and X. Li, “Heterogeneous graph attention networks for semi-supervised short text classification,” in Proceedings of EMNLP, pp. 4820–4829, 2019.
[22] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in Proceedings of ICLR, 2018.
[23] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: extraction and mining of academic social networks,” in Proceedings of SIGKDD, pp. 990–998, 2008.
[24] F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” in Proceedings of ICML, vol. 97, pp. 6861–6871, 2019.
[25] P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, “Deep graph infomax,” in Proceedings of ICLR, 2019.
[26] Z. Peng, W. Huang, M. Luo, Q. Zheng, Y. Rong, T. Xu, and J. Huang, “Graph representation learning via graphical mutual information maximization,” in Proceedings of WWW, pp. 259–270, 2020.
[27] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of NeurIPS, pp. 1024–1034, 2017.
[28] X. Wang, M. Zhu, D. Bo, P. Cui, C. Shi, and J. Pei, “AM-GCN: adaptive multi-channel graph convolutional networks,” in Proceedings of SIGKDD, pp. 1243–1253, 2020.
[29] H. Pei, B. Wei, K. C. Chang, Y. Lei, and B. Yang, “Geom-gcn: Geometric graph convolutional networks,” in Proceedings of ICLR, 2020.
[30] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of EMNLP, pp. 1532–1543, 2014.
[31] X. Wang, T. Gao, Z. Zhu, Z. Zhang, Z. Liu, J. Li, and J. Tang, “KEPLER: A unified model for knowledge embedding and pre-trained language representation,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 176–194, 2021.
[32] V. D. M. Laurens and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. 2605, pp. 2579–2605, 2008.
[33] S. Wan, Y. Lan, J. Guo, J. Xu, L. Pang, and X. Cheng, “A deep architecture for semantic matching with multiple positional sentence representations.,” in Proceedings of AAAI, pp. 2835–2841, 2016.
[34] C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power, “End-to-end neural ad-hoc ranking with kernel pooling,” in Proceedings of SIGIR, pp. 55–64, 2017.
[35] B. Hu, Z. Lu, H. Li, and Q. Chen, “Convolutional neural network architectures for matching natural language sentences,” in Proceedings of NeurIPS, pp. 2042–2050, 2014.
[36] L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng, “Text matching as image recognition.,” in Proceedings of AAAI, vol. 16, pp. 2793–2799, 2016.
[37] B. Mitra, F. Diaz, and N. Craswell, “Learning to match using local and distributed representations of text for web search,” in Proceedings of WWW, pp. 1291–1299, 2017.
[38] Y. Jiang, Y. Shang, Z. Liu, H. Shen, Y. Xiao, W. Xiong, S. Xu, W. Yan, and D. Jin, “BERT2DNN: BERT distillation with massive unlabeled data for online e-commerce search,” in Proceedings of ICDM, pp. 212–221, 2020.
[39] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” in Proceedings of ICLR, 2014.
[40] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Proceedings of NeurIPs, pp. 3837–3845, 2016.
[41] Y. Shi, J. Shen, Y. Li, N. Zhang, X. He, Z. Lou, Q. Zhu, M. Walker, M. Kim, and J. Han, “Discovering hypernymy in text-rich heterogeneous information network by exploiting context granularity,” in Proceedings of CIKM, pp. 599–608, 2019.
[42] J. Shang, X. Zhang, L. Liu, S. Li, and J. Han, “Nettaxo: Automated topic taxonomy construction from text-rich network,” in Proceedings of WWW, pp. 1908–1919, 2020.
[43] X. Zhang, C. Zhang, X. L. Dong, J. Shang, and J. Han, “Minimally-supervised structure-rich text categorization via learning on text-rich networks,” in Proceedings of WWW, pp. 3258–3268, 2021.
[44] H. Wang, M. Zhao, X. Xie, W. Li, and M. Guo, “Knowledge graph convolutional networks for recommender systems,” in Proceedings of WWW, pp. 3307–3313, 2019.
[45] X. Wang, X. He, Y. Cao, M. Liu, and T. Chua, “KGAT: knowledge graph attention network for recommendation,” in Proceedings of SIGKDD, pp. 950–958, 2019.
[46] Z. Zhang, F. Zhuang, H. Zhu, Z. Shi, H. Xiong, and Q. He, “Relational graph neural network with hierarchical attention for knowledge graph completion,” in Proceedings of AAAI, pp. 9612–9619, 2020.
[47] J. Li, S. Li, W. X. Zhao, G. He, Z. Wei, N. J. Yuan, and J. Wen, “Knowledge-enhanced personalized review generation with capsule graph neural network,” in Proceedings of CIKM, pp. 735–744, 2020.
[48] L. Hu, T. Yang, L. Zhang, W. Zhong, D. Tang, C. Shi, N. Duan, and M. Zhou, “Compare to the knowledge: Graph neural fake news detection with external knowledge,” in Proceedings of ACL, pp. 754–763, 2021.
[49] L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text classification,” in Proceedings of AAAI, pp. 7370–7377, 2019.
[50] X. Liu, X. You, X. Zhang, J. Wu, and P. Lv, “Tensor graph convolutional networks for text classification,” in Proceedings of AAAI, pp. 8409–8416, 2020.
[51] Y. Zhang, X. Yu, Z. Cui, S. Wu, Z. Wen, and L. Wang, “Every document owns its structure: Inductive text classification via graph neural networks,” in Proceedings of ACL, pp. 334–339, 2020.
[52] C. Li, X. Peng, H. Peng, J. Li, and L. Wang in Proceedings of IJCAI, pp. 2680–2686, 2021.

TeKo: Text-Rich Graph Neural Networks with External Knowledge