This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

IRWE: Inductive Random Walk for Joint Inference of Identity and Position Network Embedding

Meng Qin mengqin_az@foxmail.com
Department of Computer Science & Engineering, Hong Kong University of Science & Technology
Dit-Yan Yeung dyyeung@cse.ust.hk
Department of Computer Science & Engineering, Hong Kong University of Science & Technology
Abstract

Network embedding, which maps graphs to distributed representations, is a unified framework for various graph inference tasks. According to the topology properties (e.g., structural roles and community memberships of nodes) to be preserved, it can be categorized into the identity and position embedding. Most existing methods can only capture one type of property. Some approaches can support the inductive inference that generalizes the embedding model to new nodes or graphs but relies on the availability of attributes. Due to the complicated correlations between topology and attributes, it is unclear for some inductive methods which type of property they can capture. In this study, we explore a unified framework for the joint inductive inference of identity and position embeddings without attributes. An inductive random walk embedding (IRWE) method is proposed, which combines multiple attention units to handle the random walk (RW) on graph topology and simultaneously derives identity and position embeddings that are jointly optimized. We demonstrate that some RW statistics can characterize node identities and positions while supporting the inductive inference. Experiments validate the superior performance of IRWE over various baselines for the transductive and inductive inference of identity and position embeddings.

1 Introduction

For various graph inference techniques, network embedding (a.k.a. graph representation learning) is a commonly used framework. It maps each node of a graph to a low-dimensional vector representation (a.k.a. embedding) with some key properties preserved. The derived representations are used to support several downstream inference tasks, e.g., node classification (Kipf & Welling, 2017; Veličković et al., 2018), node clustering (Ye et al., 2022; Qin et al., 2023a; Gao et al., 2023), and link prediction (Lei et al., 2018; 2019; Qin et al., 2023b; Qin & Yeung, 2023).

According to the topology properties to be preserved, existing network embedding techniques can be categorized into the identity and position embedding (Zhu et al., 2021). The identity embedding (a.k.a. structural embedding) preserves the structural role of each node characterized by its rooted subgraph, which is also defined as node identity. The position embedding (a.k.a. proximity-preserving embedding) captures the linkage similarity between nodes in terms of the overlap of local neighbors (i.e., community structures (Newman, 2006)), which is also defined as node position or proximity. In Fig. 1 (a), each color denotes a structural role. For instance, red and yellow may indicate the opinion leader and hole spanner in a social network (Yang et al., 2015). Moreover, there are two communities denoted by the two dotted circles in Fig. 1, where nodes in the same community have dense linkages and thus are more likely to have similar positions.

Refer to caption
Figure 1: An example of identity and position embedding in terms of (b) struc2vec and (c) node2vec, where each color denotes a unique identity while nodes in the same community have similar positions.

The identity and position embedding should respectively force nodes with similar identities (e.g., {v1,v8}\{v_{1},v_{8}\}) and positions (e.g., {v1,v2,v6}\{v_{1},v_{2},v_{6}\}) to have close embeddings. As a demonstration, we applied struc2vec (Ribeiro et al., 2017) and node2vec (Grover & Leskovec, 2016) (with embedding dimensionality d=2d=2), which are typical identity and position embedding methods, to the example in Fig. 1 (a) and visualize the derived embeddings. Note that two nodes may have the same identity even though they are far away from each other. In contrast, nodes with similar positions must be close to each other with dense linkage and short distances. Due to the contradiction, it is challenging to simultaneously capture the two types of properties in a common embedding space. For instance, v1v_{1} and v8v_{8} with the same identity have close identity embeddings in Fig. 1 (b). However, their position embeddings are far away from each other in Fig. 1 (c). Since the two types of embeddings may be appropriate for different downstream tasks (e.g., structural role classification and community detection), we expect a unified embedding model.

Most conventional embedding methods (Wu et al., 2020; Grover & Leskovec, 2016; Ribeiro et al., 2017; Donnat et al., 2018) follow the embedding lookup scheme and can only support transductive embedding inference. In this scheme, node embeddings are model parameters optimized only for the currently observed graph topology. When applying the model to new unseen nodes or graphs, one needs to re-train the model from scratch. Compared with transductive methods, some state-of-the-art techniques (Hamilton et al., 2017; Velickovic et al., 2019) can support the advanced inductive inference, which directly generalizes the embedding model trained on observed topology to new unseen nodes or graphs without re-training.

Most existing inductive approaches (e.g., those based on graph neural networks (GNNs) (Wu et al., 2020)) rely on the availability of node attributes and an attribute aggregation mechanism. However, prior studies (Qin et al., 2018; Li et al., 2019; Wang et al., 2020; Qin & Lei, 2021) have demonstrated some complicated correlations between graph topology and attributes. For instance, attributes may provide (i) complementary characteristics orthogonal to topology for better quality of downstream tasks or (ii) inconsistent noise causing unexpected quality degradation. It is unclear for most inductive methods that their performance improvement is brought about by the incorporation of attributes or better exploration of topology. When attributes are unavailable, most inductive approaches require additional procedures to extract auxiliary attribute inputs from topology (e.g., one-hot node degree encodings). Our experiments demonstrate that some inductive baselines with these naive attribute extraction strategies may even fail to outperform conventional transductive methods on the inference of identity and position embeddings. It is also hard to determine which type of properties (i.e., node identities or positions) that some inductive approaches can capture.

In this study, we consider the unsupervised network embedding and explore a unified framework for the joint inductive inference of identity and position embeddings. To clearly distinguish between the two types of embeddings, we consider the case where topology is the only available information source. This eliminates the unclear influence from graph attributes due to the complicated correlations between the two sources. Different from most existing inductive approaches relying on the availability of node attributes, we propose an inductive random walk embedding (IRWE) method. It combines multiple attention units with different choices of key, query, and value to handle the random walk (RW) and induced statistics on graph topology.

RW is an effective technique to explore topology properties for network embedding. However, most RW-based methods (Grover & Leskovec, 2016; Ribeiro et al., 2017) follow the transductive embedding lookup scheme, failing to support the advanced inductive inference. We demonstrate that anonymous walk (AW) (Ivanov & Burnaev, 2018), the anonymization of RW, and its induced statistics can be informative features shared by all possible nodes and graphs and thus have the potential to support inductive inference.

Although the identity and position embedding encodes properties that may contradict with one another, there remains a relation that nodes with different identities should have different contributions in forming the local community structures. For the example in Fig. 1, v1v_{1} and v2v_{2} may correspond to an opinion leader and ordinary audience of a social network, where v1v_{1} is expected to contribute more in forming community#1 than v2v_{2}. By incorporating this relation, IRWE jointly derives and optimizes two sets of embeddings w.r.t. node identities and positions. In particular, we demonstrate that some AW statistics can characterize node identities to derive identity embeddings, which can be further used to generate position embeddings. It is also expected that the joint learning of the two sets of embeddings can improve the quality of one another.

Our major contributions are summarized as follows. (i) In contrast to most existing inductive embedding methods relying on the availability of node attributes, we propose an alternative IRWE approach, whose inductiveness is only supported by the RW on graph topology. (ii) To the best of our knowledge, we are the first to explore a unified framework for the joint inductive inference of identity and position embeddings using RW, AW, and induced statistics. (iii) Experiments on public datasets validate the superiority of IRWE over various baselines for the transductive and inductive inference of identity and position embeddings.

2 Related Work

2.1 Identity & Position Embedding

In the past several years, a series of network embedding techniques have been proposed. Rossi et al. (2020) gave an overview of existing methods covering the identity and position embedding. Most existing embedding approaches can only capture one type of topology properties (i.e., node identities or positions).

Perozzi et al. (2014) proposed DeepWalk that applies skip-gram to learn node embeddings from RWs on graph topology. The ability of DeepWalk to capture node positions is further validated in (Pei et al., 2020; Rossi et al., 2020). Grover & Leskovec (2016) modified the RW in DeepWalk to a biased form and introduced node2vec that can derive richer position embeddings by adjusting the trade-off between breadth- and depth-first sampling. Cao et al. (2015) reformulated the RW in DeepWalk to matrix factorization objectives. Wang et al. (2017), Ye et al. (2022), and Chen et al. (2023) introduced community-preserving embedding methods based on nonnegative matrix factorization, hyperbolic embedding, and graph contrastive learning.

Ribeiro et al. (2017) proposed struc2vec, an identity embedding method, by applying RW to a multilayer graph constructed via hierarchical similarities w.r.t. node degrees. Donnat et al. (2018) used graph wavelets to develop GraphWave and proved its ability to capture node identities. Pei et al. (2020) introduced struc2gauss, which encodes node identities in a space formulated by Gaussian distributions, and analyzed the effectiveness of different energy functions and similarity measures. Guo et al. (2020) enhanced the ability of GNNs to preserve node identities by reconstructing several manually-designed statistics. Chen et al. (2022) enabled the graph transformer to capture node identities by incorporating the rooted subgraph of each node.

Hoff (2007) demonstrated that the latent class and distance models can respectively capture node positions and identities but real networks may exhibit combinations of both properties. An eigen-model was proposed, which can generalize either the latent class model or distance model. However, the proposed eigen-model is a conventional probabilistic model and cannot simultaneously capture both properties in a unified framework. Zhu et al. (2021) proposed a PhUSION framework with three steps and showed which components can be used for the identity or position embedding. Although PhUSION reveals the similarity and difference between the two types of embeddings, it can only derive one type of embedding under each unique setting. Rossi et al. (2020) validated that some techniques (e.g., RW and attribute aggregation) of existing methods can only derive either identity or position embeddings. Srinivasan & Ribeiro (2020) proved that the relation between identity and position embeddings can be analogous to that of a probability distribution and its samples. Similarly, PaCEr (Yan et al., 2024) is a concurrent transductive method that considers the relation between the two types of embeddings based on RW with restart. Although these methods (Srinivasan & Ribeiro, 2020; Yan et al., 2024) can derive both identity and position embeddings, they only involve the optimization of one type of embedding and a simple transform to another type. In contrast, we focus on the joint learning and inductive inference of the two types of embeddings.

2.2 Inductive Network Embedding

Some recent studies explore the inductive inference that directly derives embeddings for new unseen nodes or graphs by generalizing the model parameters optimized on known topology. Hamilton et al. (2017) introduced GraphSAGE, an inductive GNN framework, including the neighbor sampling and feature aggregation with different choices of aggregation functions. GAT (Veličković et al., 2018) leverages self-attention into the attribute aggregation of GNN, which automatically determines the aggregation weights for the neighbors of each node. Velickovic et al. (2019) proposed DGI that maximizes the mutual information between patch embeddings and high-level graph summaries. Without using the feature aggregation of GNN, Nguyen et al. (2021) developed SANNE that applies self-attention to handle RWs sampled from graph topology. However, the inductiveness of the these methods relies on the availability of node attributes.

Some recent research analyzed the ability of several new GNN structures to capture node identities or positions in specific cases about node attributes (e.g., all the nodes have the same scalar attribute input (Xu et al., 2019)). Wu et al. (2019) and You et al. (2021) proposed DEMO-Net and ID-GNN that can capture node identities using the degree-specific multi-task graph convolution and heterogeneous message passing on the rooted subgraph of each node, respectively. Jin et al. (2020) leveraged AW statistics into the feature aggregation to enhance the ability of GNN to preserve node identities. P-GNN (You et al., 2019) can derive position-aware embeddings based on a distance-weighted aggregation scheme over the sets of sampled anchor nodes. However, these GNN structures can only capture either node identities or positions.

In contrast to the aforementioned methods, we explore a unified inductive framework for the joint inference of identity and position embeddings without relying on the availability and aggregation of attributes.

3 Problem Statements & Preliminaries

We consider the unsupervised network embedding on undirected unweighted graphs. A graph can be represented as 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), with 𝒱={v1,v2,,vN}\mathcal{V}=\{v_{1},v_{2},\dots,v_{N}\} and ={(vi,vj)|vi,vj𝒱}\mathcal{E}=\{(v_{i},v_{j})|v_{i},v_{j}\in\mathcal{V}\} as the sets of nodes and edges. We also assume that graph topology is the only available information source and attributes are unavailable.

Definition 1 (Node Identity). Node identity describes the structural role that a node vv plays in graph topology (e.g., opinion leader and hole spanner w.r.t. red and yellow nodes in Fig. 1 (a)), which can be characterized by its ll-hop rooted subgraph 𝒢s(v,l)\mathcal{G}_{s}(v,l). Given a pre-set ll, nodes (v,u)(v,u) with similar subgraphs (𝒢s(v,l),𝒢s(u,l))(\mathcal{G}_{s}(v,l),\mathcal{G}_{s}(u,l)) (e.g., measured by the WL graph isomorphism test) are expected to play similar structural roles and have similar identities.

Definition 2 (Node Position). Positions of nodes in graph topology can be encoded by their relative distances and can be further characterized by the linkage similarity in terms of the overlap of ll-hop neighbors (i.e., community structures). Nodes with a high overlap of ll-hop neighbors are more likely to (i) have short distance, (ii) belong to the same community, and thus (iii) have similar positions.

Definition 3 (Network Embedding). Given a graph 𝒢\mathcal{G}, we consider the network embedding (a.k.a. graph representation learning) f:𝒱df:\mathcal{V}\mapsto{\mathbb{R}^{d}} that maps each node vv to a vector f(v)f(v) (a.k.a. embedding), with either node identities or positions preserved. We define f(v):=ψ(v)f(v):=\psi(v) (or f(v):=γ(v)f(v):=\gamma(v)) as the identity (or position) embedding if {ψ(v)}\{\psi(v)\} (or {γ(v)}\{\gamma(v)\}) preserve node identities (or positions). Namely, nodes (v,u)(v,u) with similar identities (or positions) should have close representations (ψ(v),ψ(u))(\psi(v),\psi(u)) (or (γ(v),γ(u))(\gamma(v),\gamma(u))). The learned embeddings are adopted as the inputs of some downstream modules to support concrete inference tasks.

The embedding inference includes the transductive and inductive settings. A transductive method focuses on the optimization of ff on the currently observed topology 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) and can only support inference tasks on 𝒱\mathcal{V}. In contrast, an inductive approach can directly generalize its model parameters, which are first optimized on (𝒱,)(\mathcal{V},\mathcal{E}), to new unseen nodes 𝒱{\mathcal{V}}^{\prime} or even a new graph 𝒢′′=(𝒱′′,′′){\mathcal{G}}^{\prime\prime}=(\mathcal{V}^{\prime\prime},\mathcal{E}^{\prime\prime}) and support tasks on 𝒱{\mathcal{V}}^{\prime} or 𝒱′′\mathcal{V}^{\prime\prime} (i.e., the inductive inference for new nodes or across graphs). A transductive method cannot support the inductive inference but an inductive approach can tackle both settings.

We focus on the joint inductive inference of identity and position embeddings. A novel IRWE method is proposed which combines multiple attention units to handle RWs and induced AWs.

Definition 4 (Random Walk & Anonymous Walk). An RW with length ll is a node sequence w=(w(0),w(1),,w(l))w=(w^{(0)},w^{(1)},\dots,w^{(l)}), where w(j)𝒱w^{(j)}\in\mathcal{V} is the jj-th node and (w(j),w(j+1))(w^{(j)},w^{(j+1)})\in\mathcal{E}. Assume that the index jj starts from 0. For an RW ww, one can map it to an AW ω=(Iw(w(0)),,Iw(w(l)))\omega=(I_{w}(w^{(0)}),\dots,I_{w}(w^{(l)})), where Iw(w(j))I_{w}(w^{(j)}) maps w(j)w^{(j)} to its first occurrence index in ww.

In Fig. 1 (a), (v1,v4,v5,v1,v6)(v_{1},v_{4},v_{5},v_{1},v_{6}) is a valid RW with (0,1,2,0,3)(0,1,2,0,3) as its AW. In particular, two RWs (e.g., (v1,v4,v5,v1)(v_{1},v_{4},v_{5},v_{1}) and (v8,v10,v9,v8)(v_{8},v_{10},v_{9},v_{8})) can be mapped to a common AW (i.e., (0,1,2,0)(0,1,2,0)). In Section 4, we further demonstrate that AW and its induced statistics can be features shared by all possible topology and thus can support the inductive embedding inference without attributes.

4 Methodology

Refer to caption
(a) Overall architecture
Refer to caption
(b) Identity embedding module
Refer to caption
(c) Position embedding module
Figure 2: Model architecture of IRWE including modules of (b) identity and (c) position embeddings.
Refer to caption
Figure 3: Running examples about the derivation of one-hot AW encodings {ρ(ω)}\{\rho(\omega)\}, AW statistics {s(v)}\{s(v)\}, and high-order degree features {δ(v)}\{\delta(v)\} based on the local topology of node v1v_{1} in Fig. 1.

In this section, we elaborate on the model architecture as well as the optimization and inference of IRWE. Fig. 2 (a) gives an overview of the model architecture, including two jointly optimized modules that derive identity embeddings {ψ(v)}\{\psi(v)\} and position embeddings {γ(v)}\{\gamma(v)\}.

4.1 Identity Embedding Module

Fig. 2 (b) highlights details of the identity embedding module. It derives identity embeddings {ψ(v)}\{\psi(v)\} based on auxiliary AW embeddings {φ(ω)}\{\varphi(\omega)\}, AW statistics {s(v)}\{s(v)\}, and high-order degree features {δ(v)}\{\delta(v)\}. Fig. 3 gives running examples about the extraction of {φ(ω)}\{\varphi(\omega)\}, {s(v)}\{s(v)\}, and {δ(v)}\{\delta(v)\} based on the local topology of node v1v_{1} in Fig. 1, where we set RW length l=3l=3 and the number of sampled RWs nS=5n_{S}=5 as a demonstration. The optimization and inference of this module includes the (1) AW embedding auto-encoder, (2) identity embedding encoder, and (3) identity embedding decoder.

4.1.1 AW Embedding Auto-Encoder

As discussed in Section 3, it is possible to map RWs with different sets of nodes to a common AW. For instance, (0,1,2,0)(0,1,2,0) is the common AW of RWs (v1,v2,v3,v1)(v_{1},v_{2},v_{3},v_{1}) and (v1,v4,v5,v1)(v_{1},v_{4},v_{5},v_{1}) in Fig. 3. Given a fixed length ll, RWs on all possible topology structures can only be mapped to a finite set of AWs Ωl\Omega_{l}. Namely, Ωl\Omega_{l} and its induced statistics are shared by all possible nodes and graphs, thus having the potential to support the inductive embedding inference. Based on this intuition, IRWE maintains an AW embedding φ(ω)d\varphi(\omega)\in\mathbb{R}^{d} for each AW ωΩl\omega\in\Omega_{l}. In this setting, {φ(ω)}\{\varphi(\omega)\} can be used as a special embedding lookup table for the derivation of inductive features regarding graph topology.

We also consider an additional constraint on {φ(ω)}\{\varphi(\omega)\}, where two AWs with more common elements in corresponding positions should have closer representations. For instance, (0,1,2,1,2)(0,1,2,1,2) and (0,1,0,1,2)(0,1,0,1,2) should be closer in the AW embedding space than (0,1,2,1,2)(0,1,2,1,2) and (0,1,0,2,3)(0,1,0,2,3). To apply this constraint, we transform each AW ω\omega with length ll to a one-hot encoding ρ(ω){0,1}(l+1)2\rho(\omega)\in\{0,1\}^{(l+1)^{2}}, where ρ(ω)jl:(j+1)l\rho(\omega)_{jl:(j+1)l} (i.e., subsequence from the jljl-th to the (j+1)l(j+1)l-th positions) is the one-hot encoding of the jj-th element in ω\omega. For instance, we have ρ(ω)=[0000010000100001]\rho(\omega)=[0000~{}0100~{}0010~{}0001] for ω=(0,1,2,3)\omega=(0,1,2,3) in Fig. 3. An auto-encoder is then introduced to derive and regularize {φ(ω)}\{\varphi(\omega)\}, including an encoder and a decoder. Given an AW ω\omega, the encoder Encφ(){\rm{Enc}}_{\varphi}(\cdot) and decoder Decφ(){\rm{Dec}}_{\varphi}(\cdot) are defined as

φ(ω)=Encφ(ω):=MLP(ρ(ω)),ρ^(ω)=Decφ(ω):=MLP(φ(ω)),\varphi(\omega)={\mathop{\rm Enc}\nolimits}_{\varphi}(\omega):={\mathop{\rm MLP}\nolimits}(\rho(\omega)),~{}\hat{\rho}(\omega)={\mathop{\rm Dec}\nolimits}_{\varphi}(\omega):={\mathop{\rm MLP}\nolimits}(\varphi(\omega)), (1)

which are both multi-layer perceptrons (MLPs). The encoder takes ρ(ω)\rho(\omega) as input and derives AW embedding φ(ω)\varphi(\omega). The decoder reconstructs ρ(ω)\rho(\omega) with φ(ω)\varphi(\omega) as input. Since similar AWs have similar one-hot encodings, similar AWs can have close embeddings by minimizing the reconstruction error between {ρ(ω)}\{\rho(\omega)\} and {ρ^(ω)}\{{\hat{\rho}}(\omega)\}.

4.1.2 Identity Embedding Encoder

IRWE derives identity embeddings {ψ(v)}\{\psi(v)\} via the combination of AW embeddings {φ(ω)}\{\varphi(\omega)\} inspired by the following Theorem 1 (Micali & Zhu, 2016).

Theorem 1. Let 𝒢s(v,r)\mathcal{G}_{s}(v,r) be the rooted subgraph induced by nodes with a distance less than rr from vv. Let q(v,l)q(v,l) be the distribution of AWs w.r.t. RWs starting from vv with length ll. One can reconstruct 𝒢s(v,r)\mathcal{G}_{s}(v,r) in time O(n2)O(n^{2}) with O(n2)O(n^{2}) access to [q(v,1),,q(v,l)][q(v,1),\cdots,q(v,l)], where l=O(m)l=O(m); nn and mm are the numbers of nodes and edges in 𝒢s(v,r)\mathcal{G}_{s}(v,r).

For a given length ll, let ηl\eta_{l} be the number of AWs. q(v,l)q(v,l) can be represented as an ηl\eta_{l}-dimensional vector, with the jj-th element as the occurrence probability of the jj-th AW. Since AWs with length ll include sequences of those with length less than ll (e.g., (0,1,2,3)(0,1,2,3) provides information about (0,1,2)(0,1,2)), one can derive q(v,k)q(v,k) (k<lk<l) based on q(v,l)q(v,l). Therefore, q(v,l)q(v,l) can be used to characterize 𝒢s(v,r)\mathcal{G}_{s}(v,r) according to Theorem 1.

As defined in Section 3, nodes with similar rooted subgraphs are expected to play similar structural roles and thus have similar identities. For instance, in Fig. 1, 𝒢s(v1,1)\mathcal{G}_{s}(v_{1},1) and 𝒢s(v8,1)\mathcal{G}_{s}(v_{8},1) have the same topology structure, which is consistent with the same identity they have. Hence, q(v,l)q(v,l) can characterize the identity of node vv.

To estimate q(v,l)q(v,l), we extract AW statistic s(v)s(v) for each node vv using Algorithm 1. We first sample RWs with length ll starting from vv via the standard unbiased strategy (Perozzi et al., 2014) (see Algorithm 6 in Appendix A). Let 𝒲(v)\mathcal{W}^{(v)} be the set of sampled RWs starting from vv. Each RW w𝒲(v)w\in\mathcal{W}^{(v)} is then mapped to its AW. Let Ωl\Omega_{l} be an AW lookup table including all the ηl\eta_{l} AWs with length ll, which is fixed and shared by all possible topology. We define the AW statistic as s(v):=[c(ω1),,c(ωηl)]+ηls(v):=[c(\omega_{1}),\cdots,c(\omega_{\eta_{l}})]\in\mathbb{Z}_{+}^{\eta_{l}}, where c(ωj)c(\omega_{j}) is the frequency of the jj-th AW in Ωl\Omega_{l} as illustrated in Fig. 3.

Refer to caption
Figure 4: Visualization of AW statistics {s(v)}\{s(v)\} on Brazil.
Table 1: Variation of the Number of AWs and its Reduced Value w.r.t. Length ll on Brazil
ll 4 5 6 7 8 9
ηl\eta_{l} 52 203 877 4,140 21,147 115,975
η~l{\tilde{\eta}}_{l} 15 52 195 610 1,540 3,173
Input: target node vv; RW length ll; sampled RWs 𝒲(v){\mathcal{W}}^{(v)}; AW lookup table Ωl\Omega_{l}; number of AWs ηl\eta_{l}
Output: AW statistic s(v)s(v) w.r.t. vv
1 s(v)[0,0,,0]ηls(v)\leftarrow[0,0,\cdots,0]^{\eta_{l}} //Initialize s(v)s(v)
2 for each w𝒲(v)w\in{\mathcal{W}}^{(v)} do
3       map RW ww to its AW ω\omega
4       get the index jj of AW ω\omega in lookup table Ωl\Omega_{l}
5       s(v)js(v)j+1s(v)_{j}\leftarrow s(v)_{j}+1 //Update s(v)s(v)
Algorithm 1 Derivation of AW Statistics

Although ηl\eta_{l} grows exponentially with the increase of length ll, {s(v)}\{s(v)\} are usually sparse. Fig. 4.1.2 visualizes the example AW statistics {s(v)}\{s(v)\} derived from RWs on the Brazil dataset (see Section 5.1 for details) with l=4l=4 and |𝒲(v)|=1,000|\mathcal{W}^{(v)}|=1,000. The ii-th row in Fig. 4.1.2 is the AW statistic s(vi)s(v_{i}) of node viv_{i}. Dark blue indicates that the corresponding element is 0. There exist many AWs {ωj}\{\omega_{j}\} not observed during the RW sampling (i.e., v𝒱\forall v\in\mathcal{V} s.t. s(v)j=0s(v)_{j}=0). We then remove terms w.r.t. these unobserved AWs in Ωl\Omega_{l} and {s(v)}\{s(v)\}. Let Ω~l{\tilde{\Omega}}_{l}, s~(v){\tilde{s}}(v), and η~l{\tilde{\eta}}_{l} be the reduced Ωl\Omega_{l}, s(v)s(v), and ηl\eta_{l}. Table 4.1.2 shows the variation of ηl\eta_{l} and η~l{\tilde{\eta}}_{l} on Brazil as ll increases from 44 to 99, where ηl\eta_{l} is significantly reduced.

In addition to {s~(v)}\{{\tilde{s}}(v)\}, one can also characterize node identities from the view of node degrees (Ribeiro et al., 2017; Wu et al., 2019) based on the following Hypothesis 1.

Hypothesis 1. Nodes with the same degree are expected to play the same structural role. This concept can be extended to high-order neighbors of nodes. Namely, nodes are expected to have similar identities if they have similar node degree statistics (e.g., distribution over all degree values) w.r.t. their high-order neighbors.

Input: target node vv; RW length ll; one-hot degree encoding dimensionality ee; sampled RWs 𝒲(v){\mathcal{W}}^{(v)}; minimum degree degmin\deg_{\min}; maximum degree degmax\deg_{\max}
Output: high-order degree feature δ(v)\delta(v) w.r.t. vv
1 δ(v)[0,0,,0](l+1)e\delta(v)\leftarrow[0,0,\cdots,0]^{(l+1)e} //Initialize degree feature δ(v)\delta(v)
2 for each w𝒲(v)w\in{\mathcal{W}}^{(v)} do
3       for ii from 0 to ll do
4             uw(i)u\leftarrow w^{(i)} //ii-th node in current RW ww
5             ρd(u)[0,,0]e\rho_{d}(u)\leftarrow[0,\cdots,0]\in\mathbb{R}^{e} //Initialize degree encoding ρd(u)\rho_{d}(u)
6             j(deg(u)degmin)e/(degmaxdegmin)j\leftarrow\left\lfloor{(\deg(u)-{\deg_{\min}})e/({\deg_{\max}}-{\deg_{\min}})}\right\rfloor
7             ρd(u)j1\rho_{d}(u)_{j}\leftarrow 1 //Update ρd(u)\rho_{d}(u)
8             δ(v)ie:(i+1)eδ(v)ie:(i+1)e+ρd(u)\delta(v)_{ie:(i+1)e}\leftarrow\delta(v)_{ie:(i+1)e}+\rho_{d}(u)//Update δ(v)\delta(v)
9      
Algorithm 2 Derivation of Degree Features

Based on this motivation, we extract high-order degree feature δ(v)\delta(v) for each node vv using Algorithm 2. Given a node uu, one can construct a bucket one-hot encoding ρd(u){0,1}e\rho_{d}(u)\in\{0,1\}^{e} w.r.t. its degree deg(u)\deg(u), where only the jj-th element ρd(u)j\rho_{d}(u)_{j} is set to 11 with the remaining elements set to 0 and j=(deg(u)degmin)e/(degmaxdegmin)j=\left\lfloor{(\deg(u)-{\deg_{\min}})e/({\deg_{\max}}-{\deg_{\min}})}\right\rfloor; degmin\deg_{\min} and degmax\deg_{\max} are the minimum and maximum degrees. Since high-order neighbors of a node vv can be explored by RWs 𝒲(v){\mathcal{W}}^{(v)} starting from vv, we define δ(v)(l+1)e\delta(v)\in\mathbb{Z}^{(l+1)e} as an (l+1)e(l+1)e-dimensional vector, where the subsequence δ(v)ie:(i+1)e\delta(v)_{ie:(i+1)e} is the sum of bucket one-hot degree encodings w.r.t. nodes occurred at the ii-th position of RWs in 𝒲(v){\mathcal{W}}^{(v)}. Fig. 3 gives a running example to derive δ(v1)\delta(v_{1}) (with e=5e=5) for node v1v_{1} in Fig. 1.

Following the aforementioned discussions regarding Theorem 1 and Hypothesis 1, IRWE derives identity embeddings {ψ(v)}\{\psi(v)\} via the adaptive combination of AW embeddings {φ(ω)}\{\varphi(\omega)\} w.r.t. AW statistics {s~(v)}\{{\tilde{s}}(v)\} and degree features {δ(v)}\{\delta(v)\}. The multi-head attention is applied to automatically determine the contribution of each AW embedding φ(ω)\varphi(\omega) in the combination, where we treat {φ(ω)}\{\varphi(\omega)\} as the key and value; the concatenated feature [s~(v)||δ(v)][\tilde{s}(v)||\delta(v)] is used as the query. Before feeding [s~(v)||δ(v)](η~l+le)[\tilde{s}(v)||\delta(v)]\in\mathbb{R}^{({\tilde{\eta}}_{l}+le)} to the multi-head attention, we introduce a feature reduction unit Reds(){\rm{Red}}_{s}(\cdot), an MLP, to reduce its dimensionality to dd:

g¯(v)=Reds(v):=MLP([s~(v)||δ(v)]).{\bar{g}}(v)={{\mathop{\rm Red}\nolimits}_{s}}(v):={\mathop{\rm MLP}\nolimits}([\tilde{s}(v)||\delta(v)]). (2)

The multi-head attention that derives identity embeddings {ψ(v)}\{\psi(v)\} is defined as

𝐙=Att(𝐐,𝐊,𝐕)=Att({g¯(v)},{φ(ω)},{φ(ω)}),{\bf{Z}}={\mathop{\rm Att}\nolimits}({\bf{Q}},{\bf{K}},{\bf{V}})={\mathop{\rm Att}\nolimits}(\{\bar{g}(v)\},\{\varphi(\omega)\},\{\varphi(\omega)\}), (3)

where Att(,,){\mathop{\rm Att}\nolimits}(\cdot,\cdot,\cdot) is the standard multi-head attention unit (see Appendix D for details), with 𝐐{\bf{Q}}, 𝐊{\bf{K}}, and 𝐕{\bf{V}} as inputs of query, key, and value. In (3), we have 𝐐i,:=g¯(vi){\bf{Q}}_{i,:}={\bar{g}}(v_{i}), 𝐊j,:=𝐕j,:=φ(ωj){\bf{K}}_{j,:}={\bf{V}}_{j,:}=\varphi(\omega_{j}), and 𝐙i,:=ψ(vi){\bf{Z}}_{i,:}=\psi(v_{i}).

4.1.3 Identity Embedding Decoder

An identity embedding decoder Decψ(){\rm{Dec}}_{\psi}(\cdot) is introduce to regularize identity embeddings {ψ(v)}\{\psi(v)\} using statistics {[s~(v)||δ(v)]}\{[\tilde{s}(v)||\delta(v)]\}. It takes the ψ(v)\psi(v) of a node vv as input and reconstruct corresponding [s~(v)||δ(v)][\tilde{s}(v)||\delta(v)] via

g^(v)=Decψ(v):=MLP(ψ(v)),{\hat{g}}(v)={{\mathop{\rm Dec}\nolimits}_{\psi}}(v):={\mathop{\rm MLP}\nolimits}(\psi(v)), (4)

where g^(v){\hat{g}}(v) is the reconstructed statistic. It can force {ψ(v)}\{\psi(v)\} to capture node identities hidden in {[s~(v)||δ(v)]}\{[\tilde{s}(v)||\delta(v)]\} by minimizing the reconstruction error between {g^(v)}\{{\hat{g}}(v)\} and {[s~(v)||δ(v)]}\{[\tilde{s}(v)||\delta(v)]\}. Note that we only apply Decψ(){\rm{Dec}}_{\psi}(\cdot) to optimize {ψ(v)}\{\psi(v)\} and do not need this unit in the inference phase.

4.2 Position Embedding Module

Fig. 2 (c) gives an overview of the position embedding module. It derives position embeddings {γ(v)}\{\gamma(v)\} based on (i) identity embeddings {ψ(v)}\{\psi(v)\} given by the previous module and (ii) auxiliary position encodings {πg(v),πl(j)}\{\pi_{g}(v),\pi_{l}(j)\} extracted from the sampled RWs {𝒲(v)}\{{\mathcal{W}}^{(v)}\}. Instead of using the attribute aggregation mechanism of GNNs, we convert the graph topology into a set of RWs and use the transformer encoder (Vaswani et al., 2017), a sophisticated structure that can process sequential data, to handle RWs. Besides the sequential input, transformer also requires a ‘position’ encoding to describe the position of each element in the sequence. As highlighted in Definition 3, the physical meaning of node position in non-Euclidean graphs is different from that in Euclidean sequences (e.g., sentences and RWs). To describe the (i) Euclidean position in RWs and (ii) node position in a graph, we introduce the local and global position encodings (denoted as πl(j)\pi_{l}(j) and πg(v)\pi_{g}(v)) for a sequence position with index jj and each node vv. The optimization and inference of this module includes the (1) input fusion unit, (2) position embedding encoder, and (3) position embedding decoder.

4.2.1 Input Fusion Unit

The input fusion unit extracts {πl(j),πg(v)}\{\pi_{l}(j),\pi_{g}(v)\} and derives inputs of the transformer encoder combined with {ψ(v)}\{\psi(v)\}. Since the RW length ll is usually not very large (e.g., 10\leq 10 in our experiments), we define the local position encoding πl(j){0,1}l+1\pi_{l}(j)\in\{0,1\}^{l+1} as the standard one-hot encoding of index jj. Inspired by previous studies (Perozzi et al., 2014; Grover & Leskovec, 2016; Zhu et al., 2021) that validated the potential of RW for exploring local community structures, we extract the global position encoding {πg(v)}\{\pi_{g}(v)\} for each node vv w.r.t. RW statistic r(v)r(v) using Algorithm 3.

Input: target node vv; sampled RWs 𝒲(v){\mathcal{W}}^{(v)}; node set 𝒱\mathcal{V}; random matrix 𝚯|𝒱|×d{\bf{\Theta}}\in\mathbb{R}^{|\mathcal{V}|\times d}
Output: global position encoding πg(v)\pi_{g}(v) w.r.t. vv
1 r(v)[0,0,,0]|𝒱|r(v)\leftarrow[0,0,\cdots,0]^{|\mathcal{V}|} //Initialize RW statistic r(v)r(v)
2 for each w𝒲(v)w\in{\mathcal{W}}^{(v)} do
3       for each vjwv_{j}\in w do
4             r(v)jr(v)j+1r(v)_{j}\leftarrow r(v)_{j}+1 //Update r(v)r(v)
5            
6      
πg(v)r(v)𝚯\pi_{g}(v)\leftarrow r(v){\bf{\Theta}} //Derive πg(v)\pi_{g}(v)
Algorithm 3 Derivation of Global Position Encoding

Given a node vv, we maintain r(v)|𝒱|r(v)\in\mathbb{Z}^{|\mathcal{V}|}, with the jj-th element r(v)jr(v)_{j} as the frequency that node vjv_{j} occurs in RWs 𝒲(v){\mathcal{W}}^{(v)} starting from vv. For instance, we have r(v1)=[8,2,1,2,3,2,1,0,0,0,0,0,1]r({v_{1}})=[8,2,1,2,3,2,1,0,0,0,0,0,1] for the running example in Fig. 3 with 1313 nodes. Since nodes in a community are densely connected, nodes within the same community are more likely to be reached via RWs. Therefore, nodes (v,u)(v,u) with similar positions are expected to have similar statistics (r(v),r(u))(r(v),r(u)). We then derive πg(v)\pi_{g}(v) by mapping r(v)r(v) to a dd-dimensional vector via the following Gaussian random projection, an efficient dimension reduction technique that can preserve the relative distance between input features with a rigorous guarantee (Arriaga & Vempala, 2006):

πg(v)=r(v)𝚯with𝚯|𝒱|×d,𝚯ir𝒩(0,1/d).\pi_{g}(v)=r(v){\bf{\Theta}}{\rm{~{}with~{}}}{\bf{\Theta}}\in\mathbb{R}^{|{\mathcal{V}}|\times d},{\bf{\Theta}}_{ir}\sim{\mathcal{N}}(0,1/d). (5)

In this setting, the non-Euclidean node positions in a graph topology are encoded in terms of the relative distance between {πg(v)}\{\pi_{g}(v)\}. Hence, πg(v)\pi_{g}(v) has the initial ability to encode the position of node vv. IRWE integrates the relation between node identities and positions based on the following Hypothesis 2.

Hypothesis 2. In a community (i.e., node cluster with dense linkages), nodes with different structural roles may have different contributions in forming the community structure.

For instance, in a social network, an opinion leader (e.g., v1v_{1} and v8v_{8} in Fig. 1) is expected to have more contributions in forming the community it belongs to than an ordinary audience (e.g., v2v_{2} and v9v_{9} in Fig. 1). Based on this intuition, we use identity embeddings {ψ(v)}\{\psi(v)\} to reweight global position encodings {πg(v)}\{\pi_{g}(v)\}, with the reweighting contributions determined by a modified attention operation. Concretely, we set identity embeddings {ψ(v)}\{\psi(v)\} as the query and let global position encodings {πg(v)}\{\pi_{g}(v)\} as the key and value (i.e., 𝐐i,:=ψ(vi){\bf{Q}}_{i,:}=\psi(v_{i}) and 𝐊i,:=𝐕i,:=πg(vi){\bf{K}}_{i,:}={\bf{V}}_{i,:}=\pi_{g}(v_{i})). The modified attention operation is defined as

𝐙=ReAtt(𝐐,𝐊,𝐕):=(MLP(𝐐~)+MLP(𝐊~))𝐕~,{\bf{Z}}={\rm{ReAtt}}({\bf{Q}},{\bf{K}},{\bf{V}}):=({\rm{MLP}}({\bf{\tilde{Q}}})+{\rm{MLP}}({\bf{\tilde{K}}}))\odot{\bf{\tilde{V}}}, (6)

where 𝐐~:=BN(𝐐){\bf{\tilde{Q}}}:={\mathop{\rm BN}\nolimits}({\bf{Q}}), 𝐊~:=BN(𝐊){\bf{\tilde{K}}}:={\mathop{\rm BN}\nolimits}({\bf{K}}), and 𝐕~:=BN(𝐕){\bf{\tilde{V}}}:={\mathop{\rm BN}\nolimits}({\bf{V}}); BN(){\mathop{\rm BN}\nolimits}(\cdot) and \odot are the batch normalization and element-wise multiplication. In (6), we apply two MLPs to derive nonlinear mappings of the normalized {𝐐,𝐊}\{{\bf{Q}},{\bf{K}}\} and use their sum to support the element-wise reweighting of the normalized 𝐕{\bf{V}}. For convenience, we denote the reweighted vector w.r.t. a node viv_{i} as π¯g(vi)=𝐙i,:{\bar{\pi}}_{g}(v_{i})={\bf{Z}}_{i,:}. Given an RW w=(w(0),w(1),,w(l))w=(w^{(0)},w^{(1)},\cdots,w^{(l)}), IRWE concatenates the reweighted vector π¯g(w(j)){\bar{\pi}}_{g}(w^{(j)}) and local position encoding πl(j)\pi_{l}(j) for the jj-th node and feeds its linear mapping to the transformer encoder:

t(w(j)):=[π¯g(w(j))||πl(j)]𝐖t,t({w^{(j)}}):=[{{\bar{\pi}}_{g}}(w^{(j)})||{\pi_{l}}(j)]{\bf{W}}_{t}, (7)

where 𝐖t(d+l+1)×d{\bf{W}}_{t}\in\mathbb{R}^{(d+l+1)\times d} is a trainable parameter.

4.2.2 Position Embedding Encoder

IRWE uses the transformer encoder TransEnc(){\mathop{\rm TransEnc}\nolimits}(\cdot) to handle an RW w=(w(0),,w(l))w=(w^{(0)},\cdots,w^{(l)}):

(t¯(w(0)),)=TransEnc(t(w(0)),,t(w(l))).(\bar{t}({w^{(0)}}),\cdots)={\mathop{\rm TransEnc}\nolimits}(t({w^{(0)}}),\cdots,t({w^{(l)}})). (8)

It takes the corresponding sequence of vectors (t(w(0)),,t(w(l)))(t(w^{(0)}),\cdots,t(w^{(l)})) as input and derives another sequence of vectors (t¯(w(0)),,t¯(w(l)))(\bar{t}({w^{(0)}}),\cdots,\bar{t}(w^{(l)})) with the same dimensionality. TransEnc(){\mathop{\rm TransEnc}\nolimits}(\cdot) follows a multi-layer structure, with each layer including the self-attention, skip connections, layer normalization, and feedforward mapping. Due to space limit, we omit details of TransEnc(){\mathop{\rm TransEnc}\nolimits}(\cdot) that can be found in (Vaswani et al., 2017).

For an RW ww starting from a node vv, the first output vector t¯(w(0))=t¯(v){\bar{t}}(w^{(0)})={\bar{t}}(v) can be a representation of vv. As we sample multiple RWs 𝒲(v){\mathcal{W}}^{(v)} starting from each node vv, one can obtain multiple such representations based on 𝒲(v){\mathcal{W}}^{(v)}. However, we only need one unique positon embedding γ(v)\gamma(v) for vv. Let t¯(v):={t¯(w(0))|w𝒲(v)}{{\bar{t}}^{(v)}}:=\{\bar{t}({w^{(0)}})|w\in{\mathcal{W}^{(v)}}\}. A naive strategy to derive γ(v)\gamma(v) is to average representations in t¯(v){{\bar{t}}^{(v)}}. Instead, we develop the following attentive readout function to compute the weighted mean of t¯(v){{\bar{t}}^{(v)}}, with the weights determined by attention:

𝐳=ROut(t¯(v),πg(v)):=Att(πg(v),t¯(v),t¯(v)),γ(v):=𝐳𝐖γ+𝐛γ,andγ¯(v):=𝐳𝐖γ¯+𝐛γ¯.{\bf{z}}={\mathop{\rm ROut}\nolimits}({\bar{t}}^{(v)},{\pi_{g}}(v)):={\mathop{\rm Att}\nolimits}({\pi_{g}}(v),{\bar{t}}^{(v)},{{\bar{t}}^{(v)}}),~{}\gamma(v):={\bf{z}}{{\bf{W}}_{\gamma}}+{{\bf{b}}_{\gamma}},{\rm{~{}and~{}}}\bar{\gamma}(v):={\bf{z}}{{\bf{W}}_{\bar{\gamma}}}+{{\bf{b}}_{\bar{\gamma}}}. (9)

In (9), Att(,,){\mathop{\rm Att}\nolimits}(\cdot,\cdot,\cdot) is the standard multi-head attention unit (see Appendix D for details), where we let the global position encoding πg(v)\pi_{g}(v) be the query (i.e., 𝐐=πg(v)1×d{\bf{Q}}=\pi_{g}(v)\in\mathbb{R}^{1\times d}) and t¯(v){{\bar{t}}^{(v)}} be the key and value (i.e., 𝐊j,:=𝐕j,:=t¯j(v){\bf{K}}_{j,:}={\bf{V}}_{j,:}={\bar{t}}_{j}^{(v)}). γ(v)\gamma(v) and γ¯(v){\bar{\gamma}}(v) are the (i) position embedding and (ii) auxiliary context embedding of node vv, with {𝐖γ,𝐛γ,𝐖γ¯,𝐛γ¯}\{{\bf{W}}_{\gamma},{\bf{b}}_{\gamma},{\bf{W}}_{\bar{\gamma}},{\bf{b}}_{\bar{\gamma}}\} as trainable parameters.

4.2.3 Position Embedding Decoder

The position embedding decoder is introduced to optimize position embeddings {γ(v)}\{\gamma(v)\} together with auxiliary context embeddings {γ¯(v)}\{\bar{\gamma}(v)\}. Some of existing embedding methods (Perozzi et al., 2014; Tang et al., 2015; Hamilton et al., 2017) are optimized via the following contrastive loss with negative sampling:

mincnr=(vi,vj)D[pijlnσ(γ(vi)γ~T(vj)/τ)+Qnjlnσ(γ(vi)γ~T(vj)/τ)],\min{\mathcal{L}_{\rm{cnr}}}=-\sum\nolimits_{({v_{i}},{v_{j}})\in D}~{}{[{p_{ij}}\ln\sigma(\gamma({v_{i}}){{\tilde{\gamma}}^{T}}({v_{j}})/\tau)+Q{n_{j}}\ln\sigma(-\gamma({v_{i}}){{\tilde{\gamma}}^{T}}({v_{j}})/\tau)]}, (10)

where DD denotes the training set including positive and negative samples in terms of node pairs {(vi,vj)}\{(v_{i},v_{j})\}; pijp_{ij} is defined as the statistic of a positive node pair (vi,vj)(v_{i},v_{j}) (e.g., the frequency that (vi,vj)(v_{i},v_{j}) occurs in the RW sampling); QQ is the number of negative samples; nj{n_{j}} is usually set to be the probability that (vi,vj)(v_{i},v_{j}) is selected as a negative sample; σ()\sigma(\cdot) is the sigmoid function; τ\tau is a temperature parameter to be specified. We follow prior work (Tang et al., 2015) to let pij:=𝐀ij/deg(vi){p_{ij}}:={{\bf{A}}_{ij}}/{{\deg}(v_{i})} (i.e., the probability that there is an edge from viv_{i} to vjv_{j} with 𝐀{0,1}|𝒱|×|𝒱|{\bf{A}}\in\{0,1\}^{|\mathcal{V}|\times|\mathcal{V}|} as the adjacency matrix) and nj(i:(vi,vj)pij)0.75{n_{j}}\propto(\sum\nolimits_{i:({v_{i}},{v_{j}})\in\mathcal{E}}{{p_{ij}}{)^{0.75}}}. In the next section, we demonstrate that the contrastive loss (10) can be converted to a reconstruction loss such that the joint optimization of IRWE only includes several reconstruction objectives.

4.3 Model Optimization & Inference

Given an RW length ll, let Ω~l{\tilde{\Omega}}_{l} be the reduced AW lookup table w.r.t. the reduced AW statistics {s~(v)}\{\tilde{s}(v)\} in (2). The optimization objective of identity embeddings {ψ(v)}\{\psi(v)\} can be described as

minψ:=regφ+αregψ,\min\mathcal{L}_{\psi}:={\mathcal{L}_{{\rm{reg-}}\varphi}}+\alpha{\mathcal{L}_{{\rm{reg-}}\psi}}, (11)
regφ:=ωΩ~l|ρ(ω)ρ^(ω)|22,{\mathcal{L}_{{\rm{reg-}}\varphi}}:=\sum\nolimits_{\omega\in{{\tilde{\Omega}}_{l}}}{|\rho(\omega)-\hat{\rho}(\omega)|_{2}^{2}}, (12)
regψ:=v𝒱|[s~(v)||δ(v)]/|𝒲(v)|g^(v)|22,{\mathcal{L}_{{\rm{reg-}}\psi}}:=\sum\nolimits_{v\in\mathcal{V}}{|[{\tilde{s}}(v)||\delta(v)]/|{\mathcal{W}}^{(v)}|-\hat{g}(v)|_{2}^{2}}, (13)

where regφ{\mathcal{L}_{{\rm{reg-}}\varphi}} regularizes auxiliary AW embeddings {φ(ω)}\{\varphi(\omega)\} by reconstructing the one-hot AW encodings {ρ(ω)}\{\rho(\omega)\} via the auto-encoder defined in (1); regψ{\mathcal{L}_{{\rm{reg-}}\psi}} regularizes the derived identity embeddings {ψ(v)}\{\psi(v)\} by minimizing the error between (i) features {[s~(v)||δ(v)]}\{[\tilde{s}(v)||\delta(v)]\} normalized by the number of sampled RWs |𝒲(v)||{\mathcal{W}}^{(v)}| and (ii) reconstructed values {g^(v)}\{\hat{g}(v)\} given by (4); α\alpha is a tunable parameter.

As described in Section 4.2.3, one can optimize position embeddings {γ(v)}\{\gamma(v)\} via a contrastive loss (10). It can be converted to another reconstruction loss based on the following Proposition 1. In this setting, the optimization of {ψ(v)}\{\psi(v)\} and {γ(v)}\{\gamma(v)\} only includes three simple reconstruction losses.

Proposition 1. Let 𝚪|𝒱|×d{\bf{\Gamma}}\in\mathbb{R}^{|\mathcal{V}|\times d} and 𝚪¯|𝒱|×d{\bf{\bar{\Gamma}}}\in\mathbb{R}^{|\mathcal{V}|\times d} be the matrix forms of {γ(vi)}\{{\gamma(v_{i})}\} and {γ¯(vi)}\{\bar{\gamma}(v_{i})\} with the ii-th rows denoting the corresponding embeddings of node viv_{i}. We introduce the auxiliary contrastive statistics 𝐂|𝒱|×|𝒱|{\bf{C}}\in\mathbb{R}^{|\mathcal{V}|\times|\mathcal{V}}| in terms of a sparse matrix where 𝐂ij=lnpijln(Qnj){\bf{C}}_{ij}=\ln p_{ij}-\ln(Qn_{j}) if (vi,vj)(v_{i},v_{j})\in\mathcal{E} and 𝐂ij=0{\bf{C}}_{ij}=0 otherwise. The contrastive loss (10) is equivalent to the following reconstruction loss:

minγ=𝚪𝚪¯T/τ𝐂F2.\min{{\mathcal{L}}_{\gamma}}=\left\|{{\bf{\Gamma}}{{{\bf{\bar{\Gamma}}}}^{T}}/\tau-{\bf{C}}}\right\|_{F}^{2}. (14)

The key idea to prove Proposition 1 is to let the partial derivative cnr/[γ(vi)γ¯T(vj)/τ]\partial{\mathcal{L}_{\rm cnr}}/\partial[\gamma({v_{i}}){{\bar{\gamma}}^{T}}({v_{j}})/\tau] w.r.t. each edge (vi,vj)(v_{i},v_{j}) to 0. We leave the proof of Proposition 1 in Appendix C.

Input: topology (𝒱,)(\mathcal{V},\mathcal{E}); RW settings {l,nS,nI}\{l,n_{S},n_{I}\}; local position encodings {πl(j)}\{\pi_{l}(j)\}; optimization settings {m,mψ,mγ,λψ,λγ}\{m,m_{\psi},m_{\gamma},\lambda_{\psi},\lambda_{\gamma}\}
Output: sampled RWs {𝒲(v),𝒲I(v)}\{\mathcal{W}^{(v)},\mathcal{W}_{I}^{(v)}\}; reduced AW lookup table Ω~l{\tilde{\Omega}}_{l} & induced statistics {s~(v),δ(v),πg(v)}\{{\tilde{s}}(v),\delta(v),\pi_{g}(v)\}; optimized model parameters {θψ,θγ}\{\theta^{*}_{\psi},\theta^{*}_{\gamma}\}
1 get AW lookup table Ωl\Omega_{l} w.r.t. length ll
2 get min degree degmin{\deg_{\min}} & max degree degmax{\deg_{\max}} of (𝒱,)(\mathcal{V},\mathcal{E})
3 get contrastive statistic 𝐂{\bf{C}}
4 for each node v𝒱v\in\mathcal{V} do
5       sample nSn_{S} RWs 𝒲(v)\mathcal{W}^{(v)} staring from vv via Algorithm 6
6       get AW statistic s(v)s(v) w.r.t. 𝒲(v)\mathcal{W}^{(v)} via Algorithm 1
7       get degree feature δ(v)\delta(v) w.r.t. {𝒲(v),dmin,dmax}\{\mathcal{W}^{(v)},{\rm{d}}_{\min},{\rm{d}}_{\max}\} via Algorithm 2
8       get global position encoding πg(v)\pi_{g}(v) w.r.t. 𝒲(v)\mathcal{W}^{(v)} via Algorithm 3
9       randomly select nIn_{I} RWs 𝒲I(v)\mathcal{W}_{I}^{(v)} from 𝒲(v)\mathcal{W}^{(v)}
10get reduced AW statistic {s~(v)}\{\tilde{s}(v)\} by deleting unobserved AWs
11 get reduced AW lookup table Ω~l{\tilde{\Omega}}_{l} w.r.t. {s~(v)}\{\tilde{s}(v)\}
12 initial model parameters {θψ,θγ}\{\theta_{\psi},\theta_{\gamma}\}
13 for iter_countiter\_count from 11 to mm do
14       for countψcount_{\psi} from 11 to mψm_{\psi} do
15             get {ρ^(ω),g^(v)}\{{\hat{\rho}}(\omega),{\hat{g}}(v)\} w.r.t. {Ω~l,s~(v),δ(v)}\{{\tilde{\Omega}_{l}},\tilde{s}(v),\delta(v)\}
16             get training loss ψ\mathcal{L}_{\psi} via (11)
17             optimize identity embeddings {ψ(v)}\{\psi(v)\} via Opt(λψ,θψ,ψ){\mathop{\rm Opt}\nolimits}({\lambda_{\psi}},{\theta_{\psi}},{\mathcal{L}_{\psi}})
18            
19      for countγcount_{\gamma} from 11 to mγm_{\gamma} do
20             get identity embeddings {ψ(v)}\{\psi(v)\} w.r.t. {Ω~l,s~(v),δ(v)}\{{\tilde{\Omega}_{l}},\tilde{s}(v),\delta(v)\}
21             get position embeddings {γ(v)}\{\gamma(v)\} w.r.t. {ψ(v),πg(v),πl(j),𝒲I(v)}\{\psi(v),\pi_{g}(v),\pi_{l}(j),{\mathcal{W}_{I}^{(v)}}\}
22             get training loss γ\mathcal{L}_{\gamma} via (14)
23             optimize position embeddings {γ(v)}\{\gamma(v)\} via Opt(λγ,{θψ,θγ},γ){\mathop{\rm Opt}\nolimits}({\lambda_{\gamma}},\{{\theta_{\psi}},{\theta_{\gamma}}\},{\mathcal{L}_{\gamma}})
24            
25      save model parameters {θψ,θγ}\{\theta_{\psi},\theta_{\gamma}\}
Algorithm 4 Model Optimization of IRWE

Algorithm 4 summarizes the joint optimization procedure of IRWE. Before formally optimizing the model, we sampled nSn_{S} RWs 𝒲(v)\mathcal{W}^{(v)} starting from each node vv and derive statistics {s~(v),δ(v),πg(v)}\{{\tilde{s}(v)},\delta(v),\pi_{g}(v)\} induced by {𝒲(v)}\{{\mathcal{W}}^{(v)}\}. In particular, we randomly select nIn_{I} RWs 𝒲I(v){\mathcal{W}}_{I}^{(v)} from 𝒲(v)\mathcal{W}^{(v)} (nI<nSn_{I}<n_{S}) for each node, which are handled by the transformer encoder in the position embedding module. Namely, we use a ratio of the sampled RWs to derive {γ(v)}\{\gamma(v)\} due to the high complexity of transformer. We only sample RWs and derive induced statistics once, which are shared by the following optimization iterations.

To jointly optimize {ψ(v)}\{\psi(v)\} and {γ(v)}\{\gamma(v)\}, one can combine (11) and (14) to derive a single hybrid optimization objective. Our pre-experiments show that better embedding quality can be achieved if we separately optimize the two types of embeddings. One possible reason is that the two modules have unbalanced scales of parameters. Let θψ\theta_{\psi} and θγ\theta_{\gamma} be the sets of model parameters of the identity and position modules. The scale of θγ\theta_{\gamma} is larger than θψ\theta_{\psi} due to the application of transformer. As described in lines 14-17 and lines 19-22, we respectively update {ψ(v)}\{\psi(v)\} and {γ(v)}\{\gamma(v)\} mψ1m_{\psi}\geq 1 and mγ1m_{\gamma}\geq 1 times based on (11) and (14) in each iteration, where we can balance the optimization of {ψ(v)}\{\psi(v)\} and {γ(v)}\{\gamma(v)\} by adjusting mψm_{\psi} and mγm_{\gamma}.

Note that {ψ(v)}\{\psi(v)\} are inputs of the position embedding module, providing node identity information for the inference of {γ(v)}\{\gamma(v)\}. The optimization of {γ(v)}\{\gamma(v)\} also includes the update of θψ\theta_{\psi} via gradient descent, which also affect the inference of {ψ(v)}\{\psi(v)\}. Therefore, the two types of embeddings are jointly optimized although we adopt a separate updating strategy. The Adam optimizer is used to update {θψ,θγ}\{\theta_{\psi},\theta_{\gamma}\}, with λψ\lambda_{\psi} and λγ\lambda_{\gamma} as the learning rates for {ψ(v)}\{\psi(v)\} and {γ(v)}\{\gamma(v)\}. Finally, we save model parameters after mm iterations.

During the model optimization, we save the sampled RWs {𝒲(v),𝒲I(v)}\{\mathcal{W}^{(v)},\mathcal{W}_{I}^{(v)}\}, reduced AW lookup table Ω~l{\tilde{\Omega}}_{l}, and induced statistics {s~(v),δ(v),πg(v)}\{{\tilde{s}}(v),\delta(v),\pi_{g}(v)\} (i.e., lines 4-11 in Algorithm 4) and use them as inputs of the transductive inference of {ψ(v)}\{\psi(v)\} and {γ(v)}\{\gamma(v)\}. Then, the transductive inference only includes one feedforward propagation through the model. We summarize this simple inference procedure in Algorithm 7 (see Appendix A).

Input: optimized model parameters {θψ,θγ}\{\theta^{*}_{\psi},\theta^{*}_{\gamma}\}; new topology (𝒱𝒱,)(\mathcal{V}\cup\mathcal{V}^{\prime},\mathcal{E}^{\prime}); RW settings {l,nS,nI}\{l,n_{S},n_{I}\}; local position encodings {πl(j)}\{\pi_{l}(j)\}; {Ω~l,degmin,degmax,s~(v),δ(v),πg(v)}\{{\tilde{\Omega}}_{l},{\deg_{\min}},{\deg_{\max}},{\tilde{s}}(v),\delta(v),\pi_{g}(v)\} derived in model optimization on old topology (𝒱,)(\mathcal{V},\mathcal{E})
Output: inductive embeddings {ψ(v)}\{\psi(v)\} & {γ(v)}\{\gamma(v)\} w.r.t. 𝒱\mathcal{V}^{\prime}
1 for each node v𝒱v\in\mathcal{V}^{\prime} do
2       sample nSn_{S} RWs 𝒲(v)\mathcal{W}^{(v)} from vv w.r.t. \mathcal{E}^{\prime} via Algorithm 6
3       get AW statistic s~(v){\tilde{s}}^{\prime}(v) w.r.t. {𝒲(v),Ω~l}\{\mathcal{W}^{(v)},{\tilde{\Omega}}_{l}\} via Algorithm 8
4       get degree feature δ(v){\delta}^{\prime}(v) w.r.t. {𝒲(v),dmin,dmax}\{\mathcal{W}^{(v)},{\rm{d}}_{\min},{\rm{d}}_{\max}\} via Algorithm 9
5       get global position encoding πg(v){\pi}^{\prime}_{g}(v) w.r.t. {𝒲(v),𝒱}\{\mathcal{W}^{(v)},\mathcal{V}\} via Algorithm 10
6       randomly select nIn_{I} RWs 𝒲I(v)\mathcal{W}_{I}^{(v)} from 𝒲(v)\mathcal{W}^{(v)}
7       add s~(v){\tilde{s}}^{\prime}(v), δ(v){\delta}^{\prime}(v), & πg(v){\pi}^{\prime}_{g}(v) to {s~(v)}\{{\tilde{s}}(v)\}, {δ(v)}\{\delta(v)\}, & {πg(v)}\{\pi_{g}(v)\}
8get {ψ(v)}\{\psi(v)\} based on {Ω~l,s~(v),δ(v)}\{{\tilde{\Omega}_{l}},\tilde{s}(v),\delta(v)\} w.r.t. 𝒱𝒱\mathcal{V}\cup\mathcal{V}^{\prime}
get {γ(v)}\{\gamma(v)\} based on {ψ(v),πg(v),πl(j),𝒲I(v)}\{\psi(v),\pi_{g}(v),\pi_{l}(j),{\mathcal{W}_{I}^{(v)}}\} w.r.t. 𝒱\mathcal{V}^{\prime}
Algorithm 5 Inductive Inference within a Graph

To support the inductive inference for new nodes within a graph, we adopt an incremental strategy to get the inductive statistics {s~(v),δ(v),πg(v)}\{\tilde{s}(v),\delta(v),\pi_{g}(v)\} via modified versions of Algorithms 1, 2, and 3 that utilize some intermediate results derived during the training on old topology (𝒱,)(\mathcal{V},\mathcal{E}). Algorithm 5 summarizes the inductive inference within a graph. Let 𝒱\mathcal{V}^{\prime} and \mathcal{E}^{\prime} be the set of new nodes and edge set induced by 𝒱𝒱\mathcal{V}\cup\mathcal{V}^{\prime}. We sample RWs 𝒲(v)\mathcal{W}^{(v)} for each new node v𝒱v\in\mathcal{V}^{\prime} and get the AW statistic s~(v){\tilde{s}}(v) w.r.t. AWs in the lookup table Ω~l\tilde{\Omega}_{l} reduced on old topology (𝒱,)(\mathcal{V},\mathcal{E}) rather than all AWs. δ(v)\delta(v) is derived based on the one-hot degree encoding truncated by the minimum and maximum degrees of (𝒱,)(\mathcal{V},\mathcal{E}). In the derivation of πg(v)\pi_{g}(v), we compute truncated RW statistic r(v)r(v) only w.r.t. previously observed nodes 𝒱\mathcal{V}. We detail procedures to derive inductive {s~(v),δ(v),πg(v)}\{\tilde{s}(v),\delta(v),\pi_{g}(v)\} in Algorithms 8, 9, and 10 (see Appendix A). Similar to the transductive inference, given the derived {s~(v),δ(v),πg(v)}\{\tilde{s}(v),\delta(v),\pi_{g}(v)\}, we obtain the inductive {ψ(v)}\{\psi(v)\} and {γ(v)}\{\gamma(v)\} via one feedforward propagation.

For the inductive inference across graphs, we sample RWs {𝒲(v),𝒲I(v)}\{\mathcal{W}^{(v)},\mathcal{W}_{I}^{(v)}\} on each new graph (𝒱′′,′′)(\mathcal{V}^{\prime\prime},\mathcal{E}^{\prime\prime}). Since there are no shared nodes between the training and inference topology, we only incrementally compute the reduced/truncated statistics {s~(v),δ(v)}\{{\tilde{s}}(v),\delta(v)\} using the procedures of lines 3-4 in Algorithm 5. We derive global position encodings {πg(v)}\{\pi_{g}(v)\} from scratch via Algorithm 3. We summarize this inductive inference procedure in Algorithm 11 (see Appendix A). We also leave detailed complexity analysis of IRWE in Appendix B.

5 Experiments

In this section, we elaborate on our experiments. Section 5.1 introduces experiment setups. Evaluation results for the transductive and inductive embedding inference are described and analyzed in Sections 5.2 and 5.3. Ablation study and parameter analysis are introduced in Sections 5.4 and 5.5. Due to space limit, we leave detailed experiment settings and further experiment results in Appendix D and E.

5.1 Experiment Setups

Table 2: Statistics of Datasets
Datasets N E K
PPI 3,890 38,739 50
Wiki 4,777 92,517 40
BlogCatalog 10,312 333,983 39
USA 1,190 13,599 4
Europe 399 5,993 4
Brazil 131 1,003 4
PPIs 1,021-3,480 4,554-26,688 10
Table 3: Details of Methods to be Evaluated
Methods Trans Ind Pos Ide
node2vec (Grover & Leskovec, 2016) \surd \surd
GraRep (Cao et al., 2015) \surd \surd
struc2vec (Ribeiro et al., 2017) \surd \surd
struc2gauss (Pei et al., 2020) \surd \surd
PaCEr (Yan et al., 2024) \surd Δ\Delta Δ\Delta
PhUSION (Zhu et al., 2021) \surd Δ\Delta Δ\Delta
GraphSAGE (Hamilton et al., 2017) \surd - -
DGI (Velickovic et al., 2019) \surd - -
GraphMAE (Hou et al., 2022) \surd - -
GraphMAE2 (Hou et al., 2023) \surd - -
P-GNN (You et al., 2019) \surd \surd
CSGCL (Chen et al., 2023) \surd \surd
GraLSP (Jin et al., 2020) \surd \surd
SPINE (Guo et al., 2019) \surd \surd
GAS (Guo et al., 2020) \surd \surd
SANNE (Nguyen et al., 2021) \surd - -
UGFormer (Nguyen et al., 2022) \surd - -
IRWE (ours) \surd \surd \surd

Datasets. We used seven datasets commonly used by related research to validate the effectiveness of IRWE, with statistics shown in Table 5.1, where NN, EE, and KK are the numbers of nodes, edges, and classes.

PPI, Wiki, and BlogCatalog are the first type of datasets (Grover & Leskovec, 2016; Zhu et al., 2021) providing the ground-truth of node positions for multi-label classification. USA, Europe, and Brazil are the second type of datasets (Ribeiro et al., 2017; Zhu et al., 2021) with node identity ground-truth for multi-class classification. In summary, PPI, Wiki, and BlogCatalog are widely used to evaluate the quality of position embedding while USA, Europe, and Brazil are well-known datasets for the evaluation of identity embedding.

PPIs is a widely used dataset for the inductive inference across graphs (Hamilton et al., 2017; Veličković et al., 2018), which includes a set of protein-protein interaction graphs (in terms of connected components). In addition to graph topology, PPIs also provides node features and ground-truth for node classification. As stated in Section 3, we do not consider graph attributes due to the complicated correlations between topology and attributes. It is also unclear whether the classification ground-truth is dominated by topology or attributes. Therefore, we only used the graph topology of PPIs.

Downstream Tasks. We adopted multi-label and multi-class node classification for the evaluation of position and identity embeddings on the first and second types of datasets, respectively. In particular, each node may belong to multiple classes in multi-label classification while each node only belongs to one class in multi-class classification. We used Micro F1-score as the quality metric for the two classification tasks. To avoid the exception that some labels are not presented in all training examples, we removed classes with very few numbers of members (i.e., less than 88) when conducting node classification.

We also adopted unsupervised node clustering to evaluate the quality of identity and position embeddings. Inspired by spectral clustering (Von Luxburg, 2007) and Hypothesis 1, we constructed an auxiliary (top-10) similarity graph 𝒢D{\mathcal{G}}_{D} based on the high-order degree features {δ(v)(l+1)e}\{{\delta}^{\prime}(v)\in\mathbb{R}^{(l+1)e}\} derived via a procedure similar to Algorithm 2. The only difference between {δ(v)}\{{\delta}^{\prime}(v)\} (used for evaluation) and {δ(v)}\{\delta(v)\} (used in IRWE) is that δ(v){\delta}^{\prime}(v) is derived from the rooted subgraph 𝒢s(v,l){\mathcal{G}}_{s}(v,l^{\prime}) but not the sampled RWs 𝒲(v){\mathcal{W}}^{(v)}. To obtain {δ(v)}\{{\delta}^{\prime}(v)\}, we set l=5l=5 (i.e., the order of neighbors) and e=500e=500 (i.e., the dimensionality of the one-hot degree encoding) for the first type of datasets while we let l=3l=3 and e=200e=200 for PPIs. Namely, we applied a clustering algorithm to embeddings learned on the original graph 𝒢\mathcal{G} but evaluated the clustering result on 𝒢D\mathcal{G}_{D}. We define this task as the node identity clustering and expect that it can measure the quality of identity embeddings becuase high-order degree features {δ(v)}\{{\delta}^{\prime}(v)\} can capture node identities.

In addition, we treated the node clustering evaluated on the original graph 𝒢\mathcal{G} as community detection (Newman, 2006), a task commonly used for the evaluation of position embeddings. Normalized cut (NCut) (Von Luxburg, 2007) w.r.t. 𝒢D\mathcal{G}_{D} and modularity (Newman, 2006) w.r.t. 𝒢\mathcal{G} were used as quality metrics for node identity clustering and community detection. We leave details of NCut and modularity in Appendix D.

Logistic regression and KKMeans were used as downstream algorithms for node classification and clustering. Larger F1-score and modularity as well as smaller NCut implies better performance of downstream tasks, thus indicating better embedding quality.

In summary, we adopted (i) node identity clustering and (ii) multi-label node classification to respectively evaluate identity and position embeddings on the first type of datasets. For the second type of datasets, (i) multi-label node classification and (ii) community detection were used to evaluate identity and position embeddings. We only applied the unsupervised (i) node identity clustering and (ii) community detection to evaluate the two types of embeddings for PPIs, since we did not consider its ground-truth.

Baselines. We compared IRWE with 1717 unsupervised baselines, covering identity and position embedding as well as transductive and inductive approaches. Table 5.1 summarizes all the methods to be evaluated, where ‘-’ denotes that it is unclear for a baseline which type of property it can capture. PhUSION has multiple variants using different proximities for different types of embeddings. We used variants with (i) positive point-wise mutual information and (ii) heat kernel, which are recommended proximities for position and identity embedding, as two baselines denoted as PhN-PPMI and PhN-HK. Each variant of PhUSION can only derive one type of embedding. Although PaCEr also considers the correlation between identity and position embeddings (denoted as PaCEr(I) and PaCEr(P)) and derives both types of embeddings, it only optimizes PaCEr(P) based on the observed graph topology and simply transform PaCEr(P) to PaCEr(I).

For each transductive baseline, we can distinguish that it captures node identities or positions. For inductive baselines, GraLSP, SPINE, and GAS are claimed to be identity embedding methods while P-GNN and CSGCL can preserve node positions. Similar to our method, GraLSP and SPINE use RWs and induced statistics to enhance the embedding quality. SANNE applies the transformer encoder to handle RWs. All the inductive baselines rely on the availability of node attributes. We used the bucket one-hot encodings of node degrees as their attribute inputs, a widely-used strategy for inductive methods when attributes are unavailable. All the transductive methods learn their embeddings only based on graph topology. To validate the challenge of capturing node identities and positions in one embedding space, we introduced an additional baseline [n2v||||s2v] by concatenating node2vec and struc2vec.

Most of the baseline can only generate one set of embeddings. We have to use this unique set of embeddings to support two different tasks on each dataset. Our IRWE method can support the inductive inference of identity and position embeddings, simultaneously generating two sets of embeddings. For convenience, we denote the derived identity and position embeddings as IRWE(ψ\psi) and IRWE(γ\gamma).

As stated in Section 3, we consider the unsupervised network embedding. There exist supervised inductive methods (e.g., GAT (Veličković et al., 2018), GIN (Xu et al., 2019), ID-GNN (You et al., 2021), DE-GNN (Li et al., 2020), DEMO-Net (Wu et al., 2019), and SAT (Chen et al., 2022)) that do not provide unsupervised training objectives in their original designs. To ensure the fairness of comparison, these supervised baselines are not included in our experiments. Due to space limit, we leave details of layer configurations, parameter settings, and experiment environment in Appendix D.

5.2 Evaluation of Transductive Embedding Inference

Table 4: Transductive Embedding Inference w.r.t. Node Position Classification and Node Identity Clustering
PPI Wiki BlogCatalog
F1-score\uparrow (%) Ncut\downarrow F1-score\uparrow (%) Ncut\downarrow F1-score\uparrow (%) Ncut\downarrow
20% 40% 60% 80% 20% 40% 60% 80% 20% 40% 60% 80%
node2vec 17.79 19.15 20.16 21.58 45.18 47.43 51.05 52.25 53.87 38.89 37.20 39.45 40.45 41.58 36.82
GraRep 17.94 20.54 22.00 23.49 39.92 49.87 53.33 54.18 55.09 37.12 30.83 33.58 34.71 35.68 34.32
PaCEr(P) 15.94 17.30 18.69 19.70 45.42 43.80 45.81 46.32 47.76 36.93 35.06 37.98 38.89 39.76 34.10
PhN-PPMI 20.17 22.34 23.64 24.84 45.31 46.11 49.04 50.35 51.22 38.88 38.86 40.97 41.69 42.71 36.21
struc2vec 7.70 7.99 8.04 8.47 30.51 40.70 41.14 41.17 41.34 30.96 14.67 15.09 15.28 14.79 30.47
struc2gauss 10.59 11.40 11.91 12.59 38.01 41.09 41.06 40.86 41.13 27.66 17.16 17.21 17.28 16.95 34.41
PaCEr(I) 9.93 10.38 10.70 10.86 40.40 41.70 41.70 41.22 42.08 24.93 16.26 16.44 16.55 16.36 32.89
PhN-HK 9.60 9.57 9.44 9.95 31.52 41.54 41.58 41.35 41.77 29.47 17.28 17.33 17.32 17.04 34.45
[n2v||||s2v] 14.29 14.67 14.66 14.38 31.99 38.95 39.75 41.85 44.37 32.32 26.94 28.75 31.34 33.75 31.14
GraSAGE 6.59 6.29 7.12 6.88 36.00 41.14 41.06 40.82 40.89 30.71 16.79 16.77 16.70 16.56 34.28
DGI 10.98 12.37 13.36 14.24 45.35 42.63 43.44 43.91 44.33 36.85 19.24 20.81 21.92 22.22 33.35
GraMAE 11.58 12.76 13.76 14.00 37.72 42.01 42.52 42.87 43.32 25.14 19.29 20.38 20.57 21.02 28.35
GraMAE2 9.63 10.40 11.26 11.52 45.26 41.85 42.04 41.73 42.34 38.26 17.76 18.14 18.23 18.29 35.56
P-GNN 11.70 12.71 13.71 13.75 39.74 43.16 44.38 44.92 45.88 37.31 19.29 20.64 21.39 21.43 34.75
CSGCL 14.93 16.14 17.13 17.81 41.66 42.77 43.39 43.47 44.06 25.94 18.91 19.25 19.30 19.42 30.58
GraLSP 9.08 9.35 9.37 9.95 29.76 41.05 41.00 40.62 41.40 11.00 16.65 17.50 17.44 17.58 23.46
SPINE 8.36 9.07 9.97 10.41 44.49 40.92 40.87 40.59 40.50 38.89 16.25 16.51 16.50 16.39 37.47
GAS 9.25 9.88 10.59 11.15 39.47 41.29 41.40 41.44 42.24 34.59 18.07 18.47 18.76 18.94 34.11
SANNE 7.77 8.18 8.05 9.57 46.87 41.07 41.08 41.01 41.56 38.35 16.56 16.77 16.70 16.72 37.10
UGFormer 6.57 6.04 6.31 6.31 32.30 41.15 41.07 40.81 40.88 21.16 16.73 16.84 16.76 16.53 28.01
IRWE(ψ\psi) 11.45 13.52 14.48 15.60 28.92 45.18 46.49 46.93 47.46 9.85 17.84 18.73 19.05 19.20 24.58
IRWE(γ\gamma) 19.63 22.75 24.20 25.88 42.78 52.02 54.29 54.94 56.20 19.31 38.99 41.42 41.86 42.76 36.07
Table 5: Transductive Embedding Inference w.r.t. Node Identity Classification and Community Detection
USA Europe Brazil
F1-score\uparrow (%) Mod\uparrow F1-score\uparrow (%) Mod\uparrow F1-score\uparrow (%) Mod\uparrow
20% 40% 60% 80% (%) 20% 40% 60% 80% (%) 20% 40% 60% 80% (%)
node2vec 47.02 50.42 53.16 53.36 25.88 36.19 39.65 41.98 41.46 7.43 32.50 32.12 39.75 37.14 11.76
GraRep 52.52 57.86 61.93 62.01 27.54 39.18 44.32 48.09 44.87 11.48 34.89 40.45 43.50 42.14 19.76
PaCEr(P) 47.44 49.36 51.46 53.95 22.12 42.56 45.27 48.76 50.00 4.13 37.83 39.09 45.50 50.71 2.35
PhN-PPMI 50.28 54.31 57.45 57.05 25.03 36.58 40.54 44.21 43.17 7.26 32.60 36.51 39.00 40.00 9.12
struc2vec 56.85 58.97 59.91 62.52 0.38 51.85 53.93 57.27 57.31 -5.61 65.43 71.66 75.25 74.29 -1.43
struc2gauss 60.88 61.89 62.32 64.36 3.27 49.50 53.38 55.53 56.34 -6.49 68.69 72.72 75.50 73.57 -3.31
PaCEr(I) 59.80 60.47 60.42 61.40 -0.19 50.61 54.84 56.12 59.92 -3.60 63.91 67.86 68.18 73.75 -2.74
PhN-HK 58.64 60.97 62.43 63.19 13.14 50.32 52.13 54.79 56.09 -6.01 61.84 68.78 74.75 69.28 -5.19
[n2v||||s2v] 54.02 55.69 58.79 57.05 2.91 48.25 52.23 54.79 52.43 -5.22 59.78 65.75 64.75 60.71 2.28
GraSAGE 45.49 50.06 54.70 55.37 1.55 34.23 46.31 45.70 46.82 -0.71 35.86 39.09 54.00 57.85 2.93
DGI 54.62 57.78 58.85 59.49 3.45 44.23 48.05 52.39 49.02 -4.78 36.19 41.36 48.25 47.85 9.18
GraMAE 58.86 62.33 64.62 64.11 5.86 45.19 49.10 52.72 49.26 1.70 44.56 55.00 63.00 66.42 3.18
GraMAE2 55.91 56.90 57.67 59.07 18.73 35.97 40.09 43.96 42.92 7.03 36.63 38.93 39.00 37.85 5.95
P-GNN 58.55 61.29 62.54 61.34 21.48 45.33 47.06 51.65 50.00 0.29 46.08 50.15 49.75 52.85 1.78
CSGCL 59.49 59.41 61.79 61.09 21.14 46.87 53.03 56.36 52.68 -8.61 38.91 44.39 48.50 52.14 13.04
GraLSP 57.89 58.87 60.58 61.84 2.72 42.59 47.66 45.70 51.70 0.65 43.15 52.12 61.25 64.28 0.32
SPINE 35.07 37.42 40.64 40.25 2.16 25.12 25.82 23.71 30.00 -0.08 23.36 21.51 19.25 23.57 0.05
GAS 60.46 62.97 64.48 64.45 22.45 51.56 52.18 55.12 58.04 5.20 67.06 69.09 72.75 74.28 1.51
SANNE 54.95 56.86 58.15 61.01 14.59 44.63 50.25 54.46 49.51 6.21 40.43 45.61 51.25 51.43 5.90
UGFormer 51.61 53.85 53.95 55.88 0.78 36.12 43.83 45.79 48.29 1.35 35.22 39.70 47.00 46.42 2.65
IRWE(ψ\psi) 58.02 63.58 66.19 65.46 1.78 52.06 54.88 58.10 60.24 -0.52 70.22 74.09 72.25 75.00 1.17
IRWE(γ\gamma) 55.25 58.69 60.64 61.68 31.24 43.67 47.41 50.25 49.27 17.74 36.85 40.15 44.25 41.43 21.26

We first evaluated the transductive embedding inference of all the methods on the first and second types of datasets. For the two classification tasks, we randomly sampled T{20%,40%,60%,80%}T\in\{20\%,40\%,60\%,80\%\} and 1010% of the nodes to form the training and validation sets with the remaining nodes as the test set on each dataset. Similar to 1010-fold cross-validation, we repeated the data splitting 1010 times, where we split the node set into 1010 subsets with each one as the validation set in a round and used the average quality w.r.t. the validation set to tune parameters of all the methods. Evaluation results of the transductive embedding inference are shown in Tables 4 and 5, where metrics are in bold or underlined if they perform the best or within top-3.

For transductive baselines, identity embedding approaches (i.e., struc2vec, struc2gauss, and PhN-HK) and position embedding methods (i.e., node2vec, GraRep, PhN-PPMI) are in groups with top clustering performance (in terms of NCut and modularity) on the first and second types of datasets, respectively. Since prior studies have demonstrated the ability of these transductive baselines to capture node identities or positions, the evaluation results validate our motivation of using node identity clustering and community detection to evaluate the quality of identity and position embeddings. Our node identity clustering results also validate Hypothesis 1 that the high-order degree features {δ(v)}\{\delta(v)\} can encode node identity information.

On each dataset, most baselines can only achieve relatively high performance for one task w.r.t. identity or position embedding. It indicates that most existing embedding methods can only capture either node identities or positions. In most cases, [n2v||||s2v] outperforms neither (i) node2vec for tasks w.r.t. node positions nor (ii) struc2vec for those w.r.t. node identities. It implies that the simple integration of the two types of embeddings may even damage the quality of capturing node identities or positions. Therefore, it is challenging to preserve both properties in a common embedding space.

For tasks w.r.t. each type of embedding, conventional transductive baselines can achieve much better performance than most of the advanced inductive baselines. One possible reason is that existing inductive embedding approaches rely on the availability of node attributes. However, there are complicated correlations between graph topology and attributes as discussed in Section 1. Our results imply that the embedding quality of some inductive baselines is largely affected by their attribute inputs. Some standard settings for the case without available attributes (e.g., using one-hot degree encodings as attribute inputs) cannot help derive informative identity or position embeddings.

Our IRWE method achieves the best quality for both identity and position embedding in most cases. It indicates that IRWE can jointly derive informative identity and position embeddings in a unified framework.

5.3 Evaluation of Inductive Embedding Inference

Table 6: Inductive Inference for New Nodes within a Graph and across Graphs
PPI Wiki BlogCatalog USA Europe Brazil PPIs
F1\uparrow Ncut\downarrow F1\uparrow Ncut\downarrow F1\uparrow Ncut\downarrow F1\uparrow Mod\uparrow F1\uparrow Mod\uparrow F1\uparrow Mod\uparrow Mod\uparrow Ncut\downarrow
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
GraSAGE 7.35 36.13 40.71 28.86 16.30 26.50 57.81 0.88 52.68 0.02 71.42 2.18 3.90 6.69
DGI 14.64 45.18 44.16 36.89 22.76 33.71 65.54 3.90 52.19 0.69 58.57 3.65 3.52 8.31
GraMAE 14.54 38.58 43.78 24.73 20.94 27.78 66.72 1.62 54.15 1.11 64.29 3.42 2.82 7.40
GraMAE2 11.81 45.28 41.27 37.88 19.05 35.94 59.83 9.56 46.34 4.46 42.86 5.34 3.68 8.14
PGNN 14.29 42.91 43.74 37.57 22.06 35.33 59.49 16.15 51.70 1.36 61.42 3.93 7.83 7.95
CSGCL 16.13 41.46 43.96 25.54 19.30 31.14 63.36 18.17 56.59 -7.81 61.43 5.81 -0.26 5.88
GraLSP 6.39 47.24 40.62 31.78 16.51 37.39 25.21 -0.34 44.39 0.12 38.57 -0.80 0.88 8.48
SPINE 9.12 47.21 40.80 38.95 16.87 37.45 44.87 0.76 24.88 0.16 37.14 0.41 0.38 8.63
GAS 11.50 39.33 41.84 34.44 18.94 33.89 64.87 23.05 56.59 3.51 68.57 4.27 -2.10 7.15
SANNE 5.19 45.58 40.86 33.88 16.39 34.11 25.71 0.01 26.34 -0.01 25.13 -0.01 1.43 8.22
UGFormer 5.59 34.70 40.71 21.39 16.23 27.84 59.83 2.03 45.85 0.73 62.86 1.95 -0.83 5.43
IRWE(ψ\psi) 10.53 32.95 41.05 15.93 16.10 26.58 68.40 10.07 60.00 -1.12 72.86 -5.28 0.16 4.62
IRWE(γ\gamma) 18.29 45.54 47.32 19.47 27.04 35.96 49.75 25.54 45.85 11.65 44.29 12.31 11.41 8.47

We further consider the inductive inference (i) for new unseen nodes within a graph and (ii) across graphs, which were evaluated on the (i) first two types of datasets (i.e., PPI, Wiki, BlogCatalog, USA, Europe, and Brazil) and (ii) PPIs, respectively. We could only evaluate the quality of inductive methods because transductive baselines cannot support the inductive inference.

For the inductive inference within a graph, we randomly selected 8080%, 1010%, and 1010% of nodes on each single graph to form the training, validation, and test sets (denoted as 𝒱trn{\mathcal{V}}_{trn}, 𝒱val{\mathcal{V}}_{val}, and 𝒱tst{\mathcal{V}}_{tst}), where 𝒱val{\mathcal{V}}_{val} and 𝒱tst{\mathcal{V}}_{tst} represent sets of new nodes not observed in 𝒱trn{\mathcal{V}}_{trn}. The embedding model of each inductive method was optimized only on the topology induced by 𝒱trn{\mathcal{V}}_{trn}. When validating and testing a method using the node classification task, embeddings w.r.t. 𝒱trn{\mathcal{V}}_{trn} and 𝒱trn𝒱val{\mathcal{V}}_{trn}\cup{\mathcal{V}}_{val} were used to train the downstream logistic regression. We repeated the data splitting 1010 times following a strategy similar to 1010-fold cross validation and used the average quality w.r.t. the validation set to tune parameters of all the methods.

For the inductive inference across graphs, we sampled 33 graphs from PPIs denoted as 𝒢trn\mathcal{G}_{trn}, 𝒢val\mathcal{G}_{val}, and 𝒢tst\mathcal{G}_{tst}, which were used for training, validation, and testing. We first optimized the embedding model on 𝒢trn\mathcal{G}_{trn}. To validate or test the model, we derived inductive embeddings w.r.t. 𝒢val\mathcal{G}_{val} or 𝒢tst\mathcal{G}_{tst} and obtained clustering results for evaluation by applying KKMeans. This procedure was repeated 55 times, where 1515 graphs were sampled. Finally, the average quality over the 55 data splits was reported.

Evaluation results of the inductive embedding inference are depicted in Table 6, where metrics are in bold or underlined if they perform the best or within top-3. IRWE achieves the best quality in most cases. In particular, the quality metrics of IRWE are significantly better than other inductive baselines, whose inductiveness relies on the availability of node attributes. Our results further demonstrate that IRWE can support the inductive inference of identity and position embeddings, simulataneously generating two sets of informative embeddings, without relying on the availability and aggregation of any graph attributes.

5.4 Ablation Study

Table 7: Ablation Study w.r.t. Node Position Classification and Node Identity Clustering on PPI as well as Node Identity Classification and Community Detection on USA.
PPI USA
F1\uparrow (%) Ncut\downarrow F1\uparrow (%) Mod\uparrow (%)
IRWE 25.88 28.94 67.31 31.24
(1) w/o loss regφ{\mathcal{L}}_{{\rm{reg-}}\varphi} 25.43 30.14 66.55 30.82
(2) w/o input {s~(v)}\{{\tilde{s}}(v)\} 24.76 29.68 65.21 29.31
(3) w/o input {δ(v)}\{\delta(v)\} 25.14 30.61 67.07 31.08
(4) w/o loss regψ{\mathcal{L}}_{{\rm{reg-}}\psi} 25.65 36.02 45.79 30.11
(5) w/o input {ψ(v)}\{\psi(v)\} 24.95 29.28 65.79 30.44
(6) w/o input {πg(v)}\{{\pi}_{g}(v)\} 25.08 29.62 65.79 29.45
(7) w/o ROut(){\mathop{\rm ROut}\nolimits}(\cdot) 13.39 29.39 66.47 -0.76
(8) w/o loss γ{\mathcal{L}}_{\gamma} 22.43 29.42 65.88 23.65
(9) base stat {s~(v)}\{{\tilde{s}}(v)\} 46.05 56.63
(10) base stat {δ(v)}\{\delta(v)\} 34.06 63.94
(11) base stat {πg(v)}\{\pi_{g}(v)\} 17.52 21.85
(12) based stat 𝐂{\bf{C}} (SVD) 22.60 12.15

In ablation study, we respectively removed some components from the IRWE model to explore their effectiveness for ensuring the high embedding quality of our method. For the identity embedding module, we considered the (i) AW embedding regularization loss regφ{\mathcal{L}}_{{\rm{reg-}}\varphi} (12), (ii) AW statistic inputs {s~(v)}\{{\tilde{s}}(v)\}, (iii) high-order degree feature inputs {δ(v)}\{\delta(v)\}, and (iv) identity embedding regularization loss regψ{\mathcal{L}}_{{\rm{reg-}}\psi} (13). In cases (i) and (iv), identity embeddings were only optimized via one loss (i.e., regψ{\mathcal{L}}_{{\rm{reg-}}\psi} or regφ{\mathcal{L}}_{{\rm{reg-}}\varphi}).

For the position embedding module, we checked the effectiveness of the (v) identity embedding inputs {ψ(v)}\{\psi(v)\}, (vi) global position encoding inputs {πg(v)}\{\pi_{g}(v)\}, (vii) attentive readout function ROut(){\mathop{\rm ROut}\nolimits}(\cdot) described in (9), and (viii) reconstruction loss γ\mathcal{L}_{\gamma} (14). In case (v), the two modules of IRWE were independently optimized. For case (vii), we simply averaged the representations in t¯(v){\bar{t}}^{(v)} to replace ROut(){\mathop{\rm ROut}\nolimits}(\cdot). For case (viii), we replaced the contrastive statistics 𝐂{\bf{C}} in (14) with adjacency matrix 𝐀{\bf{A}}.

We also used some induced statistics as baselines. Concretely, we evaluated the quality of (ix) AW statistics {s~(v)}\{{\tilde{s}}(v)\} and (x) degree features {δ(v)}\{\delta(v)\} to capture node identities. In contrast, we checked the quality of (xi) global position encodings {πg(v)}\{\pi_{g}(v)\} and (xii) contrastive statistics 𝐂{\bf{C}} for node positions. In case (xii), we derived representations with the same dimensionality as other embedding methods by applying SVD to 𝐂{\bf{C}}.

As a demonstration, we report results of transductive embedding inference on PPI and USA (with 80%80\% of nodes sampled as the training set for classification) in Table 7. According to our results, regψ\mathcal{L}_{{\rm{reg-}}\psi} is essential for identity embedding learning, since there are significant quality declines for node identity clustering and classification in case (iv). ROut(){\mathop{\rm ROut}\nolimits}(\cdot) and γ\mathcal{L}_{\gamma} are key components to capture node positions due to the significant quality declines for node position classification and community detection in cases (vii) and (viii). All the remaining components can further enhance the ability to capture node identities and positions. The joint optimization of identity and position embeddings can also improve the quality of one another.

5.5 Parameter Analysis

Refer to caption
(a) PPI, F1-score\uparrow, ll
Refer to caption
(b) PPI, F1-score\uparrow, α\alpha
Refer to caption
(c) PPI, F1-score\uparrow, τ\tau
Refer to caption
(d) PPI, NCut\downarrow, ll
Refer to caption
(e) PPI, NCut\downarrow, α\alpha
Refer to caption
(f) PPI, NCut\downarrow, τ\tau
Figure 5: Parameter analysis w.r.t. ll, α\alpha, and τ\tau on PPI in terms of F1-score\uparrow (node position classification) and NCut\downarrow (node identity clustering).
Refer to caption
(a) USA, F1-score\uparrow, ll
Refer to caption
(b) USA, F1-score\uparrow, α\alpha
Refer to caption
(c) USA, F1-score\uparrow, τ\tau
Refer to caption
(d) USA, modularity\uparrow, ll
Refer to caption
(e) USA, modularity\uparrow, α\alpha
Refer to caption
(f) USA, modularity\uparrow, τ\tau
Figure 6: Parameter analysis w.r.t. ll, α\alpha, and τ\tau on USA in terms of F1-score\uparrow (node identity classification) and modularity\uparrow (community detection).

We tested the effects of (i) RW length ll, (ii) α\alpha in loss (11), and (iii) temperature parameter τ\tau in loss (14). Concretely, we set l{4,5,,9}l\in\{4,5,\cdots,9\}, α{0.1,0.5,1,5,10,50,100}\alpha\in\{0.1,0.5,1,5,10,50,100\}, and τ{1,5,10,50,100,500,1000}\tau\in\{1,5,10,50,100,500,1000\}. Example parameter analysis results of the transductive embedding inference on PPI and USA (with 80%80\% of nodes sampled as the training set for classification) are illustrated in Fig. 5 and 6. The quality of both types of embeddings is not sensitive to the settings of ll. Compared with position embeddings, the quality of identity embeddings is more sensitive to α\alpha (e.g., in terms of F1-score of node classification on USA and NCut of node identity clustering on PPI). The settings of τ\tau would significantly affect the quality of the two types of embeddings. The recommended parameter settings of IRWE are given in the Appendix D.

6 Conclusion

In this paper, we considered unsupervised network embedding and explored a unified framework for the joint optimization and inductive inference of identity and position embeddings without relying on the availability and aggregation of graph attributes. An IRWE method was proposed, which combines multiple attention units with different designs to handle RWs on graph topology. We demonstrated that AW derived from RW and induced statistics can (i) be features shared by all possible nodes and graphs to support inductive inference and (ii) characterize node identities to derive identity embeddings. We also showed the intrinsic relation between the two types of embeddings. Based on this relation, the derived identity embeddings can be used for the inductive inference of position embeddings. Experiments on public datasets validated that IRWE can achieve superior quality compared with various baselines for the transductive and inductive inference of identity and position embeddings. We leave discussions of future directions in Appendix F.

References

  • Arriaga & Vempala (2006) Rosa I Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and random projection. Machine learning, 63:161–182, 2006.
  • Cao et al. (2015) Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information & Knowledge Management, pp.  891–900, 2015.
  • Chen et al. (2022) Dexiong Chen, Leslie O’Bray, and Karsten Borgwardt. Structure-aware transformer for graph representation learning. In Proceedings of the 2022 International Conference on Machine Learning, pp.  3469–3489, 2022.
  • Chen et al. (2023) Han Chen, Ziwen Zhao, Yuhua Li, Yixiong Zou, Ruixuan Li, and Rui Zhang. Csgcl: Community-strength-enhanced graph contrastive learning. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence, pp.  2059–2067, 2023.
  • Donnat et al. (2018) Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. Learning structural node embeddings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  1320–1329, 2018.
  • Gao et al. (2023) Yu Gao, Meng Qin, Yibin Ding, Li Zeng, Chaorui Zhang, Weixi Zhang, Wei Han, Rongqian Zhao, and Bo Bai. Raftgp: Random fast graph partitioning. In 2023 IEEE High Performance Extreme Computing Conference, pp.  1–7, 2023.
  • Grover & Leskovec (2016) Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  855–864, 2016.
  • Guo et al. (2019) Junliang Guo, Linli Xu, and Jingchang Liu. Spine: Structural identity preserved inductive network embedding. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp.  2399–2405, 2019.
  • Guo et al. (2020) Xuan Guo, Wang Zhang, Wenjun Wang, Yang Yu, Yinghui Wang, and Pengfei Jiao. Role-oriented graph auto-encoder guided by structural information. In Proceedings of the 25th International Conference on Database Systems for Advanced Applications, pp.  466–481, 2020.
  • Hamilton et al. (2017) William Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of the 2017 Advances in Neural Information Processing Systems, pp.  1024–1034, 2017.
  • Hoff (2007) Peter Hoff. Modeling homophily and stochastic equivalence in symmetric relational data. Advances in Neural Information Processing Systems, 20, 2007.
  • Hou et al. (2022) Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp.  594–604, 2022.
  • Hou et al. (2023) Zhenyu Hou, Yufei He, Yukuo Cen, Xiao Liu, Yuxiao Dong, Evgeny Kharlamov, and Jie Tang. Graphmae2: A decoding-enhanced masked self-supervised graph learner. In Proceedings of the 2023 ACM Web Conference, pp.  737–746, 2023.
  • Ivanov & Burnaev (2018) Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. In Proceedings of the 2018 International Conference on Machine Learning, pp.  2186–2195, 2018.
  • Jin et al. (2020) Yilun Jin, Guojie Song, and Chuan Shi. Gralsp: Graph neural networks with local structural patterns. In Proceedings of the 2020 AAAI Conference on Artificial Intelligence, pp.  4361–4368, 2020.
  • Kipf & Welling (2017) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, 2017.
  • Lei et al. (2018) Kai Lei, Meng Qin, Bo Bai, and Gong Zhang. Adaptive multiple non-negative matrix factorization for temporal link prediction in dynamic networks. In Proceedings of the 2018 ACM SIGCOMM Workshop on Network Meets AI & ML, pp.  28–34, 2018.
  • Lei et al. (2019) Kai Lei, Meng Qin, Bo Bai, Gong Zhang, and Min Yang. Gcn-gan: A non-linear temporal link prediction model for weighted dynamic networks. In Proceedings of the 2019 IEEE Conference on Computer Communications, pp.  388–396, 2019.
  • Li et al. (2020) Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. Distance encoding: Design provably more powerful neural networks for graph representation learning. Proceedings of the 2020 Advances in Neural Information Processing Systems, 33:4465–4478, 2020.
  • Li et al. (2019) Wei Li, Meng Qin, and Kai Lei. Identifying interpretable link communities with user interactions and messages in social networks. In Proceedings of the 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, pp.  271–278, 2019.
  • Liu et al. (2023) Zirui Liu, Chen Shengyuan, Kaixiong Zhou, Daochen Zha, Xiao Huang, and Xia Hu. Rsc: accelerate graph neural networks training via randomized sparse computations. In International Conference on Machine Learning, pp.  21951–21968, 2023.
  • Micali & Zhu (2016) Silvio Micali and Zeyuan Allen Zhu. Reconstructing markov processes from independent and anonymous experiments. Discrete Applied Mathematics, 200:108–122, 2016.
  • Newman (2006) Mark EJ Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577–8582, 2006.
  • Nguyen et al. (2021) Dai Quoc Nguyen, Tu Dinh Nguyen, and Dinh Phung. A self-attention network based node embedding model. In Proceedings of the 2021 Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp.  364–377, 2021.
  • Nguyen et al. (2022) Dai Quoc Nguyen, Tu Dinh Nguyen, and Dinh Phung. Universal graph transformer self-attention networks. In Companion Proceedings of the Web Conference 2022, pp.  193–196, 2022.
  • Pei et al. (2020) Yulong Pei, Xin Du, Jianpeng Zhang, George Fletcher, and Mykola Pechenizkiy. struc2gauss: Structural role preserving network embedding via gaussian embedding. Data Mining & Knowledge Discovery, 34(4):1072–1103, 2020.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery & data Mining, pp.  701–710, 2014.
  • Qin & Lei (2021) Meng Qin and Kai Lei. Dual-channel hybrid community detection in attributed networks. Information Sciences, 551:146–167, 2021.
  • Qin & Yeung (2023) Meng Qin and Dit-Yan Yeung. Temporal link prediction: A unified framework, taxonomy, and review. ACM Computing Surveys, 56(4):1–40, 2023.
  • Qin et al. (2018) Meng Qin, Di Jin, Kai Lei, Bogdan Gabrys, and Katarzyna Musial. Adaptive community detection incorporating topology and content in social networks. Knowledge-Based Systems, 161:342–356, 2018.
  • Qin et al. (2023a) Meng Qin, Chaorui Zhang, Bo Bai, Gong Zhang, and Dit-Yan Yeung. Towards a better trade-off between quality and efficiency of community detection: An inductive embedding method across graphs. ACM Transactions on Knowledge Discovery from Data, 2023a.
  • Qin et al. (2023b) Meng Qin, Chaorui Zhang, Bo Bai, Gong Zhang, and Dit-Yan Yeung. High-quality temporal link prediction for weighted dynamic graphs via inductive embedding aggregation. IEEE Transactions on Knowledge and Data Engineering, 2023b.
  • Ribeiro et al. (2017) Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  385–394, 2017.
  • Rossi et al. (2020) Ryan A Rossi, Di Jin, Sungchul Kim, Nesreen K Ahmed, Danai Koutra, and John Boaz Lee. On proximity and structural role-based embeddings in networks: Misconceptions, techniques, and applications. ACM Transactions on Knowledge Discovery from Data, 14(5):1–37, 2020.
  • Srinivasan & Ribeiro (2020) Balasubramaniam Srinivasan and Bruno Ribeiro. On the equivalence between positional node embeddings and structural graph representations. In Proceedings of the 8th International Conference on Learning Representations, 2020.
  • Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pp.  1067–1077, 2015.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 2017 Advances in Neural Information Processing Systems, pp.  5998–6008, 2017.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations, 2018.
  • Velickovic et al. (2019) Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In Proceedings of the 7th International Conference on Learning Representations, 2019.
  • Von Luxburg (2007) Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics & Computing, 17(4):395–416, 2007.
  • Wang et al. (2017) Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. Community preserving network embedding. In Proceedings of the 2017 AAAI Conference on Artificial Iintelligence, pp.  203–209, 2017.
  • Wang et al. (2020) Xiao Wang, Meiqi Zhu, Deyu Bo, Peng Cui, Chuan Shi, and Jian Pei. Am-gcn: Adaptive multi-channel graph convolutional networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  1243–1253, 2020.
  • Wu et al. (2019) Jun Wu, Jingrui He, and Jiejun Xu. Demo-net: Degree-specific graph neural networks for node and graph classification. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  406–415, 2019.
  • Wu et al. (2020) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks & Learning Systems, 32(1):4–24, 2020.
  • Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In Proceedings of the 7th International Conference on Learning Representations, 2019.
  • Yan et al. (2024) Yuchen Yan, Yongyi Hu, Qinghai Zhou, Lihui Liu, Zhichen Zeng, Yuzhong Chen, Menghai Pan, Huiyuan Chen, Mahashweta Das, and Hanghang Tong. Pacer: Network embedding from positional to structural. In Proceedings of the ACM on Web Conference 2024, pp.  2485–2496, 2024.
  • Yang et al. (2015) Yang Yang, Jie Tang, Cane Wing-ki Leung, Yizhou Sun, Qicong Chen, Juanzi Li, and Qiang Yang. RAIN: social role-aware information diffusion. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, pp.  367–373, 2015.
  • Ye et al. (2022) Dongsheng Ye, Hao Jiang, Ying Jiang, Qiang Wang, and Yulin Hu. Community preserving mapping for network hyperbolic embedding. Knowledge-Based Systems, 246:108699, 2022.
  • You et al. (2019) Jiaxuan You, Rex Ying, and Jure Leskovec. Position-aware graph neural networks. In Proceedings of th 2019 International Conference on Machine Learning, pp.  7134–7143, 2019.
  • You et al. (2021) Jiaxuan You, Jonathan M Gomes-Selman, Rex Ying, and Jure Leskovec. Identity-aware graph neural networks. In Proceedings of the 2012 AAAI Conference on Artificial Intelligence, pp.  10737–10745, 2021.
  • Zhang et al. (2022) Wentao Zhang, Yu Shen, Zheyu Lin, Yang Li, Xiaosen Li, Wen Ouyang, Yangyu Tao, Zhi Yang, and Bin Cui. Pasca: A graph neural architecture search system under the scalable paradigm. In Proceedings of the 2022 ACM Web Conference, pp.  1817–1828, 2022.
  • Zhu et al. (2021) Jing Zhu, Xingyu Lu, Mark Heimann, and Danai Koutra. Node proximity is all you need: Unified structural and positional node and graph embedding. In Proceedings of the 2021 SIAM International Conference on Data Mining, pp.  163–171, 2021.

Appendix A Detailed Algorithms

Input: topology (𝒱,)(\mathcal{V},\mathcal{E}); target node vv; RW length ll; number of samples nSn_{S}
Output: set of sampled RWs 𝒲(v)\mathcal{W}^{(v)}
1 𝒲(v){\mathcal{W}}^{(v)}\leftarrow\emptyset //Initialize 𝒲(v){\mathcal{W}}^{(v)}
2 for sample_countsample\_count from 11 to nSn_{S} do
3       vsvv_{s}\leftarrow v and w(vs)w\leftarrow(v_{s}) //Initialize current RW ww
4       while |w|(l+1)|w|\leq(l+1) do
5             randomly sample a node vtv_{t} from vsv_{s}’s neighbors
6             append vtv_{t} to ww
7             vsvtv_{s}\leftarrow v_{t}
8      
9      add ww to 𝒲(v){\mathcal{W}}^{(v)}
Algorithm 6 RW Sampling Starting from a Node
Input: RWs {𝒲(v),𝒲I(v)}\{\mathcal{W}^{(v)},\mathcal{W}_{I}^{(v)}\}, AW lookup table Ω~l{\tilde{\Omega}}_{l}, & statistics {s~(v),δ(v),πg(v)}\{{\tilde{s}}(v),\delta(v),\pi_{g}(v)\} saved in model optimization; inference topology (𝒱,)(\mathcal{V},\mathcal{E})
Output: transductive embddings {ψ(v)}\{\psi(v)\} & {γ(v)}\{\gamma(v)\} w.r.t. 𝒱\mathcal{V}
1 get {ψ(v)}\{\psi(v)\} based on {Ω~l,s~(v),δ(v)}\{{\tilde{\Omega}_{l}},\tilde{s}(v),\delta(v)\} w.r.t. 𝒱\mathcal{V}
get {γ(v)}\{\gamma(v)\} based on {ψ(v),πg(v),πl(j),𝒲I(v)}\{\psi(v),\pi_{g}(v),\pi_{l}(j),\mathcal{W}_{I}^{(v)}\} w.r.t. 𝒱\mathcal{V}
Algorithm 7 Transductive Inference
Input: new target node v𝒱v\in{\mathcal{V}^{\prime}}; sampled RWs 𝒲(v){\mathcal{W}}^{(v)}; AW lookup table Ω~l{\tilde{\Omega}}_{l} reduced on old topology (𝒱,)(\mathcal{V},\mathcal{E})
Output: inductive AW statistic s(v)s(v) w.r.t. vv
1 η~l|Ω~l|{\tilde{\eta}}_{l}\leftarrow|{\tilde{\Omega}}_{l}| //Get size of reduced AW lookup table
2 s(v)[0,0,,0]η~ls(v)\leftarrow[0,0,\cdots,0]^{{\tilde{\eta}}_{l}} //Initialize s(v)s(v)
3 for each w𝒲(v)w\in{\mathcal{W}}^{(v)} do
4       map RW ww to its corresponding AW ω\omega
5       if ωΩ~l\omega\in{\tilde{\Omega}}_{l} then
6             get the index jj of AW ω\omega in reduced lookup table Ω~l{\tilde{\Omega}}_{l}
7             s(v)js(v)j+1s(v)_{j}\leftarrow s(v)_{j}+1 //Update s(v)s(v)
8      
Algorithm 8 Inductive Derivation of AW Statistics
Input: new target node v𝒱v\in\mathcal{V}^{\prime}; RW length ll; one-hot degree encoding dimensionality ee; sampled RWs 𝒲(v){\mathcal{W}}^{(v)}; degmin\deg_{\min} & degmax\deg_{\max} in old topology (𝒱,)(\mathcal{V},\mathcal{E})
Output: inductive degree feature δ(v)\delta(v) w.r.t. vv
1 δ(v)[0,0,,0](l+1)e\delta(v)\leftarrow[0,0,\cdots,0]^{(l+1)e} //Initialize degree feature δ(v)\delta(v)
2 for each w𝒲(v)w\in{\mathcal{W}}^{(v)} do
3       for ii from 0 to ll do
4             uw(i)u\leftarrow w^{(i)} //ii-th node in current RW ww
5             if u𝒱u\in\mathcal{V} then
6                   ρd(u)[0,,0]e\rho_{d}(u)\leftarrow[0,\cdots,0]^{e}//Initialize one-hot degree encoding
7                   if deg(u)<degmin\deg(u)<{\deg_{\min}} then
8                         j0j\leftarrow 0
9                  else if deg(u)>degmax\deg(u)>{\deg_{\max}} then
10                         j(e1)j\leftarrow(e-1)
11                  else
12                         j(deg(u)degmin)e(degmaxdegmin)j\leftarrow\left\lfloor{\frac{{(\deg(u)-{{\deg}_{\min}})e}}{{({{\deg}_{\max}}-{{\deg}_{\min}})}}}\right\rfloor
13                  ρd(u)j1\rho_{d}(u)_{j}\leftarrow 1 //Update ρd(u)\rho_{d}(u)
14                   δ(v)ie:(i+1)eδ(v)ie:(i+1)e+ρd(u)\delta(v)_{ie:(i+1)e}\leftarrow\delta(v)_{ie:(i+1)e}+\rho_{d}(u)
15            
16      
Algorithm 9 Inductive Derivation of Degree Feature
Input: new target node v𝒱v\in\mathcal{V}^{\prime}; sampled RWs 𝒲(v){\mathcal{W}}^{(v)}; old training node set 𝒱\mathcal{V}; random matrix 𝚯|𝒱|×d{\bf{\Theta}}\in\mathbb{R}^{|\mathcal{V}|\times d}
Output: inductive global position encoding πg(v)\pi_{g}(v) w.r.t. vv
1 r(v)[0,0,,0]|𝒱|r(v)\leftarrow[0,0,\cdots,0]^{|\mathcal{V}|} //Initialize RW stat r(v)r(v)
2 for each w𝒲(v)w\in{\mathcal{W}}^{(v)} do
3       for each node vwv\in w do
4             if v𝒱v\in\mathcal{V} then
5                   get index jj of vv in the training node set 𝒱\mathcal{V}
6                   r(v)jr(v)j+1r(v)_{j}\leftarrow r(v)_{j}+1 //Update r(v)r(v)
7                  
8            
9      
πg(v)r(v)𝚯\pi_{g}(v)\leftarrow r(v){\bf{\Theta}} //Derive πg(v)\pi_{g}(v)
Algorithm 10 Inductive Derivation of Global Position Encoding
Input: optimized model parameters {θψ,θγ}\{\theta^{*}_{\psi},\theta^{*}_{\gamma}\}; new topology (𝒱′′,′′)(\mathcal{V}^{\prime\prime},\mathcal{E}^{\prime\prime}); RW settings {l,nS,nI}\{l,n_{S},n_{I}\}; local position encodings {πl(j)}\{\pi_{l}(j)\}; {Ω~l,degmin,degmax}\{{\tilde{\Omega}}_{l},{\deg_{\min}},{\deg_{\max}}\} derived in model optimization on old topology (𝒱,)(\mathcal{V},\mathcal{E})
Output: inductive embeddings {ψ(v)}\{\psi(v)\} & {γ(v)}\{\gamma(v)\} w.r.t. 𝒱′′\mathcal{V}^{\prime\prime}
1 for each node v𝒱′′v\in\mathcal{V}^{\prime\prime} do
2       sample nSn_{S} RWs 𝒲(v)\mathcal{W}^{(v)} from vv w.r.t. ′′\mathcal{E}^{\prime\prime} via Algorithm 6
3       get AW statistics s~(v){\tilde{s}}(v) w.r.t. {𝒲(v),Ω~l}\{\mathcal{W}^{(v)},{\tilde{\Omega}}_{l}\} via Algorithm 8
4       get degree feature δ(v){\delta}(v) w.r.t. {𝒲(v),dmin,dmax}\{\mathcal{W}^{(v)},{\rm{d}}_{\min},{\rm{d}}_{\max}\} via Algorithm 9
5       get global position encoding πg(v){\pi}_{g}(v) w.r.t. 𝒲(v)\mathcal{W}^{(v)} via Algorithm 3
6       randomly select nIn_{I} RWs 𝒲I(v)\mathcal{W}_{I}^{(v)} from 𝒲(v)\mathcal{W}^{(v)}
7get {ψ(v)}\{\psi(v)\} based on {Ω~l,s~(v),δ(v)}\{{\tilde{\Omega}_{l}},\tilde{s}(v),\delta(v)\} w.r.t. 𝒱′′\mathcal{V}^{\prime\prime}
get {γ(v)}\{\gamma(v)\} based on {ψ(v),πg(v),πl(j),𝒲I(v)}\{\psi(v),\pi_{g}(v),\pi_{l}(j),{\mathcal{W}_{I}^{(v)}}\} w.r.t. 𝒱′′\mathcal{V}^{\prime\prime}
Algorithm 11 Inductive Inference across Graphs

The RW sampling procedure starting from a node is summarized in Algorithm 6, which uniformly sample the next node vtv_{t} from the neighbors of each source node vsv_{s}.

Algorithm 7 summarizes the transductive inference procedure of IRWE, where the RWs {𝒲(v),𝒲I(v)}\{\mathcal{W}^{(v)},\mathcal{W}_{I}^{(v)}\}, AW lookup table Ω~l{\tilde{\Omega}}_{l}, and statistics {s~(v),δ(v),πg(v)}\{{\tilde{s}}(v),\delta(v),\pi_{g}(v)\} derived the model optimization are used as the inputs. Therefore, the transudcitve inference of identity embeddings {ψ(v)}\{\psi(v)\} and position embeddings {γ(v)}\{\gamma(v)\} only includes one feedforward propagation through the model.

Procedures to get inductive AW statistics {s(v)}\{s(v)\}, high-order degree features {δ(v)}\{\delta(v)\}, and global position encodings {πg(v)}\{\pi_{g}(v)\}, which support the inductive inference for new nodes within a graph (i.e., Algorithm 5), are described in Algorithms 8, 9, and 10, respectively. When deriving {s(v)}\{s(v)\}, we only compute the frequency of AWs in the lookup table Ω~l\tilde{\Omega}_{l} reduced on (𝒱,)(\mathcal{V},\mathcal{E}) rather than all AWs. Moreover, we get {δ(v)}\{\delta(v)\} based on the one-hot degree encoding truncated by the minimum and maximum degrees of the training topology (𝒱,)(\mathcal{V},\mathcal{E}) but not those of the inference topology (𝒱𝒱,)(\mathcal{V}\cup\mathcal{V}^{\prime},\mathcal{E}^{\prime}). For πg(v)\pi_{g}(v), we compute truncated RW statistic r(v)r(v) only w.r.t. previously observed nodes 𝒱\mathcal{V} rather than 𝒱𝒱\mathcal{V}^{\prime}\cup\mathcal{V}.

The inductive inference across graphs is summarized in Algorithm 11. We sample RWs {𝒲(v),𝒲I(v)}\{\mathcal{W}^{(v)},\mathcal{W}_{I}^{(v)}\} on each new graph (𝒱′′,′′)(\mathcal{V}^{\prime\prime},\mathcal{E}^{\prime\prime}). Since there are no shared nodes between the training topology (𝒱,)(\mathcal{V},\mathcal{E}) and inference topology (𝒱′′,′′)(\mathcal{V}^{\prime\prime},\mathcal{E}^{\prime\prime}), we only incrementally compute statistics {s~(v),δ(v)}\{{\tilde{s}}(v),\delta(v)\} based on {Ω~l,degmin,degmax}\{{\tilde{\Omega}}_{l},{\deg_{\min}},{\deg_{\max}}\} derived from (𝒱,)(\mathcal{V},\mathcal{E}) but compute global position encodings {πg(v)}\{\pi_{g}(v)\} from scratch.

Appendix B Complexity Analysis

The complexity of the RW sampling starting from each node (i.e., Algorithm 6) is no more than O(nSl)O(n_{S}l). The complexities to derive AW statistics s(v)s(v) (i.e., Algorithm 1), high-order degree features δ(v)\delta(v) (i.e., Algorithm 2), and global position encoding πg(v)\pi_{g}(v) (i.e., Algorithm 3) w.r.t. a node vv are O(nS)O(n_{S}), O(nSl)O(n_{S}l), and O(nSl+k(v)d)O(n_{S}l+k(v)d), with k(v)k(v) as the number of nodes observed in 𝒲(v){\mathcal{W}}^{(v)}. The overall complexity to derive the RW-induced statistics (i.e., the feature inputs of IRWE) from a graph (𝒱,)(\mathcal{V},\mathcal{E}) is no more than O(|𝒱|nSl+|𝒱|nS+|𝒱|nSl+(|𝒱|nSl+k¯d))=O(|𝒱|nSl+k¯d)O(|\mathcal{V}|n_{S}l+|\mathcal{V}|n_{S}+|\mathcal{V}|n_{S}l+(|\mathcal{V}|n_{S}l+{\bar{k}}d))=O(|\mathcal{V}|n_{S}l+{\bar{k}}d), with k¯:=v𝒱k(v)\bar{k}:=\sum\nolimits_{v\in\mathcal{V}}{k(v)}.

As described in Algorithm 7, the transductive inference of IRWE only includes one feedforward propagation through the model. Its complexity is no more than O(η~ll2d+|𝒱|(el+η~l)d+|𝒱|η~ldh+(|𝒱|d2+|𝒱|d)+|𝒱|(d+l)d+|𝒱|nIl2dh+nId)=O(|𝒱|(η~l+nIl2)dh)O({\tilde{\eta}}_{l}l^{2}d+|\mathcal{V}|(el+{\tilde{\eta}}_{l})d+|\mathcal{V}|{\tilde{\eta}}_{l}dh+(|\mathcal{V}|d^{2}+|\mathcal{V}|d)+|\mathcal{V}|(d+l)d+|\mathcal{V}|n_{I}l^{2}dh+n_{I}d)=O(|\mathcal{V}|({\tilde{\eta}}_{l}+n_{I}l^{2})dh), where we assume that eldel\approx d, l2|𝒱|l^{2}\ll|\mathcal{V}|, and dη~ld\ll{\tilde{\eta}}_{l}; hh is the number of attention heads. According to Algorithm 5, the complexity of inductive inference for new nodes within a graph is O(|𝒱|nSl+k¯d+|𝒱𝒱|(η~l+nIl2)dh)O(|\mathcal{V}^{\prime}|n_{S}l+{\bar{k}^{\prime}}d+|\mathcal{V}\cup\mathcal{V^{\prime}}|({\tilde{\eta}}_{l}+n_{I}l^{2})dh), with k¯:=v𝒱k(v)\bar{k}^{\prime}:=\sum\nolimits_{v\in\mathcal{V^{\prime}}}{k(v)}. The complexity of inductive inference across graphs (i.e., Algorithm 11) is O(|𝒱′′|nSl+k¯′′d+|𝒱′′|(η~l+nIl2)dh)O(|\mathcal{V^{\prime\prime}}|n_{S}l+{\bar{k}^{\prime\prime}}d+|\mathcal{V^{\prime\prime}}|({\tilde{\eta}}_{l}+n_{I}l^{2})dh), with k¯′′:=v𝒱′′k(v)\bar{k}^{\prime\prime}:=\sum\nolimits_{v\in\mathcal{V^{\prime\prime}}}{k(v)}.

Table 8: Summary of the Complexities of Model Parameters to be Learned.
node2vec GraRep struc2vec struc2gauss PaCEr PhN GSAGE DGI GMAE
O(Nd)O(Nd) O(Ndl)O(Ndl) O(Nd)O(Nd) O(Nd)O(Nd) O(N(d+N))O(N(d+N)) O(Nd)O(Nd) O(Ld2)O(Ld^{2}) O(Ld2)O(Ld^{2}) O(Ld2)O(Ld^{2})
GMAE2 P-GNN CSGCL GraLSP SPINE GAS SANNE UGFormer IRWE
O(Ld2)O(Ld^{2}) O(Ld2)O(Ld^{2}) O(Ld2)O(Ld^{2}) O(Ld2+ηld)O(Ld^{2}+\eta_{l}d) O(d2)O(d^{2}) O(Ld2)O(Ld^{2}) O(Lhd2)O(Lhd^{2}) O(Lhd2)O(Lhd^{2}) O(l2d+η~ld+Lhd2)O(l^{2}d+{\tilde{\eta}_{l}}d+Lhd^{2})

Table 8 summarizes and compares the complexities of model parameters to be learned for all the methods in our experiments, where NN is the number of nodes; dd is the dimensionality of input feature or embedding; ll is the RW/AW length; ηl\eta_{l} and η~l{\tilde{\eta}}_{l} denote the (i) number of AWs w.r.t. length ll and (ii) reduced value of ηl\eta_{l}; LL is the number of layers of GNN or transformer encoder; hh is the number of attention heads. Since most transductive embedding methods (e.g., node2vec and struc2vec) follow the embedding lookup scheme, their model parameters are with a complexity of at least O(Nd)O(Nd). Most inductive approaches rely on the attribute aggregation mechanism of GNNs. Their model parameters have a complexity of at least O(Ld2)O(Ld^{2}). Methods based on the multi-head attention or transformer encoder (e.g., SANNE, UGFormer, and IRWE) should have at least O(Lhd2)O(Lhd^{2}) learnable model parameters. In addition, IRWE also includes the AW auto-encoder and MLPs in the identity embedding encoder and decoder. Therefore, the model parameters of IRWE have a complexity of O(l2d+(η~l+le)d+Lhd2)=O(l2d+η~ld+Lhd2)O(l^{2}d+({\tilde{\eta}}_{l}+le)d+Lhd^{2})=O(l^{2}d+{\tilde{\eta}_{l}}d+Lhd^{2}), where we assume that eldel\approx d.

Appendix C Proof of Proposition 1

For simplicity, we let zij:=γ(vi)γ~T(vj)/τ{z_{ij}}:=\gamma({v_{i}}){{\tilde{\gamma}}^{T}}({v_{j}})/\tau. To minimize the contrastive loss cnr{\mathcal{L}}_{\rm cnr} (10), one can let its partial derivative cnr/zij\partial{{\mathcal{L}}_{\rm cnr}}/z_{ij} w.r.t. each edge (vi,vj)(v_{i},v_{j})\in\mathcal{E} to 0. Since σ(x)=1/(1+ex)\sigma(x)=1/(1+{e^{-x}}) and dσ(x)/dx=σ(x)[1σ(x)]d\sigma(x)/dx=\sigma(x)[1-\sigma(x)], we have

0=cnr/zij=pij(1σ(zij))+Qnj(1σ(zij)),0=\partial{{\mathcal{L}}_{cnr}}/z_{ij}=-{p_{ij}}(1-\sigma(z_{ij}))+Q{n_{j}}(1-\sigma(-z_{ij})), (15)

which can be rearranged as

pijσ(zij)Qnjσ(zij)=pijQnj.{p_{ij}}\sigma(z_{ij})-Q{n_{j}}\sigma(-z_{ij})={p_{ij}}-Q{n_{j}}. (16)

By applying σ(x)=exσ(x)\sigma(-x)={e^{-x}}\sigma(x), we have

pijσ(zij)Qnjexp{zij}σ(zij)=pijQnjpijQnjexp{zij}1+exp{zij}=pijQnjpij+QnjQnj(1+exp{zij})1+exp{zij}=pijQnj(pij+Qnj)σ(zij)=pijσ(zij)=pij/(pij+Qnj)1+exp{zij}=(pij+Qnj)/pijexp{zij}=Qnj/pij.\begin{array}[]{l}{p_{ij}}\sigma(z_{ij})-Q{n_{j}}\cdot\exp\{-z_{ij}\}\sigma(z_{ij})={p_{ij}}-Q{n_{j}}\\ \Rightarrow\frac{{{p_{ij}}-Q{n_{j}}\cdot\exp\{-z_{ij}\}}}{{1+\exp\{-z_{ij}\}}}={p_{ij}}-Q{n_{j}}\\ \Rightarrow\frac{{{p_{ij}}+Q{n_{j}}-Q{n_{j}}(1+\exp\{-z_{ij}\})}}{{1+\exp\{-z_{ij}\}}}={p_{ij}}-Q{n_{j}}\\ \Rightarrow({p_{ij}}+Q{n_{j}})\sigma(z_{ij})={p_{ij}}\\ \Rightarrow\sigma(z_{ij})={{p_{ij}}}/({{{p_{ij}}+Q{n_{j}}}})\\ \Rightarrow 1+\exp\{-z_{ij}\}=({{p_{ij}}+Q{n_{j}}})/{{{p_{ij}}}}\\ \Rightarrow\exp\{-z_{ij}\}={Q{n_{j}}}/{{{p_{ij}}}}\end{array}. (17)

By taking the logarithm of both sides, we further have

zij=lnpijln(Qnj).z_{ij}=\ln{p_{ij}}-\ln(Q{n_{j}}). (18)

Let 𝐂|𝒱|×|𝒱|{\bf{C}}\in\mathbb{R}^{|\mathcal{V}|\times|\mathcal{V}|} be an auxiliary matrix with the same definition as that in Proposition 1. From the perspective of matrix factorization, we can rewrite the aforementioned equation to another matrix form 𝚪𝚪~T/τ=𝐂{\bf{\Gamma}}{{{\bf{\tilde{\Gamma}}}}^{T}}/\tau={\bf{C}}, which is equivalent to the reconstruction loss γ\mathcal{L}_{\gamma} (14).

Appendix D Detailed Experiment Settings

Table 9: Parameter Settings for Transductive Inference.
(dd, ee, nSn_{S}, nIn_{I}) (λψ\lambda_{\psi}, λγ\lambda_{\gamma}) (mm, mψm_{\psi}, mγm_{\gamma}) (ll, α\alpha, τ\tau)
PPI (256, 100, 1e3, 10) (5e-4,1e-3) (2e3, 10, 1) (7, 0.1, 5e2)
Wiki (256, 100, 1e3, 10) (1e-3,1e-3) (1e3, 5, 1) (7, 10, 1e3)
Blog (256, 100, 1e3, 10) (5e-4,5e-4) (3e3, 1, 20) (9, 10, 10)
USA (128, 100, 1e3, 20) (1e-3,5e-4) (500, 10, 1) (9, 10, 10)
Europe (64, 100, 1e3, 20) (5e-4,5e-4) (200, 1, 1) (9, 10, 10)
Brazil (64, 32, 1e3, 20) (5e-4,5e-4) (200, 1, 1) (9, 0.1, 1e2)
Table 10: Parameter Settings for Inductive Inference.
(dd, ee, nSn_{S}, nIn_{I}) (λψ\lambda_{\psi}, λγ\lambda_{\gamma}) (mm, mψm_{\psi}, mγm_{\gamma}) (ll, α\alpha, τ\tau)
PPI (256, 100, 1e3, 10) (5e-4,1e-4) (1e3, 20, 1) (7, 10, 5e2)
Wiki (256, 100, 1e3, 10) (1e-3,5e-4) (1e3, 1, 1) (7, 10, 5e2)
Blog (256, 100, 1e3, 10) (5e-4,5e-4) (1e3, 20, 5) (5, 10, 5)
USA (128, 100, 1e3, 10) (5e-4,5e-4) (500, 10, 1) (9, 10, 10)
Europe (64, 100, 1e3, 10) (5e-4,5e-4) (200, 1, 1) (9, 10, 10)
Brazil (64, 32, 1e3, 10) (5e-4,5e-4) (200, 1, 1) (9, 0.1, 1e2)
PPIs (256, 100, 1e3, 10) (5e-4,5e-4) (1000, 5, 1) (9, 10, 50)
Table 11: Layer Configurations for Transductive Inference.
Datasets Identity Embedding Module Position Embedding Module
Encφ(){\mathop{\rm Enc}\nolimits}_{\varphi}(\cdot) Decφ(){\mathop{\rm Dec}\nolimits}_{\varphi}(\cdot) Reds(){\mathop{\rm Red}\nolimits}_{s}(\cdot) hψh_{\psi} Decψ(){\mathop{\rm Dec}\nolimits}_{\psi}(\cdot) MLP in ReAtt(){\mathop{\rm ReAtt}\nolimits}(\cdot) (LtranL_{\rm{tran}}, htranh_{\rm{tran}}) hrouth_{\rm{rout}}
PPI l2l^{2},128,t,dd,t dd,128,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,2048,r,1024,r,512,r,dd,r 64 dd,512,t,lele,t dd,dd,s,dd,s,dd,s,dd,s (4, 64) 64
Wiki l2l^{2},128,t,dd,t dd,128,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,1024,r,512,r,dd,r 64 dd,512,t,lele,t dd,dd,s,dd,s,dd,s,dd,s (4, 64) 64
Blog l2l^{2},128,t,dd,t dd,128,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,1024,r,512,r,dd,r 64 dd,512,t,lele,t dd,dd,s,dd,s,dd,s,dd,s (5, 64) 64
USA l2l^{2},100,t,dd,t dd,100,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,4096,r,2048,r,512,r,dd,r 32 dd,lele,t dd,dd,s,dd,s,dd,s,dd,s (4, 32) 32
Europe l2l^{2},64,t,dd,t dd,64,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,4096,r,1024,r,256,r,dd,r 16 dd,256,t,512,t,lele,t dd,dd,s,dd,s (4, 16) 16
Brazil l2l^{2},64,t,dd,t dd,64,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,1024,r,512,r,128,r,dd,r 16 dd,128,t,lele,t dd,dd,s,dd,s (4, 16) 16
Table 12: Layer Configurations for Inductive Inference.
Datasets Identity Embedding Module Position Embedding Module
Encφ(){\mathop{\rm Enc}\nolimits}_{\varphi}(\cdot) Decφ(){\mathop{\rm Dec}\nolimits}_{\varphi}(\cdot) Reds(){\mathop{\rm Red}\nolimits}_{s}(\cdot) hψh_{\psi} Decψ(){\mathop{\rm Dec}\nolimits}_{\psi}(\cdot) MLP in ReAtt(){\mathop{\rm ReAtt}\nolimits}(\cdot) (LtranL_{\rm{tran}}, htranh_{\rm{tran}}) hrouth_{\rm{rout}}
PPI l2l^{2},128,t,dd,t dd,128,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,1024,r,512,r,dd,r 64 dd,lele,t dd,dd,s,dd,s,dd,s,dd,s (4, 64) 64
Wiki l2l^{2},128,t,dd,t dd,128,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,1024,r,512,r,dd,r 64 dd,512,t,lele,t dd,dd,s,dd,s,dd,s,dd,s (4, 64) 64
Blog l2l^{2},128,t,dd,t dd,128,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,1024,r,512,r,dd,r 64 dd,t,512,t,lele,t dd,dd,s,dd,s,dd,s,dd,s (5, 64) 64
USA l2l^{2},100,t,dd,t dd,100,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,1024,r,512,r,dd,r 32 dd,512,t,lele,t dd,dd,s,dd,s,dd,s,dd,s (4, 16) 16
Europe l2l^{2},64,t,dd,t dd,64,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,1024,r,512,r,dd,r 16 dd,256,t,512,t,lele,t dd,dd,s,dd,s (4, 16) 16
Brazil l2l^{2},64,t,dd,t dd,64,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,512,r,128,r,dd 16 dd,256,t,lele,t dd,dd,s,dd,s (4, 16) 16
PPIs l2l^{2},128,t,dd,t dd,128,t,l2l^{2},t η~l{\tilde{\eta}}_{l}+lele,1024,r,512,r,dd,r 64 dd,512,t,lele,t dd,dd,s,dd,s,dd,s,dd,s (6, 64) 64

Given a clustering result 𝒞={𝒞1,,𝒞K}\mathcal{C}=\{\mathcal{C}_{1},\cdots,\mathcal{C}_{K}\}, NCut w.r.t. the auxiliary similarity graph 𝒢D\mathcal{G}_{D} is defined as

NCut(𝒞;𝒢D):=0.5r=1K[cut(𝒞r,𝒞¯r)/vol(𝒞r)],{\mathop{\rm NCut}\nolimits}(\mathcal{C};{\mathcal{G}_{D}}):=0.5\sum\nolimits_{r=1}^{K}{[{\mathop{\rm cut}\nolimits}({\mathcal{C}_{r}},{{\mathcal{\bar{C}}}_{r}})/{\mathop{\rm vol}\nolimits}({\mathcal{C}_{r}})]}, (19)

where 𝒞¯r:=𝒱𝒞r{\mathcal{{\bar{C}}}_{r}}:=\mathcal{V}-{\mathcal{C}_{r}}, cut(𝒞r,𝒞¯r):=vi𝒞r,vj𝒞¯r(𝐀D)ij{\mathop{\rm cut}\nolimits}({\mathcal{C}_{r}},{\mathcal{{\bar{C}}}_{r}}):=\sum\nolimits_{{v_{i}}\in{\mathcal{C}_{r}},{v_{j}}\in{\mathcal{{\bar{C}}}_{r}}}{{({\bf{A}}_{D})_{ij}}}, and vol(𝒞r):=vi𝒞r,vj𝒱(𝐀D)ij{\mathop{\rm vol}\nolimits}({\mathcal{C}_{r}}):=\sum\nolimits_{{v_{i}}\in{\mathcal{C}_{r}},{v_{j}}\in\mathcal{V}}{{({\bf{A}}_{D})_{ij}}}, with 𝐀D{\bf{A}}_{D} as the adjacency matrix of 𝒢D\mathcal{G}_{D}. Given a clustering result 𝒞\mathcal{C}, modularity w.r.t. the original graph 𝒢\mathcal{G} is defined as

Mod(𝒞;𝒢):=12er=1Kvi,vj𝒞r[𝐀ijdeg(vi)deg(vj)/(2e)],{\mathop{\rm Mod}\nolimits}(\mathcal{C};\mathcal{G}):=\frac{1}{{2e}}\sum\nolimits_{r=1}^{K}{\sum\nolimits_{{v_{i}},{v_{j}}\in{\mathcal{C}_{r}}}{[{{\bf{A}}_{ij}}-\deg({v_{i}})\deg({v_{j}})/(2e)]}}, (20)

where e:=ideg(vi)/2e:=\sum\nolimits_{i}{\deg({v_{i}})}/2 is the number of edges.

The parameter settings of IRWE for the transductive and inductive inference are depicted in Tables 9 and 10, where dd is the embedding dimensionality; ee is the dimensionality of one-hot degree encoding for the degree features {δ(v)}\{\delta(v)\}; nS:=|𝒲(v)|n_{S}:=|\mathcal{W}^{(v)}| and nI:=|𝒲I(v)|n_{I}:=|\mathcal{W}_{I}^{(v)}| are the number of sampled RWs and number of RWs used to infer position embeddings for each node vv; λψ\lambda_{\psi} and λγ\lambda_{\gamma} are learning rates to optimize identity and position embeddings; mm is the number of training iterations; in each iteration, we update identity and position embeddings mψm_{\psi} and mγm_{\gamma} times; ll is the RW length; α\alpha and τ\tau are hyper-parameters in the training losses.

Tables 11 and 12 give layer configurations for the transductive and inductive embedding inference, where Encφ(){\mathop{\rm Enc}\nolimits}_{\varphi}(\cdot) and Decφ(){\mathop{\rm Dec}\nolimits}_{\varphi}(\cdot) denote the AW encoder and decoder described in (1); Reds(){\mathop{\rm Red}\nolimits}_{s}(\cdot) is the feature reduction unit (2); Decψ(){\mathop{\rm Dec}\nolimits}_{\psi}(\cdot) represents the identity embedding decoder (4); ReAtt(){\mathop{\rm ReAtt}\nolimits}(\cdot) is the attentive reweighting function (6); η~l:=|Ω~l|{\tilde{\eta}}_{l}:=|{\tilde{\Omega}}_{l}| is the reduced number of AWs; hψh_{\psi}, htranh_{\rm tran}, and hrouth_{\rm rout} represent the numbers of attention heads in identity embedding encoder (3), transformer encoder (8), and attentive readout function (9); LtranL_{\rm tran} is the number of transformer encoder layers; ’t’, ’s’, and ’r’ denote the activation functions of Tanh, Sigmoid, and ReLU. For our IRWE method, we recommend setting l{4,5,,9}l\in\{4,5,\cdots,9\}, α{0.1,0.5,1,5,10}\alpha\in\{0.1,0.5,1,5,10\}, τ{1,5,10,50,100,500,1000}\tau\in\{1,5,10,50,100,500,1000\}, and mψ,mγ{1,5,10,20}m_{\psi},m_{\gamma}\in\{1,5,10,20\}.

We adopted the standard multi-head attention (Vaswani et al., 2017) to build the identity embedding encoder (3) and attentive readout unit (9) of IRWE. An attention unit includes the inputs of key, query, and value described by 𝐊m×d{\bf{K}}\in\mathbb{R}^{m\times d}, 𝐐n×d{\bf{Q}}\in\mathbb{R}^{n\times d}, and 𝐕m×d{\bf{V}}\in\mathbb{R}^{m\times d}. Assume that there are hh attention heads. Let d~=d/h{\tilde{d}}=d/h. For the jj-th head, we first derive linear mappings 𝐊~(j)=𝐊𝐖k(j){\bf{\tilde{K}}}^{(j)}={\bf{K}}{\bf{W}}_{k}^{(j)}, 𝐐~(j)=𝐐𝐖q(j){\bf{\tilde{Q}}}^{(j)}={\bf{Q}}{\bf{W}}_{q}^{(j)}, and 𝐕~(j)=𝐕𝐖v(j){\bf{\tilde{V}}}^{(j)}={\bf{V}}{\bf{W}}_{v}^{(j)}, with {𝐖k(j)d×d~,𝐖q(j)d×d~,𝐖v(j)d×d~}\{{\bf{W}}_{k}^{(j)}\in\mathbb{R}^{d\times{\tilde{d}}},{\bf{W}}_{q}^{(j)}\in\mathbb{R}^{d\times{\tilde{d}}},{\bf{W}}_{v}^{(j)}\in\mathbb{R}^{d\times{\tilde{d}}}\} as trainable parameters. The attention head is defined as

𝐙(j)=Attj(𝐐,𝐊,𝐕):=softmax(𝐐~(j)𝐊~(j)T/d~)𝐕~(j).{{\bf{Z}}^{(j)}}={{\mathop{\rm Att}\nolimits}_{j}}({\bf{Q}},{\bf{K}},{\bf{V}}):={\mathop{\rm softmax}\nolimits}({{\bf{\tilde{Q}}}^{(j)}}{{\bf{\tilde{K}}}^{(j)T}}/\sqrt{\tilde{d}}){{\bf{\tilde{V}}}^{(j)}}. (21)

We further concatenate the outputs of all the heads via 𝐙=Att(𝐐,𝐊,𝐕):=[𝐙(1)𝐙(h)]{\bf{Z}}={\mathop{\rm Att}\nolimits}({\bf{Q}},{\bf{K}},{\bf{V}}):=[{{\bf{Z}}^{(1)}}||\cdots||{{\bf{Z}}^{(h)}}].

All the experiments were conducted on a server with AMD EPYC 7742 64-Core CPU, 512512GB main memory, and one NVIDIA A100 GPU (8080GB memory). We used the official code or public implementations of all the baselines and tuned parameters to report their best performance. On each dataset, we set the same embedding dimensionality for all the methods.

Appendix E Further Experiment Results

Table 13: Evaluation Results on Attributed Graphs for the Validation of Inconsistency of Attributes.
Cornell Texas Washington Wisconsin
Mod\uparrow(%) NCut\downarrow Mod\uparrow(%) NCut\downarrow Mod\uparrow(%) NCut\downarrow Mod\uparrow(%) NCut\downarrow
node2vec 56.93 3.18 45.99 3.10 44.94 3.59 54.73 2.97
struc2vec -9.50 1.53 -11.37 1.17 -9.33 0.71 -8.94 1.34
att-emb -0.09 3.76 -0.01 3.75 -2.30 3.84 -3.54 3.75
[n2v||||att] 50.80 3.38 35.26 3.12 44.05 3.50 54.02 3.13
[s2v||||att] -5.76 1.81 -12.77 1.33 -2.81 1.00 -9.58 1.35

To demonstrate the possible inconsistency of graph attributes for identity and position embedding as discussed in Section 1, we conducted additional experiments on four attributed graphs (i.e., Cornell, Texas, Washington, and Wisconsin) from WebKB111https://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/. For each graph, we extracted the largest connected component from its topology. After the pre-processing, we have (N,E,M,K)=(183,227,1703,5)(N,E,M,K)=(183,227,1703,5), (183,279,1703,5)(183,279,1703,5), (215,365,1703,5)(215,365,1703,5), and (251,450,1703,5)(251,450,1703,5) for Cornell, Texas, Washington, and Wisconsin, where NN, EE, and KK are numbers of nodes, edges, and clusters; MM is the dimensionality of node attributes.

We then applied node2vec and struc2vec, which are typical position and identity embedding baselines as described in Table 5.1, to the extracted topology of each graph, where we set embedding dimensionality d=64d=64. Furthermore, we derived special attribute embeddings (denoted as att-emb) with the same dimensionality by applying SVD to node attributes. Namely, we have three baseline methods (e.g., node2vec, struc2vec, and att-emb). To simulate the incorporation of attributes, we concatenated att-emb with node2vec and struc2vec, forming another two additional baselines denoted as [n2v||||att] and [s2v||||att]. The unsupervised community detection and node identity clustering were adopted as the downstream tasks for position and identity embedding, respectively. The evaluation results are depicted in Table 13, where att-emb outperforms neither (i) node2vec for community detection nor (ii) struc2vec for node identity clustering; the concatenation of att-emb cannot further improve the embedding quality of node2vec and struc2vec. The results imply that (i) attributes may fail to capture both node positions and identities; (ii) the simple integration of attributes may even damage the quality of position and identity embeddings.

Appendix F Discussions of Future Research Directions

Some possible future research directions of this study are summarized as follows.

In this study, we focused on network embedding where topology is the only available information source without any attributes, due to the complicated correlations between the two sources (Qin et al., 2018; Li et al., 2019; Wang et al., 2020; Qin & Lei, 2021). We intend to explore the adaptive incorporation of attributes. Concretely, when attributes carry characteristics consistent with topology, one can fully utilize attribute information to enhance the embedding quality. In contrast, when there is inconsistent noise in attributes, we need to adaptively control the effect of attributes to avoid unexpected quality degradation.

In addition to mapping each node to a low-dimensional vector (a.k.a. node-level embedding), network embedding also includes the representation of a graph (a.k.a. graph-level embedding). We plan to extend IRWE to the graph-level embedding and evaluate the embedding quality for some graph-level downstream tasks (e.g., graph classification). To analyze the relations of graph-level embeddings to identity and position embeddings is also our next focus.

The optimization of IRWE adopts the standard full-batch setting, where we derive statistics or embeddings w.r.t. all the nodes when computing the training losses. This setting may not be scalable to graphs with large numbers of nodes. Inspired by recent studies of scalable GNNs (Zhang et al., 2022; Liu et al., 2023), we intend to explore a scalable optimization strategy based on mini-batch settings.