This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Going Deeper into Permutation-Sensitive
Graph Neural Networks

Zhongyu Huang*
NLPR, Institute of Automation
Chinese Academy of Sciences
huangzhongyu2020@ia.ac.cn &Yingheng Wang*
Tsinghua University
Johns Hopkins University
ywang584@jhu.edu Chaozhuo Li
Microsoft Research Asia
cli@microsoft.com

&Huiguang He
NLPR, Institute of Automation
Chinese Academy of Sciences
huiguang.he@ia.ac.cn
Abstract

The invariance to permutations of the adjacency matrix, i.e., graph isomorphism, is an overarching requirement for Graph Neural Networks (GNNs). Conventionally, this prerequisite can be satisfied by the invariant operations over node permutations when aggregating messages. However, such an invariant manner may ignore the relationships among neighboring nodes, thereby hindering the expressivity of GNNs. In this work, we devise an efficient permutation-sensitive aggregation mechanism via permutation groups, capturing pairwise correlations between neighboring nodes. We prove that our approach is strictly more powerful than the 2-dimensional Weisfeiler-Lehman (2-WL) graph isomorphism test and not less powerful than the 3-WL test. Moreover, we prove that our approach achieves the linear sampling complexity. Comprehensive experiments on multiple synthetic and real-world datasets demonstrate the superiority of our model.

1 Introduction

The invariance to permutations of the adjacency matrix, i.e., graph isomorphism, is a key inductive bias for graph representation learning [1]. Graph Neural Networks (GNNs) invariant to graph isomorphism are more amenable to generalization as different orderings of the nodes result in the same representations of the underlying graph. Therefore, many previous studies [2, 3, 4, 5, 6, 7, 8] devote much effort to designing permutation-invariant aggregators to make the overall GNNs permutation-invariant (permutation of the nodes of the input graph does not affect the output) or permutation-equivariant (permutation of the input permutes the output) to node orderings.

Despite their great success, Kondor et al. [9] and de Haan et al. [10] expound that such a permutation-invariant manner may hinder the expressivity of GNNs. Specifically, the strong symmetry of these permutation-invariant aggregators presumes equal statuses of all neighboring nodes, ignoring the relationships among neighboring nodes. Consequently, the central nodes cannot distinguish whether two neighboring nodes are adjacent, failing to recognize and reconstruct the fine-grained substructures within the graph topology. As shown in Figure 1(a), the general Message Passing Neural Networks (MPNNs) [4] can only explicitly reconstruct a star graph from the 1-hop neighborhood, but are powerless to model any connections between neighbors [11]. To address this problem, some latest advances [11, 12, 13, 14] propose to use subgraphs or ego-nets to improve the expressive power while preserving the property of permutation-invariance. Unfortunately, they usually suffer from high computational complexity when operating on multiple subgraphs [14].

In contrast, the permutation-sensitive (as opposed to permutation-invariant) function111One of the typical permutation-sensitive functions is Recurrent Neural Networks (RNNs), e.g., Simple Recurrent Network (SRN) [15], Gated Recurrent Unit (GRU) [16], and Long Short-Term Memory (LSTM) [17]. can be regarded as a “symmetry-breaking” mechanism, which breaks the equal statuses of neighboring nodes. The relationships among neighboring nodes, e.g., the pairwise correlation between each pair of neighboring nodes, are explicitly modeled in the permutation-sensitive paradigm. These pairwise correlations help capture whether two neighboring nodes are connected, thereby exploiting the local graph substructures to improve the expressive power. We illustrate a concrete example in Appendix D.

Different permutation-sensitive aggregation functions behave variously when modeling pairwise correlations. GraphSAGE with an LSTM aggregator [5] in Figure 1(b) is capable of modeling some pairwise correlations among the sampled subset of neighboring nodes. Janossy Pooling with the π\pi-SGD strategy [18] in Figure 1(c) samples random permutations of all neighboring nodes, thus modeling pairwise correlations more efficiently. The number of modeled pairwise correlations is proportional to the number of sampled permutations. After sampling permutations with a costly nonlinear complexity of 𝒪(nlnn)\mathcal{O}(n\ln n) (see Appendix K for detailed analysis), all the pairwise correlations between nn neighboring nodes can be modeled and all the possible connections are covered.

Refer to caption
(a) Msg-passing scheme (permutation-invariant)
Refer to caption
(b) GraphSAGE with an LSTM aggregator
Refer to caption
(c) Janossy Pooling with π\pi-SGD optimization
Refer to caption
(d) Our goal of capturing all pairwise correlations
Figure 1: Comparison of the pairwise correlations modeled by various aggregation functions in 1-hop neighborhood. Here we illustrate with one central node vv and n=7n=7 neighbors. The dashed lines represent the pairwise correlations between neighbors modeled by the aggregators, the real topological connections between neighbors are hidden for clarity. Subfigure (b) shows 2 sampled batches with the neighborhood sample size k=5k=5. Subfigure (c) shows 2 sampled permutations. Dashed lines - - and - - in (b)/(c) denote different batches/permutations.

In fact, previous works [18, 1] have explored that incorporating permutation-sensitive functions into GNNs is indeed an effective way to improve their expressive power. Janossy Pooling [18] and Relational Pooling [1] both design the most powerful GNN models by exploiting permutation-sensitive functions to cover all n!n! possible permutations. They explicitly learn all representations of the underlying graph with possible n!n! node orderings to guarantee the permutation-invariance and generalization capability, overcoming the limited generalization of permutation-sensitive GNNs [19]. However, the complete modeling of all n!n! permutations also leads to an intractable computational complexity 𝒪(n!)\mathcal{O}(n!). Thus, we expect to design a powerful yet efficient GNN, which can guarantee the expressive power, and significantly reduce the complexity with a minimal loss of generalization capability.

Different from explicitly modeling all n!n! permutations, we propose to sample a small number of representative permutations to cover all n(n1)/2n(n-1)/2 pairwise correlations (as shown in Figure 1(d)) by the permutation-sensitive functions. Accordingly, the permutation-invariance is approximated by the invariance to pairwise correlations. Moreover, we mathematically analyze the complexity of permutation sampling and reduce it from 𝒪(nlnn)\mathcal{O}(n\ln n) to 𝒪(n)\mathcal{O}(n) via a well-designed Permutation Group (PG). Based on the proposed permutation sampling strategy, we then devise an aggregation mechanism and theoretically prove that its expressivity is strictly more powerful than the 2-WL test and not less powerful than the 3-WL test. Thus, our model is capable of significantly reducing the computational complexity while guaranteeing the expressive power. To the best of our knowledge, our model achieves the lowest time and space complexity among all the GNNs beyond 2-WL test so far.

2 Related Work

Permutation-Sensitive Graph Neural Networks.

Loukas [20] first analyzes that it is necessary to sacrifice the permutation-invariance and -equivariance of MPNNs to improve their expressive power when nodes lose discriminative attributes. However, only a few models (GraphSAGE with LSTM aggregators [5], RP with π\pi-SGD [18], CLIP [21]) are permutation-sensitive GNNs. These studies provide either theoretical proofs or empirical results that their approaches can capture some substructures, especially triangles, which can be served as special cases of our Theorem 4. Despite their powerful expressivity, the nonlinear complexity of sampling or coloring limits their practical application.

Expressive Power of Graph Neural Networks.

Xu et al. [7] and Morris et al. [22] first investigate the GNNs’ ability to distinguish non-isomorphic graphs and demonstrate that the traditional message-passing paradigm [4] is at most as powerful as the 2-WL test [23], which cannot distinguish some graph pairs like regular graphs with identical attributes. In order to theoretically improve the expressive power of the 2-WL test, a direct way is to equip nodes with distinguishable attributes, e.g., identifier [1, 20], port numbering [24], coloring [24, 21], and random feature [25, 26]. Another series of researches [22, 8, 27, 28, 29, 30] consider high-order relations to design more powerful GNNs but suffer from high computational complexity when handling high-order tensors and performing global computations on the graph. Some pioneering works [10, 12] use the automorphism group of local subgraphs to obtain more expressive representations and overcome the problem of global computations, but their pre-processing stages still require solving the NP-hard subgraph isomorphism problem. Recent studies [24, 31, 20, 32] also characterize the expressive power of GNNs from the perspectives of what they cannot learn.

Leveraging Substructures for Learning Representations.

Previous efforts mainly focused on the isomorphism tasks, but did little work on understanding their capacity to capture and exploit the graph substructure. Recent studies [21, 11, 10, 19, 33, 25, 34, 35, 36] show that the expressive power of GNNs is highly related to the local substructures in graphs. Chen et al. [11] demonstrate that the substructure counting ability of GNN architectures not only serves as an intuitive theoretical measure of their expressive power but also is highly relevant to practical tasks. Barceló et al. [35] and Bouritsas et al. [36] propose to incorporate some handcrafted subgraph features to improve the expressive power, while they require expert knowledge to select task-relevant features. Several latest advances [19, 34, 37, 38] have been made to enhance the standard MPNNs by leveraging high-order structural information while retaining the locality of message-passing. However, the complexity issue has not been satisfactorily solved because they introduce memory/time-consuming context matrices [19], eigenvalue decomposition [34], and lifting transformation [37, 38] in pre-processing.

Relations to Our Work.

Some crucial differences between related works [5, 1, 21, 11, 19, 25, 34, 37] and ours can be summarized as follows: (i) we propose to design powerful permutation-sensitive GNNs while approximating the property of permutation-invariance, balancing the expressivity and computational efficiency; (ii) our approach realizes the linear complexity of permutation sampling and reaches the theoretical lower bound; (iii) our approach can directly learn substructures from data instead of pre-computing or strategies based on handcrafted structural features. We also provide detailed discussions in Appendix H.3 for [5], K.2 for [18, 1], and L.1 for [37, 38].

3 Designing Powerful Yet Efficient GNNs via Permutation Groups

In this section, we begin with the analysis of theoretically most powerful but intractable GNNs. Then, we propose a tractable strategy to achieve linear permutation sampling and significantly reduce the complexity. Based on this strategy, we design our permutation-sensitive aggregation mechanism via permutation groups. Furthermore, we mathematically analyze the expressivity of permutation-sensitive GNNs and prove that our proposed model is more powerful than the 2-WL test and not less powerful than the 3-WL test via incidence substructure counting.

3.1 Preliminaries

Let G=(𝒱,)𝒢G=(\mathcal{V},\mathcal{E})\in\mathcal{G} be a graph with vertex set 𝒱={v1,v2,,vN}\mathcal{V}=\{v_{1},v_{2},\ldots,v_{N}\} and edge set \mathcal{E}, directed or undirected. Let 𝑨N×N\bm{A}\in\mathbb{R}^{N\times N} be the adjacency matrix of GG. For a node v𝒱v\in\mathcal{V}, dvd_{v} denotes its degree, i.e., the number of 1-hop neighbors of node vv, which is equivalent to nn in this section for simplicity. Suppose these nn neighboring nodes of the central node vv are randomly numbered as u1,,unu_{1},\ldots,u_{n} (also abbreviated as 1,,n1,\ldots,n in the following), the set of neighboring nodes is represented as 𝒩(v)\mathcal{N}(v) (or S=[n]={1,,n}S=[n]=\{1,\ldots,n\}). Given a set of graphs {G1,G2,,GM}𝒢\{G_{1},G_{2},\ldots,G_{M}\}\subseteq\mathcal{G}, each graph GG has a label yGy_{G}. Our goal is to learn a representation vector 𝒉G\bm{h}_{G} of the entire graph GG and classify it into the correct category from CC classes. In this paper, we use the normal GG to denote a graph and the Gothic 𝔊\mathfrak{G} to denote a group. The necessary backgrounds of graph theory and group theory are attached in Appendixes A and B. The rigorous definition of the kk-WL test is provided in Appendix C.

3.2 Theoretically Most Powerful GNNs

Relational Pooling (RP) [1] proposes the theoretically most powerful permutation-invariant model by averaging over all permutations of the nodes, which can be formulated as follows:

𝒉G=1|𝒮N|π𝒮Nf(𝒉πv1,𝒉πv2,,𝒉πvN)\bm{h}_{G}=\frac{1}{|\mathcal{S}_{N}|}\sum_{\pi\in\mathcal{S}_{N}}{\vec{f}\left(\bm{h}_{\pi v_{1}},\bm{h}_{\pi v_{2}},\cdots,\bm{h}_{\pi v_{N}}\right)} (1)

where πvi(i=1,,N)\pi v_{i}(i=1,\ldots,N) denotes the result of acting π𝒮N\pi\in\mathcal{S}_{N} on vi𝒱v_{i}\in\mathcal{V}, 𝒮N\mathcal{S}_{N} is the symmetric group on the set [N][N] (or 𝒱\mathcal{V}), f\vec{f} is a sufficiently expressive (possibly permutation-sensitive) function, 𝒉vi\bm{h}_{v_{i}} is the feature vector of node viv_{i}.

The permutation-sensitive functions, especially sequential models, are capable of modeling the kk-ary dependency [18, 1] among kk input nodes. Meanwhile, the different input node orderings will lead to a total number of k!k! different kk-ary dependencies. These kk-ary dependencies indicate the relations and help capture the topological connections among the corresponding kk nodes, thereby exploiting the substructures within these kk nodes to improve the expressive power of GNN models. For instance, the expressivity of Eq. (1) is mainly attributed to the modeling of all possible NN-ary dependencies (full-dependencies) among all NN nodes, which can capture all graphs isomorphic to GG. However, it is intractable and practically prohibitive to model all permutations (N!N! NN-ary dependencies) due to the extremely high computational cost. Thus, it is necessary to design a tractable strategy to reduce the computational cost while maximally preserving the expressive power.

3.3 Permutation Sampling Strategy

Intuitively, the simplest way is to replace NN-ary dependencies with 2-ary dependencies, i.e., the pairwise correlations in Section 1. Moreover, since the inductive bias of locality results in lower complexity on sparse graphs [39, 34], we restrict the permutation-sensitive functions to aggregate information and model the 2-ary dependencies in the 1-hop neighborhoods. Thus, we will further discuss how to model all 2-ary dependencies between nn neighboring nodes with the lowest sampling complexity 𝒪(n)\mathcal{O}(n).

Suppose nn neighboring nodes are arranged as a ring, we define this ring as an arrangement. An initial arrangement can be simply defined as 12n11-2-\cdots-n-1, including an nn-ary dependency {12n1}\{1-2-\cdots-n-1\} and nn 2-ary dependencies {12,23,,n1}\{1-2,2-3,\cdots,n-1\}. Since a permutation adjusts the node ordering in the arrangement, we can use a permutation to generate a new arrangement, which corresponds to a new nn-ary dependency covering nn 2-ary dependencies. The following theorem provides a lower bound of the number of arrangements to cover all 2-ary dependencies.

Theorem 1.

Let n(n4)n(n\geq 4) denote the number of 1-hop neighboring nodes around the central node vv. There are (n1)/2\lfloor(n-1)/2\rfloor kinds of arrangements in total, satisfying that their corresponding 2-ary dependencies are disjoint. Meanwhile, after at least n/2\lfloor n/2\rfloor arrangements (including the initial one), all 2-ary dependencies have been covered at least once.

We first give a sketch of the proof. Construct a simple undirected graph G=(𝒱,)G^{\prime}=(\mathcal{V}^{\prime},\mathcal{E}^{\prime}), where 𝒱\mathcal{V}^{\prime} denotes the nn neighboring nodes (abbreviated as nodes in the following), and \mathcal{E}^{\prime} represents an edge set in which each edge indicates the corresponding 2-ary dependency has been covered in some arrangements. Each arrangement corresponds to a Hamiltonian cycle in graph GG^{\prime}. In addition, we define the following permutation σ\sigma to generate new arrangements:

σ={(12345n1n14263nn2)=(2 4 6n1nn2 7 5 3),nis odd,(1234n1n3152nn2)=(1 3 5n1nn2 6 4 2),nis even.\small\sigma=\left\{\begin{aligned} &\left(\begin{matrix}1&2&3&4&5&\cdots&n-1&n\\ 1&4&2&6&3&\cdots&n&n-2\\ \end{matrix}\right)=\left(2\ 4\ 6\ \cdots\ n-1\ n\ n-2\ \cdots\ 7\ 5\ 3\right),n~{}\text{is odd},\\ &\left(\begin{matrix}1&2&3&4&\cdots&n-1&n\\ 3&1&5&2&\cdots&n&n-2\\ \end{matrix}\right)=\left(1\ 3\ 5\ \cdots\ n-1\ n\ n-2\ \cdots\ 6\ 4\ 2\right),n~{}\text{is even}.\end{aligned}\right. (2)

After performing the permutation σ\sigma once, a new arrangement is generated and a Hamiltonian cycle is constructed. Since every pair of nodes can form a 2-ary dependency, covering all 2-ary dependencies is equivalent to constructing a complete graph KnK_{n}. Besides, as a KnK_{n} has n(n1)/2n(n-1)/2 edges and each Hamiltonian cycle has nn edges, a KnK_{n} can only be constructed with at least n(n1)/2n=(n1)/2=n/2\lceil n(n-1)/2n\rceil=\lceil(n-1)/2\rceil=\lfloor n/2\rfloor Hamiltonian cycles. It can be proved that after performing the permutation σ\sigma for n/21=𝒪(n)\lfloor n/2\rfloor-1=\mathcal{O}(n) times in succession (excluding the initial one), all 2-ary dependencies are covered at least once. Detailed proof of Theorem 1 is provided in Appendix E.

Note that Theorem 1 has the constraint n4n\geq 4 because all 2-ary dependencies have already been covered in the initial arrangement when 1<n<41<n<4, and there is only a single node when n=1n=1. If n=2,3,4n=2,3,4, σ=(12),\sigma=\left(\begin{matrix}1&2\end{matrix}\right), (23),(1342)\left(\begin{matrix}2&3\end{matrix}\right),\left(\begin{matrix}1&3&4&2\end{matrix}\right), respectively (the case of n=1n=1 is trivial). Thus the permutation σ\sigma defined in Theorem 1 is available for an arbitrary nn, while Eq. (2) shows the general case with a large nn.

Refer to caption
(a) n=5n=5           (b) n=6n=6
Refer to caption
(c) n=7n=7           (d) n=8n=8
Figure 2: Modeling all pairwise correlations between nn neighboring nodes via permutations. Subfigures (a) to (d) characterize the cases when n=5to8n=5~{}\text{to}~{}8 (ignoring the central node vv). The monochrome permutation diagram illustrates the mapping process of permutation σ\sigma, where the directed arc aba\to b indicates that moving aa to the original position of bb. All arrangements generated by σi\sigma^{i} are shown in color below the diagram. The first and the last n/2\lfloor n/2\rfloor arrangements are marked with solid and dashed lines, respectively. Solid and dashed lines with the same color indicate that they correspond to a pair of bi-directional Hamiltonian cycles. Only the Hamiltonian cycles corresponding to the first n/2\lfloor n/2\rfloor arrangements are displayed for clarity. For a further explanation and the relationships among Theorem 1, Lemma 2, Corollary 3, Figure 2, and Eq. (3), please refer to Appendix J.1 and Figure 8.

According to the ordering of nn neighboring nodes in the arrangement, we can apply a permutation-sensitive function to model an nn-ary dependency among these nn nodes while covering nn 2-ary dependencies. Since the input orderings aba\to b and bab\to a lead to different results in the permutation-sensitive function, these dependencies and the corresponding Hamiltonian cycles (the solid arrows in Figure 2) are modeled in a directed manner. We continue performing the permutation σ\sigma for n/2\lfloor n/2\rfloor times successively to get additional n/2\lfloor n/2\rfloor arrangements (the dashed lines in Figure 2) and reversely directed Hamiltonian cycles (not shown in Figure 2). After the bi-directional modeling, edges in Hamiltonian cycles are transformed into undirected edges. Figure 2 briefly illustrates the above process when n=5to8n=5~{}\text{to}~{}8. In conclusion, all 2-ary dependencies can be modeled in an undirected manner by the tailored permutations. The number of permutations is nn if nn is even and (n1)(n-1) if nn is odd, ensuring the linear sampling complexity 𝒪(n)\mathcal{O}(n).

In fact, all permutations above form a permutation group. In order to incorporate the strategy proposed by Theorem 1 into the aggregation process of GNN, we propose to use the permutation group and group action, defined as follows.

Lemma 2.

For the permutation σ\sigma of nn indices, 𝔊={e,σ,σ2,,σn2}\mathfrak{G}=\{e,\sigma,\sigma^{2},\ldots,\sigma^{n-2}\} is a permutation group isomorphic to the cyclic group n1\mathbb{Z}_{n-1} if nn is odd. And 𝔊={e,σ,σ2,,σn1}\mathfrak{G}=\{e,\sigma,\sigma^{2},\ldots,\sigma^{n-1}\} is a permutation group isomorphic to the cyclic group n\mathbb{Z}_{n} if nn is even.

Corollary 3.

The map α:𝔊×SS\alpha:\mathfrak{G}\times S\to S denoted by (g,s)gs(g,s)\mapsto gs is a group action of 𝔊\mathfrak{G} on SS.

To better illustrate the results of Lemma 2 and Corollary 3, the detailed discussion and diagram are attached in Appendix G. Next, we apply the permutation group and group action to design our permutation-sensitive aggregation mechanism.

3.4 Network Architecture

Without loss of generality, we apply the widely-used Recurrent Neural Networks (RNNs) as the permutation-sensitive function to model the dependencies among neighboring nodes. Let the group elements (i.e., permutations) in 𝔊\mathfrak{G} act on SS, our proposed strategy in Section 3.3 is formulated as:

𝒉v(k)=g𝔊RNN(𝒉gu1(k1),𝒉gu2(k1),,𝒉gun(k1),𝒉gu1(k1))+𝑾self(k1)𝒉v(k1),u1:n𝒩(v)\bm{h}_{v}^{(k)}=\sum_{g\in\mathfrak{G}}{\text{RNN}\left(\bm{h}_{gu_{1}}^{(k-1)},\bm{h}_{gu_{2}}^{(k-1)},\cdots,\bm{h}_{gu_{n}}^{(k-1)},\bm{h}_{gu_{1}}^{(k-1)}\right)}+\bm{W}_{\text{self}}^{(k-1)}\bm{h}_{v}^{(k-1)},u_{1:n}\in\mathcal{N}(v) (3)

where gui(i=1,,n)gu_{i}(i=1,\ldots,n) denotes the result of acting g𝔊g\in\mathfrak{G} on uiSu_{i}\in S, and 𝒉v(k)dk\bm{h}_{v}^{(k)}\in\mathbb{R}^{d_{k}} is the feature vector of central node vv at the kk-th layer. We provide more discussion on the groups and model variants in Appendixes J.2 and J.3. Eq. (3) takes advantage of the locality and permutation group 𝔊\mathfrak{G} to simplify the group actions in Eq. (1), which acts the symmetric group 𝒮N\mathcal{S}_{N} on vertex set 𝒱\mathcal{V}, thereby avoiding the complete modeling of N!N! permutations. Meanwhile, Eq. (3) models all 2-ary dependencies and achieves the invariance to 2-ary dependencies. Thus, we can conclude that Eq. (3) realizes the efficient approximation of permutation-invariance with low complexity. In practice, we merge the central node vv into RNN for simplicity:

𝒉v(k)=g𝔊RNN(𝒉v(k1),𝒉gu1(k1),𝒉gu2(k1),,𝒉gun(k1),𝒉v(k1)),u1:n𝒩(v)\bm{h}_{v}^{(k)}=\sum_{g\in\mathfrak{G}}{\text{RNN}\left(\bm{h}_{v}^{(k-1)},\bm{h}_{gu_{1}}^{(k-1)},\bm{h}_{gu_{2}}^{(k-1)},\cdots,\bm{h}_{gu_{n}}^{(k-1)},\bm{h}_{v}^{(k-1)}\right)},u_{1:n}\in\mathcal{N}(v) (4)

Then, we apply a READOUT function (e.g., SUM()(\cdot)) to obtain the graph representation 𝒉G(k)\bm{h}_{G}^{(k)} at the kk-th layer and combine representations learned by different layers to get the score 𝒔\bm{s} for classification:

𝒉G(k)=v𝒱𝒉v(k),𝒔=k𝑾(k)𝒉G(k)\bm{h}_{G}^{(k)}=\sum_{v\in\mathcal{V}}{\bm{h}_{v}^{(k)}},\quad\bm{s}=\sum_{k}{\bm{W}^{(k)}\bm{h}_{G}^{(k)}} (5)

here 𝑾(k)C×dk\bm{W}^{(k)}\in\mathbb{R}^{C\times d_{k}} represents a learnable scoring matrix for the kk-th layer. Finally, we input score 𝒔\bm{s} to the softmax function and obtain the predicted class of graph GG.

Complexity.

We briefly analyze the computational complexity of Eq. (3). Suppose the input and output dimensions are both cc for each layer, let Δ\Delta denote the maximum degree of graph GG. In the worst-case scenario, Eq. (3) requires summing over Δ\Delta terms processed in a serial manner. Since there is no interdependence between these Δ\Delta terms, they can also be computed in a parallel manner with the time complexity of Θ(Δc2)\Theta(\Delta c^{2}) (caused by RNN computation), while sacrificing the memory to save time. Let MM denote the number of edges. Table 2 compares our approach with other powerful GNNs on the per-layer space and time complexity. The results of baselines are taken from Vignac et al. [19]. Since the complexity analysis of GraphSAGE [5], MPSN [37], and CWN [38] involves many other notations, we analyze GraphSAGE in Appendix H.3, and MPSN and CWN in Appendix L.1. In a nutshell, our approach theoretically outperforms other powerful GNNs in terms of time and space complexity, even being on par with MPNN.

Table 1: Memory and time complexity per layer.
Model Memory Time complexity
GIN [7] Θ(Nc)\Theta(Nc) Θ(Mc+Nc2)\Theta(Mc+Nc^{2})
MPNN [4] Θ(Nc)\Theta(Nc) Θ(Mc2)\Theta(Mc^{2})
Fast SMP [19] Θ(N2c)\Theta(N^{2}c) Θ(MNc+N2c2)\Theta(MNc+N^{2}c^{2})
SMP [19] Θ(N2c)\Theta(N^{2}c) Θ(MNc2)\Theta(MNc^{2})
PPGN [28] Θ(N2c)\Theta(N^{2}c) Θ(N3c+N2c2)\Theta(N^{3}c+N^{2}c^{2})
3-WL [22] Θ(N3c)\Theta(N^{3}c) Θ(N4c+N3c2)\Theta(N^{4}c+N^{3}c^{2})
Ours (serial) Θ(Nc)\Theta(Nc) Θ(NΔ2c2)\Theta(N\Delta^{2}c^{2})
Ours (parallel) Θ(NΔc)\Theta(N\Delta c) Θ(NΔc2)\Theta(N\Delta c^{2})
Table 2: Results (measured by MAE) on incidence triangle counting.
Model Erdős-Rényi random graph Random regular graph
GCN [3] 0.599 ±\pm 0.006 0.500 ±\pm 0.012
SAGE [5] 0.118 ±\pm 0.005 0.127 ±\pm 0.011
GIN [7] 0.219 ±\pm 0.016 0.342 ±\pm 0.005
rGIN [25] 0.194 ±\pm 0.009 0.325 ±\pm 0.006
RP [1] 0.058 ±\pm 0.006 0.161 ±\pm 0.003
LRP [11] 0.023 ±\pm 0.011 0.037 ±\pm 0.019
PG-GNN 0.019 ±\pm 0.002 0.027 ±\pm 0.001

3.5 Expressivity Analysis

In this subsection, we theoretically analyze the expressive power of a typical category of permutation-sensitive GNNs, i.e., GNNs with RNN aggregators (Theorem 4), and that of our proposed PG-GNN (Proposition 5). We begin with GIN [7], which possesses the equivalent expressive power as the 2-WL test [7, 30]. In fact, the variants of GIN can be recovered by GNNs with RNN aggregators (see Appendix I for details), which implies that this category of permutation-sensitive GNNs can be at least as powerful as the 2-WL test. Next, we explicate why they go beyond the 2-WL test from the perspective of substructure counting.

Triangular substructures are rich in various networks, and counting triangles is an important task in network analysis [40]. For example, in social networks, the formation of a triangle indicates that two people with a common friend will also become friends [41]. A triangle uivuj\triangle u_{i}vu_{j} is incident to the node vv if uiu_{i} and uju_{j} are adjacent and node vv is their common neighbor. We define the triangle uivuj\triangle u_{i}vu_{j} as an incidence triangle over node vv (also uiu_{i} and uju_{j}), and denote the number of incidence triangles over node vv as τv\tau_{v}. Formally, the number of incidence triangles over each node in an undirected graph can be calculated as follows (proof and discussion for the directed graph are provided in Appendix H.1):

𝝉=12𝑨2𝑨𝟏N\bm{\tau}=\frac{1}{2}\bm{A}^{2}\odot\bm{A}\cdot\bm{1}_{N} (6)

where 𝝉N\bm{\tau}\in\mathbb{R}^{N} and its ii-th element τi\tau_{i} represents the number of incidence triangles over node ii, \odot denotes element-wise product (i.e., Hadamard product), 𝟏N=(1,1,,1)N\bm{1}_{N}=(1,1,\cdots,1)^{\top}\in\mathbb{R}^{N} is a sum vector.

Besides the WL-test, the capability of counting graph substructures also characterizes the expressive power of GNNs [11]. Thus, we verify the expressivity of permutation-sensitive GNNs by evaluating their abilities to count triangles.

Theorem 4.

Let xv,v𝒱x_{v},\forall v\in\mathcal{V} denote the feature inputs on graph G=(𝒱,)G=(\mathcal{V},\mathcal{E}), and 𝕄\mathbb{M} be a general GNN model with RNN aggregators. Suppose that xvx_{v} is initialized as the degree dvd_{v} of node vv, and each node is distinguishable. For any 0<ϵ1/80<\epsilon\leq 1/8 and 0<δ<10<\delta<1, there exists a parameter setting Θ\Theta for 𝕄\mathbb{M} so that after 𝒪(dv(2dv+τv)𝔱dv+τv)\mathcal{O}\left(\frac{d_{v}(2d_{v}+\tau_{v})\mathfrak{t}}{d_{v}+\tau_{v}}\right) samples,

Pr(|zvτv1|ϵ)1δ,v𝒱,\Pr\left(\left|\frac{z_{v}}{\tau_{v}}-1\right|\leq\epsilon\right)\geq 1-\delta,\forall v\in\mathcal{V},

where zvz_{v}\in\mathbb{R} is the final output value generated by 𝕄\mathbb{M} and τv\tau_{v} is the number of incidence triangles.

Detailed proof can be found in Appendix H.2. Theorem 4 concludes that, if the input node features are node degrees and nodes are distinguishable, there exists a parameter setting for a general GNN with RNN aggregators such that it can approximate the number of incidence triangles to arbitrary precision for every node. Since 2-WL and MPNNs cannot count triangles [11], we conclude that this category of permutation-sensitive GNNs is more powerful. However, the required samples are related to τv\tau_{v} and proportional to the mixing time 𝔱\mathfrak{t} (see Appendix H.2), leading to a practically prohibitive aggregation complexity. Many existing permutation-sensitive GNNs like GraphSAGE with LSTM and RP with π\pi-SGD suffer from this issue (see Appendixes H.3 and K.2 for more discussion).

On the contrary, our approach can estimate the number of incidence triangles in linear sampling complexity 𝒪(n)=𝒪(dv)\mathcal{O}(n)=\mathcal{O}(d_{v}). According to the definition of incidence triangles and the fact that they always appear within vv’s 1-hop neighborhood, we know that the number of connections between the central node vv’s neighboring nodes is equivalent to the number of incidence triangles over vv. Meanwhile, Theorem 1 and Eq. (3) ensure that all 2-ary dependencies between nn neighboring nodes are modeled with 𝒪(n)\mathcal{O}(n) sampling complexity. These dependencies capture the information of whether two neighboring nodes are connected, thereby estimating the number of connections and counting incidence triangles in linear sampling complexity.

Recently, Balcilar et al. [34] claimed that the trace (tr) and Hadamard product (\odot) operations are crucial requirements to go further than 2-WL to reach 3-WL from the perspective of Matrix Language [42, 43]. In fact, for any two neighbors uiu_{i} and uju_{j} of the central node vv, the locality and 2-ary dependency of Eq. (3) introduce the information of 𝑨2\bm{A}^{2} (i.e., uivuju_{i}-v-u_{j}) and 𝑨\odot\bm{A} (i.e., uiu_{i} ?-\!\!\!\!? uju_{j}), respectively. Thus Eq. (3) can mimic Eq. (6) to count incidence triangles. Moreover, we also prove that 𝟏N𝝉=12tr(𝑨3)\bm{1}_{N}^{\top}\bm{\tau}=\frac{1}{2}\text{tr}(\bm{A}^{3}) (see Appendix H.1 for details), which indicates that PG-GNN can realize the trace (tr) operation when we use SUM()(\cdot) or MEAN()(\cdot) (i.e., 𝟏N\bm{1}_{N}) as the graph-level READOUT function. Note that even though MPNNs and 2-WL test are equipped with distinguishable attributes, they still have difficulty performing triangle counting since they cannot implement the trace or Hadamard product operations [34].

Beyond the incidence triangle, we can also leverage 2-ary dependencies of uiu_{i} ?-\!\!\!\!? uju_{j}, uiu_{i} ?-\!\!\!\!? uku_{k}, and uju_{j} ?-\!\!\!\!? uku_{k} to discover the incidence 4-clique vuiujuk\diamondplus vu_{i}u_{j}u_{k}, which is completely composed of triangles and only appears within vv’s 1-hop neighborhood. In this way, the expressive power of PG-GNN can be further improved by its capability of counting incidence 4-cliques. As illustrated in Figure 7, these incidence 4-cliques help distinguish some pairs of non-isomorphic strongly regular graphs while the 3-WL test fails. Consequently, the expressivity of our model is guaranteed to be not less powerful than 3-WL222“A is no/not less powerful than B” means that there exists a pair of non-isomorphic graphs such that A can distinguish but B cannot. The terminology “no/not less powerful” used here follows the standard definition in the literature [11, 37, 38, 13]..

From the analysis above, we confirm the expressivity of PG-GNN as follows. The strict proof and more detailed discussion on PG-GNN and 3-WL are provided in Appendix I.

Proposition 5.

PG-GNN is strictly more powerful than the 2-WL test and not less powerful than the 3-WL test.

4 Experiments

In this section, we evaluate PG-GNN on multiple synthetic and real-world datasets from a wide range of domains. Dataset statistics and details are presented in Appendix M.1. The hyper-parameter search space and final hyper-parameter configurations are provided in Appendix M.2. Computing infrastructures can be found in Appendix M.3. The code is publicly available at https://github.com/zhongyu1998/PG-GNN.

4.1 Counting Substructures in Random Graphs

We conduct synthetic experiments of counting incidence substructures (triangles and 4-cliques) on two types of random graphs: Erdős-Rényi random graphs and random regular graphs [11]. The incidence substructure counting task is designed on the node level, which is more rigorous than traditional graph-level counting tasks. Table 2 summarizes the results measured by Mean Absolute Error (MAE, lower is better) for incidence triangle counting. We report the average and standard deviation of testing MAEs over 5 runs with 5 different seeds. In addition, the testing MAEs of PG-GNN on ER and random regular graphs are 0.029 ±\pm 0.002 and 0.023 ±\pm 0.001 for incidence 4-clique counting, respectively. Overall, the negligible MAEs of our model support our claim that PG-GNN is powerful enough for counting incidence triangles and 4-cliques.

Another phenomenon is that permutation-sensitive GNNs consistently outperform permutation-invariant GNNs on substructure counting tasks. This indicates that permutation-sensitive GNNs are capable of learning these substructures directly from data, without explicitly assigning them as node features, but the permutation-invariant counterparts like GCN and GIN fail. Therefore, permutation-sensitive GNNs can implicitly leverage the information of characteristic substructures in representation learning and thus benefit real-world tasks in practical scenarios.

Table 3: Results (measured by accuracy: %) on TUDataset.
Model PROTEINS NCI1 IMDB-B IMDB-M COLLAB
WL [44] 75.0 ±\pm 3.1 86.0 ±\pm 1.8 73.8 ±\pm 3.9 50.9 ±\pm 3.8 78.9 ±\pm 1.9
DGCNN [45] 75.5 ±\pm 0.9 74.4 ±\pm 0.5 70.0 ±\pm 0.9 47.8 ±\pm 0.9 73.8 ±\pm 0.5
IGN [8] 76.6 ±\pm 5.5 74.3 ±\pm 2.7 72.0 ±\pm 5.5 48.7 ±\pm 3.4 78.4 ±\pm 2.5
GIN [7] 76.2 ±\pm 2.8 82.7 ±\pm 1.7 75.1 ±\pm 5.1 52.3 ±\pm 2.8 80.2 ±\pm 1.9
PPGN [28] 77.2 ±\pm 4.7 83.2 ±\pm 1.1 73.0 ±\pm 5.8 50.5 ±\pm 3.6 80.7 ±\pm 1.7
CLIP [21] 77.1 ±\pm 4.4 N/A 76.0 ±\pm 2.7 52.5 ±\pm 3.0 N/A
NGN [10] 71.7 ±\pm 1.0 82.7 ±\pm 1.4 74.8 ±\pm 2.0 51.3 ±\pm 1.5 N/A
WEGL [46] 76.5 ±\pm 4.2 N/A 75.4 ±\pm 5.0 52.3 ±\pm 2.9 80.6 ±\pm 2.0
SIN [37] 76.5 ±\pm 3.4 82.8 ±\pm 2.2 75.6 ±\pm 3.2 52.5 ±\pm 3.0 N/A
CIN [38] 77.0 ±\pm 4.3 83.6 ±\pm 1.4 75.6 ±\pm 3.7 52.7 ±\pm 3.1 N/A
PG-GNN (Ours) 76.8 ±\pm 3.8 82.8 ±\pm 1.3 76.8 ±\pm 2.6 53.2 ±\pm 3.6 80.9 ±\pm 0.8

4.2 Real-World Benchmarks

Datasets.

We evaluate our model on 7 real-world datasets from various domains. PROTEINS and NCI1 are bioinformatics datasets; IMDB-BINARY, IMDB-MULTI, and COLLAB are social network datasets. They are all popular graph classification tasks from the classical TUDataset [47]. We follow Xu et al. [7] to create the input features for each node. More specifically, the input node features of bioinformatics graphs are categorical node labels, and the input node features of social networks are node degrees. All the input features are encoded in a one-hot manner. In addition, MNIST is a computer vision dataset for the graph classification task, and ZINC is a chemistry dataset for the graph regression task. They are both modern benchmark datasets, and we obtain the features from the original paper [48], but do not take edge features into account. We summarize the statistics of all 7 real-world datasets in Table 7, and more details about these datasets can be found in Appendix M.1.

Evaluations.

For TUDataset, we follow the same data split and evaluation protocol as Xu et al. [7]. We perform 10-fold cross-validation with random splitting and report our results (the average and standard deviation of testing accuracies) at the epoch with the best average accuracy across the 10 folds. For MNIST and ZINC, we follow the same data splits and evaluation metrics as Dwivedi et al. [48], please refer to Appendix M.1 for more details. The experiments are performed over 4 runs with 4 different seeds, and we report the average and standard deviation of testing results.

Baselines.

We compare our PG-GNN with multiple state-of-the-art baselines: Weisfeiler-Lehman Graph Kernels (WL) [44], Graph SAmple and aggreGatE (GraphSAGE) [5], Gated Graph ConvNet (GatedGCN) [49], Deep Graph Convolutional Neural Network (DGCNN) [45], 3-WL-GNN [22], Invariant Graph Network (IGN) [8], Graph Isomorphism Network (GIN) [7], Provably Powerful Graph Network (PPGN) [28], Ring-GNN [50], Colored Local Iterative Procedure (CLIP) [21], Natural Graph Network (NGN) [10], (Deep-)Local Relation Pooling (LRP) [11], Principal Neighbourhood Aggregation (PNA) [51], Wasserstein Embedding for Graph Learning (WEGL) [46], Simplicial Isomorphism Network (SIN) [37], and Cell Isomorphism Network (CIN) [38].

Results and Analysis.

Tables 3 and 4 present a summary of the results. The results of baselines in Table 3 are taken from their original papers, except WL taken from Xu et al. [7], and IGN from Maron et al. [28] for preserving the same evaluation protocol. The results of baselines in Table 4 are taken from Dwivedi et al. [48], except PPGN and Deep-LRP are taken from Chen et al. [11], and PNA from Corso et al. [51]. Obviously, our model achieves outstanding performance on most datasets, even outperforming competitive baselines by a considerable margin.

From Tables 3 and 4, we notice that our model significantly outperforms other approaches on all social network datasets, but slightly underperforms main baselines on molecular datasets such as NCI1 and ZINC. Recall that in Section 3.5, we demonstrate that our model is capable of estimating the number of incidence triangles. The capability of counting incidence triangles benefits our model on graphs with many triangular substructures, e.g., social networks. However, triangles rarely exist in chemical compounds (verified in Table 7) due to their instability in the molecular structures. Thus our model achieves sub-optimal performance on molecular datasets. Suppose we extend the 1-hop neighborhoods to 2-hop (even kk-hop) in Eq. (3). In that case, our model will exploit more sophisticated substructures such as pentagon (cyclopentadienyl) and hexagon (benzene ring), which will benefit tasks on molecular graphs but increase the complexity. Thus, we leave it to future work.

Table 4: Results and running times on MNIST and ZINC.
Model MNIST ZINC
Accuracy \uparrow Time / Epoch MAE \downarrow Time / Epoch
GraphSAGE [5] 97.31 ±\pm 0.10 113.12s 0.468 ±\pm 0.003 3.74s
GatedGCN [49] 97.34 ±\pm 0.14 128.79s 0.435 ±\pm 0.011 5.76s
GIN [7] 96.49 ±\pm 0.25 39.22s 0.387 ±\pm 0.015 2.29s
3-WL-GNN [22] 95.08 ±\pm 0.96 1523.20s 0.407 ±\pm 0.028 286.23s
Ring-GNN [50] 91.86 ±\pm 0.45 2575.99s 0.512 ±\pm 0.023 327.65s
PPGN [28] N/A N/A 0.256 ±\pm 0.054 334.69s
Deep-LRP [11] N/A N/A 0.223 ±\pm 0.008 72s
PNA [51] 97.41 ±\pm 0.16 N/A 0.320 ±\pm 0.032 N/A
PG-GNN (Ours) 97.51 ±\pm 0.07 82.60s 0.282 ±\pm 0.011 6.92s

4.3 Running Time Analysis

As discussed above, compared to other powerful GNNs, one of the most important advantages of PG-GNN is efficiency. To evaluate, we compare the average running times between PG-GNN and baselines on two large-scale benchmarks, MNIST and ZINC. Table 4 also presents the average running times per epoch for various models. As shown in Table 4, PG-GNN is significantly faster than other powerful baselines, even on par with several variants of MPNNs. Thus, we can conclude that our approach outperforms other powerful GNNs in terms of time complexity. We also provide memory cost analysis in Tables 5 and 6, please refer to Appendix L.2 for more details.

5 Conclusion and Future Work

In this work, we devise an efficient permutation-sensitive aggregation mechanism via permutation groups, capturing pairwise correlations between neighboring nodes while ensuring linear sampling complexity. We throw light on the reasons why permutation-sensitive functions can improve GNNs’ expressivity. Moreover, we propose to approximate the property of permutation-invariance to significantly reduce the complexity with a minimal loss of generalization capability. In conclusion, we take an important step forward to better understand the permutation-sensitive GNNs.

However, Eq. (3) only models a small portion of nn-ary dependencies while covering all 2-ary dependencies. Although these 2-ary dependencies are invariant to an arbitrary permutation, the invariance to higher-order dependencies may not be guaranteed. It would be interesting to extend the 1-hop neighborhoods to 2-hop (even kk-hop) in Eq. (3), thereby completely modeling higher-order dependencies and exploiting more sophisticated substructures, which is left for future work.

Acknowledgements

We would like to express our sincere gratitude to the anonymous reviewers for their insightful comments and constructive feedback, which helped us polish this paper better. We thank Suixiang Gao, Xing Xie, Shiming Xiang, Zhiyong Liu, and Hao Zhang for their helpful discussions and valuable suggestions. This work was supported in part by the National Natural Science Foundation of China (61976209, 62020106015), CAS International Collaboration Key Project (173211KYSB20190024), and Strategic Priority Research Program of CAS (XDB32040000).

References

  • Murphy et al. [2019a] Ryan Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Relational pooling for graph representations. In Proceedings of the 36th International Conference on Machine Learning, pages 4663–4673, 2019a.
  • Duvenaud et al. [2015] David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, pages 2224–2232, 2015.
  • Kipf and Welling [2017] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, 2017.
  • Gilmer et al. [2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, pages 1263–1272, 2017.
  • Hamilton et al. [2017] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
  • Ying et al. [2018] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pages 4800–4810, 2018.
  • Xu et al. [2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In Proceedings of the 7th International Conference on Learning Representations, 2019.
  • Maron et al. [2019a] Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariant graph networks. In Proceedings of the 7th International Conference on Learning Representations, 2019a.
  • Kondor et al. [2018] Risi Kondor, Hy Truong Son, Horace Pan, Brandon Anderson, and Shubhendu Trivedi. Covariant compositional networks for learning graphs. arXiv preprint arXiv:1801.02144, 2018.
  • de Haan et al. [2020] Pim de Haan, Taco Cohen, and Max Welling. Natural graph networks. In Advances in Neural Information Processing Systems, pages 3636–3646, 2020.
  • Chen et al. [2020] Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. Can graph neural networks count substructures? In Advances in Neural Information Processing Systems, pages 10383–10395, 2020.
  • Thiede et al. [2021] Erik Henning Thiede, Wenda Zhou, and Risi Kondor. Autobahn: Automorphism-based graph neural nets. In Advances in Neural Information Processing Systems, pages 29922–29934, 2021.
  • Zhao et al. [2022] Lingxiao Zhao, Wei Jin, Leman Akoglu, and Neil Shah. From stars to subgraphs: Uplifting any GNN with local structure awareness. In Proceedings of the 10th International Conference on Learning Representations, 2022.
  • Bevilacqua et al. [2022] Beatrice Bevilacqua, Fabrizio Frasca, Derek Lim, Balasubramaniam Srinivasan, Chen Cai, Gopinath Balamurugan, Michael M Bronstein, and Haggai Maron. Equivariant subgraph aggregation networks. In Proceedings of the 10th International Conference on Learning Representations, 2022.
  • Elman [1990] Jeffrey L Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
  • Cho et al. [2014] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724–1734, 2014.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Murphy et al. [2019b] Ryan L Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs. In Proceedings of the 7th International Conference on Learning Representations, 2019b.
  • Vignac et al. [2020] Clément Vignac, Andreas Loukas, and Pascal Frossard. Building powerful and equivariant graph neural networks with structural message-passing. In Advances in Neural Information Processing Systems, pages 14143–14155, 2020.
  • Loukas [2020] Andreas Loukas. What graph neural networks cannot learn: Depth vs width. In Proceedings of the 8th International Conference on Learning Representations, 2020.
  • Dasoulas et al. [2020] George Dasoulas, Ludovic Dos Santos, Kevin Scaman, and Aladin Virmaux. Coloring graph neural networks for node disambiguation. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, pages 2126–2132, 2020.
  • Morris et al. [2019] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and Leman go neural: Higher-order graph neural networks. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pages 4602–4609, 2019.
  • Weisfeiler and Leman [1968] Boris Weisfeiler and Andrei A Leman. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsiya, 2(9):12–16, 1968.
  • Sato et al. [2019] Ryoma Sato, Makoto Yamada, and Hisashi Kashima. Approximation ratios of graph neural networks for combinatorial problems. In Advances in Neural Information Processing Systems, pages 4081–4090, 2019.
  • Sato et al. [2021] Ryoma Sato, Makoto Yamada, and Hisashi Kashima. Random features strengthen graph neural networks. In Proceedings of the 2021 SIAM International Conference on Data Mining, pages 333–341, 2021.
  • Abboud et al. [2021] Ralph Abboud, İsmail İlkan Ceylan, Martin Grohe, and Thomas Lukasiewicz. The surprising power of graph neural networks with random node initialization. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, pages 2112–2118, 2021.
  • Maron et al. [2019b] Haggai Maron, Ethan Fetaya, Nimrod Segol, and Yaron Lipman. On the universality of invariant networks. In Proceedings of the 36th International Conference on Machine Learning, pages 4363–4371, 2019b.
  • Maron et al. [2019c] Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. Provably powerful graph networks. In Advances in Neural Information Processing Systems, pages 2156–2167, 2019c.
  • Keriven and Peyré [2019] Nicolas Keriven and Gabriel Peyré. Universal invariant and equivariant graph neural networks. In Advances in Neural Information Processing Systems, pages 7092–7101, 2019.
  • Azizian and Lelarge [2021] Waïss Azizian and Marc Lelarge. Expressive power of invariant and equivariant graph neural networks. In Proceedings of the 9th International Conference on Learning Representations, 2021.
  • Garg et al. [2020] Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. In Proceedings of the 37th International Conference on Machine Learning, pages 3419–3430, 2020.
  • Tahmasebi et al. [2020] Behrooz Tahmasebi, Derek Lim, and Stefanie Jegelka. Counting substructures with higher-order graph neural networks: Possibility and impossibility results. arXiv preprint arXiv:2012.03174, 2020.
  • You et al. [2021] Jiaxuan You, Jonathan Gomes-Selman, Rex Ying, and Jure Leskovec. Identity-aware graph neural networks. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pages 10737–10745, 2021.
  • Balcilar et al. [2021] Muhammet Balcilar, Pierre Héroux, Benoit Gaüzère, Pascal Vasseur, Sébastien Adam, and Paul Honeine. Breaking the limits of message passing graph neural networks. In Proceedings of the 38th International Conference on Machine Learning, pages 599–608, 2021.
  • Barceló et al. [2021] Pablo Barceló, Floris Geerts, Juan Reutter, and Maksimilian Ryschkov. Graph neural networks with local graph parameters. In Advances in Neural Information Processing Systems, pages 25280–25293, 2021.
  • Bouritsas et al. [2022] Giorgos Bouritsas, Fabrizio Frasca, Stefanos P Zafeiriou, and Michael M Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • Bodnar et al. [2021a] Cristian Bodnar, Fabrizio Frasca, Yuguang Wang, Nina Otter, Guido F Montúfar, Pietro Liò, and Michael Bronstein. Weisfeiler and Lehman go topological: Message passing simplicial networks. In Proceedings of the 38th International Conference on Machine Learning, pages 1026–1037, 2021a.
  • Bodnar et al. [2021b] Cristian Bodnar, Fabrizio Frasca, Nina Otter, Yu Guang Wang, Pietro Liò, Guido F Montúfar, and Michael Bronstein. Weisfeiler and Lehman go cellular: CW networks. In Advances in Neural Information Processing Systems, pages 2625–2640, 2021b.
  • Battaglia et al. [2018] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • Al Hasan and Dave [2018] Mohammad Al Hasan and Vachik S Dave. Triangle counting in large networks: A review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(2):e1226, 2018.
  • Mitzenmacher and Upfal [2017] Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge University Press, 2017.
  • Brijder et al. [2019] Robert Brijder, Floris Geerts, Jan Van Den Bussche, and Timmy Weerwag. On the expressive power of query languages for matrices. ACM Transactions on Database Systems, 44(4):1–31, 2019.
  • Geerts [2021] Floris Geerts. On the expressive power of linear algebra on graphs. Theory of Computing Systems, 65(1):179–239, 2021.
  • Shervashidze et al. [2011] Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12(77):2539–2561, 2011.
  • Zhang et al. [2018] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classification. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 4438–4445, 2018.
  • Kolouri et al. [2021] Soheil Kolouri, Navid Naderializadeh, Gustavo K Rohde, and Heiko Hoffmann. Wasserstein embedding for graph learning. In Proceedings of the 9th International Conference on Learning Representations, 2021.
  • Morris et al. [2020] Christopher Morris, Nils M Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. TUDataset: A collection of benchmark datasets for learning with graphs. ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+), 2020. URL www.graphlearning.io.
  • Dwivedi et al. [2020] Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.
  • Bresson and Laurent [2017] Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXiv preprint arXiv:1711.07553, 2017.
  • Chen et al. [2019] Zhengdao Chen, Soledad Villar, Lei Chen, and Joan Bruna. On the equivalence between graph isomorphism testing and function approximation with GNNs. In Advances in Neural Information Processing Systems, pages 15894–15902, 2019.
  • Corso et al. [2020] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal neighbourhood aggregation for graph nets. In Advances in Neural Information Processing Systems, pages 13260–13271, 2020.
  • West [2001] Douglas Brent West. Introduction to graph theory. Prentice Hall, 2001.
  • Artin [2011] Michael Artin. Algebra. Pearson Prentice Hall, 2011.
  • Birkhoff and Mac Lane [2017] Garrett Birkhoff and Saunders Mac Lane. A survey of modern algebra. CRC Press, 2017.
  • Grohe [2017] Martin Grohe. Descriptive complexity, canonisation, and definable graph structure theory, volume 47. Cambridge University Press, 2017.
  • Cai et al. [1992] Jin-Yi Cai, Martin Fürer, and Neil Immerman. An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.
  • Deo [2017] Narsingh Deo. Graph theory with applications to engineering and computer science. Courier Dover Publications, 2017.
  • Harary and Manvel [1971] Frank Harary and Bennet Manvel. On the number of cycles in a graph. Matematickỳ časopis, 21(1):55–63, 1971.
  • Chung et al. [2012] Kai-Min Chung, Henry Lam, Zhenming Liu, and Michael Mitzenmacher. Chernoff-Hoeffding bounds for Markov chains: Generalized and simplified. In Proceedings of the 29th International Symposium on Theoretical Aspects of Computer Science, pages 124–135, 2012.
  • Haykin [2010] Simon Haykin. Neural networks and learning machines, 3/E. Pearson Education India, 2010.
  • Chen et al. [2016] Xiaowei Chen, Yongkun Li, Pinghui Wang, and John CS Lui. A general framework for estimating graphlet statistics via random walk. Proceedings of the VLDB Endowment, 10(3):253–264, 2016.
  • Hardiman and Katzir [2013] Stephen J Hardiman and Liran Katzir. Estimating clustering coefficients and size of social networks via random walk. In Proceedings of the 22nd International Conference on World Wide Web, pages 539–550, 2013.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Achanta et al. [2012] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012.
  • Irwin et al. [2012] John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. ZINC: A free tool to discover chemistry for biology. Journal of Chemical Information and Modeling, 52(7):1757–1768, 2012.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, pages 448–456, 2015.
  • Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
  • Kingma and Ba [2015] Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015.
  • Hagberg et al. [2008] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference, pages 11–15, 2008.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8026–8037, 2019.
  • Wang et al. [2019] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, et al. Deep graph library: Towards efficient and scalable deep learning on graphs. ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

Appendix A Background on Graph Theory

Given a graph G=(𝒱,)G=(\mathcal{V},\mathcal{E}), a walk in GG is a finite sequence of alternating vertices and edges such as v0,e1,v1,e2,,em,vmv_{0},e_{1},v_{1},e_{2},\ldots,e_{m},v_{m}, where each edge ei=(vi1,vi)e_{i}=(v_{i-1},v_{i}). A walk may have repeated edges. A trail is a walk in which all the edges are distinct. A path is a trail in which all vertices (hence all edges) are distinct (except, possibly, v0=vmv_{0}=v_{m}). A trail or path is closed if v0=vmv_{0}=v_{m}, and a closed path containing at least one edge is a cycle [52].

A Hamiltonian path is a path in a graph that passes through each vertex exactly once. A Hamiltonian cycle is a cycle in a graph that passes through each vertex exactly once. A Hamiltonian graph is a graph that contains a Hamiltonian cycle.

Let G=(𝒱,)G=(\mathcal{V},\mathcal{E}) and G=(𝒱,)G^{\prime}=(\mathcal{V}^{\prime},\mathcal{E}^{\prime}) be graphs. If GGG^{\prime}\subseteq G and GG^{\prime} contains all the edges (vi,vj)(v_{i},v_{j})\in\mathcal{E} with vi,vj𝒱v_{i},v_{j}\in\mathcal{V}^{\prime}, then GG^{\prime} is an induced subgraph of GG, and we say that 𝒱\mathcal{V}^{\prime} induces GG^{\prime} in GG.

An empty graph is a graph whose edge-set is empty. A regular graph is a graph in which each vertex has the same degree. If each vertex has degree rr, the graph is rr-regular. A strongly regular graph in the family SRG(v,r,λ,μv,r,\lambda,\mu) is an rr-regular graph with vv vertices, where every two adjacent vertices have λ\lambda common neighbors, and every two non-adjacent vertices have μ\mu common neighbors.

A complete graph is a simple undirected graph in which every pair of distinct vertices is adjacent. We denote the complete graph on nn vertices by KnK_{n}. A tournament is a directed graph in which each edge of a complete graph is given an orientation. We denote the tournament on nn vertices by Kn\vec{K}_{n}. A clique of a graph GG is a complete induced subgraph of GG. A clique of size kk is called a kk-clique.

The local clustering coefficient of a vertex quantifies how close its neighbors are to being a clique (complete graph). The local clustering coefficient cvc_{v} of a vertex vv is given by the proportion of links between the nn vertices within its neighborhood 𝒩(v)\mathcal{N}(v) divided by the number of links that could possibly exist between them, defined as cv=2|{eij:i,j𝒩(v),eij}|n(n1)c_{v}=\dfrac{2\left|\left\{e_{ij}:i,j\in\mathcal{N}(v),e_{ij}\in\mathcal{E}\right\}\right|}{n(n-1)}. This measure is 1 if every neighbor connected to vv is also connected to every other vertex within the neighborhood.

Let G=(𝒱,)G=(\mathcal{V},\mathcal{E}) and G=(𝒱,)G^{\prime}=(\mathcal{V}^{\prime},\mathcal{E}^{\prime}) be graphs. An isomorphism ϑ:𝒱𝒱\vartheta:\mathcal{V}\to\mathcal{V}^{\prime} between GG and GG^{\prime} is a bijective map that maps pairs of connected vertices to pairs of connected vertices, and likewise for pairs of non-connected vertices, i.e., (ϑ(u),ϑ(v))(\vartheta(u),\vartheta(v))\in\mathcal{E}^{\prime} iff (u,v)(u,v)\in\mathcal{E} for all uu and vv in 𝒱\mathcal{V}.

Appendix B Background on Group Theory

Since we deal with finite sets in this paper, all the following definitions are about finite groups.

For an arbitrary element xx in a group 𝔊\mathfrak{G}, the order of xx is the smallest positive integer nn such that xn=ex^{n}=e, where ee is the identity element. H={e,x,x2,,xn1}H=\{e,x,x^{2},\ldots,x^{n-1}\} is the cyclic subgroup generated by xx and is often denoted by H=xH=\left\langle x\right\rangle. A cyclic group is a group that is equal to one of its cyclic subgroups: 𝔊=g\mathfrak{G}=\left\langle g\right\rangle for some element gg, and the element gg is called a generator. The cyclic group with nn elements is denoted by n\mathbb{Z}_{n} [53, 54].

A permutation of a finite set SS is a bijective map from SS to itself. In Cauchy’s two-line notation, it denotes such a permutation by listing the “natural” order for all the nn elements of SS in the first row, and for each one, its image below it in the second row: σ=(12nσ(1)σ(2)σ(n))\sigma=\left(\begin{matrix}1&2&\cdots&n\\ \sigma(1)&\sigma(2)&\cdots&\sigma(n)\\ \end{matrix}\right). A cycle of length rr (or rr-cycle) is a permutation σ\sigma for which there exists an element i1i_{1} in {1,2,,n}\{1,2,\ldots,n\} such that σ(i1)=i2,σ(i2)=i3,,σ(ir1)=ir,σ(ir)=i1\sigma(i_{1})=i_{2},\sigma(i_{2})=i_{3},\cdots,\sigma(i_{r-1})=i_{r},\sigma(i_{r})=i_{1} are the only elements moved by σ\sigma. In cycle notation, it denotes such a cycle (or rr-cycle) by (i1i2ir)(i_{1}\ i_{2}\ \cdots\ i_{r}).

A permutation group is a group whose elements are permutations of a given set SS, with the group operation “\circ” being the composition of permutations. The permutation group on the set SS is denoted by Perm(S)Perm(S). A symmetric group is a group whose elements are all permutations of a given set SS. The symmetric group on the set S=[n]={1,2,,n}S=[n]=\left\{1,2,\ldots,n\right\} is denoted by 𝒮n\mathcal{S}_{n} [53]. Every permutation group is a subgroup of a symmetric group.

A group action α\alpha of a group 𝔊\mathfrak{G} on a set SS is a map α:𝔊×SS\alpha:\mathfrak{G}\times S\to S, denoted by (g,s)gs(g,s)\mapsto gs (with α(g,s)\alpha(g,s) often shortened to gsgs or gsg\cdot s) that satisfies the following two axioms:

  • a)

    identity: es=se\cdot s=s, for all sSs\in S, where ee is the identity element of 𝔊\mathfrak{G}.

  • b)

    associative law: (g1g2)s=g1(g2s)(g_{1}\circ g_{2})\cdot s=g_{1}\cdot(g_{2}\cdot s), for all g1,g2𝔊g_{1},g_{2}\in\mathfrak{G} and sSs\in S, where \circ denotes the operation or composition in 𝔊\mathfrak{G}.

Let 𝔊\mathfrak{G} and 𝔊\mathfrak{G}^{\prime} be groups. A homomorphism φ:𝔊𝔊\varphi:\mathfrak{G}\to\mathfrak{G}^{\prime} is a map from 𝔊\mathfrak{G} to 𝔊\mathfrak{G}^{\prime} such that φ(ab)=φ(a)φ(b)\varphi(ab)=\varphi(a)\varphi(b) for all aa and bb in 𝔊\mathfrak{G}. An isomorphism φ:𝔊𝔊\varphi:\mathfrak{G}\to\mathfrak{G}^{\prime} from 𝔊\mathfrak{G} to 𝔊\mathfrak{G}^{\prime} is a bijective group homomorphism - a bijective map such that φ(ab)=φ(a)φ(b)\varphi(ab)=\varphi(a)\varphi(b) for all aa and bb in 𝔊\mathfrak{G} [53]. We use the symbol \cong to denote two groups 𝔊\mathfrak{G} and 𝔊\mathfrak{G}^{\prime} are isomorphic, i.e., 𝔊𝔊\mathfrak{G}\cong\mathfrak{G}^{\prime}.

Appendix C Definition of k-WL Test

There are different definitions of the kk-dimensional Weisfeiler-Lehman (kk-WL) test for k2k\geq 2, while in this work, we follow the definition in Chen et al. [11]. Note that the kk-WL test here is equivalent to the kk-WL tests in [55, 22, 28, 30], and the (k1)(k-1)-WL test in [56] (Grohe [55] calls this version as kk-WL). (k+1)(k+1)-WL test has been proven to be strictly more powerful than kk-WL test [56].

The kk-WL algorithm is a generalization of the 1-WL, it colors tuples from 𝒱k\mathcal{V}^{k} instead of nodes. For any kk-tuple s=(i1,,ik)𝒱ks=(i_{1},\ldots,i_{k})\in\mathcal{V}^{k} and each j[k]={1,,k}j\in[k]=\{1,\ldots,k\}, define the jj-th neighborhood

𝒩j(s)={(i1,,ij1,u,ij+1,,ik)|u𝒱}\mathcal{N}_{j}(s)=\{(i_{1},\ldots,i_{j-1},u,i_{j+1},\ldots,i_{k})\ |\ u\in\mathcal{V}\}

That is, the jj-th neighborhood 𝒩j(s)\mathcal{N}_{j}(s) of the kk-tuple ss is obtained by replacing the jj-th component of ss with every node from 𝒱\mathcal{V}.

Given a pair of graphs GG and GG^{\prime}, we use the kk-WL algorithm to test them for isomorphism. Suppose that the two graphs have the same number of vertices since otherwise, they can be told apart easily. Without loss of generality, we assume that they share the same set of vertex indices, 𝒱\mathcal{V} (but may differ in \mathcal{E}). The kk-WL test follows the following coloring procedure.

  • 1)

    For each of the graphs, at iteration 0, the test assigns an initial color in the color space Γ\Gamma to each kk-tuple according to its atomic type, i.e., two kk-tuples ss and ss^{\prime} in 𝒱k\mathcal{V}^{k} get the same color if the subgraphs induced from nodes of ss and ss^{\prime} are isomorphic.

  • 2)

    In each iteration t>0t>0, the test computes a kk-tuple coloring ck(t):𝒱kΓc_{k}^{(t)}:\mathcal{V}^{k}\to\Gamma. More specifically, let ck(t)(s)c_{k}^{(t)}(s) denote the color of ss in GG assigned at the tt-th iteration, and let ck(t)(s){c^{\prime}}_{k}^{(t)}(s^{\prime}) denote the color assigned for ss^{\prime} in GG^{\prime}. Define

    Cj(t)(s)\displaystyle C_{j}^{(t)}(s) =HASH1(t)({ck(t1)(w)|w𝒩j(s)})\displaystyle=\text{HASH}_{1}^{(t)}\left(\left\{c_{k}^{(t-1)}(w)\ \Big{|}\ w\in\mathcal{N}_{j}(s)\right\}\right)
    Cj(t)(s)\displaystyle{C^{\prime}}_{j}^{(t)}(s^{\prime}) =HASH1(t)({ck(t1)(w)|w𝒩j(s)})\displaystyle=\text{HASH}_{1}^{(t)}\left(\left\{{c^{\prime}}_{k}^{(t-1)}(w^{\prime})\ \Big{|}\ w^{\prime}\in\mathcal{N}_{j}(s^{\prime})\right\}\right)

    where HASH1(t)\text{HASH}_{1}^{(t)} is a hash function that maps injectively from the space of multisets of colors to some intermediate space. Then let

    ck(t)(s)\displaystyle c_{k}^{(t)}(s) =HASH2(t)((ck(t1)(s),(C1(t)(s),,Ck(t)(s))))\displaystyle=\text{HASH}_{2}^{(t)}\left(\left(c_{k}^{(t-1)}(s),\left(C_{1}^{(t)}(s),\ldots,C_{k}^{(t)}(s)\right)\right)\right)
    ck(t)(s)\displaystyle{c^{\prime}}_{k}^{(t)}(s^{\prime}) =HASH2(t)((ck(t1)(s),(C1(t)(s),,Ck(t)(s))))\displaystyle=\text{HASH}_{2}^{(t)}\left(\left({c^{\prime}}_{k}^{(t-1)}(s^{\prime}),\left({C^{\prime}}_{1}^{(t)}(s^{\prime}),\ldots,{C^{\prime}}_{k}^{(t)}(s^{\prime})\right)\right)\right)

    where HASH2(t)\text{HASH}_{2}^{(t)} maps injectively from its input space to the color space Γ\Gamma, ck(t)(s)c_{k}^{(t)}(s) and ck(t)(s){c^{\prime}}_{k}^{(t)}(s) are updated iteratively in this way.

  • 3)

    The test will terminate and return the result that the two graphs are not isomorphic if the following two multisets differ at some iteration tt:

    {ck(t)(s)|s𝒱k}{ck(t)(s)|s𝒱k}\left\{c_{k}^{(t)}(s)\ \Big{|}\ s\in\mathcal{V}^{k}\right\}\neq\left\{{c^{\prime}}_{k}^{(t)}(s^{\prime})\ \Big{|}\ s^{\prime}\in\mathcal{V}^{k}\right\}

For the detailed difference between kk-WL test here and (k1)(k-1)-WL test in Cai et al. [56] (kk-WL in Grohe [55]), see Remark 3.5.9 in Grohe [55].

Appendix D Distinguishing Non-Isomorphic Graph Pairs: Permutation-Sensitive vs. Permutation-Invariant Aggregation Functions

Refer to caption
Figure 3: A pair of non-isomorphic graphs that cannot be distinguished by permutation-invariant aggregation functions, but can be easily distinguished by permutation-sensitive aggregation functions.

Let ff be an arbitrary aggregation function. For a node vv, let xvx_{v} (bb for blue, gg for green) denote the initial node feature, hvh_{v} denote the feature transformed by ff. In the initial stage, we have:

xu1=xu2=xu5=xu6=b,xu3=xu4=g\displaystyle x_{u_{1}}=x_{u_{2}}=x_{u_{5}}=x_{u_{6}}=b,~{}x_{u_{3}}=x_{u_{4}}=g
xv1=xv2=xv5=xv6=b,xv3=xv4=g\displaystyle x_{v_{1}}=x_{v_{2}}=x_{v_{5}}=x_{v_{6}}=b,~{}x_{v_{3}}=x_{v_{4}}=g

Figure 3 illustrates a pair of non-isomorphic graphs that 2-WL test and most permutation-invariant aggregation functions fail to distinguish. Suppose ff is permutation-invariant, we take the sum aggregator SUM()(\cdot) as an example to illustrate this process. After the first round of iteration, the transformed feature of each node is:

hu1=hu2=hu5=hu6=b+g,hu3=hu4=2b+g\displaystyle h_{u_{1}}=h_{u_{2}}=h_{u_{5}}=h_{u_{6}}=b+g,~{}h_{u_{3}}=h_{u_{4}}=2b+g
hv1=hv2=hv5=hv6=b+g,hv3=hv4=2b+g\displaystyle h_{v_{1}}=h_{v_{2}}=h_{v_{5}}=h_{v_{6}}=b+g,~{}h_{v_{3}}=h_{v_{4}}=2b+g

We can find that the distributions of node features of these two graphs are the same. Similarly, after each round of iteration, these two graphs always produce the same distributions of node features. Hence we can conclude that the 2-WL test and the permutation-invariant function SUM()(\cdot) fail to distinguish these two graphs.

In contrast, suppose ff is permutation-sensitive, we take a generic permutation-sensitive aggregator h(t)=kh(t1)+x(t)h^{(t)}=k\cdot h^{(t-1)}+x^{(t)} as an example to illustrate its process. Here x(t)x^{(t)} is the tt-th input node feature, h(t)h^{(t)} is the corresponding transformed feature with h(0)=0h^{(0)}=0, and the learnable parameter k>1k>1 measures the pairwise correlation between x(t1)x^{(t-1)} and x(t)x^{(t)}. For the left graph G1G_{1}, we focus on node u3u_{3}. Let the input ordering of neighboring nodes be u1,u4,u5u_{1},u_{4},u_{5}, i.e., xu3(1)xu3(2)xu3(3)=bgbx_{u_{3}}^{(1)}\to x_{u_{3}}^{(2)}\to x_{u_{3}}^{(3)}=b\to g\to b, then ff only encodes the pairwise correlation between bb and gg. Thus, we have

hu3(1)\displaystyle h_{u_{3}}^{(1)} =k0+b=b\displaystyle=k\cdot 0+b=b
hu3(2)\displaystyle h_{u_{3}}^{(2)} =kb+g=kb+g\displaystyle=k\cdot b+g=kb+g
hu3(3)\displaystyle h_{u_{3}}^{(3)} =k(kb+g)+b=(k2+1)b+kg\displaystyle=k\cdot(kb+g)+b=(k^{2}+1)b+kg

For the right graph G2G_{2}, we focus on node v3v_{3}. Let the input ordering of neighboring nodes be v1,v2,v4v_{1},v_{2},v_{4}, i.e., xv3(1)xv3(2)xv3(3)=bbgx_{v_{3}}^{(1)}\to x_{v_{3}}^{(2)}\to x_{v_{3}}^{(3)}=b\to b\to g, then ff also encodes the pairwise correlation between bb and bb. Thus, we have

hv3(1)\displaystyle h_{v_{3}}^{(1)} =k0+b=b\displaystyle=k\cdot 0+b=b
hv3(2)\displaystyle h_{v_{3}}^{(2)} =kb+b=kb+b\displaystyle=k\cdot b+b=kb+b
hv3(3)\displaystyle h_{v_{3}}^{(3)} =k(kb+b)+g=(k2+k)b+g\displaystyle=k\cdot(kb+b)+g=(k^{2}+k)b+g

After the first round of iteration, the node feature hu3(3)h_{u_{3}}^{(3)} of u3u_{3} differs from the hv3(3)h_{v_{3}}^{(3)} of v3v_{3}. Hence we can conclude that the permutation-sensitive aggregation function ff can distinguish these two graphs. Moreover, the weight ratio of bb and gg in hu3(3)h_{u_{3}}^{(3)} is (k2+1):k(k^{2}+1):k, which is smaller than that in hv3(3)h_{v_{3}}^{(3)}, i.e., (k2+k):1(k^{2}+k):1. This fact indicates that, in G1G_{1}, ff focuses more on encoding the pairwise correlation between bb and gg. In contrast, in G2G_{2}, ff focuses more on encoding the pairwise correlation between bb and bb, thereby exploiting the triangular substructure such as v1v3v2\triangle v_{1}v_{3}v_{2}. It is worth noting that when k=1k=1, the function ff is h(t)=h(t1)+x(t)h^{(t)}=h^{(t-1)}+x^{(t)} and degenerates to the permutation-invariant function SUM()(\cdot), resulting in hu3(3)=hv3(3)=2b+gh_{u_{3}}^{(3)}=h_{v_{3}}^{(3)}=2b+g.

Appendix E Proof of Theorem 1

Theorem 1. Let n(n4)n(n\geq 4) denote the number of 1-hop neighboring nodes around the central node vv. There are (n1)/2\lfloor(n-1)/2\rfloor kinds of arrangements in total, satisfying that their corresponding 2-ary dependencies are disjoint. Meanwhile, after at least n/2\lfloor n/2\rfloor arrangements (including the initial one), all 2-ary dependencies have been covered at least once.

Proof.

Construct a simple undirected graph G=(𝒱,)G^{\prime}=(\mathcal{V}^{\prime},\mathcal{E}^{\prime}), where 𝒱\mathcal{V}^{\prime} denotes the nn neighboring nodes (abbreviated as nodes in the following) around the central node vv, and \mathcal{E}^{\prime} represents an edge set in which each edge indicates the corresponding 2-ary dependency has been covered in some arrangements. Thus, each arrangement corresponds to a Hamiltonian cycle in graph GG^{\prime}. For any two arrangements, detecting whether their corresponding 2-ary dependencies are disjoint can be analogous to finding two edge-disjoint Hamiltonian cycles. Since every pair of nodes can form a 2-ary dependency, the first problem can be translated into finding the maximum number of edge-disjoint Hamiltonian cycles in a complete graph KnK_{n}, and the second problem can be translated into finding the minimum number of Hamiltonian cycles to cover a complete graph KnK_{n}.

Refer to caption
(a) nn is odd
Refer to caption
(b) nn is even
Refer to caption
(c) our revision when nn is even
Figure 4: The initial arrangements (following the gray solid lines).

Since a KnK_{n} has n(n1)2\frac{n(n-1)}{2} edges and each Hamiltonian cycle has nn edges, there are at most n(n1)2/n=n12\lfloor\frac{n(n-1)}{2}/n\rfloor=\lfloor\frac{n-1}{2}\rfloor edge-disjoint Hamiltonian cycles in a KnK_{n}. In addition, we can specifically construct n12\lfloor\frac{n-1}{2}\rfloor edge-disjoint Hamiltonian cycles as follows. If nn is odd, keep the nodes fixed on a circle with node 1 at the center, rotate the node numbers on the circle clockwise by 360n1,2×360n1,,n32×360n1\frac{360^{\circ}}{n-1},2\times\frac{360^{\circ}}{n-1},\ldots,\frac{n-3}{2}\times\frac{360^{\circ}}{n-1}, while the graph structure always remains unchanged as the initial arrangement shown in Figure 4(a). Each rotation can be formulated as the following permutation σ\sigma^{\prime}:

σ={(12345n1n14263nn2)=(2 4 6n1nn2 7 5 3),ifnis odd,(12345n1n14263n3n1)=(2 4 6n2nn1 7 5 3),ifnis even.\sigma^{\prime}=\left\{\begin{aligned} &\small{\left(\begin{matrix}1&2&3&4&5&\cdots&n-1&n\\ 1&4&2&6&3&\cdots&n&n-2\\ \end{matrix}\right)}=\left(2\ 4\ 6\ \cdots\ n-1\ n\ n-2\ \cdots\ 7\ 5\ 3\right),\text{if}~{}n~{}\text{is odd},\\ &\small{\left(\begin{matrix}1&2&3&4&5&\cdots&n-1&n\\ 1&4&2&6&3&\cdots&n-3&n-1\\ \end{matrix}\right)}=\left(2\ 4\ 6\ \cdots\ n-2\ n\ n-1\ \cdots\ 7\ 5\ 3\right),\text{if}~{}n~{}\text{is even}.\end{aligned}\right.

Observe that each rotation generates a new Hamiltonian cycle containing completely different edges from before. Thus we have n32=n121\frac{n-3}{2}=\lfloor\frac{n-1}{2}\rfloor-1 new Hamiltonian cycles with all edges disjoint from the ones in Figure 4(a) and among themselves [57]. If nn is even, the node arrangement can be initialized as shown in Figure 4(b), and n42=n121\frac{n-4}{2}=\lfloor\frac{n-1}{2}\rfloor-1 new Hamiltonian cycles can be constructed successively in a similar way. We thus conclude that there are n12\lfloor\frac{n-1}{2}\rfloor kinds of arrangements in total, satisfying that their corresponding 2-ary dependencies are disjoint.

Furthermore, if nn is odd, KnK_{n} has n(n1)2\frac{n(n-1)}{2} edges divisible by the length nn of each Hamiltonian cycle. Therefore, we can exactly cover all edges by the above n12=n12=n2\lfloor\frac{n-1}{2}\rfloor=\frac{n-1}{2}=\lfloor\frac{n}{2}\rfloor kinds of arrangements. On the contrary, if nn is even, KnK_{n} has n(n1)2\frac{n(n-1)}{2} edges indivisible by the length nn of each Hamiltonian cycle, remaining n2\frac{n}{2} edges uncovered by the above n12=n22\lfloor\frac{n-1}{2}\rfloor=\frac{n-2}{2} kinds of arrangements. Thus we continue to perform the permutation σ\sigma^{\prime} once, i.e., n12+1=n2=n2\lfloor\frac{n-1}{2}\rfloor+1=\frac{n}{2}=\lfloor\frac{n}{2}\rfloor kinds of arrangements in total, to cover all edges but result in n2\frac{n}{2} edges duplicated twice.

As discussed in the main body, these n2\lfloor\frac{n}{2}\rfloor arrangements and the corresponding n2\lfloor\frac{n}{2}\rfloor Hamiltonian cycles are modeled by the permutation-sensitive function in a directed manner. In addition, we also expect to reverse these n2\lfloor\frac{n}{2}\rfloor directed Hamiltonian cycles by performing the permutation σ\sigma^{\prime} successively, thereby transforming them into an undirected manner. However, σ\sigma^{\prime} cannot satisfy this requirement if nn is even. Thus, we propose to revise the permutation σ\sigma^{\prime} into the following one:

σ={(12345n1n14263nn2)=(2 4 6n1nn2 7 5 3),ifnis odd,(1234n1n3152nn2)=(1 3 5n1nn2 6 4 2),ifnis even.\sigma=\left\{\begin{aligned} &\small{\left(\begin{matrix}1&2&3&4&5&\cdots&n-1&n\\ 1&4&2&6&3&\cdots&n&n-2\\ \end{matrix}\right)}=\left(2\ 4\ 6\ \cdots\ n-1\ n\ n-2\ \cdots\ 7\ 5\ 3\right),\text{if}~{}n~{}\text{is odd},\\ &\small{\left(\begin{matrix}1&2&3&4&\cdots&n-1&n\\ 3&1&5&2&\cdots&n&n-2\\ \end{matrix}\right)}=\left(1\ 3\ 5\ \cdots\ n-1\ n\ n-2\ \cdots\ 6\ 4\ 2\right),\text{if}~{}n~{}\text{is even}.\end{aligned}\right.

where σ\sigma is the same as σ\sigma^{\prime} when nn is odd, but a little different when nn is even. If nn is even, σ\sigma is an nn-cycle, but σ\sigma^{\prime} is an (n1)(n-1)-cycle. The corresponding initial node arrangement after revision is shown in Figure 4(c). After adding a virtual node 0 at the center in Figure 4(c), σ\sigma becomes the same as σ\sigma^{\prime} with n+1n+1 in Figure 4(a), which can cover all edges with (n+1)12=n2\lfloor\frac{(n+1)-1}{2}\rfloor=\lfloor\frac{n}{2}\rfloor kinds of arrangements. Moreover, after performing σ\sigma for nn times in succession, it can cover a complete graph bi-directionally but σ\sigma^{\prime} fails.

In conclusion, after performing σ\sigma or σ\sigma^{\prime} for n21\lfloor\frac{n}{2}\rfloor-1 times in succession (excluding the initial one), all 2-ary dependencies have been covered at least once. \Box

Appendix F Proof of Lemma 2

Theorem F.1.

The order of any permutation is the least common multiple of the lengths of its disjoint cycles [54].

Proposition F.2.

The order of a cyclic group is equal to the order of its generator [53].

Using Theorem F.1 and Proposition F.2, we prove Lemma 2 as follows.

2. For the permutation σ\sigma of nn indices, 𝔊={e,σ,σ2,,σn2}\mathfrak{G}=\{e,\sigma,\sigma^{2},\ldots,\sigma^{n-2}\} is a permutation group isomorphic to the cyclic group n1\mathbb{Z}_{n-1} if nn is odd. And 𝔊={e,σ,σ2,,σn1}\mathfrak{G}=\{e,\sigma,\sigma^{2},\ldots,\sigma^{n-1}\} is a permutation group isomorphic to the cyclic group n\mathbb{Z}_{n} if nn is even.

Proof.

If nn is odd, we find the order of permutation σ\sigma first. Since

σ=(12345n1n14263nn2)=(1)(2 4 6n1nn2 7 5 3)\sigma=\left(\begin{matrix}1&2&3&4&5&\cdots&n-1&n\\ 1&4&2&6&3&\cdots&n&n-2\\ \end{matrix}\right)=\left(1\right)\left(2\ 4\ 6\ \cdots\ n-1\ n\ n-2\ \cdots\ 7\ 5\ 3\right)

Let π1=(1)\pi_{1}=\left(1\right), π2=(2 4 6n1nn2 7 5 3)\pi_{2}=\left(2\ 4\ 6\ \cdots\ n-1\ n\ n-2\ \cdots\ 7\ 5\ 3\right), then the permutation σ\sigma can be represented as the product of these two disjoint cycles, i.e., σ=π1π2\sigma=\pi_{1}\pi_{2}. Here π1\pi_{1} is a 11-cycle of length 11, π2\pi_{2} is an (n1)(n-1)-cycle of length n1n-1. Using Theorem F.1, the order of permutation σ\sigma is the least common multiple of 11 and n1n-1: lcm(1,n1)=n1\text{lcm}(1,n-1)=n-1, which indicates that σn1=e\sigma^{n-1}=e. Therefore, 𝔊={e,σ,σ2,,σn2}\mathfrak{G}=\{e,\sigma,\sigma^{2},\ldots,\sigma^{n-2}\} is a permutation group generated by σ\sigma, i.e., 𝔊=σ\mathfrak{G}=\left\langle\sigma\right\rangle. According to the definition of the cyclic group (see Appendix B), 𝔊\mathfrak{G} is isomorphic to a cyclic group. By Proposition F.2, the order of group 𝔊=σ\mathfrak{G}=\left\langle\sigma\right\rangle is equal to the order of its generator σ\sigma, i.e., n1n-1. Thus, 𝔊={e,σ,σ2,,σn2}\mathfrak{G}=\{e,\sigma,\sigma^{2},\ldots,\sigma^{n-2}\} is a permutation group isomorphic to the cyclic group n1\mathbb{Z}_{n-1}.

Similarly, we can prove that 𝔊={e,σ,σ2,,σn1}\mathfrak{G}=\{e,\sigma,\sigma^{2},\ldots,\sigma^{n-1}\} is a permutation group isomorphic to the cyclic group n\mathbb{Z}_{n} if nn is even. \Box

Theorem F.3 (Cayley’s Theorem).

Every finite group is isomorphic to a permutation group [53].

The conclusion of Lemma 2 also obeys the most fundamental Cayley’s Theorem in group theory.

Appendix G Proof of Corollary 3 and the Diagram of Group Action

3. The map α:𝔊×SS\alpha:\mathfrak{G}\times S\to S denoted by (g,s)gs(g,s)\mapsto gs is a group action of 𝔊\mathfrak{G} on SS.

Proof.

Let ee be the identity element of 𝔊\mathfrak{G} and idσid_{\sigma} be the identity permutation. And let \circ denote the composition in 𝔊\mathfrak{G}. For all σi,σj𝔊\sigma^{i},\sigma^{j}\in\mathfrak{G} and sSs\in S, we have

α(e,s)=es=idσs=s\displaystyle\alpha(e,s)=e\cdot s=id_{\sigma}\cdot s=s
α(σiσj,s)=(σiσj)s=σi(σjs)=α(σi,α(σj,s))\displaystyle\alpha(\sigma^{i}\sigma^{j},s)=(\sigma^{i}\circ\sigma^{j})\cdot s=\sigma^{i}\cdot(\sigma^{j}\cdot s)=\alpha(\sigma^{i},\alpha(\sigma^{j},s))

Thus, the map α\alpha defines a group action of the permutation group 𝔊\mathfrak{G} on the set SS. \Box

Refer to caption
(a) 𝔊1={e,σ,σ2,,σ5}6\mathfrak{G}_{1}=\{e,\sigma,\sigma^{2},\ldots,\sigma^{5}\}\cong\mathbb{Z}_{6}
Refer to caption
(b) 𝔊2={e,σ,σ2,,σ7}8\mathfrak{G}_{2}=\{e,\sigma,\sigma^{2},\ldots,\sigma^{7}\}\cong\mathbb{Z}_{8}
Figure 5: The group structure of the permutation group 𝔊\mathfrak{G} and the results of its actions on the set SS.

To better understand Lemma 2 and Corollary 3, we provide diagrams to illustrate the group actions of the permutation groups 𝔊1={e,σ,σ2,,σ5}\mathfrak{G}_{1}=\{e,\sigma,\sigma^{2},\ldots,\sigma^{5}\} and 𝔊2={e,σ,σ2,,σ7}\mathfrak{G}_{2}=\{e,\sigma,\sigma^{2},\ldots,\sigma^{7}\} when n=7n=7 and n=8n=8, respectively. As shown in Figures 5(a) and 5(b), the overall frameworks with big light-gray circles and cyan arrows represent the Cayley diagrams of the permutation groups 𝔊16\mathfrak{G}_{1}\cong\mathbb{Z}_{6} and 𝔊28\mathfrak{G}_{2}\cong\mathbb{Z}_{8} constructed by Lemma 2, respectively. The center of each subfigure presents the corresponding generator σ\sigma. Each big light-gray circle represents an element gg (i.e., a permutation) of group 𝔊\mathfrak{G}, marked at the center of the circle. And each cyan arrow gigjg_{i}\to g_{j} indicates the relationship gj=giσg_{j}=g_{i}\circ\sigma exists between two group elements gi,gj𝔊g_{i},g_{j}\in\mathfrak{G}. After gg acts on the elements 1,,n1,\ldots,n of the set SS, the corresponding images are presented as the colored numbers next to the big light-gray circle. Finally, the 2-ary dependencies (colored arrows) between neighboring nodes (small dark-gray circles) are modeled according to the action results of gg, shown in each big light-gray circle.

Appendix H Proofs About Incidence Triangles

H.1 Proof of Eq. (6)

𝝉=12𝑨2𝑨𝟏N,𝟏N𝝉=12tr(𝑨3)\bm{\tau}=\frac{1}{2}\bm{A}^{2}\odot\bm{A}\cdot\bm{1}_{N},\quad\bm{1}_{N}^{\top}\bm{\tau}=\frac{1}{2}\text{tr}(\bm{A}^{3})
Proof.

Let 𝑨=(aij)N×N\bm{A}=(a_{ij})_{N\times N}, 𝑩=𝑨2=(bij)N×N\bm{B}=\bm{A}^{2}=(b_{ij})_{N\times N}, where aija_{ij} and bijb_{ij} denote the (i,j)(i,j) element of 𝑨\bm{A} and 𝑩\bm{B}, respectively. Since aija_{ij} equals 1 iff nodes viv_{i} and vjv_{j} are adjacent in GG, bijb_{ij} equals the number of walks of length 2 from nodes viv_{i} to vjv_{j} in GG. In addition, a walk of length 2 from viv_{i} to vjv_{j} and an edge from vjv_{j} to viv_{i} form a triangle containing both viv_{i} and vjv_{j}. Therefore, the (i,j)(i,j) element of 𝑨2𝑨\bm{A}^{2}\odot\bm{A} equals bijaij=bijajib_{ij}a_{ij}=b_{ij}a_{ji}, which indicates how many triangles contain both viv_{i} and vjv_{j}. We can use a sum vector 𝟏N=(1,1,,1)N\bm{1}_{N}=(1,1,\cdots,1)^{\top}\in\mathbb{R}^{N} to sum up each row of 𝑨2𝑨\bm{A}^{2}\odot\bm{A} and get a result vector, whose ii-th element gives twice the number of incidence triangles of node viv_{i}. Here the “twice” comes from the fact that each incidence triangle vjvivk\triangle v_{j}v_{i}v_{k} over node viv_{i} has two walks of length 2 starting from node viv_{i}, that is, vivjvkv_{i}\to v_{j}\to v_{k} and vivkvjv_{i}\to v_{k}\to v_{j}. Hence after dividing each element of the result vector by 2, we finally obtain 𝝉=12𝑨2𝑨𝟏N\bm{\tau}=\frac{1}{2}\bm{A}^{2}\odot\bm{A}\cdot\bm{1}_{N}.

For the second equation, we have

𝟏N𝝉=12𝟏N(𝑨2𝑨)𝟏N=12i=1Nj=1Nbijaij=12i=1N(𝑨2𝑨)ii=12tr(𝑨2𝑨)=12tr(𝑨3)\bm{1}_{N}^{\top}\bm{\tau}=\frac{1}{2}\bm{1}_{N}^{\top}\cdot(\bm{A}^{2}\odot\bm{A})\cdot\bm{1}_{N}=\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}{b_{ij}a_{ij}}=\frac{1}{2}\sum_{i=1}^{N}(\bm{A}^{2}\cdot\bm{A}^{\top})_{ii}=\frac{1}{2}\text{tr}(\bm{A}^{2}\cdot\bm{A}^{\top})=\frac{1}{2}\text{tr}(\bm{A}^{3})

\Box

Remark. The ii-th diagonal entry of 𝑨3\bm{A}^{3} is equal to twice the number of triangles in which the ii-th node is contained [41]. In addition, each triangle has three vertices. Hence we can divide the sum of the diagonal entries by 6 to obtain the total number of triangles in graph GG, i.e., 16tr(𝑨3)\frac{1}{6}\text{tr}(\bm{A}^{3}) [58].

For directed graphs, we also have similar results:

𝝉=𝑨2𝑨𝟏N,𝟏N𝝉=tr(𝑨3)\bm{\vec{\tau}}=\bm{A}^{2}\odot\bm{A}^{\top}\cdot\bm{1}_{N},\quad\bm{1}_{N}^{\top}\bm{\vec{\tau}}=\text{tr}(\bm{A}^{3})

where 𝝉N\bm{\vec{\tau}}\in\mathbb{R}^{N} and its ii-th element τi\vec{\tau}_{i} denotes the number of directed incidence triangles over node ii.

H.2 Proof of Theorem 4

Theorem H.1 (Chernoff-Hoeffding Bound for Discrete Time Markov Chain [59]).

Let \mathcal{M} be an ergodic Markov chain with state space [n]={1,2,,n}[n]=\{1,2,\ldots,n\} and stationary distribution 𝛑\bm{\pi}. Let 𝔱=𝔱(ϵ)\mathfrak{t}=\mathfrak{t}(\epsilon) be its ϵ\epsilon-mixing time for ϵ1/8\epsilon\leq 1/8. Let (X1,X2,,Xr)(X_{1},X_{2},\dots,X_{r}) denote an rr-step random walk on \mathcal{M} starting from an initial distribution φ\varphi on [n][n], i.e., X1φX_{1}\leftarrow\varphi. Define φ𝛑=i=1nφi2πi\left\|\varphi\right\|_{\bm{\pi}}=\sum_{i=1}^{n}{\frac{\varphi_{i}^{2}}{\pi_{i}}}. For every step k[r]k\in[r], let f(k):[n][0,1]f^{(k)}:[n]\rightarrow[0,1] be a weight function such that the expectation 𝔼Xk𝛑[f(k)(Xk)]=μ\mathbb{E}_{X_{k}\leftarrow\bm{\pi}}[f^{(k)}(X_{k})]=\mu for all kk. Define the total weight of the walk (X1,X2,,Xr)(X_{1},X_{2},\ldots,X_{r}) by Zk=1rf(Xk)Z\triangleq\sum_{k=1}^{r}f(X_{k}). There exists some constant cc (which is independent of μ\mu, δ\delta and ϵ\epsilon) such that for 0<δ<10<\delta<1

Pr(|Zμr|>ϵμr)cφ𝝅exp(ϵ2μr72𝔱)\Pr\left(\left|Z-\mu r\right|>\epsilon\mu r\right)\leq c\left\|\varphi\right\|_{\bm{\pi}}\exp{\left(-\frac{\epsilon^{2}\mu r}{72\mathfrak{t}}\right)}

or equivalently

Pr(|Zrμ|>ϵμ)cφ𝝅exp(ϵ2μr72𝔱).\Pr\left(\left|\frac{Z}{r}-\mu\right|>\epsilon\mu\right)\leq c\left\|\varphi\right\|_{\bm{\pi}}\exp{\left(-\frac{\epsilon^{2}\mu r}{72\mathfrak{t}}\right)}.

Theorem H.2.

Any nonlinear dynamic system may be approximated by a recurrent neural network to any desired degree of accuracy and with no restrictions imposed on the compactness of the state space, provided that the network is equipped with an adequate number of hidden neurons [60].

Using Theorem H.1 and Theorem H.2, we prove Theorem 4 as follows.

Theorem 4. Let xv,v𝒱x_{v},\forall v\in\mathcal{V} denote the feature inputs on graph G=(𝒱,)G=(\mathcal{V},\mathcal{E}), and 𝕄\mathbb{M} be a general GNN model with RNN aggregators. Suppose that xvx_{v} is initialized as the degree dvd_{v} of node vv, and each node is distinguishable. For any 0<ϵ1/80<\epsilon\leq 1/8 and 0<δ<10<\delta<1, there exists a parameter setting Θ\Theta for 𝕄\mathbb{M} so that after 𝒪(dv(2dv+τv)𝔱dv+τv)\mathcal{O}\left(\frac{d_{v}(2d_{v}+\tau_{v})\mathfrak{t}}{d_{v}+\tau_{v}}\right) samples,

Pr(|zvτv1|ϵ)1δ,v𝒱,\Pr\left(\left|\frac{z_{v}}{\tau_{v}}-1\right|\leq\epsilon\right)\geq 1-\delta,\forall v\in\mathcal{V},

where zvz_{v}\in\mathbb{R} is the final output value generated by 𝕄\mathbb{M} and τv\tau_{v} is the number of incidence triangles.

Proof.

Without loss of generality, we discuss how to estimate the number of incidence triangles τ0\tau_{0} for an arbitrary node v0v_{0} based on its nn neighbors v1,v2,,vnv_{1}^{\prime},v_{2}^{\prime},\ldots,v_{n}^{\prime}. Let v0=v0v_{0}^{\prime}=v_{0}, and let G=(𝒱,)G^{\prime}=(\mathcal{V}^{\prime},\mathcal{E}^{\prime}) denote the subgraph induced by 𝒱={v0,v1,v2,,vn}\mathcal{V}^{\prime}=\{v_{0}^{\prime},v_{1}^{\prime},v_{2}^{\prime},\ldots,v_{n}^{\prime}\}, with an adjacency matrix 𝑨(n+1)×(n+1)\bm{A}^{\prime}\in\mathbb{R}^{(n+1)\times(n+1)}. We add a symbol “\prime” to all notations of the induced subgraph GG^{\prime} to distinguish them from those of graph GG. For each node vi𝒱v_{i}^{\prime}\in\mathcal{V}^{\prime}, did_{i}^{\prime} denotes the degree of viv_{i}^{\prime} in graph GG^{\prime}, τi\tau_{i}^{\prime} denotes the number of incidence triangles of viv_{i}^{\prime} in graph GG^{\prime}. In particular, d0=d0=nd_{0}^{\prime}=d_{0}=n, τ0=τ0\tau_{0}^{\prime}=\tau_{0}. Our goal is to estimate τ0\tau_{0} for an arbitrary node v0v_{0} in graph GG, which is equal to τ0\tau_{0}^{\prime} in graph GG^{\prime}.

A simple random walk (SRW) with rr steps on graph GG^{\prime}, denoted by R=(X1,X2,,Xr)R=(X_{1},X_{2},\ldots,X_{r}), is defined as follows: start from an initial node in GG^{\prime}, then move to one of its neighboring nodes chosen uniformly at random, and repeat this process (r1)(r-1) times. This random walk on graph GG^{\prime} can be viewed as a finite Markov chain \mathcal{M} with the state space 𝒱\mathcal{V}^{\prime}, and the transition probability matrix 𝑷\bm{P} of this Markov chain is defined as

𝑷(i,j)={1di,if(vi,vj),0,otherwise.\bm{P}(i,j)=\begin{cases}\dfrac{1}{d_{i}^{\prime}},&\text{if}\ (v_{i}^{\prime},v_{j}^{\prime})\in\mathcal{E}^{\prime},\\ 0,&\text{otherwise.}\end{cases}

Let D=i=0ndi=2||D^{\prime}=\sum_{i=0}^{n}d_{i}^{\prime}=2\left|\mathcal{E}^{\prime}\right| denote the sum of degrees in graph GG^{\prime}. After many random walk steps, the probability Pr(Xr=vi)\Pr(X_{r}=v_{i}^{\prime}) converges to pidi/Dp_{i}\triangleq d_{i}^{\prime}/D^{\prime}, and the vector 𝝅=(p0,p1,,pn)\bm{\pi}=(p_{0},p_{1},\ldots,p_{n}) is called the stationary distribution of this random walk.

The mixing time of a Markov chain is the number of steps it takes for a random walk to approach its stationary distribution. We adopt the definition in [59, 61, 41] and define the mixing time 𝔱(ϵ)\mathfrak{t}(\epsilon) as follows:

𝔱(ϵ)=maxXi𝒱min{t:|𝝅𝝅(i)𝑷t|<ϵ}\mathfrak{t}(\epsilon)=\max_{X_{i}\in\mathcal{V}^{\prime}}\min\left\{t:\left|\bm{\pi}-\bm{\pi}^{(i)}\bm{P}^{t}\right|<\epsilon\right\}

where 𝝅\bm{\pi} is the stationary distribution of the Markov chain defined above, 𝝅(i)\bm{\pi}^{(i)} is the initial distribution when starting from state Xi𝒱X_{i}\in\mathcal{V}^{\prime}, 𝑷t\bm{P}^{t} is the transition matrix after tt steps, and |||\cdot| is the variation distance between two distributions.

Later on, we will exploit node samples taken from a random walk to construct an estimator z0z_{0}, then use the mixing time based Chernoff-Hoeffding bound [59] to compute the number of steps/samples needed, thereby guaranteeing that our estimator z0z_{0} is within (1±ϵ)(1\pm\epsilon) of the true value τ0\tau_{0} with the probability of at least 1δ1-\delta.

Given a random walk (X1,X2,,Xr)(X_{1},X_{2},\ldots,X_{r}) on graph GG^{\prime}, we define a new variable ak=𝑨Xk1,Xk+1a_{k}=\bm{A}_{X_{k-1},X_{k+1}}^{\prime} for every 2kr12\leq k\leq r-1, then we have

𝔼[akdXk]\displaystyle\mathbb{E}\left[a_{k}d_{X_{k}}^{\prime}\right] =i=0npi𝔼[akdXk|Xk=vi]\displaystyle=\sum_{i=0}^{n}{p_{i}\mathbb{E}\left[a_{k}d_{X_{k}}^{\prime}\ \big{|}\ X_{k}=v_{i}^{\prime}\right]}
=i=0ndiD2τidi2di\displaystyle=\sum_{i=0}^{n}{\frac{d_{i}^{\prime}}{D^{\prime}}\frac{2\tau_{i}^{\prime}}{{d_{i}^{\prime}}^{2}}d_{i}^{\prime}}
=2Di=0nτi\displaystyle=\frac{2}{D^{\prime}}\sum_{i=0}^{n}{\tau_{i}^{\prime}} (7)

The second equality holds because there are di2{d_{i}^{\prime}}^{2} equal probability combinations of (Xk1,vi,Xk+1)(X_{k-1},v_{i}^{\prime},X_{k+1}), out of which only 2τi2\tau_{i}^{\prime} combinations form a triangle (u,vi,w)(u^{\prime},v_{i}^{\prime},w^{\prime}) or its reverse (w,vi,u)(w^{\prime},v_{i}^{\prime},u^{\prime}), where uu^{\prime} is connected to ww^{\prime}, i.e., ak=𝑨Xk1,Xk+1=𝑨u,w=1a_{k}=\bm{A}_{X_{k-1},X_{k+1}}^{\prime}=\bm{A}_{u^{\prime},w^{\prime}}^{\prime}=1.

To estimate τ0\tau_{0}, we introduce two variables Y1Y_{1} and Y2Y_{2}, defined as follows:

Y11r2k=2r1akdXk,Y21rk=1r1dXkY_{1}\triangleq\frac{1}{r-2}\sum_{k=2}^{r-1}{a_{k}d_{X_{k}}^{\prime}},\quad Y_{2}\triangleq\frac{1}{r}\sum_{k=1}^{r}{\frac{1}{d_{X_{k}}^{\prime}}}

Using the linearity of expectation and Eq. (Proof), we obtain

𝔼[Y1]=1r2k=2r1𝔼[akdXk]=2Di=0nτi\mathbb{E}[Y_{1}]=\frac{1}{r-2}\sum_{k=2}^{r-1}{\mathbb{E}\left[a_{k}d_{X_{k}}^{\prime}\right]}=\frac{2}{D^{\prime}}\sum_{i=0}^{n}{\tau_{i}^{\prime}} (8)

Similarly, we have

𝔼[Y2]=1rk=1r𝔼[1dXk]=1rk=1r(i=0ndiD1di)=n+1D\mathbb{E}[Y_{2}]=\frac{1}{r}\sum_{k=1}^{r}{\mathbb{E}\left[\frac{1}{d_{X_{k}}^{\prime}}\right]}=\frac{1}{r}\sum_{k=1}^{r}{\left(\sum_{i=0}^{n}{\frac{d_{i}^{\prime}}{D^{\prime}}\frac{1}{d_{i}^{\prime}}}\right)}=\frac{n+1}{D^{\prime}} (9)

Recall that GG^{\prime} is a subgraph induced by 𝒱={v0,v1,v2,,vn}\mathcal{V}^{\prime}=\{v_{0}^{\prime},v_{1}^{\prime},v_{2}^{\prime},\ldots,v_{n}^{\prime}\}, where v1,v2,,vnv_{1}^{\prime},v_{2}^{\prime},\ldots,v_{n}^{\prime} are nn neighbors of an arbitrary node v0=v0v_{0}^{\prime}=v_{0}. Therefore, the maximum degree of graph GG^{\prime} is Δ=n\Delta^{\prime}=n, which is equal to d0=d0d_{0}^{\prime}=d_{0}. In addition, we have i=0nτi=3τ0=3τ0\sum_{i=0}^{n}{\tau_{i}^{\prime}}=3\tau_{0}^{\prime}=3\tau_{0}, and D=2||=2(d0+τ0)=2(d0+τ0)D^{\prime}=2\left|\mathcal{E}^{\prime}\right|=2(d_{0}^{\prime}+\tau_{0}^{\prime})=2(d_{0}+\tau_{0}). Substituting them in Eq. (8) and Eq. (9), we get

𝔼[Y1]=3τ0d0+τ0\mathbb{E}[Y_{1}]=\frac{3\tau_{0}}{d_{0}+\tau_{0}} (10)

and

𝔼[Y2]=d0+12(d0+τ0)\mathbb{E}[Y_{2}]=\frac{d_{0}+1}{2(d_{0}+\tau_{0})} (11)

From Eq. (10) and Eq. (11) we can isolate τ0\tau_{0} and get

τ0=d0+16𝔼[Y1]𝔼[Y2]\tau_{0}=\frac{d_{0}+1}{6}\cdot\frac{\mathbb{E}[Y_{1}]}{\mathbb{E}[Y_{2}]} (12)

Since d0d_{0} is the feature input, the coefficient d0+16\dfrac{d_{0}+1}{6} can be considered as a constant factor here. Intuitively, both Y1Y_{1} and Y2Y_{2} converge to their expected values, and thus the estimator z0d0+16Y1Y2z_{0}\triangleq\dfrac{d_{0}+1}{6}\cdot\dfrac{Y_{1}}{Y_{2}} converges to τ0\tau_{0} as well. Next, we will find the number of steps/samples rr for convergence.

Since akdXk=𝑨Xk1,Xk+1dXka_{k}d_{X_{k}}^{\prime}=\bm{A}_{X_{k-1},X_{k+1}}^{\prime}d_{X_{k}}^{\prime} in Y1Y_{1} only depends on a 3-nodes history, we observe a related Markov chain ~\tilde{\mathcal{M}} that remembers the three latest visited nodes. Accordingly, ~\tilde{\mathcal{M}} has (n+1)×(n+1)×(n+1)(n+1)\times(n+1)\times(n+1) states, and (Xk1,Xk,Xk+1)(Xk,Xk+1,Xk+2)(X_{k-1},X_{k},X_{k+1})\to(X_{k},X_{k+1},X_{k+2}) has the same transition probability as Xk+1Xk+2X_{k+1}\to X_{k+2} in \mathcal{M}. Define each state X~k=(Xk1,Xk,Xk+1)\tilde{X}_{k}=(X_{k-1},X_{k},X_{k+1}) for 2kr12\leq k\leq r-1. Let f~1(k)(X~k)=f1(k)(Xk)=akdXkΔ=akdXkd0\tilde{f}_{1}^{(k)}(\tilde{X}_{k})=f_{1}^{(k)}(X_{k})=\dfrac{a_{k}d_{X_{k}}^{\prime}}{\Delta^{\prime}}=\dfrac{a_{k}d_{X_{k}}^{\prime}}{d_{0}} such that all values of f~1(k)(X~k)\tilde{f}_{1}^{(k)}(\tilde{X}_{k}) are in [0,1][0,1]. By Eq. (Proof), Eq. (8), and Eq. (10), we have μ1=𝔼X~k𝝅(f~1(k)(X~k))=3τ0d0(d0+τ0)\mu_{1}=\mathbb{E}_{\tilde{X}_{k}\leftarrow\bm{\pi}}(\tilde{f}_{1}^{(k)}(\tilde{X}_{k}))=\dfrac{3\tau_{0}}{d_{0}(d_{0}+\tau_{0})}. Define Z1k=2r1f~1(k)(X~k)=r2d0Y1Z_{1}\triangleq\sum\limits_{k=2}^{r-1}\tilde{f}_{1}^{(k)}(\tilde{X}_{k})=\dfrac{r-2}{d_{0}}Y_{1}, assume that φ𝝅\varphi\approx\bm{\pi} thus φ𝝅=1\left\|\varphi\right\|_{\bm{\pi}}=1. By Theorem H.1 and Eq. (10), we have

Pr(|Y1𝔼[Y1]|>ϵ3𝔼[Y1])c1exp(3ϵ2τ0(r2)972𝔱~d0(d0+τ0))\Pr\left(\left|Y_{1}-\mathbb{E}[Y_{1}]\right|>\frac{\epsilon}{3}\mathbb{E}[Y_{1}]\right)\leq c_{1}\exp{\left(-\frac{3\cdot\epsilon^{2}\tau_{0}(r-2)}{9\cdot 72\cdot\tilde{\mathfrak{t}}d_{0}(d_{0}+\tau_{0})}\right)} (13)

Extracting rY1r_{Y_{1}} from δ2=c1exp(ϵ2τ0(r2)216𝔱~d0(d0+τ0))\dfrac{\delta}{2}=c_{1}\exp{\left(-\dfrac{\epsilon^{2}\tau_{0}(r-2)}{216\cdot\tilde{\mathfrak{t}}d_{0}(d_{0}+\tau_{0})}\right)}, we obtain rY1=2216ln(δ/2c1)ϵ2d0(d0+τ0)𝔱~τ0=𝒪(d0(d0+τ0)𝔱~τ0)r_{Y_{1}}=2-216\dfrac{\ln(\delta/2c_{1})}{\epsilon^{2}}\cdot\dfrac{d_{0}(d_{0}+\tau_{0})\tilde{\mathfrak{t}}}{\tau_{0}}=\mathcal{O}\left(\dfrac{d_{0}(d_{0}+\tau_{0})\tilde{\mathfrak{t}}}{\tau_{0}}\right), where c1c_{1}, ϵ\epsilon and δ\delta are all constants.

Let f2(k)(Xk)=1dXkf_{2}^{(k)}(X_{k})=\dfrac{1}{d_{X_{k}}^{\prime}}, by Eq. (9) and Eq. (11) we have μ2=𝔼Xk𝝅(f2(k)(Xk))=d0+12(d0+τ0)\mu_{2}=\mathbb{E}_{X_{k}\leftarrow\bm{\pi}}(f_{2}^{(k)}(X_{k}))=\dfrac{d_{0}+1}{2(d_{0}+\tau_{0})}. Define Z2k=1rf2(k)(Xk)=rY2Z_{2}\triangleq\sum\limits_{k=1}^{r}f_{2}^{(k)}(X_{k})=rY_{2}, assume that φ𝝅\varphi\approx\bm{\pi} thus φ𝝅=1\left\|\varphi\right\|_{\bm{\pi}}=1. By Theorem H.1 and Eq. (11), we have

Pr(|Y2𝔼[Y2]|>ϵ3𝔼[Y2])c2exp(ϵ2(d0+1)r2972𝔱(d0+τ0))\Pr\left(\left|Y_{2}-\mathbb{E}[Y_{2}]\right|>\frac{\epsilon}{3}\mathbb{E}[Y_{2}]\right)\leq c_{2}\exp{\left(-\frac{\epsilon^{2}(d_{0}+1)r}{2\cdot 9\cdot 72\cdot\mathfrak{t}(d_{0}+\tau_{0})}\right)} (14)

Extracting rY2r_{Y_{2}} from δ2=c2exp(ϵ2(d0+1)r1296𝔱(d0+τ0))\dfrac{\delta}{2}=c_{2}\exp{\left(-\dfrac{\epsilon^{2}(d_{0}+1)r}{1296\cdot\mathfrak{t}(d_{0}+\tau_{0})}\right)}, we obtain rY2=1296ln(δ/2c2)ϵ2(d0+τ0)𝔱d0+1=𝒪((d0+τ0)𝔱d0+1)r_{Y_{2}}=-1296\dfrac{\ln(\delta/2c_{2})}{\epsilon^{2}}\cdot\dfrac{(d_{0}+\tau_{0})\mathfrak{t}}{d_{0}+1}=\mathcal{O}\left(\dfrac{(d_{0}+\tau_{0})\mathfrak{t}}{d_{0}+1}\right), where c2c_{2}, ϵ\epsilon and δ\delta are all constants.

Since 𝔱𝔱~\mathfrak{t}\geq\tilde{\mathfrak{t}} (see Appendix A in [62] for details), choose r𝒪(d0(d0+τ0)𝔱τ0)max{rY1,rY2}r\geq\mathcal{O}\left(\dfrac{d_{0}(d_{0}+\tau_{0})\mathfrak{t}}{\tau_{0}}\right)\geq\max\{r_{Y_{1}},r_{Y_{2}}\}. Eq. (13) and Eq. (14) find the number of steps/samples rr, which guarantees both Y1Y_{1} and Y2Y_{2} are within (1±ϵ/3)(1\pm\epsilon/3) of their expected values with the probability of at least 1δ/21-\delta/2. Since the probability of Y1Y_{1} or Y2Y_{2} deviating from their expected value is at most δ/2\delta/2, the probability of either Y1Y_{1} or Y2Y_{2} deviating is at most δ\delta:

Pr(|Y𝔼[Y]|>ϵ3𝔼[Y])δ2,Y=Y1,Y2\displaystyle\Pr\left(\left|Y-\mathbb{E}[Y]\right|>\frac{\epsilon}{3}\mathbb{E}[Y]\right)\leq\frac{\delta}{2},\quad Y=Y_{1},Y_{2}
\displaystyle\Rightarrow Pr((1ϵ3)𝔼[Y]Y(1+ϵ3)𝔼[Y])1δ2,Y=Y1,Y2\displaystyle\Pr\left(\left(1-\frac{\epsilon}{3}\right)\mathbb{E}[Y]\leq Y\leq\left(1+\frac{\epsilon}{3}\right)\mathbb{E}[Y]\right)\geq 1-\frac{\delta}{2},\quad Y=Y_{1},Y_{2}
\displaystyle\Rightarrow Pr((1ϵ)τ0d0+161ϵ31+ϵ3𝔼[Y1]𝔼[Y2]d0+16Y1Y2estimatorz0d0+161+ϵ31ϵ3𝔼[Y1]𝔼[Y2](1+ϵ)τ0)1δ\displaystyle\Pr\left(\underbrace{(1-\epsilon)\tau_{0}\leq\dfrac{d_{0}+1}{6}\frac{1-\frac{\epsilon}{3}}{1+\frac{\epsilon}{3}}\frac{\mathbb{E}[Y_{1}]}{\mathbb{E}[Y_{2}]}}_{\star}\leq\underbrace{\dfrac{d_{0}+1}{6}\frac{Y_{1}}{Y_{2}}}_{\text{estimator}~{}z_{0}}\leq\underbrace{\dfrac{d_{0}+1}{6}\frac{1+\frac{\epsilon}{3}}{1-\frac{\epsilon}{3}}\frac{\mathbb{E}[Y_{1}]}{\mathbb{E}[Y_{2}]}\leq(1+\epsilon)\tau_{0}}_{\star}\right)\geq 1-\delta

The first line is a summary of Eq. (13) and Eq. (14). The inequalities “\star” hold due to Eq. (12), and the fact of both 1ϵ1ϵ31+ϵ31-\epsilon\leq\dfrac{1-\frac{\epsilon}{3}}{1+\frac{\epsilon}{3}} and 1+ϵ1+ϵ31ϵ31+\epsilon\geq\dfrac{1+\frac{\epsilon}{3}}{1-\frac{\epsilon}{3}} when 0<ϵ1/80<\epsilon\leq 1/8. We thus conclude that after 𝒪(d0(d0+τ0)𝔱τ0)\mathcal{O}\left(\dfrac{d_{0}(d_{0}+\tau_{0})\mathfrak{t}}{\tau_{0}}\right) samples, Pr(|z0τ01|ϵ)1δ\Pr\left(\left|\dfrac{z_{0}}{\tau_{0}}-1\right|\leq\epsilon\right)\geq 1-\delta.

Refer to caption
(a) Star graph, without any triangles
Refer to caption
(b) General case, with some triangles
Figure 6: Add an artificial node vav_{a} and connect it to all nodes in GG^{\prime}.

However, if τ0=0\tau_{0}=0 and GG^{\prime} is a star graph, the number of samples r𝒪(d0(d0+τ0)𝔱τ0)r\geq\mathcal{O}\left(\dfrac{d_{0}(d_{0}+\tau_{0})\mathfrak{t}}{\tau_{0}}\right)\to\infty. To avoid that, we add an artificial node vav_{a} and connect it to all nodes in GG^{\prime}, as illustrated in Figure 6. Since d0(a)=d0+1d_{0}^{(a)}=d_{0}+1, τ0(a)=d0+τ0\tau_{0}^{(a)}=d_{0}+\tau_{0}, we only need to minus a d0d_{0} for the estimated result τ0(a)\tau_{0}^{(a)}, and the number of samples can then be reduced to 𝒪(d0(a)(d0(a)+τ0(a))𝔱(a)τ0(a))𝒪(d0(2d0+τ0)𝔱d0+τ0)\mathcal{O}\left(\dfrac{d_{0}^{(a)}(d_{0}^{(a)}+\tau_{0}^{(a)})\mathfrak{t}^{(a)}}{\tau_{0}^{(a)}}\right)\approx\mathcal{O}\left(\dfrac{d_{0}(2d_{0}+\tau_{0})\mathfrak{t}}{d_{0}+\tau_{0}}\right).

We have proved that we can estimate the number of incidence triangles τ0\tau_{0} for an arbitrary node v0v_{0} based on its nn neighbors by a random walk. Consider the random walk as a nonlinear dynamic system, according to the RNNs’ universal approximation ability (Theorem H.2), this random walk can be approximated by an RNN to any desired degree of accuracy. Therefore, let the input sequence of RNN follow the random walk above, then the RNN aggregator can mimic this random walk on the subgraph induced by v0v_{0} and its 1-hop neighbors when aggregating, finally outputs z0τ0z_{0}\approx\tau_{0}. This completes the proof. \Box

Note: This proof is inspired by Hardiman and Katzir [62] and Chen et al. [61].

H.3 Analysis of GraphSAGE

Theorem H.3.

Let xvU,v𝒱x_{v}\in U,\forall v\in\mathcal{V} denote the input features for Algorithm 1 (proposed in GraphSAGE) on graph G=(𝒱,)G=(\mathcal{V},\mathcal{E}), where UU is any compact subset of d\mathbb{R}^{d}. Suppose that there exists a fixed positive constant C+C\in\mathbb{R}^{+} such that xvxv2>C\|x_{v}-x_{v^{\prime}}\|_{2}>C for all pairs of nodes. Then we have that ϵ>0\forall\epsilon>0 there exists a parameter setting Θ\Theta^{*} for Algorithm 1 such that after K=4K=4 iterations

|zvcv|<ϵ,v𝒱,\left|z_{v}-c_{v}\right|<\epsilon,\forall v\in\mathcal{V},

where zvz_{v}\in\mathbb{R} are final output values generated by Algorithm 1 and cvc_{v} are node clustering coefficients [5].

According to Theorem H.3, GraphSAGE can approximate the clustering coefficients in a graph to arbitrary precision. In addition, since GraphSAGE with LSTM aggregators is a special case of our proposed Theorem 4, it can also approximate the number of incidence triangles to arbitrary precision. In fact, the number of incidence triangles τv\tau_{v} is related to the local clustering coefficient cvc_{v}. More specifically, τv=cvdv(dv1)/2\tau_{v}=c_{v}\cdot d_{v}(d_{v}-1)/2. Therefore, the conclusion of Theorem 4 is consistent with that of Theorem H.3. However, Theorem 4 reveals that the required samples 𝒪(dv(2dv+τv)𝔱dv+τv)\mathcal{O}\left(\frac{d_{v}(2d_{v}+\tau_{v})\mathfrak{t}}{d_{v}+\tau_{v}}\right) are related to τv\tau_{v} and proportional to the mixing time 𝔱\mathfrak{t}, leading to a practically prohibitive aggregation complexity.

To overcome this problem and improve the efficiency, GraphSAGE performs neighborhood sampling and suggests sampling 2-hop neighborhoods for each node. Suppose the neighborhood sample sizes of 1-hop and 2-hop are S1S_{1} and S2S_{2}, then the sampling complexity is Θ(NS1S2)\Theta(NS_{1}S_{2}). Accordingly, the memory and time complexity of GraphSAGE with LSTM are Θ(Nc+NS1S2)\Theta(Nc+NS_{1}S_{2}) and Θ(NS1S2c2+NS1S2)\Theta(NS_{1}S_{2}c^{2}+NS_{1}S_{2}).

Appendix I Proof of Proposition 5

Theorem I.1.

22-WL and MPNNs cannot induced-subgraph-count any connected pattern with 33 or more nodes [11].

Lemma I.2.

No pair of strongly regular graphs in family SRG(v,r,λ,μv,r,\lambda,\mu) can be distinguished by the 2-FWL test [37, 36].

Using Theorem I.1 and Lemma I.2, we prove Proposition 5 as follows.

5. PG-GNN is strictly more powerful than the 2-WL test and not less powerful than the 3-WL test.

Proof.

We first verify that the GIN (with the equivalent expressive power as the 2-WL test) [7] can be instantiated by a GNN model with RNN aggregators (including our proposed PG-GNN). Consider a single layer of GIN:

𝒉v(k)=MLP(k)(𝒉v(k1)+u𝒩(v)𝒉u(k1))\bm{h}_{v}^{(k)}=\text{MLP}^{(k)}\left(\bm{h}_{v}^{(k-1)}+\sum\nolimits_{u\in\mathcal{N}(v)}\bm{h}_{u}^{(k-1)}\right) (15)

where MLP(k)\text{MLP}^{(k)} has a linear mapping 𝑾GIN(k)dk×dk1\bm{W}_{\text{GIN}}^{(k)}\in\mathbb{R}^{d_{k}\times d_{k-1}} and a bias term 𝒃GIN(k)dk\bm{b}_{\text{GIN}}^{(k)}\in\mathbb{R}^{d_{k}}. Without loss of generality, we take the Simple Recurrent Network (SRN) [15] as the RNN aggregator in Eq. (3), formulated as follows:

𝒛t(k)\displaystyle\bm{z}_{t}^{(k)} =𝑼𝒚t1(k)+𝑾𝒉t(k1)+𝒃\displaystyle=\bm{U}\bm{y}_{t-1}^{(k)}+\bm{W}\bm{h}_{t}^{(k-1)}+\bm{b}
𝒚t(k)\displaystyle\bm{y}_{t}^{(k)} =a(𝒛t(k))\displaystyle=a(\bm{z}_{t}^{(k)})

Let 𝑾=𝑾GIN(k)\bm{W}=\bm{W}_{\text{GIN}}^{(k)}, 𝑼=𝑰dk\bm{U}=\bm{I}_{d_{k}}, 𝒃=𝒃GIN(k)\bm{b}=\bm{b}_{\text{GIN}}^{(k)}, the initial state 𝒚0(k)=𝟎\bm{y}_{0}^{(k)}=\mathbf{0}, the activation function a()a(\cdot) be an identity function. And let the input sequence of the RNN aggregator be an arbitrarily ordered sequence of the set {𝒉u(k1)}u𝒩(v)v\{\bm{h}_{u}^{(k-1)}\}_{u\in\mathcal{N}(v)\cup v}. Then any GIN with Eq. (15) can be instantiated by a GNN model with RNN aggregators (in particular, a PG-GNN with Eq. (3)), which implies that the permutation-sensitive GNNs can be at least as powerful as the 2-WL test.

Next, we prove that PG-GNN is strictly more powerful than MPNNs and 2-WL test from the perspective of substructure counting. Without loss of generality, we take an arbitrary node vv into consideration. According to the definition of incidence triangles and the fact that they always appear in the 1-hop neighborhood of the central node, the number of connections between neighboring nodes of the central node vv is equivalent to the number of incidence triangles over vv. Theorem 1 ensures that all the 2-ary dependencies can be modeled by Eq. (3). Suppose we are aiming to capture the connections between two arbitrary neighbors of the central node, we can use an MSE loss to measure the mean squared error between the predicted and ground-truth counting values and guide our model to learn the correct 2-ary dependencies, thereby capturing the correct connections and counting the number of connections between neighboring nodes. And if we mainly focus on specific downstream tasks (e.g., graph classification), these 2-ary dependencies will be learned adaptively with the guidance of a specific loss function (e.g., cross-entropy loss). Thus PG-GNN is capable of counting incidence triangles333In fact, since PG-GNN can count incidence triangles, it is also capable of counting all incidence 3-node graphlets. There are only two types of 3-node graphlets, i.e., wedges ( ) and triangles ( ), let τv\tau_{v} be the number of incidence triangles over vv and nn be the number of 1-hop neighbors, then we have (n2)τv\binom{n}{2}-\tau_{v} incidence wedges.. Moreover, since the incidence 4-cliques always appear in the 1-hop neighborhood of the central node and every 4-clique is entirely composed of triangles, PG-GNN can also leverage 2-ary dependencies to count incidence 4-cliques, similar to counting incidence triangles. Thus PG-GNN can count all 3-node graphlets ( , ), even 4-cliques ( ) incident to node vv.

In addition, Chen et al. [11] proposed Theorem I.1, which implies that 2-WL and MPNNs cannot count any connected induced subgraph with 3 or more nodes. Since the incidence wedges, triangles, and 4-cliques are all connected induced subgraphs with 3\geq 3 nodes, the above arguments demonstrate that the expressivity of PG-GNN goes beyond the 2-WL test and MPNNs.

Refer to caption
Figure 7: A pair of non-isomorphic strongly regular graphs in the family SRG(16,6,2,2): 4×\times4 Rook’s graph and the Shrikhande graph.

To round off the proof, we finally prove that PG-GNN is not less powerful than the 3-WL test. Consider a pair of strongly regular graphs in the family SRG(16,6,2,2): 4×\times4 Rook’s graph and the Shrikhande graph. As illustrated in Figure 7, only Rook’s graph (left) possesses 4-cliques (some are emphasized by colors), but the Shrikhande graph (right) possesses no 4-cliques. Since PG-GNN is capable of counting incidence 4-cliques, our approach can distinguish this pair of strongly regular graphs. However, in virtue of Lemma I.2 and the fact that 2-FWL is equivalent to 3-WL [28], the 3-WL test fails to distinguish them. Thus PG-GNN is not less powerful than the 3-WL test444More accurately, PG-GNN is outside the WL hierarchy, and thus it is not easy to fairly compare it with 3-WL. On the one hand, PG-GNN can distinguish some strongly regular graphs but 3-WL fails. On the other hand, 3-WL considers all the 3-tuples (i1,i2,i3)𝒱3(i_{1},i_{2},i_{3})\in\mathcal{V}^{3}, which form a superset of (induced) subgraphs, but PG-GNN only considers the induced subgraphs and thus cannot completely achieve 3-WL. In summary, 3-WL and PG-GNN have their own unique merits. However, since 3-WL needs to consider all (N3)=Θ(N3)\binom{N}{3}=\Theta(N^{3}) 3-tuples, the problem of complexity is inevitable. In contrast, PG-GNN breaks from the WL hierarchy to make a trade-off between expressive power and computational efficiency..

In conclusion, our proposed PG-GNN is strictly more powerful than the 2-WL test and not less powerful than the 3-WL test. \Box

Appendix J Details of the Proposed Model

In this section, we discuss the proposed model in detail. The notations follow the definitions in Section 3.1, i.e., let nn denote the number of 1-hop neighbors of the central node vv. Suppose these nn neighbors are randomly numbered as u1,,unu_{1},\ldots,u_{n} (also abbreviated as 1,,n1,\ldots,n for simplicity), the set of neighboring nodes is represented as 𝒩(v)\mathcal{N}(v) (or S=[n]={1,,n}S=[n]=\{1,\ldots,n\}).

J.1 Illustration of the Proposed Model

Figure 8 presents a further explanation of Figure 2 and the relationships among Theorem 1, Lemma 2, Corollary 3, Figure 2, and Eq. (3). In this figure, we ignore the central node vv for clarity and illustrate for n=5n=5 and n=6n=6. Here we take n=5n=5 as an example to explain Figure 8(a).

The very left column shows the components of Figure 2 and Eq. (3), and the right four columns provide the decoupled illustrations of Figure 2 and Eq. (3). The first row of the right four columns lists the group action guigu_{i} (gg acts on uiu_{i}) defined by Corollary 3, where uiu_{i} ranges from u1u_{1} to u5u_{5}, g𝔊={e,σ,σ2,σ3}g\in\mathfrak{G}=\{e,\sigma,\sigma^{2},\sigma^{3}\} and 𝔊\mathfrak{G} is defined by Lemma 2. For readers unfamiliar with group theory, the third row of the right four columns explicitly provides the corresponding action results of guigu_{i}, such as σ2u1=u1,σ2u2=u5,σ2u3=u4,σ2u4=u3,σ2u5=u2\sigma^{2}u_{1}=u_{1},\sigma^{2}u_{2}=u_{5},\sigma^{2}u_{3}=u_{4},\sigma^{2}u_{4}=u_{3},\sigma^{2}u_{5}=u_{2} in the third column. In addition, these four columns are associated with each other by the generator σ\sigma. For example, in the third row, after σ\sigma acts on the action results in the first column, they are transformed into the action results in the second column according to the permutation diagram, i.e., σu1=u1,σu2=u4,σu3=u2,σu4=u5,σu5=u3\sigma u_{1}=u_{1},\sigma u_{2}=u_{4},\sigma u_{3}=u_{2},\sigma u_{4}=u_{5},\sigma u_{5}=u_{3}. Action results in other columns are transformed in a similar manner and form a cyclic structure. The second row of the right four columns illustrates this process.

In each column, after obtaining the action results of gu1,,gungu_{1},\ldots,gu_{n}, we arrange these n=5n=5 neighbors (action results) as an undirected ring. The first n/2=2\lfloor n/2\rfloor=2 arrangements (marked by solid lines) are constructed according to Theorem 1, and the last n/2=2\lfloor n/2\rfloor=2 arrangements (marked by dashed lines) reverse the former. Either the first or the last n/2=2\lfloor n/2\rfloor=2 arrangements cover all undirected 2-ary dependencies. Then, we use permutation-sensitive RNNs to model the 2-ary dependencies in a directed manner (since permutation-sensitive RNNs serve aba\to b and bab\to a as two different pairs) and construct the corresponding Hamiltonian cycles. As a result, the Hamiltonian cycles are modeled bi-directionally, and edges in Hamiltonian cycles are transformed into an undirected manner. The arrangement generation and Hamiltonian cycle construction are detailed in Section 3.3.

Figure 8(b) presents in a similar way as Figure 8(a) does. However, we do not show all six columns due to the limited space. Here we omit the 5th and the 6th columns, which illustrate the modeling processes based on group elements σ4\sigma^{4} and σ5\sigma^{5}.

Refer to caption
(a) n=5n=5
Refer to caption
(b) n=6n=6
Figure 8: Illustration of the proposed PG-GNN model.

J.2 Discussion on Groups

Since the (permutation) group in Eq. (3) plays a pivotal role in our model, it is necessary to discuss the motivation for using groups and why we select the specific group. In fact, the group is used to effectively model all 2-ary dependencies (pairwise correlations). We first summarize why the modeling of all 2-ary dependencies is indispensable:

  • Expressive power. Modeling all 2-ary dependencies can capture whether any two neighboring nodes are connected, helping our model count incidence triangles and 4-cliques hence improving its expressive power.

  • Generalization capability and computational complexity. Modeling all 2-ary dependencies can make these dependencies invariant to arbitrary permutations of node orderings. Such an invariance to 2-ary dependencies is an approximation of the permutation-invariance and helps to guarantee the generalization capability. Moreover, it also avoids considering all n!n! permutations to strictly ensure the permutation-invariance, thereby significantly reducing the computational complexity.

  • Robustness. Modeling all 2-ary dependencies makes our model insensitive to a specific 2-ary dependency and robust to potential data noise and adversarial perturbations.

In order to effectively cover all 2-ary dependencies with the lowest complexity, we try to design a special group to accomplish this goal. According to Cayley’s Theorem (Theorem F.3) that “Every finite group is isomorphic to a permutation group”, we focus on finding permutation groups instead of all finite groups. Hence the problem is converted to finding the basic element of the permutation group, i.e., the permutation. Lemma 2 defines a permutation group 𝔊\mathfrak{G} and constructs its permutation σ\sigma (Eq. (2)) based on Theorem 1, which has been proven to reach the theoretical lower bound of the sampling complexity when sampling permutations to cover all 2-ary dependencies. This permutation group is isomorphic to the cyclic group, the simplest group to achieve linear sampling complexity. On the contrary, other groups, such as dihedral group 𝒟n\mathcal{D}_{n}, alternating group 𝒜n\mathcal{A}_{n}, symmetric group 𝒮n\mathcal{S}_{n}, etc., will lead to higher nonlinear complexity hence sacrificing efficiency. Thus, in terms of computational efficiency, group 𝔊\mathfrak{G} defined in Lemma 2 is the best choice, which drives us to apply it to our model design (Eq. (3)).

J.3 Discussion on Model Variants

Since our proposed model mainly focuses on modeling all 2-ary dependencies, the most intuitive way is to enumerate all n(n1)n(n-1) bi-directional 2-ary dependencies between the nn neighbors of the central node vv and then sum them up, which can be formulated as follows:

𝒉v(k)=ui,uj𝒩(v)uiujRNN(𝒉ui(k1),𝒉uj(k1))+𝑾self(k1)𝒉v(k1)\bm{h}_{v}^{(k)}=\sum_{\begin{subarray}{c}u_{i},u_{j}\in\mathcal{N}(v)\\ u_{i}\neq u_{j}\end{subarray}}{\text{RNN}\left(\bm{h}_{u_{i}}^{(k-1)},\bm{h}_{u_{j}}^{(k-1)}\right)}+\bm{W}_{\text{self}}^{(k-1)}\bm{h}_{v}^{(k-1)} (16)

Besides, we can also merge the central node vv into RNN to form n(n1)n(n-1) triplets:

𝒉v(k)=ui,uj𝒩(v)uiujRNN(𝒉ui(k1),𝒉uj(k1),𝒉v(k1))\bm{h}_{v}^{(k)}=\sum_{\begin{subarray}{c}u_{i},u_{j}\in\mathcal{N}(v)\\ u_{i}\neq u_{j}\end{subarray}}{\text{RNN}\left(\bm{h}_{u_{i}}^{(k-1)},\bm{h}_{u_{j}}^{(k-1)},\bm{h}_{v}^{(k-1)}\right)} (17)

In fact, both these two naive variants and our proposed Eq. (3) can model all 2-ary dependencies. However, each term (𝒉ui(k1),𝒉uj(k1))\left(\bm{h}_{u_{i}}^{(k-1)},\bm{h}_{u_{j}}^{(k-1)}\right) in Eq. (16) can only capture a 2-ary dependency, and each term (𝒉ui(k1),𝒉uj(k1),𝒉v(k1))\left(\bm{h}_{u_{i}}^{(k-1)},\bm{h}_{u_{j}}^{(k-1)},\bm{h}_{v}^{(k-1)}\right) in Eq. (17) can only capture a triplet (3-ary dependency). Contrary to these two naive variants, each term (𝒉gu1(k1),,𝒉gun(k1),𝒉gu1(k1))\left(\bm{h}_{gu_{1}}^{(k-1)},\cdots,\bm{h}_{gu_{n}}^{(k-1)},\bm{h}_{gu_{1}}^{(k-1)}\right) in Eq. (3) encodes all neighbors as a higher-order nn-ary dependency, which contains more information and is more powerful than 2-ary or 3-ary dependency.

On the other hand, we can also integrate all terms of Eq. (3) into only one term, and use a single RNN to model it as follows:

𝒉v(k)=RNN(g𝔊(𝒉gu1(k1),𝒉gu2(k1),,𝒉gun(k1),𝒉gu1(k1)))+𝑾self(k1)𝒉v(k1),u1:n𝒩(v)\bm{h}_{v}^{(k)}=\text{RNN}\left(\operatorname*{\scalebox{1.25}[2.0]{$\parallel$}}_{g\in\mathfrak{G}}{\left(\bm{h}_{gu_{1}}^{(k-1)},\bm{h}_{gu_{2}}^{(k-1)},\cdots,\bm{h}_{gu_{n}}^{(k-1)},\bm{h}_{gu_{1}}^{(k-1)}\right)}\right)+\bm{W}_{\text{self}}^{(k-1)}\bm{h}_{v}^{(k-1)},u_{1:n}\in\mathcal{N}(v) (18)

where \parallel is the concatenation operation. For example, if nn is even, it concatenates g𝔊g\in\mathfrak{G} as:

g𝔊(𝒉gu1(k1),𝒉gu2(k1),,𝒉gun(k1),𝒉gu1(k1))=\displaystyle\operatorname*{\scalebox{1.25}[2.0]{$\parallel$}}_{g\in\mathfrak{G}}{\left(\bm{h}_{gu_{1}}^{(k-1)},\bm{h}_{gu_{2}}^{(k-1)},\cdots,\bm{h}_{gu_{n}}^{(k-1)},\bm{h}_{gu_{1}}^{(k-1)}\right)}= 𝒉eu1(k1),𝒉eu2(k1),,𝒉eun(k1),𝒉eu1(k1),\displaystyle~{}\bm{h}_{eu_{1}}^{(k-1)},\bm{h}_{eu_{2}}^{(k-1)},\cdots,\bm{h}_{eu_{n}}^{(k-1)},\bm{h}_{eu_{1}}^{(k-1)},
𝒉σu1(k1),𝒉σu2(k1),,𝒉σun(k1),𝒉σu1(k1),\displaystyle~{}\bm{h}_{\sigma u_{1}}^{(k-1)},\bm{h}_{\sigma u_{2}}^{(k-1)},\cdots,\bm{h}_{\sigma u_{n}}^{(k-1)},\bm{h}_{\sigma u_{1}}^{(k-1)},
,\displaystyle~{}\cdots,
𝒉σn1u1(k1),𝒉σn1u2(k1),,𝒉σn1un(k1),𝒉σn1u1(k1)\displaystyle~{}\bm{h}_{\sigma^{n-1}u_{1}}^{(k-1)},\bm{h}_{\sigma^{n-1}u_{2}}^{(k-1)},\cdots,\bm{h}_{\sigma^{n-1}u_{n}}^{(k-1)},\bm{h}_{\sigma^{n-1}u_{1}}^{(k-1)}

Although this variant can model all n(n1)n(n-1) 2-ary dependencies in a single term, the time complexity is problematic. Since the concatenation operation orders these representations 𝒉(k1)\bm{h}_{*}^{(k-1)}, Eq. (18) can only be processed serially with the time complexity of Θ(NΔ2c2)\Theta(N\Delta^{2}c^{2}). This drawback hinders us from effectively balancing the expressive power and computational cost. In contrast, as explained in Section 3.4, our proposed Eq. (3) can be computed in parallel with lower time complexity of Θ(NΔc2)\Theta(N\Delta c^{2}), making it more efficient in practice.

Appendix K Analysis of Sampling Complexity

In this section, we first consider a variant of the coupon collector’s problem (Problem K.2) and find the analytical solution to it. Then, we use the solution of Problem K.2 to estimate the sampling complexity of π\pi-SGD optimization (proposed by Janossy Pooling [18] and Relational Pooling [1]). Finally, we conduct numerical experiments to verify the rationality of our estimation.

K.1 A Variant of Coupon Collector’s Problem

The coupon collector’s problem is a famous probabilistic paradigm arising from the following scenario.

Problem K.1 (Coupon Collector’s Problem).

Suppose there are mm different types of coupons, and each time one chooses a coupon independently and uniformly at random from the mm types. One needs to collect mH(m)=mlnm+𝒪(m)mH(m)=m\ln m+\mathcal{O}(m) coupons on average before obtaining at least one of every type of coupon, here H(m)=i=1m1iH(m)=\sum_{i=1}^{m}{\frac{1}{i}} is the mm-th harmonic number [41].

In order to estimate the sampling complexity of π\pi-SGD optimization, we need a more sophisticated analysis of the coupon collector’s problem. The following problem is the generalization of Problem K.1 from one coupon to k(k1)k(k\geq 1) coupons at each time, providing a theoretical foundation for our discussion in Section K.2.

Problem K.2 (kk-Coupon Collector’s Problem).

Suppose there are mm different types of coupons, and each time one chooses kk coupons (k1k\geq 1, without repetition) independently and uniformly at random from the mm types. One needs to collect i=1m(1)i+1(mi)1(mik)/(mk)\sum\limits_{i=1}^{m}{(-1)^{i+1}\frac{\tbinom{m}{i}}{1-\left.{\tbinom{m-i}{k}}\middle/{\tbinom{m}{k}}\right.}} times on average before obtaining at least one of every type of coupon.

Proof.

Let XX be the collecting times until at least one of every type of coupon is obtained. We start by considering the probability that XX is greater than ss when ss is fixed. For j=1,,mj=1,\ldots,m, let AjA_{j} denote the event that no type jj coupon is collected in the first ss times. By the inclusion-exclusion principle,

Pr(X>s)=\displaystyle\Pr(X>s)= Pr(j=1mAj)\displaystyle\Pr\left(\bigcup_{j=1}^{m}{A_{j}}\right)
=\displaystyle= 1j1mPr(Aj1)1j1<j2mPr(Aj1Aj2)+1j1<j2<j3mPr(Aj1Aj2Aj3)\displaystyle\sum_{1\leq j_{1}\leq m}{\Pr(A_{j_{1}})}-\sum_{1\leq j_{1}<j_{2}\leq m}{\Pr(A_{j_{1}}\cap A_{j_{2}})}+\sum_{1\leq j_{1}<j_{2}<j_{3}\leq m}{\Pr(A_{j_{1}}\cap A_{j_{2}}\cap A_{j_{3}})}
+(1)m+1Pr(A1Am)\displaystyle-\cdots+{(-1)^{m+1}}\Pr(A_{1}\cap\cdots\cap A_{m})
=\displaystyle= i=1m(1)i+11j1<<jimPr(Aj1Aji)\displaystyle\sum_{i=1}^{m}{(-1)^{i+1}\sum_{1\leq j_{1}<\cdots<j_{i}\leq m}{\Pr(A_{j_{1}}\cap\cdots\cap A_{j_{i}})}}

where Pr(Aj1Aji)=[(mik)(mk)]s\Pr(A_{j_{1}}\cap\cdots\cap A_{j_{i}})={\left[\frac{\dbinom{m-i}{k}}{\dbinom{m}{k}}\right]}^{s}, and for 1j1<<jim1\leq j_{1}<\cdots<j_{i}\leq m there are (mi)\dbinom{m}{i} choices. Thus, we have

Pr(X>s)=i=1m(1)i+1(mi)[(mik)(mk)]s\Pr(X>s)=\sum_{i=1}^{m}{(-1)^{i+1}\binom{m}{i}{\left[\frac{\dbinom{m-i}{k}}{\dbinom{m}{k}}\right]}^{s}} (19)

Since XX takes only positive integer values, we can compute its expectation by

𝔼[X]=s=1sPr(X=s)=s=0Pr(X>s)\mathbb{E}[X]=\sum_{s=1}^{\infty}{s\cdot\Pr(X=s)}=\sum_{s=0}^{\infty}{\Pr(X>s)} (20)

Using Eq. (19) in Eq. (20), we obtain

𝔼[X]\displaystyle\mathbb{E}[X] =s=0i=1m(1)i+1(mi)[(mik)(mk)]s\displaystyle=\sum_{s=0}^{\infty}{\sum_{i=1}^{m}{(-1)^{i+1}\binom{m}{i}{\left[\frac{\dbinom{m-i}{k}}{\dbinom{m}{k}}\right]}^{s}}}
=i=1m(1)i+1(mi)s=0[(mik)(mk)]s\displaystyle=\sum_{i=1}^{m}{(-1)^{i+1}\binom{m}{i}\sum_{s=0}^{\infty}{\left[\frac{\dbinom{m-i}{k}}{\dbinom{m}{k}}\right]}^{s}}
=i=1m(1)i+1(mi)1(mik)/(mk)\displaystyle=\sum_{i=1}^{m}{(-1)^{i+1}\frac{\dbinom{m}{i}}{1-\left.{\dbinom{m-i}{k}}\middle/{\dbinom{m}{k}}\right.}}

\Box

K.2 Sampling Complexity Analysis of π\pi-SGD Optimization

Suppose there are nn neighboring nodes around the central node vv. π\pi-SGD optimization samples a permutation of these nn nodes randomly at each time and models their dependencies based on the sampled permutation. As mentioned in the main body, we are interested in the average times of modeling all the pairwise correlations between these nn nodes. This problem can be equivalently formulated in graph-theoretic language as follows:

Problem K.3 (Complete Graph Covering Problem).

Let GG^{\prime} be an empty graph with nn nodes. Each time we generate a Hamiltonian path at random and add the corresponding n1n-1 edges to GG^{\prime} (edges can be generated repeatedly at different times). How many times does it take on average before graph GG^{\prime} covers a complete graph KnK_{n}?

It is difficult to give an analytical solution to this problem, so we try to find an approximate solution. In fact, the complete graph covering problem (Problem K.3) has an interesting connection with the kk-coupon collector’s problem (Problem K.2) discussed above. The generation of a Hamiltonian path among nn nodes at each time is equivalent to the drawing of n1n-1 interrelated edges555They have to be in an end-to-end manner, e.g., 1-2, 2-3, 3-4. from all possible n(n1)2\frac{n(n-1)}{2} edges. Suppose we ignore the interrelations between these n1n-1 edges and each time choose n1n-1 edges independently666They do not have to be in an end-to-end manner, e.g., 1-2, 1-3, 1-4. and randomly without repetition. In that case, Problem K.3 will degenerate into a special case of Problem K.2. Thus, we have the following conjecture:

Conjecture K.4.

Suppose there are nn neighboring nodes around the central node vv, and each time we sample a permutation of these nn nodes at random. How many times does it take on average before any two nodes have become neighbors at least once? This problem is equivalent to the complete graph covering problem, which shares a similar result to the kk-coupon collector’s problem: Suppose there are m=n(n1)2m=\frac{n(n-1)}{2} different types of coupons, and each time one chooses k=n1k=n-1 coupons (without repetition) independently and uniformly at random from the mm types. How many times does it take on average before obtaining at least one of every type of coupon?

Refer to caption
Figure 9: The experimental results of numerical simulation. Since the results of the complete graph covering problem are equal to the sampling times of π\pi-SGD optimization, we label them as “π\pi-SGD”. The blue and green lines represent undirected (with n(n1)2\frac{n(n-1)}{2} undirected edges) and directed (with n(n1)n(n-1) bi-directional edges) cases, respectively. In addition, since the kk-coupon collector’s problem gives almost the same results as the complete graph covering problem, we only show 20 points (\color[rgb]{1,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,0}\pgfsys@color@cmyk@stroke{0}{0}{1}{0}\pgfsys@color@cmyk@fill{0}{0}{1}{0}\bullet,\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}\pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}\bullet) uniformly for the numerical results of the closed-form expression of the kk-coupon collector’s problem for clarity. The light yellow \color[rgb]{1,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,0}\pgfsys@color@cmyk@stroke{0}{0}{1}{0}\pgfsys@color@cmyk@fill{0}{0}{1}{0}\bullet and light blue \color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}\pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}\bullet points represent undirected (m=n(n1)2m=\frac{n(n-1)}{2}, k=n1k=n-1) and directed (m=n(n1)m=n(n-1), k=n1k=n-1) cases, respectively. We highlight the results of π\pi-SGD at the points that nn are multiples of 100 (marked by blue and green triangular stars) for comparison with those of the kk-coupon collector’s problem (marked by light yellow \color[rgb]{1,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,0}\pgfsys@color@cmyk@stroke{0}{0}{1}{0}\pgfsys@color@cmyk@fill{0}{0}{1}{0}\bullet and light blue \color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}\pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}\bullet points).

Since the analytical solution to the kk-coupon collector’s problem has been given by Problem K.2 in Section K.1, we can use it to approximate the result of Problem K.3 and estimate the sampling complexity of π\pi-SGD optimization. We also conduct extensive numerical experiments to compare the results of Problem K.3 with those of Problem K.2 when nn ranges from 1 to 1,000. We consider both undirected and directed cases for Problem K.3, there are n(n1)2\frac{n(n-1)}{2} undirected and n(n1)n(n-1) bi-directional edges, respectively. Correspondingly, Problem K.2 takes m=n(n1)2m=\frac{n(n-1)}{2} and m=n(n1)m=n(n-1) coupons. For each nn, we conduct experiments for 10,000 runs and report the average times of covering these edges/coupons. As shown in Figure 9, Problem K.3 (π\pi-SGD) gives almost the same numerical results as Problem K.2 (the closed-form expression), verifying the rationality of Conjecture K.4. Hence, we conclude the following observation:

Observation K.5.

Suppose there are nn neighboring nodes around the central node vv, and each time we sample a permutation of these nn nodes at random. Any two nodes have been neighbors at least once after i=1n(n1)2(1)i+1(n(n1)2i)1(n(n1)2in1)/(n(n1)2n1)\sum\limits_{i=1}^{\tfrac{n(n-1)}{2}}{(-1)^{i+1}\tfrac{\tbinom{\tfrac{n(n-1)}{2}}{i}}{1-{\tbinom{\tfrac{n(n-1)}{2}-i}{n-1}}\Bigl{/}{\tbinom{\tfrac{n(n-1)}{2}}{n-1}}}} approximately 𝒪(nlnn)\mathcal{O}(n\ln n) times on average.

It is worth noting that our approach only needs n2\frac{n}{2} times in the undirected cases and nn times in the directed cases. According to the conclusion of Observation K.5, if the degree of the central node vv is n=100n=100, our approach saves 100ln100100/29\frac{100\ln 100}{100/2}\approx 9 times compared to π\pi-SGD optimization. If n=1000n=1000, e.g., a hub node in the large-scale network dataset, our approach saves 1000ln10001000/214\frac{1000\ln 1000}{1000/2}\approx 14 times.

Appendix L Analysis of Computational Complexity

In this section, we first provide the time and space complexity analysis of the recent related works MPSN [37] and CWN [38], then conduct comprehensive experiments about memory consumption to validate the efficiency of our proposed PG-GNN.

L.1 Computational Complexity Analysis of MPSN and CWN

Recently, a batch of works [37, 38] exploited local high-order interactions to effectively improve the expressive power of GNNs. MPSN [37] focuses on simplicial complexes, which are composed of simplices that generalize the 2-dimensional triangle to arbitrary kk dimensions. CWN [38] further generalizes simplicial complexes of MPSN to cell complexes. Hence these works are particularly relevant to our approach. The core idea of these works is to use the lifting transformation, which maps graphs to more distinguishable simplicial complexes [37] or cell complexes [38] by adding additional structures, e.g., attaching kk-simplices (k2k\geq 2) to (k+1)(k+1)-cliques [37] and 2-cells to induced cycles [38]. Here we mainly discuss the CWN since it is more powerful yet efficient than MPSN.

Let NN and MM denote the number of nodes and edges, respectively. Let XX be a dd-dimensional regular cell complex, BpB_{p} be the maximum boundary size of a pp-cell in XX, and SpS_{p} be the number of pp-cells. For CWN, the time complexity is 𝒪(p=1d(BpSp+2(Bp2)Sp))\mathcal{O}\left(\sum_{p=1}^{d}{\left(B_{p}S_{p}+2\cdot\binom{B_{p}}{2}S_{p}\right)}\right), and the space complexity is 𝒪(N+p=1d(Sp+BpSp+2(Bp2)Sp))\mathcal{O}\left(N+\sum_{p=1}^{d}{\left(S_{p}+B_{p}S_{p}+2\cdot\binom{B_{p}}{2}S_{p}\right)}\right). Next, we analyze the time complexity for a generic lifting transformation that maps a graph to a 2-dimensional cell complex and attaches 2-cells to all the induced cycles in the graph. Since 0-cells, 1-cells, and 2-cells represent vertices, edges, and induced cycles, respectively, we have d=2d=2, S1=MS_{1}=M, B1=2B_{1}=2, and B2B_{2} equals the size of the maximum induced cycle considered. In the case of molecular graphs, the number of induced cycles (chemical rings), S2S_{2}, is usually upper-bounded by a small constant. Accordingly, CWN achieves outstanding empirical performance and efficiency on molecular tasks. However, in the case of social networks, there are usually Ω(N2)\Omega(N^{2}) triangles (see IMDB-B, IMDB-M, and COLLAB in Table 7), even without mentioning other types of induced cycles. Thus we have S2=Ω(N2)S_{2}=\Omega(N^{2}), and the time complexity is 𝒪(4M+B2S2+2(B22)S2)𝒪(N2)\mathcal{O}\left(4M+B_{2}S_{2}+2\cdot\binom{B_{2}}{2}S_{2}\right)\geq\mathcal{O}(N^{2}). To make matters worse, for general graph distributions, S2S_{2} may grow exponentially with the number of nodes [38], and the computation of the pre-processing step (lifting transformation) may also be intractable. In a nutshell, the computational complexity may hinder the application of CWN outside of the molecular domain, where the importance of specific substructures is not well understood and their number may grow rapidly.

L.2 Memory Cost Analysis

Table 5: CPU RAM consumption (MiB) on real-world datasets. The gray font denotes the consumption of the pre-processing stage (i.e., lifting transformation) of CIN.
Model PROTEINS NCI1 IMDB-B IMDB-M COLLAB MNIST ZINC
GIN [7] 2,338 2,460 2,337 2,343 11,351 24,946 3,017
CIN [38], pre-proc. 561 627 769 745 N/A N/A 1,558
CIN [38], training 2,689 2,749 2,953 3,001 N/A N/A 2,993
PG-GNN (Ours) 2,343 2,466 2,351 2,349 11,298 24,955 3,020
Table 6: GPU memory consumption (MiB) on real-world datasets. “OOM” means out of memory (>> 24,220MiB).
Model PROTEINS NCI1 IMDB-B IMDB-M COLLAB MNIST ZINC
GIN [7] 887 889 881 877 1,125 981 901
CIN [38] 2,039 1,033 3,891 13,361 OOM N/A 1,371
PG-GNN (Ours) 980 1,142 1,202 1,036 21,485 4,127 1,367

According to Bodnar et al. [38], in all experiments, they employ a model which stacks CWN layers with local aggregators as in GIN, thus naming their architecture “Cell Isomorphism Network” (CIN). Here we use GIN and CIN as our baselines to compare the memory consumption of different models. We use the codes released by the authors of GIN777https://github.com/weihua916/powerful-gnns and CIN888https://github.com/twitter-research/cwn, and run experiments with the (optimal) hyper-parameter configurations reported in their original papers to keep the comparison as fair as possible. Tables 5 and 6 summarize the CPU RAM and GPU memory consumption for various models, respectively. Note that the total CPU RAM consumption of CIN should be computed as the consumption of “pre-processing++ “training”, while other models do not require the extra pre-processing steps. As shown in the tables, the memory cost of CIN grows rapidly outside of the molecular domain, such as on social networks and MNIST, consistent with our analysis above. In contrast, our PG-GNN is memory-efficient and outperforms CIN in terms of memory cost on almost all datasets, even performing on par with GIN on most datasets.

Appendix M Details of the Experiments

M.1 Details of Datasets

In this subsection, we provide detailed descriptions of datasets used in our experiments. The statistics of real-world datasets are summarized in Table 7.

Table 7: Statistics of real-world datasets. The degree denotes in-degree / out-degree for MNIST containing directed graphs.
Property PROTEINS NCI1 IMDB-B IMDB-M COLLAB MNIST ZINC
Graphs 1,113 4,110 1,000 1,500 5,000 70,000 12,000
Classes 2 2 2 3 3 10 N/A
Nodes (avg) 39.06 29.87 19.77 13.00 74.49 70.57 23.16
Nodes (max) 620 111 136 89 492 75 37
Degree (avg) 3.73 2.16 9.76 10.14 65.97 8.00 / 8.00 2.15
Degree (max) 25 4 135 88 491 18 / 8 4
Triangles (avg) 27.40 0.05 391.99 305.90 124551.40 626.07 0.06
Triangles (max) 534 3 6,985 14,089 2,574,680 702 2

M.1.1 Synthetic Datasets

We conduct synthetic experiments of counting incidence substructures on two types of random graphs: Erdős-Rényi random graphs and random regular graphs, created by Chen et al. [11]. The first one consists of 5,000 Erdős-Rényi random graphs with 10 nodes in each graph, and each edge exists with a probability of 0.3. The second one consists of 5,000 random regular graphs with nn nodes in each graph and the degree of dd, where (n,d)(n,d) is uniformly sampled from {(10, 6), (15, 6), (20, 5), (30, 5)}. Both datasets are randomly split into 30%, 20%, and 50% for training, validation, and testing.

For the incidence triangle counting task, all nodes are labeled with Eq. (6). For the incidence 4-clique counting task, it is hard to derive such a closed-form expression as Eq. (6), so we manually label each central node by counting how many groups of three neighboring nodes are fully connected. The evaluation metric of the incidence substructure counting task is the mean absolute error (MAE) between the predicted and true number of incidence substructures for each node.

M.1.2 TUDataset

Bioinformatics.

PROTEINS is a dataset in which each graph represents a protein, and nodes represent secondary structure elements (SSEs) within the protein structure, i.e., helices, sheets, and turns. An edge connects two nodes if they are neighbors in the amino-acid sequence or 3D space. The task is to classify the proteins into enzymes and non-enzymes. NCI1 is a publicly available dataset collected by the National Cancer Institute (NCI). Each graph represents a chemical compound, in which nodes and edges represent atoms and chemical bonds. This dataset is related to anti-cancer screening, and the task is to predict whether the chemical compounds are positive or negative for cell lung cancer.

Social Networks.

IMDB-BINARY is a movie-collaboration dataset containing the actor/actress and genre information of different movies on IMDB. Each graph corresponds to an actor/actress’s ego network, in which nodes correspond to actors/actresses, and an edge indicates two actors/actresses appear in the same movie. These graphs are derived from Action and Romance genres. And the task is to classify the graphs into their genres. IMDB-MULTI is the multi-class version of IMDB-BINARY and contains a balanced set of ego networks derived from Comedy, Romance, and Sci-Fi genres. COLLAB is a scientific collaboration dataset. Each graph corresponds to a researcher’s ego network, in which nodes correspond to the researcher and its collaborators, and an edge indicates the collaboration between two researchers. These researchers come from different fields, i.e., High Energy Physics, Condensed Matter Physics, and Astro Physics. The task is to classify the graphs into the fields of corresponding researchers.

M.1.3 MNIST and ZINC

MNIST.

MNIST [63] is a classical image classification dataset. The original MNIST images are converted into graphs using super-pixels [64]. Each graph represents an image, and its adjacency matrix is built with 8-nearest neighbors for each node (super-pixel). Note that since the relationship between each super-pixel (node) and its nearest neighbors is asymmetric, the resultant adjacency matrices are also asymmetric. For more details about the generation, please refer to Appendix A.2 in Dwivedi et al. [48]. The resultant graphs are of sizes 40-75 super-pixels, and each node’s features are assigned with super-pixel coordinates and intensity. MNIST has 55,000 training, 5,000 validation, and 10,000 testing graphs, where the 5,000 graphs for the validation set are randomly sampled from the training set. The evaluation metric for MNIST is the classification accuracy between the predicted class and ground-truth label for each graph.

ZINC.

ZINC [65] is one of the most popular real-world molecular datasets with 250K graphs, out of which Dwivedi et al. [48] randomly select 12K for efficiency. Each graph represents a molecule, where nodes and edges represent atoms and chemical bonds, respectively. The node features are the types of heavy atoms encoded in a one-hot manner. The task is to predict the constrained solubility, an important chemical property for molecules. ZINC has 10,000 training, 1,000 validation, and 1,000 testing graphs. The evaluation metric for ZINC is the mean absolute error (MAE) between the predicted and true constrained solubility for each molecular graph.

Table 8: Hyper-parameter configurations on synthetic datasets.
Hyper-parameter GCN GraphSAGE GIN rGIN RP LRP PG-GNN
batch size 32 32 32 32 16 16 16
hidden units 64 64 64 64 64 64 64
layers 3 3 5 5 5 5 5
dropout 0.5 0.0 0.5 0.5 0.0 0.0 0.0
initial lr 0.01 0.01 0.01 0.01 0.01 0.001 0.001

M.2 Details of Hyper-Parameters

M.2.1 Synthetic Datasets

We select the architectural hyper-parameters based on the performance in the validation set. The hyper-parameter search space is listed as follows: the batch size in {16, 32, 64}, the number of hidden units in {16, 32, 64}, the number of layers in {3, 4, 5}, the dropout ratio in {0.0, 0.5} after the final prediction layer, the initial learning rate in {0.01, 0.005, 0.001}, the decay rate in {0.5, 0.9}, the decay rate patience in {5, 10, 15, 20, 25}, and the aggregator in {SRN, GRU, LSTM}.

Configurations of Baselines.

We use the default hyper-parameter configurations reported in their original papers. Specifically, we follow Hamilton et al. [5] to sample 2-hop neighborhoods for each node, set the neighborhood sample sizes S1S_{1} and S2S_{2} of 1-hop and 2-hop to both 5, and use LSTM [17] as the aggregator in GraphSAGE. We use the uniform distribution over D={0,0.01,0.02,,0.99}D=\{0,0.01,0.02,\ldots,0.99\} as the random distribution μ\mu in rGIN like Sato et al. [25]. We set the dimension mm of one-hot node IDs to 10 and use GIN [7] as the backbone in RP following Murphy et al. [1]. According to Chen et al. [11], we set the depth ll and width kk to 1 and 3 in LRP. Other hyper-parameters on different models are shown in Table 8.

Configurations of PG-GNN.

We report the hyper-parameters chosen by our model selection procedure as follows. For all tasks and datasets, 5 GNN layers (including the input layer) are applied, and the LSTMs with 2 layers are used as the aggregation functions. Batch normalization [66] is applied to every hidden layer. All models are initialized using Glorot initialization [67] and trained using the Adam SGD optimizer [68] with an initial learning rate of 0.001. If the performance on the validation set does not improve after 20 epochs, the learning rate is then decayed by a factor of 0.5, except for the 4-clique counting task on ER graphs, whose patience is set to 25 epochs. The training is stopped when the learning rate reaches the minimum value of 5E-6.

M.2.2 Real-World Datasets

TUDataset.

We select the architectural hyper-parameters based on the accuracy in one random training fold. The hyper-parameter search space is listed as follows: the batch size in {16, 32, 64}, the number of hidden units in {8, 16, 32, 64}, the number of layers in {3, 4, 5}, the dropout ratio in {0.0, 0.5} after the final prediction layer, the initial learning rate in {0.01, 0.005, 0.001}, the decay rate in {0.5, 0.9}, the readout function in {SUM, MEAN}, and the aggregator in {SRN, GRU, LSTM}.

MNIST and ZINC.

We select the architectural hyper-parameters based on the performance in the validation set. The hyper-parameter search space is listed as follows: the batch size in {32, 64, 128}, the number of hidden units in {32, 64, 128}, the number of layers in {3, 4, 5}, the dropout ratio in {0.0, 0.5} after the final prediction layer, the initial learning rate in {0.01, 0.005, 0.001}, the decay rate in {0.5, 0.9}, the decay rate patience in {5, 10, 15, 20, 25}, the readout function in {SUM, MEAN}, and the aggregator in {SRN, GRU, LSTM}.

Configurations.

We report the hyper-parameters chosen by our model selection procedure as follows. For all datasets, 3 or 5 GNN layers (including the input layer) are applied, and the LSTMs with 2 layers are used as the aggregation functions. Batch normalization [66] is applied to every hidden layer. All models are initialized using Glorot initialization [67] and trained using the Adam SGD optimizer [68] with an initial learning rate of 0.001. For TUDataset, the learning rate is decayed by a factor of 0.5 every 50 epochs. The training is stopped when the number of epochs reaches the maximum value of 400. For MNIST and ZINC, if the performance on the validation set does not improve after 20 and 25 epochs, the learning rate is then decayed by a factor of 0.5. The training is stopped when the learning rate reaches the minimum value of 5E-6. Other hyper-parameters on different datasets are shown in Table 9.

Table 9: Hyper-parameter configurations on real-world datasets.
Hyper-parameter PROTEINS NCI1 IMDB-B IMDB-M COLLAB MNIST ZINC
batch size 16 32 16 32 32 64 64
hidden units 8 32 16 16 64 128 128
layers 5 5 5 5 3 5 5
dropout 0.5 0.0 0.0 0.5 0.5 0.0 0.0
degree False False True True True N/A N/A
readout SUM SUM SUM or MEAN SUM SUM or MEAN MEAN SUM

M.3 Computing Infrastructures

Hardware Infrastructures.

The experiments are conducted on Linux servers equipped with an Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 256GB RAM and 8 NVIDIA TITAN RTX GPUs.

Software Infrastructures.

All models are implemented using Python version 3.6, NetworkX version 2.4 [69], PyTorch version 1.4.0 [70] with CUDA version 10.0.130, and cuDNN version 7.6.5. In addition, the benchmark datasets are loaded by Deep Graph Library (DGL) version 0.4.2 [71].