This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A New Perspective on the Effects of Spectrum in Graph Neural Networks

Mingqi Yang    Yanming Shen    Rui Li    Heng Qi    Qiang Zhang    Baocai Yin
Abstract

Many improvements on GNNs can be deemed as operations on the spectrum of the underlying graph matrix, which motivates us to directly study the characteristics of the spectrum and their effects on GNN performance. By generalizing most existing GNN architectures, we show that the correlation issue caused by the unsmoothunsmooth spectrum becomes the obstacle to leveraging more powerful graph filters as well as developing deep architectures, which therefore restricts GNNs’ performance. Inspired by this, we propose the correlation-free architecture which naturally removes the correlation issue among different channels, making it possible to utilize more sophisticated filters within each channel. The final correlation-free architecture with more powerful filters consistently boosts the performance of learning graph representations. Code is available at https://github.com/qslim/gnn-spectrum.

Machine Learning, ICML

1 Introduction

Although graph neural network (GNN) communities are in a rapid development of both theories and applications, there is still a lack of a generalized understanding of the effects of the graph’s spectrum in GNNs. As we can see, many improvements can finally be unified into different operations on the spectrum of the underlying graph, while their effectiveness is interpreted by several well-accepted isolated concepts: (Wu et al., 2019; Zhu et al., 2021; Klicpera et al., 2019a, b; Chien et al., 2021; Balcilar et al., 2021) explain it in the perspective of simulating low/high pass filters; (Ming Chen et al., 2020; Xu et al., 2018; Liu et al., 2020; Li et al., 2018) interpret it as ways of alleviating oversmoothing phenomenon in deep architectures; (Cai et al., 2021) adopts the conception of normalization operation in neural networks and applies it to graph data. Since these improvements all indirectly operate on the spectrum, it motivates us to study the potential connections between the GNN performance and the characteristics of the graph’s spectrum. If we can find such a connection, it would provide a deeper and generalized insight into these seemingly unrelated improvements associated with the graph’s spectrum (low/high pass filter, oversmoothing, graph normalization, etc), and further identify potential issues in existing architectures. To this end, we first consider the simple correlation metric: cosine similarity among signals, and study the relations between it and the graph’s spectrum in the graph convolution operation. It provides a new perspective that in existing GNN architectures, the distribution of eigenvalues of the underlying graph matrix controls the cosine similarity among signals. An ill-posed unsmoothunsmooth spectrum would easily make signals over-correlated which is evidence of information loss.

Compared with oversmoothing studies (Li et al., 2018; Oono & Suzuki, 2020; Rong et al., 2019; Huang et al., 2020), the correlation analysis associated with the graph’s spectrum further indicates that the correlation issue is essentially caused by the graph’s spectrum. In other words, for graph topologies with an unsmooth spectrum, the issue can appear even with a shallow architecture, and a deep model further makes the spectrum less smooth and eventually exacerbates this issue. Meanwhile, the correlation analysis also provides a unified interpretation of the effectiveness of various existing improvements associated with the graph’s spectrum since they all implicitly impose some constraints on the spectrum to alleviate the correlation issue. However, these improvements are trade-offs between alleviating the correlation issue and applying more powerful graph filters: since a filter implementation directly reflects on the spectrum, a more appropriate filter for relevant signal patterns may correspond to an ill-posed spectrum, which in return will not gain performance improvements. Hence, in general GNN architectures, the correlation issue becomes the obstacle to applying more powerful filters. As we can see, although one can approximate more sophisticated graph filters by increasing the order kk of the polynomial theoretically (Shuman et al., 2013), in the popular models, simple filters, e.g. low-pass filter (Kipf & Welling, 2017; Wu et al., 2019), or the fixed filter coefficients (Klicpera et al., 2019a, b) serve as the practical applicable choice.

With all the above understandings, the key solution is to decouple the correlation issue from the filter design, which results in our correlation-free architecture. In contrast to existing approaches, it allows to focus on exploring more sophisticated filters without the concern of the correlation issue. With this guarantee, we can improve the approximation abilities of polynomial filters to better approximate the desired more complex filters (Hammond et al., 2011; Defferrard et al., 2016). However, we also find that it cannot be achieved by simply increasing the number of polynomial bases as the basis characteristics implicitly restrict the number of available bases in the resulting polynomial filter. For this reason, commonly used (normalized) adjacency or Laplacian matrix where its spectrum serves as the basis cannot effectively utilize high-order bases. To address this issue, we propose new graph matrix representations, which are capable of leveraging more bases and learnable filter coefficients to better respond to more complex signal patterns. The resulting model significantly boosts performance on learning graph representations. Although there are extensive studies on the polynomial filters including the fixed coefficients and learnable coefficients (Defferrard et al., 2016; Levie et al., 2019; Chien et al., 2021; He et al., 2021), to the best of our knowledge, they all focus on the coefficients design and use the (normalized) adjacency or Laplacian matrix as a basis. Therefore, our work is well distinguished from them. Our contributions are summarized as follows:

  • We show that general GNN architectures suffer from the correlation issue and also quantify this issue with spectral smoothness;

  • We propose the correlation-free architecture that decouples the correlation issue from graph convolution;

  • We show that the spectral characteristics also hinder the approximation abilities of polynomial filters and address it by altering the graph’s spectrum.

2 Preliminaries

Let 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) be an undirected graph with node set 𝒱\mathcal{V} and edge set \mathcal{E}. We denote n=|𝒱|n=|\mathcal{V}| the number of nodes, A𝔸n×nA\in\mathbb{A}^{n\times n} the adjacency matrix and Hn×dH\in\mathbb{R}^{n\times d} the node feature matrix where dd is the feature dimensionality. 𝒉n\bm{h}\in\mathbb{R}^{n} is a graph signal that corresponds to one dimension of HH.

Spectral Graph Convolution (Hammond et al., 2011; Defferrard et al., 2016). The definition of spectral graph convolution relies on Fourier transform on the graph domain. For a signal 𝒉\bm{h} and graph Laplacian L=UΛUL=U\Lambda U^{\top}, we have Fourier transform x^=Ux\hat{x}=U^{\top}x and inverse transform x=Ux^x=U\hat{x}. Then, the graph convolution of a signal 𝒉\bm{h} with a filter 𝒈θ\bm{g}_{\theta} is

𝒈θ𝒉=U((U𝒈θ)(U𝒉))=UG^θU𝒉,\displaystyle\bm{g}_{\theta}*\bm{h}=U\left(\left(U^{\top}\bm{g}_{\theta}\right)\odot\left(U^{\top}\bm{h}\right)\right)=U\hat{G}_{\theta^{\prime}}U^{\top}\bm{h}, (1)

where G^θ\hat{G}_{\theta^{\prime}} denotes a diagonal matrix in which the diagonal corresponds to spectral filter coefficients. To avoid eigendecomposition and ensure scalability, G^θ\hat{G}_{\theta^{\prime}} is approximated by a truncated expansion in terms of Chebyshev polynomials Tk(Λ~)T_{k}(\tilde{\Lambda}) up to the kk-th order (Hammond et al., 2011), which is also the polynomials of Λ\Lambda,

G^θ(Λ)i=0kθiTi(Λ~)=i=0kθiΛi,\displaystyle\hat{G}_{\theta^{\prime}}(\Lambda)\approx\sum_{i=0}^{k}\theta_{i}^{\prime}T_{i}(\tilde{\Lambda})=\sum_{i=0}^{k}\theta_{i}\Lambda^{i}, (2)

where Λ~=2λmaxΛIn\tilde{\Lambda}=\frac{2}{\lambda_{\max}}\Lambda-I_{n}. Now the convolution in Eq. 1 is

UG^θU𝒉U(i=0kθiΛi)U𝒉=i=0kθiLi𝒉.\displaystyle U\hat{G}_{\theta^{\prime}}U^{\top}\bm{h}\approx U\left(\sum_{i=0}^{k}\theta_{i}\Lambda^{i}\right)U^{\top}\bm{h}=\sum_{i=0}^{k}\theta_{i}L^{i}\bm{h}. (3)

Note that this expression is kk-localized since it is a kk-order polynomial in the Laplacian, i.e., it depends only on nodes that are at most kk hops away from the central node.

Graph Convolutional Network (GCN) (Kipf & Welling, 2017). GCN is derived from 11-order Chebyshev polynomials with several approximations. The authors further introduce the renormalization trick D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} with A~=A+In\tilde{A}=A+I_{n} and D~ii=jA~ij\tilde{D}_{ii}=\sum_{j}\tilde{A}_{ij}. Also, GCN can be generalized to multiple input channels and a layer-wise model:

H(l+1)=σ(D~12A~D~12H(l)W(l)),\displaystyle H^{(l+1)}=\sigma\left(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{(l)}W^{(l)}\right), (4)

where WW is learnable matrix and σ\sigma is nonlinear function.

Graph Diffusion Convolution (GDC) (Klicpera et al., 2019b). A generalized graph diffusion is given by the diffusion matrix:

H=k=0θkTk,\displaystyle H=\sum_{k=0}^{\infty}\theta_{k}T^{k}, (5)

with the weight coefficients θk\theta_{k} and the generalized transition matrix TT. TT can be Trw=AD1T_{rw}=AD^{-1}, Tsym=D12AD12T_{sym}=D^{-\frac{1}{2}}AD^{-\frac{1}{2}} or others as long as they are convergent. GDC can be viewed as a generalization of the original definition of spectral graph convolution, which also applies polynomial filters but not necessarily the Laplacian.

3 Revisiting Existing GNN Architectures

Table 1: A summary of pγp_{\gamma} in Eq. 6 in general graph convolutions.
GCN SGC APPNP GCNII GDC SSGC GPR ChebyNet CayleNet BernNet
Poly-basis General General Residual Residual General General General Chebyshev Cayle Bernstein
Poly-coefficient Fixed Fixed Fixed Fixed Fixed Fixed Learnable Fixed Learnable Learnable

We first generalize existing spectral graph convolution as follows

H=σ(pγ(S)fΘ(H)),\displaystyle H=\sigma\bigl{(}p_{\gamma}(S)f_{\Theta}(H)\bigr{)}, (6)

where SS is the graph matrix, e.g. adjacency or Laplacian matrix and their normalized forms. pγ:n×nn×np_{\gamma}:\mathbb{R}^{n\times n}\rightarrow\mathbb{R}^{n\times n} is the polynomial of graph matrices with coefficients γk\gamma\in\mathbb{R}^{k} for a kk-order polynomial. fΘ:ddf_{\Theta}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d^{\prime}} is the feature transformation neural network with the learnable parameters Θ\Theta. In SGC (Wu et al., 2019), GDC (Klicpera et al., 2019b), SSGC (Zhu & Koniusz, 2020), and GPR (Chien et al., 2021), pγp_{\gamma} is implemented as the general polynomial, i.e. pγ(S)=i=0kγiSip_{\gamma}(S)=\sum_{i=0}^{k}\gamma_{i}S^{i}. Their differences are identified by the coefficients γ\gamma. For example, SGC corresponds to a very simple form with γi=0,i<k\gamma_{i}=0,i<k and γk=1\gamma_{k}=1. By removing the nonlinear layer in GCNII (Ming Chen et al., 2020), APPNP (Klicpera et al., 2019a) and GCNII share the similar graph convolution layer as

H(l)=(1α)SH(l1)+αZ,H(0)=Z,Z=fΘ(X),\displaystyle H^{(l)}=(1-\alpha)SH^{(l-1)}+\alpha Z,H^{(0)}=Z,Z=f_{\Theta}(X),

where α(0,1)\alpha\in(0,1) and Xn×dX\in\mathbb{R}^{n\times d} is the input node features. By deriving its closed-form, we reformulate it with Eq. 6 as pγ(S)=i=0k1α(1α)iSi+(1α)kSkp_{\gamma}(S)=\sum_{i=0}^{k-1}\alpha(1-\alpha)^{i}S^{i}+(1-\alpha)^{k}S^{k}. In ChebyNet (Defferrard et al., 2016), CayleNet (Levie et al., 2019) and BernNet (He et al., 2021), pγp_{\gamma} corresponds to Chebyshev, Cayle and Bernstein polynomials respectively. GPR, CayleNet and BernNet apply learnable coefficient γ\gamma, where γ\gamma is learned as the coefficients of general, Cayle and Bernstein basis respectively. Therefore, with our formulation in Eq. 6, general graph convolutions are mainly different from pγp_{\gamma} as summarized in Tab. 1111Here, we follow the naming convention in GCNII called initial residual connection. GCN and GCNII interlace nonlinear computations over layers, making them difficult to reformulate all layers with Eq. 6. But one can represent them with the recursive form as H(l)=σ(pγ(S)fΘ(H(l1)))H^{(l)}=\sigma\bigl{(}p_{\gamma}(S)f_{\Theta}(H^{(l-1)})\bigr{)}. For example, in GCN, we have pγ(S)=Sp_{\gamma}(S)=S and fΘ(H(l1))=H(l1)Θf_{\Theta}(H^{(l-1)})=H^{(l-1)}\Theta with S=D~12A~D~12S=\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}..

3.1 Correlation Analysis in the Lens of Graph’s Spectrum

Based on the generalized formulation of Eq. 6, we conduct correlation analysis on existing graph convolution in the perspective of the graph’s spectrum. We denote 𝒮=pγ(S)\mathcal{S}=p_{\gamma}(S) for simplicity. 𝒉n\bm{h}\in\mathbb{R}^{n} denotes one channel in fΘ(H)f_{\Theta}(H). Then the convolution on 𝒉\bm{h} is represented as 𝒮𝒉\mathcal{S}\bm{h}. The cosine similarity between 𝒉\bm{h} and the ii-th eigenvector 𝒑i\bm{p}_{i} of 𝒮\mathcal{S} is

cos(𝒉,𝒑i)=𝒉𝒑ij=1n(𝒉𝒑j)2=αij=1nαj2.\displaystyle\cos\bigl{(}\langle\bm{h},\bm{p}_{i}\rangle\bigr{)}=\frac{\bm{h}^{\top}\bm{p}_{i}}{\sqrt{\sum^{n}_{j=1}\bigl{(}\bm{h}^{\top}\bm{p}_{j}\bigr{)}^{2}}}=\frac{\alpha_{i}}{\sqrt{\sum^{n}_{j=1}\alpha_{j}^{2}}}. (7)

αi=𝒉𝒑i\alpha_{i}=\bm{h}^{\top}\bm{p}_{i} is the weight of 𝒉\bm{h} on 𝒑i\bm{p}_{i} when representing 𝒉\bm{h} with the set of orthonormal bases 𝒑i,i[n]\bm{p}_{i},i\in[n]. The cosine similarity between 𝒮𝒉\mathcal{S}\bm{h} and 𝒑i\bm{p}_{i} is

cos(𝒮𝒉,𝒑i)=αiλij=1nαj2λj2.\displaystyle\cos\bigl{(}\langle\mathcal{S}\bm{h},\bm{p}_{i}\rangle\bigr{)}=\frac{\alpha_{i}\lambda_{i}}{\sqrt{\sum^{n}_{j=1}\alpha_{j}^{2}\lambda_{j}^{2}}}. (8)

The detailed derivations of Eq. 7 and Eq. 8 are given in Appendix A.

Eq. 8 builds the connection between the cosine similarity and the spectrum of the underlying graph matrix. We say the spectrum is smoothsmooth if all eigenvalues have similar magnitudes. By comparing Eq. 7 and Eq. 8, it shows that the graph convolution operation with the unsmooth spectrum, i.e., dissimilar eigenvalues, results in signals correlated (a higher cosine similarity) to the eigenvectors corresponding to larger magnitude eigenvalues and orthogonal (a lower cosine similarity) to the eigenvectors corresponding to smaller magnitude eigenvalues. In the case where 0 eigenvalue is involved in the spectrum, signals would lose information in the direction of the corresponding eigenvectors. In the deep architecture, this problem would further be exacerbated:

Proposition 3.1.

Assume 𝒮n×n\mathcal{S}\in\mathbb{R}^{n\times n} is a symmetric matrix with real-valued entries. |λ1||λ2|,,|λn||\lambda_{1}|\geq|\lambda_{2}|\geq,\dots,\geq|\lambda_{n}| are nn real eigenvalues, and 𝐩in,i[n]\bm{p}_{i}\in\mathbb{R}^{n},i\in[n] are corresponding eigenvectors. Then, for any given 𝐡,𝐡n\bm{h},\bm{h}^{\prime}\in\mathbb{R}^{n}, we have
(i) |cos(𝒮k+1𝐡,𝐩1)||cos(𝒮k𝐡,𝐩1)||\cos(\langle\mathcal{S}^{k+1}\bm{h},\bm{p}_{1}\rangle)|\geq|\cos(\langle\mathcal{S}^{k}\bm{h},\bm{p}_{1}\rangle)| and |cos(𝒮k+1𝐡,𝐩n)||cos(𝒮k𝐡,𝐩n)||\cos(\langle\mathcal{S}^{k+1}\bm{h},\bm{p}_{n}\rangle)|\leq|\cos(\langle\mathcal{S}^{k}\bm{h},\bm{p}_{n}\rangle)| for k=0,1,2,,+k=0,1,2,\dots,+\infty;
(ii) If |λ1|>|λ2||\lambda_{1}|>|\lambda_{2}|, limk|cos(𝒮k𝐡,𝐩1)|=limk|cos(𝒮k𝐡,𝒮k𝐡)|=1\lim\limits_{k\rightarrow\infty}|\cos(\langle\mathcal{S}^{k}\bm{h},\bm{p}_{1}\rangle)|=\lim\limits_{k\rightarrow\infty}|\cos(\langle\mathcal{S}^{k}\bm{h},\mathcal{S}^{k}\bm{h}^{\prime}\rangle)|=1, and the convergence speed is decided by |λ2λ1||\frac{\lambda_{2}}{\lambda_{1}}|.

We prove Proposition 3.1 in Appendix B. Proposition 3.1 shows that a deeper architecture violates the spectrum’s smoothness, which therefore makes the input signals more correlated to each other. 222 Here, nonlinearity is not involved in the propagation step. This meets the case of the decoupling structure where a multi-layer GNN is split into independent propagation and prediction steps (Liu et al., 2020; Wu et al., 2019; Klicpera et al., 2019a; Zhu & Koniusz, 2020; Zhang et al., 2021). The propagation involving nonlinearity remains unexplored due to its high complexity, except for one case of ReLU as nonlinearity (Oono & Suzuki, 2020). Most convergence analyses (such as over-smoothing) only study the simplified linear case (Cai et al., 2021; Liu et al., 2020; Wu et al., 2019; Klicpera et al., 2019a; Zhao & Akoglu, 2020; Xu et al., 2018; Ming Chen et al., 2020; Zhu & Koniusz, 2020; Klicpera et al., 2019b; Chien et al., 2021). Finally, Rank((𝒉1,𝒉2,,𝒉d))=1\textrm{Rank}((\bm{h}_{1},\bm{h}_{2},\dots,\bm{h}_{d}))=1, and the information within signals would be washed out. Note that all the above analysis does not impose any constraint to the underlying graph such as connectivity.

Revisiting oversmoothing via the lens of correlation issue. In the well-known oversmoothing analysis, the convergence is considered as limkA~symkH(0)=H()\lim_{k\rightarrow\infty}\tilde{A}_{\mathrm{sym}}^{k}H^{(0)}=H^{(\infty)} where each row of H()H^{(\infty)} only depends on the degree of the corresponding node, provided that the graph is irreducibleirreducible and aperiodicaperiodic (Xu et al., 2018; Liu et al., 2020; Zhao & Akoglu, 2020; Chien et al., 2021). Our analysis generalizes this result. In our analysis, the convergence of the cosine similarity among signals does not limit a graph to be connectedconnected or normalizednormalized that is required in the oversmoothing analysis analogical to the stationary distribution of the Markov chain, and even does not require a model to be necessarily deepdeep : it is essentially caused by the bad distributions of eigenvalues, while the deep architecture exacerbates it. Interestingly, inspired by this perspective, the correlation problem actually relates to the specific topologies since different topologies correspond to different spectrum. There exists topologies inherently with bad distributions of eigenvalues, and they will suffer from the problem even with a shallow architecture. Also, by taking the symmetry into consideration, Proposition 3.1(i) shows that the convergence of cosine similarity with respect to kk is also monotonousmonotonous. In contrast that existing results only discuss the theoretical infinite depth case, this provides more concrete evidence in the practical finite depth case that a deeper architecture can be more harmful than a shallow one.

Revisiting graph filters via the lens of correlation issue. The graph filter is approximated by a polynomial in the theory of spectral graph convolution (Hammond et al., 2011; Defferrard et al., 2016). Although theoretically, one can approximate any desired graph filter by increasing the order kk of the polynomial (Shuman et al., 2013), most GNNs cannot gain improvements by enlarging kk. Instead, the simple low-pass filter studied by many improvements on spectral graph convolution acts as the practical effective choice (Shuman et al., 2013; Wu et al., 2019; NT & Maehara, 2019; Muhammet et al., 2020; Klicpera et al., 2019b). Although there are studies involving high-pass filters to better process high-frequency signals recently, the low-pass is always required in graph convolution (Zhu & Koniusz, 2020; Zhu et al., 2021; Balcilar et al., 2021; Bo et al., 2021; Gao et al., 2021). This can be explained in the perspective of correlation analysis. As we have shown, the graph convolution is sensitive to the spectrum. A more proper filter to better respond to relevant signal patterns may result in an unsmooth spectrum, making different channels correlated to each other after convolution. In contrast, although a low-pass filter has limited expressiveness, it corresponds to a smoother spectrum, which alleviates the correlation issue.

4 Correlation-free Architecture

The correlation analysis via the lens of graph’s spectrum shows that in general GNN architectures, the unsmoothunsmooth spectrum leads to correlation issue and therefore acts as the obstacle to developing deep architectures as well as leveraging more expressive graph filters. To overcome this issue, a natural idea is to assign the graph convolution in different channels of fΘ(H)f_{\Theta}(H) with different spectrums, which can be viewed as a generalization of Eq. 6 as follows

H=fΨ([pΓ1(S)fΘ1(H),,pΓd(S)fΘd(H)]).\displaystyle H=f_{\Psi}\big{(}\bigl{[}p_{\Gamma_{1}}(S)f_{\Theta_{1}}(H),\dots,p_{\Gamma_{d^{\prime}}}(S)f_{\Theta_{d^{\prime}}}(H)\bigr{]}\bigr{)}. (9)

Both fΘ:ddf_{\Theta}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d^{\prime}} and fΨ:dd′′f_{\Psi}:\mathbb{R}^{d^{\prime}}\rightarrow\mathbb{R}^{d^{\prime\prime}} are the feature transformation neural networks with the learnable parameters Θ\Theta and Ψ\Psi respectively. pΓip_{\Gamma_{i}} is the ii-th polynomial with the learnable coefficients Γik\Gamma_{i}\in\mathbb{R}^{k}. fΘi(H)nf_{\Theta_{i}}(H)\in\mathbb{R}^{n} is the ii-th channel of fΘ(H)n×df_{\Theta}(H)\in\mathbb{R}^{n\times d^{\prime}}. We denote 𝒉i=fΘi(H)\bm{h}_{i}=f_{\Theta_{i}}(H) for simplicity. Then the convolution operation on 𝒉i\bm{h}_{i} in Eq. 9 is

pΓi(S)𝒉i=j=0kΓi,jSj𝒉i=Uj=0kΓi,jΛjU𝒉i\displaystyle p_{\Gamma_{i}}(S)\bm{h}_{i}=\sum_{j=0}^{k}\Gamma_{i,j}S^{j}\bm{h}_{i}=U\sum_{j=0}^{k}\Gamma_{i,j}\Lambda^{j}U^{\top}\bm{h}_{i} (10)

with the filter diag(gΓi)=j=0kΓi,jΛj\mathrm{diag}(g_{\Gamma_{i}})=\sum_{j=0}^{k}\Gamma_{i,j}\Lambda^{j}. We denote 𝝀=(λ1,λ2,,λn)n\bm{\lambda}=\bigl{(}\lambda_{1},\lambda_{2},\dots,\lambda_{n}\bigr{)}^{\top}\in\mathbb{R}^{n}. Then,

gΓi=j=0kΓi,j𝝀j=(𝝀1,𝝀2,,𝝀k)×Γi=V×Γi,\displaystyle g_{\Gamma_{i}}=\sum_{j=0}^{k}\Gamma_{i,j}\bm{\lambda}^{j}=\begin{pmatrix}\bm{\lambda}^{1},\bm{\lambda}^{2},\dots,\bm{\lambda}^{k}\end{pmatrix}\times\Gamma_{i}=V\times\Gamma_{i}, (11)

where V=(𝝀1,𝝀2,,𝝀k)n×kV=\bigl{(}\bm{\lambda}^{1},\bm{\lambda}^{2},\dots,\bm{\lambda}^{k}\bigr{)}\in\mathbb{R}^{n\times k}. If λiλj\lambda_{i}\neq\lambda_{j} for any iji\neq j, i.e., the algebraic multiplicity of all eigenvalues is 1, VV is a Vandermonde matrix with Rank(V)=min(n,k)\textrm{Rank}(V)=\textrm{min}(n,k). Vj=𝝀j,j[k]V_{j}=\bm{\lambda}^{j},j\in[k] serve as a set of kk bases, where each filter gΓig_{\Gamma_{i}} is a linear combination of VjV_{j}. Hence, a larger kk helps to better approximate the desired filter. When k=nk=n, VV is a full-rank matrix and gΓig_{\Gamma_{i}} is sufficient to represent any desired filter with proper assignments of Γi\Gamma_{i}. Note that nn is much smaller in real-world graph-level tasks than that in node-level tasks, making k=nk=n more tractable.

By considering the columns of a Vandermonde matrix, i.e. 𝝀j,j[k]\bm{\lambda}^{j},j\in[k] as bases, we can see that when increasing kk (aka applying more bases), λik\lambda_{i}^{k} with |λi|<<1|\lambda_{i}|<<1 goes diminishing and λik\lambda_{i}^{k} with |λi|>>1|\lambda_{i}|>>1 goes divergent. To balance the diminishing and divergence problems when applying a larger kk, we need to carefully control the range of the spectrum close to 11 or 1-1. General approaches have 𝝀[0,1]n\bm{\lambda}\in[0,1]^{n} 333General approaches use the (symmetry) normalized AA, i.e. D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}, D~1A~\tilde{D}^{-1}\tilde{A} to guarantee its spectrum is bounded by [1,1][-1,1] (Kipf & Welling, 2017; Klicpera et al., 2019b) or the (symmetry) normalized LL, i.e. ID~12A~D~12I-\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} to ensure the boundary [0, 2] and then rescale it to [0,1][0,1] (He et al., 2021).. Although there is no concern of divergence problems, λik\lambda_{i}^{k}, especially for a small λi\lambda_{i}, inclines to 0 when increasing kk, making the higher-order basis ineffective in the practical limited precision condition.

On the other hand, general approaches are less likely to learn the coefficients of polynomial filters in a completely free manner (Klicpera et al., 2019b; He et al., 2021). The specially designed coefficients to explicit modify spectrum, i.e. Personalized PageRank (PPR), heat kernel (Klicpera et al., 2019b), etc or the coefficients learned under the constrained condition, i.e. Chebyshev (Defferrard et al., 2016), Cayley (Levie et al., 2019), Bernstein (He et al., 2021) polynomial, etc act as the practical applicable filters. This is probably because the polynomial filter relies on sophisticated coefficients to maintain spectral properties. Learning them from scratch would easily fall into an ill-posed filter (He et al., 2021). However, by modifying the filter bases, it would relax the requirement on the coefficients, making it more suitable for learning coefficients from scratch.

Finally, although the new architecture in Eq. 9 decouples the correlation issue from developing more powerful filters, general filter bases are less qualified for approximating more complex filters. Hence, we still need to explore more effective filter bases to replace existing ones. To this end, we will introduce two different improvements on filter bases in the following sections whose effectiveness will serve as a verification of our analysis.

4.1 Spectral Optimization on Filter Basis

One can directly apply a smoothing function on the spectrum of SS, which helps to narrow the range of eigenvalues close to 1 or -1. There can be various approaches to this end, and in this paper, we propose the following eigendecomposition-based method for a symmetric matrix S=PΛPS=P\Lambda P^{\top} 444Although the computation of SρS_{\rho} requires eigendecomposition, SS is always a symmetric matrix and the eigendecomposition on it is much faster than a general matrix.

Sρ=Pdiag(fρ(λi))P,fρ(λ)={(λ)ρ,λ<0λρ,λ0,\displaystyle S_{\rho}=P\mathrm{diag}(f_{\rho}(\lambda_{i}))P^{\top},f_{\rho}(\lambda)=\left\{\begin{array}[]{ll}-(-\lambda)^{\rho},\lambda<0\\ \lambda^{\rho},\lambda\geq 0,\end{array}\right. (12)

where i[n]i\in[n]. ρ(0,1)\rho\in(0,1), λρ=eρlnλ\lambda^{\rho}=e^{\rho\ln\lambda}. SρS_{\rho} serves as the polynomial bases in Eq. 10. Unlike general spectral approaches, SS is not required to be a bounded spectrum. It can leverage more bases while alleviating both the diminishing and divergence problems by controlling ρk\rho\cdot k in a small range. Therefore, SρS_{\rho} can be considered as a basis-augmentation technique as shown in Fig. 1.

Refer to caption
Figure 1: Assume λ>0\lambda>0 and ρ=0.3˙\rho=0.\dot{3}.

There can be other transformations on the spectrum, e.g., P(Sigmoid(Λ)+ρ)PP\bigl{(}\mathrm{Sigmoid}\bigl{(}\Lambda\bigr{)}+\rho\bigr{)}P^{\top}, which have a similar effect to SρS_{\rho}. Note that the injectivity of fρf_{\rho} also influences the approximation ability, which is discussed in more details in Appendix C.

4.2 Generalized Normalization on Filter Basis

Eq. 12 directly operates on the spectrum, which can achieve an accurate control on the range of the spectrum but requires eigendecomposition. To avoid eigendecomposition, we alternatively study the effects of graph normalization on the spectrum. We generalize the normalized adjacency matrix as follows

D~ϵA~D~ϵ=(D+ηI)ϵ(A+ηI)(D+ηI)ϵ,\displaystyle\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon}=(D+\eta I)^{\epsilon}(A+\eta I)(D+\eta I)^{\epsilon}, (13)

where ϵ[0.5,0]\epsilon\in[-0.5,0] is the normalization coefficient and η[0,1]\eta\in[0,1] is the shift coefficient. Widely-used D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} corresponds to ϵ=0.5\epsilon=-0.5 and η=1\eta=1.

Proposition 4.1.

Let λ1λ2λn\lambda_{1}\geq\lambda_{2}\geq\dots\geq\lambda_{n} be the spectrum of AA and μ1μ2μn\mu_{1}\geq\mu_{2}\geq\dots\geq\mu_{n} be the spectrum of (D+ηI)ϵ(A+ηI)(D+ηI)ϵ(D+\eta I)^{\epsilon}(A+\eta I)(D+\eta I)^{\epsilon}, then for any i[n]i\in[n], we have

(λi+η)(dmax+η)2ϵμi(λi+η)(dmin+η)2ϵ,\begin{aligned} (\lambda_{i}+\eta)(d_{\mathrm{max}}+\eta)^{2\epsilon}\leq\mu_{i}\leq(\lambda_{i}+\eta)(d_{\mathrm{min}}+\eta)^{2\epsilon}\end{aligned},

where dmind_{\mathrm{min}} and dmaxd_{\mathrm{max}} are the minimum and maximum degrees of nodes in the graph.

We prove Proposition 4.1 in Appendix D. Proposition 4.1 extends the results in (Spielman, 2007), showing that the normalization has a scaling effect on the spectrum: a smaller ϵ\epsilon is likely to lead to a smaller μi\mu_{i}, while a larger ϵ\epsilon is likely to lead to a larger μi\mu_{i}. When ϵ=0\epsilon=0, the upper and lower bounds coincide with μi=λi+η\mu_{i}=\lambda_{i}+\eta.

Refer to caption
Figure 2: We use the metric λD~ϵA~D~ϵλA~\frac{\lambda_{\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon}}}{\lambda_{\tilde{A}}} to evaluate the shrinking effects of D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon} on the spectrum. We randomly sample 5 graphs in each of three datasets ZINC, MolPCBA and NCI1 respectively. In the first three figures, we use the fixed ϵ=0.3\epsilon=-0.3 on all 5 graphs. In the fourth figure, we use ϵ=0.1,0.2,0.3,0.4,0.5\epsilon=-0.1,-0.2,-0.3,-0.4,-0.5 respectively on one graph, which corresponds to the 5 lines from top to bottom. More visualization results on other datasets can be found in Appendix E.

To further investigate the effects of the normalization on the spectrum, we fix η=0\eta=0 and empirically evaluate ϵ\epsilon as shown in Fig. 2. When fixing ϵ\epsilon, D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon} shrinks the spectrum of AA with different degrees on different eigenvalues. For eigenvalues with small magnitudes (in the middle area of the spectrum), it has a small shrinking effect, while for eigenvalues with large magnitudes, it has a relatively large shrinking effect. Hence, D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon} can be used as a spectral smoothing method. Also, different ϵ\epsilon results in different shrinking effects, which is consistent with the results in Proposition 4.1. Widely-used D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} with the spectrum bounded by [1,1][-1,1] may not be a good choice since the diminishing problem. Intuitively, to utilize more bases, we should narrow the range of the spectrum close to 1 (or -1) to avoid both the diminishing and divergence problems in higher-order bases. This can vary from different datasets and we should carefully balance ϵ\epsilon and kk.

5 Related Work

Many improvements on GNNs can be unified into the spectral smoothing operations, e.g. low-pass filter (Wu et al., 2019; Zhu et al., 2021; Klicpera et al., 2019a, b; Chien et al., 2021; Balcilar et al., 2021), alleviating oversmoothing (Ming Chen et al., 2020; Xu et al., 2018; Liu et al., 2020; Li et al., 2018), graph normalization(Cai et al., 2021), etc, our analysis on the relations of the correlation issue and the spectrum of underlying graph’s matrix provides a unified interpretation on their effectiveness.

ChebyNet (Defferrard et al., 2016), CayleNet (Levie et al., 2019), APPNP (Klicpera et al., 2019a), SSGC (Zhu & Koniusz, 2020), GPR (Chien et al., 2021), BernNet (He et al., 2021), etc explore various polynomial filters and use the normalized adjacency or Laplacian matrix as basis. We improve the approximation ability of polynomial filters by altering the spectrum of filter bases. The resulting bases allow leveraging more bases to approximate more sophisticated filters and are more suitable for learning coefficients from scratch.

We note that the concurrent work (Jin et al., 2022) has also pointed out the overcorrelation issue in the infinite depth case, without further discussion on the reason (e.g. graph’s spectrum) behind this phenomenon. In contrast, we show that correlation is inherently caused by the unsmoothunsmooth spectrum of the underlying graph filter, and also quantify this effect with spectral smoothness. It allows to analyze the correlation across all layers instead of only the theoretical infinite depth.

6 Experiments

We conduct experiments on TUDatasets (Yanardag & Vishwanathan, 2015; Kersting et al., 2016), OGB (Hu et al., 2020) which involve graph classification tasks and ZINC (Dwivedi et al., 2020) which involves graph regression tasks. Then, we evaluate the effects of our proposed new graph convolution architecture and two filter bases.

6.1 Results

Table 2: Results on TUDatasets. Higher is better.
dataset NCI1 NCI109 ENZYMES PTC_MR
GK 62.49±\pm0.27 62.35±\pm0.3 32.70±\pm1.20 55.65±\pm0.5
RW - - 24.16±\pm1.64 55.91±\pm0.3
PK 82.54±\pm0.5 - - 59.5±\pm2.4
FGSD 79.80 78.84 - 62.8
AWE - - 35.77±\pm5.93 -
DGCNN 74.44±\pm0.47 - 51.0±\pm7.29 58.59±\pm2.5
PSCN 74.44±\pm0.5 - - 62.29±\pm5.7
DCNN 56.61±\pm1.04 - - -
ECC 76.82 75.03 45.67 -
DGK 80.31±\pm0.46 80.32±\pm0.3 53.43±\pm0.91 60.08±\pm2.6
GraphSag 76.0±\pm1.8 - 58.2±\pm6.0 -
CapsGNN 78.35±\pm1.55 - 54.67±\pm5.67 -
DiffPool 76.9±\pm1.9 - 62.53 -
GIN 82.7±\pm1.7 - - 64.6±\pm7.0
kk-GNN 76.2 - - 60.9
Spec-GN 84.79±\pm1.63 83.62±\pm0.75 72.50±\pm5.79 68.05±\pm6.41
Norm-GN 84.87±\pm1.68 83.50±\pm1.27 73.33±\pm7.96 67.76±\pm4.52
Refer to caption
Figure 3: A visualization of the learned filters on ZINC. We tested on three bases with each basis randomly sampling 9 filters. Dots represent the eigenvalues of each basis. More visualization results on other datasets can be found in Appendix F.

Settings. We use the default dataset splits for OGB and ZINC. For TUDatasets, we follow the standard 10-fold cross-validation protocol and splits from (Zhang et al., 2018) and report our results following the protocol described in (Xu et al., 2019; Ying et al., 2018). Following all baselines on the leaderboard of ZINC, we control the number of parameters around 500K. The baseline models include: GK (Shervashidze et al., 2009), RW (Vishwanathan et al., 2010), PK (Neumann et al., 2016), FGSD (Verma & Zhang, 2017), AWE (Ivanov & Burnaev, 2018), DGCNN (Zhang et al., 2018), PSCN (Niepert et al., 2016), DCNN (Atwood & Towsley, 2016), ECC (Simonovsky & Komodakis, 2017), DGK (Yanardag & Vishwanathan, 2015), CapsGNN (Xinyi & Chen, 2019), DiffPool (Ying et al., 2018), GIN (Xu et al., 2019), kk-GNN (Morris et al., 2019), GraphSage (Hamilton et al., 2017), GAT (Veličković et al., 2018), GatedGCN-PE (Bresson & Laurent, 2017), MPNN (sum) (Gilmer et al., 2017), DeeperG (Li et al., 2020), PNA (Corso et al., 2020), DGN (Beani et al., 2021), GSN (Bouritsas et al., 2020), GINE-VN (Brossard et al., 2020), GINE-APPNP (Brossard et al., 2020), PHC-GNN (Le et al., 2021), SAN (Kreuzer et al., 2021), Graphormer (Ying et al., 2021). Spec-GN denotes the proposed graph convolution in Eq. 9 with the smoothed filter basis by spectral transformation in Eq. 12. Norm-GN denotes the proposed graph convolution in Eq. 9 with the smoothed filter basis by graph normalization in Eq. 13.

Table 3: Results on ZINC (Lower is better) and MolPCBA (Higher is better).
method ZINC MAE MolPCBA AP
GCN 0.367±\pm0.011 (505k) 24.24±\pm0.34 (2.02m)
GIN 0.526±\pm0.051 (510k) 27.03±\pm0.23 (3.37m)
GAT 0.384±\pm0.007 (531k) -
GraphSage 0.398±\pm0.002 (505k) -
GatedGCN-PE 0.214±\pm0.006 (505k) -
MPNN 0.145±\pm0.007 (481k) -
DeeperG - 28.42±\pm0.43 (5.55m)
PNA 0.142±\pm0.010 (387k) 28.38±\pm0.35 (6.55m)
DGN 0.168±\pm0.003 NA 28.85±\pm0.30 (6.73m)
GSN 0.101±\pm0.010 (523k) -
GINE-VN - 29.17±\pm0.15 (6.15m)
GINE-APPNP - 29.79±\pm0.30 (6.15m)
PHC-GNN - 29.47±\pm0.26 (1.69m)
SAN 0.139±\pm0.006 (509k) -
Graphormer 0.122±\pm0.006 (489k) -
Spec-GN 0.0698±\pm0.002 (503k) 29.65±\pm0.28 (1.74m)
Norm-GN 0.0709±\pm0.002 (500k) 29.51±\pm0.33 (1.74m)

Results. Tab. 2 and 3 summarize performance of our approaches comparing with baselines on TUDatasets, ZINC and MolPCBA. For TUDatasets, we report the results of each model in its original paper by default. When the results are not given in the original paper, we report the best testing results given in (Zhang et al., 2018; Ivanov & Burnaev, 2018; Xinyi & Chen, 2019). For ZINC and MolPCBA, we report the results of their public leaderboards. TUDatasets involves small-scale datasets. NCI1 and NCI109 are around 4K graphs. ENZYMES and PTC_MR are under 1K graphs. General GNNs easily suffer from overfitting on these small-scale data, and therefore we can see that some traditional kernel-based methods even get better performance. However, Spec-GN and Norm-GN achieve higher classification accuracies by a large margin on these datasets. The results on TUDatasets show that although Spec-GN and Norm-GN achieve more expressive filters, it does not lead to overfitting on learning graph representations. Recently, Transformer-based models are quite popular in learning graph representations, and they significantly improve the results on large-scale molecular datasets. On ZINC, Spec-GN and Norm-GN outperform these Transformer-based models by a large margin. And on MolPCBA, they are also competitive compared with SOTA results.

Table 4: Ablation study results on ZINC with different settings.
Architecture Basis test MAE valid MAE
shd idp D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} D~1A~\tilde{D}^{-1}\tilde{A} A~ρ\tilde{A}_{\rho} D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon}
0.1415±\pm0.00748 0.1568±\pm0.00729
0.1439±\pm0.00900 0.1569±\pm0.00739
0.1061±\pm0.01018 0.1294±\pm0.01454
0.1133±\pm0.01711 0.1316±\pm0.02057
0.0944±\pm0.00379 0.1100±\pm0.00787
0.0982±\pm0.00417 0.1172±\pm0.00666
0.0698±\pm0.00200 0.0884±\pm0.00319
0.0709±\pm0.00176 0.0929±\pm0.00445

6.2 Ablation Studies

We perform ablation studies on the proposed architecture and the filter bases A~ρ\tilde{A}_{\rho} (by setting S=A~S=\tilde{A} in Eq. 12) and D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon} on ZINC. We use “idp” and “shd” to respectively represent the correlation-free architecture (also known as independent filter architecture) in Eq. 9 and the general shared filter architecture in Eq. 6. Both architectures learn the filter coefficients from scratch.

Correlation-free architecture and different filter bases. In Fig. 3, we visualize the learned filters in the correlation-free on three bases, i.e. D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}, A~ρ\tilde{A}_{\rho} and D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon}. The visualizations show that each channel indeed learns a different filter on all three bases. D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} has the bounded spectrum [1,1][-1,1] that is slightly close to 11 due to the involvement of self-loop. The filters learn a similar response on all range which corresponds to different frequencies in frequency domain. A~ρ\tilde{A}_{\rho} and D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon} have the spectrum close to 11 or 1-1 while the filters learn diverse responses on these areas, which corresponds to more complex patterns on different frequencies. Tab. 4 shows that the correlation-free always outperforms the shared filter by a large margin on all tested bases. Both D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} and D~1A~\tilde{D}^{-1}\tilde{A} have the bounded spectrum [1,1][-1,1] and they have similar performance. A~ρ\tilde{A}_{\rho} and D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon} narrow the range of the spectrum close to 11 or 1-1 through completely different strategies, but they have similar performance that is much better than D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} and D~1A~\tilde{D}^{-1}\tilde{A}. This validates our analysis on the filter basis. Meanwhile, A~ρ\tilde{A}_{\rho} achieves more accurate control on the spectrum, and correspondingly, it slightly outperforms D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon}.

Refer to caption
Figure 4: Ablation study results on ZINC with different number of bases kk.

Do more bases gain improvements? In Fig. 4, we systematically evaluate the effects of the number of bases on learning graph representations, including D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} and our A~ρ\tilde{A}_{\rho} with ρ=1/3,1/6,1/7,1/8\rho=1/3,1/6,1/7,1/8. The shared filter case, i.e. shd+D~12A~D~12+\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} cannot well leverage more bases (a larger kk) as the MAE stops decreasing at 0.1500.150 which is also reported by several baselines in Tab. 3. In contrast, both correlation-free cases idp+D~12A~D~12+\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} and idp+A~ρ+\tilde{A}_{\rho} outperform the shared filter case by a large margin and they continuously gain improvements when increasing kk. The MAE of idp+D~12A~D~12+\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} stops decreasing at the test MAE close to 0.09 and the valid MAE close to 0.11. By replacing D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} with A~ρ\tilde{A}_{\rho}, the best test MAE is below 0.07, and the best valid MAE is close to 0.088. The bases in A~ρ\tilde{A}_{\rho} are controlled by both ρ\rho and kk. We use the tuple (ρ,k)(\rho,k) to denote a combination of ρ\rho and kk. By fixing ρ\rho, the curves corresponding to ρ=1/3\rho=1/3 and ρ=1/6\rho=1/6 show that increasing kk gains improvements. By fixing the upper bound of ρ×k\rho\times k to be 1, (1/6,6)(1/6,6) involves 3 more bases than (1/3,3)(1/3,3) and outperforms (1/3,3)(1/3,3). The same results are also reflected in the comparison of (1/6,12)(1/6,12) and (1/3,6)(1/3,6). For the comparison of (1/6,18)(1/6,18) and (1/3,9)(1/3,9), both settings achieve the lowest MAE and the difference is less obvious.

Refer to caption
Figure 5: Ablation study results on ZINC with different number of layers.

The effects of model depth. Fig.5 shows the performance comparisons between correlation-free and shared filter as depth increases. Each architecture is tested with the default basis D~12A~D~12\tilde{D}^{\frac{1}{2}}\tilde{A}\tilde{D}^{\frac{1}{2}} and our proposed A~ρ\tilde{A}_{\rho}. We set the same number of bases in all resulting models, and each model is tested with the number of layers (depth) equal to {5,10,15,20,25}\{5,10,15,20,25\}. The results show that the correlation-free can preserve the performance as depth increases. The shared filter cases perform quite unstable and drop dramatically when depth >20>20. Also, across all depths, the correlation-free almost always outperforms the shared filter and has low variance among different runs. In Appendix G, we also test cosine similarities of different layers in a deep model.

Stability. We also found that the correlation-free is more stable in different runs than the shared filter case as reflected in the standard deviation in Tab. 4. This is probably because different channels may pose different patterns, which causes interference among each other in the shared filter case. While the correlation-free well avoids this problem. Also, the results of A~ρ\tilde{A}_{\rho} and D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon} are more stable than D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} and D~1A~\tilde{D}^{-1}\tilde{A} in different runs. For D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} and D~1A~\tilde{D}^{-1}\tilde{A}, the difference between the best and the worst runs can be more than 0.02. While for A~ρ\tilde{A}_{\rho} and D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon}, this difference is less than 0.01. More results are given in Appendix H. The instability of D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} and D~1A~\tilde{D}^{-1}\tilde{A} is probably because learning filter coefficients from scratch without any constraints is difficult to maintain spectrum properties and therefore easily falls into an ill-posed filter (He et al., 2021). In contrast, A~ρ\tilde{A}_{\rho} and D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon} inherently with smoother spectrum alleviate this problem and make them more appropriate in the scenario of learning coefficients from scratch.

7 Conclusion

We study the effects of spectrum in GNNs. It shows that in existing architectures, the unsmooth spectrum results in the correlation issue, which acts as the obstacle to developing deep models as well as applying more powerful graph filters. Based on this observation, we propose the correlation-free architecture which decouples the correlation issue from filter design. Then, we show that the spectral characteristics also hinder the approximation abilities of polynomial filters and address it by altering the graph’s spectrum. Our extensive experiments show the significant performance gain of correlation-free architecture with powerful filters.

Acknowledgments

This work is supported in part by the National Key Research and Development Program of China (no. 2021ZD0112400), and also in part by the National Natural Science Foundation of China under grants U1811463 and 62072069.

References

  • Atwood & Towsley (2016) Atwood, J. and Towsley, D. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1993–2001, 2016.
  • Balcilar et al. (2021) Balcilar, M., Renton, G., Héroux, P., Gaüzère, B., Adam, S., and Honeine, P. Analyzing the expressive power of graph neural networks in a spectral perspective. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=-qh0M9XWxnv.
  • Beani et al. (2021) Beani, D., Passaro, S., Létourneau, V., Hamilton, W., Corso, G., and Liò, P. Directional graph networks. In International Conference on Machine Learning, pp. 748–758. PMLR, 2021.
  • Bo et al. (2021) Bo, D., Wang, X., Shi, C., and Shen, H. Beyond low-frequency information in graph convolutional networks. In AAAI. AAAI Press, 2021.
  • Bouritsas et al. (2020) Bouritsas, G., Frasca, F., Zafeiriou, S., and Bronstein, M. M. Improving graph neural network expressivity via subgraph isomorphism counting. arXiv preprint arXiv:2006.09252, 2020.
  • Bresson & Laurent (2017) Bresson, X. and Laurent, T. Residual gated graph convnets. arXiv preprint arXiv:1711.07553, 2017.
  • Brossard et al. (2020) Brossard, R., Frigo, O., and Dehaene, D. Graph convolutions that can finally model local structure. arXiv preprint arXiv:2011.15069, 2020.
  • Cai et al. (2021) Cai, T., Luo, S., Xu, K., He, D., Liu, T.-Y., and Wang, L. Graphnorm: A principled approach to accelerating graph neural network training. In 2021 International Conference on Machine Learning, May 2021.
  • Chien et al. (2021) Chien, E., Peng, J., Li, P., and Milenkovic, O. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=n6jl7fLxrP.
  • Corso et al. (2020) Corso, G., Cavalleri, L., Beaini, D., Liò, P., and Veličković, P. Principal neighbourhood aggregation for graph nets. In Advances in Neural Information Processing Systems, 2020.
  • Defferrard et al. (2016) Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852, 2016.
  • Dwivedi et al. (2020) Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y., and Bresson, X. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.
  • Gao et al. (2021) Gao, X., Dai, W., Li, C., Zou, J., Xiong, H., and Frossard, P. Message passing in graph convolution networks via adaptive filter banks. arXiv preprint arXiv:2106.09910, 2021.
  • Gilmer et al. (2017) Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.  1263–1272. JMLR. org, 2017.
  • Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034, 2017.
  • Hammond et al. (2011) Hammond, D. K., Vandergheynst, P., and Gribonval, R. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.
  • He et al. (2021) He, M., Wei, Z., Huang, Z., and Xu, H. Bernnet: Learning arbitrary graph spectral filters via bernstein approximation. In NeurIPS, 2021.
  • Hu et al. (2020) Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
  • Huang et al. (2020) Huang, W., Rong, Y., Xu, T., Sun, F., and Huang, J. Tackling over-smoothing for general graph convolutional networks. arXiv preprint arXiv:2008.09864, 2020.
  • Ivanov & Burnaev (2018) Ivanov, S. and Burnaev, E. Anonymous walk embeddings. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  2191–2200, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/ivanov18a.html.
  • Jin et al. (2022) Jin, W., Liu, X., Ma, Y., Aggarwal, C., and Tang, J. Towards feature overcorrelation in deeper graph neural networks, 2022. URL https://openreview.net/forum?id=Mi9xQBeZxY5.
  • Kersting et al. (2016) Kersting, K., Kriege, N. M., Morris, C., Mutzel, P., and Neumann, M. Benchmark data sets for graph kernels, 2016. http://graphkernels.cs.tu-dortmund.de.
  • Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • Klicpera et al. (2019a) Klicpera, J., Bojchevski, A., and Günnemann, S. Predict then propagate: Graph neural networks meet personalized pagerank. In International Conference on Learning Representations (ICLR), 2019a.
  • Klicpera et al. (2019b) Klicpera, J., Weißenberger, S., and Günnemann, S. Diffusion improves graph learning. Advances in Neural Information Processing Systems, 32:13354–13366, 2019b.
  • Kreuzer et al. (2021) Kreuzer, D., Beaini, D., Hamilton, W., Létourneau, V., and Tossou, P. Rethinking graph transformers with spectral attention. arXiv preprint arXiv:2106.03893, 2021.
  • Le et al. (2021) Le, T., Bertolini, M., Noé, F., and Clevert, D.-A. Parameterized hypercomplex graph neural networks for graph classification. arXiv preprint arXiv:2103.16584, 2021.
  • Levie et al. (2019) Levie, R., Monti, F., Bresson, X., and Bronstein, M. M. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing, 67(1):97–109, 2019. doi: 10.1109/TSP.2018.2879624.
  • Li et al. (2020) Li, G., Xiong, C., Thabet, A., and Ghanem, B. Deepergcn: All you need to train deeper gcns. arXiv preprint arXiv:2006.07739, 2020.
  • Li et al. (2018) Li, Q., Han, Z., and Wu, X.-M. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Liu et al. (2020) Liu, M., Gao, H., and Ji, S. Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  338–348, 2020.
  • Ming Chen et al. (2020) Ming Chen, Z. W., Zengfeng Huang, B. D., and Li, Y. Simple and deep graph convolutional networks. 2020.
  • Morris et al. (2019) Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  4602–4609, 2019.
  • Muhammet et al. (2020) Muhammet, B., Guillaume, R., Pierre, H., Benoit, G., Sébastien, A., and Honeine, P. When spectral domain meets spatial domain in graph neural networks. In Thirty-seventh International Conference on Machine Learning (ICML 2020)-Workshop on Graph Representation Learning and Beyond (GRL+ 2020), 2020.
  • Murphy et al. (2019) Murphy, R. L., Srinivasan, B., Rao, V., and Ribeiro, B. Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BJluy2RcFm.
  • Neumann et al. (2016) Neumann, M., Garnett, R., Bauckhage, C., and Kersting, K. Propagation kernels: efficient graph kernels from propagated information. Machine Learning, 102(2):209–245, 2016.
  • Niepert et al. (2016) Niepert, M., Ahmed, M., and Kutzkov, K. Learning convolutional neural networks for graphs. In International Conference on Machine Learning, pp. 2014–2023, 2016.
  • NT & Maehara (2019) NT, H. and Maehara, T. Revisiting graph neural networks: All we have is low-pass filters, 2019.
  • Oono & Suzuki (2020) Oono, K. and Suzuki, T. Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1ldO2EFPr.
  • Rong et al. (2019) Rong, Y., Huang, W., Xu, T., and Huang, J. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, 2019.
  • Shervashidze et al. (2009) Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., and Borgwardt, K. Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp.  488–495, 2009.
  • Shuman et al. (2013) Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A., and Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing magazine, 30(3):83–98, 2013.
  • Simonovsky & Komodakis (2017) Simonovsky, M. and Komodakis, N. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3693–3702, 2017.
  • Spielman (2007) Spielman, D. A. Spectral graph theory and its applications. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp.  29–38, 2007. doi: 10.1109/FOCS.2007.56.
  • Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph Attention Networks. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJXMpikCZ.
  • Verma & Zhang (2017) Verma, S. and Zhang, Z.-L. Hunt for the unique, stable, sparse and fast feature learning on graphs. In Advances in Neural Information Processing Systems, pp. 88–98, 2017.
  • Vishwanathan et al. (2010) Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., and Borgwardt, K. M. Graph kernels. Journal of Machine Learning Research, 11(Apr):1201–1242, 2010.
  • Wu et al. (2019) Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Weinberger, K. Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning, pp.  6861–6871. PMLR, 2019.
  • Xinyi & Chen (2019) Xinyi, Z. and Chen, L. Capsule graph neural network. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Byl8BnRcYm.
  • Xu et al. (2018) Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.-i., and Jegelka, S. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pp. 5453–5462. PMLR, 2018.
  • Xu et al. (2019) Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
  • Yanardag & Vishwanathan (2015) Yanardag, P. and Vishwanathan, S. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.  1365–1374. ACM, 2015.
  • Ying et al. (2021) Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform bad for graph representation? arXiv preprint arXiv:2106.05234, 2021.
  • Ying et al. (2018) Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810, 2018.
  • Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/f22e4747da1aa27e363d86d40ff442fe-Paper.pdf.
  • Zhang et al. (2018) Zhang, M., Cui, Z., Neumann, M., and Chen, Y. An end-to-end deep learning architecture for graph classification. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Zhang et al. (2021) Zhang, S., Liu, L., Gao, S., He, D., Fang, X., Li, W., Huang, Z., Su, W., and Wang, W. Litegem: Lite geometry enhanced molecular representation learning for quantum property prediction, 2021.
  • Zhao & Akoglu (2020) Zhao, L. and Akoglu, L. Pairnorm: Tackling oversmoothing in gnns. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkecl1rtwB.
  • Zhu & Koniusz (2020) Zhu, H. and Koniusz, P. Simple spectral graph convolution. In International Conference on Learning Representations, 2020.
  • Zhu et al. (2021) Zhu, M., Wang, X., Shi, C., Ji, H., and Cui, P. Interpreting and unifying graph neural networks with an optimization framework. In Proceedings of the Web Conference 2021, pp.  1215–1226, 2021.

Appendix A Derivations of Eq. 7 and Eq. 8

Since 𝒮n×n\mathcal{S}\in\mathbb{R}^{n\times n} is a symmetric matrix, assume the eigendecomposition 𝒮=PΛP\mathcal{S}=P\Lambda P^{\top} with P=(𝒑1,𝒑2,,𝒑n)P=\bigl{(}\bm{p}_{1},\bm{p}_{2},\dots,\bm{p}_{n}\bigr{)} and 𝒑i=1,i[n]\|\bm{p}_{i}\|=1,i\in[n].

cos(𝒉,𝒑i)\displaystyle\cos\bigl{(}\langle\bm{h},\bm{p}_{i}\rangle\bigr{)} =𝒉𝒑i𝒉𝒑i\displaystyle=\frac{\bm{h}^{\top}\bm{p}_{i}}{\|\bm{h}\|\|\bm{p}_{i}\|}
=𝒉𝒑i𝒉\displaystyle=\frac{\bm{h}^{\top}\bm{p}_{i}}{\|\bm{h}\|}
=𝒉𝒑i𝒉𝒉\displaystyle=\frac{\bm{h}^{\top}\bm{p}_{i}}{\sqrt{\bm{h}^{\top}\bm{h}}}
=𝒉𝒑i(P𝒉)P𝒉\displaystyle=\frac{\bm{h}^{\top}\bm{p}_{i}}{\sqrt{(P^{\top}\bm{h})^{\top}P^{\top}\bm{h}}}
=𝒉𝒑ij=1n(𝒑j𝒉)2\displaystyle=\frac{\bm{h}^{\top}\bm{p}_{i}}{\sqrt{\sum^{n}_{j=1}\bigl{(}\bm{p}_{j}^{\top}\bm{h}\bigr{)}^{2}}}
=𝒉𝒑ij=1n(𝒉𝒑j)2\displaystyle=\frac{\bm{h}^{\top}\bm{p}_{i}}{\sqrt{\sum^{n}_{j=1}\bigl{(}\bm{h}^{\top}\bm{p}_{j}\bigr{)}^{2}}}
=αij=1nαj2.\displaystyle=\frac{\alpha_{i}}{\sqrt{\sum^{n}_{j=1}\alpha_{j}^{2}}}.
cos(𝒮𝒉,𝒑i)\displaystyle\cos\bigl{(}\langle\mathcal{S}\bm{h},\bm{p}_{i}\rangle\bigr{)} =(𝒮𝒉)𝒑i𝒮𝒉𝒑i\displaystyle=\frac{\bigl{(}\mathcal{S}\bm{h}\bigr{)}^{\top}\bm{p}_{i}}{\|\mathcal{S}\bm{h}\|\|\bm{p}_{i}\|}
=(𝒮𝒉)𝒑i(𝒮𝒉)𝒮𝒉\displaystyle=\frac{\bigl{(}\mathcal{S}\bm{h}\bigr{)}^{\top}\bm{p}_{i}}{\sqrt{\bigl{(}\mathcal{S}\bm{h}\bigr{)}^{\top}\mathcal{S}\bm{h}}}
=(PΛ(P𝒉))𝒑i(PΛ(P𝒉))(PΛ(P𝒉))\displaystyle=\frac{\bigl{(}P\Lambda\bigl{(}P^{\top}\bm{h}\bigr{)}\bigr{)}^{\top}\bm{p}_{i}}{\sqrt{\bigl{(}P\Lambda\bigl{(}P^{\top}\bm{h}\bigr{)}\bigr{)}^{\top}\bigl{(}P\Lambda\bigl{(}P^{\top}\bm{h}\bigr{)}\bigr{)}}}
=(P𝒉)ΛP𝒑i(P𝒉)Λ2(P𝒉)\displaystyle=\frac{\bigl{(}P^{\top}\bm{h}\bigr{)}^{\top}\Lambda P^{\top}\bm{p}_{i}}{\sqrt{\bigl{(}P^{\top}\bm{h}\bigr{)}^{\top}\Lambda^{2}\bigl{(}P^{\top}\bm{h}\bigr{)}}}
=(𝒑1𝒉,,𝒑i𝒉,,𝒑n𝒉)(λ1λiλn)(𝒑1𝒑i𝒑n)(𝒑i)(P𝒉)Λ2(P𝒉)\displaystyle=\frac{\begin{pmatrix}\bm{p}_{1}^{\top}\bm{h},\dots,\bm{p}_{i}^{\top}\bm{h},\dots,\bm{p}_{n}^{\top}\bm{h}\end{pmatrix}\begin{pmatrix}\lambda_{1}\\ &\ddots\\ &&\lambda_{i}\\ &&&\ddots\\ &&&&\lambda_{n}\\ \end{pmatrix}\begin{pmatrix}\bm{p}_{1}^{\top}\\ \vdots\\ \bm{p}_{i}^{\top}\\ \vdots\\ \bm{p}_{n}^{\top}\\ \end{pmatrix}\begin{pmatrix}\bm{p}_{i}\end{pmatrix}}{\sqrt{\bigl{(}P^{\top}\bm{h}\bigr{)}^{\top}\Lambda^{2}\bigl{(}P^{\top}\bm{h}\bigr{)}}}
=𝒑i𝒉λij=1n(𝒑j𝒉)2λj2\displaystyle=\frac{\bm{p}_{i}^{\top}\bm{h}\lambda_{i}}{\sqrt{\sum^{n}_{j=1}\bigl{(}\bm{p}_{j}^{\top}\bm{h}\bigr{)}^{2}\lambda_{j}^{2}}}
=𝒉𝒑iλij=1n(𝒉𝒑j)2λj2\displaystyle=\frac{\bm{h}^{\top}\bm{p}_{i}\lambda_{i}}{\sqrt{\sum^{n}_{j=1}\bigl{(}\bm{h}^{\top}\bm{p}_{j}\bigr{)}^{2}\lambda_{j}^{2}}}
=αiλij=1nαj2λj2\displaystyle=\frac{\alpha_{i}\lambda_{i}}{\sqrt{\sum^{n}_{j=1}\alpha_{j}^{2}\lambda_{j}^{2}}}

Appendix B Proof of Proposition 3.1

Proof.

(i) As 𝒮k=PΛkP\mathcal{S}^{k}=P\Lambda^{k}P^{\top} and Eq. 8, for k=0,1,2,,+k=0,1,2,\dots,+\infty, we have

|cos(𝒮k𝒉,𝒑1)|\displaystyle|\cos(\langle\mathcal{S}^{k}\bm{h},\bm{p}_{1}\rangle)| =|α1λ1k|i=1nαi2λi2k\displaystyle=\frac{|\alpha_{1}\lambda_{1}^{k}|}{\sqrt{\sum^{n}_{i=1}\alpha_{i}^{2}\lambda_{i}^{2k}}}
=|λ1||λ1||α1λ1k|i=1nαi2λi2k\displaystyle=\frac{|\lambda_{1}|}{|\lambda_{1}|}\frac{|\alpha_{1}\lambda_{1}^{k}|}{\sqrt{\sum^{n}_{i=1}\alpha_{i}^{2}\lambda_{i}^{2k}}}
=|α1λ1k+1|λ12i=1nαi2λi2k\displaystyle=\frac{|\alpha_{1}\lambda_{1}^{k+1}|}{\sqrt{\lambda_{1}^{2}\sum^{n}_{i=1}\alpha_{i}^{2}\lambda_{i}^{2k}}}
|α1λ1k+1|i=1nαi2λi2(k+1)\displaystyle\leq\frac{|\alpha_{1}\lambda_{1}^{k+1}|}{\sqrt{\sum^{n}_{i=1}\alpha_{i}^{2}\lambda_{i}^{2(k+1)}}}
=|cos(𝒮k+1𝒉,𝒑1)|.\displaystyle=|\cos(\langle\mathcal{S}^{k+1}\bm{h},\bm{p}_{1}\rangle)|.

Similarly, we can prove that |cos(𝒮k𝒉,𝒑n)||cos(𝒮k+1𝒉,𝒑n)||\cos(\langle\mathcal{S}^{k}\bm{h},\bm{p}_{n}\rangle)|\geq|\cos(\langle\mathcal{S}^{k+1}\bm{h},\bm{p}_{n}\rangle)|.

(ii) Since |cos(𝒮k𝒉,𝒑n)||\cos(\langle\mathcal{S}^{k}\bm{h},\bm{p}_{n}\rangle)| monotonously increases with respect to kk and has the upper bound 1, |cos(𝒮k𝒉,𝒑n)||\cos(\langle\mathcal{S}^{k}\bm{h},\bm{p}_{n}\rangle)| must be convergent.

limk|cos(𝒮k𝒉,𝒑1)|\displaystyle\lim\limits_{k\rightarrow\infty}|\cos(\langle\mathcal{S}^{k}\bm{h},\bm{p}_{1}\rangle)| =limk|α1λ1k|i=1nαi2λi2k\displaystyle=\lim\limits_{k\rightarrow\infty}\frac{|\alpha_{1}\lambda_{1}^{k}|}{\sqrt{\sum^{n}_{i=1}\alpha_{i}^{2}\lambda_{i}^{2k}}}
=limk|α1|α12+i=2nαi2(λiλ1)2k\displaystyle=\lim\limits_{k\rightarrow\infty}\frac{|\alpha_{1}|}{\sqrt{\alpha_{1}^{2}+\sum^{n}_{i=2}\alpha_{i}^{2}\bigl{(}\frac{\lambda_{i}}{\lambda_{1}}\bigr{)}^{2k}}}
=|α1|α12+limki=2nαi2(λiλ1)2k\displaystyle=\frac{|\alpha_{1}|}{\sqrt{\alpha_{1}^{2}+\lim\limits_{k\rightarrow\infty}\sum^{n}_{i=2}\alpha_{i}^{2}\bigl{(}\frac{\lambda_{i}}{\lambda_{1}}\bigr{)}^{2k}}}

As |λ1|>|λ2|,,|λn||\lambda_{1}|>|\lambda_{2}|\geq,\dots,\geq|\lambda_{n}|, we have limki=2nαi2(λiλ1)2k=0\lim\limits_{k\rightarrow\infty}\sum^{n}_{i=2}\alpha_{i}^{2}\bigl{(}\frac{\lambda_{i}}{\lambda_{1}}\bigr{)}^{2k}=0 and the convergence speed is decided by |λ2λ1||\frac{\lambda_{2}}{\lambda_{1}}|. Therefore limk|cos(𝒮k𝒉,𝒑1)|=1\lim\limits_{k\rightarrow\infty}|\cos(\langle\mathcal{S}^{k}\bm{h},\bm{p}_{1}\rangle)|=1.

cos(𝒮𝒉,𝒮𝒉)\displaystyle\cos\bigl{(}\langle\mathcal{S}\bm{h},\mathcal{S}\bm{h}^{\prime}\rangle\bigr{)} =(𝒮𝒉)𝒮𝒉𝒮𝒉𝒮𝒉\displaystyle=\frac{\bigl{(}\mathcal{S}\bm{h}\bigr{)}^{\top}\mathcal{S}\bm{h}^{\prime}}{\|\mathcal{S}\bm{h}\|\|\mathcal{S}\bm{h}^{\prime}\|}
=(𝒮𝒉)𝒮𝒉(𝒮𝒉)𝒮𝒉(𝒮𝒉)𝒮𝒉\displaystyle=\frac{\bigl{(}\mathcal{S}\bm{h}\bigr{)}^{\top}\mathcal{S}\bm{h}^{\prime}}{\sqrt{\bigl{(}\mathcal{S}\bm{h}\bigr{)}^{\top}\mathcal{S}\bm{h}}\sqrt{\bigl{(}\mathcal{S}\bm{h}^{\prime}\bigr{)}^{\top}\mathcal{S}\bm{h}^{\prime}}}
=(PΛ(P𝒉))PΛ(P𝒉)(PΛ(P𝒉))(PΛ(P𝒉))(PΛ(P𝒉))(PΛ(P𝒉))\displaystyle=\frac{\bigl{(}P\Lambda\bigl{(}P^{\top}\bm{h}\bigr{)}\bigr{)}^{\top}P\Lambda\bigl{(}P^{\top}\bm{h}^{\prime}\bigr{)}}{\sqrt{\bigl{(}P\Lambda\bigl{(}P^{\top}\bm{h}\bigr{)}\bigr{)}^{\top}\bigl{(}P\Lambda\bigl{(}P^{\top}\bm{h}\bigr{)}\bigr{)}}\sqrt{\bigl{(}P\Lambda\bigl{(}P^{\top}\bm{h}^{\prime}\bigr{)}\bigr{)}^{\top}\bigl{(}P\Lambda\bigl{(}P^{\top}\bm{h}^{\prime}\bigr{)}\bigr{)}}}
=(P𝒉)Λ2P𝒉(P𝒉)Λ2(P𝒉)(P𝒉)Λ2(P𝒉)\displaystyle=\frac{\bigl{(}P^{\top}\bm{h}\bigr{)}^{\top}\Lambda^{2}P^{\top}\bm{h}^{\prime}}{\sqrt{\bigl{(}P^{\top}\bm{h}\bigr{)}^{\top}\Lambda^{2}\bigl{(}P^{\top}\bm{h}\bigr{)}}\sqrt{\bigl{(}P^{\top}\bm{h}^{\prime}\bigr{)}^{\top}\Lambda^{2}\bigl{(}P^{\top}\bm{h}^{\prime}\bigr{)}}}
=𝜶Λ2𝜷𝜶Λ2𝜶𝜷Λ2𝜷\displaystyle=\frac{\bm{\alpha}^{\top}\Lambda^{2}\bm{\beta}}{\sqrt{\bm{\alpha}^{\top}\Lambda^{2}\bm{\alpha}}\sqrt{\bm{\beta}^{\top}\Lambda^{2}\bm{\beta}}}
=i=1nαiβiλi2i=1nαi2λi2i=1nβi2λi2\displaystyle=\frac{\sum^{n}_{i=1}\alpha_{i}\beta_{i}\lambda_{i}^{2}}{\sqrt{\sum^{n}_{i=1}\alpha_{i}^{2}\lambda_{i}^{2}}\sqrt{\sum^{n}_{i=1}\beta_{i}^{2}\lambda_{i}^{2}}}

Then,

limk|cos(𝒮k𝒉,𝒮k𝒉)|\displaystyle\lim\limits_{k\rightarrow\infty}|\cos(\langle\mathcal{S}^{k}\bm{h},\mathcal{S}^{k}\bm{h}^{\prime}\rangle)| =limk|i=1nαiβiλi2k|i=1nαi2λi2ki=1nβi2λi2k\displaystyle=\lim\limits_{k\rightarrow\infty}\frac{\bigl{|}\sum^{n}_{i=1}\alpha_{i}\beta_{i}\lambda_{i}^{2k}\bigr{|}}{\sqrt{\sum^{n}_{i=1}\alpha_{i}^{2}\lambda_{i}^{2k}}\sqrt{\sum^{n}_{i=1}\beta_{i}^{2}\lambda_{i}^{2k}}}
=limk|i=1nαiβiλiλ12k|i=1nαi2λiλ12ki=1nβi2λiλ12k\displaystyle=\lim\limits_{k\rightarrow\infty}\frac{\bigl{|}\sum^{n}_{i=1}\alpha_{i}\beta_{i}\frac{\lambda_{i}}{\lambda_{1}}^{2k}\bigr{|}}{\sqrt{\sum^{n}_{i=1}\alpha_{i}^{2}\frac{\lambda_{i}}{\lambda_{1}}^{2k}}\sqrt{\sum^{n}_{i=1}\beta_{i}^{2}\frac{\lambda_{i}}{\lambda_{1}}^{2k}}}
=limk|α1β1+i=2nαiβiλiλ12k|α12+i=2nαi2λiλ12kβ12+i=2nβi2λiλ12k\displaystyle=\lim\limits_{k\rightarrow\infty}\frac{\bigl{|}\alpha_{1}\beta_{1}+\sum^{n}_{i=2}\alpha_{i}\beta_{i}\frac{\lambda_{i}}{\lambda_{1}}^{2k}\bigr{|}}{\sqrt{\alpha_{1}^{2}+\sum^{n}_{i=2}\alpha_{i}^{2}\frac{\lambda_{i}}{\lambda_{1}}^{2k}}\sqrt{\beta_{1}^{2}+\sum^{n}_{i=2}\beta_{i}^{2}\frac{\lambda_{i}}{\lambda_{1}}^{2k}}}
=|α1β1|α12β12\displaystyle=\frac{\bigl{|}\alpha_{1}\beta_{1}\bigr{|}}{\sqrt{\alpha_{1}^{2}}\sqrt{\beta_{1}^{2}}}
=1\displaystyle=1

Appendix C More Discussions of Spectral Optimization on Filter Basis

We use E(S,λ)\textrm{E}_{(S,\lambda)} to denote the eigenspace of SS associated with λ\lambda such that E(S,λ)={𝒗:(SλI)𝒗=𝟎}\textrm{E}_{(S,\lambda)}=\{\bm{v}:(S-\lambda I)\bm{v}=\bm{0}\}.

Proposition C.1.

Given a symmetric matrix Sn×nS\in\mathbb{R}^{n\times n} with S=PΛPS=P\Lambda P^{\top} where Λ=diag(λ1,λ2,,λn)\Lambda=\textrm{diag}(\lambda_{1},\lambda_{2},\dots,\lambda_{n}), and PP can be any eigenbasis of SS, let Sϕ=Pϕ(Λ)PS_{\phi}=P\phi(\Lambda)P^{\top}, where ϕ()\phi(\cdot) is an entry-wise function applied on Λ\Lambda. Then we have
(i) E(S,λi)E(Sϕ,ϕ(λi)),i[n]\textrm{E}_{(S,\lambda_{i})}\subseteq\textrm{E}_{(S_{\phi},\phi(\lambda_{i}))},i\in[n];
(ii) Meanwhile, if ϕ()\phi(\cdot) is injective, E(S,λi)=E(Sϕ,ϕ(λi))\textrm{E}_{(S,\lambda_{i})}=\textrm{E}_{(S_{\phi},\phi(\lambda_{i}))} and ϕ(S)=Pϕ(Λ)P\mathcal{F}_{\phi}(S)=P\phi(\Lambda)P^{\top} is injective.

Proof.

Let P=(𝒑1,𝒑2,,𝒑n)P=(\bm{p}_{1},\bm{p}_{2},\dots,\bm{p}_{n}). S=PΛPS=P\Lambda P^{\top} is equivalent to S𝒑i=λi𝒑i,i[n]S\bm{p}_{i}=\lambda_{i}\bm{p}_{i},i\in[n]. For any i[n]i\in[n], the geometric multiplicity of any λi\lambda_{i} is equal to its algebraic multiplicity, and E(S,λi)=Span({𝒑k|λk=λi,k[n]})\textrm{E}_{(S,\lambda_{i})}=\textrm{Span}(\{\bm{p}_{k}|\lambda_{k}=\lambda_{i},k\in[n]\}). Sϕ=Pϕ(Λ)PS_{\phi}=P\phi(\Lambda)P^{\top} and Sϕ𝒑i=ϕ(λi)𝒑i,i[n]S_{\phi}\bm{p}_{i}=\phi(\lambda_{i})\bm{p}_{i},i\in[n]. Similarly, for any i[n]i\in[n], E(Sϕ,ϕ(λi))=Span({𝒑k|ϕ(λk)=ϕ(λi),k[n]})\textrm{E}_{(S_{\phi},\phi(\lambda_{i}))}=\textrm{Span}(\{\bm{p}_{k}|\phi(\lambda_{k})=\phi(\lambda_{i}),k\in[n]\}). Note that {𝒑k|λk=λi,k[n]}{𝒑k|ϕ(λk)=ϕ(λi),k[n]}\{\bm{p}_{k}|\lambda_{k}=\lambda_{i},k\in[n]\}\subseteq\{\bm{p}_{k}|\phi(\lambda_{k})=\phi(\lambda_{i}),k\in[n]\} for any i[n]i\in[n]. Hence Span({𝒑k|λk=λi,k[n]})Span({𝒑k|ϕ(λk)=ϕ(λi),k[n]}\textrm{Span}(\{\bm{p}_{k}|\lambda_{k}=\lambda_{i},k\in[n]\})\subseteq\textrm{Span}(\{\bm{p}_{k}|\phi(\lambda_{k})=\phi(\lambda_{i}),k\in[n]\}. As a result, E(S,λi)E(Sϕ,ϕ(λi))\textrm{E}_{(S,\lambda_{i})}\subseteq\textrm{E}_{(S_{\phi},\phi(\lambda_{i}))} for any i[n])i\in[n]).

If ϕ()\phi(\cdot) is injective, {𝒑k|λk=λi,k[n]}={𝒑k|ϕ(λk)=ϕ(λi),k[n]}\{\bm{p}_{k}|\lambda_{k}=\lambda_{i},k\in[n]\}=\{\bm{p}_{k}|\phi(\lambda_{k})=\phi(\lambda_{i}),k\in[n]\} for any i[n]i\in[n]. Thus E(S,λi)=E(Sϕ,ϕ(λi))\textrm{E}_{(S,\lambda_{i})}=\textrm{E}_{(S_{\phi},\phi(\lambda_{i}))}.

We use σ(S)\sigma(S) to denote the generalisation of the set of all eigenvalues of SS (Slso known as the spectrum of SS). Let S=PΛ1PS=P\Lambda_{1}P^{\top} and B=QΛ2QB=Q\Lambda_{2}Q^{\top}. Suppose SBS\neq B, to prove Sϕ=ϕ(S)Bϕ=ϕ(B)S_{\phi}=\mathcal{F}_{\phi}(S)\neq B_{\phi}=\mathcal{F}_{\phi}(B), we discuss two cases respectively.

Case 1: σ(S)σ(B)\sigma(S)\neq\sigma(B)

Then σ(Sϕ)σ(Bϕ)\sigma(S_{\phi})\neq\sigma(B_{\phi}). The characteristic polynomials of SϕS_{\phi} and BϕB_{\phi} are different. Therefore, SϕBϕS_{\phi}\neq B_{\phi}.

Case 2: σ(S)=σ(B)\sigma(S)=\sigma(B)

Then Λ1=Λ2=Λ\Lambda_{1}=\Lambda_{2}=\Lambda. We prove the equivalent proposition ”Sϕ=BϕS=BS_{\phi}=B_{\phi}\Rightarrow S=B”. If Sϕ=BϕS_{\phi}=B_{\phi}, Pϕ(Λ)P=Qϕ(Λ)QP\phi(\Lambda)P^{\top}=Q\phi(\Lambda)Q^{\top}. For any λi\lambda_{i} with geometric multiplicity kk, we can find the corresponding eigenvectors 𝒑1,𝒑2,,𝒑k\bm{p}_{1},\bm{p}_{2},\dots,\bm{p}_{k} according to Pϕ(Λ)PP\phi(\Lambda)P^{\top}. Similarly, we can find the corresponding eigenvectors 𝒒1,𝒒2,,𝒒k\bm{q}_{1},\bm{q}_{2},\dots,\bm{q}_{k} according to Qϕ(Λ)QQ\phi(\Lambda)Q^{\top}. Note that the eigen-decomposition is unique in terms of eigenspaces. Thus, E(Sϕ,ϕ(λi))=Span(𝒑1,𝒑2,,𝒑k)=Span(𝒒1,𝒒2,,𝒒k)=E(Bϕ,ϕ(λi))\textrm{E}_{(S_{\phi},\phi(\lambda_{i}))}=\textrm{Span}(\bm{p}_{1},\bm{p}_{2},\dots,\bm{p}_{k})=\textrm{Span}(\bm{q}_{1},\bm{q}_{2},\dots,\bm{q}_{k})=\textrm{E}_{(B_{\phi},\phi(\lambda_{i}))}. Therefore, for any λi\lambda_{i}, E(S,λi)=E(B,λi)\textrm{E}_{(S,\lambda_{i})}=\textrm{E}_{(B,\lambda_{i})} (As given in Proposition C.1). Correspondingly, S=PΛP=QΛQ=BS=P\Lambda P^{\top}=Q\Lambda Q^{\top}=B.

Proposition C.1 shows that the eigenspace of SϕS_{\phi} involves the eigenspace of SS. Therefore, SϕS_{\phi} is invariant to the choice of eigenbasis, i.e., Sϕ=Pϕ(Λ)P=Pϕ(Λ)PS_{\phi}=P\phi(\Lambda)P^{\top}=P^{\prime}\phi(\Lambda)P^{\prime\top} for any eigenbases PP and PP^{\prime} of SS. Hence, SϕS_{\phi} is unique to SS for a given ϕ()\phi(\cdot). Consistently, we denote the mapping ϕ(S)=ϕ(PΛP)=Pϕ(Λ)P\mathcal{F}_{\phi}(S)=\mathcal{F}_{\phi}(P\Lambda P^{\top})=P\phi(\Lambda)P^{\top}.

When ϕ\mathcal{F}_{\phi} is injective, ϕ(S)\mathcal{F}_{\phi}(S) and SS share the same algebraic multiplicity. Otherwise, ϕ(S)\mathcal{F}_{\phi}(S) has a larger algebraic multiplicity on the corresponding eigenvalues, which may weaken the approximation ability based on the understanding of Vandermonde matrix. Also, the injectivity of ϕ\mathcal{F}_{\phi} serves as a guarantee that the transformation is reversible with no information loss.

ϕ()\mathcal{F}_{\phi}(\cdot) is also equivariant to graph isomorphism. For any two graphs G1G_{1} and G2G_{2} with matrix representations S1S_{1} and S2S_{2} (e.g., adjacency matrix, Laplacian matrix, etc.), G1G_{1} and G2G_{2} are isomorphic if and only if there exists a permutation matrix MM such that MS1M=S2MS_{1}M^{\top}=S_{2}. We denote I(S)=MSMI(S)=MSM^{\top}. Then

Claim 1.

ϕ()\mathcal{F}_{\phi}(\cdot) is equivariant to graph isomorphism, i.e. ϕ(I(S))=I(ϕ(S))\mathcal{F}_{\phi}(I(S))=I(\mathcal{F}_{\phi}(S)).

Proof.
ϕ(I(S))\displaystyle\mathcal{F}_{\phi}(I(S)) =ϕ(MSM)\displaystyle=\mathcal{F}_{\phi}(MSM^{\top})
=ϕ(M(PΛP)M)\displaystyle=\mathcal{F}_{\phi}(M(P\Lambda P^{\top})M^{\top})
=ϕ((MP)Λ(MP))\displaystyle=\mathcal{F}_{\phi}((MP)\Lambda(MP)^{\top})
=(MP)ϕ(Λ)(MP)\displaystyle=(MP)\phi(\Lambda)(MP)^{\top}
=M(Pϕ(Λ)P)M\displaystyle=M(P\phi(\Lambda)P^{\top})M^{\top}
=I(ϕ(S))\displaystyle=I(\mathcal{F}_{\phi}(S))

Hence, for a specific GNN model fGNNf_{\textrm{GNN}}, fGNN(ϕ(I(S)))=fGNN(I(ϕ(S)))=fGNN(ϕ(S))f_{\textrm{GNN}}(\mathcal{F}_{\phi}(I(S)))=f_{\textrm{GNN}}(I(\mathcal{F}_{\phi}(S)))=f_{\textrm{GNN}}(\mathcal{F}_{\phi}(S)). The learned representation is invariant to graph isomorphism (also known as permutation invariance (Zaheer et al., 2017; Murphy et al., 2019)) when introducing ϕ()\mathcal{F}_{\phi}(\cdot).

Appendix D Proof of Proposition 4.1

Proof.

Let Å=(D+ηI)ϵ(A+ηI)(D+ηI)ϵ\mathring{A}=(D+\eta I)^{\epsilon}(A+\eta I)(D+\eta I)^{\epsilon}. According to Courant-Fischer theorem,

μi=mindim(S)=imaxxS𝒙Å𝒙𝒙𝒙.\displaystyle\mu_{i}=\min_{\mathrm{dim}(S)=i}\max_{x\in S}\frac{\bm{x}^{\top}\mathring{A}\bm{x}}{\bm{x}^{\top}\bm{x}}.

Let 𝒚=(D+ηI)ϵ𝒙\bm{y}=(D+\eta I)^{\epsilon}\bm{x}. As the change of variables 𝒚=(D+ηI)ϵ𝒙\bm{y}=(D+\eta I)^{\epsilon}\bm{x} is non-singular, this is equivalent to

μi=mindim(T)=imax𝒚T𝒚(A+ηI)𝒚𝒚(D+ηI)2ϵ𝒚.\displaystyle\mu_{i}=\min_{\mathrm{dim}(T)=i}\max_{\bm{y}\in T}\frac{\bm{y}^{\top}(A+\eta I)\bm{y}}{\bm{y}^{\top}(D+\eta I)^{-2\epsilon}\bm{y}}.

Therefore,

μi\displaystyle\mu_{i} =mindim(T)=imax𝒚T𝒚(A+ηI)𝒚𝒚(D+ηI)2ϵ𝒚\displaystyle=\min_{\mathrm{dim}(T)=i}\max_{\bm{y}\in T}\frac{\bm{y}^{\top}(A+\eta I)\bm{y}}{\bm{y}^{\top}(D+\eta I)^{-2\epsilon}\bm{y}}
mindim(T)=imax𝒚T𝒚(A+ηI)𝒚(dmax+η)2ϵ𝒚𝒚\displaystyle\geq\min_{\mathrm{dim}(T)=i}\max_{\bm{y}\in T}\frac{\bm{y}^{\top}(A+\eta I)\bm{y}}{(d_{\mathrm{max}}+\eta)^{-2\epsilon}\bm{y}^{\top}\bm{y}}
=(dmax+η)2ϵ(mindim(T)=imax𝒚T𝒚A𝒚𝒚𝒚+η)\displaystyle=(d_{\mathrm{max}}+\eta)^{2\epsilon}\bigl{(}\min_{\mathrm{dim}(T)=i}\max_{\bm{y}\in T}\frac{\bm{y}^{\top}A\bm{y}}{\bm{y}^{\top}\bm{y}}+\eta\bigr{)}
=(λi+η)(dmax+η)2ϵ.\displaystyle=(\lambda_{i}+\eta)(d_{\mathrm{max}}+\eta)^{2\epsilon}.

Similarly, we can prove μi(λi+η)(dmin+η)2ϵ\mu_{i}\leq(\lambda_{i}+\eta)(d_{\mathrm{min}}+\eta)^{2\epsilon}. ∎

Appendix E Visualizations of the Effects of the Normalization D~ϵA~D~ϵ\tilde{D}^{\epsilon}\tilde{A}\tilde{D}^{\epsilon} on the Spectrum

Refer to caption
Figure 6: We randomly sample 5 graphs in each of three datasets NCI109, ENZYMES and PTC_MR respectively. And we use the fixed ϵ=0.3\epsilon=-0.3 to see the effects of the normalization on all graphs.

Appendix F Visualizations of the Learned Filters

[Uncaptioned image]
[Uncaptioned image]
Refer to caption
Refer to caption
Refer to caption
Figure 7: Visualizations of the learned filters on MolPCBA, NCI1, NCI109, ENZYMES and PTC_MR.

Appendix G The Correlation Issue of Deep Models

Refer to caption
Figure 8: Cosine similarities on ZINC.

We test the absolute value of cosine similarities in different layers for a depth=25 model. For each graph, we compute the mean of all hidden signal pairs. The final visualized results in Fig.8 are the mean of all graphs within a randomly selected batch. To be consistent with the definition of spectral graph convolution as well as our correlation analysis, the test runs do not utilize edge features of ZINC.

The results show that on both bases, the cosine of the shared filter case converges to 1, while the correlation-free converges to 0.80.8 for D~12A~D~12\tilde{D}^{\frac{1}{2}}\tilde{A}\tilde{D}^{\frac{1}{2}} and 0.70.7 for A~ρ\tilde{A}_{\rho}. (We also found that it easily leads to a large cosine similarity on ZINC, which is mainly because graphs are small such that n<<dn<<d, where nn is the number of nodes and dd is the number of hidden features.) These results do show that general GNNs suffer from the correlation issue as depth increases, while our correlation-free architecture enjoys a relatively stable performance.

Appendix H More Results

[Uncaptioned image]
[Uncaptioned image]
Refer to caption
Refer to caption
Figure 9: The curves of 5 runs on ZINC with the number of basis k=9,18,21,24k=9,18,21,24.