This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ASWT-SGNN: Adaptive Spectral Wavelet Transform-based
Self-Supervised Graph Neural Network

Ruyue Liu1,2, Rong Yin1,2, Yong Liu3, Weiping Wang 1 Corresponding author.
Abstract

Graph Comparative Learning (GCL) is a self-supervised method that combines the advantages of Graph Convolutional Networks (GCNs) and comparative learning, making it promising for learning node representations. However, the GCN encoders used in these methods rely on the Fourier transform to learn fixed graph representations, which is inherently limited by the uncertainty principle involving spatial and spectral localization trade-offs. To overcome the inflexibility of existing methods and the computationally expensive eigen-decomposition and dense matrix multiplication, this paper proposes an Adaptive Spectral Wavelet Transform-based Self-Supervised Graph Neural Network (ASWT-SGNN). The proposed method employs spectral adaptive polynomials to approximate the filter function and optimize the wavelet using contrast loss. This design enables the creation of local filters in both spectral and spatial domains, allowing flexible aggregation of neighborhood information at various scales and facilitating controlled transformation between local and global information. Compared to existing methods, the proposed approach reduces computational complexity and addresses the limitation of graph convolutional neural networks, which are constrained by graph size and lack flexible control over the neighborhood aspect. Extensive experiments on eight benchmark datasets demonstrate that ASWT-SGNN accurately approximates the filter function in high-density spectral regions, avoiding costly eigen-decomposition. Furthermore, ASWT-SGNN achieves comparable performance to state-of-the-art models in node classification tasks.

Introduction

Graphs are essential in various real-world domains such as social networks, brain networks, transportation networks, and citation networks (Wang et al. 2016; Yu, Lee, and Sohn 2020). Recently, the emergence of graph neural networks (GNNs) (He et al. 2022; Wang et al. 2022b) has attracted much attention due to their success in applications involving graph-structured data such as node classification (Kipf and Welling 2016) and edge prediction (Hasanzadeh et al. 2019). Graph Comparative Learning (GCL) (Velickovic et al. 2019; Sun et al. 2019) combines the capabilities of GNN and comparative learning techniques, making it a promising paradigm in the field of graph analysis (He et al. 2020). Typically, GCL methods generate multiple views by randomly augmenting input data and optimize the GNN encoder by learning consistency across views. GCL reduces the dependence of graph representation learning on human annotations and achieves state-of-the-art performance in tasks such as node classification (Hassani and Khasahmadi 2020; Zhang et al. 2021).

Most graph contrastive learning methods utilize graph convolutional neural networks (GCNs) as encoders (Xie et al. 2022; Wang et al. 2022b). Similar to convolutional neural networks (CNNs) in computer vision, spectral GCNs use Fourier bases in the design of graph-based operators (Bruna et al. 2013). However, these operators are localized in the frequency rather than the spatial domain. Additionally, they require costly multiplications between eigen-decomposition and dense matrices, leading to high computational expenses. In order to address this issue and achieve spatial localization, methods such as ChebyNet (Defferrard, Bresson, and Vandergheynst 2016) and GCN (Kipf and Welling 2016) employ polynomial approximation. While GCN is widely adopted for graph problems due to its impressive performance and computational efficiency (Shi et al. 2020), it encounters limitations and challenges when applied to large graphs, especially in mini-batch settings (Zeng et al. 2019b). To overcome the scaling challenges of GCN on large graphs, researchers have proposed layer sampling methods (Ying et al. 2018; Kaler et al. 2022) and subgraph sampling methods (Zeng et al. 2019b, a). However, the filter size is determined by the size of the entire graph or the sampled subgraph, which restricts flexibility for inputs of different sizes. Although some flexible spatial methods have been proposed, their aggregators lack learnability and convolutional properties. Consequently, the problem of designing a flexible filter that combines the learnability of spatial methods and the convolutional properties of spectral methods still needs to be solved.

To address the issues of inflexible and unlearnable filters, as well as the limited applicability caused by high computational complexity in existing methods, this paper proposes a novel graph comparative learning paradigm based on the adaptive spectral wavelet transform. More specifically, a fast spectral adaptive approximation method is utilized to estimate the wavelet filter, and contrast loss is employed to optimize the wavelet scale directly. Additionally, the introduction of residual links mitigates over-smoothing during information aggregation. By avoiding the expensive eigen-decomposition of the graph Laplacian operator and enabling localization in both the spectral and spatial domains, this approach effectively overcomes the limitations of graph convolutional neural networks that are constrained by graph size and lack flexibility in controlling neighborhood aspects. In comparison to state-of-the-art methods, this work introduces the following innovations:

  • We propose a novel self-supervised graph representation learning method based on sparse graph wavelets that creates localized filters in both the spectral and spatial domains. It reduces the computational complexity and addresses the limitation that graph convolutional neural networks cannot flexibly control the neighborhood aspect.

  • Theoretically, we demonstrate that nodes with similar network neighborhoods and features exhibit similar ASWT-SGNN embeddings, providing a performance guarantee for the proposed method.

  • Extensive experiments on eight benchmark datasets show that the proposed method reduces approximation errors in high-density spectral components without requiring expensive eigen-decomposition and achieves competitive performance with state-of-the-art models in node classification tasks.

Related Works

Graph Convolutional Neural Network

Following the success of CNNs in computer vision and natural language processing, researchers have sought to extend CNNs to the graph domain. The key challenge lies in defining the convolution operator for graphs. Graph convolutional neural networks can be broadly categorized into two main approaches: spectral and spatial. Spectral methods employ the graph Fourier transform to transfer signals from the spatial domain to the spectral domain, where convolution operations are performed. Spectral GNN (Bruna et al. 2013) is the first attempt to implement CNNs on graphs. ChebyNet (Defferrard, Bresson, and Vandergheynst 2016) introduce a parameterization method using Chebyshev polynomials for spectral filters, which enables fast localization of spectral filters. GCN (Kipf and Welling 2016) proposes a simplified version of ChebyNet, which achieves success in graph semi-supervised learning tasks. However, these spectral methods face challenges with generalization, as they are limited by fixed graph sizes, and larger filter sizes result in increased computational and memory costs. Spatial methods draw inspiration from weighted summation in CNNs to extract spatial features from topological graphs. MoNet (Monti et al. 2017) provides a general framework for designing spatial methods by utilizing the weighted average of functions defined within a node’s neighborhood as a spatial convolution operator. GAT (Velickovic et al. 2017) proposes a self-attentive mechanism to learn the weighting function. However, these methods employ unlearnable aggregators, and the localization of the convolution operation remains uncertain.

Graph Contrastive Learning

In graph analysis, contrastive learning was initially introduced by DGI (Velickovic et al. 2019) and InfoGraph (Sun et al. 2019), drawing inspiration from maximizing local-global mutual information. Building upon this, MVGRL (Hassani and Khasahmadi 2020) incorporate node diffusion into the graph comparison framework. GCA (Zhu et al. 2021) learns node representations by considering other nodes as negative samples, while BGRL (Thakoor et al. 2021) proposes a no-negative-sample model. CCA-SSG (Zhang et al. 2021) optimizes feature-level objectives in addition to instance-level differences. GRADE (Wang et al. 2022b) investigates fairness differences in comparative learning and proposes a novel approach to graph enhancement. Several surveys (Xie et al. 2022; Ding et al. 2022; Kumar, Rawat, and Chauhan 2022) summarize recent advancements in graph contrastive learning. Despite these methods’ notable achievements, most rely on GCN and its variants as the base models, inheriting the limitations of GCN. These limitations restrict the performance of these graph contrastive learning methods in tasks that require preserving fine-grained node features.

Graph Wavelets

The wavelet transform exhibits favorable structural properties by utilizing finite length and attenuation basis functions. This approach effectively localizes signals within both the spatial and spectral domains. Additionally, it is noteworthy that the basis and its inverse in the wavelet transform often showcase sparsity, contributing to its utility and efficiency. To construct the wavelet transform on graphs, Hammond et al. (Hammond, Vandergheynst, and Gribonval 2011) propose a method that approximates the wavelet using Chebyshev polynomials. This approach effectively avoids the need for eigen-decomposition of the Laplace matrix. Building upon this, GWNN (Xu et al. 2019) redefines graph convolution based on graph wavelets, resulting in high efficiency and sparsity. M-GWCN (Behmanesh et al. 2022) applies the multi-scale graph wavelet transform to learn representations of multimodal data. Collectively, these works showcase the value of graph wavelets in signal processing on graphs. However, these methods primarily employ wavelet transforms in supervised or semi-supervised tasks and heavily rely on labeled data.

Refer to caption
(a) Signal spectrum
Refer to caption
(b) Filter signal
Figure 1: Visualize how wavelet filters can capture multiscale properties in graph signals and structures. We use graphs with two levels of clusters (4-node clusters and 8-node clusters) for demonstration. These clusters are reflected in the gaps (about 0 and 6) in the spectrum in Figure 1(a), reflecting different eigenvalue clustering. The signal is obtained by filtering a random signal with the filter in Figure 1(a), purposefully highlighting the three eigenvalue clusters. Figure 1(b) shows how the complete signal is decomposed into multiple filter components.

Preliminary

Graph Fourier Transform

Let 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) denotes a graph, where 𝒱={v1,,vN}\mathcal{V}=\{v_{1},\cdots,v_{N}\} represents the set of nodes and 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V} represents the set of edges. 𝒢\mathcal{G} is associated with a feature matrix 𝑿N×p\bm{X}\in\mathbb{R}^{N\times p}, 𝑿=[x1,,xn]\bm{X}=[x_{1},\dots,x_{n}] and an adjacency matrix 𝑨N×N\bm{A}\in\mathbb{R}^{N\times N}, where 𝑨ij=1\bm{A}_{ij}=1  if (vi,vj)(v_{i},v_{j})\in\mathcal{E} and 𝑨ij=0\bm{A}_{ij}=0  otherwise. We define the Laplacian matrix of the graph as 𝑳=𝑫𝑨\bm{L}=\bm{D}-\bm{A}, where 𝑫=Diag(d1,,dN)\bm{D}=Diag(d_{1},\dots,d_{N}), di=j𝑨ijd_{i}={\textstyle\sum_{j}\bm{A}_{ij}}. The symmetric normalized Laplacian is defined as 𝑳sym=𝑫12𝑳𝑫12=𝑼𝚲𝑼\bm{L}_{sym}=\bm{D}^{-\frac{1}{2}}\bm{L}\bm{D}^{-\frac{1}{2}}=\bm{U}\bm{\Lambda}\bm{U}^{\top}. 𝑼=[𝒖1,,𝒖N]\bm{U}=\left[{\bm{u}}_{1},\ldots,{\bm{u}}_{N}\right], where 𝒖iN\bm{\bm{u}}_{i}\in\mathbb{R}^{N} denotes the ii-th eigenvector of 𝑳sym\bm{L}_{sym} and 𝚲=Diag(λ1,,λN)\bm{\Lambda}=Diag(\lambda_{1},\dots,\lambda_{N}) is the corresponding eigenvalue matrix.

For a discrete graph signal 𝑿\bm{X}, its Fourier transform has the following form (Shuman et al. 2013):

f^(λ)=i=1N𝒙i𝑼(i)=i=1N𝒙i𝑼T(i),\hat{f}\left(\lambda_{\ell}\right)=\sum_{i=1}^{N}\bm{x}_{i}\bm{U}_{\ell}^{*}(i)=\sum_{i=1}^{N}\bm{x}_{i}\bm{U}_{\ell}^{T}(i), (1)

where 𝑼\bm{U}_{\ell}^{*} denotes the conjugate transposition of the eigenvalues λ\lambda_{\ell} corresponding to the eigenvectors. Since the complex number is not involved in the scope of this work, it can be regarded as a common transposition, that is, 𝑼=𝑼T\bm{U}^{*}=\bm{U}^{T}.

Graph Fourier transform provides a means to define the graph convolution operator using the convolution theorem. The filter signal, denoted as fof_{o}, can be mathematically expressed as:

fo=𝑼g(𝚲)𝑼T𝑿,f_{o}=\bm{U}g(\bm{\Lambda})\bm{U}^{T}\bm{X}, (2)

where g(𝚲)g(\bm{\Lambda}) denotes the filter function on the eigenvalues, and 𝑼g(𝚲)𝑼T\bm{U}g(\bm{\Lambda})\bm{U}^{T} is the graph filtering matrix.

Uncertainty Principle

The implementation of graph convolution using the Fourier transform lacks spatial localization despite its ability to achieve localization in the spectral domain. Additionally, its localization is significantly influenced by computational efficiency. We employ the spatial and spectral concentration metric proposed by Tsitsvero et al. (Tsitsvero, Barbarossa, and Di Lorenzo 2016) to establish a more comprehensive and precise notion of localization.

𝑩𝒙22𝒙22=a2,𝑪𝒙22𝒙22=b2,\frac{\left\|\bm{B}\bm{x}\right\|_{2}^{2}}{\left\|\bm{x}\right\|_{2}^{2}}=a^{2},\frac{\left\|\bm{C}\bm{x}\right\|_{2}^{2}}{\left\|\bm{x}\right\|_{2}^{2}}=b^{2}, (3)

where the square of the Euclidean paradigm 𝒙22\left\|\bm{x}\right\|_{2}^{2} denotes the total energy of the signal 𝒙\bm{x}. 𝑩\bm{B} is the diagonal matrix representing the node restriction operator. Given a node subset 𝒮𝒱\mathcal{S}\subseteq\mathcal{V}, 𝑩=Diag{𝕀(i𝒮)}\bm{B}={Diag}\left\{\mathbb{I}(i\in\mathcal{S})\right\}, where 𝕀()\mathbb{I}(\cdot) is an indicator function. 𝑪=𝑼Σ𝑼1,\bm{C}=\bm{U}\Sigma_{\mathcal{F}}\bm{U}^{-1}, is the band-limiting operator. Given a matrix 𝑼\bm{U} and a subset of frequency indices 𝒱\mathcal{F}\subseteq\mathcal{V}^{*}, where 𝒱={1,,N}\mathcal{V}^{*}=\{1,\cdots,N\} denotes the set of all frequency indices. Σ\Sigma_{\mathcal{F}} is a diagonal matrix defined as Σ=Diag{𝕀(i)}\Sigma_{\mathcal{F}}={Diag}\left\{\mathbb{I}(i\in\mathcal{F})\right\}.

More specifically, considering a pair of vertex sets 𝒮\mathcal{S} and frequency sets \mathcal{F}, where a2a^{2} and b2b^{2} represent the percentage of energy contained within the sets 𝒮\mathcal{S} and \mathcal{F} respectively, our objective is to determine the balance between aa and bb and identify signals that can achieve all feasible pairs. The resulting uncertainty principle is formulated and presented in the following theorem (Tsitsvero, Barbarossa, and Di Lorenzo 2016).

cos1a+cos1bcos1λmax(𝑩𝑪),\cos^{-1}a+\cos^{-1}b\geq\cos^{-1}\lambda_{\max}\left(\bm{BC}\right), (4)

where λmax(𝑩𝑪)\lambda_{max}(\bm{BC}) is the maximum eigenvalue of 𝑩𝑪\bm{BC}.

Considering the uncertainty principle, we retain the entire frequency band if we follow the graph localization method in Eq. (2). However, real graph signals show a non-uniform distribution in the frequency domain, which suggests that 𝑪\bm{C} can be chosen more efficiently. Moreover, a deeper network leads to a wider propagation of the signal in the graph domain, which limits the width of the frequency bands and leads to the dominance of low-frequency signals (smoothing signals), which produce over-smoothing.

Graph Wavelet Transform

As the graph Fourier transform, the graph wavelet transform also necessitates a set of suitable bases to map the graph signal to the spectral domain. In this case, we denote the wavelet operator as 𝚿s=𝑼gs(𝚲)𝑼\bm{\Psi}_{s}=\bm{U}g_{s}(\bm{\Lambda})\bm{U}^{\top}, where ss is the scaling parameter. The wavelet transform breaks down a function gg into a linear combination of basis functions localized in spatial and spectral. This paper employs the Heat Kernel wavelet as the low-pass filter (denoted as glg^{l}). In contrast, the Mexican-Hat wavelet is the band-pass filter (denoted as gbg^{b}). These filters are defined as follows:

gsl(λ)=esλ,g^{l}_{s}(\lambda)=e^{-s\lambda}, (5)
gsb(λ)=23π14(1(λs)2)e(λs)22.g_{s}^{b}(\lambda)=\frac{2}{\sqrt{3}\pi^{\frac{1}{4}}}\left(1-\left({\lambda}{s}\right)^{2}\right)e^{-\frac{\left({\lambda}{s}\right)^{2}}{2}}. (6)

In this work, we integrate filter functions to achieve the combined effect of a low-pass filter and a wide band-pass filter. Figure 1 exemplifies how the combination of low-pass and band-pass filters can generate a more intricate wavelet filter capable of capturing signal components at various scales. This design enables us to obtain a richer representation of the signal, encompassing its frequency information more nuancedly.

gθ(λ)=gs0l(λ)+l=1Lgslb(λ),g_{\theta}(\lambda)=g_{s_{0}}^{l}(\lambda)+\sum_{l=1}^{L}g_{s_{l}}^{b}(\lambda), (7)

where θ={s0,s1,,sl}\theta=\{s_{0},s_{1},...,s_{l}\} is the set of scale parameters.

Using the graph wavelet transform instead of the graph Fourier transform in Eq. (2), we get the graph convolution as follows:

fo=𝚿𝑮𝚿𝑿,f_{o}=\bm{\Psi}\bm{G}\bm{\Psi}^{\top}\bm{X}, (8)

where 𝑮\bm{G} is a diagonal matrix, which acts as a filter, the scale set θ\theta of wavelet coefficients is omitted for simplification.

Methodology

Wavelet Coefficients Approximation

The model formulation discussed in the preceding sections necessitates an eigen-decomposition of the Laplace operator of the input graph 𝒢\mathcal{G}. However, this process presents a computational complexity of 𝒪(N3)\mathcal{O}(N^{3}), rendering it infeasible for larger graphs. In order to overcome this constraint, we employ the polynomial approximation method previously introduced by Hammond et al. (Hammond, Vandergheynst, and Gribonval 2011). This method involves expressing the wavelet filter as gθ(λ)pθ(λ)=γ0+γ1λ++γmλmg_{\theta}(\lambda)\approx p_{\theta}(\lambda)=\gamma_{0}+\gamma_{1}\lambda+\cdots+\gamma_{m}\lambda^{m}. This allows rewriting the wavelet operator as 𝚿𝜽=𝑼pθ(𝚲)𝑼=pθ(𝑳sym)\bm{\Psi_{\theta}}=\bm{U}p_{\theta}(\bm{\Lambda})\bm{U}^{\top}=p_{\theta}(\bm{L}_{sym}).

While existing methods rely on Chebyshev polynomial approximations, we aim to optimize scales. Therefore, we utilize a least squares approximation that can parameterize the set of wavelet scales θ\theta, which can be expressed as follows:

𝜸θ=(𝑽𝚲𝑽𝚲)1𝑽𝚲gθ(𝚲),\bm{\gamma}_{\theta}=(\bm{V}_{\bm{\Lambda}}^{\top}\bm{V}_{\bm{\Lambda}})^{-1}\bm{V}_{\bm{\Lambda}}^{\top}g_{\theta}(\bm{\Lambda}), (9)

Where 𝑽𝚲N×(m+1)\bm{V}_{\bm{\Lambda}}\in\mathbb{R}^{N\times(m+1)} is the Vandermonde matrix of 𝚲\bm{\Lambda} from order 0 to order mm, NN is the number of eigenvalues.

Accurate calculation of the eigenvalues requires an expensive eigen-decomposition of the graph Laplace operator, which is not feasible in our case. As an alternative, we transform the set of eigenvalues into a sequence of linearly spaced points 𝝃={ξi}i=1K\bm{\xi}=\{\xi_{i}\}_{i=1}^{K} within the interval [0,2][0,2] on the spectral domain. However, in graphs that exhibit multiscale features, the eigenvalues do not follow a uniform distribution on the spectral domain. Instead, they display spectral gaps corresponding to different scales within the data. To address this non-uniform distribution, we incorporate the estimated spectral density 𝝎\bm{\omega} as the weight for each of the KK sample points 𝝃\bm{\xi} on the spectral domain (Fan et al. 2020). Consequently, we can compute the weighted least squares coefficients,

𝜸θ=(𝑽𝝃Diag(𝝎)𝑽𝝃)1𝑽𝝃TDiag(𝝎)gθ(𝝃),\bm{\gamma}_{\theta}=(\bm{V}_{\bm{\xi}}^{\top}Diag(\bm{\omega})\bm{V}_{\bm{\xi}})^{-1}\bm{V}_{\bm{\xi}}^{T}Diag(\bm{\omega})g_{\theta}(\bm{\xi}), (10)

where the spectral density 𝝎={ωj}j=1K\bm{\omega}=\{\omega_{j}\}_{j=1}^{K}, and ωi=1Nj=1N{𝕀(λj=ξi)}{\omega}_{i}=\frac{1}{N}\sum_{j=1}^{N}\{\mathbb{I}(\lambda_{j}=\xi_{i})\}.

The goal of spectral density estimation is to approximate the density function without expensive graph Laplacian eigen-decomposition. To achieve this, we determine the number of eigenvalues less than or equal to each ξi\xi_{i} in the set 𝝃\bm{\xi}. It can be achieved by computing an approximation to the trace of the eigen-projection matrix 𝑷\bm{P} (Di Napoli, Polizzi, and Saad 2016). In practice, directly obtaining the projector 𝑷\bm{P} is often not feasible. However, it can be approximated efficiently using polynomials or rational functions of the Laplacian operator 𝑳\bm{L}. In this approximation, we interpret 𝑷\bm{P} as a step function of 𝑳\bm{L}, which can be expressed as follows:

𝑷ξi=h(𝑳),whereh(λ)={1,ifλξi0,otherwise.\bm{P}_{\xi_{i}}=h(\bm{L}),~{}\text{where}~{}h(\lambda)=\left\{\begin{array}[]{ll}1,&\text{if}~{}\lambda\leq\xi_{i}\\ 0,&\text{otherwise.}\end{array}\right. (11)

Although it is impossible to compute h(λ)h(\lambda) exactly cheaply, it can be approximated using a finite sum of Jackson-Chebyshev polynomials, denoted as ϕ(λ)\phi(\lambda). Please refer to Appendix AA for detailed information on the approximation method. In this form, it becomes possible to estimate the trace of 𝑷\bm{P} by an estimator developed by Hutchinson (Hutchinson 1989) and further improved more recently (Tang and Saad 2012). Hutchinson’s stochastic estimator relies solely on matrix-vector products to approximate the matrix trajectory. The key idea is to utilize Rademacher random variables, where each entry of a randomly generated vector 𝑹N\bm{R}\in\mathbb{R}^{N} takes on the values 1-1 and 11 with equal probability of 12\frac{1}{2}. Thus, an estimate of the trace tr(𝑷)tr(\bm{P}) can be obtained by generating nrn_{r} samples of random vectors 𝑹k\bm{R}_{k}, k=1,,nrk=1,\cdots,n_{r} and computing the average over these samples. The estimator can be expressed as follows:

tr(𝑷)1nrk=1nr𝑹k𝑷𝑹k1nrk=1nr𝑹kϕ(𝑳sym)𝑹k.{tr}(\bm{P})\approx\frac{1}{n_{r}}\sum_{k=1}^{n_{r}}\bm{R}_{k}^{\top}\bm{P}\bm{R}_{k}\approx\frac{1}{n_{r}}\sum_{k=1}^{n_{r}}\bm{R}_{k}^{\top}\phi({\bm{L}_{sym}})\bm{R}_{k}. (12)

We obtain the approximation 𝛀\bm{\Omega} to the cumulative spectral density function.

𝛀={(ξi,1N[1nrk=1nr𝑹kϕ(𝑳sym)𝑹k])}i=1K.\bm{\Omega}=\left\{\left(\xi_{i},\frac{1}{N}\left[\frac{1}{n_{r}}\sum_{k=1}^{n_{r}}\bm{R}_{k}^{\top}\phi({\bm{L}_{sym}})\bm{R}_{k}\right]\right)\right\}_{i=1}^{K}. (13)

Finally, the cumulative spectral density 𝛀\bm{\Omega} is differentiated to obtain an approximation of the spectral density 𝝎\bm{\omega}, ie 𝝎=dd𝝃𝛀\bm{\omega}=\frac{d}{d\bm{\xi}}\bm{\Omega}. Using the estimated spectral density 𝝎\bm{\omega} as a weight for each sample point ξi\xi_{i} on the spectral domain, the weighted least squares coefficient 𝜸θ\bm{\gamma}_{\theta} can be calculated by substituting it into Eq. (10).

These coefficients are used to approximate the wavelet filter matrix 𝚿θ\bm{\Psi}_{\theta}, which can be expressed as follows:

𝚿θ=γ0𝑰+γ1𝑳sym++γm𝑳symm.\bm{\Psi}_{\theta}=\gamma_{0}\bm{I}+\gamma_{1}\bm{L}_{sym}+\cdots+\gamma_{m}\bm{L}_{sym}^{m}. (14)

Encoder

In this paper, we construct a graph wavelet multilayer convolutional network (ASWT-SGNN). Based on the above, we define the ll-th layer of ASWT-SGNN as

𝑯l+1=σ(𝑯l𝑾l),\bm{H}^{l+1}=\sigma(\bm{H}^{l^{\prime}}\bm{W}^{l}), (15)

where σ\sigma is the activation function, 𝑯lN×p\bm{H}^{l^{\prime}}\in\mathbb{R}^{N\times p} is the results of graph convolution of the ll-th layer, 𝑾lp×q\bm{W}^{l}\in\mathbb{R}^{p\times q} is the weight of the ll-th layer, pp is the number of features in current layer, and qq is the number of features in next layer. 𝑯l\bm{H}^{l^{\prime}} in Eq. (15) is described as follows:

𝑯l=α𝑭l𝑯l+(1α)𝑯l,\bm{H}^{{l}^{\prime}}=\alpha\bm{F}^{l}\bm{H}^{l}+(1-\alpha)\bm{H}^{{l}}, (16)

where 1α1-\alpha represents the proportion of the original features 𝑯l\bm{H}^{l} in ll-th layer, and 0α10\leq\alpha\leq 1. By incorporating this residual connection, we guarantee that regardless of the number of layers we add, the resulting representation will always retain a portion of the initial features. To enhance the encoder’s capability to aggregate both local and global information effectively, we define the diffusion operator 𝑭l\bm{F}^{l} of ll-th layer in Eq. (16) as

𝑭l=β(𝚿θ𝑮l𝚿𝜽)+(1β)(𝑫12𝑨~𝑫12),{\bm{F}^{l}}=\beta(\bm{\Psi}_{\theta}\bm{G}^{l}\bm{\Psi}_{\bm{\theta}}^{\top})+(1-\beta)({\bm{D}}^{-\frac{1}{2}}\tilde{\bm{A}}{\bm{D}}^{-\frac{1}{2}}), (17)

where β\beta stands for the ratio of the graph wavelet term and 0β10\leq\beta\leq 1. θ{\theta} is the learnable multi-scale parameters. 𝑮l\bm{G}^{l} is the learnable diagonal filter matrix of ll-th layer. 𝑨~=𝑨+𝑰\tilde{\bm{A}}=\bm{A}+\bm{I} represents the adjacency matrix of the self-loop graph of 𝒢\mathcal{G}, where 𝑰\bm{I} is the identity matrix.

Optimization Objective

Typical GCL methods involve generating augmented views and subsequently optimizing the congruence between their encoded representations. In this paper, we generate two augmented graphs, 𝒛\bm{z} and 𝒐\bm{o}, by using feature augmentation. We randomly sample the mask vector 𝒎𝒇{0,1}B\bm{m_{f}}\in\{0,1\}^{B} to hide part of the dimensions in the node feature. Each element in mask 𝒎𝒇\bm{m_{f}} is sampled from Bernoulli distribution Ber(1fd)Ber(1-f_{d}), where the hyperparameter fdf_{d} is the feature descent rate. Therefore, the augmented node feature 𝑿^\hat{\bm{X}} calculated by the following formula:

𝑿^=[𝒙1𝒎f,𝒙2𝒎f,,𝒙N𝒎f].\hat{\bm{X}}=\left[\bm{x}_{1}\circ\bm{m}_{f},\bm{x}_{2}\circ\bm{m}_{f},\cdots,\bm{x}_{N}\circ\bm{m}_{f}\right]. (18)

We use the comparison objective for the node representation of the two graph augmentation obtained. For node vav_{a}, the node representations from different graph augmentation 𝒛a\bm{z}_{a} and 𝒐a\bm{o}_{a} form a positive pair. In contrast, the node representations of other nodes in the two graph augmentation are considered negative pairs. Therefore, we define the paired objective of each positive pair (𝒛a,𝒐a)(\bm{z}_{a},\bm{o}_{a}) as

a(𝒛,𝒐)=logeθ(𝒛a,𝒐a)eθ(𝒛a,𝒐a)+ba(eθ(𝒛a,𝒐b)+eθ(𝒛a,𝒛b)),\mathcal{L}_{a}(\bm{z},\bm{o})=\log\frac{e^{\theta\left(\bm{z}_{a},\bm{o}_{a}\right)}}{e^{\theta\left(\bm{z}_{a},\bm{o}_{a}\right)}+\sum_{b\neq a}(e^{\theta\left(\bm{z}_{a},\bm{o}_{b}\right)}+e^{\theta\left(\bm{z}_{a},\bm{z}_{b}\right)})}, (19)

where the critic θ(𝒛,𝒐)\theta(\bm{z},\bm{o}) is defined as sim(𝑯𝒛,𝑯𝒐)sim(\bm{H_{z}},\bm{H_{o}}), and sim(,)sim(\cdot,\cdot) refers to cosine similarity function. 𝑯𝒛\bm{H_{z}} is the graph embedding generated by graph augmentation through the proposed method ASWT-SGNN. The overall objective to be maximized is the average of all positive pairs,

=12Na=1N[a(𝒛,𝒐)+a(𝒐,𝒛)].\mathcal{L}=-\frac{1}{2N}\sum_{a=1}^{N}\left[\mathcal{L}_{a}\left(\bm{z},\bm{o}\right)+\mathcal{L}_{a}\left(\bm{o},\bm{z}\right)\right]. (20)

Complexity Analysis

The first step of ASWT-SGNN is wavelet coefficients approximation. Eq. (8) shows that the immediate solution requires eigen-decomposition of the Laplace matrix to obtain the eigenvalues and eigenvectors of the matrix. However, the complexity of the direct solution is very high. For example, the time complexity of the quick response (QR) algorithm is 𝒪(N3)\mathcal{O}(N^{3}), and the space complexity is 𝒪(N2)\mathcal{O}(N^{2}). Therefore, we use the least squares approximation to approximate the solution in this step. If the mm-order Chebyshev polynomial approximation is used, the sum of the complexity of each item in the polynomial is calculated as 𝒪(m×|E|)\mathcal{O}(m\times|E|).

The second step of ASWT-SGNN is to use the wavelet transform for graph convolution. Spectral CNN (Bruna et al. 2013) has high parameter complexity 𝒪(N×p×q)\mathcal{O}(N\times p\times q). ChebyNet (Defferrard, Bresson, and Vandergheynst 2016) approximates the convolution kernel by the polynomial function of the diagonal matrix of the Laplace eigenvalue, reducing the parameter complexity to 𝒪(m×p×q)\mathcal{O}(m\times p\times q), where mm is the order of the polynomial function. GCN (Kipf and Welling 2016) simplifies ChebyNet by setting m=1. In this paper, the feature transformation is performed first, and the parameter complexity is 𝒪(p×q)\mathcal{O}(p\times q). Then the graph convolution is performed, and the parameter complexity is 𝒪(N)\mathcal{O}(N).

Theoretical Analysis

Lemma 1.

Consider nodes vav_{a} and vbv_{b} within a graph, characterized by their similarity in features or labels. Consequently, their kk-hop neighbors exhibit a one-to-one mapping, specifically 𝒩k(b)=π(𝒩k(a))\mathcal{N}_{k}(b)=\pi(\mathcal{N}_{k}(a)). In this context, it holds true that |𝚿am𝚿bπ(m)|2ϵ|\bm{\Psi}_{am}-\bm{\Psi}_{b\pi(m)}|\leq 2\epsilon, where 𝚿am\bm{\Psi}_{am} signifies the wavelet coefficient between nodes vav_{a} and vmv_{m}.

Proof.

The proof is shown in Appendix AA. ∎

This lemma shows that nodes with similar network neighborhoods have similar wavelet coefficients.

Theorem 1.

If the graph following the above assumption and Lemma 1, the expectation of embedding is given by:

𝔼[𝑭i]=𝑾𝔼yP(yi),𝒙Py(𝒙)[𝚪i𝒙],\begin{split}\mathbb{E}\left[\bm{F}_{i}\right]=\bm{W}\mathbb{E}_{y\sim P\left(y_{i}\right),\bm{x}\sim P_{y}(\bm{x})}[\bm{\Gamma}_{i}\bm{x}],\end{split} (21)

where 𝚪=𝚿𝐆𝚿\bm{\Gamma}=\bm{\Psi}\bm{G}\bm{\Psi}^{\top}, 𝚪i\bm{\Gamma}_{i} is the ii-th row of 𝚪\bm{\Gamma}. In this context, we simplify the model by excluding the residual connection component. With probability at least 1δ1-\delta over the distribution for the graph, we have:

𝑯i𝔼[𝑯i]2σmax2(𝑾)plog(2p/δ)2𝑵𝚪i𝒙ψ2,\begin{split}\left\|\bm{H}_{i}-\mathbb{E}\left[\bm{H}_{i}\right]\right\|_{2}\leq\sqrt{\frac{\sigma_{\max}^{2}(\bm{W})p\log(2p/\delta)}{2\bm{N}\|\bm{\Gamma}_{i}\bm{x}\|_{\psi_{2}}}},\end{split} (22)

where the sub-gaussian norms 𝚪i𝐱ψ2=min𝚪i𝐱i,dψ2\|\bm{\Gamma}_{i}\bm{x}\|_{\psi_{2}}=\min\left\|\bm{\Gamma}_{i}\bm{x}_{i,d}\right\|_{\psi_{2}}, pp is the dimension of features, d[1,p]d\in[1,p] and σmax2(𝐖)\sigma^{2}_{max}(\bm{W}) is the largest singular value of 𝐖\bm{W}.

Proof.

The proof is shown in Appendix AA. ∎

The theorem stated above indicates that the proposed method can map nodes with the same label to an area centered around the expectation in the embedding space. This holds true for any graph in which each node’s feature and neighborhood pattern are sampled from distributions that depend on the node label.

Table 1: Accuracy (%) on the eight datasets for the node classification. The best result is bold, and the second best is underlined.

Method Datasets Cora CiteSeer PubMed Computers Photo CS Physics WikiCS Avg. GCN 81.6±\pm0.2 70.3±\pm0.4 79.3±\pm0.2 84.5±\pm0.3 91.6±\pm0.3 93.1±\pm0.3 93.7±\pm0.2 73.0±\pm0.1 83.4 GAT 83.1±\pm0.3 72.4±\pm0.3 79.5±\pm0.1 85.8±\pm0.1 91.7±\pm0.2 89.5±\pm0.3 93.5±\pm0.3 72.6±\pm0.3 83.5 GWNN 82.8±\pm0.3 71.8±\pm0.2 79.9±\pm0.3 85.6±\pm0.2 92.5±\pm0.1 92.6±\pm0.4 92.7±\pm0.2 72.8±\pm0.3 83.8 GCNII 85.5±\pm0.4 73.4±\pm0.6 80.4±\pm0.3 87.6±\pm0.4 92.7±\pm0.2 92.8±\pm0.4 93.5±\pm0.2 74.7±\pm0.3 85.1 GMI 83.3±\pm0.2 72.6±\pm0.2 79.8±\pm0.4 82.2±\pm0.1 90.7±\pm0.2 92.6±\pm0.2 94.3±\pm0.4 74.9±\pm0.2 83.8 MVGRL 83.1±\pm0.2 72.3±\pm0.5 80.3±\pm0.5 87.5±\pm0.1 91.7±\pm0.1 92.1±\pm0.3 95.1±\pm0.2 77.5±\pm0.1 84.9 GCA-SSG 83.9±\pm0.4 73.1±\pm0.3 81.3±\pm0.4 88.4±\pm0.3 89.5±\pm0.1 92.4±\pm0.1 93.4±\pm0.2 78.2±\pm0.3 85.0 GRADE 84.0±\pm0.3 72.4±\pm0.4 82.7±\pm0.3 84.7±\pm0.1 92.6±\pm0.1 92.7±\pm0.4 93.7±\pm0.2 78.1±\pm0.2 85.1 AF-GCL 83.1±\pm0.1 71.9±\pm0.4 79.0±\pm0.7 89.6±\pm0.2 92.5±\pm0.3 92.0±\pm0.1 95.2±\pm0.2 79.0±\pm0.5 85.3 MA-GCL 83.3±\pm0.4 73.6±\pm0.4 83.5±\pm0.7 88.8±\pm0.3 93.8±\pm0.3 92.5±\pm0.4 94.8±\pm0.5 78.7±\pm0.5 86.1 GraphMAE 84.2±\pm0.4 73.4±\pm0.4 81.1±\pm0.4 89.5±\pm0.1 93.2±\pm0.1 92.7±\pm0.2 94.3±\pm0.4 78.9±\pm0.2 85.9 MaskGAE 84.3±\pm0.4 73.8±\pm0.8 83.6±\pm0.5 89.5±\pm0.1 93.3±\pm0.1 92.7±\pm0.5 94.1±\pm0.4 78.4±\pm0.2 86.2 ASWT-SGNN(Semi) 88.1±\pm0.4 81.5±\pm0.3 85.2±\pm0.6 89.4±\pm0.4 93.8±\pm0.2 93.2±\pm0.3 94.9±\pm0.4 79.8±\pm0.5 88.2 ASWT-SGNN(CL) 86.2±\pm0.6 73.9±\pm0.1 84.9±\pm0.3 89.2±\pm0.3 93.5±\pm0.2 93.5±\pm0.3 95.4±\pm0.3 79.5±\pm0.3 87.0

Experiments

Experimental Setting

Datasets We evaluate the approach on eight benchmark datasets, which have been widely used in GCL methods. Specifically, citation datasets include Cora, CiteSeer and PubMed (Yang, Cohen, and Salakhudinov 2016), co-purchase and co-author datasets include Photo, Computers, CS and Physics (Suresh et al. 2021). Wikipedia dataset includes WikiCS (Mernyei and Cangea 2020).

Baselines We consider several baseline methods for the node classification task. These include semi-supervised learning methods like GCN (Kipf and Welling 2016), GAT (Velickovic et al. 2017) and GCNII (Chen et al. 2020) and wavelet neural network GWNN (Xu et al. 2019). Furthermore, we evaluate six GCL methods and two graph generation learning methods, which are GMI (Peng et al. 2020), MVGRL (Hassani and Khasahmadi 2020), GCA-SSG (Zhang et al. 2021), GRADE (Wang et al. 2022b), AF-GCL (Wang et al. 2022a), MA-GCL (Gong, Yang, and Shi 2023), GraphMAE (Hou et al. 2022), and MaskGAE (Li et al. 2023). These methods represent state-of-the-art approaches in the field of node classification tasks.

Evaluation protocol. For the ASWT-SGNN model, node representations are learned unsupervised using a 2-layer model. Following that, a linear classifier is applied as a post-processing step for assessment. The dataset is randomly partitioned, with 20% of nodes allocated to the training set, another 20% to the validation set, and the remaining 60% to the test set. To ensure the robustness of our findings, we conducted the experiments five times for each dataset, each time with different random seeds. The results include both the average accuracy and the corresponding standard deviation. All experiments use PyTorch on a server with four e NVIDIA A40 GPUs (each 48GB memory). ASWT-SGNN utilizes the Adam Optimizer with a learning rate of 0.001. The specific hyperparameters are as follows: the number of sampling points in the spectral domain, KK, is set to 20, the feature update ratio, α\alpha, is set to 0.8, and the wavelet terms ratio, β\beta, is set to 0.4.

Wavelet operator Approximation Experiments

The proposed method of approximating wavelet operators is applied to real-world graph data. We evaluate its performance using two distinct metrics: the similarity between the approximated wavelet filter gθg_{\theta} and the precisely computed wavelet filter, and the Mean Absolute Error (MAE) between the approximated wavelet operator 𝚿θ\bm{\Psi}_{\theta} and the actual wavelet operator. Figure 2 presents a specific example showing the accurate and approximate multiscale filters, computed precisely and approximately, demonstrating a significant overlap between them. The MAEs for different scaled training sets are depicted in Figure 3. The experimental results indicate that the proposed wavelet operator approximation method reduces the approximation error in the high-density spectral domain, while avoiding the need for computationally expensive eigen-decomposition.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Wavelet filters gθg_{\theta} for the Cora dataset obtained by actual wavelet and polynomial approximation.
Refer to caption
Refer to caption
Figure 3: MAE between the approximated and the actual wavelet operator at the eigenvalues (left). MAE between the predicted labels and the actual labels of the nodes on the Cora dataset (right).

Node Classification

Table 1 displays the results of node classification accuracy. It is worth noting that our proposed ASWT-SGNN achieves state-of-the-art (SOTA) performance on six out of the eight graphical benchmarks. Specifically, compared to other self-supervised methods across the eight datasets, ASWT-SGNN outperforms the best self-supervised method, MaskGAE, by an average of 0.8%, and it outperforms the worst self-supervised method, GMI, by 3.2%. Furthermore, experiments are carried out within a semi-supervised setting, revealing that the proposed method consistently outperforms the best semi-supervised benchmark, GCNII, by an average margin of 3.1%. Additionally, it surpasses the best self-supervised benchmark, MaskGAE, by an average of 2.0%. These results further demonstrate the effectiveness of the proposed method ASWT-SGNN in node classification tasks.

Ablation Studies

The proposed method incorporates two pivotal hyperparameters: α\alpha and β\beta. By adjusting these parameters, ASWT-SGNN can be simplified to its core forms: when α\alpha is set to 1 and β\beta is set to 0, ASWT-SGNN aligns with GCN; setting both α\alpha and β\beta to 1 makes ASWT-SGNN behave similarly to GWNN (Xu et al. 2019); and when α1\alpha\neq 1 and β\beta is set to 0, ASWT-SGNN exhibits similarities to GCNII (Chen et al. 2020). We comprehensively compare ASWT-SGNN and transformation models, including GCN, GWNN, and GCNII. Clear observations can be made from Figure 4: as the number of layers increases, the performance of GCN and GWNN significantly declines. In contrast, the performance of GCNII and ASWT-SGNN remains relatively stable even with more layers stacked. Notably, due to the incorporation of graph wavelet bases, ASWT-SGNN model outperforms GCNII in semi-supervised node classification tasks.

Refer to caption
Refer to caption
Figure 4: Ablation studies. The degree of over-smoothing under various parameter settings.
Refer to caption
Refer to caption
Figure 5: Classification accuracy of the proposed method at different parameter settings (α\alpha and β\beta).
Refer to caption
(a) ASWT-SGNN
Refer to caption
(b) MA-GCL
Figure 6: Intra-class distance heatmap and node embedding visualization on Cora dataset.

Other Experiments

To comprehensively investigate the impact of parameters α\alpha and β\beta on model performance, we systematically vary their values from 0 to 1. The results are presented in Figure 5, illustrating several noticeable trends. When α\alpha is set to 0, indicating the exclusion of residual connections, accuracy is significantly decreased. On the other hand, larger values of α\alpha result in fixed node representations, leading to lower performance. β\beta represents the proportion of the graph wavelet base; if it is tiny, it may not effectively extract local information. Conversely, excessively large values of β\beta could result in the neglect of global information, ultimately reducing performance. These findings highlight the complex relationship between parameters α\alpha and β\beta in shaping the model’s performance. This delicate balance allows our model to synthesize local and global information, achieving optimal performance effectively.

Furthermore, we use t-SNE (Van der Maaten and Hinton 2008) to visualize the node embeddings. Figure 6 illustrates that compared to MA-GCL, ASWT-SGNN exhibits more pronounced gaps between different classes. This suggests that ASWL-SGNN captures more detailed class information and clarifies the boundaries between samples from different classes. Further extensive experimental analyses, including sparse, robustness, and uncertainty analyses, are exhaustively presented in Appendix BB.

Conclusion

This paper introduces an adaptive graph wavelet self-supervised neural network called ASWT-SGNN. By utilizing multiple wavelet scales, the model integrates different levels of localization on the graph, enabling the capture of elements beyond the low-frequency ones. To avoid the expensive eigen-decomposition in the spectral domain, the model employs a polynomial approximation of the wavelet operator. Comprehensive experimental results demonstrate the competitiveness of the proposed method against state-of-the-art GCL models on real graph datasets. As our framework applies to all message-passing GNNs and polynomial graph filters, we plan to extend its application to more intricate graph neural architectures. Moreover, we will also consider larger datasets with the increase in GPU resources.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (No.62106259, No.62076234), Beijing Outstanding Young Scientist Program (NO.BJJWZYJH012019100020098), and Beijing Natural Science Foundation (No. 4222029).

References

  • Behmanesh et al. (2022) Behmanesh, M.; Adibi, P.; Ehsani, S. M. S.; and Chanussot, J. 2022. Geometric multimodal deep learning with multiscaled graph wavelet convolutional network. IEEE Transactions on Neural Networks and Learning Systems, 1–15.
  • Bruna et al. (2013) Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013. Spectral networks and locally connected networks on graphs. arXiv:1312.6203.
  • Chen et al. (2020) Chen, M.; Wei, Z.; Huang, Z.; Ding, B.; and Li, Y. 2020. Simple and deep graph convolutional networks. In International conference on machine learning, 1725–1735. PMLR.
  • Defferrard, Bresson, and Vandergheynst (2016) Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 3844–3852.
  • Di Napoli, Polizzi, and Saad (2016) Di Napoli, E.; Polizzi, E.; and Saad, Y. 2016. Efficient estimation of eigenvalue counts in an interval. Numerical Linear Algebra with Applications, 23(4): 674–692.
  • Ding et al. (2022) Ding, K.; Xu, Z.; Tong, H.; and Liu, H. 2022. Data augmentation for deep graph learning: A survey. ACM SIGKDD Explorations Newsletter, 24(2): 61–77.
  • Du et al. (2019) Du, S. S.; Hou, K.; Salakhutdinov, R. R.; Poczos, B.; Wang, R.; and Xu, K. 2019. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in neural information processing systems, 32.
  • Fan et al. (2020) Fan, T.; Shuman, D. I.; Ubaru, S.; and Saad, Y. 2020. Spectrum-adapted polynomial approximation for matrix functions with applications in graph signal processing. Algorithms, 13(11): 295.
  • Gong, Yang, and Shi (2023) Gong, X.; Yang, C.; and Shi, C. 2023. Ma-gcl: Model augmentation tricks for graph contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 4284–4292.
  • Hamilton, Ying, and Leskovec (2017) Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems, 30.
  • Hammond, Vandergheynst, and Gribonval (2011) Hammond, D. K.; Vandergheynst, P.; and Gribonval, R. 2011. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2): 129–150.
  • Hasanzadeh et al. (2019) Hasanzadeh, A.; Hajiramezanali, E.; Narayanan, K.; Duffield, N.; Zhou, M.; and Qian, X. 2019. Semi-implicit graph variational auto-encoders. Advances in neural information processing systems, 32.
  • Hassani and Khasahmadi (2020) Hassani, K.; and Khasahmadi, A. H. 2020. Contrastive multi-view representation learning on graphs. In International conference on machine learning, 4116–4126. PMLR.
  • He et al. (2022) He, D.; Liang, C.; Liu, H.; Wen, M.; Jiao, P.; and Feng, Z. 2022. Block modeling-guided graph convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 4022–4029.
  • He et al. (2020) He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9726–9735. IEEE Computer Society.
  • Hou et al. (2022) Hou, Z.; Liu, X.; Cen, Y.; Dong, Y.; Yang, H.; Wang, C.; and Tang, J. 2022. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 594–604.
  • Hutchinson (1989) Hutchinson, M. F. 1989. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3): 1059–1076.
  • Kaler et al. (2022) Kaler, T.; Stathas, N.; Ouyang, A.; Iliopoulos, A.-S.; Schardl, T.; Leiserson, C. E.; and Chen, J. 2022. Accelerating training and inference of graph neural networks with fast sampling and pipelining. Proceedings of Machine Learning and Systems, 4: 172–189.
  • Kipf and Welling (2016) Kipf, T. N.; and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907.
  • Kumar, Rawat, and Chauhan (2022) Kumar, P.; Rawat, P.; and Chauhan, S. 2022. Contrastive self-supervised learning: review, progress, challenges and future research directions. International Journal of Multimedia Information Retrieval, 11(4): 461–488.
  • Li et al. (2023) Li, J.; Wu, R.; Sun, W.; Chen, L.; Tian, S.; Zhu, L.; Meng, C.; Zheng, Z.; and Wang, W. 2023. What’s Behind the Mask: Understanding Masked Graph Modeling for Graph Autoencoders. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1268–1279.
  • Mernyei and Cangea (2020) Mernyei, P.; and Cangea, C. 2020. Wiki-cs: A wikipedia-based benchmark for graph neural networks. arXiv:2007.02901.
  • Peng et al. (2020) Peng, Z.; Huang, W.; Luo, M.; Zheng, Q.; Rong, Y.; Xu, T.; and Huang, J. 2020. Graph representation learning via graphical mutual information maximization. In Proceedings of The Web Conference 2020, 259–270.
  • Shervashidze et al. (2011) Shervashidze, N.; Schweitzer, P.; Van Leeuwen, E. J.; Mehlhorn, K.; and Borgwardt, K. M. 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(9).
  • Shi et al. (2020) Shi, M.; Tang, Y.; Zhu, X.; and Liu, J. 2020. Multi-label graph convolutional network representation learning. IEEE Transactions on Big Data, 8(5): 1169–1181.
  • Shuman et al. (2013) Shuman, D. I.; Narang, S. K.; Frossard, P.; Ortega, A.; and Vandergheynst, P. 2013. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing magazine, 30(3): 83–98.
  • Sun et al. (2019) Sun, F.-Y.; Hoffmann, J.; Verma, V.; and Tang, J. 2019. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv:1908.01000.
  • Suresh et al. (2021) Suresh, S.; Li, P.; Hao, C.; and Neville, J. 2021. Adversarial graph augmentation to improve graph contrastive learning. Advances in Neural Information Processing Systems, 34: 15920–15933.
  • Tang and Saad (2012) Tang, J. M.; and Saad, Y. 2012. A probing method for computing the diagonal of a matrix inverse. Numerical Linear Algebra with Applications, 19(3): 485–501.
  • Thakoor et al. (2021) Thakoor, S.; Tallec, C.; Azar, M. G.; Munos, R.; Veličković, P.; and Valko, M. 2021. Bootstrapped representation learning on graphs. In ICLR 2021 Workshop on Geometrical and Topological Representation Learning.
  • Tsitsvero, Barbarossa, and Di Lorenzo (2016) Tsitsvero, M.; Barbarossa, S.; and Di Lorenzo, P. 2016. Signals on graphs: Uncertainty principle and sampling. IEEE Transactions on Signal Processing, 64(18): 4845–4860.
  • Van der Maaten and Hinton (2008) Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
  • Velickovic et al. (2017) Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y.; et al. 2017. Graph attention networks. stat, 1050(20): 10–48550.
  • Velickovic et al. (2019) Velickovic, P.; Fedus, W.; Hamilton, W. L.; Liò, P.; Bengio, Y.; and Hjelm, R. D. 2019. Deep Graph Infomax. ICLR (Poster), 2(3): 4.
  • Wang et al. (2022a) Wang, H.; Zhang, J.; Zhu, Q.; and Huang, W. 2022a. Augmentation-Free Graph Contrastive Learning with Performance Guarantee. arXiv:2204.04874.
  • Wang et al. (2022b) Wang, R.; Wang, X.; Shi, C.; and Song, L. 2022b. Uncovering the Structural Fairness in Graph Contrastive Learning. Advances in Neural Information Processing Systems, 35: 32465–32473.
  • Wang et al. (2019) Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.; and Solomon, J. M. 2019. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5): 1–12.
  • Wang et al. (2016) Wang, Z.; Wang, C.; Pei, J.; Ye, X.; and Philip, S. Y. 2016. Causality Based Propagation History Ranking in Social Networks. In IJCAI, 3917–3923.
  • Xie et al. (2022) Xie, Y.; Xu, Z.; Zhang, J.; Wang, Z.; and Ji, S. 2022. Self-supervised learning of graph neural networks: A unified review. IEEE transactions on pattern analysis and machine intelligence, 45(2): 2412–2429.
  • Xu et al. (2019) Xu, B.; Shen, H.; Cao, Q.; Qiu, Y.; and Cheng, X. 2019. Graph wavelet neural network. arXiv:1904.07785.
  • Xu et al. (2018) Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826.
  • Yanardag and Vishwanathan (2015) Yanardag, P.; and Vishwanathan, S. 2015. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 1365–1374.
  • Yang, Cohen, and Salakhudinov (2016) Yang, Z.; Cohen, W.; and Salakhudinov, R. 2016. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, 40–48. PMLR.
  • Ying et al. (2018) Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W.; and Leskovec, J. 2018. Hierarchical graph representation learning with differentiable pooling. Advances in neural information processing systems, 31.
  • You et al. (2020) You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; and Shen, Y. 2020. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33: 5812–5823.
  • Yu, Lee, and Sohn (2020) Yu, B.; Lee, Y.; and Sohn, K. 2020. Forecasting road traffic speeds by considering area-wide spatio-temporal dependencies based on a graph convolutional neural network (GCN). Transportation research part C: emerging technologies, 114: 189–204.
  • Zeng et al. (2019a) Zeng, H.; Zhou, H.; Srivastava, A.; Kannan, R.; and Prasanna, V. 2019a. Accurate, efficient and scalable graph embedding. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 462–471. IEEE.
  • Zeng et al. (2019b) Zeng, H.; Zhou, H.; Srivastava, A.; Kannan, R.; and Prasanna, V. 2019b. Graphsaint: Graph sampling based inductive learning method. arXiv:1907.04931.
  • Zhang et al. (2021) Zhang, H.; Wu, Q.; Yan, J.; Wipf, D.; and Yu, P. S. 2021. From canonical correlation analysis to self-supervised graph neural networks. Advances in Neural Information Processing Systems, 34: 76–89.
  • Zhu et al. (2021) Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; and Wang, L. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, 2069–2080.

Appendix A Appendix A: Details of Methodology

Algorithm of AWST-SGNN

Algorithm 1 illustrates the detailed steps of deploying AWST-SGNN in an instantiation of GCL.

Algorithm 1 Implementation of AWST-SGNN

Input: Feature matrix 𝑿\bm{X}; Adjacency matrix 𝑨\bm{A}; Laplacian matrix 𝑳\bm{L}; Encoder fθf_{\theta}.
Parameter: Band-pass filters scales interval [lowb,highb][low_{b},high_{b}]; Low-pass filters scales interval [lowl,highl][low_{l},high_{l}]; Number of band-pass filters LL; Number of sampling points in the spectral domain KK; Number of Rademacher vectors nrn_{r}; Feature update ratio α\alpha; Wavelet terms ratio β\beta; Learning rate η\eta.
Output: Trained encoder fθf_{\theta^{*}}

1:  Draw low-pass filter scale s0s_{0} from Uniform(lowllow_{l}, highlhigh_{l});
2:  Draw band-pass filter scales {si}i=1L\{s_{i}\}_{i=1}^{L} from Uniform(lowblow_{b}, highbhigh_{b});
3:  Draw spectral domain sampling points {ξi}i=1K\{\xi_{i}\}_{i=1}^{K} from the linear interval [0,2][0,2];
4:  Calculate the spectral density 𝝎\bm{\omega} according to Eq. (13);
5:  Initialize encoder fθf_{\theta};
6:  for every epoch do
7:      Draw two graph augmentations 𝒐\bm{o} and 𝒛\bm{z};
8:      Calculate the wavelet filters gθg_{\theta} according to Eq. (7);
9:      Calculate the approximate wavelet basis Ψθ\Psi_{\theta} according to Eq. (14);
10:      View encoding: 𝑯1=fθ(𝒐,Ψθ,𝑨)\bm{H}_{1}=f_{\theta}(\bm{o},\Psi_{\theta},\bm{A}), 𝑯2=fθ(𝒛,Ψθ,𝑨)\bm{H}_{2}=f_{\theta}(\bm{z},\Psi_{\theta},\bm{A});
11:      Update parameters θ\theta, 𝑮\bm{G} and 𝑾\bm{W} by contrastive loss (𝑯1,𝑯2)\mathcal{L}(\bm{H}_{1},\bm{H}_{2}).
12:  end for
13:  return Encoder fθf_{\theta^{*}}

Spectral Density Estimation

During wavelet operator polynomial approximation, a crucial step involves computing the polynomial coefficients through a weighted least squares approximation. The selection of these weights hinges on their proportionality to the spectral density of the graph, defined as follows:

ω(z)=1Nj=1N{𝕀(λj=z)}.{\omega}(z)=\frac{1}{N}\sum_{j=1}^{N}\{\mathbb{I}(\lambda_{j}=z)\}. (23)

Spectral density estimation aims to approximate this function without performing the expensive eigen-decomposition of the graph Laplacian. We opt for the kernel polynomial method to estimate the spectral density function. It involves initially determining an estimate for the cumulative spectral density function Ω(z)\Omega(z) using the formula:

Ω(z)=1Nj=1N{𝕀(λjz)}.\Omega(z)=\frac{1}{N}\sum_{j=1}^{N}\{\mathbb{I}(\lambda_{j}\leq z)\}. (24)

For each ξ\xi within the set of {ξi}i=1K\{\xi_{i}\}_{i=1}^{K} linearly spaced points across the spectral domain, we aim to ascertain the count of eigenvalues that are less than or equal to ξ\xi. Achieving this goal entails approximating the trace of eigen-projection 𝑷\bm{P} (Di Napoli, Polizzi, and Saad 2016), thereby facilitating the computation of an estimate for the eigenvalue counts within the interval [0,ξi][0,\xi_{i}].

𝑷=λj[0,ξi]𝒖j𝒖jT.\bm{P}=\sum_{\lambda_{j}\in[0,\xi_{i}]}\bm{u}_{j}\bm{u}_{j}^{T}. (25)

The eigenvalues of the projector are restricted to being either zero or one. Consequently, the trace of the projector 𝑷\bm{P} is equivalent to the count of terms in the Eq. (24), which corresponds to the number of eigenvalues within the range [0,ξi][0,\xi_{i}]. As a result, the computation of the count of eigenvalues situated within the interval [0,ξi][0,\xi_{i}] is facilitated by determining the trace of the corresponding projector:

tr(𝑷)=j=1N{𝕀(λjξi)}.tr(\bm{P})=\sum_{j=1}^{N}\{\mathbb{I}(\lambda_{j}\leq\xi_{i})\}. (26)

If the matrix 𝑷\bm{P} is explicitly provided, we would be capable of directly calculating its trace and obtaining an exact value for tr(𝑷)tr(\bm{P}). However, the projector 𝑷\bm{P} is often unavailable in practice. Nonetheless, it is feasible to approximate 𝑷\bm{P} inexpensively by representing it as either a polynomial or a rational function of the Laplacian matrix 𝑳\bm{L}. In order to achieve this, we can interpret 𝑷\bm{P} as a step function of 𝑳\bm{L}, specifically:

𝑷=h(𝑳),whereh(λ)={1ifλξi0otherwise\bm{P}=h(\bm{L}),~{}\text{where}~{}h(\lambda)=\left\{\begin{array}[]{ll}1{}&\text{if}~{}\lambda\leq\xi_{i}\\ 0{}&\text{otherwise}\end{array}\right. (27)

The approximation of h(λ)h(\lambda) is now attainable using a finite sum ϕ(λ)\phi(\lambda) comprised of Chebyshev polynomials. In this representation, it becomes feasible to estimate the trace of 𝑷\bm{P} using a stochastic estimator initially introduced by Hutchinson (Hutchinson 1989) and subsequently refined by more recent contributions (Tang and Saad 2012). Hutchinson’s unbiased estimator employs matrix-vector products exclusively to approximate the trace of a matrix. The foundational principle hinges on applying identically independently distributed (i.i.d.) Rademacher random variables. In this approach, each entry of the randomly generated vectors 𝑹\bm{R} assumes either 1-1 or 11 with equal probability 1/21/2. Hutchinson established a lemma demonstrating that 𝔼(𝑹𝑷𝑹)=tr(𝑷)\mathbb{E}(\bm{R}^{\top}\bm{P}\bm{R})=tr(\bm{P}). Consequently, an estimation of the trace (tr(A)) is derived by generating nrn_{r} samples of random vectors 𝑹k\bm{R}_{k}, where k=1,,nrk=1,\cdots,n_{r}, and subsequently computing the average of 𝑹𝑷𝑹\bm{R}^{\top}\bm{P}\bm{R} across these samples.

tr(𝑷)1nrk=1nr𝑹k𝑷𝑹k.{tr}(\bm{P})\approx\frac{1}{n_{r}}\sum_{k=1}^{n_{r}}\bm{R}_{k}^{\top}\bm{P}\bm{R}_{k}. (28)

In the polynomial filtering approach, the step function h(λ)h(\lambda) is expanded into a degree mm Chebyshev polynomial series:

h(λ)ϕ(λ)=j=0mγjTj(λ),h(\lambda)\approx\phi(\lambda)=\sum_{j=0}^{m}\gamma_{j}T_{j}(\lambda), (29)

where Tj(λ)T_{j}(\lambda) are the jj-degree Chebyshev polynomials, and the coefficients γj\gamma_{j} are the expansion coefficients of the step function hh which are known to be

γj={1π(arccos(a)arccos(b)),ifj=02π(sin(jarccos(a))sin(jarccos(b))j),ifj>0.\gamma_{j}=\left\{\begin{array}[]{ll}\frac{1}{\pi}\left(\arccos(a)-\arccos(b)\right),&\text{if}~{}j=0\\ \frac{2}{\pi}\left(\frac{\sin(j\arccos(a))-\sin(j\arccos(b))}{j}\right),&\text{if}~{}j>0.\end{array}\right. (30)

As a result, we obtain an expansion of 𝑷\bm{P} into matrices Tj(𝑳)T_{j}(\bm{L}).

h(𝑳)ϕ(𝑳)=j=0mγjTj(𝑳).h(\bm{L})\approx\phi(\bm{L})=\sum_{j=0}^{m}\gamma_{j}T_{j}(\bm{L}). (31)

To mitigate the adverse oscillations near the boundaries (known as Gibbs oscillations) that often arise from the expansion of h(λ)h(\lambda), a common practice is introducing damping multipliers, referred to as Jackson coefficients. This modification results in the following replacement for Eq. (31):

𝑷=h(𝑳)ϕ(𝑳)=j=0mgjmγjTj(𝑳).\bm{P}=h(\bm{L})\approx\phi(\bm{L})=\sum_{j=0}^{m}g_{j}^{m}\gamma_{j}T_{j}(\bm{L}). (32)

It is worth noting that the matrix polynomial for the conventional Chebyshev approach retains the same form as described above, with the Jackson coefficients gjmg_{j}^{m} uniformly set to one. As a result, we will employ the same notation to represent both expansions. The original expression of the Jackson coefficients can be derived using the following formula:

gjm=(1jm+2)sin(αm)cos(jαm)sin(αm)+1m+2cos(αm)sin(jαm)sin(αm),\begin{split}g_{j}^{m}=&\frac{\left(1-\frac{j}{m+2}\right)\sin(\alpha_{m})\cos(j\alpha_{m})}{\sin(\alpha_{m})}\\ &+\frac{\frac{1}{m+2}\cos(\alpha_{m})\sin(j\alpha_{m})}{\sin(\alpha_{m})},\end{split} (33)

where αm=πm+2\alpha_{m}=\frac{\pi}{m+2}. It is noteworthy that these coefficients can also be expressed in a slightly more concise form:

gjm=sin(j+1)αm(m+2)sinαm+(1j+1m+2)cos(jαm).g_{j}^{m}=\frac{\sin(j+1)\alpha_{m}}{(m+2)\sin\alpha_{m}}+\left(1-\frac{j+1}{m+2}\right)\cos(j\alpha_{m}). (34)

Upon directly substituting the expression for ϕ(𝑳)\phi(\bm{L}) into the stochastic estimator (Eq. (28)), we arrive at the ensuing estimate:

j=1N{𝕀(λjξi)}=tr(𝑷)1nvk=1nv[j=0mgjmγj𝑹kTTj(𝑳)𝑹k].\sum_{j=1}^{N}\{\mathbb{I}(\lambda_{j}\leq\xi_{i})\}=tr(\bm{P})\approx\frac{1}{n_{v}}\sum_{k=1}^{n_{v}}\left[\sum_{j=0}^{m}g_{j}^{m}\gamma_{j}\bm{R}_{k}^{T}T_{j}(\bm{L})\bm{R}_{k}\right]. (35)

An approximation Ω\Omega to the cumulative spectral density function is obtained using monotonic piece-wise cubic interpolation to interpolate between the estimates at points ξi\xi_{i}.

𝛀=({(ξi,1N[1nrk=1nr𝑹kϕ(𝑳sym)𝑹k])}i=1K).\bm{\Omega}=\mathcal{I}\left(\left\{\left(\xi_{i},\frac{1}{N}\left[\frac{1}{n_{r}}\sum_{k=1}^{n_{r}}\bm{R}_{k}^{\top}\phi({\bm{L}_{sym}})\bm{R}_{k}\right]\right)\right\}_{i=1}^{K}\right). (36)

Finally, the approximation 𝝎\bm{\omega} to the spectral density is derived by differentiating 𝛀\bm{\Omega} for 𝝃\bm{\xi}.

Proof of Lemma 1

Lemma 1.

Consider nodes vav_{a} and vbv_{b} within a graph, characterized by their similarity in features or labels. Consequently, their kk-hop neighbors exhibit a one-to-one mapping, specifically 𝒩k(b)=π(𝒩k(a))\mathcal{N}_{k}(b)=\pi(\mathcal{N}_{k}(a)). In this context, it holds true that |𝚿am𝚿bπ(m)|2ϵ|\bm{\Psi}_{am}-\bm{\Psi}_{b\pi(m)}|\leq 2\epsilon, where 𝚿am\bm{\Psi}_{am} signifies the wavelet coefficient between nodes vav_{a} and vmv_{m}.

Proof.

The definition domain of kernel function gg used in wavelet transform in the graph is closed interval [0,λN][0,\lambda_{N}]. According to the Stone-Weierstrass approximation theorem, a polynomial function can approximate continuous functions on closed intervals.

ϵ>0,P(λ)=j=0mαjλj:|g(λ)P(λ)|ϵ.\begin{split}\forall\epsilon>0,\exists P(\lambda)=\sum_{j=0}^{m}\alpha_{j}\lambda^{j}:|g(\lambda)-P(\lambda)|\leq\epsilon.\end{split} (37)

Let r(λ)=g(λ)P(λ)r(\lambda)=g(\lambda)-P(\lambda), we have:

𝚿=(j=0mαj𝑳j)+𝑼r(𝚲)𝑼T.\bm{\Psi}=\left(\sum_{j=0}^{m}\alpha_{j}\bm{L}^{j}\right)+\bm{U}r(\bm{\Lambda})\bm{U}^{T}. (38)

The distribution of wavelet coefficients can be expressed as a polynomial function composed of different terms of the Laplace matrix, and using Newton’s binomial theorem to expand the LjL^{j} term in the above equation, we have the following form:

𝑳j=(𝑫𝑨)j=i=0jCji𝑫i𝑨ji,\begin{split}\bm{L}^{j}=(\bm{D}-\bm{A})^{j}=\sum_{i=0}^{j}C_{j}^{i}\bm{D}^{i}\bm{A}^{j-i},\end{split} (39)

where CjiC^{i}_{j} is the binomial coefficient. If the graph’s shortest path between two nodes vav_{a} and vbv_{b} is more significant than jj, we have 𝑳abj\bm{L}^{j}_{ab} = 0.

Then, we discuss the effect of the approximation residue r(𝚲)r(\bm{\Lambda}) , since the eigenvector matrix 𝑼\bm{U} is a standard orthogonal matrix and, by virtue of r(𝝀)<ϵr(\bm{\lambda})<\epsilon, according Cauchy-Schwarz inequality, we obtain:

|δaT𝑼r(𝚲)𝑼Tδb|2(=1N|r(λ)|2𝑼a2)(=1N𝑼b2)ϵ2.\begin{split}\left|\delta_{a}^{T}\bm{U}r(\bm{\Lambda})\bm{U}^{T}\delta_{b}\right|^{2}\leq\left(\sum_{\ell=1}^{N}\left|r\left(\lambda_{\ell}\right)\right|^{2}\bm{U}_{a\ell}^{2}\right)\left(\sum_{\ell=1}^{N}\bm{U}_{b\ell}^{2}\right)\leq\epsilon^{2}.\end{split} (40)

Using Eq. (38), we write the difference between each pair of mapped coefficients 𝚿am\bm{\Psi}_{am} and 𝚿bπ(m)\bm{\Psi}_{b\pi(m)} in terms of the kk-th order approximation of the graph Laplacian:

|𝚿ma𝚿π(m)b|=δm𝑼(P(𝚲)+r(𝚲))𝑼Tδaδπ(m)𝑼(P(𝚲)+r(𝚲))𝑼Tδb||(𝑼P(𝚲)𝑼T)ma(𝑼P(𝚲)𝑼T)π(m)a+|(𝑼r(𝚲)𝑼T)ma|+|(𝑼r(𝚲)𝑼T)π(m)b|.\begin{split}\left|\bm{\Psi}_{ma}-\bm{\Psi}_{\pi(m)b}\right|=\mid\delta_{m}\bm{U}(P(\bm{\Lambda})+r(\bm{\Lambda}))\bm{U}^{T}\delta_{a}-\\ \delta_{\pi(m)}\bm{U}(P(\bm{\Lambda})+r(\bm{\Lambda}))\bm{U}^{T}\delta_{b}|\leq\\ |\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{ma}-\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{\pi(m)a}\mid\\ +\left|\left(\bm{U}r(\bm{\Lambda})\bm{U}^{T}\right)_{ma}\right|+\left|\left(\bm{U}r(\bm{\Lambda})\bm{U}^{T}\right)_{\pi(m)b}\right|.\end{split} (41)

Since the kk-hop neighborhoods around vav_{a} and vbv_{b} are identical, the following holds:

m𝒩K(a),(𝑼P(𝚲)𝑼T)ma=(𝑼P(𝚲)𝑼T)π(m)b,m𝒩K(a),(𝑼P(𝚲)𝑼T)ma=(𝑼P(𝚲)𝑼T)π(m)a=0.\begin{split}\forall m\in\mathcal{N}_{K}(a),\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{ma}&=\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{\pi(m)b},\\ \forall m\notin\mathcal{N}_{K}(a),\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{ma}&=\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{\pi(m)a}=0.\end{split} (42)

Therefore, we have |𝚿am𝚿bπ(m)|2ϵ|\bm{\Psi}_{am}-\bm{\Psi}_{b\pi(m)}|\leq 2\epsilon according to Eq. (42). ∎

Proof of Theorem 1

Theorem 1.

Consider the graph following the above graph assumption and Lemma 1, then the expectation of embedding is given by:

𝔼[𝑯i]=𝑾𝔼yP(yi),𝒙Py(𝒙)[𝚪i𝒙],\begin{split}\mathbb{E}\left[\bm{H}_{i}\right]=\bm{W}\mathbb{E}_{y\sim P\left(y_{i}\right),\bm{x}\sim P_{y}(\bm{x})}[\bm{\Gamma}_{i}\bm{x}],\end{split} (43)

where 𝚪=𝚿𝐆𝚿\bm{\Gamma}=\bm{\Psi}\bm{G}\bm{\Psi}^{\top}. Thus, with probability at least 1δ1-\delta over the distribution for the graph, we have:

𝑯i𝔼[𝑯i]2σmax2(𝑾)plog(2p/δ)2𝑵𝚪i𝒙ψ2,\begin{split}\left\|\bm{H}_{i}-\mathbb{E}\left[\bm{H}_{i}\right]\right\|_{2}\leq\sqrt{\frac{\sigma_{\max}^{2}(\bm{W})p\log(2p/\delta)}{2\bm{N}\|\bm{\Gamma}_{i}\bm{x}\|_{\psi_{2}}}},\end{split} (44)

where the sub-gaussian norms 𝚪i𝐱ψ2min𝚪i𝐱i,dψ2\|\bm{\Gamma}_{i}\bm{x}\|_{\psi_{2}}\equiv\min\left\|\bm{\Gamma}_{i}\bm{x}_{i,d}\right\|_{\psi_{2}}, d[1,p]d\in[1,p] and σmax2(𝐖)\sigma^{2}_{max}(\bm{W}) is the largest singular value of 𝐖\bm{W}.

Proof.

We consider a Gaussian mixture model for the node features. For the sake of simplicity, we focus on the binary classification problem. Given the (binary) label yy and a latent vector 𝝁𝒩(𝟎,𝑰p/p)\boldsymbol{\mu}\sim\mathcal{N}\left(\bm{0},\bm{I}_{p}/p\right), where the identity matrix 𝑰pp×p\bm{I}_{p}\in\mathbb{R}^{p\times p}, the features are governed by: 𝒙i=yi𝝁+𝒒ip\bm{x}_{i}=y_{i}\boldsymbol{\mu}+\frac{\bm{q}_{i}}{\sqrt{p}}. Here, the random variable 𝒒ip\bm{q}_{i}\in\mathbb{R}^{p} has independent standard normal entries, and yi{1,1}y_{i}\in\{-1,1\} represents latent classes with abuse of notation. The dimension of the features, denoted as pp, remains constant. Subsequently, the features of nodes with class yiy_{i} follow the same distribution, contingent on yiy_{i}, i.e., xiPyi(x)x_{i}\sim P_{y_{i}}(x).

When the scale is θ\theta, the sum of wavelet coefficients at node ii has the following form:

j=1N𝚿ij=j=1N=1Ngθ(λ)𝑼T(i)𝑼(j)==1Ngθ(λ)𝑼T(i)j=1N𝑼(j).\begin{split}\sum_{j=1}^{N}\bm{\Psi}_{ij}&=\sum_{j=1}^{N}\sum_{\ell=1}^{N}g_{\theta}\left(\lambda_{\ell}\right)\bm{U}_{\ell}^{T}(i)\bm{U}_{\ell}(j)\\ &=\sum_{\ell=1}^{N}g_{\theta}\left(\lambda_{\ell}\right)\bm{U}_{\ell}^{T}(i)\sum_{j=1}^{N}\bm{U}_{\ell}(j).\end{split} (45)

Because the sum of the elements of the eigenvector corresponding to the non-zero eigenvalues of the Laplace matrix is zero, all the items with >1\ell>1 can be ignored in the Eq. (45), and only the items with zero eigenvalues can be retained.

The zero eigenvalues represent the direct current (DC) component of the graph, which corresponds to the same value for each element in the eigenvector and is normalized to 𝑼1=[1/N,,1/N]\bm{U}_{1}=[1/\sqrt{N},\ldots,1/\sqrt{N}], which can be substituted into Eq.(45) to obtain:

j=1N𝚿ij=g(0)=1\begin{split}\sum_{j=1}^{N}\bm{\Psi}_{ij}=g(0)=1\end{split} (46)

Next, we calculate the expectation of aggregated embedding:

𝔼[𝑯i]=𝔼[𝑾j=1N𝚪ijxj]=𝑾𝔼yP(yi),𝒙Py(𝚪𝒙)[𝚪i𝒙],\begin{split}\mathbb{E}\left[\bm{H}_{i}\right]&=\mathbb{E}\left[\bm{W}\sum_{j=1}^{N}\bm{\Gamma}_{ij}x_{j}\right]\\ &=\bm{W}\mathbb{E}_{y\sim P\left(y_{i}\right),\bm{x}\sim P_{y}(\bm{\bm{\Gamma}_{x}})}[\bm{\Gamma}_{i}\bm{x}],\end{split} (47)

where 𝚪=𝚿𝑮𝚿\bm{\Gamma}=\bm{\Psi}\bm{G}\bm{\Psi}^{\top}. This equation is based on the graph data assumption such that 𝒙jPyi(𝒙)\bm{x}_{j}\sim P_{y_{i}}(\bm{x}) for every jj. Now we provide a concentration analysis. Because each feature xix_{i} is a sub-Gaussian variable, then by Hoeffding’s inequality, with probability at least 1δ1-\delta^{{}^{\prime}} for each d[1,p]d\in[1,p], we have,

|1𝑵j(𝚪i𝒙j,d𝔼[𝚪i𝒙j,d])|log(2/δ)2𝑵𝚪i𝒙j,dψ2,\begin{split}\left|\frac{1}{\bm{N}}\sum_{j}\left(\bm{\Gamma}_{i}\bm{x}_{j,d}-\mathbb{E}\left[\bm{\Gamma}_{i}\bm{x}_{j,d}\right]\right)\right|\leq\sqrt{\frac{\log\left(2/\delta^{\prime}\right)}{2\bm{N}\left\|\bm{\Gamma}_{i}\bm{x}_{j,d}\right\|_{\psi_{2}}}},\end{split} (48)

where 𝚪i𝒙j,dψ2\left\|\bm{\Gamma}_{i}\bm{x}_{j,d}\right\|_{\psi_{2}} is sub-Gaussian norm of 𝚪i𝒙j,d\bm{\Gamma}_{i}\bm{x}_{j,d}. Furthermore, because each dimension of 𝒙j\bm{x}_{j} is mutual independence, thus we define 𝚪i𝒙jψ2=𝚪i𝒙j,dψ2\left\|\bm{\Gamma}_{i}\bm{x}_{j}\right\|_{\psi_{2}}=\left\|\bm{\Gamma}_{i}\bm{x}_{j,d}\right\|_{\psi_{2}} Then we apply a union bound by setting δ=pδ\delta^{\prime}=p\delta on the feature dimension kk. Then with probability at least 1δ1-\delta we have,

1𝑵j(𝚪i𝒙j,d𝔼[𝚪i𝒙j,d])2p|1𝑵j(𝚪i𝒙j,d𝔼[𝚪i𝒙j,d])|plog(2p/δ)2𝑵𝚪i𝒙ψ2.\begin{split}&\left\|\frac{1}{\bm{N}}\sum_{j}\left(\bm{\Gamma}_{i}\bm{x}_{j,d}-\mathbb{E}\left[\bm{\Gamma}_{i}\bm{x}_{j,d}\right]\right)\right\|_{2}\\ \leq&\sqrt{p}\left|\frac{1}{\bm{N}}\sum_{j}\left(\bm{\Gamma}_{i}\bm{x}_{j,d}-\mathbb{E}\left[\bm{\Gamma}_{i}\bm{x}_{j,d}\right]\right)\right|\\ \leq&\sqrt{\frac{p\log(2p/\delta)}{2\bm{N}\|\bm{\Gamma}_{i}\bm{x}\|_{\psi_{2}}}}.\end{split} (49)

Finally, plug the weight matrix into the inequality,

𝑭i𝔼[𝑭i]σmax(𝑾)1𝑵j(𝚪i𝒙j,k𝔼[𝚪i𝒙j,k])2,\begin{split}\left\|\bm{F}_{i}-\mathbb{E}\left[\bm{F}_{i}\right]\right\|\leq\sigma_{\max}(\bm{W})\left\|\frac{1}{\bm{N}}\sum_{j}\left(\bm{\Gamma}_{i}\bm{x}_{j,k}-\mathbb{E}\left[\bm{\Gamma}_{i}\bm{x}_{j,k}\right]\right)\right\|_{2},\end{split} (50)

where σmax\sigma_{max} is the largest singular value of the weight matrix. ∎

Datasets Cora CiteSeer Pubmed CS Photo Computers Physics WikiCS
Nodes 2708 3327 19717 18333 7650 13752 34493 11701
Edges 5429 4732 44338 81894 119081 245861 247962 216123
Features 1433 3703 500 6805 745 767 8415 300
Classes 5 6 3 15 8 10 5 10
Density of 𝑼\bm{U} 99.15% 95.81% 99.99% 100% 99.66% 99.55% 99.83% 99.20%
Density of 𝚿\bm{\Psi} 0.18% 0.11% 0.03% 0.05% 0.42% 0.27% 0.72% 0.32%
Table 2: The Statistics of Datasets.

Appendix B Appendix B: Additional Experiments

Datasets

We evaluate our approach on eight widely used datasets in previous GCL methods, including Cora, Citseeer, Pubmed, Amazon-Photo, Amazon-Computers, Coauthor-CS, Coauthor-Physics and WikiCS. The statistics of datasets are summarized in Table 2.

  • Cora, CiteSeer, PubMed (Yang, Cohen, and Salakhudinov 2016): These datasets are widely used for node classification tasks. They consist of citation networks where nodes represent papers, and edges denote citation relationships.

  • Amazon-Computers and Amazon-Photo (Suresh et al. 2021): These datasets are derived from Amazon’s co-purchase relationships. Nodes correspond to products, and edges indicate frequent co-purchases. Each dataset involves product categorization with 10 and 8 classes based on the product category. The node features are bag-of-words representations of product reviews.

  • Coauthor-CS and Coauthor-Physics (Suresh et al. 2021): These academic networks are constructed from co-authorship relationships in the Microsoft Academic Graph. Nodes denote authors, and edges signify co-authored relationships. The datasets involve author classification into 15 and 5 classes based on research fields. Node features are represented as bag-of-words from paper keywords.

  • WikiCS (Mernyei and Cangea 2020): This reference network is constructed from Wikipedia references. Nodes represent computer science articles, and edges denote hyperlinks between articles. Articles are labeled with ten subfields, and their features consist of the average Glove embeddings of all words in the article.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: The Mexican Hat wavelet transform of a δ\delta signal on the local node. With different scales, the wavelet can capture different neighborhood information weighted continuously.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: The Heat Kernel wavelet transform of a δ\delta signal on the local node. With different scales, the wavelet can capture different neighborhood information weighted continuously.
Refer to caption
(a) Low-pass filter
Refer to caption
(b) Band-pass filter
Refer to caption
(c) Combined
Figure 9: Illustrations of the spatial and spectral domains trade-off in wavelet basis with different scaling parameters ss.

Baseline

To ensure consistency, we rely on the code provided by the original authors for all baseline methods, carefully following their recommendations when setting the respective hyperparameters. When specific guidance is lacking, we exercise our judgment to make appropriate adjustments.

  • GCN: https://github.com/tkipf/gcn

  • GAT: https://github.com/PetarV-/GAT

  • GWNN: https://github.com/Eilene/GWNN

  • GCNII: https://github.com/chennnM/GCNII

  • GMI: https://github.com/zpeng27/GMI

  • MVGRL: https://github.com/kavehhassani/mvgrl

  • GCA-SSG: https://github.com/hengruizhang98/CCA-SSG

  • GRADE: https://github.com/BUPT-GAMMA/Uncovering-the-Structural-Fairness-in-Graph-Contrastive-Learning

  • AF-GCL: https://github.com/Namkyeong/AFGRL

  • MA-GCL: https://github.com/GXM1141/MA-GCL

Parameter Setting

We furnish a comprehensive account of the hyperparameter settings for all eight datasets. The specifics are outlined below:

  • Number of band-pass filters: LL=3

  • Number of scales: 5

  • Number of sampling points in the spectral domain: KK=20

  • Number of Rademacher vectors: nrn_{r}=10

  • The ratio of feature updates: α\alpha=0.8

  • The ratio of the graph wavelet terms: β\beta=0.4

  • Feature discard ratio: fdf_{d}=0.2

  • Order of polynomial approximation: mm=3

  • Band-pass filters scales interval: [0,5]

  • Low-pass filters scales interval: [4,6]

  • Learning rate: η\eta=0.001

  • Hidden size: 256

  • Projector size: 128

  • Activation function: ReLU

Sparsity Analysis

We computed the sparsity of the Fourier transform’s eigenvectors and the wavelet transform’s coefficient matrix across the eight datasets. The findings in Table 2 demonstrate a significant improvement in sparsity for the wavelet transform, with a nearly one hundred-fold increase compared to the Fourier transform. It suggests that the graph wavelet transform exhibits superior computational efficiency over the graph Fourier transform.

Uncertainty analysis (Localization analysis)

To explore the trade-off between spatial and spectral localizations, we can start with an impulse signal δi\delta_{i} perfectly localized at the focal node viv_{i}. Theoretically, as ss approaches 0, the wavelet transform becomes the identity matrix, allowing us to recover the original signal with perfect graph concentration. On the other hand, as ss approaches infinity, the graph signal is completely mapped to the lowest frequency, sacrificing graph localization in favor of frequency band concentrations.

For example, let us consider a Stanford Bunny graph (Figure 7 and Figure 8). The degree of diffusion for Heat Kernel wavelets and Mexican-Hat wavelets with s=10s=10 is much smaller than s=50s=50. By choosing s=10s=10, we achieve better frequency localization at the expense of graph localization. Using the notation from Eq. (3) in the body of the paper, we employ a graph limiting operator that covers the direct neighborhood of the focal vertex, denoted as |𝒮|=6|\mathcal{S}|=6, resulting in a concentration in the graph domain.

Figure 9 depicts the trade-off between the spatial and spectral domains of different wavelet bases at various scale parameters using the same dataset as shown in Figure 7. Further investigation reveals that a more balanced choice of scale parameter would be in the [1,3][1,3] range. Within this range, we can significantly sacrifice a small portion of graph localization to improve frequency localization. In practical scenarios, as demonstrated in the experiment section, the selection of the scale parameter depends on the connectivity of the underlying graph topology and the distribution of training labels in both the spatial and spectral domains. This trade-off between spatial and spectral domains also sheds light on the fundamental limitations of graph signal aggregations. Previous studies have discovered that stacking multiple layers of GCN can lead to over-smoothing problems. From the perspective of the uncertainty principle, deeper networks result in a broader spread of signals in the graph domain, which limits the frequency band’s width. This leads to the dominance of low-frequency signals, also known as smooth signals. However, we can precisely control this trade-off with wavelet bases and achieve better results across different datasets.

a=𝑩𝑼gs(𝚲)𝑼δi22𝑼gs(𝚲)𝑼δi22.a=\frac{\left\|\bm{B}\bm{U}g_{s}(\bm{\Lambda})\bm{U}^{\top}\delta_{i}\right\|^{2}_{2}}{\left\|\bm{U}g_{s}(\bm{\Lambda})\bm{U}^{\top}\delta_{i}\right\|^{2}_{2}}. (51)

We solve for the maximal bb for the given aa, the maximal possible concentration in the spectral domain becomes (Tsitsvero, Barbarossa, and Di Lorenzo 2016):

b=aλmax+(1a2)(1λmax2).b=a\lambda_{max}+\sqrt{(1-a^{2})(1-\lambda_{max}^{2})}. (52)
Refer to caption
(a) Epoch 1
Refer to caption
(b) Epoch 100
Refer to caption
(c) Epoch 300
Refer to caption
(d) Epoch 500
Refer to caption
(e) Epoch 1
Refer to caption
(f) Epoch 100
Refer to caption
(g) Epoch 300
Refer to caption
(h) Epoch 500
Figure 10: The visualization of learned features of the Cora dataset during training ASWT-SGNN.

Graph Classification

For graph classification, we conduct experiments on three benchmarks: MUTAG, NCI1, and IMDB-BINARY (Yanardag and Vishwanathan 2015). Each dataset is a collection of graphs where each graph is associated with a label.

For the evaluation protocol, after generating graph embeddings with AWST-SGNN’s encoder and readout function, we feed encoded graph-level representations into a downstream classifier to predict the label, and report the mean 10-fold cross-validation accuracy with standard deviation after five runs.

In addition to the classical graph kernel methods, namely Weisfeiler-Lehman Subtree Kernel (WL) (Shervashidze et al. 2011) and GNTK (Du et al. 2019), we compare AWST-SGNN with three supervised methods: GIN (Xu et al. 2018), GraphSAGE (Hamilton, Ying, and Leskovec 2017) , and DGCNN (Wang et al. 2019), as well as two self-supervised methods: GraphCL (You et al. 2020) and MVGRL (Hassani and Khasahmadi 2020). The results are presented in Table 3. It is observed that AWST-SGNN outperforms all the baselines on two datasets, and on the remaining dataset it achieves competitive performance. These results suggest that AWST-SGNN is capable of learning meaningful information and holds promise for graph-level tasks.

Table 3: Accuracy (%) on the three datasets for the graph classification. The best result is bold, and the second best is underlined.

Method Datasets MUTAG NCI1 IMDB-BINARY Avg. WL 88.7±\pm7.3 76.7±\pm2.1 72.8±\pm3.9 79.4 GNTK 89.3±\pm8.5 76.8±\pm4.9 75.9±\pm4.6 80.6 GIN 89.4±\pm5.6 80.0±\pm2.4 75.1±\pm2.8 81.5 DGCNN 85.9±\pm1.7 74.1±\pm3.1 69.2±\pm3.4 76.4 GraphSAGE 86.6±\pm1.5 74.8±\pm2.3 68.9±\pm4.5 76.7 GraphCL 86.8±\pm1.3 77.9±\pm2.4 71.1±\pm2.1 78.6 MVGRL 89.4±\pm1.9 79.1±\pm1.7 74.2±\pm2.7 80.9 Ours 89.7±\pm5.2 80.2±\pm3.8 75.7±\pm3.1 81.9

Robustness analyses

To assess the adaptability of ASWL-SGNN to adversarial graphs, we conducted experiments on the Cora and Citeseer datasets. This evaluation involved introducing random feature masking to the original graph structure, thereby varying the ratio of masked features from 0 to 0.8 to simulate varying degrees of attack intensity. We performed comparisons against GRADE, GCA-SSG, and MA-GCL. The outcomes, illustrated in Figure 11, consistently demonstrate ASWL-SGNN’s superior performance over the other methods. Significantly, our method displays more substantial performance enhancements as the feature masking rate increases. This observation suggests that ASWL-SGNN exhibits heightened robustness against severe feature attacks.

Refer to caption
Refer to caption
Figure 11: Test accuracy when graphs are perturbed by features masking.

Visualisation

For AWST-SGNN, we use heatmaps to show the intra-class distances and inter-class distances during the training period on the Cora dataset, as shown in Figure 10(a)-10(d), where the intra-class distances are shrinking, and the inter-class distances are increasing as the training progresses. Moreover, the t-SNE algorithm (Van der Maaten and Hinton 2008) is utilized to visualize the learned embeddings during the training process (as shown in Figure 10(e)-10(h)). Each distinct color within the visualization corresponds to a different class. As training advances, the disparities between the feature representations of various categories become more discernible.