ASWT-SGNN: Adaptive Spectral Wavelet Transform-based
Self-Supervised Graph Neural Network

Ruyue Liu^1,2, Rong Yin^1,2, Yong Liu³, Weiping Wang ¹ Corresponding author.

Abstract

Graph Comparative Learning (GCL) is a self-supervised method that combines the advantages of Graph Convolutional Networks (GCNs) and comparative learning, making it promising for learning node representations. However, the GCN encoders used in these methods rely on the Fourier transform to learn fixed graph representations, which is inherently limited by the uncertainty principle involving spatial and spectral localization trade-offs. To overcome the inflexibility of existing methods and the computationally expensive eigen-decomposition and dense matrix multiplication, this paper proposes an Adaptive Spectral Wavelet Transform-based Self-Supervised Graph Neural Network (ASWT-SGNN). The proposed method employs spectral adaptive polynomials to approximate the filter function and optimize the wavelet using contrast loss. This design enables the creation of local filters in both spectral and spatial domains, allowing flexible aggregation of neighborhood information at various scales and facilitating controlled transformation between local and global information. Compared to existing methods, the proposed approach reduces computational complexity and addresses the limitation of graph convolutional neural networks, which are constrained by graph size and lack flexible control over the neighborhood aspect. Extensive experiments on eight benchmark datasets demonstrate that ASWT-SGNN accurately approximates the filter function in high-density spectral regions, avoiding costly eigen-decomposition. Furthermore, ASWT-SGNN achieves comparable performance to state-of-the-art models in node classification tasks.

Introduction

Graphs are essential in various real-world domains such as social networks, brain networks, transportation networks, and citation networks (Wang et al. 2016; Yu, Lee, and Sohn 2020). Recently, the emergence of graph neural networks (GNNs) (He et al. 2022; Wang et al. 2022b) has attracted much attention due to their success in applications involving graph-structured data such as node classification (Kipf and Welling 2016) and edge prediction (Hasanzadeh et al. 2019). Graph Comparative Learning (GCL) (Velickovic et al. 2019; Sun et al. 2019) combines the capabilities of GNN and comparative learning techniques, making it a promising paradigm in the field of graph analysis (He et al. 2020). Typically, GCL methods generate multiple views by randomly augmenting input data and optimize the GNN encoder by learning consistency across views. GCL reduces the dependence of graph representation learning on human annotations and achieves state-of-the-art performance in tasks such as node classification (Hassani and Khasahmadi 2020; Zhang et al. 2021).

Most graph contrastive learning methods utilize graph convolutional neural networks (GCNs) as encoders (Xie et al. 2022; Wang et al. 2022b). Similar to convolutional neural networks (CNNs) in computer vision, spectral GCNs use Fourier bases in the design of graph-based operators (Bruna et al. 2013). However, these operators are localized in the frequency rather than the spatial domain. Additionally, they require costly multiplications between eigen-decomposition and dense matrices, leading to high computational expenses. In order to address this issue and achieve spatial localization, methods such as ChebyNet (Defferrard, Bresson, and Vandergheynst 2016) and GCN (Kipf and Welling 2016) employ polynomial approximation. While GCN is widely adopted for graph problems due to its impressive performance and computational efficiency (Shi et al. 2020), it encounters limitations and challenges when applied to large graphs, especially in mini-batch settings (Zeng et al. 2019b). To overcome the scaling challenges of GCN on large graphs, researchers have proposed layer sampling methods (Ying et al. 2018; Kaler et al. 2022) and subgraph sampling methods (Zeng et al. 2019b, a). However, the filter size is determined by the size of the entire graph or the sampled subgraph, which restricts flexibility for inputs of different sizes. Although some flexible spatial methods have been proposed, their aggregators lack learnability and convolutional properties. Consequently, the problem of designing a flexible filter that combines the learnability of spatial methods and the convolutional properties of spectral methods still needs to be solved.

To address the issues of inflexible and unlearnable filters, as well as the limited applicability caused by high computational complexity in existing methods, this paper proposes a novel graph comparative learning paradigm based on the adaptive spectral wavelet transform. More specifically, a fast spectral adaptive approximation method is utilized to estimate the wavelet filter, and contrast loss is employed to optimize the wavelet scale directly. Additionally, the introduction of residual links mitigates over-smoothing during information aggregation. By avoiding the expensive eigen-decomposition of the graph Laplacian operator and enabling localization in both the spectral and spatial domains, this approach effectively overcomes the limitations of graph convolutional neural networks that are constrained by graph size and lack flexibility in controlling neighborhood aspects. In comparison to state-of-the-art methods, this work introduces the following innovations:

•

We propose a novel self-supervised graph representation learning method based on sparse graph wavelets that creates localized filters in both the spectral and spatial domains. It reduces the computational complexity and addresses the limitation that graph convolutional neural networks cannot flexibly control the neighborhood aspect.
•

Theoretically, we demonstrate that nodes with similar network neighborhoods and features exhibit similar ASWT-SGNN embeddings, providing a performance guarantee for the proposed method.
•

Extensive experiments on eight benchmark datasets show that the proposed method reduces approximation errors in high-density spectral components without requiring expensive eigen-decomposition and achieves competitive performance with state-of-the-art models in node classification tasks.

Related Works

Graph Convolutional Neural Network

Following the success of CNNs in computer vision and natural language processing, researchers have sought to extend CNNs to the graph domain. The key challenge lies in defining the convolution operator for graphs. Graph convolutional neural networks can be broadly categorized into two main approaches: spectral and spatial. Spectral methods employ the graph Fourier transform to transfer signals from the spatial domain to the spectral domain, where convolution operations are performed. Spectral GNN (Bruna et al. 2013) is the first attempt to implement CNNs on graphs. ChebyNet (Defferrard, Bresson, and Vandergheynst 2016) introduce a parameterization method using Chebyshev polynomials for spectral filters, which enables fast localization of spectral filters. GCN (Kipf and Welling 2016) proposes a simplified version of ChebyNet, which achieves success in graph semi-supervised learning tasks. However, these spectral methods face challenges with generalization, as they are limited by fixed graph sizes, and larger filter sizes result in increased computational and memory costs. Spatial methods draw inspiration from weighted summation in CNNs to extract spatial features from topological graphs. MoNet (Monti et al. 2017) provides a general framework for designing spatial methods by utilizing the weighted average of functions defined within a node’s neighborhood as a spatial convolution operator. GAT (Velickovic et al. 2017) proposes a self-attentive mechanism to learn the weighting function. However, these methods employ unlearnable aggregators, and the localization of the convolution operation remains uncertain.

Graph Contrastive Learning

In graph analysis, contrastive learning was initially introduced by DGI (Velickovic et al. 2019) and InfoGraph (Sun et al. 2019), drawing inspiration from maximizing local-global mutual information. Building upon this, MVGRL (Hassani and Khasahmadi 2020) incorporate node diffusion into the graph comparison framework. GCA (Zhu et al. 2021) learns node representations by considering other nodes as negative samples, while BGRL (Thakoor et al. 2021) proposes a no-negative-sample model. CCA-SSG (Zhang et al. 2021) optimizes feature-level objectives in addition to instance-level differences. GRADE (Wang et al. 2022b) investigates fairness differences in comparative learning and proposes a novel approach to graph enhancement. Several surveys (Xie et al. 2022; Ding et al. 2022; Kumar, Rawat, and Chauhan 2022) summarize recent advancements in graph contrastive learning. Despite these methods’ notable achievements, most rely on GCN and its variants as the base models, inheriting the limitations of GCN. These limitations restrict the performance of these graph contrastive learning methods in tasks that require preserving fine-grained node features.

Graph Wavelets

The wavelet transform exhibits favorable structural properties by utilizing finite length and attenuation basis functions. This approach effectively localizes signals within both the spatial and spectral domains. Additionally, it is noteworthy that the basis and its inverse in the wavelet transform often showcase sparsity, contributing to its utility and efficiency. To construct the wavelet transform on graphs, Hammond et al. (Hammond, Vandergheynst, and Gribonval 2011) propose a method that approximates the wavelet using Chebyshev polynomials. This approach effectively avoids the need for eigen-decomposition of the Laplace matrix. Building upon this, GWNN (Xu et al. 2019) redefines graph convolution based on graph wavelets, resulting in high efficiency and sparsity. M-GWCN (Behmanesh et al. 2022) applies the multi-scale graph wavelet transform to learn representations of multimodal data. Collectively, these works showcase the value of graph wavelets in signal processing on graphs. However, these methods primarily employ wavelet transforms in supervised or semi-supervised tasks and heavily rely on labeled data.

Preliminary

Graph Fourier Transform

Let $\mathcal{G}=(\mathcal{V},\mathcal{E})$ denotes a graph, where $\mathcal{V}=\{v_{1},\cdots,v_{N}\}$ represents the set of nodes and $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ represents the set of edges. $\mathcal{G}$ is associated with a feature matrix $\bm{X}\in\mathbb{R}^{N\times p}$ , $\bm{X}=[x_{1},\dots,x_{n}]$ and an adjacency matrix $\bm{A}\in\mathbb{R}^{N\times N}$ , where $\bm{A}_{ij}=1$ if $(v_{i},v_{j})\in\mathcal{E}$ and $\bm{A}_{ij}=0$ otherwise. We define the Laplacian matrix of the graph as $\bm{L}=\bm{D}-\bm{A}$ , where $\bm{D}=Diag(d_{1},\dots,d_{N})$ , $d_{i}={\textstyle\sum_{j}\bm{A}_{ij}}$ . The symmetric normalized Laplacian is defined as $\bm{L}_{sym}=\bm{D}^{-\frac{1}{2}}\bm{L}\bm{D}^{-\frac{1}{2}}=\bm{U}\bm{\Lambda}\bm{U}^{\top}$ . $\bm{U}=\left[{\bm{u}}_{1},\ldots,{\bm{u}}_{N}\right]$ , where $\bm{\bm{u}}_{i}\in\mathbb{R}^{N}$ denotes the $i$ -th eigenvector of $\bm{L}_{sym}$ and $\bm{\Lambda}=Diag(\lambda_{1},\dots,\lambda_{N})$ is the corresponding eigenvalue matrix.

For a discrete graph signal $\bm{X}$ , its Fourier transform has the following form (Shuman et al. 2013):

\hat{f}\left(\lambda_{\ell}\right)=\sum_{i=1}^{N}\bm{x}_{i}\bm{U}_{\ell}^{*}(i)=\sum_{i=1}^{N}\bm{x}_{i}\bm{U}_{\ell}^{T}(i),

(1)

where $\bm{U}_{\ell}^{*}$ denotes the conjugate transposition of the eigenvalues $\lambda_{\ell}$ corresponding to the eigenvectors. Since the complex number is not involved in the scope of this work, it can be regarded as a common transposition, that is, $\bm{U}^{*}=\bm{U}^{T}$ .

Graph Fourier transform provides a means to define the graph convolution operator using the convolution theorem. The filter signal, denoted as $f_{o}$ , can be mathematically expressed as:

f_{o}=\bm{U}g(\bm{\Lambda})\bm{U}^{T}\bm{X},

(2)

where $g(\bm{\Lambda})$ denotes the filter function on the eigenvalues, and $\bm{U}g(\bm{\Lambda})\bm{U}^{T}$ is the graph filtering matrix.

Uncertainty Principle

The implementation of graph convolution using the Fourier transform lacks spatial localization despite its ability to achieve localization in the spectral domain. Additionally, its localization is significantly influenced by computational efficiency. We employ the spatial and spectral concentration metric proposed by Tsitsvero et al. (Tsitsvero, Barbarossa, and Di Lorenzo 2016) to establish a more comprehensive and precise notion of localization.

\frac{\left\|\bm{B}\bm{x}\right\|_{2}^{2}}{\left\|\bm{x}\right\|_{2}^{2}}=a^{2},\frac{\left\|\bm{C}\bm{x}\right\|_{2}^{2}}{\left\|\bm{x}\right\|_{2}^{2}}=b^{2},

(3)

where the square of the Euclidean paradigm $\left\|\bm{x}\right\|_{2}^{2}$ denotes the total energy of the signal $\bm{x}$ . $\bm{B}$ is the diagonal matrix representing the node restriction operator. Given a node subset $\mathcal{S}\subseteq\mathcal{V}$ , $\bm{B}={Diag}\left\{\mathbb{I}(i\in\mathcal{S})\right\}$ , where $\mathbb{I}(\cdot)$ is an indicator function. $\bm{C}=\bm{U}\Sigma_{\mathcal{F}}\bm{U}^{-1},$ is the band-limiting operator. Given a matrix $\bm{U}$ and a subset of frequency indices $\mathcal{F}\subseteq\mathcal{V}^{*}$ , where $\mathcal{V}^{*}=\{1,\cdots,N\}$ denotes the set of all frequency indices. $\Sigma_{\mathcal{F}}$ is a diagonal matrix defined as $\Sigma_{\mathcal{F}}={Diag}\left\{\mathbb{I}(i\in\mathcal{F})\right\}$ .

More specifically, considering a pair of vertex sets $\mathcal{S}$ and frequency sets $\mathcal{F}$ , where $a^{2}$ and $b^{2}$ represent the percentage of energy contained within the sets $\mathcal{S}$ and $\mathcal{F}$ respectively, our objective is to determine the balance between $a$ and $b$ and identify signals that can achieve all feasible pairs. The resulting uncertainty principle is formulated and presented in the following theorem (Tsitsvero, Barbarossa, and Di Lorenzo 2016).

\cos^{-1}a+\cos^{-1}b\geq\cos^{-1}\lambda_{\max}\left(\bm{BC}\right),

(4)

where $\lambda_{max}(\bm{BC})$ is the maximum eigenvalue of $\bm{BC}$ .

Considering the uncertainty principle, we retain the entire frequency band if we follow the graph localization method in Eq. (2). However, real graph signals show a non-uniform distribution in the frequency domain, which suggests that $\bm{C}$ can be chosen more efficiently. Moreover, a deeper network leads to a wider propagation of the signal in the graph domain, which limits the width of the frequency bands and leads to the dominance of low-frequency signals (smoothing signals), which produce over-smoothing.

Graph Wavelet Transform

As the graph Fourier transform, the graph wavelet transform also necessitates a set of suitable bases to map the graph signal to the spectral domain. In this case, we denote the wavelet operator as $\bm{\Psi}_{s}=\bm{U}g_{s}(\bm{\Lambda})\bm{U}^{\top}$ , where $s$ is the scaling parameter. The wavelet transform breaks down a function $g$ into a linear combination of basis functions localized in spatial and spectral. This paper employs the Heat Kernel wavelet as the low-pass filter (denoted as $g^{l}$ ). In contrast, the Mexican-Hat wavelet is the band-pass filter (denoted as $g^{b}$ ). These filters are defined as follows:

g^{l}_{s}(\lambda)=e^{-s\lambda},

(5)

g_{s}^{b}(\lambda)=\frac{2}{\sqrt{3}\pi^{\frac{1}{4}}}\left(1-\left({\lambda}{s}\right)^{2}\right)e^{-\frac{\left({\lambda}{s}\right)^{2}}{2}}.

(6)

In this work, we integrate filter functions to achieve the combined effect of a low-pass filter and a wide band-pass filter. Figure 1 exemplifies how the combination of low-pass and band-pass filters can generate a more intricate wavelet filter capable of capturing signal components at various scales. This design enables us to obtain a richer representation of the signal, encompassing its frequency information more nuancedly.

g_{\theta}(\lambda)=g_{s_{0}}^{l}(\lambda)+\sum_{l=1}^{L}g_{s_{l}}^{b}(\lambda),

(7)

where $\theta=\{s_{0},s_{1},...,s_{l}\}$ is the set of scale parameters.

Using the graph wavelet transform instead of the graph Fourier transform in Eq. (2), we get the graph convolution as follows:

f_{o}=\bm{\Psi}\bm{G}\bm{\Psi}^{\top}\bm{X},

(8)

where $\bm{G}$ is a diagonal matrix, which acts as a filter, the scale set $\theta$ of wavelet coefficients is omitted for simplification.

Methodology

Wavelet Coefficients Approximation

The model formulation discussed in the preceding sections necessitates an eigen-decomposition of the Laplace operator of the input graph $\mathcal{G}$ . However, this process presents a computational complexity of $\mathcal{O}(N^{3})$ , rendering it infeasible for larger graphs. In order to overcome this constraint, we employ the polynomial approximation method previously introduced by Hammond et al. (Hammond, Vandergheynst, and Gribonval 2011). This method involves expressing the wavelet filter as $g_{\theta}(\lambda)\approx p_{\theta}(\lambda)=\gamma_{0}+\gamma_{1}\lambda+\cdots+\gamma_{m}\lambda^{m}$ . This allows rewriting the wavelet operator as $\bm{\Psi_{\theta}}=\bm{U}p_{\theta}(\bm{\Lambda})\bm{U}^{\top}=p_{\theta}(\bm{L}_{sym})$ .

While existing methods rely on Chebyshev polynomial approximations, we aim to optimize scales. Therefore, we utilize a least squares approximation that can parameterize the set of wavelet scales $\theta$ , which can be expressed as follows:

\bm{\gamma}_{\theta}=(\bm{V}_{\bm{\Lambda}}^{\top}\bm{V}_{\bm{\Lambda}})^{-1}\bm{V}_{\bm{\Lambda}}^{\top}g_{\theta}(\bm{\Lambda}),

(9)

Where $\bm{V}_{\bm{\Lambda}}\in\mathbb{R}^{N\times(m+1)}$ is the Vandermonde matrix of $\bm{\Lambda}$ from order $0$ to order $m$ , $N$ is the number of eigenvalues.

Accurate calculation of the eigenvalues requires an expensive eigen-decomposition of the graph Laplace operator, which is not feasible in our case. As an alternative, we transform the set of eigenvalues into a sequence of linearly spaced points $\bm{\xi}=\{\xi_{i}\}_{i=1}^{K}$ within the interval $[0,2]$ on the spectral domain. However, in graphs that exhibit multiscale features, the eigenvalues do not follow a uniform distribution on the spectral domain. Instead, they display spectral gaps corresponding to different scales within the data. To address this non-uniform distribution, we incorporate the estimated spectral density $\bm{\omega}$ as the weight for each of the $K$ sample points $\bm{\xi}$ on the spectral domain (Fan et al. 2020). Consequently, we can compute the weighted least squares coefficients,

\bm{\gamma}_{\theta}=(\bm{V}_{\bm{\xi}}^{\top}Diag(\bm{\omega})\bm{V}_{\bm{\xi}})^{-1}\bm{V}_{\bm{\xi}}^{T}Diag(\bm{\omega})g_{\theta}(\bm{\xi}),

(10)

where the spectral density $\bm{\omega}=\{\omega_{j}\}_{j=1}^{K}$ , and ${\omega}_{i}=\frac{1}{N}\sum_{j=1}^{N}\{\mathbb{I}(\lambda_{j}=\xi_{i})\}$ .

The goal of spectral density estimation is to approximate the density function without expensive graph Laplacian eigen-decomposition. To achieve this, we determine the number of eigenvalues less than or equal to each $\xi_{i}$ in the set $\bm{\xi}$ . It can be achieved by computing an approximation to the trace of the eigen-projection matrix $\bm{P}$ (Di Napoli, Polizzi, and Saad 2016). In practice, directly obtaining the projector $\bm{P}$ is often not feasible. However, it can be approximated efficiently using polynomials or rational functions of the Laplacian operator $\bm{L}$ . In this approximation, we interpret $\bm{P}$ as a step function of $\bm{L}$ , which can be expressed as follows:

\bm{P}_{\xi_{i}}=h(\bm{L}),~{}\text{where}~{}h(\lambda)=\left\{\begin{array}[]{ll}1,&\text{if}~{}\lambda\leq\xi_{i}\\ 0,&\text{otherwise.}\end{array}\right.

(11)

Although it is impossible to compute $h(\lambda)$ exactly cheaply, it can be approximated using a finite sum of Jackson-Chebyshev polynomials, denoted as $\phi(\lambda)$ . Please refer to Appendix AA for detailed information on the approximation method. In this form, it becomes possible to estimate the trace of $\bm{P}$ by an estimator developed by Hutchinson (Hutchinson 1989) and further improved more recently (Tang and Saad 2012). Hutchinson’s stochastic estimator relies solely on matrix-vector products to approximate the matrix trajectory. The key idea is to utilize Rademacher random variables, where each entry of a randomly generated vector $\bm{R}\in\mathbb{R}^{N}$ takes on the values $-1$ and $1$ with equal probability of $\frac{1}{2}$ . Thus, an estimate of the trace $tr(\bm{P})$ can be obtained by generating $n_{r}$ samples of random vectors $\bm{R}_{k}$ , $k=1,\cdots,n_{r}$ and computing the average over these samples. The estimator can be expressed as follows:

{tr}(\bm{P})\approx\frac{1}{n_{r}}\sum_{k=1}^{n_{r}}\bm{R}_{k}^{\top}\bm{P}\bm{R}_{k}\approx\frac{1}{n_{r}}\sum_{k=1}^{n_{r}}\bm{R}_{k}^{\top}\phi({\bm{L}_{sym}})\bm{R}_{k}.

(12)

We obtain the approximation $\bm{\Omega}$ to the cumulative spectral density function.

\bm{\Omega}=\left\{\left(\xi_{i},\frac{1}{N}\left[\frac{1}{n_{r}}\sum_{k=1}^{n_{r}}\bm{R}_{k}^{\top}\phi({\bm{L}_{sym}})\bm{R}_{k}\right]\right)\right\}_{i=1}^{K}.

(13)

Finally, the cumulative spectral density $\bm{\Omega}$ is differentiated to obtain an approximation of the spectral density $\bm{\omega}$ , ie $\bm{\omega}=\frac{d}{d\bm{\xi}}\bm{\Omega}$ . Using the estimated spectral density $\bm{\omega}$ as a weight for each sample point $\xi_{i}$ on the spectral domain, the weighted least squares coefficient $\bm{\gamma}_{\theta}$ can be calculated by substituting it into Eq. (10).

These coefficients are used to approximate the wavelet filter matrix $\bm{\Psi}_{\theta}$ , which can be expressed as follows:

\bm{\Psi}_{\theta}=\gamma_{0}\bm{I}+\gamma_{1}\bm{L}_{sym}+\cdots+\gamma_{m}\bm{L}_{sym}^{m}.

(14)

Encoder

In this paper, we construct a graph wavelet multilayer convolutional network (ASWT-SGNN). Based on the above, we define the $l$ -th layer of ASWT-SGNN as

\bm{H}^{l+1}=\sigma(\bm{H}^{l^{\prime}}\bm{W}^{l}),

(15)

where $\sigma$ is the activation function, $\bm{H}^{l^{\prime}}\in\mathbb{R}^{N\times p}$ is the results of graph convolution of the $l$ -th layer, $\bm{W}^{l}\in\mathbb{R}^{p\times q}$ is the weight of the $l$ -th layer, $p$ is the number of features in current layer, and $q$ is the number of features in next layer. $\bm{H}^{l^{\prime}}$ in Eq. (15) is described as follows:

\bm{H}^{{l}^{\prime}}=\alpha\bm{F}^{l}\bm{H}^{l}+(1-\alpha)\bm{H}^{{l}},

(16)

where $1-\alpha$ represents the proportion of the original features $\bm{H}^{l}$ in $l$ -th layer, and $0\leq\alpha\leq 1$ . By incorporating this residual connection, we guarantee that regardless of the number of layers we add, the resulting representation will always retain a portion of the initial features. To enhance the encoder’s capability to aggregate both local and global information effectively, we define the diffusion operator $\bm{F}^{l}$ of $l$ -th layer in Eq. (16) as

{\bm{F}^{l}}=\beta(\bm{\Psi}_{\theta}\bm{G}^{l}\bm{\Psi}_{\bm{\theta}}^{\top})+(1-\beta)({\bm{D}}^{-\frac{1}{2}}\tilde{\bm{A}}{\bm{D}}^{-\frac{1}{2}}),

(17)

where $\beta$ stands for the ratio of the graph wavelet term and $0\leq\beta\leq 1$ . ${\theta}$ is the learnable multi-scale parameters. $\bm{G}^{l}$ is the learnable diagonal filter matrix of $l$ -th layer. $\tilde{\bm{A}}=\bm{A}+\bm{I}$ represents the adjacency matrix of the self-loop graph of $\mathcal{G}$ , where $\bm{I}$ is the identity matrix.

Optimization Objective

Typical GCL methods involve generating augmented views and subsequently optimizing the congruence between their encoded representations. In this paper, we generate two augmented graphs, $\bm{z}$ and $\bm{o}$ , by using feature augmentation. We randomly sample the mask vector $\bm{m_{f}}\in\{0,1\}^{B}$ to hide part of the dimensions in the node feature. Each element in mask $\bm{m_{f}}$ is sampled from Bernoulli distribution $Ber(1-f_{d})$ , where the hyperparameter $f_{d}$ is the feature descent rate. Therefore, the augmented node feature $\hat{\bm{X}}$ calculated by the following formula:

\hat{\bm{X}}=\left[\bm{x}_{1}\circ\bm{m}_{f},\bm{x}_{2}\circ\bm{m}_{f},\cdots,\bm{x}_{N}\circ\bm{m}_{f}\right].

(18)

We use the comparison objective for the node representation of the two graph augmentation obtained. For node $v_{a}$ , the node representations from different graph augmentation $\bm{z}_{a}$ and $\bm{o}_{a}$ form a positive pair. In contrast, the node representations of other nodes in the two graph augmentation are considered negative pairs. Therefore, we define the paired objective of each positive pair $(\bm{z}_{a},\bm{o}_{a})$ as

\mathcal{L}_{a}(\bm{z},\bm{o})=\log\frac{e^{\theta\left(\bm{z}_{a},\bm{o}_{a}\right)}}{e^{\theta\left(\bm{z}_{a},\bm{o}_{a}\right)}+\sum_{b\neq a}(e^{\theta\left(\bm{z}_{a},\bm{o}_{b}\right)}+e^{\theta\left(\bm{z}_{a},\bm{z}_{b}\right)})},

(19)

where the critic $\theta(\bm{z},\bm{o})$ is defined as $sim(\bm{H_{z}},\bm{H_{o}})$ , and $sim(\cdot,\cdot)$ refers to cosine similarity function. $\bm{H_{z}}$ is the graph embedding generated by graph augmentation through the proposed method ASWT-SGNN. The overall objective to be maximized is the average of all positive pairs,

\mathcal{L}=-\frac{1}{2N}\sum_{a=1}^{N}\left[\mathcal{L}_{a}\left(\bm{z},\bm{o}\right)+\mathcal{L}_{a}\left(\bm{o},\bm{z}\right)\right].

(20)

Complexity Analysis

The first step of ASWT-SGNN is wavelet coefficients approximation. Eq. (8) shows that the immediate solution requires eigen-decomposition of the Laplace matrix to obtain the eigenvalues and eigenvectors of the matrix. However, the complexity of the direct solution is very high. For example, the time complexity of the quick response (QR) algorithm is $\mathcal{O}(N^{3})$ , and the space complexity is $\mathcal{O}(N^{2})$ . Therefore, we use the least squares approximation to approximate the solution in this step. If the $m$ -order Chebyshev polynomial approximation is used, the sum of the complexity of each item in the polynomial is calculated as $\mathcal{O}(m\times|E|)$ .

The second step of ASWT-SGNN is to use the wavelet transform for graph convolution. Spectral CNN (Bruna et al. 2013) has high parameter complexity $\mathcal{O}(N\times p\times q)$ . ChebyNet (Defferrard, Bresson, and Vandergheynst 2016) approximates the convolution kernel by the polynomial function of the diagonal matrix of the Laplace eigenvalue, reducing the parameter complexity to $\mathcal{O}(m\times p\times q)$ , where $m$ is the order of the polynomial function. GCN (Kipf and Welling 2016) simplifies ChebyNet by setting m=1. In this paper, the feature transformation is performed first, and the parameter complexity is $\mathcal{O}(p\times q)$ . Then the graph convolution is performed, and the parameter complexity is $\mathcal{O}(N)$ .

Theoretical Analysis

Lemma 1.

Consider nodes $v_{a}$ and $v_{b}$ within a graph, characterized by their similarity in features or labels. Consequently, their $k$ -hop neighbors exhibit a one-to-one mapping, specifically $\mathcal{N}_{k}(b)=\pi(\mathcal{N}_{k}(a))$ . In this context, it holds true that $|\bm{\Psi}_{am}-\bm{\Psi}_{b\pi(m)}|\leq 2\epsilon$ , where $\bm{\Psi}_{am}$ signifies the wavelet coefficient between nodes $v_{a}$ and $v_{m}$ .

Proof.

The proof is shown in Appendix AA. ∎

This lemma shows that nodes with similar network neighborhoods have similar wavelet coefficients.

Theorem 1.

If the graph following the above assumption and Lemma 1, the expectation of embedding is given by:

\begin{split}\mathbb{E}\left[\bm{F}_{i}\right]=\bm{W}\mathbb{E}_{y\sim P\left(y_{i}\right),\bm{x}\sim P_{y}(\bm{x})}[\bm{\Gamma}_{i}\bm{x}],\end{split}

(21)

where $\bm{\Gamma}=\bm{\Psi}\bm{G}\bm{\Psi}^{\top}$ , $\bm{\Gamma}_{i}$ is the $i$ -th row of $\bm{\Gamma}$ . In this context, we simplify the model by excluding the residual connection component. With probability at least $1-\delta$ over the distribution for the graph, we have:

\begin{split}\left\|\bm{H}_{i}-\mathbb{E}\left[\bm{H}_{i}\right]\right\|_{2}\leq\sqrt{\frac{\sigma_{\max}^{2}(\bm{W})p\log(2p/\delta)}{2\bm{N}\|\bm{\Gamma}_{i}\bm{x}\|_{\psi_{2}}}},\end{split}

(22)

where the sub-gaussian norms $\|\bm{\Gamma}_{i}\bm{x}\|_{\psi_{2}}=\min\left\|\bm{\Gamma}_{i}\bm{x}_{i,d}\right\|_{\psi_{2}}$ , $p$ is the dimension of features, $d\in[1,p]$ and $\sigma^{2}_{max}(\bm{W})$ is the largest singular value of $\bm{W}$ .

Proof.

The proof is shown in Appendix AA. ∎

The theorem stated above indicates that the proposed method can map nodes with the same label to an area centered around the expectation in the embedding space. This holds true for any graph in which each node’s feature and neighborhood pattern are sampled from distributions that depend on the node label.

Table 1: Accuracy (%) on the eight datasets for the node classification. The best result is bold, and the second best is underlined.

Method Datasets Cora CiteSeer PubMed Computers Photo CS Physics WikiCS Avg. GCN 81.6 $\pm$ 0.2 70.3 $\pm$ 0.4 79.3 $\pm$ 0.2 84.5 $\pm$ 0.3 91.6 $\pm$ 0.3 93.1 $\pm$ 0.3 93.7 $\pm$ 0.2 73.0 $\pm$ 0.1 83.4 GAT 83.1 $\pm$ 0.3 72.4 $\pm$ 0.3 79.5 $\pm$ 0.1 85.8 $\pm$ 0.1 91.7 $\pm$ 0.2 89.5 $\pm$ 0.3 93.5 $\pm$ 0.3 72.6 $\pm$ 0.3 83.5 GWNN 82.8 $\pm$ 0.3 71.8 $\pm$ 0.2 79.9 $\pm$ 0.3 85.6 $\pm$ 0.2 92.5 $\pm$ 0.1 92.6 $\pm$ 0.4 92.7 $\pm$ 0.2 72.8 $\pm$ 0.3 83.8 GCNII 85.5 $\pm$ 0.4 73.4 $\pm$ 0.6 80.4 $\pm$ 0.3 87.6 $\pm$ 0.4 92.7 $\pm$ 0.2 92.8 $\pm$ 0.4 93.5 $\pm$ 0.2 74.7 $\pm$ 0.3 85.1 GMI 83.3 $\pm$ 0.2 72.6 $\pm$ 0.2 79.8 $\pm$ 0.4 82.2 $\pm$ 0.1 90.7 $\pm$ 0.2 92.6 $\pm$ 0.2 94.3 $\pm$ 0.4 74.9 $\pm$ 0.2 83.8 MVGRL 83.1 $\pm$ 0.2 72.3 $\pm$ 0.5 80.3 $\pm$ 0.5 87.5 $\pm$ 0.1 91.7 $\pm$ 0.1 92.1 $\pm$ 0.3 95.1 $\pm$ 0.2 77.5 $\pm$ 0.1 84.9 GCA-SSG 83.9 $\pm$ 0.4 73.1 $\pm$ 0.3 81.3 $\pm$ 0.4 88.4 $\pm$ 0.3 89.5 $\pm$ 0.1 92.4 $\pm$ 0.1 93.4 $\pm$ 0.2 78.2 $\pm$ 0.3 85.0 GRADE 84.0 $\pm$ 0.3 72.4 $\pm$ 0.4 82.7 $\pm$ 0.3 84.7 $\pm$ 0.1 92.6 $\pm$ 0.1 92.7 $\pm$ 0.4 93.7 $\pm$ 0.2 78.1 $\pm$ 0.2 85.1 AF-GCL 83.1 $\pm$ 0.1 71.9 $\pm$ 0.4 79.0 $\pm$ 0.7 89.6 $\pm$ 0.2 92.5 $\pm$ 0.3 92.0 $\pm$ 0.1 95.2 $\pm$ 0.2 79.0 $\pm$ 0.5 85.3 MA-GCL 83.3 $\pm$ 0.4 73.6 $\pm$ 0.4 83.5 $\pm$ 0.7 88.8 $\pm$ 0.3 93.8 $\pm$ 0.3 92.5 $\pm$ 0.4 94.8 $\pm$ 0.5 78.7 $\pm$ 0.5 86.1 GraphMAE 84.2 $\pm$ 0.4 73.4 $\pm$ 0.4 81.1 $\pm$ 0.4 89.5 $\pm$ 0.1 93.2 $\pm$ 0.1 92.7 $\pm$ 0.2 94.3 $\pm$ 0.4 78.9 $\pm$ 0.2 85.9 MaskGAE 84.3 $\pm$ 0.4 73.8 $\pm$ 0.8 83.6 $\pm$ 0.5 89.5 $\pm$ 0.1 93.3 $\pm$ 0.1 92.7 $\pm$ 0.5 94.1 $\pm$ 0.4 78.4 $\pm$ 0.2 86.2 ASWT-SGNN(Semi) 88.1 $\pm$ 0.4 81.5 $\pm$ 0.3 85.2 $\pm$ 0.6 89.4 $\pm$ 0.4 93.8 $\pm$ 0.2 93.2 $\pm$ 0.3 94.9 $\pm$ 0.4 79.8 $\pm$ 0.5 88.2 ASWT-SGNN(CL) 86.2 $\pm$ 0.6 73.9 $\pm$ 0.1 84.9 $\pm$ 0.3 89.2 $\pm$ 0.3 93.5 $\pm$ 0.2 93.5 $\pm$ 0.3 95.4 $\pm$ 0.3 79.5 $\pm$ 0.3 87.0

Experiments

Experimental Setting

Datasets We evaluate the approach on eight benchmark datasets, which have been widely used in GCL methods. Specifically, citation datasets include Cora, CiteSeer and PubMed (Yang, Cohen, and Salakhudinov 2016), co-purchase and co-author datasets include Photo, Computers, CS and Physics (Suresh et al. 2021). Wikipedia dataset includes WikiCS (Mernyei and Cangea 2020).

Baselines We consider several baseline methods for the node classification task. These include semi-supervised learning methods like GCN (Kipf and Welling 2016), GAT (Velickovic et al. 2017) and GCNII (Chen et al. 2020) and wavelet neural network GWNN (Xu et al. 2019). Furthermore, we evaluate six GCL methods and two graph generation learning methods, which are GMI (Peng et al. 2020), MVGRL (Hassani and Khasahmadi 2020), GCA-SSG (Zhang et al. 2021), GRADE (Wang et al. 2022b), AF-GCL (Wang et al. 2022a), MA-GCL (Gong, Yang, and Shi 2023), GraphMAE (Hou et al. 2022), and MaskGAE (Li et al. 2023). These methods represent state-of-the-art approaches in the field of node classification tasks.

Evaluation protocol. For the ASWT-SGNN model, node representations are learned unsupervised using a 2-layer model. Following that, a linear classifier is applied as a post-processing step for assessment. The dataset is randomly partitioned, with 20% of nodes allocated to the training set, another 20% to the validation set, and the remaining 60% to the test set. To ensure the robustness of our findings, we conducted the experiments five times for each dataset, each time with different random seeds. The results include both the average accuracy and the corresponding standard deviation. All experiments use PyTorch on a server with four e NVIDIA A40 GPUs (each 48GB memory). ASWT-SGNN utilizes the Adam Optimizer with a learning rate of 0.001. The specific hyperparameters are as follows: the number of sampling points in the spectral domain, $K$ , is set to 20, the feature update ratio, $\alpha$ , is set to 0.8, and the wavelet terms ratio, $\beta$ , is set to 0.4.

Wavelet operator Approximation Experiments

The proposed method of approximating wavelet operators is applied to real-world graph data. We evaluate its performance using two distinct metrics: the similarity between the approximated wavelet filter $g_{\theta}$ and the precisely computed wavelet filter, and the Mean Absolute Error (MAE) between the approximated wavelet operator $\bm{\Psi}_{\theta}$ and the actual wavelet operator. Figure 2 presents a specific example showing the accurate and approximate multiscale filters, computed precisely and approximately, demonstrating a significant overlap between them. The MAEs for different scaled training sets are depicted in Figure 3. The experimental results indicate that the proposed wavelet operator approximation method reduces the approximation error in the high-density spectral domain, while avoiding the need for computationally expensive eigen-decomposition.

Node Classification

Table 1 displays the results of node classification accuracy. It is worth noting that our proposed ASWT-SGNN achieves state-of-the-art (SOTA) performance on six out of the eight graphical benchmarks. Specifically, compared to other self-supervised methods across the eight datasets, ASWT-SGNN outperforms the best self-supervised method, MaskGAE, by an average of 0.8%, and it outperforms the worst self-supervised method, GMI, by 3.2%. Furthermore, experiments are carried out within a semi-supervised setting, revealing that the proposed method consistently outperforms the best semi-supervised benchmark, GCNII, by an average margin of 3.1%. Additionally, it surpasses the best self-supervised benchmark, MaskGAE, by an average of 2.0%. These results further demonstrate the effectiveness of the proposed method ASWT-SGNN in node classification tasks.

Ablation Studies

The proposed method incorporates two pivotal hyperparameters: $\alpha$ and $\beta$ . By adjusting these parameters, ASWT-SGNN can be simplified to its core forms: when $\alpha$ is set to 1 and $\beta$ is set to 0, ASWT-SGNN aligns with GCN; setting both $\alpha$ and $\beta$ to 1 makes ASWT-SGNN behave similarly to GWNN (Xu et al. 2019); and when $\alpha\neq 1$ and $\beta$ is set to 0, ASWT-SGNN exhibits similarities to GCNII (Chen et al. 2020). We comprehensively compare ASWT-SGNN and transformation models, including GCN, GWNN, and GCNII. Clear observations can be made from Figure 4: as the number of layers increases, the performance of GCN and GWNN significantly declines. In contrast, the performance of GCNII and ASWT-SGNN remains relatively stable even with more layers stacked. Notably, due to the incorporation of graph wavelet bases, ASWT-SGNN model outperforms GCNII in semi-supervised node classification tasks.

Other Experiments

To comprehensively investigate the impact of parameters $\alpha$ and $\beta$ on model performance, we systematically vary their values from 0 to 1. The results are presented in Figure 5, illustrating several noticeable trends. When $\alpha$ is set to 0, indicating the exclusion of residual connections, accuracy is significantly decreased. On the other hand, larger values of $\alpha$ result in fixed node representations, leading to lower performance. $\beta$ represents the proportion of the graph wavelet base; if it is tiny, it may not effectively extract local information. Conversely, excessively large values of $\beta$ could result in the neglect of global information, ultimately reducing performance. These findings highlight the complex relationship between parameters $\alpha$ and $\beta$ in shaping the model’s performance. This delicate balance allows our model to synthesize local and global information, achieving optimal performance effectively.

Furthermore, we use t-SNE (Van der Maaten and Hinton 2008) to visualize the node embeddings. Figure 6 illustrates that compared to MA-GCL, ASWT-SGNN exhibits more pronounced gaps between different classes. This suggests that ASWL-SGNN captures more detailed class information and clarifies the boundaries between samples from different classes. Further extensive experimental analyses, including sparse, robustness, and uncertainty analyses, are exhaustively presented in Appendix BB.

Conclusion

This paper introduces an adaptive graph wavelet self-supervised neural network called ASWT-SGNN. By utilizing multiple wavelet scales, the model integrates different levels of localization on the graph, enabling the capture of elements beyond the low-frequency ones. To avoid the expensive eigen-decomposition in the spectral domain, the model employs a polynomial approximation of the wavelet operator. Comprehensive experimental results demonstrate the competitiveness of the proposed method against state-of-the-art GCL models on real graph datasets. As our framework applies to all message-passing GNNs and polynomial graph filters, we plan to extend its application to more intricate graph neural architectures. Moreover, we will also consider larger datasets with the increase in GPU resources.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (No.62106259, No.62076234), Beijing Outstanding Young Scientist Program (NO.BJJWZYJH012019100020098), and Beijing Natural Science Foundation (No. 4222029).

References

Behmanesh et al. (2022) Behmanesh, M.; Adibi, P.; Ehsani, S. M. S.; and Chanussot, J. 2022. Geometric multimodal deep learning with multiscaled graph wavelet convolutional network. IEEE Transactions on Neural Networks and Learning Systems, 1–15.
Bruna et al. (2013) Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013. Spectral networks and locally connected networks on graphs. arXiv:1312.6203.
Chen et al. (2020) Chen, M.; Wei, Z.; Huang, Z.; Ding, B.; and Li, Y. 2020. Simple and deep graph convolutional networks. In International conference on machine learning, 1725–1735. PMLR.
Defferrard, Bresson, and Vandergheynst (2016) Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 3844–3852.
Di Napoli, Polizzi, and Saad (2016) Di Napoli, E.; Polizzi, E.; and Saad, Y. 2016. Efficient estimation of eigenvalue counts in an interval. Numerical Linear Algebra with Applications, 23(4): 674–692.
Ding et al. (2022) Ding, K.; Xu, Z.; Tong, H.; and Liu, H. 2022. Data augmentation for deep graph learning: A survey. ACM SIGKDD Explorations Newsletter, 24(2): 61–77.
Du et al. (2019) Du, S. S.; Hou, K.; Salakhutdinov, R. R.; Poczos, B.; Wang, R.; and Xu, K. 2019. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in neural information processing systems, 32.
Fan et al. (2020) Fan, T.; Shuman, D. I.; Ubaru, S.; and Saad, Y. 2020. Spectrum-adapted polynomial approximation for matrix functions with applications in graph signal processing. Algorithms, 13(11): 295.
Gong, Yang, and Shi (2023) Gong, X.; Yang, C.; and Shi, C. 2023. Ma-gcl: Model augmentation tricks for graph contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 4284–4292.
Hamilton, Ying, and Leskovec (2017) Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems, 30.
Hammond, Vandergheynst, and Gribonval (2011) Hammond, D. K.; Vandergheynst, P.; and Gribonval, R. 2011. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2): 129–150.
Hasanzadeh et al. (2019) Hasanzadeh, A.; Hajiramezanali, E.; Narayanan, K.; Duffield, N.; Zhou, M.; and Qian, X. 2019. Semi-implicit graph variational auto-encoders. Advances in neural information processing systems, 32.
Hassani and Khasahmadi (2020) Hassani, K.; and Khasahmadi, A. H. 2020. Contrastive multi-view representation learning on graphs. In International conference on machine learning, 4116–4126. PMLR.
He et al. (2022) He, D.; Liang, C.; Liu, H.; Wen, M.; Jiao, P.; and Feng, Z. 2022. Block modeling-guided graph convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 4022–4029.
He et al. (2020) He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9726–9735. IEEE Computer Society.
Hou et al. (2022) Hou, Z.; Liu, X.; Cen, Y.; Dong, Y.; Yang, H.; Wang, C.; and Tang, J. 2022. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 594–604.
Hutchinson (1989) Hutchinson, M. F. 1989. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3): 1059–1076.
Kaler et al. (2022) Kaler, T.; Stathas, N.; Ouyang, A.; Iliopoulos, A.-S.; Schardl, T.; Leiserson, C. E.; and Chen, J. 2022. Accelerating training and inference of graph neural networks with fast sampling and pipelining. Proceedings of Machine Learning and Systems, 4: 172–189.
Kipf and Welling (2016) Kipf, T. N.; and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907.
Kumar, Rawat, and Chauhan (2022) Kumar, P.; Rawat, P.; and Chauhan, S. 2022. Contrastive self-supervised learning: review, progress, challenges and future research directions. International Journal of Multimedia Information Retrieval, 11(4): 461–488.
Li et al. (2023) Li, J.; Wu, R.; Sun, W.; Chen, L.; Tian, S.; Zhu, L.; Meng, C.; Zheng, Z.; and Wang, W. 2023. What’s Behind the Mask: Understanding Masked Graph Modeling for Graph Autoencoders. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1268–1279.
Mernyei and Cangea (2020) Mernyei, P.; and Cangea, C. 2020. Wiki-cs: A wikipedia-based benchmark for graph neural networks. arXiv:2007.02901.
Peng et al. (2020) Peng, Z.; Huang, W.; Luo, M.; Zheng, Q.; Rong, Y.; Xu, T.; and Huang, J. 2020. Graph representation learning via graphical mutual information maximization. In Proceedings of The Web Conference 2020, 259–270.
Shervashidze et al. (2011) Shervashidze, N.; Schweitzer, P.; Van Leeuwen, E. J.; Mehlhorn, K.; and Borgwardt, K. M. 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(9).
Shi et al. (2020) Shi, M.; Tang, Y.; Zhu, X.; and Liu, J. 2020. Multi-label graph convolutional network representation learning. IEEE Transactions on Big Data, 8(5): 1169–1181.
Shuman et al. (2013) Shuman, D. I.; Narang, S. K.; Frossard, P.; Ortega, A.; and Vandergheynst, P. 2013. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing magazine, 30(3): 83–98.
Sun et al. (2019) Sun, F.-Y.; Hoffmann, J.; Verma, V.; and Tang, J. 2019. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv:1908.01000.
Suresh et al. (2021) Suresh, S.; Li, P.; Hao, C.; and Neville, J. 2021. Adversarial graph augmentation to improve graph contrastive learning. Advances in Neural Information Processing Systems, 34: 15920–15933.
Tang and Saad (2012) Tang, J. M.; and Saad, Y. 2012. A probing method for computing the diagonal of a matrix inverse. Numerical Linear Algebra with Applications, 19(3): 485–501.
Thakoor et al. (2021) Thakoor, S.; Tallec, C.; Azar, M. G.; Munos, R.; Veličković, P.; and Valko, M. 2021. Bootstrapped representation learning on graphs. In ICLR 2021 Workshop on Geometrical and Topological Representation Learning.
Tsitsvero, Barbarossa, and Di Lorenzo (2016) Tsitsvero, M.; Barbarossa, S.; and Di Lorenzo, P. 2016. Signals on graphs: Uncertainty principle and sampling. IEEE Transactions on Signal Processing, 64(18): 4845–4860.
Van der Maaten and Hinton (2008) Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
Velickovic et al. (2017) Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y.; et al. 2017. Graph attention networks. stat, 1050(20): 10–48550.
Velickovic et al. (2019) Velickovic, P.; Fedus, W.; Hamilton, W. L.; Liò, P.; Bengio, Y.; and Hjelm, R. D. 2019. Deep Graph Infomax. ICLR (Poster), 2(3): 4.
Wang et al. (2022a) Wang, H.; Zhang, J.; Zhu, Q.; and Huang, W. 2022a. Augmentation-Free Graph Contrastive Learning with Performance Guarantee. arXiv:2204.04874.
Wang et al. (2022b) Wang, R.; Wang, X.; Shi, C.; and Song, L. 2022b. Uncovering the Structural Fairness in Graph Contrastive Learning. Advances in Neural Information Processing Systems, 35: 32465–32473.
Wang et al. (2019) Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.; and Solomon, J. M. 2019. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5): 1–12.
Wang et al. (2016) Wang, Z.; Wang, C.; Pei, J.; Ye, X.; and Philip, S. Y. 2016. Causality Based Propagation History Ranking in Social Networks. In IJCAI, 3917–3923.
Xie et al. (2022) Xie, Y.; Xu, Z.; Zhang, J.; Wang, Z.; and Ji, S. 2022. Self-supervised learning of graph neural networks: A unified review. IEEE transactions on pattern analysis and machine intelligence, 45(2): 2412–2429.
Xu et al. (2019) Xu, B.; Shen, H.; Cao, Q.; Qiu, Y.; and Cheng, X. 2019. Graph wavelet neural network. arXiv:1904.07785.
Xu et al. (2018) Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826.
Yanardag and Vishwanathan (2015) Yanardag, P.; and Vishwanathan, S. 2015. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 1365–1374.
Yang, Cohen, and Salakhudinov (2016) Yang, Z.; Cohen, W.; and Salakhudinov, R. 2016. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, 40–48. PMLR.
Ying et al. (2018) Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W.; and Leskovec, J. 2018. Hierarchical graph representation learning with differentiable pooling. Advances in neural information processing systems, 31.
You et al. (2020) You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; and Shen, Y. 2020. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33: 5812–5823.
Yu, Lee, and Sohn (2020) Yu, B.; Lee, Y.; and Sohn, K. 2020. Forecasting road traffic speeds by considering area-wide spatio-temporal dependencies based on a graph convolutional neural network (GCN). Transportation research part C: emerging technologies, 114: 189–204.
Zeng et al. (2019a) Zeng, H.; Zhou, H.; Srivastava, A.; Kannan, R.; and Prasanna, V. 2019a. Accurate, efficient and scalable graph embedding. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 462–471. IEEE.
Zeng et al. (2019b) Zeng, H.; Zhou, H.; Srivastava, A.; Kannan, R.; and Prasanna, V. 2019b. Graphsaint: Graph sampling based inductive learning method. arXiv:1907.04931.
Zhang et al. (2021) Zhang, H.; Wu, Q.; Yan, J.; Wipf, D.; and Yu, P. S. 2021. From canonical correlation analysis to self-supervised graph neural networks. Advances in Neural Information Processing Systems, 34: 76–89.
Zhu et al. (2021) Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; and Wang, L. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, 2069–2080.

Appendix A Appendix A: Details of Methodology

Algorithm of AWST-SGNN

Algorithm 1 illustrates the detailed steps of deploying AWST-SGNN in an instantiation of GCL.

Algorithm 1 Implementation of AWST-SGNN

Input: Feature matrix $\bm{X}$ ; Adjacency matrix $\bm{A}$ ; Laplacian matrix $\bm{L}$ ; Encoder $f_{\theta}$ .
Parameter: Band-pass filters scales interval $[low_{b},high_{b}]$ ; Low-pass filters scales interval $[low_{l},high_{l}]$ ; Number of band-pass filters $L$ ; Number of sampling points in the spectral domain $K$ ; Number of Rademacher vectors $n_{r}$ ; Feature update ratio $\alpha$ ; Wavelet terms ratio $\beta$ ; Learning rate $\eta$ .
Output: Trained encoder $f_{\theta^{*}}$

1: Draw low-pass filter scale

s_{0}

from Uniform(

low_{l}

high_{l}

);

2: Draw band-pass filter scales

\{s_{i}\}_{i=1}^{L}

from Uniform(

low_{b}

high_{b}

);

3: Draw spectral domain sampling points

\{\xi_{i}\}_{i=1}^{K}

from the linear interval

[0,2]

;

4: Calculate the spectral density

\bm{\omega}

according to Eq. (13);

5: Initialize encoder

f_{\theta}

;

6: for every epoch do

7: Draw two graph augmentations

\bm{o}

and

\bm{z}

;

8: Calculate the wavelet filters

g_{\theta}

according to Eq. (7);

9: Calculate the approximate wavelet basis

\Psi_{\theta}

according to Eq. (14);

10: View encoding:

\bm{H}_{1}=f_{\theta}(\bm{o},\Psi_{\theta},\bm{A})

\bm{H}_{2}=f_{\theta}(\bm{z},\Psi_{\theta},\bm{A})

;

11: Update parameters

\theta

\bm{G}

and

\bm{W}

by contrastive loss

\mathcal{L}(\bm{H}_{1},\bm{H}_{2})

12: end for

13: return Encoder

f_{\theta^{*}}

Spectral Density Estimation

During wavelet operator polynomial approximation, a crucial step involves computing the polynomial coefficients through a weighted least squares approximation. The selection of these weights hinges on their proportionality to the spectral density of the graph, defined as follows:

{\omega}(z)=\frac{1}{N}\sum_{j=1}^{N}\{\mathbb{I}(\lambda_{j}=z)\}.

(23)

Spectral density estimation aims to approximate this function without performing the expensive eigen-decomposition of the graph Laplacian. We opt for the kernel polynomial method to estimate the spectral density function. It involves initially determining an estimate for the cumulative spectral density function $\Omega(z)$ using the formula:

\Omega(z)=\frac{1}{N}\sum_{j=1}^{N}\{\mathbb{I}(\lambda_{j}\leq z)\}.

(24)

For each $\xi$ within the set of $\{\xi_{i}\}_{i=1}^{K}$ linearly spaced points across the spectral domain, we aim to ascertain the count of eigenvalues that are less than or equal to $\xi$ . Achieving this goal entails approximating the trace of eigen-projection $\bm{P}$ (Di Napoli, Polizzi, and Saad 2016), thereby facilitating the computation of an estimate for the eigenvalue counts within the interval $[0,\xi_{i}]$ .

\bm{P}=\sum_{\lambda_{j}\in[0,\xi_{i}]}\bm{u}_{j}\bm{u}_{j}^{T}.

(25)

The eigenvalues of the projector are restricted to being either zero or one. Consequently, the trace of the projector $\bm{P}$ is equivalent to the count of terms in the Eq. (24), which corresponds to the number of eigenvalues within the range $[0,\xi_{i}]$ . As a result, the computation of the count of eigenvalues situated within the interval $[0,\xi_{i}]$ is facilitated by determining the trace of the corresponding projector:

tr(\bm{P})=\sum_{j=1}^{N}\{\mathbb{I}(\lambda_{j}\leq\xi_{i})\}.

(26)

If the matrix $\bm{P}$ is explicitly provided, we would be capable of directly calculating its trace and obtaining an exact value for $tr(\bm{P})$ . However, the projector $\bm{P}$ is often unavailable in practice. Nonetheless, it is feasible to approximate $\bm{P}$ inexpensively by representing it as either a polynomial or a rational function of the Laplacian matrix $\bm{L}$ . In order to achieve this, we can interpret $\bm{P}$ as a step function of $\bm{L}$ , specifically:

\bm{P}=h(\bm{L}),~{}\text{where}~{}h(\lambda)=\left\{\begin{array}[]{ll}1{}&\text{if}~{}\lambda\leq\xi_{i}\\ 0{}&\text{otherwise}\end{array}\right.

(27)

The approximation of $h(\lambda)$ is now attainable using a finite sum $\phi(\lambda)$ comprised of Chebyshev polynomials. In this representation, it becomes feasible to estimate the trace of $\bm{P}$ using a stochastic estimator initially introduced by Hutchinson (Hutchinson 1989) and subsequently refined by more recent contributions (Tang and Saad 2012). Hutchinson’s unbiased estimator employs matrix-vector products exclusively to approximate the trace of a matrix. The foundational principle hinges on applying identically independently distributed (i.i.d.) Rademacher random variables. In this approach, each entry of the randomly generated vectors $\bm{R}$ assumes either $-1$ or $1$ with equal probability $1/2$ . Hutchinson established a lemma demonstrating that $\mathbb{E}(\bm{R}^{\top}\bm{P}\bm{R})=tr(\bm{P})$ . Consequently, an estimation of the trace (tr(A)) is derived by generating $n_{r}$ samples of random vectors $\bm{R}_{k}$ , where $k=1,\cdots,n_{r}$ , and subsequently computing the average of $\bm{R}^{\top}\bm{P}\bm{R}$ across these samples.

{tr}(\bm{P})\approx\frac{1}{n_{r}}\sum_{k=1}^{n_{r}}\bm{R}_{k}^{\top}\bm{P}\bm{R}_{k}.

(28)

In the polynomial filtering approach, the step function $h(\lambda)$ is expanded into a degree $m$ Chebyshev polynomial series:

h(\lambda)\approx\phi(\lambda)=\sum_{j=0}^{m}\gamma_{j}T_{j}(\lambda),

(29)

where $T_{j}(\lambda)$ are the $j$ -degree Chebyshev polynomials, and the coefficients $\gamma_{j}$ are the expansion coefficients of the step function $h$ which are known to be

\gamma_{j}=\left\{\begin{array}[]{ll}\frac{1}{\pi}\left(\arccos(a)-\arccos(b)\right),&\text{if}~{}j=0\\ \frac{2}{\pi}\left(\frac{\sin(j\arccos(a))-\sin(j\arccos(b))}{j}\right),&\text{if}~{}j>0.\end{array}\right.

(30)

As a result, we obtain an expansion of $\bm{P}$ into matrices $T_{j}(\bm{L})$ .

h(\bm{L})\approx\phi(\bm{L})=\sum_{j=0}^{m}\gamma_{j}T_{j}(\bm{L}).

(31)

To mitigate the adverse oscillations near the boundaries (known as Gibbs oscillations) that often arise from the expansion of $h(\lambda)$ , a common practice is introducing damping multipliers, referred to as Jackson coefficients. This modification results in the following replacement for Eq. (31):

\bm{P}=h(\bm{L})\approx\phi(\bm{L})=\sum_{j=0}^{m}g_{j}^{m}\gamma_{j}T_{j}(\bm{L}).

(32)

It is worth noting that the matrix polynomial for the conventional Chebyshev approach retains the same form as described above, with the Jackson coefficients $g_{j}^{m}$ uniformly set to one. As a result, we will employ the same notation to represent both expansions. The original expression of the Jackson coefficients can be derived using the following formula:

\begin{split}g_{j}^{m}=&\frac{\left(1-\frac{j}{m+2}\right)\sin(\alpha_{m})\cos(j\alpha_{m})}{\sin(\alpha_{m})}\\ &+\frac{\frac{1}{m+2}\cos(\alpha_{m})\sin(j\alpha_{m})}{\sin(\alpha_{m})},\end{split}

(33)

where $\alpha_{m}=\frac{\pi}{m+2}$ . It is noteworthy that these coefficients can also be expressed in a slightly more concise form:

g_{j}^{m}=\frac{\sin(j+1)\alpha_{m}}{(m+2)\sin\alpha_{m}}+\left(1-\frac{j+1}{m+2}\right)\cos(j\alpha_{m}).

(34)

Upon directly substituting the expression for $\phi(\bm{L})$ into the stochastic estimator (Eq. (28)), we arrive at the ensuing estimate:

\sum_{j=1}^{N}\{\mathbb{I}(\lambda_{j}\leq\xi_{i})\}=tr(\bm{P})\approx\frac{1}{n_{v}}\sum_{k=1}^{n_{v}}\left[\sum_{j=0}^{m}g_{j}^{m}\gamma_{j}\bm{R}_{k}^{T}T_{j}(\bm{L})\bm{R}_{k}\right].

(35)

An approximation $\Omega$ to the cumulative spectral density function is obtained using monotonic piece-wise cubic interpolation to interpolate between the estimates at points $\xi_{i}$ .

\bm{\Omega}=\mathcal{I}\left(\left\{\left(\xi_{i},\frac{1}{N}\left[\frac{1}{n_{r}}\sum_{k=1}^{n_{r}}\bm{R}_{k}^{\top}\phi({\bm{L}_{sym}})\bm{R}_{k}\right]\right)\right\}_{i=1}^{K}\right).

(36)

Finally, the approximation $\bm{\omega}$ to the spectral density is derived by differentiating $\bm{\Omega}$ for $\bm{\xi}$ .

Proof of Lemma 1

Lemma 1.

Proof.

The definition domain of kernel function $g$ used in wavelet transform in the graph is closed interval $[0,\lambda_{N}]$ . According to the Stone-Weierstrass approximation theorem, a polynomial function can approximate continuous functions on closed intervals.

\begin{split}\forall\epsilon>0,\exists P(\lambda)=\sum_{j=0}^{m}\alpha_{j}\lambda^{j}:|g(\lambda)-P(\lambda)|\leq\epsilon.\end{split}

(37)

Let $r(\lambda)=g(\lambda)-P(\lambda)$ , we have:

\bm{\Psi}=\left(\sum_{j=0}^{m}\alpha_{j}\bm{L}^{j}\right)+\bm{U}r(\bm{\Lambda})\bm{U}^{T}.

(38)

The distribution of wavelet coefficients can be expressed as a polynomial function composed of different terms of the Laplace matrix, and using Newton’s binomial theorem to expand the $L^{j}$ term in the above equation, we have the following form:

\begin{split}\bm{L}^{j}=(\bm{D}-\bm{A})^{j}=\sum_{i=0}^{j}C_{j}^{i}\bm{D}^{i}\bm{A}^{j-i},\end{split}

(39)

where $C^{i}_{j}$ is the binomial coefficient. If the graph’s shortest path between two nodes $v_{a}$ and $v_{b}$ is more significant than $j$ , we have $\bm{L}^{j}_{ab}$ = 0.

Then, we discuss the effect of the approximation residue $r(\bm{\Lambda})$ , since the eigenvector matrix $\bm{U}$ is a standard orthogonal matrix and, by virtue of $r(\bm{\lambda})<\epsilon$ , according Cauchy-Schwarz inequality, we obtain:

\begin{split}\left|\delta_{a}^{T}\bm{U}r(\bm{\Lambda})\bm{U}^{T}\delta_{b}\right|^{2}\leq\left(\sum_{\ell=1}^{N}\left|r\left(\lambda_{\ell}\right)\right|^{2}\bm{U}_{a\ell}^{2}\right)\left(\sum_{\ell=1}^{N}\bm{U}_{b\ell}^{2}\right)\leq\epsilon^{2}.\end{split}

(40)

Using Eq. (38), we write the difference between each pair of mapped coefficients $\bm{\Psi}_{am}$ and $\bm{\Psi}_{b\pi(m)}$ in terms of the $k$ -th order approximation of the graph Laplacian:

\begin{split}\left|\bm{\Psi}_{ma}-\bm{\Psi}_{\pi(m)b}\right|=\mid\delta_{m}\bm{U}(P(\bm{\Lambda})+r(\bm{\Lambda}))\bm{U}^{T}\delta_{a}-\\ \delta_{\pi(m)}\bm{U}(P(\bm{\Lambda})+r(\bm{\Lambda}))\bm{U}^{T}\delta_{b}|\leq\\ |\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{ma}-\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{\pi(m)a}\mid\\ +\left|\left(\bm{U}r(\bm{\Lambda})\bm{U}^{T}\right)_{ma}\right|+\left|\left(\bm{U}r(\bm{\Lambda})\bm{U}^{T}\right)_{\pi(m)b}\right|.\end{split}

(41)

Since the $k$ -hop neighborhoods around $v_{a}$ and $v_{b}$ are identical, the following holds:

\begin{split}\forall m\in\mathcal{N}_{K}(a),\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{ma}&=\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{\pi(m)b},\\ \forall m\notin\mathcal{N}_{K}(a),\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{ma}&=\left(\bm{U}P(\bm{\Lambda})\bm{U}^{T}\right)_{\pi(m)a}=0.\end{split}

(42)

Therefore, we have $|\bm{\Psi}_{am}-\bm{\Psi}_{b\pi(m)}|\leq 2\epsilon$ according to Eq. (42). ∎

Proof of Theorem 1

Theorem 1.

Consider the graph following the above graph assumption and Lemma 1, then the expectation of embedding is given by:

\begin{split}\mathbb{E}\left[\bm{H}_{i}\right]=\bm{W}\mathbb{E}_{y\sim P\left(y_{i}\right),\bm{x}\sim P_{y}(\bm{x})}[\bm{\Gamma}_{i}\bm{x}],\end{split}

(43)

where $\bm{\Gamma}=\bm{\Psi}\bm{G}\bm{\Psi}^{\top}$ . Thus, with probability at least $1-\delta$ over the distribution for the graph, we have:

\begin{split}\left\|\bm{H}_{i}-\mathbb{E}\left[\bm{H}_{i}\right]\right\|_{2}\leq\sqrt{\frac{\sigma_{\max}^{2}(\bm{W})p\log(2p/\delta)}{2\bm{N}\|\bm{\Gamma}_{i}\bm{x}\|_{\psi_{2}}}},\end{split}

(44)

where the sub-gaussian norms $\|\bm{\Gamma}_{i}\bm{x}\|_{\psi_{2}}\equiv\min\left\|\bm{\Gamma}_{i}\bm{x}_{i,d}\right\|_{\psi_{2}}$ , $d\in[1,p]$ and $\sigma^{2}_{max}(\bm{W})$ is the largest singular value of $\bm{W}$ .

Proof.

We consider a Gaussian mixture model for the node features. For the sake of simplicity, we focus on the binary classification problem. Given the (binary) label $y$ and a latent vector $\boldsymbol{\mu}\sim\mathcal{N}\left(\bm{0},\bm{I}_{p}/p\right)$ , where the identity matrix $\bm{I}_{p}\in\mathbb{R}^{p\times p}$ , the features are governed by: $\bm{x}_{i}=y_{i}\boldsymbol{\mu}+\frac{\bm{q}_{i}}{\sqrt{p}}$ . Here, the random variable $\bm{q}_{i}\in\mathbb{R}^{p}$ has independent standard normal entries, and $y_{i}\in\{-1,1\}$ represents latent classes with abuse of notation. The dimension of the features, denoted as $p$ , remains constant. Subsequently, the features of nodes with class $y_{i}$ follow the same distribution, contingent on $y_{i}$ , i.e., $x_{i}\sim P_{y_{i}}(x)$ .

When the scale is $\theta$ , the sum of wavelet coefficients at node $i$ has the following form:

\begin{split}\sum_{j=1}^{N}\bm{\Psi}_{ij}&=\sum_{j=1}^{N}\sum_{\ell=1}^{N}g_{\theta}\left(\lambda_{\ell}\right)\bm{U}_{\ell}^{T}(i)\bm{U}_{\ell}(j)\\ &=\sum_{\ell=1}^{N}g_{\theta}\left(\lambda_{\ell}\right)\bm{U}_{\ell}^{T}(i)\sum_{j=1}^{N}\bm{U}_{\ell}(j).\end{split}

(45)

Because the sum of the elements of the eigenvector corresponding to the non-zero eigenvalues of the Laplace matrix is zero, all the items with $\ell>1$ can be ignored in the Eq. (45), and only the items with zero eigenvalues can be retained.

The zero eigenvalues represent the direct current (DC) component of the graph, which corresponds to the same value for each element in the eigenvector and is normalized to $\bm{U}_{1}=[1/\sqrt{N},\ldots,1/\sqrt{N}]$ , which can be substituted into Eq.(45) to obtain:

\begin{split}\sum_{j=1}^{N}\bm{\Psi}_{ij}=g(0)=1\end{split}

(46)

Next, we calculate the expectation of aggregated embedding:

\begin{split}\mathbb{E}\left[\bm{H}_{i}\right]&=\mathbb{E}\left[\bm{W}\sum_{j=1}^{N}\bm{\Gamma}_{ij}x_{j}\right]\\ &=\bm{W}\mathbb{E}_{y\sim P\left(y_{i}\right),\bm{x}\sim P_{y}(\bm{\bm{\Gamma}_{x}})}[\bm{\Gamma}_{i}\bm{x}],\end{split}

(47)

where $\bm{\Gamma}=\bm{\Psi}\bm{G}\bm{\Psi}^{\top}$ . This equation is based on the graph data assumption such that $\bm{x}_{j}\sim P_{y_{i}}(\bm{x})$ for every $j$ . Now we provide a concentration analysis. Because each feature $x_{i}$ is a sub-Gaussian variable, then by Hoeffding’s inequality, with probability at least $1-\delta^{{}^{\prime}}$ for each $d\in[1,p]$ , we have,

\begin{split}\left|\frac{1}{\bm{N}}\sum_{j}\left(\bm{\Gamma}_{i}\bm{x}_{j,d}-\mathbb{E}\left[\bm{\Gamma}_{i}\bm{x}_{j,d}\right]\right)\right|\leq\sqrt{\frac{\log\left(2/\delta^{\prime}\right)}{2\bm{N}\left\|\bm{\Gamma}_{i}\bm{x}_{j,d}\right\|_{\psi_{2}}}},\end{split}

(48)

where $\left\|\bm{\Gamma}_{i}\bm{x}_{j,d}\right\|_{\psi_{2}}$ is sub-Gaussian norm of $\bm{\Gamma}_{i}\bm{x}_{j,d}$ . Furthermore, because each dimension of $\bm{x}_{j}$ is mutual independence, thus we define $\left\|\bm{\Gamma}_{i}\bm{x}_{j}\right\|_{\psi_{2}}=\left\|\bm{\Gamma}_{i}\bm{x}_{j,d}\right\|_{\psi_{2}}$ Then we apply a union bound by setting $\delta^{\prime}=p\delta$ on the feature dimension $k$ . Then with probability at least $1-\delta$ we have,

\begin{split}&\left\|\frac{1}{\bm{N}}\sum_{j}\left(\bm{\Gamma}_{i}\bm{x}_{j,d}-\mathbb{E}\left[\bm{\Gamma}_{i}\bm{x}_{j,d}\right]\right)\right\|_{2}\\ \leq&\sqrt{p}\left|\frac{1}{\bm{N}}\sum_{j}\left(\bm{\Gamma}_{i}\bm{x}_{j,d}-\mathbb{E}\left[\bm{\Gamma}_{i}\bm{x}_{j,d}\right]\right)\right|\\ \leq&\sqrt{\frac{p\log(2p/\delta)}{2\bm{N}\|\bm{\Gamma}_{i}\bm{x}\|_{\psi_{2}}}}.\end{split}

(49)

Finally, plug the weight matrix into the inequality,

\begin{split}\left\|\bm{F}_{i}-\mathbb{E}\left[\bm{F}_{i}\right]\right\|\leq\sigma_{\max}(\bm{W})\left\|\frac{1}{\bm{N}}\sum_{j}\left(\bm{\Gamma}_{i}\bm{x}_{j,k}-\mathbb{E}\left[\bm{\Gamma}_{i}\bm{x}_{j,k}\right]\right)\right\|_{2},\end{split}

(50)

where $\sigma_{max}$ is the largest singular value of the weight matrix. ∎

Datasets	Cora	CiteSeer	Pubmed	CS	Photo	Computers	Physics	WikiCS
Nodes	2708	3327	19717	18333	7650	13752	34493	11701
Edges	5429	4732	44338	81894	119081	245861	247962	216123
Features	1433	3703	500	6805	745	767	8415	300
Classes	5	6	3	15	8	10	5	10
Density of $\bm{U}$	99.15%	95.81%	99.99%	100%	99.66%	99.55%	99.83%	99.20%
Density of $\bm{\Psi}$	0.18%	0.11%	0.03%	0.05%	0.42%	0.27%	0.72%	0.32%

Table 2: The Statistics of Datasets.

Appendix B Appendix B: Additional Experiments

Datasets

We evaluate our approach on eight widely used datasets in previous GCL methods, including Cora, Citseeer, Pubmed, Amazon-Photo, Amazon-Computers, Coauthor-CS, Coauthor-Physics and WikiCS. The statistics of datasets are summarized in Table 2.

•

Cora, CiteSeer, PubMed (Yang, Cohen, and Salakhudinov 2016): These datasets are widely used for node classification tasks. They consist of citation networks where nodes represent papers, and edges denote citation relationships.
•

Amazon-Computers and Amazon-Photo (Suresh et al. 2021): These datasets are derived from Amazon’s co-purchase relationships. Nodes correspond to products, and edges indicate frequent co-purchases. Each dataset involves product categorization with 10 and 8 classes based on the product category. The node features are bag-of-words representations of product reviews.
•

Coauthor-CS and Coauthor-Physics (Suresh et al. 2021): These academic networks are constructed from co-authorship relationships in the Microsoft Academic Graph. Nodes denote authors, and edges signify co-authored relationships. The datasets involve author classification into 15 and 5 classes based on research fields. Node features are represented as bag-of-words from paper keywords.
•

WikiCS (Mernyei and Cangea 2020): This reference network is constructed from Wikipedia references. Nodes represent computer science articles, and edges denote hyperlinks between articles. Articles are labeled with ten subfields, and their features consist of the average Glove embeddings of all words in the article.

Baseline

To ensure consistency, we rely on the code provided by the original authors for all baseline methods, carefully following their recommendations when setting the respective hyperparameters. When specific guidance is lacking, we exercise our judgment to make appropriate adjustments.

•

GCN: https://github.com/tkipf/gcn
•

GAT: https://github.com/PetarV-/GAT
•

GWNN: https://github.com/Eilene/GWNN
•

GCNII: https://github.com/chennnM/GCNII
•

GMI: https://github.com/zpeng27/GMI
•

MVGRL: https://github.com/kavehhassani/mvgrl
•

GCA-SSG: https://github.com/hengruizhang98/CCA-SSG
•

GRADE: https://github.com/BUPT-GAMMA/Uncovering-the-Structural-Fairness-in-Graph-Contrastive-Learning
•

AF-GCL: https://github.com/Namkyeong/AFGRL
•

MA-GCL: https://github.com/GXM1141/MA-GCL

Parameter Setting

We furnish a comprehensive account of the hyperparameter settings for all eight datasets. The specifics are outlined below:

•

Number of band-pass filters: $L$ =3
•

Number of scales: 5
•

Number of sampling points in the spectral domain: $K$ =20
•

Number of Rademacher vectors: $n_{r}$ =10
•

The ratio of feature updates: $\alpha$ =0.8
•

The ratio of the graph wavelet terms: $\beta$ =0.4
•

Feature discard ratio: $f_{d}$ =0.2
•

Order of polynomial approximation: $m$ =3
•

Band-pass filters scales interval: [0,5]
•

Low-pass filters scales interval: [4,6]
•

Learning rate: $\eta$ =0.001
•

Hidden size: 256
•

Projector size: 128
•

Activation function: ReLU

Sparsity Analysis

We computed the sparsity of the Fourier transform’s eigenvectors and the wavelet transform’s coefficient matrix across the eight datasets. The findings in Table 2 demonstrate a significant improvement in sparsity for the wavelet transform, with a nearly one hundred-fold increase compared to the Fourier transform. It suggests that the graph wavelet transform exhibits superior computational efficiency over the graph Fourier transform.

Uncertainty analysis (Localization analysis)

To explore the trade-off between spatial and spectral localizations, we can start with an impulse signal $\delta_{i}$ perfectly localized at the focal node $v_{i}$ . Theoretically, as $s$ approaches 0, the wavelet transform becomes the identity matrix, allowing us to recover the original signal with perfect graph concentration. On the other hand, as $s$ approaches infinity, the graph signal is completely mapped to the lowest frequency, sacrificing graph localization in favor of frequency band concentrations.

For example, let us consider a Stanford Bunny graph (Figure 7 and Figure 8). The degree of diffusion for Heat Kernel wavelets and Mexican-Hat wavelets with $s=10$ is much smaller than $s=50$ . By choosing $s=10$ , we achieve better frequency localization at the expense of graph localization. Using the notation from Eq. (3) in the body of the paper, we employ a graph limiting operator that covers the direct neighborhood of the focal vertex, denoted as $|\mathcal{S}|=6$ , resulting in a concentration in the graph domain.

Figure 9 depicts the trade-off between the spatial and spectral domains of different wavelet bases at various scale parameters using the same dataset as shown in Figure 7. Further investigation reveals that a more balanced choice of scale parameter would be in the $[1,3]$ range. Within this range, we can significantly sacrifice a small portion of graph localization to improve frequency localization. In practical scenarios, as demonstrated in the experiment section, the selection of the scale parameter depends on the connectivity of the underlying graph topology and the distribution of training labels in both the spatial and spectral domains. This trade-off between spatial and spectral domains also sheds light on the fundamental limitations of graph signal aggregations. Previous studies have discovered that stacking multiple layers of GCN can lead to over-smoothing problems. From the perspective of the uncertainty principle, deeper networks result in a broader spread of signals in the graph domain, which limits the frequency band’s width. This leads to the dominance of low-frequency signals, also known as smooth signals. However, we can precisely control this trade-off with wavelet bases and achieve better results across different datasets.

a=\frac{\left\|\bm{B}\bm{U}g_{s}(\bm{\Lambda})\bm{U}^{\top}\delta_{i}\right\|^{2}_{2}}{\left\|\bm{U}g_{s}(\bm{\Lambda})\bm{U}^{\top}\delta_{i}\right\|^{2}_{2}}.

(51)

We solve for the maximal $b$ for the given $a$ , the maximal possible concentration in the spectral domain becomes (Tsitsvero, Barbarossa, and Di Lorenzo 2016):

b=a\lambda_{max}+\sqrt{(1-a^{2})(1-\lambda_{max}^{2})}.

(52)

Graph Classification

For graph classification, we conduct experiments on three benchmarks: MUTAG, NCI1, and IMDB-BINARY (Yanardag and Vishwanathan 2015). Each dataset is a collection of graphs where each graph is associated with a label.

For the evaluation protocol, after generating graph embeddings with AWST-SGNN’s encoder and readout function, we feed encoded graph-level representations into a downstream classifier to predict the label, and report the mean 10-fold cross-validation accuracy with standard deviation after five runs.

In addition to the classical graph kernel methods, namely Weisfeiler-Lehman Subtree Kernel (WL) (Shervashidze et al. 2011) and GNTK (Du et al. 2019), we compare AWST-SGNN with three supervised methods: GIN (Xu et al. 2018), GraphSAGE (Hamilton, Ying, and Leskovec 2017) , and DGCNN (Wang et al. 2019), as well as two self-supervised methods: GraphCL (You et al. 2020) and MVGRL (Hassani and Khasahmadi 2020). The results are presented in Table 3. It is observed that AWST-SGNN outperforms all the baselines on two datasets, and on the remaining dataset it achieves competitive performance. These results suggest that AWST-SGNN is capable of learning meaningful information and holds promise for graph-level tasks.

Table 3: Accuracy (%) on the three datasets for the graph classification. The best result is bold, and the second best is underlined.

Method Datasets MUTAG NCI1 IMDB-BINARY Avg. WL 88.7 $\pm$ 7.3 76.7 $\pm$ 2.1 72.8 $\pm$ 3.9 79.4 GNTK 89.3 $\pm$ 8.5 76.8 $\pm$ 4.9 75.9 $\pm$ 4.6 80.6 GIN 89.4 $\pm$ 5.6 80.0 $\pm$ 2.4 75.1 $\pm$ 2.8 81.5 DGCNN 85.9 $\pm$ 1.7 74.1 $\pm$ 3.1 69.2 $\pm$ 3.4 76.4 GraphSAGE 86.6 $\pm$ 1.5 74.8 $\pm$ 2.3 68.9 $\pm$ 4.5 76.7 GraphCL 86.8 $\pm$ 1.3 77.9 $\pm$ 2.4 71.1 $\pm$ 2.1 78.6 MVGRL 89.4 $\pm$ 1.9 79.1 $\pm$ 1.7 74.2 $\pm$ 2.7 80.9 Ours 89.7 $\pm$ 5.2 80.2 $\pm$ 3.8 75.7 $\pm$ 3.1 81.9

Robustness analyses

To assess the adaptability of ASWL-SGNN to adversarial graphs, we conducted experiments on the Cora and Citeseer datasets. This evaluation involved introducing random feature masking to the original graph structure, thereby varying the ratio of masked features from 0 to 0.8 to simulate varying degrees of attack intensity. We performed comparisons against GRADE, GCA-SSG, and MA-GCL. The outcomes, illustrated in Figure 11, consistently demonstrate ASWL-SGNN’s superior performance over the other methods. Significantly, our method displays more substantial performance enhancements as the feature masking rate increases. This observation suggests that ASWL-SGNN exhibits heightened robustness against severe feature attacks.

Visualisation

For AWST-SGNN, we use heatmaps to show the intra-class distances and inter-class distances during the training period on the Cora dataset, as shown in Figure 10(a)-10(d), where the intra-class distances are shrinking, and the inter-class distances are increasing as the training progresses. Moreover, the t-SNE algorithm (Van der Maaten and Hinton 2008) is utilized to visualize the learned embeddings during the training process (as shown in Figure 10(e)-10(h)). Each distinct color within the visualization corresponds to a different class. As training advances, the disparities between the feature representations of various categories become more discernible.

ASWT-SGNN: Adaptive Spectral Wavelet Transform-based Self-Supervised Graph Neural Network

Abstract

Introduction

Related Works

Graph Convolutional Neural Network

Graph Contrastive Learning

Graph Wavelets

Preliminary

Graph Fourier Transform

Uncertainty Principle

Graph Wavelet Transform

Methodology

Wavelet Coefficients Approximation

Encoder

Optimization Objective

Complexity Analysis

Theoretical Analysis

Lemma 1.

Proof.

Theorem 1.

Proof.

Experiments

Experimental Setting

Wavelet operator Approximation Experiments

Node Classification

Ablation Studies

Other Experiments

Conclusion

Acknowledgments

References

Appendix A Appendix A: Details of Methodology

Algorithm of AWST-SGNN

Spectral Density Estimation

Proof of Lemma 1

Lemma 1.

Proof.

Proof of Theorem 1

Theorem 1.

Proof.

Appendix B Appendix B: Additional Experiments

Datasets

Baseline

Parameter Setting

Sparsity Analysis

Uncertainty analysis (Localization analysis)

Graph Classification

Robustness analyses

Visualisation

ASWT-SGNN: Adaptive Spectral Wavelet Transform-based
Self-Supervised Graph Neural Network