This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Graph Neural Networks
via Unbiased Aggregation

Zhichao Hou1 Ruiqi Feng111footnotemark: 1 Tyler Derr2 Xiaorui Liu1
Equal contribution.Corresponding author.
   1North Carolina State University, 2Vanderbilt University
   {zhou4,xliu96}@ncsu.edu ruiqifeng.2024@gmail.com tyler.derr@vanderbilt.edu
Abstract

The adversarial robustness of Graph Neural Networks (GNNs) has been questioned due to the false sense of security uncovered by strong adaptive attacks despite the existence of numerous defenses. In this work, we delve into the robustness analysis of representative robust GNNs and provide a unified robust estimation point of view to understand their robustness and limitations. Our novel analysis of estimation bias motivates the design of a robust and unbiased graph signal estimator. We then develop an efficient Quasi-Newton Iterative Reweighted Least Squares algorithm to solve the estimation problem, which is unfolded as robust unbiased aggregation layers in GNNs with theoretical guarantees. Our comprehensive experiments confirm the strong robustness of our proposed model under various scenarios, and the ablation study provides a deep understanding of its advantages. Our code is available at https://github.com/chris-hzc/RUNG.

1 Introduction

Graph neural networks (GNNs) have gained tremendous popularity in recent years due to their ability to capture topological relationships in graph-structured data (Ma & Tang, 2021). However, most GNNs are vulnerable to adversarial attacks, which can lead to a substantial decline in predictive performance  (Zhang & Zitnik, 2020; Jin et al., 2020; Geisler et al., 2021a). Despite the numerous defense strategies proposed to robustify GNNs, a recent study has revealed that most of these defenses are not as robust as initially claimed (Mujkanovic et al., 2022). Specifically, under adaptive attacks, they easily underperform the multi-layer perceptrons (MLPs) which do not utilize the graph topology information at all (Mujkanovic et al., 2022). Therefore, it is imperative to thoroughly investigate the limitations of existing defenses and develop innovative robust GNNs to securely harness the topology information in graphs.

Existing defenses attempt to bolster the resilience of GNNs using diverse approaches. For instance, Jaccard-GCN (Wu et al., 2019) and SVD-GCN (Entezari et al., 2020) aim to denoise the graph by removing potential adversarial edges during the pre-processing procedure, while ProGNN (Jin et al., 2020) learns the clean graph structure during the training process. GRAND (Feng et al., 2020) and robust training (Deng et al., 2023; Chen et al., 2020) also improve the training procedure through data augmentation. GNNGuard (Zhang & Zitnik, 2020) and RGCN (Zhu et al., 2019) reinforce their GNN architectures by heuristically reweighting edges in the graph. Additionally, there emerge some ODEs-inspired architectures including the GraphCON (Rusch et al., 2022) and HANG (Zhao et al., 2024) that demonstrate decent robustness. Although most of these defenses exhibit decent robustness against transfer attacks, i.e., the attack is generated through surrogate models, they encounter catastrophic performance drops when confronted with adaptive adversarial attacks that directly attack the victim model (Mujkanovic et al., 2022). Concerned by the false sense of security, we provide a comprehensive study on existing defenses under adaptive attacks. Our preliminary study in Section 2 indicates that SoftMedian (Geisler et al., 2021a), TWIRLS (Yang et al., 2021), and ElasticGNN (Liu et al., 2021) exhibit closely aligned performance and notably outperform other defenses despite their apparent architectural differences. However, as attack budgets increase, these defenses still experience a severe performance decrease and underperform the graph-agnostic MLPs. These observations are intriguing, but the underlying reasons are still unclear.

To unravel the aligned robustness and performance degradation of SoftMedian, TWIRLS, and ElasticGNN, we delve into their theoretical understanding and unveil their inherent connections and limitations in the underlying principles. Specifically, their improved robustness can be understood from a unified view of 1\ell_{1}-based robust graph smoothing. Moreover, we unearth the problematic estimation bias of 1\ell_{1}-based graph smoothing that allows the adversarial impact to accumulate as the attack budget escalates, which provides a plausible explanation of their declining robustness. Motivated by these understandings, we propose a robust and unbiased graph signal estimator to reduce the estimation bias in GNNs. We design an efficient Quasi-Newton IRLS algorithm that unrolls as robust unbiased aggregation layers to safeguard GNNs against adversarial attacks. Our contributions can be summarized as follows:

  • We provide a unified view of 1\ell_{1}-based robust graph signal smoothing to justify the improved and closely aligned robustness of representative robust GNNs. Moreover, we reveal their estimation bias, which explains their severe performance degradation as the attack budgets increase.

  • We propose a robust and unbiased graph signal estimator to mitigate the estimation bias in 1\ell_{1}-based graph signal smoothing and design an efficient Quasi-Newton IRLS algorithm to solve the non-smooth and non-convex estimation problem with theoretical guarantees.

  • The proposed algorithm can be readily unfolded as feature aggregation building blocks in GNNs, which not only provides clear interpretability but also covers many classic GNNs as special cases.

  • Comprehensive experiments demonstrate that our proposed GNN significantly improves the robustness while maintaining clean accuracy. We also provide comprehensive ablation studies to validate its working mechanism.

2 Estimation Bias Analysis of Robust GNNs

In Section 2.1, we conduct a preliminary study to evaluate the robustness of several representative robust GNNs. In Section 2.2, we establish a unified view as 1\ell_{1}-based models to uncover the inherent connections of three well-performing GNNs, including SoftMedian, TWIRLS and ElasticGNN. In Section 2.3, we leverage the bias of 1\ell_{1}-based estimation to explain the catastrophic performance degradation in the preliminary experiments.

Notation. Let 𝒢={𝒱,}{\mathcal{G}}=\{{\mathcal{V}},{\mathcal{E}}\} be a graph with node set 𝒱={v1,,vn}{\mathcal{V}}=\{v_{1},\dots,v_{n}\} and edge set ={e1,,em}{\mathcal{E}}=\{e_{1},\dots,e_{m}\}. The adjacency matrix of 𝒢{\mathcal{G}} is denoted as 𝑨{0,1}n×n{\bm{A}}\in\{0,1\}^{n\times n} and the graph Laplacian matrix is 𝑳=𝑫𝑨{\bm{L}}={\bm{D}}-{\bm{A}}. 𝑫=diag(d1,,dn){\bm{D}}=\text{diag}(d_{1},\dots,d_{n}) is the degree matrix where di=|𝒩(i)|d_{i}=|{\mathcal{N}}(i)| and 𝒩(i){\mathcal{N}}(i) is the neighborhood set of viv_{i}. The node feature matrix is denoted as 𝑭=[𝒇1,,𝒇n]n×d{\bm{F}}=[{\bm{f}}_{1},\dots,{\bm{f}}_{n}]^{\top}\in\mathbb{R}^{n\times d}, and 𝒇(0){\bm{f}}^{(0)} (𝑭(0){\bm{F}}^{(0)}) denotes the node feature vector (matrix) before graph smoothing in decoupled GNN models. Let Δ{1,0,1}m×n\Delta\in\{-1,0,1\}^{m\times n} be the incidence matrix whose ll-th row denotes the ll-th edge el=(i,j)e_{l}=(i,j) such that Δli=1,Δlj=1,Δlk=0k{i,j}\Delta_{li}=-1,\Delta_{lj}=1,\Delta_{lk}=0~{}\forall k\notin\{i,j\}. Δ~\tilde{\Delta} is its normalized version : Δ~lj=Δlj/dj\tilde{\Delta}_{lj}=\Delta_{lj}/\sqrt{d_{j}}. For a vector 𝒙d{\bm{x}}\in\mathbb{R}^{d}, we use 1\ell_{1}-based gragh smoothing penalty to denote either 𝒙1=i|𝒙i|\|{\bm{x}}\|_{1}=\sum_{i}|{\bm{x}}_{i}| or 𝒙2=i𝒙i2\|{\bm{x}}\|_{2}=\sqrt{\sum_{i}{\bm{x}}_{i}^{2}}. Note that we use 2\ell_{2}-based gragh smoothing penalty to denote 𝒙22=i𝒙i2\|{\bm{x}}\|_{2}^{2}=\sum_{i}{\bm{x}}_{i}^{2}.

2.1 Robustness Analysis

To test the robustness of existing GNNs without the false sense of security, we perform a preliminary evaluation of existing robust GNNs against adaptive attacks. We choose various baselines including the undefended MLP, GCN (Kipf & Welling, 2016), some of the most representative defenses in (Mujkanovic et al., 2022), and two additional robust models TWIRLS (Yang et al., 2021) and ElasticGNN (Liu et al., 2021). We execute adaptive local evasion topological attacks and test the node classification accuracy on the Cora ML and Citeseer datasets. The detailed settings follow Section 4.1. From Figure 1, it can be observed that:

Refer to caption
Figure 1: Robustness analysis under adaptive local attack. The perturbation budget (xx-axis) is the number of edges allowed to be perturbed relative to the target node’s degree. SoftMedian, TWIRLS, and ElasticGNN (blue curves) exhibit similarly aligned competitive robustness among all the selected robust GNNs, but all models experience catastrophic performance degradation as the attack budget increases.
  • Among all the selected robust GNNs, only SoftMedian, TWIRLS, and ElasticGNN exhibit notable and closely aligned improvements in robustness whereas other GNNs do not show obvious improvement over undefended GCN.

  • SoftMedian, TWIRLS, and ElasticGNN encounter a similar catastrophic performance degradation as the attack budget scales up. Their accuracy easily drops below that of the graph-unware MLP, indicating their failure in safely exploiting the topology of the data.

2.2 A Unified View of Robust Estimation

Our preliminary study provides intriguing observations in Section 2.1, but the underlying reasons behind these phenomena remain obscure. This motivates us to delve into their theoretical understanding and explanation. In this section, we will compare the architectures of three well-performing GNNs, aiming to reveal their intrinsic connections.

SoftMedian (Geisler et al., 2021a) substitutes the GCN aggregation for enhanced robustness with the dimension-wise median 𝒎id{\bm{m}}_{i}\in\mathbb{R}^{d} for all neighbors of each node i𝒱i\in{\mathcal{V}}. However, computing the median involves operations like ranking and selection, which is not compatible with the back-propagation training of GNNs. Therefore, the median is approximated as a differentiable weighted sum 𝒎~i=1Zj𝒩(i)w(𝒇j,𝒎i)𝒇j,i𝒱\smash[b]{\tilde{\bm{m}}_{i}=\frac{1}{Z}\sum_{j\in{\mathcal{N}}(i)}w({\bm{f}}_{j},{\bm{m}}_{i}){\bm{f}}_{j},\forall i\in{\mathcal{V}}}, where 𝒎i{\bm{m}}_{i} is the exact non-differentiable dimension-wise median, 𝒇j{\bm{f}}_{j} is the feature vector of the jj-th neighbor, w(𝒙,𝒚)=eβ𝒙𝒚2w({\bm{x}},{\bm{y}})=e^{-\beta\|{\bm{x}}-{\bm{y}}\|_{2}}, and Z=kw(𝒇k,𝒎k)Z=\sum_{k}w({\bm{f}}_{k},{\bm{m}}_{k}) is a normalization factor. In this way, the aggregation assigns the largest weights to the neighbors closest to the actual median.

TWIRLS (Yang et al., 2021) utilizes the iteratively reweighted least squares (IRLS) algorithm to optimize the objective with parameter λ\lambda, and ρ(y)=y\rho(y)=y is the default:

2λ(i,j)ρ(𝒇i~𝒇j~2)+i𝒱𝒇i~𝒇~(0)22,𝒇i~=(1+λdi)12𝒇i.2\lambda\sum_{(i,j)\in\mathcal{E}}\rho(\|\tilde{{\bm{f}}_{i}}-\tilde{{\bm{f}}_{j}}\|_{2})+\sum_{i\in\mathcal{V}}\|\tilde{{\bm{f}}_{i}}-\tilde{{\bm{f}}}^{(0)}\|_{2}^{2},\tilde{{\bm{f}}_{i}}=(1+\lambda d_{i})^{-\frac{1}{2}}{\bm{f}}_{i}. (1)

ElasticGNN (Liu et al., 2021) proposes the elastic message passing which unfolds the proximal alternating predictor-corrector (PAPC) algorithm to minimize the objective with parameter λ{1,2}\lambda_{\{1,2\}}:

12i𝒱𝒇i𝒇i(0)22+λ1(i,j)𝒇idi𝒇jdjp+λ2(i,j)𝒇idi𝒇jdj22,where p{1,2}.\begin{split}&\frac{1}{2}\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}^{(0)}_{i}\|_{2}^{2}+\lambda_{1}\sum_{(i,j)\in\mathcal{E}}\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\right\|_{p}+\lambda_{2}\sum_{(i,j)\in\mathcal{E}}\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\right\|_{2}^{2},\text{where }p\in\{1,2\}.\\ \end{split} (2)

A Unified View of Robust Estimation. While these three approaches have seemingly different architectures, we provide a unified view of robust estimation to illuminate their inherent connections. First, the objective of TWIRLS in Eq. (1) can be considered as a particular case of ElasticGNN with λ2=0\lambda_{2}=0 and p=2p=2 when neglecting the difference in the node degree normalization. However, TWIRLS and ElasticGNN leverage different optimization solvers, i.e., IRLS and PAPC, which leads to vastly different GNN layers. Second, SoftMedian approximates the computation of medians in a soft way of weighted sums, which can be regarded as approximately solving the dimension-wise median estimation problem (Huber, 2004): argmin𝒇ij𝒩(i)𝒇i𝒇j1\operatorname*{arg\,min}_{{\bm{f}}_{i}}\sum_{j\in\mathcal{N}(i)}\|{\bm{f}}_{i}-{\bm{f}}_{j}\|_{1}. Therefore, SoftMedian can be regarded as the ElasticGNN with λ2=0\lambda_{2}=0 and p=1p=1. We also note that the SoftMedoid (Geisler et al., 2020) approach also resembles ElasticGNN with λ2=0\lambda_{2}=0 and p=2p=2, and the Total Variation GNN (Hansen & Bianchi, 2023) also utilizes an 1\ell_{1} estimator in spectral clustering.

The above analyses suggest that SoftMedian, TWIRLS, and ElasticGNN share the same underlying design principle of 1\ell_{1}-based robust graph signal estimation, i.e. a similar graph smoothing objective with edge difference penalties 𝒇i𝒇j1\|{\bm{f}}_{i}-{\bm{f}}_{j}\|_{1} or 𝒇i𝒇j2\|{\bm{f}}_{i}-{\bm{f}}_{j}\|_{2}. However, they adopt different approximation solutions that result in distinct architecture designs. This unified view of robust estimation clearly explains their closely aligned performance. Besides, the superiority 1\ell_{1}-based models over the 2\ell_{2}-based models such as GCN (Kipf & Welling, 2016), whose graph smoothing objective is essentially (i,j)𝒇i/di𝒇j/dj22\sum_{(i,j)\in\mathcal{E}}\|{\bm{f}}_{i}/\sqrt{d_{i}}-{\bm{f}}_{j}/\sqrt{d_{j}}\|_{2}^{2} (Ma et al., 2021), can be explained since 1\ell_{1}-based graph smoothing mitigates the impact of the outliers (Liu et al., 2021).

2.3 Bias Analysis and Performance Degradation

The unified view of 1\ell_{1}-based graph smoothing we established in Section 2.2 not only explains their aligned robustness improvement but also provides a perspective to understand their failure as attack budgets scale up through an estimation bias analysis.

Refer to caption
Figure 2: Different mean estimators in the presence of outliers. The clean samples are the majority of data points following the Gaussian distribution 𝒩((0,0),1I)\mathcal{N}((0,0),1\cdot I), while the outliers are data points that deviate significantly from the main data pattern, following 𝒩((8,8),0.5I)\mathcal{N}((8,8),0.5\cdot I). 2\ell_{2}-estimator deviates far from the true mean, while the 1\ell_{1}-based estimator is more resistant to outliers. However, as the ratio of outliers escalates, the 1\ell_{1}-based estimator encounters a greater shift from the true mean, but our estimator still maintains a position close to the ground truth.

Bias of 1\ell_{1}-based Estimation. In the literature of high-dimensional statistics, it has been well understood that the 1\ell_{1} regularization will induce an estimation bias. In the context of denoising (Donoho, 1995) or variable selection (Tibshirani, 1996a), small coefficients β\beta are undesirable. To exclude small β\beta in the estimation, a soft-thresholding operator can be derived as 𝐒λ(β)=sign(β)max(|β|λ,0){\mathbf{S}}_{\lambda}(\beta)=\text{sign}(\beta)\max(|\beta|-\lambda,0). As a result, large β\beta are also shrunk by a constant, so the 1\ell_{1} estimation is biased towards zero.

A similar bias effect also occurs in graph signal estimation in the presence of adversarial attacks. For example, in TWIRLS (Eq. (1)), after the graph aggregation 𝒇~i(k+1)=j𝒩(i)wij𝒇~j(k)\tilde{\bm{f}}_{i}^{(k+1)}=\sum_{j\in\mathcal{N}(i)}w_{ij}\tilde{\bm{f}}_{j}^{(k)} where wij=𝒇i~𝒇j~21w_{ij}=\|\tilde{{\bm{f}}_{i}}-\tilde{{\bm{f}}_{j}}\|_{2}^{-1}, the edge difference 𝒇~i𝒇~j\tilde{\bm{f}}_{i}-\tilde{\bm{f}}_{j} will shrink towards zero. Consequently, every adversarial edge the attacker adds will induce a bias that can be accumulated and amplified when the attack budget scales up.

Numerical Simulation. To provide a more intuitive illustration of the estimation bias of 1\ell_{1}-based models, we simulate a mean estimation problem on synthetic data since most message passing schemes in GNNs essentially estimate the mean of neighboring node features. The results in Figure 2 shows that 1\ell_{1}-based estimator is more resistant than 2\ell_{2}-based estimator. However, as the ratio of outliers escalates, the 1\ell_{1}-based estimator encounters a greater shift from the true mean due to the accumulated bias caused by outliers. This observation explains why 1\ell_{1}-based graph smoothing models suffer from catastrophic degradation under large attack budgets. The detailed simulation settings and results are available in Appendix A.

3 Robust GNNs with Unbiased Aggregation

In this section, we first design a robust unbiased estimator to reduce the bias in graph signal estimation in Section 3.1 and propose an efficient second-order IRLS algorithm to compute the robust estimator with theoretical convergence guarantees in Section 3.2. Finally, we unroll the proposed algorithm as the robust unbiased feature aggregation layers in GNNs in Section 3.3.

3.1 Robust and Unbiased Graph Signal Estimator

Our study and analysis in Section 2 have shown that while 1\ell_{1}-based methods outperform 2\ell_{2}-based methods in robustness, they still suffer from the accumulated estimation bias, leading to severe performance degradation under large perturbation budgets. This motivates us to design a robust and unbiased graph signal estimator that derives unbiased robust aggregation for GNNs with stronger resilience to attacks.

Theoretically, the estimation bias in Lasso regression has been discovered and analyzed in high-dimensional statistics (Zou, 2006). Statisticians have proposed adaptive Lasso (Zou, 2006) and many non-convex penalties such as Smoothly Clipped Absolute Deviation (SCAD) (Fan & Li, 2001) and Minimax Concave Penalty (MCP) (Zhang, 2010a) to alleviate this bias. Motivated by these advancements, we propose a Robust and Unbiased Graph signal Estimator (RUGE) as follows:

argmin𝑭(𝑭)=(i,j)ργ(𝒇idi𝒇jdj2)+λi𝒱𝒇i𝒇i(0)22,\operatorname*{arg\,min}_{{\bm{F}}}\mathcal{H}({\bm{F}})=\sum_{(i,j)\in\mathcal{E}}\rho_{\gamma}(\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\right\|_{2})+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2}, (3)

where ργ(y)\rho_{\gamma}(y) denotes the function that penalizes the feature differences on edges by MCP:

ργ(y)={yy22γif y<γγ2if yγ.\rho_{\gamma}(y)=\begin{cases}y-\frac{y^{2}}{2\gamma}&\text{if }y<\gamma\\ \frac{\gamma}{2}&\text{if }y\geq\gamma\end{cases}. (4)
Refer to caption
Figure 3: Penalties.

As shown in Figure 3, MCP closely approximates the 1\ell_{1} norm when yy is small since the quadratic term y22γ\frac{y^{2}}{2\gamma} is negligible, and it becomes a constant value when yy is large. This transition can be adjusted by the thresholding parameter γ\gamma. When γ\gamma approaches infinity, the penalty ργ(y)\rho_{\gamma}(y) reduces to the 1\ell_{1} norm. Conversely, when γ\gamma is very small, the “valley” of ργ\rho_{\gamma} near zero is exceptionally sharp, so ργ(y)\rho_{\gamma}(y) approaches the 0\ell_{0} norm and becomes a constant for a slightly larger yy. This enables RUGE to suppress smoothing on edges whose node differences exceeding the threshold γ\gamma. This not only mitigates the estimation bias against outliers but also maintains the estimation accuracy in the absence of outliers. The simulation in Figure 2 verifies that our proposed estimator (η(𝒙)ργ(𝒙2)\eta({\bm{x}})\coloneqq\rho_{\gamma}(\|{\bm{x}}\|_{2})) can recover the true mean despite the increasing outlier ratio when the outlier ratio is below the theoretical optimal breakdown point.

3.2 Quasi-Newton IRLS

Despite the advantages discussed above, the proposed RUGE in Eq. (3) is non-smooth and non-convex, which results in challenges for deriving efficient numerical solutions that can be readily unfolded as neural network layers. In the literature, researchers have developed optimization algorithms for MCP-related problems, such as the Alternating Direction Multiplier Method (ADMM) and Newton-type algorithms (Fan & Li, 2001; Zhang, 2010a; Varma et al., 2019). However, due to their excessive computation and memory requirements as well as the incompatibility with back-propagation training, these algorithms are not well-suited for the construction of feature aggregation layers in GNNs. To solve these challenges, we propose an efficient Quasi-Newton Iteratively Reweighted Least Squares algorithm (QN-IRLS) to solve the estimation problem in Eq. (3).

IRLS. Before stepping into our QN-IRLS, we first introduce the main idea of iteratively reweighted least squares (IRLS) (Holland & Welsch, 1977) and analyze its weakness in convergence. IRLS aims to circumvent the non-smooth (𝑭)\smash[b]{\mathcal{H}({\bm{F}})} in Eq. (3) by computing its quadratic upper bound ^\smash[b]{\hat{\mathcal{H}}} based on 𝑭(k)\smash[b]{{\bm{F}}^{(k)}} in each iteration kk and optimizing ^()\smash[b]{\hat{\mathcal{H}}}({\mathcal{F}}):

^(𝑭)=(i,j)𝑾ij(k)𝒇idi𝒇jdj22+λi𝒱𝒇i𝒇i(0)22,\hat{\mathcal{H}}({\bm{F}})=\sum_{(i,j)\in\mathcal{E}}{\bm{W}}_{ij}^{(k)}\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{\smash[b]{d_{j}}}}\right\|_{2}^{2}+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2}, (5)

where 𝑾ij(k)=𝟏ijdργ(yij)dyij2|yij=yij(k){\bm{W}}^{(k)}_{ij}=\bm{1}_{i\neq j}\frac{d\rho_{\gamma}(y_{ij})}{dy_{ij}^{2}}\big{|}_{y_{ij}=y_{ij}^{(k)}}111𝑾ij{\bm{W}}_{ij} is defined as dρ(y)dy2|y=yij(k)\frac{d\rho(y)}{dy^{2}}\big{|}_{y=y_{ij}^{(k)}} so that the quadratic upper bound ^\hat{\mathcal{H}} is tight at 𝑭(k){\bm{F}}^{(k)} according to Lemma 3. The diagonal terms of 𝑾{\bm{W}} are set to zero to avoid undefined derivative of dρ(y)dy2|y=0\smash[b]{\frac{d\rho(y)}{dy^{2}}\big{|}_{y=0}} as discussed in Remark 2. , where yij(k)=𝒇i(k)/di𝒇j(k)/dj2\smash[b]{y^{(k)}_{ij}=\big{\|}{\bm{f}}_{i}^{(k)}/\sqrt{d_{i}}-{\bm{f}}_{j}^{(k)}/\sqrt{\smash[b]{d_{j}}}\big{\|}_{2}} and ργ()\rho_{\gamma}(\cdot) is the MCP function. For the detailed proof of the upper bound, please refer to Lemma 1 in Appendix B. Then, each iterative step of IRLS can be formulated as the first-order gradient descent for ^(𝑭)\hat{\mathcal{H}}({\bm{F}}):

𝑭(k+1)=𝑭(k)η^(𝑭(k))=𝑭(k)η((𝑸^(k)2𝑾(k)𝑨~)𝑭(k)2λ𝑭(0)),\begin{split}&{\bm{F}}^{(k+1)}={\bm{F}}^{(k)}-\eta\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})={\bm{F}}^{(k)}-\eta\left((\hat{\bm{Q}}^{(k)}-2{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}-2\lambda{\bm{F}}^{(0)}\right),\end{split} (6)

where η\eta is the step size, 𝑸^(k)=2(diag(𝒒(k))+λ𝑰)\hat{\bm{Q}}^{(k)}=2(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}}), and 𝒒m(k)=j𝑾mj(k)𝑨mj/dm{\bm{q}}^{(k)}_{m}=\sum_{j}{\bm{W}}^{(k)}_{mj}{\bm{A}}_{mj}/d_{m}. Its convergence condition is given in Theorem 1, with a proof in Appendix B.

Theorem 1.

If 𝐅(k){\bm{F}}^{(k)} follows the update rule in Eq. (6) where ρ\rho defining 𝐖{\bm{W}} satisfies that dρ(y)dy2\smash[b]{{\frac{d\rho(y)}{dy^{2}}}} is non-decreasing y(0,)\forall y\in(0,\infty), then a sufficient condition for (𝐅(k+1))(𝐅(k))\mathcal{H}({\bm{F}}^{(k+1)})\leq\mathcal{H}({\bm{F}}^{(k)}) is that the step size η\eta satisfies 0<ηdiag(𝐪(k))𝐖(k)𝐀~+λ𝐈21.0<\eta\leq\|\text{diag}({\bm{q}}^{(k)})-{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}}\|_{2}^{-1}.

Quasi-Newton IRLS. Theorem 1 suggests the difficulty in the proper selection of stepsize for (first-order) IRLS due to its non-trivial dependency on the graph (𝑨~\tilde{\bm{A}}) and the dynamic terms (𝒒(k){\bm{q}}^{(k)} and 𝑾(k){\bm{W}}^{(k)}) 222A related work (Yang et al., 2021) adopts IRLS algorithm to optimize the problem in Eq. (1). A preconditioned version is proposed to handle the unnormalized graph Laplacian, but its step size needs to satisfy ηΔ𝚪(k)Δ+λ𝑰21\eta\leq\|\Delta^{\top}{\bf\Gamma}^{(k)}\Delta+\lambda{\bm{I}}\|_{2}^{-1} as shown in Lemma 3.3 of  (Yang et al., 2021), which is expensive to estimate. . The dilemma is that a small stepsize will lead to slow convergence but a large step easily causes divergence and instability as verified by our experiments in Section 4.3 (Figrue 7), which reveals its critical shortcoming for the construction of GNN layers.

To overcome this limitation, we aim to propose a second-order Newton method, 𝑭(k+1)=𝑭(k)(2^(𝑭(k)))1^(𝑭(k)){\bm{F}}^{(k+1)}={\bm{F}}^{(k)}-(\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)}))^{-1}\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)}), to achieve faster convergence and stepsize-free hyperparameter tuning by better capturing the geometry of the optimization landscape. However, obtaining the analytic expression for the inverse Hessian matrix (2^(𝑭(k)))1n×n(\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)}))^{-1}\in\mathbb{R}^{n\times n} is intractable, and the numerical solution requires expensive computation for large graphs. Therefore, we propose a novel Quasi-Newton IRLS algorithm (QN-IRLS) that approximates the Hessian matrix 2^(𝑭(k))=2(diag(𝒒(k))𝑾(k)𝑨~+λ𝑰)\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)})=2(\text{diag}({\bm{q}}^{(k)})-{\bm{W}}^{(k)}\odot\tilde{\bm{A}}+\lambda{\bm{I}}) by the diagonal matrix 𝑸^(k)=2(diag(𝒒(k))+λ𝑰)\hat{{\bm{Q}}}^{(k)}=2(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}}) such that the inverse is trivial. The proposed QN-IRLS works as follows:

𝑭(k+1)=𝑭(k)(𝑸^(k))1^(𝑭(k))=(diag(𝒒(k))+λ𝑰)1((𝑾(k)𝑨~)𝑭(k)+λ𝑭(0)),\begin{split}&{\bm{F}}^{(k+1)}={\bm{F}}^{(k)}-\big{(}\hat{{\bm{Q}}}^{(k)}\big{)}^{-1}\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})=(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}})^{-1}\left(({\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}+\lambda{\bm{F}}^{(0)}\right),\end{split} (7)

where (𝑸^(k))1(\hat{{\bm{Q}}}^{(k)})^{-1} automatically adjusts the per-coordinate stepsize according to the local geometry of the optimization landscape, 𝒒(k){\bm{q}}^{(k)} and 𝑾(k){\bm{W}}^{(k)} are defined as in Eq. (5) and (6). In this way, QN-IRLS provides faster convergence without needing to select a stepsize. The convergence is guaranteed by Theorem 2 with the proof in Appendix B.

Theorem 2.

If 𝐅(k+1){\bm{F}}^{(k+1)} follows update rule in Eq. (7), where ρ\rho satisfies that dρ(y)dy2\frac{d\rho(y)}{dy^{2}} is non-decreasing y(0,)\forall y\in(0,\infty), it is guaranteed that (𝐅(k+1))(𝐅(k))\mathcal{H}({\bm{F}}^{(k+1)})\leq\mathcal{H}({\bm{F}}^{(k)}).

3.3 GNN with Robust Unbiased Aggregation

Refer to caption
Figure 4: dρ(y)dy2\frac{d\rho(y)}{dy^{2}}.

The proposed QN-IRLS provides an efficient algorithm to optimize the RUGE in Eq. (3) with a theoretical convergence guarantee. Instantiated with ρ=ργ\rho=\rho_{\gamma}, each iteration in QN-IRLS in Eq. (7) can be used as one layer in GNNs, which yields the Robust Unbiased Aggregation (RUNG):

𝑭(k+1)=(diag(𝒒(k))+λ𝑰)1((𝑾(k)𝑨~)𝑭(k)+λ𝑭(0)),\begin{split}&{\bm{F}}^{(k+1)}=(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}})^{-1}\left(({\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}+\lambda{\bm{F}}^{(0)}\right),\end{split} (8)

where 𝒒m(k)=j𝑾mj(k)𝑨mj/dm{\bm{q}}^{(k)}_{m}=\sum_{j}{\bm{W}}^{(k)}_{mj}{\bm{A}}_{mj}/d_{m} as in Eq. (6) , 𝑾ij(k)=𝟏ijmax(0,1yij(k)1γ){\bm{W}}^{(k)}_{ij}=\bm{1}_{i\neq j}\max(0,\frac{1}{\smash[b]{y_{ij}^{(k)}}}-\frac{1}{\gamma}) and yij(k)=𝒇i(k)di𝒇j(k)dj2\smash[b]{y^{(k)}_{ij}=\big{\|}\frac{{\bm{f}}_{i}^{(k)}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}^{(k)}}{\sqrt{\smash[b]{d_{j}}}}\big{\|}_{2}}.

Interpretability. The proposed RUNG can be interpreted intuitively with edge reweighting. In Eq. (LABEL:eq:rw_update_f3), the normalized adjacency matrix 𝑨~\tilde{{\bm{A}}} is reweighted by 𝑾(k){\bm{W}}^{(k)}, where 𝑾ij(k)=dρ(y)dy2|y=yij(k){\bm{W}}^{(k)}_{ij}=\frac{d\rho(y)}{dy^{2}}|_{y=y_{ij}^{(k)}}. It is shown in Figure 4 that 𝑾ij{\bm{W}}_{ij} becomes zero for any edge ek=(i,j)e_{k}=(i,j) with a node difference yij(k)γy_{ij}^{(k)}\geq\gamma, thus pruning suspicious edges. This implies RUNG’s strong robustness under large-budget adversarial attacks. With the inclusion of the skip connection 𝑭(0){\bm{F}}^{(0)}, diag(𝒒(k))+λ𝑰\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}} can be seen as a normalizer of the layer output.

Relations with Existing GNNs. RUNG can adopt different ρ\rho that Theorem 2 allows, thus covering many classic GNNs as special cases. When ρ(y)=y2\rho(y)=y^{2}, RUNG in Eq. (LABEL:eq:rw_update_f3) exactly reduces to APPNP (Gasteiger et al., 2018) (𝑭(k+1)=11+λ𝑨~𝑭(k)+λ1+λ𝑭(0){\bm{F}}^{(k+1)}=\frac{1}{1+\lambda}\tilde{\bm{A}}{\bm{F}}^{(k)}+\frac{\lambda}{1+\lambda}{\bm{F}}^{(0)}) and GCN (𝑭(k+1)=𝑨~𝑭(k){\bm{F}}^{(k+1)}=\tilde{\bm{A}}{\bm{F}}^{(k)}) if chosing λ=0\lambda=0. When ρ(y)=y\rho(y)=y, the objective of RUGE is equivalent to ElasticGNN with p=2p=2, which is analogous to SoftMedian and TWIRLS due to their inherent connections as 1\ell_{1}-based graph smoothing.

Complexity analysis. RUNG is scalable with time complexity of O(k(m+n)d)O(k(m+n)d) and space complexity O(m+nd)O(m+nd), where mm is the number of edges, dd is the number of features, nn is the number of nodes, and kk is the number of GNN layers. Therefore, the complexity of our RUNG is comparable to normal GCN (with a constant difference) and it is feasible to implement. The detailed discussions about computation efficiency can be found in Appendix C.

4 Experiment

In this section, we perform comprehensive experiments to validate the robustness of the proposed RUNG. Besides, ablation studies show its convergence and defense mechanism.

4.1 Experiment Setting

Datasets. We test our RUNG with the node classification task on two widely used real-world citation networks, Cora ML and Citeseer (Sen et al., 2008), as well as a large-scale networks Ogbn-Arxiv (Hu et al., 2020). We adopt the data split of 10%10\% training, 10%10\% validation, and 80%80\% testing, and report the classification accuracy of the attacked nodes following (Mujkanovic et al., 2022). Each experiment is averaged over 55 different random splits.

Baselines. To evaluate the performance of RUNG, we compare it to 2\ell_{2} other representative baselines. Among them, MLP, GCN (Kipf & Welling, 2016), APPNP (Gasteiger et al., 2018), and GAT (Velickovic et al., 2017) are undefended vanilla models. GNNGuard (Zhang & Zitnik, 2020), RGCN (Zhu et al., 2019), GRAND (Feng et al., 2020), ProGNN (Jin et al., 2020), Jaccard-GCN (Wu et al., 2019), SVD-GCN (Entezari et al., 2020), EvenNet (Lei et al., 2022), HANG (Zhao et al., 2024), NoisyGNN (Ennadir et al., 2024), and GARNET (Deng et al., 2022) are representative robust GNNs. Besides, SoftMedian and TWIRLS are representative methods with 1\ell_{1}-based graph smoothing 333We do not include ElasticGNN because it is still unclear how to attack it adaptively due to its special incident matrix formulation (Liu et al., 2021). In the preliminary study (Section 2.1), we evaluate the robustness of ElasticGNN following the unit test setting proposed in (Mujkanovic et al., 2022). We also evaluate a variant of TWIRLS with thresholding attention (TWIRLS-T). For RUNG, we test two variants: default RUNG (Eq. (LABEL:eq:rw_update_f3)) and RUNG-1\ell_{1} with 1\ell_{1} penalty (ρ(y)=y\rho(y)=y).

Hyperparameters. The model hyperparameters including learning rate, weight decay, and dropout rate are tuned as in (Mujkanovic et al., 2022). Other hyperparameters follow the settings in the original papers. RUNG uses an MLP connected to 10 graph aggregation layers following the decoupled GNN architecture of APPNP. λ^=11+λ\smash[b]{\hat{\lambda}=\frac{1}{1+\lambda}} is tuned in {0.7, 0.8, 0.9}, and γ\gamma tuned in {0.5,1,2,3,5}\{0.5,1,2,3,5\}. We chose the hyperparameter setting that yields the best robustness without a notable impact (smaller than 1%1\%) on the clean accuracy following (Bojchevski & Günnemann, 2019).

Attack setting. We use the PGD attack (Xu et al., 2019) to execute the adaptive evasion and poisoning topology attack since it delivers the strongest attack in most settings (Mujkanovic et al., 2022). The adaptive attack setting is provided in Appendix F. The adversarial attacks aim to misclassify specific target nodes (local attack) or the entire set of test nodes (global attack). To avoid a false sense of robustness, our adaptive attacks directly target the victim model instead of the surrogate model. Additionally, we include the transfer attacks with a 2-layer GCN as the surrogate model. We also include graph injection attack following the setting in TDGIA (Zou et al., 2021).

4.2 Adversarial Robustness

Table 1: Adaptive local attack on Cora ML. The best and second are marked.
Model 0% (Clean) 20%20\% 50%50\% 100%100\% 150%150\% 200%200\%
MLP 72.6±6.472.6\pm 6.4 72.6±6.472.6\pm 6.4 72.6±6.472.6\pm 6.4 72.6±6.4\mathbf{72.6\pm 6.4} 72.6±6.4\mathbf{72.6\pm 6.4} 72.6±6.4\mathbf{72.6\pm 6.4}
GCN 82.7±4.982.7\pm 4.9 40.7±10.240.7\pm 10.2 12.0±6.212.0\pm 6.2 2.7±2.52.7\pm 2.5 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0
APPNP 84.7±6.8\mathbf{84.7\pm 6.8} 50.0±13.050.0\pm 13.0 27.3±6.527.3\pm 6.5 14.0±5.314.0\pm 5.3 3.3±3.03.3\pm 3.0 0.7±1.30.7\pm 1.3
GAT 80.7±10.080.7\pm 10.0 30.7±16.130.7\pm 16.1 16.0±12.216.0\pm 12.2 11.3±4.511.3\pm 4.5 1.3±1.61.3\pm 1.6 2.0±1.62.0\pm 1.6
GNNGuard 82.7±6.782.7\pm 6.7 44.0±11.644.0\pm 11.6 30.7±11.630.7\pm 11.6 14.0±6.814.0\pm 6.8 5.3±3.45.3\pm 3.4 2.0±2.72.0\pm 2.7
RGCN 84.6±4.084.6\pm 4.0 46.0±9.346.0\pm 9.3 18.0±8.118.0\pm 8.1 6.0±3.96.0\pm 3.9 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0
GRAND 84.0±6.884.0\pm 6.8 47.3±9.047.3\pm 9.0 18.7±9.118.7\pm 9.1 7.3±4.97.3\pm 4.9 1.3±1.61.3\pm 1.6 0.0±0.00.0\pm 0.0
ProGNN 84.7±6.2\mathbf{84.7\pm 6.2} 47.3±10.447.3\pm 10.4 21.3±7.821.3\pm 7.8 4.0±2.54.0\pm 2.5 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0
Jaccard-GCN 81.3±5.081.3\pm 5.0 46.0±6.846.0\pm 6.8 17.3±4.917.3\pm 4.9 4.7±3.44.7\pm 3.4 0.7±1.30.7\pm 1.3 0.7±1.30.7\pm 1.3
GARNET 82.4±6.882.4\pm 6.8 70.9±7.570.9\pm 7.5 61.9±7.961.9\pm 7.9 42.7±9.342.7\pm 9.3 11.6±3.411.6\pm 3.4 9.6±3.59.6\pm 3.5
HANG 83.1±7.483.1\pm 7.4 71.2±7.871.2\pm 7.8 60.1±6.360.1\pm 6.3 39.5±3.439.5\pm 3.4 9.4±2.39.4\pm 2.3 5.6±2.35.6\pm 2.3
NoisyGNN 82.5±5.682.5\pm 5.6 57.4±5.757.4\pm 5.7 47.8±6.247.8\pm 6.2 36.1±4.536.1\pm 4.5 5.8±3.45.8\pm 3.4 4.1±1.24.1\pm 1.2
EvenNet 83.4±8.183.4\pm 8.1 64.8±6.964.8\pm 6.9 56.1±5.656.1\pm 5.6 29.5±5.329.5\pm 5.3 3.1±1.13.1\pm 1.1 1.1±1.31.1\pm 1.3
GraphCON 81.3±7.181.3\pm 7.1 67.3±7.367.3\pm 7.3 54.7±7.154.7\pm 7.1 41.3±6.241.3\pm 6.2 4.1±2.14.1\pm 2.1 2.3±1.22.3\pm 1.2
SoftMedian 80.0±10.280.0\pm 10.2 72.7±13.7¯\underline{72.7\pm 13.7} 62.7±12.762.7\pm 12.7 46.7±11.046.7\pm 11.0 8.0±4.58.0\pm 4.5 8.7±3.48.7\pm 3.4
TWIRLS 83.3±7.383.3\pm 7.3 71.3±8.671.3\pm 8.6 60.7±11.060.7\pm 11.0 36.0±8.836.0\pm 8.8 20.7±10.420.7\pm 10.4 12.0±6.912.0\pm 6.9
TWIRLS-T 82.0±4.582.0\pm 4.5 70.7±4.470.7\pm 4.4 62.7±7.462.7\pm 7.4 54.7±6.254.7\pm 6.2 44.0±11.244.0\pm 11.2 40.7±11.840.7\pm 11.8
RUNG-1\ell_{1} (Ours) 84.0±6.884.0\pm 6.8 72.7±7.1¯\underline{72.7\pm 7.1} 62.7±11.262.7\pm 11.2 53.3±8.253.3\pm 8.2 22.0±9.322.0\pm 9.3 14.0±7.414.0\pm 7.4
RUNG (Ours) 84.0±5.384.0\pm 5.3 75.3±6.9\mathbf{75.3\pm 6.9} 72.7±8.5\mathbf{72.7\pm 8.5} 70.7±10.670.7\pm 10.6 69.3±9.869.3\pm 9.8 69.3±9.069.3\pm 9.0
Table 2: Adaptive global attack on Cora ML. The best and second are marked.
Model 0% (Clean) 5%5\% 10%10\% 20%20\% 30%30\% 40%40\%
MLP 65.0±1.065.0\pm 1.0 65.0±1.065.0\pm 1.0 65.0±1.065.0\pm 1.0 65.0±1.065.0\pm 1.0 65.0±1.0¯\underline{65.0\pm 1.0} 65.0±1.065.0\pm 1.0
GCN 85.0±0.485.0\pm 0.4 75.3±0.575.3\pm 0.5 69.6±0.569.6\pm 0.5 60.9±0.760.9\pm 0.7 54.2±0.654.2\pm 0.6 48.4±0.548.4\pm 0.5
APPNP 86.3±0.4\mathbf{86.3\pm 0.4} 75.8±0.575.8\pm 0.5 69.7±0.769.7\pm 0.7 60.3±0.960.3\pm 0.9 53.8±1.253.8\pm 1.2 49.0±1.649.0\pm 1.6
GAT 83.5±0.583.5\pm 0.5 75.8±0.875.8\pm 0.8 71.2±1.271.2\pm 1.2 65.0±0.965.0\pm 0.9 60.5±0.960.5\pm 0.9 56.7±0.956.7\pm 0.9
GNNGuard 83.1±0.783.1\pm 0.7 74.6±0.774.6\pm 0.7 70.2±1.070.2\pm 1.0 63.1±1.163.1\pm 1.1 57.5±1.657.5\pm 1.6 51.0±1.251.0\pm 1.2
RGCN 85.7±0.485.7\pm 0.4 75.0±0.875.0\pm 0.8 69.1±0.469.1\pm 0.4 59.8±0.759.8\pm 0.7 52.8±0.752.8\pm 0.7 46.1±0.746.1\pm 0.7
GRAND 86.1±0.786.1\pm 0.7 76.2±0.876.2\pm 0.8 70.7±0.770.7\pm 0.7 61.6±0.761.6\pm 0.7 56.7±0.856.7\pm 0.8 51.9±0.951.9\pm 0.9
ProGNN 85.6±0.585.6\pm 0.5 76.5±0.776.5\pm 0.7 71.0±0.571.0\pm 0.5 63.0±0.763.0\pm 0.7 56.8±0.756.8\pm 0.7 51.3±0.651.3\pm 0.6
Jaccard-GCN 83.7±0.783.7\pm 0.7 73.9±0.573.9\pm 0.5 68.3±0.768.3\pm 0.7 60.0±1.160.0\pm 1.1 54.0±1.754.0\pm 1.7 49.1±2.449.1\pm 2.4
GARNET 84.0±0.584.0\pm 0.5 76.5±0.476.5\pm 0.4 72.1±0.172.1\pm 0.1 66.4±0.766.4\pm 0.7 62.1±1.362.1\pm 1.3 58.7±1.558.7\pm 1.5
HANG 84.5±0.284.5\pm 0.2 75.7±0.675.7\pm 0.6 74.2±0.374.2\pm 0.3 69.5±0.469.5\pm 0.4 64.7±0.964.7\pm 0.9 58.4±0.158.4\pm 0.1
NoisyGNN 83.9±0.583.9\pm 0.5 76.7±0.176.7\pm 0.1 72.1±0.372.1\pm 0.3 64.7±0.764.7\pm 0.7 58.0±0.258.0\pm 0.2 53.3±0.653.3\pm 0.6
EvenNet 84.8±0.984.8\pm 0.9 75.8±0.975.8\pm 0.9 70.7±0.670.7\pm 0.6 63.9±0.563.9\pm 0.5 58.7±0.758.7\pm 0.7 54.7±0.854.7\pm 0.8
GraphCON 83.7±0.683.7\pm 0.6 75.8±0.375.8\pm 0.3 70.9±0.870.9\pm 0.8 65.9±0.765.9\pm 0.7 62.2±0.962.2\pm 0.9 56.6±0.356.6\pm 0.3
SoftMedian 85.0±0.785.0\pm 0.7 78.6±0.378.6\pm 0.3 75.5±0.9¯\underline{75.5\pm 0.9} 69.5±0.569.5\pm 0.5 62.8±0.862.8\pm 0.8 58.1±0.758.1\pm 0.7
TWIRLS 84.2±0.684.2\pm 0.6 77.3±0.877.3\pm 0.8 72.9±0.372.9\pm 0.3 66.9±0.266.9\pm 0.2 62.4±0.662.4\pm 0.6 58.7±1.158.7\pm 1.1
TWIRLS-T 82.8±0.582.8\pm 0.5 76.8±0.676.8\pm 0.6 73.2±0.473.2\pm 0.4 67.7±0.467.7\pm 0.4 63.8±0.263.8\pm 0.2 60.8±0.360.8\pm 0.3
RUNG-1\ell_{1} (Ours) 85.8±0.585.8\pm 0.5 78.4±0.4¯\underline{78.4\pm 0.4} 74.3±0.374.3\pm 0.3 68.1±0.668.1\pm 0.6 63.5±0.763.5\pm 0.7 59.8±0.859.8\pm 0.8
RUNG (Ours) 84.6±0.584.6\pm 0.5 78.9±0.4\mathbf{78.9\pm 0.4} 75.7±0.2\mathbf{75.7\pm 0.2} 71.8±0.4\mathbf{71.8\pm 0.4} 67.8±1.3\mathbf{67.8\pm 1.3} 65.1±1.2\mathbf{65.1\pm 1.2}

Here we evaluate the the performance of RUNG against the baselines under different settings. The results of local and global adaptive attacks on Cora ML are presented in Table 1 and Table 2, while those on Citeseer are presented in Table 3 and Table 4 in Appendix E due to space limits. We summarize the following analysis from Cora ML, noting that the same observations apply to Citeseer.

  • Under adaptive attacks, many existing defenses are not significantly more robust than undefended models. The 1\ell_{1}-based models such as TWIRLS, SoftMedian, and RUNG-1\ell_{1} demonstrate considerable and closely aligned robustness under both local and global attacks, which supports our unified 1\ell_{1}-based robust view analysis in Section 2.2.

  • RUNG exhibits significant improvements over all baselines across various budgets under both global and local attacks. Local attacks are stronger than global attacks since local attacks concentrate on targeted nodes. The robustness improvement of RUNG appears to be more remarkable in local attacks.

  • When there is no attack, RUNG largely preserves an excellent clean performance. RUNG also achieves state-of-the-art performance under small attack budgets.

4.3 Ablation study

Refer to caption
Figure 5: Convergence of our QN-IRLS compared to IRLS.
Refer to caption
Figure 6: Bias induced by   different attack budgets.
Refer to caption
Figure 7: Distribution of feature difference on attacked edges.

Convergence. To verify the advantage of our QN-IRLS method in Eq (7) over the first-order IRLS in Eq (6), we show the objective \mathcal{H} on each layer in Figure 7. It can be observed that our stepsize-free QN-IRLS demonstrates the best convergence as discussed in Section 3.

Estimation bias. The bias effect in 1\ell_{1}-based GNNs and the unbiasedness of our RUNG can be empirically verified. We quantify the bias with i𝒱𝒇i𝒇i22\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{\star}\|_{2}^{2}, where 𝒇i{\bm{f}}_{i}^{\star} and 𝒇i{\bm{f}}_{i} denote the aggregated feature on the clean graph and attacked graph, respectively. As shown in Figure 7, when the budget scales up, 1\ell_{1} GNNs exhibit a notable bias, whereas RUNG has nearly zero bias. We provide comprehensive discussion of unbiasedness of RUNG in Appendix D.

Defense Mechanism To further investigate how our defense takes effect, we analyze the edges added under adaptive attacks. The distribution of the node feature differences 𝒇i/di𝒇j/dj2\big{\|}{\bm{f}}_{i}/{\sqrt{d_{i}}}-{\bm{f}}_{j}/\sqrt{d_{j}}\big{\|}_{2} of attacked edges is shown in Figure 7 for different graph signal estimators. It can be observed that our RUNG forces the attacker to focus on the edges with a small feature difference, indicating that our RUNG can improve robustness by down-weighting or pruning some malicious edges that connect distinct nodes. Therefore, the attacks become less influential, which explains why RUNG demonstrates outstanding robustness.

Transfer Attacks. In addition to the adaptive attack, we also conduct a set of transfer attacks that take every baseline GNN as the surrogate model to comprehensively test the robustness of RUNG, following the unit test attack protocol proposed in (Mujkanovic et al., 2022). We summarize the results on Cora ML and Citeseer in Figure 9 and  Figure 10 in Appendix E due to space limits. All transfer attacks are weaker than the adaptive attack in Section 4.2, indicating the necessity to evaluate the strongest adaptive attack to avoid the false sense of security emphasized in this paper. Note that the attack transferred from RUNG model is slightly weaker than the adaptive attack since the surrogate and victim RUNG models have different model parameters in the transfer attack setting.

Hyperparameters. Due to the space limit, we provide the additional ablation studies on the hyperparameters (including γ\gamma and λ\lambda in MCP as well as the number of layers) of RUNG in Appendix G. The results offer an overview strategy for the choice of optimal hyperparameters. We can observe that γ\gamma in MCP has a significant impact on the performance of RUNG. Specifically, a larger γ\gamma makes RUNG closer to an 1\ell_{1}-based model, while a smaller γ\gamma encourages more edges to be pruned. This pruning helps RUNG to remove more malicious edges and improve robustness, although a small γ\gamma may slightly reduce clean performance.

Robustness under various scenarios. Besides the evaluation under strong adaptive attacks, we also validate the consistent effectiveness of our proposed RUNG under various scenarios, including Transfer attacks (Appendix E.2), Poisoning attacks (Appendix E.3), Large scale datasets (Appendix E.4), Adversarial training (Appendix E.5), Graph injection attacks (Appendix E.6).

5 Related Work

To the best of our knowledge, although there are works unifying existing GNNs from a graph signal denoising perspective (Ma et al., 2021), no work has been dedicated to uniformly understand the robustness and limitations of robust GNNs such as SoftMedian (Geisler et al., 2021a), SoftMedoid (Geisler et al., 2020), TWIRLS (Yang et al., 2021), ElasticGNN (Liu et al., 2021), and TVGNN (Hansen & Bianchi, 2023) from the 1\ell_{1} robust statistics and bias analysis perspectives. To mitigate the estimation bias, MCP penalty is promising since it is well known for its near unbiasedness property (Zhang, 2010a) and has been applied to the graph trend filtering problem (Varma et al., 2019) to promote piecewise signal modeling, but their robustness is unexplored. Nevertheless, other robust GNNs have utilized alternative penalties that might alleviate the bias effect. For example, GNNGuard (Zhang & Zitnik, 2020) prunes the edges whose cosine similarity is too small. Another example is that TWIRLS (Yang et al., 2021) with a thresholding penalty can also exclude edges using graph attention. However, the design of their edge weighting or graph attention is heuristic-based and exhibits suboptimal performance compared to the RUNG proposed in this work.

6 Conclusion & Limitation

In this work, we propose a unified view of 1\ell_{1} robust graph smoothing to uniformly understand the robustness and limitations of representative robust GNNs. The established view not only justifies their improved and closely aligned robustness but also explains their severe performance degradation under large attack budgets by a novel estimation bias analysis. To mitigate the estimation bias, we propose a robust and unbiased graph signal estimator. To solve this non-trivial estimation problem, we design a novel and efficient Quasi-Newton IRLS algorithm that can better capture the landscape of the optimization problem and converge stably with a theoretical guarantee. This algorithm can be unfolded and used as a building block for constructing robust GNNs with Robust Unbiased Aggregation (RUNG). As verified by our experiments, RUNG provides the best performance under strong adaptive attacks among all the baselines. Furthermore, RUNG also covers many classic GNNs as special cases. Most importantly, this work provides a deeper understanding of existing approaches and reveals a principled direction for designing robust GNNs.

Regarding the limitations, first, the improvement of RUNG is more significant under large budgets compared to the robust baselines. Second, we primarily include experiments on homophilic graphs in the main paper, but we can generalize the proposed robust aggregation to heterophilic graphs in future work. Third, although our Quasi-Newton IRLS algorithm has exhibited excellent convergence compared to the vanilla IRLS, the efficiency of RUNG could be further improved.

Acknowledgment

Zhichao Hou and Dr. Xiaorui Liu are supported by the National Science Foundation (NSF) National AI Research Resource Pilot Award, Amazon Research Award, NCSU Data Science Academy Seed Grant Award, and NCSU Faculty Research and Professional Development Award.

References

  • Beck & Sabach (2015) Amir Beck and Shoham Sabach. Weiszfeld’s method: Old and new results. Journal of Optimization Theory and Applications, 164(1):1–40, 2015. doi: 10.1007/s10957-014-0586-7. URL https://doi.org/10.1007/s10957-014-0586-7.
  • Bojchevski & Günnemann (2019) Aleksandar Bojchevski and Stephan Günnemann. Certifiable robustness to graph perturbations. In Neural Information Processing Systems, 2019. URL https://api.semanticscholar.org/CorpusID:202782978.
  • Candès et al. (2008) Emmanuel J. Candès, Michael B. Wakin, and Stephen P. Boyd. Enhancing sparsity by reweighted l1l_{1} minimization. Journal of Fourier Analysis and Applications, 14(5-6):877–905, Dec 2008. ISSN 1069-5869. Funding by NSF.
  • Chen et al. (2020) Jinyin Chen, Xiang Lin, Hui Xiong, Yangyang Wu, Haibin Zheng, and Qi Xuan. Smoothing adversarial training for gnn. IEEE Transactions on Computational Social Systems, 8(3):618–629, 2020.
  • Deng et al. (2022) Chenhui Deng, Xiuyu Li, Zhuo Feng, and Zhiru Zhang. Garnet: Reduced-rank topology learning for robust and scalable graph neural networks. In Learning on Graphs Conference, pp.  3–1. PMLR, 2022.
  • Deng et al. (2023) Zhijie Deng, Yinpeng Dong, and Jun Zhu. Batch virtual adversarial training for graph convolutional networks. AI Open, 2023.
  • Donoho (1995) David L Donoho. De-noising by soft-thresholding. IEEE transactions on information theory, 41(3):613–627, 1995.
  • Ennadir et al. (2024) Sofiane Ennadir, Yassine Abbahaddou, Johannes F Lutzeyer, Michalis Vazirgiannis, and Henrik Boström. A simple and yet fairly effective defense for graph neural networks. In 38th AAAI Conference on Artificial Intelligence (AAAI), 2024.
  • Entezari et al. (2020) Negin Entezari, Saba A Al-Sayouri, Amirali Darvishzadeh, and Evangelos E Papalexakis. All you need is low (rank) defending against adversarial attacks on graphs. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp.  169–177, 2020.
  • Fan & Li (2001) Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. ISSN 01621459. URL http://www.jstor.org/stable/3085904.
  • Feng et al. (2020) Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie Tang. Graph random neural networks for semi-supervised learning on graphs. Advances in neural information processing systems, 33:22092–22103, 2020.
  • Gasteiger et al. (2018) Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
  • Geisler et al. (2020) Simon Geisler, Daniel Zügner, and Stephan Günnemann. Reliable graph neural networks via robust aggregation. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  • Geisler et al. (2021a) Simon Geisler, Tobias Schmidt, Hakan Şirin, Daniel Zügner, Aleksandar Bojchevski, and Stephan Günnemann. Robustness of graph neural networks at scale. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  7637–7649. Curran Associates, Inc., 2021a. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/3ea2db50e62ceefceaf70a9d9a56a6f4-Paper.pdf.
  • Geisler et al. (2021b) Simon Geisler, Tobias Schmidt, Hakan Şirin, Daniel Zügner, Aleksandar Bojchevski, and Stephan Günnemann. Robustness of graph neural networks at scale. Advances in Neural Information Processing Systems, 34:7637–7649, 2021b.
  • Hansen & Bianchi (2023) Jonas Berg Hansen and Filippo Maria Bianchi. Total variation graph neural networks. In Proceedings of the 40th international conference on Machine learning. ACM, 2023.
  • Holland & Welsch (1977) Paul W Holland and Roy E Welsch. Robust regression using iteratively reweighted least-squares. Communications in Statistics-theory and Methods, 6(9):813–827, 1977.
  • Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
  • Huber (2004) Peter J Huber. Robust statistics, volume 523. John Wiley & Sons, 2004.
  • Jin et al. (2020) Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  66–74, 2020.
  • Kipf & Welling (2016) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • Lei et al. (2022) Runlin Lei, Zhen Wang, Yaliang Li, Bolin Ding, and Zhewei Wei. Evennet: Ignoring odd-hop neighbors improves robustness of graph neural networks. Advances in Neural Information Processing Systems, 35:4694–4706, 2022.
  • Liu et al. (2021) Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang. Elastic graph neural networks. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  6837–6849. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/liu21k.html.
  • Ma & Tang (2021) Yao Ma and Jiliang Tang. Deep learning on graphs. Cambridge University Press, 2021.
  • Ma et al. (2021) Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, pp.  1202–1211, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384469. doi: 10.1145/3459637.3482225. URL https://doi.org/10.1145/3459637.3482225.
  • Mujkanovic et al. (2022) Felix Mujkanovic, Simon Geisler, Stephan Günnemann, and Aleksandar Bojchevski. Are defenses for graph neural networks robust? In Neural Information Processing Systems, NeurIPS, 2022.
  • Rusch et al. (2022) T Konstantin Rusch, Ben Chamberlain, James Rowbottom, Siddhartha Mishra, and Michael Bronstein. Graph-coupled oscillator networks. In International Conference on Machine Learning, pp.  18888–18909. PMLR, 2022.
  • Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
  • Tibshirani (1996a) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996a. ISSN 00359246. URL http://www.jstor.org/stable/2346178.
  • Tibshirani (1996b) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996b.
  • Varma et al. (2019) Rohan Varma, Harlin Lee, Jelena Kovačević, and Yuejie Chi. Vector-valued graph trend filtering with non-convex penalties. IEEE transactions on signal and information processing over networks, 6:48–62, 2019.
  • Velickovic et al. (2017) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, Yoshua Bengio, et al. Graph attention networks. stat, 1050(20):10–48550, 2017.
  • Wu et al. (2019) Huijun Wu, Chen Wang, Yuriy Tyshetskiy, Andrew Docherty, Kai Lu, and Liming Zhu. Adversarial examples on graph data: Deep insights into attack and defense. arXiv preprint arXiv:1903.01610, 2019.
  • Xu et al. (2019) Kaidi Xu, Hongge Chen, Sijia Liu, Pin-Yu Chen, Tsui-Wei Weng, Mingyi Hong, and Xue Lin. Topology attack and defense for graph neural networks: an optimization perspective. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp.  3961–3967, 2019.
  • Yang et al. (2021) Yongyi Yang, Tang Liu, Yangkun Wang, Jinjing Zhou, Quan Gan, Zhewei Wei, Zheng Zhang, Zengfeng Huang, and David Wipf. Graph neural networks inspired by classical iterative algorithms. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  11773–11783. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/yang21g.html.
  • Zhang (2010a) Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894 – 942, 2010a. doi: 10.1214/09-AOS729. URL https://doi.org/10.1214/09-AOS729.
  • Zhang (2010b) Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010b. ISSN 00905364, 21688966. URL http://www.jstor.org/stable/25662264.
  • Zhang & Zitnik (2020) Xiang Zhang and Marinka Zitnik. Gnnguard: Defending graph neural networks against adversarial attacks. Advances in neural information processing systems, 33:9263–9275, 2020.
  • Zhao et al. (2024) Kai Zhao, Qiyu Kang, Yang Song, Rui She, Sijie Wang, and Wee Peng Tay. Adversarial robustness in graph neural networks: A hamiltonian approach. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhu et al. (2019) Dingyuan Zhu, Ziwei Zhang, Peng Cui, and Wenwu Zhu. Robust graph convolutional networks against adversarial attacks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  1399–1407, 2019.
  • Zou (2006) Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. doi: 10.1198/016214506000000735. URL https://doi.org/10.1198/016214506000000735.
  • Zou et al. (2021) Xu Zou, Qinkai Zheng, Yuxiao Dong, Xinyu Guan, Evgeny Kharlamov, Jialiang Lu, and Jie Tang. Tdgia: Effective injection attacks on graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp.  2461–2471, 2021.

Appendix A Bias Accumulation of L1 Models

A.1 Details of the Numerical Simulation Settings

To provide a more intuitive illustration of the estimation biases of different models, we simulate a mean estimation problem on synthetic data since most message passing schemes in GNNs essentially estimate the mean of neighboring node features. In the context of mean estimation, the bias is measured as the distances between different mean estimators and the true mean. We firstly generated clean samples {𝒙i}i=1n\{{\bm{x}}_{i}\}_{i=1}^{n} (blue dots) and the outlier samples {𝒙i}i=n+1n+m\{{\bm{x}}_{i}\}_{i=n+1}^{n+m}(red dots) from 2-dimensional Gaussian distributions, 𝒩((0,0),1){\mathcal{N}}((0,0),1) and 𝒩((8,8),0.5){\mathcal{N}}((8,8),0.5), respectively. We calculate the mean of clean samples 1ni=1n𝒙i\smash[b]{\frac{1}{n}\sum_{i=1}^{n}{\bm{x}}_{i}} as the ground truth of the mean estimator. Then we estimate the mean of all the samples by solving argmin𝒛i=1n+mη(𝒛𝒙i)\operatorname*{arg\,min}_{\bm{z}}\sum_{i=1}^{n+m}\eta({\bm{z}}-{\bm{x}}_{i}) using the Weiszfeld method (Candès et al., 2008; Beck & Sabach, 2015), where η()\eta(\cdot) can take different norms such as 2\ell_{2} norm 22\|\cdot\|_{2}^{2} and 1\ell_{1} norm 2\|\cdot\|_{2}.

The mean estimators are formulated as minimization operators

𝒛¯=argmin𝒛in+mη(𝒛𝒙i),\bar{\bm{z}}=\operatorname*{arg\,min}_{{\bm{z}}}\sum_{i}^{n+m}\eta({\bm{z}}-{\bm{x}}_{i}), (9)

where nn is the number of clean samples and mm is the number of adversarial samples.

1\ell_{1} estimator. The 1\ell_{1} estimator (η(𝒚)𝒚2\eta({\bm{y}})\coloneqq\|{\bm{y}}\|_{2}), essentially is the geometric median. We adopted the Weiszfeld method to iteratively reweight 𝒛{\bm{z}} to minimize the objective, following

𝒛(k+1)=iwi(k)𝒙iiwi(k),{\bm{z}}^{(k+1)}=\frac{\sum_{i}w_{i}^{(k)}{\bm{x}}_{i}}{\sum_{i}w_{i}^{(k)}}, (10)

where wi(k)=1𝒛(k)𝒙i2w_{i}^{(k)}=\frac{1}{\|{\bm{z}}^{(k)}-{\bm{x}}_{i}\|_{2}}. This can be seen as a gradient descent step of 𝒛(k+1)=𝒛(k)α𝒛i𝒛𝒙i2=𝒛(k+1)αi𝒛(k)𝒙i𝒛(k)𝒙i2{\bm{z}}^{(k+1)}={\bm{z}}^{(k)}-\alpha\nabla_{\bm{z}}\sum_{i}\|{\bm{z}}-{\bm{x}}_{i}\|_{2}={\bm{z}}^{(k+1)}-\alpha\sum_{i}\frac{{\bm{z}}^{(k)}-{\bm{x}}_{i}}{\|{\bm{z}}^{(k)}-{\bm{x}}_{i}\|_{2}}. Taking α=1iwi(k)\alpha=\frac{1}{\sum_{i}w_{i}^{(k)}} instantly yields Eq. (10).

MCP-based estimator. We therefore adopt a similar approach for the MCP-based estimator (“Ours” in Fig. Figure 2), where η(𝒚)ργ(𝒚)\eta({\bm{y}})\coloneqq\rho_{\gamma}({\bm{y}}):

𝒛(k+1)\displaystyle{\bm{z}}^{(k+1)} =𝒛(k)α𝒛iργ(𝒛𝒙i2)\displaystyle={\bm{z}}^{(k)}-\alpha\nabla_{\bm{z}}\sum_{i}\rho_{\gamma}(\|{\bm{z}}-{\bm{x}}_{i}\|_{2}) (11)
=𝒛(k)αimax(0,1𝒛(k)𝒙i21γ)(𝒛(k)𝒙i).\displaystyle={\bm{z}}^{(k)}-\alpha\sum_{i}\max(0,\frac{1}{\|{\bm{z}}^{(k)}-{\bm{x}}_{i}\|_{2}}-\frac{1}{\gamma})({\bm{z}}^{(k)}-{\bm{x}}_{i}). (12)

Denoting max(0,𝒛(k)𝒙i211γ)\max(0,\|{\bm{z}}^{(k)}-{\bm{x}}_{i}\|_{2}^{-1}-\frac{1}{\gamma}) as wiw_{i}, and then α=1iwi\alpha=\frac{1}{\sum_{i}w_{i}} yields a similar reweighting iteration 𝒛(k+1)=iwi(k)𝒙iiwi(k){\bm{z}}^{(k+1)}=\frac{\sum_{i}w_{i}^{(k)}{\bm{x}}_{i}}{\sum_{i}w_{i}^{(k)}}.

2\ell_{2} estimator.

It is worth noting that the same technique can be applied to the 2\ell_{2} estimator with ρ(𝒛)𝒛22\rho({\bm{z}})\coloneqq\|{\bm{z}}\|_{2}^{2}. The iteration becomes

𝒛(k+1)\displaystyle{\bm{z}}^{(k+1)} =𝒛(k)α𝒛i𝒛𝒙i22\displaystyle={\bm{z}}^{(k)}-\alpha\nabla_{\bm{z}}\sum_{i}\|{\bm{z}}-{\bm{x}}_{i}\|_{2}^{2} (13)
=𝒛(k)αi(𝒛(k)𝒙i),\displaystyle={\bm{z}}^{(k)}-\alpha\sum_{i}({\bm{z}}^{(k)}-{\bm{x}}_{i}), (14)

and α=1n+m\alpha=\frac{1}{n+m} yields 𝒛(k+1)=1n+mi𝒙i{\bm{z}}^{(k+1)}=\frac{1}{n+m}\sum_{i}{\bm{x}}_{i}, which gives the mean of all samples in one single iteration.

Similarities between this mean estimation scenario and our QN-IRLS in graph smoothing can be observed, both of which involve iterative reweighting to estimate similar objectives. The approximated Hessian in our QN-IRLS resembles the Weiszfeld method, canceling the 𝒛(k){\bm{z}}^{(k)} by tuning the stepsize.

In Figure 2, we visualize the generated clean samples and outliers, as well as the ground truth means and the mean estimators with η()=22\eta(\cdot)=\|\cdot\|_{2}^{2} or 2\|\cdot\|_{2} under different outlier ratios (15%, 30%, 45%). The results show that the 2\ell_{2}-based estimator deviates far from the true mean, while the 1\ell_{1}-based estimator is more resistant to outliers, which explains why 1\ell_{1}-based methods exhibit stronger robustness. However, as the ratio of outliers escalates, the 1\ell_{1}-based estimator encounters a greater shift from the true mean due to the accumulated bias caused by outliers. This observation explains why 1\ell_{1}-based graph smoothing models suffer from catastrophic degradation under large attack budgets. Our estimator keeps much closer to the ground truth than other estimators with the existence of outliers.

A.2 Additional Simulation Results and Discussions

Here, we complement Figure 2 with the settings of higher attack budgets. As the outlier ratio exceeds the breakdown point 50%50\%, we observe that our MCP-based mean estimator can correctly recover the majority of the samples, i.e. converge to the center of “outliers”.

Refer to caption
Figure 8: The trajectory of our MCP-based mean estimator.

Appendix B Convergence Analysis

To begin with, we will provide an overview of our proof, followed by a detailed presentation of the formal proof for the convergence analysis.

Overview of proof. First, for both IRLS and QN-IRLS, we construct, for 𝑭(k){\bm{F}}^{(k)} in every iteration kk, a quadratic upper bound ^\hat{\mathcal{H}} that satisfies ^+C\hat{\mathcal{H}}+C\geq\mathcal{H} where the equality is reached at 𝑭(k){\bm{F}}^{(k)}. Then we can minimize ^\hat{\mathcal{H}} to guarantee the iterative descent of \mathcal{H} since (𝑭(k+1))^(𝑭(k+1))+C^(𝑭(k))+C=(𝑭(k))\mathcal{H}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k+1)})+C\leq\hat{\mathcal{H}}({\bm{F}}^{(k)})+C=\mathcal{H}({\bm{F}}^{(k)}).

To find the 𝑭(k+1){\bm{F}}^{(k+1)} such that ^(𝑭(k+1))^(𝑭(k))\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)}), IRLS simply adopts the plain gradient descent 𝑭(k+1)=𝑭(k)η^(𝑭(k)){\bm{F}}^{(k+1)}={\bm{F}}^{(k)}-\eta\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)}) whose convergence condition can be analyzed with the β\beta-smoothness of the quadratic ^\hat{\mathcal{H}} (Theorem 1). To address the problems of IRLS as motivated in Section 3.2, our Quasi-Newton IRLS utilizes the diagonal approximate Hessian 𝑸^\hat{{\bm{Q}}} to scale the update step size in different dimensions respectively as 𝑭(k+1)=𝑭(k)𝑸^1^(𝑭(k)){\bm{F}}^{(k+1)}={\bm{F}}^{(k)}-\hat{{\bm{Q}}}^{-1}\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)}). Thereafter, by bounding the Hessian with 2𝑸^2\hat{\bm{Q}}, the descent condition of ^\hat{\mathcal{H}} is simplified (Theorem 2).

Lemma 1.

For any ρ(y)\rho(y) satisfying dρ(y)dy2\frac{d\rho(y)}{dy^{2}} is non-increasing, denote yij:=𝐟idi𝐟jdj2y_{ij}:=\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\right\|_{2}, then (𝐅)=(i,j),ijρ(yij)+λi𝒱𝐟i𝐟i(0)22\mathcal{H}({\bm{F}})=\sum_{(i,j)\in\mathcal{E},i\neq j}\rho(y_{ij})+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2} has the following upper bound:

(𝑭)^(𝑭)+C=(i,j),ij𝑾ij(k)yij2+λi𝒱𝒇i𝒇i(0)22+C,\mathcal{H}({\bm{F}})\leq\hat{\mathcal{H}}({\bm{F}})+C=\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}^{(k)}y_{ij}^{2}+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2}+C, (15)

where 𝐖ij(k)=ρ(y)y2|y=yij(k){\bm{W}}_{ij}^{(k)}=\frac{\partial\rho(y)}{\partial y^{2}}\big{|}_{y=y_{ij}^{(k)}} and yij(k)=𝐟i(k)di𝐟j(k)dj2y_{ij}^{(k)}=\left\|\frac{{\bm{f}}_{i}^{(k)}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}^{(k)}}{\sqrt{d_{j}}}\right\|_{2} and C=(𝐅(k))^(𝐅(k))C=\mathcal{H}({\bm{F}}^{(k)})-\hat{\mathcal{H}}({\bm{F}}^{(k)}) is a constant. The equality in Eq. (15) is achieved when 𝐅=𝐅(k){\bm{F}}={\bm{F}}^{(k)}.

Proof.

Let v=y2v=y^{2} and define ψ(v):=ρ(y)=ρ(v)\psi(v):=\rho(y)=\rho(\sqrt{v}). Then ψ\psi is concave since dψ(v)dv=dρ(y)dy2\frac{d\psi(v)}{dv}=\frac{d\rho(y)}{dy^{2}} is non-increasing. According to the concavity property, we have ψ(v)ψ(v0)+ψ(ν)|ν=v0(vv0)\psi(v)\leq\psi(v_{0})+\psi^{\prime}(\nu)\big{|}_{\nu=v_{0}}(v-v_{0}). Substitute v=y2,v0=y02v=y^{2},v_{0}=y_{0}^{2}, we obtain:

ρ(y)\displaystyle\rho(y) y2ρ(y)y2|y=y0y02ρ(y)y2|y=y0+ρ(y0)\displaystyle\leq y^{2}{\frac{\partial\rho(y)}{\partial y^{2}}\big{|}_{y=y_{0}}}-y_{0}^{2}\frac{\partial\rho(y)}{\partial y^{2}}\big{|}_{y=y_{0}}+{\rho(y_{0})} (16)
=y2ρ(y)y2|y=y0+C(y0)\displaystyle=y^{2}{\frac{\partial\rho(y)}{\partial y^{2}}\big{|}_{y=y_{0}}}+C(y_{0}) (17)

where the inequality is reached when y=y0y=y_{0}. Next, substitute y=yijy=y_{ij} and y0=yij(k)y_{0}=y_{ij}^{(k)}, we can get ρ(yij)𝑾ij(k)yij2+C(yij(k))\rho(y_{ij})\leq{\bm{W}}_{ij}^{(k)}y_{ij}^{2}+C(y_{ij}^{(k)}) which takes the equality at yij=yij(k)y_{ij}=y_{ij}^{(k)}. Finally, by summing up both sides and add a regularization term, we can prove the Eq. (15). ∎

Remark 1.

It can be seen that the definition of ^\hat{\mathcal{H}} depends on 𝐅(k){\bm{F}}^{(k)}, which ensures that the bound is tight when 𝐅=𝐅(k){\bm{F}}={\bm{F}}^{(k)}. This tight bound condition is essential in the majorization-minimization algorithm as seen in Theorem 1.

Lemma 2.

For ^=(i,j),ij𝐖ijyij2+λi𝒱𝐟i𝐟i(0)22\hat{\mathcal{H}}=\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}y_{ij}^{2}+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2}, the gradient and Hessian w.r.t. 𝐅{\bm{F}}444Here are some explanations on the tensor ‘Hessian’ 2^\nabla^{2}\hat{\mathcal{H}}. Since ^(𝐅)\hat{\mathcal{H}}({\bm{F}}) is dependent on a matrix, there are some difficulties in defining the Hessian. However, as can be observed in Eq. (27) and Eq. (32), the feature dimension can be accounted for by the following. Initially, we treat the feature dimension as an irrelevant dimension that is excluded from the matrix operations. E.g., 𝐅2^𝐅=ik𝐅ij2^ijkl𝐅kl{\bm{F}}\nabla^{2}\hat{\mathcal{H}}{\bm{F}}=\sum_{ik}{\bm{F}}_{ij}\nabla^{2}\hat{\mathcal{H}}_{ijkl}{\bm{F}}_{kl} where the feature dimensions jj and ll remain free indices while the node indices ii and kk are eliminated as dummy indices. Finally, we take the trace of the resulting #feature×\times#feature matrix to get the desired value. satisfy

𝑭mn^(𝑭)=2((diag(𝒒)𝑾𝑨~+λ𝑰)𝑭λ𝑭(0))mn,\nabla_{{\bm{F}}_{mn}}\hat{\mathcal{H}}({\bm{F}})=2\left((\text{diag}({\bm{q}})-{\bm{W}}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}}){\bm{F}}-\lambda{\bm{F}}^{(0)}\right)_{mn}, (18)

and

𝑭mn𝑭kl^(𝑭)=2(diag(𝒒)𝑾𝑨~+λ𝑰)mk,\nabla_{{\bm{F}}_{mn}}\nabla_{{\bm{F}}_{kl}}\hat{\mathcal{H}}({\bm{F}})=2\left(\text{diag}({\bm{q}})-{\bm{W}}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}}\right)_{mk}, (19)

where 𝐪m=j𝐖mj𝐀mj/dm{\bm{q}}_{m}=\sum_{j}{\bm{W}}_{mj}{\bm{A}}_{mj}/d_{m} and 𝐀~ij=𝐀ijdidj\tilde{{\bm{A}}}_{ij}=\frac{{\bm{A}}_{ij}}{\sqrt{d_{i}d_{j}}} is the symmetrically normalized adjacency matrix.

Proof.

Follow 𝑨=𝑨{\bm{A}}={\bm{A}}^{\top} and define yij2𝒇idi𝒇jdj22~{}y^{2}_{ij}\coloneqq\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\right\|^{2}_{2}, then the first-order gradient of (i,j),ij𝑾ijyij2\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}y_{ij}^{2} will be

𝑭mn((i,j),ij𝑾ijyij2)\displaystyle\nabla_{{\bm{F}}_{mn}}\left(\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}y_{ij}^{2}\right) (20)
=\displaystyle= (i,j),ij𝑾ijyij2𝑭mn\displaystyle\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}\frac{\partial y^{2}_{ij}}{\partial{\bm{F}}_{mn}} (21)
=\displaystyle= (m,j)𝑾mjymj2𝑭mn\displaystyle\sum_{(m,j)\in\mathcal{E}}{\bm{W}}_{mj}\frac{\partial y^{2}_{mj}}{\partial{\bm{F}}_{mn}} (22)
=\displaystyle= (m,j)𝑾mj(𝑭mndm𝑭jndj)2𝑭mn\displaystyle\sum_{(m,j)\in\mathcal{E}}{\bm{W}}_{mj}\frac{\partial\left(\frac{{\bm{F}}_{mn}}{\sqrt{d_{m}}}-\frac{{\bm{F}}_{jn}}{\sqrt{d_{j}}}\right)^{2}}{\partial{\bm{F}}_{mn}} (23)
=\displaystyle= j𝒩(m)2𝑾mj(𝑭mndm𝑭jndmdj)\displaystyle\sum_{j\in\mathcal{N}(m)}2{\bm{W}}_{mj}(\frac{{\bm{F}}_{mn}}{d_{m}}-\frac{{\bm{F}}_{jn}}{\sqrt{d_{m}d_{j}}}) (24)
=\displaystyle= 2j𝑾mj(𝑨mjdm𝑭mn𝑨mjdmdj𝑭jn)\displaystyle 2\sum_{j}{\bm{W}}_{mj}(\frac{{\bm{A}}_{mj}}{d_{m}}{\bm{F}}_{mn}-\frac{{\bm{A}}_{mj}}{\sqrt{d_{m}d_{j}}}{\bm{F}}_{jn}) (25)
=\displaystyle= 2(j𝑾mj𝑨mjdm)𝑭mn2((𝑾𝑨~)𝑭)mn\displaystyle 2(\frac{\sum_{j}{\bm{W}}_{mj}{\bm{A}}_{mj}}{d_{m}}){\bm{F}}_{mn}-2(({\bm{W}}\odot\tilde{{\bm{A}}}){\bm{F}})_{mn} (26)
=\displaystyle= (2(diag(𝒒)𝑾𝑨~)𝑭)mn,\displaystyle\left(2(\text{diag}({\bm{q}})-{\bm{W}}\odot\tilde{{\bm{A}}}){\bm{F}}\right)_{mn}, (27)

and the second-order hessian will be:

𝑭mn𝑭kl2((i,j),ij𝑾ijyij2)\displaystyle\nabla^{2}_{{\bm{F}}_{mn}{\bm{F}}_{kl}}\left(\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}y_{ij}^{2}\right) (28)
=\displaystyle= (i,j),ij𝑾ijyij2𝑭mn𝑭kl\displaystyle\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}\frac{\partial y^{2}_{ij}}{\partial{\bm{F}}_{mn}\partial{\bm{F}}_{kl}} (29)
=\displaystyle= 2𝑭kl(j𝑾mj(𝑨mjdm𝑭mn𝑨mjdmdj𝑭jn))\displaystyle 2\frac{\partial}{\partial{\bm{F}}_{kl}}\left(\sum_{j}{\bm{W}}_{mj}(\frac{{\bm{A}}_{mj}}{d_{m}}{\bm{F}}_{mn}-\frac{{\bm{A}}_{mj}}{\sqrt{d_{m}d_{j}}}{\bm{F}}_{jn})\right) (30)
=\displaystyle= 2(𝒒mδmkj𝑾mj𝑨mjdmdjδjk)δnl\displaystyle 2({\bm{q}}_{m}\delta_{mk}-\sum_{j}{\bm{W}}_{mj}\frac{{\bm{A}}_{mj}}{\sqrt{d_{m}d_{j}}}\delta_{jk})\delta_{nl} (31)
=\displaystyle= 2(diag(𝒒)𝑾𝑨~)mkδnl.\displaystyle 2(\text{diag}({\bm{q}})-{\bm{W}}\odot\tilde{{\bm{A}}})_{mk}\delta_{nl}. (32)

Remark 2.

From Eq. (21) to Eq. (24), one can assume m𝒩(m)m\notin\mathcal{N}(m), and thus 𝐖mm=0{\bm{W}}_{mm}=0. However, as we know, a self-loop is often added to 𝐀{\bm{A}} to facilitate stability by avoiding zero-degree nodes that cannot be normalized. This is not as problematic as it seems, though. Because (i,j),ij\sum_{(i,j)\in\mathcal{E},i\neq j} intrinsically excludes the diagonal terms, we can simply assign zero to the diagonal terms of 𝐖{\bm{W}} so that the term of j=mj=m is still zero in Eq. (24), as defined in Eq. (5).

To minimize ^\hat{\mathcal{H}}, the gradient descent update rule takes the form of Eq. (6). One may assume that when η\eta is chosen to be small enough, ^(𝑭(k+1))^(𝑭(k))\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)}). For a formal justification, we have Theorem 1 to determine the convergence condition of η\eta.

Theorem 1.

If 𝐅(k){\bm{F}}^{(k)} follows the update rule in Eq. (6), where the ρ\rho satisfies that dρ(y)dy2\frac{d\rho(y)}{dy^{2}} is non-decreasing for y(0,)y\in(0,\infty), then a sufficient condition for (𝐅(k+1))(𝐅(k))\mathcal{H}({\bm{F}}^{(k+1)})\leq\mathcal{H}({\bm{F}}^{(k)}) is that the step size η\eta satisfies 0<ηdiag(𝐪(k))𝐖(k)𝐀~+λ𝐈21.0<\eta\leq\|\text{diag}({\bm{q}}^{(k)})-{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}}\|_{2}^{-1}.

Proof.

The descent of ^(𝑭)\hat{\mathcal{H}}({\bm{F}}) can ensure the descent of (𝑭)\mathcal{H}({\bm{F}}) since (𝑭(k+1))^(𝑭(k+1))^(𝑭(k))=(𝑭(k))\mathcal{H}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)})=\mathcal{H}({\bm{F}}^{(k)}). Therefore, we only need to prove ^(𝑭(k+1))^(𝑭(k))\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)}).

Noting that ^\hat{\mathcal{H}} is a quadratic function and 𝑭(k+1)𝑭(k)=η^(𝑭(k)){\bm{F}}^{(k+1)}-{\bm{F}}^{(k)}=-\eta\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)}), then ^(𝑭(k+1))\hat{\mathcal{H}}({\bm{F}}^{(k+1)}) and ^(𝑭(k))\hat{\mathcal{H}}({\bm{F}}^{(k)}) can be connected using Taylor expansion4, where ^\nabla\hat{\mathcal{H}} and 2^\nabla^{2}\hat{\mathcal{H}} is given in Lemma 2:

^(𝑭(k+1))^(𝑭(k))\displaystyle\hat{\mathcal{H}}({\bm{F}}^{(k+1)})-\hat{\mathcal{H}}({\bm{F}}^{(k)}) (33)
=\displaystyle= tr(^(𝑭(k))(𝑭(k+1)𝑭(k)))\displaystyle{\mathrm{tr}}\left(\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})^{\top}({\bm{F}}^{(k+1)}-{\bm{F}}^{(k)})\right) (34)
+12tr((𝑭(k+1)𝑭(k))2^(𝑭(k))(𝑭(k+1)𝑭(k)))\displaystyle+\frac{1}{2}{\mathrm{tr}}\left(({\bm{F}}^{(k+1)}-{\bm{F}}^{(k)})^{\top}\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)})({\bm{F}}^{(k+1)}-{\bm{F}}^{(k)})\right) (35)
=\displaystyle= tr(η^(𝑭(k))^(𝑭(k)))\displaystyle{\mathrm{tr}}\left(-\eta\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})^{\top}\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})\right) (36)
+tr(η22^(𝑭(k))2^(𝑭(k))^(𝑭(k))).\displaystyle+{\mathrm{tr}}\left(\frac{\eta^{2}}{2}\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})^{\top}\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)})\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})\right). (37)

Insert 2^(𝑭(k))=2(diag(𝒒)𝑾(k)𝑨~+λ𝑰)\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)})=2(\text{diag}({\bm{q}})-{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}}) from Lemma 2 into the above equation and we can find a sufficient condition for ^(𝑭(k+1))^(𝑭(k))\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)}) to be

η+diag(𝒒)𝑾(k)𝑨~+λ^𝑰2η20,-\eta+\|\text{diag}({\bm{q}})-{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}+\hat{\lambda}{\bm{I}}\|_{2}\eta^{2}\leq 0, (38)

or

ηdiag(𝒒)𝑾(k)𝑨~+λ^𝑰21.\eta\leq\|\text{diag}({\bm{q}})-{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}+\hat{\lambda}{\bm{I}}\|_{2}^{-1}. (39)

Now we prove that when taking the Quasi-Newton-IRLS step as in Eq. (7), the objective ^\hat{\mathcal{H}} is guaranteed to descend. Since the features in different dimensions are irrelevant, we simplify our notations as if feature dimension was 11. One may easily recover the general scenario by taking the trace.

Lemma 3.

2𝑸^2^(𝒚)2\hat{{\bm{Q}}}-\nabla^{2}\hat{\mathcal{H}}({\bm{y}}) is positive semi-definite, where ^=(i,j),ij𝐖ijyij2+λi𝒱𝐟i𝐟i(0)22\hat{\mathcal{H}}=\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}y_{ij}^{2}+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2}, 𝐐^=2(diag(𝐪)+λ𝐈)\hat{\bm{Q}}=2(\text{diag}({\bm{q}})+\lambda{\bm{I}}), and 𝐪m=j𝐖mj𝐀mj/dm{\bm{q}}_{m}=\sum_{j}{\bm{W}}_{mj}{\bm{A}}_{mj}/d_{m}.

Proof.

In Lemma 2, we have 2^(𝒚)=2(diag(𝒒)+λ𝑰𝑾𝑨~)\nabla^{2}\hat{\mathcal{H}}({\bm{y}})=2(\text{diag}({\bm{q}})+\lambda{\bm{I}}-{\bm{W}}\odot\tilde{{\bm{A}}}), then

2𝑸^2^(𝒚)=2(diag(𝒒)+λ𝑰+𝑾𝑨~).2\hat{{\bm{Q}}}-\nabla^{2}\hat{\mathcal{H}}({\bm{y}})=2(\text{diag}({\bm{q}})+\lambda{\bm{I}}+{\bm{W}}\odot\tilde{{\bm{A}}}). (40)

Recall how we derived Eq. (27) from Eq. (15), where we proved that

(i,j),ij𝑾ij𝒇idi𝒇jdj22=tr(𝑭(diag(𝒒)𝑾𝑨~)𝑭),\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\|_{2}^{2}={\mathrm{tr}}({\bm{F}}^{\top}(\text{diag}({\bm{q}})-{\bm{W}}\odot\tilde{{\bm{A}}}){\bm{F}}), (41)

which holds for all 𝑭{\bm{F}}. Similarly, the equation still holds after flipping the sign before 𝒇j/dj{\bm{f}}_{j}/\sqrt{d_{j}} and 𝑾𝑨~{\bm{W}}\odot\tilde{{\bm{A}}}. We then have this inequality: 𝑭,λ0\forall{\bm{F}},\forall\lambda\geq 0

0\displaystyle 0\leq (i,j),ij𝑾ij𝒇idi+𝒇jdj22=tr(𝑭(diag(𝒒)+𝑾𝑨~)𝑭)\displaystyle\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}+\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\|_{2}^{2}={\mathrm{tr}}({\bm{F}}^{\top}(\text{diag}({\bm{q}})+{\bm{W}}\odot\tilde{{\bm{A}}}){\bm{F}}) (42)
\displaystyle\leq tr(𝑭(diag(𝒒)+𝑾𝑨~+λ𝑰)𝑭).\displaystyle{\mathrm{tr}}({\bm{F}}^{\top}(\text{diag}({\bm{q}})+{\bm{W}}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}}){\bm{F}}). (43)

Thus, (diag(𝒒)+𝑾𝑨~+λ𝑰)0(\text{diag}({\bm{q}})+{\bm{W}}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}})\succeq 0, and thus 2𝑸^2^(𝒚)02\hat{{\bm{Q}}}-\nabla^{2}\hat{\mathcal{H}}({\bm{y}})\succeq 0.

Using Lemma 1 and Lemma 3 we can prove Theorem 2. Note that we continue to assume #feature=1=1 for simplicity but without loss of generality4.

Theorem 2.

If 𝐅(k+1){\bm{F}}^{(k+1)} follows update rule in Eq. (7), where ρ\rho satisfies that dρ(y)dy2\frac{d\rho(y)}{dy^{2}} is non-decreasing y(0,)\forall y\in(0,\infty), it is guaranteed that (𝐅(k+1))(𝐅(k))\mathcal{H}({\bm{F}}^{(k+1)})\leq\mathcal{H}({\bm{F}}^{(k)}).

Proof.

Following the discussions in Theorem 1, we only need to prove ^(𝑭(k+1))^(𝑭(k))\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)}).For the quadratic ^\hat{\mathcal{H}}, we have:

^(𝒙)=^(𝒚)+^(𝒚)(𝒙𝒚)+12(𝒙𝒚)2^(𝒚)(𝒙𝒚).\hat{\mathcal{H}}({\bm{x}})=\hat{\mathcal{H}}({\bm{y}})+\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}({\bm{x}}-{\bm{y}})+\frac{1}{2}({\bm{x}}-{\bm{y}})^{\top}\nabla^{2}\hat{\mathcal{H}}({\bm{y}})({\bm{x}}-{\bm{y}}). (44)

We can define 𝒬(𝒚)=2𝑸^(𝒚)\mathcal{Q}({\bm{y}})=2\hat{{\bm{Q}}}({\bm{y}}) in Lemma 3 such that 𝒬(𝒚)2^(𝒚)0\mathcal{Q}({\bm{y}})-\nabla^{2}\hat{\mathcal{H}}({\bm{y}})\succeq 0, then

𝒛,𝒛𝒬(𝒚)𝒛𝒛2^(𝒚)𝒛.\forall{\bm{z}},{\bm{z}}^{\top}\mathcal{Q}({\bm{y}}){\bm{z}}\geq{\bm{z}}^{\top}\nabla^{2}\hat{\mathcal{H}}({\bm{y}}){\bm{z}}. (45)

Then an upper bound of ^(𝒙)\hat{\mathcal{H}}({\bm{x}}) can be found by inserting Eq. (45) into Eq. (44).

^(𝒙)^(𝒚)+^(𝒚)(𝒙𝒚)+12(𝒙𝒚)𝒬(𝒚)(𝒙𝒚).\hat{\mathcal{H}}({\bm{x}})\leq\hat{\mathcal{H}}({\bm{y}})+\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}({\bm{x}}-{\bm{y}})+\frac{1}{2}({\bm{x}}-{\bm{y}})^{\top}\mathcal{Q}({\bm{y}})({\bm{x}}-{\bm{y}}). (46)

Then, insert 𝒬=2𝑸^\mathcal{Q}=2\hat{\bm{Q}} into Eq. (46). Note that 𝑸^2(diag(𝒒)+λ^𝑰)\hat{\bm{Q}}\coloneqq 2(\text{diag}({\bm{q}})+\hat{\lambda}{\bm{I}}), so 𝑸^0\hat{\bm{Q}}\succeq 0 and 𝑸^=𝑸^\hat{\bm{Q}}^{\top}=\hat{\bm{Q}}. Thereafter, the update rule 𝒙=𝒚𝑸^1^(𝒚){\bm{x}}={\bm{y}}-\hat{{\bm{Q}}}^{-1}\nabla\hat{\mathcal{H}}({\bm{y}}) in Eq. (7) gives

^(𝒙)^(𝒚)\displaystyle\hat{\mathcal{H}}({\bm{x}})-\hat{\mathcal{H}}({\bm{y}}) (47)
\displaystyle\leq ^(𝒚)(𝒙𝒚)+12(𝒙𝒚)𝒬(𝒚)(𝒙𝒚)\displaystyle\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}({\bm{x}}-{\bm{y}})+\frac{1}{2}({\bm{x}}-{\bm{y}})^{\top}\mathcal{Q}({\bm{y}})({\bm{x}}-{\bm{y}}) (48)
=\displaystyle= ^(𝒚)(𝒙𝒚)+2(𝑸^12(𝒙𝒚))(𝑸^12(𝒙𝒚))\displaystyle\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}({\bm{x}}-{\bm{y}})+2\left(\hat{{\bm{Q}}}^{\frac{1}{2}}({\bm{x}}-{\bm{y}})\right)^{\top}\left(\hat{{\bm{Q}}}^{\frac{1}{2}}({\bm{x}}-{\bm{y}})\right) (49)
=\displaystyle= 2^(𝒚)𝑸^1^(𝒚)2^(𝒚)(𝑸^12)𝑸^12^(𝒚)\displaystyle 2\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}\hat{{\bm{Q}}}^{-1}\nabla\hat{\mathcal{H}}({\bm{y}})-2\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}(\hat{{\bm{Q}}}^{-\frac{1}{2}})^{\top}\hat{{\bm{Q}}}^{-\frac{1}{2}}\nabla\hat{\mathcal{H}}({\bm{y}}) (50)
=\displaystyle= 0.\displaystyle 0. (51)

Therefore, our QN-IRLS in Eq. (7) is guaranteed to descend. ∎

Appendix C Computation Efficiency

Our RUNG model preserves advantageous efficiency even adopting the quasi-Newton IRLS algorithm.

C.1 Time Complexity Analysis

Each RUNG layer involves computing 𝑾{\bm{W}}, 𝒒{\bm{q}}, and the subsequent aggregations. We elaborate on them one by one. We denote the number of feature dimensions dd, the number of nodes nn, and the number of edges mm, which are assumed to satisfy m1m\gg 1, n1n\gg 1 and d1d\gg 1. The number of layers is denoted as kk. The asymptotic computation complexity is denoted as 𝒪()\mathcal{O}(\cdot).

Computation of 𝑾𝑨{\bm{W}}\odot{\bm{A}} and 𝑾𝑨~{\bm{W}}\odot\tilde{\bm{A}}.

𝑾𝟏ijdργ(yij)dyij2{\bm{W}}\coloneqq\bm{1}_{i\neq j}\frac{d\rho_{\gamma}(y_{ij})}{dy_{ij}^{2}} is the edge weighting matrix dependent on the node feature matrix 𝑭{\bm{F}}. The computation of yij=𝒇idi𝒇jdj2y_{ij}=\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{\smash[b]{d_{j}}}}\|_{2} is 𝒪(𝒹)\mathcal{O}(\mathpzc{d}) and that of dργ(yij)dyij2\frac{d\rho_{\gamma}(y_{ij})}{dy_{ij}^{2}} is 𝒪(1)\mathcal{O}(1). 𝑾ij{\bm{W}}_{ij} only needs computing when (i,j)(i,j)\in\mathcal{E}, because (i,j)\forall(i,j)\notin\mathcal{E}, 𝑾ij{\bm{W}}_{ij} will be masked out by 𝑨{\bm{A}} or 𝑨~\tilde{\bm{A}} anyways. Each element of 𝑾{\bm{W}} involves computation time of 𝒪(d)\mathcal{O}(d) and mm elements are needed. In total, 𝑾{\bm{W}} costs 𝒪(md)\mathcal{O}(md), and 𝑾𝑨{\bm{W}}\odot{\bm{A}} and 𝑾𝑨~{\bm{W}}\odot\tilde{\bm{A}} cost 𝒪(md+m)=𝒪(md)\mathcal{O}(md+m)=\mathcal{O}(md).

Computation of 𝑸^1\hat{\bm{Q}}^{-1}.

𝑸^112(diag(𝒒(k))+λ𝑰)1\hat{\bm{Q}}^{-1}\coloneqq\frac{1}{2}(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}})^{-1} is the inverse Hessian in our quasi-Newton IRLS. Because 𝑸^\hat{\bm{Q}} is designed to be a diagonal matrix, its inverse can be evaluated as element-wise reciprocal which is efficient. As for 𝒒j𝑾mj(k)𝑨mj/dm{\bm{q}}\coloneqq\sum_{j}{\bm{W}}^{(k)}_{mj}{\bm{A}}_{mj}/d_{m}, only existing edges (i,j)(i,j)\in\mathcal{E} need evaluation in the summation. Therefore, this computation costs 𝒪(m)\mathcal{O}(m). Thus, 𝑸^1\hat{\bm{Q}}^{-1} costs 𝒪(𝓂)\mathcal{O}(\mathpzc{m}) in total.

Computation of aggregation.

An RUNG layer follows

𝑭(k+1)=2𝑸^1((𝑾(k)𝑨~)𝑭(k)+λ𝑭(0)),{\bm{F}}^{(k+1)}=2\hat{\bm{Q}}^{-1}\left(({\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}+\lambda{\bm{F}}^{(0)}\right), (52)

which combines the quantities calculated above. An extra graph aggregation realized by the matrix multiplication between 𝑾𝑨~{\bm{W}}\odot\tilde{\bm{A}} and 𝑭{\bm{F}} is required, costing 𝒪(md)\mathcal{O}(md). The subsequent addition to 𝑭(0){\bm{F}}^{(0)} and the multiplication to the diagonal 𝑸^1\hat{\bm{Q}}^{-1} both cost 𝒪(nd)\mathcal{O}(nd).

Stacking layers.

RUNG unrolls the QN-IRLS optimization procedure, which has multiple iterations. Therefore, the convergence increase that QN-IRLS introduces allows a RUNG with fewer layers and increases the overall complexity. It is worth noting that the QN-IRLS utilizes a diagonal approximated Hessian, and thus the computation per iteration is also kept efficient as discussed above.

Summing up all the costs, we have the total computational complexity of our RUNG, 𝒪((m+n)kd)\mathcal{O}((m+n)kd). Our RUNG thus scales well to larger graph datasets such as ogbn-arxiv.

Space Complexity Analysis

The only notable extra storage cost is 𝑾{\bm{W}} whose sparse layout takes up 𝒪(m)\mathcal{O}(m). This is the same order of size as the adjacency matrix itself, thus not impacting the total asymptotic complexity.

C.2 Alternative Perspective

In fact, the above analysis can be simplified when we look at the local aggregation behavior of RUNG. For node ii, it’s updated via aggregation 𝒇i=2𝑸^ii1((j𝒩(i)𝑾ij𝒇j)+λ𝒇i(0)){\bm{f}}_{i}=\frac{2}{\hat{\bm{Q}}_{ii}^{-1}}((\sum_{j\in\mathcal{N}(i)}{\bm{W}}_{ij}{\bm{f}}_{j})+\lambda{\bm{f}}_{i}^{(0)}). The summation over neighbors’ 𝒇j{\bm{f}}_{j} will give 𝒪(m)\mathcal{O}(m) in the total time complexity in each feature dimension, and 𝑾ij{\bm{W}}_{ij} involves 𝒪(d)\mathcal{O}(d) computations for each neighbor. This sums up to 𝒪(md)\mathcal{O}(md) as well. Essentially, the high efficiency of RUNG originates from that every edge weighting in our model involves only the 2 nodes on this edge.

Appendix D Comprehensive Bias Analysis

In this section, we provide multiple evidence from various perspectives to reveal the estimation bias of 1\ell_{1}-based estimation as follows.

(1) From the theoretical perspective, the extensive literature on high-dimensional statistics (Tibshirani, 1996b; Zhang, 2010b) has proved that 1\ell_{1} regularization induces an estimation bias.

(2) From the algorithm perspective, in Section 2.3, we provide the explanation on the 1\ell_{1}-based estimation problem solver. Specifically, the soft-thresholding operator Sλ(θ):=sign(θ)max(|θ|λ,0)S_{\lambda}(\theta):=\text{sign}(\theta)\max(|\theta|-\lambda,0) induced by the 1\ell_{1} regularized problem causes a constant shrinkage for θ\theta larger than λ\lambda, enforcing the estimator to be biased towards zero with the magnitude λ\lambda.

(3) From the numerical simulation in Section 2.3, we provide an example of mean estimation to verify this estimation bias. As shown in Figure 2, the 1\ell_{1} estimator (green) deviates further from the true mean as the ratio of outliers escalates. This can be clearly explained as the effect of the accumulation of estimation bias. In other words, each outlier results in a constant bias, and the bias accumulates with more outliers.

(4) From the performance perspective, 1\ell_{1}-based GNNs such as SoftMedian, TWIRLS, and RUNG-1\ell_{1} (the 1\ell_{1} variant of our model) suffer from significant performance degradation when the attack budget increases.

(5) From our ablation study in Figure 6, we quantify the estimation bias of the aggregated feature fif_{i}^{\star} on the attacked graph from the feature fif_{i} on the clean graph: i𝒱fifi22\sum_{i\in\mathcal{V}}\|f_{i}-f_{i}^{*}\|_{2}^{2}. The results demonstrate that 1\ell_{1}-based GNN produces biased estimation under adversarial attacks and the bias indeed scales up with the attack budget. However, our proposed RUNG method exhibits a nearly zero estimation bias under the same attacking budgets.

All of this evidence can convincingly support our claim that 1\ell_{1}-based robust estimator suffers from the estimation bias, which validates the motivation of our new algorithm design.

Appendix E Additional experiment results

In this section, we present the experiment results that are not shown in the main paper due to space limits.

E.1 Adaptive Attacks

Table 3 and Table 4 are the results of adaptive local and global attacks on Citeseer, referred to in Section 4.2.

Table 3: Adaptive local attack on Citeseer. The best and second are marked.
Model 0%0\% 20%20\% 50%50\% 100%100\% 150%150\% 200%200\%
MLP 69.3±2.569.3\pm 2.5 69.3±2.569.3\pm 2.5 69.3±2.5¯\underline{69.3\pm 2.5} 69.3±2.5\mathbf{69.3\pm 2.5} 69.3±2.5\mathbf{69.3\pm 2.5} 69.3±2.5\mathbf{69.3\pm 2.5}
GCN 79.3±3.379.3\pm 3.3 44.7±8.844.7\pm 8.8 27.3±7.727.3\pm 7.7 6.7±3.76.7\pm 3.7 0.7±1.30.7\pm 1.3 0.0±0.00.0\pm 0.0
APPNP 80.7±4.4\mathbf{80.7\pm 4.4} 50.0±6.750.0\pm 6.7 39.3±6.539.3\pm 6.5 16.7±12.316.7\pm 12.3 16.0±8.016.0\pm 8.0 0.0±0.00.0\pm 0.0
GAT 74.7±5.074.7\pm 5.0 15.3±17.515.3\pm 17.5 13.3±13.513.3\pm 13.5 12.0±11.312.0\pm 11.3 12.7±6.512.7\pm 6.5 9.3±5.39.3\pm 5.3
GNNGuard 74.7±4.574.7\pm 4.5 46.0±10.446.0\pm 10.4 32.7±11.032.7\pm 11.0 18.0±7.518.0\pm 7.5 6.0±3.96.0\pm 3.9 4.0±3.94.0\pm 3.9
RGCN 80.0±2.1¯\underline{80.0\pm 2.1} 46.7±9.446.7\pm 9.4 32.7±8.832.7\pm 8.8 10.0±5.210.0\pm 5.2 0.7±1.30.7\pm 1.3 0.7±1.30.7\pm 1.3
GRAND 77.3±2.577.3\pm 2.5 56.7±4.256.7\pm 4.2 44.0±3.944.0\pm 3.9 16.7±6.316.7\pm 6.3 0.7±1.30.7\pm 1.3 0.0±0.00.0\pm 0.0
ProGNN 80.0±2.1¯\underline{80.0\pm 2.1} 42.7±7.442.7\pm 7.4 26.0±5.326.0\pm 5.3 10.0±4.710.0\pm 4.7 0.7±1.30.7\pm 1.3 0.0±0.00.0\pm 0.0
Jaccard-GCN 78.7±3.478.7\pm 3.4 46.7±7.346.7\pm 7.3 28.0±7.528.0\pm 7.5 6.7±4.76.7\pm 4.7 0.7±1.30.7\pm 1.3 0.0±0.00.0\pm 0.0
SoftMedian 78.7±3.478.7\pm 3.4 69.3±6.569.3\pm 6.5 66.0±7.166.0\pm 7.1 56.0±4.456.0\pm 4.4 8.7±6.98.7\pm 6.9 3.3±3.03.3\pm 3.0
TWIRLS 77.3±2.577.3\pm 2.5 69.3±1.369.3\pm 1.3 68.7±1.668.7\pm 1.6 57.3±2.557.3\pm 2.5 36.7±4.736.7\pm 4.7 26.7±4.726.7\pm 4.7
TWIRLS-T 76.0±3.376.0\pm 3.3 70.7±2.5¯\underline{70.7\pm 2.5} 68.7±2.768.7\pm 2.7 62.0±3.462.0\pm 3.4 52.7±5.352.7\pm 5.3 47.3±8.347.3\pm 8.3
RUNG-1\ell_{1} (ours) 80.0±3.7¯\underline{80.0\pm 3.7} 75.3±4.5\mathbf{75.3\pm 4.5} 73.3±3.0\mathbf{73.3\pm 3.0} 67.3±3.3¯\underline{67.3\pm 3.3} 36.0±9.336.0\pm 9.3 26.0±8.326.0\pm 8.3
RUNG (ours) 77.3±1.377.3\pm 1.3 70.7±5.7¯\underline{70.7\pm 5.7} 69.3±6.8¯\underline{69.3\pm 6.8} 67.3±7.1¯\underline{67.3\pm 7.1} 64.0±5.7¯\underline{64.0\pm 5.7} 61.3±5.8¯\underline{61.3\pm 5.8}
Table 4: Adaptive global attack on Citeseer. The best and second are marked.
Model Clean 5%5\% 10%10\% 20%20\% 30%30\% 40%40\%
MLP 67.7±0.367.7\pm 0.3 67.7±0.367.7\pm 0.3 67.7±0.3¯\underline{67.7\pm 0.3} 67.7±0.3\mathbf{67.7\pm 0.3} 67.7±0.3\mathbf{67.7\pm 0.3} 67.7±0.3\mathbf{67.7\pm 0.3}
GCN 74.8±1.274.8\pm 1.2 66.1±1.066.1\pm 1.0 60.9±0.860.9\pm 0.8 53.0±1.053.0\pm 1.0 47.0±0.847.0\pm 0.8 41.2±1.141.2\pm 1.1
APPNP 75.3±1.1\mathbf{75.3\pm 1.1} 65.8±0.965.8\pm 0.9 60.7±1.660.7\pm 1.6 52.3±1.652.3\pm 1.6 46.0±2.046.0\pm 2.0 41.2±2.241.2\pm 2.2
GAT 73.4±1.273.4\pm 1.2 65.4±1.365.4\pm 1.3 60.4±1.460.4\pm 1.4 52.6±2.552.6\pm 2.5 47.2±3.447.2\pm 3.4 41.2±4.841.2\pm 4.8
GNNGuard 72.4±1.172.4\pm 1.1 65.6±0.965.6\pm 0.9 61.8±1.461.8\pm 1.4 55.6±1.455.6\pm 1.4 51.0±1.351.0\pm 1.3 47.3±1.347.3\pm 1.3
RGCN 74.4±1.074.4\pm 1.0 66.0±0.866.0\pm 0.8 60.6±0.960.6\pm 0.9 52.5±0.852.5\pm 0.8 46.1±0.946.1\pm 0.9 40.2±1.040.2\pm 1.0
GRAND 74.8±0.674.8\pm 0.6 66.6±0.766.6\pm 0.7 61.8±0.761.8\pm 0.7 53.6±1.153.6\pm 1.1 47.4±1.247.4\pm 1.2 42.2±0.942.2\pm 0.9
ProGNN 74.2±1.374.2\pm 1.3 65.6±1.165.6\pm 1.1 60.3±1.160.3\pm 1.1 52.7±1.452.7\pm 1.4 46.2±0.946.2\pm 0.9 40.8±0.640.8\pm 0.6
Jaccard-GCN 74.8±1.274.8\pm 1.2 66.3±1.266.3\pm 1.2 60.9±1.260.9\pm 1.2 53.3±0.953.3\pm 0.9 46.5±0.946.5\pm 0.9 41.1±1.041.1\pm 1.0
EvenNet 74.6±0.574.6\pm 0.5 66.8±0.566.8\pm 0.5 62.0±0.662.0\pm 0.6 55.9±0.455.9\pm 0.4 51.1±0.451.1\pm 0.4 47.4±0.847.4\pm 0.8
GARNET 74.8±1.374.8\pm 1.3 68.0±0.968.0\pm 0.9 64.0±1.164.0\pm 1.1 58.2±0.758.2\pm 0.7 53.9±0.853.9\pm 0.8 51.0±0.951.0\pm 0.9
SoftMedian 74.6±0.774.6\pm 0.7 68.0±0.768.0\pm 0.7 64.4±0.964.4\pm 0.9 59.3±1.159.3\pm 1.1 55.2±2.055.2\pm 2.0 51.9±2.151.9\pm 2.1
TWIRLS 74.2±0.874.2\pm 0.8 69.2±0.869.2\pm 0.8 66.4±0.766.4\pm 0.7 61.6±0.961.6\pm 0.9 58.1±1.258.1\pm 1.2 51.8±1.551.8\pm 1.5
TWIRLS-T 73.7±1.173.7\pm 1.1 69.1±1.269.1\pm 1.2 66.4±1.066.4\pm 1.0 62.8±1.562.8\pm 1.5 60.0±1.460.0\pm 1.4 57.4±1.557.4\pm 1.5
RUNG-1\ell_{1} (ours) 75.5±1.1\mathbf{75.5\pm 1.1} 69.3±1.2¯\underline{69.3\pm 1.2} 65.9±1.265.9\pm 1.2 61.1±1.161.1\pm 1.1 57.2±1.457.2\pm 1.4 53.9±1.353.9\pm 1.3
RUNG (ours) 74.3±0.774.3\pm 0.7 71.4±1.0\mathbf{71.4\pm 1.0} 69.8±1.3\mathbf{69.8\pm 1.3} 67.6±1.2¯\underline{67.6\pm 1.2} 66.5±1.3¯\underline{66.5\pm 1.3} 65.3±1.5¯\underline{65.3\pm 1.5}

E.2 Transfer Attacks

Table 5 and Table 6 are the results of transfer global attacks on Cora ML and Citeseer. Figure 9 and Figure 10 are the experiment results of our RUNG attacked by transfer attacks generated on different surrogate models as mentioned in Section 4.3.

Table 5: Transfer global evasion attack on Cora ML.
Model Clean 5%5\% 10%10\% 20%20\% 30%30\% 40%40\%
MLP 65.0±1.065.0\pm 1.0 65.0±1.065.0\pm 1.0 65.0±1.065.0\pm 1.0 65.0±1.065.0\pm 1.0 65.0±1.065.0\pm 1.0 65.0±1.065.0\pm 1.0
GCN 85.0±0.485.0\pm 0.4 76.0±0.776.0\pm 0.7 70.6±0.970.6\pm 0.9 62.1±1.062.1\pm 1.0 55.4±1.055.4\pm 1.0 49.8±1.149.8\pm 1.1
APPNP 86.3±0.4\mathbf{86.3}\pm\mathbf{0.4} 78.0±1.078.0\pm 1.0 72.7±1.272.7\pm 1.2 64.4±1.864.4\pm 1.8 57.7±1.557.7\pm 1.5 53.0±1.253.0\pm 1.2
GAT 83.5±0.583.5\pm 0.5 78.4±0.778.4\pm 0.7 75.1±0.975.1\pm 0.9 69.5±1.769.5\pm 1.7 65.0±1.865.0\pm 1.8 61.4±2.461.4\pm 2.4
GNNGuard 83.1±0.783.1\pm 0.7 79.9±0.779.9\pm 0.7 78.1±0.778.1\pm 0.7 74.8±0.974.8\pm 0.9 71.8±1.171.8\pm 1.1 69.9±1.169.9\pm 1.1
RGCN 85.7±0.485.7\pm 0.4 77.7±0.677.7\pm 0.6 72.7±0.872.7\pm 0.8 64.4±0.864.4\pm 0.8 57.9±0.857.9\pm 0.8 52.6±1.052.6\pm 1.0
GRAND 86.1±0.786.1\pm 0.7 80.2±0.780.2\pm 0.7 76.4±1.076.4\pm 1.0 70.0±1.370.0\pm 1.3 64.6±1.464.6\pm 1.4 60.1±1.160.1\pm 1.1
ProGNN 85.6±0.585.6\pm 0.5 77.8±0.777.8\pm 0.7 72.8±0.972.8\pm 0.9 64.8±1.164.8\pm 1.1 58.7±1.358.7\pm 1.3 53.2±1.353.2\pm 1.3
Jaccard-GCN 83.7±0.783.7\pm 0.7 77.8±0.777.8\pm 0.7 74.1±1.174.1\pm 1.1 68.4±1.168.4\pm 1.1 63.5±1.663.5\pm 1.6 59.3±1.459.3\pm 1.4
ElasticGNN 85.9±0.585.9\pm 0.5 81.9±0.981.9\pm 0.9 79.0±1.079.0\pm 1.0 73.3±1.373.3\pm 1.3 68.1±1.468.1\pm 1.4 63.6±1.563.6\pm 1.5
SoftMedian 85.0±0.785.0\pm 0.7 81.8±0.481.8\pm 0.4 79.0±0.679.0\pm 0.6 73.0±1.173.0\pm 1.1 66.3±1.666.3\pm 1.6 60.4±2.460.4\pm 2.4
TWIRLS 84.2±0.684.2\pm 0.6 81.5±0.781.5\pm 0.7 79.7±0.679.7\pm 0.6 76.8±0.576.8\pm 0.5 74.3±0.774.3\pm 0.7 72.8±0.972.8\pm 0.9
RUNG-1\ell_{1} (ours) 85.8±0.585.8\pm 0.5 82.3±0.882.3\pm 0.8 79.7±0.679.7\pm 0.6 75.1±0.875.1\pm 0.8 70.5±0.870.5\pm 0.8 67.1±0.967.1\pm 0.9
RUNG (ours) 84.6±0.584.6\pm 0.5 83.7±0.6\mathbf{83.7}\pm\mathbf{0.6} 82.9±0.9\mathbf{82.9}\pm\mathbf{0.9} 81.3±1.3\mathbf{81.3}\pm\mathbf{1.3} 79.2±1.6\mathbf{79.2}\pm\mathbf{1.6} 77.9±2.0\mathbf{77.9}\pm\mathbf{2.0}
Table 6: Transfer global evasion attack on Citeseer.
Model Clean 5%5\% 10%10\% 20%20\% 30%30\% 40%40\%
MLP 67.7±0.367.7\pm 0.3 67.7±0.367.7\pm 0.3 67.7±0.367.7\pm 0.3 67.7±0.367.7\pm 0.3 67.7±0.3¯\underline{67.7\pm 0.3} 67.7±0.3¯\underline{67.7\pm 0.3}
GCN 74.8±1.274.8\pm 1.2 66.7±1.366.7\pm 1.3 62.1±1.362.1\pm 1.3 54.5±1.754.5\pm 1.7 48.7±1.848.7\pm 1.8 43.4±2.143.4\pm 2.1
APPNP 75.3±1.1¯\underline{75.3\pm 1.1} 67.7±1.267.7\pm 1.2 62.9±1.062.9\pm 1.0 55.6±1.055.6\pm 1.0 50.3±1.050.3\pm 1.0 45.8±1.645.8\pm 1.6
GAT 73.4±1.273.4\pm 1.2 68.2±1.368.2\pm 1.3 64.6±1.464.6\pm 1.4 58.2±2.158.2\pm 2.1 53.2±3.153.2\pm 3.1 48.7±3.748.7\pm 3.7
GNNGuard 72.4±1.172.4\pm 1.1 70.6±1.270.6\pm 1.2 69.1±1.369.1\pm 1.3 67.1±1.467.1\pm 1.4 65.4±1.765.4\pm 1.7 64.2±1.964.2\pm 1.9
RGCN 74.4±1.074.4\pm 1.0 67.8±1.067.8\pm 1.0 63.5±1.363.5\pm 1.3 56.3±1.556.3\pm 1.5 51.0±1.451.0\pm 1.4 46.1±1.746.1\pm 1.7
GRAND 74.8±0.674.8\pm 0.6 68.9±0.868.9\pm 0.8 65.0±0.965.0\pm 0.9 59.2±1.359.2\pm 1.3 54.8±1.554.8\pm 1.5 51.2±1.951.2\pm 1.9
ProGNN 74.2±1.374.2\pm 1.3 66.8±1.066.8\pm 1.0 62.3±1.062.3\pm 1.0 55.0±1.155.0\pm 1.1 49.2±1.049.2\pm 1.0 43.7±1.343.7\pm 1.3
Jaccard-GCN 74.8±1.274.8\pm 1.2 68.5±1.168.5\pm 1.1 65.1±1.065.1\pm 1.0 59.0±1.759.0\pm 1.7 54.6±1.954.6\pm 1.9 51.1±1.751.1\pm 1.7
ElasticGNN 75.2±1.2{75.2\pm 1.2} 70.9±1.470.9\pm 1.4 67.7±1.667.7\pm 1.6 62.0±2.562.0\pm 2.5 57.9±3.257.9\pm 3.2 53.4±4.053.4\pm 4.0
SoftMedian 74.6±0.774.6\pm 0.7 68.0±0.768.0\pm 0.7 64.4±0.964.4\pm 0.9 59.3±1.159.3\pm 1.1 55.2±2.055.2\pm 2.0 51.9±2.251.9\pm 2.2
TWIRLS 74.2±0.874.2\pm 0.8 71.8±1.0¯\underline{71.8\pm 1.0} 70.1±0.9¯\underline{70.1\pm 0.9} 68.4±1.4¯\underline{68.4\pm 1.4} 67.4±1.667.4\pm 1.6 66.4±1.866.4\pm 1.8
RUNG-1\ell_{1} (ours) 75.5±1.1\mathbf{75.5\pm 1.1} 72.0±1.372.0\pm 1.3 69.3±1.469.3\pm 1.4 65.1±1.865.1\pm 1.8 61.8±2.061.8\pm 2.0 58.7±2.458.7\pm 2.4
RUNG (ours) 74.3±0.774.3\pm 0.7 74.2±1.0\mathbf{74.2\pm 1.0} 74.2±1.0\mathbf{74.2\pm 1.0} 74.3±1.0\mathbf{74.3\pm 1.0} 74.2±1.1\mathbf{74.2\pm 1.1} 74.1±1.1\mathbf{74.1\pm 1.1}
Refer to caption
Figure 9: Transfer global attack from different surrogate models to our RUNG on Cora ML.
Refer to caption
Figure 10: Transfer global attack from different surrogate models to our RUNG on Citeseer.

Figure 11 shows results of global evasion transfer attacks between different models on Cora ML. Our observations are summarized below:

  • The attacks generated by RUNG are stronger when applied to more robust models like SoftMedian, while are not strong against undefended or weakly defended models.

  • For 1\ell_{1} GNNs, the attacks are the strongest when transferred from 1\ell_{1} GNNs. This supports again our unified view on 1\ell_{1} GNNs. An exception is TWIRLS because it only has one attention layer and does not always converge to the actual 1\ell_{1} objective.

Refer to caption
Figure 11: Transfer global attack between different model pairs on Cora.

E.3 Poisoning Attacks

We provide the experiment results under poisoning attacks on Cora ML and Citeseer in Table 7 and Table 8, respectively.

Table 7: Poisoning Attacks on Cora ML.
Budget 5% 10% 20% 30% 40%
GCN 74.9±0.474.9\pm 0.4 69.7±0.769.7\pm 0.7 60.7±0.760.7\pm 0.7 54.0±1.054.0\pm 1.0 48.7±1.048.7\pm 1.0
APPNP 76.3±0.976.3\pm 0.9 71.1±1.271.1\pm 1.2 63.0±1.363.0\pm 1.3 57.1±0.657.1\pm 0.6 53.2±1.153.2\pm 1.1
SoftMedian 79.2±0.779.2\pm 0.7 75.6±0.375.6\pm 0.3 67.8±0.667.8\pm 0.6 62.9±1.062.9\pm 1.0 58.6±0.758.6\pm 0.7
RUNG-1\ell_{1} (ours) 79.7±0.679.7\pm 0.6 76.4±0.676.4\pm 0.6 68.1±0.668.1\pm 0.6 63.8±0.563.8\pm 0.5 60.1±0.960.1\pm 0.9
RUNG (ours) 78.5±0.578.5\pm 0.5 75.5±0.375.5\pm 0.3 71.5±0.471.5\pm 0.4 67.1±1.667.1\pm 1.6 64.6±1.364.6\pm 1.3
Table 8: Poisoning Attacks on Citeseer.
Budget 5% 10% 20% 30% 40%
GCN 65.5±1.165.5\pm 1.1 59.8±1.059.8\pm 1.0 51.0±1.051.0\pm 1.0 44.0±1.244.0\pm 1.2 37.9±1.037.9\pm 1.0
APPNP 64.2±1.864.2\pm 1.8 58.1±2.658.1\pm 2.6 49.8±2.549.8\pm 2.5 43.4±2.343.4\pm 2.3 40.6±2.740.6\pm 2.7
SoftMedian 67.1±1.067.1\pm 1.0 63.8±1.063.8\pm 1.0 58.5±1.158.5\pm 1.1 54.3±1.954.3\pm 1.9 51.2±2.451.2\pm 2.4
RUNG-1\ell_{1} (ours) 68.9±1.068.9\pm 1.0 65.9±1.165.9\pm 1.1 61.0±1.061.0\pm 1.0 57.2±1.157.2\pm 1.1 53.9±1.353.9\pm 1.3
RUNG (ours) 72.4±0.972.4\pm 0.9 72.1±1.272.1\pm 1.2 71.3±1.471.3\pm 1.4 70.8±1.370.8\pm 1.3 69.7±1.469.7\pm 1.4

E.4 Large Scale Ogbn-Arxiv

In the large scale datasets, we can not directly apply the vanilla PGD attack (Xu et al., 2019) on them due to excessive requirement of memory and computation on dense matrix. Alternatively, we leverage the PRBCD (Geisler et al., 2021b) to scale the PGD attack instead of manipulating on the dense adjacency matrix. We conduct experiments on large scale Ogbn-Arxiv and the results in Table 9 verified the superior robustness of our RUNG model. RUNG’s robustness outperforms its 1\ell_{1} variant which delivers similar performance as SoftMedian and Elastic GNN due to their shared 1\ell_{1} graph smoothing.

Table 9: Global PGD Attacks on Ogbn-Arxiv.
Model Clean 1% 5% 10%
GCN 71.9±0.571.9\pm 0.5 63.1±0.463.1\pm 0.4 48.9±3.248.9\pm 3.2 41.8±0.541.8\pm 0.5
APPNP 71.7±0.371.7\pm 0.3 64.2±0.464.2\pm 0.4 50.1±2.350.1\pm 2.3 42.2±1.342.2\pm 1.3
SoftMedian 71.2±0.571.2\pm 0.5 65.1±0.365.1\pm 0.3 54.1±1.654.1\pm 1.6 50.1±2.650.1\pm 2.6
RUNG-1\ell_{1} (ours) 71.6±0.671.6\pm 0.6 65.5±0.465.5\pm 0.4 55.0±1.155.0\pm 1.1 49.6±1.149.6\pm 1.1
RUNG (ours) 70.2±2.170.2\pm 2.1 65.2±0.265.2\pm 0.2 64.0±0.864.0\pm 0.8 61.2±0.761.2\pm 0.7

E.5 Adversarial Training

We conduct the adversarial training following (Xu et al., 2019) and present the results in Table 10. From the results, we can observe that with adversarial training, the robustness of RUNG can be further improved.

Table 10: Adversarial Training vs Normal Training on RUNG.
Budget Clean 5% 10% 20% 30% 40%
Normal Training 84.6±0.584.6\pm 0.5 78.9±0.478.9\pm 0.4 75.7±0.275.7\pm 0.2 71.8±0.471.8\pm 0.4 67.8±1.367.8\pm 1.3 65.1±1.265.1\pm 1.2
Adversarial Training 84.3±1.284.3\pm 1.2 81.8±0.781.8\pm 0.7 79.9±0.379.9\pm 0.3 77.3±1.177.3\pm 1.1 72.8±0.672.8\pm 0.6 69.1±0.969.1\pm 0.9

E.6 Graph Injection Attack

The injection attack was conducted following the settings in (Zou et al., 2021) to evaluate the robustness of different methods. We set up the budget on the number of injected nodes as 100 and the budget on degree as 200. The results in Table 11 show that our RUNG significantly outperforms the baseline models.

Table 11: Graph Injection Attack on Citeseer.
Model Clean Attacked
GCN 75.40 28.14
APPNP 75.84 25.24
GNNGuard 73.47 34.73
SoftMedian 74.82 38.91
RUNG-1\ell_{1} 75.01 39.22
RUNG-MCP (ours) 75.65 51.13

Appendix F Adaptive Attack Settings

For the adaptive PGD attack we adpoted in the experiments, we majorly followed the algorithm in (Mujkanovic et al., 2022) in the adaptive evasion attack. For the sake of completeness, we describe it below:

In summary, we consider the topology attack setting where the adjacency matrix 𝑨{\bm{A}} is perturbed by δ𝑨\delta{\bm{A}} whose element δ𝑨ij{0,1}\delta{\bm{A}}_{ij}\in\{0,1\}. The budget BB is defined as Bδ𝑨0B\geq\|\delta{\bm{A}}\|_{0}. The PGD attack involves first relaxing 𝑨{\bm{A}} from binary to continuous so that a gradient ascent attack can be conducted on the relaxed graph.

During the attack, the minimization problem below is solved:

δ𝑨=argminδ𝑨attack(GNNθ(𝑨+(𝑰2𝑨)δ𝑨,𝑭),ytarget),\delta{\bm{A}}_{\star}=\operatorname*{arg\,min}_{\delta{\bm{A}}}\mathcal{L}_{\text{attack}}(\text{GNN}_{\theta}({\bm{A}}+({\bm{I}}-2{\bm{A}})\odot\delta{\bm{A}},{\bm{F}}),y_{\text{target}}), (53)

where \mathcal{L} is carefully designed attack loss function (Geisler et al., 2021a; Mujkanovic et al., 2022), 𝑨{\bm{A}}, 𝑭{\bm{F}} and ytargety_{\text{target}} are respectively the graph, node feature matrix and ground truth labels in the dataset, θ\theta is the parameters of the GNN under attack which are not altered in the evasion attack setting. (𝑰2𝑨)δ𝑨({\bm{I}}-2{\bm{A}})\odot\delta{\bm{A}} is the calculated perturbation that “flips” the adjacency matrix between 0 and 11 when it is perturbed. The gradient of attackδ𝑨\frac{\mathcal{L}_{\text{attack}}}{\delta{\bm{A}}} is computed and utilized to update the perturbation matrix δ𝑨\delta{\bm{A}}.

After the optimization problem is solved, δ𝑨\delta{\bm{A}} is projected back to the feasible domain of δ𝑨ij{1}\delta{\bm{A}}_{ij}\in\{1\}. The adjacency matrix serves as a probability matrix allowing a Bernoulli sampling of the binary adjacency matrix 𝑨{\bm{A}}^{\prime}. The sampling is executed repeatedly so that an 𝑨{\bm{A}}^{\prime} producing the strongest perturbation is finally generated.

Appendix G Additional Ablation Study of RUNG

G.1 Hyperparameters

The choice of the hyperparameters γ\gamma and λ\lambda is crucial to the performance of RUNG. We therefore experimented with their different combinations and conducted adaptive attacks on Cora as shown in Fig. 12.

Recall the formulation of RUNG in Eq.(LABEL:eq:rw_update_f3):

𝑭(k+1)=(diag(𝒒(k))+λ𝑰)1((𝑾(k)𝑨~)𝑭(k)+λ𝑭(0)),{\bm{F}}^{(k+1)}=(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}})^{-1}\left(({\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}+\lambda{\bm{F}}^{(0)}\right), (54)

where 𝒒m(k)=j𝑾mj(k)𝑨mj/dm{\bm{q}}^{(k)}_{m}=\sum_{j}{\bm{W}}^{(k)}_{mj}{\bm{A}}_{mj}/d_{m}, 𝑾ij(k)=𝟏ijmax(0,12yij(k)12γ){\bm{W}}^{(k)}_{ij}=\bm{1}_{i\neq j}\max(0,\frac{1}{2\smash[b]{y_{ij}^{\leavevmode\resizebox{}{3.0pt}{$(k)$}}}}-\frac{1}{2\gamma}) and yij(k)=𝒇i(k)di𝒇j(k)dj2\smash[b]{y^{(k)}_{ij}=\big{\|}\frac{{\bm{f}}_{i}^{(k)}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}^{(k)}}{\sqrt{\smash[b]{d_{j}}}}\big{\|}_{2}}.

In the formulation, λ\lambda controls the intensity of the regularization in the graph smoothing. In our experiments, we tune λ^11+λ\hat{\lambda}\coloneqq\frac{1}{1+\lambda} which is normalized into (0,1)(0,~{}1). In Figure 12, the optimal value of λ^\hat{\lambda} can be found almost always near 0.90.9 regardless of the attack budget. This indicates that our penalty function ργ\rho_{\gamma} is decoupled from γ\gamma which makes the tuning easier, contrary to the commonly used formulation of MCP (Zhang, 2010a).

On the other hand, γ\gamma has a more intricate impact on the performance of RUNG. Generally speaking, the smaller γ\gamma is, the more edges get pruned, which leads to higher robustness and a lower clean accuracy. We begin our discussion in three cases:

Small attack budget (0%,5%,10%0\%,5\%,10\%). The performance is largely dependent on clean accuracy. Besides, when γ\gamma\rightarrow\infty, RUNG becomes a state-of-the-art robust 1\ell_{1} model. Therefore, a small γ\gamma likely introduces more harm to the clean performance than robustness increments over 1\ell_{1} models. The optimal γ\gamma thus at least recovers the performance of 1\ell_{1} models.

Large attack budget (20%,30%,40%20\%,30\%,40\%). In these cases, γ\gamma\rightarrow\infty is no longer a good choice because 1\ell_{1} models are beginning to suffer from the accumulated bias effect. The optimal γ\gamma is thus smaller (near 0.50.5). However, for fairness, we chose the same γ\gamma under different budgets in our experiments, so the reported RUNG fixes γ=3\gamma=3. In reality, however, when we know the possible attack budgets in advance, we can tune γ\gamma for an even better performance.

Very large attack budget (50%,60%50\%,60\%). We did not include these scenarios because almost all GNNs perform poorly in this region. However, we believe it can provide some insights into robust graph learning. Under these budgets, more than half of the edges are perturbed. In the context of robust statistics (e.g. mean estimation), the estimator will definitely break down. However, in our problem of graph estimation, the input node features offer extra information allowing us to exploit the graph information even beyond the breakdown point. In the “peak” near (0.9,0.5)(0.9,0.5), RUNG achieves >70%>70\% accuracy which is higher than MLP. This indicates that the edge weighting of RUNG is capable of securely harnessing the graph information even in the existence of strong adversarial attacks. The “ridge” near a λ^=0.2\hat{\lambda}=0.2, on the other hand, emerges because of MLP. When the regularization dominates, λ\lambda\rightarrow\infty, and λ^0\hat{\lambda}\rightarrow 0. A small λ\lambda is then connected to a larger emphasis on the input node feature prior. Under large attack budgets, MLP delivers relatively good estimation, so a small λ^\hat{\lambda} is beneficial.

Refer to caption
Figure 12: The performance dependence of RUNG with different hyperparameters γ\gamma and λ\lambda. The performance is evaluated under different attack budgets. The attack setting is the global evasion attack and the dataset is Cora. Note the x-axis is λ^11+λ\hat{\lambda}\coloneqq\frac{1}{1+\lambda} instead of λ\lambda.

G.2 GNN Layers

In RUNG, QN-IRLS is unrolled into GNN layers. We would naturally expect RUNG to have enough number of layers so that the estimator converges as desired. We conducted an ablation study on the performance (clean and adversarial) of RUNG with different layer numbers and the results are shown in Fig. Figure 13. We make the following observations:

  • As the layer number increases, RUNG exhibits better performance. This verifies the effectiveness of our proposed RUGE, as well as the stably converging QN-IRLS.

  • The performance of RUNG can achieve a reasonably good level even with a small layer number (33-55 layers) with accelerated convergence powered by QN-IRLS. This can further reduce the computation complexity of RUNG.

Refer to caption
Figure 13: The performance dependence of RUNG on the number of aggregation layers.

Appendix H Robustness of GCN and APPNP

In addition to the formulation in section 3 the main text, we can simply apply our edge reweighting technique to the GCN architecture. Essentially the aggregation operation in GCN can be viewed as an APPNP layer with λ^=0\hat{\lambda}=0. The GCN version of a layer in our model can then be formulated as

𝑭(k+1)=ReLU((𝑾(k)𝑨~)𝑭(k)𝑴(k)),{\bm{F}}^{(k+1)}=\text{ReLU}(({\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}{\bm{M}}^{(k)}), (55)

where 𝑴{\bm{M}} is the learned weight matrix in GCN.

GCN-based defenses are less robust than APPNP-based defenses

GCN consists of layers in which both feature transformation and message passing are included. This graph convolution operation will weaken defense methods that rely on edge weighting, such as GNNGuard, models using l1l_{1} penalty as well as our method using MCP penalty.

Consider an edge that is added by the attacker555Almost all attacks add new edges as shown by our experiments, so this is almost always the case. A successful defense should detect the attack on this edge by the large difference of nodes connected by this edge, and then assign a zero weight or at least a smaller weight to this edge. However in GCN, even if this edge is detected in the first layer’s message passing, the subsequent feature transformation makes the node difference less likely to be preserved until the second layer. This is where the attack can evade the defense and is thus a vulnerability allowing adaptive attacks through. According to our experiments, using different defense parameters in different layers of GCN, unfortunately, does not help much either.

On the other hand, in APPNP node features are also altered along the message-passing layers, but the node distance change is more regulated than in GCN since MLP is decoupled from the graph smoothing layers. In the latter submodule, node differences simply decrease, which allows our defense based on node differences.

Experiments

It can be seen from Figure 14 that the correlation of node feature differences in different layers of GCN is about 4 times less than in APPNP, which means that an attack detected in the first layer is less likely to continue to be detected in the second layer than in APPNP. This property of GCN makes the many defense methods using GCN architecture as a backbone less robust, as shown in the experiment results in Table 12 and Table 13. Nevertheless, our GCN-based MCP model outperforms the SOTA models using l1l_{1} methods.

Refer to caption
Figure 14: The distances 𝒇i𝒇j2\|{\bm{f}}_{i}\ -{\bm{f}}_{j}\|_{2} between connected nodes in different layers are shown. The linear transform operation in aggregation layers of GCN reduces the correlation between the node distance at different propagation layers.
Model Clean 5%5\% 10%10\% 20%20\% 30%30\% 40%40\%
RUNG-1\ell_{1} 85.8±0.585.8\pm 0.5 78.4±0.478.4\pm 0.4 74.3±0.374.3\pm 0.3 68.1±0.668.1\pm 0.6 63.5±0.763.5\pm 0.7 59.8±0.859.8\pm 0.8
RUNG 84.6±0.584.6\pm 0.5 78.9±0.478.9\pm 0.4 75.7±0.275.7\pm 0.2 71.8±0.471.8\pm 0.4 67.8±1.367.8\pm 1.3 65.1±1.265.1\pm 1.2
RUNG-1\ell_{1}-GCN 84.0±0.484.0\pm 0.4 74.7±0.674.7\pm 0.6 69.7±0.769.7\pm 0.7 62.7±0.662.7\pm 0.6 57.6±0.657.6\pm 0.6 53.5±0.853.5\pm 0.8
RUNG-GCN 82.6±0.682.6\pm 0.6 76.1±0.776.1\pm 0.7 71.0±0.971.0\pm 0.9 64.3±1.164.3\pm 1.1 59.9±0.859.9\pm 0.8 56.5±1.156.5\pm 1.1
Table 12: Comparison between APPNP-based and GCN-based QN-IRLS on Cora.
Model Clean 5%5\% 10%10\% 20%20\% 30%30\% 40%40\%
RUNG-1\ell_{1} 75.5±1.175.5\pm 1.1 69.3±1.269.3\pm 1.2 65.9±1.265.9\pm 1.2 61.1±1.161.1\pm 1.1 57.2±1.457.2\pm 1.4 53.9±1.353.9\pm 1.3
RUNG 74.3±0.774.3\pm 0.7 71.4±1.071.4\pm 1.0 69.8±1.369.8\pm 1.3 67.6±1.267.6\pm 1.2 66.5±1.366.5\pm 1.3 65.3±1.565.3\pm 1.5
RUNG-1\ell_{1}-GCN 73.8±1.273.8\pm 1.2 66.1±1.166.1\pm 1.1 61.9±1.061.9\pm 1.0 56.0±0.756.0\pm 0.7 51.3±0.851.3\pm 0.8 47.7±0.647.7\pm 0.6
RUNG-GCN 73.0±0.873.0\pm 0.8 68.2±0.868.2\pm 0.8 64.3±1.364.3\pm 1.3 59.4±1.059.4\pm 1.0 55.7±1.455.7\pm 1.4 53.1±1.653.1\pm 1.6
Table 13: Comparison between APPNP-based and GCN-based QN-IRLS on Citeseer.