Robust Graph Neural Networks
via Unbiased Aggregation

Zhichao Hou¹ Ruiqi Feng¹¹¹footnotemark: 1 Tyler Derr² Xiaorui Liu¹
Equal contribution.Corresponding author. ¹North Carolina State University, ²Vanderbilt University
{zhou4,xliu96}@ncsu.edu ruiqifeng.2024@gmail.com tyler.derr@vanderbilt.edu

Abstract

The adversarial robustness of Graph Neural Networks (GNNs) has been questioned due to the false sense of security uncovered by strong adaptive attacks despite the existence of numerous defenses. In this work, we delve into the robustness analysis of representative robust GNNs and provide a unified robust estimation point of view to understand their robustness and limitations. Our novel analysis of estimation bias motivates the design of a robust and unbiased graph signal estimator. We then develop an efficient Quasi-Newton Iterative Reweighted Least Squares algorithm to solve the estimation problem, which is unfolded as robust unbiased aggregation layers in GNNs with theoretical guarantees. Our comprehensive experiments confirm the strong robustness of our proposed model under various scenarios, and the ablation study provides a deep understanding of its advantages. Our code is available at https://github.com/chris-hzc/RUNG.

1 Introduction

Graph neural networks (GNNs) have gained tremendous popularity in recent years due to their ability to capture topological relationships in graph-structured data (Ma & Tang, 2021). However, most GNNs are vulnerable to adversarial attacks, which can lead to a substantial decline in predictive performance (Zhang & Zitnik, 2020; Jin et al., 2020; Geisler et al., 2021a). Despite the numerous defense strategies proposed to robustify GNNs, a recent study has revealed that most of these defenses are not as robust as initially claimed (Mujkanovic et al., 2022). Specifically, under adaptive attacks, they easily underperform the multi-layer perceptrons (MLPs) which do not utilize the graph topology information at all (Mujkanovic et al., 2022). Therefore, it is imperative to thoroughly investigate the limitations of existing defenses and develop innovative robust GNNs to securely harness the topology information in graphs.

Existing defenses attempt to bolster the resilience of GNNs using diverse approaches. For instance, Jaccard-GCN (Wu et al., 2019) and SVD-GCN (Entezari et al., 2020) aim to denoise the graph by removing potential adversarial edges during the pre-processing procedure, while ProGNN (Jin et al., 2020) learns the clean graph structure during the training process. GRAND (Feng et al., 2020) and robust training (Deng et al., 2023; Chen et al., 2020) also improve the training procedure through data augmentation. GNNGuard (Zhang & Zitnik, 2020) and RGCN (Zhu et al., 2019) reinforce their GNN architectures by heuristically reweighting edges in the graph. Additionally, there emerge some ODEs-inspired architectures including the GraphCON (Rusch et al., 2022) and HANG (Zhao et al., 2024) that demonstrate decent robustness. Although most of these defenses exhibit decent robustness against transfer attacks, i.e., the attack is generated through surrogate models, they encounter catastrophic performance drops when confronted with adaptive adversarial attacks that directly attack the victim model (Mujkanovic et al., 2022). Concerned by the false sense of security, we provide a comprehensive study on existing defenses under adaptive attacks. Our preliminary study in Section 2 indicates that SoftMedian (Geisler et al., 2021a), TWIRLS (Yang et al., 2021), and ElasticGNN (Liu et al., 2021) exhibit closely aligned performance and notably outperform other defenses despite their apparent architectural differences. However, as attack budgets increase, these defenses still experience a severe performance decrease and underperform the graph-agnostic MLPs. These observations are intriguing, but the underlying reasons are still unclear.

To unravel the aligned robustness and performance degradation of SoftMedian, TWIRLS, and ElasticGNN, we delve into their theoretical understanding and unveil their inherent connections and limitations in the underlying principles. Specifically, their improved robustness can be understood from a unified view of $\ell_{1}$ -based robust graph smoothing. Moreover, we unearth the problematic estimation bias of $\ell_{1}$ -based graph smoothing that allows the adversarial impact to accumulate as the attack budget escalates, which provides a plausible explanation of their declining robustness. Motivated by these understandings, we propose a robust and unbiased graph signal estimator to reduce the estimation bias in GNNs. We design an efficient Quasi-Newton IRLS algorithm that unrolls as robust unbiased aggregation layers to safeguard GNNs against adversarial attacks. Our contributions can be summarized as follows:

•

We provide a unified view of $\ell_{1}$ -based robust graph signal smoothing to justify the improved and closely aligned robustness of representative robust GNNs. Moreover, we reveal their estimation bias, which explains their severe performance degradation as the attack budgets increase.
•

We propose a robust and unbiased graph signal estimator to mitigate the estimation bias in $\ell_{1}$ -based graph signal smoothing and design an efficient Quasi-Newton IRLS algorithm to solve the non-smooth and non-convex estimation problem with theoretical guarantees.
•

The proposed algorithm can be readily unfolded as feature aggregation building blocks in GNNs, which not only provides clear interpretability but also covers many classic GNNs as special cases.
•

Comprehensive experiments demonstrate that our proposed GNN significantly improves the robustness while maintaining clean accuracy. We also provide comprehensive ablation studies to validate its working mechanism.

2 Estimation Bias Analysis of Robust GNNs

In Section 2.1, we conduct a preliminary study to evaluate the robustness of several representative robust GNNs. In Section 2.2, we establish a unified view as $\ell_{1}$ -based models to uncover the inherent connections of three well-performing GNNs, including SoftMedian, TWIRLS and ElasticGNN. In Section 2.3, we leverage the bias of $\ell_{1}$ -based estimation to explain the catastrophic performance degradation in the preliminary experiments.

Notation. Let ${\mathcal{G}}=\{{\mathcal{V}},{\mathcal{E}}\}$ be a graph with node set ${\mathcal{V}}=\{v_{1},\dots,v_{n}\}$ and edge set ${\mathcal{E}}=\{e_{1},\dots,e_{m}\}$ . The adjacency matrix of ${\mathcal{G}}$ is denoted as ${\bm{A}}\in\{0,1\}^{n\times n}$ and the graph Laplacian matrix is ${\bm{L}}={\bm{D}}-{\bm{A}}$ . ${\bm{D}}=\text{diag}(d_{1},\dots,d_{n})$ is the degree matrix where $d_{i}=|{\mathcal{N}}(i)|$ and ${\mathcal{N}}(i)$ is the neighborhood set of $v_{i}$ . The node feature matrix is denoted as ${\bm{F}}=[{\bm{f}}_{1},\dots,{\bm{f}}_{n}]^{\top}\in\mathbb{R}^{n\times d}$ , and ${\bm{f}}^{(0)}$ ( ${\bm{F}}^{(0)}$ ) denotes the node feature vector (matrix) before graph smoothing in decoupled GNN models. Let $\Delta\in\{-1,0,1\}^{m\times n}$ be the incidence matrix whose $l$ -th row denotes the $l$ -th edge $e_{l}=(i,j)$ such that $\Delta_{li}=-1,\Delta_{lj}=1,\Delta_{lk}=0~{}\forall k\notin\{i,j\}$ . $\tilde{\Delta}$ is its normalized version : $\tilde{\Delta}_{lj}=\Delta_{lj}/\sqrt{d_{j}}$ . For a vector ${\bm{x}}\in\mathbb{R}^{d}$ , we use $\ell_{1}$ -based gragh smoothing penalty to denote either $\|{\bm{x}}\|_{1}=\sum_{i}|{\bm{x}}_{i}|$ or $\|{\bm{x}}\|_{2}=\sqrt{\sum_{i}{\bm{x}}_{i}^{2}}$ . Note that we use $\ell_{2}$ -based gragh smoothing penalty to denote $\|{\bm{x}}\|_{2}^{2}=\sum_{i}{\bm{x}}_{i}^{2}$ .

2.1 Robustness Analysis

To test the robustness of existing GNNs without the false sense of security, we perform a preliminary evaluation of existing robust GNNs against adaptive attacks. We choose various baselines including the undefended MLP, GCN (Kipf & Welling, 2016), some of the most representative defenses in (Mujkanovic et al., 2022), and two additional robust models TWIRLS (Yang et al., 2021) and ElasticGNN (Liu et al., 2021). We execute adaptive local evasion topological attacks and test the node classification accuracy on the Cora ML and Citeseer datasets. The detailed settings follow Section 4.1. From Figure 1, it can be observed that:

Refer to caption — Figure 1: Robustness analysis under adaptive local attack. The perturbation budget ( $x$ -axis) is the number of edges allowed to be perturbed relative to the target node’s degree. SoftMedian, TWIRLS, and ElasticGNN (blue curves) exhibit similarly aligned competitive robustness among all the selected robust GNNs, but all models experience catastrophic performance degradation as the attack budget increases.

•

Among all the selected robust GNNs, only SoftMedian, TWIRLS, and ElasticGNN exhibit notable and closely aligned improvements in robustness whereas other GNNs do not show obvious improvement over undefended GCN.
•

SoftMedian, TWIRLS, and ElasticGNN encounter a similar catastrophic performance degradation as the attack budget scales up. Their accuracy easily drops below that of the graph-unware MLP, indicating their failure in safely exploiting the topology of the data.

2.2 A Unified View of Robust Estimation

Our preliminary study provides intriguing observations in Section 2.1, but the underlying reasons behind these phenomena remain obscure. This motivates us to delve into their theoretical understanding and explanation. In this section, we will compare the architectures of three well-performing GNNs, aiming to reveal their intrinsic connections.

SoftMedian (Geisler et al., 2021a) substitutes the GCN aggregation for enhanced robustness with the dimension-wise median ${\bm{m}}_{i}\in\mathbb{R}^{d}$ for all neighbors of each node $i\in{\mathcal{V}}$ . However, computing the median involves operations like ranking and selection, which is not compatible with the back-propagation training of GNNs. Therefore, the median is approximated as a differentiable weighted sum $\smash[b]{\tilde{\bm{m}}_{i}=\frac{1}{Z}\sum_{j\in{\mathcal{N}}(i)}w({\bm{f}}_{j},{\bm{m}}_{i}){\bm{f}}_{j},\forall i\in{\mathcal{V}}}$ , where ${\bm{m}}_{i}$ is the exact non-differentiable dimension-wise median, ${\bm{f}}_{j}$ is the feature vector of the $j$ -th neighbor, $w({\bm{x}},{\bm{y}})=e^{-\beta\|{\bm{x}}-{\bm{y}}\|_{2}}$ , and $Z=\sum_{k}w({\bm{f}}_{k},{\bm{m}}_{k})$ is a normalization factor. In this way, the aggregation assigns the largest weights to the neighbors closest to the actual median.

TWIRLS (Yang et al., 2021) utilizes the iteratively reweighted least squares (IRLS) algorithm to optimize the objective with parameter $\lambda$ , and $\rho(y)=y$ is the default:

2\lambda\sum_{(i,j)\in\mathcal{E}}\rho(\|\tilde{{\bm{f}}_{i}}-\tilde{{\bm{f}}_{j}}\|_{2})+\sum_{i\in\mathcal{V}}\|\tilde{{\bm{f}}_{i}}-\tilde{{\bm{f}}}^{(0)}\|_{2}^{2},\tilde{{\bm{f}}_{i}}=(1+\lambda d_{i})^{-\frac{1}{2}}{\bm{f}}_{i}.

(1)

ElasticGNN (Liu et al., 2021) proposes the elastic message passing which unfolds the proximal alternating predictor-corrector (PAPC) algorithm to minimize the objective with parameter $\lambda_{\{1,2\}}$ :

\begin{split}&\frac{1}{2}\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}^{(0)}_{i}\|_{2}^{2}+\lambda_{1}\sum_{(i,j)\in\mathcal{E}}\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\right\|_{p}+\lambda_{2}\sum_{(i,j)\in\mathcal{E}}\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\right\|_{2}^{2},\text{where }p\in\{1,2\}.\\ \end{split}

(2)

A Unified View of Robust Estimation. While these three approaches have seemingly different architectures, we provide a unified view of robust estimation to illuminate their inherent connections. First, the objective of TWIRLS in Eq. (1) can be considered as a particular case of ElasticGNN with $\lambda_{2}=0$ and $p=2$ when neglecting the difference in the node degree normalization. However, TWIRLS and ElasticGNN leverage different optimization solvers, i.e., IRLS and PAPC, which leads to vastly different GNN layers. Second, SoftMedian approximates the computation of medians in a soft way of weighted sums, which can be regarded as approximately solving the dimension-wise median estimation problem (Huber, 2004): $\operatorname*{arg\,min}_{{\bm{f}}_{i}}\sum_{j\in\mathcal{N}(i)}\|{\bm{f}}_{i}-{\bm{f}}_{j}\|_{1}$ . Therefore, SoftMedian can be regarded as the ElasticGNN with $\lambda_{2}=0$ and $p=1$ . We also note that the SoftMedoid (Geisler et al., 2020) approach also resembles ElasticGNN with $\lambda_{2}=0$ and $p=2$ , and the Total Variation GNN (Hansen & Bianchi, 2023) also utilizes an $\ell_{1}$ estimator in spectral clustering.

The above analyses suggest that SoftMedian, TWIRLS, and ElasticGNN share the same underlying design principle of $\ell_{1}$ -based robust graph signal estimation, i.e. a similar graph smoothing objective with edge difference penalties $\|{\bm{f}}_{i}-{\bm{f}}_{j}\|_{1}$ or $\|{\bm{f}}_{i}-{\bm{f}}_{j}\|_{2}$ . However, they adopt different approximation solutions that result in distinct architecture designs. This unified view of robust estimation clearly explains their closely aligned performance. Besides, the superiority $\ell_{1}$ -based models over the $\ell_{2}$ -based models such as GCN (Kipf & Welling, 2016), whose graph smoothing objective is essentially $\sum_{(i,j)\in\mathcal{E}}\|{\bm{f}}_{i}/\sqrt{d_{i}}-{\bm{f}}_{j}/\sqrt{d_{j}}\|_{2}^{2}$ (Ma et al., 2021), can be explained since $\ell_{1}$ -based graph smoothing mitigates the impact of the outliers (Liu et al., 2021).

2.3 Bias Analysis and Performance Degradation

The unified view of $\ell_{1}$ -based graph smoothing we established in Section 2.2 not only explains their aligned robustness improvement but also provides a perspective to understand their failure as attack budgets scale up through an estimation bias analysis.

Bias of $\ell_{1}$ -based Estimation. In the literature of high-dimensional statistics, it has been well understood that the $\ell_{1}$ regularization will induce an estimation bias. In the context of denoising (Donoho, 1995) or variable selection (Tibshirani, 1996a), small coefficients $\beta$ are undesirable. To exclude small $\beta$ in the estimation, a soft-thresholding operator can be derived as ${\mathbf{S}}_{\lambda}(\beta)=\text{sign}(\beta)\max(|\beta|-\lambda,0)$ . As a result, large $\beta$ are also shrunk by a constant, so the $\ell_{1}$ estimation is biased towards zero.

A similar bias effect also occurs in graph signal estimation in the presence of adversarial attacks. For example, in TWIRLS (Eq. (1)), after the graph aggregation $\tilde{\bm{f}}_{i}^{(k+1)}=\sum_{j\in\mathcal{N}(i)}w_{ij}\tilde{\bm{f}}_{j}^{(k)}$ where $w_{ij}=\|\tilde{{\bm{f}}_{i}}-\tilde{{\bm{f}}_{j}}\|_{2}^{-1}$ , the edge difference $\tilde{\bm{f}}_{i}-\tilde{\bm{f}}_{j}$ will shrink towards zero. Consequently, every adversarial edge the attacker adds will induce a bias that can be accumulated and amplified when the attack budget scales up.

Numerical Simulation. To provide a more intuitive illustration of the estimation bias of $\ell_{1}$ -based models, we simulate a mean estimation problem on synthetic data since most message passing schemes in GNNs essentially estimate the mean of neighboring node features. The results in Figure 2 shows that $\ell_{1}$ -based estimator is more resistant than $\ell_{2}$ -based estimator. However, as the ratio of outliers escalates, the $\ell_{1}$ -based estimator encounters a greater shift from the true mean due to the accumulated bias caused by outliers. This observation explains why $\ell_{1}$ -based graph smoothing models suffer from catastrophic degradation under large attack budgets. The detailed simulation settings and results are available in Appendix A.

3 Robust GNNs with Unbiased Aggregation

In this section, we first design a robust unbiased estimator to reduce the bias in graph signal estimation in Section 3.1 and propose an efficient second-order IRLS algorithm to compute the robust estimator with theoretical convergence guarantees in Section 3.2. Finally, we unroll the proposed algorithm as the robust unbiased feature aggregation layers in GNNs in Section 3.3.

3.1 Robust and Unbiased Graph Signal Estimator

Our study and analysis in Section 2 have shown that while $\ell_{1}$ -based methods outperform $\ell_{2}$ -based methods in robustness, they still suffer from the accumulated estimation bias, leading to severe performance degradation under large perturbation budgets. This motivates us to design a robust and unbiased graph signal estimator that derives unbiased robust aggregation for GNNs with stronger resilience to attacks.

Theoretically, the estimation bias in Lasso regression has been discovered and analyzed in high-dimensional statistics (Zou, 2006). Statisticians have proposed adaptive Lasso (Zou, 2006) and many non-convex penalties such as Smoothly Clipped Absolute Deviation (SCAD) (Fan & Li, 2001) and Minimax Concave Penalty (MCP) (Zhang, 2010a) to alleviate this bias. Motivated by these advancements, we propose a Robust and Unbiased Graph signal Estimator (RUGE) as follows:

\operatorname*{arg\,min}_{{\bm{F}}}\mathcal{H}({\bm{F}})=\sum_{(i,j)\in\mathcal{E}}\rho_{\gamma}(\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\right\|_{2})+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2},

(3)

where $\rho_{\gamma}(y)$ denotes the function that penalizes the feature differences on edges by MCP:

\rho_{\gamma}(y)=\begin{cases}y-\frac{y^{2}}{2\gamma}&\text{if }y<\gamma\\ \frac{\gamma}{2}&\text{if }y\geq\gamma\end{cases}.

(4)

As shown in Figure 3, MCP closely approximates the $\ell_{1}$ norm when $y$ is small since the quadratic term $\frac{y^{2}}{2\gamma}$ is negligible, and it becomes a constant value when $y$ is large. This transition can be adjusted by the thresholding parameter $\gamma$ . When $\gamma$ approaches infinity, the penalty $\rho_{\gamma}(y)$ reduces to the $\ell_{1}$ norm. Conversely, when $\gamma$ is very small, the “valley” of $\rho_{\gamma}$ near zero is exceptionally sharp, so $\rho_{\gamma}(y)$ approaches the $\ell_{0}$ norm and becomes a constant for a slightly larger $y$ . This enables RUGE to suppress smoothing on edges whose node differences exceeding the threshold $\gamma$ . This not only mitigates the estimation bias against outliers but also maintains the estimation accuracy in the absence of outliers. The simulation in Figure 2 verifies that our proposed estimator ( $\eta({\bm{x}})\coloneqq\rho_{\gamma}(\|{\bm{x}}\|_{2})$ ) can recover the true mean despite the increasing outlier ratio when the outlier ratio is below the theoretical optimal breakdown point.

3.2 Quasi-Newton IRLS

Despite the advantages discussed above, the proposed RUGE in Eq. (3) is non-smooth and non-convex, which results in challenges for deriving efficient numerical solutions that can be readily unfolded as neural network layers. In the literature, researchers have developed optimization algorithms for MCP-related problems, such as the Alternating Direction Multiplier Method (ADMM) and Newton-type algorithms (Fan & Li, 2001; Zhang, 2010a; Varma et al., 2019). However, due to their excessive computation and memory requirements as well as the incompatibility with back-propagation training, these algorithms are not well-suited for the construction of feature aggregation layers in GNNs. To solve these challenges, we propose an efficient Quasi-Newton Iteratively Reweighted Least Squares algorithm (QN-IRLS) to solve the estimation problem in Eq. (3).

IRLS. Before stepping into our QN-IRLS, we first introduce the main idea of iteratively reweighted least squares (IRLS) (Holland & Welsch, 1977) and analyze its weakness in convergence. IRLS aims to circumvent the non-smooth $\smash[b]{\mathcal{H}({\bm{F}})}$ in Eq. (3) by computing its quadratic upper bound $\smash[b]{\hat{\mathcal{H}}}$ based on $\smash[b]{{\bm{F}}^{(k)}}$ in each iteration $k$ and optimizing $\smash[b]{\hat{\mathcal{H}}}({\mathcal{F}})$ :

\hat{\mathcal{H}}({\bm{F}})=\sum_{(i,j)\in\mathcal{E}}{\bm{W}}_{ij}^{(k)}\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{\smash[b]{d_{j}}}}\right\|_{2}^{2}+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2},

(5)

where ${\bm{W}}^{(k)}_{ij}=\bm{1}_{i\neq j}\frac{d\rho_{\gamma}(y_{ij})}{dy_{ij}^{2}}\big{|}_{y_{ij}=y_{ij}^{(k)}}$ ¹¹1 ${\bm{W}}_{ij}$ is defined as $\frac{d\rho(y)}{dy^{2}}\big{|}_{y=y_{ij}^{(k)}}$ so that the quadratic upper bound $\hat{\mathcal{H}}$ is tight at ${\bm{F}}^{(k)}$ according to Lemma 3. The diagonal terms of ${\bm{W}}$ are set to zero to avoid undefined derivative of $\smash[b]{\frac{d\rho(y)}{dy^{2}}\big{|}_{y=0}}$ as discussed in Remark 2. , where $\smash[b]{y^{(k)}_{ij}=\big{\|}{\bm{f}}_{i}^{(k)}/\sqrt{d_{i}}-{\bm{f}}_{j}^{(k)}/\sqrt{\smash[b]{d_{j}}}\big{\|}_{2}}$ and $\rho_{\gamma}(\cdot)$ is the MCP function. For the detailed proof of the upper bound, please refer to Lemma 1 in Appendix B. Then, each iterative step of IRLS can be formulated as the first-order gradient descent for $\hat{\mathcal{H}}({\bm{F}})$ :

\begin{split}&{\bm{F}}^{(k+1)}={\bm{F}}^{(k)}-\eta\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})={\bm{F}}^{(k)}-\eta\left((\hat{\bm{Q}}^{(k)}-2{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}-2\lambda{\bm{F}}^{(0)}\right),\end{split}

(6)

where $\eta$ is the step size, $\hat{\bm{Q}}^{(k)}=2(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}})$ , and ${\bm{q}}^{(k)}_{m}=\sum_{j}{\bm{W}}^{(k)}_{mj}{\bm{A}}_{mj}/d_{m}$ . Its convergence condition is given in Theorem 1, with a proof in Appendix B.

Theorem 1.

If ${\bm{F}}^{(k)}$ follows the update rule in Eq. (6) where $\rho$ defining ${\bm{W}}$ satisfies that $\smash[b]{{\frac{d\rho(y)}{dy^{2}}}}$ is non-decreasing $\forall y\in(0,\infty)$ , then a sufficient condition for $\mathcal{H}({\bm{F}}^{(k+1)})\leq\mathcal{H}({\bm{F}}^{(k)})$ is that the step size $\eta$ satisfies $0<\eta\leq\|\text{diag}({\bm{q}}^{(k)})-{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}}\|_{2}^{-1}.$

Quasi-Newton IRLS. Theorem 1 suggests the difficulty in the proper selection of stepsize for (first-order) IRLS due to its non-trivial dependency on the graph ( $\tilde{\bm{A}}$ ) and the dynamic terms ( ${\bm{q}}^{(k)}$ and ${\bm{W}}^{(k)}$ ) ²²2A related work (Yang et al., 2021) adopts IRLS algorithm to optimize the problem in Eq. (1). A preconditioned version is proposed to handle the unnormalized graph Laplacian, but its step size needs to satisfy $\eta\leq\|\Delta^{\top}{\bf\Gamma}^{(k)}\Delta+\lambda{\bm{I}}\|_{2}^{-1}$ as shown in Lemma 3.3 of (Yang et al., 2021), which is expensive to estimate. . The dilemma is that a small stepsize will lead to slow convergence but a large step easily causes divergence and instability as verified by our experiments in Section 4.3 (Figrue 7), which reveals its critical shortcoming for the construction of GNN layers.

To overcome this limitation, we aim to propose a second-order Newton method, ${\bm{F}}^{(k+1)}={\bm{F}}^{(k)}-(\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)}))^{-1}\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})$ , to achieve faster convergence and stepsize-free hyperparameter tuning by better capturing the geometry of the optimization landscape. However, obtaining the analytic expression for the inverse Hessian matrix $(\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)}))^{-1}\in\mathbb{R}^{n\times n}$ is intractable, and the numerical solution requires expensive computation for large graphs. Therefore, we propose a novel Quasi-Newton IRLS algorithm (QN-IRLS) that approximates the Hessian matrix $\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)})=2(\text{diag}({\bm{q}}^{(k)})-{\bm{W}}^{(k)}\odot\tilde{\bm{A}}+\lambda{\bm{I}})$ by the diagonal matrix $\hat{{\bm{Q}}}^{(k)}=2(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}})$ such that the inverse is trivial. The proposed QN-IRLS works as follows:

\begin{split}&{\bm{F}}^{(k+1)}={\bm{F}}^{(k)}-\big{(}\hat{{\bm{Q}}}^{(k)}\big{)}^{-1}\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})=(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}})^{-1}\left(({\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}+\lambda{\bm{F}}^{(0)}\right),\end{split}

(7)

where $(\hat{{\bm{Q}}}^{(k)})^{-1}$ automatically adjusts the per-coordinate stepsize according to the local geometry of the optimization landscape, ${\bm{q}}^{(k)}$ and ${\bm{W}}^{(k)}$ are defined as in Eq. (5) and (6). In this way, QN-IRLS provides faster convergence without needing to select a stepsize. The convergence is guaranteed by Theorem 2 with the proof in Appendix B.

Theorem 2.

If ${\bm{F}}^{(k+1)}$ follows update rule in Eq. (7), where $\rho$ satisfies that $\frac{d\rho(y)}{dy^{2}}$ is non-decreasing $\forall y\in(0,\infty)$ , it is guaranteed that $\mathcal{H}({\bm{F}}^{(k+1)})\leq\mathcal{H}({\bm{F}}^{(k)})$ .

3.3 GNN with Robust Unbiased Aggregation

The proposed QN-IRLS provides an efficient algorithm to optimize the RUGE in Eq. (3) with a theoretical convergence guarantee. Instantiated with $\rho=\rho_{\gamma}$ , each iteration in QN-IRLS in Eq. (7) can be used as one layer in GNNs, which yields the Robust Unbiased Aggregation (RUNG):

\begin{split}&{\bm{F}}^{(k+1)}=(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}})^{-1}\left(({\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}+\lambda{\bm{F}}^{(0)}\right),\end{split}

(8)

where ${\bm{q}}^{(k)}_{m}=\sum_{j}{\bm{W}}^{(k)}_{mj}{\bm{A}}_{mj}/d_{m}$ as in Eq. (6) , ${\bm{W}}^{(k)}_{ij}=\bm{1}_{i\neq j}\max(0,\frac{1}{\smash[b]{y_{ij}^{(k)}}}-\frac{1}{\gamma})$ and $\smash[b]{y^{(k)}_{ij}=\big{\|}\frac{{\bm{f}}_{i}^{(k)}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}^{(k)}}{\sqrt{\smash[b]{d_{j}}}}\big{\|}_{2}}$ .

Interpretability. The proposed RUNG can be interpreted intuitively with edge reweighting. In Eq. (LABEL:eq:rw_update_f3), the normalized adjacency matrix $\tilde{{\bm{A}}}$ is reweighted by ${\bm{W}}^{(k)}$ , where ${\bm{W}}^{(k)}_{ij}=\frac{d\rho(y)}{dy^{2}}|_{y=y_{ij}^{(k)}}$ . It is shown in Figure 4 that ${\bm{W}}_{ij}$ becomes zero for any edge $e_{k}=(i,j)$ with a node difference $y_{ij}^{(k)}\geq\gamma$ , thus pruning suspicious edges. This implies RUNG’s strong robustness under large-budget adversarial attacks. With the inclusion of the skip connection ${\bm{F}}^{(0)}$ , $\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}}$ can be seen as a normalizer of the layer output.

Relations with Existing GNNs. RUNG can adopt different $\rho$ that Theorem 2 allows, thus covering many classic GNNs as special cases. When $\rho(y)=y^{2}$ , RUNG in Eq. (LABEL:eq:rw_update_f3) exactly reduces to APPNP (Gasteiger et al., 2018) ( ${\bm{F}}^{(k+1)}=\frac{1}{1+\lambda}\tilde{\bm{A}}{\bm{F}}^{(k)}+\frac{\lambda}{1+\lambda}{\bm{F}}^{(0)}$ ) and GCN ( ${\bm{F}}^{(k+1)}=\tilde{\bm{A}}{\bm{F}}^{(k)}$ ) if chosing $\lambda=0$ . When $\rho(y)=y$ , the objective of RUGE is equivalent to ElasticGNN with $p=2$ , which is analogous to SoftMedian and TWIRLS due to their inherent connections as $\ell_{1}$ -based graph smoothing.

Complexity analysis. RUNG is scalable with time complexity of $O(k(m+n)d)$ and space complexity $O(m+nd)$ , where $m$ is the number of edges, $d$ is the number of features, $n$ is the number of nodes, and $k$ is the number of GNN layers. Therefore, the complexity of our RUNG is comparable to normal GCN (with a constant difference) and it is feasible to implement. The detailed discussions about computation efficiency can be found in Appendix C.

4 Experiment

In this section, we perform comprehensive experiments to validate the robustness of the proposed RUNG. Besides, ablation studies show its convergence and defense mechanism.

4.1 Experiment Setting

Datasets. We test our RUNG with the node classification task on two widely used real-world citation networks, Cora ML and Citeseer (Sen et al., 2008), as well as a large-scale networks Ogbn-Arxiv (Hu et al., 2020). We adopt the data split of $10\%$ training, $10\%$ validation, and $80\%$ testing, and report the classification accuracy of the attacked nodes following (Mujkanovic et al., 2022). Each experiment is averaged over $5$ different random splits.

Baselines. To evaluate the performance of RUNG, we compare it to $\ell_{2}$ other representative baselines. Among them, MLP, GCN (Kipf & Welling, 2016), APPNP (Gasteiger et al., 2018), and GAT (Velickovic et al., 2017) are undefended vanilla models. GNNGuard (Zhang & Zitnik, 2020), RGCN (Zhu et al., 2019), GRAND (Feng et al., 2020), ProGNN (Jin et al., 2020), Jaccard-GCN (Wu et al., 2019), SVD-GCN (Entezari et al., 2020), EvenNet (Lei et al., 2022), HANG (Zhao et al., 2024), NoisyGNN (Ennadir et al., 2024), and GARNET (Deng et al., 2022) are representative robust GNNs. Besides, SoftMedian and TWIRLS are representative methods with $\ell_{1}$ -based graph smoothing ³³3We do not include ElasticGNN because it is still unclear how to attack it adaptively due to its special incident matrix formulation (Liu et al., 2021). In the preliminary study (Section 2.1), we evaluate the robustness of ElasticGNN following the unit test setting proposed in (Mujkanovic et al., 2022). We also evaluate a variant of TWIRLS with thresholding attention (TWIRLS-T). For RUNG, we test two variants: default RUNG (Eq. (LABEL:eq:rw_update_f3)) and RUNG- $\ell_{1}$ with $\ell_{1}$ penalty ( $\rho(y)=y$ ).

Hyperparameters. The model hyperparameters including learning rate, weight decay, and dropout rate are tuned as in (Mujkanovic et al., 2022). Other hyperparameters follow the settings in the original papers. RUNG uses an MLP connected to 10 graph aggregation layers following the decoupled GNN architecture of APPNP. $\smash[b]{\hat{\lambda}=\frac{1}{1+\lambda}}$ is tuned in {0.7, 0.8, 0.9}, and $\gamma$ tuned in $\{0.5,1,2,3,5\}$ . We chose the hyperparameter setting that yields the best robustness without a notable impact (smaller than $1\%$ ) on the clean accuracy following (Bojchevski & Günnemann, 2019).

Attack setting. We use the PGD attack (Xu et al., 2019) to execute the adaptive evasion and poisoning topology attack since it delivers the strongest attack in most settings (Mujkanovic et al., 2022). The adaptive attack setting is provided in Appendix F. The adversarial attacks aim to misclassify specific target nodes (local attack) or the entire set of test nodes (global attack). To avoid a false sense of robustness, our adaptive attacks directly target the victim model instead of the surrogate model. Additionally, we include the transfer attacks with a 2-layer GCN as the surrogate model. We also include graph injection attack following the setting in TDGIA (Zou et al., 2021).

4.2 Adversarial Robustness

Table 1: Adaptive local attack on Cora ML. The best and second are marked.

Model	0% (Clean)	$20\%$	$50\%$	$100\%$	$150\%$	$200\%$
MLP	$72.6\pm 6.4$	$72.6\pm 6.4$	$72.6\pm 6.4$	$\mathbf{72.6\pm 6.4}$	$\mathbf{72.6\pm 6.4}$	$\mathbf{72.6\pm 6.4}$
GCN	$82.7\pm 4.9$	$40.7\pm 10.2$	$12.0\pm 6.2$	$2.7\pm 2.5$	$0.0\pm 0.0$	$0.0\pm 0.0$
APPNP	$\mathbf{84.7\pm 6.8}$	$50.0\pm 13.0$	$27.3\pm 6.5$	$14.0\pm 5.3$	$3.3\pm 3.0$	$0.7\pm 1.3$
GAT	$80.7\pm 10.0$	$30.7\pm 16.1$	$16.0\pm 12.2$	$11.3\pm 4.5$	$1.3\pm 1.6$	$2.0\pm 1.6$
GNNGuard	$82.7\pm 6.7$	$44.0\pm 11.6$	$30.7\pm 11.6$	$14.0\pm 6.8$	$5.3\pm 3.4$	$2.0\pm 2.7$
RGCN	$84.6\pm 4.0$	$46.0\pm 9.3$	$18.0\pm 8.1$	$6.0\pm 3.9$	$0.0\pm 0.0$	$0.0\pm 0.0$
GRAND	$84.0\pm 6.8$	$47.3\pm 9.0$	$18.7\pm 9.1$	$7.3\pm 4.9$	$1.3\pm 1.6$	$0.0\pm 0.0$
ProGNN	$\mathbf{84.7\pm 6.2}$	$47.3\pm 10.4$	$21.3\pm 7.8$	$4.0\pm 2.5$	$0.0\pm 0.0$	$0.0\pm 0.0$
Jaccard-GCN	$81.3\pm 5.0$	$46.0\pm 6.8$	$17.3\pm 4.9$	$4.7\pm 3.4$	$0.7\pm 1.3$	$0.7\pm 1.3$
GARNET	$82.4\pm 6.8$	$70.9\pm 7.5$	$61.9\pm 7.9$	$42.7\pm 9.3$	$11.6\pm 3.4$	$9.6\pm 3.5$
HANG	$83.1\pm 7.4$	$71.2\pm 7.8$	$60.1\pm 6.3$	$39.5\pm 3.4$	$9.4\pm 2.3$	$5.6\pm 2.3$
NoisyGNN	$82.5\pm 5.6$	$57.4\pm 5.7$	$47.8\pm 6.2$	$36.1\pm 4.5$	$5.8\pm 3.4$	$4.1\pm 1.2$
EvenNet	$83.4\pm 8.1$	$64.8\pm 6.9$	$56.1\pm 5.6$	$29.5\pm 5.3$	$3.1\pm 1.1$	$1.1\pm 1.3$
GraphCON	$81.3\pm 7.1$	$67.3\pm 7.3$	$54.7\pm 7.1$	$41.3\pm 6.2$	$4.1\pm 2.1$	$2.3\pm 1.2$
SoftMedian	$80.0\pm 10.2$	$\underline{72.7\pm 13.7}$	$62.7\pm 12.7$	$46.7\pm 11.0$	$8.0\pm 4.5$	$8.7\pm 3.4$
TWIRLS	$83.3\pm 7.3$	$71.3\pm 8.6$	$60.7\pm 11.0$	$36.0\pm 8.8$	$20.7\pm 10.4$	$12.0\pm 6.9$
TWIRLS-T	$82.0\pm 4.5$	$70.7\pm 4.4$	$62.7\pm 7.4$	$54.7\pm 6.2$	$44.0\pm 11.2$	$40.7\pm 11.8$
RUNG- $\ell_{1}$ (Ours)	$84.0\pm 6.8$	$\underline{72.7\pm 7.1}$	$62.7\pm 11.2$	$53.3\pm 8.2$	$22.0\pm 9.3$	$14.0\pm 7.4$
RUNG (Ours)	$84.0\pm 5.3$	$\mathbf{75.3\pm 6.9}$	$\mathbf{72.7\pm 8.5}$	$70.7\pm 10.6$	$69.3\pm 9.8$	$69.3\pm 9.0$

Table 2: Adaptive global attack on Cora ML. The best and second are marked.

Model	0% (Clean)	$5\%$	$10\%$	$20\%$	$30\%$	$40\%$
MLP	$65.0\pm 1.0$	$65.0\pm 1.0$	$65.0\pm 1.0$	$65.0\pm 1.0$	$\underline{65.0\pm 1.0}$	$65.0\pm 1.0$
GCN	$85.0\pm 0.4$	$75.3\pm 0.5$	$69.6\pm 0.5$	$60.9\pm 0.7$	$54.2\pm 0.6$	$48.4\pm 0.5$
APPNP	$\mathbf{86.3\pm 0.4}$	$75.8\pm 0.5$	$69.7\pm 0.7$	$60.3\pm 0.9$	$53.8\pm 1.2$	$49.0\pm 1.6$
GAT	$83.5\pm 0.5$	$75.8\pm 0.8$	$71.2\pm 1.2$	$65.0\pm 0.9$	$60.5\pm 0.9$	$56.7\pm 0.9$
GNNGuard	$83.1\pm 0.7$	$74.6\pm 0.7$	$70.2\pm 1.0$	$63.1\pm 1.1$	$57.5\pm 1.6$	$51.0\pm 1.2$
RGCN	$85.7\pm 0.4$	$75.0\pm 0.8$	$69.1\pm 0.4$	$59.8\pm 0.7$	$52.8\pm 0.7$	$46.1\pm 0.7$
GRAND	$86.1\pm 0.7$	$76.2\pm 0.8$	$70.7\pm 0.7$	$61.6\pm 0.7$	$56.7\pm 0.8$	$51.9\pm 0.9$
ProGNN	$85.6\pm 0.5$	$76.5\pm 0.7$	$71.0\pm 0.5$	$63.0\pm 0.7$	$56.8\pm 0.7$	$51.3\pm 0.6$
Jaccard-GCN	$83.7\pm 0.7$	$73.9\pm 0.5$	$68.3\pm 0.7$	$60.0\pm 1.1$	$54.0\pm 1.7$	$49.1\pm 2.4$
GARNET	$84.0\pm 0.5$	$76.5\pm 0.4$	$72.1\pm 0.1$	$66.4\pm 0.7$	$62.1\pm 1.3$	$58.7\pm 1.5$
HANG	$84.5\pm 0.2$	$75.7\pm 0.6$	$74.2\pm 0.3$	$69.5\pm 0.4$	$64.7\pm 0.9$	$58.4\pm 0.1$
NoisyGNN	$83.9\pm 0.5$	$76.7\pm 0.1$	$72.1\pm 0.3$	$64.7\pm 0.7$	$58.0\pm 0.2$	$53.3\pm 0.6$
EvenNet	$84.8\pm 0.9$	$75.8\pm 0.9$	$70.7\pm 0.6$	$63.9\pm 0.5$	$58.7\pm 0.7$	$54.7\pm 0.8$
GraphCON	$83.7\pm 0.6$	$75.8\pm 0.3$	$70.9\pm 0.8$	$65.9\pm 0.7$	$62.2\pm 0.9$	$56.6\pm 0.3$
SoftMedian	$85.0\pm 0.7$	$78.6\pm 0.3$	$\underline{75.5\pm 0.9}$	$69.5\pm 0.5$	$62.8\pm 0.8$	$58.1\pm 0.7$
TWIRLS	$84.2\pm 0.6$	$77.3\pm 0.8$	$72.9\pm 0.3$	$66.9\pm 0.2$	$62.4\pm 0.6$	$58.7\pm 1.1$
TWIRLS-T	$82.8\pm 0.5$	$76.8\pm 0.6$	$73.2\pm 0.4$	$67.7\pm 0.4$	$63.8\pm 0.2$	$60.8\pm 0.3$
RUNG- $\ell_{1}$ (Ours)	$85.8\pm 0.5$	$\underline{78.4\pm 0.4}$	$74.3\pm 0.3$	$68.1\pm 0.6$	$63.5\pm 0.7$	$59.8\pm 0.8$
RUNG (Ours)	$84.6\pm 0.5$	$\mathbf{78.9\pm 0.4}$	$\mathbf{75.7\pm 0.2}$	$\mathbf{71.8\pm 0.4}$	$\mathbf{67.8\pm 1.3}$	$\mathbf{65.1\pm 1.2}$

Here we evaluate the the performance of RUNG against the baselines under different settings. The results of local and global adaptive attacks on Cora ML are presented in Table 1 and Table 2, while those on Citeseer are presented in Table 3 and Table 4 in Appendix E due to space limits. We summarize the following analysis from Cora ML, noting that the same observations apply to Citeseer.

•

Under adaptive attacks, many existing defenses are not significantly more robust than undefended models. The $\ell_{1}$ -based models such as TWIRLS, SoftMedian, and RUNG- $\ell_{1}$ demonstrate considerable and closely aligned robustness under both local and global attacks, which supports our unified $\ell_{1}$ -based robust view analysis in Section 2.2.
•

RUNG exhibits significant improvements over all baselines across various budgets under both global and local attacks. Local attacks are stronger than global attacks since local attacks concentrate on targeted nodes. The robustness improvement of RUNG appears to be more remarkable in local attacks.
•

When there is no attack, RUNG largely preserves an excellent clean performance. RUNG also achieves state-of-the-art performance under small attack budgets.

4.3 Ablation study

Convergence. To verify the advantage of our QN-IRLS method in Eq (7) over the first-order IRLS in Eq (6), we show the objective $\mathcal{H}$ on each layer in Figure 7. It can be observed that our stepsize-free QN-IRLS demonstrates the best convergence as discussed in Section 3.

Estimation bias. The bias effect in $\ell_{1}$ -based GNNs and the unbiasedness of our RUNG can be empirically verified. We quantify the bias with $\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{\star}\|_{2}^{2}$ , where ${\bm{f}}_{i}^{\star}$ and ${\bm{f}}_{i}$ denote the aggregated feature on the clean graph and attacked graph, respectively. As shown in Figure 7, when the budget scales up, $\ell_{1}$ GNNs exhibit a notable bias, whereas RUNG has nearly zero bias. We provide comprehensive discussion of unbiasedness of RUNG in Appendix D.

Defense Mechanism To further investigate how our defense takes effect, we analyze the edges added under adaptive attacks. The distribution of the node feature differences $\big{\|}{\bm{f}}_{i}/{\sqrt{d_{i}}}-{\bm{f}}_{j}/\sqrt{d_{j}}\big{\|}_{2}$ of attacked edges is shown in Figure 7 for different graph signal estimators. It can be observed that our RUNG forces the attacker to focus on the edges with a small feature difference, indicating that our RUNG can improve robustness by down-weighting or pruning some malicious edges that connect distinct nodes. Therefore, the attacks become less influential, which explains why RUNG demonstrates outstanding robustness.

Transfer Attacks. In addition to the adaptive attack, we also conduct a set of transfer attacks that take every baseline GNN as the surrogate model to comprehensively test the robustness of RUNG, following the unit test attack protocol proposed in (Mujkanovic et al., 2022). We summarize the results on Cora ML and Citeseer in Figure 9 and Figure 10 in Appendix E due to space limits. All transfer attacks are weaker than the adaptive attack in Section 4.2, indicating the necessity to evaluate the strongest adaptive attack to avoid the false sense of security emphasized in this paper. Note that the attack transferred from RUNG model is slightly weaker than the adaptive attack since the surrogate and victim RUNG models have different model parameters in the transfer attack setting.

Hyperparameters. Due to the space limit, we provide the additional ablation studies on the hyperparameters (including $\gamma$ and $\lambda$ in MCP as well as the number of layers) of RUNG in Appendix G. The results offer an overview strategy for the choice of optimal hyperparameters. We can observe that $\gamma$ in MCP has a significant impact on the performance of RUNG. Specifically, a larger $\gamma$ makes RUNG closer to an $\ell_{1}$ -based model, while a smaller $\gamma$ encourages more edges to be pruned. This pruning helps RUNG to remove more malicious edges and improve robustness, although a small $\gamma$ may slightly reduce clean performance.

Robustness under various scenarios. Besides the evaluation under strong adaptive attacks, we also validate the consistent effectiveness of our proposed RUNG under various scenarios, including Transfer attacks (Appendix E.2), Poisoning attacks (Appendix E.3), Large scale datasets (Appendix E.4), Adversarial training (Appendix E.5), Graph injection attacks (Appendix E.6).

5 Related Work

To the best of our knowledge, although there are works unifying existing GNNs from a graph signal denoising perspective (Ma et al., 2021), no work has been dedicated to uniformly understand the robustness and limitations of robust GNNs such as SoftMedian (Geisler et al., 2021a), SoftMedoid (Geisler et al., 2020), TWIRLS (Yang et al., 2021), ElasticGNN (Liu et al., 2021), and TVGNN (Hansen & Bianchi, 2023) from the $\ell_{1}$ robust statistics and bias analysis perspectives. To mitigate the estimation bias, MCP penalty is promising since it is well known for its near unbiasedness property (Zhang, 2010a) and has been applied to the graph trend filtering problem (Varma et al., 2019) to promote piecewise signal modeling, but their robustness is unexplored. Nevertheless, other robust GNNs have utilized alternative penalties that might alleviate the bias effect. For example, GNNGuard (Zhang & Zitnik, 2020) prunes the edges whose cosine similarity is too small. Another example is that TWIRLS (Yang et al., 2021) with a thresholding penalty can also exclude edges using graph attention. However, the design of their edge weighting or graph attention is heuristic-based and exhibits suboptimal performance compared to the RUNG proposed in this work.

6 Conclusion & Limitation

In this work, we propose a unified view of $\ell_{1}$ robust graph smoothing to uniformly understand the robustness and limitations of representative robust GNNs. The established view not only justifies their improved and closely aligned robustness but also explains their severe performance degradation under large attack budgets by a novel estimation bias analysis. To mitigate the estimation bias, we propose a robust and unbiased graph signal estimator. To solve this non-trivial estimation problem, we design a novel and efficient Quasi-Newton IRLS algorithm that can better capture the landscape of the optimization problem and converge stably with a theoretical guarantee. This algorithm can be unfolded and used as a building block for constructing robust GNNs with Robust Unbiased Aggregation (RUNG). As verified by our experiments, RUNG provides the best performance under strong adaptive attacks among all the baselines. Furthermore, RUNG also covers many classic GNNs as special cases. Most importantly, this work provides a deeper understanding of existing approaches and reveals a principled direction for designing robust GNNs.

Regarding the limitations, first, the improvement of RUNG is more significant under large budgets compared to the robust baselines. Second, we primarily include experiments on homophilic graphs in the main paper, but we can generalize the proposed robust aggregation to heterophilic graphs in future work. Third, although our Quasi-Newton IRLS algorithm has exhibited excellent convergence compared to the vanilla IRLS, the efficiency of RUNG could be further improved.

Acknowledgment

Zhichao Hou and Dr. Xiaorui Liu are supported by the National Science Foundation (NSF) National AI Research Resource Pilot Award, Amazon Research Award, NCSU Data Science Academy Seed Grant Award, and NCSU Faculty Research and Professional Development Award.

References

Beck & Sabach (2015) Amir Beck and Shoham Sabach. Weiszfeld’s method: Old and new results. Journal of Optimization Theory and Applications, 164(1):1–40, 2015. doi: 10.1007/s10957-014-0586-7. URL https://doi.org/10.1007/s10957-014-0586-7.
Bojchevski & Günnemann (2019) Aleksandar Bojchevski and Stephan Günnemann. Certifiable robustness to graph perturbations. In Neural Information Processing Systems, 2019. URL https://api.semanticscholar.org/CorpusID:202782978.
Candès et al. (2008) Emmanuel J. Candès, Michael B. Wakin, and Stephen P. Boyd. Enhancing sparsity by reweighted $l_{1}$ minimization. Journal of Fourier Analysis and Applications, 14(5-6):877–905, Dec 2008. ISSN 1069-5869. Funding by NSF.
Chen et al. (2020) Jinyin Chen, Xiang Lin, Hui Xiong, Yangyang Wu, Haibin Zheng, and Qi Xuan. Smoothing adversarial training for gnn. IEEE Transactions on Computational Social Systems, 8(3):618–629, 2020.
Deng et al. (2022) Chenhui Deng, Xiuyu Li, Zhuo Feng, and Zhiru Zhang. Garnet: Reduced-rank topology learning for robust and scalable graph neural networks. In Learning on Graphs Conference, pp. 3–1. PMLR, 2022.
Deng et al. (2023) Zhijie Deng, Yinpeng Dong, and Jun Zhu. Batch virtual adversarial training for graph convolutional networks. AI Open, 2023.
Donoho (1995) David L Donoho. De-noising by soft-thresholding. IEEE transactions on information theory, 41(3):613–627, 1995.
Ennadir et al. (2024) Sofiane Ennadir, Yassine Abbahaddou, Johannes F Lutzeyer, Michalis Vazirgiannis, and Henrik Boström. A simple and yet fairly effective defense for graph neural networks. In 38th AAAI Conference on Artificial Intelligence (AAAI), 2024.
Entezari et al. (2020) Negin Entezari, Saba A Al-Sayouri, Amirali Darvishzadeh, and Evangelos E Papalexakis. All you need is low (rank) defending against adversarial attacks on graphs. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 169–177, 2020.
Fan & Li (2001) Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. ISSN 01621459. URL http://www.jstor.org/stable/3085904.
Feng et al. (2020) Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie Tang. Graph random neural networks for semi-supervised learning on graphs. Advances in neural information processing systems, 33:22092–22103, 2020.
Gasteiger et al. (2018) Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
Geisler et al. (2020) Simon Geisler, Daniel Zügner, and Stephan Günnemann. Reliable graph neural networks via robust aggregation. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
Geisler et al. (2021a) Simon Geisler, Tobias Schmidt, Hakan Şirin, Daniel Zügner, Aleksandar Bojchevski, and Stephan Günnemann. Robustness of graph neural networks at scale. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 7637–7649. Curran Associates, Inc., 2021a. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/3ea2db50e62ceefceaf70a9d9a56a6f4-Paper.pdf.
Geisler et al. (2021b) Simon Geisler, Tobias Schmidt, Hakan Şirin, Daniel Zügner, Aleksandar Bojchevski, and Stephan Günnemann. Robustness of graph neural networks at scale. Advances in Neural Information Processing Systems, 34:7637–7649, 2021b.
Hansen & Bianchi (2023) Jonas Berg Hansen and Filippo Maria Bianchi. Total variation graph neural networks. In Proceedings of the 40th international conference on Machine learning. ACM, 2023.
Holland & Welsch (1977) Paul W Holland and Roy E Welsch. Robust regression using iteratively reweighted least-squares. Communications in Statistics-theory and Methods, 6(9):813–827, 1977.
Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
Huber (2004) Peter J Huber. Robust statistics, volume 523. John Wiley & Sons, 2004.
Jin et al. (2020) Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 66–74, 2020.
Kipf & Welling (2016) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
Lei et al. (2022) Runlin Lei, Zhen Wang, Yaliang Li, Bolin Ding, and Zhewei Wei. Evennet: Ignoring odd-hop neighbors improves robustness of graph neural networks. Advances in Neural Information Processing Systems, 35:4694–4706, 2022.
Liu et al. (2021) Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang. Elastic graph neural networks. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 6837–6849. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/liu21k.html.
Ma & Tang (2021) Yao Ma and Jiliang Tang. Deep learning on graphs. Cambridge University Press, 2021.
Ma et al. (2021) Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, pp. 1202–1211, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384469. doi: 10.1145/3459637.3482225. URL https://doi.org/10.1145/3459637.3482225.
Mujkanovic et al. (2022) Felix Mujkanovic, Simon Geisler, Stephan Günnemann, and Aleksandar Bojchevski. Are defenses for graph neural networks robust? In Neural Information Processing Systems, NeurIPS, 2022.
Rusch et al. (2022) T Konstantin Rusch, Ben Chamberlain, James Rowbottom, Siddhartha Mishra, and Michael Bronstein. Graph-coupled oscillator networks. In International Conference on Machine Learning, pp. 18888–18909. PMLR, 2022.
Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
Tibshirani (1996a) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996a. ISSN 00359246. URL http://www.jstor.org/stable/2346178.
Tibshirani (1996b) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996b.
Varma et al. (2019) Rohan Varma, Harlin Lee, Jelena Kovačević, and Yuejie Chi. Vector-valued graph trend filtering with non-convex penalties. IEEE transactions on signal and information processing over networks, 6:48–62, 2019.
Velickovic et al. (2017) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, Yoshua Bengio, et al. Graph attention networks. stat, 1050(20):10–48550, 2017.
Wu et al. (2019) Huijun Wu, Chen Wang, Yuriy Tyshetskiy, Andrew Docherty, Kai Lu, and Liming Zhu. Adversarial examples on graph data: Deep insights into attack and defense. arXiv preprint arXiv:1903.01610, 2019.
Xu et al. (2019) Kaidi Xu, Hongge Chen, Sijia Liu, Pin-Yu Chen, Tsui-Wei Weng, Mingyi Hong, and Xue Lin. Topology attack and defense for graph neural networks: an optimization perspective. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3961–3967, 2019.
Yang et al. (2021) Yongyi Yang, Tang Liu, Yangkun Wang, Jinjing Zhou, Quan Gan, Zhewei Wei, Zheng Zhang, Zengfeng Huang, and David Wipf. Graph neural networks inspired by classical iterative algorithms. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 11773–11783. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/yang21g.html.
Zhang (2010a) Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894 – 942, 2010a. doi: 10.1214/09-AOS729. URL https://doi.org/10.1214/09-AOS729.
Zhang (2010b) Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010b. ISSN 00905364, 21688966. URL http://www.jstor.org/stable/25662264.
Zhang & Zitnik (2020) Xiang Zhang and Marinka Zitnik. Gnnguard: Defending graph neural networks against adversarial attacks. Advances in neural information processing systems, 33:9263–9275, 2020.
Zhao et al. (2024) Kai Zhao, Qiyu Kang, Yang Song, Rui She, Sijie Wang, and Wee Peng Tay. Adversarial robustness in graph neural networks: A hamiltonian approach. Advances in Neural Information Processing Systems, 36, 2024.
Zhu et al. (2019) Dingyuan Zhu, Ziwei Zhang, Peng Cui, and Wenwu Zhu. Robust graph convolutional networks against adversarial attacks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1399–1407, 2019.
Zou (2006) Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. doi: 10.1198/016214506000000735. URL https://doi.org/10.1198/016214506000000735.
Zou et al. (2021) Xu Zou, Qinkai Zheng, Yuxiao Dong, Xinyu Guan, Evgeny Kharlamov, Jialiang Lu, and Jie Tang. Tdgia: Effective injection attacks on graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2461–2471, 2021.

Appendix A Bias Accumulation of L1 Models

A.1 Details of the Numerical Simulation Settings

To provide a more intuitive illustration of the estimation biases of different models, we simulate a mean estimation problem on synthetic data since most message passing schemes in GNNs essentially estimate the mean of neighboring node features. In the context of mean estimation, the bias is measured as the distances between different mean estimators and the true mean. We firstly generated clean samples $\{{\bm{x}}_{i}\}_{i=1}^{n}$ (blue dots) and the outlier samples $\{{\bm{x}}_{i}\}_{i=n+1}^{n+m}$ (red dots) from 2-dimensional Gaussian distributions, ${\mathcal{N}}((0,0),1)$ and ${\mathcal{N}}((8,8),0.5)$ , respectively. We calculate the mean of clean samples $\smash[b]{\frac{1}{n}\sum_{i=1}^{n}{\bm{x}}_{i}}$ as the ground truth of the mean estimator. Then we estimate the mean of all the samples by solving $\operatorname*{arg\,min}_{\bm{z}}\sum_{i=1}^{n+m}\eta({\bm{z}}-{\bm{x}}_{i})$ using the Weiszfeld method (Candès et al., 2008; Beck & Sabach, 2015), where $\eta(\cdot)$ can take different norms such as $\ell_{2}$ norm $\|\cdot\|_{2}^{2}$ and $\ell_{1}$ norm $\|\cdot\|_{2}$ .

The mean estimators are formulated as minimization operators

\bar{\bm{z}}=\operatorname*{arg\,min}_{{\bm{z}}}\sum_{i}^{n+m}\eta({\bm{z}}-{\bm{x}}_{i}),

(9)

where $n$ is the number of clean samples and $m$ is the number of adversarial samples.

$\ell_{1}$ estimator. The $\ell_{1}$ estimator ( $\eta({\bm{y}})\coloneqq\|{\bm{y}}\|_{2}$ ), essentially is the geometric median. We adopted the Weiszfeld method to iteratively reweight ${\bm{z}}$ to minimize the objective, following

{\bm{z}}^{(k+1)}=\frac{\sum_{i}w_{i}^{(k)}{\bm{x}}_{i}}{\sum_{i}w_{i}^{(k)}},

(10)

where $w_{i}^{(k)}=\frac{1}{\|{\bm{z}}^{(k)}-{\bm{x}}_{i}\|_{2}}$ . This can be seen as a gradient descent step of ${\bm{z}}^{(k+1)}={\bm{z}}^{(k)}-\alpha\nabla_{\bm{z}}\sum_{i}\|{\bm{z}}-{\bm{x}}_{i}\|_{2}={\bm{z}}^{(k+1)}-\alpha\sum_{i}\frac{{\bm{z}}^{(k)}-{\bm{x}}_{i}}{\|{\bm{z}}^{(k)}-{\bm{x}}_{i}\|_{2}}$ . Taking $\alpha=\frac{1}{\sum_{i}w_{i}^{(k)}}$ instantly yields Eq. (10).

MCP-based estimator. We therefore adopt a similar approach for the MCP-based estimator (“Ours” in Fig. Figure 2), where $\eta({\bm{y}})\coloneqq\rho_{\gamma}({\bm{y}})$ :

	$\displaystyle{\bm{z}}^{(k+1)}$	$\displaystyle={\bm{z}}^{(k)}-\alpha\nabla_{\bm{z}}\sum_{i}\rho_{\gamma}(\\|{\bm{z}}-{\bm{x}}_{i}\\|_{2})$		(11)
		$\displaystyle={\bm{z}}^{(k)}-\alpha\sum_{i}\max(0,\frac{1}{\\|{\bm{z}}^{(k)}-{\bm{x}}_{i}\\|_{2}}-\frac{1}{\gamma})({\bm{z}}^{(k)}-{\bm{x}}_{i}).$		(12)

Denoting $\max(0,\|{\bm{z}}^{(k)}-{\bm{x}}_{i}\|_{2}^{-1}-\frac{1}{\gamma})$ as $w_{i}$ , and then $\alpha=\frac{1}{\sum_{i}w_{i}}$ yields a similar reweighting iteration ${\bm{z}}^{(k+1)}=\frac{\sum_{i}w_{i}^{(k)}{\bm{x}}_{i}}{\sum_{i}w_{i}^{(k)}}$ .

$\ell_{2}$ estimator.

It is worth noting that the same technique can be applied to the $\ell_{2}$ estimator with $\rho({\bm{z}})\coloneqq\|{\bm{z}}\|_{2}^{2}$ . The iteration becomes

	$\displaystyle{\bm{z}}^{(k+1)}$	$\displaystyle={\bm{z}}^{(k)}-\alpha\nabla_{\bm{z}}\sum_{i}\\|{\bm{z}}-{\bm{x}}_{i}\\|_{2}^{2}$		(13)
		$\displaystyle={\bm{z}}^{(k)}-\alpha\sum_{i}({\bm{z}}^{(k)}-{\bm{x}}_{i}),$		(14)

and $\alpha=\frac{1}{n+m}$ yields ${\bm{z}}^{(k+1)}=\frac{1}{n+m}\sum_{i}{\bm{x}}_{i}$ , which gives the mean of all samples in one single iteration.

Similarities between this mean estimation scenario and our QN-IRLS in graph smoothing can be observed, both of which involve iterative reweighting to estimate similar objectives. The approximated Hessian in our QN-IRLS resembles the Weiszfeld method, canceling the ${\bm{z}}^{(k)}$ by tuning the stepsize.

In Figure 2, we visualize the generated clean samples and outliers, as well as the ground truth means and the mean estimators with $\eta(\cdot)=\|\cdot\|_{2}^{2}$ or $\|\cdot\|_{2}$ under different outlier ratios (15%, 30%, 45%). The results show that the $\ell_{2}$ -based estimator deviates far from the true mean, while the $\ell_{1}$ -based estimator is more resistant to outliers, which explains why $\ell_{1}$ -based methods exhibit stronger robustness. However, as the ratio of outliers escalates, the $\ell_{1}$ -based estimator encounters a greater shift from the true mean due to the accumulated bias caused by outliers. This observation explains why $\ell_{1}$ -based graph smoothing models suffer from catastrophic degradation under large attack budgets. Our estimator keeps much closer to the ground truth than other estimators with the existence of outliers.

A.2 Additional Simulation Results and Discussions

Here, we complement Figure 2 with the settings of higher attack budgets. As the outlier ratio exceeds the breakdown point $50\%$ , we observe that our MCP-based mean estimator can correctly recover the majority of the samples, i.e. converge to the center of “outliers”.

Appendix B Convergence Analysis

To begin with, we will provide an overview of our proof, followed by a detailed presentation of the formal proof for the convergence analysis.

Overview of proof. First, for both IRLS and QN-IRLS, we construct, for ${\bm{F}}^{(k)}$ in every iteration $k$ , a quadratic upper bound $\hat{\mathcal{H}}$ that satisfies $\hat{\mathcal{H}}+C\geq\mathcal{H}$ where the equality is reached at ${\bm{F}}^{(k)}$ . Then we can minimize $\hat{\mathcal{H}}$ to guarantee the iterative descent of $\mathcal{H}$ since $\mathcal{H}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k+1)})+C\leq\hat{\mathcal{H}}({\bm{F}}^{(k)})+C=\mathcal{H}({\bm{F}}^{(k)})$ .

To find the ${\bm{F}}^{(k+1)}$ such that $\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)})$ , IRLS simply adopts the plain gradient descent ${\bm{F}}^{(k+1)}={\bm{F}}^{(k)}-\eta\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})$ whose convergence condition can be analyzed with the $\beta$ -smoothness of the quadratic $\hat{\mathcal{H}}$ (Theorem 1). To address the problems of IRLS as motivated in Section 3.2, our Quasi-Newton IRLS utilizes the diagonal approximate Hessian $\hat{{\bm{Q}}}$ to scale the update step size in different dimensions respectively as ${\bm{F}}^{(k+1)}={\bm{F}}^{(k)}-\hat{{\bm{Q}}}^{-1}\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})$ . Thereafter, by bounding the Hessian with $2\hat{\bm{Q}}$ , the descent condition of $\hat{\mathcal{H}}$ is simplified (Theorem 2).

Lemma 1.

For any $\rho(y)$ satisfying $\frac{d\rho(y)}{dy^{2}}$ is non-increasing, denote $y_{ij}:=\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\right\|_{2}$ , then $\mathcal{H}({\bm{F}})=\sum_{(i,j)\in\mathcal{E},i\neq j}\rho(y_{ij})+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2}$ has the following upper bound:

\mathcal{H}({\bm{F}})\leq\hat{\mathcal{H}}({\bm{F}})+C=\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}^{(k)}y_{ij}^{2}+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2}+C,

(15)

where ${\bm{W}}_{ij}^{(k)}=\frac{\partial\rho(y)}{\partial y^{2}}\big{|}_{y=y_{ij}^{(k)}}$ and $y_{ij}^{(k)}=\left\|\frac{{\bm{f}}_{i}^{(k)}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}^{(k)}}{\sqrt{d_{j}}}\right\|_{2}$ and $C=\mathcal{H}({\bm{F}}^{(k)})-\hat{\mathcal{H}}({\bm{F}}^{(k)})$ is a constant. The equality in Eq. (15) is achieved when ${\bm{F}}={\bm{F}}^{(k)}$ .

Proof.

Let $v=y^{2}$ and define $\psi(v):=\rho(y)=\rho(\sqrt{v})$ . Then $\psi$ is concave since $\frac{d\psi(v)}{dv}=\frac{d\rho(y)}{dy^{2}}$ is non-increasing. According to the concavity property, we have $\psi(v)\leq\psi(v_{0})+\psi^{\prime}(\nu)\big{|}_{\nu=v_{0}}(v-v_{0})$ . Substitute $v=y^{2},v_{0}=y_{0}^{2}$ , we obtain:

	$\displaystyle\rho(y)$	$\displaystyle\leq y^{2}{\frac{\partial\rho(y)}{\partial y^{2}}\big{\|}_{y=y_{0}}}-y_{0}^{2}\frac{\partial\rho(y)}{\partial y^{2}}\big{\|}_{y=y_{0}}+{\rho(y_{0})}$		(16)
		$\displaystyle=y^{2}{\frac{\partial\rho(y)}{\partial y^{2}}\big{\|}_{y=y_{0}}}+C(y_{0})$		(17)

where the inequality is reached when $y=y_{0}$ . Next, substitute $y=y_{ij}$ and $y_{0}=y_{ij}^{(k)}$ , we can get $\rho(y_{ij})\leq{\bm{W}}_{ij}^{(k)}y_{ij}^{2}+C(y_{ij}^{(k)})$ which takes the equality at $y_{ij}=y_{ij}^{(k)}$ . Finally, by summing up both sides and add a regularization term, we can prove the Eq. (15). ∎

Remark 1.

It can be seen that the definition of $\hat{\mathcal{H}}$ depends on ${\bm{F}}^{(k)}$ , which ensures that the bound is tight when ${\bm{F}}={\bm{F}}^{(k)}$ . This tight bound condition is essential in the majorization-minimization algorithm as seen in Theorem 1.

Lemma 2.

For $\hat{\mathcal{H}}=\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}y_{ij}^{2}+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2}$ , the gradient and Hessian w.r.t. ${\bm{F}}$ ⁴⁴4Here are some explanations on the tensor ‘Hessian’ $\nabla^{2}\hat{\mathcal{H}}$ . Since $\hat{\mathcal{H}}({\bm{F}})$ is dependent on a matrix, there are some difficulties in defining the Hessian. However, as can be observed in Eq. (27) and Eq. (32), the feature dimension can be accounted for by the following. Initially, we treat the feature dimension as an irrelevant dimension that is excluded from the matrix operations. E.g., ${\bm{F}}\nabla^{2}\hat{\mathcal{H}}{\bm{F}}=\sum_{ik}{\bm{F}}_{ij}\nabla^{2}\hat{\mathcal{H}}_{ijkl}{\bm{F}}_{kl}$ where the feature dimensions $j$ and $l$ remain free indices while the node indices $i$ and $k$ are eliminated as dummy indices. Finally, we take the trace of the resulting #feature $\times$ #feature matrix to get the desired value. satisfy

\nabla_{{\bm{F}}_{mn}}\hat{\mathcal{H}}({\bm{F}})=2\left((\text{diag}({\bm{q}})-{\bm{W}}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}}){\bm{F}}-\lambda{\bm{F}}^{(0)}\right)_{mn},

(18)

and

\nabla_{{\bm{F}}_{mn}}\nabla_{{\bm{F}}_{kl}}\hat{\mathcal{H}}({\bm{F}})=2\left(\text{diag}({\bm{q}})-{\bm{W}}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}}\right)_{mk},

(19)

where ${\bm{q}}_{m}=\sum_{j}{\bm{W}}_{mj}{\bm{A}}_{mj}/d_{m}$ and $\tilde{{\bm{A}}}_{ij}=\frac{{\bm{A}}_{ij}}{\sqrt{d_{i}d_{j}}}$ is the symmetrically normalized adjacency matrix.

Proof.

Follow ${\bm{A}}={\bm{A}}^{\top}$ and define $~{}y^{2}_{ij}\coloneqq\left\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\right\|^{2}_{2}$ , then the first-order gradient of $\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}y_{ij}^{2}$ will be

	$\displaystyle\nabla_{{\bm{F}}_{mn}}\left(\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}y_{ij}^{2}\right)$	(20)
$\displaystyle=$	$\displaystyle\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}\frac{\partial y^{2}_{ij}}{\partial{\bm{F}}_{mn}}$	(21)
$\displaystyle=$	$\displaystyle\sum_{(m,j)\in\mathcal{E}}{\bm{W}}_{mj}\frac{\partial y^{2}_{mj}}{\partial{\bm{F}}_{mn}}$	(22)
$\displaystyle=$	$\displaystyle\sum_{(m,j)\in\mathcal{E}}{\bm{W}}_{mj}\frac{\partial\left(\frac{{\bm{F}}_{mn}}{\sqrt{d_{m}}}-\frac{{\bm{F}}_{jn}}{\sqrt{d_{j}}}\right)^{2}}{\partial{\bm{F}}_{mn}}$	(23)
$\displaystyle=$	$\displaystyle\sum_{j\in\mathcal{N}(m)}2{\bm{W}}_{mj}(\frac{{\bm{F}}_{mn}}{d_{m}}-\frac{{\bm{F}}_{jn}}{\sqrt{d_{m}d_{j}}})$	(24)
$\displaystyle=$	$\displaystyle 2\sum_{j}{\bm{W}}_{mj}(\frac{{\bm{A}}_{mj}}{d_{m}}{\bm{F}}_{mn}-\frac{{\bm{A}}_{mj}}{\sqrt{d_{m}d_{j}}}{\bm{F}}_{jn})$	(25)
$\displaystyle=$	$\displaystyle 2(\frac{\sum_{j}{\bm{W}}_{mj}{\bm{A}}_{mj}}{d_{m}}){\bm{F}}_{mn}-2(({\bm{W}}\odot\tilde{{\bm{A}}}){\bm{F}})_{mn}$	(26)
$\displaystyle=$	$\displaystyle\left(2(\text{diag}({\bm{q}})-{\bm{W}}\odot\tilde{{\bm{A}}}){\bm{F}}\right)_{mn},$	(27)

and the second-order hessian will be:

	$\displaystyle\nabla^{2}_{{\bm{F}}_{mn}{\bm{F}}_{kl}}\left(\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}y_{ij}^{2}\right)$	(28)
$\displaystyle=$	$\displaystyle\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}\frac{\partial y^{2}_{ij}}{\partial{\bm{F}}_{mn}\partial{\bm{F}}_{kl}}$	(29)
$\displaystyle=$	$\displaystyle 2\frac{\partial}{\partial{\bm{F}}_{kl}}\left(\sum_{j}{\bm{W}}_{mj}(\frac{{\bm{A}}_{mj}}{d_{m}}{\bm{F}}_{mn}-\frac{{\bm{A}}_{mj}}{\sqrt{d_{m}d_{j}}}{\bm{F}}_{jn})\right)$	(30)
$\displaystyle=$	$\displaystyle 2({\bm{q}}_{m}\delta_{mk}-\sum_{j}{\bm{W}}_{mj}\frac{{\bm{A}}_{mj}}{\sqrt{d_{m}d_{j}}}\delta_{jk})\delta_{nl}$	(31)
$\displaystyle=$	$\displaystyle 2(\text{diag}({\bm{q}})-{\bm{W}}\odot\tilde{{\bm{A}}})_{mk}\delta_{nl}.$	(32)

∎

Remark 2.

From Eq. (21) to Eq. (24), one can assume $m\notin\mathcal{N}(m)$ , and thus ${\bm{W}}_{mm}=0$ . However, as we know, a self-loop is often added to ${\bm{A}}$ to facilitate stability by avoiding zero-degree nodes that cannot be normalized. This is not as problematic as it seems, though. Because $\sum_{(i,j)\in\mathcal{E},i\neq j}$ intrinsically excludes the diagonal terms, we can simply assign zero to the diagonal terms of ${\bm{W}}$ so that the term of $j=m$ is still zero in Eq. (24), as defined in Eq. (5).

To minimize $\hat{\mathcal{H}}$ , the gradient descent update rule takes the form of Eq. (6). One may assume that when $\eta$ is chosen to be small enough, $\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)})$ . For a formal justification, we have Theorem 1 to determine the convergence condition of $\eta$ .

Theorem 1.

If ${\bm{F}}^{(k)}$ follows the update rule in Eq. (6), where the $\rho$ satisfies that $\frac{d\rho(y)}{dy^{2}}$ is non-decreasing for $y\in(0,\infty)$ , then a sufficient condition for $\mathcal{H}({\bm{F}}^{(k+1)})\leq\mathcal{H}({\bm{F}}^{(k)})$ is that the step size $\eta$ satisfies $0<\eta\leq\|\text{diag}({\bm{q}}^{(k)})-{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}}\|_{2}^{-1}.$

Proof.

The descent of $\hat{\mathcal{H}}({\bm{F}})$ can ensure the descent of $\mathcal{H}({\bm{F}})$ since $\mathcal{H}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)})=\mathcal{H}({\bm{F}}^{(k)})$ . Therefore, we only need to prove $\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)})$ .

Noting that $\hat{\mathcal{H}}$ is a quadratic function and ${\bm{F}}^{(k+1)}-{\bm{F}}^{(k)}=-\eta\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})$ , then $\hat{\mathcal{H}}({\bm{F}}^{(k+1)})$ and $\hat{\mathcal{H}}({\bm{F}}^{(k)})$ can be connected using Taylor expansion4, where $\nabla\hat{\mathcal{H}}$ and $\nabla^{2}\hat{\mathcal{H}}$ is given in Lemma 2:

	$\displaystyle\hat{\mathcal{H}}({\bm{F}}^{(k+1)})-\hat{\mathcal{H}}({\bm{F}}^{(k)})$	(33)
$\displaystyle=$	$\displaystyle{\mathrm{tr}}\left(\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})^{\top}({\bm{F}}^{(k+1)}-{\bm{F}}^{(k)})\right)$	(34)
	$\displaystyle+\frac{1}{2}{\mathrm{tr}}\left(({\bm{F}}^{(k+1)}-{\bm{F}}^{(k)})^{\top}\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)})({\bm{F}}^{(k+1)}-{\bm{F}}^{(k)})\right)$	(35)
$\displaystyle=$	$\displaystyle{\mathrm{tr}}\left(-\eta\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})^{\top}\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})\right)$	(36)
	$\displaystyle+{\mathrm{tr}}\left(\frac{\eta^{2}}{2}\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})^{\top}\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)})\nabla\hat{\mathcal{H}}({\bm{F}}^{(k)})\right).$	(37)

Insert $\nabla^{2}\hat{\mathcal{H}}({\bm{F}}^{(k)})=2(\text{diag}({\bm{q}})-{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}})$ from Lemma 2 into the above equation and we can find a sufficient condition for $\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)})$ to be

-\eta+\|\text{diag}({\bm{q}})-{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}+\hat{\lambda}{\bm{I}}\|_{2}\eta^{2}\leq 0,

(38)

\eta\leq\|\text{diag}({\bm{q}})-{\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}+\hat{\lambda}{\bm{I}}\|_{2}^{-1}.

(39)

∎

Now we prove that when taking the Quasi-Newton-IRLS step as in Eq. (7), the objective $\hat{\mathcal{H}}$ is guaranteed to descend. Since the features in different dimensions are irrelevant, we simplify our notations as if feature dimension was $1$ . One may easily recover the general scenario by taking the trace.

Lemma 3.

$2\hat{{\bm{Q}}}-\nabla^{2}\hat{\mathcal{H}}({\bm{y}})$ is positive semi-definite, where $\hat{\mathcal{H}}=\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}y_{ij}^{2}+\lambda\sum_{i\in\mathcal{V}}\|{\bm{f}}_{i}-{\bm{f}}_{i}^{(0)}\|_{2}^{2}$ , $\hat{\bm{Q}}=2(\text{diag}({\bm{q}})+\lambda{\bm{I}})$ , and ${\bm{q}}_{m}=\sum_{j}{\bm{W}}_{mj}{\bm{A}}_{mj}/d_{m}$ .

Proof.

In Lemma 2, we have $\nabla^{2}\hat{\mathcal{H}}({\bm{y}})=2(\text{diag}({\bm{q}})+\lambda{\bm{I}}-{\bm{W}}\odot\tilde{{\bm{A}}})$ , then

2\hat{{\bm{Q}}}-\nabla^{2}\hat{\mathcal{H}}({\bm{y}})=2(\text{diag}({\bm{q}})+\lambda{\bm{I}}+{\bm{W}}\odot\tilde{{\bm{A}}}).

(40)

Recall how we derived Eq. (27) from Eq. (15), where we proved that

\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\|_{2}^{2}={\mathrm{tr}}({\bm{F}}^{\top}(\text{diag}({\bm{q}})-{\bm{W}}\odot\tilde{{\bm{A}}}){\bm{F}}),

(41)

which holds for all ${\bm{F}}$ . Similarly, the equation still holds after flipping the sign before ${\bm{f}}_{j}/\sqrt{d_{j}}$ and ${\bm{W}}\odot\tilde{{\bm{A}}}$ . We then have this inequality: $\forall{\bm{F}},\forall\lambda\geq 0$

	$\displaystyle 0\leq$	$\displaystyle\sum_{(i,j)\in\mathcal{E},i\neq j}{\bm{W}}_{ij}\\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}+\frac{{\bm{f}}_{j}}{\sqrt{d_{j}}}\\|_{2}^{2}={\mathrm{tr}}({\bm{F}}^{\top}(\text{diag}({\bm{q}})+{\bm{W}}\odot\tilde{{\bm{A}}}){\bm{F}})$		(42)
	$\displaystyle\leq$	$\displaystyle{\mathrm{tr}}({\bm{F}}^{\top}(\text{diag}({\bm{q}})+{\bm{W}}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}}){\bm{F}}).$		(43)

Thus, $(\text{diag}({\bm{q}})+{\bm{W}}\odot\tilde{{\bm{A}}}+\lambda{\bm{I}})\succeq 0$ , and thus $2\hat{{\bm{Q}}}-\nabla^{2}\hat{\mathcal{H}}({\bm{y}})\succeq 0$ .

∎

Using Lemma 1 and Lemma 3 we can prove Theorem 2. Note that we continue to assume #feature $=1$ for simplicity but without loss of generality4.

Theorem 2.

Proof.

Following the discussions in Theorem 1, we only need to prove $\hat{\mathcal{H}}({\bm{F}}^{(k+1)})\leq\hat{\mathcal{H}}({\bm{F}}^{(k)})$ .For the quadratic $\hat{\mathcal{H}}$ , we have:

\hat{\mathcal{H}}({\bm{x}})=\hat{\mathcal{H}}({\bm{y}})+\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}({\bm{x}}-{\bm{y}})+\frac{1}{2}({\bm{x}}-{\bm{y}})^{\top}\nabla^{2}\hat{\mathcal{H}}({\bm{y}})({\bm{x}}-{\bm{y}}).

(44)

We can define $\mathcal{Q}({\bm{y}})=2\hat{{\bm{Q}}}({\bm{y}})$ in Lemma 3 such that $\mathcal{Q}({\bm{y}})-\nabla^{2}\hat{\mathcal{H}}({\bm{y}})\succeq 0$ , then

\forall{\bm{z}},{\bm{z}}^{\top}\mathcal{Q}({\bm{y}}){\bm{z}}\geq{\bm{z}}^{\top}\nabla^{2}\hat{\mathcal{H}}({\bm{y}}){\bm{z}}.

(45)

Then an upper bound of $\hat{\mathcal{H}}({\bm{x}})$ can be found by inserting Eq. (45) into Eq. (44).

\hat{\mathcal{H}}({\bm{x}})\leq\hat{\mathcal{H}}({\bm{y}})+\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}({\bm{x}}-{\bm{y}})+\frac{1}{2}({\bm{x}}-{\bm{y}})^{\top}\mathcal{Q}({\bm{y}})({\bm{x}}-{\bm{y}}).

(46)

Then, insert $\mathcal{Q}=2\hat{\bm{Q}}$ into Eq. (46). Note that $\hat{\bm{Q}}\coloneqq 2(\text{diag}({\bm{q}})+\hat{\lambda}{\bm{I}})$ , so $\hat{\bm{Q}}\succeq 0$ and $\hat{\bm{Q}}^{\top}=\hat{\bm{Q}}$ . Thereafter, the update rule ${\bm{x}}={\bm{y}}-\hat{{\bm{Q}}}^{-1}\nabla\hat{\mathcal{H}}({\bm{y}})$ in Eq. (7) gives

	$\displaystyle\hat{\mathcal{H}}({\bm{x}})-\hat{\mathcal{H}}({\bm{y}})$	(47)
$\displaystyle\leq$	$\displaystyle\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}({\bm{x}}-{\bm{y}})+\frac{1}{2}({\bm{x}}-{\bm{y}})^{\top}\mathcal{Q}({\bm{y}})({\bm{x}}-{\bm{y}})$	(48)
$\displaystyle=$	$\displaystyle\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}({\bm{x}}-{\bm{y}})+2\left(\hat{{\bm{Q}}}^{\frac{1}{2}}({\bm{x}}-{\bm{y}})\right)^{\top}\left(\hat{{\bm{Q}}}^{\frac{1}{2}}({\bm{x}}-{\bm{y}})\right)$	(49)
$\displaystyle=$	$\displaystyle 2\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}\hat{{\bm{Q}}}^{-1}\nabla\hat{\mathcal{H}}({\bm{y}})-2\nabla\hat{\mathcal{H}}({\bm{y}})^{\top}(\hat{{\bm{Q}}}^{-\frac{1}{2}})^{\top}\hat{{\bm{Q}}}^{-\frac{1}{2}}\nabla\hat{\mathcal{H}}({\bm{y}})$	(50)
$\displaystyle=$	$\displaystyle 0.$	(51)

Therefore, our QN-IRLS in Eq. (7) is guaranteed to descend. ∎

Appendix C Computation Efficiency

Our RUNG model preserves advantageous efficiency even adopting the quasi-Newton IRLS algorithm.

C.1 Time Complexity Analysis

Each RUNG layer involves computing ${\bm{W}}$ , ${\bm{q}}$ , and the subsequent aggregations. We elaborate on them one by one. We denote the number of feature dimensions $d$ , the number of nodes $n$ , and the number of edges $m$ , which are assumed to satisfy $m\gg 1$ , $n\gg 1$ and $d\gg 1$ . The number of layers is denoted as $k$ . The asymptotic computation complexity is denoted as $\mathcal{O}(\cdot)$ .

Computation of ${\bm{W}}\odot{\bm{A}}$ and ${\bm{W}}\odot\tilde{\bm{A}}$ .

${\bm{W}}\coloneqq\bm{1}_{i\neq j}\frac{d\rho_{\gamma}(y_{ij})}{dy_{ij}^{2}}$ is the edge weighting matrix dependent on the node feature matrix ${\bm{F}}$ . The computation of $y_{ij}=\|\frac{{\bm{f}}_{i}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}}{\sqrt{\smash[b]{d_{j}}}}\|_{2}$ is $\mathcal{O}(\mathpzc{d})$ and that of $\frac{d\rho_{\gamma}(y_{ij})}{dy_{ij}^{2}}$ is $\mathcal{O}(1)$ . ${\bm{W}}_{ij}$ only needs computing when $(i,j)\in\mathcal{E}$ , because $\forall(i,j)\notin\mathcal{E}$ , ${\bm{W}}_{ij}$ will be masked out by ${\bm{A}}$ or $\tilde{\bm{A}}$ anyways. Each element of ${\bm{W}}$ involves computation time of $\mathcal{O}(d)$ and $m$ elements are needed. In total, ${\bm{W}}$ costs $\mathcal{O}(md)$ , and ${\bm{W}}\odot{\bm{A}}$ and ${\bm{W}}\odot\tilde{\bm{A}}$ cost $\mathcal{O}(md+m)=\mathcal{O}(md)$ .

Computation of $\hat{\bm{Q}}^{-1}$ .

$\hat{\bm{Q}}^{-1}\coloneqq\frac{1}{2}(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}})^{-1}$ is the inverse Hessian in our quasi-Newton IRLS. Because $\hat{\bm{Q}}$ is designed to be a diagonal matrix, its inverse can be evaluated as element-wise reciprocal which is efficient. As for ${\bm{q}}\coloneqq\sum_{j}{\bm{W}}^{(k)}_{mj}{\bm{A}}_{mj}/d_{m}$ , only existing edges $(i,j)\in\mathcal{E}$ need evaluation in the summation. Therefore, this computation costs $\mathcal{O}(m)$ . Thus, $\hat{\bm{Q}}^{-1}$ costs $\mathcal{O}(\mathpzc{m})$ in total.

Computation of aggregation.

An RUNG layer follows

{\bm{F}}^{(k+1)}=2\hat{\bm{Q}}^{-1}\left(({\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}+\lambda{\bm{F}}^{(0)}\right),

(52)

which combines the quantities calculated above. An extra graph aggregation realized by the matrix multiplication between ${\bm{W}}\odot\tilde{\bm{A}}$ and ${\bm{F}}$ is required, costing $\mathcal{O}(md)$ . The subsequent addition to ${\bm{F}}^{(0)}$ and the multiplication to the diagonal $\hat{\bm{Q}}^{-1}$ both cost $\mathcal{O}(nd)$ .

Stacking layers.

RUNG unrolls the QN-IRLS optimization procedure, which has multiple iterations. Therefore, the convergence increase that QN-IRLS introduces allows a RUNG with fewer layers and increases the overall complexity. It is worth noting that the QN-IRLS utilizes a diagonal approximated Hessian, and thus the computation per iteration is also kept efficient as discussed above.

Summing up all the costs, we have the total computational complexity of our RUNG, $\mathcal{O}((m+n)kd)$ . Our RUNG thus scales well to larger graph datasets such as ogbn-arxiv.

Space Complexity Analysis

The only notable extra storage cost is ${\bm{W}}$ whose sparse layout takes up $\mathcal{O}(m)$ . This is the same order of size as the adjacency matrix itself, thus not impacting the total asymptotic complexity.

C.2 Alternative Perspective

In fact, the above analysis can be simplified when we look at the local aggregation behavior of RUNG. For node $i$ , it’s updated via aggregation ${\bm{f}}_{i}=\frac{2}{\hat{\bm{Q}}_{ii}^{-1}}((\sum_{j\in\mathcal{N}(i)}{\bm{W}}_{ij}{\bm{f}}_{j})+\lambda{\bm{f}}_{i}^{(0)})$ . The summation over neighbors’ ${\bm{f}}_{j}$ will give $\mathcal{O}(m)$ in the total time complexity in each feature dimension, and ${\bm{W}}_{ij}$ involves $\mathcal{O}(d)$ computations for each neighbor. This sums up to $\mathcal{O}(md)$ as well. Essentially, the high efficiency of RUNG originates from that every edge weighting in our model involves only the 2 nodes on this edge.

Appendix D Comprehensive Bias Analysis

In this section, we provide multiple evidence from various perspectives to reveal the estimation bias of $\ell_{1}$ -based estimation as follows.

(1) From the theoretical perspective, the extensive literature on high-dimensional statistics (Tibshirani, 1996b; Zhang, 2010b) has proved that $\ell_{1}$ regularization induces an estimation bias.

(2) From the algorithm perspective, in Section 2.3, we provide the explanation on the $\ell_{1}$ -based estimation problem solver. Specifically, the soft-thresholding operator $S_{\lambda}(\theta):=\text{sign}(\theta)\max(|\theta|-\lambda,0)$ induced by the $\ell_{1}$ regularized problem causes a constant shrinkage for $\theta$ larger than $\lambda$ , enforcing the estimator to be biased towards zero with the magnitude $\lambda$ .

(3) From the numerical simulation in Section 2.3, we provide an example of mean estimation to verify this estimation bias. As shown in Figure 2, the $\ell_{1}$ estimator (green) deviates further from the true mean as the ratio of outliers escalates. This can be clearly explained as the effect of the accumulation of estimation bias. In other words, each outlier results in a constant bias, and the bias accumulates with more outliers.

(4) From the performance perspective, $\ell_{1}$ -based GNNs such as SoftMedian, TWIRLS, and RUNG- $\ell_{1}$ (the $\ell_{1}$ variant of our model) suffer from significant performance degradation when the attack budget increases.

(5) From our ablation study in Figure 6, we quantify the estimation bias of the aggregated feature $f_{i}^{\star}$ on the attacked graph from the feature $f_{i}$ on the clean graph: $\sum_{i\in\mathcal{V}}\|f_{i}-f_{i}^{*}\|_{2}^{2}$ . The results demonstrate that $\ell_{1}$ -based GNN produces biased estimation under adversarial attacks and the bias indeed scales up with the attack budget. However, our proposed RUNG method exhibits a nearly zero estimation bias under the same attacking budgets.

All of this evidence can convincingly support our claim that $\ell_{1}$ -based robust estimator suffers from the estimation bias, which validates the motivation of our new algorithm design.

Appendix E Additional experiment results

In this section, we present the experiment results that are not shown in the main paper due to space limits.

E.1 Adaptive Attacks

Table 3 and Table 4 are the results of adaptive local and global attacks on Citeseer, referred to in Section 4.2.

Table 3: Adaptive local attack on Citeseer. The best and second are marked.

Model	$0\%$	$20\%$	$50\%$	$100\%$	$150\%$	$200\%$
MLP	$69.3\pm 2.5$	$69.3\pm 2.5$	$\underline{69.3\pm 2.5}$	$\mathbf{69.3\pm 2.5}$	$\mathbf{69.3\pm 2.5}$	$\mathbf{69.3\pm 2.5}$
GCN	$79.3\pm 3.3$	$44.7\pm 8.8$	$27.3\pm 7.7$	$6.7\pm 3.7$	$0.7\pm 1.3$	$0.0\pm 0.0$
APPNP	$\mathbf{80.7\pm 4.4}$	$50.0\pm 6.7$	$39.3\pm 6.5$	$16.7\pm 12.3$	$16.0\pm 8.0$	$0.0\pm 0.0$
GAT	$74.7\pm 5.0$	$15.3\pm 17.5$	$13.3\pm 13.5$	$12.0\pm 11.3$	$12.7\pm 6.5$	$9.3\pm 5.3$
GNNGuard	$74.7\pm 4.5$	$46.0\pm 10.4$	$32.7\pm 11.0$	$18.0\pm 7.5$	$6.0\pm 3.9$	$4.0\pm 3.9$
RGCN	$\underline{80.0\pm 2.1}$	$46.7\pm 9.4$	$32.7\pm 8.8$	$10.0\pm 5.2$	$0.7\pm 1.3$	$0.7\pm 1.3$
GRAND	$77.3\pm 2.5$	$56.7\pm 4.2$	$44.0\pm 3.9$	$16.7\pm 6.3$	$0.7\pm 1.3$	$0.0\pm 0.0$
ProGNN	$\underline{80.0\pm 2.1}$	$42.7\pm 7.4$	$26.0\pm 5.3$	$10.0\pm 4.7$	$0.7\pm 1.3$	$0.0\pm 0.0$
Jaccard-GCN	$78.7\pm 3.4$	$46.7\pm 7.3$	$28.0\pm 7.5$	$6.7\pm 4.7$	$0.7\pm 1.3$	$0.0\pm 0.0$
SoftMedian	$78.7\pm 3.4$	$69.3\pm 6.5$	$66.0\pm 7.1$	$56.0\pm 4.4$	$8.7\pm 6.9$	$3.3\pm 3.0$
TWIRLS	$77.3\pm 2.5$	$69.3\pm 1.3$	$68.7\pm 1.6$	$57.3\pm 2.5$	$36.7\pm 4.7$	$26.7\pm 4.7$
TWIRLS-T	$76.0\pm 3.3$	$\underline{70.7\pm 2.5}$	$68.7\pm 2.7$	$62.0\pm 3.4$	$52.7\pm 5.3$	$47.3\pm 8.3$
RUNG- $\ell_{1}$ (ours)	$\underline{80.0\pm 3.7}$	$\mathbf{75.3\pm 4.5}$	$\mathbf{73.3\pm 3.0}$	$\underline{67.3\pm 3.3}$	$36.0\pm 9.3$	$26.0\pm 8.3$
RUNG (ours)	$77.3\pm 1.3$	$\underline{70.7\pm 5.7}$	$\underline{69.3\pm 6.8}$	$\underline{67.3\pm 7.1}$	$\underline{64.0\pm 5.7}$	$\underline{61.3\pm 5.8}$

Table 4: Adaptive global attack on Citeseer. The best and second are marked.

Model	Clean	$5\%$	$10\%$	$20\%$	$30\%$	$40\%$
MLP	$67.7\pm 0.3$	$67.7\pm 0.3$	$\underline{67.7\pm 0.3}$	$\mathbf{67.7\pm 0.3}$	$\mathbf{67.7\pm 0.3}$	$\mathbf{67.7\pm 0.3}$
GCN	$74.8\pm 1.2$	$66.1\pm 1.0$	$60.9\pm 0.8$	$53.0\pm 1.0$	$47.0\pm 0.8$	$41.2\pm 1.1$
APPNP	$\mathbf{75.3\pm 1.1}$	$65.8\pm 0.9$	$60.7\pm 1.6$	$52.3\pm 1.6$	$46.0\pm 2.0$	$41.2\pm 2.2$
GAT	$73.4\pm 1.2$	$65.4\pm 1.3$	$60.4\pm 1.4$	$52.6\pm 2.5$	$47.2\pm 3.4$	$41.2\pm 4.8$
GNNGuard	$72.4\pm 1.1$	$65.6\pm 0.9$	$61.8\pm 1.4$	$55.6\pm 1.4$	$51.0\pm 1.3$	$47.3\pm 1.3$
RGCN	$74.4\pm 1.0$	$66.0\pm 0.8$	$60.6\pm 0.9$	$52.5\pm 0.8$	$46.1\pm 0.9$	$40.2\pm 1.0$
GRAND	$74.8\pm 0.6$	$66.6\pm 0.7$	$61.8\pm 0.7$	$53.6\pm 1.1$	$47.4\pm 1.2$	$42.2\pm 0.9$
ProGNN	$74.2\pm 1.3$	$65.6\pm 1.1$	$60.3\pm 1.1$	$52.7\pm 1.4$	$46.2\pm 0.9$	$40.8\pm 0.6$
Jaccard-GCN	$74.8\pm 1.2$	$66.3\pm 1.2$	$60.9\pm 1.2$	$53.3\pm 0.9$	$46.5\pm 0.9$	$41.1\pm 1.0$
EvenNet	$74.6\pm 0.5$	$66.8\pm 0.5$	$62.0\pm 0.6$	$55.9\pm 0.4$	$51.1\pm 0.4$	$47.4\pm 0.8$
GARNET	$74.8\pm 1.3$	$68.0\pm 0.9$	$64.0\pm 1.1$	$58.2\pm 0.7$	$53.9\pm 0.8$	$51.0\pm 0.9$
SoftMedian	$74.6\pm 0.7$	$68.0\pm 0.7$	$64.4\pm 0.9$	$59.3\pm 1.1$	$55.2\pm 2.0$	$51.9\pm 2.1$
TWIRLS	$74.2\pm 0.8$	$69.2\pm 0.8$	$66.4\pm 0.7$	$61.6\pm 0.9$	$58.1\pm 1.2$	$51.8\pm 1.5$
TWIRLS-T	$73.7\pm 1.1$	$69.1\pm 1.2$	$66.4\pm 1.0$	$62.8\pm 1.5$	$60.0\pm 1.4$	$57.4\pm 1.5$
RUNG- $\ell_{1}$ (ours)	$\mathbf{75.5\pm 1.1}$	$\underline{69.3\pm 1.2}$	$65.9\pm 1.2$	$61.1\pm 1.1$	$57.2\pm 1.4$	$53.9\pm 1.3$
RUNG (ours)	$74.3\pm 0.7$	$\mathbf{71.4\pm 1.0}$	$\mathbf{69.8\pm 1.3}$	$\underline{67.6\pm 1.2}$	$\underline{66.5\pm 1.3}$	$\underline{65.3\pm 1.5}$

E.2 Transfer Attacks

Table 5 and Table 6 are the results of transfer global attacks on Cora ML and Citeseer. Figure 9 and Figure 10 are the experiment results of our RUNG attacked by transfer attacks generated on different surrogate models as mentioned in Section 4.3.

Table 5: Transfer global evasion attack on Cora ML.

Model	Clean	$5\%$	$10\%$	$20\%$	$30\%$	$40\%$
MLP	$65.0\pm 1.0$	$65.0\pm 1.0$	$65.0\pm 1.0$	$65.0\pm 1.0$	$65.0\pm 1.0$	$65.0\pm 1.0$
GCN	$85.0\pm 0.4$	$76.0\pm 0.7$	$70.6\pm 0.9$	$62.1\pm 1.0$	$55.4\pm 1.0$	$49.8\pm 1.1$
APPNP	$\mathbf{86.3}\pm\mathbf{0.4}$	$78.0\pm 1.0$	$72.7\pm 1.2$	$64.4\pm 1.8$	$57.7\pm 1.5$	$53.0\pm 1.2$
GAT	$83.5\pm 0.5$	$78.4\pm 0.7$	$75.1\pm 0.9$	$69.5\pm 1.7$	$65.0\pm 1.8$	$61.4\pm 2.4$
GNNGuard	$83.1\pm 0.7$	$79.9\pm 0.7$	$78.1\pm 0.7$	$74.8\pm 0.9$	$71.8\pm 1.1$	$69.9\pm 1.1$
RGCN	$85.7\pm 0.4$	$77.7\pm 0.6$	$72.7\pm 0.8$	$64.4\pm 0.8$	$57.9\pm 0.8$	$52.6\pm 1.0$
GRAND	$86.1\pm 0.7$	$80.2\pm 0.7$	$76.4\pm 1.0$	$70.0\pm 1.3$	$64.6\pm 1.4$	$60.1\pm 1.1$
ProGNN	$85.6\pm 0.5$	$77.8\pm 0.7$	$72.8\pm 0.9$	$64.8\pm 1.1$	$58.7\pm 1.3$	$53.2\pm 1.3$
Jaccard-GCN	$83.7\pm 0.7$	$77.8\pm 0.7$	$74.1\pm 1.1$	$68.4\pm 1.1$	$63.5\pm 1.6$	$59.3\pm 1.4$
ElasticGNN	$85.9\pm 0.5$	$81.9\pm 0.9$	$79.0\pm 1.0$	$73.3\pm 1.3$	$68.1\pm 1.4$	$63.6\pm 1.5$
SoftMedian	$85.0\pm 0.7$	$81.8\pm 0.4$	$79.0\pm 0.6$	$73.0\pm 1.1$	$66.3\pm 1.6$	$60.4\pm 2.4$
TWIRLS	$84.2\pm 0.6$	$81.5\pm 0.7$	$79.7\pm 0.6$	$76.8\pm 0.5$	$74.3\pm 0.7$	$72.8\pm 0.9$
RUNG- $\ell_{1}$ (ours)	$85.8\pm 0.5$	$82.3\pm 0.8$	$79.7\pm 0.6$	$75.1\pm 0.8$	$70.5\pm 0.8$	$67.1\pm 0.9$
RUNG (ours)	$84.6\pm 0.5$	$\mathbf{83.7}\pm\mathbf{0.6}$	$\mathbf{82.9}\pm\mathbf{0.9}$	$\mathbf{81.3}\pm\mathbf{1.3}$	$\mathbf{79.2}\pm\mathbf{1.6}$	$\mathbf{77.9}\pm\mathbf{2.0}$

Table 6: Transfer global evasion attack on Citeseer.

Model	Clean	$5\%$	$10\%$	$20\%$	$30\%$	$40\%$
MLP	$67.7\pm 0.3$	$67.7\pm 0.3$	$67.7\pm 0.3$	$67.7\pm 0.3$	$\underline{67.7\pm 0.3}$	$\underline{67.7\pm 0.3}$
GCN	$74.8\pm 1.2$	$66.7\pm 1.3$	$62.1\pm 1.3$	$54.5\pm 1.7$	$48.7\pm 1.8$	$43.4\pm 2.1$
APPNP	$\underline{75.3\pm 1.1}$	$67.7\pm 1.2$	$62.9\pm 1.0$	$55.6\pm 1.0$	$50.3\pm 1.0$	$45.8\pm 1.6$
GAT	$73.4\pm 1.2$	$68.2\pm 1.3$	$64.6\pm 1.4$	$58.2\pm 2.1$	$53.2\pm 3.1$	$48.7\pm 3.7$
GNNGuard	$72.4\pm 1.1$	$70.6\pm 1.2$	$69.1\pm 1.3$	$67.1\pm 1.4$	$65.4\pm 1.7$	$64.2\pm 1.9$
RGCN	$74.4\pm 1.0$	$67.8\pm 1.0$	$63.5\pm 1.3$	$56.3\pm 1.5$	$51.0\pm 1.4$	$46.1\pm 1.7$
GRAND	$74.8\pm 0.6$	$68.9\pm 0.8$	$65.0\pm 0.9$	$59.2\pm 1.3$	$54.8\pm 1.5$	$51.2\pm 1.9$
ProGNN	$74.2\pm 1.3$	$66.8\pm 1.0$	$62.3\pm 1.0$	$55.0\pm 1.1$	$49.2\pm 1.0$	$43.7\pm 1.3$
Jaccard-GCN	$74.8\pm 1.2$	$68.5\pm 1.1$	$65.1\pm 1.0$	$59.0\pm 1.7$	$54.6\pm 1.9$	$51.1\pm 1.7$
ElasticGNN	${75.2\pm 1.2}$	$70.9\pm 1.4$	$67.7\pm 1.6$	$62.0\pm 2.5$	$57.9\pm 3.2$	$53.4\pm 4.0$
SoftMedian	$74.6\pm 0.7$	$68.0\pm 0.7$	$64.4\pm 0.9$	$59.3\pm 1.1$	$55.2\pm 2.0$	$51.9\pm 2.2$
TWIRLS	$74.2\pm 0.8$	$\underline{71.8\pm 1.0}$	$\underline{70.1\pm 0.9}$	$\underline{68.4\pm 1.4}$	$67.4\pm 1.6$	$66.4\pm 1.8$
RUNG- $\ell_{1}$ (ours)	$\mathbf{75.5\pm 1.1}$	$72.0\pm 1.3$	$69.3\pm 1.4$	$65.1\pm 1.8$	$61.8\pm 2.0$	$58.7\pm 2.4$
RUNG (ours)	$74.3\pm 0.7$	$\mathbf{74.2\pm 1.0}$	$\mathbf{74.2\pm 1.0}$	$\mathbf{74.3\pm 1.0}$	$\mathbf{74.2\pm 1.1}$	$\mathbf{74.1\pm 1.1}$

Figure 11 shows results of global evasion transfer attacks between different models on Cora ML. Our observations are summarized below:

•

The attacks generated by RUNG are stronger when applied to more robust models like SoftMedian, while are not strong against undefended or weakly defended models.
•

For $\ell_{1}$ GNNs, the attacks are the strongest when transferred from $\ell_{1}$ GNNs. This supports again our unified view on $\ell_{1}$ GNNs. An exception is TWIRLS because it only has one attention layer and does not always converge to the actual $\ell_{1}$ objective.

E.3 Poisoning Attacks

We provide the experiment results under poisoning attacks on Cora ML and Citeseer in Table 7 and Table 8, respectively.

Table 7: Poisoning Attacks on Cora ML.

Budget	5%	10%	20%	30%	40%
GCN	$74.9\pm 0.4$	$69.7\pm 0.7$	$60.7\pm 0.7$	$54.0\pm 1.0$	$48.7\pm 1.0$
APPNP	$76.3\pm 0.9$	$71.1\pm 1.2$	$63.0\pm 1.3$	$57.1\pm 0.6$	$53.2\pm 1.1$
SoftMedian	$79.2\pm 0.7$	$75.6\pm 0.3$	$67.8\pm 0.6$	$62.9\pm 1.0$	$58.6\pm 0.7$
RUNG- $\ell_{1}$ (ours)	$79.7\pm 0.6$	$76.4\pm 0.6$	$68.1\pm 0.6$	$63.8\pm 0.5$	$60.1\pm 0.9$
RUNG (ours)	$78.5\pm 0.5$	$75.5\pm 0.3$	$71.5\pm 0.4$	$67.1\pm 1.6$	$64.6\pm 1.3$

Table 8: Poisoning Attacks on Citeseer.

Budget	5%	10%	20%	30%	40%
GCN	$65.5\pm 1.1$	$59.8\pm 1.0$	$51.0\pm 1.0$	$44.0\pm 1.2$	$37.9\pm 1.0$
APPNP	$64.2\pm 1.8$	$58.1\pm 2.6$	$49.8\pm 2.5$	$43.4\pm 2.3$	$40.6\pm 2.7$
SoftMedian	$67.1\pm 1.0$	$63.8\pm 1.0$	$58.5\pm 1.1$	$54.3\pm 1.9$	$51.2\pm 2.4$
RUNG- $\ell_{1}$ (ours)	$68.9\pm 1.0$	$65.9\pm 1.1$	$61.0\pm 1.0$	$57.2\pm 1.1$	$53.9\pm 1.3$
RUNG (ours)	$72.4\pm 0.9$	$72.1\pm 1.2$	$71.3\pm 1.4$	$70.8\pm 1.3$	$69.7\pm 1.4$

E.4 Large Scale Ogbn-Arxiv

In the large scale datasets, we can not directly apply the vanilla PGD attack (Xu et al., 2019) on them due to excessive requirement of memory and computation on dense matrix. Alternatively, we leverage the PRBCD (Geisler et al., 2021b) to scale the PGD attack instead of manipulating on the dense adjacency matrix. We conduct experiments on large scale Ogbn-Arxiv and the results in Table 9 verified the superior robustness of our RUNG model. RUNG’s robustness outperforms its $\ell_{1}$ variant which delivers similar performance as SoftMedian and Elastic GNN due to their shared $\ell_{1}$ graph smoothing.

Table 9: Global PGD Attacks on Ogbn-Arxiv.

Model	Clean	1%	5%	10%
GCN	$71.9\pm 0.5$	$63.1\pm 0.4$	$48.9\pm 3.2$	$41.8\pm 0.5$
APPNP	$71.7\pm 0.3$	$64.2\pm 0.4$	$50.1\pm 2.3$	$42.2\pm 1.3$
SoftMedian	$71.2\pm 0.5$	$65.1\pm 0.3$	$54.1\pm 1.6$	$50.1\pm 2.6$
RUNG- $\ell_{1}$ (ours)	$71.6\pm 0.6$	$65.5\pm 0.4$	$55.0\pm 1.1$	$49.6\pm 1.1$
RUNG (ours)	$70.2\pm 2.1$	$65.2\pm 0.2$	$64.0\pm 0.8$	$61.2\pm 0.7$

E.5 Adversarial Training

We conduct the adversarial training following (Xu et al., 2019) and present the results in Table 10. From the results, we can observe that with adversarial training, the robustness of RUNG can be further improved.

Table 10: Adversarial Training vs Normal Training on RUNG.

Budget	Clean	5%	10%	20%	30%	40%
Normal Training	$84.6\pm 0.5$	$78.9\pm 0.4$	$75.7\pm 0.2$	$71.8\pm 0.4$	$67.8\pm 1.3$	$65.1\pm 1.2$
Adversarial Training	$84.3\pm 1.2$	$81.8\pm 0.7$	$79.9\pm 0.3$	$77.3\pm 1.1$	$72.8\pm 0.6$	$69.1\pm 0.9$

E.6 Graph Injection Attack

The injection attack was conducted following the settings in (Zou et al., 2021) to evaluate the robustness of different methods. We set up the budget on the number of injected nodes as 100 and the budget on degree as 200. The results in Table 11 show that our RUNG significantly outperforms the baseline models.

Table 11: Graph Injection Attack on Citeseer.

Model	Clean	Attacked
GCN	75.40	28.14
APPNP	75.84	25.24
GNNGuard	73.47	34.73
SoftMedian	74.82	38.91
RUNG- $\ell_{1}$	75.01	39.22
RUNG-MCP (ours)	75.65	51.13

Appendix F Adaptive Attack Settings

For the adaptive PGD attack we adpoted in the experiments, we majorly followed the algorithm in (Mujkanovic et al., 2022) in the adaptive evasion attack. For the sake of completeness, we describe it below:

In summary, we consider the topology attack setting where the adjacency matrix ${\bm{A}}$ is perturbed by $\delta{\bm{A}}$ whose element $\delta{\bm{A}}_{ij}\in\{0,1\}$ . The budget $B$ is defined as $B\geq\|\delta{\bm{A}}\|_{0}$ . The PGD attack involves first relaxing ${\bm{A}}$ from binary to continuous so that a gradient ascent attack can be conducted on the relaxed graph.

During the attack, the minimization problem below is solved:

\delta{\bm{A}}_{\star}=\operatorname*{arg\,min}_{\delta{\bm{A}}}\mathcal{L}_{\text{attack}}(\text{GNN}_{\theta}({\bm{A}}+({\bm{I}}-2{\bm{A}})\odot\delta{\bm{A}},{\bm{F}}),y_{\text{target}}),

(53)

where $\mathcal{L}$ is carefully designed attack loss function (Geisler et al., 2021a; Mujkanovic et al., 2022), ${\bm{A}}$ , ${\bm{F}}$ and $y_{\text{target}}$ are respectively the graph, node feature matrix and ground truth labels in the dataset, $\theta$ is the parameters of the GNN under attack which are not altered in the evasion attack setting. $({\bm{I}}-2{\bm{A}})\odot\delta{\bm{A}}$ is the calculated perturbation that “flips” the adjacency matrix between $0$ and $1$ when it is perturbed. The gradient of $\frac{\mathcal{L}_{\text{attack}}}{\delta{\bm{A}}}$ is computed and utilized to update the perturbation matrix $\delta{\bm{A}}$ .

After the optimization problem is solved, $\delta{\bm{A}}$ is projected back to the feasible domain of $\delta{\bm{A}}_{ij}\in\{1\}$ . The adjacency matrix serves as a probability matrix allowing a Bernoulli sampling of the binary adjacency matrix ${\bm{A}}^{\prime}$ . The sampling is executed repeatedly so that an ${\bm{A}}^{\prime}$ producing the strongest perturbation is finally generated.

Appendix G Additional Ablation Study of RUNG

G.1 Hyperparameters

The choice of the hyperparameters $\gamma$ and $\lambda$ is crucial to the performance of RUNG. We therefore experimented with their different combinations and conducted adaptive attacks on Cora as shown in Fig. 12.

Recall the formulation of RUNG in Eq.(LABEL:eq:rw_update_f3):

{\bm{F}}^{(k+1)}=(\text{diag}({\bm{q}}^{(k)})+\lambda{\bm{I}})^{-1}\left(({\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}+\lambda{\bm{F}}^{(0)}\right),

(54)

where ${\bm{q}}^{(k)}_{m}=\sum_{j}{\bm{W}}^{(k)}_{mj}{\bm{A}}_{mj}/d_{m}$ , ${\bm{W}}^{(k)}_{ij}=\bm{1}_{i\neq j}\max(0,\frac{1}{2\smash[b]{y_{ij}^{\leavevmode\resizebox{}{3.0pt}{$(k)$}}}}-\frac{1}{2\gamma})$ and $\smash[b]{y^{(k)}_{ij}=\big{\|}\frac{{\bm{f}}_{i}^{(k)}}{\sqrt{d_{i}}}-\frac{{\bm{f}}_{j}^{(k)}}{\sqrt{\smash[b]{d_{j}}}}\big{\|}_{2}}$ .

In the formulation, $\lambda$ controls the intensity of the regularization in the graph smoothing. In our experiments, we tune $\hat{\lambda}\coloneqq\frac{1}{1+\lambda}$ which is normalized into $(0,~{}1)$ . In Figure 12, the optimal value of $\hat{\lambda}$ can be found almost always near $0.9$ regardless of the attack budget. This indicates that our penalty function $\rho_{\gamma}$ is decoupled from $\gamma$ which makes the tuning easier, contrary to the commonly used formulation of MCP (Zhang, 2010a).

On the other hand, $\gamma$ has a more intricate impact on the performance of RUNG. Generally speaking, the smaller $\gamma$ is, the more edges get pruned, which leads to higher robustness and a lower clean accuracy. We begin our discussion in three cases:

Small attack budget ( $0\%,5\%,10\%$ ). The performance is largely dependent on clean accuracy. Besides, when $\gamma\rightarrow\infty$ , RUNG becomes a state-of-the-art robust $\ell_{1}$ model. Therefore, a small $\gamma$ likely introduces more harm to the clean performance than robustness increments over $\ell_{1}$ models. The optimal $\gamma$ thus at least recovers the performance of $\ell_{1}$ models.

Large attack budget ( $20\%,30\%,40\%$ ). In these cases, $\gamma\rightarrow\infty$ is no longer a good choice because $\ell_{1}$ models are beginning to suffer from the accumulated bias effect. The optimal $\gamma$ is thus smaller (near $0.5$ ). However, for fairness, we chose the same $\gamma$ under different budgets in our experiments, so the reported RUNG fixes $\gamma=3$ . In reality, however, when we know the possible attack budgets in advance, we can tune $\gamma$ for an even better performance.

Very large attack budget ( $50\%,60\%$ ). We did not include these scenarios because almost all GNNs perform poorly in this region. However, we believe it can provide some insights into robust graph learning. Under these budgets, more than half of the edges are perturbed. In the context of robust statistics (e.g. mean estimation), the estimator will definitely break down. However, in our problem of graph estimation, the input node features offer extra information allowing us to exploit the graph information even beyond the breakdown point. In the “peak” near $(0.9,0.5)$ , RUNG achieves $>70\%$ accuracy which is higher than MLP. This indicates that the edge weighting of RUNG is capable of securely harnessing the graph information even in the existence of strong adversarial attacks. The “ridge” near a $\hat{\lambda}=0.2$ , on the other hand, emerges because of MLP. When the regularization dominates, $\lambda\rightarrow\infty$ , and $\hat{\lambda}\rightarrow 0$ . A small $\lambda$ is then connected to a larger emphasis on the input node feature prior. Under large attack budgets, MLP delivers relatively good estimation, so a small $\hat{\lambda}$ is beneficial.

G.2 GNN Layers

In RUNG, QN-IRLS is unrolled into GNN layers. We would naturally expect RUNG to have enough number of layers so that the estimator converges as desired. We conducted an ablation study on the performance (clean and adversarial) of RUNG with different layer numbers and the results are shown in Fig. Figure 13. We make the following observations:

•

As the layer number increases, RUNG exhibits better performance. This verifies the effectiveness of our proposed RUGE, as well as the stably converging QN-IRLS.
•

The performance of RUNG can achieve a reasonably good level even with a small layer number ( $3$ - $5$ layers) with accelerated convergence powered by QN-IRLS. This can further reduce the computation complexity of RUNG.

Appendix H Robustness of GCN and APPNP

In addition to the formulation in section 3 the main text, we can simply apply our edge reweighting technique to the GCN architecture. Essentially the aggregation operation in GCN can be viewed as an APPNP layer with $\hat{\lambda}=0$ . The GCN version of a layer in our model can then be formulated as

{\bm{F}}^{(k+1)}=\text{ReLU}(({\bm{W}}^{(k)}\odot\tilde{{\bm{A}}}){\bm{F}}^{(k)}{\bm{M}}^{(k)}),

(55)

where ${\bm{M}}$ is the learned weight matrix in GCN.

GCN-based defenses are less robust than APPNP-based defenses

GCN consists of layers in which both feature transformation and message passing are included. This graph convolution operation will weaken defense methods that rely on edge weighting, such as GNNGuard, models using $l_{1}$ penalty as well as our method using MCP penalty.

Consider an edge that is added by the attacker⁵⁵5Almost all attacks add new edges as shown by our experiments, so this is almost always the case. A successful defense should detect the attack on this edge by the large difference of nodes connected by this edge, and then assign a zero weight or at least a smaller weight to this edge. However in GCN, even if this edge is detected in the first layer’s message passing, the subsequent feature transformation makes the node difference less likely to be preserved until the second layer. This is where the attack can evade the defense and is thus a vulnerability allowing adaptive attacks through. According to our experiments, using different defense parameters in different layers of GCN, unfortunately, does not help much either.

On the other hand, in APPNP node features are also altered along the message-passing layers, but the node distance change is more regulated than in GCN since MLP is decoupled from the graph smoothing layers. In the latter submodule, node differences simply decrease, which allows our defense based on node differences.

Experiments

It can be seen from Figure 14 that the correlation of node feature differences in different layers of GCN is about 4 times less than in APPNP, which means that an attack detected in the first layer is less likely to continue to be detected in the second layer than in APPNP. This property of GCN makes the many defense methods using GCN architecture as a backbone less robust, as shown in the experiment results in Table 12 and Table 13. Nevertheless, our GCN-based MCP model outperforms the SOTA models using $l_{1}$ methods.

Model	Clean	$5\%$	$10\%$	$20\%$	$30\%$	$40\%$
RUNG- $\ell_{1}$	$85.8\pm 0.5$	$78.4\pm 0.4$	$74.3\pm 0.3$	$68.1\pm 0.6$	$63.5\pm 0.7$	$59.8\pm 0.8$
RUNG	$84.6\pm 0.5$	$78.9\pm 0.4$	$75.7\pm 0.2$	$71.8\pm 0.4$	$67.8\pm 1.3$	$65.1\pm 1.2$
RUNG- $\ell_{1}$ -GCN	$84.0\pm 0.4$	$74.7\pm 0.6$	$69.7\pm 0.7$	$62.7\pm 0.6$	$57.6\pm 0.6$	$53.5\pm 0.8$
RUNG-GCN	$82.6\pm 0.6$	$76.1\pm 0.7$	$71.0\pm 0.9$	$64.3\pm 1.1$	$59.9\pm 0.8$	$56.5\pm 1.1$

Table 12: Comparison between APPNP-based and GCN-based QN-IRLS on Cora.

Model	Clean	$5\%$	$10\%$	$20\%$	$30\%$	$40\%$
RUNG- $\ell_{1}$	$75.5\pm 1.1$	$69.3\pm 1.2$	$65.9\pm 1.2$	$61.1\pm 1.1$	$57.2\pm 1.4$	$53.9\pm 1.3$
RUNG	$74.3\pm 0.7$	$71.4\pm 1.0$	$69.8\pm 1.3$	$67.6\pm 1.2$	$66.5\pm 1.3$	$65.3\pm 1.5$
RUNG- $\ell_{1}$ -GCN	$73.8\pm 1.2$	$66.1\pm 1.1$	$61.9\pm 1.0$	$56.0\pm 0.7$	$51.3\pm 0.8$	$47.7\pm 0.6$
RUNG-GCN	$73.0\pm 0.8$	$68.2\pm 0.8$	$64.3\pm 1.3$	$59.4\pm 1.0$	$55.7\pm 1.4$	$53.1\pm 1.6$

Table 13: Comparison between APPNP-based and GCN-based QN-IRLS on Citeseer.

Robust Graph Neural Networks via Unbiased Aggregation

Abstract

1 Introduction

2 Estimation Bias Analysis of Robust GNNs

2.1 Robustness Analysis

2.2 A Unified View of Robust Estimation

2.3 Bias Analysis and Performance Degradation

3 Robust GNNs with Unbiased Aggregation

3.1 Robust and Unbiased Graph Signal Estimator

3.2 Quasi-Newton IRLS

Theorem 1.

Theorem 2.

3.3 GNN with Robust Unbiased Aggregation

4 Experiment

4.1 Experiment Setting

4.2 Adversarial Robustness

4.3 Ablation study

5 Related Work

6 Conclusion & Limitation

Acknowledgment

References

Appendix A Bias Accumulation of L1 Models

A.1 Details of the Numerical Simulation Settings

ℓ2\ell_{2} estimator.

A.2 Additional Simulation Results and Discussions

Appendix B Convergence Analysis

Lemma 1.

Proof.

Remark 1.

Lemma 2.

Proof.

Remark 2.

Theorem 1.

Proof.

Lemma 3.

Proof.

Theorem 2.

Proof.

Appendix C Computation Efficiency

C.1 Time Complexity Analysis

Computation of 𝑾⊙𝑨{\bm{W}}\odot{\bm{A}} and 𝑾⊙𝑨~{\bm{W}}\odot\tilde{\bm{A}}.

Computation of 𝑸^−1\hat{\bm{Q}}^{-1}.

Computation of aggregation.

Stacking layers.

Space Complexity Analysis

C.2 Alternative Perspective

Appendix D Comprehensive Bias Analysis

Appendix E Additional experiment results

E.1 Adaptive Attacks

E.2 Transfer Attacks

E.3 Poisoning Attacks

E.4 Large Scale Ogbn-Arxiv

E.5 Adversarial Training

E.6 Graph Injection Attack

Appendix F Adaptive Attack Settings

Appendix G Additional Ablation Study of RUNG

G.1 Hyperparameters

G.2 GNN Layers

Appendix H Robustness of GCN and APPNP

GCN-based defenses are less robust than APPNP-based defenses

Experiments

Robust Graph Neural Networks
via Unbiased Aggregation

$\ell_{2}$ estimator.

Computation of ${\bm{W}}\odot{\bm{A}}$ and ${\bm{W}}\odot\tilde{\bm{A}}$ .

Computation of $\hat{\bm{Q}}^{-1}$ .