This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Spatio-Temporal Sparsification for General Robust Graph Convolution Networks

Mingming Lu1 Ya Zhang1
Abstract

Graph Neural Networks (GNNs) have attracted increasing attention due to its successful applications on various graph-structure data. However, recent studies have shown that adversarial attacks are threatening the functionality of GNNs. Although numerous works have been proposed to defend adversarial attacks from various perspectives, most of them can be robust against the attacks only on specific scenarios. To address this shortage of robust generalization, we propose to defend the adversarial attacks on GNN through applying the Spatio-Temporal sparsification (called ST-Sparse) on the GNN hidden node representation. ST-Sparse is similar to the Dropout regularization in spirit. Through intensive experiment evaluation with GCN as the target GNN model, we identify the benefits of ST-Sparse as follows: (1) ST-Sparse shows the defense performance improvement in most cases, as it can effectively increase the robust accuracy by up to 6% improvement; (2) ST-Sparse illustrates its robust generalization capability by integrating with the existing defense methods, similar to the integration of Dropout into various deep learning models as a standard regularization technique; (3) ST-Sparse also shows its ordinary generalization capability on clean datasets, in that ST-SparseGCN (the integration of ST-Sparse and the original GCN) even outperform the original GCN, while the other three representative defense methods are inferior to the original GCN.

1 Introduction

Recently, Graph Neural Networks (GNNs) have attracted increasing attention due to its successful applications on various graph-structure data, such as social networks, chemical composition structures, and biological gene proteins (Zhou et al. 2018; Wu et al. 2019b). However, recent works (Sun et al. 2018; Xu et al. 2019a; Dai et al. 2018) have pointed out that GNNs are vulnerable to adversarial attacks, which can crash safety-critical GNN applications, such as auto-driving, medical diagnosis (Wu et al. 2019b).

To address this issue, numerous works (Sun et al. 2018; Zhang, Cui, and Zhu 2020; Jin et al. 2020) have been proposed to defend the adversarial attacks from the perspectives of data preprocessing (Wu et al. 2019a), structure modification (Wang et al. 2019), adversarial training (Feng et al. 2019), adversarial detection (Zhang, Hossain Khan, and Coates 2019), and etc. However, our experimental study and the evaluation in the existing work (Jin et al. 2020) have shown that none of the existing defense methods is superior to the others under all attacks for all datasets with all perturbation sizes. This illustrates the limitation of the existing defense methods in terms of the robust generalization capability.

Recently, (Tsipras et al. 2019) revealed that the existence of adversarial attack might originate from the utilization of weakly correlated features, which can be reduced by keeping only the strongly correlated features. This phenomenon motivates us to adopt the sparse representation, which is widely utilized in computational neuroscience (Ahmad and Scheinkman 2019), to reduce the effect from the weakly correlated features. Thus, in this work, we propose a spatio-temporal feature sparsification framework to improve the robustness of the GNN models.

The spatial feature sparsification (called TopK) in the proposed framework simply keeps the kk features with the largest values and sets all the other features to zero. In spirit, TopK is the same as the Dropout regulirazation technique (Srivastava et al. 2014) except that Dropout randomly drops neurons, while TopK orders the neurons according to their output values and keeps only the neurons with the top kk values. Through experiment studies, we identify that TopK can improve the defense performance under four representative adversarial attacks on three typical benchmark datasets with varies perturbation sizes.

However, the robustness brought by TopK is at the expense of the generalization capability. Compared with Dropout, TopK loses the randomness, which sacrifices the generalization capability, as the randomness in Dropout can decompose a complex model into an ensemble of a large number of simpler models. To address this issue, temporal feature sparsification is introduced to alternate the non-zero features (also called active features) in each training epoch. Through the feature alternation, more features can participate in node representation in turn. Thus, the spatial sparsification together with the temporal sparsification (abbreviated as ST-Sparse) can behave similarly to the Dropout regulirazation technique. Therefore, ST-Sparse might achieve the similar generalization capability as Dropout. Moreover, through experiment evaluation, we identify that ST-Sparse can achieve robust generalization in that it can integrate with the existing defense methods to further improve the model robustness, similar to the integration of Dropout into various deep learning models as a standard regularization technique.

Refer to caption
Figure 1: The illustration of spatio-temporal sparsification, where the vertial rectangular bar associated with each node represents the node’s feature vector and the colored/white squares in the bar denote active/inactive features. In the temporal sparsification part, the horizontal red rectangle illustrates the on-and-off activation patten of temporal sparsity.

Fig. 1 illustrates the basic idea of the proposed ST-Sparse mechanism. The spatial sparsification is mainly dedicated to transform a dense hidden node vector of a GNN to a sparse high-dimensional vector, where only the top kk salient features are activated, as illustrated through the red rectangle at the top left part of Fig. 1. The temporal sparsification further sparsifies the active features along the time dimension during the GNN training process. More specifically, the duty cycle of each active feature dimension is sparse so that each active feature will not be intensively used.

Note that the temporal sparsification is applied to the feature dimension instead of the features of individual nodes, because on one hand, the salient features of individual nodes usually focus only on a few dimensions, the temporal spasification of these features may significantly degrade the model performance; on the other hand, the overall distribution composed of all nodes can better reflect the temporal sparsity. By balancing the duty cycle of activation among different dimension, it is possible to avoid the intensive usage of certain dimensions, thereby increasing the model’s expressive capability, which in turn increases the robustness of the model.

The main contributions of this work is summarized as follows.

  • From the perspective of spatio-temporal sparsity, we explore to construct a robust feature space, where the information propagation in GNN is less vulnerable to adversarial attacks.

  • We provide a novel ST-Sparse mechanism, which utilizes TopK to realize spatial sparsity in high-dimensional vector space, and adopts attention to balance the activation duty cycles among different dimensions, so as to realize the temporal sparsity in the feature space.

  • To verify the effectiveness of ST-Sparse, we apply ST-Sparse to the graph convolution network (GCN) (Kipf and Welling 2017) (denoted as ST-SparseGCN). Intensive experiments through three benchmark datasets show that ST-SparseGCN can significantly improve the robustness, robust generalization, and ordinary generalization of GCN in terms of classification accuracy.

2 Related Works

Adversarial attacks on general graph. The basic idea of adversarial attacks on graph is to change the graph topology or feature information to intentionally interfere with the classifier. (Dai et al. 2018) studied a non-target evasion attack based on reinforcement learning. (Zügner, Akbarnejad, and Günnemann 2018) proposed netattack, a poisoning attack on GCN, which modifies the training data to misclassify the target node. Further,  (Zügner and Günnemann 2019) used the meta-gradient to solve the min-max problem in attacks during training, and proposed an attack method that reduces the overall classification performance. Besides, (Xu et al. 2019b) simplified the discrete graph problem by convex relaxation, and thus proposed a gradient-based topological attack.

Defense methods on general graph. The existing defense methods can be classified from the perspectives of data preprocessing (Wu et al. 2019a), structure modification (Wang et al. 2019), adversarial training (Feng et al. 2019), the modificaiton of the objective function (Bojchevski and Günnemann 2019), adversarial detection (Zhang, Hossain Khan, and Coates 2019), and hybrid defense (Chen et al. 2019). The proposed ST-SparseGCN model can be regarded as a structure modification methods, because it modifies the original GCN structure, as shown in Fig. 2. However, the proposed ST-Sparse defense methods can also be integrated with the other GCN defense models, such as GCN-Jaccard(Wu et al. 2019a) and GCN-SVD(Entezari et al. 2020), which can be regarded as data preprocessing methods. Thus, the integrated models can be classified as hybrid defense models. Although dozens of defense methods on graph have been proposed, none of them shows the robust generalization, as they are not superior to the others under all attacks for all datasets with all perturbation sizes (Jin et al. 2020).

Sparsity and Robustness. The relation between sparsity and robustness has been revealed in the fields of image classification (Guo et al. 2018) and neuroscience (Ahmad and Scheinkman 2019). From the perspective of image classification, (Guo et al. 2018) clarified the inherent relation between sparsity and robustness through theoretical analysis and experimental evaluation. (Cosentino et al. 2019; Tsipras et al. 2019) revealed that the existence of adversarial attack might originate from the utilization of weakly correlated features, which can be reduced by keeping only the strongly correlated features. This phenomenon also illustrated the necessity of sparsity to reduce the effect of the weakly correlated features.

Difference to the Existing Methods. Unlike the existing works on GNN robustness, most of which assume certain prior knowledge concering the attack, we intend to construct a robust feature space that can resist attack without any prior knowledge on attack, which can be called “black box defense”. (Zheng et al. 2020) also considered the relation between the model robustness and sparsity. However, its sparsity is defined for the sparsity of the graph structure, instead of the sparsity of the hidden node representation introduced by our ST-Sparse method. It is worth to note that in ST-Sparse, since the perturbation is injected in the hidden layer, it does not have to generate perturbation on graph structure and node feature separately.

3 Preliminaries

Notations

Given an undirected graph G=(V,E,X)G=(V,E,X), where V={v1,v2,,vn}V=\left\{v_{1},v_{2},...,v_{n}\right\} is a set of nodes with |V|=n|V|=n, EV×VE\subseteq V\times V is a set of edges that can be represented as an adjacency matrix A{0,1}n×nA\in{{\left\{0,1\right\}}^{n\times n}}, and X=[x1,x2,,xn]Tn×dX=\left[{{x}_{1}},{{x}_{2}},\ldots,{{x}_{n}}\right]^{T}\in{{\mathbb{R}}^{n\times d}} is a feature matrix with xix_{i} denoting a feature vector of node viVv_{i}\in V. C=c1,c2,,cn\text{C}=\left<{{c}_{1}},{{c}_{2}},\ldots,{{c}_{n}}\right> is the class label vector with cic_{i} representing the label value of node viv_{i}.

Graph Convolution Networks

In the paper, we focus on GCNs for node classification. In particular, we will consider the well established work (Kipf and Welling 2017). As a semi-supervised model, GCN can learn the hidden representation of each node. The hidden vectors of all nodes in the l+1l+1 layer can be represented recursively by the hidden vectors of the ll layer as follows.

H(l+1)=σ(D~12A~D~12H(l)W(l)){{H}^{\left(l+1\right)}}=\sigma\left({{{\tilde{D}}}^{-\frac{1}{2}}}\tilde{A}{{{\tilde{D}}}^{-\frac{1}{2}}}{{H}^{\left(l\right)}}{{W}^{\left(l\right)}}\right) (1)

where A~=A+In{\tilde{A}}={A}+{{I}_{n}}, W(l)d(l)×d(l+1)W^{\left(l\right)}\in{\mathbb{R}}^{d^{(l)}\times d^{(l+1)}}yl denotes the learnable weight matrix at layer ll, D~i=jA~ij{{\tilde{D}}_{i}}=\mathop{\sum}_{j}{{\tilde{A}}_{ij}}, and σ()\sigma\left(\cdot\right) is an activation function, such as ReLu. Initially, H(0)=XH^{(0)}=X.

4 The ST-SparseGCN Framework

In the following, we introduce technical details of the proposed ST-SparseGCN. As shown in Fig. 2, ST-Sparse can be integrated into the GCN model as an activation layer through replacing the ReLu activation function. The ST-Sparse layer will transform the dense feature hih_{i} of each node viv_{i} into a ST-Sparse feature sis_{i}. While the ST-Sparse feature transforming process can be further decompose into the spatial sparsification and temporal sparsification processes.

Refer to caption
Figure 2: The SparseGCN framework.

The High-dimensional Sparse Space

First, we will describe the mapping from the dense space to the high-dimensional space, which can be simply realized through replacing the parameter matrix W(l)W^{(l)} in Eq. (1) with a high-dimensional version Wh(l)W^{(l)}_{h}, as shown in Eq. (2).

H(l+1)=σ(D~12A~D~12H(l)Wh(l)),{H^{\left({l+1}\right)}=\sigma\left({{{\tilde{D}}^{-\frac{1}{2}}}\tilde{A}{{\tilde{D}}^{-\frac{1}{2}}}H^{\left(l\right)}W_{h}^{\left(l\right)}}\right)}, (2)

where W(l)d(l)×dhW^{\left(l\right)}\in{\mathbb{R}}^{d^{(l)}\times d_{h}}. Compared to d(l+1)d^{(l+1)} (the second dimension of W(l)W^{(l)} in Eq. (1)), dhd_{h} (the second dimension of Wh(l)W^{(l)}_{h}) is much larger. In Section 5, we will illustrate the underlying reason for high dimensional space through experiment evaluation, which will show that the low dimension can significantly reduce the performance of the proposed ST-SparseGCN. Thus, the high-dimensional space is one of the key factors for the effectiveness of the propose ST-SparseGCN. It is worth to note that dhd_{h} is the same for all layers except the input layer, i.e., l1\forall l\geq 1, H(l)n×dhH^{(l)}\in{\mathbb{R}}^{n\times d_{h}} and H(0)=Xn×dH^{(0)}=X\in{\mathbb{R}}^{n\times d}.

Next, we will formally introduce the definition of spatial sparsity as follows.

Definition 1.

Spatial Sparsity. viV\forall v_{i}\in V, its high-dimensional feature vector hi=<hi1,hi2,,hidh>h_{i}=<h_{i1},h_{i2},\ldots,h_{id_{h}}> satisfies the spatial sparsity if hi0dh||h_{i}||_{0}\ll d_{h}, where ||||0||\cdot||_{0} denote the l0l_{0}-norm, i.e. the number of non-zero elements.

Def. 1 implies that the non-zero elements of a spatial sparse vector should be much less than the vector dimension. In the following, we will adopt sis_{i} to denote the sparse version of hih_{i}. Also, S=[s1,s2,,sn]TS=[s_{1},s_{2},\ldots,s_{n}]^{T} represents the sparse matrix consisting of the sparse vectors from all nodes.

Although spatial sparsity can ensure the feature sparsity of individual nodes, it cannot guarantee that individual features are sparse, i.e., the number of nodes activated on any given feature is much less than the total number of nodes. For example, in Fig 2, feature jj is not temporally sparsed after spatial sparsification because too many nodes activate feature jj. Through temporal sparsification, the non-zero elements associated with feature jj will be gradually reduced.

This new type of sparsity can be illustrated through simple calculation. If t[1,2,,T]\forall t\in\left[{1,2,\ldots,{\rm{T}}}\right], where TT is the total number training epochs, and viV\forall v_{i}\in V, sit0k||s^{t}_{i}||_{0}\leq k, then St0n×k||S^{t}||_{0}\leq n\times k, where sits^{t}_{i} and StS^{t} represent sis_{i} and SS at epoch tt, respectively, as there are nn nodes in total. Thus, on average, each feature will be on duty (i.e., non-zero values) for at most n×kdh\frac{n\times k}{d_{h}} nodes, because there are dhd_{h} features in total. Since kdhk\ll d_{h} according to Def. 1, it can be concluded that n×kdhn\frac{n\times k}{d_{h}}\ll n, where nn is actually the maximal number of non-zero elements for any feature at epoch tt. Thus, from the feature’s perspective, if the duty cycle (in terms of non-zero elements) for all features needs to be balanced, it also shows the sparsity phenomenon.

The underlying reason for the necessity of the duty-cycle balance lies in that, if a feature is on duty for too many nodes, this feature may show the Mathew effect, i.e., the more a feature is used at the current epoch, the more oftern it will be used in the following epochs. This Mathew effect can be took advantaged by the adversarial attacker through manipulating the heavily used feature.

Thus, it is desirable to introduce temporal spasity so that the duty cycle of features can be balanced along with the training epochs. To formally define temporal sparsity, we introduce sjts_{*j}^{t} to denote the set of nodes that utilizes the jj-th feature, which is equal to the jj-th column of StS^{t}, i.e., sjt=<s1jt,s2jt,,snjt>s_{*j}^{t}=<s_{1j}^{t},s_{2j}^{t},\ldots,s_{nj}^{t}>.

Based on the above description, the temporal sparsity concerning the jj-th feature can be formally defined as follows.

Definition 2.

Temporal Sparsity. For j{1,,dh}\forall j\in\{1,\cdots,d_{h}\}, the vector <sj1,,sjt,,sjT><s_{*j}^{1},\ldots,s_{*j}^{t},\ldots,s_{*j}^{T}> satisfies the temporal sparsity if

limT+sumt=1Tsjt0T=n×kdh\lim_{T\to+\infty}\frac{sum_{t=1}^{T}||s_{*j}^{t}||_{0}}{T}=\frac{n\times k}{d_{h}}

TopK Based Spatial Sparsification

In our ST-SparseGCN model, the spatial sparsification is implemented through TopK, which simply selects the top kα=αdhk_{\alpha}=\lfloor\alpha\cdot d_{h}\rfloor features for any hiHh_{i}\in H, where α(0,1)\alpha\in\left(0,1\right) is the spatial sparse ratio. TopK can be formalized as follows.

si=TopK(hi,kα)={hij,jzi,0,jzi.s_{i}=TopK({h_{i},k_{\alpha}})=\left\{{\begin{array}[]{*{20}{c}}{h_{ij},\qquad j\in{z_{i}}},\\ {0,\qquad\;\;\;j\notin{z_{i}}.}\end{array}}\right. (3)

where ziz_{i} represents the set of features with the largest kαk_{\alpha} values from hih_{i}. In another words, TopK will keep the values of those top kk features and set all the other features to zero.

Through replacing the activation function in Eq.(2) with TopK, we can implement a spatial sparsed GCN, which can be formalized through the equation shown in Eq.(4).

S(l+1)=TopK(D~12A~D~12S(l)Wh(l),kα),S^{(l+1)}=TopK({\tilde{D}}^{-\frac{1}{2}}{\tilde{A}}{\tilde{D}}^{-\frac{1}{2}}S^{(l)}W_{h}^{(l)},k_{\alpha}), (4)

where S(0)=XS^{(0)}=X initially. It is worth to note that the TopK function in Eq.(4) is a matrix version of the TopK fuction in Eq.(3). This matrix version selects the top kαk_{\alpha} features for each node viv_{i} independently.

The spatial sparse ratio α\alpha in the TopK function is a hyperparameter to be adjusted. Intuitively, on one hand, small α\alpha implies less non-zero features, which might seriously compromise the generalization capability of the proposed model, because the possible vectors that can be represented in the high-dimensional space will become less along with smaller kαk_{\alpha}. On the other hand, large α\alpha may compromise the model robustness. The appropriate value of α\alpha will be evaluated in Section 5.

TopK VS. ReLu. In ST-SparseGCN, the ReLu function in GCN has been replaced by the TopK function. The effect of the replacement can be illustrated through Fig. 3(a), where the GCN coupled with TopK and the GCN with ReLu are compared in terms of the ratio of activated neurons during the training process. From Fig. 3(a), it can be observed that TopK greatly reduces the number of activated neurons. TopK and ReLu can also be compared through the funciton curves as shown in Fig. 3(b), from which it can be observed that it is more difficult for a neuron to be activated through the TopK activation function.

Refer to caption
(a) The comparison of activated neurons.
Refer to caption
(b) The comparison of function curves.
Figure 3: The comparison of TopK and ReLu.

The Cost of Robustness. (Xiao, Zhong, and Zheng 2020) has proved that the computational complexiy of TopK is asymptotically O(N)O(N), which is the same as ReLu. However, it takes more time for TopK to converge in experiments, which might originate from the spatial sparsity. Due to the spatial sparsity, only a small number of neurons can be activated, which implies that the gradient update covers only a small number of neurons in each epoch. Nevertheless, the computing cost of TopK can be reduced through the computing optimization of sparse matrix.

Attention Based Temporal Sparsification

At the first glance, it seems that temporal spasity can be realized through applying the TopK function as follows: TopK(sjt,n×kdh)TopK(s^{t}_{*j},\frac{n\times k}{d_{h}}). However, this may reduce the number of non-zero features for certain nodes, which might compromise the generalization capability as discussed previously. Furthermore, it may cause a sudden discontinuities in the model output.

To avoid the above issues, we propose an attention-based temporal sparsification mechanism, where at any epoch tt, for any node viv_{i}, its feature jj is assigned an attention value bijtb_{ij}^{t}. This attention value will be adaptively adjusted according to the historical sparsity information of feature jj, namely, sjt0||s^{t^{\prime}}_{*j}||_{0} (t{1,2,,t1}t^{\prime}\in\{1,2,\cdots,t-1\}). Then, the adjusted attention value bijtb_{ij}^{t} is used as a weight to adjust the corresponding feature jj of node viv_{i} in the spatial sparsified GCN hidden representation (namely S(l)S^{(l)}, as defined in Eq.(4)) so that the feature with larger sparsity value will reduce its chance to be selected by the TopK function.

Concretely, in each epoch tt, the attention mechanism updates the attention value of each node viv_{i}’s feature jj based on the integration of the historical sparsity s^ijt\hat{s}_{ij}^{t} and the current sparsity of the feature (i.e. sjt0||s^{t}_{*j}||_{0}). If integrated feature sparsity is higher than the sparsities of the other features, its attention value bijtb^{t}_{ij} will be reduced accordingly.

Formally, the integrated sparsity of any feature jj associated with node viv_{i} is updated as shown in Eq.(5).

s^ijt+1=s^ijt+τ×sjt0,{\hat{s}_{ij}^{t+1}=\hat{s}_{ij}^{t}+\tau\times||s^{t}_{*j}||_{0}}, (5)

where s^ijt\hat{s}_{ij}^{t} is the historical sparsity of node viv_{i}’s feature jj before epoch tt and τ\tau is a hyper parameter that controls the decay rate of historical information. Initially, s^ij0=0\hat{s}_{ij}^{0}=0.

Based on the integrated sparsity s^jt\hat{s}_{j}^{t}, the attention value bjtb^{t}_{j} is updated through a smooth exponential function as shown in Eq.(6)

bijt=exp(γs^ijt),b_{ij}^{t}=\exp(-\gamma\hat{s}_{ij}^{t}), (6)

where γ\gamma is a hyper parameter. From Eq.(6), it can be observed that, the larger the integrated sparsity s^ijt\hat{s}_{ij}^{t}, the smaller the updated attention value bijtb_{ij}^{t}, because the smooth exponential function exp()\exp(\cdot) is a monotonically decreasing function.

From Eq.( 5), it can be observed that the historical sparsity s^ijt\hat{s}_{ij}^{t} is actually independent of nodes, and so does the attention value bijtb^{t}_{ij}, according to Eq.(6). Thus, Eq.( 5) and Eq.(6) can be computed only once for all nodes, from which we can obtain feature jj’s historical sparsity and attention values for all nodes, namely vector s^jt\hat{s}_{*j}^{t} and vector bjtb^{t}_{*j}, respectively.

From bjtb^{t}_{*j}, j{1,,dh}j\in\{1,\cdots,d_{h}\}, we can contruct the attention mask matrix t{\cal B}^{t} as follows.

t=<b1t,,bjt,,bdht>,{\cal B}^{t}=<b_{*1}^{t},\ldots,b_{*j}^{t},\ldots,b_{*d_{h}}^{t}>, (7)

Based on the attention mask matrix, the proposed ST-SparseGCN can be formalized through Eq. (8).

S(l+1)=TopK(D~12A~D~12(t(l)S(l))Wh(l),kα).\begin{split}S^{(l+1)}=TopK(&\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}\left(\mathcal{B}_{t}^{(l)}\odot S^{(l)}\right)W_{h}^{(l)},k_{\alpha}).\end{split} (8)

Initially, t(0)=0\mathcal{B}_{t}^{(0)}=0 and S(0)=XS^{(0)}=X. Eq. (8) can be desribed as follows. At epoch tt, in the (l+1)(l+1)-th layer, sparse matrix S(l)S^{(l)} first multiplies the attention mask matrix t(l)\mathcal{B}_{t}^{(l)} element-wisely to temporal sparsify the feature space, so as to mitigate the Mathew effect. The sparsified matrix will be fed as input into GCN for information propagation among nodes. The GCN output is further spatially sparsified thourgh TopK activation fucntion. In the end, S(L)S^{(L)} is passed to a fully connected layer with the softmax activation function to predict labels YY.

5 Experimental Evaluation

Experimental Settings

Baselines. To evaluate the robustness and effectiveness of ST-SparseGCN, experiments are performed on the deep learning framework PyTorch (Steiner et al. 2019) and the GNN extension library PyG (Fey and Lenssen 2019). The proposed defense model (ST-SparseGCN) is compared with four baselines, three of which are representative graph defense methods, on the task of node-level semi-supervised classification as follows.

  • GCN(Kipf and Welling 2017):GCN proposes to simplify the graph convolution using only the first order polynomial, i.e. the immediate neighborhood. By stacking multiple convolutional layers, GCN achieved the state-of- the-art performance in clean datasets.

  • GCN-Jaccard(Wu et al. 2019a): GCN-Jaccard utilizes the Jaccard similarity of features to prune perturbed graphs based on the assumption that the connected nodes usually show high feature similarity.

  • GCN-SVD(Entezari et al. 2020): GCN-SVD proposes a low-rank representation method, which can approximate the original node representation with a low-rank representation, so as to resist the adversarial attacks.

  • RGCN(Zhu et al. 2019): RGCN aims to defend against adversarial edges with Gaussian distributions as the latent node representation in hidden layers to absorb the negative effects of adversarial edges.

We implement the above baseline methods with refering to the implementation in DeepRobust(Li et al. 2020).

Attacker models. To validate the defensive ability of our proposed defender, we choose four representative GCN attacker models.

  • DICE(Waniek et al. 2018):DICE randomly selects node pairs to flip their connectivity (i.e., the removal of the existing edges and the connection of non-adjacent nodes).

  • Mettack(Zügner and Günnemann 2019): Mettack aims at reducing the overall performance of GNNs via meta learning. We used the attack method Meta-Self.

  • PGD(Xu et al. 2019b):PGD is projected gradient descent topology attack to attacking a pre-defined GNN.

  • Min-Max(Xu et al. 2019b): Min-max is min-max topology attack to attacking a re-trainable GNN. The minimization is optimized using the PGD method and the maximization aims to constrain the attack loss by retraining WW.

Parameter Setting. The following common parameters are the same for ST-SparseGCN and the baselines. The number of GCN layer is 2 and the training epochs is 200. The selected optimizer is Adam (Kingma and Ba 2015) with a fixed learning rate of 0.01. The other hyperparameters in baselines are closely followed the benchmark setup. And the hyper parameters in the ST-SparseGCN model are adjusted based on the validation set to achieve the best robust performance. Parameter sensitivity of ST-SparseGCN will be analyzed in Section 5. The final results of all experiments are obtained by averaging 5 repeated experiments. Our experiments are performed on NVIDIA RTX 2080Ti GPU.

Datasets. ST-SparseGCN is evaluated on three well-known datasets: Cora, Citeseer and Polblogs (Sen et al. 2008), where nodes represent documents and edges represent citations. The sparse bag-of-words feature vector associated with each node is the model input. Table 1 enumerates the statistics information of the datasets. The same training set, test set, and validation set from the same data set is used to fairly evaluate the performance of different models.

Table 1: Statistics of datasets
Nodes Edges Features Classes
Cora 2708(1 graph) 5429 1433 7
Citeseer 3327(1 graph) 4732 3703 6
Polblogs 1490(1 graph) 33430 1490 2

Classification Performance Evaluation

In order to properly measure the impact of the perturbation, we first evaluated the performance of ST-SparseGCN and all baselines on different clean datasets. The average accuracy with standard deviation is enumerated in Table 2, which indicates that ST-SparseGCN can achieve excellent performance on clean data sets. Compared to four baselines, the superiority of ST-SparseGCN may come from the generalization capability of the temporal sparsification on the feature space.

Table 2: The results of accuracys(%\%) on clean datasets
Cora Citeseer Polblogs
GCN 81.6±\pm0.6 70.7±\pm0.8 85.9±\pm0.9
GCN-Jaccard 78.9±\pm0.8 71.4±\pm0.7 50.4±\pm0.9
GCN-SVD 68.4±\pm0.8 59.8±\pm0.9 80.4±\pm0.4
RGCN 81.1±\pm0.6 71.4±\pm0.5 85.3±\pm0.7
ST-SparseGCN 82.2±\pm0.6 72.0±\pm0.6 89.1±\pm0.4
Table 3: Summary of mDRs(in percent) in classification accuracy compared to GCN in the clean/original graph.Lower is better.
Dataset Cora Citeseer
Defender / Attacker DICE Mettack MinMax PGD DICE Mettack MinMax PGD
GCN 5.28 54.13 21.90 10.09 2.53 64.39 22.82 5.74
GCN_Jaccard 6.82 38.51 13.18 17.57 1.12 53.08 12.16 4.13
GCN_SVD 25.81 50.27 61.43 13.06 18.57 16.61 52.12 19.14
RGCN 5.04 35.13 20.24 13.11 1.46 61.56 11.22 10.65
ST-SparseGCN 4.33 48.42 17.44 7.21 1.92 60.74 17.96 4.73
ST-SparseGCN_Jaccard 6.23 29.53 11.05 8.53 0.52 45.69 8.20 2.30
ST-SparseGCN_SVD 24.87 47.02 59.13 13.40 18.22 15.82 47.82 19.31

Defense Performance Evaluation

In the section, we evaluate the overall defense performance of the proposed ST-SparseGCN by comparing it with various defense methods under different adversarial attackers along with different perturbation sizes.

Perturbation Size. For each attacker, we increase the perturbation rate from 0 to 0.25 with a step size of 0.05. In general, the defense performance decreases along with the increase of the perturbation size. In order to concisely present the experiment results, we define a new metric to evaluate the defense performance, termed dropping rate (DR) as shown in Eq.(9).

DR(Acc,Acc^)=Acc^AccAccDR(Acc,\widehat{Acc})=\frac{\widehat{Acc}-Acc}{Acc} (9)

where Acc^\widehat{Acc} is the accuracy of GCN on clean/original graph. Dropping rate characterizes the defense performance by measuring the integration of the performance degeneration caused by attacker models and the performance remedy from the defense methods. The smaller the dropping rate, the better the defense methods. We use the mean dropping rate(mDR) to describe the overall defense performance along with different perturbation sizes.

Hybrid Defense. To illustrate that the proposed ST-SparseGCN defense methods can be complementary to the existing defense methods, we propose to integrate ST-SparseGCN with two existing defense models, namely, GCN_Jaccard and GCN_SVD, which improve GCN robustness through data preprocessing. The two integrated defense models are called ST-SparseGCN_Jaccard and ST-SparseGCN_SVD, respectively.

Experimental Results. The experiment results on the Cora and Citeseer datasets are enumerated in Table 3. Due to space limitation, the experiment results on the Prolblogs dataset are not included, but illustrated in Fig. 4 instead.

From Table 3, we can make the following observations: (i) the proposed ST-SparseGCN defense model or its variants (ST-SparseGCN_Jaccard and ST-SparseGCN_SVD) achieve the best defense performance under varoius attackers on all datasets, as ST-SparseGCN constructs a robust feature space in each GCN layer; (ii) the hybrid defenders (ST-SparseGCN_Jaccard and ST-SparseGCN_SVD) can improve defense performance compared with the corresponding baselines (namely GCN_Jaccard and GCN_SVD) alone in most cases, which implies that our proposed defender ST-SparseGCN is complementary to the existing defenders, because ST-SparseGCN intends defend the adversarial attacks from the perspective of sparsification in feature space, which is complementary to most of the existing defense methods; (iii) none of the existing non-hybrid graph defenders (including the ST-SparseGCN alone) can perform best under all attackers on all datasets. This phenomenon may originate from the fact that the success of adversarial attacks comes from various aspects of the GCN model. Thus, the hybrid defense models may deserve further exploration.

Fig. 4 summaries the performance results under different attackers along with varied perturbation sizes on the Polblogs dataset. The results show that ST-SparseGCN again consistently achieves better performance than all the baselines, which demonstrates the superiority of the proposed ST-SparseGCN model. The experiment results shown in Table 3 and Fig. 4 have illustrated that our defender can improve defense performance under all attackers on all datasets. It is worth to note that ST-SparseGCN does not rely on any prior knowledge of any particular adversarial attack method. The advantage of ST-Sparse might lie in that it can construct a robust feature space, which can be effectively against various adversarial attacks.

Refer to caption
Figure 4: Results of different defenders when adopting different attackers in Polblogs datasets.

SparseGCN and Dropout

In the section, we compare the generalization performance and robustness of ST-Sparse and Dropout through experiments. Table 4 demonstrates both dropout and ST-Sparse can improve the generalization ability of the model, and ST-Sparse is even better. In terms of robustness, ST-Sparse performs better than Dropout in the face of attacker. In addition, Dropout combined with ST-Sparse will damage the performance of the model. This phenomenon may be the random inactivation of Dropout will damage the ability to preferentially select features in ST-Sparse.

Table 4: Defense performance(in percent) in classification accuracy with Dropout and ST-Sparse.
GCN ST-SparseGCN
Clean 81.5±\pm0.6 82.7±\pm0.6
+Dropout 81.7±\pm0.7 81.5±\pm0.6
+Attacker 65.6±\pm0.9 69.6±\pm0.6
+Dropout+Attacker 66.9±\pm0.7 68.3±\pm0.8
  • 1

    The dataset is Cora. The attacker is Mettack. The perturbation size is 0.05.

  • 2

    The results are averaged five times.

Time Complexity

We conduct several experiments on datasets-model pairs mentioned above to report the runtime of a whole training procedure for 200 epochs obtained on a single NVIDIA RTX 2080 Ti (cf. Fig. 5). Thanks to the ability to quickly process sparse data in the PYG framework, the time overhead between models is basically the same. Among them, GCN_Jaccard takes the most time because its data preprocessing process is very slow.

Refer to caption
Figure 5: The time it takes to run the model once.

Ablation Study and Parameter Analysis

In this section, we evaluate the maginal effect of the temporal sparsity, spatial sparsity, and the dimension dhd_{h} on the accuracy and robustness of ST-SparseGCN. The performance evaluated on clean and perturbed datasets are shown separately. Due to space limitation, we only show the experiment results on the Cora dataset under the Mettack with a perturbation size of 0.05. The experiments on the other datasets exhibit similar patterns, which are included in the supplementary material.

Refer to caption
(a) ST-sparsity
Refer to caption
(b) Dimensions
Figure 6: Results of parameter analysis

In the experiments, the extent of the spatial sparsity is controlled by the spatial sparse ratio α\alpha. Fig. 6(a) shows that the influence of the temporal sparsity and the Mettack along with the increasing sparse ratio α\alpha from 0.020.02 to 0.50.5, where T-Sparsity and Perturbed represent the temporal sparsity and the perturbation from the Mettack, respectively. From Fig. 6(a), by comparing the performance of the models with and without temporal spasity, it can be osberved that temporal sparsity can effectively improve the ST-SparseGCN’s classification performance on both the clean graphs and the perturbed graphs. This illustrates the benefits of the temporal sparsity, which can not only increase the model’s generalization capability (from the performance improvement on the clean graphs), but also improve the model’s defense performance (from the performance improvement on the perturbed graphs).

It can also be inferred from Fig. 6(a) that, when the spatial sparsity ratio α\alpha varies from a small value (0.02) to a relative larger value (0.08), both the models with and without temporal sparsity show significant performance improvement. Thus, this illustrates the necessity of spatial sparsity. Moreover, when α\alpha is larger than 0.08, the accuracy of the ST-SparseGCN remains basically unchanged. However, a very small α\alpha will degrade the performance, probably because the number of non-zero features is not enough to distinguish different categories for node classificaiton task.

Fig. 6(b) illustrates the impact of the dimension dhd_{h} on the performance. It is can be observed that whether it is the clean dataset or the perturbed dataset, the performance is drastically reduced when dhd_{h} reduced to a small value. On the other hand, when dhd_{h} increases to a certain level, the performance almost remains stable. This illustrates that there exists an appropriate value for dhd_{h}. The results also illustrate that the high-dimensional space can enable the GCN model to have more robust performance in case of the perturbation incurred by the attackers.

6 Conclusion

Although the GNN models have emerged rapidly, they still suffer the adversarial attack problem. Unlike the current works, which defend the attack on certain specific scenarios, this paper intends to universally address the attack problem. The proposed ST-Sparse mechanism is similar to the Dropout regularization technique in spirit, as it can provide a general adversarial defense layer, which can be readily integrated into numerous GNN variants. Meanwhile, ST-Sparse can also ensure both robust generalization and ordinary generalization. To evaluate the ST-Sparse’s effectiveness, we conduct intensive experiments. The experiment results show that, in the face of four representative attack methods on three representative datasets with different levels of perturbation, ST-SparseGCN outperforms three representative defense methods.

References

  • Ahmad and Scheinkman (2019) Ahmad, S.; and Scheinkman, L. 2019. How Can We Be So Dense? The Benefits of Using Highly Sparse Representations. In ICML 2019 Workshop on Uncertainty & Robustness in Deep Learning. URL http://arxiv.org/abs/1903.11257.
  • Bojchevski and Günnemann (2019) Bojchevski, A.; and Günnemann, S. 2019. Certifiable Robustness to Graph Perturbations. In Advances in Neural Information Processing Systems 32, 8319–8330. Curran Associates, Inc. URL http://papers.nips.cc/paper/9041-certifiable-robustness-to-graph-perturbations.pdf.
  • Chen et al. (2019) Chen, J.; Wu, Y.; Lin, X.; and Xuan, Q. 2019. Can Adversarial Network Attack be Defended? CoRR abs/1903.05994. URL http://arxiv.org/abs/1903.05994.
  • Cosentino et al. (2019) Cosentino, J.; Zaiter, F.; Pei, D.; and Zhu, J. 2019. The Search for Sparse, Robust Neural Networks. arXiv preprint arXiv:1912.02386 (NeurIPS). URL http://arxiv.org/abs/1912.02386.
  • Dai et al. (2018) Dai, H.; Li, H.; Tian, T.; Xin, H.; Wang, L.; Jun, Z.; and Le, S. 2018. Adversarial attack on graph structured data. In 35th International Conference on Machine Learning, ICML 2018, volume 3, 1799–1808. ISBN 9781510867963.
  • Entezari et al. (2020) Entezari, N.; Al-Sayouri, S. A.; Darvishzadeh, A.; and Papalexakis, E. E. 2020. All you need is Low (rank): Defending against adversarial attacks on graphs. WSDM 2020 - Proceedings of the 13th International Conference on Web Search and Data Mining 169–177. doi:10.1145/3336191.3371789.
  • Feng et al. (2019) Feng, F.; He, X.; Tang, J.; and Chua, T.-S. 2019. Graph Adversarial Training: Dynamically Regularizing Based on Graph Structure. IEEE Transactions on Knowledge and Data Engineering 1–1. ISSN 1041-4347. doi:10.1109/tkde.2019.2957786.
  • Fey and Lenssen (2019) Fey, M.; and Lenssen, J. E. 2019. Fast Graph Representation Learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 1, 1–9. URL http://arxiv.org/abs/1903.02428.
  • Guo et al. (2018) Guo, Y.; Zhang, C.; Zhang, C.; and Chen, Y. 2018. Sparse DNNs with improved adversarial robustness. Advances in Neural Information Processing Systems 2018-Decem(NeurIPS): 242–251. ISSN 10495258.
  • Jin et al. (2020) Jin, W.; Li, Y.; Xu, H.; Wang, Y.; and Tang, J. 2020. Adversarial Attacks and Defenses on Graphs: A Review and Empirical Study. arXiv preprint arXiv:2003.00653 URL http://arxiv.org/abs/2003.00653.
  • Kingma and Ba (2015) Kingma, D. P.; and Ba, J. L. 2015. Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings 1–15.
  • Kipf and Welling (2017) Kipf, T. N.; and Welling, M. 2017. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, 1–14.
  • Li et al. (2020) Li, Y.; Jin, W.; Xu, H.; and Tang, J. 2020. DeepRobust: A PyTorch Library for Adversarial Attacks and Defenses. arXiv preprint arXiv:2005.06149 .
  • Sen et al. (2008) Sen, P.; Namata, G. M.; Bilgic, M.; Getoor, L.; Gallagher, B.; and Eliassi-Rad, T. 2008. Collective classification in network data. AI Magazine 29(3): 93–106. ISSN 07384602. doi:10.1609/aimag.v29i3.2157.
  • Srivastava et al. (2014) Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1): 1929–1958.
  • Steiner et al. (2019) Steiner, B.; Devito, Z.; Chintala, S.; Gross, S.; Paszke, A.; Massa, F.; Lerer, A.; Chanan, G.; Lin, Z.; Yang, E.; Desmaison, A.; Tejani, A.; Kopf, A.; Bradbury, J.; Antiga, L.; Raison, M.; Gimelshein, N.; Chilamkurthy, S.; Killeen, T.; Fang, L.; and Bai, J. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeuroIPS (NeurIPS).
  • Sun et al. (2018) Sun, L.; Wang, J.; Yu, P. S.; and Li, B. 2018. Adversarial Attack and Defense on Graph Data: A Survey. arXiv preprint arXiv:1812.10528 URL http://arxiv.org/abs/1812.10528.
  • Tsipras et al. (2019) Tsipras, D.; Santurkar, S.; Engstrom, L.; Turner, A.; and Madry, A. 2019. Robustness may be at odds with accuracy. In 7th International Conference on Learning Representations, ICLR 2019, 1–24.
  • Wang et al. (2019) Wang, S.; Chen, Z.; Ni, J.; Yu, X.; Li, Z.; Chen, H.; and Yu, P. S. 2019. Adversarial Defense Framework for Graph Neural Network. arXiv preprint arXiv:1905.03679 URL http://arxiv.org/abs/1905.03679.
  • Waniek et al. (2018) Waniek, M.; Michalak, T. P.; Wooldridge, M. J.; and Rahwan, T. 2018. Hiding individuals and communities in a social network. Nature Human Behaviour 2(2): 139–147. ISSN 23973374. doi:10.1038/s41562-017-0290-3.
  • Wu et al. (2019a) Wu, H.; Wang, C.; Tyshetskiy, Y.; Docherty, A.; Lu, K.; and Zhu, L. 2019a. Adversarial examples for graph data: Deep insights into attack and defense. IJCAI International Joint Conference on Artificial Intelligence 2019-August: 4816–4823. ISSN 10450823. doi:10.24963/ijcai.2019/669.
  • Wu et al. (2019b) Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; and Yu, P. S. 2019b. A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems XX(Xx): 1–22. URL http://arxiv.org/abs/1901.00596.
  • Xiao, Zhong, and Zheng (2020) Xiao, C.; Zhong, P.; and Zheng, C. 2020. Enhancing Adversarial Defense by k-Winners-Take-All. In International Conference on Learning Representations. URL https://openreview.net/forum?id=Skgvy64tvr.
  • Xu et al. (2019a) Xu, H.; Ma, Y.; Liu, H.; Deb, D.; Liu, H.; Tang, J.; and Jain, A. K. 2019a. Adversarial Attacks and Defenses in Images, Graphs and Text: A Review URL http://arxiv.org/abs/1909.08072.
  • Xu et al. (2019b) Xu, K.; Chen, H.; Liu, S.; Chen, P. Y.; Weng, T. W.; Hong, M.; and Lin, X. 2019b. Topology attack and defense for graph neural networks: An optimization perspective. In IJCAI International Joint Conference on Artificial Intelligence, volume 2019-Augus, 3961–3967. ISBN 9780999241141. ISSN 10450823. doi:10.24963/ijcai.2019/550.
  • Zhang, Hossain Khan, and Coates (2019) Zhang, Y.; Hossain Khan, S.; and Coates, M. 2019. Comparing and Detecting Adversarial Attacks for Graph Deep Learning. In ICLR 2019 Workshop: Representation Learning on Graphs and Manifolds, 1–7. URL https://www.kdd.in.tum.de/research/nettack/.
  • Zhang, Cui, and Zhu (2020) Zhang, Z.; Cui, P.; and Zhu, W. 2020. Deep Learning on Graphs: A Survey. IEEE Transactions on Knowledge and Data Engineering 1–1. ISSN 1041-4347. doi:10.1109/tkde.2020.2981333.
  • Zheng et al. (2020) Zheng, C.; Zong, B.; Cheng, W.; Song, D.; Ni, J.; Yu, W.; Chen, H.; and Wang, W. 2020. Robust Graph Representation Learning via Neural Sparsification . In Proceedings of ICML 2020.
  • Zhou et al. (2018) Zhou, J.; Cui, G.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; and Sun, M. 2018. Graph Neural Networks: A Review of Methods and Applications 1–22. URL http://arxiv.org/abs/1812.08434.
  • Zhu et al. (2019) Zhu, D.; Cui, P.; Zhang, Z.; and Zhu, W. 2019. Robust graph convolutional networks against adversarial attacks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1399–1407. ISBN 9781450362016. doi:10.1145/3292500.3330851.
  • Zügner, Akbarnejad, and Günnemann (2018) Zügner, D.; Akbarnejad, A.; and Günnemann, S. 2018. Adversarial attacks on neural networks for graph data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2847–2856. ISBN 9781450355520. doi:10.1145/3219819.3220078.
  • Zügner and Günnemann (2019) Zügner, D.; and Günnemann, S. 2019. Adversarial attacks on graph neural networks via meta learning. In 7th International Conference on Learning Representations, ICLR 2019, 1–15.