This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deformable Graph Transformer

Jinyoung Park1 , Seongjun Yun111footnotemark: 1 , Hyeonjin Park2,3, Jaewoo Kang1,
Jisu Jeong2,3, Kyung-Min Kim2,3, Jung-Woo Ha2,3, Hyunwoo J. Kim1
Korea University1, NAVER CLOVA2, NAVER AI LAB3
{lpmn678,ysj5419,kangj,hyunwoojkim}@korea.ac.kr
{hyeonjin.park.ml,jisu.jeong}@navercorp.com
{kyungmin.kim.ml,jungwoo.ha}@navercorp.com
First two authors have equal contributionis the corresponding author
Abstract

Transformer-based models have recently shown success in representation learning on graph-structured data beyond natural language processing and computer vision. However, the success is limited to small-scale graphs due to the drawbacks of full dot-product attention on graphs such as the quadratic complexity with respect to the number of nodes and message aggregation from enormous irrelevant nodes. To address these issues, we propose Deformable Graph Transformer (DGT) that performs sparse attention via dynamically sampled relevant nodes for efficiently handling large-scale graphs with a linear complexity in the number of nodes. Specifically, our framework first constructs multiple node sequences with various criteria to consider both structural and semantic proximity. Then, combining with our learnable Katz Positional Encodings, the sparse attention is applied to the node sequences for learning node representations with a significantly reduced computational cost. Extensive experiments demonstrate that our DGT achieves state-of-the-art performance on 7 graph benchmark datasets with 2.5 \sim 449 times less computational cost compared to transformer-based graph models with full attention.

1 Introduction

Transformer (Vaswani et al., 2017) has proven its effectiveness in modeling sequential data in various tasks such as natural language understanding (Devlin et al., 2019; Yang et al., 2019; Brown et al., 2020) and speech recognition (Zhang et al., 2020; Gulati et al., 2020). Beyond sequential data, recent works (Dosovitskiy et al., 2021; Liu et al., 2021; Yang et al., 2021; Carion et al., 2020; Zhu et al., 2021; Zhao et al., 2021) have successfully generalized Transformer to various computer vision tasks such as image classification (Dosovitskiy et al., 2021; Liu et al., 2021; Yang et al., 2021), object detection (Carion et al., 2020; Zhu et al., 2021; Song et al., 2021), and 3D shape classification (Zhao et al., 2021). Inspired by the success of Transformer-based models, there have been recent efforts to apply the Transformer to graph domains by using graph structural information through structural encodings (Ying et al., 2021; Dwivedi & Bresson, 2020; Mialon et al., 2021; Kreuzer et al., 2021), and they have achieved the best performance on various graph-related tasks.

However, most existing Transformer-based graph models have difficulty in learning representations on large-scale graphs while they have shown their superiority on small-scale graphs. Since the Transformer-based graph models perform self-attention by treating each input node as an input token, the computational cost is quadratic in the number of input nodes, which is problematic on large-scale graphs. In addition, different from graph neural networks that aggregate messages from local neighborhoods, Transformer-based graph models globally aggregate messages from numerous nodes. So, on large-scale graphs, a huge number of messages from falsely correlated nodes often overwhelm the information from relevant nodes. As a result, Transformer-based graph models often exhibit poor generalization performance. A simple method to address these issues is performing masked attention where the key and value pairs are restricted to neighborhoods of query nodes (Dwivedi & Bresson, 2020). But, since the masked attention has a fixed small receptive field, it struggles to learn representations on large-scale graphs that require a large receptive field.

In this paper, we propose a novel Transformer for graphs named Deformable Graph Transformer (DGT) that performs sparse attention with a small set of key and value pairs adaptively sampled considering both semantic and structural proximity. To be specific, our approach first generates multiple node sequences for each query node with diverse sorting criteria such as Personalized PageRank (PPR) score, BFS, and feature similarity. Then, our Deformable Graph Attention (DGA), a key module of DGT, dynamically adjusts offsets to sample the key and value pairs on the generated node sequences and learns representations with sampled key-value pairs. In addition, we present simple and effective positional encodings to capture structural information. Motivated by Katz index (Katz, 1953), which is used for measuring connectivity between nodes, we design Katz Positional Encoding (Katz PE) to incorporate structural similarity and distance between nodes on a graph into the attention. Our extensive experiments show that DGT achieved state-of-the-art performances on 7 benchmark datasets and outperformed existing Transformer-based graph models on all 8 datasets at a significantly reduced computational cost.

Our contributions are as follows:

  • We propose Deformable Graph Transformer (DGT) that performs sparse attention with a reduced number of keys and values for learning node representations, which significantly improves the scalability and expressive power of Transformer-based graph models.

  • We design deformable attention for graph-structured data, Deformable Graph Attention (DGA), that flexibly attends to a small set of relevant nodes based on various types of the proximity between nodes.

  • We present learnable positional encodings named Katz PE to improve the expressive power of Transformer-based graph models by incorporating structural similarity and distance between nodes based on Katz index (Katz, 1953).

  • We validate the effectiveness of the Deformable Graph Transformer with extensive experimental results that our DGT achieves the state-of-the-art performance on 7 graph benchmark datasets with 2.54492.5\sim 449 times less computational cost compared to transformer-based graph models with full attention.

2 Related Works

Graph Neural Networks. Graph Neural Networks have become the de facto standard approaches on various graph-related tasks (Kipf & Welling, 2017; Hamilton et al., 2017; Wu et al., 2019; Xu et al., 2018; Gilmer et al., 2017). There have been several works that apply attention mechanisms to graph neural networks (Rong et al., 2020; Veličković et al., 2018; Brody et al., 2022; Kim & Oh, 2021) motivated by the success of the attention. GAT (Veličković et al., 2018) and GATv2 (Brody et al., 2022) adaptively aggregate messages from neighborhoods with the attention scheme. However, the previous works often show poor performance on heterophilic graphs due to their homophily assumption that nodes within a small neighborhood have similar attributes and potentially the same labels. So, recent works (Abu-El-Haija et al., 2019; Pei et al., 2020; Zhu et al., 2020; Park et al., 2022) have been proposed to extended message aggregation beyond a few-hop neighborhood to cope with both homophilic and heterophilic graphs. H2GCN (Zhu et al., 2020) separates input features and aggregated features to preserve information of input features. Deformable GCN (Park et al., 2022) improves the flexibility of convolution by performing deformable convolution.

Transformer-based Graph Models. Recently, (Ying et al., 2021; Dwivedi & Bresson, 2020; Mialon et al., 2021; Kreuzer et al., 2021; Wu et al., 2021) have adopted the Transformer architecture for learning on graphs. Graphormer (Ying et al., 2021) and GT (Dwivedi & Bresson, 2020) are built upon the standard Transformer architectures by incorporating structural information of graphs into the dot-product self-attention. However, these approaches, which we will refer to as ‘graph Transformers’ for brevity, are not suitable for large-scale graphs. It is because referencing numerous key nodes for each query node is prohibitively costly, and that hinders the attention module from learning the proper function due to the noisy features from irrelevant nodes. Although restricting the attention scope to local neighbors is a simple remedy to reduce the computational complexity, it leads to a failure in capturing local-range dependency, which is crucial for large-scale or heterophilic graphs. To mitigate the shortcomings of existing Transformer-based graph models, we propose DGT equipped with deformable sparse attention that dynamically samples relevant nodes to efficiently learn powerful representations on both homophilic and heterophilic graphs with significantly improved scalability.

Sparse Transformers in Other Domains. Transformer (Vaswani et al., 2017) and its variants have achieved performance improvements in various domains such as natural language processing (Devlin et al., 2019; Brown et al., 2020) and computer vision (Dosovitskiy et al., 2021; Carion et al., 2020). However, these models require quadratic space and time complexity, which is especially problematic with long input sequences. Recent works (Choromanski et al., 2021; Jaegle et al., 2021; Kitaev et al., 2020) have studied this issue and proposed various efficient Transformer architectures. (Choromanski et al., 2021; Xiong et al., 2021) study the low-rank approximation for attention to reduce the complexity. Perceiver (Jaegle et al., 2021; 2022) leverages a cross-attention mechanism to iteratively distill inputs into latent vectors to scale linearly with the input size. Sparse Transformer (Child et al., 2019) uses pre-defined sparse attention patterns on keys by restricting the attention pattern to be fixed local windows. (Zhu et al., 2021) also proposes sparse attention that dynamically samples a set of key/value pairs for each query without a fixed attention pattern. Inspired by the deformable attention (Zhu et al., 2021), we propose deformable attention for graph-structured data that flexibly attends to informative key nodes considering various types of the proximity between nodes via multiple node sequences and our learnable Katz Positional Encondings.

3 Methods

The goal of our architecture is to address the limitations of Transformer-based graph models and generalize Transformers on large-scale graphs. Specifically, existing Transformer-based graph models suffer from multiple challenges: 1) a scalability issue caused by the quadratic computational cost in regards to the number of nodes and 2) aggregation of distracting information since an enormous number of nodes are aggregated. To address the challenges, we propose Deformable Graph Transformer (DGT). Our framework is composed of two main components: 1) deformable attention that attends to only a small set of adaptively selected key nodes considering diverse relations between nodes, and 2) positional encoding that captures structural similarity and distance between nodes. Before introducing our proposed architectures, we revisit the basic concepts of graph neural networks and attention in Transformers.

3.1 Preliminaries

Graph Neural Networks (GNNs). Consider an undirected graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) with a set of NN nodes 𝒱={v1,v2,,vN}\mathcal{V}=\{v_{1},v_{2},\dots,v_{N}\} and a set of edges ={(vi,vj)|vi,vj𝒱}\mathcal{E}=\{(v_{i},v_{j})\ |\ v_{i},v_{j}\in\mathcal{V}\} where the nodes vi,vj𝒱v_{i},v_{j}\in\mathcal{V} are connected. Each node vi𝒱v_{i}\in\mathcal{V} has a feature vector 𝐱iF\mathbf{x}_{i}\in\mathbb{R}^{F}, where FF is the dimensionality of the node feature, and a set of neighborhoods 𝒩(i)={vj𝒱|(vi,vj)}\mathcal{N}{(i)}=\{v_{j}\in\mathcal{V}|(v_{i},v_{j})\in\mathcal{E}\}.

Given a graph 𝒢\mathcal{G} and a set of node features {𝐱i}i=1N\left\{\mathbf{x}_{i}\right\}_{i=1}^{N}, Graph Neural Networks (GNNs) aim to learn each node representation by an iterative aggregation of transformed representations of the node itself and its neighborhoods as follows:

hi(l)=σ(W(l)(cii(l)hi(l1)+vj𝒩(i)cij(l)hj(l1))),{\textbf{h}}^{(l)}_{i}=\sigma\left(\textbf{W}^{(l)}\left(c_{ii}^{(l)}\textbf{h}_{i}^{(l-1)}+\sum_{v_{j}\in\mathcal{N}{(i)}}{c_{ij}^{(l)}\textbf{h}_{j}^{(l-1)}}\right)\right), (1)

where 𝐡i(l)d(l){\mathbf{h}}^{(l)}_{i}\in\mathbb{R}^{d^{(l)}} is a hidden representation of node viv_{i} in the ll-th GNN layer, 𝐡i(0)=𝐱i{\mathbf{h}}^{(0)}_{i}=\mathbf{x}_{i}, 𝐖(l)d(l)×d(l1)\mathbf{W}^{(l)}\in\mathbb{R}^{d^{(l)}\times d^{(l-1)}} is a learnable weight matrix at the ll-th GNN layer, σ\sigma is a non-linear activation function, cij(l)c_{ij}^{(l)} and cii(l)c_{ii}^{(l)} represent weights for aggregation characterized by each GNN. For example, GCN (Kipf & Welling, 2017) can be represented as a form of (1) if cij=(deg(i)deg(j))1/2c_{ij}=(\text{deg}(i)\text{deg}(j))^{-1/2} and cii=(deg(i))1c_{ii}=(\text{deg}(i))^{-1}, where deg(i)\text{deg}(i) is the degree of node viv_{i}, and GAT (Veličković et al., 2018) learns cij(l)c_{ij}^{(l)} and cii(l)c_{ii}^{(l)} based on the attention mechanism.

Transformer-based Graph Models. Most Transformer-based graph models (Dwivedi & Bresson, 2020; Kreuzer et al., 2021) learn each node representation with self-attention mechanism as Transformers (Vaswani et al., 2017) learn each token with self-attention. Given a node vqv_{q} with a corresponding node embedding 𝐳qc\mathbf{z}_{q}\in\mathbb{R}^{c} and a set of key/value vectors ={𝐟k}k=1N\mathcal{F}=\left\{\mathbf{f}_{k}\right\}_{k=1}^{N}, the Multi-Head Attention (MHA) for Transformer-based graph models is formulated as follows:

MHA(𝐳q,)=m=1M𝐖m[k=1N𝐀mqk𝐖m𝐟k],\text{MHA}\left(\mathbf{z}_{q},\mathcal{F}\right)=\sum_{m=1}^{M}\mathbf{W}_{m}\left[\sum_{k=1}^{N}\mathbf{A}_{mqk}\cdot\mathbf{W}^{\prime}_{m}\mathbf{f}_{k}\right], (2)

where mm is the index of MM attention heads, 𝐖mc×cv\mathbf{W}_{m}\in\mathbb{R}^{c\times c_{v}} and 𝐖mcv×c\mathbf{W}_{m}^{\prime}\in\mathbb{R}^{c_{v}\times c} are weight matrix parameters. The attention weight 𝐀mqk\mathbf{A}_{mqk} between the query 𝐳qc\mathbf{z}_{q}\in\mathbb{R}^{c} and the key 𝐟kc\mathbf{f}_{k}\in\mathbb{R}^{c}, which is generally calculated as 𝐀mqk=exp[𝐳q𝐔m𝐕m𝐟k/cv]Z\mathbf{A}_{mqk}=\frac{\exp\left[{\mathbf{z}_{q}^{\top}\mathbf{U}^{\top}_{m}\mathbf{V}_{m}\mathbf{f}_{k}}/{\sqrt{c_{v}}}\right]}{Z}, where ZZ\in\mathbb{R} is a normalization factor to achieve k=1N𝐀mqk=1\sum_{k=1}^{N}\mathbf{A}_{mqk}=1 and 𝐔m,𝐕m\mathbf{U}_{m},\mathbf{V}_{m} are weight matrices to compute queries and keys.

By harnessing the power of the multi-head self attention, Transformer-based graph models have shown superior performance on graph-related tasks (Dwivedi & Bresson, 2020; Kreuzer et al., 2021). But, they have difficulty in learning representations on large-scale graphs. Since the large-scale graph has a vast number of nodes, the scalability issue of the multi-head attention is dramatically exacerbated. Also, the enormous number of keys to attend per query node increases the risk of the aggregation of noisy information from irrelevant key nodes. In this paper, we design a sparse Transformers named Deformable Graph Transformer (DGT) that attends to only a small number of relevant keys through deformable sampling (Zhu et al., 2021). By reducing the number of key nodes, it is possible to make transformer-based architecture effective and efficient on large-scale graphs.

3.2 Deformable Graph Attention

Refer to caption
Figure 1: Overview of the deformable graph attention module. In the pre-processing phase, NodeSort module first constructs multiple node sequences {Sπq}πΠ\{S_{\pi q}\}_{\pi\in\Pi} depending on query node vqv_{q} by sorting nodes through diverse criteria πΠ\pi\in\Pi. Then, the kernel-based interpolation is applied on each offset to get values, whose offsets are computed by the queries with a linear projection. The deformable graph attention module aggregates the values of each head with attention weights to generate output.

In this section, we present Deformable Graph Attention (DGA), a key module in our proposed Deformable Graph Transformer (DGT). The overview of the deformable graph attention is illustrated in Figure 1. The general idea of deformable graph attention is to design the deformable attention mechanism for graph-structured data, and therefore DGT performs the attention mechanism with partially selected relevant key nodes, which makes the transformer-based architecture efficient and effective on large-scale graphs. However, extending the deformable attention mechanism to graph representation learning is non-trivial since the offset-based interpolation in the deformable attention only works in Euclidean space whereas graphs are non-Euclidean data. Also, even when embedding nodes in a graph into a low-dimensional space via graph embedding methods (e.g., Node2vec (Grover & Leskovec, 2016)), nodes are irregularly distributed in the low-dimensional space, which makes difficult to effectively deform an attention map.

To address this challenge, we propose a NodeSort module that converts a graph into a sorted sequence of nodes in a regular space. We define a base node vbv_{b} as the first node in a sorted sequence which is similar to an ego node in an ego graph. NodeSort differentially sorts nodes depending on the base node. In other words, NodeSort provides a relative ordering that varies across base nodes whereas conventional topological sort yields a single (absolute) ordering for a graph. Specifically, given a base node, NodeSort sorts nodes and returns a sequence of their features as follows:

Sπb=NodeSortπ(𝒢,vb,{𝐱i}i=1N)=[𝐱σπb1(i)]i=1N,S_{\pi b}=\text{NodeSort}_{\pi}(\mathcal{G},v_{b},\left\{\mathbf{x}_{i}\right\}_{i=1}^{N})=[\mathbf{x}_{\sigma_{\pi b}^{-1}(i)}]_{i=1}^{N}, (3)

where π\pi denotes a specific criterion for sorting nodes and σπb\sigma_{\pi b} is a bijective mapping from 𝒱\mathcal{V} to 𝒱\mathcal{V} depending on a base node vbv_{b}. To consider both structural and semantic proximity, we generate multiple sorted node sequences for each node vbv_{b}, {Sπb}πΠ\{S_{\pi b}\}_{\pi\in\Pi}, from a set of diverse criteria Π\Pi, such as Personalized PageRank (PPR) score, BFS, and feature similarity. Note that σπb\sigma_{\pi b} redefines neighbors of each node vbv_{b} based on various aspects πΠ\pi\in\Pi beyond 1-hop neighbors in the existing GNNs.

Now, we introduce our Deformable Graph Attention (DGA), which is a sparse attention by dynamically sampling key/value pairs from the set of sorted sequences of node features. To benefit from various properties of the graphs, the deformable graph attention module is designed to deal with diverse node sequences. Different from existing methods based on deformable sampling modules in computer vision, which only considers spatial proximity on a grid, DGA captures both structural and semantic proximity. Given the set of sorted sequences {Sπq}πΠ\{S_{\pi q}\}_{\pi\in\Pi} and features of query node 𝐳q\mathbf{z}_{q}, DGA is defined as

DGA(𝐳q,{Sπq}πΠ)=πΠm=1M𝐖πm[k=1K𝐀πmqk𝐖πmS~πq(Δ𝐩πmqk)],\text{DGA}\left(\mathbf{z}_{q},\{S_{\pi q}\}_{\pi\in\Pi}\right)=\sum_{\pi\in\Pi}\sum_{m=1}^{M}\mathbf{W}_{\pi m}\left[\sum_{k=1}^{K}\mathbf{A}_{\pi mqk}\cdot\mathbf{W}^{\prime}_{\pi m}\tilde{S}_{\pi q}\left(\Delta\mathbf{p}_{\pi mqk}\right)\right], (4)

where S~πq(Δ𝐩πmqk)\tilde{S}_{\pi q}\left(\Delta\mathbf{p}_{\pi mqk}\right) denotes the representation of the kk-th key, KK denotes the number of keys, 𝐖πmc×cv\mathbf{W}_{\pi m}\in\mathbb{R}^{c\times c_{v}} and 𝐖πmcv×c\mathbf{W}^{\prime}_{\pi m}\in\mathbb{R}^{c_{v}\times c} are the learnable weight matrices for each criteria πΠ\pi\in\Pi and the mm-th attention head, 𝐀πmqk=θπmkatt(𝐳q)\mathbf{A}_{\pi mqk}=\theta_{\pi mk}^{\text{att}}(\mathbf{z}_{q}) denotes the attention weight from a linear function θatt\theta^{\text{att}} between the qq-th query and the kk-th key of the mm-th attention head and criterion π\pi, which is normalized by k=1K𝐀πmqk=1\sum_{k=1}^{K}\mathbf{A}_{\pi mqk}=1 and the attention weight needs to be in a interval [0,1][0,1].

Δ𝐩πmqk\Delta\mathbf{p}_{\pi mqk} is a sampling offset of the kk-th key for criterion π\pi and mm-th attention head, which is generated by θπmkoff(𝐳q)\theta_{\pi mk}^{\text{off}}(\mathbf{z}_{q}), where θoff\theta^{\text{off}} is a linear function with an activation function. As the offset Δ𝐩πmqk\Delta\mathbf{p}_{\pi mqk} is fractional, we compute S~πq(𝐩)\tilde{S}_{\pi q}\left(\mathbf{p}\right) by kernel-based interpolation:

S~πq(p)=ig(𝐩,i)Sπq[i],g(a,b)={exp((ab)2γ), if |ab|<ϵ0, otherwise\tilde{S}_{\pi q}\left(\textbf{p}\right)=\sum_{i}g(\mathbf{p},i)\cdot S_{\pi q}[i],\quad g(a,b)=\begin{cases}\exp\left(-\frac{\left(a-b\right)^{2}}{\gamma}\right),&\mbox{ if }\lvert a-b\rvert<\epsilon\\ 0,&\mbox{ otherwise}\end{cases} (5)

where γ++\gamma\in\mathbb{R}^{++} denotes the bandwidth of the kernel, ϵ\epsilon is a hyper-parameter for truncating the kernel, and Sπq[i]S_{\pi q}[i] is the node feature at a ii-th index of the sequence SπqS_{\pi q}.

A standard way of computing S~πq(𝐩)\tilde{S}_{\pi q}\left(\mathbf{p}\right) is calculating Eq. (5) by substituting gg with g(a,b)=max(0,1|ab|)g\left(a,b\right)=\max\left(0,1-\left\lvert a-b\right\rvert\right). But, the existing way considers only two nodes whose coordinates are ceiling and floor values of a point 𝐩\mathbf{p}. It might not be sufficient for the offsets Δ𝐩\Delta\mathbf{p} to move to get relevant information by looking at only two nodes. As the scale of the graph increases, a target node requires much more nodes to learn its representation. So, we apply the Radial Basis Function kernel for computing S~πq\tilde{S}_{\pi q} as in Eq. (5) to look at a wide range of nodes for representing query nodes.

3.3 Katz Positional Encoding

Positional Encoding is a crucial component in Transformer to reflect domain-specific positional information into its attention mechanism. A major issue with positional encoding on graphs is the absence of absolute positions of nodes, unlike other domains. One remedy to this issue is to encode relations between nodes. Here, we propose learnable positional encodings, Katz PE, for Transformer-based graph models based on connectivity between nodes. To be specific, inspired by the matrix of Katz indices (Katz, 1953) which counts all paths between nodes with the decaying weight β\beta to reflect the preference for shorter paths, i.e., 𝐀^=k=1βk1𝐀k\hat{\mathbf{A}}=\sum_{k=1}^{\infty}\beta^{k-1}{\mathbf{A}^{k}}, our method learns positional embeddings, PEi\text{PE}_{i}, of each node ii by the nonlinear transform of 𝐀^\hat{\mathbf{A}} as follows:

Katz PE(vi)=MLP(𝐀^[vi]T),\text{Katz PE}(v_{i})=\text{MLP}(\hat{\mathbf{A}}[v_{i}]^{T}), (6)

where MLP is a Multi-Layer Perceptron, and 𝐀^[vi]\hat{\mathbf{A}}[v_{i}] is the row vector of node viv_{i} in 𝐀^\hat{\mathbf{A}}. We limit the maximum kk in 𝐀^\hat{\mathbf{A}} to KK, i.e., 𝐀^=k=1Kβk1𝐀k\hat{\mathbf{A}}=\sum_{k=1}^{K}\beta^{k-1}{\mathbf{A}^{k}} for the efficient calculation. In addition, when NN is large, then we sample NN^{\prime} anchor nodes with a high degree and utilize the submatrix of Katz indices 𝐀^N×N\hat{\mathbf{A}}^{\prime}\in\mathbb{R}^{N\times N^{\prime}}. Our learnable Katz PE is simple yet more effective for both our DGT and vanilla Transformer than existing pre-computed positional encodings. See Section 4.3, for more details.

3.4 Deformable Graph Transformer

Finally, we introduce our Deformable Graph Transformer (DGT) built upon our proposed Graph Deformable Attention and Positional Encoding. Deformable Graph Transformer first encodes node feature 𝐱i\mathbf{x}_{i} with the learnable function fθf_{\theta}, which can be MLP, and combines with positional embeddings from Eq. (6) as

𝐳i(0)=fθ(𝐱i)+Katz PE(vi).\mathbf{z}_{i}^{(0)}=f_{\theta}(\mathbf{x}_{i})+\text{Katz PE}(v_{i}). (7)

Then, given a set of sorted sequences {Sqπ}πΠ\{S_{q}^{\pi}\}_{\pi\in\Pi} from Eq. (4), each ll-th Deformable Graph Attention layer in DGT performs the attention mechanism with a sampled set of informative keys and applies skip-connection and MLP to update node representations as follows:

𝐳^i(l)=DGA(𝐳i(l),{Sπi}πΠ,𝐙(l1))+𝐳i(l1),𝐳i(l)=MLP(𝐳^i(l))+𝐳^i(l).\mathbf{\hat{z}}_{i}^{(l)}=\mbox{DGA}\left(\mathbf{z}_{i}^{(l)},\{S_{\pi i}\}_{\pi\in\Pi},\mathbf{Z}^{(l-1)}\right)+\mathbf{z}_{i}^{(l-1)},\quad\mathbf{z}_{i}^{(l)}=\mbox{MLP}(\mathbf{\hat{z}}_{i}^{(l)})+\mathbf{\hat{z}}_{i}^{(l)}. (8)

After the stack of LL Deformable Graph Attention blocks, each node representation 𝐳i(L)\mathbf{z}_{i}^{(L)} is used for node classification on top and MLP followed by a softmax layer is used as y^i=Softmax(MLP(𝐳i(L)))\hat{y}_{i}=\text{Softmax}(\text{MLP}(\mathbf{z}^{(L)}_{i})). Our loss function is a cross-entropy on nodes that have ground truth labels.

3.5 Complexity Analysis

We provide the comparison of computational complexity between the self-attention in most Transformer-based graph models and the Deformable Graph Attention (DGA) in our DGT. As the number of nodes NN increases, our DGA with a linear complexity of NN is more efficient than the self-attention with a quadratic complexity of NN.

Self-Attention in most Transformer-based graph models (Vaswani et al., 2017; Ying et al., 2021; Dwivedi & Bresson, 2020). Suppose that NN is the number of nodes, CC is the dimensionality of hidden representations. The self-attention operation requires a huge computation cost with the complexity of 𝒪(N2C+NC2)\mathcal{O}\left(N^{2}C+NC^{2}\right), where CC is the dimensionality of hidden representations.

Deformable Graph Attention. Consider KK is the number of keys, TT is the number of critera, and WW is the number of nonzero values of gg in Eq. (5). Then, the computational complexity of Deformable Graph Attention is 𝒪(N(C2T+WKCT))\mathcal{O}(N\cdot(C^{2}T+WKCT)) ,which is a linear complexity with respect to the number of nodes NW,K,C,TN\gg W,K,C,T. The details are in Sec A.

4 Experiments

In this section, we evaluate the effectiveness of our proposed Deformable Graph Transformer (DGT) against state-of-the-art models on node classification benchmark datasets.

4.1 Experimental Setup

Datasets. We validate the effectiveness of our model on node classification using four heterophilic graph datasets and four homophilic graph datasets, which are distinguished by the edge-based homophily ratio (Zhu et al., 2020) defined as h=|{(vi,vj):(vi,vj)yi=yj}|||h=\frac{\left\lvert\left\{(v_{i},v_{j}):(v_{i},v_{j})\in\mathcal{E}\land y_{i}=y_{j}\right\}\right\rvert}{\left\lvert\mathcal{E}\right\rvert}. Each dataset has an edge-based homophily ratio ranged from h=0.22h=0.22 (very heterophilic) to h=0.81h=0.81 (very homophilic). For large-scale graphs, we evaluate our method on twitch-gamers, obgn-arxiv, and Reddit datasets. More details about the datasets are in Sec C.1.

Baselines. To demonstrate the effectiveness of our Deformable Graph Transformer (DGT), we compare DGT with following baselines: (1) six standard GNNs: GCN (Kipf & Welling, 2017), GAT (Veličković et al., 2018), GraphSAGE (Hamilton et al., 2017), JKNet (Xu et al., 2018), SGC (Wu et al., 2019), GATv2 (Brody et al., 2022); (2) four GNNs designed for heterophilic settings: MixHop (Abu-El-Haija et al., 2019), Geom-GCN (Pei et al., 2020), H2GCN (Zhu et al., 2020), DeformableGCN (Park et al., 2022); (3) four Transformer-based architectures for graphs: Transformer (Vaswani et al., 2017), Graphormer (Ying et al., 2021), GT-full (Dwivedi & Bresson, 2020), GT-sparse (Dwivedi & Bresson, 2020). (4) two variants of our proposed DGT: DGT-light using a single sorting criterion (|Π|=1|\Pi|=1) and DGT using multiple sorting criteria (|Π|=3|\Pi|=3).

Implementation details. The best model on the validation split is used for reporting the performance. We adopt the splits (48%/ 32%/ 20%) of nodes per class for (train/ validation/ test) following (Pei et al., 2020; Zhu et al., 2020) and the experiments are repated 30 times on Actor, Squirrel, Chameleon, Cora, and Citeseer datasets. For twitch-gamers, ogbn-arxiv, and Reddit, the experiments are conducted with the splits provided by (Lim et al., 2021), (Hu et al., 2020), and (Hamilton et al., 2017), respectively and repeated 10 times. More implementation details are in Sec C.2.

Table 1: Evaluation results on node classification task (Mean accuracy (%) ±\pm 95% confidence interval). OOM denotes ‘out-of-memory’. Bold indicates the model with the best performance and underline indicates the second best model.
Model Actor Squirrel Chameleon Cora Citeseer twitch-gamers ogbn-arxiv Reddit Avg. Rank
# Nodes 7,600 5,201 2,277 2,708 3,327 168,114 169,343 232,965
# Edges 26,659 198,353 31,371 5,278 4,552 6,797,557 1,166,243 11,606,919
Hom. ratio hh 0.22 0.22 0.23 0.81 0.74 0.55 0.66 0.76
GNN-based Models
MLP 35.05±\pm0.38 31.66±\pm0.82 47.11±\pm0.71 75.10±\pm0.84 73.54±\pm0.69 61.14±\pm0.06 53.89±\pm0.21 70.03±\pm0.16 12.38
GCN 30.13±\pm0.30 50.42±\pm0.66 66.33±\pm0.64 87.22±\pm0.26 76.08±\pm0.43 64.34±\pm0.12 71.27±\pm0.11 95.06±\pm0.03 7.25
GAT 30.25±\pm0.39 54.26±\pm1.21 66.85±\pm0.88 86.21±\pm0.43 75.71±\pm0.42 62.90±\pm0.22 70.92±\pm0.11 OOM 9.13
GraphSAGE 35.24±\pm0.49 43.75±\pm0.75 63.28±\pm0.68 86.94±\pm0.36 76.25±\pm0.53 64.73±\pm0.11 70.19±\pm0.11 96.27±\pm0.01 7.13
JKNet 30.39±\pm0.35 55.17±\pm0.62 67.81±\pm0.86 87.17±\pm0.33 76.33±\pm0.53 65.08±\pm0.07 71.00±\pm0.15 95.28±\pm0.02 6.25
SGC 29.43±\pm0.41 35.07±\pm0.51 49.95±\pm1.15 87.33±\pm0.39 75.47±\pm0.56 60.47±\pm0.14 66.56±\pm0.01 94.72±\pm0.00 11.25
GATv2 30.54±\pm0.41 57.41±\pm0.94 67.25±\pm0.58 86.10±\pm0.41 75.63±\pm0.49 64.15±\pm0.09 71.01±\pm0.15 OOM 8.25
MixHop 35.79±\pm0.33 38.78±\pm0.86 59.27±\pm0.83 87.16±\pm0.38 75.95±\pm0.57 65.20±\pm0.12 71.47±\pm0.15 96.23±\pm0.04 6.25
Geom-GCN 31.53±\pm0.31 37.98±\pm0.42 60.70±\pm0.91 85.38±\pm0.55 76.57±\pm0.56 N/A N/A N/A 10.50
H2GCN 35.32±\pm0.34 36.89±\pm0.80 58.21±\pm0.70 87.73±\pm0.64 76.88±\pm0.54 OOM OOM OOM 8.50
DeformableGCN 36.53±\pm0.42 62.09±\pm0.68 71.03±\pm0.57 87.32±\pm0.44 76.67±\pm0.43 OOM 70.22±\pm0.19 OOM 6.00
Transformer-based Graph Models
Transformer 36.61±\pm0.39 31.00±\pm0.60 45.93±\pm0.83 73.75±\pm0.71 72.99±\pm0.61 OOM OOM OOM 12.63
Graphormer 36.54±\pm0.44 36.25±\pm0.72 50.15±\pm1.26 73.44±\pm0.90 72.60±\pm0.63 OOM OOM OOM 12.00
GT-full 34.53±\pm0.38 32.33±\pm0.64 49.07±\pm1.25 69.51±\pm1.01 70.18±\pm0.67 OOM OOM OOM 13.63
GT-sparse 34.69±\pm0.35 44.22±\pm0.67 64.82±\pm0.57 85.63±\pm0.44 75.49±\pm0.58 63.09±\pm0.71 71.45±\pm0.14 OOM 8.75
DGT-light (Ours) 36.86±\pm0.53 62.58±\pm0.57 73.04±\pm0.65 86.60±\pm0.60 75.72±\pm0.40 65.59±\pm0.25 71.18±\pm0.13 96.14±\pm0.05 4.38
DGT (Ours) 36.93±\pm0.39 63.78±\pm0.59 73.48±\pm0.88 87.55±\pm0.59 77.04±\pm0.57 66.09±\pm0.22 71.77±\pm0.10 96.32±\pm0.02 1.13

4.2 Experimental Results

Table 2: Efficiency comparisons on Transformer-based graph models. \dagger denotes the performance measured by CPU implementation.
Chameleon Cora Citeseer Squirrel twitch-gamers ogbn-arxiv
# Nodes 2,277 2,708 3,327 5,201 168,114 169,343
# Edges 31,371 5,278 4,552 198,353 6,797,557 1,166,243
Model FLOPs Acc. FLOPs Acc. FLOPs Acc. FLOPs Acc. FLOPs Acc. FLOPs Acc.
Transformer 1.06G 45.93±\pm0.83 1.26G 73.75±\pm0.71 2.29G 72.99±\pm0.61 4.29G 31.00±\pm0.60 3622G 59.85 OOM OOM
Graphormer 1.78G 50.15±\pm1.26 2.26G 73.44±\pm0.90 3.79G 72.60±\pm0.63 7.88G 36.25±\pm0.72 OOM OOM OOM OOM
GT-full 1.07G 49.07±\pm1.25 1.27G 69.51±\pm1.01 2.31G 70.18±\pm0.67 4.31G 32.33±\pm0.64 3623G 59.18 OOM OOM
GT-sparse 0.43G 64.82±\pm0.57 0.43G 85.63±\pm0.44 0.99G 75.49±\pm0.58 1.49G 44.22±\pm0.67 17.04G 63.09±\pm0.71 20.24G 71.45±\pm0.14
DGT-light (Ours) 0.43G 73.04±\pm0.65 0.36G 86.60±\pm0.60 0.87G 75.72±\pm0.40 1.24G 62.58±\pm0.57 8.05G 65.59±\pm0.25 5.02G 71.18±\pm0.13
DGT (Ours) 0.49G 73.48±\pm0.88 0.65G 87.55±\pm0.59 1.05G 77.04±\pm0.57 2.63G 63.78±\pm0.59 16.19G 66.09±\pm0.22 6.66G 71.77±\pm0.10
Table 3: Performance comparisons on different ordering and criteria π\pi for constructing node sequences.
Name Ordering Criteria (π\pi) Cora Citeseer Chameleon Squirrel
AR Absolute Random 81.53±\pm1.21 72.27±\pm0.72 70.51±\pm0.45 62.44±\pm0.62
AB Absolute BFS 81.22±\pm1.14 72.28±\pm0.69 71.95±\pm0.49 62.31±\pm0.89
AM Absolute BFS,PPR,Feat 83.28±\pm0.77 70.20±\pm0.84 72.41±\pm0.57 62.28±\pm0.81
RB Relative BFS 86.60±\pm0.60 75.72±\pm0.40 73.04±\pm0.65 62.58±\pm0.57
RM Relative BFS,PPR,Feat 87.55±\pm0.59 77.04±\pm0.57 73.48±\pm0.88 63.78±\pm0.59
Table 4: Effects of different positional encodings.
Model Positional Encoding Cora Citeseer Chameleon Squirrel
Transformer w/o PE 73.75±\pm0.71 72.99±\pm0.61 45.93±\pm0.83 31.00±\pm0.60
Node2Vec 81.52±\pm0.68 72.07±\pm0.58 49.00±\pm0.82 39.15±\pm0.53
Laplacian Eigvecs 69.51±\pm1.01 70.18±\pm0.67 49.07±\pm1.25 32.33±\pm0.64
Katz PE (ours) 81.44±\pm1.16 73.02±\pm0.71 72.13±\pm0.60 59.79±\pm0.76
DGT (ours) w/o PE 86.32±\pm0.52 76.61±\pm0.61 59.44±\pm0.87 46.59±\pm0.75
Node2Vec 86.80±\pm0.51 75.36±\pm0.54 62.02±\pm0.63 50.26±\pm0.54
Laplacian Eigvecs 83.94±\pm0.67 76.21±\pm0.61 58.02±\pm0.91 43.86±\pm0.67
Katz PE (ours) 87.55±\pm0.59 77.04±\pm0.57 73.48±\pm0.88 63.78±\pm0.59
Refer to caption
(a) Graphormer
Refer to caption
(b) DGT (ours)
Figure 2: Visualization of the 20 most important key nodes for a given query node vqv_{q} in (a) Graphormer and (b) DGT on Chameleon validation set. Red nodes denote nodes with the same label as the query node’s label whereas blue nodes have different labels. Orange dashed lines represent connections between the query node and key nodes with the same label as the query node and gray dashed lines represent the connections between nodes with different labels.

Table 1 summarizes node classification results of our proposed DGT and DGT-light, GNN-based models, and Transformer-based graph models. DGT consistently shows superior performance across all eight datasets, which achieves the state-of-the-art performance on seven datasets and competitive performance on one dataset. Especially, our DGT significantly outperforms transformer-based baselines by a large margin up to 106%. Surprisingly, existing Transformer-based graph models show poor performance compared to GNN baselines except for Actor. This implies that Transformer-based graph models failed to filter out irrelevant messages and focus on useful nodes. In addition, they are not applicable to large-scale graphs such as ogbn-arxiv, twitch-gamers, and Reddit due to their huge computational costs.

On the other hand, our DGT consistently outperforms both GNNs and Transformer-based graph models on almost all datasets and efficiently handles large-scale graphs by utilizing a small set of relevant nodes. GNNs generally perform well in homophilic graphs such as Cora and Citeseer, but show relatively inferior performance in heterophilic graphs such as Actor, Squirrel, Chameleon, and twitch-gamers. This is because most GNNs utilize directly connected nodes for aggregation even in heterophilic graphs. Instead, our DGT and DGT-light show larger performance gain in heterophilic graph datasets since it captures long-range dependency, which is important for learning on heterophilic graphs.

We also compare our DGT and DGT-light with other Transformer-based architectures to validate the efficiency (FLOPs) of our DGT in Table 2. The table shows that our DGT and DGT-light improves not only the performance but also the efficiency with deformable graph attention on various datasets. As the number of nodes increases, the gap between Transformer and variants of DGT with respect to FLOPS becomes bigger. In particular, on the twitch-gamers, DGT-light (8.05G) reduces computational costs by ×\times 449 compared to Transformer (3622G).

4.3 Quantitative Analysis

Here, we provide quantitative results of additional experiments to demonstrate the contribution of each component of our Deformable Graph Transformer. We first provide the ablation study of the NodeSort module and examine the effectiveness of our positional encoding comparing with popular positional encoding methods.

NodeSort module. We conduct an ablation study of the NodeSort module to study the effect of the relative ordering that varies depending on base nodes and various sorting criteria. We compare several constructions: absolute ordering with random permutation (AR), absolute ordering with BFS (AB), absolute ordering with multiple criteria (AM), relative ordering with BFS (RB), and relative ordering with multiple criteria (RM). As shown in Table 3, the absolute ordering approaches (AR, AB, AM) show poor performance, even worse than MLP in Citeseer. The absolute ordering with BFS (AB) shows no performance gain compared to a randomly permuted absolute ordering (AR) on three datasets. This shows that a single absolute node sequence is not sufficient to learn complex relationships between nodes in the graph. On the other hand, the relative ordering with BFS (RB) shows a significant performance improvement up to 5.07(%). Further, the relative ordering with multiple criteria (RM) consistently shows superior performance compared to the relative ordering with BFS (RB).

Positional encoding. To validate the effectiveness of our positional encoding, we compare our proposed PE methods with models without positional encodings (w/o PE) and various encoding methods such as Node2Vec (Grover & Leskovec, 2016) and Laplacian Eigvecs used in GT (Dwivedi & Bresson, 2020) on four datasets. We use Transformer and DGT for the base models. Table 4 demonstrates that our positional encoding is effective on both base models. In particular, it improves the performance by 17.19(%) compared to DGT without positional encoding on Squirrel.

4.4 Qualitative Analysis

We provide qualitative analysis to understand why DGT is effective. We visualize the top 20 key nodes with high attention scores for a given query node vqv_{q} in Graphormer and DGT. In DGT, we compute attention scores of each node, viv_{i} by wi=k𝐀πmqkg(Δ𝐩πmqk,i)w_{i}=\sum_{k}\mathbf{A}_{\pi mqk}\cdot g(\Delta\mathbf{p}_{\pi mqk},i). As shown in Figure 2, a within 1-hop neighborhood of the given query node vqv_{q}, 6 out of 7 neighbors have different labels. So, both DGT and Graphormer aggregate messages (dashed lines) from remote nodes beyond 1-hop neighbors through the attention mechanism. 17 of the top 20 nodes in DGT are nodes with the same label whereas only 4 out of 20 nodes have the same label in Graphormer. Also, the ratio of the attention scores for the nodes with the same label, i.e., {vi:vi𝒱yq=yi}wi/vi𝒱wi{\sum_{\left\{v_{i}:v_{i}\in\mathcal{V}\land y_{q}=y_{i}\right\}}w_{i}}/{\sum_{v_{i}\in\mathcal{V}}w_{i}} is 0.97 for DGT and 0.21 for Graphormer. This evidences that our DGT effectively performs the sparse attention by focusing on a small set of relevant key nodes compared to Graphormer.

5 Conclusion

We propose Deformable Graph Transformer (DGT) that performs sparse attention, named Deformable Graph Attention (DGA) for learning node representations on large-scale graphs. With Deformable Graph Attention, our model can address two limitations of Transformer-based graph models such as a scalability issue and aggregation of noisy information. Different from standard deformable attention (Zhu et al., 2021; Xia et al., 2022), the Deformable Graph Attention considers both structural and semantic proximity based on diverse node sequences. Also, we design simple and effective positional encodings for graph Transformers. Our extensive experiments demonstrate that DGT outperforms existing Transformer-based graph models on eight graph datasets. We hope our work paves the way for generalizing Transformers on large-scale graphs.

References

  • Abu-El-Haija et al. (2019) Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In ICML, 2019.
  • Brody et al. (2022) Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks? In ICLR, 2022.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020.
  • Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
  • Child et al. (2019) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv:1904.10509, 2019.
  • Choromanski et al. (2021) Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In ICLR, 2021.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • Dwivedi & Bresson (2020) Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. In AAAIworkshop, 2020.
  • Fey & Lenssen (2019) Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLRW, 2019.
  • Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In ICML, 2017.
  • Grover & Leskovec (2016) Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, 2016.
  • Gulati et al. (2020) Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. In Interspeech, 2020.
  • Hamilton et al. (2017) William Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS, 2017.
  • Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In NeurIPS, 2020.
  • Jaegle et al. (2021) Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In ICML, 2021.
  • Jaegle et al. (2022) Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs. In ICLR, 2022.
  • Katz (1953) Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43, 1953.
  • Kim & Oh (2021) Dongkwan Kim and Alice Oh. How to find your friendly neighborhood: Graph attention design with self-supervision. In ICLR, 2021.
  • Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Kipf & Welling (2017) Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • Kitaev et al. (2020) Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In ICLR, 2020.
  • Kreuzer et al. (2021) Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. In NeurIPS, 2021.
  • Lim et al. (2021) Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and Ser Nam Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. In NeurIPS, 2021.
  • Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  • Mialon et al. (2021) Grégoire Mialon, Dexiong Chen, Margot Selosse, and Julien Mairal. Graphit: Encoding graph structure in transformers. arXiv:2106.05667, 2021.
  • Park et al. (2022) Jinyoung Park, Sungdong Yoo, Jihwan Park, and Hyunwoo J Kim. Deformable graph convolutional networks. In AAAI, 2022.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NeurIPSworkshop, 2017.
  • Pei et al. (2020) Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-gcn: Geometric graph convolutional networks. In ICLR, 2020.
  • Rong et al. (2020) Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. In NeurIPS, 2020.
  • Rozemberczki & Sarkar (2021) Benedek Rozemberczki and Rik Sarkar. Twitch gamers: a dataset for evaluating proximity preserving and structural role-based node embeddings. arXiv:2101.03091, 2021.
  • Rozemberczki et al. (2021) Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2):cnab014, 2021.
  • Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
  • Song et al. (2021) Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jampani, Dongyoon Han, Byeongho Heo, Wonjae Kim, and Ming-Hsuan Yang. Vidt: An efficient and effective fully transformer-based object detector. In International Conference on Learning Representations, 2021.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
  • Tang et al. (2009) Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. Social influence analysis in large-scale networks. In SIGKDD, pp.  807–816, 2009.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
  • Wang et al. (2019) Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv:1909.01315, 2019.
  • Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In ICML, 2019.
  • Wu et al. (2021) Zhanghao Wu, Paras Jain, Matthew Wright, Azalia Mirhoseini, Joseph E Gonzalez, and Ion Stoica. Representing long-range context for graph neural networks with global attention. In NeurIPS, 2021.
  • Xia et al. (2022) Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with deformable attention. In CVPR, 2022.
  • Xiong et al. (2021) Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nystöm-based algorithm for approximating self-attention. In AAAI, 2021.
  • Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In ICML, 2018.
  • Yang et al. (2021) Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self-attention for local-global interactions in vision transformers. In NeurIPS, 2021.
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, 2019.
  • Ying et al. (2021) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? In NeurIPS, 2021.
  • Zhang et al. (2020) Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP, 2020.
  • Zhao et al. (2021) Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In ICCV, 2021.
  • Zhu et al. (2020) Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. In NeurIPS, 2020.
  • Zhu et al. (2021) Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.

Appendix A Complexity for Deformable Graph Attention

Consider MM is the number of heads, KK is the number of keys, TT is the number of critera, and WW is the number of nonzero values of gg in Eq. (5). In the Deformable Graph Attention (Eq. (4)), calculating the sampling offsets Δ𝐩πmqk\Delta\mathbf{p}_{\pi mqk} and attention weights 𝐀πmqk\mathbf{A}_{\pi mqk} requires the complexity of 𝒪(NCMKT)\mathcal{O}(NCMKT). Then, given the sampling offsets and attention weights, the complexity of calculating Eq. (4) is 𝒪(NC2T+NKC2T+WNKCT)\mathcal{O}(NC^{2}T+NKC^{2}T+WNKCT), where WW is the number of nonzero values of S~πq(Δ𝐩πmqk)\tilde{S}_{\pi q}(\Delta\mathbf{p}_{\pi mqk}), and the factor WW is because of the kernel-based interpolation. We can calculate the linear transformation of the interpolated features 𝐖πmS~πq(Δ𝐩πmqk)\mathbf{W}^{\prime}_{\pi m}\tilde{S}_{\pi q}(\Delta\mathbf{p}_{\pi mqk}) by interpolation of the linear transformed features 𝐖πm𝐗\mathbf{W}^{\prime}_{\pi m}\mathbf{X}. So, the complexity of calculating Eq. (4) become as 𝒪(NC2T+WNKCT)\mathcal{O}(NC^{2}T+WNKCT). In the results, the overall complexity of Deformable Graph Attention is 𝒪(N(C2T+WKCT+CMKT))\mathcal{O}(N\cdot(C^{2}T+WKCT+CMKT)). In our implementation, we set M=4,K=4M=4,K=4 and C=64C=64 as a default, thus MK<CMK<C and the complexity is of 𝒪(N(C2T+WKCT))\mathcal{O}(N\cdot(C^{2}T+WKCT)) ,which is a linear complexity with respect to the number of nodes NW,K,C,TN\gg W,K,C,T.

Appendix B Additional Experiments

Here, we provide additional experimental results to analyze the contributions of each component of our Deformable Graph Transformer (DGT) including the ablation study of the NodeSort module and our kernel-based interpolation.

B.1 NodeSort module.

We conduct an ablation study for the NodeSort module to validate the effectiveness of a single node sequence with each criterion and multiple node sequences. We use three criteria for sorting: Breadth-first Search (BFS), Personalized PageRank score (PPR), and Feature similarity (Feature). As shown in Table 5, when multiple node sequences are applied, DGT shows good performance on four datasets. In particular, on the cora dataset, multiple node sequences improve 3.46 (%) over the model with the Feature similarity criterion.

Table 5: Performance comparisons on different criteria for constructing node sequences.
Criteria Squirrel Chameleon Cora Citeseer
BFS PPR Feature
62.58±\pm0.57 73.04±\pm0.65 86.60±\pm0.60 75.72±\pm0.40
62.57±\pm0.59 72.66±\pm0.77 86.31±\pm0.45 75.42±\pm0.64
62.66±\pm0.73 72.67±\pm0.80 84.09±\pm0.62 75.31±\pm0.61
63.78±\pm0.59 73.48±\pm0.88 87.55±\pm0.59 77.04±\pm0.57

B.2 Kernel-based interpolation.

In Figure 3, we conduct an ablation study of our kernel-based interpolation according to ϵ=1,2,3,4,5,6,7,8\epsilon=1,2,3,4,5,6,7,8 in gg on the Cora dataset. The model shows the worst performance when ϵ=1\epsilon=1. As the value of ϵ\epsilon increases, the model shows better performance. We believe that the value of ϵ\epsilon needs to be appropriately big to find out relevant nodes for a query node.

We also compare the performance of bilinear interpolation used in (Zhu et al., 2021) and kernel-based interpolation in Table 6. As shown in the table, our kernel-based interpolation improves learnabality of offsets compared to the bilinear interpolation.

Refer to caption
Figure 3: Analysis of our kernel-based interpolation according to ϵ\epsilon in gg.
Table 6: Performance comparisons on different interpolation methods.
Method Cora Citeseer Chameleon Squirrel twitch-gamers
Bilinear interpolation 86.02±\pm0.53 76.61±\pm0.56 68.51±\pm0.86 61.69±\pm0.65 65.30±\pm0.15
Kernel-based interpolation (Ours) 87.55±\pm0.59 77.04±\pm0.57 63.78±\pm0.59 73.48±\pm0.88 66.09±\pm0.22

Appendix C Experimental Details

C.1 Datasets

We use four heterophilic graph datasets and four homophilic graph datasets for our experiments. To the best of our knowledge, all the datasets for our experiments do not contain personally identifiable information or offensive contents.

For homophilic graphs, we use two Planetoid datasets (Citeseer and Cora) (Sen et al., 2008), OGB node classification dataset (ogbn-arxiv)111 Copyright (c) 2019 OGB Team. Licensed under MIT License (Hu et al., 2020), and Reddit (Hamilton et al., 2017). Planetoid and ogbn-arxiv datasets are citation networks whose nodes represent papers and edges indicate citations between papers. Node labels are the topics of each paper and node features are the bag-of-words of papers in Planetoid datasets and word2vec of papers in ogbn-arxiv. Reddit dataset is composed of Reddit posts and the node label is community, which a post belongs to.

For heterophilic graphs, we use four graph datasets: Squirrel222Copyright (c) 2007 Free Software Foundation, Inc. under GNU GENERAL PUBLIC LICENSE (Rozemberczki et al., 2021), Chameleon2 (Rozemberczki et al., 2021), Actor333Copyright (c) Wikipedia:Text of Creative Commons Attribution. under ShareAlike 3.0 Unported License  (Tang et al., 2009), and twitch-gamers444Copyright (c) 2019 Benedek Rozemberczki. Licensed under MIT License (Lim et al., 2021; Rozemberczki & Sarkar, 2021). Squirrel and Chameleon are web page datasets collected from Wikipedia (Rozemberczki et al., 2021), where the nodes are web pages, edges are links between them, node features are keywords of the pages, and labels are five categories based on the monthly traffic of the web pages. Actor is an actor co-occurrence network, where nodes are actors, edges represent co-occurrence on the same Wikipedia page, node features are keywords in the Wikipedia pages, and labels are five categories in terms of words of actor’s Wikipedia. twitch-gamers is an online social network (Lim et al., 2021; Rozemberczki & Sarkar, 2021), where nodes are Twitch users, edges are mutual follower relationships between them, and node features include a number of views, creation and update dates, language, life time, and whether the account is dead. Node labels denote whether the channel has explicit content.

C.2 Implementation Details

As written in Section 4.1 in the main paper, all the models including baselines and our DGT are optimized using Adam optimizer (Kingma & Ba, 2015). The experiments are conducted on a single GPU (RTX 2080 Ti or A6000). For all cases, learning rates and weight decay are optimized in the same search space: learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}. The hidden dimension is fixed with 64 and for all the cases. Also, the dropout (Srivastava et al., 2014) is applied and the epochs are 1000 with patience 100 for early stopping. The best model on the validation split is used for reporting the performance. We adopt the splits (48%/ 32%/ 20%) of nodes per class for (train/ validation/ test) following (Pei et al., 2020; Zhu et al., 2020) and the experiments are repeated 30 times on Actor, Squirrel, Chameleon, Cora, and Citeseer datasets. For twitch-gamers, ogbn-arxiv, and Reddit, the experiments are conducted with the splits provided by (Lim et al., 2021), (Hu et al., 2020), and (Hamilton et al., 2017), respectively and repeated 10 times.

Implementation details of Baselines.

We implement the baselines using Pytorch555Copyright (c) 2016- Facebook, Inc (Adam Paszke). Licensed under BSD-3-Clause License Paszke et al. (2017), geometric deep learning library Torch-Geometric666Copyright (c) 2020 Matthias Fey. Licensed under MIT License Fey & Lenssen (2019), and Deep Graph Library777Copyright (c) 2019 DGL. Licensed under Apache License, Version 2.0 Wang et al. (2019). The detailed experimental settings for each baseline models are as follows:

  • MLP : learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}

  • GCN888Copyright (c) 2016 Thomas Kipf. under MIT License Kipf & Welling (2017) : learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}, and layer in {1, 2, 3, 4}.

  • GAT999Copyright (c) 2018 Petar Veličković. under MIT License Veličković et al. (2018) : learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}, layer in {1, 2, 3, 4}, the number of heads in {1,4}.

  • GraphSAGE101010Copyright (c) 2018 Petar Veličković. under MIT License Hamilton et al. (2017) : learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}, and layer in {1, 2, 3, 4}.

  • JKNet Xu et al. (2018) : learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}, and layer in {1, 2, 3, 4}.

  • SGC111111Copyright (c) 2019 Tianyi Zhang. under MIT License Wu et al. (2019) : learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5} and KK in {1, 2, 3, 4}.

  • GATv2121212Copyright (c) 2022 Shaked Brody. under MIT License Brody et al. (2022) : learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}, and layer in {1, 2, 3, 4}, and the number of heads in {1,4}.

  • MixHop131313Copyright (c) 2007 Free Software Foundation, Inc. under GNU License Abu-El-Haija et al. (2019) : learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}, layer in {1, 2, 3, 4}, and maximum value of PP is 2.

  • Geom-GCN141414Copyright (c) 2019 Geom-GCN Authors. under MIT License Pei et al. (2020) : learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}.

  • H2GCN Zhu et al. (2020) : learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}, and KK in {1, 2}.

  • DeformableGCN Park et al. (2022) : learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}, block in {1, 2}, the number of neighbors in {5,10,15,20,25}.

  • Transformer151515Copyright (c) 2017 Victor Huang. under MIT License Vaswani et al. (2017) : learning rate in {0.05,0.01,0.005}, weight decay in {1e-3,5e-4,5e-5}, the number of blocks in {1,2,3}, and the number of heads in {1,4}.

  • Graphormer161616Copyright (c) Microsoft Corporation. under MIT License Ying et al. (2021) : learning rate in {0.05,0.01,0.005}, weight decay in {1e-3,5e-4,5e-5}, the number of blocks in {1,2,3}, and the number of heads in {1,4}.

  • GT171717Copyright (c) 2020 Vijay Prakash Dwivedi, Xavier Bresson. under MIT License Dwivedi & Bresson (2020) : learning rate in {0.05,0.01,0.005}, weight decay in {1e-3,5e-4,5e-5}, the number of blocks in {1,2,3}, the number of heads in {1,4}.

  • GT-sparse181818Copyright (c) 2020 Vijay Prakash Dwivedi, Xavier Bresson. under MIT License Dwivedi & Bresson (2020) : learning rate in {0.05,0.01,0.005}, weight decay in {1e-3,5e-4,5e-5}, the number of blocks in {1,2,3}, the number of heads in {1,4}.

Ours.

We implement our model (DGT and DGT-light) using Pytorch and Torch-Geometric. For constructing node sequences, we employ three criteria such as BFS, sorting based on Personalized PageRank score, and sorting based on feature similarity between a base node and nodes in a graph. The detailed hyperparameter settings are as follows: learning rate in {0.05, 0.01, 0.005}, weight decay in {1e-3, 5e-4, 5e-5}, the number of blocks in {1, 2}, and γ\gamma in {1,2,4,8,16,32,64,128,256}.

Appendix D Discussion

Negative societal impacts.

Deformable Graph Transformer (DGT) is a graph Transformer for learning node representations on large-scale graphs. We believe that this paper has no direct negative societal impacts. However, similar to other neural networks for graph-structured datasets such as Graph Neural Networks (GNNs), DGT can be utilized for malicious applications. Graph neural networks can be applied to predict unknowable information such as religions, political views, and personal preferences based on graph information. If this technology is applied to identifying the personalities of voters and influencing their behaviors, it might cause interference in the elections. To mitigate these societal problems, illegal data collection and data harvest should be prevented and benchmark datasets should be released without any private information.

Limitations.

Deformable Graph Transformer (DGT) performs deformable attention based on diverse node sequences. The node sequences play the role of coordinates in 2D images. However, different from the works based on deformable attention in computer vision Zhu et al. (2021); Xia et al. (2022), nodes in graphs do not have the exact positions. So, the position of each node needs to be defined with appropriate mechanisms based on the properties of graphs. In this paper, we utilize multiple criteria for generating node sequences to capture various properties of the graphs for node classification. In other tasks such as link prediction and graph classification, other criteria for generating node sequences could lead to more powerful representations on graphs. We leave it for our future work.