This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication

Cheng Wan
Rice University
chwan@rice.edu
&Youjie Li
UIUC
li238@illinois.edu
&Cameron R. Wolfe
Rice University
crw13@rice.edu
\ANDAnastasios Kyrillidis
Rice University
anastasios@rice.edu
&Nam Sung Kim
UIUC
nam.sung.kim@gmail.com
&Yingyan Lin
Rice University
yl150@rice.edu
Abstract

Graph Convolutional Networks (GCNs) is the state-of-the-art method for learning graph-structured data, and training large-scale GCNs requires distributed training across multiple accelerators such that each accelerator is able to hold a partitioned subgraph. However, distributed GCN training incurs prohibitive overhead of communicating node features and feature gradients among partitions for every GCN layer during each training iteration, limiting the achievable training efficiency and model scalability. To this end, we propose PipeGCN, a simple yet effective scheme that hides the communication overhead by pipelining inter-partition communication with intra-partition computation. It is non-trivial to pipeline for efficient GCN training, as communicated node features/gradients will become stale and thus can harm the convergence, negating the pipeline benefit. Notably, little is known regarding the convergence rate of GCN training with both stale features and stale feature gradients. This work not only provides a theoretical convergence analysis but also finds the convergence rate of PipeGCN to be close to that of the vanilla distributed GCN training without any staleness. Furthermore, we develop a smoothing method to further improve PipeGCN’s convergence. Extensive experiments show that PipeGCN can largely boost the training throughput (1.7×\times\sim28.5×\times) while achieving the same accuracy as its vanilla counterpart and existing full-graph training methods. The code is available at https://github.com/RICE-EIC/PipeGCN.

1 Introduction

Graph Convolutional Networks (GCNs) (Kipf & Welling, 2016) have gained great popularity recently as they demonstrated the state-of-the-art (SOTA) performance in learning graph-structured data (Zhang & Chen, 2018; Xu et al., 2018; Ying et al., 2018). Their promising performance is resulting from their ability to capture diverse neighborhood connectivity. In particular, a GCN aggregates all features from the neighbor node set for a given node, the feature of which is then updated via a multi-layer perceptron. Such a two-step process (neighbor aggregation and node update) empowers GCNs to better learn graph structures. Despite their promising performance, training GCNs at scale is still a challenging problem, as a prohibitive amount of compute and memory resources are required to train a real-world large-scale graph, let alone exploring deeper and more advanced models. To overcome this challenge, various sampling-based methods have been proposed to reduce the resource requirement at a cost of incurring feature approximation errors. A straightforward instance is to create mini-batches by sampling neighbors (e.g., GraphSAGE (Hamilton et al., 2017) and VR-GCN (Chen et al., 2018)) or to extract subgraphs as training samples (e.g., Cluster-GCN (Chiang et al., 2019) and GraphSAINT (Zeng et al., 2020)).

In addition to sampling-based methods, distributed GCN training has emerged as a promising alternative, as it enables large full-graph training of GCNs across multiple accelerators such as GPUs. This approach first partitions a giant graph into multiple small subgraps, each of which is able to fit into a single GPU, and then train these partitioned subgraphs locally on GPUs together with indispensable communication across partitions. Following this direction, several recent works (Ma et al., 2019; Jia et al., 2020; Tripathy et al., 2020; Thorpe et al., 2021; Wan et al., 2022) have been proposed and verified the great potential of distributed GCN training. P3P^{3} (Gandhi & Iyer, 2021) follows another direction that splits the data along the feature dimension and leverages intra-layer model parallelism for training, which shows superior performance on small models.

In this work, we propose a new method for distributed GCN training, PipeGCN, which targets achieving a full-graph accuracy with boosted training efficiency. Our main contributions are following:

  • We first analyze two efficiency bottlenecks in distributed GCN training: the required significant communication overhead and frequent synchronization, and then propose a simple yet effective technique called PipeGCN to address both aforementioned bottlenecks by pipelining inter-partition communication with intra-partition computation to hide the communication overhead.

  • We address the challenge raised by PipeGCN, i.e., the resulting staleness in communicated features and feature gradients (neither weights nor weight gradients), by providing a theoretical convergence analysis and showing that PipeGCN’s convergence rate is 𝒪(T23)\mathcal{O}(T^{-\frac{2}{3}}), i.e., close to vanilla distributed GCN training without staleness. To the best of our knowledge, we are the first to provide a theoretical convergence proof of GCN training with both stale feature and stale feature gradients.

  • We further propose a low-overhead smoothing method to further improve PipeGCN’s convergence by reducing the error incurred by the staleness.

  • Extensive empirical and ablation studies consistently validate the advantages of PipeGCN over both vanilla distributed GCN training and those SOTA full-graph training methods (e.g., boosting the training throughput by 1.7×\times\sim28.5×\times while achieving the same or a better accuracy).

2 Background and Related Works

Graph Convolutional Networks. GCNs represent each node in a graph as a feature (embedding) vector and learn the feature vector via a two-step process (neighbor aggregation and then node update) for each layer, which can be mathematically described as:

zv()\displaystyle z^{(\ell)}_{v} =ζ()({hu(1)u𝒩(v)})\displaystyle=\zeta^{(\ell)}\left(\left\{h^{(\ell-1)}_{u}\mid u\in\mathcal{N}(v)\right\}\right) (1)
hv()\displaystyle h^{(\ell)}_{v} =ϕ()(zv(),hv(1))\displaystyle=\phi^{(\ell)}\left(z^{(\ell)}_{v},h^{(\ell-1)}_{v}\right) (2)

where 𝒩(v)\mathcal{N}(v) is the neighbor set of node vv in the graph, hv()h^{(\ell)}_{v} represents the learned embedding vector of node vv at the \ell-th layer, zv()z_{v}^{(\ell)} is an intermediate aggregated feature calculated by an aggregation function ζ()\zeta^{(\ell)}, and ϕ()\phi^{(\ell)} is the function for updating the feature of node vv. The original GCN (Kipf & Welling, 2016) uses a weighted average aggregator for ζ()\zeta^{(\ell)} and the update function ϕ()\phi^{(\ell)} is a single-layer perceptron σ(W()zv())\sigma(W^{(\ell)}z_{v}^{(\ell)}) where σ()\sigma(\cdot) is a non-linear activation function and W()W^{(\ell)} is a weight matrix. Another famous GCN instance is GraphSAGE (Hamilton et al., 2017) in which ϕ()\phi^{(\ell)} is σ(W()CONCAT(zv(),hv(1)))\sigma\left(W^{(\ell)}\cdot\textsc{CONCAT}\left(z^{(\ell)}_{v},h^{(\ell-1)}_{v}\right)\right).

Distributed Training for GCNs. A real-world graph can contain millions of nodes and billions of edges (Hu et al., 2020), for which a feasible training approach is to partition it into small subgraphs (to fit into each GPU’s resource), and train them in parallel, during which necessary communication is performed to exchange boundary node features and gradients to satisfy GCNs’s neighbor aggregation (Equ. 1). Such an approach is called vanilla partition-parallel training and is illustrated in Fig. 1 (a). Following this approach, several works have been proposed recently. NeuGraph (Ma et al., 2019), AliGraph (Zhu et al., 2019), and ROC (Jia et al., 2020) perform such partition-parallel training but rely on CPUs for storage for all partitions and repeated swapping of a partial partition to GPUs. Inevitably, prohibitive CPU-GPU swaps are incurred, plaguing the achievable training efficiency. CAGNET (Tripathy et al., 2020) is different in that it splits each node feature vector into tiny sub-vectors which are then broadcasted and computed sequentially, thus requiring redundant communication and frequent synchronization. Furthermore, P3P^{3} (Gandhi & Iyer, 2021) proposes to split both the feature and the GCN layer for mitigating the communication overhead, but it makes a strong assumption that the hidden dimensions of a GCN should be considerably smaller than that of input features, which restricts the model size. A concurrent work Dorylus (Thorpe et al., 2021) adopts a fine-grained pipeline along each compute operation in GCN training and supports asynchronous usage of stale features. Nevertheless, the resulting staleness of feature gradients is neither analyzed nor considered for convergence proof, let alone error reduction methods for the incurred staleness.

Refer to caption
Figure 1: An illustrative comparison between vanilla partition-parallel training and PipeGCN.
Table 1: Differences between conventional asynchronous distributed training and PipeGCN.
Method
Hogwild!, SSP,
MXNet, Pipe-SGD,
PipeDream, PipeMare
PipeGCN
Target
Large Model,
Small Feature
Large Feature
Staleness Weight Gradients
Features and
Feature Gradients

Asynchronous Distributed Training. Many prior works have been proposed for asynchronous distributed training of DNNs. Most works (e.g., Hogwild! (Niu et al., 2011), SSP (Ho et al., 2013), and MXNet (Li et al., 2014)) rely on a parameter server with multiple workers running asynchronously to hide communication overhead of weights/(weight gradients) among each other, at a cost of using stale weight gradients from previous iterations. Other works like Pipe-SGD (Li et al., 2018b) pipeline such communication with local computation of each worker. Another direction is to partition a large model along its layers across multiple GPUs and then stream in small data batches through the layer pipeline, e.g., PipeDream (Harlap et al., 2018) and PipeMare (Yang et al., 2021). Nonetheless, all these works aim at large models with small data, where communication overhead of model weights/weight gradients are substantial but data feature communications are marginal (if not none), thus not well suited for GCNs. More importantly, they focus on convergence with stale weight gradients of models, rather than stale features/feature gradients incurred in GCN training. Tab. 1 summarizes the differences. In a nutshell, little effort has been made to study asynchronous or pipelined distributed training of GCNs, where feature communication plays the major role, let alone the corresponding theoretical convergence proofs.

GCNs with Stale Features/Feature Gradients. Several recent works have been proposed to adopt either stale features (Chen et al., 2018; Cong et al., 2020) or feature gradients (Cong et al., 2021) in single-GPU training of GCNs. Nevertheless, their convergence analysis considers only one of two kinds of staleness and derives a convergence rate of 𝒪(T12)\mathcal{O}(T^{-\frac{1}{2}}) for pure sampling-based methods. This is, however, limited in distributed GCN training as its convergence is simultaneously affected by both kinds of staleness. PipeGCN proves such convergence with both stale features and feature gradients and offers a better rate of 𝒪(T23)\mathcal{O}(T^{-\frac{2}{3}}). Furthermore, none of previous works has studied the errors incurred by staleness which harms the convergence speed, while PipeGCN develops a low-overhead smoothing method to reduce such errors.

3 The Proposed PipeGCN Framework

Overview. To enable efficient distributed GCN training, we first identify the two bottlenecks associated with vanilla partition-parallel training: substantial communication overhead and frequently synchronized communication (see Fig. 1(b)), and then address them directly by proposing a novel strategy, PipeGCN, which pipelines the communication and computation stages across two adjacent iterations in each partition of distributed GCN training for breaking the synchrony and then hiding the communication latency (see Fig. 1(c)). It is non-trivial to achieve efficient GCN training with such a pipeline method, as staleness is incurred in communicated features/feature gradients and more importantly little effort has been made to study the convergence guarantee of GCN training using stale feature gradients. This work takes an initial effort to prove both the theoretical and empirical convergence of such a pipelined GCN training method, and for the first time shows its convergence rate to be close to that of vanilla GCN training without staleness. Furthermore, we propose a low-overhead smoothing method to reduce the errors due to stale features/feature gradients for further improving the convergence.

3.1 Bottlenecks in Vanilla Partition-Parallel Training

Table 2: The substantial communication overhead in vanilla partition-parallel training, where Comm. Ratio is the communication time divided by the total training time.
Dataset # Partition Comm. Ratio
Reddit 2 65.83%
4 82.89%
ogbn-products 5 76.17%
10 85.79%
Yelp 3 61.16%
6 76.84%

Significant communication overhead. Fig. 1(a) illustrates vanilla partition-parallel training, where each partition holds inner nodes that come from the original graph and boundary nodes that come from other subgraphs. These boundary nodes are demanded by the neighbor aggregation of GCNs across neighbor partitions, e.g., in Fig. 1(a) node-5 needs nodes-[3,4,6] from other partitions for calculating Equ. 1. Therefore, it is the features/gradients of boundary nodes that dominate the communication overhead in distributed GCN training. Note that the amount of boundary nodes can be excessive and far exceeds the inner nodes, as the boundary nodes are replicated across partitions and scale with the number of partitions. Besides the sheer size, communication of boundary nodes occurs for (1) each layer and (2) both forward and backward passes, making communication overhead substantial. We evaluate such overhead111The detailed setting can be found in Sec. 4. in Tab. 2 and find communication to be dominant, which is consistent with CAGNET (Tripathy et al., 2020).

Frequently synchronized communication. The aforementioned communication of boundary nodes must be finished before calculating Equ. 1 and Equ. 2, which inevitably forces synchronization between communication and computation and requires a fully sequential execution (see Fig. 1(b)). Thus, for most of training time, each partition is waiting for dominant features/gradients communication to finish before the actual compute, repeated for each layer and for both forward and backward passes.

Refer to caption
Figure 2: A detailed comparison between vanilla partition-parallel training of GCNs and PipeGCN.
Input: partition id ii, partition count nn, graph partition 𝒢i\mathcal{G}_{i}, propagation matrix PiP_{i}, node feature XiX_{i}, label YiY_{i}, boundary node set i\mathcal{B}_{i}, layer count LL, learning rate η\eta, initial model W0W_{0}
Output: trained model WTW_{T} after TT iterations
𝒱i{node v𝒢i:vi}\mathcal{V}_{i}\leftarrow\{\text{node }v\in\mathcal{G}_{i}:v\notin\mathcal{B}_{i}\}
  \triangleright create inner node set
1 Broadcast i\mathcal{B}_{i} and Receive [1,,n][\mathcal{B}_{1},\cdots,\mathcal{B}_{n}]
2 [𝒮i,1,,𝒮i,n][1𝒱i,,n𝒱i[\mathcal{S}_{i,1},\cdots,\mathcal{S}_{i,n}]\leftarrow[\mathcal{B}_{1}\cap\mathcal{V}_{i},\cdots,\mathcal{B}_{n}\cap\mathcal{V}_{i}]
3 Broadcast 𝒱i\mathcal{V}_{i} and Receive [𝒱1,,𝒱n][\mathcal{V}_{1},\cdots,\mathcal{V}_{n}]
4 [𝒮1,i,,𝒮n,i][i𝒱1,,i𝒱n[\mathcal{S}_{1,i},\cdots,\mathcal{S}_{n,i}]\leftarrow[\mathcal{B}_{i}\cap\mathcal{V}_{1},\cdots,\mathcal{B}_{i}\cap\mathcal{V}_{n}]
5
H(0)[Xi0]H^{(0)}\leftarrow\left[\begin{array}[]{c}X_{i}\\ 0\end{array}\right]
  \triangleright initialize node feature, set boundary feature as 0
6
7for t1Tt\coloneqq 1\rightarrow T do
8       for 1L\ell\coloneqq 1\rightarrow L do \triangleright forward pass
9             if t>1t>1 then
10                   wait until threadf()thread_{f}^{(\ell)} completes
                   [H𝒮1,i(1),,H𝒮n,i(1)][B1(),,Bn()][H^{(\ell-1)}_{\mathcal{S}_{1,i}},\cdots,H^{(\ell-1)}_{\mathcal{S}_{n,i}}]\leftarrow[B_{1}^{(\ell)},\cdots,B_{n}^{(\ell)}]
                    \triangleright update boundary feature
11                  
12             end if
13            with threadf()thread_{f}^{(\ell)} \triangleright communicate boundary features in parallel
14                   Send [H𝒮i,1(1),,H𝒮i,n(1)][H^{(\ell-1)}_{\mathcal{S}_{i,1}},\cdots,H^{(\ell-1)}_{\mathcal{S}_{i,n}}] to partition [1,,n][1,\cdots,n] and Receive [B1(),,Bn()][B^{(\ell)}_{1},\cdots,B^{(\ell)}_{n}]
15                  
16            
            H𝒱i()σ(PiH(1)Wt1())H^{(\ell)}_{\mathcal{V}_{i}}\leftarrow\sigma(P_{i}H^{(\ell-1)}W^{(\ell)}_{t-1})
              \triangleright update inner nodes feature
17            
18       end for
19      J𝒱i(L)Loss(H𝒱i(L),Yi)H𝒱i(L)J^{(L)}_{\mathcal{V}_{i}}\leftarrow\frac{\partial Loss(H^{(L)}_{\mathcal{V}_{i}},Y_{i})}{\partial H^{(L)}_{\mathcal{V}_{i}}}
20       for L1\ell\coloneqq L\rightarrow 1 do \triangleright backward pass
             Gi()[PiH(1)](J𝒱i()σ(PiH(1)Wt1()))G^{(\ell)}_{i}\leftarrow\left[P_{i}H^{(\ell-1)}\right]^{\top}\left(J^{(\ell)}_{\mathcal{V}_{i}}\circ\sigma^{\prime}(P_{i}H^{(\ell-1)}W^{(\ell)}_{t-1})\right)
              \triangleright calculate weight gradient
21             if >1\ell>1 then
                   J(1)Pi(J𝒱i()σ(PiH(1)Wt1()))[Wt1()]J^{(\ell-1)}\leftarrow P_{i}^{\top}\left(J^{(\ell)}_{\mathcal{V}_{i}}\circ\sigma^{\prime}(P_{i}H^{(\ell-1)}W^{(\ell)}_{t-1})\right)[W^{(\ell)}_{t-1}]^{\top}
                    \triangleright calculate feature gradient
22                   if t>1t>1 then
23                         wait until threadb()thread_{b}^{(\ell)} completes
24                         for j1nj\coloneqq 1\rightarrow n do
                               J𝒮i,j(1)J𝒮i,j(1)+Cj()J^{(\ell-1)}_{\mathcal{S}_{i,j}}\leftarrow J^{(\ell-1)}_{\mathcal{S}_{i,j}}+C_{j}^{(\ell)}
                                \triangleright accumulate feature gradient
25                              
26                         end for
27                        
28                   end if
29                  with threadb()thread_{b}^{(\ell)} \triangleright communicate boundary feature gradient in parallel
30                         Send [J𝒮1,i(1),,J𝒮n,i(1)][J^{(\ell-1)}_{\mathcal{S}_{1,i}},\cdots,J^{(\ell-1)}_{\mathcal{S}_{n,i}}] to partition [1,,n][1,\cdots,n] and Receive [C1(),,Cn()][C^{(\ell)}_{1},\cdots,C^{(\ell)}_{n}]
31                        
32                  
33                  
34             end if
35            
36       end for
      GAllReduce(Gi)G\leftarrow AllReduce(G_{i})
        \triangleright synchronize model gradient
       WtWt1ηGW_{t}\leftarrow W_{t-1}-\eta\cdot G
        \triangleright update model
37      
38 end for
return WTW_{T}
Algorithm 1 Training a GCN with PipeGCN (per-partition view).

3.2 The Proposed PipeGCN Method

Fig. 1(c) illustrates the high-level overview of PipeGCN, which pipelines the communicate and compute stages spanning two iterations for each GCN layer. Fig. 2 further provides the detailed end-to-end flow, where PipeGCN removes the heavy communication overhead in the vanilla approach by breaking the synchronization between communicate and compute and hiding communicate with compute of each GCN layer. This is achieved by deferring the communicate to next iteration’s compute (instead of serving the current iteration) such that compute and communicate can run in parallel. Inevitably, staleness is introduced in the deferred communication and results in a mixture usage of fresh inner features/gradients and staled boundary features/gradients.

Analytically, PipeGCN is achieved by modifying Equ. 1. For instance, when using a mean aggregator, Equ. 1 and its corresponding backward formulation in PipeGCN become:

zv(t,)\displaystyle z^{(t,\ell)}_{v} =MEAN({hu(t,1)u𝒩(v)(v)}{hu(t1,1)u(v)})\displaystyle=\textsc{MEAN}\left(\{h^{(t,\ell-1)}_{u}\mid u\in\mathcal{N}(v)\setminus\mathcal{B}(v)\}\cup\{h^{(t-1,\ell-1)}_{u}\mid u\in\mathcal{B}(v)\}\right) (3)
δhu(t,)\displaystyle\delta_{h_{u}}^{(t,\ell)} =v:u𝒩(v)(v)1dvδzv(t,+1)+v:u(v)1dvδzv(t1,+1)\displaystyle=\sum_{v:u\in\mathcal{N}(v)\setminus\mathcal{B}(v)}\frac{1}{d_{v}}\cdot\delta_{z_{v}}^{(t,\ell+1)}+\sum_{v:u\in\mathcal{B}(v)}\frac{1}{d_{v}}\cdot\delta_{z_{v}}^{(t-1,\ell+1)} (4)

where (v)\mathcal{B}(v) is node vv’s boundary node set, dvd_{v} denotes node vv’s degree, and δhu(t,)\delta_{h_{u}}^{(t,\ell)} and δzv(t,)\delta_{z_{v}}^{(t,\ell)} represent the gradient approximation of huh_{u} and zvz_{v} at layer \ell and iteration tt, respectively. Lastly, the implementation of PipeGCN are outlined in Alg. 1.

3.3 PipeGCN’s Convergence Guarantee

As PipeGCN adopts a mixture usage of fresh inner features/gradients and staled boundary features/gradients, its convergence rate is still unknown. We have proved the convergence of PipeGCN and present the convergence property in the following theorem.

Theorem 3.1 (Convergence of PipeGCN, informal version).

There exists a constant EE such that for any arbitrarily small constant ε>0\varepsilon>0, we can choose a learning rate η=εE\eta=\frac{\sqrt{\varepsilon}}{E} and number of training iterations T=((θ(1))(θ))Eε32T=(\mathcal{L}(\theta^{(1)})-\mathcal{L}(\theta^{*}))E\varepsilon^{-\frac{3}{2}} such that:

1Tt=1T(θ(t))2𝒪(ε)\frac{1}{T}\sum_{t=1}^{T}\|\nabla\mathcal{L}(\theta^{(t)})\|_{2}\leq\mathcal{O}(\varepsilon)

where ()\mathcal{L}(\cdot) is the loss function, θ(t)\theta^{(t)} and θ\theta^{*} represent the parameter vector at iteration tt and the optimal parameter respectively.

Therefore, the convergence rate of PipeGCN is 𝒪(T23)\mathcal{O}(T^{-\frac{2}{3}}), which is better than sampling-based method (𝒪(T12)\mathcal{O}(T^{-\frac{1}{2}})) (Chen et al., 2018; Cong et al., 2021) and close to full-graph training (𝒪(T1)\mathcal{O}(T^{-1})). The formal version of the theorem and our detailed proof can be founded in Appendix A.

3.4 The Proposed Smoothing Method

To further improve the convergence of PipeGCN, we propose a smoothing method to reduce errors incurred by stale features/feature gradients at a minimal overhead. Here we present the smoothing of feature gradients, and the same formulation also applies to features. To improve the approximate gradients for each feature, fluctuations in feature gradients between adjacent iterations should be reduced. Therefore, we apply a light-weight moving average to the feature gradients of each boundary node vv as follow:

δ^zv(t,)=γδ^zv(t1,)+(1γ)δzv(t,)\hat{\delta}_{z_{v}}^{(t,\ell)}=\gamma\hat{\delta}_{z_{v}}^{(t-1,\ell)}+(1-\gamma)\delta_{z_{v}}^{(t,\ell)}

where δ^zv(t,)\hat{\delta}_{z_{v}}^{(t,\ell)} is the smoothed feature gradient at layer \ell and iteration tt, and γ\gamma is the decay rate. When integrating this smoothed feature gradient method into the backward pass, Equ. 4 can be rewritten as:

δ^hu(t,)=v:u𝒩(v)(v)1dvδzv(t,+1)+v:u(v)1dvδ^zv(t1,+1)\hat{\delta}_{h_{u}}^{(t,\ell)}=\sum_{v:u\in\mathcal{N}(v)\setminus\mathcal{B}(v)}\frac{1}{d_{v}}\cdot\delta_{z_{v}}^{(t,\ell+1)}+\sum_{v:u\in\mathcal{B}(v)}\frac{1}{d_{v}}\cdot\hat{\delta}_{z_{v}}^{(t-1,\ell+1)}

Note that the smoothing of stale features and gradients can be independently applied to PipeGCN.

4 Experiment Results

We evaluate PipeGCN on four large-scale datasets, Reddit (Hamilton et al., 2017), ogbn-products (Hu et al., 2020), Yelp (Zeng et al., 2020), and ogbn-papers100M (Hu et al., 2020). More details are provided in Tab. 3. To ensure robustness and reproducibility, we fix (i.e., do not tune) the hyper-parameters and settings for PipeGCN and its variants throughout all experiments. To implement partition parallelism (for both vanilla distributed GCN training and PipeGCN), the widely used METIS (Karypis & Kumar, 1998) partition algorithm is adopted for graph partition with its objective set to minimize the communication volume. We implement PipeGCN in PyTorch (Paszke et al., 2019) and DGL (Wang et al., 2019). Experiments are conducted on a machine with 10 RTX-2080Ti (11GB), Xeon 6230R@2.10GHz (187GB), and PCIe3x16 connecting CPU-GPU and GPU-GPU. Only for ogbn-papers100M, we use 4 compute nodes (each contains 8 MI60 GPUs, an AMD EPYC 7642 CPU, and 48 lane PCI 3.0 connecting CPU-GPU and GPU-GPU) networked with 10Gbps Ethernet. To support full-graph GCN training with the model sizes in Tab. 3, the minimum required partition numbers are 2, 3, 5, 32 for Reddit, ogbn-products, Yelp, and ogbn-papers100M, respectively.

Table 3: Detailed experiment setups: graph datasets, GCN models, and training hyper-parameters.
Dataset # Nodes # Edges Feat. size GraphSAGE model size Optimizer LearnRate Dropout # Epoch
Reddit 233K 114M 602 4 layer, 256 hidden units Adam 0.01 0.5 3000
ogbn-products 2.4M 62M 100 3 layer, 128 hidden units Adam 0.003 0.3 500
Yelp 716K 7.0M 300 4 layer, 512 hidden units Adam 0.001 0.1 3000
ogbn-papers100M 111M 1.6B 128 3 layer, 48 hidden units Adam 0.01 0.0 1000

For convenience, we here name all methods: vanilla partition-parallel training of GCNs (GCN), PipeGCN with feature gradient smoothing (PipeGCN-G), PipeGCN with feature smoothing (PipeGCN-F), and PipeGCN with both smoothing (PipeGCN-GF). The default decay rate γ\gamma for all smoothing methods is set to 0.95.

Refer to caption
Figure 3: Throughput comparison. Each partition uses one GPU (except CAGNET (cc=2) uses two).

4.1 Improving Training Throughput over Full-Graph Training Methods

Fig. 3 compares the training throughput between PipeGCN and the SOTA full-graph training methods (ROC (Jia et al., 2020) and CAGNET (Tripathy et al., 2020)). We observe that both vanilla partition-parallel training (GCN) and PipeGCN greatly outperform ROC and CAGNET across different number of partitions, because they avoid both the expensive CPU-GPU swaps (ROC) and the redundant node broadcast (CAGNET). Specifically, GCN is 3.1×\times\sim16.4×\times faster than ROC and 2.1×\times\sim10.2×\times faster than CAGNET (cc=2). PipeGCN further improves upon GCN, achieving a throughput improvement of 5.6×\times\sim28.5×\times over ROC and 3.9×\times\sim17.7×\times over CAGNET (cc=2)222More detailed comparisons among full-graph training methods can be found in Appendix B.. Note that we are not able to compare PipeGCN with NeuGraph (Ma et al., 2019), AliGraph (Zhu et al., 2019), and P3P^{3} (Gandhi & Iyer, 2021) as their code are not publicly available. Besides, Dorylus (Thorpe et al., 2021) is not comparable, as it is not for regular GPU servers. Considering the substantial performance gap between ROC/CAGNET and GCN, we focus on comparing GCN with PipeGCN for the reminder of the section.

4.2 Improving Training Throughput without Compromising Accuracy

We compare the training performance of both test score and training throughput between GCN and PipeGCN in Tab. 4. We can see that PipeGCN without smoothing already achieves a comparable test score with the vanilla GCN training on both Reddit and Yelp, and incurs only a negligible accuracy drop (-0.08%\sim-0.23%) on ogbn-products, while boosting the training throughput by 1.72×\times\sim2.16×\times across all datasets and different number of partitions333More details regarding PipeGCN’s advantages in training throughput can be found in Appendix C., thus validating the effectiveness of PipeGCN.

With the proposed smoothing method plugged in, PipeGCN-G/F/GF is able to compensate the dropped score of vanilla PipeGCN, achieving an equal or even better test score as/than the vanilla GCN training (without staleness), e.g., 97.14% vs. 97.11% on Reddit, 79.36% vs. 79.14% on ogbn-products and 65.28% vs. 65.26% on Yelp. Meanwhile, PipeGCN-G/F/GF enjoys a similar throughput improvement as vanilla PipeGCN, thus validating the negligible overhead of the proposed smoothing method. Therefore, pipelined transfer of features and gradients greatly improves the training throughput while maintaining the full-graph accuracy.

Note that our distributed GCN training methods consistently achieve higher test scores than SOTA sampling-based methods for GraphSAGE-based models reported in (Zeng et al., 2020) and (Hu et al., 2020), confirming that full-graph training is preferred to obtain better GCN models. For example, the best sampling-based method achieves a 96.6% accuracy on Reddit (Zeng et al., 2020) while full-graph GCN training achieves 97.1%, and PipeGCN improves the accuracy by 0.28% over sampling-based GraphSAGE models on ogbn-products (Hu et al., 2020). This advantage of full-graph training is also validated by recent works (Jia et al., 2020; Tripathy et al., 2020; Liu et al., 2022; Wan et al., 2022).

Table 4: Training performance comparison among vanilla partition-parallel training (GCN) and PipeGCN variants (PipeGCN*), where we report the test accuracy for Reddit and ogbn-products, and the F1-micro score for Yelp. Highest performance is in bold.
Dataset Method Test Score (%) Throughput Dataset Method Test Score (%) Throughput
Reddit (2 partitions) GCN 97.11±\pm0.02 1×\times (1.94 epochs/s) Reddit (4 partitions) GCN 97.11±\pm0.02 1×\times (2.07 epochs/s)
PipeGCN 97.12±\pm0.02 1.91×\times PipeGCN 97.04±\pm0.03 2.12×\times
PipeGCN-G 97.14±\pm0.03 1.89×\times PipeGCN-G 97.09±\pm0.03 2.07×\times
PipeGCN-F 97.09±\pm0.02 1.89×\times PipeGCN-F 97.10±\pm0.02 2.10×\times
PipeGCN-GF 97.12±\pm0.02 1.87×\times PipeGCN-GF 97.10±\pm0.02 2.06×\times
ogbn-products (5 partitions) GCN 79.14±\pm0.35 1×\times (1.45 epochs/s) ogbn-products (10 partitions) GCN 79.14±\pm0.35 1×\times (1.28 epochs/s)
PipeGCN 79.06±\pm0.42 1.94×\times PipeGCN 78.91±\pm0.65 1.87×\times
PipeGCN-G 79.20±\pm0.38 1.90×\times PipeGCN-G 79.08±\pm0.58 1.82×\times
PipeGCN-F 79.36±\pm0.38 1.90×\times PipeGCN-F 79.21±\pm0.31 1.81×\times
PipeGCN-GF 78.86±\pm0.34 1.91×\times PipeGCN-GF 78.77±\pm0.23 1.82×\times
Yelp (3 partitions) GCN 65.26±\pm0.02 1×\times (2.00 epochs/s) Yelp (6 partitions) GCN 65.26±\pm0.02 1×\times (2.25 epochs/s)
PipeGCN 65.27±\pm0.01 2.16×\times PipeGCN 65.24±\pm0.02 1.72×\times
PipeGCN-G 65.26±\pm0.02 2.15×\times PipeGCN-G 65.28±\pm0.02 1.69×\times
PipeGCN-F 65.26±\pm0.03 2.15×\times PipeGCN-F 65.25±\pm0.04 1.68×\times
PipeGCN-GF 65.26±\pm0.04 2.11×\times PipeGCN-GF 65.26±\pm0.04 1.67×\times

4.3 Maintaining Convergence Speed

Refer to caption
Figure 4: Epoch-to-accuracy comparison among vanilla partition-parallel training (GCN) and PipeGCN variants (PipeGCN*), where PipeGCN and its variants achieve a similar convergence as the vanilla training (without staleness) but are twice as fast in wall-clock time (see Tab. 4).

To understand PipeGCN’s influence on the convergence speed, we compare the training curve among different methods in Fig. 4. We observe that the convergence of PipeGCN without smoothing is still comparable with that of the vanilla GCN training, although PipeGCN converges slower at the early phase of training and then catches up at the later phase, due to the staleness of boundary features/gradients. With the proposed smoothing methods, PipeGCN-G/F boosts the convergence substantially and matches the convergence speed of vanilla GCN training. There is no clear difference between PipeGCN-G and PipeGCN-F. Lastly, with combined smoothing of features and gradients, PipeGCN-GF can acheive the same or even slightly better convergence speed as vanilla GCN training (e.g., on Reddit) but can overfit gradually similar to the vanilla GCN training, which is further investigated in Sec. 4.4. Therefore, PipeGCN maintains the convergence speed w.r.t the number of epochs while reduces the end-to-end training time by around 50% thanks to its boosted training throughput (see Tab. 4).

4.4 Benefit of Staleness Smoothing Method

Error Reduction and Convergence Speedup. To understand why the proposed smoothing technique (Sec. 3.4) speeds up convergence, we compare the error incurred by the stale communication between PipeGCN and PipeGCN-G/F. The error is calculated as the Frobenius-norm of the gap between the correct gradient/feature and the stale gradient/feature used in PipeGCN training. Fig. 6 compares the error at each GCN layer. We can see that the proposed smoothing technique (PipeGCN-G/F) reduces the error of staleness substantially (from the base version of PipeGCN) and this benefit consistently holds across different layers in terms of both feature and gradient errors, validating the effectiveness of our smoothing method and explaining its improvement to the convergence speed.

Refer to caption
Figure 5: Comparison of the resulting feature gradient error and feature error from PipeGCN and PipeGCN-G/F at each GCN layer on Reddit (2 partitions). PipeGCN-G/F here uses a default smoothing decay rate of 0.950.95.
Refer to caption
Figure 6: Test-accuracy convergence comparison among different smoothing decay rates γ\gamma in PipeGCN-GF on ogbn-products (10 partitions).
Refer to caption
Figure 7: Comparison of the resulting feature gradient error and feature error when adopting different decay rates γ\gamma at each GCN layer on ogbn-products (10 partitions).

Overfitting Mitigation. To understand the effect of staleness smoothing on model overfitting, we also evaluate the test-accuracy convergence under different decay rates γ\gamma in Fig. 6. Here ogbn-products is adopted as the study case because the distribution of its test set largely differs from that of its training set. From Fig. 6, we observe that smoothing with a large γ\gamma (0.70.7/0.950.95) offers a fast convergence, i.e., close to the vanilla GCN training, but overfits rapidly. To understand this issue, we further provide detailed comparisons of the errors incurred under different γ\gamma in Fig. 7. We can see that a larger γ\gamma enjoys lower approximation errors and makes the gradients/features more stable, thus improving the convergence speed. The increased stability on the training set, however, constrains the model from exploring a more general minimum point on the test set, thus leading to overfitting as the vanilla GCN training. In contrast, a small γ\gamma (0 \sim 0.50.5) mitigates this overfitting and achieves a better accuracy (see Fig. 6). But a too-small γ\gamma (e.g., 0) gives a high error for both stale features and gradients (see Fig. 7), thus suffering from a slower convergence. Therefore, a trade-off between convergence speed and achievable optimality exists between different smoothing decay rates, and γ=0.5\gamma=0.5 combines the best of both worlds in this study.

4.5 Scaling Large Graph Training over Multiple Servers

Table 5: Comparison of epoch training time on ogbn-papers100M.
Method Total Communication
GCN 1.00×1.00\times (10.5s) 1.00×1.00\times (6.6s)
PipeGCN 0.62×0.62\times (6.5s) 0.39×0.39\times (2.6s)
PipeGCN-GF 0.64×0.64\times (6.7s) 0.42×0.42\times (2.8s)

To further test the capability of PipeGCN, we scale up the graph size to ogbn-papers100M and train GCN over multiple GPU servers with 32 GPUs. Tab. 5 shows that even at such a large-scale setting where communication overhead dominates, PipeGCN still reduce communication time by 61%, leading to a total training time reduction of 38% compared to the vanilla GCN baseline 444More experiments on multi-server training can be found in Appendix E..

5 Conclusion

In this work, we propose a new method, PipeGCN, for efficient full-graph GCN training. PipeGCN pipelines communication with computation in distributed GCN training to hide the prohibitive communication overhead. More importantly, we are the first to provide convergence analysis for GCN training with both stale features and feature gradients, and further propose a light-weight smoothing method for convergence speedup. Extensive experiments validate the advantages of PipeGCN over both vanilla GCN training (without staleness) and state-of-the-art full-graph training.

6 Acknowledgement

The work is supported by the National Science Foundation (NSF) through the MLWiNS program (Award number: 2003137), the CC Compute program (Award number: 2019007), and the NeTS program (Award number: 1801865).

References

  • Alistarh et al. (2017) Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30, 2017.
  • Chen et al. (2018) Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional networks with variance reduction. In International Conference on Machine Learning, pp. 942–950. PMLR, 2018.
  • Chiang et al. (2019) Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  257–266, 2019.
  • Cong et al. (2020) Weilin Cong, Rana Forsati, Mahmut Kandemir, and Mehrdad Mahdavi. Minimal variance sampling with provable guarantees for fast training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  1393–1403, 2020.
  • Cong et al. (2021) Weilin Cong, Morteza Ramezani, and Mehrdad Mahdavi. On the importance of sampling in learning graph convolutional networks. arXiv preprint arXiv:2103.02696, 2021.
  • Gandhi & Iyer (2021) Swapnil Gandhi and Anand Padmanabha Iyer. P3: Distributed deep graph learning at scale. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pp.  551–568, 2021.
  • Garg et al. (2020) Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. In International Conference on Machine Learning, pp. 3419–3430. PMLR, 2020.
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034, 2017.
  • Harlap et al. (2018) Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018.
  • Ho et al. (2013) Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B Gibbons, Garth A Gibson, Gregory R Ganger, and Eric P Xing. More effective distributed ml via a stale synchronous parallel parameter server. Advances in neural information processing systems, 2013:1223, 2013.
  • Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
  • Jia et al. (2020) Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. Improving the accuracy, scalability, and performance of graph neural networks with roc. Proceedings of Machine Learning and Systems (MLSys), pp. 187–198, 2020.
  • Karypis & Kumar (1998) George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359–392, 1998.
  • Kipf & Welling (2016) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • Li et al. (2014) Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. Advances in Neural Information Processing Systems, 27:19–27, 2014.
  • Li et al. (2018a) Youjie Li, Jongse Park, Mohammad Alian, Yifan Yuan, Zheng Qu, Peitian Pan, Ren Wang, Alexander Gerhard Schwing, Hadi Esmaeilzadeh, and Nam Sung Kim. A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks. In Proceedings of the 51st IEEE/ACM International Symposium on Microarchitecture (MICRO’18), Fukuoka City, Japan, 2018a.
  • Li et al. (2018b) Youjie Li, Mingchao Yu, Songze Li, Salman Avestimehr, Nam Sung Kim, and Alexander Schwing. Pipe-SGD: A decentralized pipelined sgd framework for distributed deep net training. Advances in Neural Information Processing Systems, 2018b.
  • Liao et al. (2020) Renjie Liao, Raquel Urtasun, and Richard Zemel. A pac-bayesian approach to generalization bounds for graph neural networks. arXiv preprint arXiv:2012.07690, 2020.
  • Liu et al. (2022) Zirui Liu, Kaixiong Zhou, Fan Yang, Li Li, Rui Chen, and Xia Hu. EXACT: Scalable graph neural networks training via extreme activation compression. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=vkaMaq95_rX.
  • Ma et al. (2019) Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and Yafei Dai. Neugraph: parallel deep neural network computation on large graphs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pp.  443–458, 2019.
  • Niu et al. (2011) Feng Niu, Benjamin Recht, Christopher Ré, and Stephen J Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730, 2011.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026–8037, 2019.
  • Seide et al. (2014) Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association. Citeseer, 2014.
  • Thorpe et al. (2021) John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, et al. Dorylus: affordable, scalable, and accurate gnn training with distributed cpu servers and serverless threads. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pp.  495–514, 2021.
  • Tripathy et al. (2020) Alok Tripathy, Katherine Yelick, and Aydin Buluc. Reducing communication in graph neural network training. arXiv preprint arXiv:2005.03300, 2020.
  • Wan et al. (2022) Cheng Wan, Youjie Li, Ang Li, Nam Sung Kim, and Yingyan Lin. BNS-GCN: Efficient full-graph training of graph convolutional networks with partition-parallelism and random boundary node sampling. Fifth Conference on Machine Learning and Systems, 2022.
  • Wang et al. (2019) Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019.
  • Wen et al. (2017) Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. Advances in neural information processing systems, 30, 2017.
  • Xu et al. (2018) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
  • Yang et al. (2021) Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher Aberger, and Christopher De Sa. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems, 3, 2021.
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  974–983, 2018.
  • Yu et al. (2018) Mingchao Yu, Zhifeng Lin, Krishna Giri Narra, Songze Li, Youjie Li, Nam Sung Kim, Alexander Schwing, Murali Annavaram, and Salman Avestimehr. Gradiveq: Vector quantization for bandwidth-efficient gradient aggregation in distributed cnn training. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS’18), Montreal, Canada, 2018.
  • Zeng et al. (2020) Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. Graphsaint: Graph sampling based inductive learning method. arXiv preprint arXiv:1907.04931, 2020.
  • Zhang & Chen (2018) Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165–5175, 2018.
  • Zhu et al. (2019) Rong Zhu, Kun Zhao, Hongxia Yang, Wei Lin, Chang Zhou, Baole Ai, Yong Li, and Jingren Zhou. Aligraph: A comprehensive graph neural network platform. arXiv preprint arXiv:1902.08730, 2019.

Appendix A Convergence Proof

In this section, we prove the convergence of PipeGCN. Specifically, we first figure out that when the model is updated via gradient descent, the change of intermediate features and their gradients are bounded by a constant which is proportional to learning rate η\eta under standard assumptions. Based on this, we further demonstrate that the error occurred by the staleness is proportional to η\eta, which guarantees that the gradient error is bounded by ηE\eta E where EE is defined in Corollary A.10, and thus PipeGCN converges in 𝒪(ε32)\mathcal{O}(\varepsilon^{-\frac{3}{2}}) iterations.

A.1 Notations and Assumptions

For a given graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) with an adjacency matrix AA, feature matrix XX, we define the propagation matrix PP as PD~1/2A~D~1/2P\coloneqq\widetilde{D}^{-1/2}\widetilde{A}\widetilde{D}^{-1/2}, where A~=A+I,D~u,u=vA~u,v\widetilde{A}=A+I,\widetilde{D}_{u,u}=\sum_{v}\widetilde{A}_{u,v}. One GCN layer performs one step of feature propagation (Kipf & Welling, 2016) as formulated below

H(0)\displaystyle H^{(0)} =X\displaystyle=X
Z()\displaystyle Z^{(\ell)} =PH(1)W()\displaystyle=PH^{(\ell-1)}W^{(\ell)}
H()\displaystyle H^{(\ell)} =σ(Z())\displaystyle=\sigma(Z^{(\ell)})

where H()H^{(\ell)}, W()W^{(\ell)}, and Z()Z^{(\ell)} denote the embedding matrix, the trainable weight matrix, and the intermediate embedding matrix in the \ell-th layer, respectively, and σ\sigma denotes a non-linear activation function. For an LL-layer GCN, the loss function is denoted by (θ)\mathcal{L}(\theta) where θ=vec[W(1),W(2),,W(L)]\theta=\text{vec}[W^{(1)},W^{(2)},\cdots,W^{(L)}]. We define the \ell-th layer as a function f()(,)f^{(\ell)}(\cdot,\cdot).

f()(H(1),W())σ(PH(1)W())f^{(\ell)}(H^{(\ell-1)},W^{(\ell)})\coloneqq\sigma(PH^{(\ell-1)}W^{(\ell)})

Its gradient w.r.t. the input embedding matrix can be represented as

J(1)=Hf()(J(),H(1),W())PM()[W()]J^{(\ell-1)}=\nabla_{H}f^{(\ell)}(J^{(\ell)},H^{(\ell-1)},W^{(\ell)})\coloneqq P^{\top}M^{(\ell)}[W^{(\ell)}]^{\top}

and its gradient w.r.t. the weight can be represented as

G()=Wf()(J(),H(1),W())[PH(1)]M()G^{(\ell)}=\nabla_{W}f^{(\ell)}(J^{(\ell)},H^{(\ell-1)},W^{(\ell)})\coloneqq[PH^{(\ell-1)}]^{\top}M^{(\ell)}

where M()=J()σ(PH(1)W())M^{(\ell)}=J^{(\ell)}\circ\sigma^{\prime}(PH^{(\ell-1)}W^{(\ell)}) and \circ denotes Hadamard product.

For partition-parallel training, we can split PP into two parts P=Pin+PbdP=P_{in}+P_{bd} where PinP_{in} represents intra-partition propagation and PbdP_{bd} denotes inter-partition propagation. For PipeGCN, we can represent one GCN layer as below

H~(t,0)\displaystyle\widetilde{H}^{(t,0)} =X\displaystyle=X
Z~(t,)\displaystyle\widetilde{Z}^{(t,\ell)} =PinH~(t,1)W~(t,)+PbdH~(t1,1)W~(t,)\displaystyle=P_{in}\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t,\ell)}+P_{bd}\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t,\ell)}
H~(t,)\displaystyle\widetilde{H}^{(t,\ell)} =σ(Z~(t,))\displaystyle=\sigma(\widetilde{Z}^{(t,\ell)})

where tt is the epoch number and W~(t,)\widetilde{W}^{(t,\ell)} is the weight at epoch tt layer \ell. We define the loss function for this setting as ~(θ~(t))\widetilde{\mathcal{L}}(\widetilde{\theta}^{(t)}) where θ~(t)=vec[W~(t,1),W~(t,2),,W~(t,L)]\widetilde{\theta}^{(t)}=\text{vec}[\widetilde{W}^{(t,1)},\widetilde{W}^{(t,2)},\cdots,\widetilde{W}^{(t,L)}]. We can also summarize the layer as a function f~(t,)(,)\widetilde{f}^{(t,\ell)}(\cdot,\cdot)

f~(t,)(H~(t,1),W~(t,))σ(PinH~(t,1)W~(t,)+PbdH~(t1,1)W~(t,))\widetilde{f}^{(t,\ell)}(\widetilde{H}^{(t,\ell-1)},\widetilde{W}^{(t,\ell)})\coloneqq\sigma(P_{in}\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t,\ell)}+P_{bd}\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t,\ell)})

Note that H~(t1,1)\widetilde{H}^{(t-1,\ell-1)} is not a part of the input of f~(t,)(,)\widetilde{f}^{(t,\ell)}(\cdot,\cdot) because it is a constant for the tt-th epoch. The corresponding backward propagation follows the following computation

J~(t,1)=Hf~(t,)(J~(t,),H~(t,1),W~(t,))\widetilde{J}^{(t,\ell-1)}=\nabla_{H}\widetilde{f}^{(t,\ell)}(\widetilde{J}^{(t,\ell)},\widetilde{H}^{(t,\ell-1)},\widetilde{W}^{(t,\ell)})
G~(t,)=Wf~(t,)(J~(t,),H~(t,1),W~(t,))\widetilde{G}^{(t,\ell)}=\nabla_{W}\widetilde{f}^{(t,\ell)}(\widetilde{J}^{(t,\ell)},\widetilde{H}^{(t,\ell-1)},\widetilde{W}^{(t,\ell)})

where

M~(t,)=J~(t,)σ(PinH~(t,1)W~(t,)+PbdH~(t1,1)W~(t,))\widetilde{M}^{(t,\ell)}=\widetilde{J}^{(t,\ell)}\circ\sigma^{\prime}(P_{in}\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t,\ell)}+P_{bd}\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t,\ell)})
Hf~(t,)(J~(t,),H~(t,1),W~(t,))PinM~(t,)[W~(t,)]+PbdM~(t1,)[W~(t1,)]\nabla_{H}\widetilde{f}^{(t,\ell)}(\widetilde{J}^{(t,\ell)},\widetilde{H}^{(t,\ell-1)},\widetilde{W}^{(t,\ell)})\coloneqq P_{in}^{\top}\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}+P_{bd}^{\top}\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}
Wf~(t,)(J~(t,),H~(t,1),W~(t,))[PinH~(t,1)+PbdH~(t1,1)]M~(t,)\nabla_{W}\widetilde{f}^{(t,\ell)}(\widetilde{J}^{(t,\ell)},\widetilde{H}^{(t,\ell-1)},\widetilde{W}^{(t,\ell)})\coloneqq[P_{in}\widetilde{H}^{(t,\ell-1)}+P_{bd}\widetilde{H}^{(t-1,\ell-1)}]^{\top}\widetilde{M}^{(t,\ell)}

Again, J~(t1,)\widetilde{J}^{(t-1,\ell)} is not a part of the input of Hf~(t,)(,,)\nabla_{H}\widetilde{f}^{(t,\ell)}(\cdot,\cdot,\cdot) or Wf~(t,)(,,)\nabla_{W}\widetilde{f}^{(t,\ell)}(\cdot,\cdot,\cdot) because it is a constant for epoch tt. Finally, we define ~(θ~(t))=vec[G~(t,1),G~(t,2),,G~(t,L)]\nabla\widetilde{\mathcal{L}}(\widetilde{\theta}^{(t)})=\text{vec}[\widetilde{G}^{(t,1)},\widetilde{G}^{(t,2)},\cdots,\widetilde{G}^{(t,L)}]. It should be highlighted that the ‘gradient’ Hf~(t,)(,,)\nabla_{H}\widetilde{f}^{(t,\ell)}(\cdot,\cdot,\cdot), Wf~(t,)(,,)\nabla_{W}\widetilde{f}^{(t,\ell)}(\cdot,\cdot,\cdot) and ~(θ~(t))\nabla\widetilde{\mathcal{L}}(\widetilde{\theta}^{(t)}) are not the standard gradient for the corresponding forward process due to the stale communication. Properties of gradient cannot be directly applied to these variables.

Before proceeding our proof, we make the following standard assumptions about the adopted GCN architecture and input graph.

Assumption A.1.

The loss function Loss(,)(\cdot,\cdot) is ClossC_{\text{loss}}-Lipschitz continuous and LlossL_{\text{loss}}-smooth w.r.t. to the input node embedding vector, i.e., |Loss(h(L),y)Loss(h(L),y)|Clossh(L)h(L)2|Loss(h^{(L)},y)-Loss(h^{\prime(L)},y)|\leq C_{\text{loss}}\|h^{(L)}-h^{\prime(L)}\|_{2} and Loss(h(L),y)Loss(h(L),y)2Llossh(L)h(L)2\|\nabla Loss(h^{(L)},y)-\nabla Loss(h^{\prime(L)},y)\|_{2}\leq L_{\text{loss}}\|h^{(L)}-h^{\prime(L)}\|_{2} where hh is the predicted label and yy is the correct label vector.

Assumption A.2.

The activation function σ()\sigma(\cdot) is CσC_{\sigma}-Lipschitz continuous and LσL_{\sigma}-smooth, i.e., σ(z())σ(z())2Cσz()z()2\|\sigma(z^{(\ell)})-\sigma(z^{\prime(\ell)})\|_{2}\leq C_{\sigma}\|z^{(\ell)}-z^{\prime(\ell)}\|_{2} and σ(z())σ(z())2Lσz()z()2\|\sigma^{\prime}(z^{(\ell)})-\sigma^{\prime}(z^{\prime(\ell)})\|_{2}\leq L_{\sigma}\|z^{(\ell)}-z^{\prime(\ell)}\|_{2}.

Assumption A.3.

For any [L]\ell\in[L], the norm of weight matrices, the propagation matrix, and the input feature matrix are bounded: W()F\|W^{(\ell)}\|_{F} BW,PFBP,XFBX\leq B_{W},\|P\|_{F}\leq B_{P},\|X\|_{F}\leq B_{X}. (This generic assumption is also used in (Chen et al., 2018; Liao et al., 2020; Garg et al., 2020; Cong et al., 2021).)

A.2 Bounded Matrices and Changes

Lemma A.1.

For any [L]\ell\in[L], the Frobenius norm of node embedding matrices, gradient passing from the \ell-th layer node embeddings to the (1)(\ell-1)-th, gradient matrices are bounded, i.e.,

H()F,H~(t,)FBH,\|H^{(\ell)}\|_{F},\|\widetilde{H}^{(t,\ell)}\|_{F}\leq B_{H},
J()F,J~(t,)FBJ,\|J^{(\ell)}\|_{F},\|\widetilde{J}^{(t,\ell)}\|_{F}\leq B_{J},
M()F,M~(t,)FBM,\|M^{(\ell)}\|_{F},\|\widetilde{M}^{(t,\ell)}\|_{F}\leq B_{M},
G()F,G~(t,)FBG\|G^{(\ell)}\|_{F},\|\widetilde{G}^{(t,\ell)}\|_{F}\leq B_{G}

where

BH=max1L(CσBPBW)BXB_{H}=\max_{1\leq\ell\leq L}(C_{\sigma}B_{P}B_{W})^{\ell}B_{X}
BJ=max2L(CσBPBW)LClossB_{J}=\max_{2\leq\ell\leq L}(C_{\sigma}B_{P}B_{W})^{L-\ell}C_{loss}
BM=CσBJB_{M}=C_{\sigma}B_{J}
BG=BPBHBMB_{G}=B_{P}B_{H}B_{M}
Proof.

The proof of H()FBH\|H^{(\ell)}\|_{F}\leq B_{H} and J()FBJ\|J^{(\ell)}\|_{F}\leq B_{J} can be found in Proposition 1 in (Cong et al., 2021). By induction,

H~(t,)F\displaystyle\|\widetilde{H}^{(t,\ell)}\|_{F} =σ(PinH~(t,1)W~(t,)+PbdH~(t1,1)W~(t,))F\displaystyle=\|\sigma(P_{in}\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t,\ell)}+P_{bd}\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t,\ell)})\|_{F}
CσBWPin+PbdF(CσBPBW)1BX\displaystyle\leq C_{\sigma}B_{W}\|P_{in}+P_{bd}\|_{F}(C_{\sigma}B_{P}B_{W})^{\ell-1}B_{X}
(CσBPBW)BX\displaystyle\leq(C_{\sigma}B_{P}B_{W})^{\ell}B_{X}
J~(t,1)F\displaystyle\|\widetilde{J}^{(t,\ell-1)}\|_{F} =Pin(J~(t,)σ(Z~(t,)))[W~(t,)]+Pbd(J~(t1,)σ(Z~(t1,)))[W~(t1,)]F\displaystyle=\left\|P_{in}^{\top}\left(\widetilde{J}^{(t,\ell)}\circ\sigma^{\prime}(\widetilde{Z}^{(t,\ell)})\right)[\widetilde{W}^{(t,\ell)}]^{\top}+P_{bd}^{\top}\left(\widetilde{J}^{(t-1,\ell)}\circ\sigma^{\prime}(\widetilde{Z}^{(t-1,\ell)})\right)[\widetilde{W}^{(t-1,\ell)}]^{\top}\right\|_{F}
CσBWPin+PbdF(CσBPBW)LCloss\displaystyle\leq C_{\sigma}B_{W}\|P_{in}+P_{bd}\|_{F}(C_{\sigma}B_{P}B_{W})^{L-\ell}C_{\text{loss}}
(CσBPBW)L+1Closs\displaystyle\leq(C_{\sigma}B_{P}B_{W})^{L-\ell+1}C_{\text{loss}}
M()F=J()σ(Z())FCσBJ\|M^{(\ell)}\|_{F}=\|J^{(\ell)}\circ\sigma^{\prime}(Z^{(\ell)})\|_{F}\leq C_{\sigma}B_{J}
M~(t,)F=J~(t,)σ(Z~(t,))FCσBJ\|\widetilde{M}^{(t,\ell)}\|_{F}=\|\widetilde{J}^{(t,\ell)}\circ\sigma^{\prime}(\widetilde{Z}^{(t,\ell)})\|_{F}\leq C_{\sigma}B_{J}
G()\displaystyle G^{(\ell)} =[PH(1)]M()\displaystyle=[PH^{(\ell-1)}]^{\top}M^{(\ell)}
BPBHBM\displaystyle\leq B_{P}B_{H}B_{M}
G~(t,)\displaystyle\widetilde{G}^{(t,\ell)} =[PinH~(t,1)+PbdH~(t1,1)]M~(t,)\displaystyle=[P_{in}\widetilde{H}^{(t,\ell-1)}+P_{bd}\widetilde{H}^{(t-1,\ell-1)}]^{\top}\widetilde{M}^{(t,\ell)}
BPBHBM\displaystyle\leq B_{P}B_{H}B_{M}

Because the gradient matrices are bounded, the weight change is bounded.

Corollary A.2.

For any tt, \ell, W~(t,)W~(t1,)FBΔW=ηBG\|\widetilde{W}^{(t,\ell)}-\widetilde{W}^{(t-1,\ell)}\|_{F}\leq B_{\Delta W}=\eta B_{G} where η\eta is the learning rate.

Now we can analyze the changes of intermediate variables.

Lemma A.3.

For any t,t,\ell, we have Z~(t,)Z~(t1,)FBΔZ\|\widetilde{Z}^{(t,\ell)}-\widetilde{Z}^{(t-1,\ell)}\|_{F}\leq B_{\Delta Z}, H~(t,)H~(t1,)FBΔH\|\widetilde{H}^{(t,\ell)}-\widetilde{H}^{(t-1,\ell)}\|_{F}\leq B_{\Delta H}, where BΔZ=i=0L1CσiBPi+1BWiBHBΔWB_{\Delta Z}=\sum\limits_{i=0}^{L-1}C_{\sigma}^{i}B_{P}^{i+1}B_{W}^{i}B_{H}B_{\Delta W} and BΔH=CσBΔZB_{\Delta H}=C_{\sigma}B_{\Delta Z}.

Proof.

When =0\ell=0, H~(t,0)H~(t1,0)F=XXF=0\|\widetilde{H}^{(t,0)}-\widetilde{H}^{(t-1,0)}\|_{F}=\|X-X\|_{F}=0. Now we consider >0\ell>0 by induction.

Z~(t,)Z~(t1,)F=\displaystyle\|\widetilde{Z}^{(t,\ell)}-\widetilde{Z}^{(t-1,\ell)}\|_{F}= (PinH~(t,1)W~(t,)+PbdH~(t1,1)W~(t,))\displaystyle\|(P_{in}\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t,\ell)}+P_{bd}\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t,\ell)})
(PinH~(t1,1)W~(t1,)+PbdH~(t2,1)W~(t1,))F\displaystyle-(P_{in}\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t-1,\ell)}+P_{bd}\widetilde{H}^{(t-2,\ell-1)}\widetilde{W}^{(t-1,\ell)})\|_{F}
=\displaystyle= Pin(H~(t,1)W~(t,)H~(t1,1)W~(t1,))\displaystyle\|P_{in}(\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t,\ell)}-\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t-1,\ell)})
+Pbd(H~(t1,1)W~(t,)H~(t2,1)W~(t1,))F\displaystyle+P_{bd}(\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t,\ell)}-\widetilde{H}^{(t-2,\ell-1)}\widetilde{W}^{(t-1,\ell)})\|_{F}

Then we analyze the bound of H~(t,1)W~(t,)H~(t1,1)W~(t1,)F\|\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t,\ell)}-\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t-1,\ell)}\|_{F} which is denoted by s(t,)s^{(t,\ell)}.

s(t,)\displaystyle s^{(t,\ell)} H~(t,1)W~(t,)H~(t,1)W~(t1,)F+H~(t,1)W~(t1,)H~(t1,1)W~(t1,)F\displaystyle\leq\|\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t,\ell)}-\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t-1,\ell)}\|_{F}+\|\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t-1,\ell)}-\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t-1,\ell)}\|_{F}
BHW~(t,)W~(t1,)F+BWH~(t,1)H~(t1,1)F\displaystyle\leq B_{H}\|\widetilde{W}^{(t,\ell)}-\widetilde{W}^{(t-1,\ell)}\|_{F}+B_{W}\|\widetilde{H}^{(t,\ell-1)}-\widetilde{H}^{(t-1,\ell-1)}\|_{F}

According to Corollary A.2, W~(t,)W~(t1,)FBΔW\|\widetilde{W}^{(t,\ell)}-\widetilde{W}^{(t-1,\ell)}\|_{F}\leq B_{\Delta W}. By induction, H~(t,1)H~(t1,1)Fi=02Cσi+1BPi+1BWiBHBΔW\|\widetilde{H}^{(t,\ell-1)}-\widetilde{H}^{(t-1,\ell-1)}\|_{F}\leq\sum\limits_{i=0}^{\ell-2}C_{\sigma}^{i+1}B_{P}^{i+1}B_{W}^{i}B_{H}B_{\Delta W}. Combining these inequalities,

s(t,)BHBΔW+i=11CσiBPiBWiBHBΔWs^{(t,\ell)}\leq B_{H}B_{\Delta W}+\sum\limits_{i=1}^{\ell-1}C_{\sigma}^{i}B_{P}^{i}B_{W}^{i}B_{H}B_{\Delta W}

Plugging it back, we have

Z~(t,)Z~(t1,)F\displaystyle\|\widetilde{Z}^{(t,\ell)}-\widetilde{Z}^{(t-1,\ell)}\|_{F}\leq Pin(H~(t,1)W~(t,)H~(t1,1)W~(t1,))\displaystyle\|P_{in}(\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t,\ell)}-\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t-1,\ell)})
+Pbd(H~(t1,1)W~(t,)H~(t2,1)W~(t1,))F\displaystyle+P_{bd}(\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t,\ell)}-\widetilde{H}^{(t-2,\ell-1)}\widetilde{W}^{(t-1,\ell)})\|_{F}
\displaystyle\leq BP(BHBΔW+i=11CσiBPiBWiBHBΔW)\displaystyle B_{P}\left(B_{H}B_{\Delta W}+\sum\limits_{i=1}^{\ell-1}C_{\sigma}^{i}B_{P}^{i}B_{W}^{i}B_{H}B_{\Delta W}\right)
=\displaystyle= i=01CσiBPi+1BWiBHBΔW\displaystyle\sum\limits_{i=0}^{\ell-1}C_{\sigma}^{i}B_{P}^{i+1}B_{W}^{i}B_{H}B_{\Delta W}
H~(t,)H~(t1,)F=\displaystyle\|\widetilde{H}^{(t,\ell)}-\widetilde{H}^{(t-1,\ell)}\|_{F}= σ(Z~(t,))σ(Z~(t1,))F\displaystyle\|\sigma(\widetilde{Z}^{(t,\ell)})-\sigma(\widetilde{Z}^{(t-1,\ell)})\|_{F}
\displaystyle\leq CσZ~(t,)Z~(t1,)F\displaystyle C_{\sigma}\|\widetilde{Z}^{(t,\ell)}-\widetilde{Z}^{(t-1,\ell)}\|_{F}
\displaystyle\leq CσBΔZ\displaystyle C_{\sigma}B_{\Delta Z}

Lemma A.4.

J~(t,)J~(t1,)FBΔJ\|\widetilde{J}^{(t,\ell)}-\widetilde{J}^{(t-1,\ell)}\|_{F}\leq B_{\Delta J} where

BΔJ=max2L(BPBWCσ)LBΔHLloss+(BMBΔW+LσBJBΔZBW)i=0L3BPi+1BWiCσiB_{\Delta J}=\max\limits_{2\leq\ell\leq L}(B_{P}B_{W}C_{\sigma})^{L-\ell}B_{\Delta H}L_{\text{loss}}+(B_{M}B_{\Delta W}+L_{\sigma}B_{J}B_{\Delta Z}B_{W})\sum\limits_{i=0}^{L-3}B_{P}^{i+1}B_{W}^{i}C_{\sigma}^{i}
Proof.

For the last layer (=L)(\ell=L), J~(t,L)J~(t1,L)FLlossH~(t,L)H~(t1,L)FLlossBΔH\|\widetilde{J}^{(t,L)}-\widetilde{J}^{(t-1,L)}\|_{F}\leq L_{\text{loss}}\|\widetilde{H}^{(t,L)}-\widetilde{H}^{(t-1,L)}\|_{F}\leq L_{\text{loss}}B_{\Delta H}. For the case of <L\ell<L, we prove the lemma by using induction.

J~(t,1)J~(t1,1)F=\displaystyle\|\widetilde{J}^{(t,\ell-1)}-\widetilde{J}^{(t-1,\ell-1)}\|_{F}= (PinM~(t,)[W~(t,)]+PbdM~(t1,)[W~(t1,)])\displaystyle\left\|\left(P_{in}^{\top}\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}+P_{bd}^{\top}\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}\right)\right.
(PinM~(t1,)[W~(t1,)]+PbdM~(t2,)[W~(t2,)])F\displaystyle\left.-\left(P_{in}^{\top}\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}+P_{bd}^{\top}\widetilde{M}^{(t-2,\ell)}[\widetilde{W}^{(t-2,\ell)}]^{\top}\right)\right\|_{F}
\displaystyle\leq Pin(M~(t,)[W~(t,)]M~(t1,)[W~(t1,)])F\displaystyle\left\|P_{in}^{\top}\left(\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}-\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}\right)\right\|_{F}
+Pbd(M~(t1,)[W~(t1,)]M~(t2,)[W~(t2,)])F\displaystyle+\left\|P_{bd}^{\top}\left(\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}-\widetilde{M}^{(t-2,\ell)}[\widetilde{W}^{(t-2,\ell)}]^{\top}\right)\right\|_{F}

We denote M~(t,)[W~(t,)]M~(t1,)[W~(t1,)]F\left\|\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}-\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}\right\|_{F} by s(t,)s^{(t,\ell)} and analyze its bound.

s(t,)\displaystyle s^{(t,\ell)}\leq M~(t,)[W~(t,)]M~(t,)[W~(t1,)]F\displaystyle\left\|\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}-\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}\right\|_{F}
+M~(t,)[W~(t1,)]M~(t1,)[W~(t1,)]F\displaystyle+\left\|\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}-\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}\right\|_{F}
\displaystyle\leq BM[W~(t,)][W~(t1,)]F+BWM~(t,)M~(t1,)F\displaystyle B_{M}\left\|[\widetilde{W}^{(t,\ell)}]^{\top}-[\widetilde{W}^{(t-1,\ell)}]^{\top}\right\|_{F}+B_{W}\left\|\widetilde{M}^{(t,\ell)}-\widetilde{M}^{(t-1,\ell)}\right\|_{F}

According to Corollary A.2, [W~(t,)][W~(t1,)]FBΔW\left\|[\widetilde{W}^{(t,\ell)}]^{\top}-[\widetilde{W}^{(t-1,\ell)}]^{\top}\right\|_{F}\leq B_{\Delta W}. For the second term,

M~(t,)M~(t1,)F\displaystyle\|\widetilde{M}^{(t,\ell)}-\widetilde{M}^{(t-1,\ell)}\|_{F}
=\displaystyle= J~(t,)σ(Z~(t,))J~(t1,)σ(Z~(t1,))F\displaystyle\|\widetilde{J}^{(t,\ell)}\circ\sigma^{\prime}(\widetilde{Z}^{(t,\ell)})-\widetilde{J}^{(t-1,\ell)}\circ\sigma^{\prime}(\widetilde{Z}^{(t-1,\ell)})\|_{F}
\displaystyle\leq J~(t,)σ(Z~(t,))J~(t,)σ(Z~(t1,))F+J~(t,)σ(Z~(t1,))J~(t1,)σ(Z~(t1,))F\displaystyle\|\widetilde{J}^{(t,\ell)}\circ\sigma^{\prime}(\widetilde{Z}^{(t,\ell)})-\widetilde{J}^{(t,\ell)}\circ\sigma^{\prime}(\widetilde{Z}^{(t-1,\ell)})\|_{F}+\|\widetilde{J}^{(t,\ell)}\circ\sigma^{\prime}(\widetilde{Z}^{(t-1,\ell)})-\widetilde{J}^{(t-1,\ell)}\circ\sigma^{\prime}(\widetilde{Z}^{(t-1,\ell)})\|_{F}
\displaystyle\leq BJσ(Z~(t,))σ(Z~(t1,))F+CσJ~(t,)J~(t1,)F\displaystyle B_{J}\|\sigma^{\prime}(\widetilde{Z}^{(t,\ell)})-\sigma^{\prime}(\widetilde{Z}^{(t-1,\ell)})\|_{F}+C_{\sigma}\|\widetilde{J}^{(t,\ell)}-\widetilde{J}^{(t-1,\ell)}\|_{F} (5)

According to the smoothness of σ\sigma and Lemma A.3, σ(Z~(t,))σ(Z~(t1,))FLσBΔZ\|\sigma^{\prime}(\widetilde{Z}^{(t,\ell)})-\sigma^{\prime}(\widetilde{Z}^{(t-1,\ell)})\|_{F}\leq L_{\sigma}B_{\Delta Z}. By induction,

J~(t,)J~(t1,)F\displaystyle\|\widetilde{J}^{(t,\ell)}-\widetilde{J}^{(t-1,\ell)}\|_{F}
(BPBWCσ)(L)BΔHLloss+(BMBΔW+LσBJBΔZBW)i=0L1BPi+1BWiCσi\displaystyle\leq(B_{P}B_{W}C_{\sigma})^{(L-\ell)}B_{\Delta H}L_{\text{loss}}+(B_{M}B_{\Delta W}+L_{\sigma}B_{J}B_{\Delta Z}B_{W})\sum\limits_{i=0}^{L-\ell-1}B_{P}^{i+1}B_{W}^{i}C_{\sigma}^{i}

As a result,

s(t,)\displaystyle s^{(t,\ell)}\leq BMBΔW+BWBJLσBΔZ+BWCσJ~(t,)J~(t1,)F\displaystyle B_{M}B_{\Delta W}+B_{W}B_{J}L_{\sigma}B_{\Delta Z}+B_{W}C_{\sigma}\|\widetilde{J}^{(t,\ell)}-\widetilde{J}^{(t-1,\ell)}\|_{F}
=\displaystyle= (BMBΔW+BWBJLσBΔZ)+BP(L)BW(L+1)Cσ(L+1)BΔHLloss\displaystyle(B_{M}B_{\Delta W}+B_{W}B_{J}L_{\sigma}B_{\Delta Z})+B_{P}^{(L-\ell)}B_{W}^{(L-\ell+1)}C_{\sigma}^{(L-\ell+1)}B_{\Delta H}L_{\text{loss}}
+(BMBΔW+LσBJBΔZBW)i=1LBPiBWiCσi\displaystyle+(B_{M}B_{\Delta W}+L_{\sigma}B_{J}B_{\Delta Z}B_{W})\sum\limits_{i=1}^{L-\ell}B_{P}^{i}B_{W}^{i}C_{\sigma}^{i}
\displaystyle\leq BP(L)BW(L+1)Cσ(L+1)BΔHLloss\displaystyle B_{P}^{(L-\ell)}B_{W}^{(L-\ell+1)}C_{\sigma}^{(L-\ell+1)}B_{\Delta H}L_{\text{loss}}
+(BMBΔW+LσBJBΔZBW)i=0LBPiBWiCσi\displaystyle+(B_{M}B_{\Delta W}+L_{\sigma}B_{J}B_{\Delta Z}B_{W})\sum\limits_{i=0}^{L-\ell}B_{P}^{i}B_{W}^{i}C_{\sigma}^{i}
J~(t,1)J~(t1,1)F=\displaystyle\|\widetilde{J}^{(t,\ell-1)}-\widetilde{J}^{(t-1,\ell-1)}\|_{F}= Pin(M~(t,)[W~(t,)]M~(t1,)[W~(t1,)])F\displaystyle\left\|P_{in}^{\top}\left(\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}-\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}\right)\right\|_{F}
+Pbd(M~(t1,)[W~(t1,)]M~(t2,)[W~(t2,)])F\displaystyle+\left\|P_{bd}^{\top}\left(\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}-\widetilde{M}^{(t-2,\ell)}[\widetilde{W}^{(t-2,\ell)}]^{\top}\right)\right\|_{F}
\displaystyle\leq BPs(t,)\displaystyle B_{P}s^{(t,\ell)}
\displaystyle\leq (BPBWCσ)(L+1)BΔHLloss\displaystyle(B_{P}B_{W}C_{\sigma})^{(L-\ell+1)}B_{\Delta H}L_{\text{loss}}
+(BMBΔW+LσBJBΔZBW)i=0LBPi+1BWiCσi\displaystyle+(B_{M}B_{\Delta W}+L_{\sigma}B_{J}B_{\Delta Z}B_{W})\sum\limits_{i=0}^{L-\ell}B_{P}^{i+1}B_{W}^{i}C_{\sigma}^{i}

From Equation 5, we can also conclude that

Corollary A.5.

M~(t,)M~(t1,)FBΔM\|\widetilde{M}^{(t,\ell)}-\widetilde{M}^{(t-1,\ell)}\|_{F}\leq B_{\Delta M} with BΔM=BJLσBΔZ+CσBΔJB_{\Delta M}=B_{J}L_{\sigma}B_{\Delta Z}+C_{\sigma}B_{\Delta J}.

A.3 Bounded Feature Error and Gradient Error

In this subsection, we compare the difference between generic GCN and PipeGCN with the same parameter set, i.e., θ=θ~(t)\theta=\widetilde{\theta}^{(t)}.

Lemma A.6.

Z~(t,)Z()FEZ\|\widetilde{Z}^{(t,\ell)}-Z^{(\ell)}\|_{F}\leq E_{Z},H~(t,)H()FEH\|\widetilde{H}^{(t,\ell)}-H^{(\ell)}\|_{F}\leq E_{H} where EZ=BΔHi=1LCσi1BWiBPiE_{Z}=B_{\Delta H}\sum\limits_{i=1}^{L}C_{\sigma}^{i-1}B_{W}^{i}B_{P}^{i} and EH=BΔHi=1L(CσBWBP)iE_{H}=B_{\Delta H}\sum\limits_{i=1}^{L}(C_{\sigma}B_{W}B_{P})^{i}.

Proof.
Z~(t,)Z()F\displaystyle\|\widetilde{Z}^{(t,\ell)}-Z^{(\ell)}\|_{F} =(PinH~(t,1)W~(t,)+PbdH~(t1,1)W~(t,))(PH(1)W())F\displaystyle=\|(P_{in}\widetilde{H}^{(t,\ell-1)}\widetilde{W}^{(t,\ell)}+P_{bd}\widetilde{H}^{(t-1,\ell-1)}\widetilde{W}^{(t,\ell)})-(PH^{(\ell-1)}W^{(\ell)})\|_{F}
(PinH~(t,1)+PbdH~(t1,1)PH(1))W()F\displaystyle\leq\|(P_{in}\widetilde{H}^{(t,\ell-1)}+P_{bd}\widetilde{H}^{(t-1,\ell-1)}-PH^{(\ell-1)})W^{(\ell)}\|_{F}
=BWP(H~(t,1)H(1))+Pbd(H~(t1,1)H~(t,1))F\displaystyle=B_{W}\|P(\widetilde{H}^{(t,\ell-1)}-H^{(\ell-1)})+P_{bd}(\widetilde{H}^{(t-1,\ell-1)}-\widetilde{H}^{(t,\ell-1)})\|_{F}
BWBP(H~(t,1)H(1)F+BΔH)\displaystyle\leq B_{W}B_{P}\left(\|\widetilde{H}^{(t,\ell-1)}-H^{(\ell-1)}\|_{F}+B_{\Delta H}\right)

By induction, we assume that H~(t,1)H(1)FBΔHi=11(CσBWBP)i\|\widetilde{H}^{(t,\ell-1)}-H^{(\ell-1)}\|_{F}\leq B_{\Delta H}\sum\limits_{i=1}^{\ell-1}(C_{\sigma}B_{W}B_{P})^{i}. Therefore,

Z~(t,)Z()F\displaystyle\|\widetilde{Z}^{(t,\ell)}-Z^{(\ell)}\|_{F} BWBPBΔHi=01(CσBWBP)i\displaystyle\leq B_{W}B_{P}B_{\Delta H}\sum_{i=0}^{\ell-1}(C_{\sigma}B_{W}B_{P})^{i}
=BΔHi=1Cσi1BWiBPi\displaystyle=B_{\Delta H}\sum\limits_{i=1}^{\ell}C_{\sigma}^{i-1}B_{W}^{i}B_{P}^{i}
H~(t,)H()F\displaystyle\|\widetilde{H}^{(t,\ell)}-H^{(\ell)}\|_{F} =σ(Z~(t,))σ(Z())F\displaystyle=\|\sigma(\widetilde{Z}^{(t,\ell)})-\sigma(Z^{(\ell)})\|_{F}
CσZ~(t,)Z()F\displaystyle\leq C_{\sigma}\|\widetilde{Z}^{(t,\ell)}-Z^{(\ell)}\|_{F}
BΔHi=1(CσBWBP)i\displaystyle\leq B_{\Delta H}\sum\limits_{i=1}^{\ell}(C_{\sigma}B_{W}B_{P})^{i}

Lemma A.7.

J~(t,)J()FEJ\|\widetilde{J}^{(t,\ell)}-J^{(\ell)}\|_{F}\leq E_{J} and M~(t,)M()FEM\|\widetilde{M}^{(t,\ell)}-M^{(\ell)}\|_{F}\leq E_{M} with

EJ=max2L(BPBWCσ)LLlossEH+BP(BW(BJEZLσ+BΔM)+BΔWBM)i=0L3(BPBWCσ)iE_{J}=\max_{2\leq\ell\leq L}(B_{P}B_{W}C_{\sigma})^{L-\ell}L_{\text{loss}}E_{H}+B_{P}(B_{W}(B_{J}E_{Z}L_{\sigma}+B_{\Delta M})+B_{\Delta W}B_{M})\sum_{i=0}^{L-3}(B_{P}B_{W}C_{\sigma})^{i}
EM=CσEJ+LσBJEZE_{M}=C_{\sigma}E_{J}+L_{\sigma}B_{J}E_{Z}
Proof.

When =L\ell=L, J~(t,L)J(L)FLlossEH\|\widetilde{J}^{(t,L)}-J^{(L)}\|_{F}\leq L_{\text{loss}}E_{H}. For any \ell, we assume that

J~(t,)J()F(BPBWCσ)LLlossEH+Ui=0L1(BPBWCσ)i\|\widetilde{J}^{(t,\ell)}-J^{(\ell)}\|_{F}\leq(B_{P}B_{W}C_{\sigma})^{L-\ell}L_{\text{loss}}E_{H}+U\sum_{i=0}^{L-\ell-1}(B_{P}B_{W}C_{\sigma})^{i} (6)
M~(t,)M()F(BPBWCσ)LCσLlossEH+UCσi=0L1(BPBWCσ)i+LσBJEZ\|\widetilde{M}^{(t,\ell)}-M^{(\ell)}\|_{F}\leq(B_{P}B_{W}C_{\sigma})^{L-\ell}C_{\sigma}L_{\text{loss}}E_{H}+UC_{\sigma}\sum_{i=0}^{L-\ell-1}(B_{P}B_{W}C_{\sigma})^{i}+L_{\sigma}B_{J}E_{Z} (7)

where U=BP(BWBJEZLσ+BΔWBM+BWBΔM)U=B_{P}(B_{W}B_{J}E_{Z}L_{\sigma}+B_{\Delta W}B_{M}+B_{W}B_{\Delta M}). We prove them by induction as follows.

M~(t,)M()F\displaystyle\|\widetilde{M}^{(t,\ell)}-M^{(\ell)}\|_{F}
=J~(t,)σ(Z~(t,))J()σ(Z())F\displaystyle=\|\widetilde{J}^{(t,\ell)}\circ\sigma^{\prime}(\widetilde{Z}^{(t,\ell)})-J^{(\ell)}\circ\sigma^{\prime}(Z^{(\ell)})\|_{F}
J~(t,)σ(Z~(t,))J~(t,)σ(Z())F+J~(t,)σ(Z())J()σ(Z())F\displaystyle\leq\|\widetilde{J}^{(t,\ell)}\circ\sigma^{\prime}(\widetilde{Z}^{(t,\ell)})-\widetilde{J}^{(t,\ell)}\circ\sigma^{\prime}(Z^{(\ell)})\|_{F}+\|\widetilde{J}^{(t,\ell)}\circ\sigma^{\prime}(Z^{(\ell)})-J^{(\ell)}\circ\sigma^{\prime}(Z^{(\ell)})\|_{F}
BJσ(Z~(t,))σ(Z())F+CσJ~(t,)J()F\displaystyle\leq B_{J}\|\sigma^{\prime}(\widetilde{Z}^{(t,\ell)})-\sigma^{\prime}(Z^{(\ell)})\|_{F}+C_{\sigma}\|\widetilde{J}^{(t,\ell)}-J^{(\ell)}\|_{F}

Here σ(Z~(t,))σ(Z())FLσEZ\|\sigma^{\prime}(\widetilde{Z}^{(t,\ell)})-\sigma^{\prime}(Z^{(\ell)})\|_{F}\leq L_{\sigma}E_{Z}. With Equation 6,

M~(t,)M()F(BPBWCσ)LCσLlossEH+UCσi=0L1(BPBWCσ)i+LσBJEZ\|\widetilde{M}^{(t,\ell)}-M^{(\ell)}\|_{F}\leq(B_{P}B_{W}C_{\sigma})^{L-\ell}C_{\sigma}L_{\text{loss}}E_{H}+UC_{\sigma}\sum_{i=0}^{L-\ell-1}(B_{P}B_{W}C_{\sigma})^{i}+L_{\sigma}B_{J}E_{Z}

On the other hand,

J~(t,1)J(1)F\displaystyle\|\widetilde{J}^{(t,\ell-1)}-J^{(\ell-1)}\|_{F}
=PinM~(t,)[W~(t,)]+PbdM~(t1,)[W~(t1,)]PM()[W()]F\displaystyle=\|P_{in}^{\top}\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}+P_{bd}^{\top}\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}-P^{\top}M^{(\ell)}[W^{(\ell)}]^{\top}\|_{F}
=P(M~(t,)M())[W()]+Pbd(M~(t1,)[W~(t1,)]M~(t,)[W~(t,)])F\displaystyle=\|P^{\top}(\widetilde{M}^{(t,\ell)}-M^{(\ell)})[W^{(\ell)}]^{\top}+P_{bd}^{\top}(\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}-\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top})\|_{F}
P(M~(t,)M())[W()]F+Pbd(M~(t1,)[W~(t1,)]M~(t,)[W~(t,)])F\displaystyle\leq\|P^{\top}(\widetilde{M}^{(t,\ell)}-M^{(\ell)})[W^{(\ell)}]^{\top}\|_{F}+\|P_{bd}^{\top}(\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}-\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top})\|_{F}
BPBWM~(t,)M()F+BPM~(t1,)[W~(t1,)]M~(t,)[W~(t,)]F\displaystyle\leq B_{P}B_{W}\|\widetilde{M}^{(t,\ell)}-M^{(\ell)}\|_{F}+B_{P}\|\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}-\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}\|_{F}

The first part is bounded by Equation 7. For the second part,

M~(t1,)[W~(t1,)]M~(t,)[W~(t,)]F\displaystyle\|\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}-\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}\|_{F}
M~(t1,)[W~(t1,)]M~(t1,)[W~(t,)]F+M~(t1,)[W~(t,)]M~(t,)[W~(t,)]F\displaystyle\leq\|\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}-\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}\|_{F}+\|\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}-\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}\|_{F}
BΔWBM+BWBΔM\displaystyle\leq B_{\Delta W}B_{M}+B_{W}B_{\Delta M}

Therefore,

J~(t,1)J(1)F\displaystyle\|\widetilde{J}^{(t,\ell-1)}-J^{(\ell-1)}\|_{F}
BPBWM~(t,)M()F+BPM~(t1,)[W~(t1,)]M~(t,)[W~(t,)]F\displaystyle\leq B_{P}B_{W}\|\widetilde{M}^{(t,\ell)}-M^{(\ell)}\|_{F}+B_{P}\|\widetilde{M}^{(t-1,\ell)}[\widetilde{W}^{(t-1,\ell)}]^{\top}-\widetilde{M}^{(t,\ell)}[\widetilde{W}^{(t,\ell)}]^{\top}\|_{F}
(BPBWCσ)L+1LlossEH+Ui=1L(BPBWCσ)i+U\displaystyle\leq(B_{P}B_{W}C_{\sigma})^{L-\ell+1}L_{\text{loss}}E_{H}+U\sum_{i=1}^{L-\ell}(B_{P}B_{W}C_{\sigma})^{i}+U
=(BPBWCσ)L+1LlossEH+Ui=0L(BPBWCσ)i\displaystyle=(B_{P}B_{W}C_{\sigma})^{L-\ell+1}L_{\text{loss}}E_{H}+U\sum_{i=0}^{L-\ell}(B_{P}B_{W}C_{\sigma})^{i}

Lemma A.8.

G~(t,)G()FEG\|\widetilde{G}^{(t,\ell)}-G^{(\ell)}\|_{F}\leq E_{G} where EG=BP(BHEM+BMEH)E_{G}=B_{P}(B_{H}E_{M}+B_{M}E_{H})

Proof.
G~(t,)G()F\displaystyle\|\widetilde{G}^{(t,\ell)}-G^{(\ell)}\|_{F}
=\displaystyle= [PinH~(t,1)+PbdH~(t1,1)]M~(t,)[PH()]M()F\displaystyle\left\|[P_{in}\widetilde{H}^{(t,\ell-1)}+P_{bd}\widetilde{H}^{(t-1,\ell-1)}]^{\top}\widetilde{M}^{(t,\ell)}-[PH^{(\ell)}]^{\top}M^{(\ell)}\right\|_{F}
\displaystyle\leq [PinH~(t,1)+PbdH~(t1,1)]M~(t,)[PH(1)]M~(t,)F\displaystyle\left\|[P_{in}\widetilde{H}^{(t,\ell-1)}+P_{bd}\widetilde{H}^{(t-1,\ell-1)}]^{\top}\widetilde{M}^{(t,\ell)}-[PH^{(\ell-1)}]^{\top}\widetilde{M}^{(t,\ell)}\right\|_{F}
+[PH(1)]M~(t,)[PH(1)]M()F\displaystyle+\left\|[PH^{(\ell-1)}]^{\top}\widetilde{M}^{(t,\ell)}-[PH^{(\ell-1)}]^{\top}M^{(\ell)}\right\|_{F}
\displaystyle\leq BM(P(H~(t,1)H(1))+Pbd(H~(t1,1)H~(t,1))F)+BPBHEM\displaystyle B_{M}(\|P(\widetilde{H}^{(t,\ell-1)}-H^{(\ell-1)})+P_{bd}(\widetilde{H}^{(t-1,\ell-1)}-\widetilde{H}^{(t,\ell-1)})\|_{F})+B_{P}B_{H}E_{M}
\displaystyle\leq BMBP(EH+BΔH)+BPBHEM\displaystyle B_{M}B_{P}(E_{H}+B_{\Delta H})+B_{P}B_{H}E_{M}

By summing up from =1\ell=1 to =L\ell=L to both sides, we have

Corollary A.9.

~(θ)(θ)2Eloss\|\nabla\widetilde{\mathcal{L}}(\theta)-\nabla\mathcal{L}(\theta)\|_{2}\leq E_{\text{loss}} where Eloss=LEGE_{\text{loss}}=LE_{G}.

According to the derivation of ElossE_{\text{loss}}, we observe that ElossE_{\text{loss}} contains a factor η\eta. To simplify the expression of ElossE_{\text{loss}}, we assume that BPBWCσ12B_{P}B_{W}C_{\sigma}\leq\frac{1}{2} without loss of generality, and rewrite Corollary A.9 as the following.

Corollary A.10.

~(θ)(θ)2ηE\|\nabla\widetilde{\mathcal{L}}(\theta)-\nabla\mathcal{L}(\theta)\|_{2}\leq\eta E where

E=18LBP3BX2ClossCσ(3BXCσ2Lloss+6BXClossLσ+10ClossCσ2)E=\frac{1}{8}LB_{P}^{3}B_{X}^{2}C_{\text{loss}}C_{\sigma}\left(3B_{X}C_{\sigma}^{2}L_{\text{loss}}+6B_{X}C_{\text{loss}}L_{\sigma}+10C_{\text{loss}}C_{\sigma}^{2}\right)

A.4 Proof of the Main Theorem

We first introduce a lemma before the proof of our main theorem.

Lemma A.11 (Lemma 1 in (Cong et al., 2021)).

An LL-layer GCN is LfL_{f}-Lipschitz smoothness, i.e., (θ1)(θ2)2Lfθ1θ22\|\nabla\mathcal{L}(\theta_{1})-\nabla\mathcal{L}(\theta_{2})\|_{2}\leq L_{f}\|\theta_{1}-\theta_{2}\|_{2}.

Now we prove the main theorem.

Theorem A.12 (Convergence of PipeGCN, formal).

Under Assumptions A.1, A.2, and A.3, we can derive the following by choosing a learning rate η=εE\eta=\frac{\sqrt{\varepsilon}}{E} and number of training iterations T=((θ(1))(θ))Eε32T=(\mathcal{L}(\theta^{(1)})-\mathcal{L}(\theta^{*}))E\varepsilon^{-\frac{3}{2}}:

1Tt=1T(θ(t))23ε\frac{1}{T}\sum_{t=1}^{T}\|\nabla\mathcal{L}(\theta^{(t)})\|_{2}\leq 3\varepsilon

where EE is defined in Corollary A.10, ε>0\varepsilon>0 is an arbitrarily small constant, ()\mathcal{L}(\cdot) is the loss function, θ(t)\theta^{(t)} and θ\theta^{*} represent the parameter vector at iteration tt and the optimal parameter respectively.

Proof.

With the smoothness of the model,

(θ(t+1))\displaystyle\mathcal{L}(\theta^{(t+1)}) (θ(t))+(θ(t)),θ(t+1)θ(t)+Lf2θ(t+1)θ(t)22\displaystyle\leq\mathcal{L}(\theta^{(t)})+\left<\nabla\mathcal{L}(\theta^{(t)}),\theta^{(t+1)}-\theta^{(t)}\right>+\frac{L_{f}}{2}\|\theta^{(t+1)}-\theta^{(t)}\|_{2}^{2}
=(θ(t))η(θ(t)),~(θ(t))+η2Lf2~(θ(t))22\displaystyle=\mathcal{L}(\theta^{(t)})-\eta\left<\nabla\mathcal{L}(\theta^{(t)}),\nabla\widetilde{\mathcal{L}}(\theta^{(t)})\right>+\frac{\eta^{2}L_{f}}{2}\|\nabla\widetilde{\mathcal{L}}(\theta^{(t)})\|_{2}^{2}

Let δ(t)=~(θ(t))(θ(t))\delta^{(t)}=\nabla\widetilde{\mathcal{L}}(\theta^{(t)})-\nabla\mathcal{L}(\theta^{(t)}) and η1/Lf\eta\leq 1/L_{f}, we have

(θ(t+1))\displaystyle\mathcal{L}(\theta^{(t+1)}) (θ(t))η(θ(t)),(θ(t))+δ(t)+η2(θ(t))+δ(t)22\displaystyle\leq\mathcal{L}(\theta^{(t)})-\eta\left<\nabla\mathcal{L}(\theta^{(t)}),\nabla\mathcal{L}(\theta^{(t)})+\delta^{(t)}\right>+\frac{\eta}{2}\|\nabla\mathcal{L}(\theta^{(t)})+\delta^{(t)}\|_{2}^{2}
(θ(t))η2(θ(t))22+η2δ(t)22\displaystyle\leq\mathcal{L}(\theta^{(t)})-\frac{\eta}{2}\|\nabla\mathcal{L}(\theta^{(t)})\|_{2}^{2}+\frac{\eta}{2}\|\delta^{(t)}\|_{2}^{2}

From Corollary A.10 we know that δ(t)2<ηE\|\delta^{(t)}\|_{2}<\eta E. After rearranging the terms,

(θ(t))222η((θ(t))(θ(t+1)))+η2E2\displaystyle\|\nabla\mathcal{L}(\theta^{(t)})\|_{2}^{2}\leq\frac{2}{\eta}(\mathcal{L}(\theta^{(t)})-\mathcal{L}(\theta^{(t+1)}))+\eta^{2}E^{2}

Summing up from t=1t=1 to TT and taking the average,

1Tt=1T(θ(t))22\displaystyle\frac{1}{T}\sum_{t=1}^{T}\|\nabla\mathcal{L}(\theta^{(t)})\|_{2}^{2} 2ηT((θ(1))(θ(T+1)))+η2E2\displaystyle\leq\frac{2}{\eta T}(\mathcal{L}(\theta^{(1)})-\mathcal{L}(\theta^{(T+1)}))+\eta^{2}E^{2}
2ηT((θ(1))(θ))+η2E2\displaystyle\leq\frac{2}{\eta T}(\mathcal{L}(\theta^{(1)})-\mathcal{L}(\theta^{*}))+\eta^{2}E^{2}

where θ\theta^{*} is the minimum point of ()\mathcal{L}(\cdot). By taking η=εE\eta=\frac{\sqrt{\varepsilon}}{E} and T=((θ(1))(θ))Eε32T=(\mathcal{L}(\theta^{(1)})-\mathcal{L}(\theta^{*}))E\varepsilon^{-\frac{3}{2}} with an arbitrarily small constant ε>0\varepsilon>0, we have

1Tt=1T(θ(t))23ε\frac{1}{T}\sum_{t=1}^{T}\|\nabla\mathcal{L}(\theta^{(t)})\|_{2}\leq 3\varepsilon

Appendix B Training Time Breakdown of Full-Graph Training Methods

To understand why PipeGCN significantly boosts the training throughput over full-graph training methods, we provide the detailed time breakdown in Tab. 6 using the same model as Tab. 3 (4-layer GraphSAGE, 256 hidden units), in which “GCN” denotes the vanilla partition-parallel training illustrated in Fig. 1(a). We observe that PipeGCN greatly saves communication time.

Table 6: Epoch time breakdown of full-graph training methods on the Reddit dataset.
Method Total time (s) Compute (s) Communication (s) Reduce (s)
ROC (2 GPUs) 3.63 0.5 3.13 0.00
CAGNET (c=1, 2 GPUs) 2.74 1.91 0.65 0.18
CAGNET (c=2, 2 GPUs) 5.41 4.36 0.09 0.96
GCN (2 GPUs) 0.52 0.17 0.34 0.01
PipeGCN (2 GPUs) 0.27 0.25 0.00 0.02
ROC (4 GPUs) 3.34 0.42 2.92 0.00
CAGNET (c=1, 4 GPUs) 2.31 0.97 1.23 0.11
CAGNET (c=2, 4 GPUs) 2.26 1.03 0.55 0.68
GCN (4 GPUs) 0.48 0.07 0.40 0.01
PipeGCN (4 GPUs) 0.23 0.10 0.10 0.03

Appendix C Training Time Improvement Breakdown of PipeGCN

To understand the training time improvement offered by PipeGCN, we further breakdown the epoch time into three parts (intra-partition computation, inter-partition communication, and reduce for aggregating model gradient) and provide the result in Fig. 8. We can observe that: 1) inter-partition communication dominates the training time in vanilla partition-parallel training (GCN); 2) PipeGCN (with or without smoothing) greatly hides the communication overhead across different number of partitions and all datasets, e.g., the communication time is hidden completely in 2-partition Reddit and almost completely in 3-partition Yelp, thus the substantial reduction in training time; and 3) the proposed smoothing incurs only minimal overhead (i.e., minor difference between PipeGCN and PipeGCN-GF). Lastly, we also notice that when communication ratio is extremely large (85%+), PipeGCN hides communication significantly but not completely (e.g., 10-partition ogbn-products), in which case we can employ those compression and quantization techniques (Alistarh et al. (2017); Seide et al. (2014); Wen et al. (2017); Li et al. (2018a); Yu et al. (2018)) from the area of general distributed SGD for further reducing the communication, as the compression is orthogonal to the pipeline method. Besides compression, we can also increase the pipeline depth of PipeGCN, e.g., using two iterations of compute to hide one iteration of communication, which is left to our future work.

Refer to caption
Figure 8: Training time breakdown of vanilla partition-parallel training (GCN), PipeGCN, and PipeGCN with smoothing (PipeGCN-GF).

Appendix D Maintaining Convergence Speed (Additional Experiments)

We provide the additional convergence curves on Yelp in Fig. 9. We can see that PipeGCN and its variants maintain the convergence speed w.r.t the number of epochs while substantially reducing the end-to-end training time.

Refer to caption
Figure 9: The epoch-to-accuracy comparison on “Yelp” among the vanilla partition-parallel training (GCN) and PipeGCN variants (PipeGCN*), where PipeGCN and its variants achieve a similar convergence as the vanilla training (without staleness) but are twice as fast in terms of wall-clock time (see the Throughput improvement in Tab. 4 of the main content).

Appendix E Scaling GCN Training over Multiple GPU Servers

We also scale up PipeGCN training over multiple GPU servers (each contains AMD Radeon Instinct MI60 GPUs, an AMD EPYC 7642 CPU, and 48 lane PCI 3.0 connecting CPU-GPU and GPU-GPU) networked with 10Gbps Ethernet.

The accuracy results of PipeGCN and its variants are summarized in Tab. 7:

Table 7: The accuracy of PipeGCN and its variants on Reddit.
#partitions (#node×\times#gpus) PipeGCN PipeGCN-F PipeGCN-G PipeGCN-GF
2 (1×\times2) 97.12% 97.09% 97.14% 97.12%
3 (1×\times3) 97.01% 97.15% 97.17% 97.14%
4 (1×\times4) 97.04% 97.10% 97.09% 97.10%
6 (2×\times3) 97.09% 97.12% 97.08% 97.10%
8 (2×\times4) 97.02% 97.06% 97.15% 97.03%
9 (3×\times3) 97.03% 97.08% 97.11% 97.08%
12 (3×\times4) 97.05% 97.05% 97.12% 97.10%
16 (4×\times4) 96.99% 97.02% 97.14% 97.12%

Furthermore, we provide PipeGCN’s speedup against vanilla partition-parallel training in Tab. 8:

Table 8: The speedup of PipeGCN and its vatiants against vanilla partition-parallel training on Reddit.
#nodes×\times#gpus GCN PipeGCN PipeGCN-G PipeGCN-F PipeGCN-GF
1×\times2 1.00×\times 1.16×\times 1.16×\times 1.16×\times 1.16×\times
1×\times3 1.00×\times 1.22×\times 1.22×\times 1.22×\times 1.22×\times
1×\times4 1.00×\times 1.29×\times 1.28×\times 1.29×\times 1.28×\times
2×\times2 1.00×\times 1.61×\times 1.60×\times 1.61×\times 1.60×\times
2×\times3 1.00×\times 1.64×\times 1.64×\times 1.64×\times 1.64×\times
2×\times4 1.00×\times 1.41×\times 1.42×\times 1.41×\times 1.37×\times
3×\times2 1.00×\times 1.65×\times 1.65×\times 1.65×\times 1.65×\times
3×\times3 1.00×\times 1.48×\times 1.49×\times 1.50×\times 1.48×\times
3×\times4 1.00×\times 1.35×\times 1.36×\times 1.35×\times 1.34×\times
4×\times2 1.00×\times 1.64×\times 1.63×\times 1.63×\times 1.62×\times
4×\times3 1.00×\times 1.38×\times 1.38×\times 1.38×\times 1.38×\times
4×\times4 1.00×\times 1.30×\times 1.29×\times 1.29×\times 1.29×\times

From the two tables above, we can observe that our PipeGCN family consistently maintains the accuracy of the full-graph training, while improving the throughput by 15%\sim66% regardless of the machine settings and number of partitions.

Appendix F Implementation Details

We discuss the details of the effective and efficient implementation of PipeGCN in this section.

First, for parallel communication and computation, a second cudaStream is required for communication besides the default cudaStream for computation. To also save memory buffers for communication, we batch all communication (e.g., from different layers) into this second cudaStream. When the popular communication backend, Gloo, is used, we parallelize the CPU-GPU transfer with CPU-CPU transfer.

Second, when Dropout layer is used in GCN model, it should be applied after communication. The implementation of the dropout layer for PipeGCN should be considered carefully so that the dropout mask remains consistent for the input tensor and corresponding gradient. If the input feature passes through the dropout layer before being communicated, during the backward phase, the dropout mask is changed and the gradient of masked values is involved in the computation, which introduces noise to the calculation of followup gradients. As a result, the dropout layer can only be applied after receiving boundary features.