Distributed Graph Neural Network Training: A Survey
Abstract.
Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review on the optimization techniques for the distributed execution of GNN training. In this survey, we analyze three major challenges in distributed GNN training that are massive feature communication, the loss of model accuracy and workload imbalance. Then we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories that are GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol. We carefully discuss the techniques in each category. In the end, we summarize existing distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion about the future direction on distributed GNN training.
1. Introduction
GNNs are powerful tools to handle problems modeled by the graph and have been widely adopted in various applications, including social networks (e.g., social spammer detection (Wu et al., 2020; Song et al., 2021), social network analysis (Tan et al., 2019)), bio-informatics (e.g., protein interface prediction (Fout et al., 2017), disease–gene association (Rao et al., 2021)), drug discovery (Li et al., 2021b; Bongini et al., 2021), traffic forecasting (Jiang and Luo, 2022), health care (Choi et al., 2020; Ahmedt-Aristizabal et al., 2021), recommendation (Fan et al., 2019; Wu et al., 2022; Huang et al., 2021a; Hu et al., 2020b), natural language process (Zhong et al., 2020; Zhang et al., 2020d) and others (Sanchez-Gonzalez et al., 2018; Do et al., 2019; Zheng et al., 2020a, b; Zhang et al., 2021). By integrating the information of graph structure into the deep learning models, GNNs can achieve significantly better results than traditional machine learning and data mining methods.
A GNN model generally contains multi graph convolutional layers, where each vertex aggregates the latest states of its neighbors, updates the state of the vertex, and applies neural network to the updated state of the vertex. Taking the traditional graph convolutional network (GCN) as an example, in each layer, a vertex uses a sum function to aggregate the neighbor states and its own state, then applies a single-layer MLP to transform the new state. Such procedures are repeated times if the number of layers is . The vertex states generated in the th layer are used by the downstream tasks, like node classification, link prediction, and so on. In the past years, many research works have made remarkable progress in the design of graph neural network models. Prominent models include GCN (Welling and Kipf, 2017), GraphSAGE (Hamilton et al., 2017), GAT (Veličković et al., 2018), GIN (Xu et al., 2019), and many other application-specific GNN models (Zhang et al., 2020b, 2019a). Up to date, there are tens of surveys reviewing the GNN models (Zhang et al., 2022a; Wu et al., 2021b; Zhou et al., 2020; Xie et al., 2022b). On the other hand, to efficiently develop different GNN models, many GNN-oriented frameworks are proposed based on various deep learning libraries (Liu et al., 2021a; Grattarola and Alippi, 2021; Battaglia et al., 2018; Fey and Lenssen, 2019; Cen et al., 2021; Wang et al., 2019b). Many new optimizations are proposed to speed up GNN training, including GNN computation kernels (Tian et al., 2020; Chen et al., 2020; Huang et al., 2020; Rahman et al., 2021; Zhang et al., 2022b; Fu et al., 2022; Huang et al., 2021b), efficient programming models (Xie et al., 2022a; Wu et al., 2021a; Hu et al., 2020c), and full utilization of new hardware (Chen et al., 2022; Zeng and Prasanna, 2020; Zhou et al., 2021a; Gong et al., 2022). However, these frameworks and optimizations mainly focus on training GNN on a single machine, while not paying much attention to the scalability of input graphs.
Nowadays, large-scale graph neural networks (Liu et al., 2022; Joshi, 2022) become a hot topic because of the prevalence of massive large graph data. It is common to have graphs with billions of vertices and trillions of edges, like the social network in Sina Weibo, WeChat, Twitter and Meta. However, most of the existing GNN models are only tested on small graph data sets, and it is impossible or inefficient to process large graph data sets (Hu et al., 2020a). This is because GNN models are complex and require massive computation resources when handling large graphs. A line of works achieve large-scale GNNs by designing scalable GNN models. They use simplification (Wu et al., 2019; He et al., 2020; Frasca et al., 2020), quantization (Tailor et al., 2020; Wang et al., 2021d; Bahri et al., 2021; Huang et al., 2022; Liu et al., 2021b; Wang et al., 2022a; Zhao et al., 2020; Wang et al., 2021c; Feng et al., 2020), sampling (You et al., 2020; Zeng et al., 2020; Chiang et al., 2019) and distillation (Yan et al., 2020; Zhang et al., 2020c; Deng and Zhang, 2021) to design efficient models. Another line of works adopt distributed computing to the GNN training, a.k.a, distributed GNN training. Because when handling large graphs, the limited memory and computing resource of a single device (e.g, GPU) become the bottleneck of large-scale GNN training, and distributed computing provides more computing resources (e.g., multi-GPUs, CPU clusters, etc.) to improve the training efficiency. However, previous systems targeting at distributed graph processing (Malewicz et al., 2010; Low et al., 2012) and distributed deep learning (Li et al., 2020; Abadi et al., 2016) separately. The graph processing systems do not consider the acceleration of neural network operations, while the deep learning systems lack the ability of processing graph data. Therefore, many efforts have been made in designing efficient distributed GNN training frameworks and systems (Wan et al., 2022b, a; Zhu et al., 2019; Zheng et al., 2020c; Jia et al., 2020).
In this survey, we focus on the works and specific techniques proposed for distributed GNN training. Distributed GNN training divides the whole workload of model training among a set of workers, and all the workers process the workload in parallel. However, due to the data dependency in GNNs, it is non-trivial to apply existing distributed machine learning methods (Wang et al., 2022b; Verbraeken et al., 2020) to GNNs, and many new techniques for optimizing the distributed GNN training pipeline are proposed. Although there are a lot of surveys (Zhang et al., 2022a; Wu et al., 2021b; Zhou et al., 2020) about GNNs, to the best of our knowledge, little effort has been made to systematically review the techniques for distributed GNN training. Recently, Besta et al. (Besta and Hoefler, 2022) only reviewed the parallel computing paradigm of GNN, Abadal (Abadal et al., 2021) surveyed GNN computing from algorithms to hardware accelerators, and Vatter et al. (Vatter et al., 2023) provided a comprehensive overview of the evolution of the distributed systems for scalable GNN training.
To clearly organize the techniques for distributed GNN training, we introduce a general distributed GNN training pipeline which consists of three stages – data partition, batch generation, and GNN model training. These stages involve GNN-specific execution logic that includes graph processing and graph aggregation. In the context of this general distributed GNN training pipeline, we discuss three main challenges of distributed GNN training which are caused by the data dependency in graph data and require new techniques specifically designed for distributed GNN training. To help readers understand various optimization techniques that address the above challenges better, we introduce a new taxonomy that classifies the techniques into four orthogonal categories that are GNN data partition, GNN batch generation, GNN execution model and GNN communication protocol. This taxonomy not only covers the optimization techniques used in both mini-batch distributed GNN training and full-graph distributed GNN training, but also discusses the techniques from graph processing to model execution. We carefully review the existing techniques in each category followed by describing 28 representative distributed GNN systems and frameworks either from industry or academia. Finally, we briefly discuss the future directions of distributed GNN training.
The contributions of this survey are as follows:
-
•
This is the first survey focusing on the optimization techniques for efficient distributed GNN training, and it helps researchers quickly understand the landscape of distributed GNN training.
-
•
We introduce a new taxonomy of distributed GNN training techniques by considering the life-cycle of end-to-end distributed GNN training. In high-level, the new taxonomy consists of four orthogonal categories that are GNN data partition, GNN batch generation, GNN execution model and GNN communication protocol.
-
•
We provide a detailed and comprehensive technical summary of each category in our new taxonomy.
-
•
We review 28 representative distributed GNN training systems and frameworks from industry to academia.
-
•
We make a discussion about the future directions of the distributed GNN training.
Survey organization. In Section 2, we introduce the background of GNNs, and discuss the differences between GNN training and traditional neural network training. In Section 3, we present the distributed GNN training pipeline, highlight the specific challenges faced by distributed GNN training and introduce our taxonomy of the techniques used in distributed GNN training. On the basis of the taxonomy, we make detailed discussion about the techniques of distributed GNN training, including data partition (Section 4), batch generation (Section 5), execution model (Section 6), and communication protocol (Section 7). In Section 8, we discuss various existing distributed GNN training systems. In the end, we unveil some promising future directions for distributed GNN training and make a conclusion.
2. Preliminaries of Graph Neural Networks
Graph & Graph partition. A graph consists of a vertex set and an edge set . The set of neighbors of a vertex is denoted by and the degree of is denoted by . In directed graph, the set of incoming and outgoing neighbors of a vertex is denoted by and , respectively, and the corresponding degree is denoted by and . An edge is represented by where , . The adjacent matrix of is represented by , where an entry of equals 1 when , otherwise . Each vertex in a graph has an initial feature vector , where is the dimension size of feature vectors, and the feature matrix of the graph is denoted by .
A graph can be partitioned in distributed settings. For an edge-cut graph partition, a graph is divided into partitions , , which satisfy . The end vertices of cross edges are called boundary vertices, and other vertices in are inner vertices. For a vertex-cut graph partition, the above partitions should satisfy . A vertex may be replicated among partitions and the number of replications of vertex is called the replication factor. When the replication factor of a vertex is larger than 1, the vertex is called boundary vertex, and other vertices in are inner vertices.
Graph Neural Networks (GNNs). Given a graph with adjacent matrix and the feature matrix where each row is the initial feature vector of a vertex in the graph , an -th layer in a GNN updates the vertex feature by aggregating the features from the corresponding neighborhoods, and the process can be formalized in matrix view as below
(1) |
where is the hidden embedding matrix and , is the model weights, is a normalized and is a non-linear function, e.g., Relu, Sigmoid, etc. Eq. 1 is the global view of GNN computation. The local view of GNN computation is the computation of a single vertex. Given a vertex , the local computation of -th layer in the message passing schema (Gilmer et al., 2017) can be formalized as below
(2) |
where is an aggregation function, is an update function, and is the hidden embedding of edge at the -th layer.
In contrast to traditional neural networks, where the data samples are typically independent and identically distributed (i.i.d.), GNN introduces data dependencies due to the underlying graph structure. These dependencies arise from the fact that the hidden embedding of each vertex in a GNN is updated based on the embeddings of its neighbors at the previous layer, which is formalized by Eq. 1 in global view and Eq. 2 in local view, respectively. This dependency enables GNNs to capture the structural information in the graph. By leveraging the graph structure and capturing the dependencies between vertices and edges, GNNs are able to model complex relational data and perform tasks such as node classification and link prediction.
GNN Training Methods. To train a GNN model, a basic solution is to update the model parameters once in a single epoch, such training method is called full-graph GNN training. However, full-graph GNN training is memory-intensive (Welling and Kipf, 2017) since it requires to access the whole training graph and cannot scale to large graphs. Alternatively, we may choose to update the GNN model parameters multiple times over the course of a single epoch. This is known as mini-batch GNN training. An epoch is divided into multiple iterations and each iteration updates the GNN model with a batch (a.k.a, a subset of the whole training dataset).

Compared to traditional deep neural network (DNN) training, GNN training has two essential differences, which are illustrated by an example in Figure 1. First, for the batch generation in mini-batch training, GNN constructs a batch with -hop neighbors involved, while traditional DNN involves the selected training samples only. Because of the data dependency in GNN training, to compute the output embedding in the final layer (layer ) of a training vertex, we need all the embeddings of its neighbors from the previous layer (layer ). Recursively, in order to compute the embeddings of neighboring vertices at layer , embeddings of its 2-hop neighbors are required. Therefore, an exact mini-batch containing only one training vertex involves all its -hop neighbors, along with their features as the input of mini-batch training. A significant challenge arises when constructing such mini-batches: they can quickly become large and expand to include a substantial portion of the complete graph, even only a few vertices are selected as the central training vertices. Some methods are proposed to mitigate this problem by sampling in the neighborhood so to reduce the total size of the mini-batch, such as node-wise sampling (Hamilton et al., 2017; Chen et al., 2018b), layer-wise sampling (Chen et al., 2018a; Huang et al., 2018; Zou et al., 2019) and subgraph sampling (Cong et al., 2020; Zeng et al., 2020).
Second, for the model computation in both mini-batch training and full-graph training, an additional aggregation process is introduced, to collect vertex information from the neighborhood. After the output embeddings of training vertices are computed, a backward propagation is performed to compute the gradient for each parameter. Similar to the forward computation, the gradients on each vertex is sent alongside the edges to its neighbors during backward propagation, incurring an additional scatter process compared to DNN training. The aggregation and scatter operation among data samples leads to a much more complex computation process compared to DNN training.
Distributed DNN Training. Distributed DNN training (Langer et al., 2020; Ko et al., 2021) is a solution to large-scale DNN. Parallelism and synchronization are two key components. 1) Parallelism. Data parallelism is a prevalent training paradigm in distributed DNN training. In data parallelism, the model is replicated across multiple devices or machines, with each replica processing a different subset of the training data. Computed gradients are exchanged and averaged to synchronize the model parameters, often utilizing communication operations like all-reduce (Chen et al., 2016). To handle large models that exceed the capacity of a single GPU, model parallelism is adopted, where each device processes a distinct part of the model during forward and backward propagation. 2) Synchronization. Distributed training can be categorized into synchronous and asynchronous training. In synchronous training, all workers complete a forward and backward pass before model updates occur. This ensures that all workers are using the same model parameters for computation. In asynchronous training, workers update the model parameters independently, so to remove the global synchronization point between each mini-batch. Based on the idea of asynchronous training, pipeline parallelism is proposed (Huang et al., 2019; Narayanan et al., 2019) to perform flexible mini-batch training and improve resource utilization to accelerate the overall training time.
Because of the large models in the DNNs, most efforts are made to manage the storage and synchronization of model parameters for the distributed DNN training. Since GNN models typically have shallow network structures, the management of model parameters is trivial. However, the input graphs of GNNs can be extremely large. The specific data dependency introduced by graph data structure significantly impacts the computation process of distributed GNN training, resulting in a substantial communication overhead that differs from DNN training, which becomes the main consideration and introduces new challenges (see Section 3.2).
3. Distributed GNN Training and Challenges
3.1. General Distributed GNN Training Pipeline

To better understand the general workflow of end-to-end distributed GNN training, we divide the training pipeline into three stages that are data partition, batch generation and GNN model training. The GNN model training stage is further divided into model execution (including the design of the execution model and communication protocol) and parameter update. Figure 2 visualizes the high-level abstraction of end-to-end distributed GNN training workflow.
Data partition. This is a preprocessing stage enabling distributed training. It distributes the input data (i.e., graph and feature) among a set of workers. Considering that the training data in GNN are dependent, the data partition stage becomes more complex than the one in traditional distributed machine learning. As shown in Figure 2, the cross-worker edges among the partitioned data (i.e., subgraphs) imply the data dependency. If we acknowledge the data dependency among partitions, the distributed training efficiency is reduced by communication; if we simply ignore the data dependency, the model accuracy is destroyed. Therefore, data partition is a critical stage for the efficiency of end-to-end distributed GNN training.
GNN batch generation. This stage is only involved in distributed mini-batch training. In this stage, each worker generates a computation graph from the partitioned input graph and feature. However, due to the data dependency, the computation graph generation are quite different from the traditional deep learning model. The computation graph for the mini-batch training strategy may not be generated correctly without accessing remote input data. The remote data accessing makes the distributed batch generation more complicated than the traditional deep learning model.
GNN model training. This is the core stage for the embedding computing and model updating. It is further divided into model execution stage and parameter update stage. The model execution stage is specifically designed for distributed GNN training, while the parameter update stage is consistent to that of traditional DNN model training. We further divide the model execution stage into execution model and communication protocol. For distributed mini-batch training, only execution model is involved, which manages the scheduling of different training operators and how to overlap them efficiently with the batch generation stage. For distributed full-graph training, both execution model and communication protocol is included. The execution model involves -layer graph aggregation of GNN models and the aggregation exhibits an irregular data access pattern. Furthermore, the graph aggregation in each layer needs to access the hidden embeddings and computed gradients of the remote neighbors via the communication protocol, and the synchronization schema between layers should be also considered. Finally, for parameter update stage, the existing techniques in classical distributed machine learning can be directly applied to the distributed GNN training. In conclusion, the distributed GNN model training stage is more complicated than training traditional DNN training, and needs careful design for both the execution model and the communication protocol.
3.2. Challenges in Distributed GNN Training
Due to the data dependency, it is non-trivial to train distributed GNNs efficiently. We summarize three major challenges as below.
Challenge #1: Massive feature communication. For the distributed GNN training, massive feature communication is incurred in both the batch generation and GNN model training stage. When the GNN model is trained in a mini-batch manner, the batch generation needs to access remote graph and features as the operation ① in Figure 2 to create a batch for local training. Even if in multi-GPU training with a centralized graph store (e.g., CPU memory), the computation graph generation incurs massive data movement from the graph store to workers (e.g., GPU memory). Existing empirical studies demonstrate that the cost of mini-batch construction becomes the bottleneck during the end-to-end training (Zheng et al., 2022a; Jangda et al., 2021). When the GNN model is trained in a full-graph manner, the computation graph can be the same as the partitioned graph without communication. However, the graph aggregation in each layer needs to access the hidden features in the remote neighbors of vertices (operation ② in Figure 2) which causes massive hidden feature (or embedding) communication. In summary, both distributed mini-batch training and distributed full-graph training of GNN models suffer from massive feature communication.
Challenge #2: The loss of model accuracy. Mini-batch GNN training is more scalable than full-graph training, so it is adopted by most existing distributed GNN training systems. However, with the increase of model depth, the exact mini-batch training suffers from the neighbor-explosion problem. A de-facto solution is to construct an approximated mini-batch by sampling or ignoring cross-worker edges. Although the approximated mini-batches improve the training efficiency, it cannot guarantee the model convergence in theory (Jia et al., 2020). Therefore, for the distributed mini-batch GNN training, we need to make a trade-off between model accuracy and training efficiency. In addition, there is a trend to develop full-graph distributed GNN training which has a convergence guarantee and is able to achieve the same model accuracy compared to the local training on a single worker.
Challenge #3: Workload imbalance. Workload balance is an intrinsic problem in distributed computing. However, the various workload characteristics of GNN models increase the difficulty to partition the training workload balance among workers. Because it is hard to model the GNN workload in a simple and unified way. Without formal cost models, the classical graph partitioning algorithms cannot be used to balance the GNN workload among workers. In addition, distributed mini-batch GNN training requires that each worker handles the same number of mini-batches with the same batch size (i.e., subgraph size), not simply balancing the number of vertices in subgraphs. In summary, it is easy to encounter workload imbalance when training GNN in a distributed environment, thus incurring workers waiting for each other and destroying the training efficiency.

3.3. Taxonomy of Distributed GNN Training Techniques

To realize the distributed GNN training and optimize the efficiency via solving the above challenges, many new techniques are proposed in the past years. Figure 3 illustrates the relationship between challenges and the proposed techniques. Previous works typically present their own contributions as a component of their proposed systems or frameworks, and may lack a comprehensive discussion of the techniques used in the same stage of distributed GNN training, or the techniques addressing the same challenge. In this survey, we introduce a new taxonomy by organizing the techniques specifically designed for distributed GNN training on the basis of the stages in the end-to-end distributed training pipeline. With such a design, we organize similar techniques that optimize the same stage in the distributed GNN training pipeline together and help readers fully understand the existing solutions for the different stages in distributed GNN training.
According to the previous empirical studies, due to the data dependency, the bottleneck of distributed GNN training generally comes up in the stages of data partition, batch generation and GNN model training, as shown in the pipeline. Furthermore, various training strategies (e.g., mini-batch training, full-graph training) bring in different workload patterns and result in different optimization techniques used in batch generation stage and model training stage. For example, the computation graph generation in batch generation stage is important to mini-batch training while communication protocol is important to full-graph training. In consequence, our new taxonomy classifies the techniques specifically designed for distributed GNN training into four categories (i.e., GNN data partition, GNN batch generation, GNN execution model and GNN communication protocol) as shown in Figure 4. In the following, we introduce the overview of each category.
GNN data partition. In this category, we review the data partition techniques for distributed GNN training. The goal of data partition is to balance the workload and minimize the communication cost for a GNN workload. GNN training is an instance of distributed graph computing, many traditional graph partition methods can be directly used. However, they are not optimal for distributed GNN training because of the new characteristics of GNN workloads. Researchers have paid much effort to design GNN-friendly cost models which guide traditional graph partition methods. In addition, the graph and feature are two typical types of data in GNN and both of them are partitioned. Some works decouple the features from the graph structure and partition them independently. In Section 4, we elaborate on the existing GNN data partition techniques.
GNN batch generation. In this category, we review the techniques of GNN batch generation for mini-batch distributed GNN training. The methods of mini-batch generation affect both the training efficiency and model accuracy. Graph sampling is a popular approach to generate a mini-batch for large-scale GNN training. However, the standard graph sampling techniques do not consider the factors of distributed environments and each sampler on a worker will frequently access data from other workers incurring massive communication. Recently, several new GNN batch generation methods optimized for distributed settings have been introduced. We further classify them into distributed sampling mini-batch generation and partition-based mini-batch generation. In addition, the cache has been extensively studied to reduce communication during the GNN batch generation. In Section 5, we elaborate on the existing GNN batch generation techniques.
GNN execution model. In this category, we review the execution model for both mini-batch and full-graph GNN training. In mini-batch GNN training, sampling and feature extraction are two main operations that dominate the total training time. To improve efficiency, different execution models are proposed to schedule the training stage of a mini-batch with the sampling and feature extraction stage, in order to fully utilize computing resources. In full-graph GNN training, the neighbors’ state for each vertex is aggregated in the forward computation and the gradients are scattered back to the neighbors in the backward computation, leading to massive communication of remote vertices. Due to the data dependency and irregular computation pattern, traditional machine learning parallel models (e.g., data parallel, model parallel, etc.) are not optimal for graph aggregation, especially when feature vectors are high-dimensional. We introduce the full-graph execution model from both matrix view and graph view. From matrix view, we classify the execution of distributed SpMM into computation-only, communication-computation, and communication-computation-reduction models. From graph view, we classify the execution models from two perspectives. From the perspective of the execution orders of graph operators, we classify the full-graph execution model into one-shot execution model and chunk-based execution model. From the perspective of the synchronization timeliness in graph aggregation, we classify the full-graph execution models into synchronous execution model and asynchronous execution model. Based on the removal of different synchronization points, the asynchronous execution model can be further categorized into two types. A detailed description of these techniques is presented in Section 6.
GNN communication protocol. In this category, we review the communication protocols of distributed full-graph parallel training. The graph aggregation of distributed full-graph training needs to access remote hidden embedding access because the neighbors of a vertex cannot be always at local with graph partitioning. Based on the synchronous and asynchronous execution model in graph aggregation, these communication protocols can also be categorized as synchronous communication protocol and asynchronous communication protocol, respectively. The former assumes in each layer, the model access the latest embedding while the latter assumes in each layer the model is allowed to access stale embedding (or historical embeddings). A detailed description of various communication protocols is presented in Section 7.
4. GNN Data Partition

In this section, we review existing techniques of GNN data partition in distributed GNN training. Figure 5 describes the overview of the techniques. Considering graphs and features are two typical types of data in GNN, we classify partition methods into graph partition and feature partition. The optimization objectives are workload balance and communication and computation minimization, which aim at addressing challenges #1 and #3. In addition, the cost model is another critical component that captures the characteristics of GNN workloads. With a well-designed cost model, workloads on each partition can be accurately estimated. A graph partition strategy can better address the challenges in GNN by adopting a more accurate cost model. In the following, we first present various cost models, which are the basis of graph partition. Then we discuss the graph partition and feature partition, respectively.
4.1. Cost model of GNN
The cost model estimates the computation and communication cost of a GNN workload. Therefore, building a cost model is an important technique that underpins the data partition. In general, for a graph analysis task, we use the number of vertices to estimate the computation cost and the number of cross-edges to estimate the communication cost (Karypis and Kumar, 1998). However, this simple method is not proper for the GNN tasks because they are not only influenced by the number of vertices and cross-edges, but also influenced by the dimension of features, the number of layers, and the distribution of training vertices. Researchers have proposed several GNN-specific cost models, including the heuristics model, learning-based model, and operator-based model.
Heuristics model selects several graph metrics to estimate the cost via simple user-defined functions. Several heuristics models for GNN workloads have been introduced in the context of streaming graph partition (Stanton and Kliot, 2012; Xu et al., 2015), which assigns a vertex or block to a partition one by one. Such models define an affinity score for each vertex or block and the score helps the vertex or block select a proper partition.
Assume a GNN task is denoted by , , , where is the number of layers, is the graph, , and are the vertex set for train, validation and test. The graph is partitioned into subgraphs. In the context of streaming graph partition, for each assignment of vertex or block, let be the set of vertices that have been already assigned to it, and , and are the corresponding vertex set belong to partition .
Lin et al. (Lin et al., 2020) define an affinity score vector with dimension for each train vertex , in which each score represents the affinity of the vertex to a partition. Let be the -hop in-neighbor set of train vertex , the score of with respect to partition is defined as below,
(3) |
where . This score function implicitly balances the number of training vertices among partitions. Liu et al. (Liu et al., 2023) define similar affinity score with respect to a block (or subgraph) , and the formal definition is
(4) |
where . Zheng et al. (Zheng et al., 2022a) define the affinity score of a block by considering all the training, validation and test vertices, the formula is
(5) |
where is the number of cross-edges between and , are hyper-parameters manually set by users.
Learning-based model takes advantage of machine learning techniques to model the complex cost of GNN workloads. The basic idea is to manually extract features via feature engineering and apply classical machine learning to train the cost model. The learning-based model is able to estimate the cost using not only the static graph structural information but also runtime statistics of GNN workloads, thus achieving a more accurate estimation than the heuristic models. Jia et al. (Jia et al., 2020) introduce a linear regression model for GNN computation cost estimation. The model estimates the computation cost of a single GNN layer with regard to any input graph . For each vertex in the graph, they select five features (listed in Table 1), including three graph-structural features and two runtime features. The estimation model is formalized as below,
(6) |
(7) | ||||
where is a trainable parameter for layer l, is the i-th feature of v, and sums up i-th feature of all vertices in G.
Name | Type | Definition | Description |
---|---|---|---|
graph-structural feature | 1 | the vertex itself | |
graph-structural feature | the number of neighbors | ||
graph-structural feature | continuity of neighbors | ||
runtime feature | mem accesses to load neighbors | ||
runtime feature | mem accesses to load the activations of all neighbors |
Wang et al. (Wang et al., 2021e) use a polynomial function to estimate the computation cost of vertex over a set of manually selected features. The formal definition is
(8) |
where is the number of neighbor types (e.g., a metapath (Wang et al., 2021e)) which is defined by the GNN models, is the number of neighbors for the -th type, is the total feature dimensions of the -th type of neighbor instance (i.e., a neighbor instance of the -th type has vertices and each vertex has feature dimension , then ). Refer to the original work (Wang et al., 2021e) for the detailed examples of the function . The total computation cost of a subgraph is the sum of the estimated costs of the vertices in the subgraph.
Operator-based model enumerates the operators in a GNN workload and estimates the total computation cost by summing the cost of each operator. Zhao et al. (Zhao et al., 2021c) divide the computation of a GNN workload into forward computation and backward computation. The forward computation of a GNN layer is divided into aggregation, linear transformation, and activation function; while the backward computation of a GNN layer is divided into gradient computation towards loss function, embedding gradient computation, and gradient multiplications. As a result, the costs of computing embedding of vertex in layer in forward and backward propagation are estimated by and , respectively.
(9) |
(10) |
where is the dimension of hidden embeddings in the -th GNN layer, is the number of neighbors of vertex , , , , and are constant factors which can be learned through testing the running time in practice. Finally, the computation cost of a mini-batch is computed by summing up the computation cost from all vertices in the from layer 1 to layer as below,
(11) |
where represents the vertices in graph G that are l-hop away from vertex u.
4.2. Graph partition in GNN
GNN is a type of graph computation task. Classical graph partition methods can be directly applied to the GNN workloads. Many distributed GNN training algorithms and systems like AliGraph (Zhu et al., 2019), DistGNN (Md et al., 2021) adopt METIS (Karypis and Kumar, 1998), Vertex-cut (Gonzalez et al., 2012), Edge-cut (Karypis and Kumar, 1998), streaming graph partition (Stanton and Kliot, 2012), and other graph partition methods (Boman et al., 2013) to train the GNN models in different applications. Recently, Tripathy et al. (Tripathy et al., 2020) empirically studied the performance of 1-D, 2-D, 1.5-D, and 3-D graph partitioning methods with respect to GCN models. However, due to the unique characteristics of GNN workloads, classical graph partition methods are not optimal to balance the GNN workload while minimizing the communication cost.
In distributed mini-batch GNN training, to balance the workload, we need to balance the number of train vertices (i.e., the number of batches per epoch) among workers. As a result, new optimization objectives are introduced. Zheng et al. (Zheng et al., 2020c) formulate the graph partition problem in the DistDGL as a multi-constraint partition problem that aims at balancing training/validation/test vertices/edges in each partition. They adopt the multi-constraint mechanism in METIS to realize the customized objective of graph partition for both homogeneous graphs and heterogeneous graphs (Zheng et al., 2022b), and further expand this mechanism to a two-level partitioning strategy that handles the situation with multi-GPUs and multi-machines and balances the computation. Lin et al. (Lin et al., 2020) apply streaming graph partition – Linear Deterministic Greedy (LGD) (Stanton and Kliot, 2012) and use a new affinity score (E.q. 3) to balance the GNN workload. Furthermore, they ensure each vertex in a partition has its complete -hop neighbors to avoid massive communication during sampling. Recently, block-based streaming graph partition methods (Liu et al., 2023; Zheng et al., 2022a) are introduced for mini-batch distributed GNN training. They first use multi-source BFS to divide the graph into many small blocks, then apply greedy assignment heuristics with customized affinity scores (E.q. 4, 5) to partition the coarsened graph, and uncoarsen the block-based graph partition by mapping back the blocks to the vertices in the original graph. Besides workload balance, some works minimize the total computation cost. Zhao et al. (Zhao et al., 2021c) prove that the computation cost function (E.q. 11) is a submodular function and use METIS to partition the graph into cohesive mini-batches achieving approximation of the optimal computation cost.
In distributed full-graph GNN training, we need to balance the workload of each GNN layer among a set of workers while minimizing the embedding communication among workers. One approach is to leverage a heuristic method to model the computation workload and the communication overhead, and adopt the multi-constraint graph partitioning mechanism. Wan et al. (Wan et al., 2023a) models the computation workload as both the number of vertices and the number of edges, and treat them as two constraints. The communication overhead is modeled as the number of remote vertices in each worker. Therefore, with the multi-constraint graph partitioning, the computation workload can be balanced and the communication overhead can be reduced. To further balance the communication overhead, an iterative re-partitioning stage is adopted to swap vertices between workers with the highest and lowest communication volumes. Demirci et al. (Demirci et al., 2023) model the computation workload as the total number of vertices’ neighbors in each worker. To model the communication overhead under P2P-based communication protocol (introduced in Section 7.1), they introduce the hypergraph to model the communication cost of each vertex as the connectivity of its corresponding net in the hypergraph. Based on the hypergraph model, existing methods (Catalyurek and Aykanat, 1999) for hypergraph partitioning is adopted to distribute vertices on different workers. Another approach is to apply learning-based cost models to model the complex cost for diverse GNN models. On top of the model-aware cost models, existing graph partition methods are adopted. For example, Jia et al. (Jia et al., 2020) utilize the linear-regression-based cost model (E.q. 7) and apply range graph partitioning to balance the workload estimated by the cost model. In addition, the range graph partition guarantees that each partition holds consecutively numbered vertices, thus reducing the data movement cost between CPU and GPU. Wang et al. (Wang et al., 2021e) use the E.q. 8 to estimate the computation cost of a partition, and apply the application-driven graph partition method (Fan et al., 2020) to generate workload balancing plan, which adaptively mitigates workload and minimizes the communication cost. Based on the vertex-cut partition, Hoang et al. (Hoang et al., 2021) leverage the 2D Cartesian vertex cut to improve the scalability.
4.3. Feature partition in GNN
As an important type of data in GNN, features need to be partitioned as well. Most distributed GNN training solutions treat features as the vertex properties of a graph and they are partitioned along with the graph partition. For example, if the graph is partitioned by the edge-cut method, then each vertex feature is stored in the partition where the corresponding vertex lies at. In other words, the feature matrix is partitioned in row-wise.
Considering the different processing patterns of features compared to the graph structure, a line of works partition feature matrix independent of the graph. When a graph is partitioned by the 2D partition method, Tripathy et al. (Tripathy et al., 2020) partition the feature matrix with the 2D partition method as well, while Vasimuddin et al. (Md et al., 2021) use row-wise method to partition the feature matrix with replication and ensure each vertex has the complete feature at local. Gandhi et al.(Gandhi and Iyer, 2021) deeply analyze the communication pattern in distributed mini-batch GNN training and find that classical graph partition methods only reduce the communication cost at the first hop process. Furthermore, sophisticated graph partitioning, e.g., METIS, is an expensive preprocess step. They introduce to partition the graph and feature separately. The graph is partitioned via random partition to avoid the expensive preprocess. The input features are partitioned along the feature dimension (a.k.a. column-wise partition). In other words, each partition contains a sub-column of all vertices’ features. With such data partition strategy, they further design a new execution model introduced in Section 6.2.5 and train the GNN model with minimal communication as described in Section 7.2.2, especially, when the input feature has high dimension and the hidden features have low dimension. Similarly, Wolf et al. (Wolfe et al., 2021) also use a column-wise feature partition method to distributed training for ultra-wide GCN models. Differently, in this method, the graph is partitioned with METIS and both input and intermediate features are partitioned along the feature dimension.
5. GNN Batch Generation

Mini-batch GNN training is a common approach to scaling GNNs to large graphs. Graph sampling is a de-facto tool to generate mini-batches in standalone mode. So far, many sampling-based GNNs (Hamilton et al., 2017; Chen et al., 2018a; Huang et al., 2018; Zou et al., 2019; Zeng et al., 2020; Cong et al., 2020) have been proposed, and they can be classified into vertex-wise sampling, layer-wise sampling and subgraph-wise sampling according to different types of sampling methods. Different batch generation methods influence both the training efficiency and training accuracy. To avoid the graph sampling becoming a bottleneck, there are some explorations for the efficient GNN dataloaders (Dong et al., 2021; Min et al., 2021; Bai et al., 2021; Deutsch et al., 2019).
In mini-batch distributed GNN training, the data dependence brings in massive communication for the batch generation process. To improve the training efficiency in distributed settings, several new GNN batch generation techniques that are specific to distributed training are proposed and address challenge #1 and challenge #2. As shown in Figure 6, one solution is generating a mini-batch via distributed sampling and the other is directly using the local partition (or subgraph) as the mini-batch. In the following, we describe these techniques.
5.1. Mini-batch generation with distributed sampling
Based on existing sampling-based GNNs, it is straightforward to obtain the distributed version by implementing distributed sampling. In other words, we use distributed graph sampling to create mini-batches over large graphs. Most of existing distributed GNN systems, like AliGraph (Zhu et al., 2019), DistDGL (Zheng et al., 2020c), BGL (Liu et al., 2023), follow this idea. However, in a distributed environment, a basic sampler on a worker will frequently access data from other workers incurring massive communication. In multi-GPU settings, although graphs and features might be centralized and stored in CPU memory, each GPU needs to access the mini-batch from CPU memory via massive CPU-GPU data movement.
To reduce communication in distributed computing, cache is a common optimization mechanism for improving data locality. At the system level, many GNN-oriented cache strategies have been proposed. The basic idea of caching is to store frequently accessed remote vertices at local. Zheng et al. (Zheng et al., 2020c) propose to replicate the remote neighbors of boundary vertices at local for each partitioned graph in DistDGL and ensure each vertex at local has the full neighbors so that the mini-batch generator (e.g., sampler) can compute locally without communicating to other partitions. However, such a method only reduces the communication caused by the direct (i.e., one-hop) neighbor access and is unaware of the access frequency. Zhu et al. (Zhu et al., 2019) introduce a metric called -th importance of , denoted by . The metric is formally defined as , where and are the number of -hop in and out-neighbors of the vertex . To balance the communication reduction and storage overhead, they only cache the vertices when its is larger than a threshold. Li et al. (Lin et al., 2020) design a static GPU cache that stores the high out-degree vertices for the multi-GPU training. The static cache can reduce the feature movement between CPU and GPU. Min et al. (Min et al., 2022) introduce weighted reverse PageRank (Weighted R-Pagerank) (Bar-Yossef and Mashiach, 2008) that incorporates the labeling status of the vertices into the reverse PageRank method to identify the frequent access vertices. The hot vertex features with a high score are cached in GPU memory and the rest of the cold vertex features are placed in CPU memory. Furthermore, due to the different bandwidths between NVLink and PCIe, they divided the hot vertex features into the hottest one and the next hottest one. The hottest vertex features are replicated over multiple GPUs. The next hottest data is scattered over multiple GPUs and they can be accessed from the peer GPU memory. While the above strategies only focus on feature cache, latest studies (Zhang et al., 2023; Sun et al., 2023) find that when GPU sampling is used, caching both the vertex features and graph topology achieves better performance. To find the optimal cache allocation for graph topology and vertex features, Sun et al. (Sun et al., 2023) leverage a linear search strategy to obtain the ratio of the two cache sizes, and allocate each cache with the top hottest data. In contrast, Zhang et al. (Zhang et al., 2023) model this allocation problem as a knapsack problem, and design a greedy algorithm to find the optimal cache allocation strategy.
The above static cache strategies cannot be adaptive to various graph sampling workloads. Liu et al. (Liu et al., 2023) propose a dynamic cache mechanism that applies the FIFO policy by leveraging the distribution of vertex features. To improve the cache hit ratio, they introduce proximity-aware ordering for a mini-batch generation. The order constrains the training vertex sequence in BFS order. To ensure model convergence, they introduce randomness to the BFS-based sequence by selecting a BFS sequence in a round-robin fashion and randomly shifting the BFS sequence. Yang et al. (Yang et al., 2022) propose a more robust cache approach, called a pre-sampling-based caching policy. Given a graph G, a sampling algorithm A, and a training set T, it first executes sampling stages to collect statistics, then uses the statistics to identify the frequency (or hotness) of the data, finally caches the data with larger hotness. Kayer et al. (Kaler et al., 2023) adopt the same idea and propose an analysis-based cache policy that performs more accurate prediction on the data hotness. A propagation model is designed in the analysis method, and it iteratively propagates and calculates the sampled probability of each vertex given G, A, and T. Both the pre-sampling-based and analysis-based caching policies consider all three factors that influence the data access frequency, and they are more robust than previous cache strategies for the various GNN workloads.
Besides the cache strategy, another approach is to design new distributed sampling techniques that are communication-efficient and maintain the model accuracy. A basic idea of communication-efficient sampling is to prioritize sampling local vertex, yet this introduces bias for the generated mini-batch. Inspired by the linear weighted sampling methods (Chen et al., 2018a; Zou et al., 2019), Jiang et al. (Jiang and Rumi, 2021) propose a skewed linear weighted sampling for the neighbor selection. Concretely, the skewed sampling scales the sampling weights of local vertices by a factor and Jiang et al. theoretically prove that the training can achieve the same convergence rate as the linear weighted sampling by properly selecting the value of . Cai et al. propose a technique called collective sampling primitive (CSP) (Cai et al., 2023). CSP reduces communication costs by pushing the sampling tasks to remote workers instead of pulling the entire neighbor lists to the local worker. This is beneficial because the full neighbor list of a center vertex is often much larger than the sampled results. With CSP, sampling tasks for remote vertices are conducted on remote workers, and only the results are sent back, reducing communication overhead.
5.2. Partition-based mini-batch generation
According to the abstraction of distributed GNN training pipeline in Figure 2, we know that the graph is partitioned at first. If we restrict each worker to only training the GNN model with the local partition, massive communication can be avoided. In such a training strategy, a partition is a mini-batch, and we call it a partition-based mini-batch. PSGD-PA (Ramezani et al., 2021) is a straightforward implementation of the above idea with a Parameter Server. In GraphTheta (Li et al., 2021a), the partitions are obtained via a community detection algorithm.
However, the ignorance of cross edges among partitions brings in the loss of model accuracy. To improve the model accuracy, subgraph (i.e., partition) expansion is introduced. It preserves the local structural information of boundary vertices by replicating remote vertices to local. Xue et al. (Xue et al., 2022) use METIS to obtain a set of unoverlapped subgraphs, and then expand each subgraph by adding the one-hop neighbors of vertices that do not belong to the subgraph. Similarly, Angerd et al. (Angerd et al., 2020) use the breadth-first method to expand the -hop subgraph. Zhao et al. (Zhao et al., 2021a) define a Monte-Carlo-based vertex importance measurement and use a depth-first sampling to expand the subgraph. They further introduce variance-based subgraph importance and weighted global consensus to reduce the impact of subgraph variance and improve the model accuracy. In addition, Ramezani et al. (Ramezani et al., 2021) introduced LLCG (Learn Locally, Correct Globally) to improve the model accuracy for the partition-based mini-batch training. LLCG follows the Parameter Server architecture and periodically executes a global correction operation on a parameter server to update the global model. The empirical results demonstrate that the global correction operation can improve the model accuracy significantly.
6. GNN Execution Model
The execution model is the core stage for GNN model training, as shown in Figure 2. It is responsible to schedule different operations to achieve high training efficiency. For distributed mini-batch training, as the sampling and feature extraction operations in batch generation dominate the training efficiency and make the computation graph generation expensive, the execution model focuses on how to efficiently schedule these operations with the batch training operation. While for distributed full-graph training, the computation graph execution is complicated due to the data dependency among workers, and the execution model focuses on the scheduling of graph aggregation and communication during computation graph execution.
In the following, due to different execution manners of mini-batch and full-graph GNN training, we discuss the execution model of them respectively.
6.1. Mini-batch Execution Model
In distributed mini-batch GNN execution, sampling and feature extraction are two unique operations which become the bottleneck. As shown in (Liu et al., 2023; Gandhi and Iyer, 2021), the sampling and feature extraction takes up almost 83% - 99% of the overall training time. The low efficiency is caused by massive communication and resource contention. Therefore, several execution models for mini-batch execution are proposed.
Conventional execution model performs the sampling and feature extraction of each computation graph (i.e. batch) in sequence before batch training is performed, as shown in Figure 7(a). When CPU-based graph sampling is applied, the sampling is executed on a CPU to construct the graph structure, feature extraction with respect to the graph is executed on a GPU with cache support, and training is executed on the GPU as well. When GPU-based sampling (Jangda et al., 2021; Pandey et al., 2020) is applied, all the operations in a pipeline are executed on a single device. Such conventional execution model is popular among existing GNN training systems (Zheng et al., 2020c; Zhu et al., 2019; Lin et al., 2020). The optimizations in the context of the conventional execution model lie in the optimization of sampling operation and the cache strategy for feature extraction, as introduced in Section 4 and 5. However, the conventional execution model brings in resource contention. For example, processing graph sampling requests and subgraph construction compete for CPUs when CPU-based graph sampling is applied; graph sampling and feature extraction compete for GPUs’ memory when GPU-based graph sampling is used.
Factored execution model uses a dedicated device or isolated resources to execute a single operation, as shown in Figure 7(b). Empirical studies demonstrate that during the mini-batch GNN training, an operation shares a large amount of data in different epochs and preserves inter-epoch data locality. This observation motivates the factored execution model to avoid resource contention and improves the data locality. Yang et al. (Yang et al., 2022) introduce a factored execution model that eliminates resource contention on GPU memory. The new model assigns a sampling operator and training operator to individual devices. More specifically, it uses some dedicated GPUs for graph sampling, and the remained GPUs are used for feature extraction and model training. With the help of the factored design, the new execution model is able to sample larger graphs while GPUs for feature extraction can cache more features. To improve the CPU utilization when CPU-based graph sampling is used. Liu et al. (Liu et al., 2023) assign isolated resources (i.e., CPUs) to different operations and balance the execution time of each operation by adjusting resource allocation. They introduced a profiling-based resource allocation method to improve CPU utilization and processing efficiency. The new resource allocation method first profiles the cost of each operator and solves an optimization problem minimizing the maximal completion time of all operations with resource constraints.

Operator-parallel execution model enables the inter-batch parallelism and generates multi computation graphs in parallel by mixing the execution of all operators in the pipeline, as shown in Figure 7(c). Zheng et al. (Zheng et al., 2022b) decompose the batch generation stage into multiple stages (e.g., graph sampling, neighborhood subgraph construction, feature extraction) and each stage is treated as an operator. Dependencies exist among different operators, while there are no dependencies among mini-batches. Therefore, the operators in different mini-batches can be scheduled into a pipeline, and operators without dependencies can be executed in parallel. In this way, the batch generation of a later batch can be performed immediately after or concurrently with that of the former batch, instead of waiting for the former batch to finish the model training stage. The idea introduced by Cai et al. (Cai et al., 2023) is very similar, as they set a sampler and a loader to conduct the sampling and feature extraction operation, respectively. A queue is used to manage the execution of different mini-batches. With a followed trainer to perform the execution of computation graph, the training process of mini-batches forms a training pipeline. Several mini-batches can be executed simultaneously, thus improving the GPU utilization and accelerating both the computation graph generation and execution stage. Zhang et al. (Zheng et al., 2022a) also decouple the batch generation, while further modeling the dependencies among different operators into a DAG. Treating each operator as a node in the DAG, a two-level scheduling strategy is proposed by considering both the scheduling among DAGs (coarse-grained) and among operators (fine-grained). With such two-level scheduling, the cost of CPU-based sampling and GPU-based feature extraction can be balanced, to improve the CPU utilization and reduce the end-to-end GNN training time.
Execution model with pull-push parallelism. All of the above execution models construct the subgraph with sampling first followed by extracting features according to the subgraph. When the input features are high-dimensional, the communication caused by the feature extraction becomes the bottleneck in the end-to-end GNN training pipeline. The expensive high-quality graph partition introduced in Section 4.2 is an option to reduce communication. Gandhi et al. (Gandhi and Iyer, 2021) introduce a new execution model with pull-push parallelism. The new model partitions the graph using fast hash partition and the feature matrix using column-wise hash partition. It constructs the subgraph by pulling remote vertices and edges at local, and then pushes the constructed subgraph to all the workers. Then each worker extracts partial features and performs aggregation of these partial features for all the received subgraphs in parallel, forming the model parallel stage shown in Figure 7(d). A data parallel stage which trains subgraphs (i.e. mini-batches) individually on each worker is then performed. Light communication overhead is incurred between the model parallel stage and data parallel stage to move the low-dimensional hidden embeddings. With the help of pull-push parallelism, the execution model replaces the expensive input feature movement with the light graph structure movement, thus improving the efficiency of batch generation and model training.
6.2. Full-graph execution model
For the distributed mini-batch GNN training, it is trivial to parallelize the execution of computation graphs, i.e., mini-batches, among the distributed workers, since each mini-batch is completely stored in a single worker. For the distributed full-graph or large-batch GNN training, the execution of a computation graph is partitioned among a set of workers. Due to the data dependency in the computation graph, it is non-trivial to achieve the high efficiency of the GNN training. So far, many computation graph execution models specific to GNN training have been proposed.
In the following section, we first discuss the full-graph execution model from the perspective of vertex computation. We delve into the execution model from both the view of graph operation (Section 6.2.1) and the matrix multiplication (Sectcion 6.2.2). From the view of graph operation, the GNN execution can be modeled as SAGA-NN (Ma et al., 2019), which is inspired by the traditional GAS model in graph processing. From the view of matrix multiplication, the core of GNN execution can be modeled as SpMM (Sparse Matrix Multiplication). We also make a comparison in Section 6.2.3. Second, we focus on the update mode, which determines whether the vertex embeddings and model parameters used for computation are updated in time or with delay. We categorize the full-graph execution model into synchronous execution models (Section 6.2.4) and asynchronous execution models (Section 6.2.5). Note that the above two perspectives (i.e., vertex computation and update mode) are intersected, and a system simultaneously adopts one execution model from each of the two perspectives.
6.2.1. Graph View.
From the view of graph processing, we use the most well-known programming model SAGA-NN (Ma et al., 2019) for the following discussion. SAGA-NN divides the forward computation of a single GNN layer into four operators Scatter (SC), ApplyEdge (AE), Gather (GA) and ApplyVertex (AV). SC and GA are two graph operations, in which vertex features are scattered along the edges and gathered to the target vertex, respectively. AE and AV may contain neural network (NN) operations, which process directly on the edge features or the aggregated features of the target vertices, respectively.
According to different computation paradigms of graph operators (i.e., SC and GA), we divide computation graph execution models into one-shot execution and chunk-based execution.

One-shot Execution. One-shot execution is the most common model adopted by GNN training. On each worker, all the neighbor features (or embeddings) needed for one graph aggregation of local vertices is collected in advance to the local storage, and the graph aggregation is processed in one shot, which is illustrated in the upper part of Figure 8. The remote information may require communication with remote workers, and the communication protocol will be discussed in Section 7. From the local view of graph aggregation, each target vertex is assigned to a specific worker. The worker gathers the neighbors’ features of the target vertex and aggregates these features at a time. As an example, DistDGL (Zheng et al., 2020c) adopts this model straightforwardly. BNS-GCN (Wan et al., 2022a) leverages a sampling-based method to aggregate neighbor vertices’ features at random. The one-shot execution can also be easily incorporated with the asynchronous execution model (Wan et al., 2022b) (see Section 6.2.5), in which the stale embeddings are collected for the graph aggregation.
Chunk-based Execution. The one-shot execution is straightforward, yet it causes expensive memory consumption and heavy network communication. For example, storing all the data required in the local storage of a worker may cause Out-Of-Memory (OOM) problem. On the other hand, transferring all the vertex features may lead to heavy network overhead. To address these problems, an alternative approach is to split the complete aggregation into several partial aggregations, and accumulate the partial aggregations to get the final graph aggregation. To achieve this, the neighborhoods of a vertex are grouped into several sub-neighborhoods, and we call each sub-neighborhood along with the vertex features or embeddings a chunk. The chunks can be processed sequentially (sequential chunk-based execution) or in parallel (parallel chunk-based execution), which is illustrated in the lower part of Figure 8.
Under the sequential chunk-based execution, the partial aggregation is conducted sequentially and accumulated to the final aggregation result one after another. NeuGraph (Ma et al., 2019) uses 2D partitioning method to generate several edge chunks, thus the neighborhood of a vertex is partitioned accordingly into several sub-neighborhoods. It assigns each worker with the aggregation job of certain vertices, and feeds their edge chunks sequentially to compute the final result. SAR (Mostafa, 2022) uses edge-cut partitioning method to create chunks, retrieves the chunks of vertex from remote workers sequentially, and computes the partial aggregations at local. The sequential chunk-based execution effectively addresses the OOM problem, since each worker only needs to handle the storage and computation of one chunk at a time.
Under the parallel chunk-based execution, the partial aggregations of different chunks are computed in parallel. After all chunks finish partial aggregation, communication is invoked to transfer the results, and the final aggregation result is computed at a time. Since the communication volume incurred by transferring the partial aggregation result is much less than transferring the complete chunk, the network communication overhead can be reduced. DeepGalois (Hoang et al., 2021) straightforwardly adopts this execution model. DistGNN (Md et al., 2021) orchestrates the parallel chunk-based execution model with the asynchronous execution model, to transfer the partial aggregation with staleness. FlexGraph (Wang et al., 2021e) further overlaps the communication of remote partial aggregations with the computation of the local partial aggregation to improve efficiency.
6.2.2. Matrix View.
As described in Section 2, the matrix formulation of a GNN model is given by , which involves SpMM since matrix is sparse. In order to perform computations for the GNN model, three matrices (i.e., , , and ) need to be stored either locally or in a distributed manner. For GNN on large graphs, it is impractical for a single worker (e.g., a GPU processor) to store all three matrices simultaneously. At least one matrix must be partitioned and distributed across different workers. The execution of distributed SpMM in GNN can be divided into three stages: communication, computation, and reduction. The computation stage is the core of matrix multiplication, which is performed locally in each worker. Prior to the computation stage, workers may require specific blocks of a matrix from other workers, necessitating a communication stage. Typically, the communication stage is accomplished through a broadcasting mechanism. After the computation stage, a worker may only possess partial results of the final matrix block. In such cases, a reduction stage is necessary to gather the remaining partial results from other workers. Workers retrieve and sum up these partial results to obtain the final matrix.
According to the existence of the above three stages, we categorize the distributed SpMM into three execution models: computation-only, communication-computation, and communication-computation-reduction. The existence of the communication and reduction stage is determined by two factors: ➀partition strategy: how the matrix is partitioned and stored; ➁stationary strategy: which matrix is kept as the stationary matrix (i.e., no communication is incurred to move data in this matrix). The partition strategy determines whether a matrix is replicated stored on multiple workers, or partitioned into blocks and distributed to different workers. The partition strategy also dictates the matrix partitioning mechanisms, such as 1D, 2D, etc. The stationary strategy determines the choice of the stationary matrix. During the execution of distributed SpMM, each partitioned block of the stationary matrix is pinned on the corresponding worker, and no communication is required for the stationary matrix. We summarize how these factors impact the execution model in Table 2, which will be introduced in detail in the following.
As discussed in Section 2, the weight parameter matrix is relatively small in GNN models. As a result, is fully replicated across all workers, adhering to the basic data parallelism principle. The subsequent discussion primarily focuses on matrices , , and their product .
Computation-only Execution Model. In this model, both the communication and reduction stages are eliminated by adopting specific partition strategy. One of or is fully replicated across each worker, while the other matrix is partitioned properly among different workers to achieve a communication-free paradigm (Kurt et al., [n.d.]). For instance, matrix is replicated across all workers, and matrix is partitioned into column blocks. Each worker holds a column block, which contains multiple columns of . In such case, no communication is required prior to local computation. After local computation, each worker holds a corresponding column block of the final matrix , thus no reduction is required either. In this case, only the computation stage is performed in the SpMM operation. However, this execution model lacks strong scalability if both matrices and exceed the memory capacity of an individual worker, since at least one of the matrices needs to be fully replicated across all workers.
Communication-computation Execution Model. For the circumstances that both matrix and have to be partitioned and stored on the workers in a distributed manner, the communication-computation execution model is introduced. In this execution model, the workers need to share the matrix partitions they hold with each other, necessitating the communication stage prior to local computation. This communication can be performed either in a broadcast fashion or a point-to-point (P2P) fashion, as detailed in Section 7. Furthermore, since the reduction stage is not executed, -Stationary strategy is adopted in the communication-computation execution model and no communication is required to obtain the result matrix .
The third column in Table 2 shows that when -Stationary strategy is adopted, many partition strategies (1D, 1.5D and 2D) can be applied in the communication-computation execution model. For the -Stationary 1D partitioning, each worker stores a row block of matrices , , and . During the communication stage, each worker broadcasts its row block of to all other workers. Subsequently, local computation is performed to compute the block rows of . In this paradigm, matrix is also stationary, making this 1D -Stationary SpMM also -Stationary. It is worth noting that the 1D partitioning can also be performed in a column-wise manner (Gandhi and Iyer, 2021). In such cases, the 1D -Stationary is also -Stationary. In short, under 1D partitioning, both -Stationary and -Stationary are equivalent to -Stationary. Therefore, the 1D partitioning follows a communication-computation execution model. However, 1D -Stationary faces scalability challenges as a worker needs to broadcast the partition to all remote workers, resulting in linear communication costs proportional to the number of workers (Tripathy et al., 2020). To address this, optimizations such as P2P communication and non-blocking techniques (Demirci et al., 2023; Peng et al., 2022) can be employed to accelerate the communication stage, which will be discussed in Section 7.
For -Stationary 1.5D partitioning, either matrix or matrix is partitioned in a 2D manner, while the other matrix is partitioned in a 1D manner. To partition a matrix in a 2D manner, each processor holds a row-column block of the complete matrix, comprising elements in the matrix that satisfy both the assigned column IDs and row IDs for the processor. Under 1.5D partitioning, although matrix is set to be stationary, either matrix or matrix needs to be broadcast to all processors, leading to scalability challenges similar to 1D partitioning.
Another approach is to leverage -Stationary 2D partitioning, where both matrix and matrix are partitioned in a 2D manner. For a row-column block in matrix , the processor holding this block also holds the corresponding row-column blocks for matrix and matrix . It only needs to receive the blocks with the same row ID of matrix and the blocks with the same column ID of matrix . In this way, the total communication overhead is reduced.
Communication-computation-reduction Execution Model. As described above, adopting -Stationary strategy is the key to eliminating the reduction stage. Therefore, in this model, -Stationary strategy is not adopted. In other words, different from the communication-computation execution model where -Stationary strategy is adopted, other stationary strategies including -Stationary, -Stationary, and Non-Stationary strategies are considered. As discussed previously, under both replicated and 1D partitioning, all stationary strategies are equivalent to -Stationary, and no reduction stage is required. For -Stationary and -Stationary, under 1.5D and 2D partitioning, the results obtained after the local computation stage are still partial. Each worker performs a reduction operation to sum up these remote partial results with its local partial result. The 3D partitioning can be viewed as the aggregation of multiple 2D partitioning SpMM operations. As a result, a final reduction operation is required to aggregate the results of multiple 2D partitioning SpMM operations. Under 3D partitioning, non of the three matrices can be stationary. Therefore, 3D partitioning only corresponds to Non-Stationary strategy.
-Stationary | -Stationary | -Stationary | Non-Stationary | |
---|---|---|---|---|
Replicated | C | C | C | - |
1D | CC | CC | CC | - |
1.5D | CCR | CCR | CC | - |
2D | CCR | CCR | CC | - |
3D | - | - | - | CCR |
Discussion. Both partition strategy and stationary strategy play crucial roles in distributed SpMM computation. We summarize the impact of these two factors on the distributed SpMM execution model as shown in Table 2. For the circumstances that one of and is replicated stored, a communication-free SpMM can be performed, leading to a computation-only execution model. However, since the GNN model consists of both SpMM and GeMM (General dense Matrix Multiplication) operations, a redistribution step is required between them to achieve both communication-free SpMM and GeMM (Kurt et al., [n.d.]). Additionally, the order in which SpMM and GeMM are executed (i.e., SpMM first or GeMM first) can also impact the communication and computation overhead (Kurt et al., [n.d.]; Wang et al., 2021a). When both matrices and are distributed across different processors, and the 1D partitioning method or the -Stationary is adopted, SpMM follows the communication-computation execution model. However, this execution model faces scalability and communication overhead. Finally, for -Stationary or -Stationary with 1.5D or 2D partitioning, and for the Non-Stationary strategy, they follow the communication-computation-reduction execution model. The communication cost in the overall SpMM operation varies depending on the selected stationary mechanism, and different matrix partitioning methods contribute differently to the communication cost, as discussed in more detail in previous studies (Tripathy et al., 2020; Selvitopi et al., 2021).
6.2.3. Comparison of graph view and matrix view.
We provide a comparison between the graph view and matrix view of GNN models and highlight the connections between these two views. The graph view and matrix view are two different programming interfaces that handle GNN models at different levels. The matrix view is a lower-level representation of GNN models, as it directly manages the data presented in matrices. On the other hand, the graph view provides a higher-level understanding of GNN models. In most GNN frameworks such as DGL and PyG, the programming interfaces are organized based on the graph view for ease of use and convenience. However, during model training, graph operations such as feature aggregation are ultimately transformed into matrix multiplications. Both the one-shot execution and chunk-based execution in the graph view can be mapped to the matrix view. In a simple one-shot execution, each vertex is assigned to a specific worker along with its input features. The worker performs an aggregation when it receives the features of all its neighbors. This corresponds to the 1D partitioning -Stationary in the matrix view, which follows a communication-computation paradigm. In other stationary mechanisms that follow a communication-computation-reduction paradigm, the partial results obtained from the reduction operation correspond to the partial aggregation results in the chunk-based execution from the graph view.
6.2.4. Synchronous execution model
In synchronous execution model (Figure 9(a)), the computation of each operator begins after the communication is synchronized. Almost all distributed SpMM executions follow a synchronous model. Therefore, to achieve a better understanding of the difference between synchronous and asynchronous execution models while also adhering to the page limit, we introduce the following from the graph view only with the SAGA-NN model as an example. In SAGA-NN model, four operators and their backward counterparts form an execution pipeline for the graph data, following the order of SC-AE-GA-AV-AV-GA-AE-SC. Among these eight stages, two of them involve the communication of states of boundary vertices, that is, GA and GA. In GA, neighbor vertices features should be aggregated to the target vertices, thus the features of boundary vertices should be transferred. In GA, the gradients of the boundary vertices should be sent back to their belonging workers. Therefore, GA and GA are two synchronization points in synchronous execution model. On these points, the execution flow has to be blocked until all communication finishes. Systems like NeuGraph (Ma et al., 2019), CAGNET (Tripathy et al., 2020), FlexGraph (Wang et al., 2021e), DistGNN (Md et al., 2021) apply this execution model. To reduce the communication cost and improve the training efficiency, several communication protocols are proposed and we review them in Section 7.1. We point out that these two synchronization points exist irrespective of the view of execution model. From a matrix view, GA corresponds to the SpMM operation, and GA corresponds to the SpMM operation, where represents the gradient matrix in backward propagation.
6.2.5. Asynchronous execution model
The asynchronous execution model allows the computation starts with historical states and avoids the expensive synchronization cost. According to the different types of states, we classify the asynchronous execution model into type I asynchronization and type II asynchronization.


Type I asynchronization. As mentioned above, two synchronization points exist in the execution pipeline of the SAGA-NN model, and similar synchronization points exist in other programming models as well. Removing such synchronization points of computation graph computation in the execution pipeline introduces type I asynchronization (Peng et al., 2022; Md et al., 2021; Wan et al., 2022b). In type I asynchronization (Figure 9(b)), workers do not wait for the hidden embeddings (hidden features after the first GNN layer) of boundary vertices to arrive. Instead, it uses the historical hidden embeddings of the boundary vertices from previous epochs which are cached or received earlier to perform the aggregation of target vertices. Similar historical gradients are used in the GA stage.
Type II asynchronization. During GNN training, another synchronization point occurs when the weight parameters need to be updated. This is identical to traditional deep learning models, where many efforts have been made to remove such synchronization point (Recht et al., 2011; Narayanan et al., 2019). Gandhi et al. (Gandhi and Iyer, 2021) further adopt this idea into GNN models, which forms type II asynchronization. Under such protocol, mini-batches are executed in parallel, and form a mini-batch pipeline.
Note that type I and type II asynchronization can be adopted individually or jointly. Detailed communication protocol for GNN-specific Type I asynchronous execution model is described in Section 7.2.
7. GNN Communication Protocols
In this section, we review GNN communication protocols that support different computation graph execution models, and they are responsible to transfer hidden embeddings and gradients. According to the types of computation graph execution models, we classify GNN communication protocols into synchronous GNN communication protocols and asynchronous GNN communication protocols.
7.1. Synchronous GNN Communication Protocols
Synchronous GNN communication protocols handle the communication in a synchronous execution model, where a synchronous barrier is set in GA and GA of each layer, to ensure the hidden embeddings used for aggregation are computed from exact last layer in the current iteration. There are several approaches to transferring the hidden embeddings, and we introduce them one by one as follows.
7.1.1. Broadcast-based protocol
Broadcast is a general technique to share information among a set of workers. In the GNN context, each worker broadcasts the latest hidden embeddings or graph structure it holds according to the underlying data partitioning methods, and other workers start the corresponding graph convolutional computation once they receive the required data from that worker. When all workers finish broadcasting, the hidden embeddings of the full graph are shared among all workers, and all workers receive the latest hidden embeddings they need in the current iteration. The authors of CAGNET (Tripathy et al., 2020) model the GNN computation as matrix computation. They partition the matrices (i.e., graph adjacency matrix and feature matrix) into sub-matrices with 1-D, 1.5-D, 2-D, and 3-D graph partitioning strategies. During the GNN training, each worker directly broadcasts the partial matrix of graph structure or vertex features to other workers for synchronization, guaranteeing that the vertices on a worker have a complete neighborhood. The efficiency of the broadcast can be destroyed by the workload imbalance problem, which causes a stall in computation.
7.1.2. Point-to-point-based protocol
The aforementioned broadcast sends data to all the workers in the distributed environment and might cause redundant communication. In contrast, Point-to-Point (P2P) transmission is a fine-grained communication method to share information among workers, and it only transfers data alongside cross edges in an edge-cut graph partition or replicated vertices in a vertex-cut graph partition. For example, ParallelGCN (Demirci et al., 2023) leverages the edge-cut graph partitioning, and maintains a special data structure to manage the different vertex sets for different worker pairs. Each vertex set indicates the vertices that need to be sent between the two workers. The communication of these vertices is then performed in a P2P fashion. The communication volume is thus reduced since only necessary vertices are sent and received. DistGNN (Md et al., 2021) employs the vertex-cut partitioning strategy. Each edge is assigned to one specific worker, while each vertex along with its features may be stored repetitively. Among the replications, one vertex is called the master vertex. When aggregating the neighbors of a vertex, P2P transmission is applied. Specifically, partial aggregations of the replications of the master vertex are sent to the worker where the master vertex is located, instead of broadcasting to all workers. As a result, the master vertex can aggregate the embeddings from all its neighbors. However, according to the studies of CAGNET (Tripathy et al., 2020), the naive P2P transmission brings in the overhead of request and send operations and increases the latency. To mitigate this extra overhead, pipeline-based protocol is introduced, which further overlaps the communication with local computation, as described in Section 7.1.3.
7.1.3. Pipeline-based protocol
Since the naive P2P-based protocol leads to an extra overhead of request and send operations introduced in Section 7.1.2, pipeline-based protocol is proposed as an optimization technique. It divides the complete task of vertex aggregation on workers to several sub-tasks, and overlap the communication and computation of these sub-tasks. The execution of these sub-tasks thus form a pipeline.
G3 (Wan et al., 2023a) introduces a bin packing mechanism to pack local vertices into different bins (i.e., sub-tasks). During the aggregation stage, some vertices may be ready to execute the aggregation of the next layer prior to other vertices once all of their remote neighbors’ embeddings are received. These vertices are prioritized to be packed into bins and compute the aggregation results. Once the computation of a bin finishes, the aggregation result of the vertices in that bin is sent to remote workers. In this way, a fine-grained overlap of communication and computation can be achieved, and different bins form a pipeline.
In systems that adopt the chunk-based execution model introduced in Section 8, such as SAR (Mostafa, 2022), DistGNN (Md et al., 2021) and FlexGraph (Wang et al., 2021e), the aggregation of each neighborhood chunk can be treated as a sub-task, providing the opportunity to overlap the communication and computation of different chunks. Here we call the aggregation of a vertex’s partial neighborhood the partial aggregation. SAR (Mostafa, 2022) aggregates the chunks in a predefined order. Each worker in SAR first computes the partial aggregation of the local neighborhood of the target vertex, then fetches the remote neighborhood in the predefined order, computes the partial aggregation of the remote neighborhood, and accumulates the results at local one-by-one. Consequently, the aggregation of the complete neighborhood in SAR is divided into different stages, and in each stage, a partial aggregation is performed. The final aggregation is computed when all partial aggregation completes. SAR applies the pipeline-based communication protocol to reduce memory consumption since it does not need to store all the features of the neighborhood and compute the aggregation at once. ParallelGCN adopts a similar idea without pre-defining the aggregation order. The remote chunks of neighborhood are received in a random order, and it starts aggregating the received vertices immediately. Furthermore, to reduce the overhead of network communication caused by transferring of the partial neighborhood, DeepGalois (Hoang et al., 2021) performs a remote partial aggregation before communication. Therefore, only the partial aggregation results need to be transferred. The cd-0 communication strategy in DistGNN is similar, which simply performs remote partial aggregation, fetches it to the local worker, and computes the final aggregation result. To further improve the training efficiency, FlexGraph (Wang et al., 2021e) overlaps the computation and communication. Each worker first issues a request to partially aggregate the neighborhood at the remote worker and then computes the local partial aggregation while waiting for the remote partial aggregations to complete the transfer. After receiving the remote partial aggregations, the worker directly aggregates them with the local partial aggregation. Therefore, the local partial aggregation is overlapped with the communication to get remote partial aggregation.
7.1.4. Communication via Shared Memory
To train large GNN at scale, it is also feasible to leverage the CPU memory to retrieve the required information instead of GPU-GPU communication. The complete graph and feature embeddings are stored in shared memory (i.e., CPU memory) and the device memory of each GPU is treated as a cache. In ROC (Jia et al., 2020), the authors assume that all the GNN data is fit in the CPU memory and they repetitively store the whole graph structure and features in the CPU DRAM on each worker. GPU-based workers retrieve the vertex features and hidden embeddings from the local CPU-shared memory. For larger graphs or scenarios that all the GNN data cannot fit into the CPU DRAM of a single worker, distributed shared memory is preferred. DistDGL (Zheng et al., 2020c) partitions the graph and stores the partitions with a specially designed KVStore. During the training, if the data is co-located with the worker, DistDGL accesses them via local CPU memory, otherwise, it incurs RPC requests to retrieve information from other remote CPU memory. In NeuGraph (Ma et al., 2019), the graph structure data and hidden embeddings are also stored in shared memory. To support large-scale graphs, the input graph is partitioned into edge chunks by the 2D partitioning method, and the feature matrix (as well as hidden embedding matrix) is partitioned into vertex chunks by the row-wise partition method. As described in Section 8, it follows a chunk-based execution model. To compute the partial aggregation of a chunk, each GPU fetches the corresponding edge chunk as well as the vertex chunk from the CPU memory. To speed up the vertex chunk transfer among GPUs, a chain-based streaming scheduling method is applied considering the communication topology to avoid bandwidth contention.
7.2. Asynchronous GNN communication protocol
Asynchronous GNN communication protocols are used to support asynchronous execution models. As described in Section 6.2.5, there are two asynchronization in GNN training, type I asynchronization which removes the synchronization in aggregation stages and type II asynchronization which removes synchronization in model weight updating stages. Type I asynchronization is specific to GNN model computation, while type II asynchronization has been widely used in traditional deep learning and the corresponding techniques can be directly introduced into GNN training. In the following, we focus on reviewing the asynchronous GNN communication protocols that are specific to distributed GNN training. Table 3 summarizes the popular asynchronous GNN communication protocols.
Type I | Type II |
|
|
Staleness Model | Staleness bound | Parallelism | Mechanism | |||||
SANCUS | ✓ | ✗ | ✓ | ✗ | Variation-based | Algorithm Depended | Data | Skip-broadcast | ||||
DIGEST | ✓ | ✗ | ✓ | ✗ | Epoch-adaptive | s | Data | Embedding server | ||||
DIGEST-A | ✓ | ✓ | ✓ | ✗ | Epoch-adaptive | s | Data | Embedding server | ||||
DistGNN | ✓ | - | ✓ | ✗ | Epoch-fixed | s | Data | Overlap | ||||
PipeGCN | ✓ | ✗ | ✓ | ✓ | Epoch-fixed | 1 | Data | Pipeline | ||||
Dorylus | ✓ | ✓ | ✓ | ✓ | Epoch-adaptive | s | Data | Pipeline | ||||
P3 | ✗ | ✓ | ✗ | ✗ | Epoch-fixed | 3 | Model | Mini-batch pipeline | ||||
CM-GCN | - | - | - | - | - | 0 | Data | Counter-based vertex-level async. |
7.2.1. Staleness Models for Asynchronous GNN communication protocol
Generally, bringing in asynchronization means information with staleness should be used in training. Specifically, in type I asynchronization, the GA or GA stage is performed without completely gathering the latest states of the neighborhood, and the historical vertex embeddings or gradients are used in aggregation. Different staleness models are introduced to maintain the staleness of historical information, and each of them should ensure that the staleness of aggregated information is bounded so that the GNN training is converged. In the following, we review three popular staleness models.
Epoch-Fixed Staleness. One straightforward method is to aggregate historical information with fixed staleness (Wan et al., 2022b; Md et al., 2021; Gandhi and Iyer, 2021). Let be the current epoch in training, and be the epoch of the historical information used in aggregation. In epoch-fixed staleness model, , where is a hyper-parameter set by users. In this way, the staleness is bounded explicitly by .
Epoch-Adaptive Staleness. Another variation of the above basic model is to use an epoch-adaptive staleness (Thorpe et al., 2021; Chai et al., 2022). During the aggregation of a vertex , let be the epoch of historical information of any vertex (i.e., any neighbor vertex of ) used in aggregation. In epoch-adaptive staleness model, for all , holds. This means in different epochs, the staleness of historical information for aggregation may be different, and in one epoch, the staleness of aggregated neighbors of a vertex may also be different. Generally, once reaches , then the latest embeddings or gradients should be broadcast under decentralized training or pushed to the historical embedding server under centralized training. If the above condition does not hold when aggregation is performed, then the aggregation is blocked until the historical information within the staleness bound (i.e., ) can be retrieved.
Variation-Based Staleness. Third, the staleness can also be measured by the variation of embeddings or gradients. In other words, we only aggregate embeddings or gradients when they are significantly changed. Specifically, let be the embeddings of layer held by worker , and be the historical embedding last shared to other workers. In the variation-based staleness model, , where is the maximum difference bound set by users so that the latest embeddings or gradients are broadcast if the historical version available for other workers are too stale. In this way, the staleness is bounded with the help of .
7.2.2. Realization of Asynchronous GNN communication protocols
As described above, historical information can be used in aggregation during GA or GA stages. Some protocols only take the GA stage into consideration, supporting asynchronous embedding aggregation. Some protocols are designed to use historical information during both GA and GA stages, supporting both asynchronous embedding and gradient aggregation.
Asynchronous Embedding Aggregation. Peng et al. (Peng et al., 2022) take the basic training paradigm of broadcast in CAGNET into consideration, and design a skip-broadcast mechanism under a 1D partition. This mechanism auto-checks the staleness of the partitioned vertex features on each compute vertex and skips the broadcast if it is not too stale. If the broadcast of a partition is skipped, then other workers should use the historical vertex embeddings cached previously to process the forward computation. Three staleness check algorithms are further proposed, and each one of them ensures that the staleness of cached vertex embeddings is bounded.
Chai et al. (Chai et al., 2022) adopt a similar idea while using parameter servers to maintain the historical embeddings of all workers. The vertex hidden embeddings are pushed to these servers every several epochs, and the workers pull the historical embeddings to their local cache in the next epoch after the push. The staleness of historical vertex embeddings is thus bounded by the push-pull period.
The cd-r algorithms introduced by Md et al. (Md et al., 2021) overlap the communication of partial aggregation results from each worker with the forward computation in GNN. Specifically, the partial aggregation in iteration is transmitted asynchronously to the target vertex, and the final aggregation is performed in iteration . Under this algorithm, the staleness is bounded to .
Asynchronous Embedding and Gradients Aggregation. As we introduced in Section 6.2.4, GA and GA are two synchronization points in the GNN execution model. While the above protocols only use historical vertex embeddings during stage GA, Wan et al. (Wan et al., 2022b) also take the synchronization point GA into consideration, and use stale vertex gradients in the back propagation. During the training, both embeddings and gradients from the neighborhood are asynchronously sent in a point-to-point fashion in the last epoch and received by the target vertex in the current epoch. Therefore, the communication is overlapped in both forward and backward computation and the training pipeline with a fixed-epoch gap is thus constructed, in which workers are only allowed to use features or gradients exactly one epoch ago, and the staleness is bounded. Thorpe et al. (Thorpe et al., 2021) design a finer and more flexible pipeline. Similarly to the method proposed by Wan et al. (Wan et al., 2022b), both stale vertex embeddings and vertex gradients are used in the pipeline. Moreover, it also removes the type II synchronization, and thus a trainer may use stale weight parameters and starts the next epoch immediately. The staleness in the pipeline is explicitly bounded with set by users. The bounded staleness ensures that the trainer which moves the fastest in the pipeline is allowed to use historical embeddings or gradients at most epochs earlier. With this staleness bound, the staleness of weight parameters is also bounded accordingly.
Vertex-level Asynchronization. Without using stale embeddings or gradients, asynchronization can also be designed in vertex-level (Zhao et al., 2021c), where each vertex starts the computation of the next layer as soon as all its neighbors’ embeddings are received. Different from the traditional synchronous method in which all vertices on a worker start the computation of a layer together, during vertex-level asynchronous processing, different vertices in one worker may be computing different layers at the same time. Note that this asynchronization does not make any difference in the aggregation result, since all the information required should be the latest, and no embeddings or gradients from previous epochs are used.
8. Distributed GNN Training Systems
Systems | Hardware | Cost model | Graph partition | Feature Partition | Batch |
|
Execution model |
|
|
||||||
NeuGraph (Ma et al., 2019) | Multi-GPU | N/A | 2D | Row-wise | full-graph | N | Chunk-based | Shared memory | N | ||||||
PaGraph (Lin et al., 2020) | Heuristics | LGD | Shared graph store | k-hop | Y | - | N/A | Y | |||||||
BGL (Liu et al., 2023) | Heuristics | LGD | Shared graph store | k-hop | Y | Factored | N/A | N | |||||||
GNNLab (Yang et al., 2022) | N/A | Shared graph store | Shared graph store | k-hop | Y | Factored | N/A | Y | |||||||
Rethinking (Song and Jiang, 2022) | N/A | Shared graph store | Row-wise | k-hop | Y | - | N/A | N | |||||||
DSP (Cai et al., 2023) | N/A | METIS | Row-wise | k-hop | Y | Operator-parallel | N/A | Y | |||||||
Legion (Sun et al., 2023) | N/A | METIS | Row-wise | k-hop | Y | - | N/A | N | |||||||
ROC (Jia et al., 2020) | GPU-cluster | Learning-based | Edge-cut | Shared graph store | full-graph | Y | - | Shared memory | Y | ||||||
P3 (Gandhi and Iyer, 2021) | N/A | Hash | Column-wise | k-hop | Y | Pull-push, Async. | Weight staleness | N | |||||||
Sancus (Peng et al., 2022) | N/A | 1D | Row-wise | full-graph | Y | Async | Embedding staleness | Y | |||||||
NeutronStar (Ma et al., 2019) | N/A | Edge-cut | Row-wise | full-graph | Y | - | P2P | Y | |||||||
DGCL (Cai et al., 2021) | N/A | METIS | Row-wise | full-graph | N | - | P2P | Y | |||||||
PipeGCN (Wan et al., 2022b) | N/A | METIS | Row-wise | full-graph | Y | Async. |
|
Y | |||||||
BNS-GCN (Wan et al., 2022a) | N/A | Edge-cut | Row-wise | full-graph | N | - | P2P | Y | |||||||
DistDGLv2 (Zheng et al., 2022b) | Heuristics | METIS | Row-wise | subgraph | Y | Factored | N/A | N | |||||||
SALIENT++ (Kaler et al., 2023) | N/A | Edge-cut | Row-wise | full-graph | Y | - | N/A | Y | |||||||
G3 (Wan et al., 2023a) | Heuristics | Edge-cut | Row-wise | full-graph | N | - | P2P | N | |||||||
AdaQP (Wan et al., 2023b) | N/A | Edge-cut | Row-wise | full-graph | N | - | Broadcast | Y | |||||||
AliGraph (Zhu et al., 2019) | CPU-cluster | N/A |
|
Row-wise | k-hop | Y | - | N/A | Y | ||||||
AGL (Zhang et al., 2020a) | N/A | DFS | Row-wise | k-hop | N | - | N/A | N | |||||||
DistDGL (Zheng et al., 2020c) | Heuristics | METIS | Row-wise | k-hop | Y | - | N/A | Y | |||||||
CM-GCN (Zhao et al., 2021c) | Operator-based | Edge-cut | Row-wise | subgraph | N | Async. | Vertex-level | N | |||||||
DistGNN (Md et al., 2021) | N/A | 2D | Row-wise | full-graph | Y | Chunk-based, Async. |
|
N | |||||||
FlexGraph (Wang et al., 2021e) | Learning-based | Par2E | Row-wise | full-graph | Y | Chunk-based | Pipeline-based | N | |||||||
ByteGNN (Zheng et al., 2022a) | Heuristics | LGD | Row-wise | k-hop | N | Operator-parallel | N/A | N | |||||||
ParallelGCN (Demirci et al., 2023) | Heuristics | Hypergraph | Row-wise | full-graph | N | Chunk-based | P2P | Y | |||||||
Dorylus (Thorpe et al., 2021) | Serverless | N/A | Edge-cut | Row-wise | full-graph | Y | Async. |
|
Y | ||||||
SUGAR (Xue et al., 2022) | Low-resrouce devices | N/A | Edge-cut | Row-wise | subgraph | N | - | N/A | N |
In recent years, many distributed GNN systems are proposed as a combination of distributed graph processing systems and distributed deep learning systems. The distributed GNN systems take the general idea of efficiently processing graph data from the former systems, while combining the capability of processing structured data from the latter systems. A recent survey systematically introduced the evolution from the graph processing system to the GNN system (Vatter et al., 2023). Based on the taxonomy proposed in this survey, we review 28 distributed GNN training systems and frameworks as the representatives summarized in Table 4, and mainly classify them into three categories according to the computation resources: a) Systems on a single machine with multi-GPUs, b) Systems on GPU clusters and c) Systems on CPU clusters. Besides, there are a few distributed GNN training systems using more diversified computation resources, such as serverless threads, mobile devices, and edge devices. We classify them as miscellaneous systems. In the table, we make a summarization of the techniques the systems incorporate according to our new taxonomy. Note that for the column of execution model, we ignore the conventional, one-shot, and synchronous execution models as they are the basic execution models adopted by most systems. Therefore, we only note the execution models of systems with special designs. For detailed introduction of each system and their relations, please refer to Section A in the Appendix.
9. Future Direction
As a general solution to train on large-scale graphs, distributed GNN training gain widespread attention in recent years. In addition to the techniques and systems discussed above, there are other interesting but emerging research topics in distributed GNN training. We discuss some interesting directions in the following.
Benchmark for distributed GNN training. Many efforts are made in benchmarking traditional deep learning models. For instance, DAWNBench (Coleman et al., 2017) provides a standard evaluation criterion to compare different training algorithms. It focuses on the end-to-end training time consumed to converge to a state-of-the-art accuracy, as well as the inference time. Both single machine and distributed computing scenarios are considered. Furthermore, many benchmark suites are developed for traditional DNN training to profile and characterize the computation (Mattson et al., 2020; Zhu et al., 2018; Dong and Kaeli, 2017). As for GNNs, Dwivedi (Dwivedi et al., 2020) attempts to benchmark the GNN pipeline and compare the performance of different GNN models on medium-size datasets. Meanwhile, larger datasets (Hu et al., 2020a; Du et al., 2021; Freitas et al., 2021) for graph machine learning tasks have been published. However, to our knowledge, few efforts have been made to compare the efficiency of different GNN training algorithms, especially in distributed computing scenarios. GNNMark (Baruah et al., 2021) is the first benchmark suite developed specifically for GNN training. It leverages the NVIDIA nvprof profiler (Bradley, 2012) to characterize kernel-level computations, NVBit framework (Villa et al., 2019) profiles memory divergence and further modifies PyTorch source code to collect data sparsity. However, it lacks flexibility for quick profiling of different GNN models and does not pay much attention to profiling the distributed and multi-machine training scenarios. Therefore, it would be practical for a new benchmark to be designed for large-scale distributed GNN training.
Large-scale dynamic graph neural networks. In many applications, graphs are not static. The vertex attributes or graph structures often evolve with changes, which requires the representations to be updated in time. Li and Chen (Li and Chen, 2021) proposed a general cache-based GNN system to accelerate the representation updating. It sets a cache for hidden representations and selects valuable representations to save time for updating. DynaGraph (Guan et al., 2022) efficiently trains GNN via cached message passing and timestep fusion. Furthermore, it optimizes the graph partitioning in order to train GNN in a data-parallel method. Although dynamic GNN has long been an interesting area of research, as far as we know, there are no more works that specifically focus on distributed dynamic GNN training. The dynamism, including features and structure, poses new challenges to the ordinary solution in a distributed environment. Graph partitioning has to quickly adjust to the change of vertices and edges while meeting the requirements of load balance and communication reduction. The update of the graph structure drops the cache hit ratio, which significantly influences the end-to-end performance of GNN training.
Large-scale heterogeneous graph neural networks. Many heterogeneous GNN structures are proposed in recent years (Zhang et al., 2019b; Fu et al., 2020; Wang et al., 2019a; Zhao et al., 2021b; Chang et al., 2022). However, few distributed systems take the unique characteristic of heterogeneous graphs into consideration to support heterogeneous GNNs. Since the size of the feature and the number of neighbors may vary greatly for vertices with different types, processing heterogeneous graphs in a distributed manner may cause severe problems such as load imbalance and high network overhead. Paddle Graph Learning 111https://github.com/PaddlePaddle/PGL framework provides easy and fast programming of the message passing-based graph learning on heterogeneous graphs with distributed computing. DistDGLv2 (Zheng et al., 2022b) takes the imbalanced workload partition into consideration and leverages the multi-constraint technique in METIS to mitigate this problem. More attention should be paid to this research topic to address the above problems.
GNN model compression technique. Although model compression, including pruning, quantization, weight sharing, etc., is widely used in deep learning, it has not been extensively applied in distributed GNN training. The compression on network structures like pruning (Zhou et al., 2021b) can be contacted with a graph sampling strategy to solve the out-of-memory problem. Model quantization is another promising approach to improving the scalability of GNN models. SGQuant (Feng et al., 2020) is a GNN-tailored quantization algorithm and develops multi-granularity quantization and auto-bit selection. Degree-Quant (Tailor et al., 2020) stochastically protects (high-degree) vertices from quantization to improve weight update accuracy. BinaryGNN (Bahri et al., 2021) applies a binarization strategy inspired by the latest developments in binary neural networks for images and knowledge distillation for graph networks. For the distributed settings, recently, Song et al.(Song et al., 2022) proposed EC-Graph for distributed GNN training with CPU clusters which aims to reduce the communication costs by message compression. It adopts lossy compression and designs compensation methods to mitigate the induced errors. Meanwhile, A Bit-Tuner is used to keep the balance between model accuracy and message size. GNN model compression is orthogonal to the aforementioned distributed GNN optimization techniques, and it deserves more attention to help improve the efficiency of distributed GNN training.
10. Conclusion
Distributed GNN training is one of the successful approaches of scaling GNN models to large graphs. In this survey, we systematically reviewed the existing distributed GNN training techniques from graph data processing to distributed model execution and covered the life-cycle of end-to-end distributed GNN training. We divide the distributed GNN training pipeline into three stages, data partition, batch generation, and GNN model training, which heavily influence the GNN training efficiency. To clearly organize the new technical contributions which optimize these stages, we proposed a new taxonomy that consists of four orthogonal categories – GNN data partition, GNN batch generation, GNN execution model and GNN communication protocol. In the GNN data partition category, we described the data partition techniques for distributed GNN training; in the GNN batch generation category, we presented the techniques of fast GNN batch generation for mini-batch distributed GNN training; in the GNN execution model category, we discussed the execution model used in mini-batch and full-graph training respectively; in the GNN communication protocol category, we discussed both synchronous and asynchronous protocols for distributed GNN training. After carefully reviewing the techniques in the four categories, we summarized existing representative distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and gave a discussion about the future direction in optimizing distributed GNN training.
References
- (1)
- Abadal et al. (2021) Sergi Abadal, Akshay Jain, Robert Guirado, Jorge López-Alonso, and Eduard Alarcón. 2021. Computing Graph Neural Networks: A Survey from Algorithms to Accelerators. ACM Comput. Surv. 54, 9 (2021), 1–38.
- Abadi et al. (2016) Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265–283.
- Ahmedt-Aristizabal et al. (2021) David Ahmedt-Aristizabal, Mohammad Ali Armin, Simon Denman, Clinton Fookes, and Lars Petersson. 2021. Graph-Based Deep Learning for Medical Diagnosis and Analysis: Past, Present and Future. Sensors 21, 14 (2021), 4758.
- Angerd et al. (2020) Alexandra Angerd, Keshav Balasubramanian, and Murali Annavaram. 2020. Distributed training of graph convolutional networks using subgraph approximation. arXiv preprint arXiv:2012.04930 (2020), 1–14.
- Bahri et al. (2021) Mehdi Bahri, Gaetan Bahl, and Stefanos Zafeiriou. 2021. Binary Graph Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9492–9501.
- Bai et al. (2021) Youhui Bai, Cheng Li, Zhiqi Lin, Yufei Wu, Youshan Miao, Yunxin Liu, and Yinlong Xu. 2021. Efficient Data Loader for Fast Sampling-Based GNN Training on Large Graphs. IEEE Transactions on Parallel and Distributed Systems 32, 10 (2021), 2541–2556.
- Bar-Yossef and Mashiach (2008) Ziv Bar-Yossef and Li-Tal Mashiach. 2008. Local approximation of pagerank and reverse pagerank. In Proceedings of the 17th ACM conference on Information and knowledge management. 279–288.
- Baruah et al. (2021) Trinayan Baruah, Kaustubh Shivdikar, Shi Dong, Yifan Sun, Saiful A Mojumder, Kihoon Jung, José L Abellán, Yash Ukidave, Ajay Joshi, John Kim, et al. 2021. GNNMark: A benchmark suite to characterize graph neural network training on GPUs. In 2021 IEEE International Symposium on Performance Analysis of Systems and Software. 13–23.
- Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018), 1–40.
- Besta and Hoefler (2022) Maciej Besta and Torsten Hoefler. 2022. Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis. arXiv preprint arXiv:2205.09702 (2022), 1–27.
- Boman et al. (2013) Erik G Boman, Karen D Devine, and Sivasankaran Rajamanickam. 2013. Scalable matrix computations on large scale-free graphs using 2D graph partitioning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1–12.
- Bongini et al. (2021) Pietro Bongini, Monica Bianchini, and Franco Scarselli. 2021. Molecular generative graph neural networks for drug discovery. Neurocomputing 450 (2021), 242–252.
- Bradley (2012) Thomas Bradley. 2012. GPU performance analysis and optimisation. NVIDIA Corporation (2012), 1–117.
- Cai et al. (2021) Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, and Fan Yu. 2021. DGCL: An efficient communication library for distributed GNN training. In Proceedings of the Sixteenth European Conference on Computer Systems. 130–144.
- Cai et al. (2023) Zhenkun Cai, Qihui Zhou, Xiao Yan, Da Zheng, Xiang Song, Chenguang Zheng, James Cheng, and George Karypis. 2023. DSP: Efficient GNN training with multiple GPUs. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 392–404.
- Catalyurek and Aykanat (1999) Umit V Catalyurek and Cevdet Aykanat. 1999. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Transactions on parallel and distributed systems 10, 7 (1999), 673–693.
- Cen et al. (2021) Yukuo Cen, Zhenyu Hou, Yan Wang, Qibin Chen, Yizhen Luo, Xingcheng Yao, Aohan Zeng, Shiguang Guo, Peng Zhang, Guohao Dai, et al. 2021. Cogdl: An extensive toolkit for deep learning on graphs. arXiv preprint arXiv:2103.00959 (2021), 1–11.
- Chai et al. (2022) Zheng Chai, Guangji Bai, Liang Zhao, and Yue Cheng. 2022. Distributed Graph Neural Network Training with Periodic Historical Embedding Synchronization. arXiv preprint arXiv:2206.00057 (2022), 1–20.
- Chang et al. (2022) Yaomin Chang, Chuan Chen, Weibo Hu, Zibin Zheng, Xiaocong Zhou, and Shouzhi Chen. 2022. Megnn: Meta-path extracted graph neural network for heterogeneous graph representation learning. Knowledge-Based Systems 235 (2022), 107611.
- Chen et al. (2018a) Jie Chen, Tengfei Ma, and Cao Xiao. 2018a. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. In International Conference on Learning Representations. 1–15.
- Chen et al. (2016) Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016), 1–10.
- Chen et al. (2018b) Jianfei Chen, Jun Zhu, and Le Song. 2018b. Stochastic Training of Graph Convolutional Networks with Variance Reduction. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80). 942–950.
- Chen et al. (2022) Xiaobing Chen, Yuke Wang, Xinfeng Xie, Xing Hu, Abanti Basak, Ling Liang, Mingyu Yan, Lei Deng, Yufei Ding, Zidong Du, et al. 2022. Rubik: A Hierarchical Architecture for Efficient Graph Neural Network Training. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 4 (2022), 936–949.
- Chen et al. (2020) Zhaodong Chen, Mingyu Yan, Maohua Zhu, Lei Deng, Guoqi Li, Shuangchen Li, and Yuan Xie. 2020. fuseGNN: Accelerating Graph Convolutional Neural Network Training on GPGPU. In 2020 IEEE/ACM International Conference On Computer Aided Design. 1–9.
- Chiang et al. (2019) Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. 2019. Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 257–266.
- Choi et al. (2020) Edward Choi, Zhen Xu, Yujia Li, Michael Dusenberry, Gerardo Flores, Emily Xue, and Andrew Dai. 2020. Learning the graphical structure of electronic health records with graph convolutional transformer. In Proceedings of the AAAI conference on artificial intelligence. 606–613.
- Coleman et al. (2017) Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. 2017. Dawnbench: An end-to-end deep learning benchmark and competition. Training 100, 101 (2017), 102.
- Cong et al. (2020) Weilin Cong, Rana Forsati, Mahmut Kandemir, and Mehrdad Mahdavi. 2020. Minimal variance sampling with provable guarantees for fast training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1393–1403.
- Demirci et al. (2023) Gunduz Vehbi Demirci, Aparajita Haldar, and Hakan Ferhatosmanoglu. 2023. Scalable Graph Convolutional Network Training on Distributed-Memory Systems. In Proceedings of the VLDB Endowment, Vol. 16. 711–724.
- Deng and Zhang (2021) Xiang Deng and Zhongfei Zhang. 2021. Graph-Free Knowledge Distillation for Graph Neural Networks. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. 2321–2327.
- Deutsch et al. (2019) Alin Deutsch, Yu Xu, Mingxi Wu, and Victor Lee. 2019. Tigergraph: A native MPP graph database. arXiv preprint arXiv:1901.08248 (2019), 1–28.
- Do et al. (2019) Kien Do, Truyen Tran, and Svetha Venkatesh. 2019. Graph Transformation Policy Network for Chemical Reaction Prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 750–760.
- Dong et al. (2021) Jialin Dong, Da Zheng, Lin F Yang, and George Karypis. 2021. Global Neighbor Sampling for Mixed CPU-GPU Training on Giant Graphs. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 289–299.
- Dong and Kaeli (2017) Shi Dong and David Kaeli. 2017. DNNMark: A Deep Neural Network Benchmark Suite for GPUs. In Proceedings of the General Purpose GPUs. 63–72.
- Du et al. (2021) Yuanqi Du, Shiyu Wang, Xiaojie Guo, Hengning Cao, Shujie Hu, Junji Jiang, Aishwarya Varala, Abhinav Angirekula, and Liang Zhao. 2021. Graphgt: Machine learning datasets for graph generation and transformation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 1–17.
- Dwivedi et al. (2020) Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. 2020. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982 (2020), 1–47.
- Fan et al. (2020) Wenfei Fan, Ruochun Jin, Muyang Liu, Ping Lu, Xiaojian Luo, Ruiqi Xu, Qiang Yin, Wenyuan Yu, and Jingren Zhou. 2020. Application Driven Graph Partitioning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1765–1779.
- Fan et al. (2019) Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. 2019. Graph Neural Networks for Social Recommendation. In The World Wide Web Conference. 417–426.
- Feng et al. (2020) Boyuan Feng, Yuke Wang, Xu Li, Shu Yang, Xueqiao Peng, and Yufei Ding. 2020. SGQuant: Squeezing the Last Bit on Graph Neural Networks with Specialized Quantization. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI). 1044–1052.
- Fey and Lenssen (2019) Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428 (2019), 1–9.
- Fout et al. (2017) Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur. 2017. Protein Interface Prediction Using Graph Convolutional Networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6533–6542.
- Frasca et al. (2020) Fabrizio Frasca, Emanuele Rossi, Davide Eynard, Ben Chamberlain, Michael Bronstein, and Federico Monti. 2020. Sign: Scalable inception graph neural networks. arXiv preprint arXiv:2004.11198 (2020), 1–17.
- Freitas et al. (2021) Scott Freitas, Yuxiao Dong, Joshua Neil, and Duen Horng Chau. 2021. A Large-Scale Database for Graph Representation Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). 1–13.
- Fu et al. (2022) Qiang Fu, Yuede Ji, and H Howie Huang. 2022. TLPGNN: A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on GPU. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing. 122–134.
- Fu et al. (2020) Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. 2020. Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding. In Proceedings of The Web Conference 2020. 2331–2341.
- Gandhi and Iyer (2021) Swapnil Gandhi and Anand Padmanabha Iyer. 2021. P3: Distributed deep graph learning at scale. In 15th USENIX Symposium on Operating Systems Design and Implementation. 551–568.
- Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural Message Passing for Quantum Chemistry. In Proceedings of the 34th International Conference on Machine Learning. 1263–1272.
- Gong et al. (2022) Zhangxiaowen Gong, Houxiang Ji, Yao Yao, Christopher W Fletcher, Christopher J Hughes, and Josep Torrellas. 2022. Graphite: optimizing graph neural networks on CPUs through cooperative software-hardware techniques. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 916–931.
- Gonzalez et al. (2012) Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In 10th USENIX symposium on operating systems design and implementation (OSDI 12). 17–30.
- Grattarola and Alippi (2021) Daniele Grattarola and Cesare Alippi. 2021. Graph neural networks in TensorFlow and keras with spektral [application notes]. IEEE Computational Intelligence Magazine 16, 1 (2021), 99–106.
- Guan et al. (2022) Mingyu Guan, Anand Padmanabha Iyer, and Taesoo Kim. 2022. DynaGraph: dynamic graph neural networks at scale. In Proceedings of the 5th ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA). 1–10.
- Hamilton et al. (2017) William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1025–1035.
- He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, YongDong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648.
- Hoang et al. (2021) Loc Hoang, Xuhao Chen, Hochan Lee, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali. 2021. Efficient Distribution for Deep Learning on Large Graphs. In Proceedings of the First MLSys Workshop on Graph Neural Networks and Systems. 1–9.
- Hu et al. (2020b) Linmei Hu, Siyong Xu, Chen Li, Cheng Yang, Chuan Shi, Nan Duan, Xing Xie, and Ming Zhou. 2020b. Graph neural news recommendation with unsupervised preference disentanglement. In Proceedings of the 58th annual meeting of the association for computational linguistics. 4255–4264.
- Hu et al. (2020a) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020a. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118–22133.
- Hu et al. (2020c) Yuwei Hu, Zihao Ye, Minjie Wang, Jiali Yu, Da Zheng, Mu Li, Zheng Zhang, Zhiru Zhang, and Yida Wang. 2020c. Featgraph: A flexible and efficient backend for graph neural network systems. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.
- Huang et al. (2020) Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. 2020. Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12.
- Huang et al. (2021b) Kezhao Huang, Jidong Zhai, Zhen Zheng, Youngmin Yi, and Xipeng Shen. 2021b. Understanding and Bridging the Gaps in Current GNN Performance Optimizations. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 119–132.
- Huang et al. (2022) Linyong Huang, Zhe Zhang, Zhaoyang Du, Shuangchen Li, Hongzhong Zheng, Yuan Xie, and Nianxiong Tan. 2022. EPQuant: A Graph Neural Network compression approach based on product quantization. Neurocomputing 503 (2022), 49–61.
- Huang et al. (2021a) Tinglin Huang, Yuxiao Dong, Ming Ding, Zhen Yang, Wenzheng Feng, Xinyu Wang, and Jie Tang. 2021a. MixGCF: An Improved Training Method for Graph Neural Network-Based Recommender Systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 665–674.
- Huang et al. (2018) Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. 2018. Adaptive Sampling towards Fast Graph Representation Learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 4563–4572.
- Huang et al. (2019) Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, Hyoukjoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems, Vol. 32. 1–10.
- Jangda et al. (2021) Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini. 2021. Accelerating graph sampling for graph machine learning using GPUs. In Proceedings of the Sixteenth European Conference on Computer Systems. 311–326.
- Jia et al. (2020) Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2020. Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc. In Proceedings of Machine Learning and Systems. 187–198.
- Jia et al. (2019) Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of Machine Learning and Systems. 1–13.
- Jiang and Rumi (2021) Peng Jiang and Masuma Akter Rumi. 2021. Communication-efficient sampling for distributed training of graph convolutional networks. arXiv preprint arXiv:2101.07706 (2021), 1–11.
- Jiang and Luo (2022) Weiwei Jiang and Jiayun Luo. 2022. Graph neural network for traffic forecasting: A survey. Expert Systems with Applications 207 (2022), 117921.
- Joshi (2022) Chaitanya K. Joshi. 2022. Recent Advances in Efficient and Scalable Graph Neural Networks. https://www.chaitjo.com/post/efficient-gnns/. (2022).
- Kaler et al. (2023) Tim Kaler, Alexandros Iliopoulos, Philip Murzynowski, Tao Schardl, Charles E Leiserson, and Jie Chen. 2023. Communication-Efficient Graph Neural Networks with Probabilistic Neighborhood Expansion Analysis and Caching. In Proceedings of Machine Learning and Systems. 1–14.
- Karypis and Kumar (1998) George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359–392.
- Ko et al. (2021) Yunyong Ko, Kibong Choi, Jiwon Seo, and Sang Wook Kim. 2021. An in-depth analysis of distributed training of deep neural networks. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. 994–1003.
- Kurt et al. ([n.d.]) Süreyya Emre Kurt, Jinghua Yan, Aravind Sukumaran-Rajam, Prashant Pandey, and P Sadayappan. [n.d.]. Communication Optimization for Distributed Execution of Graph Neural Networks. ([n. d.]), 1–12.
- Langer et al. (2020) Matthias Langer, Zhen He, Wenny Rahayu, and Yanbo Xue. 2020. Distributed Training of Deep Learning Models: A Taxonomic Perspective. IEEE Transactions on Parallel and Distributed Systems 31, 12 (2020), 2802–2818.
- Li and Chen (2021) Haoyang Li and Lei Chen. 2021. Cache-Based GNN System for Dynamic Graphs. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 937–946.
- Li et al. (2021a) Houyi Li, Yongchao Liu, Yongyong Li, Bin Huang, Peng Zhang, Guowei Zhang, Xintan Zeng, Kefeng Deng, Wenguang Chen, and Changhua He. 2021a. GraphTheta: A distributed graph neural network learning system with flexible training strategy. arXiv preprint arXiv:2104.10569 (2021), 1–18.
- Li et al. (2020) Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the VLDB Endowment 13, 12 (2020), 3005–3018.
- Li et al. (2021b) Shuangli Li, Jingbo Zhou, Tong Xu, Liang Huang, Fan Wang, Haoyi Xiong, Weili Huang, Dejing Dou, and Hui Xiong. 2021b. Structure-Aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Affinity. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 975–985.
- Lin et al. (2020) Zhiqi Lin, Cheng Li, Youshan Miao, Yunxin Liu, and Yinlong Xu. 2020. PaGraph: Scaling GNN Training on Large Graphs via Computation-Aware Caching. In Proceedings of the 11th ACM Symposium on Cloud Computing. 401–415.
- Liu et al. (2021a) Meng Liu, Youzhi Luo, Limei Wang, Yaochen Xie, Hao Yuan, Shurui Gui, Haiyang Yu, Zhao Xu, Jingtun Zhang, Yi Liu, et al. 2021a. DIG: A Turnkey Library for Diving into Graph Deep Learning Research. Journal of Machine Learning Research 22 (2021), 1–9.
- Liu et al. (2023) Tianfeng Liu, Yangrui Chen, Dan Li, Chuan Wu, Yibo Zhu, Jun He, Yanghua Peng, Hongzheng Chen, Hongzhi Chen, and Chuanxiong Guo. 2023. BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation. 103–118.
- Liu et al. (2022) Xin Liu, Mingyu Yan, Lei Deng, Guoqi Li, Xiaochun Ye, Dongrui Fan, Shirui Pan, and Yuan Xie. 2022. Survey on Graph Neural Network Acceleration: An Algorithmic Perspective. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. 5521–5529.
- Liu et al. (2021b) Zirui Liu, Kaixiong Zhou, Fan Yang, Li Li, Rui Chen, and Xia Hu. 2021b. EXACT: Scalable graph neural networks training via extreme activation compression. In International Conference on Learning Representations. 1–32.
- Low et al. (2012) Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (2012), 716–727.
- Ma et al. (2019) Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and Yafei Dai. 2019. NeuGraph: Parallel Deep Neural Network Computation on Large Graphs. In 2019 USENIX Annual Technical Conference. 443–458.
- Malewicz et al. (2010) Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 135–146.
- Mattson et al. (2020) Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. 2020. MLPerf Training Benchmark. In Proceedings of Machine Learning and Systems. 336–349.
- Md et al. (2021) Vasimuddin Md, Sanchit Misra, Guixiang Ma, Ramanarayan Mohanty, Evangelos Georganas, Alexander Heinecke, Dhiraj Kalamkar, Nesreen K Ahmed, and Sasikanth Avancha. 2021. Distgnn: Scalable distributed training for large-scale graph neural networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
- Min et al. (2022) Seung Won Min, Kun Wu, Mert Hidayetoglu, Jinjun Xiong, Xiang Song, and Wen-mei Hwu. 2022. Graph Neural Network Training and Data Tiering. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3555–3565.
- Min et al. (2021) Seung Won Min, Kun Wu, Sitao Huang, Mert Hidayetoğlu, Jinjun Xiong, Eiman Ebrahimi, Deming Chen, and Wen-mei Hwu. 2021. Large graph convolutional network training with GPU-oriented data communication architecture. Proceedings of the VLDB Endowment 14, 11 (2021), 2087–2100.
- Mostafa (2022) Hesham Mostafa. 2022. Sequential Aggregation and Rematerialization: Distributed Full-batch Training of Graph Neural Networks on Large Graphs. In Proceedings of Machine Learning and Systems. 265–275.
- Narayanan et al. (2019) Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15.
- Pandey et al. (2020) Santosh Pandey, Lingda Li, Adolfy Hoisie, Xiaoye S. Li, and Hang Liu. 2020. C-SAW: A Framework for Graph Sampling and Random Walk on GPUs. In International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
- Peng et al. (2022) Jingshu Peng, Zhao Chen, Yingxia Shao, Yanyan Shen, Lei Chen, and Jiannong Cao. 2022. Sancus: staleness-aware communication-avoiding full-graph decentralized training in large-scale graph neural networks. Proceedings of the VLDB Endowment 15, 9 (2022), 1937–1950.
- Rahman et al. (2021) Md Khaledur Rahman, Majedul Haque Sujon, and Ariful Azad. 2021. Fusedmm: A unified sddmm-spmm kernel for graph embedding and graph neural networks. In 2021 IEEE International Parallel and Distributed Processing Symposium. 256–266.
- Ramezani et al. (2021) Morteza Ramezani, Weilin Cong, Mehrdad Mahdavi, Mahmut T Kandemir, and Anand Sivasubramaniam. 2021. Learn locally, correct globally: A distributed algorithm for training graph neural networks. arXiv preprint arXiv:2111.08202 (2021), 1–32.
- Rao et al. (2021) Jiahua Rao, Xiang Zhou, Yutong Lu, Huiying Zhao, and Yuedong Yang. 2021. Imputing single-cell RNA-seq data by combining graph convolution and autoencoder neural networks. Iscience 24, 5 (2021), 102393.
- Recht et al. (2011) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems. 1–9.
- Sanchez-Gonzalez et al. (2018) Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, Raia Hadsell, and Peter Battaglia. 2018. Graph Networks as Learnable Physics Engines for Inference and Control. In Proceedings of the 35th International Conference on Machine Learning. 4470–4479.
- Selvitopi et al. (2021) Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, Katherine Yelick, and Aydın Buluç. 2021. Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication. In Proceedings of the ACM International Conference on Supercomputing. 431–442.
- Song and Jiang (2022) Shihui Song and Peng Jiang. 2022. Rethinking Graph Data Placement for Graph Neural Network Training on Multiple GPUs. In Proceedings of the 36th ACM International Conference on Supercomputing. 1–10.
- Song et al. (2021) Zheng Song, Fengshan Bai, Jianfeng Zhao, and Jie Zhang. 2021. Spammer Detection Using Graph-level Classification Model of Graph Neural Network. In 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering. 531–538.
- Song et al. (2022) Zhen Song, Yu Gu, Jianzhong Qi, Zhigang Wang, and Ge Yu. 2022. EC-Graph: A Distributed Graph Neural Network System with Error-Compensated Compression. In 2022 IEEE 38th International Conference on Data Engineering. 648–660.
- Stanton and Kliot (2012) Isabelle Stanton and Gabriel Kliot. 2012. Streaming Graph Partitioning for Large Distributed Graphs. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1222–1230.
- Sun et al. (2023) Jie Sun, Li Su, Zuocheng Shi, Wenting Shen, Zeke Wang, Lei Wang, Jie Zhang, Yong Li, Wenyuan Yu, Jingren Zhou, and Fei Wu. 2023. Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training. In Proceedings of the USENIX Annual Technical Conference. 165—-179.
- Tailor et al. (2020) Shyam A Tailor, Javier Fernandez-Marques, and Nicholas D Lane. 2020. Degree-quant: Quantization-aware training for graph neural networks. arXiv preprint arXiv:2008.05000 (2020), 1–22.
- Tan et al. (2019) Qiaoyu Tan, Ninghao Liu, and Xia Hu. 2019. Deep Representation Learning for Social Network Analysis. Frontiers in Big Data 2 (2019), 2.
- Thorpe et al. (2021) John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, et al. 2021. Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads. In 15th USENIX Symposium on Operating Systems Design and Implementation. 495–514.
- Tian et al. (2020) Chao Tian, Lingxiao Ma, Zhi Yang, and Yafei Dai. 2020. PCGCN: Partition-Centric Processing for Accelerating Graph Convolutional Network. In 2020 IEEE International Parallel and Distributed Processing Symposium. 936–945.
- Tripathy et al. (2020) Alok Tripathy, Katherine Yelick, and Aydın Buluç. 2020. Reducing communication in graph neural network training. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
- Vatter et al. (2023) Jana Vatter, Ruben Mayer, and Hans-Arno Jacobsen. 2023. The Evolution of Distributed Systems for Graph Neural Networks and their Origin in Graph Processing and Deep Learning: A Survey. Comput. Surveys (2023), 1–35.
- Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In International Conference on Learning Representations. 1–12.
- Verbraeken et al. (2020) Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. 2020. A Survey on Distributed Machine Learning. Acm computing surveys 53, 2 (2020), 1–33.
- Villa et al. (2019) Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019. Nvbit: A dynamic binary instrumentation framework for nvidia gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 372–383.
- Wan et al. (2023b) Borui Wan, Juntao Zhao, and Chuan Wu. 2023b. Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training. In Proceedings of Machine Learning and Systems. 1–15.
- Wan et al. (2022a) Cheng Wan, Youjie Li, Ang Li, Nam Sung Kim, and Yingyan Lin. 2022a. BNS-GCN: Efficient full-graph training of graph convolutional networks with partition-parallelism and random boundary node sampling. In Proceedings of Machine Learning and Systems. 673–693.
- Wan et al. (2022b) Cheng Wan, Youjie Li, Cameron R. Wolfe, Anastasios Kyrillidis, Nam Sung Kim, and Yingyan Lin. 2022b. PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication. In International Conference on Learning Representations. 1–24.
- Wan et al. (2023a) Xinchen Wan, Kaiqiang Xu, Xudong Liao, Yilun Jin, Kai Chen, and Xin Jin. 2023a. Scalable and Efficient Full-Graph GNN Training for Large Graphs. In Proceedings of the 2023 ACM SIGMOD International Conference on Management of Data. 1–23.
- Wang et al. (2021c) Hanchen Wang, Defu Lian, Ying Zhang, Lu Qin, Xiangjian He, Yiguang Lin, and Xuemin Lin. 2021c. Binarized graph neural network. In World Wide Web. 825–848.
- Wang et al. (2021d) Junfu Wang, Yunhong Wang, Zhen Yang, Liang Yang, and Yuanfang Guo. 2021d. Bi-GCN: Binary Graph Convolutional Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1561–1570.
- Wang et al. (2021e) Lei Wang, Qiang Yin, Chao Tian, Jianbang Yang, Rong Chen, Wenyuan Yu, Zihang Yao, and Jingren Zhou. 2021e. FlexGraph: a flexible and efficient distributed framework for GNN training. In Proceedings of the Sixteenth European Conference on Computer Systems. 67–82.
- Wang et al. (2022b) M. Wang, W. Fu, X. He, S. Hao, and X. Wu. 2022b. A Survey on Large-Scale Machine Learning. IEEE Transactions on Knowledge & Data Engineering 34, 06 (2022), 2574–2594.
- Wang et al. (2019b) Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J. Smola, and Zheng Zhang. 2019b. Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs. In ICLR workshop on representation learning on graphs and manifolds. 1–7.
- Wang et al. (2021b) Pengyu Wang, Chao Li, Jing Wang, Taolei Wang, Lu Zhang, Jingwen Leng, Quan Chen, and Minyi Guo. 2021b. Skywalker: Efficient alias-method-based graph sampling and random walk on gpus. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 304–317.
- Wang et al. (2022c) Qiange Wang, Yanfeng Zhang, Hao Wang, Chaoyi Chen, Xiaodong Zhang, and Ge Yu. 2022c. NeutronStar: Distributed GNN Training with Hybrid Dependency Management. In Proceedings of the 2022 International Conference on Management of Data. 1301–1315.
- Wang et al. (2019a) Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. 2019a. Heterogeneous graph attention network. In The world wide web conference. 2022–2032.
- Wang et al. (2022a) Yuke Wang, Boyuan Feng, and Yufei Ding. 2022a. QGTC: Accelerating Quantized Graph Neural Networks via GPU Tensor Core. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 107–119.
- Wang et al. (2021a) Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding. 2021a. GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 515–531. https://www.usenix.org/conference/osdi21/presentation/wang-yuke
- Welling and Kipf (2017) Max Welling and Thomas N Kipf. 2017. Semi-supervised classification with graph convolutional networks. In J. International Conference on Learning Representations. 1–14.
- Wolfe et al. (2021) Cameron R Wolfe, Jingkang Yang, Arindam Chowdhury, Chen Dun, Artun Bayer, Santiago Segarra, and Anastasios Kyrillidis. 2021. GIST: Distributed training for large-scale graph convolutional networks. arXiv preprint arXiv:2102.10424 (2021), 1–28.
- Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying Graph Convolutional Networks. In Proceedings of the 36th International Conference on Machine Learning. 6861–6871.
- Wu et al. (2022) Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. 2022. Graph Neural Networks in Recommender Systems: A Survey. Comput. Surveys (2022), 1–37.
- Wu et al. (2020) Yongji Wu, Defu Lian, Yiheng Xu, Le Wu, and Enhong Chen. 2020. Graph Convolutional Networks with Markov Random Field Reasoning for Social Spammer Detection. In Proceedings of the AAAI Conference on Artificial Intelligence. 1054–1061.
- Wu et al. (2021a) Yidi Wu, Kaihao Ma, Zhenkun Cai, Tatiana Jin, Boyang Li, Chenguang Zheng, James Cheng, and Fan Yu. 2021a. Seastar: vertex-centric programming for graph neural networks. In Proceedings of the Sixteenth European Conference on Computer Systems. 359–375.
- Wu et al. (2021b) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. 2021b. A Comprehensive Survey on Graph Neural Networks. IEEE transactions on neural networks and learning systems 32, 1 (2021), 4—24.
- Xie et al. (2014) Cong Xie, Ling Yan, Wu-Jun Li, and Zhihua Zhang. 2014. Distributed Power-law Graph Computing: Theoretical and Empirical Analysis. In Advances in Neural Information Processing Systems. 1–9.
- Xie et al. (2022b) Yaochen Xie, Zhao Xu, Jingtun Zhang, Zhengyang Wang, and Shuiwang Ji. 2022b. Self-Supervised Learning of Graph Neural Networks: A Unified Review. IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (2022), 1–1.
- Xie et al. (2022a) Zhiqiang Xie, Minjie Wang, Zihao Ye, Zheng Zhang, and Rui Fan. 2022a. Graphiler: Optimizing Graph Neural Networks with Message Passing Data Flow Graph. In Proceedings of Machine Learning and Systems. 515–528.
- Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations. 1–17.
- Xu et al. (2015) Ning Xu, Bin Cui, Lei Chen, Zi Huang, and Yingxia Shao. 2015. Heterogeneous Environment Aware Streaming Graph Partitioning. In IEEE Transactions on Knowledge and Data Engineering. 1560–1572.
- Xue et al. (2022) Zihui Xue, Yuedong Yang, Mengtian Yang, and Radu Marculescu. 2022. SUGAR: Efficient Subgraph-level Training via Resource-aware Graph Partitioning. arXiv preprint arXiv:2202.00075 (2022), 1–16.
- Yan et al. (2020) Bencheng Yan, Chaokun Wang, Gaoyang Guo, and Yunkai Lou. 2020. TinyGNN: Learning Efficient Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1848–1856.
- Yang et al. (2022) Jianbang Yang, Dahai Tang, Xiaoniu Song, Lei Wang, Qiang Yin, Rong Chen, Wenyuan Yu, and Jingren Zhou. 2022. GNNLab: A Factored System for Sample-Based GNN Training over GPUs. In Proceedings of the Seventeenth European Conference on Computer Systems. 417–434.
- Yang et al. (2019) Ke Yang, MingXing Zhang, Kang Chen, Xiaosong Ma, Yang Bai, and Yong Jiang. 2019. Knightking: a fast distributed graph random walk engine. In Proceedings of the ACM symposium on operating systems principles. 524–537.
- You et al. (2020) Yuning You, Tianlong Chen, Zhangyang Wang, and Yang Shen. 2020. L2-GCN: Layer-Wise and Learned Efficient Training of Graph Convolutional Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1–9.
- Zeng and Prasanna (2020) Hanqing Zeng and Viktor Prasanna. 2020. GraphACT: Accelerating GCN training on CPU-FPGA heterogeneous platforms. In 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 255–265.
- Zeng et al. (2020) Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2020. GraphSAINT: Graph Sampling Based Inductive Learning Method. In International Conference on Learning Representations. 1–19.
- Zhang et al. (2019b) Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. 2019b. Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 793–803.
- Zhang et al. (2020a) Dalong Zhang, Xin Huang, Ziqi Liu, Jun Zhou, Zhiyang Hu, Xianzheng Song, Zhibang Ge, Lin Wang, Zhiqiang Zhang, and Yuan Qi. 2020a. AGL: A Scalable System for Industrial-Purpose Graph Machine Learning. Proceedings of the VLDB Endowment 13, 12 (2020), 3125–3137.
- Zhang et al. (2019a) Guo Zhang, Hao He, and Dina Katabi. 2019a. Circuit-GNN: Graph Neural Networks for Distributed Circuit Design. In Proceedings of the 36th International Conference on Machine Learning. 7364–7373.
- Zhang et al. (2022b) Hengrui Zhang, Zhongming Yu, Guohao Dai, Guyue Huang, Yufei Ding, Yuan Xie, and Yu Wang. 2022b. Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective. In Proceedings of Machine Learning and Systems. 467–484.
- Zhang et al. (2020b) Weijia Zhang, Hao Liu, Yanchi Liu, Jingbo Zhou, and Hui Xiong. 2020b. Semi-Supervised Hierarchical Recurrent Graph Neural Network for City-Wide Parking Availability Prediction. In The Thirty-Fourth AAAI Conference on Artificial Intelligence. 1186–1193.
- Zhang et al. (2020c) Wentao Zhang, Xupeng Miao, Yingxia Shao, Jiawei Jiang, Lei Chen, Olivier Ruas, and Bin Cui. 2020c. Reliable Data Distillation on Graph Convolutional Network. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1399–1414.
- Zhang et al. (2023) Xin Zhang, Yanyan Shen, Yingxia Shao, and Lei Chen. 2023. DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU. In Proceedings of the ACM on Management of Data. 1–24.
- Zhang et al. (2021) Xiao-Meng Zhang, Li Liang, Lin Liu, and Ming-Jing Tang. 2021. Graph Neural Networks and Their Current Applications in Bioinformatics. Frontiers in Genetics 12 (2021), 1–22.
- Zhang et al. (2020d) Yufeng Zhang, Xueli Yu, Zeyu Cui, Shu Wu, Zhongzhen Wen, and Liang Wang. 2020d. Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 334–339.
- Zhang et al. (2022a) Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2022a. Deep Learning on Graphs: A Survey. IEEE Transactions on Knowledge and Data Engineering 34, 1 (2022), 249–270.
- Zhao et al. (2021c) Guoyi Zhao, Tian Zhou, and Lixin Gao. 2021c. CM-GCN: A Distributed Framework for Graph Convolutional Networks using Cohesive Mini-batches. In 2021 IEEE International Conference on Big Data. 153–163.
- Zhao et al. (2021b) Jianan Zhao, Xiao Wang, Chuan Shi, Binbin Hu, Guojie Song, and Yanfang Ye. 2021b. Heterogeneous graph structure learning for graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence. 4697–4705.
- Zhao et al. (2021a) Taige Zhao, Xiangyu Song, Jianxin Li, Wei Luo, and Imran Razzak. 2021a. Distributed Optimization of Graph Convolutional Network using Subgraph Variance. arXiv preprint arXiv:2110.02987 (2021), 1–12.
- Zhao et al. (2020) Yiren Zhao, Duo Wang, Daniel Bates, Robert Mullins, Mateja Jamnik, and Pietro Lio. 2020. Learned low precision graph neural networks. arXiv preprint arXiv:2009.09232 (2020), 1–14.
- Zheng et al. (2022a) Chenguang Zheng, Hongzhi Chen, Yuxuan Cheng, Zhezheng Song, Yifan Wu, Changji Li, James Cheng, Hao Yang, and Shuai Zhang. 2022a. ByteGNN: efficient graph neural network training at large scale. Proceedings of the VLDB Endowment 15, 6 (2022), 1228–1242.
- Zheng et al. (2020b) Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. 2020b. GMAN: A Graph Multi-Attention Network for Traffic Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence. 1234–1241.
- Zheng et al. (2020c) Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. 2020c. DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs. In 10th IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms. 36–44.
- Zheng et al. (2022b) Da Zheng, Xiang Song, Chengru Yang, Dominique LaSalle, and George Karypis. 2022b. Distributed Hybrid CPU and GPU Training for Graph Neural Networks on Billion-Scale Heterogeneous Graphs. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4582–4591.
- Zheng et al. (2020a) Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. 2020a. Learning Sparse Nonparametric DAGs. In The 23rd International Conference on Artificial Intelligence and Statistics. 3414–3425.
- Zhong et al. (2020) Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. 2020. Reasoning Over Semantic-Level Graph for Fact Checking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6170–6180.
- Zhou et al. (2021b) Hongkuan Zhou, Ajitesh Srivastava, Hanqing Zeng, Rajgopal Kannan, and Viktor Prasanna. 2021b. Accelerating large scale real-time GNN inference using channel pruning. Proceedings of the VLDB Endowment 14, 9 (2021), 1597–1605.
- Zhou et al. (2020) Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open (2020), 57–81.
- Zhou et al. (2021a) Zhe Zhou, Cong Li, Xuechao Wei, and Guangyu Sun. 2021a. Gcnear: A hybrid architecture for efficient gcn training with near-memory processing. arXiv preprint arXiv:2111.00680 (2021), 1–15.
- Zhu et al. (2018) Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. 2018. Benchmarking and Analyzing Deep Neural Network Training. In 2018 IEEE International Symposium on Workload Characterization. 88–100.
- Zhu et al. (2019) Rong Zhu, Kun Zhao, Hongxia Yang, Wei Lin, Chang Zhou, Baole Ai, Yong Li, and Jingren Zhou. 2019. AliGraph: a comprehensive graph neural network platform. Proceedings of the VLDB Endowment 12, 12 (2019), 2094–2105.
- Zou et al. (2019) Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. 2019. Layer-dependent importance sampling for training deep and large graph convolutional networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 11249–11259.
Appendix A Distributed GNN Training Systems
In this section, we present detailed introduction of the representative systems presented in Table 4. We systematically summarize the evolution of these systems in Figure 10. The evolution of techniques are marked on the edges. In the following, we highlight the main techniques leveraged by each system to optimize the training efficiency and further introduce some other system settings and optimizations not presented in the table.

A.1. Systems on single machine with multi-GPUs
GPUs are the most powerful computing resource to train neural network models. A common solution for distributed GNN training is to use multi-GPUs on a single machine. Here we introduce some popular systems that support efficient multi-GPUs training.
NeuGraph (Ma et al., 2019) is a pioneer work that aims at enabling scalable full-graph GNN training by bridging deep learning systems and graph processing systems with a novel programming abstraction called SAGA-NN. To support the execution of the SAGA-NN model on large graphs, it adopts a 2D graph partition based on locality-aware algorithms, which partitions the large graph into disjoint edge chunks. These edge chunks are scheduled sequentially to each GPU to perform chunk-based computation graph execution. In order to better improve the performance of CPU-GPU data transfer, a selective scheduling strategy is proposed, which filters useful vertices only from a large chunk and sends them to GPU. A chain-based chunk scheduling scheme is designed to fully utilize the PCIe bandwidth, and maximize the performance in multi-GPU GNN training. Other optimizations are designed to optimize the computation in GPU kernels, such as redundant computation elimination and stage fusion, by considering both the character of graph structure and specific GNN models. Overall, NeuGraph, implemented on top of TensorFlow with C++ and Python, naturally combines deep learning systems with graph computation and fully utilizes the efficiency of the CPU’s large memory, PCIe’s hierarchical structure, and GPU’s computing power.
PaGraph (Lin et al., 2020) is a novel GNN framework that was implemented under DGL and PyTorch, supporting sampling-based GNN training on multi-GPUs. The basic idea of PaGraph is to leverage the spare GPU memory and use it as a cache during sampling-based GNN training. It mainly focuses on the workload imbalance and redundant vertex access problem, and designs a heuristic partitioning strategy, by considering the affinity of each vertex to a partition. To support multi-GPUs and reduce the storage overhead of partitions, PaGraph removes the redundant vertices and edges that have no contribution to GNN training after the partitioning. To further achieve computing efficiency, resource isolation is conducted to avoid interference between different operations, and local shuffling is added to improve data parallel training. To sum up, PaGraph accelerates GNN training speed by cache and graph partition strategies and eventually achieves up to 96.8% reduction of data transfer and up to 16 performance improvement compared to DGL.
BGL (Liu et al., 2023) is a GPU-efficient GNN training system for large graph learning, which focuses on sampling-based GNN. To remove heavy network contention of feature retrieving, BGL adopts a two-level cache consisting of multi-GPU and CPU memory to maximize cache size. After partitioning the graph via BFS and assignment heuristic, in order to improve training speed, BGL further optimizes resource allocation via profiling-based resource allocation. The computation in BGL is split into eight stages, and the profiling-based resource allocation method distributes isolated resources to different stages with the goal of minimizing the maximal completion time of all stages. In conclusion, BGL is a generic system that can be applied to various computation backends (DGL, Euler222https://github.com/alibaba/euler, PyG) and pushes GPU utilization to a significantly higher level compared to existing frameworks.
GNNLab (Yang et al., 2022) is a newly GNN system optimized for sampling-based GNN models and conducts an overall investigation on a conventional designed sampling-based system. It originally introduces a cache policy based on pre-sampling, which achieves efficient and robust results on various sampling algorithms and datasets. Based on the factored design for sampling-based training, GNNLab further solves the challenge of workload imbalance across GPUs by assigning GPUs to different executors on demand of workload. GNNLab first assigns a certain number of GPUs calculated by the heuristic equation. And then, in order to choose the optimal GPU allocation constantly, GNNLab dynamically switches executors according to standby and existing Trainer working efficiency. In summary, this system eliminates heavy data I/O on feature retrieving and GPU memory contention of conventional GNN execution model, outperforming DGL and PyG by up to 9.1 and 74.3
Rethinking (Song and Jiang, 2022) develops a data-parallel GNN training system, focusing on the data movement optimization among CPU and GPUs. Unlike other systems, it allows GPUs to fetch data from each other by leveraging the high bandwidth among them. It builds a cost model to estimate the cost of data movement among CPU and GPUs, which can be minimized by the optimal data placement strategy. Based on the cost model, it formulates the data placement problem as a constrained optimization problem and develops an efficient solution to find an excellent data partition. In addition, to further reduce the data movement, it introduces a locality-aware neighbor sampling to increase the sampling probability of vertices on GPU which have little loading time. As a result, empirical studies demonstrate that it achieves better data loading efficiency than DGL (Wang et al., 2019b) and PaGraph (Lin et al., 2020) with the smart data placement strategy.
DSP (Cai et al., 2023) introduces a distributed sampling technique to accelerate the batch generation stage in distributed GNN training. It presents a Collective Sampling Primitive (CSP) that sends sampling tasks to remote workers, receiving only the sampled results instead of the entire neighbor list. This reduces communication costs as the sampled result requires fewer bits compared to the complete neighbor list. To support CSP effectively, DSP prioritizes caching the partitioned graph topology on each GPU. Theoretical analysis confirms that CSP is versatile in expressing both node-wise sampling and layer-wise sampling for GNN batch generation. Moreover, DSP implements a mini-batch pipeline, assigning a sampler, a loader, and a trainer for each GPU. A carefully designed centralized communication coordination prevents deadlocks in the pipeline. The distributed sampling technique proposed by DSP achieves up to 20 speedup compared to existing CPU-based and UVA-based sampling methods. In most cases, the overall training time is improved by over compared to DGL (Wang et al., 2019b).
Legion(Sun et al., 2023) focuses on multi-GPU training with NVLink connections. It introduces a hierarchical graph partitioning mechanism that considers the structure of NVLink cliques. It first partitions the graph with METIS, and assign each partition to an NVLink Clique. Utilizing the fact that communication among GPUs in the same NVLink clique is more cost-effective than PCIe-connected GPUs, Legion sets up a public cache for GPUs in each NVLink Clique, which is stored distributively in that NVLink Clique. Legion employs the batch generation mechanism similar to PaGraph (Lin et al., 2020), by only sampling the training vertices from a partition assigned to an NVLink clique. Therefore, each graph partition enjoys a dedicated cache, thus improving the cache hit ratio. Inspired by GNNLab (Yang et al., 2022), Legion also utilizes a pre-sampling mechanism to identify the most frequently accessed vertices for graph sampling and feature extraction. Legion proposes a caching strategy that caches both the graph topology and vertex features, which is also proposed in a concurrent work (Zhang et al., 2023). The ratio of GPU memory allocation for topology cache and feature cache is determined by a parameter . A linear search is performed to find the optimal , by carefully modeling the cache benefits with a given . Legion successfully pushes the envelope for multi-GPU GNN training to billion-scale graphs and achieves significant speedup for end-to-end GNN training.
A.2. Systems on GPU clusters
GPU cluster is an extension of a single machine with multi-GPUs, and it provides multi-machine multi-GPUs for the case when graphs and input features are too large to fit in a single machine. With the GPU cluster, both the communication between CPU and GPU and the communication among machines should be considered. The following systems have been tested on GPU clusters according to their original experiments.
ROC (Jia et al., 2020) is a distributed full-graph GNN training and inference system which is implemented on top of a GPU-optimized graph processing system FlexFlow (Jia et al., 2019). ROC supports both multi-GPUs and GPU clusters. In ROC, GPU memory is used as a cache and CPU memory is to store all GNN tensors and intermediate results that support larger GNN. So, ROC only requires all GNN tensors to fit in the CPU DRAM and each GPU access the graph and features not at local via data transfers between CPU and GPU. ROC applies a learning-based graph partitioner to balance the workload and applies a dynamic programming-based memory manager to minimize data transfers between CPU and CPU memories. The memory manager is guided by the topology of the GNN architecture and determines the tensors to be cached in GPU by analyzing GNN’s future operations.
(Gandhi and Iyer, 2021) is a distributed system that introduces model parallel into GNN training, and orchestrates it with the data parallelism to reduce the total communication cost. takes the observation that a large communication overhead is introduced in transferring the vertex features in the first layer, and designs a column-wise partitioning strategy to distribute the features to different workers. Specifically, different from traditional data-parallel GNN training, partitions the graph structure and vertex features independently. During the training process, in the first layer, each worker handles the given columns of features of all vertices and fetches graph structures from other workers. In such way, only the graph structure is transferred among workers, which incurs less network overhead than transferring the vertex features. For the remaining layers, traditional data parallelism is adopted. This hybrid model-data parallelism is named push-pull parallelism in . Under push-pull parallelism, a simple hash partitioning of the graph structure is sufficient. Moreover, a mini-batch pipeline is designed to fully utilize the computing resources, and an asynchronous communication protocol with bounded staleness is used. firstly incorporates the model parallelism into distributed GNN training for a good cause and designs a new training approach to improve overall performance, especially when the feature size is large.
SANCUS (Peng et al., 2022) is a decentralized GNN training framework that introduces historical embedding in full graph training, in order to reduce the large communication overhead by broadcasting the latest embeddings. Unlike traditional parameter server-based architecture, it applies decentralized training architecture to eliminate the centralized bottleneck bandwidth. The graph and features are co-partitioned among a set of workers (e.g., GPUs). To avoid massive communication among workers, a new communication primitive, called skip-broadcast, is proposed to support staleness-aware training. Skip-broadcast is compatible with collective operations like ring-based pipeline broadcast and all-reduce, and it reshapes the underlying communication topology by automatically removing communications when the cached embedding is not too stale. Three different staleness metrics are supported in SANCUS among which epoch-adaptive variation-gap embedding staleness works the best in general. Empirical results show that SANCUS reduces up to 74% communications without losing accuracy.
NeutronStar (Wang et al., 2022c) is a distributed GNN training system with hybrid dependency management approach. It first introduces a cost model to estimate the redundant computation caused by storing duplicated vertices in different GPUs and the communication overhead caused by transferring boundary vertices. Further, it proposes a greedy-based heuristic algorithm to split dependencies into cached and communicated groups that can minimize the cost of accessing dependencies. In addition, NeutronStar decouples the dependency management from graph operation and NN functions for flexible auto differentiation and provides high-level python APIs for both forward and backward. To sum up, NeutronStar combines two dependencies accessing mechanisms for GPU clusters GNN speedy training and integrates some optimizations for CPU-GPU heterogeneous computation and communication, outperforming DistDGL and ROC by up to 1.8 and 14.3.
DGCL (Cai et al., 2021) is a GNN-oriented communication library for GPU clusters. It takes the physical communication topology and the communication relation determined by the GNN model into consideration for planning the communication. On the basis of the observation that hierarchies are always existing in the communication topology, DGCL splits the data transfer into stages, which are corresponding to the number of links between the source GPU and destination GPU. Thereafter, DGCL introduces a heuristic cost model for each communication plan. And it obtains the optimal communication plan via an efficient tree spanning algorithm. As regards implementation, DGCL utilizes MPI and PyTorch to support the distributed environment and applies Horovod and PyTorch to support distributed model synchronization. According to the authors’ empirical studies, this library reduces the communication time of peer-to-peer communication by 77.5% on average and the training time for an epoch by up to 47%.
PipeGCN (Wan et al., 2022b) is a full-graph distributed training system developed on top of DistDGL that introduces stale embeddings to form a pipeline. For neighbors on remote workers, PipeGCN leverages a deferred communication strategy to use both the historical embeddings and gradients in the previous epoch. The latest embeddings are still computed for local neighbors. With a mixture of fresh and stale features (gradients), PipeGCN can better overlap communication overhead with computation cost. Further, PipeGCN gives detailed proof that guarantees its bounded error and convergence. In addition, to reduce the effect of stale information, PipeGCN proposes a light-weight moving average to smooth the fluctuations of boundary vertices’ features (gradients). PipeGCN is a representative system equipped with an asynchronous computation graph execution model and asynchronous communication protocol. With a simple strategy to fix the staleness bound to 1 epoch, PipeGCN is able to reduce the end-to-end training time by around 50% without compromising the accuracy.
BNS-GCN (Wan et al., 2022a) aims to find a balance between the large communication overhead of the full-graph training method and the accuracy loss of sampling-based training methods. The idea is to communicate only a portion of the boundary vertices in each remote worker. It reduces the number of transferred boundary vertices randomly by a factor of . Each partition selects the vertices from the boundary set stochastically and merely uses their features for GNN training. Thanks to its simple strategy, BNS-GCN can be merged into any distributed GNN training system based on edge-cut partitioning without introducing excessive computation. Implemented on top of DGL, this strategy achieves full-graph accuracy while the sampling method speeds up the training time. This simple combination of the two basic GNN acceleration techniques, i.e. the distributed GNN and sampling-based GNN, leads to an interesting training paradigm for further study.
DistDGLv2 (Zheng et al., 2022b) is a continuous work of DistDGL by the DGL team from Amazon. It takes the characteristic of the heterogeneous graph into consideration for balanced graph partition. Specifically, it extends METIS to split heterogeneous graphs regardless of their vertex types and adopts hierarchical partitioning to first reduce the communication across the network and then reduce the data copy to GPUs in a mini-batch. In addition, it designs a mini-batch generation pipeline, to enable parallel computation of several mini-batches in the CPU. A distributed hybrid CPU/GPU sampling technique is designed to fully utilize the GPU computation, which samples vertices and edges for each hop of the neighborhood in CPU and performs graph compaction in GPU to remove unnecessary vertices and edges for mini-batch computation. This mini-batch pipeline proposed achieved 2-3 total training speed up compared to DistDGL, and will soon be merged and released in the DGL source code.
SALIENT++ (Kaler et al., 2023) extends SALIENT to distributed GNN training and optimizes the communication cost by designing an analysis-driven cache strategy. It designs a propagation model to compute the probability of each vertex being sampled. Given a GNN model with specific sampling strategy, the sampled probability of each neighbor vertex of a center vertex can be accurately computed. With the propagation model, given graph partition and the initial probability of each training vertex being sampled into a training batch, the sampled probability of each vertex in the partition is computed. SALIENT++ use this probability to select important vertices into the cache for each worker. Furthermore, SALIENT++ proposes a fine-grained pipeline composed of 10 stages for feature collection. Experiments show that SALIENT++ achieves a speedup compared to DistDGL (Zheng et al., 2020c).
G3 (Wan et al., 2023a) focuses on the scalable full-graph training on GPU-clusters. Graph partitioning and fine-grained pipeline are the two main concerns in G3. First, G3 designs a two-step graph partitioning strategy to balance the workload and the communication cost on each worker. The first step employs the multi-constraint METIS partition, setting the balance of the number of vertices and number of edges in each partition as the two constraints, while minimizing the total communication cost which is modeled by the number of cut edges. In the second step, G3 iteratively exchanges vertices between partitions with the largest and smallest number of remote neighbors. The second step ensures a communication-balanced workload partition, by roughly making each partition involving a similar number of remote neighbors. Second, G3 proposes a fine-grained pipeline for full-graph distributed training. A bin-packing mechanism is designed, which packs the training vertices in one GPU into different bins with the help of the scheduling algorithm proposed. On the local worker, the execution of a GNN layer is split into the execution of bins, which can be overlapped. On a remote worker, some vertices may be ready to process the next layer once all their remote neighbors’ embeddings are received. Therefore, these vertices are packed into a bin. The execution of the next GNN layer of vertices in this bin starts immediately. This fine-grained pipeline implements both intra-layer pipeline and inter-layer pipeline, fully overlapping the communication with computation. G3 is the first distributed system that proposes a well-designed pipeline in full-graph synchronous GNN training.
AdaQP (Wan et al., 2023b) aims at reducing the communication overhead among workers in full-graph distributed GNN training. AdaQP proposes an innovative approach for distributed full-graph training using stochastic quantization of messages to reduce communication traffic and enhance efficiency. AdaQP quantizes messages into lower-precision integers before transmission across devices. An adaptive bit-width assignment scheme is proposed to optimize quantization bit-widths to balance data volume distribution between devices, leading to improved training efficiency. Theoretical analysis validates fast training convergence at for total training epochs. Furthermore, AdaQP also proposes a parallelization mechanism on each worker, by executing the computation of vertices without remote neighbors (central nodes) in advance. Simultaneously, the remote neighbors of other vertices (marginal nodes) are being transferred. Therefore, the computation of central nodes and the communication of neighbors of marginal nodes are overlapped. This overlapping insight is also brought up by G3 (Wan et al., 2023a). Experiments show that the message quantization improves the throughput of GNN training by up to yet incurs a negligible accuracy drop.
A.3. Systems on CPU clusters
Compared to the GPU cluster, the CPU cluster is easy to scalable and economical. Existing big data processing infrastructure is usually built with CPU clusters. There are many industrial-level distributed GNN training systems running on CPU clusters.
AliGraph (Zhu et al., 2019) is the first industrial-level distributed GNN platform developed by Alibaba Group. The platform consists of the application, algorithm, storage, sampling, and operator layers, where the latter three layers form the GNN system. The storage stores and organizes the raw data to support fast data access. It applies various graph partition algorithms to adapt to different scenarios, stores the attributes separately to save space and uses a cache to store important vertices to reduce communication. The sampling layer abstracts three kinds of sampling methods and the operator layer abstracts the GNN computation with two kinds of operators. Users can define their own models by choosing their sampling methods and implementing their own operators. Therefore, AliGraph supports not only the running of in-house developed GNNs and classical graph embedding models, but also the quick implementation of the latest SOTA GNN models.
AGL (Zhang et al., 2020a) is a distributed GNN system, with fully-functional training and inference designed for industrial purpose graph machine learning. AGL follows the message passing scheme and mainly supports sampling-based GNN training. The system consists of three major modules – GraphFlat, GraphTrainer, and GraphInfer. GraphFlat is a distributed -hop neighborhood batch generator implemented with MapReduce. It merges in-degree neighbors and propagates merged information to out-degree neighbors via message passing. Besides, it follows a workflow of sampling and indexing to eliminate the adverse effects caused by graph skewness. GraphTrainer is the distributed training framework following the parameter server. It leverages pipeline, pruning, and edge-partition to eliminate the overhead on data I/O and optimize floating point calculations. GraphInfer is a distributed inference module that splits layer GNN models into slices and applies message passing times based on MapReduce. This schema can eliminate redundant computations and reduce inference time costs dramatically. In summary, AGL achieves a nearly linear speedup in training with 100 workers, being able to finish a 2-layer GAT training with billions of vertices in acceptable hours.
DistDGL (Zheng et al., 2020c) is a distributed version of DGL designed by Amazon, which performs efficient and scalable mini-batch GNN training on the CPU cluster. The main components in DistDGL are the distributed samplers, distributed Key-Value (KV) store, trainers, and a dense model update component. The samplers are in charge of mini-batch generation and expose a set of flexible APIs to help the user define various sampling algorithms. The KVStore stores the graph and features distributedly and supports separate partition methods for vertex data and edge data. The trainers compute gradients of parameters over a mini-batch. While the dense model parameters are updated with synchronous SGD, the sparse vertex embeddings are updated with asynchronous SGD. As one of the earliest open-source systems and frameworks for distributed GNN training, DistDGL has made a great impact on the fast implementation of ideas for new training algorithms or distributed systems.
CM-GCN (Zhao et al., 2021c) proposes a new training model based on the characteristic of distributed graph partitioning and mini-batch selection. Its basic idea is to form cohesive mini-batches, in which training vertices are connected tightly and share common neighbors, in order to reduce the communication cost in retrieving neighbor vertices from remote workers. A well-designed cost model is proposed to partition the workload, which we classify as an operator-based model. In addition, CM-GCN enables a vertex-level asynchronous computation by decomposing the computation of vertices and processing them based on their data availability. Similar to BNS-GCN, CM-GCN introduces a new GCN training paradigm that attempts to reduce communication by focusing less on remote data. Although this idea may introduce training bias compared with the original GCN, experiment results show that both BNS-GCN and CM-GCN maintain a similar accuracy compared with GCN.
DistGNN (Md et al., 2021) also focuses on full graph GNN training and adds optimizations on top of DGL to support distributed GNN training. For distributed memory GNN training, DistGNN adopts Libra (Xie et al., 2014) fast partitioning with balanced workloads. By partitioning the graph in a vertex-cut manner, DistGNN designs three different aggregation strategies, including bias aggregation (by ignoring remote neighborhood), synchronous aggregation, and asynchronous aggregation with bounded staleness. The asynchronous aggregation adopts the Type I asynchronous execution model and effectively reduces communication costs. DistGNN is implemented on top of DGL with an optimized backend with C++ and LIBXSMM library. With single socket SPMM and distributed Libra partitioning being merged to the DGL source code, DistGNN has made a great contribution to the open-source GNN training.
FlexGraph (Wang et al., 2021e) is a distributed full-graph GNN training framework with a new programming abstraction NAU (a.k.a., NeighborSelection, Aggregation and Update). NAU is a flexible GNN framework to tackle the challenges that GAS-like GNN programming abstractions are unable to express GNN models using indirect neighbors during the aggregation phase. To support the new abstraction, FlexGraph employs hierarchical dependency graphs (HDGs) to compactly store the “neighbors” with different definitions and hierarchical aggregation strategies in GNN models. In addition, FlexGraph applies a hybrid aggregation strategy to distinguish the aggregation operations in different contexts and design suitable methods according to their characteristics. This system is implemented on top of PyTorch and libgrape-lite, a library for parallel graph processing. Overall, based on novel NAU abstraction, FlexGraph adopts various optimizations from different aspects, so that it gains better expression and capability of scaling than DGL, PyG, NeuGraph, and Euler.
ByteGNN (Zheng et al., 2022a) is a distributed GNN training system designed by ByteDance Incorporation, focusing on improving resource utilization during mini-batch training. Noticing that GPU does not bring in too many benefits for the sampling process, ByteGNN focuses on fully utilizing the capability of CPU clusters. An abstraction of the sampling phase for both supervised and unsupervised GNN models is proposed. With such abstraction, the workflow of the sampling phase can be arranged as a DAG corresponding to the operator-parallel execution model. A two-level scheduling strategy is designed to schedule different DAGs and their inner operators. ByteGNN also attempts to reduce the large network overhead by addressing the workload imbalance problem and designing a new graph partitioning method. Implemented based on GraphLearn (i.e. AliGraph), ByteGNN significantly improves the CPU utilization and achieves up to 3.5 speed up compared with DistDGL.
ParallelGCN (Demirci et al., 2023) focusing on accelerating the distributed SpMM execution of Graph Convolutional Network. The authors demonstrate that the broadcast communication protocol employed by CAGNET (Huang et al., 2020) is inefficient since redundant embeddings of some vertices are transferred. To improve the efficiency of communication, ParallelGCN adopts P2P-based communication protocol, and only sends vertex embeddings that are necessary between workers. ParallelGCN proposes a non-blocking P2P communication operation, which implements the chunk-based execution model, by overlapping the computation of received chunks and communication of other chunks. ParallelGCN also focuses on existing graph partitioning strategies and points out that the hypergraph partition model is more expressive than the existing graph partition model for distributed GNN training. Thanks to the hypergraph partition model, ParallelGCN is able to model the communication cost between each worker pair more accurately and achieves up to communication reduction. Based on the 1D partitioning distributed GNN execution, ParallelGCN introduces optimizations in both the communication protocol and data partition strategies.
A.4. Miscellaneous
Dorylus (Thorpe et al., 2021) studies the possibility of leveraging a more affordable computing resource, i.e., serverless threads, in GNN training. GNN computing is decomposed into different operations, including graph operations and neural operations. CPU clusters are used for graph operations such as feature extraction and feature aggregation, while the serverless threads are used for dense matrix multiplication in neural operations. Several optimizations for serverless threads management are specifically designed in Dorylus. Pipelining the GNN computation is another major contribution to Dorylus. SAGA-NN model is adopted in Dorylus, where different operators (e.g. Scatter, ApplyVertex) are executed in parallel, forming the execution pipeline. The training vertices are split into different vertex intervals. Dorylus continuously feeds the training pipeline with vertex intervals, while different vertex intervals in the pipeline are executed by different operators at the same moment. A synchronization is introduced by such a mini-batch pipeline so that an asynchronous communication protocol with bounded staleness is designed. It is shown that the serverless threads achieve high efficiency close to GPUs yet are much cheaper, and the asynchronous execution pipeline still reaches higher accuracy compared with sampling-based GNN training methods.
SUGAR (Xue et al., 2022) is proposed to support resource-efficient GNN training. It uses multiple devices with limited resources (e.g., mobile and edge devices) to train GNN models and provides rigorous proof of complexity, error bound, and convergence about distributed GNN training and graph partition. After generating the subgraphs with weighted graph and graph expansion, each device in SUGAR trains local GNN models at the subgraph level. SUGAR maintains local models instead of a global model by keeping the local model updates within each device, such that the parallel training is able to save the computation, memory, and communication costs. In summary, SUGAR aims at enlarging the scalability of GNN training and achieves considerable runtime and memory usage compared to the SOTA GNN models on the large-scale graph.