Federated Graph Learning with Adaptive Importance-based Sampling

Anran Li¹, Yuanyuan Chen², Chao Ren², Wenhan Wang³, Ming Hu⁴, Tianlin Li²,
Han Yu², Qingyu Chen¹

Abstract

For privacy-preserving graph learning tasks involving distributed graph datasets, federated learning (FL)-based GCN (FedGCN) training is required. A key challenge for FedGCN is scaling to large-scale graphs, which typically incurs high computation and communication costs when dealing with the explosively increasing number of neighbors. Existing graph sampling-enhanced FedGCN training approaches ignore graph structural information or dynamics of optimization, resulting in high variance and inaccurate node embeddings. To address this limitation, we propose the Federated Adaptive Importance-based Sampling (FedAIS) approach. It achieves substantial computational cost saving by focusing the limited resources on training important nodes, while reducing communication overhead via adaptive historical embedding synchronization. The proposed adaptive importance-based sampling method jointly considers the graph structural heterogeneity and the optimization dynamics to achieve optimal trade-off between efficiency and accuracy. Extensive evaluations against five state-of-the-art baselines on five real-world graph datasets show that FedAIS achieves comparable or up to 3.23% higher test accuracy, while saving communication and computation costs by 91.77% and 85.59%.

Introduction

Graph convolutional networks (GCNs) (Kipf and Welling 2016; Fey et al. 2021; Chen et al. 2018) have achieved impressing performance for a wide range of learning tasks on graph data. However, due to privacy concerns, heterogeneous subgraphs are separately stored by different data owners, constructing globally applicable GCNs requires collaborative learning. Federated graph learning (FedGL) for collaborative GCN training while preserving data privacy has attracted increasing attention (He et al. 2021; Chen et al. 2021; Liu et al. 2022). Based on the distribution of graph data, FedGL can be divided into inter-graph FedGL (He et al. 2021) and intra-graph FedGL (Chen et al. 2021). Intra-graph FedGL is common in practice e.g., in an online social application where each user has a local social network which contains interests and user interactions, and all networks form the latent complete human social network.

However, training intra-graph FedGL on large-scale heterogeneous graphs remains a challenge. Firstly, the exponentially increasing dependency of neighbor nodes over layers (i.e., neighbor explosion) causes the computation graph to be extremely large, which incurs prohibitively high computation cost. Secondly, intra-graph FedGL requires transferring intermediate embeddings across clients. This involves large number of edges connecting nodes that are stored by different clients. Since calculating embeddings for a node requires embeddings from its recursive neighbors several hops away, which could be stored by other clients, fetching neighbor embeddings across clients incurs high communication cost. Ignoring the information from neighbors across clients and treating subgraphs at various clients as independent can degrade model performance (Chen et al. 2021). Thirdly, client data in intra-graph FedGL exhibit statistical and structural heterogeneity. Different subsets of training data leads to different model accuracy and latency.

Existing methods of efficient FedGCN training can be categorized into three categories. The first category uses missing neighbor generation (Zhang et al. 2021b, a) to acquire accurate node embeddings. However, the additional training of the generative models would incur high computation and communication costs. The second category focuses on graph sampling (Zhang et al. 2021b), e.g., FedGraph (Chen et al. 2021) uses deep reinforcement learning (DRL) to select neighbor nodes for embedding aggregation. However, it ignores the graph structural heterogeneity information, resulting in inaccurate node embeddings and large variance in the gradients. The third category periodically transfer cross-client neighbor embedding transmission (Chen et al. 2021; Zhang et al. 2022; Deng et al. 2023) to reduce communication costs. However, it fails to capture the dynamics of model training to determine the optimal transfer period, resulting in inferior model performance. As for graph sampling for centralized learning, there are three types: 1) node-wise sampling methods iteratively sample a number of neighbor nodes for each node (Hamilton et al. 2017; Dai et al. 2018); 2) layer-wise sampling methods select a number of nodes for each GCN layer (Chen et al. 2018; Zou et al. 2019); and 3) subgraph sampling methods sample a number of subgraphs from each training batch (Chiang et al. 2019; Zeng et al. 2019). However, since these approaches are not designed for FL, they require access to potentially sensitive raw data which risk privacy of local clients. Besides, such methods neglect communication costs and would incur substantial communication burden when the volume of the graph data is large.

To address these limitations, we propose a novel federated graph sampling scheme - the Federated Adaptive Importance-based Sampling (FedAIS) approach for large-scale graph data node classification tasks. It reduces the graph sampling variance and achieves substantial cost savings by efficiently leveraging historical embedding estimators and focuses the limited communication and computation resources on training important local samples. By designing an adaptive embedding synchronization scheme, it is capable of achieving the optimal trade-off between test accuracy and computation and communication cost savings. The key advantages of FedAIS are summarized as follows.

•

Scalability: FedAIS is able to scale FedGCN to large graphs with a constant memory cost with respect to input node sizes. For a selected set of batches, FedAIS prunes the GCN computation graph so that only nodes inside the current batches and their direct 1-hop cross-client neighbors are retained, regardless of the depth of the GCN. Historical embeddings are used to accurately fill in the inter-dependency information of cross-client neighbors.
•

Efficiency: FedAIS achieves highly efficient FedGL and reduces unnecessary sample training via dynamic importance-based sampling that considers both structural information and optimization dynamics. It reduces cross-client neighbor embedding communication through adaptive embedding synchronization to select the optimal transmission interval that achieves the fast decay of the objective function.
•

Convergence: FedAIS ensures that the global model converges in an efficient manner. Theoretical analyses show that the approximation variance induced by importance-based node sampling and the staleness of historical embedding is upper bounded.

We evaluate FedAIS on five graph datasets of different scales under real-world workloads. Compared to the five state-of-the-art approaches, FedAIS achieves significant cost savings when training FedGCN models with thousands of FL participants. On average, it achieves comparable or up to 3.23% higher test accuracy, while incurring 91.77% and 85.59% lower communication and lower computation cost, respectively. In this way, FedAIS achieves significantly more advantageous trade-offs between efficiency and accuracy compared to existing approaches.

Related Work

FedGCN Training

Existing work on efficient intra-graph FedGCN training on large graphs can be divided into three branches. The first branch uses missing neighbor generation to obtain accurate node embeddings (Zhang et al. 2021b, a). However, it only focuses on improving prediction accuracy without considering the computation and communication overhead caused by additional training of the generative model. The second branch focuses on graph sampling approaches (Zhang et al. 2021b), e.g., FedGraph (Chen et al. 2021) uses deep reinforcement learning (DRL) to select neighbor nodes for embedding aggregation. However, since it ignores the graph topology and heterogeneity of clients, it results in inaccurate node embeddings and large variance in the gradients. Besides, it would incur large computation overhead as each local client needs to separately train two additional DRL networks. The third branch periodically transfer cross-client neighbor node embeddings to reduce communication costs (Chen et al. 2021; Du and Wu 2022; Zhang et al. 2022; Deng et al. 2023). However, they fail to capture the dynamics of model training to determine the optimal transfer period, resulting in inferior model performance.

Sampling-based GCN Training

One approach to scaling up GCN training is graph sampling, which can be categorized into: 1) node-wise sampling, 2) layer-wise sampling, and 3) subgraph sampling. Node-wise sampling methods (Hamilton et al. 2017; Cong et al. 2020) iteratively sample a number of neighbors for each node based on specific probabilities (e.g., calculated based on node centrality). Layer-wise sampling methods (Chen et al. 2018; Zou et al. 2019) independently sample a number of nodes for each GCN layer. Since multiple nodes are jointly sampled in each layer, the time cost of the sampling process is reduced by avoiding the exponential extension of neighbors. However, since nodes of different layers are sampled independently, some sampled nodes may have no connections with the ones in the previous layer, which would deteriorate the training performance. Subgraph sampling methods (Chiang et al. 2019; Zeng et al. 2019) sample a number of subgraphs for each batch in GCN training. However, graph partitioning of large graphs is time-consuming and the model performance is highly sensitive to the cluster size. Those three categories of methods, however, require direct access to data features, which would risk privacy leakage of local clients in FL settings. Besides, those methods neglect communication costs and would incur substantial communication costs when the volume of the graph data is large. Thus, we propose FedAIS to improve trade-offs between accuracy and efficiency (Fig. 1). Here, FedLocal is the federated GraghSage (Hamilton et al. 2017), which conducts random selection of within-client neighbor nodes, where the cross-client neighbor information is ignored. FedPNS performs periodic selection of both within-client and cross-client neighbor nodes.

Problem Formulation

There are two types of entities involved: a server $S$ , and $K$ clients $\{1,2,\cdots,K\}$ . Each client $k$ owns a graph dataset $D_{k}=(G_{k},Y_{k})$ , where $G_{k}=(V_{k},E_{k})$ is an undirected graph with $n_{k}=|V_{k}|$ vertices, $|E_{k}|$ edges. $N=\sum_{k=1}^{K}n_{k}$ is the total number of all clients’ local data samples. We focus on the task of node classification, where each vertex $v\in V_{k}$ is associated with a feature vector $x_{v}\in X_{k}$ and a label $y_{v}\in Y_{k}$ . Given a $L$ -layer FedGCN, let $f(h_{v}^{(L)},\theta,y_{v})$ , $F_{k}(h^{(L)},\theta)$ denote loss functions of an individual sample $x_{v}$ and all samples on client $k$ ’s local model. $F(h^{(L)},\theta)$ denotes the loss function of the global model. We formulate FedGCN learning as a distributed optimization problem:

\begin{split}\theta^{*}=&\arg\min\{F(h^{(L)},\theta)=\sum_{k=1}^{K}\frac{n_{k}}{N}F_{k}(h^{(L)},\theta)\},\\ &{\rm where}\,F_{k}(h^{(L)},\theta)=\frac{1}{n_{k}}\sum_{v\in V_{k}}f(h_{v}^{(L)},\theta,y_{v}),\end{split}

(1)

where the $l+1$ -th graph convolution layer embedding $h_{v}^{(l+1)}$ of node $v\in V_{k}$ is defined as:

\begin{split}h_{v}^{(l+1)}=\sigma^{(l+1)}\Big{(}h_{v}^{(l)},\underbrace{\{h_{w}^{(l)}\}_{w\in N(v)\cap V_{k}}}_{\text{Within-client nodes}}\cup\underbrace{\{h_{w}^{(l)}\}_{w\in N(v)\setminus V_{k}}}_{\text{Cross-client nodes}}\Big{)}.\end{split}

(2)

Here, $\sigma(\cdot)$ is the activation function. $N(v)$ denotes the set of neighbor nodes of $v$ and $h_{v}^{(1)}=x_{v}$ . Suppose the average degree in a local graph $G_{k}$ is $d_{k}$ . To calculate $h_{v}^{(l)}$ of node $v\in V_{k}$ in an $L$ -layer FedGCN, on average the number of neighbors involved is $d_{k}^{L}$ (Chen et al. 2017), which results in an exponential increase in computation and communication overhead with respect to $L$ . Thus, local clients cannot afford to calculate all embedding terms $h_{v}^{(l)}$ which need to be computed and transmitted recursively. Aggregating only within-client neighbor embeddings while ignoring cross-client information leads to inferior model performance (Chen et al. 2021).

Our FedAIS Approach

Joint Analysis of Variance and Overhead

To construct a global FedGCN model with fast convergence speed and low prediction error, we need to first analyze the sources of variances and biases when applying graph sampling strategies. Existing graph sampling approaches suffer from high variances and biases introduced to the stochastic gradients due to the approximation of node embeddings at different layers (Cong et al. 2020; Du and Wu 2022). We denote $\tilde{h}_{v}^{(l)}$ as the embedding approximation of node $v\in V_{k}$ in the $l$ -th layer. Specifically, the variance of stochastic gradient estimator $\tilde{g}=\sum_{k=1}^{K}\frac{1}{|B_{k}|}\sum_{v\in B_{k}}\nabla f(\tilde{h}_{v}^{(L)},\theta_{t},y_{v})$ , can be decomposed as:

\mathbb{E}[||\tilde{g}-\nabla F(h^{(L)},\theta)||]=\mathbb{E}[||\tilde{g}-g||]+\mathbb{E}[||g-\nabla{F}(h^{(L)},\theta)||].

(3)

$\mathbb{E}[||\tilde{g}-g||]$ denotes the variance of estimated gradients to their exact values $g=\sum_{k=1}^{K}\frac{1}{n_{k}}\sum_{v\in V_{k}}\nabla f(h_{v}^{(L)},\theta_{t},y_{v})$ , resulting from the inner layer embedding approximation in forward propagation. The term $\mathbb{E}[||g-\nabla{F}(h^{(L)},\theta)||]$ denotes the variance of mini-batch gradients to the full gradients due to the mini-batch sampling.

Assumption 1.

Let $F(h^{(L)},\theta)$ be differentiable and $\lambda$ -Lipschitz smooth and the value of $F(h^{(L)},\theta)$ be bounded by a scalar $F_{inf}$ .

Theorem 1.

Under Assumption 1, if for all $v\in V_{k}$ and all $l\in\{1,2,\cdots,L-1\}$ , the final output error of layer $L$ in training round $t\in[T]$ is bounded by:

||\tilde{h}_{v}^{(L)}-h_{v}^{(L)}||\leq\sum_{l=1}^{L-1}\alpha_{1}^{L-l}\alpha_{2}^{L-l}|N(v)|^{L-l}.

(4)

Theorem 1 lets us immediately derive an upper error bound for the estimated gradients, i.e.,

\mathbb{E}[||\tilde{g}-g||]\leq\lambda||\tilde{h}_{v}^{(L)}-h_{v}^{(L)}||.

(5)

From the decomposition of variance in Eq. (3) and the upper error bound for the estimated gradients in Eq. (5), we conclude that any graph sampling method introduces two sources of variance (i.e., the embedding approximation variance and the stochastic gradient variance). Therefore, to accelerate model convergence and reduce computation and communication overhead, both kinds of variance needs to be accounted in designing a graph sampling strategy.

System Overview

FedAIS consists of two modules (as shown in Fig. 2).

1.

Historical Embedding-based Graph Sampling. In training round $t$ , client $k$ updates the importance scores of its local training samples based on historical embeddings and training losses of individual samples. Then, client $k$ selects its most influential samples to be used for FedGCN model training. Using historical embeddings and training losses helps reduce both embedding approximation variance and stochastic gradient variance, thereby enabling accurate FedGCN model training.
2.

Adaptive Embedding Synchronization and Model Updating. With the influence estimation and sampling results, each client $k$ updates its historical embeddings and performs node aggregation. Then, it updates its local model and estimates the next optimal synchronization interval via the joint analysis of the overhead and error-convergence. Finally, client $k$ sends the updated local model to the server, which then aggregates the local models to produce the global model.

Historical Embedding-based Graph Sampling

While evaluating embedding $h_{v}^{(l)}$ , it is prohibitively costly to calculate all $h_{v}^{(l)}$ terms since they need to be computed and transmitted recursively (i.e., we again need the embeddings $h_{w}^{(l-1)}$ of all $v$ ’s neighbor nodes $w$ ). To reduce computation and communication costs, we introduce the historical embedding for FedGCN as an affordable approximation.

\begin{split}\tilde{h}_{v}^{(l+1)}=\sigma^{(l+1)}\Big{(}h_{v}^{(l)},\{h_{w}^{(l)}\}_{w\in N(v)\cap B}\cup\{\bar{h}_{w}^{(l)}\}_{w\in N(v)\setminus B}\Big{)},\\ \{\bar{h}_{w}^{(l)}\}_{w\in N(v)\setminus B}=\underbrace{\{\bar{h}_{w}^{(l)}\}_{w\in N(v)\setminus B\cap V_{k}}\cup\{\bar{h}_{w}^{(l)}\}_{w\in N(v)\setminus V_{k}}}_{\text{Historical embeddings}}.\end{split}

(6)

Here, we separate the within-client neighbors into two parts: 1) within-client in-batch nodes $w\in N(v)\cap B$ , which are part of the current batch $B_{k}\in V_{k}$ ; and 2) within-client out-of-batch nodes $w\in N(v)\setminus B_{k}\cap V_{k}$ , which are part of the client $k$ but not included in the current batch $B_{k}$ . For both neighbor nodes, we approximate their embeddings $h_{w}^{(l)}$ via historical embeddings $\bar{h}_{w}^{(l)}$ acquired in previous iterations. Compared to the previous approach which incurs exponentially computation and communication costs, that incurred by historical embedding estimator increases linear with $L$ , i.e., $O(\sum_{k=1}^{K}V_{k}|\cup_{v\in V_{k}}N(v)\cup\{v\}|\cdot LTJ)$ computation operations and $O(\sum_{k=1}^{K}\sum_{v\in V_{k}}\sum_{w\in N_{v}\setminus V_{k}}\sum_{l=1}^{L}d^{l}\cdot TJ)$ communication cost, where $T$ and $J$ are numbers of global training rounds and local training epochs.

Based on the historical embedding estimator and the variance analysis, we select training nodes that contribute most to the objective function and accelerate model convergence. It can be casted into the following optimization problem,

\min_{P}\frac{1}{K}\sum_{k\in[K]}\frac{1}{n_{k}}\sum_{v\in V_{k}}\frac{||\nabla f(\tilde{h}_{v}^{(L)},\theta_{t},y_{v})||^{2}}{p_{v}}.

(7)

The most straightforward solution is to use the $L_{2}$ norm of gradient as the probability. However, it requires the calculation of $n_{k}$ derivatives for each client $k$ at each local epoch $j\in[\tau_{t}]$ , which is computationally prohibitive (Li et al. 2021). To solve this issue, we instead use the difference $\Delta_{j}=f(\tilde{h}_{v}^{(L)},\theta_{j+1},y_{v})-f(\tilde{h}_{v}^{(L)},\theta_{j},y_{v})$ of training losses between two consecutive local model updates to approximate the gradient $\Delta_{j}\approx\nabla f(\tilde{h}_{v}^{(L)},\theta_{j},y_{v})$ . Then, client $k$ calculates the probability $p_{v}$ by normalizing the differences across its all training samples,

p_{v}=\frac{||\Delta_{j}||}{\sum_{v\in V_{k}}||\Delta_{j}||},v\in V_{k},j\in[\tau_{t}].

(8)

Thus, the computational complexity is $O(n_{k})$ since client $k$ only requires one forward propagation.

Adaptive Embedding Synchronization

To analyze the effect of $\tau$ on the expected runtime, we consider the following delay model. In round $t$ , the time taken by client $k$ to conduct a local model update at the $j$ -th epoch is modeled as a random variable $c_{k,j}^{t}$ and the total communication delay is a random variable $o_{\tau}^{t}$ . Then, the communication cost is $o_{k,\tau}^{t}b_{t}$ , where $b_{t}$ is the average network bandwidth during round $t$ . For the full synchronization, the total time to complete each round is $c_{\rm syn}=\max\{c_{1,1}^{t},\cdots,c_{K,1}^{t}\}+o_{\tau}^{t}$ , while for the periodic synchronization, the average time to complete each round is $c_{\rm avg}=\max\{\bar{c}_{1}^{t},\cdots,\bar{c}_{K}^{t}\}+o_{\tau}^{t}/\tau_{t}$ where $\bar{c}_{k}^{t}=\frac{1}{\tau_{t}}\sum_{j=1}^{\tau_{t}}c_{k,j}^{t}$ . Consider the simplest case where $c_{k,\tau}^{t}=c$ and $o_{\tau}^{t}=o$ are constants, and $c/o$ is the ratio of communication delay to computation cost, which depends on the size of FedGCN model, network bandwidth and client computing capacity, etc.

Assumption 2.

The stochastic gradient evaluated on the mini-batch $B_{k}$ with bounded variance, $\mathbb{E}[||\tilde{g}-g||]\leq\zeta^{2}$ , where $\tilde{g}=\sum_{k=1}^{K}\frac{1}{|B_{k}|}\sum_{v\in B_{k}}\nabla f(\tilde{h}_{v}^{(L)},\theta_{t},y_{v})$ .

Theorem 2.

For periodic synchronization, under Assumption 1-2, Theorem 1 and Eq. (5), if the learning rate $\eta$ satisfies $\eta\lambda+\eta^{2}\lambda^{2}\tau(\tau-1)\leq 1$ , and $\theta_{0}$ is the initial model generated by the server, then after total $c_{\rm total}$ runtime, the minimal expected squared gradient norm is bounded by

\frac{2(F(\tilde{h}^{(L)},\theta_{0})-F_{inf})}{\eta c_{\rm total}}(c+\frac{o}{\tau})+\eta^{2}\lambda^{2}\zeta^{2}(\tau-1).

(9)

From the optimization error bound in Eq. (9), the error-runtime trade-off for different synchronization communication intervals can be derived. While a larger $\tau$ reduces the runtime per iteration and makes the first term in Eq. (9) smaller, it also adds noise and increases the last term.

Then, we determine the optimal embedding synchronization interval $\tau$ to minimize the optimization error for each training batch. We start with infrequent cross-client embedding synchronization for improved convergence speed, and gradually transiting to higher embedding synchronization frequencies to reduce the prediction error of the learned global model. Specifically, at each training round $t$ , the server selects the optimal embedding transmission interval that achieves the fast test loss decay of the global model $\theta_{t}$ for the next interval. Theorem 2 illustrates that there is an optimal value $\tau_{t}$ that minimizes the optimization error bound at round $t$ between the server and all selected clients,

\tau_{t}=\sqrt{\frac{2(F(\tilde{h}^{(L)},\theta_{t})-F_{inf})o}{\eta^{3}c_{\rm total}\lambda^{2}\zeta^{2}}}.

(10)

It can be observed from Eq. (10) that the generated synchronization period sequence decreases along with the objective value on the test set when the learning rate is fixed. It is consistent with the intuition that the trade-off between error-convergence and overhead varies over time. Compared to the initial training phase, the benefit of using a large synchronization interval diminishes as the model converge since a lower error is preferred in the latter training phase. In some scenarios where the Lipschitz constant $\lambda$ and the gradient variance bound $\zeta^{2}$ are unknown and estimating these constants are difficult due to the highly non-convex and high-dimensional loss surface. As an alternative, we propose a simpler rule where we approximate $F_{inf}$ by 0, and divide Eq. (10) by $\tau_{0}$ to obtain the basic synchronization interval,

\tau_{t}=\Big{\lceil}\sqrt{\frac{F(\tilde{h}^{(L)},\theta_{t})}{{F(\tilde{h}^{(L)},\theta_{0})}}}\tau_{0}\Big{\rceil},

(11)

where $\lceil r\rceil$ is the ceil function to round $r$ to the nearest integer. In practical implementations, we take the test loss as the objective function value and the average batch number $\sum_{k=1}^{K}n_{k}B/K$ as the initial synchronization period $\tau_{0}$ , both of which can be easily obtained during training.

Input : Initial probability

p_{k,i}^{0}=\frac{1}{n_{k}}

, sampling ratio

r_{k}

Output : The optimal global model

\theta^{*}

1 //At the FL Server:

2 Initialize global model

\theta_{0}

;

3 for each round $t\in\{1,\cdots,T\}$ do

M_{t}\leftarrow

randomly select

m

clients;

5 for each client $k\in M_{t}$ do

\theta_{t+1}^{k}\leftarrow

LocalUpdate(

k,\theta_{t},\tau_{t}

);

\theta_{t+1}=\frac{1}{m}\sum_{k\in M_{t}}\theta_{t+1}^{k}

; // update global model

9 Calculate

\tau_{t+1}

with Eq. (11); // update interval

10 //At FL Client

k\in[K]

11 Function LocalUpdate( $k,\theta_{t},\tau_{t}$ ):

12 Calculate loss difference

\Delta_{j-1}

;

13 Update selection probability

p_{t}^{v}=\frac{||\Delta_{j-1}||}{\sum_{v\in V_{k}}||\Delta_{j-1}||}

;

14 for each local epoch $j\in\{1,2,\cdots,J\}$ do

15 Select a batch

B

\frac{n_{k}}{|B|}r_{k}

samples

x_{k,i}\propto p_{k,i}^{t}

;

16 if $j\%\tau_{t}==0$ then

17 Calculate

\tilde{h}_{i}^{(l+1)}

with Eq. (6);

18 Update historical embedding

\bar{h}_{i}^{(L)}

;

\theta_{j}^{k}\leftarrow\theta_{j-1}^{k}-\eta\frac{1}{|B|}\sum_{v\in B}\nabla f(\tilde{h}_{v}^{(L)},y_{v})

;

22 Return

\theta_{J}^{k}

;

Algorithm 1 FedAIS

Implementation

The proposed FedAIS is illustrated in Algorithm 1. Specifically, in the $t$ -th global round, the server randomly selects a set $M_{t}$ of $m$ clients, and distributes the current model $\theta_{t}$ to them (Lines 4-6). Each selected client $k$ calculates the loss for each sample $i\in V_{k}$ and updates its selection probability $p_{k,i}^{t}$ (Lines 11-12). Then, during each local epoch $j$ , client $k$ selects a batch $B_{k}$ of samples with $x_{i}\propto p_{k,i}^{t},i\in V_{k}$ . For each layer of FedGCN, when the number of local batch training epoch $j$ satisfies $j\%\tau_{t}==0$ , client $k$ firstly calculates $\tilde{h}_{i}^{(l+1)}$ with Eq. (6) by aggregating embeddings of both within-client neighbors and cross-client neighbors. Then, it performs embedding synchronization by asking neighbor client $q\in Q$ to update the selected cross-client neighbor embeddings and transmit them back (Lines 13-18). Then, client $k$ updates its local model $\theta_{j}^{k}$ and sends $\theta_{J}^{k}$ to the server. The server aggregates updates by conducting model aggregation and updating interval $\tau_{t+1}$ (Lines 7-8). In this way, the server and clients collaboratively train a FedGCN model $\theta^{*}$ with high prediction accuracy and low overhead.

Convergence Analysis

Without loss of generality, we analyze an arbitrary interval sequence $\{\tau_{1},\cdots,\tau_{R}\}$ with $R$ synchronization iterations.

Theorem 3.

(Convergence of FedAIS) Suppose the learning rate $\eta$ remains the same as $R\rightarrow\infty$ ,

\sum_{r=0}^{R}\eta_{r}\tau_{r}\rightarrow\infty,\sum_{r=0}^{R}\eta_{r}^{2}\tau_{r}<\infty,\sum_{r=0}^{R}\eta_{r}^{3}\tau_{r}^{2}<\infty,

(12)

The global model $\theta$ is guaranteed to converge to:

\mathbb{E}\Big{[}\frac{\sum_{r=0}^{R-1}\eta_{r}\sum_{k=1}^{\tau_{r}}||\nabla F(\tilde{h}^{(L)},\theta_{\sum_{i=0}^{r-1}\tau_{i}+k})||}{\sum_{r=0}^{R-1}\eta_{r}\tau_{r}}\Big{]}\rightarrow 0.

(13)

The key idea of proof is as follows. To understand the condition (12), we consider the case when $\tau_{0}=\cdots=\tau_{R}$ is a constant. Then, the converge condition is identical to that for mini-batch SGD: $\sum_{r=0}^{R}\eta_{r}\rightarrow\infty,\sum_{r=0}^{R}\eta_{r}^{2}<\infty$ . Provided that the sequence of communication periods is bounded, the learning rate in mini-batch SGD can be easily adjusted to satisfy condition (12). Specifically, when the communication period sequence decreases, the last two terms in (12) become easier to satisfy, and the differences between the objective values of two consecutive rounds are bounded. The full proof of FedAIS convergence is presented in Appendix.

Experiment Evaluation

Experimental Settings

Table 1: Statistics of the datasets.

\triangle E

denotes the total number of cross-client edges.

100 clients
Dataset	Coauthor	Pubmed	Yelp	Reddit	Amazon2M
$V$	18,333	19,717	716,847	232,965	2,449,029
$E$	163,788	88,648	13,954,819	114,615,892	61,859,140
# features	6,805	500	300	602	100
# classes	15	3	100	41	47
Train/ Val/ Test	0.8/ 0.1/ 0.1	0.8/ 0.1/ 0.1	0.75/ 0.10/ 0.15	0.66/ 0.10/ 0.24	0.8/ 0.1/ 0.1
$V_{k}$	146	158	5,376	1,538	19,592
$E_{k}$	173	879	138,815	1,140,985	610,748
$\triangle E$	1,030	747	73,230	517,533	784,277

Table 2: Performance comparison for training different FedGCN models on various datasets.

Method	Metric	Performance results (%) $\pm$ standard deviations
		Coauthor		Pubmed		Yelp		Reddit		Amazon2M
		iid	non-iid	iid	non-iid	iid	non-iid	iid	non-iid	iid	non-iid
FedAll	testAcc	89.98 $\pm$ 0.78	84.46 $\pm$ 1.08	87.74 $\pm$ 2.16	86.42 $\pm$ 1.07	91.57 $\pm$ 1.29	89.69 $\pm$ 2.08	82.38 $\pm$ 0.61	81.66 $\pm$ 0.62	71.46 $\pm$ 1.03	67.89 $\pm$ 0.82
	F1-score	78.38 $\pm$ 1.25	73.03 $\pm$ 1.13	85.62 $\pm$ 0.87	84.51 $\pm$ 0.91	29.03 $\pm$ 1.14	27.71 $\pm$ 1.41	76.34 $\pm$ 1.12	74.89 $\pm$ 1.15	29.89 $\pm$ 0.85	25.82 $\pm$ 0.79
	AUC	92.35 $\pm$ 1.32	75.26 $\pm$ 1.28	96.14 $\pm$ 0.14	96.03 $\pm$ 0.74	75.45 $\pm$ 2.03	73.36 $\pm$ 1.83	95.54 $\pm$ 0.05	92.41 $\pm$ 1.02	83.62 $\pm$ 0.65	60.12 $\pm$ 0.87
FedRandom	testAcc	87.09 $\pm$ 0.78	81.53 $\pm$ 1.28	87.16 $\pm$ 2.16	84.92 $\pm$ 1.07	92.25 $\pm$ 2.29	86.79 $\pm$ 1.12	79.34 $\pm$ 0.52	77.52 $\pm$ 0.91	71.78 $\pm$ 0.49	65.45 $\pm$ 0.64
	F1-score	72.95 $\pm$ 1.05	68.13 $\pm$ 2.13	83.62 $\pm$ 0.87	81.32 $\pm$ 0.91	29.61 $\pm$ 2.14	27.21 $\pm$ 1.41	72.34 $\pm$ 1.52	70.56 $\pm$ 1.05	30.54 $\pm$ 0.68	24.43 $\pm$ 0.48
	AUC	88.58 $\pm$ 1.27	70.76 $\pm$ 0.82	96.12 $\pm$ 0.16	92.06 $\pm$ 0.74	74.84 $\pm$ 2.03	71.58 $\pm$ 0.79	93.66 $\pm$ 0.05	88.42 $\pm$ 1.08	83.16 $\pm$ 0.54	56.98 $\pm$ 1.12
FedSage+	testAcc	85.02 $\pm$ 0.78	81.95 $\pm$ 1.03	86.82 $\pm$ 1.16	85.08 $\pm$ 1.14	83.45 $\pm$ 0.79	82.79 $\pm$ 0.43	78.34 $\pm$ 0.28	74.52 $\pm$ 0.62	71.47 $\pm$ 0.79	66.42 $\pm$ 0.81
	F1-score	76.78 $\pm$ 1.05	54.93 $\pm$ 0.63	84.62 $\pm$ 0.87	83.21 $\pm$ 0.91	30.12 $\pm$ 0.74	29.84 $\pm$ 0.81	76.54 $\pm$ 0.97	66.56 $\pm$ 1.05	29.89 $\pm$ 0.38	27.38 $\pm$ 0.37
	AUC	89.29 $\pm$ 1.17	74.45 $\pm$ 0.92	90.83 $\pm$ 0.58	89.06 $\pm$ 0.49	71.26 $\pm$ 1.03	72.27 $\pm$ 0.72	94.76 $\pm$ 0.55	92.42 $\pm$ 1.18	83.62 $\pm$ 0.48	58.24 $\pm$ 0.42
FedPNS	testAcc	87.03 $\pm$ 0.21	81.65 $\pm$ 0.93	87.38 $\pm$ 1.23	85.87 $\pm$ 1.05	90.92 $\pm$ 0.59	85.97 $\pm$ 0.82	82.25 $\pm$ 1.26	80.24 $\pm$ 0.73	71.36 $\pm$ 0.78	66.74 $\pm$ 0.43
	F1-score	72.58 $\pm$ 2.34	68.32 $\pm$ 1.62	85.96 $\pm$ 2.74	84.81 $\pm$ 2.46	28.99 $\pm$ 1.02	28.01 $\pm$ 1.24	76.10 $\pm$ 0.62	73.63 $\pm$ 0.62	29.63 $\pm$ 0.78	25.95 $\pm$ 1.07
	AUC	87.69 $\pm$ 2.37	72.81 $\pm$ 1.72	95.87 $\pm$ 1.29	93.86 $\pm$ 1.09	75.22 $\pm$ 0.25	73.74 $\pm$ 1.02	95.07 $\pm$ 1.28	91.12 $\pm$ 0.36	83.39 $\pm$ 0.52	58.19 $\pm$ 0.62
FedGraph	testAcc	88.09 $\pm$ 1.06	83.18 $\pm$ 1.25	87.82 $\pm$ 0.97	85.17 $\pm$ 0.84	90.98 $\pm$ 0.82	87.41 $\pm$ 0.58	82.18 $\pm$ 0.75	80.12 $\pm$ 1.14	71.75 $\pm$ 0.62	66.93 $\pm$ 0.47
	F1-score	73.19 $\pm$ 1.24	68.03 $\pm$ 1.86	85.69 $\pm$ 1.04	83.58 $\pm$ 1.41	31.39 $\pm$ 1.22	29.16 $\pm$ 1.02	75.88 $\pm$ 0.83	71.42 $\pm$ 1.04	30.21 $\pm$ 0.82	25.43 $\pm$ 0.93
	AUC	88.85 $\pm$ 0.78	72.15 $\pm$ 1.43	95.89 $\pm$ 1.62	93.97 $\pm$ 1.27	77.75 $\pm$ 0.51	73.31 $\pm$ 0.89	94.96 $\pm$ 1.17	92.61 $\pm$ 0.72	82.82 $\pm$ 0.78	59.89 $\pm$ 0.53
FedAIS	testAcc	88.12 $\pm$ 0.12	85.49 $\pm$ 0.79	88.36 $\pm$ 0.59	85.26 $\pm$ 1.07	94.12 $\pm$ 0.17	91.13 $\pm$ 0.72	82.48 $\pm$ 0.18	81.84 $\pm$ 0.93	71.84 $\pm$ 0.31	67.99 $\pm$ 0.58
	F1-score	74.86 $\pm$ 1.07	69.16 $\pm$ 0.37	86.34 $\pm$ 0.21	83.72 $\pm$ 0.83	31.65 $\pm$ 0.26	30.87 $\pm$ 0.69	76.16 $\pm$ 0.23	74.69 $\pm$ 0.53	30.84 $\pm$ 0.52	27.52 $\pm$ 0.81
	AUC	92.52 $\pm$ 0.12	73.75 $\pm$ 1.04	95.97 $\pm$ 0.31	96.28 $\pm$ 0.94	79.96 $\pm$ 0.12	74.35 $\pm$ 0.72	95.26 $\pm$ 1.05	92.47 $\pm$ 0.82	84.16 $\pm$ 0.26	60.72 $\pm$ 0.43

Implementation. We implemented FedAIS and deployed it in an FL system consisting of one server and 100 clients. To further investigate the performance of FedAIS in large-scale FL systems, we also tested it in an environment with up to 1,000 clients. Our implementation is based on Python 3.11 and Pytorch Geometric 2.0.1 (Fey and Lenssen 2019). All the experiments are performed on Ubuntu 20.04 operating system equipped with a 32-core AMD Ryzen Threadripper PRO 5965WX @ 3.800GHz CPU, 192G of RAM and a NVIDIA RTX A5000 GPU with 24GB memory.

Datasets and Models. We use five real-world graph datasets of different scales, i.e., Coauthor (Shchur et al. 2018), Pubmed (Sen et al. 2008), Yelp (Zeng et al. 2019), Reddit (Hamilton et al. 2017), Amazon2M (Hu et al. 2020). We partition training/ validation sets over 100 clients in both independent and identically distributed (iid) setting and non-iid setting. We use a non-iid partition by $p_{i}\sim Dir_{k}(\alpha)$ , $\alpha=0.5$ with a Dirichlet distribution and allocate a $p_{i,k}$ ratio of instances of class $i$ to client $k$ (Li et al. 2022; Yurochkin et al. 2019). Since the original graph is extremely dense, we downsample the edges in local subgraphs by 50% (Hamilton et al. 2017). The test dataset is located at the server and the statistics of the datasets are presented in Table 1. We use the widely adopted GraphSage model (Hamilton et al. 2017) with FedAvg to construct FedGCN models: FedAuthor, FedPubmed, FedYelp, FedReddit, FedAmazon (Hu et al. 2020). Each model has two hidden layers with 256, 128 neurons. We set the ratio of sample selection to 0.7, the number of neighbors sampled to 10, and the minimum embedding synchronization interval $\tau_{0}$ to 2 batch training epochs. We use Adam as the optimizer with the weight decay 0.001 and ReLU as the activation function. We set the learning rate $\eta=0.001$ , the fixed batch number is 10, and the global warm-up round is 1. We conduct training until a pre-specified test accuracy is reached, or a maximum number of iterations has elapsed (e.g., 100 rounds). We perform 5-fold cross validation and report the average results.

Comparison Baselines. 1) FedAll: It conducts training using all local samples and performs random neighbor node selection of both local subgraph neighbors and cross-client neighbors. 2) FedRandom: It performs random selection for both local samples and neighbor nodes in each batch training. 3) FedSage+ (Zhang et al. 2021b): It proposes a GNN-based neighbor generative model for each client to predict the features of each node’s cross-client neighbors. 4) FedPNS (Du and Wu 2022): It conducts training using all local samples and conducts periodic neighbor node selection for cross-client neighbor nodes. We set the periodic interval to 2 local batch training epochs. 5) FedGraph (Chen et al. 2021): It performs FL training using all local samples and selects local subgraph neighbors and cross-client neighbors by adjusting sampling policies based on DRL.

Results and Discussions

FedAIS achieves comparable or higher accuracy. We adopt three metrics (Falessi et al. 2021; Herbold et al. 2018), i.e., test accuracy, F1-score, and area under the curve (AUC), to evaluate the accuracy of FedGCN models. We compare FedAIS with other baseline methods by training different FedGCN models in both iid and non-iid settings. We present the results of those accuracy metric scores and the standard deviations of the final global models in Table 2. It shows that FedAIS achieves test accuracy, F1 score, and AUC scores that are comparable to or better than other methods. For example, for model FedYelp trained on Yelp dataset in the iid setting, the test accuracy of FedAIS is 4.55%, 3.87%, 5.20%, 5.14 higher than other methods, respectively.

FedAIS significantly saves computation and communication costs. We present the test accuracy with the size of communication cost for training FedReddit and FedAmazon in iid settings in Fig. 4. The test accuracy, F1-score and AUC scores with the size of communication cost for training FedAuthor, FedPubmed and FedYelp in both iid and non-iid settings are much similar to that in Fig. 4. The results in Fig. 4 show that FedAIS requires much less amount of communication volume than the other baselines to achieve the target test accuracy, which leads to less computation time for transmitting those node embedding bytes and updating model parameters. Besides, we present the total computation and communication overhead for training FedGCN models in Fig. 4. It shows that FedAIS achieves significantly savings of both computation and communication costs than other baselines.

Ablation Study

We perform ablation studies to show the effectiveness of each component of FedAIS. We compare FedAIS against the following ablation baselines: 1) FedAll; 2) FedAIS1: it only conducts the proposed dynamic importance sampling method of local samples without adaptive embedding synchronization; 3) FedAIS2: it conducts training using all local samples with the proposed adaptive embedding synchronization. We present the test accuracy with the size of communication cost and the total communication costs in Fig. 6. The results show that FedAIS, FedAIS1 and FedAIS2 achieve higher performance in saving much communication costs to reach the target accuracy scores and FedAIS performs the best among them. Thus, both the dynamic importance sampling module and the adaptive embedding synchronization module are effective to construct FedAIS.

Sensitivity Analysis

Impact of the number of clients. We conduct experiments with different numbers of clients engagement, i.e., $K=100,300,500,700,1,000$ . We present the test accuracy and communication cost of the model FedReddit trained with different number of clients in Fig. 6. It shows that the test accuracy of FedAIS is consistently high, i.e., above 75.0%, as the number of client increases to 1,000, and achieves test accuracy that is comparable to or higher than others. Besides, it shows that communication costs increase as the number of clients increases and FedAIS achieves substantial cost savings than other baselines in all settings.

Impact of the non-iid degree. We present the test accuracy of the model FedReddit with different non-iid degrees, i.e., $\alpha=0.05,0.1,0.5,1.0,10,100$ , in Fig. 7(a). It shows that FedAIS achieves accuracy scores that are comparable to or higher than other baselines. Besides, these accuracy scores increase as $\alpha$ increases, and when $\alpha$ is greater than 0.5, the accuracy scores are relative high. The accuracy scores with the sizes of communication cost for training the model FedReddit with different non-iid degrees are much similar to that in Fig. 3(a), which shows that FedAIS consistently requires less communication cost than others.

Impact of the ratio of samples selected. Fig. 7 presents the test accuracy and communication costs of the model FedReddit when the selection ratio is $r=$ 0.1, 0.3, 0.5, 0.7, 0.9. Here, we adjust FedPNS and FedGraph so that they can select corresponding ratios of nodes. It shows that both the test accuracy and communication cost increase as the sampling ratio increases and FedAIS performs much better. Besides, due to the large size of the Reddit dataset, FedAIS can construct a FedGCN model with high test accuracy and less communication cost by sampling only 0.1 proportion of local samples, which achieves significantly more advantageous trade-offs between accuracy and efficiency.

Conclusions

In this paper, we proposed a federated adaptive importance-based sampling approach, FedAIS, for large-scale graph data in node classification tasks. It achieves substantial computation and communication costs by efficiently utilizing historical embedding estimators and reducing unnecessary sample training via dynamic importance-based sampling. Besides, it reduces cross-client neighbor embedding communication through adaptive embedding synchronization. In this way, FedAIS determines the optimal communication period and achieves faster convergence with lower costs and lower prediction errors. Extensive evaluations show that FedAIS achieves comparable or higher test accuracy, while saving significant communication and computation costs.

References

Chen et al. (2021) Fahao Chen, Peng Li, Toshiaki Miyazaki, and Celimuge Wu. Fedgraph: Federated graph learning with intelligent sampling. IEEE Transactions on Parallel and Distributed Systems, 33(8):1775–1786, 2021.
Chen et al. (2017) Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional networks with variance reduction. arXiv preprint arXiv:1710.10568, 2017.
Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247, 2018.
Chiang et al. (2019) Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 257–266, 2019.
Cong et al. (2020) Weilin Cong, Rana Forsati, Mahmut Kandemir, and Mehrdad Mahdavi. Minimal variance sampling with provable guarantees for fast training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1393–1403, 2020.
Dai et al. (2018) Hanjun Dai, Zornitsa Kozareva, Bo Dai, Alex Smola, and Le Song. Learning steady-states of iterative algorithms over graphs. In International conference on machine learning, pages 1106–1114. PMLR, 2018.
Deng et al. (2023) Pan Deng, Xuefeng Liu, Jianwei Niu, and Chunming Hu. Graphfed: A personalized subgraph federated learning framework for non-iid graphs. In 2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems (MASS), pages 227–233. IEEE, 2023.
Du and Wu (2022) Bingqian Du and Chuan Wu. Federated graph learning with periodic neighbour sampling. In 2022 IEEE/ACM 30th International Symposium on Quality of Service (IWQoS), pages 1–10. IEEE, 2022.
Falessi et al. (2021) Davide Falessi, Aalok Ahluwalia, and Massimiliano DI Penta. The impact of dormant defects on defect prediction: A study of 19 apache projects. ACM Transactions on Software Engineering and Methodology (TOSEM), 31(1):1–26, 2021.
Fey and Lenssen (2019) Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
Fey et al. (2021) Matthias Fey, Jan E Lenssen, Frank Weichert, and Jure Leskovec. Gnnautoscale: Scalable and expressive graph neural networks via historical embeddings. In ICML, pages 3294–3304. PMLR, 2021.
Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
He et al. (2021) Chaoyang He, Keshav Balasubramanian, Lichao Sun, Lifang He, Liangwei Yang, Philip S Yu, Yu Rong, et al. Fedgraphnn: A federated learning system and benchmark for graph neural networks. arXiv preprint arXiv:2104.07145, 2021.
Herbold et al. (2018) Steffen Herbold, Alexander Trautsch, and Jens Grabowski. A comparative study to benchmark cross-project defect prediction approaches. In Proceedings of the 40th International Conference on Software Engineering, pages 1063–1063, 2018.
Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
Kipf and Welling (2016) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
Li et al. (2021) Anran Li, Lan Zhang, Juntao Tan, Yaxuan Qin, Junhao Wang, and Xiang-Yang Li. Sample-level data selection for federated learning. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications, pages 1–10. IEEE, 2021.
Li et al. (2022) Qinbin Li, Yiqun Diao, Quan Chen, and Bingsheng He. Federated learning on non-iid data silos: An experimental study. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 965–978. IEEE, 2022.
Liu et al. (2022) Rui Liu, Pengwei Xing, Zichao Deng, Anran Li, Cuntai Guan, and Han Yu. Federated graph neural networks: Overview, techniques and challenges. arXiv preprint arXiv:2202.07256, 2022.
Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.
Yurochkin et al. (2019) Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In International conference on machine learning, pages 7252–7261. PMLR, 2019.
Zeng et al. (2019) Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. Graphsaint: Graph sampling based inductive learning method. arXiv preprint arXiv:1907.04931, 2019.
Zhang et al. (2021a) Chenhan Zhang, Shuyu Zhang, JQ James, and Shui Yu. Fastgnn: A topological information protected federated learning approach for traffic speed forecasting. IEEE Transactions on Industrial Informatics, 17(12):8464–8474, 2021a.
Zhang et al. (2021b) Ke Zhang, Carl Yang, Xiaoxiao Li, Lichao Sun, and Siu Ming Yiu. Subgraph federated learning with missing neighbor generation. Advances in Neural Information Processing Systems, 34:6671–6682, 2021b.
Zhang et al. (2022) Taolin Zhang, Chuan Chen, Yaomin Chang, Lin Shu, and Zibin Zheng. Fedego: Privacy-preserving personalized federated graph learning with ego-graphs. arXiv preprint arXiv:2208.13685, 2022.
Zou et al. (2019) Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. Layer-dependent importance sampling for training deep and large graph convolutional networks. Advances in neural information processing systems, 32, 2019.