This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Subgraph Federated Learning with Missing Neighbor Generation

Ke Zhang1,4, Carl Yang1 , Xiaoxiao Li2, Lichao Sun3, Siu Ming Yiu4
1Emory University, 2University of British Columbia, 3Lehigh University, 4University of Hong Kong
kzhang2@cs.hku.hk, j.carlyang@emory.edu,
xiaoxiao.li@ece.ubc.ca, lis221@lehigh.edu, smyiu@cs.hku.hk

Corresponding author.
Abstract

Graphs have been widely used in data mining and machine learning due to their unique representation of real-world objects and their interactions. As graphs are getting bigger and bigger nowadays, it is common to see their subgraphs separately collected and stored in multiple local systems. Therefore, it is natural to consider the subgraph federated learning setting, where each local system holds a small subgraph that may be biased from the distribution of the whole graph. Hence, the subgraph federated learning aims to collaboratively train a powerful and generalizable graph mining model without directly sharing their graph data. In this work, towards the novel yet realistic setting of subgraph federated learning, we propose two major techniques: (1) FedSage, which trains a GraphSage model based on FedAvg to integrate node features, link structures, and task labels on multiple local subgraphs; (2) FedSage+, which trains a missing neighbor generator along FedSage to deal with missing links across local subgraphs. Empirical results on four real-world graph datasets with synthesized subgraph federated learning settings demonstrate the effectiveness and efficiency of our proposed techniques. At the same time, consistent theoretical implications are made towards their generalization ability on the global graphs.

1 Introduction

Graph mining leverages links among connected nodes in graphs to conduct inference. Recently, graph neural networks (GNNs) have gained applause with impressing performance and generalizability in many graph mining tasks [29, 11, 16, 20, 32]. Similar to machine learning tasks in other domains, attaining a well-performed GNN model requires its training data to not only be sufficient but also follow the similar distribution as general queries. While in reality, data owners often collect limited and biased graphs and cannot observe the global distribution. With heterogeneous subgraphs separately stored in local data owners, accomplishing a globally applicable GNN requires collaboration.

Federated learning (FL) [17, 35], targeting at training machine learning models with data distributed in multiple local systems to resolve the information-silo problem, has shown its advantage in enhancing the performance and generalizability of the collaboratively trained models without the need of sharing any actual data. For example, FL has been devised in computer vision (CV) and natural language processing (NLP) to allow the joint training of powerful and generalizable deep convolutional neural networks and language models on separately stored datasets of images and texts [19, 6, 18, 39, 13].

Refer to caption
Figure 1: A toy example of the distributed subgraph storage system: In this example, there are four hospitals and a medical administration center. The global graph records, for a certain period, the city’s patients (nodes), their information (attributes), and interactions (links). Specifically, the left part of the figure shows how the global graph is stored in each hospital, where the grey solid lines are the links explicitly stored in each hospital, and the red dashed lines are the cross-hospital links that may exist but are not stored in any hospital. The right part of the figure indicates our goal that without sharing actual data, the system obtains a globally powerful graph mining model.

Motivating Scenario. Taking the healthcare system as an example, as shown in Fig. 1, residents of a city may go to different hospitals for various reasons. As a result, their healthcare data, such as demographics and living conditions, as well as patient interactions, such as co-staying in a sickroom and co-diagnosis of a disease, are stored only within the hospitals they visit. When any healthcare problem is to be studied in the whole city, e.g., the prediction of infections when a pandemic occurs, a single powerful graph mining model is needed to conduct effective inference over the entire global patient network, which contains all subgraphs from different hospitals. However, it is rather difficult to let all hospitals share their patient networks with others to train the graph mining model due to conflicts of interests.

In such scenarios, it is desirable to train a powerful and generalizable graph mining model over multiple distributed subgraphs without actual data sharing. However, this novel yet realistic setting brings two unique technical challenges, which have never been explored so far.

Challenge 1: How to jointly learn from multiple local subgraphs? In our considered scenario, the global graph is distributed into a set of small subgraphs with heterogeneous feature and structure distributions. Training a separate graph mining model on each subgraph may not capture the global data distribution and is also prone to overfitting. Moreover, it is unclear how to integrate multiple graph mining models into a universally applicable one that can handle any queries from the underlying global graph.

Solution 1: FedSage: Training GraphSage with FedAvg. To attain a powerful and generalizable graph mining model from small and biased subgraphs distributed in multiple local owners, we develop a framework of subgraph federated learning, specifically, with the vanilla mechanism of FedAvg [21]. As for the graph mining model, we resort to GraphSage [11], due to its advantages of inductiveness and scalability. We term this framework as FedSage.

Challenge 2: How to deal with missing links across local subgraphs? Unlike distributed systems in other domains such as CV and NLP, whose data samples of images and texts are isolated and independent, data samples in graphs are connected and correlated. Most importantly, in a subgraph federated learning system, data samples in each subgraph can potentially have connections to those in other subgraphs. These connections carrying important information of node neighborhoods and serving as bridges among the data owners, however, are never directly captured by any data owner.

Solution 2: FedSage+: Generating missing neighbors along FedSage. To deal with cross-subgraph missing links, we add a missing neighbor generator on top of FedSage and propose a novel FedSage+ model. Specifically, for each data owner, instead of training the GraphSage model on the original subgraph, it first mends the subgraph with generated cross-subgraph missing neighbors and then applies FedSage on the mended subgraph. To obtain the missing neighbor generator, each data owner impairs the subgraph by randomly holding out some nodes and related links and then trains the generator based on the held-out neighbors. Training the generator on an individual local subgraph enables it to generate potential missing links within the subgraph. Further training the generator in our subgraph FL setting allows it to generate missing neighbors across distributed subgraphs.

We conduct experiments on four real-world datasets with different numbers of data owners to better simulate the application scenarios. According to our results, both of our models outperform locally trained classifiers in all scenarios. Compared to FedSage, FedSage+ further promotes the performance of the outcome classifier. Further in-depth model analysis shows the convergence and generalization ability of our frameworks, which is corroborated by our theoretical analysis in the end.

2 Related works

Graph mining. Graph mining emerges its significance in analyzing the informative graph data, which range from social networks to gene interaction networks [31, 33, 34, 24]. One of the most frequently applied tasks on graph data is node classification. Recently, graph neural networks (GNNs), e.g., graph convolutional networks (GCN) [16] and GraphSage [11], improved the state-of-the-art in node classification with their elegant yet powerful designs. However, as GNNs leverage the homophily of nodes in both node features and link structures to conduct the inference, they are vulnerable to the perturbation on graphs [4, 40, 41]. Robust GNNs, aiming at reducing the degeneration in GNNs caused by graph perturbation, are gaining attention these days. Current robust GNNs focus on the sensitivity towards modifications on node features [3, 42, 15] or adding/removing edges on the graph [37]. However, neither of these two types recapitulates the missing neighbor problem, which affects both the feature distribution and structure distribution.

To obtain a node classifier with good generalizability, the development of domain adaptive GNN sheds light on adapting a GNN model trained on the source domain to the target domain by leveraging underlying structural consistency [38, 36, 28]. However, in the distributed system we consider, data owners have subgraphs with heterogeneous feature and structure distributions. Moreover, direct information exchanges among subgraphs, such as message passing, are fully blocked due to the missing cross-subgraph links. The violation of the domain adaptive GNNs’ assumptions on alignable nodes and cross-domain structural consistency denies their usage in the distributed subgraph system.

Federated learning. FL is proposed for cross-institutional collaborative learning without sharing raw data [17, 35, 21]. FedAvg [21] is an efficient and well-studied FL method. Similar to most FL methods, it is originally proposed for traditional machine learning problems [35] to allow collaborative training on silo data through local updating and global aggregation. The ecently proposed meta-learning framework [9, 23, 14] that exploits information from different data sources to obtain a general model attracts FL researchers [8]. However, meta-learning aims to learn general models that easily adapt to different local tasks, while we learn a generalizable model from diverse data owners to assist in solving a global task. In the distributed subgraph system, to obtain a globally applicable model without sharing local graph data, we borrow the idea of FL to collaboratively train GNNs.

Federated graph learning. Recent researchers have made some progress in federated graph learning. There are existing FL frameworks designed for the graph data learning task [12, 27, 30]. [12] design graph-level FL schemes with graph datasets dispersed over multiple data owners, which are inapplicable to our distributed subgraph system construction. [27] proposes an FL method for the recommendation problem with each data owner learning on a subgraph of the whole recommendation user-item graph. It considers a different scenario assuming subgraphs have overlapped items (nodes), and the user-item interactions (edges) are distributed but completely stored in the system, which ignores the possible cross-subgraph information lost in real-world scenarios. However, we study a more challenging yet realistic case in the distributed subgraph system, where cross-subgraph edges are totally missing.

In this work, we consider the commonly existing yet not studied scenario, i.e., distributed subgraph system with missing cross-subgraph edges. Under this scenario, we focus on obtaining a globally applicable node classifier through FL on distributed subgraphs.

3 FedSage

In this section, we first illustrate the definition of the distributed subgraph system derived from real-world application scenarios. Based on this system, we then formulate our novel subgraph FL framework and a vanilla solution called FedSage.

3.1 Subgraphs Distributed in Local Systems

Notation.

We denote a global graph as G={V,E,X}G=\{V,E,X\}, where VV is the node set, XX is the respective node feature set, and EE is the edge set. In the FL system, we have the central server SS, and MM data owners with distributed subgraphs. Gi={Vi,Ei,Xi}G_{i}=\{V_{i},E_{i},X_{i}\} is the subgraph owned by DiD_{i}, for i[M]i\in[M].

Problem setup.

For the whole system, we assume V=V1VMV=V_{1}\cup\cdots\cup V_{M}. To simulate the scenario with most missing links, we assume no overlapping nodes shared across data owners, namely ViVj=V_{i}\cap V_{j}=\emptyset for i,j[M]\forall i,j\in[M] and iji\neq j. Note that the central server SS only maintains a graph mining model with no actual graph data stored. Any data owner DiD_{i} cannot directly retrieve uVju\in V_{j} from another data owner DjD_{j}. Therefore, for an edge ev,uEe_{v,u}\in E, where vViv\in V_{i} and uVju\in V_{j}, ev,uEiEje_{v,u}\notin E_{i}\cup E_{j}, that is, ev,ue_{v,u} might exist in reality but is not stored anywhere in the whole system.

For the global graph G={V,E,X}G=\{V,E,X\}, every node vVv\in V has its features xvXx_{v}\in X and one label yvYy_{v}\in Y for the downstream task, e.g., node classification. Note that for vVv\in V, vv’s feature xvdxx_{v}\in\mathbb{R}^{d_{x}} and respective label yvy_{v} is a dyd_{y}-dimensional one-hot vector. In a typical GNN, predicting a node’s label requires an ego-graph of the queried node. For a node vv from graph GG, we denote the queried ego-graph of vv as G(v)G(v), and (G(v),yv)𝒟G(G(v),y_{v})\sim\mathcal{D}_{G}.

With subgraphs distributed in the system defined above, we formulate our goal as follows.

Goal.

The system exploits an FL framework to collaboratively learn on isolated subgraphs in all data owners, without raw graph data sharing, to obtain a global node classifier FF. The learnable weights ϕ\phi in FF is optimized for queried ego-graphs following the distribution of ones drawn from the global graph GG. We formalize the problem as finding ϕ\phi^{*} that minimizes the aggregated risk

ϕ=argmin(F(ϕ))=1MiMi(Fi(ϕ))),\phi^{*}=\operatorname*{arg\,min}\mathcal{R}(F(\phi))=\frac{1}{M}\sum_{i}^{M}\mathcal{R}_{i}(F_{i}(\phi))),

where i\mathcal{R}_{i} is the local empirical risk defined as

i(Fi(ϕ))𝔼(Gi,Yi)𝒟Gi[(Fi(ϕ;Gi),Yi))],\mathcal{R}_{i}(F_{i}(\phi))\coloneqq\mathbb{E}_{(G_{i},Y_{i})\sim\mathcal{D}_{G_{i}}}[\ell(F_{i}(\phi;G_{i}),Y_{i}))],

where \ell is a task-specific loss function

1|Vi|vVil(ϕ;Gi(v),yv).\ell\coloneqq\frac{1}{|V_{i}|}\sum_{v\in V_{i}}l(\phi;G_{i}(v),y_{v}).

3.2 Collaborative Learning on Isolated Subgraphs

To fulfill the system’s goal illustrated above, we leverage the simple and efficient FedAvg framework [21] and fix the node classifier FF as a GraphSage model. The inductiveness and scalability of the GraphSage model facilitate both the training on diverse subgraphs with heterogeneous query distributions and the later inference upon the global graph. We term the GraphSage model trained with the FedAvg framework as FedSage.

For a queried node vVv\in V, a globally shared KK-layer GraphSage classifier FF integrates vv and its KK-hop neighborhood on graph GG to conduct prediction with learnable parameters ϕ={ϕk}k=1K\phi=\{\phi^{k}\}^{K}_{k=1}. Taking a subgraph GiG_{i} as an example, for vViv\in V_{i} with features as hv0=xvh^{0}_{v}=x_{v}, at each layer k[K]k\in[K], FF computes vv’s representation hvkh^{k}_{v} as

hvk=σ(ϕk(hvk1||Agg({huk1,u𝒩Gi(v)}))),h^{k}_{v}=\sigma\left(\phi^{k}\cdot\left(h_{v}^{k-1}||Agg\left(\left\{h^{k-1}_{u},\forall u\in\mathcal{N}_{G_{i}}(v)\right\}\right)\right)\right), (1)

where 𝒩Gi(v)\mathcal{N}_{G_{i}}(v) is the set of vv’s neighbors on graph GiG_{i}, |||| is the concatenation operation, Agg()Agg(\cdot) is the aggregator (e.g., mean pooling) and σ\sigma is the activation function (e.g., ReLU).

With FF outputting the inference label y~v=Softmax(hvK)\widetilde{y}_{v}=\text{Softmax}(h^{K}_{v}) for vViv\in V_{i}, the supervised loss function l(ϕ|)l(\phi|\cdot) is defined as follows

c=l(ϕ|Gi(v),yv)=CE(y~v,yv)=[yvlogy~v+(1yv)log(1y~v)],\mathcal{L}^{c}=l(\phi|G_{i}(v),y_{v})=CE(\widetilde{y}_{v},y_{v})=-\left[y_{v}\log\widetilde{y}_{v}+\left(1-y_{v}\right)\log\left(1-\widetilde{y}_{v}\right)\right], (2)

where CE()CE(\cdot) is the cross entropy function, Gi(v)G_{i}(v) is vv’s K-hop ego-graph on GiG_{i}, which contains the information of vv and its K-hop neighbors on GiG_{i}.

In FedSage, the distributed subgraph system obtains a shared global node classifier FF parameterized by ϕ\phi through ece_{c} epochs of training. During each epoch tt, every DiD_{i} first locally computes ϕiϕη(ϕ|{(Gi(v),yv)|vVit})\phi_{i}\leftarrow\phi-\eta\nabla\ell(\phi|\{(G_{i}(v),y_{v})|v\in V_{i}^{t}\}), where VitViV_{i}^{t}\subseteq V_{i} contains the sampled training nodes for epoch tt, and η\eta is the learning rate; then the central server SS collects the latest {ϕi|i[M]}\{\phi_{i}|i\in[M]\}; next, through averaging over {ϕi|i[M]}\{\phi_{i}|i\in[M]\}, SS sets ϕ\phi as the averaged value; finally, SS broadcasts ϕ\phi to data owners and finishes one round of training FF. After ece_{c} epochs, the entire system retrieves FF as the outcome global classifier, which is not limited to or biased towards the queries in any specific data owner.

Unlike FL on Euclidean data, nodes in the distributed subgraph system can have potential interactions with each other across subgraphs. However, as the cross-subgraph links cannot be captured by any data owner in the system, incomplete neighborhoods, compared to those on the global graph, commonly exist therein. Thus, directly aggregating incomplete queried ego-graph information through FedSage restricts the outcome FF from achieving the desideratum of capturing the global query distribution.

4 FedSage+

In this section, we propose a novel framework of FedSage+, i.e., subgraph FL with missing neighbor generation. We first design a missing neighbor generator (NeighGen) and its training schema via graph mending. Then, we describe the joint training of NeighGen and GraphSage to better achieve the goal in Section 3.1. Without loss of generality, in the following demonstration, we take NeighGeni, i.e., the missing neighbor generator of DiD_{i}, as an example, where i[M]i\in[M].

Refer to caption
Figure 2: Joint training of missing neighbor generation and node classification.

4.1 Missing Neighbor Generator (NeighGen)

Neural architecture of NeighGen.

As shown in Fig. 2, NeighGen consists of two modules, i.e., an encoder HeH^{e} and a generator HgH^{g}. We describe their designs in details in the following.

HeH^{e}: A GNN model, i.e., a K-layer GraphSage encoder, with parameters θe\theta^{e}. For node vViv\in V_{i} on the input graph GiG_{i}, HeH^{e} computes node embeddings Zi={zv|zv=hvK,zvdz,vVi}Z_{i}=\{z_{v}|z_{v}=h^{K}_{v},z_{v}\in\mathbb{R}^{d_{z}},v\in V_{i}\} according to Eq. (1) by substituting ϕ\phi, GG with θe\theta^{e} and GiG_{i}.

HgH^{g}: A generative model recovering missing neighbors for the input graph based on the node embedding. HgH^{g} contains dGen and fGen, where dGen is a linear regression model parameterized by θd\theta^{d} that predicts the numbers of missing neighbors N~i={n~v|n~v,vVi}\widetilde{N}_{i}=\{\widetilde{n}_{v}|\widetilde{n}_{v}\in\mathbb{N},v\in V_{i}\}, and fGen is a feature generator parameterized by θf\theta^{f} that generates a set of N~i\widetilde{N}_{i} feature vectors X~i={x~v|x~vn~v×dx,n~vN~i,vVi}\widetilde{X}_{i}=\{\widetilde{x}_{v}|\widetilde{x}_{v}\in\mathbb{R}^{\widetilde{n}_{v}\times d_{x}},\widetilde{n}_{v}\in\widetilde{N}_{i},v\in V_{i}\}. Both dGen and fGen are constructed as fully connected neural networks (FNNs), while fGen is further equipped with a Gaussian noise generator 𝐍(0,1)\mathbf{N}(0,1) that generates dzd_{z}-dimensional noise vectors and a random sampler RR. For node vViv\in V_{i}, fGen is variational, which generates the missing neighbors’ features for vv after inserting noises into the embedding zvz_{v}, while RR ensures fGen to output the features of a specific number of neighbors by sampling n~v\widetilde{n}_{v} feature vectors from the feature generator’s output. Mathematically, we have

n~v=σ((θd)Tnv), and x~v=R(σ((θf)T(zv+𝐍(0,1))),n~v).\widetilde{n}_{v}=\sigma((\theta^{d})^{T}\cdot n_{v})\text{, and }\widetilde{x}_{v}=R\left(\sigma\left((\theta^{f})^{T}\cdot(z_{v}+\mathbf{N}(0,1))\right),\widetilde{n}_{v}\right). (3)

Graph mending simulation.

For each data owner in our system, we assume that only a particular set of nodes have cross-subgraph missing neighbors. The assumption is realistic yet non-trivial for it both seizing the quiddity of the distributed subgraph system, and allowing us to locally simulate the missing neighbor situation through a graph impairing and mending process. Specifically, to simulate a graph mending process during the training of NeighGen, in each local subgraph GiG_{i}, we randomly hold out h%h\% of its nodes VihViV^{h}_{i}\subset V_{i} and all links involving them Eih={euv|uVih or vVih}EiE^{h}_{i}=\{e_{uv}|u\in V^{h}_{i}\text{ or }v\in V^{h}_{i}\}\subset E_{i}, to form an impaired subgraph, denoted as G¯i\bar{G}_{i}. G¯i={V¯i,E¯i,X¯i}\bar{G}_{i}=\{\bar{V}_{i},\bar{E}_{i},\bar{X}_{i}\} contains the impaired set of nodes V¯i=ViVih\bar{V}_{i}=V_{i}\setminus V^{h}_{i}, the corresponding nodes features X¯i=XiXih\bar{X}_{i}=X_{i}\setminus X^{h}_{i} and edges E¯i=EiEih\bar{E}_{i}=E_{i}\setminus E^{h}_{i}.

Accordingly, based on the ground-truth missing nodes VihV^{h}_{i} and links EihE^{h}_{i}, the training of NeighGen on the impaired graph G¯i\bar{G}_{i} boils down to jointly training dGen and fGen as below.

n\displaystyle\mathcal{L}^{n} =λdd+λff=λd1|V¯i|vV¯iL1S(n~vnv)+λf1|V¯i|vV¯ip[n~v]minu𝒩Gi(v)Vih(x~vpxu22),\displaystyle=\lambda^{d}\mathcal{L}^{d}+\lambda^{f}\mathcal{L}^{f}=\lambda^{d}\frac{1}{|\bar{V}_{i}|}\sum_{v\in\bar{V}_{i}}L_{1}^{S}(\widetilde{n}_{v}-n_{v})+\lambda^{f}\frac{1}{|\bar{V}_{i}|}\sum_{v\in\bar{V}_{i}}\sum_{p\in[\widetilde{n}_{v}]}\min_{u\in\mathcal{N}_{G_{i}}(v)\cap V^{h}_{i}}(||\widetilde{x}_{v}^{p}-x_{u}||^{2}_{2}), (4)

where L1SL_{1}^{S} is the smooth L1 distance [10] and x~vpdx\widetilde{x}_{v}^{p}\in\mathbb{R}^{d_{x}} is the pp-th predicted feature in x~v\widetilde{x}_{v}. Note that, 𝒩Gi(v)Vih\mathcal{N}_{G_{i}}(v)\cap V^{h}_{i} contains nvn_{v} nodes that are vv’s neighbors on GiG_{i} missing into VihV_{i}^{h}. 𝒩Gi(v)Vih\mathcal{N}_{G_{i}}(v)\cap V^{h}_{i}, which can be retrieved from VihV^{h}_{i} and EihE^{h}_{i}, provides ground-truth for training NeighGen.

Neighbor Generation.

To retrieve GiG^{\prime}_{i} from GiG_{i}, data owner DiD_{i} performs two steps, which are also shown in Fig. 2: 1) DiD_{i} trains NeighGen on the impaired graph G¯i\bar{G}_{i} w.r.t. the ground-true hidden neighbors VihV^{h}_{i}; 2) DiD_{i} exploits NeighGen to generate missing neighbors for nodes on GiG_{i} and then mends GiG_{i} into GiG_{i}^{\prime} with generated neighbors. On the local graph GiG_{i} alone, this process can be understood as a data augmentation that further generates potential missing neighbors within GiG_{i}. However, the actual goal is to allow NeighGen to generate the cross-subgraph missing neighbors, which can be achieved via training NeighGen with FL and will be discussed in Section 4.3.

4.2 Local Joint Training of GraphSage and NeighGen

While NeighGen is designed to recover missing neighbors, the final goal of our system is to train a node classifier. Therefore, we design the joint training of GraphSage and NeighGen, which leverages neighbors generated by NeighGen to assist the node classification by GraphSage. We term the integration of GraphSage and NeighGen on the local graphs as LocSage+.

After NeighGen mends the graph GiG_{i} into GiG_{i}^{\prime}, the GraphSage classifier FF is applied on GiG_{i}^{\prime}, according to Eq. (1) (with GiG_{i} replaced by GiG_{i}^{\prime}). Thus, the joint training of NeighGen and GraphSage is done by optimizing the following loss function

=n+λcc=λdd+λff+λcc,\displaystyle\mathcal{L}=\mathcal{L}^{n}+\lambda^{c}\mathcal{L}^{c}=\lambda^{d}\mathcal{L}^{d}+\lambda^{f}\mathcal{L}^{f}+\lambda^{c}\mathcal{L}^{c}, (5)

where d\mathcal{L}^{d} and f\mathcal{L}^{f} are defined in Eq. (4), and c\mathcal{L}^{c} is defined in Eq. (2) (with GiG_{i} substituted by GiG_{i}^{\prime}).

The local joint training of GraphSage and NeighGen allows NeighGen to generate missing neighbors in the local graph that are helpful for the classifications made by GraphSage. However, like GraphSage, the information encoded in the local NeighGen is limited to and biased towards the local graph, which does not enable it to really generate neighbors belonging to other data owners connected by the missing cross-subgraph links. To this end, it is natural to train NeighGen with FL as well.

4.3 Federated Learning of GraphSage and NeighGen

Similarly to GraphSage alone, as described in Section 3.2, we can apply FedAvg to the joint training of GraphSage and NeighGen, by setting the loss function to \mathcal{L} and learnable parameters to {θe,θd,θf,ϕ}\{\theta^{e},\theta^{d},\theta^{f},\phi\}. However, we observe that cooperation through directly averaging weights of NeighGen across the system can negatively affect its performance, i.e., averaging the weights of a single NeighGen model does not really allow it to generate diverse neighbors from different subgraphs. Recalling our goal of constructing NeighGen, which is to facilitate the training of a centralized GraphSage classifier by generating diverse missing neighbors in each subgraph, we do not necessarily need a centralized NeighGen. Therefore, instead of training a single centralized NeighGen, we train a local NeighGeni for each data owner DiD_{i}. In order to allow each NeighGeni to generate diverse neighbors similar to those missed into other subgraphs Gj,j[M]{i}G_{j},j\in[M]\setminus\{i\}, we add a cross-subgraph feature reconstruction loss into fGeni as follows:

if=1|V¯i|vV¯ip[n~v](minu𝒩Gi(v)Vih(x~vpxu22)+αj[M]/iminuVj(Hig(zv)pxu22)),\displaystyle\mathcal{L}^{f}_{i}=\frac{1}{|\bar{V}_{i}|}\sum_{v\in\bar{V}_{i}}\sum_{p\in[\widetilde{n}_{v}]}\left(\min_{u\in\mathcal{N}_{G_{i}}(v)\cap V^{h}_{i}}(||\widetilde{x}_{v}^{p}-x_{u}||^{2}_{2})+\alpha\sum_{j\in[M]/i}\min_{u\in V_{j}}(||H^{g}_{i}(z_{v})^{p}-x_{u}||^{2}_{2})\right), (6)

where uVj,j[M]{i}u\in V_{j},\forall j\in[M]\setminus\{i\} is picked as the closest node from GjG_{j} other than GiG_{i} to simulate the neighbor of vV¯iv\in\bar{V}_{i} missed into GjG_{j}.

As shown above, to optimize Eq. (6), DiD_{i} needs to pick the closest uu from GjG_{j}. However, directly transmitting node features XjX_{j} in DjD_{j} to DiD_{i} not only violates our subgraph FL system constraints on no direct data sharing but also is impractical in reality, as it requires each DiD_{i} to hold the entire global graph’s node features throughout training NeighGeni. Therefore, to allow DiD_{i} to update NeighGeni using Eq. (6) without direct access to XjX_{j}, for vV¯iv\in\bar{V}_{i}, DjD_{j} locally computes p[n~v]minuVj(Hig(zv)pxu22)\sum_{p\in[\widetilde{n}_{v}]}\min_{u\in V_{j}}(||H^{g}_{i}(z_{v})^{p}-x_{u}||^{2}_{2}) and sends the respective gradient back to DiD_{i}.

During this process, for vV¯iv\in\bar{V}_{i}, to federated optimize Eq. (6), only HigH^{g}_{i}, HigH^{g}_{i}’s input zvz_{v}, and the DjD_{j}’s locally computed model gradients of loss term p[n~v]minuVj(Hig(zv)pxu22)\sum_{p\in[\widetilde{n}_{v}]}\min_{u\in V_{j}}(||H^{g}_{i}(z_{v})^{p}-x_{u}||^{2}_{2}) are transmitted among the system via the server SS. For data owner DiD_{i}, the gradients received from DjD_{j} are then weighted by α\alpha and combined with the local gradients as in Eq. (6) to update the parameters of HigH^{g}_{i} of NeighGeni In this way, DiD_{i} achieves the federate training of NeighGeni without raw graph data sharing. Note that, due to NeighGen’s architecture of a concatenation of HeH^{e} and HgH^{g}, the locally preserved GNN HieH^{e}_{i} can prevent other data owners from inferring xvx_{v} by only seeing zvz_{v}. Through Eq. (6), NeighGeni is expected to perceive diverse neighborhood information from all data owners, so as to generate more realistic cross-subgraph missing neighbors. The expectedly diverse and unbiased neighbors further assist the FedSage in training a globally applicable classifier that satisfies our goal in Section 3.1.

Note that, to reduce communications and computation time incurred by Eq. (6), batch training can be applied. Appendix A shows the pseudo code of FedSage+.

5 Experiments

We conduct experiments on four datasets to verify the effectiveness of FedSage and FedSage+ under different testing scenarios. We further conduct case studies to visualize how FedSage and FedSage+ assist local data owners in accommodating queries from the global distribution. Finally, we also provide more in-depth studies on the effectiveness of NeighGen in Appendix D.

5.1 Datasets and experimental settings

We synthesize the distributed subgraph system with four widely used real-world graph datasets, i.e., Cora [25], Citeseer [25], PubMed [22], and MSAcademic [26]. To synthesize the distributed subgraph system, we find hierarchical graph clusters on each dataset with the Louvain algorithm [2] and use the clustering results with 3, 5, and 10 clusters of similar sizes to obtain subgraphs for data owners. The statistics of these datasets are presented in Table 1.

Table 1: Statistics of the datasets and the synthesized distributed subgraph systems with MM = 3, 5, and 10. #C row shows the number of classes, |Vi||V_{i}| and |Ei||E_{i}| rows show the averaged numbers of nodes and links in all subgraphs, and ΔE\Delta E shows the total number of missing cross-subgraph links.
Data Cora Citeseer PubMed MSAcademic
#C 7 6 3 15
|V||V| 2708 3312 19717 18333
|E||E| 5429 4715 44338 81894
M 3 5 10 3 5 10 3 5 10 3 5 10
|Vi||V_{i}| 903 542 271 1104 662 331 6572 3943 1972 6111 3667 1833
|Ei||E_{i}| 1675 968 450 1518 902 442 12932 7630 3789 23584 13949 5915
ΔE\Delta E 403 589 929 161 206 300 5543 6189 6445 11141 12151 22743

We implement GraphSage with two layers using the mean aggregator [5]. The number of nodes sampled in each layer of GraphSage is 5. We use batch size 64 and set training epochs to 50. The training-validation-testing ratio is 60%-20%-20% due to limited sizes of local subgraphs. Based on our observations in hyper-parameter studies for α\alpha and the graph impairing ratio hh, we set h%[3.4%,27.8%]h\%\in[3.4\%,27.8\%] and α\alpha=1. All λ\lambdas are simply set to 1. Optimization is done with Adam with a learning rate of 0.001. We implement FedSage and FedSage+ in Python and execute all experiments on a server with 8 NVIDIA GeForce GTX 1080 Ti GPUs.

Since we are the first to study the novel yet important setting of subgraph federated learning, there are no existing baselines. We conduct comprehensive ablation evaluation by comparing FedSage and FedSage+ with three models, i.e., 1) GlobSage: the GraphSage model trained on the original global graph without missing links (as an upper bound for FL framework with GraphSage model alone), 2) LocSage: one GraphSage model trained solely on each subgraph, 3) LocSage+: the GraphSage plus NeighGen model jointly trained solely on each subgraph.

The metric used in our experiments is the node classification accuracy on the queries sampled from the testing nodes on the global graph. For globally shared models of GlobSage, FedSage, and FedSage+, we report the average accuracy over five random repetitions, while for locally possessed models of LocSage and LocSage+, the scores are further averaged across local models.

5.2 Experimental results

Table 2: Node classification results on four datasets with MM = 3, 5, and 10. Besides averaged accuracy, we also provide the corresponding std.
Cora Citesser
Model M=3 M=5 M=10 M=3 M=5 M=10
LocSage 0.5762 0.4431 0.2798 0.6789 0.5612 0.4240
(±\pm0.0302) (±\pm0.0847) (±\pm0.0080) (±\pm0.054) (±0.086\pm 0.086) (±0.0859\pm 0.0859)
LocSage+ 0.5644 0.4533 0.2851 0.6848 0.5676 0.4323
(±0.0219\pm 0.0219) (±\pm0.047) (±\pm0.0080) (±\pm0.0517) (±\pm0.0714) (±\pm0.0715)
FedSage 0.8656 0.8645 0.8626 0.7241 0.7226 0.7158
(±0.0043\pm 0.0043) (±\pm0.0050) (±\pm0.0103) (±\pm0.0022) ±\pm0.0066) (±\pm0.0053)
FedSage+ 0.8686 0.8648 0.8632 0.7454 0.7440 0.7392
(±0.0054\pm 0.0054) (±\pm0.0051) (±\pm0.0034) (±\pm0.0038) (±\pm0.0025) (±0.0041\pm 0.0041)
GlobSage 0.8701 (±\pm0.0042) 0.7561 (±\pm0.0031)
PubMed MSAcademic
Model M=3 M=5 M=10 M=3 M=5 M=10
LocSage 0.8447 0.8039 0.7148 0.8188 0.7426 0.5918
(±0.0047\pm 0.0047) (±0.0337\pm 0.0337) (±\pm0.0951) (±0.0331\pm 0.0331) (±0.0790\pm 0.0790) (±0.1005\pm 0.1005)
LocSage+ 0.8481 0.8046 0.7039 0.8393 0.7480 0.5927
(±0.0041\pm 0.0041) (±0.0318\pm 0.0318) (±\pm0.0925) (±\pm0.0330) (±0.0810\pm 0.0810) (±0.1094\pm 0.1094)
FedSage 0.8708 0.8696 0.8692 0.9327 0.9391 0.9262
(±0.0014\pm 0.0014) (±0.0035\pm 0.0035) (±\pm0.0010) (±\pm0.0005) (±\pm0.0007) (±\pm0.0009)
FedSage+ 0.8775 0.8755 0.8749 0.9359 0.9414 0.9314
(±0.0012\pm 0.0012) (±0.0047\pm 0.0047) (±\pm0.0013) (±\pm0.0005) (±\pm0.0006) (±\pm0.0009)
GlobSage 0.8776(±\pm0.0011) 0.9681(±\pm0.0006)
Refer to caption
(a) Hyper-parameter study for α\alpha with h=15%h=15\%.
Refer to caption
(b) Hyper-parameter study for hh with α=1\alpha=1.
Figure 3: Node classification results on four datasets under different α\alpha and hh values with MM=3.

Overall performance.

We conduct comprehensive ablation experiments to verify the significant promotion brought by FedSage and FedSage+ for local owners in global node classification, as shown in Table 2. The most important observation emerging from the results is that FedSage+ not only clearly outperforms LocSage by an average of 23.18%, but also distinctly overcomes the cross-subgraph missing neighbor problem by reducing the average accuracy drop from the 2.11% of FedSage to 1.28%, when compared with GlobSage (absolute accuracy difference).

The significant gaps between a locally obtained classifier, i.e., LocSage or LocSage+, and a federated trained classifier, i.e., FedSage or FedSage+, assay the benefits brought by the collaboration across data owners in our distributed subgraph system. Compared to FedSage, the further elevation brought by FedSage+ corroborates the assumed degeneration brought by missing cross-subgraph links and the effectiveness of our innovatively designed NeighGen module. Notably, when the graph is relatively sparse (e.g., see Citeseer in Table 1), FedSage+ significantly exhibits its robustness in resisting the cross-subgraph information loss compared to FedSage. Note that the gaps between LocSage and LocSage+ are comparatively smaller, indicating that our NeighGen serves more than a robust GNN trainer, but is rather uniquely crucial in the subgraph FL setting.

Refer to caption
(a) Local model predictions
Refer to caption
(b) Global ground-truth vs. model predictions
Figure 4: Label distributions on the PubMed dataset with MM=5.
Refer to caption
(a) Accuracy curves
Refer to caption
(b) Loss curves
Refer to caption
(c) Training time
Figure 5: Training curves of different frameworks (GlobSage provides an upper bound).

Hyper-parameter studies.

We compare the downstream task performance under different α\alpha and hh values with three data owners. Results are shown in Fig. 3, where Fig. 3 (a) shows results when hh is fixed as 15%, and Fig. 3 (b) shows results under α\alpha=1.

Fig. 3 (a) indicates that choosing a proper α\alpha, which brings the information from other subgraphs in the system, can constantly elevate the final testing accuracy. Across different datasets, the optimal α\alpha is constantly around 1, and the performance is not influenced much unless α\alpha is set to extreme values like 0.1 or 10. Referring to Fig. 3 (b), we can observe that either a too-small (1%) or a too-large (30%) hiding portion can degrade the learning process. A too-small hh can not provide sufficient data for training NeighGen, while a too-large hh can result in sparse local subgraphs that harm the effective training of GraphSage. Referring back to the graph statistics in Table 1 in the paper, the portion of actual missing edges compared to the global graph is within the range of [3.4%, 27.8%], which explains why a value like 15% can mostly boost the performance of FedSage+.

Case studies.

To further understand how FedSage and FedSage+ improve the global classifier over LocSage, we provide case study results on PubMed with five data owners in Fig. 4. For the studied scenario, each data owner only possesses about 20% of the nodes with rather biased label distributions, as shown in Fig. 4 (a). Such bias is due to the way we synthesize the distributed subgraph system with Louvain clustering, which is also realistic in real scenarios. Local bias essentially makes it hard for any local data owner with limited training samples to obtain a generalized classifier that is globally useful. Although with 13.9% of the links missing among the system, both FedSage and FedSage+ empower local data owners in predicting labels that closely follow the ground-true global label distribution as shown in Fig. 4 (b). The figure clearly evidences that our FL models exhibit their advantages in learning a more realistic label distribution as our goal in Section 3.1, which is consistent with the observed performances in Table 2 and our theoretical implications in Section 6.

For Cora dataset with five data owners, we visualize testing accuracy, loss convergence, and runtime along 100 epochs in obtaining FF with FedSage, FedSage+, GlobSage, LocSage and LocSage+. The results are presented in Fig. 5. Both FedSage and FedSage+ can consistently achieve convergence with rapidly improved testing accuracy. Regarding runtime, even though the classifier from FedSage+ learns from distributed mended subgraphs, FedSage+ does not consume observable more training time compared to FedSage. Due to the additional communications and computations in subgraph FL, both FedSage and FedSage+ consume slightly more training time compared to GlobSage.

6 Implications on Generalization Bound

In this section, we provide a theoretical implication for the generalization error associated with number of training samples, i.e., nodes in the distributed subgraph system, following Graph Neural Tangent Kernel (GNTK) [7] on universal graph neural networks. Thus, we are motivated to promote the FedSage and FedSage+ algorithms that include more nodes in the global graph through collaborative training with FL.

Setting.

Our explanation builds on a generalized setting, where we assume a GNN FF with layer-wise aggregation operations and fully-connected layers with ReLU activation functions, which includes GraphSage as a special case. The weights of FF, ϕ\phi, is i.i.d. sampled from a multivariate Gaussian distribution 𝐍(0,I)\mathbf{N}(0,I). For Graph G={V,E,X}G=\{V,E,X\}, we define the kernel matrix of two nodes u,vVu,v\in V as follows. Here we consider FF is in the GNTK format.

Definition 6.1 (Informal version of GNTK on node classification (Definition B.2))

Considering in the overparameterized regime for an GNN FF, FF is trained using gradient descent with infinite small learning rate. Given nn nodes with corresponding labels as training samples, we denote 𝚯n×n\mathbf{\Theta}\in\mathbb{R}^{n\times n} as the the kernel matrix of GNTK. 𝚯uv\mathbf{\Theta}_{uv} is defined as

𝚯uv=𝔼ϕ𝐍(0,I)[F(ϕ,G,u)ϕ,F(ϕ,G,v)ϕ].\displaystyle\mathbf{\Theta}_{uv}=\mathbb{E}_{\phi\sim\mathbf{N}(0,I)}\left[\left<\frac{\partial F(\phi,G,u)}{\partial\phi},\frac{F(\phi,G,v)}{\partial\phi}\right>\right]\in\mathbb{R}.

Full expression of 𝚯\mathbf{\Theta} is shown in the Appendix B. The generalization ability in the GNTK regime depends on the kernel matrix 𝚯\mathbf{\Theta}. We present the generalization bound associated with the number of training samples nn in Theorem 6.2.

Theorem 6.2 (Generalization bound)

Given nn training samples of nodes (ui,yi)i=1n{(u_{i},y_{i})}^{n}_{i=1} drawn i.i.d. from the global graph GG, consider any loss function l:×[0,1]l:\mathbb{R}\times\mathbb{R}\mapsto[0,1] that is 1-Lipschitz in the first argument such that l(y,y)=0l(y,y)=0. With probability at least 1σ1-\sigma and constant c(0,1)c\in(0,1), the generalization error of GNTK for node classification can be upper-bounded by

L𝒟(F)=𝔼(u,y)G[l(F(G,u),y)]O(1/nc).\displaystyle L_{\mathcal{D}(F)}=\mathbb{E}_{(u^{\prime},y)\sim G}[l(F(G,u^{\prime}),y)]\lesssim O(1/n^{c}).

Following the generalization bound analysis in [7], we use a standard generalization bound of kernel methods of [1], which shows the upper bound of our GNTK formation error depends on that of 𝐲𝚯(1)𝐲\mathbf{y}^{\top}\mathbf{\Theta}^{(-1)}\mathbf{y} and tr(𝚯)\rm{tr}(\mathbf{\Theta}), where 𝐲\mathbf{y} is the label vector. Appendix C shows the full version of the proofs.

Implications.

We show the error bound of GNTK on node classification corresponding to the number of training samples. Under the assumptions in Definition 6.1, our theoretical result indicates that more training samples bring down the generalization error , which provides plausible support for our goal of building a globally useful classifier through FL in Eq. (3.1). Such implications are also consistent with our experimental findings in Fig. 4 where our FedSage and FedSage+ models can learn more generalizable classifiers that follow the label distributions of the global graph through involving more training nodes across different subgraphs.

7 Conclusion

This work aims at obtaining a generalized node classification model in a distributed subgraph system without direct data sharing. To tackle the realistic yet unexplored issue of missing cross-subgraph links, we design a novel missing neighbor generator NeighGen with the corresponding local and federated training processes. Experimental results evidence the distinguished elevation brought by our FedSage and FedSage+ frameworks , which is consistent with our theoretical implications.

Though FedSage manifests advantageous performance, it confronts additional communication cost and potential privacy concerns. As communications are vital for federated learning, properly reducing communication and rigorously guaranteeing privacy protection in the distributed subgraph system can both be promising future directions.

Acknowledgments and Disclosure of Funding

This work is partially supported by the internal funding and GPU servers provided by the Computer Science Department of Emory University. We thank Dr. Pan Li from Purdue University for the suggestions on the design of our NeighGen mechanism.

References

  • [1] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463–482, 2002.
  • [2] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. JSTAT, 2008(10):P10008, 2008.
  • [3] Liang Chen, Jintang Li, Qibiao Peng, Yang Liu, Zibin Zheng, and Carl Yang. Understanding structural vulnerability in graph convolutional networks. In IJCAI, 2021.
  • [4] Hanjun Dai, Hui Li, Tian Tian, Xin Huang, Lin Wang, Jun Zhu, and Le Song. Adversarial attack on graph structured data. In ICML, 2018.
  • [5] CSIRO’s Data61. Stellargraph machine learning library. https://github.com/stellargraph/stellargraph, 2018.
  • [6] Qi Dou, Tiffany Y So, Meirui Jiang, Quande Liu, Varut Vardhanabhuti, Georgios Kaissis, Zeju Li, Weixin Si, Heather HC Lee, Kevin Yu, et al. Federated deep learning for detecting covid-19 lung abnormalities in ct: a privacy-preserving multinational validation study. NPJ digital medicine, 4:1–11, 2021.
  • [7] Simon S. Du, Kangcheng Hou, Ruslan Salakhutdinov, Barnabás Póczos, Ruosong Wang, and Keyulu Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. In NeurIPS, 2019.
  • [8] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning: A meta-learning approach. NeurIPS, 2020.
  • [9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  • [10] Ross Girshick. Fast r-cnn. In ICCV, 2015.
  • [11] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS, 2017.
  • [12] Chaoyang He, Keshav Balasubramanian, Emir Ceyani, Carl Yang, Han Xie, Lichao Sun, Lifang He, Liangwei Yang, Philip S Yu, Yu Rong, Peilin Zhao, Junzhou Huang, Murali Annavaram, and Salman Avestimehr. Fedgraphnn: A federated learning system and benchmark for graph neural networks. arXiv preprint arXiv:2104.07145, 2021.
  • [13] Chaoyang He, Shen Li, Mahdi Soltanolkotabi, and Salman Avestimehr. Pipetransformer: Automated elastic pipelining for distributed training of transformers. arXiv preprint arXiv:2102.03161, 2021.
  • [14] Timothy M Hospedales, Antreas Antoniou, Paul Micaelli, and Amos J Storkey. Meta-learning in neural networks: A survey. TPAMI, 2021.
  • [15] Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph structure learning for robust graph neural networks. In SIGKDD, 2020.
  • [16] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • [17] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE SPM, 37:50–60, 2020.
  • [18] Xinle Liang, Yang Liu, Tianjian Chen, Ming Liu, and Qiang Yang. Federated transfer reinforcement learning for autonomous driving. arXiv preprint arXiv:1910.06001, 2019.
  • [19] Quande Liu, Cheng Chen, Jing Qin, Qi Dou, and Pheng-Ann Heng. Feddg: Federated domain generalization on medical image segmentation via episodic learning in continuous frequency space. arXiv preprint arXiv:2103.06030, 2021.
  • [20] Gongxu Luo, Jianxin Li, Hao Peng, Carl Yang, Lichao Sun, Philip Yu, and Lifang He. Graph entropy guided node embedding dimension selection for graph neural networks. In IJCAI, 2021.
  • [21] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017.
  • [22] Galileo Namata, Ben London, Lise Getoor, and Bert Huang. Query-driven active surveying for collective classification. In MLG workshop, 2012.
  • [23] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
  • [24] Saif Ur Rehman, Asmat Ullah Khan, and Simon Fong. Graph mining: A survey of graph mining techniques. In ICDIM, 2012.
  • [25] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
  • [26] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.
  • [27] Chuhan Wu, Fangzhao Wu, Yang Cao, Yongfeng Huang, and Xing Xie. Fedgnn: Federated graph neural network for privacy-preserving recommendation. arXiv preprint arXiv:2102.04925, 2021.
  • [28] Man Wu, Shirui Pan, Chuan Zhou, Xiaojun Chang, and Xingquan Zhu. Unsupervised domain adaptive graph convolutional networks. In WWW, 2020.
  • [29] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. TNNLS, 2020.
  • [30] Han Xie, Jing Ma, Li Xiong, and Carl Yang. Federated graph classification over non-iid graphs. In NeurIPS, 2021.
  • [31] Carl Yang, Haonan Wang, Ke Zhang, Liang Chen, and Lichao Sun. Secure deep graph generation with link differential privacy. In IJCAI, 2021.
  • [32] Carl Yang, Yuxin Xiao, Yu Zhang, Yizhou Sun, and Jiawei Han. Heterogeneous network representation learning: A unified framework with survey and benchmark. In TKDE, 2020.
  • [33] Carl Yang, Jieyu Zhang, and Jiawei Han. Co-embedding network nodes and hierarchical labels with taxonomy based generative adversarial nets. In ICDM, 2020.
  • [34] Carl Yang, Peiye Zhuang, Wenhan Shi, Alan Luu, and Pan Li. Conditional structure generation through graph variational generative adversarial nets. In NeurIPS, 2019.
  • [35] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. TIST, 10(2):1–19, 2019.
  • [36] Yizhou Zhang, Guojie Song, Lun Du, Shuwen Yang, and Yilun Jin. DANE: domain adaptive network embedding. In IJCAI, 2019.
  • [37] Dingyuan Zhu, Ziwei Zhang, Peng Cui, and Wenwu Zhu. Robust graph convolutional networks against adversarial attacks. In SIGKDD, 2019.
  • [38] Qi Zhu, Yidan Xu, Haonan Wang, Chao Zhang, Jiawei Han, and Carl Yang. Transfer learning of graph neural networks with ego-graph information maximization. In NeurIPS, 2021.
  • [39] Xinghua Zhu, Jianzong Wang, Zhenhou Hong, and Jing Xiao. Empirical studies of institutional federated learning for natural language processing. In EMNLP, 2020.
  • [40] Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. Adversarial attacks on neural networks for graph data. In SIGKDD, 2018.
  • [41] Daniel Zügner and Stephan Günnemann. Adversarial attacks on graph neural networks via meta learning. In ICLR, 2019.
  • [42] Daniel Zügner and Stephan Günnemann. Certifiable robustness and robust training for graph convolutional networks. In SIGKDD, 2019.

Appendix A FedSage+ Algorithm

Referring to Section 4.3, FedSage+ includes two phases. Firstly, all data owners in the distributed subgraph system jointly train NeighGen models through sharing gradients. Next, after every local graph mended with synthetic neighbors generated by the respective NeighGen model, the system executes FedSage to obtain the generalized node classification model. Algorithm 1 shows the pseudo code for FedSage+.

1:Notations. Data owners set {D1,,DM}\{D_{1},\dots,D_{M}\}, server SS, epochs for jointly training NeighGen ege_{g}, epochs for FedSage ece_{c}, learning rate for FedSage η\eta.
2:For t=1egt=1\rightarrow e_{g}, we iteratively run procedure A, procedure C, procedure D, and procedure E
3:Every Di{D1,,DM}D_{i}\in\{D_{1},\dots,D_{M}\} retrieves GiG_{i}^{\prime} from FL trained NeighGeni
4:SS initializes and broadcasts ϕ\phi
5:For t=1ect=1\rightarrow e_{c}, we iteratively run procedure B and procedure F
6:
7:On the server side:
8:procedure A ServerExcecutionForGen(tt)\triangleright FL of NeighGen on epoch tt
9:     Collect (Zit,Hig)LocalRequest(Di,t)(Z^{t}_{i},H^{g}_{i})\leftarrow\textsc{LocalRequest}(D_{i},t) from every data owner DiD_{i}, where i[M]i\in[M]
10:     Send {(Zjt,Hjg)|j[M]{i}}\{(Z^{t}_{j},H^{g}_{j})|j\in[M]\setminus\{i\}\} to every data owner DiD_{i}, where i[M]i\in[M]
11:     for Di{D1,,DM}D_{i}\in\{D_{1},\dots,D_{M}\} in parallel  do
12:         {i,1f,,i,Mf}{i,if}FeedForward(Di,{(Zjt,Hjg)|j[M]{i}})\{\nabla\mathcal{L}^{f}_{i,1},\dots,\nabla\mathcal{L}^{f}_{i,M}\}\setminus\{\nabla\mathcal{L}^{f}_{i,i}\}\leftarrow\textsc{FeedForward}(D_{i},\{(Z^{t}_{j},H^{g}_{j})|j\in[M]\setminus\{i\}\})      
13:     for Di{D1,,DM}D_{i}\in\{D_{1},\dots,D_{M}\} in parallel  do
14:         Aggregate gradients as i,Jfj[M]{i}i,jf\nabla\mathcal{L}^{f}_{i,J}\leftarrow\sum_{j\in[M]\setminus\{i\}}\nabla\mathcal{L}^{f}_{i,j}
15:         Send i,Jf\nabla\mathcal{L}^{f}_{i,J} to DiD_{i} for UpdateNeighGen(Di,i,JfD_{i},\nabla\mathcal{L}^{f}_{i,J})      
16:procedure B ServerExcecutionForC(tt)\triangleright FedSage for mended subgraphs on epoch tt
17:     Collect ϕiLocalUpdateC(Di,ϕ,t)\phi_{i}\leftarrow\textsc{LocalUpdateC}(D_{i},\phi,t) from all data owners
18:     Broadcast ϕ1Mi[M]ϕi\phi\leftarrow\frac{1}{M}\sum_{i\in[M]}\phi_{i}
19:
20:On the data owners side:
21:procedure C LocalRequest(Di,tD_{i},t)\triangleright Run on DiD_{i}
22:     Sample a VitV¯itV_{i}^{t}\in\bar{V}_{i}^{t} and get Zit{Hie(G¯i(v))|vVit}Z_{i}^{t}\leftarrow\{H^{e}_{i}(\bar{G}_{i}(v))|v\in V_{i}^{t}\}
23:     Send Zit,HigZ_{i}^{t},H^{g}_{i} to Server
24:procedure D FeedForward(Di,{(Zjt,Hjg)|j[M]{i}}D_{i},\{(Z^{t}_{j},H^{g}_{j})|j\in[M]\setminus\{i\}\}) \triangleright Run on DiD_{i}
25:     for j[M]{i}j\in[M]\setminus\{i\} do
26:         j,if1|Zjt|zvZjtp[|Hjg(zv)|](minuVi(Hjg(zv)pxu22))\mathcal{L}^{f}_{j,i}\leftarrow\frac{1}{|Z^{t}_{j}|}\sum_{z_{v}\in Z^{t}_{j}}\sum_{p\in[|H^{g}_{j}(z_{v})|]}\left(\min_{u\in V_{i}}(||H^{g}_{j}(z_{v})^{p}-x_{u}||^{2}_{2})\right)\triangleright A part of Eq. (6)      
27:     Compute and send {1,if,,M,if}{i,if}\{\nabla\mathcal{L}^{f}_{1,i},\dots,\nabla\mathcal{L}^{f}_{M,i}\}\setminus\{\nabla\mathcal{L}^{f}_{i,i}\} to Sever
28:procedure E UpdateNeighGen(Di,i,JfD_{i},\nabla\mathcal{L}^{f}_{i,J})\triangleright Run on DiD_{i}
29:     Train NeighGeni by optimizing Eq. (6).
30:procedure F LocalUpdateC(Di,ϕ,tD_{i},\phi,t)\triangleright Run on DiD_{i}
31:     Sample a VitViV_{i}^{t}\subseteq V_{i}
32:     ϕiϕηl(ϕ|{(Gi(v),yv)|vVit})\phi_{i}\leftarrow\phi-\eta\nabla l(\phi|\{({G_{i}^{\prime}}(v),y_{v})|v\in V_{i}^{t}\})
33:     Send ϕi\phi_{i} to Sever
Algorithm 1 FedSage+: Subgraph federated learning with missing neighbor generation

Appendix B Full Version of Definition 6.1

Notation.

We denote the whole graph G={V,E,X}G=\{V,E,X\} and |V|=n|V|=n. To perform node classification on GG, we consider a GNN FF with KK aggregation operations111In Graphsage, this is equivalent to having KK graph convectional layers. and each aggregation operation contains RR fully-connected layers. We describe the aggregation operation below.

Definition B.1 (Aggregation operation, (𝖠𝖦𝖦\mathsf{AGG}))

For k[K]\forall k\in[K], 𝖠𝖦𝖦\mathsf{AGG} aggregates the information from the previous layer and performs RR times non-linear transformation. With denoting the initial feature vector for node uVu\in V as hu(0,R)=xudh_{u}^{(0,R)}=x_{u}\in\mathbb{R}^{d}, for an 𝖠𝖦𝖦\mathsf{AGG} with R=2R=2 fully-connected layers, the 𝖠𝖦𝖦\mathsf{AGG} can be written as:

hu(k,0)=cucσmσ(ϕk,2cσmσ(ϕk,1cuv𝒩(u)uhv(k1,0))),\displaystyle h^{(k,0)}_{u}=c_{u}\sqrt{\frac{c_{\sigma}}{m}}\sigma\left(\phi_{k,2}\sqrt{\frac{c_{\sigma}}{m}}\sigma\left(\phi_{k,1}\cdot c_{u}\sum_{v\in\mathcal{N}(u)\cup u}\ h_{v}^{(k-1,0)}\right)\right),

where cσc_{\sigma}\in\mathbb{R} is a scaling factor related initialization, cuc_{u}\in\mathbb{R} is a scaling factor associated with neighbor aggregation, σ()\sigma(\cdot) is ReLU activation, and learnable parameter ϕk,rm×m\phi_{k,r}\in\mathbb{R}^{m\times m} for (k,r)[K]×[R]\{(1,1)}\forall(k,r)\in[K]\times[R]\backslash\{(1,1)\} as ϕ1,1k×d\phi_{1,1}\in\mathbb{R}^{k\times d}.

For notation simplicity, GNN FF here is considered in GNTK format. The weights of FF, ϕ\phi is i.i.d. sampled from a multivariate Gaussian distribution 𝒩(0,I)\mathcal{N}(0,I). For node uVu\in V, we denote uu’s computational graph as Gu={Vu,Eu,Xu}G_{u}=\{V_{u},E_{u},X_{u}\} and |Vu|=nu|V_{u}|=n_{u}. Let a,b\langle a,b\rangle denote inner-product of vector aa and bb. We are going to define the kernel matrix of two nodes u,vVu,v\in V as follows.

Definition B.2 (GNTK for node classification)

Considering in the overparameterized regime for an GNN FF, FF is trained using gradient descent with infinite small learning rate. Given nn training samples of nodes with corresponding labels, we denote 𝚯n×n\mathbf{\Theta}\in\mathbb{R}^{n\times n} as the the kernel matrix of GNTK. For u,vV\forall u,v\in V, 𝚯uv\mathbf{\Theta}_{uv} is the u,vu,v entry of 𝚯\mathbf{\Theta} and defined as

𝚯uv\displaystyle\mathbf{\Theta}_{uv} =𝔼ϕ𝒩(0,I)[F(ϕ,G,u)ϕ,F(ϕ,G,v)ϕ]\displaystyle=\mathbb{E}_{\phi\sim\mathcal{N}(0,I)}\left[\left<\frac{\partial F(\phi,G,u)}{\partial\phi},\frac{F(\phi,G,v)}{\partial\phi}\right>\right]
=𝔼ϕ𝒩(0,I)[F(ϕ,Gu,u)ϕ,F(ϕ,Gv,v)ϕ].\displaystyle=\mathbb{E}_{\phi\sim\mathcal{N}(0,I)}\left[\left<\frac{\partial F(\phi,G_{u},u)}{\partial\phi},\frac{F(\phi,G_{v},v)}{\partial\phi}\right>\right]\in\mathbb{R}.

In GNTK formulation, an 𝖠𝖦𝖦\mathsf{AGG} B.1 needs to calculate 1) a covariance matrix 𝚺(Gu,Gv)\mathbf{\Sigma}(G_{u},G_{v}); and 2) the intermediate kernel values 𝚯(Gu,Gv)\mathbf{\Theta}(G_{u},G_{v}) Now, we specify the pairwise value in 𝚺(Gu,Gv)nu×nv\mathbf{\Sigma}(G_{u},G_{v})\in\mathbb{R}^{n_{u}\times n_{v}} and 𝚯(Gu,Gv)nu×nv\mathbf{\Theta}(G_{u},G_{v})\in\mathbb{R}^{n_{u}\times n_{v}}. For k[K]\forall k\in[K] and r[R]\forall r\in[R], 𝚺(r)(k)(Gu,Gv)\mathbf{\Sigma}_{(r)}^{(k)}(G_{u},G_{v}) and 𝚯(r)(k)(Gu,Gv)\mathbf{\Theta}_{(r)}^{(k)}(G_{u},G_{v}) indicate the corresponding covariance and intermediate kernel matrix for rrth transformation and kkth layers. Initially, we have [𝚺(R)(0)(Gu,Gv)]uv=[𝚯(R)(0)(Gu,Gv)]uv=hu,hv[\mathbf{\Sigma}^{(0)}_{(R)}(G_{u},G_{v})]_{uv}=[\mathbf{\Theta}_{(R)}^{(0)}(G_{u},G_{v})]_{uv}=\langle h_{u},h_{v}\rangle, where hu,hvdh_{u},h_{v}\in\mathbb{R}^{d} are the input features of node uu and vv. Denote the scaling factor for node uu as cuc_{u}. 𝚯(R)(l)(Gu,Gv)\mathbf{\Theta}^{(l)}_{(R)}(G_{u},G_{v}) can be calculated recursively through the aggregation operation given in [7]. Specifically, we have the following two steps.

Step 1: Neighborhood Aggregation

As the 𝖠𝖦𝖦\mathsf{AGG} we defined above, in GNTK, the aggregation step can be performed as:

[𝚺(0)(k)(Gu,Gv)]uv\displaystyle\left[\mathbf{\Sigma}^{(k)}_{(0)}(G_{u},G_{v})\right]_{uv} =cucvu𝒩(u){u}v𝒩(v){v}[𝚺(R)(k1)(Gu,Gv)]uv,\displaystyle=c_{u}c_{v}\sum_{u^{\prime}\in\mathcal{N}(u)\cup\{u\}}\sum_{v^{\prime}\in\mathcal{N}(v)\cup\{v\}}\left[\mathbf{\Sigma}^{(k-1)}_{(R)}(G_{u},G_{v})\right]_{u^{\prime}v^{\prime}},
[𝚯(0)(k)(Gu,Gv)]uv\displaystyle\left[\mathbf{\Theta}^{(k)}_{(0)}(G_{u},G_{v})\right]_{uv} =cucvu𝒩(u){u}v𝒩(v){v}[𝚯(R)(k1)(Gu,Gv)]uv.\displaystyle=c_{u}c_{v}\sum_{u^{\prime}\in\mathcal{N}(u)\cup\{u\}}\sum_{v^{\prime}\in\mathcal{N}(v)\cup\{v\}}\left[\mathbf{\Theta}^{(k-1)}_{(R)}(G_{u},G_{v})\right]_{u^{\prime}v^{\prime}}.

Step 2: RR transformations

Now, we consider the RR ReLU fully-connected layers which perform non-linear transformations to the aggregated feature generated in step 1. The ReLU activation function σ(x)=max{0,x}\sigma(x)=\max\{0,x\}’s derivative is denoted as σ˙(x)\dot{\sigma}(x) For r[R],u,vVr\in[R],u,v\in V, we define covariance matrix and its derivative as

[𝚺(r)(k)(Gu,Gv)]uv=cσ𝔼(a,b)𝒩(𝟎,[𝑨(r)(k)(Gu,Gv)]uv)[σ(a)σ(b)],[𝚺˙(r)(k)(Gu,Gv)]uv=cσ𝔼(a,b)𝒩(𝟎,[𝑨(r)(k)(Gu,Gv)]uv)[σ˙(a)σ˙(b)],\displaystyle\begin{array}[]{l}{\left[\boldsymbol{\Sigma}_{(r)}^{(k)}\left(G_{u},G_{v}\right)\right]_{uv}=c_{\sigma}\mathbb{E}_{(a,b)\sim\mathcal{N}\left(\mathbf{0},\left[\boldsymbol{A}_{(r)}^{(k)}\left(G_{u},G_{v}\right)\right]_{uv}\right)}[\sigma(a)\sigma(b)]},\\ {\left[\dot{\boldsymbol{\Sigma}}_{(r)}^{(k)}\left(G_{u},G_{v}\right)\right]_{uv}=c_{\sigma}\mathbb{E}_{(a,b)\sim\mathcal{N}\left(\mathbf{0},\left[\boldsymbol{A}_{(r)}^{(k)}\left(G_{u},G_{v}\right)\right]_{uv}\right)}[\dot{\sigma}(a)\dot{\sigma}(b)]},\end{array}

where [𝐀(r)(k)(Gu,Gv)]uv[\mathbf{A}_{(r)}^{(k)}(G_{u},G_{v})]_{uv} is an intermediate variable that

[𝐀(r)(k)(Gu,Gv)]uv=([𝚺(r1)(k)(Gu,Gu)]uu[𝚺(r1)(k)(Gu,Gv)]uv[𝚺(r1)(k)(Gu,G)]uv[𝚺(r1)(k)(Gv,Gv)]vv)2×2\displaystyle\left[\mathbf{A}_{(r)}^{(k)}(G_{u},G_{v})\right]_{uv}=\left(\begin{array}[]{cc}{\left[\mathbf{\Sigma}_{(r-1)}^{(k)}(G_{u},G_{u})\right]_{uu}}&{\left[\mathbf{\Sigma}_{(r-1)}^{(k)}(G_{u},G_{v})\right]_{uv}}\\ {\left[\mathbf{\Sigma}_{(r-1)}^{(k)}(G_{u},G)\right]_{uv}}&{\left[\mathbf{\Sigma}_{(r-1)}^{(k)}(G_{v},G_{v})\right]_{vv}}\end{array}\right)\in\mathbb{R}^{2\times 2}

Thus, we have

[𝚯(r)(l)(Gu,Gv)]uv\displaystyle\left[\mathbf{\Theta}^{(l)}_{(r)}(G_{u},G_{v})\right]_{uv} =[𝚯(r1)(k)(Gu,Gv)]uv[𝚺˙(r)(k)(Gu,Gv)]uv+[𝚺(r)(k)(Gu,Gv)]uv.\displaystyle=\left[\mathbf{\Theta}^{(k)}_{(r-1)}(G_{u},G_{v})\right]_{uv}\left[\mathbf{\dot{\Sigma}}^{(k)}_{(r)}(G_{u},G_{v})\right]_{uv}+\left[\mathbf{\Sigma}^{(k)}_{(r)}(G_{u},G_{v})\right]_{uv}.

𝚯=𝚯(R)(l)(Gu,Gv)\mathbf{\Theta}=\mathbf{\Theta}^{(l)}_{(R)}(G_{u},G_{v}) can be viewed as kernel matrix of GNTK for node classification. The generalization ability in the NTK regime and depends on the kernel matrix.

Appendix C Missing Proofs for Theorem 6.2

In this section, we provide the detailed version and proof of Theorem 6.2.

Theorem C.1 (Full version of generalization bound Theorem 6.2)

Given nn training data samples (hi,yi)i=1n{(h_{i},y_{i})}^{n}_{i=1} drawn i.i.d from Graph GG, we consider any loss function l:×[0,1]l:\mathbb{R}\times\mathbb{R}\mapsto[0,1] that is 1-Lipschitz in the first argument such that l(y,y)=0l(y,y)=0. With a probability at least 1σ1-\sigma and a constant c(0,1)c\in(0,1), the generalization error of GNTK for node classification can be upper-bounded by

L𝒟(F)=𝔼(G,y)𝒟[l(F(G),y)]=O(𝐲𝚯(1)𝐲tr(𝚯)n+log(1/σ)n).\displaystyle L_{\mathcal{D}(F)}=\mathbb{E}_{(G,y)\sim\mathcal{D}}[l(F(G),y)]=O\left(\frac{\sqrt{\mathbf{y}^{\top}\mathbf{\Theta}^{(-1)}\mathbf{y}\cdot\rm{tr}(\mathbf{\Theta})}}{n}+\sqrt{\frac{\log(1/\sigma)}{n}}\right).

To prove the generalization bound, we make the following assumptions about the labels.

Assumption C.2

For each i[n]i\in[n], the labels yi=[𝐲]iy_{i}=[\mathbf{y}]_{i}\in\mathbb{R} satisfies

yi=α1h¯u,β1+l=1α2lh¯u,β2l2l,\displaystyle y_{i}=\alpha_{1}\langle\bar{h}_{u}^{\top},\mathbf{\beta}_{1}\rangle+\sum_{l=1}^{\infty}\alpha_{2l}\langle\bar{h}_{u}^{\top},\mathbf{\beta}_{2l}\rangle^{2l},

where α1,α2,,α2k\alpha_{1},\alpha_{2},\cdots,\alpha_{2k}\in\mathbb{R}, β1,β2,,β2kd\mathbf{\beta}_{1},\mathbf{\beta}_{2},\cdots,\mathbf{\beta}_{2k}\in\mathbb{R}^{d}, and h¯u=cuv𝒩(u){u}hvd\bar{h}_{u}=c_{u}\sum_{v\in\mathcal{N}(u)\cup\{u\}}h_{v}\in\mathbb{R}^{d} .

The following Lemma C.3 and C.4 give the bounds for 𝐲𝚯(1)𝐲\sqrt{\mathbf{y}^{\top}\mathbf{\Theta}^{(-1)}\mathbf{y}} and tr(𝚯)\rm{tr}(\mathbf{\Theta}).

Lemma C.3 (Bound on 𝐲𝚯(1)𝐲\sqrt{\mathbf{y}^{\top}\mathbf{\Theta}^{(-1)}\mathbf{y}})

Under the Assumption C.2, we have

𝐲𝚯(1)𝐲2|α1|β12+l=12π(2k1)|α2l|β2l22l=o(n)\displaystyle\sqrt{\mathbf{y}^{\top}\mathbf{\Theta}^{(-1)}\mathbf{y}}\leq 2|\alpha_{1}|\|\mathbf{\beta}_{1}\|_{2}+\sum_{l=1}^{\infty}\sqrt{2\pi}(2k-1)|\alpha_{2l}|\|\mathbf{\beta}_{2l}\|_{2}^{2l}=o(n)

Proof. Without loss of generality, we consider a simple GNN (K=1,R=1K=1,R=1) in Section B and define the kernel matrix for on the computational graph Gu,GvG_{u},G_{v} node u,vVu,v\in V as

𝚯uv=[𝚺(0)(1)(Gu,Gv)]uv[𝚺˙(1)(1)(Gu,Gv)]uv+[𝚺(1)(1)(Gu,Gv)]uv.\displaystyle\mathbf{\Theta}_{uv}=\left[\mathbf{\Sigma}^{(1)}_{(0)}(G_{u},G_{v})\right]_{uv}\left[\mathbf{\dot{\Sigma}}^{(1)}_{(1)}(G_{u},G_{v})\right]_{uv}+\left[\mathbf{\Sigma}^{(1)}_{(1)}(G_{u},G_{v})\right]_{uv}.

We decompose 𝚯n×n\mathbf{\Theta}\in\mathbb{R}^{n\times n} into 𝚯=𝚯+𝚯′′\mathbf{\Theta}=\mathbf{\Theta}^{\prime}+\mathbf{\Theta}^{\prime\prime}, where

𝚯uv=[𝚺(0)(1)(Gu,Gv)]uv[𝚺˙(1)(1)(Gu,Gv)]uv,and𝚯uv′′=[𝚺(1)(1)(Gu,Gv)]uv.\displaystyle\mathbf{\Theta}^{\prime}_{uv}=\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}\left[\dot{\boldsymbol{\Sigma}}_{(1)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv},\text{and}\quad\mathbf{\Theta}^{\prime\prime}_{uv}=\left[\mathbf{\Sigma}^{(1)}_{(1)}(G_{u},G_{v})\right]_{uv}.

Following the proof in [7] and assuming h¯u2=1\|\bar{h}_{u}\|_{2}=1, we have

[𝚺˙(1)(1)(Gu,Gv)]uv\displaystyle\left[\dot{\boldsymbol{\Sigma}}_{(1)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv} =πarccos([𝚺(0)(1)(Gu,Gv)]uv)2π,\displaystyle=\frac{\pi-\arccos\left(\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}\right)}{2\pi},
[𝚺(1)(1)(Gu,Gv)]uv\displaystyle\left[\boldsymbol{\Sigma}_{(1)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv} =πarccos([𝚺(0)(1)(Gu,Gv)]uv)+1[𝚺(0)(1)(Gu,Gv)]uv22π.\displaystyle=\frac{\pi-\arccos\left(\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}\right)+\sqrt{1-\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}^{2}}}{2\pi}.

Then,

𝚯\displaystyle\mathbf{\Theta}^{\prime} =14[𝚺(0)(1)(Gu,Gv)]uv+12π[𝚺(0)(1)(Gu,Gv)]uvarcsin([𝚺(0)(1)(Gu,Gv)]uv)\displaystyle=\frac{1}{4}\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}+\frac{1}{2\pi}\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}\arcsin\left(\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}\right)
=14[𝚺(0)(1)(Gu,Gv)]uv+12πl=1(2k3)!!(2k2)!!(2k1)[𝚺(0)(1)(Gu,Gv)]uv2k\displaystyle=\frac{1}{4}\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}+\frac{1}{2\pi}\sum_{l=1}^{\infty}\frac{(2k-3)!!}{(2k-2)!!\cdot(2k-1)}\cdot\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}^{2k}
=14h¯uh¯v+12πl=1(2k3)!!(2k2)!!(2k1)(h¯uh¯v)2k.\displaystyle=\frac{1}{4}\bar{h}_{u}^{\top}\bar{h}_{v}+\frac{1}{2\pi}\sum_{l=1}^{\infty}\frac{(2k-3)!!}{(2k-2)!!\cdot(2k-1)}\cdot\left(\bar{h}_{u}^{\top}\bar{h}_{v}\right)^{2k}.

We denote Φ2k\Phi^{2k} as the feature map of the kernel at degree 2k2k that hu,hv(2k)=Φ2k(hu)Φ2k(hv)\langle h_{u},h_{v}\rangle^{(2k)}=\Phi^{2k}(h_{u})^{\top}\Phi^{2k}(h_{v}). Following the proof in [7], we have

𝚯=14h¯uh¯u+12πl=1(2k3)!!(2k2)!!(2k1)Φ2k(h¯u)Φ2k(h¯v).\displaystyle\mathbf{\Theta}^{\prime}=\frac{1}{4}\bar{h}_{u}^{\top}\bar{h}_{u^{\prime}}+\frac{1}{2\pi}\sum_{l=1}^{\infty}\frac{(2k-3)!!}{(2k-2)!!\cdot(2k-1)}\cdot\Phi^{2k}(\bar{h}_{u})^{\top}\Phi^{2k}(\bar{h}_{v}).

As 𝚯′′\mathbf{\Theta}^{\prime\prime} is a positive semidefinite matrix, we have

𝐲𝚯(1)𝐲𝐲𝚯(1)𝐲.\displaystyle\mathbf{y}^{\top}\mathbf{\Theta}^{(-1)}\mathbf{y}\leq\mathbf{y}^{\top}{\mathbf{\Theta}^{\prime}}^{(-1)}\mathbf{y}.

We define yi(0)=α1(h¯u)β1y_{i}^{(0)}=\alpha_{1}\left(\bar{h}_{u}^{\top}\right)\mathbf{\beta}_{1} and yi(2k)=α2kΦ2k(h¯u)Φ2k(β2k)y_{i}^{(2k)}=\alpha_{2k}\Phi^{2k}\left(\bar{h}_{u}\right)^{\top}\Phi^{2k}(\mathbf{\beta}_{2k}) for each k1k\geq 1. Under Assumption C.2, label yiy_{i} can be rewritten as

yi=yi(0)+k=1yi(2k).\displaystyle y_{i}=y_{i}^{(0)}+\sum_{k=1}^{\infty}y_{i}^{(2k)}.

Then we have

𝐲𝚯(1)𝐲𝐲𝚯(1)𝐲(𝐲(0))𝚯(1)𝐲(0)+k=1(𝐲(2k))𝚯(1)𝐲(2k).\displaystyle\sqrt{\mathbf{y}^{\top}\mathbf{\Theta}^{(-1)}\mathbf{y}}\leq\sqrt{\mathbf{y}^{\top}{\mathbf{\Theta}^{\prime}}^{(-1)}\mathbf{y}}\leq\sqrt{(\mathbf{y}^{(0)})^{\top}{\mathbf{\Theta}^{\prime}}^{(-1)}\mathbf{y}^{(0)}}+\sum_{k=1}^{\infty}\sqrt{(\mathbf{y}^{(2k)})^{\top}{\mathbf{\Theta}^{\prime}}^{(-1)}\mathbf{y}^{(2k)}}.

When k=0k=0, we have

(𝐲(0))𝚯(1)𝐲(0)4|α1|β12.\displaystyle\sqrt{(\mathbf{y}^{(0)})^{\top}{\mathbf{\Theta}^{\prime}}^{(-1)}\mathbf{y}^{(0)}}\leq 4|\alpha_{1}|\|\mathbf{\beta}_{1}\|_{2}.

When k1k\geq 1, we have

(𝐲(2k))𝚯(1)𝐲(2k)2π(2k1)|α2k|Φ2k(β2k)2=2π(2k1)|α2k|β2k22l.\displaystyle\sqrt{(\mathbf{y}^{(2k)})^{\top}{\mathbf{\Theta}^{\prime}}^{(-1)}\mathbf{y}^{(2k)}}\leq\sqrt{2\pi}(2k-1)|\alpha_{2k}|\|\Phi^{2k}(\mathbf{\beta}_{2k})\|_{2}=\sqrt{2\pi}(2k-1)|\alpha_{2k}|\|\mathbf{\beta}_{2k}\|_{2}^{2l}.

Thus,

𝐲𝚯(1)𝐲2|α1|β12+l=12π(2k1)|α2l|β2l22l=o(n)\displaystyle\sqrt{\mathbf{y}^{\top}\mathbf{\Theta}^{(-1)}\mathbf{y}}\leq 2|\alpha_{1}|\|\mathbf{\beta}_{1}\|_{2}+\sum_{l=1}^{\infty}\sqrt{2\pi}(2k-1)|\alpha_{2l}|\|\mathbf{\beta}_{2l}\|_{2}^{2l}=o(n)

The bound of tr(𝚯)\rm{tr}(\mathbf{\Theta}) is simpler to prove.

Lemma C.4 (Bound on tr(𝚯)\rm{tr}(\mathbf{\Theta}))

Let nn denote as the number of training samples. Then tr(𝚯)2n\rm{tr}(\mathbf{\Theta})\leq 2n.

Proof. We have 𝚯n×n\mathbf{\Theta}\in\mathbb{R}^{n\times n}. For each u,vVu,v\in V, as Lemma C.3 shown that

[𝚺˙(1)(1)(Gu,Gv)]uv\displaystyle\left[\dot{\boldsymbol{\Sigma}}_{(1)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv} =πarccos([𝚺(0)(1)(Gu,Gv)]uv)2π12\displaystyle=\frac{\pi-\arccos\left(\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}\right)}{2\pi}\leq\frac{1}{2}

and

[𝚺(1)(1)(Gu,Gv)]uv\displaystyle\left[\boldsymbol{\Sigma}_{(1)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv} =πarccos([𝚺(0)(1)(Gu,Gv)]uv)+1[𝚺(0)(1)(Gu,Gv)]uv22π1,\displaystyle=\frac{\pi-\arccos\left(\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}\right)+\sqrt{1-\left[\boldsymbol{\Sigma}_{(0)}^{(1)}\left(G_{u},G_{v}\right)\right]_{uv}^{2}}}{2\pi}\leq 1,

we have

𝚯uv2.\displaystyle\mathbf{\Theta}_{uv}\leq 2.

Thus,

tr(𝚯)2n.\displaystyle\text{tr}(\mathbf{\Theta})\leq 2n.

Combine Combining Theorem C.1, Lemma C.3 and Lemma C.4, it is easy to see for a constant c(0,1)c\in(0,1) :

L𝒟(F)=𝔼(v,y)G[l(F(G,v),y)]O(1/nc).\displaystyle L_{\mathcal{D}(F)}=\mathbb{E}_{(v,y)\sim G}[l(F(G,v),y)]\lesssim O(1/n^{c}).

Appendix D Detailed ablation studies of NeighGen

In this section, we provide in-depth NeighGen studies to empirically explain its power in the cross-subgraph missing neighbor generation. Specifically, we first show the intermediate results of NeighGen by boiling down the generation process into the missing cross-subgraph link generation by dGen and the missing cross-subgraph neighbor feature generation by fGen. Next, we experimentally verify the necessity of training locally specialized NeighGen. Finally, we provide FL training hyper-parameter study on batch size and local epoch to emphasize the robustness of FedSage+.

D.1 Intermediate results of dGen and fGen.

In this section, we study the two generative components in NeighGen, i.e., dGen and fGen, to explore their expressiveness in reconstructing missing neighbors. Especially, we analyze the outputs from dGen and fGen separately to explain how NeighGen assists in the missing cross subgraph neighbor generation process.

As described in Section 4, both dGen and fGen are constructed as fully connected neural networks (FNNs) whose depths can be varied according to the target dataset. In principle, due to the expressiveness of FNNs [29], dGen and fGen with even very few layers have the power to approximate complex functions. The node degree and feature distributions, on the other hand, are often highly relevant to the graph structure and less complex in nature. In Fig. 6 and Table 3, we provide intermediate results on how dGen and fGen are able to recover missing neighbor numbers and features, respectively.

Additional details for dGen.

Fig. 6 shows the break-down performance of dGen on the MSAcademic dataset with M=3, which clearly shows the effectiveness of dGen in recovering the true number of missing neighbors. Notably, though the original output of dGen is a float number, we simply apply the round function to retrieve the integer number of missing neighbors for reconstruction.

Refer to caption
Figure 6: Prediction of dGen for nodes in MSAcademic with MM=3.

Additional details for fGen.

As described in Section 4.1, based on the number of missing neighbors generated by dGen, fGen further generates the feature of missing neighbors, thus recovering the incomplete neighborhoods resulting from the subgraph federated learning scenario. Regarding to our ultimate mission in missing neighbor generation as described in Section 4, i.e., locally modeling the original global graph during graph convolution, we evaluate fGen by comparing the NeighGen generated neihgbors with the neihgbors drawn from original whole graph and the ones from original subgraph. Specifically, we present the L2L_{2} distance between the averaged feature distributions of neighborhoods from these three types of graphs to show how the NeighGen generated missing neighbors narrow the gap. For simplicity, we use N(v)N(v), Ni(v)N_{i}(v), and Ni(v)N^{\prime}_{i}(v) to represent the first-order neighbors of nodes vVv\in V drawn from the global graph GG, the original subgraph GiG_{i}, and the mended subgraph GiG^{\prime}_{i} respectively. Smaller values indicate the locally drawn neighbors (Ni(v)N_{i}(v) or Ni(v)N^{\prime}_{i}(v)) being more similar to the true neighbors from the global graph (N(v)N(v)). The results in Table 3 clearly show the effectiveness of fGen in recovering the true features of missing neighbors.

Table 3: Intermediate prediction evaluation for fGen.
M=3 Cora CiteSeer PubMed MSAcademic
L2(Ni(v),N(v))±L_{2}(N^{\prime}_{i}(v),N(v))\pm std 0.0124±\pm0.0140 0.0074±\pm0.0097 0.0034 ±\pm0.0047 1.1457 ±\pm1.580
L2(Ni(v),N(v))±L_{2}(N_{i}(v),N(v))\pm std 0.0168±\pm0.0182 0.0101±\pm0.0131 0.0046 ±\pm0.0053 1.8690±\pm1.8387
M=5 Cora CiteSeer PubMed MSAcademic
L2(Ni(v),N(v))±L_{2}(N^{\prime}_{i}(v),N(v))\pm std 0.0262±\pm0.0885 0.0065±\pm0.0083 0.0040±\pm0.0054 1.1245 ±\pm1.5801
L2(Ni(v),N(v))±L_{2}(N_{i}(v),N(v))\pm std 0.0309 ±\pm0.0897 0.0083±\pm0.0115 0.0053±\pm0.0060 1.8806±\pm1.9695
M=10 Cora CiteSeer PubMed MSAcademic
L2(Ni(v),N(v))±L_{2}(N^{\prime}_{i}(v),N(v))\pm std 0.0636±\pm0.2100 0.1569±\pm0.3310 0.0056±\pm0.0170 2.7136 ±\pm 4.5595
L2(Ni(v),N(v))±L_{2}(N_{i}(v),N(v))\pm std 0.0687±\pm0.2093 0.1586 ±\pm0.3307 0.0065±\pm0.0171 3.2985±\pm4.5686

D.2 Usage of local specialized NeighGens

To empirically explain why we need separate NeighGen functions, we contrast the downstream task performances between FedSage with a globally shared NeighGen, i.e., FedSage with NeighGen obtained with FedAvg, and FedSage with FL obtained local specialized NeighGens, i.e., FedSage+. We conduct ablation experiments on four datasets with MM=3, and the results are in Table 4. The results clearly assert our explanation in Section 4.3, i.e., directly averaging NeighGen weights across the system degenerates the downstream task performance, which indicates the insufficiency of FedAvg in assisting local data owners in the diverse neighbor generation.

Table 4: Contrast results in node classification accuracy under MM=3
Model Cora CiteSeer PubMed MSAcademic
FedSage 0.8656 0.7393 0.8708 0.9327
(without NeighGen) (±(\pm0.0064) (±\pm0.0034) (±\pm0.0014) (±\pm0.0005)
FedSage 0.8619 0.7326 0.8721 0.9210
with globally shared NeighGen (±(\pm0.0034) (±\pm0.0055) (±\pm0.0012) (±\pm0.0016)
FedSage+ 0.8686 0.7454 0.8775 0.9414
(with local specialized NeighGens) (±\pm0.0054) (±\pm0.0038) (±\pm0.0012) (±\pm0.0006)

D.3 Experiments on Local Epoch and Batch Size

For the proposed FedSage and FedSage+, we further explore the association between the outcome classifiers’ performances and different training hyper-parameters, i.e., batch size and local epoch number, which are often concerned in federated learning frameworks.

The experiments are conducted on the PubMed dataset with M=5M=5. To control the variance, we fix the model parameters’ updating times. Specifically, for subgraph FL methods, i.e., FedSage and FedSage+, we fix the communication round as 50, while for the centralized learning method, i.e., GlobSage, we train the model for 50 epochs. Under different scenarios, we train the GlobSage model with all utilized training samples in MM data owners. Test accuracy indicates how models perform on the same set of global test samples. Results are shown in Table 5 and 6. Every result is presented as Mean (±\pm Std Deviation).

Table 5: Node classification accuracy under different batch sizes with local epoch number as 1.
Batch Size FedSage FedSage+ GlobSage
1 0.8682(±\pm0.0012) 0.8782(±\pm0.0012) 0.8751(±\pm0.001)
16 0.8733(±\pm0.0018) 0.8814(±\pm0.0023) 0.8736(±\pm0.0013)
64 0.8696(±\pm0.0035) 0.8755(±\pm0.0047) 0.8776(±\pm0.0011)
Table 6: Node classification accuracy under different local epoch numbers with batch size as 64. Note that GlobSage is trained with 50 epochs.
Local Epoch FedSage FedSage+ GlobSage
1 0.8696(±\pm0.0035) 0.8755(±\pm0.0047)
3 0.8663(±0.0003\pm 0.0003) 0.8740(±\pm0.0015) 0.8776(±\pm0.0011)
5 0.8591(±\pm0.0012) 0.8740(±\pm0.0011)

Table 5 and 6 both evidence the reliable, repeatable therapeutic effects that FedSage+ consistently further elevates FedSage in the global node classification task. Notably, in Table 5, when batch sizes are as small as 16 and 1, FedSage+ accomplishes even higher classification results compared to the centralized model GlobSage due to the employment of NeighGen.

Table 5 reveals the graph learning model can be affected by different batch sizes. As GlobSage is trained on a whole global graph, rather than a set of subgraphs, compared to FedSage and FedSage+, it suits a larger batch size, i.e., 64, than 1 or 16. Both FedSage and FedSage+, where every data owner samples on a limited subgraph, fit better in batch sizes 16. Remarkably, when the batch size equals 1, FedSage is prone to overfit to local biased distribution, while FedSage+ resists the overfitting problem under the NeighGen’s assistance, i.e., generating cross-subgraph missing neighbors.

Table 6 provides the relation between the local epoch number and the downstream task performance. For FedSage, more local epochs degenerate the outcome model with more biased aggregated local weights, while FedSage+ maintains a relatively more stable performance in the downstream task. Table 6 empirically evidences that the missing neighbor generator in FedSage+ provides further generalization and robustness in resisting rapid accuracy loss brought by higher local epochs.

Similar to results in Table 2, Section 5, FedSage and FedSage+ exhibit competitive performance even compared to the centralized model. Findings in Table 5 and 6 further contribute to a better understanding of the robustness in FedSage+ compared to vanilla FedSage.