This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Toward Model-centric Heterogeneous Federated Graph Learning: A Knowledge-driven Approach

Huilin Lai1, Guang Zeng2, Xunkai Li3, Xudong Shen2, Yinlin Zhu4, Ye Luo1, Jianwei Lu1, and Lei Zhu2 1Tongji University, 2Ant Group, 3Beijing Institute of Technology, 4Sun Yat-Sen University
Abstract

Federated graph learning (FGL) has emerged as a promising paradigm for collaborative machine learning, enabling multiple parties to jointly train models while preserving the privacy of raw graph data. However, existing FGL methods often overlook the model-centric heterogeneous FGL (MHtFGL) problem, which arises in real-world applications, such as the aggregation of models from different companies with varying scales and architectures. MHtFGL presents an additional challenge: the diversity of client model architectures hampers common learning and integration of graph representations. To address this issue, we propose the Federated Graph Knowledge Collaboration (FedGKC) framework, comprising two key components: Client-side Self-Mutual Knowledge Distillation, which fosters effective knowledge sharing among clients through copilot models; and Server-side Knowledge-Aware Model Aggregation, which enhances model integration by accounting for the knowledge acquired by clients. Experiments on eight benchmark datasets demonstrate that FedGKC achieves an average accuracy improvement of 3.74% over baseline models in MHtFGL scenarios, while also maintaining excellent performance in homogeneous settings.

Index Terms:
Federated Graph Learning, Heterogeneity, Knowledge Distillation

I Introduction

In today’s data-driven world, graph data structures are crucial for representing complex relationships in areas like social networks [1], bioinformatics [2], and recommendation systems [3]. They effectively capture relational attributes, providing insights for machine learning. Graph Neural Networks (GNNs) [4] extend deep learning to graph data through message-passing mechanisms, enabling robust modeling by integrating local node features with global structural information. However, rising data volumes and privacy concerns challenge traditional centralized training. Centralized storage risks data breaches, while silos hinder collaboration. Federated Graph Learning (FGL) [5] addresses these issues by training models on local devices and sharing only model updates, promoting collaboration while ensuring data privacy and efficiency.

Refer to caption
Figure 1: Illustration of the heterogeneous federated learning approaches in non-graph domains, conventional federated graph learning, and our Federated Graph Knowledge Collaboration.

As FGL applications diversify [6], the distinct needs and resource constraints of companies and individuals become increasingly pronounced. Larger enterprises typically possess extensive data collections, which enable them to implement scalable or large-scale GNN architectures [7, 8, 9]. Conversely, smaller companies with limited data holdings may favor accuracy-oriented GNN architectures or even specialized models tailored to their unique operational contexts [10]. Current FGL research primarily focuses on data-centric heterogeneity [11, 12], often neglecting the complexities introduced by model-centric heterogeneity. Data-centric heterogeneity involves variations in data distributions among clients, whereas model-centric heterogeneity encompasses differences in model architectures, presenting greater challenges. This architectural diversity impedes the effective integration of graph representations, complicating collaborative learning efforts. Furthermore, existing approaches typically operate under the assumption of homogeneous client models, thereby constraining their applicability to real-world scenarios characterized by varied client needs. Therefore, we aim to develop an innovative FGL framework that effectively addresses the model-centric heterogeneous FGL (MHtFGL) problem, ensuring that each client’s unique requirements are met while enhancing the collective learning process.

In the field of non-graph data, as illustrated in Fig. 1, researchers have attempted to tackle the challenges of model-centric heterogeneous federated learning by proposing three main strategies: personalized federated learning, prototype-based methods, and knowledge distillation-based approaches. Personalized federated approaches [13, 14] generate customized models for each client by combining global and local information, allowing for some degree of model heterogeneity, yet they still assume that other parts of the model are homogeneous, thus limiting their application in highly heterogeneous settings. Prototype-based methods [15, 16] aim to share class prototypes from all clients, but in heterogeneous model contexts, computing unbiased class prototypes becomes challenging. Additionally, when class distributions vary significantly across diverse graph datasets in FGL, the effectiveness of prototype sharing diminishes due to the minimal shared global knowledge. In contrast, knowledge distillation methods [17, 18] focus on efficiently transferring model knowledge between clients without the need to share models or data, making them particularly advantageous for completely heterogeneous federated environments. Although these methods demonstrate some efficacy, when applied directly to MHtFGL, these strategies often neglect the unique topological structures of graph data, thus reducing their effectiveness.

The application of knowledge distillation techniques within MHtFGL necessitates specific design considerations, attributed to the heightened connectivity inherent in graph data compared to traditional non-graph data. This scenario demands not only the transfer of model parameters and feature representations but also the capture of intricate relationships and structural patterns among nodes. Therefore, leveraging the full potential of knowledge distillation in the MHtFGL scenario necessitates addressing two key questions. First, Q1: ”what knowledge to distill” presents a challenge, as the intricate dependencies in graph data complicate this determination. Traditional knowledge distillation approaches often prioritize output layer predictions or intermediate feature representations, which may not capture the full structural information of graph data. Second, Q2: ”how to effectively distill” underscores the need for models to adequately learn graph knowledge while preserving the efficiency and accuracy of the distillation process.

Based on the above analysis, we propose the Federated Graph Knowledge Collaboration framework (FedGKC) to tackle challenges from heterogeneous clients in federated graph learning. This framework leverages knowledge distillation while addressing its limitations in processing graph data. We introduce copilot models in client-side learning to accommodate heterogeneous client models and dynamically adjust server-side aggregation strategies based on knowledge gained. Specifically, the FedGKC framework consists of two main components: 1) Self-Mutual Knowledge Distillation (SMKD) on Heterogeneous Clients: We establish a bidirectional knowledge distillation process between local and copilot models. As an answer to Q1, this mechanism distills both predicted class information and neighboring node information, facilitating a deeper understanding of the graph’s topology. Additionally, we employ a multi-view data augmentation technique to enhance self-distillation, helping models recognize data consistency across different perspectives, thereby boosting their generalization and robustness. 2) Knowledge-Aware Model Aggregation (KAMA) on the Server Side: Unlike the traditional FedAvg algorithm [19], we recognize that the heterogeneity among clients signifies considerable differences in model learning efficiency. We propose an aggregation weight calculation method based on knowledge acquisition levels, considering factors such as data volume, node knowledge strength (depth of model learning about graph nodes), and node knowledge clarity (the model’s ability to differentiate between nodes). This approach enables more accurate and efficient model aggregation. These two components are combined as a response to Q2, where we not only design bidirectional and self-distillation processes, but also aggregate models according to knowledge levels.

In summary, the primary contributions of our work are as follows: (1) Problem Identification. We focus on addressing the challenges from MHtFGL, particularly the impediment to unified learning and the integration of graph representations due to divergent client model architectures in real-world deployments. (2) Innovative Framework. We introduce FedGKC, a novel framework designed to facilitate effective knowledge sharing across heterogeneous clients. This is achieved through two key mechanisms: SMKD which facilitates knowledge sharing and learning across heterogeneous models to bridge structural differences, and KAMA, which analyzes and synthesizes knowledge from diverse clients to derive more effective global insights. (3) Superior Performance. To validate the effectiveness of FedGKC, we conduct extensive experiments on multiple representative datasets. Our results demonstrate that FedGKC excels across varying numbers of clients and heterogeneous models, achieving an average accuracy improvement of 3.74% and a maximum improvement of 8.13% relative to existing baseline methods.

II Related Work

II-A Federated Graph Learning

Federated Graph Learning (FGL) applies the principles of federated learning to graph neural networks, enabling collaborative training on graph-structured data while preserving data privacy. FGL allows multiple clients to jointly learn graph representation models without exchanging original graph data. The predominant FGL research can be categorized into two settings: (i) Graph-FL includes GCFL [20] and FedStar [21], where each client collects multiple independent graphs to collaboratively address graph-level downstream tasks (e.g., graph classification). (ii) Subgraph-FL [11, 12, 22, 23], where each client holds a subgraph of an implicit global graph, aims to tackle node-level downstream tasks (e.g., node classification). Notably, previous studies have primarily focused on addressing challenges related to varying data distributions [11, 12] across different clients. However, in real-world scenarios, graph neural networks from different clients may also belong to different architectural models. We are the first to investigate this issue, aiming to design new paradigms for heterogeneous client collaboration.

II-B Model-Centric Heterogeneous Federated Learning

Model-centric heterogeneous federated learning (MHtFL) accommodates the need for personalized model architectures while offering advantages in privacy protection and model customization. Based on the level of model heterogeneity, we classify the existing MHtFL methodologies into two categories: partial heterogeneity and complete heterogeneity. Methods based on partial heterogeneity, such as LG-fedAvg [13], FedGen [24], and FedGH [14], permit the primary components of client models to be heterogeneous, although they assume the remaining (smaller) components to be homogeneous. However, clients can acquire limited global knowledge through this small shared component. In contrast, completely heterogeneous MHtFL approaches impose no architectural restrictions on client models. Classic knowledge distillation-based methods [25, 26] share model outputs over a global dataset. However, acquiring such datasets in practice can be challenging. FML [17] and FedKD [18] do not rely on global datasets but instead leverage mutual extraction between small auxiliary global models and client models. Nevertheless, in early iterations, there is a risk of exchanging ineffective knowledge when both the auxiliary and client models perform poorly. Another strategy involves sharing class prototypes [15, 16, 27]. However, biased classifier phenomena have been widely observed in federated learning when handling heterogeneous data. This bias becomes more pronounced when the models exhibit heterogeneity, leading to biased prototypes, which challenges the aggregation of global knowledge for classification [28]. Notably, the aforementioned methods have not been tailored for the graph data domain. To address the challenges of model-centric heterogeneous federated graph learning (MHtFGL) scenarios, our approach innovatively employs knowledge distillation techniques to facilitate the transfer of graph information within heterogeneous federated contexts.

II-C Knowledge Distillation on Graph

Knowledge distillation [29] was originally proposed to transfer knowledge from larger, more complex models to smaller, more efficient ones to conserve computational resources. In graph learning, knowledge distillation can facilitate knowledge transfer across different models, where the key lies in determining what knowledge to distill and how to do so. Three types of information can be regarded as transferable knowledge: output logits, graph structure, and embeddings. Logits represent the inputs to the final Softmax layer, and some methods aim to minimize the differences between probability distributions from different models after computing logits to extract knowledge [30, 31, 32]. The graph structure delineates the connectivity and relationships between graph elements (such as nodes and edges), with some approaches aligning nodes based on their neighboring relationships to extract structural information [33, 34]. In addition to the logits and graph structures, there are studies that utilize node embeddings learned from intermediate layers of models to guide the training of another model [35, 36, 37]. Regarding distillation methods, direct distillation minimizes differences between models, promoting intuitive knowledge transfer [31]; adaptive distillation offers flexibility by considering the importance of different knowledge components. Methods like RDD [38], FreeKD [39], and MulDE [30] adjust the distillation process based on the certainty or correctness of the knowledge, allowing models to focus more on reliable information while avoiding knowledge that is less informative or noisy. Furthermore, various distillation strategies tailored for specific tasks exist within machine learning [36]. Our study aims to ensure the preservation and transfer of effective graph information within heterogeneous federated scenarios, thereby facilitating collaborative training among multiple clients.

III Methodology

Refer to caption
Figure 2: The overall architecture of our method, including a Self-Mutual Knowledge Distillation strategy for client-side knowledge sharing and a Knowledge-Aware Model Aggregation strategy for effective server-side model integration.

III-A Preliminaries

In this study, we propose a model-centric heterogeneous federated graph learning framework designed to leverage diverse graph data and GNN models from multiple clients for collaborative training, while ensuring the protection of data privacy.

Let there be KK clients, where the graph data owned by client kk is represented as Gk=(Vk,Ek)G_{k}=(V_{k},E_{k}) with |Vk|=Nk|V_{k}|=N_{k} nodes and |Ek|=Mk|E_{k}|=M_{k} edges. The graph’s adjacency matrix, node feature matrix, and label matrix are denoted as AkNk×NkA_{k}\in\mathbb{R}^{N_{k}\times N_{k}}, XkNk×fX_{k}\in\mathbb{R}^{N_{k}\times f}, and YkNk×CY_{k}\in\mathbb{R}^{N_{k}\times C}, respectively, where ff represents the number of node features and CC denotes the number of classes. Note that the graph data and the GNN model architectures can vary among different clients. GNNs constitute a class of deep learning models tailored to perform feature embedding and inference tasks on graph-structured data, requiring an adjacency matrix AkA_{k} and node features XkX_{k} as inputs. Federated learning represents a collaborative machine learning paradigm that enables the training of a shared model across multiple data owners without the direct exchange of raw data. Each training round involves the selection of participating clients, the update of the local model via local training on each chosen client, the aggregation of local model parameters on a central server, and the subsequent distribution of the aggregated parameters back to the clients. The objective of this work is to enhance the performance of each client’s local model on graph node classification tasks in the heterogeneous federated environment.

III-B Overview

Figure 2 illustrates an overview of the FedGKC architecture that we designed. At first, the parameters of all models are initialized. Then, in each round, the following steps are executed: (1) Client local models and their copilot models are trained using Self-Mutual Knowledge Distillation (SMKD); (2) At the server, Knowledge-Aware Model Aggregation (KAMA) is performed on the copilot models; (3) The aggregated model parameters are distributed back to all client copilot models. Steps (1) to (3) are repeated for multiple rounds until convergence, allowing each client’s local model to learn global knowledge without transmitting graph data, providing power for optimizing towards the global direction.

III-C Self-Mutual Knowledge Distillation

The model learning process on the client side consists of two main components. Firstly, there is a mutual knowledge learning process between the local model and the copilot model. This process not only allows the copilot model to acquire more localized knowledge but also enables the local model to obtain global knowledge that includes information from other clients. Secondly, to enhance the predictive capability of the local model, we generate two different views using data augmentation from the same graph data to implement self-distillation. This ensures that the local model generates consistent embedding features and prediction results for the same data across different views.

Mutual distillation addressing model heterogeneity: In traditional knowledge distillation methods [32], knowledge is typically transferred by aligning the prediction results of different models on the same data. However, merely aligning the prediction results may limit the knowledge learned by the models, resulting in a lack of comprehensiveness and representativeness. In the context of graph data processed by GNNs, node representations are primarily derived from capturing neighborhood information. Therefore, we propose propagating neighborhood knowledge from one model to another model to more effectively leverage the topological information captured by GNNs, formulated as follows:

neigh=1|𝒩|i𝒩j𝒱ii𝒟(σ(zj),σ(hi)),\mathcal{L}_{\mathrm{neigh}}=\frac{1}{|\mathcal{N}|}\sum_{i\in\mathcal{N}}\sum_{j\in\mathcal{V}_{i}\cup i}\mathcal{D}(\sigma({z}_{j}),\sigma({h}_{i})), (1)

where zz and hh represent the feature embedding outputs of the feature extractor of the two models, respectively, 𝒱i\mathcal{V}_{i} represents the set of neighbor nodes of node ii, 𝒩\mathcal{N} denotes the total number of nodes, σ=softmax()\sigma=softmax(\cdot) denotes an activation function, and 𝒟()\mathcal{D}(\cdot) denotes a distance function, such as the Kullback Leibler (KL) divergence function [29]. For the copilot model, its optimization objective can be denoted as:

cop-mutu=αCE(y,pc)+βneigh+(1αβ)KL(pl||pc),\mathcal{L}_{\text{cop-mutu}}=\alpha\mathcal{L}_{CE}(y,p_{c})+\beta\mathcal{L}_{\mathrm{neigh}}+(1-\alpha-\beta)\mathcal{L}_{KL}(p_{l}||p_{c}), (2)

where α\alpha and β\beta are the weights to balance the influence of different distillation losses, and CE(y,pc)=ylog(softmax(pc))\mathcal{L}_{CE}(y,p_{c})=-y\log(softmax(p_{c})) represents the cross-entropy loss between the copilot model prediction pcp_{c} and the truth label yy. Here, KL\mathcal{L}_{KL} is the Kullback Leibler (KL) divergence function, and plp_{l} and pcp_{c} are the predicts of local and copilot models, respectively.

Similarly, the local model needs to learn information from the copilot model while training on local data, and its optimization objective is

local-mutu=αCE(y,pl)+βneigh+(1αβ)KL(pc||pl),\mathcal{L}_{\text{local-mutu}}=\alpha\mathcal{L}_{CE}(y,p_{l})+\beta\mathcal{L}_{\mathrm{neigh}}+(1-\alpha-\beta)\mathcal{L}_{KL}(p_{c}||p_{l}), (3)

where α\alpha and β\beta are the weights to balance the influence of different distillation losses, and plp_{l} and pcp_{c} are the predicts of local and copilot models, respectively.

Self-distillation mining model potential: To further stimulate the predictive performance of the local model, we introduce a self-distillation strategy. We first create two perspectives on the same graph data — a strongly augmented view and a weakly augmented view. This dual-perspective strategy effectively enhances the diversity of graph data by applying varying degrees of edge removal and feature masking [40]. The weakly augmented view retains more original information, while the strongly augmented view significantly obscures and removes more graph information. We expect the model’s output based on the strongly augmented view to closely align with the prediction results of the weakly augmented view for the same input. This aims to ensure that, even when faced with substantial information alteration, the local model maintains consistency in its prediction results, thereby deepening its understanding of the data’s intrinsic nature. The optimization objective for the self-distillation of the local model is denoted as

self-distill=MSE(elweak,elstrong)+KL(plweak||plstong),\mathcal{L}_{\text{self-distill}}=\mathcal{L}_{MSE}(e_{l}^{\mathrm{weak}},e_{l}^{\mathrm{strong}})+\mathcal{L}_{KL}(p_{l}^{\mathrm{weak}}||p_{l}^{\mathrm{stong}}), (4)

where MSE\mathcal{L}_{MSE} denotes the mean-square error loss, and ee and pp represent the embedding output of the model feature extractor and the prediction output of the classifier, respectively.

In summary, the overall optimization objective for the local model integrates the mutual distillation objective with the self-distillation objective:

local=local-mutu+self-distill.\mathcal{L}_{\text{local}}=\mathcal{L}_{\text{local-mutu}}+\mathcal{L}_{\text{self-distill}}. (5)

The optimization goal for the copilot model is shown in Eq. (2). Through the coordinated optimization of these two components, we aim to significantly enhance the learning efficiency and global knowledge acquisition capability of the local model in heterogeneous environments.

III-D Knowledge-Aware Model Aggregation

Given the heterogeneity among clients, the learning efficiency of each client’s model varies accordingly. Traditional federated learning methods typically employ an averaging aggregation strategy [19], which does not take into account the differences among the models from various clients. This oversight can result in lower-performing models adversely affecting overall performance. To address this issue, we propose a Knowledge-Aware Model Aggregation mechanism that dynamically assigns weights based on the information content and learning knowledge levels of each client’s model. This dynamic weighting aggregation enhances collaborative efficiency by ensuring that the models from clients with rich knowledge do not suffer from the shortcomings of those with limited knowledge. Simultaneously, it allows clients with lesser knowledge to receive more guidance from the knowledgeable clients, thereby improving the model’s convergence and robustness.

Specifically, when aggregating the copilot models from each client, the server first considers the data volume from an overall perspective and calculates volume weight accordingly, denoted as:

wkvol=𝒩kk=1K𝒩k,w_{k}^{vol}=\frac{\mathcal{N}_{k}}{\sum_{k=1}^{K}\mathcal{N}_{k}}, (6)

where 𝒩k\mathcal{N}_{k} indicates the amount of data on the kk-th client.

Next, it assesses the knowledge levels learned by each node in the copilot model. Ideally, each node should focus on relevant knowledge associated with its respective category, implying that the knowledge should be strong and clear. We measure knowledge level from two dimensions: knowledge strength and knowledge clarity. The knowledge strength at a node can be represented by the maximum value of the output prediction probabilities, which intuitively reflects the model’s confidence in recognizing a particular category, denoted as:

Qistren=max(pi),Q_{i}^{stren}=\text{max}(p_{i}), (7)

where max()\text{max}(\cdot) returns the max value of the predict vector pp for the ii-th node. On the other hand, assessing knowledge clarity is more complex as it involves evaluating the distinction between the predicted results of this node and those of other categories, expressed as max(pi)pi\text{max}(p_{i})-\sum{p_{i}^{*}}, where pp^{*} represents the set of values other than the maximum value. Additionally, we must be wary of the over-smoothing phenomenon that can occur in GNNs due to interactions with neighboring nodes, which may compromise the model’s discriminatory ability. We define the clarity evaluation formula as:

Qiclar=1M1(max(pi)pi)λ(1|𝒱i|j𝒱isimi(pi,pj)),Q_{i}^{clar}=\frac{1}{M-1}(\text{max}(p_{i})-\sum{p_{i}^{*}})-\lambda(\frac{1}{|\mathcal{V}_{i}|}\sum_{j\in\mathcal{V}_{i}}\text{simi}(p_{i},p_{j})), (8)

where MM represents the number of categories, 𝒱i\mathcal{V}_{i} represents the set of neighbor nodes of node ii, λ\lambda is the weight of the over-smoothing effect, and simi()\text{simi}(\cdot) represents the cosine similarity calculation function. Then, the degree of knowledge mastery of each client is quantified by:

𝒫k=1𝒩ki=1𝒩k(Qistren+Qiclar),\mathcal{P}_{k}=\frac{1}{\mathcal{N}_{k}}\sum_{i=1}^{\mathcal{N}_{k}}(Q_{i}^{stren}+Q_{i}^{clar}), (9)

where 𝒩k\mathcal{N}_{k} indicates the amount of data on the kk-th client. The knowledge-related weights are expressed as:

wkknowled=𝒫kk=1K𝒫k.w_{k}^{knowled}=\frac{\mathcal{P}_{k}}{\sum_{k=1}^{K}\mathcal{P}_{k}}. (10)

By considering data volume, node knowledge strength, and node knowledge clarity, we quantify the learning knowledge level of each client’s copilot model. Consequently, the aggregation weight for the kk-th client’s model can be determined as follows:

wk=12(wkvol+wkknowled).w_{k}=\frac{1}{2}(w_{k}^{vol}+w_{k}^{knowled}). (11)

The model aggregation formula is:

θg=k=1Kwkθc,k,\theta_{g}=\sum_{k=1}^{K}w_{k}\cdot\theta_{c,k}, (12)

where θc,k\theta_{c,k} is the parameters of the kk-th client copilot model.

Finally, the aggregated model parameters will be distributed back to each client’s copilot model in preparation for the next round of client training. Through this meticulous adjustment, our aggregation strategy effectively integrates multi-source information, achieving a more efficient and equitable model aggregation process. The complete algorithm of FedGKC is presented in Algorithm 1.

Algorithm 1 The complete algorithm of FedGKC
1:Communication round TT; Initial model parameters θ\theta; The number of clients KK;
2:Local models for all clients;
3:Initialize all client-side copilot and local models;
4:for t=1,,Tt=1,\cdots,T do
5:     //// Client Side:
6:     for k=1,,Kk=1,\cdots,K in parallel do
7:         Optimize the coplit model according to Eq. (2);
8:         Optimize the local model according to Eq. (5);
9:         Record the data volume 𝒩k\mathcal{N}_{k};
10:         Calculate the knowledge level 𝒫k\mathcal{P}_{k} through Eq. (9);
11:     end for
12:     //// Server Side:
13:     Compute the model aggregation weights wk{w}_{k} by Eq.(11);
14:     Update aggregate model parameters by Eq.(12);
15:     Send aggregate model parameters to the client-side copilot model: θc,kθg\theta_{c,k}\leftarrow\theta_{g};
16:end for

IV Experiments

In this section, a comprehensive evaluation of FedGKC is provided through a series of experiments. Initially, the details of the datasets utilized and the experimental setup are delineated. Subsequently, the experiments aim to answer the following key questions: Q1: Does FedGKC exhibit superior performance compared to state-of-the-art baseline methods? Q2: What are the sources contributing to the performance enhancement of FedGKC? Q3: Is FedGKC capable of generalizing effectively across varying copilot model architectures? Q4: How resilient is the performance of FedGKC to fluctuations in hyperparameter values? Q5: What is the computational complexity of FedGKC?

IV-A Experimental Setup

IV-A1 Datasets

Experiments are conducted utilizing eight commonly employed benchmark datasets within the domain of graph learning. These consist of three small-scale citation network datasets (Cora, CiteSeer, PubMed) [41], two medium-scale co-authorship network datasets (Coauthor CS, Coauthor Physics) [42], two medium-scale user-item datasets (Amazon Photo, Amazon Computer) [43], and one large-scale open graph benchmark dataset (ogbn-arxiv) [44]. Further particulars regarding these datasets are delineated in Table I.

TABLE I: Statistics of the eight employed benchmark datasets.
Dataset Nodes Features Edges Classes Train/Val/Test
Cora 2,708 1,433 5,429 7 20%/40%/40%
CiteSeer 3,327 3,703 4,732 6 20%/40%/40%
PubMed 19,717 500 44,338 3 20%/40%/40%
CS 18,333 6,805 81,894 15 20%/40%/40%
Physics 34,493 8,415 247,692 5 20%/40%/40%
Photo 7,487 745 119,043 8 20%/40%/40%
Computer 13,381 767 245,778 10 20%/40%/40%
ogbn-arxiv 169,343 128 2,315,598 40 60%/20%/20%

IV-A2 Baselines

TABLE II: Performance comparison of different methods in different model architecture clients. The best results are highlighted in bold.
Dataset & Client Cora CiteSeer PubMed ogbn-arxiv
Method 5 10 20 Avg. 5 10 20 Avg. 5 10 20 Avg. 5 10 20 Avg.
LG-FedAvg [13] 78.78 78.11 70.52 75.80 65.64 62.74 58.57 62.32 84.52 82.74 80.95 82.74 61.32 61.29 61.62 61.41
FedProto [15] 78.50 77.75 74.13 76.79 59.77 59.26 57.28 58.77 84.71 82.77 81.74 83.07 38.26 44.82 48.19 43.76
FML [17] 80.33 79.90 75.37 78.53 64.67 62.52 60.39 62.53 84.39 83.05 82.23 83.22 52.15 58.25 60.98 57.13
FedKD [18] 80.32 78.64 73.01 77.32 65.19 62.22 59.58 62.33 85.37 83.21 82.19 83.59 51.77 61.31 63.34 58.81
FedTGP [16] 79.50 78.47 74.58 77.52 61.10 59.47 59.30 59.96 85.01 83.36 81.34 83.24 42.08 49.43 54.24 48.58
Ours 81.42 80.71 75.38 79.17 65.78 64.96 61.40 64.05 85.54 83.48 82.42 83.81 61.47 61.58 64.29 62.45
Dataset & Client CS Physics Photo Computer
Method 5 10 20 Avg. 5 10 20 Avg. 5 10 20 Avg. 5 10 20 Avg.
LG-FedAvg [13] 89.07 87.16 82.69 86.31 94.26 93.79 92.24 93.43 90.41 86.81 85.11 87.44 87.42 85.99 83.79 85.73
FedProto [15] 84.09 83.49 81.24 82.94 92.87 92.54 91.81 92.41 71.93 75.38 79.37 75.56 57.61 75.55 71.35 68.17
FML [17] 87.34 85.93 83.34 85.54 93.03 92.58 92.11 92.57 89.80 86.84 79.97 85.54 87.64 83.08 78.18 82.97
FedKD [18] 88.47 86.46 83.43 86.12 93.21 92.81 91.86 92.63 89.90 84.96 85.37 86.74 78.53 83.69 78.47 80.23
FedTGP [16] 85.88 84.99 82.29 84.39 93.42 93.12 92.26 92.93 75.31 81.04 80.13 78.83 67.30 73.76 71.85 70.97
Ours 88.62 87.31 84.04 86.66 94.41 93.65 92.66 93.57 91.19 89.65 88.71 89.85 88.17 86.80 85.84 86.94
TABLE III: Performance comparison of different methods in different model scale clients. The best results are highlighted in bold.
Dataset & Client Cora CiteSeer PubMed ogbn-arxiv
Method 5 10 20 Avg. 5 10 20 Avg. 5 10 20 Avg. 5 10 20 Avg.
LG-FedAvg [13] 79.50 79.90 73.43 77.61 66.16 65.47 58.62 63.42 84.48 83.17 80.70 82.78 62.24 65.05 65.32 64.20
FedProto [15] 77.59 78.19 72.55 76.11 62.08 57.70 57.03 58.94 84.71 83.08 81.27 83.02 50.06 53.92 57.04 53.67
FML [17] 80.42 79.18 74.23 77.94 66.23 64.30 61.10 63.88 84.76 83.13 81.43 83.11 54.89 62.43 64.62 60.65
FedKD [18] 80.23 79.81 74.13 78.06 66.46 65.92 60.30 64.23 85.08 83.25 82.21 83.51 59.71 61.06 63.68 61.48
FedTGP [16] 78.23 78.65 73.25 76.71 63.12 58.60 58.55 60.09 84.43 83.32 81.12 82.96 53.02 58.10 59.23 56.78
Ours 80.05 80.25 74.31 78.20 67.80 66.07 61.75 65.21 85.25 83.94 81.61 83.60 63.15 65.80 66.09 65.01
Dataset & Client CS Physics Photo Computer
Method 5 10 20 Avg. 5 10 20 Avg. 5 10 20 Avg. 5 10 20 Avg.
LG-FedAvg [13] 90.64 88.04 85.32 88.00 94.42 93.87 93.45 93.91 89.96 87.91 85.59 87.82 85.77 85.53 84.76 85.35
FedProto [15] 83.95 83.08 81.80 82.94 93.12 92.42 91.86 92.47 71.35 73.48 80.17 75.00 48.34 70.88 73.44 64.22
FML [17] 89.23 86.65 85.76 87.21 94.60 93.91 92.11 93.54 82.65 87.68 85.50 85.28 86.19 85.28 84.60 85.36
FedKD [18] 89.48 87.03 85.88 87.46 94.51 94.00 92.76 93.76 84.66 85.70 86.01 85.46 77.84 80.35 81.90 80.03
FedTGP [16] 86.84 85.49 83.81 85.38 93.92 92.57 93.03 93.17 71.42 73.72 81.81 75.65 66.73 67.63 70.27 68.21
Ours 90.74 88.92 86.50 88.72 94.62 94.18 93.54 94.11 90.87 89.78 89.06 89.90 87.69 86.51 85.50 86.57
TABLE IV: Comparison of our method with other federated methods with 10 clients in non-heterogeneous settings. The best results are highlighted in bold.
CS Physics Photo Computer Avg.
FedAvg [19] 81.06 92.71 78.02 66.90 79.67
MOON [45] 85.02 94.37 80.37 69.50 82.32
FedDC [46] 83.71 93.04 72.99 63.35 78.27
FedSage+ [23] 84.83 92.96 79.21 68.84 81.46
Fed-Pub [22] 88.58 92.11 89.17 86.54 89.10
FGSSL [11] 86.47 92.66 87.94 85.50 88.14
FedGTA [12] 87.37 93.80 87.74 84.58 88.37
AdaFGL [47] 85.77 93.88 86.30 83.29 87.31
Ours 88.73 94.47 90.30 87.16 90.17
TABLE V: The experimental results by incorporating the proposed key components to the baseline. ”SMKD” indicates Self-Mutual Knowledge Distillation, and ”KAMA” indicates Knowledge-Aware Model Aggregation.
Dataset & Client Cora CiteSeer PubMed
SMKD KAMA Client 5 Client 10 Client 20 Avg. Client 5 Client 10 Client 20 Avg. Client 5 Client 10 Client 20 Avg.
80.50 79.54 75.21 78.42 64.74 62.73 60.90 62.82 84.64 83.15 81.87 83.22
81.42 80.44 74.50 78.79 64.44 64.29 60.88 63.20 84.54 83.37 82.51 83.47
81.24 79.73 75.03 78.67 65.04 63.11 61.03 63.06 85.05 83.20 82.27 83.51
81.42 80.71 75.38 79.17 65.78 64.96 61.40 64.05 85.54 83.48 82.42 83.81

The proposed FedGKC methodology was compared with alternative heterogeneous federation strategies. It is important to note that, given the absence of established strategies specifically designed for handling heterogeneous models within the realm of federated graph learning, we re-implemented these methods under federated graph learning conditions. This re-implementation was based on the available source code of relevant techniques, including personalization-based strategies such as LG-FedAvg [13], prototype-based approaches such as FedProto [15] and FedTGP [16], as well as knowledge distillation-based methods like FML [17] and FedKD [18]. Due to the need for partial homogeneity of the personalization method, we set up the same classifier for all clients when implementing the LG-FedAvg method. In addition to this, we maintain a completely heterogeneous framework for all client models.

IV-A3 Heterogeneity Simulations

To partition data to each client, the Louvain [48] community detection algorithm is employed to simulate subgraph systems by partitioning the graph into multiple clusters, which were subsequently allocated to distributed clients. To ensure uniformity and fairness in the experimental setup, experiments are conducted using configurations of 5, 10, and 20 clients. For model heterogeneity settings, we consider two scenarios. The first scenario involves different model architectures, utilizing five distinct two-layer graph neural network architectures: GCN [49], GAT [10], GraphSAGE [50], GIN [51], and SGC [52]. The second scenario addresses variations in model scale. Leveraging the technology of deep GNNs [9], we employed a lightweight two-layer SGC, a conventional two-layer GCN, as well as deeper models including four-layer GCN, six-layer GCN, and eight-layer GCN. For the allocation of clients, given KK clients, the model architecture assigned to the kk-th client is determined by (kmod5k\bmod 5), k|K|k\in|K|. The parameters of each assigned model architecture are re-initialized for every respective client.

IV-A4 Implementation Details

The experiments are conducted on an NVIDIA GeForce RTX 3090 GPU. Unless otherwise specified, the copilot model employed is a two-layer GCN architecture. The evaluation metric focuses on node classification accuracy within subgraphs on the client side, with performance averaged across all clients. The Adam optimizer is selected for the optimization process. For hyperparameter values, α\alpha and β\beta in Eq.(2) and Eq.(3) are set to 0.6 and 0.2, respectively, and λ\lambda in Eq.(8) is set to 0.1. The evaluation metric is the node classification accuracy on the test set.

IV-B Performance Comparison

To answer Q1, we conduct comprehensive comparative experiments to demonstrate the superior performance of FedGKC. Specifically, we test two scenarios under heterogeneous settings and further comparisons under non-heterogeneous settings. 1) Clients with Different Model Architectures. In the first scenario, we evaluate the performance of FedGKC in a setting where client models have different architectures. The experimental results, detailed in Table II, show that FedGKC significantly outperforms multiple baseline methods in this scenario. Knowledge distillation-based methods, like FML and FedKD, have demonstrated performance advantages in heterogeneous environments. In comparison to these methods, FedGKC achieves greater performance enhancements, with an average improvement of 2.31% over FML and 2.34% over FedKD. This indicates that FedGKC not only effectively handles model architectural heterogeneity but also achieves better knowledge transfer and model fusion across different architectures. 2) Clients with Different Model Scales. In the second scenario, we consider the variations in model scales due to differences in the computational capabilities of various client devices. As shown in Table III, FedGKC demonstrates superior performance on the test set compared to other methods, with an average of 1.80% higher accuracy than FML and 2.17% higher accuracy than FedKD. It is confirmed that FedGKC can not only adapt to the heterogeneity of simple model architecture, but also support and perform well for models of different scales. 3) Comparative Experiments under Non-Heterogeneous Settings. In addition to the experiments under heterogeneous settings, we also conduct comparative experiments in non-heterogeneous settings to evaluate the performance of FedGKC in traditional federated learning environments. The results, presented in Table IV, show that despite being specifically designed for heterogeneous client scenarios, FedGKC performs exceptionally well even in non-heterogeneous settings, outperforming the baseline approach FedAvg by 10.5% on average. FedGKC even surpasses federated learning methods tailored for homogeneous clients by an average of 1.07% over the suboptimal method Fed-Pub, which further validates the generality and robustness of FedGKC.

IV-C Ablation Study

To answer Q2, we conduct a series of ablation experiments focusing on two critical modules introduced in FedGKC: Self-Mutual Knowledge Distillation (SMKD) and Knowledge-Aware Model Aggregation (KAMA). Table V illustrates the performance enhancements achieved by each module when applied independently to the baseline method. In terms of implementation, when the SMKD module is added alone, we employ an aggregation strategy based on sample size on the server; when incorporating the KAMA module alone, a traditional dual-model architecture is utilized on the clients. Notably, the SMKD module demonstrates greater improvements on the Cora and CiteSeer datasets, while the KAMA module shows a larger contribution to the PubMed dataset, suggesting its advantageous performance with larger scale datasets. The combination of both modules further enhances overall performance.

To gain deeper insights into the contributions of the internal mechanisms of the SMKD and KAMA modules, we conduct a more detailed ablation study on the CiteSeer dataset with 10 clients, with the experimental results presented in Fig. 3(a). First, within the SMKD module, we test the effects of removing self-distillation and mutual distillation on model performance. The results indicate that the absence of the self-distillation mechanism limits the model’s self-optimization capabilities, while neglecting mutual distillation leads to the model relying solely on local data, thereby failing to effectively assimilate global information and diminishing overall performance. Next, within the KAMA module, we focus on the elements that constitute knowledge weights, including knowledge strength (as shown in Eq. (7)), the clarity of class-related knowledge (the first term of Eq. (7)), and the clarity of smooth associations (the second term of Eq. (7)). The experimental results reveal that knowledge strength significantly impacts model performance, whereas the effect of knowledge clarity is relatively minor; nevertheless, it still contributes positively to the model’s stability and accuracy.

IV-D Generalization and Sensitivity Analysis

To answer Q3, we investigate the impact of replacing different architectures of the copilot model on overall performance, as illustrated in Fig. 3(b). Our FedGKC is not restricted to specific copilot model architectures. In fact, our method demonstrates significant performance improvements across most widely adopted model architectures, further validating the flexibility and efficacy of FedGKC.

To answer Q4, we evaluate the performance of FedGKC using 20 clients from the CS dataset across various hyperparameter settings. Specifically, we explore different combinations of the trade-off parameters α\alpha and β\beta in Eq. (2) and Eq. (3). The experimental results are shown in Fig. 4(a), where we find that the method performs optimally with α\alpha set between 0.6 and 0.7, and β\beta maintained within the range of 0.1 to 0.2. Based on these findings, we finalize our hyperparameters with α=0.6\alpha=0.6 and β=0.2\beta=0.2. Additionally, we conduct a sensitivity analysis on λ\lambda in Eq. (8), which represents the influence of over-smoothing on the clarity of node knowledge. From Fig. 4(b), we find that setting λ\lambda to 0.1 effectively enhances model performance, whereas excessively high values lead to performance deterioration, underscoring the importance of balancing node similarity and the over-smoothing issue.

Refer to caption
Figure 3: Ablation studies of our method. (a) Ablation experiments of key components: Self-Mutual Knowledge Distillation (SMKD) and Knowledge-Aware Model Aggregation (KAMA). (b) Performance comparison using different copilot model architectures.
Refer to caption
Figure 4: Sensitivity analysis of hyperparameters. (a) Different combinations of the trade-off parameters α\alpha and β\beta. (b) Impact of different λ\lambda values.

IV-E Complexity Analysis

To address Q5, a computational complexity analysis of FedGKC is provided herein. Each client employs a GNN as the local model. For an ll-layer GNN model with a batch size bb, the propagated feature X(l)X^{(l)} has a space complexity of 𝒪((b+l)f)\mathcal{O}((b+l)f), and the overhead for linear regression operations is 𝒪(f2)\mathcal{O}(f^{2}), where ff are the number of feature dimensions. The introduction of the copilot model necessitates double the parameter storage space, leading to an overall space complexity of 𝒪(2(b+l)f+2f2)\mathcal{O}(2(b+l)f+2f^{2}). Furthermore, the inclusion of additional alignment losses within the self-mutual knowledge distillation module introduces a time complexity of 𝒪(f2)\mathcal{O}(f^{2}). For the server, the reception and aggregation of model parameters impose space and time complexities of 𝒪(Kf2)\mathcal{O}(Kf^{2}) and 𝒪(K)\mathcal{O}(K), respectively, where KK represents the number of clients involved.

V Conclusion

Our research introduces the FedGKC framework, which represents a significant breakthrough in accommodating the variability of model architectures. By leveraging knowledge distillation and adaptive aggregation strategies, our framework achieves enhanced model adaptability and optimized global knowledge aggregation, positioning our framework as a robust solution for federated graph learning scenarios. There is also need for future research into improving knowledge transfer methodologies specific to graph structures, refining mechanisms to balance efficiency and accuracy, and studying real-world applications across a variety of sectors. The continued search for tailored and secure federated learning methods will be critical in realizing the full potential of graph data structures in an increasingly interconnected world.

References

  • [1] S. Kumar, A. Mallik, A. Khetarpal, and B. S. Panda, “Influence maximization in social networks using graph embedding and graph neural network,” Information Sciences, vol. 607, pp. 1617–1636, 2022.
  • [2] X.-M. Zhang, L. Liang, L. Liu, and M.-J. Tang, “Graph neural networks and their current applications in bioinformatics,” Frontiers in genetics, vol. 12, p. 690049, 2021.
  • [3] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang, “Lightgcn: Simplifying and powering graph convolution network for recommendation,” in Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 639–648.
  • [4] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2020.
  • [5] R. Liu, P. Xing, Z. Deng, A. Li, C. Guan, and H. Yu, “Federated graph neural networks: Overview, techniques, and challenges,” IEEE Transactions on Neural Networks and Learning Systems, 2024.
  • [6] G. Long, Y. Tan, J. Jiang, and C. Zhang, “Federated learning for open banking,” in Federated learning: privacy and incentive.   Springer, 2020, pp. 240–254.
  • [7] H. Ji, J. Zhu, C. Shi, X. Wang, B. Wang, C. Zhang, Z. Zhu, F. Zhang, and Y. Li, “Large-scale comb-k recommendation,” in Proceedings of the Web Conference 2021, ser. WWW ’21.   New York, NY, USA: Association for Computing Machinery, 2021, p. 2512–2523. [Online]. Available: https://doi.org/10.1145/3442381.3449924
  • [8] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, “Graph convolutional neural networks for web-scale recommender systems,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’18.   New York, NY, USA: Association for Computing Machinery, 2018, p. 974–983. [Online]. Available: https://doi.org/10.1145/3219819.3219890
  • [9] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, and S. Jegelka, “Representation learning on graphs with jumping knowledge networks,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   PMLR, 10–15 Jul 2018, pp. 5453–5462.
  • [10] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018.
  • [11] W. Huang, G. Wan, M. Ye, and B. Du, “Federated graph semantic and structural learning,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, E. Elkind, Ed.   International Joint Conferences on Artificial Intelligence Organization, 8 2023, pp. 3830–3838, main Track. [Online]. Available: https://doi.org/10.24963/ijcai.2023/426
  • [12] X. Li, Z. Wu, W. Zhang, Y. Zhu, R.-H. Li, and G. Wang, “Fedgta: Topology-aware averaging for federated graph learning,” Proceedings of the VLDB Endowment, vol. 17, no. 1, pp. 41–50, 2023.
  • [13] P. P. Liang, T. Liu, L. Ziyin, N. B. Allen, R. P. Auerbach, D. Brent, R. Salakhutdinov, and L.-P. Morency, “Think locally, act globally: Federated learning with local and global representations,” arXiv preprint arXiv:2001.01523, 2020.
  • [14] L. Yi, G. Wang, X. Liu, Z. Shi, and H. Yu, “Fedgh: Heterogeneous federated learning with generalized global header,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 8686–8696.
  • [15] Y. Tan, G. Long, L. Liu, T. Zhou, Q. Lu, J. Jiang, and C. Zhang, “Fedproto: Federated prototype learning across heterogeneous clients,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 8, 2022, pp. 8432–8440.
  • [16] J. Zhang, Y. Liu, Y. Hua, and J. Cao, “Fedtgp: Trainable global prototypes with adaptive-margin-enhanced contrastive learning for data and model heterogeneity in federated learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 15, 2024, pp. 16 768–16 776.
  • [17] T. Shen, J. Zhang, X. Jia, F. Zhang, Z. Lv, K. Kuang, C. Wu, and F. Wu, “Federated mutual learning: a collaborative machine learning method for heterogeneous data, models, and objectives,” Frontiers of Information Technology & Electronic Engineering, vol. 24, no. 10, pp. 1390–1402, 2023.
  • [18] C. Wu, F. Wu, L. Lyu, Y. Huang, and X. Xie, “Communication-efficient federated learning via knowledge distillation,” Nature communications, vol. 13, no. 1, p. 2032, 2022.
  • [19] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
  • [20] H. Xie, J. Ma, L. Xiong, and C. Yang, “Federated graph classification over non-iid graphs,” Advances in neural information processing systems, vol. 34, pp. 18 839–18 852, 2021.
  • [21] Y. Tan, Y. Liu, G. Long, J. Jiang, Q. Lu, and C. Zhang, “Federated learning on non-iid graphs via structural knowledge sharing,” in Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 8, 2023, pp. 9953–9961.
  • [22] J. Baek, W. Jeong, J. Jin, J. Yoon, and S. J. Hwang, “Personalized subgraph federated learning,” in International conference on machine learning.   PMLR, 2023, pp. 1396–1415.
  • [23] K. Zhang, C. Yang, X. Li, L. Sun, and S. M. Yiu, “Subgraph federated learning with missing neighbor generation,” Advances in Neural Information Processing Systems, vol. 34, pp. 6671–6682, 2021.
  • [24] Z. Zhu, J. Hong, and J. Zhou, “Data-free knowledge distillation for heterogeneous federated learning,” in International conference on machine learning.   PMLR, 2021, pp. 12 878–12 889.
  • [25] J. Zhang, S. Guo, X. Ma, H. Wang, W. Xu, and F. Wu, “Parameterized knowledge transfer for personalized federated learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 10 092–10 104, 2021.
  • [26] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation for robust model fusion in federated learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 2351–2363, 2020.
  • [27] Y. Tan, G. Long, J. Ma, L. Liu, T. Zhou, and J. Jiang, “Federated learning from pre-trained models: A contrastive learning approach,” Advances in neural information processing systems, vol. 35, pp. 19 332–19 344, 2022.
  • [28] Z. Li, X. Shang, R. He, T. Lin, and C. Wu, “No fear of classifier biases: Neural collapse inspired federated learning with synthetic and fixed classifier,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5319–5329.
  • [29] G. Hinton, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [30] K. Wang, Y. Liu, Q. Ma, and Q. Z. Sheng, “Mulde: Multi-teacher knowledge distillation for low-dimensional knowledge graph embeddings,” in Proceedings of the Web Conference 2021, 2021, pp. 1716–1726.
  • [31] B. Yan, C. Wang, G. Guo, and Y. Lou, “Tinygnn: Learning efficient graph neural networks,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1848–1856.
  • [32] C. Yang, J. Liu, and C. Shi, “Extract the knowledge of graph neural networks and go beyond it: An effective knowledge distillation framework,” in Proceedings of the web conference 2021, 2021, pp. 1227–1237.
  • [33] Y. Yang, J. Qiu, M. Song, D. Tao, and X. Wang, “Distilling knowledge from graph convolutional networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7074–7083.
  • [34] J. Guo, D. Chen, and C. Wang, “Alignahead: online cross-layer knowledge extraction on graph neural networks,” in 2022 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2022, pp. 1–8.
  • [35] C. Huo, D. Jin, Y. Li, D. He, Y.-B. Yang, and L. Wu, “T2-gnn: Graph neural networks for graphs with incomplete features and structure via teacher-student distillation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 4, 2023, pp. 4339–4346.
  • [36] H. He, J. Wang, Z. Zhang, and F. Wu, “Compressing deep graph neural networks via adversarial knowledge distillation,” in Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, 2022, pp. 534–544.
  • [37] L. Yu, S. Pei, L. Ding, J. Zhou, L. Li, C. Zhang, and X. Zhang, “Sail: Self-augmented graph contrastive learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 8, 2022, pp. 8927–8935.
  • [38] W. Zhang, X. Miao, Y. Shao, J. Jiang, L. Chen, O. Ruas, and B. Cui, “Reliable data distillation on graph convolutional network,” in Proceedings of the 2020 ACM SIGMOD international conference on management of data, 2020, pp. 1399–1414.
  • [39] K. Feng, C. Li, Y. Yuan, and G. Wang, “Freekd: Free-direction knowledge distillation for graph neural networks,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 357–366.
  • [40] T. Zhao, W. Jin, Y. Liu, Y. Wang, G. Liu, S. Günnemann, N. Shah, and M. Jiang, “Graph data augmentation for graph machine learning: A survey,” arXiv preprint arXiv:2202.08871, 2022.
  • [41] Z. Yang, W. Cohen, and R. Salakhudinov, “Revisiting semi-supervised learning with graph embeddings,” in International conference on machine learning.   PMLR, 2016, pp. 40–48.
  • [42] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, “Pitfalls of graph neural network evaluation,” arXiv preprint arXiv:1811.05868, 2018.
  • [43] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel, “Image-based recommendations on styles and substitutes,” in Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, 2015, pp. 43–52.
  • [44] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” Advances in neural information processing systems, vol. 33, pp. 22 118–22 133, 2020.
  • [45] Q. Li, B. He, and D. Song, “Model-contrastive federated learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10 708–10 717.
  • [46] L. Gao, H. Fu, L. Li, Y. Chen, M. Xu, and C.-Z. Xu, “Feddc: Federated learning with non-iid data via local drift decoupling and correction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 102–10 111.
  • [47] X. Li, W. Zhang, R.-H. Li, Y. Zhao, Y. Zhu, and G. Wang, “A new paradigm for federated structure non-IID subgraph learning,” 2023. [Online]. Available: https://openreview.net/forum?id=Qyz2cMy-ty6
  • [48] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2008, no. 10, p. P10008, oct 2008. [Online]. Available: https://dx.doi.org/10.1088/1742-5468/2008/10/P10008
  • [49] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations, 2017.
  • [50] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
  • [51] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in International Conference on Learning Representations, 2018.
  • [52] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in International conference on machine learning.   PMLR, 2019, pp. 6861–6871.