Federated PAC-Bayesian Learning on Non-IID data

Abstract

Existing research has either adapted the Probably Approximately Correct (PAC) Bayesian framework for federated learning (FL) or used information-theoretic PAC-Bayesian bounds while introducing their theorems, but few considering the non-IID challenges in FL. Our work presents the first non-vacuous federated PAC-Bayesian bound tailored for non-IID local data. This bound assumes unique prior knowledge for each client and variable aggregation weights. We also introduce an objective function and an innovative Gibbs-based algorithm for the optimization of the derived bound. The results are validated on real-world datasets.

Index Terms— Federated learning, PAC-Bayesian framework, generalization error

1 Introduction

To address privacy concerns in distributed learning, federated learning (FL) has emerged as a viable solution, enabling multiple local clients to collaboratively train a model while retaining their private data and without sharing it [1, 2]. However, in real-world scenarios, data across different devices is not identically and independently distributed (non-IID), which poses challenges in model training and convergence [3].

Significant efforts have been made to improve performance and analyze convergence in non-IID FL [4], but few have provided theoretical guarantees by establishing generalization bounds. Most existing FL generalization analyses rely on the Probably Approximately Correct (PAC) Bayesian theory, first formulated by McAllester [5, 6]. Building on the McAllester’s bound, these analyses typically compute local bounds or apply existing PAC-Bayesian bounds directly, overlooking the non-IID nature of FL. This approach is flawed, as the PAC-Bayesian framework assumes that each data point is IID, ignoring non-IID data and directly employing the PAC-Bayesian theory, which potentially results in inaccurate or overly relaxed bounds. Consequently, techniques developed for the PAC-Bayesian framework are not directly applicable to non-IID FL. Therefore, this work aims to advance the theoretical underpinnings of non-IID FL.

Related works. The PAC-Bayesian framework has been extensively researched in recent years [7, 8, 9], yielding tighter and non-vacuous bounds. However, there has been limited exploration in the context of FL. Some studies have proposed information theoretic-based PAC-Bayesian bounds using Rate-distortion theory to prove generalization bounds [10, 11], providing an information-theoretic perspective on enhancing generalization capacity. Others have followed McAllester’s approach, attempting to directly apply the FL paradigm to the bound. For example, the authors in [12, 13] applied McAllester’s bound in a multi-step FL scenario; Omni-Fedge [14] used the PAC-Bayesian learning framework to construct a weighted sum objective function with a penalty, considering only a local client bound instead of the entire system, which precludes obtaining global information; and FedPAC [15] employed PAC learning to balance utility, privacy, and efficiency in FL. However, these approaches do not account for the non-IIDness of FL.

Our contributions. First, we derive a federated PAC-Bayesian learning bound for non-IID local data, providing a unified perspective on federated learning paradigms. To the best of our knowledge, this is the first non-vacuous bound for a model averaging FL framework. Specifically, due to the non-IID nature of clients, we assume that each client has unique prior knowledge rather than a common one. Additionally, the aggregation weights for non-IID clients vary instead of being uniform. Based on the derived bound, we define an objective function that can be computed by each local client rather than on the server and propose a Gibbs-based algorithm dubbed FedPB for its optimization. This algorithm not only preserves the privacy of each client but also enhances efficiency. Finally, we validate our proposed bounds and algorithm on two real-world datasets, demonstrating the effectiveness of our bounds and algorithm.

2 Problem Setting

In this section, we introduce the federated PAC-Bayesian learning setting. The whole system comprises $K$ clients, each equipped with its own dataset $S_{i}=(x_{i},y_{i})_{i=1}^{n}\subseteq(\mathcal{X},\mathcal{Y})^{n}$ consisting of $n$ IID data points. Here $\mathcal{X}$ denotes the input space and $\mathcal{Y}$ denotes the output space. Each dataset $S_{i}$ is presumed to be drawn from an unknown data generating distribution $D_{k}^{\bigotimes n}$ . Moreover, let $\ell:\mathcal{Z}\times\mathcal{W}\rightarrow\mathbb{R}^{+}$ be a given loss function and $h_{k}\in\mathcal{H}$ is a stochastic estimator on client $k$ where $\mathcal{H}$ is the hypothesis class. In the PAC-Bayesian framework, each client holds a tailored prior distribution $P_{k}$ . The objective of each client is to furnish a posterior distribution $Q_{k}\in\mathcal{M}$ , where $\mathcal{M}$ denotes the set of distributions over $\mathcal{H}$ . We then define the population risk:

L(Q_{1},\dots,Q_{K})\triangleq\frac{1}{K}\sum_{k=1}^{K}\underset{h_{k}\sim Q_{k}}{\mathbb{E}}\underset{(x_{k},y_{k})\sim D_{k}}{\mathbb{E}}[\ell(h_{k}(x_{k}),y_{k})],

(1)

and the empirical risk:

\hat{L}\left(Q_{1},\dots,Q_{K}\right)\triangleq\frac{1}{nK}\sum_{k=1}^{K}\underset{h_{k}\sim Q_{k}}{\mathbb{E}}\sum_{i=1}^{n}\ell\left(h_{k}\left(x_{k,i}\right),y_{k,i}\right),

(2)

by averaging over the posterior distribution of each client. In federated learning, each client will upload their posterior distributions to a central server, and then the server will aggregate the transmitted model in a weighted manner:

\displaystyle\bar{P}=\prod_{k=1}^{K}P_{k}^{p(k)},\quad\bar{Q}=\prod_{k=1}^{K}Q_{k}^{p(k)},

where $\bar{P}$ and $\bar{Q}$ are the global prior and posterior, respectively, and the averaging weight $p=(p(1),\dots,p(K))$ be a probability distribution on $\{1,\dots,K\}$ . For the sake of generality, we can assume that $p(k)\in(0,1)$ and $\sum_{k=1}^{K}p(k)=1$ . For intuition of this aggregation, we can see that minimizing the weighted objective function is actually equivalent to maximizing the logarithm of the corresponding posterior: $\min_{h}L(h)=\min_{h}\sum_{k=1}^{K}p(k)L_{k}(h)=\max_{h}\ln\prod_{n=1}^{N}\\ p\left(h\mid\mathcal{D}_{k}\right)^{p(k)}.$ In addition, we denote the Kullback-Leibler (KL) divergence as $D_{KL}(Q\|P)\triangleq\underset{Q}{\mathbb{E}}\left[\log\frac{dQ}{dP}\right]$ if $Q\ll P$ and $D_{KL}(Q\|P)=+\infty$ otherwise.

3 Main theorem

In this section, we will present our novel bounds on the non-IID FL scenario.

Theorem 1 (Federated PAC-Bayesian learning bound).

For any $\delta\in(0,1]$ , assume the loss function $\ell(\cdot,\cdot)$ is bounded in $[0,C]$ , the following inequality holds uniformly for all posterior distributions $Q$ and for any $\delta\in(0,1)$ ,

\underset{S_{1},\dots,S_{K}}{\mathbb{P}}\bigg{\{}\forall{Q_{1},\dots,Q_{K}},L(Q_{1},\dots,Q_{K})\leq\hat{L}(Q_{1},\dots,Q_{K})\\ +\frac{\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})+\log\frac{1}{\delta}}{\lambda}+\frac{\lambda C^{2}}{8Kn}\bigg{\}}>1-\delta

(3)

Proof.

Let the local generalization error: $\operatorname{gen}(D_{k},h_{k})=\underset{(x_{k},y_{k})\sim D_{k}}{\mathbb{E}}[\ell(h_{k}(x_{k}),y_{k})]-\frac{1}{n}\sum_{i=1}^{n}\ell\left(h_{k}\left(x_{k,i}\right),y_{k,i}\right)$ , then the global generalization error: $\overline{\operatorname{gen}}(D,h)=L(Q_{1},\dots,Q_{K})\\ -\hat{L}(Q_{1},\dots,Q_{K})=\frac{1}{K}\sum_{k=1}^{K}\underset{h_{k}\sim Q_{k}}{\mathbb{E}}\operatorname{gen}(D_{k},h_{k})$ . For any $\lambda>0$ and $t>0$ , applying the Hoeffding’s lemma to $\mathbb{E}[\ell_{i}]-\ell_{i}$ , we have that, for each client $k$ ,

\underset{S_{k}}{\mathbb{E}}\underset{P_{k}}{\mathbb{E}}\left[\mathrm{e}^{\frac{\lambda}{K}\operatorname{gen}(D_{k},h_{k})}\right]\leq\mathrm{e}^{\frac{\lambda^{2}C^{2}}{8K^{2}n}}.

Since each $S_{k}$ may come from different $D_{k}$ . i.e., non-IID, we cannot directly plug this result to the PAC-Bayesian bound. Note that for each client $k\in[K]$ , $P_{i}$ is independent of $S_{1},\dots,S_{K}$ , we have that

\underset{S_{1}}{{\mathbb{E}}}\underset{P_{1}}{{\mathbb{E}}}\dots\underset{S_{K}}{{\mathbb{E}}}\underset{P_{K}}{{\mathbb{E}}}\left[\mathrm{e}^{\frac{\lambda}{K}\sum_{k=1}^{K}\operatorname{gen}(D_{k},h_{k})}\right]\leq\mathrm{e}^{\frac{\lambda^{2}C^{2}}{8Kn}}.

And we apply Donsker and Varadhan’s variational formula [16] for $P_{1},\dots,P_{K}$ to get:

\underset{S_{1},\dots,S_{K}}{{\mathbb{E}}}\left[e^{\underset{Q_{1},\dots,Q_{K}}{\sup}\lambda\underset{Q_{1}}{{\mathbb{E}}}\dots\underset{Q_{K}}{{\mathbb{E}}}[\frac{1}{K}\sum_{k=1}^{K}\operatorname{gen}(D_{k},h_{k})]}\right.\\ \left./e^{D_{KL}\left(\prod_{k=1}^{K}Q_{k}^{p(k)}\|\prod_{k=1}^{K}P_{k}^{p(k)}\right)}\right]\leq\mathrm{e}^{\frac{\lambda^{2}C^{2}}{8Kn}}.

(4)

Recall the definition of the global generalization error:

\overline{\operatorname{gen}}(D,h)=\frac{1}{K}\sum_{k=1}^{K}\underset{h_{k}\sim Q_{K}}{{\mathbb{E}}}\operatorname{gen}(D_{k},h_{k}),

and note that $D_{KL}\left(\prod_{k=1}^{K}Q_{k}^{p(k)}\|\prod_{k=1}^{K}P_{k}^{p(k)}\right)=\\ \sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})$ . Applying the Chernoff bound:

		$\displaystyle\underset{S_{1},\dots,S_{K}}{\mathbb{P}}\left[\underset{Q_{1},\dots,Q_{K}}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}>s\right]$
		$\displaystyle\leq\underset{S_{1},\dots,S_{K}}{\mathbb{E}}\left[\mathrm{e}^{\underset{Q_{1},\dots,Q_{K}}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum\limits_{k=1}^{K}p(k)D_{KL}(Q_{k}\\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}}\right]e^{-s}$
		$\displaystyle\leq\mathrm{e}^{-s}.$

Let $\delta=e^{-s}$ , that is, $s=-\log\delta$ . Thus, plug this into the above result we have that

\underset{S_{1},\dots,S_{K}}{\mathbb{P}}\bigg{\{}\exists{Q_{1},\dots,Q_{K}},\overline{\operatorname{gen}}(D,h)>\\ \frac{1}{\lambda}\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})+\frac{\lambda C^{2}}{8Kn}+\frac{1}{\lambda}\log\frac{1}{\delta}\bigg{\}}\leq\delta.

Therefore, we prove the statement by leveraging the complement of the probability. ∎

The RHS of Equation 3 comprises two components: the empirical term and the complexity term. Note that our bound eschews the typical smoothness and convexity assumptions on the loss often made by other FL frameworks. Moreover, an intuition can be presented for Equation 3 that the bound will be tighter with the increasing of clients scales, which is further corroborated by the evaluation in Section 5.3.

Corollary 1 (The choice of $\lambda$ ).

Suppose $\lambda\in\Xi\triangleq\{0,\dots,\xi\}$ and $|\cdot|$ denotes the cardinality of a set. For any $\delta\in(0,1)$ and a properly chosen $\lambda$ , with probability at least $1-\delta$ ,

L(Q_{1},\dots,Q_{K})\leq\hat{L}(Q_{1},\dots,Q_{K})\\ +C\sqrt{\frac{\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})+\log\frac{|\Xi|}{\delta}}{2Kn}}.

(5)

Proof.

Suppose $\mathcal{S}=S_{1}\cap S_{2}\cap\dots\cap S_{K}$ and $\mathcal{Q}=Q_{1}\cap Q_{2}\cap\dots\cap Q_{K}$ . Since we have the previous result (4) for some fixed $\lambda$ , we can sum (4) over all $\lambda\in\Xi$ :

\underset{\lambda\in\Xi}{\sum}\underset{\mathcal{S}}{\mathbb{E}}\left[\mathrm{e}^{\underset{\mathcal{Q}}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum\limits_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}}\right]\leq|\Xi|,

which is equivalent to:

\underset{\mathcal{S}}{\mathbb{E}}\left[\mathrm{e}^{\underset{\mathcal{Q},\lambda\in\Xi}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum\limits_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}}\right]\leq|\Xi|.

Again, from the Chernoff bound:

		$\displaystyle\underset{\mathcal{S}}{\mathbb{P}}\left[\underset{\mathcal{Q},\lambda\in\Xi}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}>s\right]$
		$\displaystyle\leq\underset{\mathcal{S}}{\mathbb{E}}\left[\mathrm{e}^{\underset{\mathcal{Q},\lambda\in\Xi}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum\limits_{k=1}^{K}p(k)D_{KL}(Q_{k}\\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}}\right]e^{-s}$
		$\displaystyle\leq\|\Xi\|\mathrm{e}^{-s}.$

Solve $\delta=|\Xi|\mathrm{e}^{-s}$ to get:

\underset{\mathcal{S}}{\mathbb{P}}\bigg{\{}\exists{\mathcal{Q},\lambda\in\Xi},\overline{\operatorname{gen}}(D,h)>\\ \frac{\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})+\log\frac{|\Xi|}{\delta}}{\lambda}+\frac{\lambda C^{2}}{8Kn}\bigg{\}}\leq\delta

Choosing a proper minimizer

\lambda^{*}=\sqrt{8Kn\left(\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})+\log\frac{|\Xi|}{\delta}\right)}/C,

we obtain the bound of this corollary. ∎

Equation 5 generally offers a strategy for selecting the value of parameter $\lambda$ so that the complexity term can be minimized.

4 FedPB: Optimize the upper bound

We denote $\mathcal{L}_{k}=\underset{h_{k}\sim Q_{k}}{\mathbb{E}}\frac{1}{n}\sum_{i=1}^{n}\ell\left(h_{k}\left(x_{k,i}\right),y_{k,i}\right)$ . Consider the following local objective function:

\mathcal{J}(Q_{k})=\lambda\mathcal{L}_{k}+p(k)D_{KL}(Q_{k}\|P_{k})

(6)

In our methodology, we introduce FedPB for a general scenario. It comprises two phases, designed to iteratively optimize the priors and posteriors for each client. Notably, in contrast to previous studies, clients are not required to upload their private prior and posterior distributions to the server, ensuring their privacy.

Phase 1 (Optimize the posterior). Given a fixed parameter $\lambda>0$ and the prior $P_{k}^{t}$ during the training epoch $t+1$ , we aim to optimize the posterior $Q_{k}^{t}$ as $\hat{Q}_{k}^{t+1}=\arg\min_{Q_{k}}\mathcal{J}(Q_{k}),$ yielding the solution:

\frac{d\hat{Q}_{k}^{t+1}}{dP_{k}^{t}}(h)=\frac{\exp\left(-\lambda\ell\left(h,z_{i}\right)\right)}{\mathbb{E}_{h\sim P_{k}^{t}}\left[\exp\left(-\lambda\ell\left(h,z_{i}\right)\right)\right]}.

Phase 2 (Optimize the prior). Having derived the optimal posterior $\hat{Q}_{k}^{t+1}$ , the prior can be updated by $\hat{P}_{k}^{t+1}={Q}_{k}^{t}$ since it minimizes $D_{KL}(Q_{k}\|P_{k})$ .

Link with personalized federated learning. Since the prior $\hat{P}_{k}^{t+1}={Q}_{k}^{t}$ and ${Q}_{k}^{t}$ is equal to the aggregated global posterior at epoch $t-1$ , the prior can be viewed as the global knowledge. Optimizing the objective function (6) minimizes the disparity between the global and local knowledge, which is a a prevalent personalization strategy in FL [17].

Re-parameterization trick. Utilizing the Bayesian neural network [18] as the local model aligns with our setting where all parameters are random, and we are optimizing their posterior distribution. In particular, the prior $P_{k}$ and posterior $Q_{k}$ are defined as follows:

\begin{aligned} &P_{k}(w;\vartheta_{P_{k}})=\mathcal{N}\left(w;\mu_{P_{k}},\sigma_{P_{k}}^{2}I_{d}\right)=\prod_{i=1}^{d}\mathcal{N}\left(w_{i};\mu_{P_{k},i},\sigma_{P_{k},i}^{2}\right)\\ &Q_{k}(w;\vartheta_{Q_{k}})=\mathcal{N}\left(w;\mu_{Q_{k}},\sigma_{Q_{k}}^{2}I_{d}\right)=\prod_{i=1}^{d}\mathcal{N}\left(w_{i};\mu_{Q_{k},i},\sigma_{Q_{k},i}^{2}\right)\end{aligned},

with every model parameter $w_{i}$ being independent. Computing the Gibbs posterior directly can be challenging, hence we select the gradient descent as an alternative way. The update rule of (6) at round $t+1$ is:

	$\displaystyle\mu_{Q_{k},i}^{t+1}$	$\displaystyle=\mu_{Q_{k},i}^{t}-\lambda\nabla_{\mu_{Q_{k},i}}\mathcal{L}_{k}-\frac{p(k)(\mu_{Q_{k},i}-\mu_{P_{k},i})}{\sigma_{Q_{k},i}^{2}},$
	$\displaystyle\sigma_{Q_{k},i}^{t+1}$	$\displaystyle=\sigma_{Q_{k},i}^{t}-\lambda\nabla_{\sigma_{Q_{k},i}}\mathcal{L}_{k}+\frac{p(k)(\sigma_{P_{k},i}^{2}-\sigma_{Q_{k},i}^{2}+\left(\mu_{P_{k},i}-\mu_{Q_{k},i}\right)^{2})}{\sigma_{Q_{k},i}^{3}},$

where the parameter $\lambda$ can be regarded as the learning rate of the GD and the KL-divergence is the regularization term. Calculating the gradients $\nabla_{\mu_{Q_{k},i}}$ and $\nabla_{\mu_{Q_{k},i}}$ directly can be intricate, but the re-parameterization trick is capable of tackling this issue. Concretely, we translate the $h\sim Q_{k}$ to $\varepsilon\sim\mathcal{N}(0,I_{d})$ and then compute the deterministic function $h=\mu+\sigma\odot\varepsilon$ , where $\odot$ signifies an element-wise multiplication. As a result, we have $\nabla_{\mu_{Q_{k},i}}\underset{h\sim Q}{\mathbb{E}}l(h)=\nabla_{\mu_{Q_{k},i}}\underset{\varepsilon\sim q(\varepsilon)}{\mathbb{E}}l(\epsilon)$ , indicating its computability using an end-to-end framework with back-propagation.

5 Evaluation

In this section, we demonstrate our algorithm and our theoretical arguments with the FL non-IID setting. Specifically, the aggregation weight $p(k)$ is defined as the sample ratio of client $k$ relative to the entire data size across all clients. For the global aggregation, the global mean and covariance are calculated by $\bar{\mu}=\sum_{k=1}^{K}p(k)\mu_{k}\sigma_{k}^{-2}/\sum_{k=1}^{K}p(k)\sigma_{k}^{-2}$ and $\bar{\sigma}=1/\sum_{k=1}^{K}p(k)\sigma_{k}^{-2}$ , respectively. Furthermore, we utilize two real-world datasets: MedMNIST (medical image analysis) [19] and CIFAR-10 [20]. For each dataset, we adopt three distinct data-generating approaches for local clients: 1) Balanced: each client holds an equal number of samples; 2) Unbalanced: varying sample counts per client (e.g., $[0.05,0.05,0.05,0.05,0.1,0.1,0.1,0.1,0.2,0.2]$ for 10 clients); 3) Dirichlet: differing sample counts per client following a Dirichlet distribution [21]. Besides, the entire FL system encompasses $K=10$ clients, initializing their posterior models uniformly from a global posterior model.

Additionally, we deploy two versions of Bayesian neural networks: one with 2 convolutional layers for MedMNIST and another with 3 layers for CIFAR-10. The CrossEntropy loss serves as our loss function, and it is optimized by the Adam optimizer [22] with a learning rate of 1e-3.

5.1 Bound evaluation

To validate our bounds, we set the confidence bound on $1-\delta=95\%$ . Our evaluation underscores a correlation between the generalization error and the complexity, emphasizing the tightness of our bound. Fig. 1 illustrates an initial increase in the generalization error and a concurrent decrease in complexity during the early stages of the training process, attributed to empirical loss optimization. Subsequently, as neural network training advances, the KL-divergence stabilizes. Throughout this progression, we can observe that the generalization error is consistently bounded by the complexity value.

Refer to caption — Fig. 1: The results of the generalization error and model performance of FedPB over the Dirichlet generating method.

5.2 Data-dependent prior

Here, we preform the ablation study of data-dependent (trainable) prior compared with data-independent (fixed, chosen before training) prior and report the mean ± standard deviation (std) accuracy of the global model in Table 1, evaluated over multiple experimental seeds. The results demonstrate the superior efficacy of the data-dependent strategy over both datasets across all three scenarios. This superiority arises from the data-dependent prior’s ability to harness more global knowledge, combined with its adaptability during training.

Table 1: Model accuracy (%) for the data-independent prior and data-dependent prior in three data-generating scenarios.

Method

MedMNIST

CIFAR10

Balanced

Unbalanced

Dirichlet

Balanced

Unbalanced

Dirichlet

Data-

independent

53.47\pm 1.12

49.44\pm 1.10

55.24\pm 6.92

50.89\pm 0.62

47.19\pm 0.92

57.93\pm 0.55

Data-

dependent

\mathbf{77.10\pm 4.25}

\mathbf{77.34\pm 3.42}

\mathbf{77.48\pm 4.75}

\mathbf{84.41\pm 0.94}

\mathbf{79.39\pm 0.56}

\mathbf{86.11\pm 0.53}

5.3 Different client scales

Lastly, we assess the influence of varying client scales on our complexity bounds. As depicted in Fig. 2, in the evaluation of both datasets with the Dirichlet generating method, by increasing the number of clients $K$ from 10 to 20 and 50, we observe a consistent decrease in the value of complexity. This observation aligns with our analysis in Equation 3.

References

[1] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–1282.
[2] Zihao Zhao, Yuzhu Mao, Yang Liu, Linqi Song, Ye Ouyang, Xinlei Chen, and Wenbo Ding, “Towards efficient communications in federated learning: A contemporary survey,” Journal of the Franklin Institute, 2023.
[3] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra, “Federated learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.
[4] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang, “On the convergence of fedavg on non-iid data,” arXiv preprint arXiv:1907.02189, 2019.
[5] David A McAllester, “Some pac-bayesian theorems,” in Proceedings of the eleventh annual conference on Computational learning theory, 1998, pp. 230–234.
[6] David A McAllester, “Pac-bayesian model averaging,” in Proceedings of the twelfth annual conference on Computational learning theory, 1999, pp. 164–170.
[7] Matthias Seeger, “Pac-bayesian generalisation error bounds for gaussian process classification,” Journal of machine learning research, vol. 3, no. Oct, pp. 233–269, 2002.
[8] Olivier Catoni, “Pac-bayesian supervised classification: the thermodynamics of statistical learning,” arXiv preprint arXiv:0712.0248, 2007.
[9] Luca Oneto, Michele Donini, Massimiliano Pontil, and John Shawe-Taylor, “Randomized learning and generalization of fair and private classifiers: From pac-bayes to stability and differential privacy,” Neurocomputing, vol. 416, pp. 231–243, 2020.
[10] Milad Sefidgaran, Romain Chor, and Abdellatif Zaidi, “Rate-distortion theoretic bounds on generalization error for distributed learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 19687–19702, 2022.
[11] LP Barnes, Alex Dytso, and H Vincent Poor, “Improved information theoretic generalization bounds for distributed and federated learning,” in 2022 IEEE International Symposium on Information Theory (ISIT). IEEE, 2022, pp. 1465–1470.
[12] Milad Sefidgaran, Romain Chor, Abdellatif Zaidi, and Yijun Wan, “Federated learning you may communicate less often!,” arXiv preprint arXiv:2306.05862, 2023.
[13] Romain Chor, Milad Sefidgaran, and Abdellatif Zaidi, “More communication does not result in smaller generalization error in federated learning,” arXiv preprint arXiv:2304.12216, 2023.
[14] Sai Anuroop Kesanapalli and BN Bharath, “Federated algorithm with bayesian approach: Omni-fedge,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3075–3079.
[15] Xiaojin Zhang, Anbu Huang, Lixin Fan, Kai Chen, and Qiang Yang, “Probably approximately correct federated learning,” arXiv preprint arXiv:2304.04641, 2023.
[16] Monroe D Donsker and SR Srinivasa Varadhan, “On a variational formula for the principal eigenvalue for operators with maximum principle,” Proceedings of the National Academy of Sciences, vol. 72, no. 3, pp. 780–783, 1975.
[17] Xu Zhang, Yinchuan Li, Wenpeng Li, Kaiyang Guo, and Yunfeng Shao, “Personalized federated learning via variational bayesian inference,” in International Conference on Machine Learning. PMLR, 2022, pp. 26293–26310.
[18] Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter, “Bayesian optimization with robust bayesian neural networks,” Advances in neural information processing systems, vol. 29, 2016.
[19] Jiancheng Yang, Rui Shi, and Bingbing Ni, “Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis,” in IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021, pp. 191–195.
[20] Alex Krizhevsky, Geoffrey Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
[21] Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni, “Bayesian nonparametric federated learning of neural networks,” in International conference on machine learning. PMLR, 2019, pp. 7252–7261.
[22] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

Federated PAC-Bayesian Learning on Non-IID data

Abstract

1 Introduction

2 Problem Setting

3 Main theorem

Theorem 1 (Federated PAC-Bayesian learning bound).

Proof.

Corollary 1 (The choice of λ\lambda).

Proof.

4 FedPB: Optimize the upper bound

5 Evaluation

5.1 Bound evaluation

5.2 Data-dependent prior

5.3 Different client scales

References

Corollary 1 (The choice of $\lambda$ ).