This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Federated PAC-Bayesian Learning on Non-IID data

Abstract

Existing research has either adapted the Probably Approximately Correct (PAC) Bayesian framework for federated learning (FL) or used information-theoretic PAC-Bayesian bounds while introducing their theorems, but few considering the non-IID challenges in FL. Our work presents the first non-vacuous federated PAC-Bayesian bound tailored for non-IID local data. This bound assumes unique prior knowledge for each client and variable aggregation weights. We also introduce an objective function and an innovative Gibbs-based algorithm for the optimization of the derived bound. The results are validated on real-world datasets.

Index Terms—  Federated learning, PAC-Bayesian framework, generalization error

1 Introduction

To address privacy concerns in distributed learning, federated learning (FL) has emerged as a viable solution, enabling multiple local clients to collaboratively train a model while retaining their private data and without sharing it [1, 2]. However, in real-world scenarios, data across different devices is not identically and independently distributed (non-IID), which poses challenges in model training and convergence [3].

Significant efforts have been made to improve performance and analyze convergence in non-IID FL [4], but few have provided theoretical guarantees by establishing generalization bounds. Most existing FL generalization analyses rely on the Probably Approximately Correct (PAC) Bayesian theory, first formulated by McAllester [5, 6]. Building on the McAllester’s bound, these analyses typically compute local bounds or apply existing PAC-Bayesian bounds directly, overlooking the non-IID nature of FL. This approach is flawed, as the PAC-Bayesian framework assumes that each data point is IID, ignoring non-IID data and directly employing the PAC-Bayesian theory, which potentially results in inaccurate or overly relaxed bounds. Consequently, techniques developed for the PAC-Bayesian framework are not directly applicable to non-IID FL. Therefore, this work aims to advance the theoretical underpinnings of non-IID FL.

Related works. The PAC-Bayesian framework has been extensively researched in recent years [7, 8, 9], yielding tighter and non-vacuous bounds. However, there has been limited exploration in the context of FL. Some studies have proposed information theoretic-based PAC-Bayesian bounds using Rate-distortion theory to prove generalization bounds [10, 11], providing an information-theoretic perspective on enhancing generalization capacity. Others have followed McAllester’s approach, attempting to directly apply the FL paradigm to the bound. For example, the authors in [12, 13] applied McAllester’s bound in a multi-step FL scenario; Omni-Fedge [14] used the PAC-Bayesian learning framework to construct a weighted sum objective function with a penalty, considering only a local client bound instead of the entire system, which precludes obtaining global information; and FedPAC [15] employed PAC learning to balance utility, privacy, and efficiency in FL. However, these approaches do not account for the non-IIDness of FL.

Our contributions. First, we derive a federated PAC-Bayesian learning bound for non-IID local data, providing a unified perspective on federated learning paradigms. To the best of our knowledge, this is the first non-vacuous bound for a model averaging FL framework. Specifically, due to the non-IID nature of clients, we assume that each client has unique prior knowledge rather than a common one. Additionally, the aggregation weights for non-IID clients vary instead of being uniform. Based on the derived bound, we define an objective function that can be computed by each local client rather than on the server and propose a Gibbs-based algorithm dubbed FedPB for its optimization. This algorithm not only preserves the privacy of each client but also enhances efficiency. Finally, we validate our proposed bounds and algorithm on two real-world datasets, demonstrating the effectiveness of our bounds and algorithm.

2 Problem Setting

In this section, we introduce the federated PAC-Bayesian learning setting. The whole system comprises KK clients, each equipped with its own dataset Si=(xi,yi)i=1n(𝒳,𝒴)nS_{i}=(x_{i},y_{i})_{i=1}^{n}\subseteq(\mathcal{X},\mathcal{Y})^{n} consisting of nn IID data points. Here 𝒳\mathcal{X} denotes the input space and 𝒴\mathcal{Y} denotes the output space. Each dataset SiS_{i} is presumed to be drawn from an unknown data generating distribution DknD_{k}^{\bigotimes n}. Moreover, let :𝒵×𝒲+\ell:\mathcal{Z}\times\mathcal{W}\rightarrow\mathbb{R}^{+} be a given loss function and hkh_{k}\in\mathcal{H} is a stochastic estimator on client kk where \mathcal{H} is the hypothesis class. In the PAC-Bayesian framework, each client holds a tailored prior distribution PkP_{k}. The objective of each client is to furnish a posterior distribution QkQ_{k}\in\mathcal{M}, where \mathcal{M} denotes the set of distributions over \mathcal{H}. We then define the population risk:

L(Q1,,QK)1Kk=1K𝔼hkQk𝔼(xk,yk)Dk[(hk(xk),yk)],L(Q_{1},\dots,Q_{K})\triangleq\frac{1}{K}\sum_{k=1}^{K}\underset{h_{k}\sim Q_{k}}{\mathbb{E}}\underset{(x_{k},y_{k})\sim D_{k}}{\mathbb{E}}[\ell(h_{k}(x_{k}),y_{k})], (1)

and the empirical risk:

L^(Q1,,QK)1nKk=1K𝔼hkQki=1n(hk(xk,i),yk,i),\hat{L}\left(Q_{1},\dots,Q_{K}\right)\triangleq\frac{1}{nK}\sum_{k=1}^{K}\underset{h_{k}\sim Q_{k}}{\mathbb{E}}\sum_{i=1}^{n}\ell\left(h_{k}\left(x_{k,i}\right),y_{k,i}\right), (2)

by averaging over the posterior distribution of each client. In federated learning, each client will upload their posterior distributions to a central server, and then the server will aggregate the transmitted model in a weighted manner:

P¯=k=1KPkp(k),Q¯=k=1KQkp(k),\displaystyle\bar{P}=\prod_{k=1}^{K}P_{k}^{p(k)},\quad\bar{Q}=\prod_{k=1}^{K}Q_{k}^{p(k)},

where P¯\bar{P} and Q¯\bar{Q} are the global prior and posterior, respectively, and the averaging weight p=(p(1),,p(K))p=(p(1),\dots,p(K)) be a probability distribution on {1,,K}\{1,\dots,K\}. For the sake of generality, we can assume that p(k)(0,1)p(k)\in(0,1) and k=1Kp(k)=1\sum_{k=1}^{K}p(k)=1. For intuition of this aggregation, we can see that minimizing the weighted objective function is actually equivalent to maximizing the logarithm of the corresponding posterior: minhL(h)=minhk=1Kp(k)Lk(h)=maxhlnn=1Np(h𝒟k)p(k).\min_{h}L(h)=\min_{h}\sum_{k=1}^{K}p(k)L_{k}(h)=\max_{h}\ln\prod_{n=1}^{N}\\ p\left(h\mid\mathcal{D}_{k}\right)^{p(k)}. In addition, we denote the Kullback-Leibler (KL) divergence as DKL(QP)𝔼𝑄[logdQdP]D_{KL}(Q\|P)\triangleq\underset{Q}{\mathbb{E}}\left[\log\frac{dQ}{dP}\right] if QPQ\ll P and DKL(QP)=+D_{KL}(Q\|P)=+\infty otherwise.

3 Main theorem

In this section, we will present our novel bounds on the non-IID FL scenario.

Theorem 1 (Federated PAC-Bayesian learning bound).

For any δ(0,1]\delta\in(0,1], assume the loss function (,)\ell(\cdot,\cdot) is bounded in [0,C][0,C], the following inequality holds uniformly for all posterior distributions QQ and for any δ(0,1)\delta\in(0,1),

S1,,SK{Q1,,QK,L(Q1,,QK)L^(Q1,,QK)+k=1Kp(k)DKL(QkPk)+log1δλ+λC28Kn}>1δ\underset{S_{1},\dots,S_{K}}{\mathbb{P}}\bigg{\{}\forall{Q_{1},\dots,Q_{K}},L(Q_{1},\dots,Q_{K})\leq\hat{L}(Q_{1},\dots,Q_{K})\\ +\frac{\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})+\log\frac{1}{\delta}}{\lambda}+\frac{\lambda C^{2}}{8Kn}\bigg{\}}>1-\delta (3)
Proof.

Let the local generalization error: gen(Dk,hk)=𝔼(xk,yk)Dk[(hk(xk),yk)]1ni=1n(hk(xk,i),yk,i)\operatorname{gen}(D_{k},h_{k})=\underset{(x_{k},y_{k})\sim D_{k}}{\mathbb{E}}[\ell(h_{k}(x_{k}),y_{k})]-\frac{1}{n}\sum_{i=1}^{n}\ell\left(h_{k}\left(x_{k,i}\right),y_{k,i}\right), then the global generalization error: gen¯(D,h)=L(Q1,,QK)L^(Q1,,QK)=1Kk=1K𝔼hkQkgen(Dk,hk)\overline{\operatorname{gen}}(D,h)=L(Q_{1},\dots,Q_{K})\\ -\hat{L}(Q_{1},\dots,Q_{K})=\frac{1}{K}\sum_{k=1}^{K}\underset{h_{k}\sim Q_{k}}{\mathbb{E}}\operatorname{gen}(D_{k},h_{k}). For any λ>0\lambda>0 and t>0t>0, applying the Hoeffding’s lemma to 𝔼[i]i\mathbb{E}[\ell_{i}]-\ell_{i}, we have that, for each client kk,

𝔼Sk𝔼Pk[eλKgen(Dk,hk)]eλ2C28K2n.\underset{S_{k}}{\mathbb{E}}\underset{P_{k}}{\mathbb{E}}\left[\mathrm{e}^{\frac{\lambda}{K}\operatorname{gen}(D_{k},h_{k})}\right]\leq\mathrm{e}^{\frac{\lambda^{2}C^{2}}{8K^{2}n}}.

Since each SkS_{k} may come from different DkD_{k}. i.e., non-IID, we cannot directly plug this result to the PAC-Bayesian bound. Note that for each client k[K]k\in[K], PiP_{i} is independent of S1,,SKS_{1},\dots,S_{K}, we have that

𝔼S1𝔼P1𝔼SK𝔼PK[eλKk=1Kgen(Dk,hk)]eλ2C28Kn.\underset{S_{1}}{{\mathbb{E}}}\underset{P_{1}}{{\mathbb{E}}}\dots\underset{S_{K}}{{\mathbb{E}}}\underset{P_{K}}{{\mathbb{E}}}\left[\mathrm{e}^{\frac{\lambda}{K}\sum_{k=1}^{K}\operatorname{gen}(D_{k},h_{k})}\right]\leq\mathrm{e}^{\frac{\lambda^{2}C^{2}}{8Kn}}.

And we apply Donsker and Varadhan’s variational formula [16] for P1,,PKP_{1},\dots,P_{K} to get:

𝔼S1,,SK[esupQ1,,QKλ𝔼Q1𝔼QK[1Kk=1Kgen(Dk,hk)]/eDKL(k=1KQkp(k)k=1KPkp(k))]eλ2C28Kn.\underset{S_{1},\dots,S_{K}}{{\mathbb{E}}}\left[e^{\underset{Q_{1},\dots,Q_{K}}{\sup}\lambda\underset{Q_{1}}{{\mathbb{E}}}\dots\underset{Q_{K}}{{\mathbb{E}}}[\frac{1}{K}\sum_{k=1}^{K}\operatorname{gen}(D_{k},h_{k})]}\right.\\ \left./e^{D_{KL}\left(\prod_{k=1}^{K}Q_{k}^{p(k)}\|\prod_{k=1}^{K}P_{k}^{p(k)}\right)}\right]\leq\mathrm{e}^{\frac{\lambda^{2}C^{2}}{8Kn}}. (4)

Recall the definition of the global generalization error:

gen¯(D,h)=1Kk=1K𝔼hkQKgen(Dk,hk),\overline{\operatorname{gen}}(D,h)=\frac{1}{K}\sum_{k=1}^{K}\underset{h_{k}\sim Q_{K}}{{\mathbb{E}}}\operatorname{gen}(D_{k},h_{k}),

and note that DKL(k=1KQkp(k)k=1KPkp(k))=k=1Kp(k)DKL(QkPk)D_{KL}\left(\prod_{k=1}^{K}Q_{k}^{p(k)}\|\prod_{k=1}^{K}P_{k}^{p(k)}\right)=\\ \sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k}). Applying the Chernoff bound:

S1,,SK[supQ1,,QKλgen¯(D,h)k=1Kp(k)DKL(QkPk)λ2C28Kn>s]\displaystyle\underset{S_{1},\dots,S_{K}}{\mathbb{P}}\left[\underset{Q_{1},\dots,Q_{K}}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}>s\right]
𝔼S1,,SK[esupQ1,,QKλgen¯(D,h)k=1Kp(k)DKL(QkPk)λ2C28Kn]es\displaystyle\leq\underset{S_{1},\dots,S_{K}}{\mathbb{E}}\left[\mathrm{e}^{\underset{Q_{1},\dots,Q_{K}}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum\limits_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}}\right]e^{-s}
es.\displaystyle\leq\mathrm{e}^{-s}.

Let δ=es\delta=e^{-s}, that is, s=logδs=-\log\delta. Thus, plug this into the above result we have that

S1,,SK{Q1,,QK,gen¯(D,h)>1λk=1Kp(k)DKL(QkPk)+λC28Kn+1λlog1δ}δ.\underset{S_{1},\dots,S_{K}}{\mathbb{P}}\bigg{\{}\exists{Q_{1},\dots,Q_{K}},\overline{\operatorname{gen}}(D,h)>\\ \frac{1}{\lambda}\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})+\frac{\lambda C^{2}}{8Kn}+\frac{1}{\lambda}\log\frac{1}{\delta}\bigg{\}}\leq\delta.

Therefore, we prove the statement by leveraging the complement of the probability. ∎

The RHS of Equation 3 comprises two components: the empirical term and the complexity term. Note that our bound eschews the typical smoothness and convexity assumptions on the loss often made by other FL frameworks. Moreover, an intuition can be presented for Equation 3 that the bound will be tighter with the increasing of clients scales, which is further corroborated by the evaluation in Section 5.3.

Corollary 1 (The choice of λ\lambda).

Suppose λΞ{0,,ξ}\lambda\in\Xi\triangleq\{0,\dots,\xi\} and |||\cdot| denotes the cardinality of a set. For any δ(0,1)\delta\in(0,1) and a properly chosen λ\lambda, with probability at least 1δ1-\delta,

L(Q1,,QK)L^(Q1,,QK)+Ck=1Kp(k)DKL(QkPk)+log|Ξ|δ2Kn.L(Q_{1},\dots,Q_{K})\leq\hat{L}(Q_{1},\dots,Q_{K})\\ +C\sqrt{\frac{\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})+\log\frac{|\Xi|}{\delta}}{2Kn}}. (5)
Proof.

Suppose 𝒮=S1S2SK\mathcal{S}=S_{1}\cap S_{2}\cap\dots\cap S_{K} and 𝒬=Q1Q2QK\mathcal{Q}=Q_{1}\cap Q_{2}\cap\dots\cap Q_{K}. Since we have the previous result (4) for some fixed λ\lambda, we can sum (4) over all λΞ\lambda\in\Xi:

λΞ𝔼𝒮[esup𝒬λgen¯(D,h)k=1Kp(k)DKL(QkPk)λ2C28Kn]|Ξ|,\underset{\lambda\in\Xi}{\sum}\underset{\mathcal{S}}{\mathbb{E}}\left[\mathrm{e}^{\underset{\mathcal{Q}}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum\limits_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}}\right]\leq|\Xi|,

which is equivalent to:

𝔼𝒮[esup𝒬,λΞλgen¯(D,h)k=1Kp(k)DKL(QkPk)λ2C28Kn]|Ξ|.\underset{\mathcal{S}}{\mathbb{E}}\left[\mathrm{e}^{\underset{\mathcal{Q},\lambda\in\Xi}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum\limits_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}}\right]\leq|\Xi|.

Again, from the Chernoff bound:

𝒮[sup𝒬,λΞλgen¯(D,h)k=1Kp(k)DKL(QkPk)λ2C28Kn>s]\displaystyle\underset{\mathcal{S}}{\mathbb{P}}\left[\underset{\mathcal{Q},\lambda\in\Xi}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}>s\right]
𝔼𝒮[esup𝒬,λΞλgen¯(D,h)k=1Kp(k)DKL(QkPk)λ2C28Kn]es\displaystyle\leq\underset{\mathcal{S}}{\mathbb{E}}\left[\mathrm{e}^{\underset{\mathcal{Q},\lambda\in\Xi}{\sup}\lambda\overline{\operatorname{gen}}(D,h)-\sum\limits_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})-\frac{\lambda^{2}C^{2}}{8Kn}}\right]e^{-s}
|Ξ|es.\displaystyle\leq|\Xi|\mathrm{e}^{-s}.

Solve δ=|Ξ|es\delta=|\Xi|\mathrm{e}^{-s} to get:

𝒮{𝒬,λΞ,gen¯(D,h)>k=1Kp(k)DKL(QkPk)+log|Ξ|δλ+λC28Kn}δ\underset{\mathcal{S}}{\mathbb{P}}\bigg{\{}\exists{\mathcal{Q},\lambda\in\Xi},\overline{\operatorname{gen}}(D,h)>\\ \frac{\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})+\log\frac{|\Xi|}{\delta}}{\lambda}+\frac{\lambda C^{2}}{8Kn}\bigg{\}}\leq\delta

Choosing a proper minimizer

λ=8Kn(k=1Kp(k)DKL(QkPk)+log|Ξ|δ)/C,\lambda^{*}=\sqrt{8Kn\left(\sum_{k=1}^{K}p(k)D_{KL}(Q_{k}\|P_{k})+\log\frac{|\Xi|}{\delta}\right)}/C,

we obtain the bound of this corollary. ∎

Equation 5 generally offers a strategy for selecting the value of parameter λ\lambda so that the complexity term can be minimized.

4 FedPB: Optimize the upper bound

We denote k=𝔼hkQk1ni=1n(hk(xk,i),yk,i)\mathcal{L}_{k}=\underset{h_{k}\sim Q_{k}}{\mathbb{E}}\frac{1}{n}\sum_{i=1}^{n}\ell\left(h_{k}\left(x_{k,i}\right),y_{k,i}\right). Consider the following local objective function:

𝒥(Qk)=λk+p(k)DKL(QkPk)\mathcal{J}(Q_{k})=\lambda\mathcal{L}_{k}+p(k)D_{KL}(Q_{k}\|P_{k}) (6)

In our methodology, we introduce FedPB for a general scenario. It comprises two phases, designed to iteratively optimize the priors and posteriors for each client. Notably, in contrast to previous studies, clients are not required to upload their private prior and posterior distributions to the server, ensuring their privacy.

Phase 1 (Optimize the posterior). Given a fixed parameter λ>0\lambda>0 and the prior PktP_{k}^{t} during the training epoch t+1t+1, we aim to optimize the posterior QktQ_{k}^{t} as Q^kt+1=argminQk𝒥(Qk),\hat{Q}_{k}^{t+1}=\arg\min_{Q_{k}}\mathcal{J}(Q_{k}), yielding the solution:

dQ^kt+1dPkt(h)=exp(λ(h,zi))𝔼hPkt[exp(λ(h,zi))].\frac{d\hat{Q}_{k}^{t+1}}{dP_{k}^{t}}(h)=\frac{\exp\left(-\lambda\ell\left(h,z_{i}\right)\right)}{\mathbb{E}_{h\sim P_{k}^{t}}\left[\exp\left(-\lambda\ell\left(h,z_{i}\right)\right)\right]}.

Phase 2 (Optimize the prior). Having derived the optimal posterior Q^kt+1\hat{Q}_{k}^{t+1}, the prior can be updated by P^kt+1=Qkt\hat{P}_{k}^{t+1}={Q}_{k}^{t} since it minimizes DKL(QkPk)D_{KL}(Q_{k}\|P_{k}).

Link with personalized federated learning. Since the prior P^kt+1=Qkt\hat{P}_{k}^{t+1}={Q}_{k}^{t} and Qkt{Q}_{k}^{t} is equal to the aggregated global posterior at epoch t1t-1, the prior can be viewed as the global knowledge. Optimizing the objective function (6) minimizes the disparity between the global and local knowledge, which is a a prevalent personalization strategy in FL [17].

Re-parameterization trick. Utilizing the Bayesian neural network [18] as the local model aligns with our setting where all parameters are random, and we are optimizing their posterior distribution. In particular, the prior PkP_{k} and posterior QkQ_{k} are defined as follows:

Pk(w;ϑPk)=𝒩(w;μPk,σPk2Id)=i=1d𝒩(wi;μPk,i,σPk,i2)Qk(w;ϑQk)=𝒩(w;μQk,σQk2Id)=i=1d𝒩(wi;μQk,i,σQk,i2),\begin{aligned} &P_{k}(w;\vartheta_{P_{k}})=\mathcal{N}\left(w;\mu_{P_{k}},\sigma_{P_{k}}^{2}I_{d}\right)=\prod_{i=1}^{d}\mathcal{N}\left(w_{i};\mu_{P_{k},i},\sigma_{P_{k},i}^{2}\right)\\ &Q_{k}(w;\vartheta_{Q_{k}})=\mathcal{N}\left(w;\mu_{Q_{k}},\sigma_{Q_{k}}^{2}I_{d}\right)=\prod_{i=1}^{d}\mathcal{N}\left(w_{i};\mu_{Q_{k},i},\sigma_{Q_{k},i}^{2}\right)\end{aligned},

with every model parameter wiw_{i} being independent. Computing the Gibbs posterior directly can be challenging, hence we select the gradient descent as an alternative way. The update rule of (6) at round t+1t+1 is:

μQk,it+1\displaystyle\mu_{Q_{k},i}^{t+1} =μQk,itλμQk,ikp(k)(μQk,iμPk,i)σQk,i2,\displaystyle=\mu_{Q_{k},i}^{t}-\lambda\nabla_{\mu_{Q_{k},i}}\mathcal{L}_{k}-\frac{p(k)(\mu_{Q_{k},i}-\mu_{P_{k},i})}{\sigma_{Q_{k},i}^{2}},
σQk,it+1\displaystyle\sigma_{Q_{k},i}^{t+1} =σQk,itλσQk,ik+p(k)(σPk,i2σQk,i2+(μPk,iμQk,i)2)σQk,i3,\displaystyle=\sigma_{Q_{k},i}^{t}-\lambda\nabla_{\sigma_{Q_{k},i}}\mathcal{L}_{k}+\frac{p(k)(\sigma_{P_{k},i}^{2}-\sigma_{Q_{k},i}^{2}+\left(\mu_{P_{k},i}-\mu_{Q_{k},i}\right)^{2})}{\sigma_{Q_{k},i}^{3}},

where the parameter λ\lambda can be regarded as the learning rate of the GD and the KL-divergence is the regularization term. Calculating the gradients μQk,i\nabla_{\mu_{Q_{k},i}} and μQk,i\nabla_{\mu_{Q_{k},i}} directly can be intricate, but the re-parameterization trick is capable of tackling this issue. Concretely, we translate the hQkh\sim Q_{k} to ε𝒩(0,Id)\varepsilon\sim\mathcal{N}(0,I_{d}) and then compute the deterministic function h=μ+σεh=\mu+\sigma\odot\varepsilon, where \odot signifies an element-wise multiplication. As a result, we have μQk,i𝔼hQl(h)=μQk,i𝔼εq(ε)l(ϵ)\nabla_{\mu_{Q_{k},i}}\underset{h\sim Q}{\mathbb{E}}l(h)=\nabla_{\mu_{Q_{k},i}}\underset{\varepsilon\sim q(\varepsilon)}{\mathbb{E}}l(\epsilon), indicating its computability using an end-to-end framework with back-propagation.

5 Evaluation

In this section, we demonstrate our algorithm and our theoretical arguments with the FL non-IID setting. Specifically, the aggregation weight p(k)p(k) is defined as the sample ratio of client kk relative to the entire data size across all clients. For the global aggregation, the global mean and covariance are calculated by μ¯=k=1Kp(k)μkσk2/k=1Kp(k)σk2\bar{\mu}=\sum_{k=1}^{K}p(k)\mu_{k}\sigma_{k}^{-2}/\sum_{k=1}^{K}p(k)\sigma_{k}^{-2} and σ¯=1/k=1Kp(k)σk2\bar{\sigma}=1/\sum_{k=1}^{K}p(k)\sigma_{k}^{-2}, respectively. Furthermore, we utilize two real-world datasets: MedMNIST (medical image analysis) [19] and CIFAR-10 [20]. For each dataset, we adopt three distinct data-generating approaches for local clients: 1) Balanced: each client holds an equal number of samples; 2) Unbalanced: varying sample counts per client (e.g., [0.05,0.05,0.05,0.05,0.1,0.1,0.1,0.1,0.2,0.2][0.05,0.05,0.05,0.05,0.1,0.1,0.1,0.1,0.2,0.2] for 10 clients); 3) Dirichlet: differing sample counts per client following a Dirichlet distribution [21]. Besides, the entire FL system encompasses K=10K=10 clients, initializing their posterior models uniformly from a global posterior model.

Additionally, we deploy two versions of Bayesian neural networks: one with 2 convolutional layers for MedMNIST and another with 3 layers for CIFAR-10. The CrossEntropy loss serves as our loss function, and it is optimized by the Adam optimizer [22] with a learning rate of 1e-3.

5.1 Bound evaluation

To validate our bounds, we set the confidence bound on 1δ=95%1-\delta=95\%. Our evaluation underscores a correlation between the generalization error and the complexity, emphasizing the tightness of our bound. Fig. 1 illustrates an initial increase in the generalization error and a concurrent decrease in complexity during the early stages of the training process, attributed to empirical loss optimization. Subsequently, as neural network training advances, the KL-divergence stabilizes. Throughout this progression, we can observe that the generalization error is consistently bounded by the complexity value.

Refer to caption

(a) Gen. for MedMNIST

Refer to caption

(b) Acc. for MedMNIST

Refer to caption

(c) Gen. for CIFAR-10

Refer to caption

(d) Acc. for CIFAR-10

Fig. 1: The results of the generalization error and model performance of FedPB over the Dirichlet generating method.

5.2 Data-dependent prior

Here, we preform the ablation study of data-dependent (trainable) prior compared with data-independent (fixed, chosen before training) prior and report the mean ± standard deviation (std) accuracy of the global model in Table 1, evaluated over multiple experimental seeds. The results demonstrate the superior efficacy of the data-dependent strategy over both datasets across all three scenarios. This superiority arises from the data-dependent prior’s ability to harness more global knowledge, combined with its adaptability during training.

Table 1: Model accuracy (%) for the data-independent prior and data-dependent prior in three data-generating scenarios.
Method MedMNIST CIFAR10
Balanced Unbalanced Dirichlet Balanced Unbalanced Dirichlet
Data-
independent
53.47±1.1253.47\pm 1.12 49.44±1.1049.44\pm 1.10 55.24±6.9255.24\pm 6.92 50.89±0.6250.89\pm 0.62 47.19±0.9247.19\pm 0.92 57.93±0.5557.93\pm 0.55
Data-
dependent
77.10±4.25\mathbf{77.10\pm 4.25} 77.34±3.42\mathbf{77.34\pm 3.42} 77.48±4.75\mathbf{77.48\pm 4.75} 84.41±0.94\mathbf{84.41\pm 0.94} 79.39±0.56\mathbf{79.39\pm 0.56} 86.11±0.53\mathbf{86.11\pm 0.53}

5.3 Different client scales

Lastly, we assess the influence of varying client scales on our complexity bounds. As depicted in Fig. 2, in the evaluation of both datasets with the Dirichlet generating method, by increasing the number of clients KK from 10 to 20 and 50, we observe a consistent decrease in the value of complexity. This observation aligns with our analysis in Equation 3.

Refer to caption
Fig. 2: The impact of different client scales over FedPB on the value of the complexity term.

References

  • [1] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–1282.
  • [2] Zihao Zhao, Yuzhu Mao, Yang Liu, Linqi Song, Ye Ouyang, Xinlei Chen, and Wenbo Ding, “Towards efficient communications in federated learning: A contemporary survey,” Journal of the Franklin Institute, 2023.
  • [3] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra, “Federated learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.
  • [4] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang, “On the convergence of fedavg on non-iid data,” arXiv preprint arXiv:1907.02189, 2019.
  • [5] David A McAllester, “Some pac-bayesian theorems,” in Proceedings of the eleventh annual conference on Computational learning theory, 1998, pp. 230–234.
  • [6] David A McAllester, “Pac-bayesian model averaging,” in Proceedings of the twelfth annual conference on Computational learning theory, 1999, pp. 164–170.
  • [7] Matthias Seeger, “Pac-bayesian generalisation error bounds for gaussian process classification,” Journal of machine learning research, vol. 3, no. Oct, pp. 233–269, 2002.
  • [8] Olivier Catoni, “Pac-bayesian supervised classification: the thermodynamics of statistical learning,” arXiv preprint arXiv:0712.0248, 2007.
  • [9] Luca Oneto, Michele Donini, Massimiliano Pontil, and John Shawe-Taylor, “Randomized learning and generalization of fair and private classifiers: From pac-bayes to stability and differential privacy,” Neurocomputing, vol. 416, pp. 231–243, 2020.
  • [10] Milad Sefidgaran, Romain Chor, and Abdellatif Zaidi, “Rate-distortion theoretic bounds on generalization error for distributed learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 19687–19702, 2022.
  • [11] LP Barnes, Alex Dytso, and H Vincent Poor, “Improved information theoretic generalization bounds for distributed and federated learning,” in 2022 IEEE International Symposium on Information Theory (ISIT). IEEE, 2022, pp. 1465–1470.
  • [12] Milad Sefidgaran, Romain Chor, Abdellatif Zaidi, and Yijun Wan, “Federated learning you may communicate less often!,” arXiv preprint arXiv:2306.05862, 2023.
  • [13] Romain Chor, Milad Sefidgaran, and Abdellatif Zaidi, “More communication does not result in smaller generalization error in federated learning,” arXiv preprint arXiv:2304.12216, 2023.
  • [14] Sai Anuroop Kesanapalli and BN Bharath, “Federated algorithm with bayesian approach: Omni-fedge,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3075–3079.
  • [15] Xiaojin Zhang, Anbu Huang, Lixin Fan, Kai Chen, and Qiang Yang, “Probably approximately correct federated learning,” arXiv preprint arXiv:2304.04641, 2023.
  • [16] Monroe D Donsker and SR Srinivasa Varadhan, “On a variational formula for the principal eigenvalue for operators with maximum principle,” Proceedings of the National Academy of Sciences, vol. 72, no. 3, pp. 780–783, 1975.
  • [17] Xu Zhang, Yinchuan Li, Wenpeng Li, Kaiyang Guo, and Yunfeng Shao, “Personalized federated learning via variational bayesian inference,” in International Conference on Machine Learning. PMLR, 2022, pp. 26293–26310.
  • [18] Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter, “Bayesian optimization with robust bayesian neural networks,” Advances in neural information processing systems, vol. 29, 2016.
  • [19] Jiancheng Yang, Rui Shi, and Bingbing Ni, “Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis,” in IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021, pp. 191–195.
  • [20] Alex Krizhevsky, Geoffrey Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
  • [21] Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni, “Bayesian nonparametric federated learning of neural networks,” in International conference on machine learning. PMLR, 2019, pp. 7252–7261.
  • [22] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.