This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

CoQ-FL: Local Differential Privacy for Federated Learning with Correlated Quantization

Written by AAAI Press Staff1
AAAI Style Contributions by Pater Patel Schneider, Sunil Issar,
J. Scott Penberthy, George Ferguson, Hans Guesgen, Francisco Cruz\equalcontrib, Marc Pujol-Gonzalez\equalcontrib
With help from the AAAI Publications Committee.
Abstract

We should remember to switch to 2025 template for anonymous submissions.
remember: ”All” authors must complete a reproducibility checklist, which facilitates replication of the reported research
Additional pages for references only. write the abstract
Submissions may consist of up to 7 pages of technical content plus additional pages solely for references

Introduction

Federated learning (FL) (mcmahan2017communication) is an innovative approach to training machine learning (ML) models on decentralized datasets distributed across multiple devices or servers. This method employs locally stored data to train ML models without centralizing or directly accessing the raw data, thereby maintaining a basic level of privacy. In FL, model training occurs without sharing any sensitive data with the central server. Instead, each distributed device (client) updates the model based on its local data and shares the updated model with the server. The server then aggregates the model updates from all clients to update the global model. This aggregation process, known as distributed mean estimation (DME), is fundamental in distributed systems and aims to accurately estimate the mean of data distributed across multiple clients.

Despite the advantages of FL in enhancing privacy by only sharing the model update without sharing the clients dataset. Recent research has identified potential information leakage. Studies have shown that even by knowing the updated model, an adversary still can retain some sensitive information about the users’ datasets (hitaj2017deep; ganju2018property; melis2019exploiting; wang2019beyond; nasr2019comprehensive; geiping2020inverting; zhao2020idlg; zhu2019deep; shokri2017membership; wei2020federated).

To mitigate information leakage in FL, differential privacy (DP) provides an additional layer of security (dwork2006our; dwork2014algorithmic; abadi2016deep; koloskova2023gradient). DP has been applied to FL to protect clients’ information (geyer2017differentially; mcmahan2017learning; zhao2020local; sun2020ldp; cheng2022differentially). To ensure local differential privacy (LDP) in FL, each client perturb the updated model before sending it to the server, making it difficult for an adversary to gain specific information about the client’s data.

Advanced techniques for weight perturbation to preserve LDP include injecting noise into the model updates (kairouz2021distributed; agarwal2018cpsgd; agarwal2021skellam; canonne2020discrete), and encoding the local updates into parameters of distributions followed by sampling from the resulted distribution (sun2020ldp; chen2022poisson; wang2019locally; bhowmick2018protection). Methods that map the weights to probability distribution have shown advantages over noise-adding methods in communication efficiency and unbiasedness. However, the main concern is minimizing variance of the estimation to achieve stable model update. This requires increasing quantization points (communication cost) or the number of clients.

To further improve the efficiency of mean estimation in FL, researchers have explored the concept of correlated quantization. This approach introduces correlation into the quantization process, allowing clients to perform quantization in a way that compensates for the movements of others. (suresh2022correlated) examines correlation between clients in the quantization problem and demonstrates that such correlation can decrease variance and increase DME accuracy, thereby improving the overall performance of the FL system. However, their method increase the accuracy of the aggregation process in the server, the privacy guarantee in not their main purpose.

In this paper, we consider encoding weights into probability distributions in Local Differential Privacy Federated Learning (LDP-FL) and propose a method of private quantization while using correlated randomness. We prove that this method is unbiased and improves mean estimation variance while preserving privacy. The proposed method for creating correlated randomness is applicable to different private quantization methods. To our knowledge, this is the first paper to consider correlated randomness in differential privacy, which theoretically and experimentally decreases variance and increases model accuracy.

Our Contribution

This work presents a novel algorithm that introduces negative correlation between client’s perturbations in federated learning systems, resulting in significant improvements in privacy-preserving data aggregation. Our key contributions are as follows:

  • We introduce an algorithm for locally differential private FL that creates negative correlations between pair of clients to make the weight perturbation negatively correlated. The negative correlation, reduces the variance of estimated mean in aggregation process in server. By utilizing the negative correlation, we achieve a more efficient use of the privacy budget without compromising the quality of the aggregated model.

  • We show that for a given level of accuracy (Variance or mean estimation error), our algorithm requires only 50% of the privacy budget (ϵ\epsilon) compared to independent weight perturbation which allows for stronger privacy guarantees without sacrificing the utility of the federated learning model.

  • We provide a mathematical proof demonstrating that our algorithm achieves the minimum possible correlation between two agents in the setup of using shared codebook to create minimal correlation. We formally represent the problem formulation for creating minimal correlation among nn number of clients.

  • Our results are complemented by extensive empirical evaluations, demonstrating the practical efficacy of our method across various federated learning scenarios and datasets.

Related Work

Differential Privacy in Federated Learning

Federated learning is a method of collaborative training and is more applicable when the clients have desire to contribute to the model update but don’t want to share their private data with the server. The concept developed initially by google (konecny2016federated; konevcny2016flondevice; mcmahan2017communication). Many research has been done recently in this field with focus on improving the FL by solving challenges in statistics of data distribution and making the communication efficient (smith2017federated; kairouz2021advances; li2019convergence; zhao2018federated; li2020federated; lin2020ensemble; li2021model; zhang2022fine; wang2021field).

By implementing local differential privacy, federated learning protocols can maintain the utility of collaborative model training while providing provable privacy guarantees for individual clients. Different method have been used to guarantee the privacy of local information including shuffling of client updates (girgis2021renyi; sun2020ldp) or adding Gaussian with different variances to achieve adjustable local privacy for each client (wei2021user).

Quantization and DME

In federated learning, for complex models, the parameters that should be transmitted by the clients are large, and sending these updates can become a bandwidth-intensive process. To address this, quantization methods have been extensively proposed to enhance communication efficiency. Significant contributions in this field include the works of (konecny2016federated; alistarh2017qsgd; bernstein2018signsgd; reisizadeh2020fedpaq; shlezinger2020uveqfed; chen2024mixed; shahmiri2024communication), who have demonstrated various strategies for effective quantization of weights during federated learning processes.

The primary objective of quantization in this context is to reduce the size of the data transmission and consequently reduce the uplink communication cost. After receiving the quantized weights, the server will estimate the global model update, a challenge known as the Distributed Mean Estimation (DME) problem. This problem’s objective is to accurately estimate the global average from the quantized inputs (weights/gradients) provided by multiple clients, without considering any specific assumptions regarding the underlying data distribution. Key studies addressing this challenge include those by (vargaftik2021drive; vargaftik2022eden; chen2020breaking; suresh2017distributed; konevcny2018randomized). In addition to these, some studies have considered the inter-client correlations and the side information to improve the estimation process (mayekar2021wyner; jiang2024correlation; jhunjhunwala2021leveraging). In this context, (suresh2022correlated) utilized the concept of correlated quantization to enhance the accuracy of mean estimation.

Differential privacy and quantization

Several researchers have developed methods utilizing stochastic quantization and discrete noise to create unbiased estimators that guarantee differential privacy. Due to the discrete nature of quantization, it is generally more appropriate to first quantize the data and then add discrete noise, rather than add continuous noise and then quantize. The latter approach can introduce bias into the estimation, which is undesirable in federated learning (FL) applications. Studies by (agarwal2018cpsgd; kairouz2021distributed; agarwal2021skellam) have applied this concept to develop communication-efficient differentially private methods in FL. These approaches involve adding noise drawn from Binomial, discrete Gaussian and Skellam distribution to the updated model, respectively.

An important perspective in the context of communication-efficient LDP-FL is the integration of differential privacy into the quantization process itself (cai2024privacy; chen2022poisson; chaudhuri2022privacy; youn2023randomized; sun2020ldp; wang2019locally; bhowmick2018protection; he2023clustered; chamikara2022local). Quantization is a process that compresses information by transferring less data about the client which inherently increase the privacy. The impact of quantization on privacy has been explored through various discrete DP methods in (jin2024breaking) in which they also proposed a ternary stochastic compressor with a DP guarantee.

In DP compression approaches, the objective is to create quantization methods that inherently preserve privacy while ensuring unbiased mean estimation at the server. This concept has been explored in depth by (youn2023randomized; chen2022poisson), who investigated randomized and Poisson-Binomial quantization schemes, respectively. (jin2020stochastic) proposed a one-bit private quantization of gradients in the context of FL, while (girgis2021shuffled) proposed a bb-bit private quantization combined with shuffling. One-bit quantizers with DP have also been investigated in (nguyen2016collecting; wang2019locally; sun2020ldp).

Preliminaries

Federated Learning

We consider an FL scenario where nn clients, 𝒞i,i[n]\mathcal{C}_{i},i\in[n], communicate with a single server, 𝒮\mathcal{S}, to train a global model iteratively over multiple rounds. Each of the clients has access to a local dataset 𝒟i,i[n]\mathcal{D}_{i},i\in[n]. In each round, the server sends the current global model to the clients. Each client 𝒞i\mathcal{C}_{i} trains its local model based on 𝒟i\mathcal{D}_{i} and sens the model update vector w(i)dw^{d}_{(i)} back to the server, where dd is the update dimension, e.g., the number of model parameters. The server aggregates the model updates and computes wd=i[n]w(i)dw^{d}=\sum_{i\in[n]}w_{(i)}^{d}.

. At round t\mathbbNt\mathbb{N}, the server transmits a weight vector 𝐰\mathbf{w} shared model to each of the clients. The clients compute

Problem Formulation

The problem of Federated Learning (FL) is closely related to the concept of Distributed Mean Estimation (DME). In fact, FL can be formulated as a DME problem (suresh2017distributed). In the FL process, client ii updates the model weights wiw_{i} based on its local data, then perturb these updates (wi=(wi)w_{i}^{*}=\mathcal{M}(w_{i})) before sending them to the server. The server aggregates these updated weights by computing their mean value: w¯=i=1n(wi)\bar{w}^{*}=\sum_{i=1}^{n}\mathcal{M}(w_{i}). The server then updates the global model parameters using w¯\bar{w}^{*}. The main goal is to design a perturbation method for the weights that keeps the aggregation process unbiased ¯(w)=w¯=i=1nwi\bar{\mathcal{M}}(w)=\bar{w}=\sum_{i=1}^{n}w_{i} while satisfying the differential privacy constraint defined in Definition 1 (dwork2006our)..

Definition 1.

Differential Privacy: A randomized mechanism \mathcal{M}, satisfies ϵ\epsilon-deifferential privacy if for any two adjacent dataset 𝒟,𝒟\mathcal{D},\mathcal{D}^{{}^{\prime}} and all sets SS in the range of \mathcal{M}, it holds that:

Pr[(𝒟)S]eϵPr[(𝒟)S]Pr[\mathcal{M}(\mathcal{D})\in S]\leq e^{\epsilon}Pr[\mathcal{M}(\mathcal{D^{{}^{\prime}}})\in S] (1)

where the ϵ0\epsilon\geq 0 is the privacy budget that controls the trade-off between privacy and utility.

A significant challenge in the model update process is that, despite having an unbiased, differentially private method, a large variance (Var(w¯)=E[2(w)]w¯2Var(\bar{w})=E[\mathcal{M}^{2}(w)]-\bar{w}^{2}) can degrade the accuracy of FL. To achieve lower variance, one can increase the number of clients (which is not possible when limited number of clients respond to server) or the number of quantization points in quantization-based DP methods (which increase communication overhead). In this paper, we introduce the use of correlated random variables in the LDP-FL method to achieve better variance while maintaining the same DP constraints. We investigate how creating correlation in FL can increase accuracy without violating differential privacy requirements.

In this setting, we consider the server to be trusted but curious. This means that while the server performs updates at each iteration based on all information gathered from the clients, it may attempt to extract additional information about the clients. Therefore, the clients’ information needs to be differentially private. We also consider the possibility of protecting client information from eavesdroppers who may have access to the model transmitted by clients and seek to extract client-specific information. Given the large number of clients in FL scenarios and the limited effect each client has on the global model update (by sending weights in the range [cr,c+r][c-r,c+r]), we do not consider the impact of malicious clients and assume all clients are trustworthy.

LDP-FL (sun2020ldp) guarantees user-level DP in FL by encoding weight information into probability distribution. Then, client samples from that distribution and sends only one bit of information to the server for each weight. In each communication round, the server updates the global model along with a center and range for weights of each layer, such that wi[cr,c+r]w_{i}\in[c-r,c+r], where superscript c,rc,r are center and range for the specific layer of the model. After updating weights based on their local dataset 𝒟i\mathcal{D}_{i}, clients sample from the following distribution for each weight which encodes the weights in the probability distribution:

wi=c+Uirα(ϵ),Ui={+1WP12+wic2rα(ϵ)1WP12wic2rα(ϵ)w_{i}^{*}=c+U_{i}r\alpha(\epsilon),\quad U_{i}=\begin{cases}+1\quad WP\quad\frac{1}{2}+\frac{w_{i}-c}{2r\alpha(\epsilon)}\\ -1\quad WP\quad\frac{1}{2}-\frac{w_{i}-c}{2r\alpha(\epsilon)}\end{cases} (2)

Where WPWP is the abbreviation of ”With Probability”, α(ϵ)=eϵ+1eϵ1\alpha(\epsilon)=\frac{e^{\epsilon}+1}{e^{\epsilon}-1}, and ϵ\epsilon is the privacy budget. The parameter α(ϵ)\alpha(\epsilon) controls privacy and ensures ϵ\epsilon-DP in the clients’ weight information. For small ϵ\epsilon values, α(ϵ)\alpha(\epsilon) is large, making UiU_{i} close to Ber(0.5)Ber(0.5). This results in clients sending upper or lower bounds with nearly equal probability, independent of wiw_{i}, preventing information leakage. As ϵ\epsilon increases, α(ϵ)\alpha(\epsilon) approaches one, transforming the problem into binary quantization (suresh2017distributed; suresh2022correlated) without privacy guarantees. We present the following lemma to demonstrate that this perturbation method is unbiased and ϵ\epsilon-differentially private.

Lemma 1.

In FL, if each client uses the randomized algorithm in Equation 2, the server’s aggregation will be unbiased, i.e., E[(w)]=w¯E[\mathcal{M}(w)]=\bar{w}.

Consider Equation 2, the expectation of each perturbed weight is E[wi]=wiE[w_{i}^{*}]=w_{i}. Consequently, the expectation of the average of the perturbed weights across all clients is equal to the average of the actual weights E[1ni=1nwi]=1ni=1nE[wi]=1ni=1nwi=w¯E[\frac{1}{n}\sum_{i=1}^{n}w_{i}^{*}]=\frac{1}{n}\sum_{i=1}^{n}E[w_{i}^{*}]=\frac{1}{n}\sum_{i=1}^{n}w_{i}=\bar{w}, proving that the perturbation method is unbiased. Having established the unbiasedness of the perturbation method, we now examine its privacy guarantees.

Theorem 1.

Let w[cr,c+r]w\in[c-r,c+r] be a single weight. The perturbation method defined in Equation 2 satisfies ϵ\epsilon-differential privacy as per Definition 1.

Proof.

To prove the theorem, we need to show that for any two possible input weights ww and ww^{\prime}, and any output ww^{*}:

Pr[(w)=w]Pr[(w)=w]eϵ\frac{Pr[\mathcal{M}(w)=w^{*}]}{Pr[\mathcal{M}(w^{\prime})=w^{*}]}\leq e^{\epsilon}

We consider U=+1U=+1 (the argument for U=1U=-1 is similar) When U=+1U=+1, w=c+rα(ϵ)w^{*}=c+r\alpha(\epsilon). The worst-case ratio occurs when we compare the maximum probability (when w=c+rw=c+r) to the minimum probability (when w=crw^{\prime}=c-r):

Pr[(w)=w]Pr[(w)=w]\displaystyle\frac{Pr[\mathcal{M}(w)=w^{*}]}{Pr[\mathcal{M}(w^{\prime})=w^{*}]} maxwPr[(w)=w]minwPr[(w)=w]=eϵ\displaystyle\leq\frac{\max_{w}Pr[\mathcal{M}(w)=w^{*}]}{\min_{w^{\prime}}Pr[\mathcal{M}(w^{\prime})=w^{]}}=e^{\epsilon}

For the case where U=1U=-1 the proof follows the same steps and yields the same result. ∎

As it can be seen in Theorem 1, ww should be in range [cr,c+r][c-r,c+r]. The selection of center cc and range rr should ensure that updated weights remain within this region. One approach to guarantee this is to enlarge rr. However, larger range leads to higher variance and reduced accuracy in the final model. The server determines the choice of center and range based on the weight ranges from the previous step and a clipping parameter. To constrain weights within the specified range, regularization terms can be added to the local objective function (cheng2022differentially) or Kashin’s representation can be utilized to limit values within the bound (chen2022poisson; chen2024privacy). These methods help maintain the balance between ensuring privacy (by keeping weights within the specified range) and preserving model accuracy (by avoiding excessive expansion of the range).

Enhancing Accuracy through Correlated Randomization in DP-FL

One important method to ensure LDP-FL involves encoding client data into discrete distributions as a perturbation mechanism (sun2020ldp). While this approach preserves privacy, it can lead to increased variance in the output, potentially compromising accuracy. This section introduces a novel approach: creating correlated random variables across clients without compromising individual privacy, thereby maintaining privacy guarantees while improving accuracy and reducing variance. In the perturbation method described in (Equation 2), the variance of each random variable wiw^{*}_{i} is bounded as follows:

r2(α2(ϵ)1)Var(wi)=α2r2(wic)2r2α2(ϵ)r^{2}(\alpha^{2}(\epsilon)-1)\leq Var(w_{i}^{*})=\alpha^{2}r^{2}-(w_{i}-c)^{2}\leq r^{2}\alpha^{2}(\epsilon) (3)

Also, the variance of the estimation for nn number of clients is expressed as:

Var(1ni=1nwi)=1n2i=1nVar(wi)+2n2i,j=1ijnCov(wi,wj)Var(\frac{1}{n}\sum_{i=1}^{n}w^{*}_{i})=\frac{1}{n^{2}}\sum_{i=1}^{n}Var(w^{*}_{i})+\frac{2}{n^{2}}\sum_{\begin{subarray}{c}i,j=1\\ i\neq j\end{subarray}}^{n}Cov(w^{*}_{i},w^{*}_{j}) (4)

which for the case that we generate the perturbed weights independently, the covariance would be zero and the variance upper bound would be α2(ϵ)r2n\frac{\alpha^{2}(\epsilon)r^{2}}{n}. To decrease the estimation variance, we propose utilizing negative correlation between the randomized mechanisms operating on each weight. Our approach, detailed in Algorithm 1, creates jointly negative correlation between clients without sharing private information. The key steps of this approach are:

Common Randomness (CR) Generation: A client (e.g., Alice) creates a random vector and shares it with other client (e.g., Bob) by end to end encryption method to prevent the server from gaining information about this common randomness. This shared randomness is generated independently of client data which means no information about the client can be gained by knowing the CR. Also, this method is extendable to the cases where clients have access to some correlated randomness, that can be generated locally. The main change in this case is that there exists a fundamental limit on the negative correlation in the outputs of the clients, which is influenced by the correlation between their distributed observations (witsenhausen1975sequences; gacs1973common; ghazi2016decidability).

Codebook Creation: Each client creates a codebook of size 2NR2^{N_{R}}, where NRN_{R} is the number of shared random bits for each model weight. The probability of each sequence ss in the codebook is determined by the bias of a coin, pxp_{x}, and is calculated as P(s)=pxwH(s)(1px)NRwH(s)P(s)=p_{x}^{w_{H}(s)}(1-p_{x})^{N_{R}-w_{H}(s)}, where wH(s)w_{H}(s) is the Hamming weight of the sequence.

Marginal Computation and Index Matching: Alice finds the codebook index where the cumulative sum of sequence probabilities up to that index equals her own marginal (derived from Equation 2) and set it as its marginal sequence. This ensures that the probability of generating a sequence with the same coin toss that falls within the preceding sequences matches Alice’s marginal. Bob, conversely, finds the index where the sum of sequence probabilities after that index equals his marginal and set it as its marginal sequence. This means the probability of generating a sequence with the same coin toss that falls within the subsequent sequences matches Bob’s marginal.

Correlated Perturbation: Clients compare their marginal indices with the common randomness index to perturb weights. For Alice, if the CR falls in sequences before the marginal sequence of Alice, it outputs c+rα(ϵ)c+r\alpha(\epsilon) and for the sequences after its marginal sequence, it outputs crα(ϵ)c-r\alpha(\epsilon), conversely, Bob sends crα(ϵ)c-r\alpha(\epsilon) if CR falls before its marginal sequence and c+rα(ϵ)c+r\alpha(\epsilon) if CR falls after its marginal sequence (see Figure 1).

This method introduces negative correlation between pairs of users, as described in the following lemma

Refer to caption
Figure 1: Generation of negatively correlated random outputs while preserving marginals. Alice and Bob select from opposite directions in the codebook. Red regions indicate areas where if Alice selects the less probable output, Bob compensates by selecting the more probable output (and vice versa), without knowledge of each other’s decisions.
Lemma 2.

In Algorithm 1, the covariance between the output of two users ii and jj is negative and is expressed as:

Cov(wi,wj)={(wiw¯)(wjw¯)ifwi+wj2c(wiw¯)(wjw¯)ifwi+wj2cCov(w^{*}_{i},w^{*}_{j})=\begin{cases}-(w_{i}-\underline{w})(w_{j}-\underline{w})\quad\text{if}\quad\frac{w_{i}+w_{j}}{2}\leq c\\ -(w_{i}-\bar{w})(w_{j}-\bar{w})\quad\text{if}\quad\frac{w_{i}+w_{j}}{2}\geq c\end{cases} (5)

where w¯=cα(ϵ)r\underline{w}=c-\alpha(\epsilon)r and w¯=c+α(ϵ)r\bar{w}=c+\alpha(\epsilon)r

Now that we created negative covariance, we are going to investigate the effect of this negative covariance in the variance of two clients, which for n pairs, this effect would be magnified by coefficient n2\frac{n}{2}. in Figure 2 we can see the effect of negative correlation for different values of epsilon. As we can see in this figure, even for smaller values of privacy budget, the improvement in variance between perturbed weights is more than this effect for larger values of privacy budget. The case where we have the privacy close to infinity this method of perturbation can be interpreted as a stochastic one bit quantization. In this case still our method achieve lower variance with respect to independent method which make this method also applicable in the application of quantization in DME. The variance in our proposed method for only two clients would be :

Refer to caption
Figure 2: Variance of wi[cr,c+r]w^{*}_{i}\in[c-r,c+r] and wj[cr,c+r]w^{*}_{j}\in[c-r,c+r] where c=0,r=0.5c=0,r=0.5 in different privacy budget. The decrease in variance for smaller values of privacy budget is larger than bigger privacy budget. For the case of no privacy still our method gains in variance with respect to independent method which makes this method applicable in correlated DME methods in (suresh2022correlated)

Now we are going to explain about the total variance in our method. For two clients we can express the variance as :

Var(wi+wj)={4(E[w]c)(E[w]w¯)ifE[w]c4(E[w]c)(E[w]w¯)ifE[w]cVar(w^{*}_{i}+w^{*}_{j})=\begin{cases}-4(E[w]-c)(E[w]-\underline{w})\quad\text{if}\quad E[w]\leq c\\ -4(E[w]-c)(E[w]-\bar{w})\quad\text{if}\quad E[w]\geq c\end{cases} (6)

where E[w]=wi+wj2E[w]=\frac{w_{i}+w_{j}}{2}, which takes its maximum value for the cases where wi+wjw_{i}+w_{j} is equal to c+w¯c+\bar{w} or c+w¯c+\underline{w}. The maximum value for variance in this case is equal to α2(ϵ)r2\alpha^{2}(\epsilon)r^{2} which is 50% less than the uncorrelated case where Var(wi+wj)=Var(wi)+Var(wj)=2α2(ϵ)r2Var(w_{i}+w_{j})=Var(w_{i})+Var(w_{j})=2\alpha^{2}(\epsilon)r^{2}. Another important characteristic is that, when we have small value for privacy budget, in the region [cr,c+r][c-r,c+r] the curve for Var(wi+wj)Var(w_{i}+w_{j}) become a flat curve with a value approximately 2α2(ϵ)r22\alpha^{2}(\epsilon)r^{2}, however, in our method, the curve would be a smaller curve because the maximum value occurs at wi+wj=2c±α(ϵ)rw_{i}+w_{j}=2c\pm\alpha(\epsilon)r which for large values of α(ϵ)\alpha(\epsilon) is outside of the region [c+r,cr][c+r,c-r] and for E[w]=cE[w]=c the variance of wi+wjw^{*}_{i}+w^{*}_{j} would be zero. Therefore, the effect of correlated method is more valuable in cases where we are looking for better accuracy,

Optimality of this method

To show that the above procedure reaches the minimum possible correlation, we formulate the problem in the setup similar to NISS problem (ghazi2016decidability; witsenhausen1975sequences) but in those setup the problem is more complicated because the agents only have access to correlated randomness not common randomness. In our setup, Alice and Bob receive a sequence of common randomness X{1,1}dX\in\{-1,1\}^{d} and they want to produce U,V{1,1}U,V\in\{-1,1\} with having a specific marginals P(U=1)=PU,P(V=1)=PVP(U=1)=P_{U},P(V=1)=P_{V} and minimum correlation. The idea is that each agent, create a codebook of size 2d2^{d} starting from sequence of all zeros to sequence of all ones with size dd. the probability of seeing each sequence by using common randomness is P(Xd)=pxwH(Xd)(1px)dwH(Xd)P(X^{d})=p_{x}^{w_{H}(X^{d})}(1-p_{x})^{d-w_{H}(X^{d})} where wH(X)w_{H}(X) is the hamming weight of sequence XdX^{d}. To simplify the problem we consider that the common randomness is generated by using an unbiased coin i.e. px=0.5p_{x}=0.5, and the probability of seeing each sequence in codebook is P(Xd)=12dP(X^{d})=\frac{1}{2^{d}} for all sequences. Now, to formally formulate the problem, we can explain this problem as an assignment problem, in which we assign the sequences in the codebook to Alice and Bob such that they output +1+1 if the common randomness is in their assigned sequence. The objective of the assignment is to minimize the correlation and preserve their marginals. This problem can be defined as Boolean programming in which Y={Yi{1,1}},Z={Zi{1,1}},i1,,2dY=\{Y_{i}\in\{-1,1\}\},Z=\{Z_{i}\in\{-1,1\}\},i\in{1,...,2^{d}} are Boolean variable that assign sequence in row ii to Alice and Bob, respectively. In another word, if Alice see that the sequence of common randomness, she will find the index ii of that sequence in codebook and output YiY_{i}, and Bob at the same time will output ZiZ_{i}. Now we can define the minimal correlation problem as:

minX,Z{1,1}2dYTZ\displaystyle\min_{X,Z\in\{-1,1\}^{2^{d}}}Y^{T}Z
s.t:12di=1nYi=2PU1\displaystyle s.t:\frac{1}{2^{d}}\sum_{i=1}^{n}Y_{i}=2P_{U}-1
12di=1nZi=2PV1\displaystyle\quad\quad\frac{1}{2^{d}}\sum_{i=1}^{n}Z_{i}=2P_{V}-1

In this problem formulation we assume that PUP_{U} and PVP_{V} are in the form of nA2d\frac{n_{A}}{2^{d}} and nB2d\frac{n_{B}}{2^{d}}which is not a restrictive assumption because if this is not the case for them, they can use their local coin to generate the exact marginal (Define the process in the appendix). The above problem, reaches its minimum in case that YY and ZZ achieve the minimum cover. In the case where we have PU+PV=1P_{U}+P_{V}=1 or nA+nB=1n_{A}+n_{B}=1, the least amount of covering is zero which is achieved by our method of correlation. If we consider PU+PV>1P_{U}+P_{V}>1 or nA+nB>2dn_{A}+n_{B}>2^{d}, then at least Alice and Bob equally output +1+1 for nA+nB2dn_{A}+n_{B}-2^{d} number of sequences, which our method also achieve the minimum correlation, also for the case where PU+PV<1P_{U}+P_{V}<1 or nA+nB<2dn_{A}+n_{B}<2^{d}, Alice and Bob equally output 1-1 for at least 2dnAnB2^{d}-n_{A}-n_{B} number of sequences, which again our method achieve this minimum value. So, we can claim that our method achieves the minimum correlation between two agents without sharing any information about the marginals.

Now, we are proposing the creation of negative correlation in the case of nn agents, in this case we can express the minimal correlation problem as:

minXi{1,1}2d,i1,,ni<jXiTXj\displaystyle\min_{X_{i}\in\{-1,1\}^{2^{d}},i\in{1,...,n}}\sum_{i<j}X_{i}^{T}X_{j}
s.t:E[Xi]=2Pi1\displaystyle s.t:E[X_{i}]=2P_{i}-1

Solving this problem

communication cost

One important aspect of FL, is the communication cost at each round of communication. In our proposed method, to achieve the negative correlation, we utilize dd number of randomness shared between two clients. The effect of number of randomness is on calculating the marginals for each client. for instance if we set d=1d=1, each pairs compare their marginals with common randomness and flip a coin to output c+α(ϵ)rc+\alpha(\epsilon)r or cα(ϵ)rc-\alpha(\epsilon)r which is the same as the LDP-FL method (sun2020ldp), and there is no negative correlation because their decision is based on the local coin flip which preserve the marginals but does not force any negative correlation. by increasing the number of randomness, we decrease the effect of local randomness in total probability and mostly each client decide about its output by local randomness only with probability 12d1\frac{1}{2^{d-1}}, which is the maximum probability of that specific region. It is possible to decrease this probability further if we have the distribution of the weights, but we do not assume any distribution in weights in our method but it is possible to choose pxp_{x} in the algorithm as a hyperparameter to achieve better accuracy.

The maximum increase in negative correlation in this algorithm is affecter bu number of randomness, and if we increase the number of randomness from dd to d+1d+1, the maximum increase in correlation is 12d\frac{1}{2^{d}} which means that increasing the communication cost would not gain us that much negative correlation and we can achieve our expected result even with small number of randomness and without much communication overhead.

Also to transfer each weights we only need to transfer one bit of information, therefore the communication cost for each weight would be d+1d+1.

Algorithm 1 LDP-FedCR (Local Differential Privacy Federated Learning with Correlated Randomness)

Inputs: Privacy budget ϵ\epsilon, communication rounds TT, global model MM, number of randomness (NumRandNumRand), Marginal pxp_{x}, local epochs EE
Initialization:
α(ϵ)eϵ+1eϵ1\alpha(\epsilon)\leftarrow\frac{e^{\epsilon}+1}{e^{\epsilon}-1}
ProbListProbabilityList(px,NumRand)ProbList\leftarrow\text{ProbabilityList}(p_{x},NumRand)

1:  Share ProbListProbList with all clients
2:  for t=1,,Tt=1,...,T do
3:     Randomly pair clients
4:     for each pair (i,j)(i,j) do
5:        Send public key of jj to ii
6:        CR[i]CommRand(px,NumRand,NumParam)CR[i]\leftarrow\text{CommRand}(p_{x},NumRand,NumParam)
7:        Encrypt and share CR[i]CR[i] with jj via server
8:     end for
9:     Distribute global model MM to all clients
10:     for each pair of clients (i,j)(i,j) do
11:        WiClientUpdate(M,E)W_{i}\leftarrow\text{ClientUpdate}(M,E)
12:        WiPerturbWeight(Wi,CR[i],ProbList,+1)W^{*}_{i}\leftarrow\text{PerturbWeight}(W_{i},CR[i],ProbList,+1)
13:        WjClientUpdate(M,E)W_{j}\leftarrow\text{ClientUpdate}(M,E)
14:        WjPerturbWeight(Wj,CR[i],ProbList,1)W^{*}_{j}\leftarrow\text{PerturbWeight}(W_{j},CR[i],ProbList,-1)
15:     end for
16:     Collect perturbed weights WkW^{*}_{k} from all clients
17:     Update global model MM by aggregating weights
18:  end for
Algorithm 2 Weight perturbation in LDP-FedCR
1:  function PerturbWeight(W,CR,ProbList,UP)(W,CR,ProbList,UP)
2:  for each wk,CRkw_{k},CR_{k} in client’s model WW and CRCR do
3:     ProbMarginal0.5+wkc2αrProbMarginal\leftarrow 0.5+\frac{w_{k}-c}{2\cdot\alpha\cdot r}
4:     if UP=+1UP=+1 then
5:        PProbMarginalP\leftarrow ProbMarginal
6:     else
7:        P1ProbMarginalP\leftarrow 1-ProbMarginal
8:     end if
9:     idxFindIndex(ProbList,P)idx\leftarrow\text{FindIndex}(ProbList,P)
10:     IntCRConvertToInteger(CRk)IntCR\leftarrow\text{ConvertToInteger}(CR_{k})
11:     if IntCR<idxIntCR<idx then
12:        UUPU\leftarrow UP
13:     else if IntCRidx+1IntCR\geq idx+1 then
14:        UUPU\leftarrow-UP
15:     else
16:        ProbGridPProbList[idx]ProbList[idx+1]ProbList[idx]ProbGrid\leftarrow\frac{P-ProbList[idx]}{ProbList[idx+1]-ProbList[idx]}
17:        Sample U+UPU\leftarrow+UP with probability ProbGridProbGrid, else UP-UP
18:     end if
19:     wkc+Uαrw_{k}^{*}\leftarrow c+U\cdot\alpha\cdot r
20:  end for
21:  return  WW^{*}
22:  end function

explain how we are going to create the correlation

Privacy Analysis

give a theorem for unbiaseness of the algorithm

Give the theorem to prove the privacy for this method

give a theorem for the variance of the method

explain how we can compare it with quantization and what are the limitation of the quantization in here

Algorithm Accuracy

a table for

Complexity Analysis

complexity of randomness generation in both quantization and our method

Experimental Results

provide results of doing experiment on the different datasets

Appendix A Functions in Algorithm

Algorithm 3 Functions in LDP-FedCR
1:  function ClientUpdate(M,E)(M,E)
2:  Initialize local model WMW\leftarrow M
3:  for e=1e=1 to EE do
4:     for each batch bb in local dataset do
5:        Compute gradients L(W,b)\nabla L(W,b)
6:        Update WWηL(W,b)W\leftarrow W-\eta\nabla L(W,b)
7:     end for
8:  end for
9:  return  WW
10:  end function
11:  function ProbabilityList(px,NumRand)(px,NumRand)
12:  ProbList[0](1px)NumRandProbList[0]\leftarrow(1-px)^{NumRand}
13:  for i=1i=1 to 2NumRand2^{NumRand} do
14:     NumOnespopcount(i)NumOnes\leftarrow\text{popcount}(i)
15:     ProbList[i]ProbList[i1]+pxNumOnes(1px)NumRandNumOnesProbList[i]\leftarrow ProbList[i-1]+p_{x}^{NumOnes}\cdot(1-p_{x})^{NumRand-NumOnes}
16:  end for
17:  return  ProbListProbList
18:  end function
19:  function CommRand(px,NumRand,NumParam)(px,NumRand,NumParam)
20:  return  sequence of IID Ber(px)Ber(p_{x}) vector with size (NumRand,NumParam)(NumRand,NumParam)
21:  end function
22:  function FindIndex(ProbList,value)(ProbList,value)
23:  return  index ii where ProbList[i]value<ProbList[i+1]ProbList[i]\leq value<ProbList[i+1]
24:  end function

Appendix B Proof of Lemma 2

Lets call the client ii as Alice and client jj as Bob. then the marginal probabilities of Alice and Bob based on 2 can be expressed as follows:

PA=12+wAc2rα(ϵ)\displaystyle P_{A}=\frac{1}{2}+\frac{w_{A}-c}{2r\alpha(\epsilon)}
PB=12+wBc2rα(ϵ)\displaystyle P_{B}=\frac{1}{2}+\frac{w_{B}-c}{2r\alpha(\epsilon)}

As it is shown in Figure 1, the probability that Alice’s output be c+rα(ϵ)rc+r\alpha(\epsilon)r and Bob’s output be cα(ϵ)rc-\alpha(\epsilon)r is min(PA,1PB)\min(P_{A},1-P_{B}), then the following would be the joint probability distribution:

Pr(wA=c+αr,wB=c+αr)=max(0,PA+PB1)\displaystyle Pr(w^{*}_{A}=c+\alpha r,w^{*}_{B}=c+\alpha r)=\max(0,P_{A}+P_{B}-1)
Pr(wA=c+αr,wB=cαr)=PA\displaystyle Pr(w^{*}_{A}=c+\alpha r,w^{*}_{B}=c-\alpha r)=P_{A}
Pr(wA=cαr,wB=c+αr)=PB\displaystyle Pr(w^{*}_{A}=c-\alpha r,w^{*}_{B}=c+\alpha r)=P_{B}
Pr(wA=cαr,wB=cαr)=max(0,1PBPA)\displaystyle Pr(w^{*}_{A}=c-\alpha r,w^{*}_{B}=c-\alpha r)=\max(0,1-P_{B}-P_{A})