This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PAUSE: Low-Latency and Privacy-Aware Active User Selection for Federated Learning

Ori Peleg, Natalie Lang, Stefano Rini, Nir Shlezinger, and Kobi Cohen Parts of this work were accepted for presentation in the 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) as the paper [1]. O. Peleg, N. Lang, N. Shlezinger, and K. Cohen are with School of ECE, Ben-Gurion University of the Negev, Beer-Sheva, Israel (email: {oripele, langn}@post.bgu.ac.il; {nirshl; yakovsec}@bgu.ac.il). S. Rini is with the Department of ECE, National Yang-Ming Chiao-Tung University (NYCU), Hsinchu, Taiwan (email: stefano.rini@nycu.edu.tw). This research was supported by the Israeli Ministry of Science and Technology.
Abstract
\Ac

fl enables multiple edge devices to collaboratively train a machine learning model without the need to share potentially private data. Federated learning proceeds through iterative exchanges of model updates, which pose two key challenges: (i) the accumulation of privacy leakage over time and (ii) communication latency. These two limitations are typically addressed separately—(i) via perturbed updates to enhance privacy and (ii) user selection to mitigate latency—both at the expense of accuracy. In this work, we propose a method that jointly addresses the accumulation of privacy leakage and communication latency via active user selection, aiming to improve the trade-off among privacy, latency, and model performance. To achieve this, we construct a reward function that accounts for these three objectives. Building on this reward, we propose a MAB-based algorithm, termed Privacy-aware Active User SElection (PAUSE) – which dynamically selects a subset of users each round while ensuring bounded overall privacy leakage. We establish a theoretical analysis, systematically showing that the reward growth rate of PAUSE follows that of the best-known rate in MAB literature. To address the complexity overhead of active user selection, we propose a simulated annealing-based relaxation of PAUSE and analyze its ability to approximate the reward-maximizing policy under reduced complexity. We numerically validate the privacy leakage, associated improved latency, and accuracy gains of our methods for the federated training in various scenarios.

Index Terms:
Federated Learning; Communication latency; Privacy; Multi-Armed Bandit; Simulated Annealing.

I Introduction

The effectiveness of deep learning models heavily depends on the availability of large amounts of data. In real-world scenarios, data is often gathered by edge devices such as mobile phones, medical devices, sensors, and vehicles. Because these data often contain sensitive information, there is a pressing need to utilize them for training deep neural networks without compromising user privacy. A popular framework to enable training DNNs without requiring data centralization is that of federated learning (FL[2]. In FL, each participating device locally trains its model in parallel, and a central server periodically aggregates these local models into a global one [3].

The distributed operation of FL, and particularly the fact that learning is carried out using multiple remote users in parallel, induces several challenges that are not present in traditional centralized learning [4, 5]. A key challenge stems from the fact that FL involves repeated exchanges of highly-parameterized models between the orchestrating server and numerous users. This often entails significant communication latency which– in turn– impacts convergence, complexity, and scalability [6]. Communication latency can be tackled by model compression [7, 8, 9, 10], and via over-the-air aggregation in settings where the users share a common wireless channel [11, 12, 13]. A complementary approach for balancing communication latency, which is key for scaling FL over massive networks, is user selection [14, 15, 16]. User selection limits the number of users participating in each round, traditionally employing pre-defined policies [17, 18, 19, 20], with more recent schemes exploring active user selection based on MAB framework [21, 22, 23, 24]. The latter adapts the policy based on, e.g., learning progress and communication delay.

Another prominent challenge of FL is associated with one of its core motivators–privacy preservation. While FL does not involve data sharing, it does not necessarily preserve data privacy, as model inversion attacks were shown to unveil private information and even reconstruct the data from model updates [25, 26, 27, 28]. The common framework for analyzing privacy leakage in FL is based on local differential privacy (LDP[29]. LDP mechanisms limit privacy leakage in a given FL round, typically by employing privacy preserving noise (PPN[30, 31, 32], that can also be unified with model compression [33, 34]. However, this results in having the amount of leaked privacy grow with the number of learning rounds [35], degrading performance by restricting the number of learning rounds and necessitating dominant PPN. Existing approaches to avoid accumulation of privacy leakage consider it as a separate task to tackling latency and scalability, often by focusing on a fixed pre-defined number of rounds [36], or by relying on an additional trusted coordinator unit [37, 38, 39], thus deviating from how FL typically operates. This motivates unifying privacy enhancement and user selection, as means to jointly tackle privacy accumulation and latency in FL.

In this work we propose a novel framework for private and scalable multi-round FL with low latency via active user selection. Our proposed method, coined PAUSE, is based on a generic per-round privacy budget, designed to avoid leakage surpassing a pre-defined limit for any number of FL rounds. This operation results in users inducing more PPN each time they participate. The budget is accounted for in formulating a dedicated reward function for active user selection that balances privacy, communication, and generalization. Based on the reward, we propose a MAB-based policy that prioritizes users with lesser PPN, balanced with grouping users of similar expected communication latency and exploring new users for enhancing generalization. We provide an analysis of PAUSE, rigorously proving that its regret growth rate obeys the desirable growth in MAB theory [40, 41, 42].

The direct application of PAUSE involves a brute search of a combinatorial nature, whose complexity grows dramatically with the number of users. To circumvent this excessive complexity and enhance scalability, we propose a reduced complexity implementation of PAUSE based on simulated annealing (SA[43], coined SA-PAUSE. We analyze the computational complexity of SA-PAUSE, quantifying its reduction compared to direct PAUSE, and rigorously characterize conditions for it to achieve the same performance as costly brute search. We evaluate PAUSE in learning of different scenarios with varying DNNs, datasets, privacy budgets, and data distributions. Our experimental studies systematically show that by fusing privacy enhancement and user selection, PAUSE enables accurate and rapid learning, approaching the performance of FL without such constraints and notably outperforming alternative approaches that do not account for leakage accumulation. We also show that SA-PAUSE approaches the performance of direct PAUSE in both privacy leakage, model accuracy, and latency, while supporting scalable implementations on large FL networks.

The rest of this paper is organized as follows. We review some necessary preliminaries and formulate the problem in Section II. PAUSE is introduced and analyzed in Section III, while its reduced complexity, SA-PAUSE, is detailed in Section IV. Numerical simulations are reported in Section V, and Section VI provides concluding remarks.

Notation: Throughout this paper, we use boldface lower-case letters for vectors, e.g., 𝒙\bm{x}. The stochastic expectation, probability operator, indicator function, and 2\ell_{2} norm are denoted by 𝔼[]\mathbb{E}[\cdot], ()\mathbb{P}(\cdot), 𝟏()\mathbf{1}(\cdot), and \|\cdot\|, respectively. For a set 𝒳\mathcal{X}, we write |𝒳|{\left|\mathcal{X}\right|} as its cardinality.

II System Model and Preliminaries

This section reviews the necessary background for deriving PAUSE. We start by recalling the FL setup and basics in LDP in Subsections II-A-II-B, respectively. Then, we formulate the active user selection problem in Subsection II-C.

II-A Preliminaries: Federated Learning

II-A1 Objective

The FL setup involves the collaborative training of a machine learning model 𝜽d\bm{\theta}\in\mathbb{R}^{d}, carried out by KK remote users and orchestrated by a server. Let the set of users be indexed by 𝕂={1,,K}\mathbb{K}=\{1,\ldots,K\}, and let 𝒟k\mathcal{D}_{k} denote the private dataset of user k𝕂k\in\mathbb{K}, which cannot be shared with the server. Define Fk(𝜽)F_{k}(\bm{\theta}) as the empirical risk of a model 𝜽\bm{\theta} evaluated on 𝒟k\mathcal{D}_{k}. The goal is to determine the d×1d\times 1 optimal parameter vector 𝜽opt\bm{\theta}^{\rm opt} that minimizes the overall loss across all users, that is

𝜽opt=argmin𝜽{F(𝜽)k=1K|𝒟k||𝒟|Fk(𝜽)}.\bm{\theta}^{\rm opt}=\operatorname*{arg\,min}_{\bm{\theta}}\left\{F(\bm{\theta})\triangleq\sum_{k=1}^{K}\frac{{\left|\mathcal{D}_{k}\right|}}{{\left|\mathcal{D}\right|}}F_{k}\left(\bm{\theta}\right)\right\}. (1)

II-A2 Learning Procedure

FL operates over multiple iterations divided into rounds [4]. At FL round tt, the server selects a set of participating users 𝒮t𝕂\mathcal{S}_{t}\subseteq\mathbb{K}, and sends the current model 𝜽t\bm{\theta}_{t} to them. Each participating user of index k𝒮tk\in\mathcal{S}_{t} then trains 𝜽t\bm{\theta}_{t} using its local data 𝒟k\mathcal{D}_{k} using, e.g., multiple iterations of mini-batch stochastic gradient descent (SGD[44], into the updated 𝜽t+1k\bm{\theta}^{k}_{t+1}.

The model update obtained by the kkth user, denoted 𝒉t+1k=𝜽t+1k𝜽t\bm{h}^{k}_{t+1}=\bm{\theta}^{k}_{t+1}-\bm{\theta}_{t}, is shared with the server, which aggregates the local updates into a global model update. The aggregation rule commonly employed by the central server in FL is that of federated averaging (FedAvg[2], in which the global model is obtained as

𝜽t+1=𝜽t+k𝒮tαtk𝒉t+1k=k𝒮tαtk𝜽t+1k,\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}+\sum_{k\in\mathcal{S}_{t}}\alpha_{t}^{k}\bm{h}^{k}_{t+1}=\sum_{k\in\mathcal{S}_{t}}\alpha_{t}^{k}\bm{\theta}^{k}_{t+1}, (2)

where αtk=|𝒟k||j𝒮t𝒟j|\alpha_{t}^{k}=\frac{{\left|\mathcal{D}_{k}\right|}}{{\left|\cup_{j\in\mathcal{S}_{t}}\mathcal{D}_{j}\right|}}. The updated global model is again distributed to the users and the learning procedure continues.

II-A3 Communication Model

Communication between the users and the server is associated with some varying latency [4]. We model this delay via the random variable τt,k\tau_{t,k}, representing the total latency in the ttth round between the server and the kkth user. Accordingly, the communication latency of the whole round, denoted as τttotal\tau_{t}^{\rm total}, is determined by the user with the highest latency

τttotal=maxk𝒮tτt,k.\tau_{t}^{\rm total}=\max_{k\in\mathcal{S}_{t}}\tau_{t,k}. (3)

The communication latency τt,k\tau_{t,k} varies over time (due to fading [6]) and between users (due to system heterogeneity [45]). As the latter is device specific, we model τt,k\tau_{t,k} as being drawn in an i.i.d. manner from a device specific distribution [21], denoted τk\tau_{k}. We further assume the users differ in their expected latencies, 𝔼[τk]\mathbb{E}[\tau_{k}]. We denote the minimal difference between these terms as δminij𝕂|𝔼[τi]𝔼[τj]|\delta\triangleq\min_{i\neq j\in\mathbb{K}}|\mathbb{E}[\tau_{i}]-\mathbb{E}[\tau_{j}]|, and assume that there is a minimal latency corresponding to, e.g., the minimal delay. Mathematically, this implies that there exists some τmin>0\tau_{\min}>0 such that τt,kτmin\tau_{t,k}\geq\tau_{\min} with probability one.

II-B Preliminaries: Local Differential Privacy

One of the main motivations for FL is the need to preserve the privacy of the users’ data. Nonetheless, the concealment of the dataset of the kkth user, 𝒟k\mathcal{D}_{k}, in favor of sharing the model updates trained using 𝒟k\mathcal{D}_{k}, was shown to be potentially leaky [25, 26, 27, 28]. Therefore, to satisfy the privacy requirements of FL, initiated privacy mechanisms are necessary.

In FL, privacy is commonly quantified in terms of LDP [46, 47], as this metric assumes an untrusted server by the users.

Definition 1 (ϵ\epsilon-LDP [48]).

A randomized mechanism \mathcal{M} satisfies ϵ\epsilon-LDP if for any pairs of input values v,vv,v^{\prime} in the domain of \mathcal{M} and for any possible output yy in it, it holds that

[(v)=y]eϵ[(v)=y].\displaystyle\mathbb{P}[\mathcal{M}(v)=y]\leq e^{\epsilon}\mathbb{P}[\mathcal{M}(v^{\prime})=y]. (4)

In Definition 1, a smaller ϵ\epsilon means stronger privacy protection. A common mechanism to achieve ϵ\epsilon-LDP is the Laplace mechanism (LM). Let Laplace(μ,b)\operatorname*{\text{Laplace}}(\mu,b) be the Laplace distribution with location μ\mu and scale bb. The LM is defined as:

Theorem 1 (LM [49]).

Given any function f:Ddf:D\to\mathbb{R}^{d} where DD is a domain of datasets, the LM defined as :

Laplace(f(x),ϵ)=f(x)+[z1,,zd]T,\mathcal{M}^{\rm Laplace}\left(f(x),\epsilon\right)=f(x)+{\left[z_{1},\dots,z_{d}\right]}^{T}, (5)

is ϵ\epsilon-LDP. In (5), zii.i.d.Laplace(0,Δf/ϵ)z_{i}\overset{\acs{iid}}{\sim}\operatorname*{\text{Laplace}}\left(0,\Delta f/\epsilon\right), i.e., they obey an i.i.d. zero-mean Laplace distribution with scale Δf/ϵ\Delta f/\epsilon, where Δfmaxx,yDf(x)f(y)1\Delta f\triangleq\max_{x,y\in D}{||f(x)-f(y)||}_{1}.

LDP mechanisms, such as LM, guarantee ϵ\epsilon-LDP for a given query of \mathcal{M} in (4). In FL, this amounts for a single model update. As FL involves multiple rounds, one has to account for the accumulated leakage, given by the composition theorem:

Theorem 2 (Composition  [48]).

Let i\mathcal{M}_{i} be an ϵi\epsilon_{i}-LDP mechanism on input vv, and (v)\mathcal{M}(v) is the sequential composition of 1(v),,m(v)\mathcal{M}_{1}(v),...,\mathcal{M}_{m}(v), then (v)\mathcal{M}(v) satisfies k=1mϵi\sum_{k=1}^{m}\epsilon_{i}-LDP.

Theorem 2 indicates that the privacy leakage of each user in FL is accumulated as the training proceeds.

II-C Problem Formulation

Our goal is to design a privacy leakage policy alongside privacy-aware user selection. Formally, we aim to set for every round tt\in\mathbb{N} an algorithm that selects m=|𝒮t|m=|\mathcal{S}_{t}| users, while setting the privacy leakage budget {ϵk,t}k𝒮t\{\epsilon_{k,t}\}_{k\in\mathcal{S}_{t}}. These policies should account for the following considerations:

  1. C1

    Optimize the accuracy of the trained 𝜽\bm{\theta} (1).

  2. C2

    Minimize the overall latency due to (3).

  3. C3

    Maintain ϵ¯\bar{\epsilon}-LDP, i.e., the overall leakage by each user should not exceed ϵ¯\bar{\epsilon}, where ϵ¯\bar{\epsilon} is a pre-defined constant.

  4. C4

    Operate with limited complexity to support real-time implementation in large-scale networks.

The considerations above are addressed in the subsequent sections. We first focus solely on considerations C1-C3, based on which we present PAUSE in Section III. Subsequently, Section IV adapts PAUSE to accommodate consideration C4, yielding SA-PAUSE, thereby jointly tackling C1-C4.

III Privacy-Aware Active User Selection

This section introduces PAUSE. We first formulate its time-varying privacy budget policy and associated reward in Subsection III-A. The resulting user selection algorithm is detailed in Subsection III-B, with its regret growth analyzed in Subsection  III-C. We conclude with a discussion in Subsection III-D.

III-A Reward and Privacy Policy

The formulation of PAUSE relies on two main components: (i)(i) a prefixed round-varying privacy budget; and (ii)(ii) a reward holistically accounting for privacy, latency, and generalization. The privacy policy is designed to ensure that C3 is preserved regardless of the number of iterations each user participated in. Accordingly, we define a sequence {ϵi}\{\epsilon_{i}\} with ϵi>0\epsilon_{i}>0, satisfying:

i=1ϵi=ϵ¯,\sum_{i=1}^{\infty}\epsilon_{i}=\bar{\epsilon}, (6)

for ϵ¯\bar{\epsilon} finite. Using the sequence {ϵi}\{\epsilon_{i}\}, the privacy budget of any user at the iith time it participates in training the model is set to ϵi\epsilon_{i}, and achieved using, e.g., LM. This guarantees that C3 holds. One candidate setting, which is also used in our experiments, sets ϵi=ϵ¯(eη1)eηi\epsilon_{i}=\bar{\epsilon}{(e^{\eta}-1)}e^{-\eta i}. This guarantees achieving asymptotic leakage of ϵ¯\bar{\epsilon} by the limit of a geometric column. for which (6) holds when η>0\eta>0.

The reward guides the active user selection procedure and utilizes two terms. The first is the privacy reward, which accounts for the fact that our privacy policy has users introduce more dominant PPN each time they participate. The privacy reward assigned to the kkth user at round tt is

pk(t)1t=1Tk(t)ϵtϵ¯,p_{k}(t)\triangleq 1-\frac{\sum_{t=1}^{T_{k}(t)}\epsilon_{t}}{\bar{\epsilon}}, (7)

where Tk(t)T_{k}(t) is the number of rounds the kkth user has been selected up to and including the ttth round, i.e., Tk(t)i=1t𝟏(k𝒮t)T_{k}(t)\triangleq\sum_{i=1}^{t}\mathbf{1}(k\in\mathcal{S}_{t}). The privacy reward (7) yields higher values to users who have participated in fewer rounds.

The second term is the generalization reward, designed to meet C1. It assigns higher values for users whose data have been underutilized compared to the relative size of their data from the whole available data, |𝒟k||𝒟|\frac{|\mathcal{D}_{k}|}{|\mathcal{D}|}. We adopt the generalization reward proposed in [23], which was shown to account for both i.i.d. balanced data and non-i.i.d. imbalanced data cases, and rewards the kkth user in an mm-sized group at round tt via the function

gk(t)|m|𝒟|/|𝒟k|Tk(t)t|βsign(m|𝒟|/|𝒟k|Tk(t)t).g_{k}(t)\triangleq\bigg{|}\frac{m}{{|\mathcal{D}|}/{|\mathcal{D}_{k}|}}-\frac{T_{k}(t)}{t}\bigg{|}^{\beta}\cdot\operatorname*{\text{sign}}\biggl{(}\frac{m}{{|\mathcal{D}|}/{|\mathcal{D}_{k}|}}-\frac{T_{k}(t)}{t}\biggr{)}. (8)

In (8), β>1\beta>1 is a hyper-parameter that adjusts the fuzziness of the function, i.e., higher β\beta yields lower absolute value where the other parameters are fixed. Fig. 1 describes gk()g_{k}(\cdot) as a function of Tk(t)/t{T_{k}(t)}/{t}, and illustrates the effect of different β\beta values as means of balancing the reward assigned to users that participated much (high Tk(t)/t{T_{k}(t)}/{t}).

Refer to caption
Figure 1: Generalization reward (8) for different values of β\beta, with |𝒟||𝒟k|=K\frac{|\mathcal{D}|}{|\mathcal{D}_{k}|}=K.

Our proposed reward encompasses the above terms, grading the selection of a group of users 𝒮\mathcal{S} of size mm at round tt as

r(𝒮,t)τminmaxk𝒮τk,t+αmk𝒮gk(t1)+γmk𝒮pk(t1)\displaystyle r(\mathcal{S},t)\triangleq\frac{\tau_{\min}}{\max_{k\in\mathcal{S}}\tau_{k,t}}\!+\!\frac{\alpha}{m}\sum_{k\in\mathcal{S}}g_{k}(t-1)\!+\!\frac{\gamma}{m}\sum_{k\in\mathcal{S}}p_{k}(t-1)
=mink𝒮τminτk,t+αmk𝒮gk(t1)+γmk𝒮pk(t1).\displaystyle=\min_{k\in\mathcal{S}}\frac{\tau_{\min}}{\tau_{k,t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{S}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{S}}p_{k}(t-1). (9)

The reward in (9) is composed of three additive terms which correspond to C2, C1 and C3, respectively, with α\alpha and γ\gamma being hyper-parameters balancing these considerations. At this point we can make two remarks regarding the reward (9):

  1. 1.

    Both gk()g_{k}(\cdot) and pk()p_{k}(\cdot) penalize repeated selection of the same users. However, each rewards differently, based on generalization and privacy considerations. The former accounts for the relative dataset sizes of the users, while the latter doesn’t. In the case of homogeneous data, where for all k𝕂k\in\mathbb{K}, |𝒟k|=|𝒟|K|\mathcal{D}_{k}|=\frac{|\mathcal{D}|}{K}, both gk()g_{k}(\cdot) and pk()p_{k}(\cdot) play a similar role. However, they differ significantly in the non-i.i.d case.

  2. 2.

    The value of the first term is determined solely by the slowest user. This non-linearity, combined with the two other terms, directs the algorithm we derive from this reward to select a group of users with similar latency in a given round.

III-B PAUSE Algorithm

Here, we present PAUSE, which is a combinatorical -based [42] algorithm based on the reward (9). To derive PAUSE, we seek a policy Π(𝒮1,𝒮2,)\Pi\triangleq(\mathcal{S}_{1},\mathcal{S}_{2},...) such that 𝔼[t=1nr(𝒮t,t)]\mathbb{E}[\sum_{t=1}^{n}r(\mathcal{S}_{t},t)] is maximized over nn. To maximize the given term, as is customary in MAB settings, we aim to minimize the regret, defined as the loss of the algorithm compared to an algorithm composed by a Genie that has prior knowledge of the expectations of the random variables, i.e., of μk𝔼[τminτk]\mu_{k}\triangleq\mathbb{E}[\frac{\tau_{\min}}{\tau_{k}}].

We define the Genie’s algorithm as selecting

𝒢targmax𝒮𝕂;|𝒮|=m{C𝒢(𝒮,t)},\mathcal{G}_{t}\triangleq\operatorname*{arg\,max}_{\mathcal{S}\subseteq\mathbb{K};|\mathcal{S}|=m}\{C^{\mathcal{G}}(\mathcal{S},t)\}, (10)

where

C𝒢(𝒮,t)mink𝒮μk+αmk𝒮gk(t1)+γmk𝒮pk(t1).C^{\mathcal{G}}(\mathcal{S},t)\triangleq\min_{k\in\mathcal{S}}\mu_{k}+\frac{\alpha}{m}\sum_{k\in\mathcal{S}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{S}}p_{k}(t-1).

The Genie policy (10) attempts to maximize the expectation of the reward (9) in each round, by replacing the order of the expectation and the mink𝒮\min_{k\in\mathcal{S}} operator. As the reward C𝒢C^{\mathcal{G}} is history-dependent, the Genie’s policy is history-dependent as well.

We use the Genie policy to derive PAUSE, denoted 𝒫(𝒫1,𝒫2,)\mathcal{P}\triangleq(\mathcal{P}_{1},\mathcal{P}_{2},\ldots), as an upper confidence bound (ucb)-type algorithm [41]. Accordingly, PAUSE estimates the unknown expectations {μk}\{\mu_{k}\} with their empirical means, computed using the latency measured in previous rounds via

μk¯(n)1Tk(n)t=1nτminτk,t𝟏(k𝒫t).\overline{\mu_{k}}(n)\triangleq\frac{1}{T_{k}(n)}\sum_{t=1}^{n}\frac{\tau_{\min}}{\tau_{k,t}}\cdot\mathbf{1}(k\in\mathcal{P}_{t}). (11)

Note that (11) can be efficiently updated in a recursive manner, as

μk¯(t)=Tk(t1)Tk(t)μk¯(t1)+𝟏(k𝒫t)Tk(t)τminτk,t.\overline{\mu_{k}}(t)=\frac{T_{k}(t-1)}{T_{k}(t)}\overline{\mu_{k}}(t-1)+\frac{\mathbf{1}(k\in\mathcal{P}_{t})}{T_{k}(t)}\frac{\tau_{min}}{\tau_{k,t}}. (12)

PAUSE uses (11) to compute the ucb terms for each user at the end of the ttth round [41], via

ucb(k,t)μk¯(t)+(m+1)log(t)Tk(t).{\rm ucb}(k,t)\triangleq\overline{\mu_{k}}(t)+\sqrt{\frac{(m+1)\log(t)}{T_{k}(t)}}. (13)

The ucb term in (13) is designed to tackle C2. Its formulation encapsulates the inherent exploration vs. exploitation trade-off in MAB problems, boosting exploitation of the fastest users in expectation using μk¯(t)\overline{\mu_{k}}(t), while encouraging to explore other users in its second term. The resulting user selection rule at round tt is

𝒫t=argmax𝒮𝕂;|𝒮|=m{\displaystyle\mathcal{P}_{t}=\operatorname*{arg\,max}_{\mathcal{S}\subseteq\mathbb{K};|\mathcal{S}|=m}\biggr{\{} mink𝒮ucb(k,t1)+αmk𝒮gk(t1)\displaystyle\min_{k\in\mathcal{S}}{\rm ucb}(k,t-1)+\frac{\alpha}{m}\sum_{k\in\mathcal{S}}g_{k}(t-1)
+γmk𝒮pk(t1)}.\displaystyle+\frac{\gamma}{m}\sum_{k\in\mathcal{S}}p_{k}(t-1)\biggl{\}}. (14)

The overall active user selection procedure is summarized as Algorithm 1. The chosen users send their noisy local model updates to the server which updates the global model by (2) and sends it back to all the users in 𝕂\mathbb{K}. At the end of every round, we update the users’ reward terms for the next round, in which pk(t)p_{k}(t) and μk¯(t)\overline{\mu_{k}}(t) change their values only for participating users k𝒫tk\in\mathcal{P}_{t}. Note that, by the formulation of Algorithm 1, it holds that when mm is an integer divisor of KK, then in the first Km\frac{K}{m} rounds, the server chooses every user exactly once due to the initial conditions.

Input : Set of users 𝕂\mathbb{K}; Number of active users mm
Init : Set Tk(0),μk¯(0),pk(0)0T_{k}(0),\overline{\mu_{k}}(0),p_{k}(0)\leftarrow 0; ucb(k,0){\rm ucb}(k,0)\leftarrow\infty;
Initial model parameters 𝜽0{\bm{\theta}}_{0}
1 for t=1,2t=1,2\ldots do
2  Select 𝒫t\mathcal{P}_{t} via (14);
3 
4  Share 𝜽t1{\bm{\theta}}_{t-1} with users in 𝒫t\mathcal{P}_{t};
5 
6  Aggregate global model 𝜽t{\bm{\theta}}_{t} via (2);
7 for k𝕂k\in\mathbb{K} do
8      Update Tk(t)Tk(t1)+𝟏(k𝒫t)T_{k}(t)\leftarrow T_{k}(t-1)+\mathbf{1}(k\in\mathcal{P}_{t});
9     Update empirical estimate μk¯(t)\overline{\mu_{k}}(t) via (12);
10     Update ucb(k,t){\rm ucb}(k,t) via (13);
11 
return 𝛉t{\bm{\theta}}_{t}
Algorithm 1 PAUSE

III-C Regret Analysis

To evaluate PAUSE, we next analyze its regret, which for a policy Π\Pi is defined as the expectation of the reward gap between the given policy and the Genie’s policy:

RΠ(n)𝔼[t=1nr(𝒢t,t)r(𝒮t,t)].R^{\Pi}(n)\triangleq\mathbb{E}\Bigl{[}\sum_{t=1}^{n}r(\mathcal{G}_{t},t)-r(\mathcal{S}_{t},t)\Bigr{]}. (15)

We define the maximal reward gap for any policy as Δmaxmaxt,Πr(𝒢t,t)r(𝒮t,t)\Delta_{\max}\triangleq\max_{t\in\mathbb{N},\Pi}r(\mathcal{G}_{t},t)-r(\mathcal{S}_{t},t). This quantity is bounded as stated the following lemma:

Lemma 1.

User selection via (14) with the reward (9) satisfies

Δmax2α+γ+maxk𝕂μkmaxk𝕂μk.\Delta_{\max}\leq 2\alpha+\gamma+\max_{k\in\mathbb{K}}\mu_{k}-\max_{k\in\mathbb{K}}\mu_{k}. (16)
Proof.

Inequality (16) follows from (9), as gk(t)[1,1]g_{k}(t)\in[-1,1] and pk(t)[0,1]p_{k}(t)\in[0,1] for every k𝕂k\in\mathbb{K} and tt\in\mathbb{N}. ∎

We bound the regret of PAUSE in the following theorem:

Theorem 3.

The regret of PAUSE holds

R𝒫(n)K(Δmax+δ)(4(m+1)log(n)δ2+1+2π23),R^{\mathcal{P}}(n)\leq K(\Delta_{\max}\!+\!\delta)\left(\frac{4(m\!+\!1)\log(n)}{\delta^{2}}\!+\!1\!+\!\frac{2\pi^{2}}{3}\right), (17)
Proof.

The proof is given in Appendix -A. ∎

Theorem 3 bounds the regret accumulated at every round nn. In the asymptotic regime, it implies that PAUSE achieves a regret growth whose order does not exceed 𝒪(log(n))\mathcal{O}(\log(n)), being the best-known regret rate in MAB [41, 42].

III-D Discussion

PAUSE is particularly designed to facilitate privacy and communication constrained FL. It leverages MAB-based active user selection to dynamically cope with privacy leakage accumulation, without restricting the overall number of FL rounds as in [36, 20, 24]. PAUSE is theoretically shown to achieve best-known regret growth and it demonstrated promising results in our experiments as detailed in Section V.

The formulation of PAUSE in Algorithm 1 focuses on the server operation, requiring the users only to send their updates with the proper PPN. As such, it can be naturally combined with existing methods for alleviating latency and privacy via update encoding [4]. Moreover, the statement of Algorithm 1 complies with any privacy policy imposed, while adhering to the constraints C1-C3. This inherent adaptability makes it an agile solution across diverse policy frameworks.

A core challenge associated with applying PAUSE stems from the fact that (14) involves a brute search over (Km)\binom{K}{m} options. Such computation is expected to become infeasible at large networks, i.e., as KK grows; making it incompatible with consideration C4. This complexity can be alleviated by approximating the brute search with low-complexity policies based on (14), as we do in the sequel.

IV SA-PAUSE

In this section we alleviate the computational burden associated with the brute search operation of PAUSE. The resulting algorithm, termed SA-PAUSE, is based on SA principles, as detailed in Subsection IV-A. We analyze SA-PAUSE, rigorously identifying conditions for which it coincides with PAUSE and characterize its time complexity in Subsection IV-B.

IV-A Simulated Annealing Algorithm

To ease the computational efficiency of the search procedure in (14), we construct a graph structure where the set of vertices 𝕍\mathbb{V} comprises all possible subsets of mm users in 𝕂\mathbb{K}. For each vertex (i.e., set of users) 𝒱𝕍\mathcal{V}\in\mathbb{V}, we denote its neighboring set as 𝒩𝒱\mathcal{N}_{\mathcal{V}}. Two vertices 𝒱,𝒰𝕍\mathcal{V},\mathcal{U}\in\mathbb{V} are designated as neighbors, when they satisfy the following requirements:

  1. R1

    The intersection of the vertices contains exactly m1m-1 elements, i.e., the sets of users 𝒱\mathcal{V} and 𝒰\mathcal{U} differ in a single user, thus |𝒱𝒰|=m1|\mathcal{V}\cap\mathcal{U}|=m-1.

  2. R2

    One of the users which appears in only a single set minimizes one of the terms of the selection rule (14) in its designated group. i.e., one of the sets is an active neighbor of the other. Mathematically, we say that 𝒰\mathcal{U} is an active neighbor of 𝒱\mathcal{V} (and 𝒱\mathcal{V} is a passive neighbor of 𝒰\mathcal{U}) if the distinct node in 𝒱\mathcal{V}, i.e., k=𝒱𝒰k=\mathcal{V}\setminus\mathcal{U}, holds

    k{\displaystyle k\in\Big{\{} argmink𝒱ucb(k,t1),argmink𝒱pk(t1),\displaystyle\operatorname*{arg\,min}\limits_{k^{\prime}\in\mathcal{V}}{\rm ucb}(k^{\prime},t-1),\operatorname*{arg\,min}\limits_{k^{\prime}\in\mathcal{V}}p_{k^{\prime}}(t-1),
    argmink𝒱gk(t1)}.\displaystyle\operatorname*{arg\,min}\limits_{k^{\prime}\in\mathcal{V}}g_{k^{\prime}}(t-1)\Big{\}}.

The above graph construction is inherently undirected due to the symmetric nature of the neighbor relationships.

To formalize our optimization objective, we define the energy of each vertex as the quantity we seek to maximize in PAUSE’s search (14). Specifically, for any vertex 𝒱\mathcal{V}, define

E(𝒱)\displaystyle E(\mathcal{V})\triangleq mink𝒱ucb(k,t1)+αmk𝒱gk(t1)\displaystyle\min_{k\in\mathcal{V}}{\rm ucb}(k,t-1)+\frac{\alpha}{m}\sum_{k\in\mathcal{V}}g_{k}(t-1)
+γmk𝒱pk(t1).\displaystyle+\frac{\gamma}{m}\sum_{k\in\mathcal{V}}p_{k}(t-1). (18)

To identify a vertex exhibiting maximal energy, we introduce an optimized SA-based algorithm [43], which iteratively inspects vertices (i.e., candidate user sets) in the graph. The resulting procedure, detailed in Algorithm 2, is comprised of two stages taking place on FL round tt: initilalziaiton and iterative search.

Initialization: Following established SA methodology, we maintain an auxiliary temperature sequence, whose jjth entry is defined as τj=Clog(j+1)\tau_{j}=\frac{C}{\log(j+1)}, where parameter C>0C>0 exceeds the maximum energy differential between any pair of vertices in the graph. Thus, one must first set the value of CC.

Accordingly, the initialization phase at round tt involves sorting all KK users according to their respective ucb(k,t1){\rm ucb}(k,t-1), pk(t1)p_{k}(t-1), and gk(t1)g_{k}(t-1) values into three distinct lists. These three sorted lists are used first to determine an appropriate value for CC. For each list i1,2,3i\in{1,2,3}, we denote 𝒜mi\mathcal{A}_{m}^{i} and mi\mathcal{B}_{m}^{i} as the sets containing the mm users with minimal and maximal values, respectively. The parameter CC is then established as follows, where ω\omega represents a small positive constant:

C=\displaystyle C= minkm1ucb(k,t1)mink𝒜m1ucb(k,t1)\displaystyle\min_{k\in\mathcal{B}^{1}_{m}}{\rm ucb}(k,t-1)-\min_{k\in\mathcal{A}^{1}_{m}}{\rm ucb}(k,t-1)
+αm[km2gk(t1)k𝒜m2gk(t1)]\displaystyle+\frac{\alpha}{m}\Big{[}\sum_{k\in\mathcal{B}^{2}_{m}}g_{k}(t-1)-\sum_{k\in\mathcal{A}^{2}_{m}}g_{k}(t-1)\Big{]}
+γm[km3pk(t1)k𝒜m3pk(t1)]+ω.\displaystyle+\frac{\gamma}{m}\Big{[}\sum_{k\in\mathcal{B}^{3}_{m}}p_{k}(t-1)-\sum_{k\in\mathcal{A}^{3}_{m}}p_{k}(t-1)\Big{]}+\omega. (19)

Iterative Search: The algorithm’s iterative phase updates an inspected vertex, moving at iteration jj from the previously inspected 𝒱j\mathcal{V}_{j} into an updated 𝒱j+1\mathcal{V}_{j+1}. This necessitates the identification of 𝒩𝒱j\mathcal{N}_{\mathcal{V}_{j}}. We decompose this task into the discovery of active and passive neighbors as specified in R2, utilizing the previously constructed sorted lists:

  1. N1

    Active Neighbor Identification - To determine the active neighbors in iteration ii, we examine each sorted list (ucb(k,t1){\rm ucb}(k,t-1), pk(t1)p_{k}(t-1), and gk(t1)g_{k}(t-1)) to identify the user with the minimal value within 𝒱j\mathcal{V}_{j}. An active neighbor is generated by substituting any of these minimal-value users with a user not present in 𝒱j\mathcal{V}_{j}. This procedure yields at most 3(Km)3(K-m) active neighbors of 𝒱j\mathcal{V}_{j}.

  2. N2

    Passive Neighbor Identification - For passive neighbors, we establish that a vertex 𝒰\mathcal{U} qualifies as a passive neighbor of 𝒱j\mathcal{V}_{j} if it can be constructed through one of two mechanisms, illustrated using the ucb(k,t1){\rm ucb}(k,t-1) sorted list. Let aa denote the user with minimal ucb(k,t1){\rm ucb}(k,t-1) in 𝒱j\mathcal{V}_{j} and bb represent the user with the second-minimal value. 𝒰\mathcal{U} is a passive neighbor of 𝒱j\mathcal{V}_{j} if it is obtained by either:

    1. (a)

      Replace any user in 𝒱j\mathcal{V}_{j} except aa with a user whose ucb(k,t1){\rm ucb}(k,t-1) value is lower than aa’s (positioned before aa in the sorted list).

    2. (b)

      Replace aa with a user whose ucb(k,t1){\rm ucb}(k,t-1) value is lower than bb’s (positioned before bb in the sorted list).

Once the neighbors set 𝒩𝒱j\mathcal{N}_{\mathcal{V}_{j}} is formulated, the algorithm inspects a random neighbor 𝒰\mathcal{U}. This set is inspected in the following iteration if it improves in terms of the energy (18) (for which it is also saved as the best set explored so far), or alternatively it is randomly selected with probability exp(E(𝒰)E(𝒱j)τj)\exp{\big{(}-\frac{E(\mathcal{U})-E(\mathcal{V}_{j})}{\tau_{j}}\big{)}}. The resulting procedure is summarized as Algorithm 2.

Input : Set of users 𝕂\mathbb{K}; Number of active users mm
Init : Randomly sample a vertex 𝒱1\mathcal{V}_{1} and set 𝒫t=𝒱1\mathcal{P}_{t}=\mathcal{V}_{1};
Sort the users along ucb(k,t1){\rm ucb}(k,t-1), pk(t1)p_{k}(t-1), and gk(t1)g_{k}(t-1), in three different lists.
1 Compute CC via (19);
2 for j=1,2j=1,2\ldots do
3  Find N𝒱jN_{\mathcal{V}_{j}} as described in N1 and N2;
4  Sample randomly 𝒰𝒩𝒱j\mathcal{U}\in\mathcal{N}_{\mathcal{V}_{j}};
5 if E(𝒰)E(𝒱j)E(\mathcal{U})\geq E(\mathcal{V}_{j}) then
6      Update inspected vertex 𝒱j+1𝒰\mathcal{V}_{j+1}\leftarrow\mathcal{U};
7      Update best vertex 𝒫t𝒰\mathcal{P}_{t}\leftarrow\mathcal{U};
8    
9 else
10     Sample pp uniformly over [0,1][0,1];
11      Set τj=Clog(1+j)\tau_{j}=\frac{C}{\log(1+j)};
12    if pexp(E(𝒰)E(𝒱j)τj)p\leq\exp{\big{(}-\frac{E(\mathcal{U})-E(\mathcal{V}_{j})}{\tau_{j}}\big{)}} then
13         Update inspected vertex 𝒱j+1𝒰\mathcal{V}_{j+1}\leftarrow\mathcal{U};
14    else
15        Re-inspect vertex 𝒱j+1𝒱j\mathcal{V}_{j+1}\leftarrow\mathcal{V}_{j};
16    
17 
return 𝒫t\mathcal{P}_{t}
Algorithm 2 Tailored SA for PAUSE at round tt

The proposed SA-PAUSE implements its FL procedure with active user selection formulated, while using Algorithm 2 to approximate PAUSE’s search 14. SA-PAUSE thus realizes Algorithm 1 while replacing its Step 1 with Algorithm 2.

IV-B Theoretical Analysis

Optimality: The SA search of SA-PAUSE, detailed in Algorithm 2, replaces searching over all possible user selections with exploration over a graph. To show its validity, we first prove that it indeed finds the reward-maximizing set of users, as done in PAUSE. Since in general there may be more than one set of users that maximizes the reward (or equivalently, the energy (18)), we use 𝒥\mathcal{J} to denote the set of vertices exhibiting maximal energy in the graph. The ability of Algorithm 2 to recover the same users set as brute search via (14) (or one that is equivalent in terms of reward) is stated in the following theorem:

Theorem 4.

For Algorithm 2, it holds that:

limj(𝒱j𝒥)=1.\lim_{j\rightarrow\infty}\mathbb{P}(\mathcal{V}_{j}\in\mathcal{J})=1. (20)
Proof.

The proof is given in Appendix -B. ∎

Theorem 4 shows that Algorithm 2 is guaranteed to recover the reward-maximizing users set in the horizon of infinite number of iterations. While the SA algorithm operates over a finite number of iterations, and Theorem 4 applies as jj\rightarrow\infty, the carefully designed cooling temperature sequence and algorithmic structure ensure robust practical performance of SA algorithms [50, 51]. This efficacy is empirically validated in Section V.

Time-Complexity: Having shown that Algorithm 2 can approach the users set recovered via PAUSE, we next show that it satisfies its core motivation, i.e., carry out this computation with reduced complexity, and thus supports scalability. While inherently the number of selected users mm is smaller than the overall number of users KK, and often mKm\ll K, we accommodate in our analysis computationally intensive settings where mm is allowed to grow with KK, but in the order of m=Θ(K)m=\Theta(K).

On each FL round tt, the initialization phase requires 𝒪(KlogK)\mathcal{O}(K\log K) operations due to the list sorting procedures. During each iteration jj, locating 𝒱j\mathcal{V}_{j}’s users’ indices in the sorted lists can be accomplished in 𝒪(KlogK)\mathcal{O}(K\log K) operations through pointer manipulation. The identification of 𝒩𝒱j\mathcal{N}_{\mathcal{V}_{j}} exhibits complexity 𝒪(|𝒩𝒱j|)\mathcal{O}(|\mathcal{N}_{\mathcal{V}_{j}}|), as each neighbor can be found in constant time. While the number of active neighbors is bounded by 3(Km)3(K-m), the quantity of passive neighbors varies across users and iterations. Given that each passive neighbor of 𝒱j\mathcal{V}_{j} corresponds to that node being an active neighbor of 𝒱j\mathcal{V}_{j}, and considering the bounded number of active neighbors per user, a balanced graph typically exhibits approximately 3(Km)3(K-m) passive neighbors per user. Specifically, in the average case where each user in 𝕍\mathbb{V} has 𝒪(KlogK)\mathcal{O}(K\log K) passive neighbors, the complexity order of Algorithm 2 is 𝒪(KlogK)\mathcal{O}(K\log K).

For comparative purposes, consider a simplified SA variant (termed Vanilla-SA) where the neighboring criterion is reduced to only the first condition in R1 (i.e., nodes are neighbors if they share exactly m1m-1 users). This algorithm closely resembles Algorithm 2, but eliminates list sorting and determines N𝒱jN_{\mathcal{V}_{j}} by exhaustively replacing each user in 𝒱j\mathcal{V}_{j} with each user in 𝕂{𝒱j}\mathbb{K}\setminus\{\mathcal{V}_{j}\}. In this case, by setting CC to be an upper bound on Δmax\Delta_{max} (16), e.g., C2α+γ+1C\triangleq 2\alpha+\gamma+1 we satisfy the conditions for Theorem 4 as well, ensuring asymptotic convergence. However, this approach results in |N𝒱j|=m(Km)|N_{\mathcal{V}_{j}}|=m(K-m), producing a densely connected graph that impedes search efficiency and invariably yields 𝒪(K2)\mathcal{O}(K^{2}) complexity. Table I presents a comprehensive comparison of time complexities across different scenarios.

Algorithm Case Best Average Worst
Brute force search 14 O(eK)O(e^{K})
Vanilla-SA O(K2)O(K^{2})
Algorithm 2 O(KlogK)O(K\log K) O(K2)O(K^{2})
TABLE I: Time complexity comparison of different algorithms

Summary: Combining the optimality analysis in Theorem 4 with the complexity characterization in Table I indicates that the integration of Algorithm 2 to approximate PAUSE’s search (14) into SA-PAUSE enables the application of PAUSE to large-scale networks, meeting C4. The theoretical convergence guarantees, coupled with its practical efficiency, make it a robust solution for approximating PAUSE and thus still adhering for considerations C1-C3. The empirical validation of these theoretical results is presented comprehensively in the following section.

V Numerical Study

V-A Experimental Setup

Here, we numerically evaluate PAUSE in FL111The source code used in our experimental study, including all the hyper-parameters, is available online at https://github.com/oritalp/PAUSE/tree/production. We consider the training of a DNN for image classification based on MNIST and CIFAR-10. The trained model is comprised of a convolutional neural network (CNN) with three hidden layers. These layers are followed by a fully-connected network (FC) with two hidden layers for CIFAR-10, and a three-layer FC with 32 neurons at its widest layer for MNIST.

We examine our approach in both small and large network settings with varying privacy budgets. In the former, the data is divided between K=30K=30 users, and m=5m=5 of them being chosen at each round, while the latter corresponds to K=300K=300 and m=15m=15 users. The communication latency τk\tau_{k} obeys a normal distribution for every k𝕂k\in\mathbb{K}. The users are equally divided into two groups: fast users, who had lower communication latency expectations, and slower users. For each configuration, we test our approach both in i.i.d and non-i.i.d data distributions. In the imbalnaced case, the data quantities are sampled from a Dirichlet distribution with parameter 𝜶\bm{\alpha}, where each user exhibits a dominant label comprising approximately a quarter of the data.

As PAUSE becomes computationally infeasible in the large network case, it is only tested on small networks, while SA-PAUSE is being tested in both scenarios. These algorithms are compared with the following benchmarks:

  • Random, uniformly sampling m=5m=5 users without replacements [44], solely in the i.i.d balanced case.

  • FedAvg with privacy and FedAvg w.o. privacy, choosing all KK users, with and without privacy, respectively.

  • Fastest in expectation, using only the same pre-known five fastest mm users in expectation at each round.

  • The clustered sampling selection algorithm proposed in [20].

V-B Small Network with i.i.d. Data

Our first study trains the mentioned CNN with an overall privacy budget of ϵ¯=40\bar{\epsilon}=40 for image classification using the CIFAR-10 dataset. The resulting FL accuracy versus communication latency are illustrated in Fig. 2. The error curves were smoothened with an averaging window of size 1010 to attenuate the fluctuations. As expected, due to privacy leakage accumulation, the more rounds a user participates in, the noisier its updates are. This is evident in Fig. 2, where choosing all users quickly results in ineffective updates. PAUSE consistently achieves both accurate learning and rapid convergence. Further observing this figure indicates SA-PAUSE successfully approximates PAUSE’s brute-force search as well.

Refer to caption
Figure 2: Validation accuracy vs. latency, CIFAR-10, i.i.d. data, small network

PAUSE’s ability to mitigate privacy accumulation is showcased in Fig. 3. There, we report the overall leakage as it evolves over epochs. Fig. 3 reveals that the privacy violation at each given epoch using PAUSE is lower compared to the random and the clustered sampling methods, adding to its improved accuracy and latency noted in Fig. 2. Note FedAvg with privacy and fastest in expectation methods’ maximum privacy violation coincide as in every epoch it’s raised by an ϵi\epsilon_{i}.

Refer to caption
Figure 3: Privacy leakage vs. global epochs, CIFAR-10. i.i.d. data, small network

V-C Small Network with non-i.i.d. Data

Subsequently, we train the same DNN with CIFAR-10 in the non-i.i.d case as described previously with an overall privacy budget of ϵ¯=100\bar{\epsilon}=100. As opposed to the balanced data test, this setting necessitates balancing between users with varying quantities of data, which might contribute differently to the learning process. The data quantities were sampled from a Dirichlet distribution with parameter 𝜶=𝟑{\bm{\alpha}}={\bm{3}}. Analyzing the validation accuracy versus communication latency in Fig. 4 indicates the superiority of our algorithms also in this case in terms of accuracy and latency. Fig. 5 depicts the maximum privacy violation of the system, this time, versus the communication latency, and facilitates this statement by demonstrating both PAUSE and its approximation’s ability to maintain privacy better though performing more sever-clients iterations in any given time.

Refer to caption
Figure 4: Validation accuracy vs. latency, CIFAR-10, non-i.i.d data, small network
Refer to caption
Figure 5: Privacy leakage vs. latency, CIFAR-10, non-i.i.d data, small network

V-D Large Networks

We proceed to consider the large network settings. Here, we train two models: one for MNIST with i.i.d. data distribution, and on for CIFAR-10 with data distributed non-i.i.d. For these scenarios, we implemented two modifications. First, to accelerate the convergence of the SA procedure in Algorithm 2 under a reasonable amount of iterations, we modulate the temperature coefficient CC as in  [52, 53]. This is accomplished by dividing the temperature coefficient by a constant κ=30\kappa=30, i.e., the temperature in the jjth iteration becomes τj=Cκlog(1+j)\tau_{j}=\frac{C}{\kappa\log(1+j)} [52, 53]. Second, to enhance exploitation [54, 55], we amplified the empirical mean μk¯(t)\overline{\mu_{k}}(t) in 13 by another constant, ζ=3\zeta=3.

The overall privacy budgets for the MNIST and CIFAR-10 evaluations are set to ϵ¯=20\bar{\epsilon}=20 and ϵ¯=10\bar{\epsilon}=10, respectively. In the former, the data quantities were sampled from a Dirichlet distribution with parameter 𝜶=𝟐{\bm{\alpha}}={\bm{2}}. Both cases exhibited consistent trends with the small networks tests, systematically demonstrating SA-PAUSE’s robustness across diverse privacy budgets, datasets, and network scales.

As before, the validation accuracy versus communication latency graphs are presented in Fig. 6 and Fig. 8, while the maximum overall privacy leakage versus time graphs are depicted in 7 and Fig. 9. These results systematically demonstrate the ability of our proposed SA-PAUSE to facilitate rapid learning over large networks with balanced and limited privacy leakage.

Refer to caption
Figure 6: Validation accuracy vs. latency, MNIST, i.i.d data, large network
Refer to caption
Figure 7: Privacy leakage vs. latency, MNIST. i.i.d data, large network
Refer to caption
Figure 8: Validation accuracy vs. latency, CIFAR-10, non-i.i.d data, large network
Refer to caption
Figure 9: Privacy leakage vs. latency, CIFAR-10, non-i.i.d data, large network

VI Conclusion

We proposed PAUSE, an active and dynamic user selection algorithm under fixed privacy constraints. This algorithm balances three FL aspects: accuracy of the trained model, communication latency, and system’s privacy. We showed that under common assumptions ,PAUSE’s regret achieves a logarithmic order with time. To address complexity and scalability, we developed SA-PAUSE, which integrates a SA algorithm with theoretical guarantees to approximate PAUSE’s brute-force search in feasible running time. We numerically demonstrated SA-PAUSE’s ability to approximate PAUSE’s search and its superiority over alternative approaches in diverse experimental scenarios.

-A Proof of Theorem 3

In the following, define hk(t)(m+1)log(t)Tk(t)h_{k}(t)\triangleq\sqrt{\frac{(m+1)\log(t)}{T_{k}(t)}}. The regret can be bounded following the definition of Δmax\Delta_{max} as

R𝒫(n)\displaystyle R^{\mathcal{P}}(n) =𝔼[t=1nr(𝒢t,t)r(𝒫t,t)]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{n}r(\mathcal{G}_{t},t)-r(\mathcal{P}_{t},t)\right]
Δmax𝔼[t=1n𝟏(r(𝒢t,t)r(𝒫t,t)],\displaystyle\leq\Delta_{\max}\mathbb{E}\left[\sum_{t=1}^{n}\mathbf{1}(r(\mathcal{G}_{t},t)\neq r(\mathcal{P}_{t},t)\right], (-A.1)

We introduce another indicator function for every i𝕂i\in\mathbb{K} along with its cumulative sum, denoted:

Ii(t)\displaystyle I_{i}(t) {1,{i=argmink𝒞tTk(t1)r(𝒫t,t)r(𝒢t,t)0,elseNi(n)t=1nIi(t).\displaystyle\triangleq\left.\begin{cases}1,&\begin{cases}&i=\underset{k\in\mathcal{C}_{t}}{\operatorname*{arg\,min}}~T_{k}(t-1)\\ &r(\mathcal{P}_{t},t)\neq r(\mathcal{G}_{t},t)\end{cases}\\ 0,&\text{else}\end{cases}\right.\,N_{i}(n)\triangleq\sum_{t=1}^{n}I_{i}(t).

Let 𝒞t𝒫t𝒢t\mathcal{C}_{t}\triangleq\mathcal{P}_{t}\cup\mathcal{G}_{t}. In every round tt where r(𝒫t)r(𝒢t)r(\mathcal{P}_{t})\neq r(\mathcal{G}_{t}), the counter Nk(t)N_{k}(t) is incremented for only a single user in 𝒞t\mathcal{C}_{t}, while for the remaining users Nk(t1)=Nk(t)N_{k}(t-1)=N_{k}(t). Thus, it holds that t=1n𝟏((𝒢t,t)r(𝒫t,t))=k=1KNk(n)\sum_{t=1}^{n}\mathbf{1}\big{(}(\mathcal{G}_{t},t)\neq r(\mathcal{P}_{t},t)\big{)}=\sum_{k=1}^{K}N_{k}(n). Substituting this into (-A.1), we obtain that

R𝒫(n)Δmaxk=1K𝔼[Nk(n)].R^{\mathcal{P}}(n)\leq\Delta_{\max}\sum_{k=1}^{K}\mathbb{E}[N_{k}(n)]. (-A.2)

In the remainder, we focus on bounding 𝔼[Nk(n)]\mathbb{E}[N_{k}(n)] for every k𝕂k\in\mathbb{K}. After that, we substitute the derived upper bound into (-A.2). To that aim, let k𝕂k\in\mathbb{K} and fix some ll\in\mathbb{N} whose value is determined later). We note that:

𝔼[Nk(n)]=𝔼[t=1n𝟏(Ik(t)=1)]\displaystyle\mathbb{E}[N_{k}(n)]=\mathbb{E}[\sum_{t=1}^{n}\mathbf{1}(I_{k}(t)=1)]
=𝔼[t=1n𝟏(Ik(t)=1,Nk(t)l)+𝟏(Ik(t)=1,Nk(t)>l)]\displaystyle=\mathbb{E}[\sum_{t=1}^{n}\mathbf{1}(I_{k}(t)=1,N_{k}(t)\leq l)+\mathbf{1}(I_{k}(t)=1,N_{k}(t)>l)]
(a)l+𝔼[t=1n𝟏(Ik(t)=1,Nk(t)>l)],\displaystyle\qquad\qquad\stackrel{{\scriptstyle(a)}}{{\leq}}l+\mathbb{E}[\sum_{t=1}^{n}\mathbf{1}(I_{k}(t)=1,N_{k}(t)>l)], (-A.3)

where (a)(a) arises from considering the cases Nk(n)lN_{k}(n)\leq l and its complementary state.

PAUSE’s policy (14) implies that in every iteration:

mink𝒫tucb(k,t1)+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)\displaystyle\!\!\min_{k\in\mathcal{P}_{t}}{\rm ucb}(k,t-1)\!+\!\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}\!g_{k}(t\!-\!1)\!+\!\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}\!p_{k}(t\!-\!1)\geq
mink𝒢tucb(k,t1)+αmk𝒢tgk(t1)+γmk𝒢tpk(t1).\displaystyle\!\!\min_{k\in\mathcal{G}_{t}}{\rm ucb}(k,t\!-\!1)\!+\!\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}\!g_{k}(t\!-\!1)\!+\!\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}\!p_{k}(t\!-\!1). (-A.4)

Since this happens with probability one, we can incorporate it into the mentioned inequality (-A.3):

𝔼[Nk(n)]l+𝔼[t=1n𝟏{Ik(t)=1,Nk(t)>l,mink𝒫tucb(k,t1)+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)mink𝒢tucb(k,t1)+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)}].\begin{split}&\mathbb{E}[N_{k}(n)]\leq l+\mathbb{E}\Biggl{[}\sum_{t=1}^{n}\mathbf{1}\Biggl{\{}I_{k}(t)=1,N_{k}(t)>l,\\ &\min_{k\in\mathcal{P}_{t}}{\rm ucb}(k,t-1)+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\geq\\ &\min_{k\in\mathcal{G}_{t}}{\rm ucb}(k,t-1)+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)\Biggr{\}}\Biggr{]}.\end{split}

We now denote the users chosen in the ttth iteration by the PAUSE algorithm and by the Genie as: 𝒢t=u~t,1,,u~t,m\mathcal{G}_{t}=\tilde{u}_{t,1},...,\tilde{u}_{t,m} and 𝒫t=ut,1,,ut,m\mathcal{P}_{t}=u_{t,1},...,u_{t,m}, respectively. For every tt, the indicator function in the sum is equal to 1 only if the kkth user is chosen the least at the beginning of the ttth iteration, i.e., Tk(t1)Tj(t1)T_{k}(t-1)\leq T_{j}(t-1) for every j𝒞tj\in\mathcal{C}_{t}. The intersection of Ik(t)=1I_{k}(t)=1 with Nk(t)>lN_{k}(t)>l implies Tk(t1)lT_{k}(t-1)\leq l. Therefore, this intersection of events implies that for every j𝒞t,lTj(t1)t1j\in\mathcal{C}_{t},l\leq T_{j}(t-1)\leq t-1. Using this result, we can further bound every event in the indicator functions in the upper bound of 𝔼[Nk(n)]\mathbb{E}[N_{k}(n)]:

𝔼[Nk(n)]l+𝔼[t=1n𝟏{Ik(t)=1,Nk(t)>l,minlTut,1,,Tut,mt1(mink𝒫tucb(k,t1)+αmk𝒫tgk(t1)+γmk𝒫tpk(t1))minlTu~t,1,,Tu~t,mt1(mink𝒢tucb(k,t1)+αmk𝒢tgk(t1)+γmk𝒢tpk(t1))}].\begin{split}\mathbb{E}[N_{k}(n)]\leq&l+\mathbb{E}\Biggl{[}\sum_{t=1}^{n}\mathbf{1}\biggl{\{}I_{k}(t)=1,N_{k}(t)>l,\\ &\min_{l\leq T_{u_{t,1}},...,T_{u_{t,m}}\leq t-1}\biggr{(}\min_{k\in\mathcal{P}_{t}}{\rm ucb}(k,t-1)+\\ &\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\biggl{)}\geq\\ &\min_{l\leq T_{\tilde{u}_{t,1}},...,T_{\tilde{u}_{t,m}}\leq t-1}\biggl{(}\min_{k\in\mathcal{G}_{t}}{\rm ucb}(k,t-1)+\\ &\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)\biggr{)}\biggr{\}}\biggr{]}.\end{split}

Using the fact that for any finite set of events of size qq, it holds that {Ai}i=1q\{A_{i}\}_{i=1}^{q}, 𝟏(i=1qAi)i=1q𝟏(Ai)\mathbf{1}(\cup_{i=1}^{q}A_{i})\leq\sum_{i=1}^{q}\mathbf{1}(A_{i}), and that the expectation of an indicator function is the probability of the internal event occurring, we have that

𝔼[Nk(n)]l+t=1nlTu~t,1,,Tu~t,m,Tut,1,,Tut,mt1[mink𝒫tucb(k,t1)+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)mink𝒢tucb(k,t1)+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)].\begin{split}&\mathbb{E}[N_{k}(n)]\leq l+\sum_{t=1}^{n}\sum_{l\leq T_{\tilde{u}_{t,1}},...,T_{\tilde{u}_{t,m}},T_{u_{t,1}},...,T_{u_{t,m}}\leq t-1}\\ &\mathbb{P}\biggl{[}\min_{k\in\mathcal{P}_{t}}{\rm ucb}(k,t\!-\!1)\!+\!\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}\!g_{k}(t\!-\!1)\!+\!\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}\!p_{k}(t\!-\!1)\geq\\ &\min_{k\in\mathcal{G}_{t}}{\rm ucb}(k,t\!-\!1)\!+\!\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}\!g_{k}(t\!-\!1)\!+\!\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}\!p_{k}(t\!-\!1)\biggr{]}.\end{split} (-A.5)

In the following steps we focus on bounding the terms in the double sum. To that aim, we define the following:

at=argmink𝒫tucb(k,t1),bt=argmink𝒢tucb(k,t1).a_{t}\!=\!\operatorname*{arg\,min}_{k\in\mathcal{P}_{t}}{\rm ucb}(k,t\!-\!1),~b_{t}\!=\!\operatorname*{arg\,min}_{k\in\mathcal{G}_{t}}{\rm ucb}(k,t\!-\!1). (-A.6)

Using these notations and writing hathat(t)h_{a_{t}}\triangleq h_{a_{t}}(t), we state the following lemma:

Lemma -A.1.

The event (-A.4) implies at least one of the next three events occurs:

  1. 1.

    x¯bt+hbtμbt\bar{x}_{b_{t}}+h_{b_{t}}\leq\mu_{b_{t}};

  2. 2.

    x¯atμat+hat\bar{x}_{a_{t}}\geq\mu_{a_{t}}+h_{a_{t}};

  3. 3.

    μbt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)<μat+2hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)\mu_{b_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)<\mu_{a_{t}}+2h{a_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1).

Proof.

Proof by contradiction: we assume all three events don’t occur and examine the following:

x¯bt+hbt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)>(1)μbt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)(3)μat+2hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)>(2)x¯at+hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1).\begin{split}&\bar{x}_{b_{t}}+h_{b_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)\stackrel{{\scriptstyle(1)}}{{>}}\\ &\mu_{b_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)\stackrel{{\scriptstyle(3)}}{{\geq}}\\ &\mu_{a_{t}}+2h_{a_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\stackrel{{\scriptstyle(2)}}{{>}}\\ &\bar{x}_{a_{t}}+h_{a_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1).\end{split}

By the definitions of ata_{t} and btb_{t} (-A.6), the inequality above can also be written as:

mink𝒫tucb(k,t1)+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)<\displaystyle\!\!\min_{k\in\mathcal{P}_{t}}{\rm ucb}(k,t\!-\!1)\!+\!\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}\!g_{k}(t\!-\!1)\!+\!\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}\!p_{k}(t\!-\!1)<
mink𝒢tucb(k,t1)+αmk𝒢tgk(t1)+γmk𝒢tpk(t1),\displaystyle\!\!\min_{k\in\mathcal{G}_{t}}{\rm ucb}(k,t\!-\!1)\!+\!\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}\!g_{k}(t\!-\!1)\!+\!\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}\!p_{k}(t\!-\!1), (-A.7)

contradicting our initially assumed event (-A.4). ∎

Applying the union bound and the relationship between the events shown in Lemma -A.1 implies:

[mink𝒫tucb(k,t1)+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)\displaystyle\mathbb{P}\biggl{[}\min_{k\in\mathcal{P}_{t}}{\rm ucb}(k,t-1)+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)
mink𝒢tucb(k,t1)+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)]\displaystyle\geq\min_{k\in\mathcal{G}_{t}}{\rm ucb}(k,t-1)+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)\biggr{]}
[x¯bt+hbtμbt](1)+[x¯atμat+hat](2)+\displaystyle\leq\overbrace{\mathbb{P}[\bar{x}_{b_{t}}+h_{b_{t}}\leq\mu_{b_{t}}]}^{\triangleq(1)}+\overbrace{\mathbb{P}[\bar{x}_{a_{t}}\geq\mu_{a_{t}}+h_{a_{t}}]}^{\triangleq(2)}+
[μbt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)<\displaystyle\mathbb{P}[\mu_{b_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)<
μat+2hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)](3).\displaystyle\underbrace{\mu_{a_{t}}\!+\!2h{a_{t}}\!+\!\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)\!+\!\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)]}_{\triangleq(3)}. (-A.8)

We obtained three probability terms – (1)(1), (2)(2), and (3)(3). we will start with bounding the first two using Hoeffding’s inequality [56]. Term (3)(3) will be bounded right after in a different manner. We’ll demonstrate how the first term is bounded; the second one is done similarly by replacing btb_{t} with ata_{t}:

[x¯bt+hbtμbt]=[x¯btμbthbt]\displaystyle\mathbb{P}[\bar{x}_{b_{t}}+h_{b_{t}}\leq\mu_{b_{t}}]=\mathbb{P}[\bar{x}_{b_{t}}-\mu_{b_{t}}\leq-h_{b_{t}}]
=[j=1Tbt(t1)τmin(τbt)jμbthbtTbt(t1)]\displaystyle=\mathbb{P}\left[\sum_{j=1}^{T_{b_{t}}(t-1)}\frac{\tau_{min}}{(\tau_{b_{t}})_{j}}-\mu_{b_{t}}\leq-h_{b_{t}}T_{b_{t}}(t-1)\right]
e2Tbt2(t1)(m+1)log(t)Tbt2(t1)=t2(m+1),\displaystyle\leq e^{-}\frac{2T_{b_{t}}^{2}(t-1)(m+1)\log(t)}{T_{b_{t}}^{2}(t-1)}=t^{-2(m+1)}, (-A.9)

where (τbt)j(\tau_{b_{t}})_{j} is the latency of the user btb_{t} at the jjth round it participated. This, results in the following inequalities:

[x¯bt+hbtμbt]=(1)t2(m+1),[x¯atμat+hat]=(2)t2(m+1).\overbrace{\mathbb{P}[\bar{x}_{b_{t}}\!+\!h_{b_{t}}\leq\mu_{b_{t}}]}^{=(1)}\leq t^{-2(m\!+\!1)},~\overbrace{\mathbb{P}[\bar{x}_{a_{t}}\geq\mu_{a_{t}}\!+\!h_{a_{t}}]}^{=(2)}\leq t^{-2(m\!+\!1)}.

To bound (3)(3) we define another two definitions:

At=argmink𝒫tμk,Bt=argmink𝒢tμk.\begin{split}&A_{t}=\operatorname*{arg\,min}_{k\in\mathcal{P}_{t}}\mu_{k},\quad B_{t}=\operatorname*{arg\,min}_{k\in\mathcal{G}_{t}}\mu_{k}.\end{split} (-A.10)

Using the law of total probability to divide (3)(3) into 2 parts:

[μbt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)<μat+2hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)](3)=[(μbt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)<μat+2hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1))(bt=Bt)]+[(μbt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)<μat+2hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1))(btBt)].\begin{split}&\mathbb{P}\biggr{[}\mu_{b_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)<\\ &\underbrace{\mu_{a_{t}}+2h{a_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\biggr{]}}_{\triangleq(3)}\\ &=\mathbb{P}\biggl{[}\bigl{(}\mu_{b_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)<\\ &\mu_{a_{t}}+2h{a_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\bigr{)}\\ &\cap(b_{t}=B_{t})\biggr{]}\\ &+\mathbb{P}\biggl{[}\bigl{(}\mu_{b_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)<\\ &\mu_{a_{t}}+2h{a_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\bigr{)}\\ &\cap(b_{t}\neq B_{t})\biggr{]}.\end{split}

We denote the former term as (3a)(3a) and the latter as (3b)(3b):

(3a)[(μbt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)<\displaystyle(3a)\triangleq\mathbb{P}\biggl{[}\bigl{(}\mu_{b_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)<
μat+2hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1))\displaystyle\mu_{a_{t}}+2h{a_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\bigr{)}
(bt=Bt)],\displaystyle\cap(b_{t}=B_{t})\biggr{]}, (-A.11a)
(3b)[(μbt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)<\displaystyle(3b)\triangleq\mathbb{P}\biggl{[}\bigl{(}\mu_{b_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)<
μat+2hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1))\displaystyle\mu_{a_{t}}+2h{a_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\bigr{)}
(btBt)].\displaystyle\cap(b_{t}\neq B_{t})\biggr{]}. (-A.11b)

In the following we show that for a range of values of ll, which so far was arbitrarily chosen, 3(a)3(a) is equal to 0. Recalling the definitions of ata_{t} (-A.6) and AtA_{t} (-A.10), we know μAtμat\mu_{A_{t}}\leq\mu_{a_{t}}. plugging this relation into probability of contained events in (b)(b), and upper bounding by omitting the intersection in (a)(a), yields:

(3a)(a)[μBt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)<μat+2hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)](b)[μBt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)<μAt+2hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)]=[μBt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)=C𝒢(𝒢t,t)(μAt+αmk𝒫tgk(t1)+γmk𝒫tpk(t1))=C𝒢(𝒫t,t)<2hat]=[C𝒢(𝒢t,t)C𝒢(𝒫t,t)<2(m+1)log(t)Tat(t1)],\begin{split}&(3a)\stackrel{{\scriptstyle(a)}}{{\leq}}\\ &\mathbb{P}\biggl{[}\mu_{B_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)<\\ &\mu_{a_{t}}+2h{a_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\biggr{]}\\ &\stackrel{{\scriptstyle(b)}}{{\leq}}\mathbb{P}\biggl{[}\mu_{B_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)<\\ &\mu_{A_{t}}+2h{a_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\biggr{]}\\ &=\mathbb{P}\biggl{[}\overbrace{\mu_{B_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)}^{=C^{\mathcal{G}}(\mathcal{G}_{t},t)}-\\ &\overbrace{\bigl{(}\mu_{A_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\bigr{)}}^{=C^{\mathcal{G}}(\mathcal{P}_{t},t)}<2h_{a_{t}}\biggr{]}\\ &=\mathbb{P}\Bigl{[}C^{\mathcal{G}}(\mathcal{G}_{t},t)-C^{\mathcal{G}}(\mathcal{P}_{t},t)<2\sqrt{\frac{(m+1)\log(t)}{T_{a_{t}}(t-1)}}\Bigr{]},\end{split}

where the last two equalities derive from reorganizing the event and recalling the definitions of C𝒢(𝒮,t)C^{\mathcal{G}}(\mathcal{S},t) and hk(t)h_{k}(t), respectively. We now show this event exists in probability 0, and then the latest bound implies (3a)(3a) is equal to 0 as well. We observe the mentioned event while recalling that Tat(t)lT_{a_{t}}(t)\geq l by the relevant indexes in the summation in (-A.5):

C𝒢(𝒢t,t)C𝒢(𝒫t,t)\displaystyle C^{\mathcal{G}}(\mathcal{G}_{t},t)-C^{\mathcal{G}}(\mathcal{P}_{t},t) <2(m+1)log(t)Tat(t1)\displaystyle<2\sqrt{\frac{(m+1)\log(t)}{T_{a_{t}}(t-1)}}
2(m+1)log(n)l.\displaystyle\leq 2\sqrt{\frac{(m+1)\log(n)}{l}}. (-A.12)

Next, we observe an enhanced version of the Genie that is rewarded by an additive term of δ\delta in every round that 𝒢t𝒫t\mathcal{G}_{t}\neq\mathcal{P}_{t}. Recalling we observe solely cases where this statement occurs, the LHS is directly larger than δ\delta. Thus, to secure non-existence of this event we may set any ll value fulfilling δ<2(m+1)log(n)l\delta<2\sqrt{\frac{(m+1)\log(n)}{l}}. Recalling δ>0\delta>0, we reorganize this condition into:

l4(m+1)log(n)δ2.l\geq\Bigg{\lceil}\frac{4(m+1)\log(n)}{\delta^{2}}\Bigg{\rceil}. (-A.13)

Moreover, this enhanced version adds another term of δt=1n𝔼[𝟏((𝒢t,t)r(𝒫t,t))]=δk=1K𝔼[Nk(n)]\delta\sum_{t=1}^{n}\mathbb{E}\bigg{[}\mathbf{1}\Big{(}(\mathcal{G}_{t},t)\neq r(\mathcal{P}_{t},t)\Big{)}\bigg{]}=\delta\sum_{k=1}^{K}\mathbb{E}[N_{k}(n)] to the regret, as noted later in the proof closure.

Recall that we initially aimed to upper bound the probability of the event (-A.7) by splitting it into three events using the union bound (-A.8). We then showed (1)(1) and (2)(2) are bounded, and divided (3)(3) into 2 parts - 3(a)3(a) and 3(b)3(b). By setting an appropriate value of ll (-A.13), we demonstrated 3(a)3(a) can be shown to be equal to 0. The last step is to upper bound 3(b)3(b) which is done similarly.

We start by recalling the definition of 3(b)3(b) (-A.11) and then bound it by a containing event:

(3b)[(μbt+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)<\displaystyle(3b)\triangleq\mathbb{P}\biggl{[}\bigl{(}\mu_{b_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)<
μat+2hat+αmk𝒫tgk(t1)+γmk𝒫tpk(t1))\displaystyle\qquad\mu_{a_{t}}+2h{a_{t}}+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\bigr{)}
(btBt)]\displaystyle\qquad\cap(b_{t}\neq B_{t})\biggr{]}
[btBt]=[μbt¯(t)+hbtμBt¯(t)+hBt].\displaystyle\leq\mathbb{P}[b_{t}\neq B_{t}]=\mathbb{P}[\overline{\mu_{b_{t}}}(t)+h_{b_{t}}\leq\overline{\mu_{B_{t}}}(t)+h_{B_{t}}]. (-A.14)

The last equality arises from the definitions of btb_{t} (-A.6) and BtB_{t} (-A.10), and definition (13). We now prove a lemma regarding this event, which its probability upper bounds 3(b)3(b):

Lemma -A.2.

The following event implies at least one of the next three events occur:

μbt¯(t)+hbtμBt¯(t)+hBt\overline{\mu_{b_{t}}}(t)+h_{b_{t}}\leq\overline{\mu_{B_{t}}}(t)+h_{B_{t}} (-A.15)
  1. 1.

    μbt¯(t)+hbtμbt\overline{\mu_{b_{t}}}(t)+h_{b_{t}}\leq\mu_{b_{t}}

  2. 2.

    μBt¯(t)μBt+hBt\overline{\mu_{B_{t}}}(t)\geq\mu_{B_{t}}+h_{B_{t}}

  3. 3.

    μbt<μBt+2hBt\mu_{b_{t}}<\mu_{B_{t}}+2h_{B_{t}}

Proof.

We prove by contradiction, as μbt¯(t)+hbt>(1)μbt(3)μBt+2hBt>(2)μBt¯(t)+hBt\overline{\mu_{b_{t}}}(t)+h_{b_{t}}\stackrel{{\scriptstyle(1)}}{{>}}\mu_{b_{t}}\stackrel{{\scriptstyle(3)}}{{\geq}}\mu_{B_{t}}+2h_{B_{t}}\stackrel{{\scriptstyle(2)}}{{>}}\overline{\mu_{B_{t}}}(t)+h_{B_{t}}, thus proving the lemma ∎

Combining the lemma, the union bound an the upper bound we found in (-A.14) yields:

3(b)\displaystyle 3(b)\leq [μbt¯(t)μbthbt]+[μBt¯(t)μBthBt]\displaystyle\mathbb{P}[\overline{\mu_{b_{t}}}(t)-\mu_{b_{t}}\leq-h_{b_{t}}]+\mathbb{P}[\overline{\mu_{B_{t}}}(t)-\mu_{B_{t}}\geq h_{B_{t}}]
+[μbt<μBt+2hBt].\displaystyle+\mathbb{P}[\mu_{b_{t}}<\mu_{B_{t}}+2h_{B_{t}}].

We already showed in (-A.9) that the first term is bounded by t2(m+1)t^{-2(m+1)}. Repeating the same steps for BtB_{t} instead of btb_{t} we can show that this value also bounds the second term. Furthermore, we now show that the event in the third term occurs with probability 0 when setting an appropriate value of ll. Observing the mentioned event:

μbtμBt<2(m+1)log(t)TBt(t1).\begin{split}\mu_{b_{t}}-\mu_{B_{t}}<2\frac{(m+1)log(t)}{T_{B_{t}}(t-1)}.\end{split} (-A.16)

Similar to (-A.13), and recalling btBtb_{t}\neq B_{t}, by demanding l4(m+1)log(n)δ2l\geq\Big{\lceil}\frac{4(m+1)\log(n)}{\delta^{2}}\Big{\rceil} we assure this event occurs with probability 0. As this is the same range as in (-A.13) we set ll to be the lowest integer in this range, i.e., l=4(m+1)log(n)δ2l=\Big{\lceil}\frac{4(m+1)\log(n)}{\delta^{2}}\Big{\rceil}.

Finally, as we showed: (3)2t2(m+1)(3)\leq 2t^{-2(m+1)}. plugging the bounds on (1)(1), (2)(2), and (3)(3) into (-A.8) we obtain:

[mink𝒫tucb(k,t1)+αmk𝒫tgk(t1)+γmk𝒫tpk(t1)mink𝒢tucb(k,t1)+αmk𝒢tgk(t1)+γmk𝒢tpk(t1)]t2(m+1)(1)+t2(m+1)(2)+2t2(m+1)(3)=4t2(m+1).\begin{split}&\mathbb{P}\biggl{[}\min_{k\in\mathcal{P}_{t}}{\rm ucb}(k,t-1)+\frac{\alpha}{m}\sum_{k\in\mathcal{P}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{P}_{t}}p_{k}(t-1)\\ &\geq\min_{k\in\mathcal{G}_{t}}{\rm ucb}(k,t-1)+\frac{\alpha}{m}\sum_{k\in\mathcal{G}_{t}}g_{k}(t-1)+\frac{\gamma}{m}\sum_{k\in\mathcal{G}_{t}}p_{k}(t-1)\biggr{]}\\ &\leq\overbrace{t^{-2(m+1)}}^{\geq(1)}+\overbrace{t^{-2(m+1)}}^{\geq(2)}+\overbrace{2t^{-2(m+1)}}^{\geq(3)}=4t^{-2(m+1)}.\end{split}

Substituting this bound along with the chosen value of ll into the result we obtained at the beginning of the proof (-A.5) we obtain:

𝔼[Nk(n)]4(m+1)log(n)δ2+t=1nlTu~t,1,,Tu~t,m,Tut,1,,Tut,mt14t2(m+1)4(m+1)log(n)δ2+1+t=1n4t2(m+1)t2m4(m+1)log(n)δ2+1+4t=1t2=π2/3.\begin{split}\mathbb{E}[N_{k}(n)]\leq&\Big{\lceil}\frac{4(m+1)\log(n)}{\delta^{2}}\Big{\rceil}+\\ &\sum_{t=1}^{n}\sum_{l\leq T_{\tilde{u}_{t,1}},...,T_{\tilde{u}_{t,m}},T_{u_{t,1}},...,T_{u_{t,m}}\leq t-1}4t^{-2(m+1)}\\ &\leq\frac{4(m+1)\log(n)}{\delta^{2}}+1+\sum_{t=1}^{n}4t^{-2(m+1)}\cdot t^{2m}\\ &\leq\frac{4(m+1)\log(n)}{\delta^{2}}+1+4\overbrace{\sum_{t=1}^{\infty}t^{-2}}^{=\pi^{2}/3}.\end{split}

To conclude the theorem’s statement, we set this result back into (-A.2) while recalling the added regret from the Genie empowerment, obtaining

R𝒫(n)(Δmax+δ)k=1K𝔼[Nk(n)]K(Δmax+δ)(4(m+1)log(n)δ2+1+4π23),\begin{split}R^{\mathcal{P}}(n)&\leq(\Delta_{\max}+\delta)\sum_{k=1}^{K}\mathbb{E}[N_{k}(n)]\\ &\leq K(\Delta_{\max}+\delta)\bigg{(}\frac{4(m+1)\log(n)}{\delta^{2}}+1+\frac{4\pi^{2}}{3}\bigg{)},\end{split}

concluding the proof of the theorem.

-B Proof of Theorem 4

To prove the theorem we introduce essential terminology and definitions. We define reachability as follows: Given two nodes 𝒱1\mathcal{V}_{1} and 𝒱2\mathcal{V}_{2} and energy level EE, node 𝒱1\mathcal{V}_{1} is considered reachable from 𝒱2\mathcal{V}_{2} if there exists a path connecting them that traverses only nodes with energy greater than or equal to EE. Building upon this definition, a graph exhibits Weak Reversibility if, for any energy level EE and nodes 𝒰1\mathcal{U}_{1} and 𝒰2\mathcal{U}_{2}, 𝒰1\mathcal{U}_{1} is reachable from 𝒰2\mathcal{U}_{2} at height EE if and only if 𝒰2\mathcal{U}_{2} is reachable from 𝒰1\mathcal{U}_{1} at height EE.

Following [43], to prove that Theorem 4 holds, one has to show that the following requirements hold:

  1. R1

    The graph satisfies weak reversibility [43].

  2. R2

    the temperature sequence is from the form of τj=Clog(j+1)\tau_{j}=\frac{C}{\log(j+1)} where CC is greater than the maximal energy difference between any two nodes.

  3. R3

    The Markov chain introduced in Algorithm 2 is irreducible.

We prove the three mentioned conditions are satisfied to conclude the theorem. Requirements R1 and R2 follow from the formulation of SA-PAUSE. Specifically, weak reversibility (R1) stems directly from the definition and the undirected graph property, while the temperature sequence condition R2 is satisfied as we set CC to be as mentioned in (19).

To prove that R3 holds, by definition, we need to show there is a path with positive probability between any two nodes 𝒱,𝒰𝕍\mathcal{V},\mathcal{U}\in\mathbb{V}. Since the graph is undirected it is sufficient to show a path from 𝒱\mathcal{V} to 𝒰\mathcal{U}. In Algorithm 3, we present an implicit algorithm yielding a series of nodes 𝒱0,𝒱1,,𝒰\mathcal{V}_{0},\mathcal{V}_{1},\ldots,\mathcal{U}. within this sequence, consecutive nodes are neighbors, i.e., the algorithm yields a path with positive probability from 𝒱0\mathcal{V}_{0} to 𝒰\mathcal{U}.

Input : Set of users 𝕂\mathbb{K}; an arbitrary node 𝒱0\mathcal{V}_{0}, and 𝒰\mathcal{U}
1
Init : j=0j=0
2
3while 𝒰𝒱j\mathcal{U}\neq\mathcal{V}_{j} do
4 if mink𝒱j{ucb(k)}maxk𝒰𝒱j{ucb(k)}\min_{k\in\mathcal{V}_{j}}\{{\rm ucb}(k)\}\leq\max_{k\in\mathcal{U}\setminus\mathcal{V}_{j}}\{{\rm ucb}(k)\} then
5    𝒱j+1(𝒱jargmink𝒱j{ucb(k)})argmaxk𝒰𝒱j{ucb(k)}\mathcal{V}_{j+1}\triangleq\Big{(}\mathcal{V}_{j}\setminus\operatorname*{arg\,min}_{k\in\mathcal{V}_{j}}\{{\rm ucb}(k)\}\Big{)}\bigcup\operatorname*{arg\,max}_{k\in\mathcal{U}\setminus\mathcal{V}_{j}}\{{\rm ucb}(k)\}
6 
7 else
8     sample a random user pp from 𝒱j𝒰\mathcal{V}_{j}\setminus\mathcal{U};
𝒱j+1(𝒱j{p})argmaxk𝒰𝒱j{ucb(k)}\mathcal{V}_{j+1}\triangleq(\mathcal{V}_{j}\setminus\{p\})\bigcup\operatorname*{arg\,max}_{k\in\mathcal{U}\setminus\mathcal{V}_{j}}\{{\rm ucb}(k)\};
j=j+1j=j+1
9 
Algorithm 3 Constructing Path from 𝒱0\mathcal{V}_{0} to 𝒰\mathcal{U}

This algorithm possesses a crucial characteristic; the conditional statement evaluates to true until it transitions to false, and from that moment on, it remains False to the end. Thus, the algorithm can be partitioned into two phases, the iterations before the statement becomes false, and the rest. We denote the iteration the condition becomes false as j0j^{0}.

First, observe that when j<j0j<j^{0}, 𝒱j+1\mathcal{V}_{j+1} is an active neighbor of 𝒱j\mathcal{V}_{j}, whereas during all subsequent iterations, the former is a passive neighbor of the latter. This proves the transitions occur in a positive probability in the first place.

Next, we prove the algorithm’s correctness and termination. Let bb the minimum ucb{\rm ucb} value in 𝒱j0\mathcal{V}_{j^{0}}. For every k𝒰k\in\mathcal{U}, if ucb(k)>b{\rm ucb}(k)>b, then it is added to 𝒱j\mathcal{V}_{j} in an iteration j<j0j<j^{0}. this is guaranteed because if such incorporation had not occurred by the j0j^{0}th iteration, the conditional statement would remain satisfied, contradicting the definition of bb. The rest of the users, i.e., every k𝒰k\in\mathcal{U} such that ucb(k)b{\rm ucb}(k)\leq b, will be added during the second phase.

Notice the algorithm avoids cyclical additions and subtractions, as during the second phase, users from 𝒰\mathcal{U} who are already present in 𝒱j\mathcal{V}_{j} for all jj0j\geq j^{0} are preserved when constructing 𝒱j+1\mathcal{V}_{j+1}. Instead, a user not belonging to 𝒰\mathcal{U} is eliminated. Throughout this exposition, we have established that the algorithm terminates, and every user k𝒰k\in\mathcal{U} is eventually incorporated into the evolving set without subsequent elimination. This completes our verification of the algorithm’s correctness, and the proof as a whole.

References

  • [1] O. Peleg, N. Lang, S. Rini, N. Shlezinger, and K. Cohen, “PAUSE: Privacy-aware active user selection for federated learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025.
  • [2] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics. PMLR, 2017, pp. 1273–1282.
  • [3] P. Kairouz et al., “Advances and open problems in federated learning,” Foundations and trends® in machine learning, vol. 14, no. 1–2, pp. 1–210, 2021.
  • [4] T. Gafni, N. Shlezinger, K. Cohen, Y. C. Eldar, and H. V. Poor, “Federated learning: A signal processing perspective,” IEEE Signal Process. Mag., vol. 39, no. 3, pp. 14–41, 2022.
  • [5] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 50–60, 2020.
  • [6] M. Chen, N. Shlezinger, H. V. Poor, Y. C. Eldar, and S. Cui, “Communication-efficient federated learning,” Proceedings of the National Academy of Sciences, vol. 118, no. 17, 2021.
  • [7] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, “The convergence of sparsified gradient methods,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [8] P. Han, S. Wang, and K. K. Leung, “Adaptive gradient sparsification for efficient federated learning: An online learning approach,” in IEEE International Conference on Distributed Computing Systems (ICDCS), 2020, pp. 300–310.
  • [9] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 2021–2031.
  • [10] N. Shlezinger, M. Chen, Y. C. Eldar, H. V. Poor, and S. Cui, “UVeQFed: Universal vector quantization for federated learning,” IEEE Trans. Signal Process., vol. 69, pp. 500–514, 2020.
  • [11] M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Trans. Signal Process., vol. 68, pp. 2155–2169, 2020.
  • [12] T. Sery and K. Cohen, “On analog gradient descent learning over multiple access fading channels,” IEEE Trans. Signal Process., vol. 68, pp. 2897–2911, 2020.
  • [13] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, 2020.
  • [14] S. Mayhoub and T. M. Shami, “A review of client selection methods in federated learning,” Archives of Computational Methods in Engineering, vol. 31, no. 2, pp. 1129–1152, 2024.
  • [15] J. Li, T. Chen, and S. Teng, “A comprehensive survey on client selection strategies in federated learning,” Computer Networks, p. 110663, 2024.
  • [16] L. Fu, H. Zhang, G. Gao, M. Zhang, and X. Liu, “Client selection in federated learning: Principles, challenges, and opportunities,” IEEE Internet Things J., vol. 10, no. 24, pp. 21 811–21 819, 2023.
  • [17] J. Xu and H. Wang, “Client selection and bandwidth allocation in wireless federated learning networks: A long-term perspective,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1188–1200, 2020.
  • [18] S. AbdulRahman, H. Tout, A. Mourad, and C. Talhi, “FedMCCS: Multicriteria client selection model for optimal iot federated learning,” IEEE Internet Things J., vol. 8, no. 6, pp. 4723–4735, 2020.
  • [19] E. Rizk, S. Vlaski, and A. H. Sayed, “Federated learning under importance sampling,” IEEE Trans. Signal Process., vol. 70, pp. 5381–5396, 2022.
  • [20] Y. Fraboni, R. Vidal, L. Kameni, and M. Lorenzi, “Clustered sampling: Low-variance and improved representativity for clients selection in federated learning,” in International Conference on Machine Learning. PMLR, 2021, pp. 3407–3416.
  • [21] W. Xia, T. Q. Quek, K. Guo, W. Wen, H. H. Yang, and H. Zhu, “Multi-armed bandit-based client scheduling for federated learning,” IEEE Trans. Wireless Commun., vol. 19, no. 11, pp. 7108–7123, 2020.
  • [22] B. Xu, W. Xia, J. Zhang, T. Q. Quek, and H. Zhu, “Online client scheduling for fast federated learning,” IEEE Wireless Commun. Lett., vol. 10, no. 7, pp. 1434–1438, 2021.
  • [23] D. Ben-Ami, K. Cohen, and Q. Zhao, “Client selection for generalization in accelerated federated learning: A multi-armed bandit approach,” IEEE Access, 2025.
  • [24] Y. Chen, W. Xu, X. Wu, M. Zhang, and B. Luo, “Personalized local differentially private federated learning with adaptive client sampling,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 6600–6604.
  • [25] L. Zhu and S. Han, “Deep leakage from gradients,” in Federated learning. Springer, 2020, pp. 17–31.
  • [26] B. Zhao, K. R. Mopuri, and H. Bilen, “iDLG: Improved deep leakage from gradients,” arXiv preprint arXiv:2001.02610, 2020.
  • [27] Y. Huang, S. Gupta, Z. Song, K. Li, and S. Arora, “Evaluating gradient inversion attacks and defenses in federated learning,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [28] H. Yin, A. Mallya, A. Vahdat, J. M. Alvarez, J. Kautz, and P. Molchanov, “See through gradients: Image batch recovery via gradinversion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 337–16 346.
  • [29] M. Kim, O. Günlü, and R. F. Schaefer, “Federated learning with local differential privacy: Trade-offs between privacy, utility, and communication,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 2650–2654.
  • [30] K. Wei et al., “Federated learning with differential privacy: Algorithms and performance analysis,” IEEE Trans. Inf. Forensics Security, vol. 15, pp. 3454–3469, 2020.
  • [31] L. Lyu, “DP-SIGNSGD: When efficiency meets privacy and robustness,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3070–3074.
  • [32] A. Lowy and M. Razaviyayn, “Private federated learning without a trusted server: Optimal algorithms for convex losses,” in International Conference on Learning Representations, 2023.
  • [33] N. Lang, E. Sofer, T. Shaked, and N. Shlezinger, “Joint privacy enhancement and quantization in federated learning,” IEEE Trans. Signal Process., vol. 71, pp. 295–310, 2023.
  • [34] N. Lang, N. Shlezinger, R. G. D’Oliveira, and S. E. Rouayheb, “Compressed private aggregation for scalable and robust federated learning over massive networks,” arXiv preprint arXiv:2308.00540, 2023.
  • [35] C. Dwork, G. N. Rothblum, and S. Vadhan, “Boosting and differential privacy,” in IEEE Annual Symposium on Foundations of Computer Science, 2010, pp. 51–60.
  • [36] J. Zhang, D. Fay, and M. Johansson, “Dynamic privacy allocation for locally differentially private federated learning with composite objectives,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 9461–9465.
  • [37] L. Sun, J. Qian, X. Chen, and P. S. Yu, “LDP-FL: Practical private aggregation in federated learning with local differential privacy,” in International Joint Conference on Artificial Intelligence, 2021.
  • [38] A. Cheu, A. Smith, J. Ullman, D. Zeber, and M. Zhilyaev, “Distributed differential privacy via shuffling,” in Advances in Cryptology–EUROCRYPT 2019: 38th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Darmstadt, Germany, May 19–23, 2019, Proceedings, Part I 38. Springer, 2019, pp. 375–403.
  • [39] B. Balle, J. Bell, A. Gascón, and K. Nissim, “The privacy blanket of the shuffle model,” in Advances in Cryptology–CRYPTO 2019: 39th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 18–22, 2019, Proceedings, Part II 39. Springer, 2019, pp. 638–667.
  • [40] Q. Zhao, Multi-armed bandits: Theory and applications to online learning in networks. Springer Nature, 2022.
  • [41] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning, vol. 47, pp. 235–256, 2002.
  • [42] W. Chen, Y. Wang, and Y. Yuan, “Combinatorial multi-armed bandit: General framework and applications,” in International conference on machine learning. PMLR, 2013, pp. 151–159.
  • [43] B. Hajek, “Cooling schedules for optimal annealing,” Mathematics of operations research, vol. 13, no. 2, pp. 311–329, 1988.
  • [44] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of FedAvg on non-iid data,” in International Conference on Learning Representations, 2019.
  • [45] N. Lang, A. Cohen, and N. Shlezinger, “Stragglers-aware low-latency synchronous federated learning via layer-wise model updates,” IEEE Trans. on Commun., 2024, early access.
  • [46] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith, “What can we learn privately?” SIAM Journal on Computing, vol. 40, no. 3, pp. 793–826, 2011.
  • [47] Y. Wang, Y. Tong, and D. Shi, “Federated latent dirichlet allocation: A local differential privacy based framework,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 6283–6290.
  • [48] T. Wang, X. Zhang, J. Feng, and X. Yang, “A comprehensive survey on local differential privacy toward data statistics and analysis,” Sensors, vol. 20, no. 24, p. 7030, 2020.
  • [49] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” Journal of Privacy and Confidentiality, vol. 7, no. 3, pp. 17–51, 2016.
  • [50] D. Henderson, S. H. Jacobson, and A. W. Johnson, “The theory and practice of simulated annealing,” Handbook of metaheuristics, pp. 287–319, 2003.
  • [51] S. Ledesma, G. Aviña, and R. Sanchez, “Practical considerations for simulated annealing implementation,” Simulated annealing, vol. 20, pp. 401–420, 2008.
  • [52] W. Ben-Ameur, “Computing the initial temperature of simulated annealing,” Computational optimization and applications, vol. 29, pp. 369–385, 2004.
  • [53] I. Bezáková, D. Štefankovič, V. V. Vazirani, and E. Vigoda, “Accelerating simulated annealing for the permanent and combinatorial counting problems,” SIAM Journal on Computing, vol. 37, no. 5, pp. 1429–1454, 2008.
  • [54] H. Wu, X. Guo, and X. Liu, “Adaptive exploration-exploitation tradeoff for opportunistic bandits,” in International Conference on Machine Learning. PMLR, 2018, pp. 5306–5314.
  • [55] M. M. Drugan, A. Nowé, and B. Manderick, “Pareto upper confidence bounds algorithms: an empirical study,” in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014.
  • [56] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” The collected works of Wassily Hoeffding, pp. 409–426, 1994.