This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Incentivized Communication for Federated Bandits

\nameZhepei Wei1∗ \emailtqf5qb@virginia.edu \AND\nameChuanhao Li1∗ \emailcl5ev@virginia.edu \AND\nameHaifeng Xu2 \emailhaifengxu@uchicago.edu \AND\nameHongning Wang1 \emailhw5x@virginia.edu

\addr1Department of Computer Science, University of Virginia, USA
\addr2Department of Computer Science, University of Chicago, USA
\addrEqual Contribution
Abstract

Most existing works on federated bandits take it for granted that all clients are altruistic about sharing their data with the server for the collective good whenever needed. Despite their compelling theoretical guarantee on performance and communication efficiency, this assumption is overly idealistic and oftentimes violated in practice, especially when the algorithm is operated over self-interested clients, who are reluctant to share data without explicit benefits. Negligence of such self-interested behaviors can significantly affect the learning efficiency and even the practical operability of federated bandit learning. In light of this, we aim to spark new insights into this under-explored research area by formally introducing an incentivized communication problem for federated bandits, where the server shall motivate clients to share data by providing incentives. Without loss of generality, we instantiate this bandit problem with the contextual linear setting and propose the first incentivized communication protocol, namely, Inc-FedUCB, that achieves near-optimal regret with provable communication and incentive cost guarantees. Extensive empirical experiments on both synthetic and real-world datasets further validate the effectiveness of the proposed method across various environments.

Keywords: Contextual bandit, Federated learning, Incentive Mechanism

1 Introduction

Federated bandit learning has recently emerged as a promising new direction to promote the application of bandit models while preserving privacy by enabling collaboration among multiple distributed clients (Dubey and Pentland, 2020; Wang et al., 2020b; Li and Wang, 2022a; He et al., 2022; Li and Wang, 2022b; Li et al., 2022, 2023; Huang et al., 2021; Tao et al., 2019; Du et al., 2023). The main focus in this line of research is on devising communication-efficient protocols to achieve near-optimal regret in various settings. Most notably, the direction on federated contextual bandits has been actively gaining momentum, since the debut of several benchmark communication protocols for contextual linear bandits in the P2P (Korda et al., 2016) and star-shaped (Wang et al., 2020b) networks. Many subsequent studies have explored diverse configurations of the clients’ and environmental modeling factors and addressed new challenges arising in these contexts. Notable recent advancements include extensions to asynchronous linear bandits (Li and Wang, 2022a; He et al., 2022), generalized liner bandits (Li and Wang, 2022b), and kernelized contextual bandits (Li et al., 2022, 2023).

Despite the extensive exploration of various settings, almost all existing federated bandit algorithms rely on the assumption that every client in the system is willing to share their local data/model with the server, regardless of the communication protocol design. For instance, synchronous protocols (Wang et al., 2020b; Li and Wang, 2022b) require all clients to simultaneously engage in data exchange with the server in every communication round. Similarly, asynchronous protocols (Li and Wang, 2022a; Li et al., 2023; He et al., 2022) also assume clients must participate in communication as long as the individualized upload or download event is triggered, albeit allowing interruptions by external factors (e.g., network failure).

In contrast, our work is motivated by the practical observation that many clients in a federated system are inherently self-interested and thus reluctant to share data without receiving explicit benefits from the server (Karimireddy et al., 2022). For instance, consider the following scenario: a recommendation platform (server) wants its mobile app users (clients) to opt in its new recommendation service, which switches previous on-device local bandit algorithm to a federated bandit algorithm. Although the new service is expected to improve the overall recommendation quality for all clients, particular clients may not be willing to participate in this collaborative learning, as the expected gain for them might not compensate their locally increased cost (e.g., communication bandwidth, added computation, lost control of their data, and etc). In this case, additional actions have to be taken by the server to encourage participation, as it has no power to force clients. This exemplifies the most critical concern in the real-world application of federated learning (Karimireddy et al., 2022). And a typical solution is known as incentive mechanism, which motivates individuals to contribute to the social welfare goal by offering incentives such as monetary compensation.

While recent studies have explored incentivized data sharing in federated learning (Pei, 2020; Tu et al., 2022), most of which only focused on the supervised offline learning setting (Karimireddy et al., 2022). To our best knowledge, ours is the first work that studies incentive design for federated bandit learning, which inherently imposes new challenges. First, there is a lack of well-defined metric to measure the utility of data sharing, which rationalizes a client’s participation. Under the context of bandit learning, we measure data utility by the expected regret reduction from the exchanged data for each client. As a result, each client values data (e.g., sufficient statistics) from the server differently, depending on how such data aligns with their local data (e.g., the more similar the less valuable). Second, the server is set to minimize regret across all clients through data exchange. But as the server does not generate data, it can be easily trapped by the situation where its collected data cannot pass the critical mass to ensure every participating client’s regret is close to optimal (e.g., the data under server’s possession cannot motivate the clients who have more valuable data to participate). To break the deadlock, we equip the server to provide monetary incentives. Subsequently, the server needs to minimize its cumulative monetary payments, in addition to the regret and communication minimization objectives as required by federated bandit learning. We propose a provably effective incentivized communication protocol, based on a heuristic search strategy to balance these distinct learning objectives. Our solution obtains near-optimal regret O(dTlogT)O(d\sqrt{T}\log T) with provable communication and incentive cost guarantees. Extensive empirical simulations on both synthetic and real-world datasets further demonstrate the effectiveness of the proposed protocol in various federated bandit learning environments.

2 Related Work

Federated Bandit Learning

One important branch in this area is federated multi-armed bandits (MABs), which has been well-studied in the literature (Liu and Zhao, 2010; Szorenyi et al., 2013; Landgren et al., 2016; Chakraborty et al., 2017; Landgren et al., 2018; Martínez-Rubio et al., 2019; Sankararaman et al., 2019; Wang et al., 2020a; Shi et al., 2020, 2021; Zhu et al., 2021). The other line of work focuses on the federated contextual bandit setting (Korda et al., 2016; Wang et al., 2020b), which has recently attracted increasing attention. Wang et al. (2020b) and Korda et al. (2016) are among the first to investigate this problem, where multiple communication protocols for linear bandits (Abbasi-Yadkori et al., 2011; Li et al., 2010) in star-shaped and P2P networks are proposed. Many follow-up works on federated linear bandits (Dubey and Pentland, 2020; Huang et al., 2021; Li and Wang, 2022a; He et al., 2022) have emerged with different client and environment settings, such as investigating fixed arm set (Huang et al., 2021), incorporating differential privacy (Dubey and Pentland, 2020), and introducing asynchronous communication (He et al., 2022; Li and Wang, 2022a). Li and Wang (2022b) extend the federated linear bandits to generalized linear bandits (Filippi et al., 2010). And they further investigated federated learning for kernelized contextual bandits in both synchronous and asynchronous settings (Li et al., 2022, 2023).

In this work, we situate the incentivized federated bandit learning problem under linear bandits with time-varying arm sets, which is a popular setting in many recent works (Wang et al., 2020b; Dubey and Pentland, 2020; Li and Wang, 2022a; He et al., 2022). But we do not assume the clients will always participate in data sharing: they will choose not to share its data with the server if the resulting benefit of data sharing is not deemed to outweigh the cost. Here we need to differentiate our setting from those with asynchronous communication, e.g., Asyn-LinUCB (Li and Wang, 2022a). Such algorithms still assume all clients are willing to share, though sometimes the communication can be interrupted by some external factors (e.g., network failure). We do not assume communication failures and leave it as our future work. Instead, we assume the clients need to be motivated to participate in federated learning, and our focus is to devise the minimum incentives to obtain the desired regret and communication cost for all participating clients.

Incentivized Federated Learning

Data sharing is essential to the success of federated learning (Pei, 2020), where client participation plays a crucial role. However, participation involves costs, such as the need for additional computing and communication resources, and the risk of potential privacy breaches, which can lead to opt-outs (Cho et al., 2022; Hu and Gong, 2020). In light of this, recent research has focused on investigating incentive mechanisms that motivate clients to contribute, rather than assuming their willingness to participate. Most of the existing research involves multiple decentralized clients solving the same task, typically with different copies of IID datasets, where the focus is on designing data valuation methods that ensure fairness or achieve a specific accuracy objective (Sim et al., 2020; Xu et al., 2021; Donahue and Kleinberg, 2023). On the other hand, Donahue and Kleinberg (2021) study voluntary participation in model-sharing games, where clients may opt out due to biased global models caused by the aggregated non-IID datasets. More recently, Karimireddy et al. (2022) investigated incentive mechanism design for data maximization while avoiding free riders. For a detailed discussion of this topic, we refer readers to recent surveys on incentive mechanism design in federated learning (Zhan et al., 2021; Tu et al., 2022).

However, most works on incentivized federated learning only focus on better model estimation among fixed offline datasets, which does not apply to the bandit learning problem, where the exploration of growing data is also part of the objective. More importantly, in our incentivized federated bandit problem, the server is obligated to improve the overall performance of the learning system, i.e., minimizing regret among all clients, which is essentially different from previous studies where the server only selectively incentivizes clients to achieve a certain accuracy (Sim et al., 2020) or to investigate how much accuracy the system can achieve without payment (Karimireddy et al., 2022).

3 Preliminaries

In this section, we formally introduce the incentivized communication problem for federated bandits under the contextual linear bandit setting.

3.1 Federated Bandit Learning

We consider a learning system consisting of (1) NN clients that directly interact with the environment by taking actions and receiving the corresponding rewards, and (2) a central server that coordinates the communication among the clients to facilitate their learning collectively. The clients can only communicate with the central server, but not with each other, resulting in a star-shaped communication network. At each time step t[T]t\in[T], an arbitrary client it[N]i_{t}\in[N] becomes active and chooses an arm 𝐱t\mathbf{x}_{t} from a candidate set 𝒜td\mathcal{A}_{t}\subseteq\mathbb{R}^{d}, and then receives the corresponding reward feedback yt=f(𝐱t)+ηty_{t}=f(\mathbf{x}_{t})+\eta_{t}\in\mathbb{R}. Note that 𝒜t\mathcal{A}_{t} is time-varying, ff denotes the unknown reward function shared by all clients, and ηt\eta_{t} denotes zero mean sub-Gaussian noise with known variance σ2\sigma^{2}.

The performance of the learning system is measured by the cumulative (pseudo) regret over all NN clients in the finite time horizon TT, i.e., RT=t=1TrtR_{T}=\sum_{t=1}^{T}r_{t}, where rt=max𝐱𝒜t𝐄[y|𝐱]𝐄[yt|𝐱t]r_{t}=\max_{\mathbf{x}\in\mathcal{A}_{t}}\mathbf{E}[y|\mathbf{x}]-\mathbf{E}[y_{t}|\mathbf{x}_{t}] is the regret incurred by client iti_{t} at time step tt. Moreover, under the federated learning setting, the system also needs to keep the communication cost CTC_{T} low, which is measured by the total number of scalars (Wang et al., 2020b) being transferred across the system up to time TT.

With the linear reward assumption, i.e., f(𝐱)=𝐱θf(\mathbf{x})=\mathbf{x}^{\top}\theta_{\star}, where θ\theta_{\star} denotes the unknown parameter, a ridge regression estimator θ^t=Vt1bt\hat{\theta}_{t}=V_{t}^{-1}b_{t} can be constructed based on sufficient statistics from all NN clients at each time step tt, where Vt=s=1t𝐱s𝐱sV_{t}=\sum_{s=1}^{t}\mathbf{x}_{s}\mathbf{x}_{s}^{\top} and bt=s=1t𝐱sysb_{t}=\sum_{s=1}^{t}\mathbf{x}_{s}y_{s} (Lattimore and Szepesvári, 2020). Using θ^t\hat{\theta}_{t} under the Optimism in the Face of Uncertainty (OFUL) principle (Abbasi-Yadkori et al., 2011), one can obtain the optimal regret RT=O(dT)R_{T}=O(d\sqrt{T}). To achieve this regret bound in the federated setting, a naive method is to immediately share statistics of each newly collected data sample to all other clients in the system, which essentially recovers its centralized counterpart. However, this solution incurs a disastrous communication cost CT=O(d2NT)C_{T}=O(d^{2}NT). On the other extreme, if no communication occurs throughout the entire time horizon (i.e., CT=0C_{T}=0), the regret upper bound can be up to RT=O(dNT)R_{T}=O(d\sqrt{NT}) when each client interacts with the environment at the same frequency, indicating the importance of timely data/model aggregation in reducing RTR_{T}.

To balance this trade-off between regret and communication cost, prior research efforts centered around designing communication-efficient protocols for federated bandits that feature the “delayed update” of sufficient statistics (Wang et al., 2020b; Li and Wang, 2022a; He et al., 2022). Specifically, each client ii only has a delayed copy of VtV_{t} and btb_{t}, denoted as Vi,t=Vtlast+ΔVi,t,bi,t=btlast+Δbi,tV_{i,t}=V_{t_{\text{last}}}+\Delta V_{i,t},b_{i,t}=b_{t_{\text{last}}}+\Delta b_{i,t}, where Vtlast,btlastV_{t_{\text{last}}},b_{t_{\text{last}}} is the aggregated sufficient statistics shared by the server in the last communication, and ΔVi,t,Δbi,t\Delta V_{i,t},\Delta b_{i,t} is the accumulated local updates that client ii obtain from its interactions with the environment since tlastt_{\text{last}}. In essence, the success of these algorithms lies in the fact that Vt,btV_{t},b_{t} typically changes slowly and thus has little instant impact on the regret for most time steps. Therefore, existing protocols that only require occasional communications can still achieve nearly optimal regret, despite the limitation on assuming clients’ willingness on participation as we discussed before.

3.2 Incentivized Federated Bandits

Different from the prior works in this line of research, where all clients altruistically share their data with the server whenever a communication round is triggered, we are intrigued in a more realistic setting where clients are self-interested and thus reluctant to share data with the server if not well motivated. Formally, each client in the federated system inherently experiences a cost 111Note that if the costs are trivially set to zero, then clients have no reason to opt-out of data sharing and our problem essentially reduces to the standard federated bandits problem (Wang et al., 2020b). of data sharing, denoted by D~ip\widetilde{D}^{p}_{i}\in\mathbb{R}, due to their individual consumption of computing resources in local updates or concerns about potential privacy breaches caused by communication with the server. Moreover, as the client has nothing to lose when there is no local update to share in a communication round at time step tt, in this case we assume the cost is 0, i.e., Dip=D~ip𝕀(ΔVi,t𝟎)D^{p}_{i}=\widetilde{D}^{p}_{i}\cdot\mathbb{I}(\Delta V_{i,t}\neq\mathbf{0}). As a result, the server needs to motivate clients to participate in data sharing via the incentive mechanism :N×d×dN\mathcal{M}:\mathbb{R}^{N}\times\mathbb{R}^{d\times d}\rightarrow\mathbb{R}^{N}, which takes as inputs a collection of client local updates ΔVi,td×d\Delta V_{i,t}\in\mathbb{R}^{d\times d} and a vector of cost values Dp={D1p,,DNp}ND^{p}=\{D^{p}_{1},\cdots,D^{p}_{N}\}\in\mathbb{R}^{N}, and outputs the incentive ={1,t,,N,t}N\mathcal{I}=\{\mathcal{I}_{1,t},\cdots,\mathcal{I}_{N,t}\}\in\mathbb{R}^{N} to be distributed among the clients. Specifically, to make it possible to measure gains and losses of utility in terms of real-valued incentives (e.g., monetary payment), we adopt the standard quasi-linear utility function assumption, as is standard in economic analysis (Allais, 1953; Pemberton and Rau, 2007).

At each communication round, a client decides whether to share its local update with the server based on the potential utility gained from participation, i.e., the difference between the incentive and the cost of data sharing. This requires the incentive mechanism to be individually rational:

Definition 1 (Individual Rationality (Myerson and Satterthwaite, 1983))

An incentive mechanism :N×d×dN\mathcal{M}:\mathbb{R}^{N}\times\mathbb{R}^{d\times d}\rightarrow\mathbb{R}^{N} is individually rational if for any ii in the participant set StS_{t} at time step tt, we have

i,tDip\displaystyle\mathcal{I}_{i,t}\geq D^{p}_{i} (1)

In other words, each participant must be guaranteed non-negative utility by participating in data sharing under \mathcal{M}.

The server coordinates with all clients and incentivizes them to participate in the communication to realize its own objective (e.g., collective regret minimization). This requires \mathcal{M} to be sufficient:

Definition 2 (Sufficiency)

An incentive mechanism :N×d×dN\mathcal{M}:\mathbb{R}^{N}\times\mathbb{R}^{d\times d}\rightarrow\mathbb{R}^{N} is sufficient if the resulting outcome satisfies the server’s objective.

Typically, under different application scenarios, the server may have different objectives, such as regret minimization or best arm identification. In this work, we set the objective of the server to minimize the regret across all clients; and ideally the server aims to attain the optimal O~(dT)\widetilde{O}(d\sqrt{T}) regret in the centralized setting via the incentivized communication. Therefore, we consider an incentive mechanism is sufficient if it ensures that the resulting accumulated regret is bounded by O~(dT)\widetilde{O}(d\sqrt{T}).

4 Methodology

The communication backbone of our solution derives from DisLinUCB (Wang et al., 2020b), which is a widely adopted paradigm for federated linear bandits. We adopt their strategy for arm selection and communication trigger, so as to focus on the incentive mechanism design. We name the resulting algorithm Inc-FedUCB, and present it in Algorithm 1. Note that the two incentive mechanisms to be presented in Section 4.2 and 4.3 are not specific to any federated bandit learning algorithms, and each of them can be easily extended to alternative workarounds as a plug-in to accommodate the incentivized federated learning setting. For clarity, a summary of technical notations can be found in Table 7.

4.1 A General Framework: Inc-FedUCB Algorithm

Our framework comprises three main steps: 1) client’s local update; 2) communication trigger; and 3) incentivized data exchange among the server and clients. Specifically, after initialization, an active client performs a local update in each time step and checks the communication trigger. If a communication round is triggered, the system performs incentivized data exchange between clients and the server. Otherwise, no communication is needed.

1:Dc0D_{c}\geq 0, Dp={D1p,,DNp}D^{p}=\{D^{p}_{1},\cdots,D^{p}_{N}\}, σ\sigma, λ>0\lambda>0, δ(0,1)\delta\in(0,1)
2:Initialize: [Server] Vg,0=𝟎d×dd×dV_{g,0}=\mathbf{0}_{d\times d}\in\mathbb{R}^{d\times d}, bg,0=𝟎ddb_{g,0}=\mathbf{0}_{d}\in\mathbb{R}^{d}, ΔVj,0=𝟎d×d,Δbj,0=𝟎d\Delta V_{-j,0}=\mathbf{0}_{d\times d},\Delta b_{-j,0}=\mathbf{0}_{d}, j[N]\forall j\in[N]
3:              [All clients] Vi,0=𝟎d×dV_{i,0}=\mathbf{0}_{d\times d}, bi,0=𝟎db_{i,0}=\mathbf{0}_{d}, ΔVi,0=𝟎d×d\Delta V_{i,0}=\mathbf{0}_{d\times d}, Δbi,0=𝟎d\Delta b_{i,0}=\mathbf{0}_{d}, Δti,0=0,i[N]\Delta t_{i,0}=0,\forall i\in[N]
4:for t=1,2,,Tt=1,2,\dots,T do
5:     [Client iti_{t}] Observe arm set 𝒜t\mathcal{A}_{t}
6:     [Client iti_{t}] Select arm 𝐱t𝒜t\mathbf{x}_{t}\in\mathcal{A}_{t} by Eq. (2) and observe reward yty_{t}
7:     [Client iti_{t}] Update: Vit,t+=𝐱t𝐱tV_{i_{t},t}\mathrel{{+}{=}}\mathbf{x}_{t}\mathbf{x}^{\top}_{t}, bit,t+=𝐱tytb_{i_{t},t}\mathrel{{+}{=}}\mathbf{x}_{t}y_{t}
8:                                ΔVit,t+=𝐱t𝐱t\Delta V_{i_{t},t}\mathrel{{+}{=}}\mathbf{x}_{t}\mathbf{x}^{\top}_{t}, Δbit,t+=𝐱tyt\Delta b_{i_{t},t}\mathrel{{+}{=}}\mathbf{x}_{t}y_{t}, Δtit,t+=1\Delta t_{i_{t},t}\mathrel{{+}{=}}1
9:     if Δtit,tlogdet(Vit,t+λI)det(Vit,tΔVit,t+λI)>Dc\Delta t_{i_{t},t}\log\frac{\det(V_{i_{t},t}+\lambda I)}{\det(V_{i_{t},t}-\Delta V_{i_{t},t}+\lambda I)}>D_{c} then
10:         [All clients \boldsymbol{\rightarrow} Server] Upload ΔVi,t\Delta V_{i,t} such that St~={ΔVi,t|i[N]}\tilde{S_{t}}=\{\Delta V_{i,t}|\forall i\in[N]\}
11:         [Server] Select incentivized participants St=(S~t)S_{t}=\mathcal{M}(\tilde{S}_{t}) \triangleright Incentive Mechanism
12:         for i:ΔVi,tSti:\Delta V_{i,t}\in S_{t} do
13:              [Participant ii\boldsymbol{\rightarrow} Server] Upload Δbi,t\Delta b_{i,t}
14:              [Server] Update: Vg,t+=ΔVi,tV_{g,t}\mathrel{{+}{=}}\Delta V_{i,t}, bg,t+=Δbi,tb_{g,t}\mathrel{{+}{=}}\Delta b_{i,t}
15:                                      ΔVj,t+=ΔVi,t\Delta V_{-j,t}\mathrel{{+}{=}}\Delta V_{i,t}, Δbj,t+=Δbi,t,ji\Delta b_{-j,t}\mathrel{{+}{=}}\Delta b_{i,t},\forall j\neq i
16:              [Participant ii] Update: ΔVi,t=0\Delta V_{i,t}=0, Δbi,t=0\Delta b_{i,t}=0, Δti,t=0\Delta t_{i,t}=0          
17:         for i[N]\forall i\in[N] do
18:              [Server \boldsymbol{\rightarrow} All Clients] Download ΔVi,t\Delta V_{-i,t}, Δbi,t\Delta b_{-i,t}
19:              [Client ii] Update: Vi,t+=ΔVi,tV_{i,t}\mathrel{{+}{=}}\Delta V_{-i,t}, bi,t+=Δbi,tb_{i,t}\mathrel{{+}{=}}\Delta b_{-i,t}
20:              [Server] Update: ΔVi,t=0\Delta V_{-i,t}=0, Δbi,t=0\Delta b_{-i,t}=0               
Algorithm 1 Inc-FedUCB Algorithm

Formally, at each time step t=1,,Tt=1,\dots,T, an arbitrary client iti_{t} becomes active and interacts with its environment using observed arm set 𝒜t\mathcal{A}_{t} (Line 5). Specifically, it selects an arm 𝐱t𝒜t\mathbf{x}_{t}\in\mathcal{A}_{t} that maximizes the UCB score as follows (Line 6):

𝐱t=argmax𝐱𝒜t𝐱θ^it,t1(λ)+αit,t1𝐱Vit,t11(λ)\mathbf{x}_{t}=\operatorname*{arg\,max}_{\mathbf{x}\in\mathcal{A}_{t}}{\mathbf{x}^{\top}\hat{\theta}_{i_{t},t-1}(\lambda)+\alpha_{i_{t},t-1}||\mathbf{x}||_{V^{-1}_{i_{t},t-1}(\lambda)}} (2)

where θ^it,t1(λ)=Vit,t11(λ)bit,t1\hat{\theta}_{i_{t},t-1}(\lambda)=V^{-1}_{i_{t},t-1}(\lambda)b_{i_{t},t-1} is the ridge regression estimator of θ\theta_{\star} with regularization parameter λ>0\lambda>0, Vit,t1(λ)=Vit,t1+λIV_{i_{t},t-1}(\lambda)=V_{i_{t},t-1}+\lambda I, and αit,t1=σlogdet(Vit,t1(λ))det(λI)+2log1/δ+λ\alpha_{i_{t},t-1}=\sigma\sqrt{\log{\frac{\det({V_{i_{t},t-1}(\lambda))}}{\det{(\lambda I)}}}+2\log{1/\delta}}+\sqrt{\lambda}. Vit,t(λ)V_{i_{t},t}(\lambda) denotes the covariance matrix constructed using data available to client iti_{t} up to time tt. After obtaining a new data point (𝐱t,yt)(\mathbf{x}_{t},y_{t}) from the environment, client iti_{t} checks the communication event trigger Δtit,tlogdet(Vit,t(λ))det(Vit,tlast(λ))>Dc\Delta t_{i_{t},t}\cdot\log\frac{\det(V_{i_{t},t}(\lambda))}{\det(V_{i_{t},t_{\text{last}}}(\lambda))}>D_{c} (Line 9), where Δtit,t\Delta t_{i_{t},t} denotes the time elapsed since the last time tlastt_{\text{last}} it communicated with the server and Dc0D_{c}\geq 0 denotes the specified threshold.

Incentivized Data Exchange

With the above event trigger, communication rounds only occur if (1) a substantial amount of new data has been accumulated locally at client iti_{t}, and/or (2) significant time has elapsed since the last communication. However, in our incentivized setting, triggering a communication round does not necessarily lead to data exchange at time step tt, as the participant set StS_{t} may be empty (Line 11). This characterizes the fundamental difference between Inc-FedUCB and DisLinUCB (Wang et al., 2020b): we no longer assume all NN clients will share their data with the server in an altruistic manner; instead, a rational client only shares its local update with the server if the condition in Eq. (1) is met. In light of this, to evaluate the potential benefit of data sharing, all clients must first reveal the value of their data to the server before the server determines the incentive. Hence, after a communication round is triggered, all clients upload their latest sufficient statistics update ΔVi,t\Delta V_{i,t} to the server (Line 10) to facilitate data valuation and participant selection in the incentive mechanism (Line 11). Note that this disclosure does not compromise clients’ privacy, as the clients’ secret lies in Δbi,t\Delta b_{i,t} that is constructed by the rewards. Only participating clients will upload their Δbi,t\Delta b_{i,t} to the server (Line 13). After collecting data from all participants, the server downloads the aggregated updates ΔVi,t\Delta V_{-i,t} and Δbi,t{\Delta b_{-i,t}} to every client ii (Line 17-20). Following the convention in federated bandit learning (Wang et al., 2020b), the communication cost is defined as the total number of scalars transferred during this data exchange process.

4.2 Payment-free Incentive Mechanism

1:Dp={Dip|i[N]}D^{p}=\{D^{p}_{i}|i\in[N]\}, St~={ΔVi,t|i[N]}\tilde{S_{t}}=\{\Delta V_{i,t}|i\in[N]\}
2:Initialize participant set St=S~tS_{t}=\tilde{S}_{t}
3:while StS_{t}\neq\emptyset do \triangleright iteratively update StS_{t} until it becomes stable
4:     StableFlag=True\text{StableFlag}=\text{True}
5:     for i:ΔVi,tSti:\Delta V_{i,t}\in S_{t} do
6:         if  i,t<Dip\mathcal{I}_{i,t}<D_{i}^{p}  then\triangleright Eq. 4
7:              Update participant set St=St\{ΔVi,t}S_{t}=S_{t}\;\backslash\;\{\Delta V_{i,t}\} \triangleright remove client jj from StS_{t}
8:              StableFlag=False\text{StableFlag}=\text{False}
9:              break               
10:     if StableFlag=True\text{StableFlag}=\text{True} then
11:         break      
12:return StS~tS_{t}\subseteq\tilde{S}_{t}
Algorithm 2 Payment-free Incentive Mechanism

As mentioned in Section 1, in federated bandit learning, clients can reduce their regret by using models constructed via shared data. Denote V~t\widetilde{V}_{t} as the covariance matrix constructed by all available data in the system at time step tt. Based on Lemma 5 and 7, the instantaneous regret of client iti_{t} is upper bounded by:

rt2αit,t1𝐱tV~t11𝐱tdet(V~t1)det(Vit,t1)=O(dlogTδ)𝐱tV~t11det(V~t1)det(Vit,t1)\displaystyle r_{t}\leq 2\alpha_{i_{t},t-1}\sqrt{\mathbf{x}_{t}^{\top}\widetilde{V}_{t-1}^{-1}\mathbf{x}_{t}}\cdot\sqrt{\frac{\det(\widetilde{V}_{t-1})}{\det(V_{i_{t},t-1})}}=O\left(\sqrt{d\log\frac{T}{\delta}}\right)\cdot\|\mathbf{x}_{t}\|_{\widetilde{V}_{t-1}^{-1}}\cdot\sqrt{\frac{\det(\widetilde{V}_{t-1})}{\det(V_{i_{t},t-1})}} (3)

where the determinant ratio reflects the additional regret due to the delayed synchronization between client iti_{t}’s local sufficient statistics and the global optimal oracle. Therefore, minimizing this ratio directly corresponds to reducing client iti_{t}’s regret. For example, full communication keeps the ratio at 1, which recovers the regret of the centralized setting discussed in Section 3.1.

Therefore, given the client’s desire for regret minimization, the data itself can be used as a form of incentive by the server. And the star-shaped communication network also gives the server an information advantage over any single client in the system: a client can only communicate with the server, while the server can communicate with every client. Therefore, the server should utilize this advantage to create incentives (i.e., the LHS of Eq. (1)), and a natural design to evaluate this data incentive is:

i,t:=i,td=det(Di,t(St)+Vi,t)det(Vi,t)1.\mathcal{I}_{i,t}:=\mathcal{I}^{d}_{i,t}=\frac{\det\left(D_{i,t}(S_{t})+V_{i,t}\right)}{\det(V_{i,t})}-1. (4)

where Di,t(St)=j:{ΔVj,tSt}{ji}ΔVj,t+ΔVi,tD_{i,t}(S_{t})=\sum_{j:\{\Delta V_{j,t}\in S_{t}\}\wedge\{j\neq i\}}\Delta V_{j,t}+\Delta V_{-i,t} denotes the data that the server can offer to client ii during the communication at time tt (i.e., current local updates from other participants that have not been shared with the server) and ΔVi,t\Delta V_{-i,t} is the historically aggregated updates stored in the server that has not been shared with client ii. Eq. (4) suggests a substantial increase in the determinant of the client’s local data is desired by the client, which results in regret reduction.

With the above data valuation in Eq. (4), we propose the payment-free incentive mechanism that motivates clients to share data by redistributing data collected from participating clients. We present this mechanism in Algorithm 2, and briefly sketch it below. First, we initiate the participant set St=S~tS_{t}=\tilde{S}_{t}, assuming all clients agree to participate. Then, we iteratively update StS_{t} by checking the willingness of each client ii in StS_{t} according to Eq. (1). If StS_{t} is empty or all clients in it are participating, then terminate; otherwise, remove client ii from StS_{t} and repeat the process.

While this payment-free incentive mechanism is neat and intuitive, it has no guarantee on the amount of data that can be collected. To see this, we provide a theoretical negative result with rigorous regret analysis in Theorem 3 (see proof in Appendix C).

Theorem 3 (Sub-optimal Regret)

When there are at most c2CNlog(T/N)\frac{c}{2C}\frac{N}{\log(T/N)} number of clients (for some constant C,c>0C,c>0), whose cost value Dipmin{(1+L2λ)T,(1+TL2λd)d}D_{i}^{p}\leq\min\{(1+\frac{L^{2}}{\lambda})^{T},(1+\frac{TL^{2}}{\lambda d})^{d}\}, there exists a linear bandit instance with σ=L=S=1\sigma=L=S=1 such that for TNdT\geq Nd, the expected regret for Inc-FedUCB algorithm with payment-free incentive mechanism is at least Ω(dNT)\Omega(d\sqrt{NT}).

Recall the discussion in Section 3.1, when there is no communication RTR_{T} is upper bounded by O(dNT)O(d\sqrt{NT}). Hence, in the worst-case scenario, the payment-free incentive mechanism might not motivate any client to participate. It is thus not a sufficient mechanism.

4.3 Payment-efficient Incentive Mechanism

To address the insufficiency issue, we further devise a payment-efficient incentive mechanism that introduces additional monetary incentives to motivate clients’ participation:

i,t:=i,td+i,tm\displaystyle\mathcal{I}_{i,t}:=\mathcal{I}^{d}_{i,t}+\mathcal{I}^{m}_{i,t} (5)

where i,td\mathcal{I}^{d}_{i,t} is the data incentive defined in Eq. (4), and i,tm\mathcal{I}^{m}_{i,t} is the real-valued monetary incentive, i.e., the payment assigned to client for its participation. Specifically, we are intrigued by the question: rather than trivially paying unlimited amounts to ensure everyone’s participation, can we devise an incentive mechanism that guarantees a certain level of client participation such that the overall regret is still nearly optimal but under acceptable monetary incentive cost?

Inspired by the determinant ratio principle discussed in Eq. (3), we propose to control the overall regret by ensuring that every client closely approximates the oracle after each communication round, which can be formalized as det(Vg,t)/det(V~t)β\det(V_{g,t})/\det(\widetilde{V}_{t})\geq\beta, where Vg,t=Vg,t1+Σ(St)V_{g,t}=V_{g,t-1}+\Sigma(S_{t}) is to be shared with all clients and Σ(St)=j:{ΔVj,tSt}ΔVj,t\Sigma(S_{t})=\sum_{j:\{\Delta V_{j,t}\in S_{t}\}}\Delta V_{j,t}. The parameter β[0,1]\beta\in[0,1] characterizes the chosen gap between the practical and optimal regrets that the server commits to. Denote the set of clients motivated by i,td\mathcal{I}^{d}_{i,t} at time tt as Std{S}^{d}_{t} and those motivated by i,tm\mathcal{I}^{m}_{i,t} as Stm{S}^{m}_{t}, and thus St=StmStd{S}_{t}={S}^{m}_{t}\cup{S}^{d}_{t}. At each communication round, the server needs to find the minimum i,tm\mathcal{I}^{m}_{i,t} such that pooling local updates from StS_{t} satisfies the required regret reduction for the entire system.

1:St~={ΔVi,t|i[N]}\widetilde{S_{t}}=\{\Delta V_{i,t}|i\in[N]\}, data-incentivized participant set S^tS~t\widehat{S}_{t}\subseteq\widetilde{S}_{t}, threshold β\beta
2:for client i:ΔVi,tS~tS^ti:\Delta V_{i,t}\in\widetilde{S}_{t}\setminus\widehat{S}_{t} do:
3:     Compute client’s potential contribution to the server (i.e., marginal gain in determinant):
ci,t(S^t)=det(ΔVi,t+Vg,t(S^t))/det(Vg,t(S^t)),Vg,t(St)=Vg,t1+Σ(St)\displaystyle c_{i,t}(\widehat{S}_{t})=\det(\Delta V_{i,t}+V_{g,t}(\widehat{S}_{t}))/\det(V_{g,t}(\widehat{S}_{t})),\;\;V_{g,t}(S_{t})=V_{g,t-1}+\Sigma(S_{t}) (6)
4:Rank clients {i1,,im}\{i_{1},\dots,i_{m}\} by their potential contribution, where m=|S~tS^t|m=|\widetilde{S}_{t}\setminus\widehat{S}_{t}|
5:Segment the list by finding α=min{j|det(Vg,t(S^t)+ΔVij,t)det(Vg,t(S~t))β,j[m]}\alpha=\min\{j\,|\,\frac{\det(V_{g,t}(\widehat{S}_{t})+\Delta V_{i_{j},t})}{\det(V_{g,t}(\widetilde{S}_{t}))}\geq\beta,\;\forall j\in[m]\}
6:k=α1k=\alpha-1, lastm=Diαpiα,td\mathcal{I}^{m}_{{\text{last}}}=D^{p}_{i_{\alpha}}-\mathcal{I}^{d}_{i_{\alpha},t}
7:return participant set St=S_{t}= Heuristic Search(k,lastm)(k,\mathcal{I}^{m}_{{\text{last}}}) \triangleright Algorithm 4
Algorithm 3 Payment-efficient Incentive Mechanism

Algorithm 2 maximizes i,td\mathcal{I}^{d}_{i,t}, and thus the servers should compute i,tm\mathcal{I}^{m}_{i,t} on top of optimal i,td\mathcal{I}^{d}_{i,t} and resulting Std{S}^{d}_{t}, which however is still combinatorially hard. First, a brute-force search can yield a time complexity up to O(2N)O(2^{N}). Second, different from typical optimal subset selection problems (Kohavi and John, 1997), the dynamic interplay among clients in our specific context brings a unique challenge: once a client is incentivized to share data, the other uninvolved clients may change their willingness due to the increased data incentive, making the problem even more intricate.

To solve the above problem, we propose a heuristic ranking-based method, as outlined in Algorithm 3. The heuristic is to rank clients by the marginal gain they bring to the server’s determinant, as formally defined in Eq. (6), which helps minimize the number of clients requiring monetary incentives, while empowering the participation of other clients motivated by the aggregated data. This forms an iterative search process: First, we rank all mm non-participating clients (Line 2-3) by their potential contribution to the server (with participant set StS_{t} committed); Then, we segment the list by β\beta, anyone whose participation satisfies the overall β\beta gap constraint is an immediately valid choice (Line 4). The first client iαi_{\alpha} in the valid list and its payment lastm\mathcal{I}^{m}_{\text{last}} (\infty if not available) will be our last resort (Line 5); Lastly, we check if there exist potentially more favorable solutions from the invalid list (Line 6). Specifically, we try to elicit up to k=α1k=\alpha-1 (k=mk=m if iαi_{\alpha} is not available) clients from the invalid list in nkn\leq k rounds, where only one client will be chosen using the same heuristic in each round. If having nn clients from the invalid list also satisfies the β\beta constraint and results in a reduced monetary incentive cost compared to lastm\mathcal{I}^{m}_{\text{last}}, then we opt for this alternative solution. Otherwise, we will adhere to the last resort.

This Heuristic Search is detailed in Appendix A, and it demonstrates a time complexity of only O(N)O(N) in the worst-case scenarios, i.e., n=m=Nn=m=N. Theorem 4 guarantees the sufficiency of this mechanism w.r.t communication and payment bounds.

Theorem 4

Under threshold β\beta and clients’ committed data sharing cost Dp={D1p,,DNp}D^{p}=\{D^{p}_{1},\cdots,D^{p}_{N}\}, with high probability the monetary incentive cost of Inc-FedUCB satisfies

MT=O(maxDpPNi=1NPi(det(λI)det(VT))1Pi)M_{T}=O\left(\max D^{p}\cdot P\cdot N-\sum\limits_{i=1}^{N}P_{i}\cdot\left(\frac{\det(\lambda I)}{\det(V_{T})}\right)^{\frac{1}{P_{i}}}\right)

where PiP_{i} is the number of epochs client ii gets paid throughout time horizon TT, PP is the total number of epochs, which is bounded P=O(NdlogT)P=O(Nd\log T) by setting communication threshold Dc=TN2dlogTT2N2dRlogTlogβD_{c}=\frac{T}{N^{2}d\log T}-\sqrt{\frac{T^{2}}{N^{2}dR\log T}}\log\beta, where R=dlog(1+Tλd)R=\left\lceil d\log(1+\frac{T}{\lambda d})\right\rceil.

Henceforth, the communication cost satisfies

CT=O(Nd2)P=O(N2d3logT)C_{T}=O(Nd^{2})\cdot P=O(N^{2}d^{3}\log T)

Furthermore, by setting βe1N\beta\geq e^{-\frac{1}{N}}, the cumulative regret is

RT=O(dTlogT)R_{T}=O\left(d\sqrt{T}\log T\right)

The proof of theorem 4 can be found in Appendix D.

5 Experiments

We simulate the incentivized federated bandit problem under various environment settings. Specifically, we create an environment of N=50N=50 clients with cost of data sharing Dp={D1p,,DNp}D^{p}=\{D^{p}_{1},\cdots,D^{p}_{N}\}, total number of iterations T=5,000T=5,000, feature dimension d=25d=25, and time-varing arm pool size K=25K=25. By default, we set Dip=Dp,i[N]D^{p}_{i}=D^{p}_{\star}\in\mathbb{R},\forall i\in[N]. Due to the space limit, more detailed results and discussions on real-world dataset can be found in Appendix E.

Refer to caption
(a) Dp=1D^{p}_{\star}=1
Refer to caption
(b) Dp=10D^{p}_{\star}=10
Refer to caption
(c) Dp=100D^{p}_{\star}=100
Figure 1: Comparison between payment-free vs. payment-efficient incentive designs.

5.1 Payment-free vs. Payment-efficient

We first empirically compared the performance of the payment-free mechanism (named as Inc-FedUCB-PF) and the payment-efficient mechanism Inc-FedUCB in Figure 1. It is clear that the added monetary incentives lead to lower regret and communication costs, particularly with increased DpD^{p}_{\star}. Lower regret is expected as more data can be collected and shared; while the reduced communication cost is contributed by reduced communication frequency. When less clients can be motivated in one communication round, more communication rounds will be triggered as the clients tend to have outdated local statistics.

5.2 Ablation Study on Heuristic Search

To investigate the impact of different components in our heuristic search, we compare the full-fledged model Inc-FedUCB with following variants on various environments: (1) Inc-FedUCB (w/o PF): without payment-free incentive mechanism, where the server only use money to incentivize clients; (2) Inc-FedUCB (w/o IS): without iterative search, where the server only rank the clients once. (3) Inc-FedUCB (w/o PF + IS): without both above strategies.

Refer to caption
(a) Accumulative Regret
Refer to caption
(b) Communication Cost
Refer to caption
(c) Payment Cost
Figure 2: Ablation study on heuristic search (w.r.t Dp[1,10,100]D^{p}_{\star}\in[1,10,100]).

In Figure 2, we present the averaged learning trajectories of regret and communication cost, along with the final payment costs (normalized) under different DpD^{p}_{\star}. The results indicate that the full-fledged Inc-FedUCB consistently outperforms all other variants in various environments. Additionally, there is a substantial gap between the variants with and without the PF strategy, emphasizing the significance of leveraging server’s information advantage to motivate participation.

5.3 Environment & Hyper-Parameter Study

We further explored diverse β\beta hyperparameter settings for Inc-FedUCB in various environments with varying DpD^{p}_{\star}, along with the comparison with DisLinUCB (Wang et al., 2020b) (only comparable when Dp=0D^{p}_{\star}=0).

d=25,K=25d=25,K=25 DisLinUCB Inc-FedUCB (β=1\beta=1) Inc-FedUCB (β=0.7\beta=0.7) Inc-FedUCB (β=0.3\beta=0.3)
T=5,000,N=50,Dp=0T=5,000,N=50,D^{p}_{\star}=0 Regret (Acc.) 48.46 48.46 48.46 (Δ=0%\Delta=0\%) 48.46 (Δ=0%\Delta=0\%)
Commu. Cost 7,605,000 7,605,000 7,605,000 (Δ=0%\Delta=0\%) 7,605,000 (Δ=0%\Delta=0\%)
Pay. Cost \ 0 0 (Δ=0%\Delta=0\%) 0 (Δ=0%\Delta=0\%)
T=5,000,N=50,Dp=1T=5,000,N=50,D^{p}_{\star}=1 Regret (Acc.) \ 48.46 47.70 (Δ1.6%\Delta-1.6\%) 48.38 (Δ0.2%\Delta-0.2\%)
Commu. Cost \ 7,605,000 7,668,825 (Δ+0.8%\Delta+0.8\%) 7,733,575 (Δ+1.7%\Delta+1.7\%)
Pay. Cost \ 75.12 60.94 (Δ18.9%\Delta-18.9\%) 22.34 (Δ70.3%\Delta-70.3\%)
T=5,000,N=50,Dp=10T=5,000,N=50,D^{p}_{\star}=10 Regret (Acc.) \ 48.46 48.21 (Δ0.5%\Delta-0.5\%) 47.55 (Δ1.9%\Delta-1.9\%)
Commu. Cost \ 7,605,000 7,779,425 (Δ+2.3%\Delta+2.3\%) 8,599,950 (Δ+13%\Delta+13\%)
Pay. Cost \ 12,819.61 9,050.61 (Δ29.4%\Delta-29.4\%) 4,859.17 (Δ62.1%\Delta-62.1\%)
T=5,000,N=50,Dp=100T=5,000,N=50,D^{p}_{\star}=100 Regret (Acc.) \ 48.46 48.22 (Δ0.5%\Delta-0.5\%) 48.44 (Δ0.1%\Delta-0.1\%)
Commu. Cost \ 7,605,000 7,842,775 (Δ+3.1%\Delta+3.1\%) 8,718,425 (Δ+14.6%\Delta+14.6\%)
Pay. Cost \ 190,882.45 133,426.01 (Δ30.1%\Delta-30.1\%) 88,893.78 (Δ53.4%\Delta-53.4\%)
Table 1: Study on hyper-parameter of Inc-FedUCB and environment.

As Table 1 shows, when all clients are incentivized to share data, our method essentially recovers the performance of DisLinUCB, while overcoming its limitation in incentivized settings when clients are not willing to share by default. Moreover, by reducing the threshold β\beta, we can substantially save payment costs while still maintaining highly competitive regret, albeit at the expense of increased communication costs. And the reason for this increased communication cost has been explained before: more communication rounds will be triggered, as clients become more outdated.

6 Conclusion

In this work, we introduce a novel incentivized communication problem for federated bandits, where the server must incentivize clients for data sharing. We propose a general solution framework Inc-FedUCB, and initiate two specific implementations introducing data and monetary incentives, under the linear contextual bandit setting. We prove that Inc-FedUCB flexibly achieves customized levels of near-optimal regret with theoretical guarantees on communication and payment costs. Extensive empirical studies further confirmed our versatile designs in incentive search across diverse environments. Currently, we assume all clients truthfully reveal their costs of data sharing to the server. We are intrigued in extending our solution to settings where clients can exhibit strategic behaviors, such as misreporting their intrinsic costs of data sharing to increase their own utility. It is then necessary to study a truthful incentive mechanism design.

References

  • Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In NIPS, volume 11, pages 2312–2320, 2011.
  • Allais (1953) Maurice Allais. Le comportement de l’homme rationnel devant le risque: critique des postulats et axiomes de l’école américaine. Econometrica: Journal of the econometric society, pages 503–546, 1953.
  • Cesa-Bianchi et al. (2013) Nicolo Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. A gang of bandits. In Advances in Neural Information Processing Systems, pages 737–745, 2013.
  • Chakraborty et al. (2017) Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. Coordinated versus decentralized exploration in multi-agent multi-armed bandits. In IJCAI, pages 164–170, 2017.
  • Cho et al. (2022) Yae Jee Cho, Divyansh Jhunjhunwala, Tian Li, Virginia Smith, and Gauri Joshi. To federate or not to federate: Incentivizing client participation in federated learning. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022), 2022.
  • Ding and Zhou (2007) Jiu Ding and Aihui Zhou. Eigenvalues of rank-one updated matrices with some applications. Applied Mathematics Letters, 20(12):1223–1226, 2007.
  • Donahue and Kleinberg (2021) Kate Donahue and Jon Kleinberg. Model-sharing games: Analyzing federated learning under voluntary participation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5303–5311, 2021.
  • Donahue and Kleinberg (2023) Kate Donahue and Jon Kleinberg. Fairness in model-sharing games. In Proceedings of the ACM Web Conference 2023, pages 3775–3783, 2023.
  • Du et al. (2023) Yihan Du, Wei Chen, Yuko Kuroki, and Longbo Huang. Collaborative pure exploration in kernel bandit. In The Eleventh International Conference on Learning Representations, 2023.
  • Dubey and Pentland (2020) Abhimanyu Dubey and AlexSandy’ Pentland. Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33:6003–6014, 2020.
  • Filippi et al. (2010) Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, 23, 2010.
  • Harper and Konstan (2015) F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015.
  • He et al. (2022) Jiafan He, Tianhao Wang, Yifei Min, and Quanquan Gu. A simple and provably efficient algorithm for asynchronous federated contextual linear bandits. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 4762–4775. Curran Associates, Inc., 2022.
  • Hu and Gong (2020) Rui Hu and Yanmin Gong. Trading data for learning: Incentive mechanism for on-device federated learning. In GLOBECOM 2020-2020 IEEE Global Communications Conference, pages 1–6. IEEE, 2020.
  • Huang et al. (2021) Ruiquan Huang, Weiqiang Wu, Jing Yang, and Cong Shen. Federated linear contextual bandits. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 27057–27068. Curran Associates, Inc., 2021.
  • Karimireddy et al. (2022) Sai Praneeth Karimireddy, Wenshuo Guo, and Michael Jordan. Mechanisms that incentivize data sharing in federated learning. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022), 2022.
  • Kohavi and John (1997) Ron Kohavi and George H John. Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273–324, 1997.
  • Korda et al. (2016) Nathan Korda, Balazs Szorenyi, and Shuai Li. Distributed clustering of linear bandits in peer to peer networks. In International conference on machine learning, pages 1301–1309. PMLR, 2016.
  • Landgren et al. (2016) Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. On distributed cooperative decision-making in multiarmed bandits. In 2016 European Control Conference (ECC), pages 243–248. IEEE, 2016.
  • Landgren et al. (2018) Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Social imitation in cooperative multiarmed bandits: Partition-based algorithms with strictly local information. In 2018 IEEE Conference on Decision and Control (CDC), pages 5239–5244. IEEE, 2018.
  • Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Li and Wang (2022a) Chuanhao Li and Hongning Wang. Asynchronous upper confidence bound algorithms for federated linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 6529–6553. PMLR, 2022a.
  • Li and Wang (2022b) Chuanhao Li and Hongning Wang. Communication efficient federated learning for generalized linear bandits. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022b.
  • Li et al. (2022) Chuanhao Li, Huazheng Wang, Mengdi Wang, and Hongning Wang. Communication efficient distributed learning for kernelized contextual bandits. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • Li et al. (2023) Chuanhao Li, Huazheng Wang, Mengdi Wang, and Hongning Wang. Learning kernelized contextual bandits in a distributed and asynchronous environment. In The Eleventh International Conference on Learning Representations, 2023.
  • Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
  • Liu and Zhao (2010) Keqin Liu and Qing Zhao. Distributed learning in multi-armed bandit with multiple players. IEEE transactions on signal processing, 58(11):5667–5681, 2010.
  • Martínez-Rubio et al. (2019) David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic bandits. Advances in Neural Information Processing Systems, 32, 2019.
  • Myerson and Satterthwaite (1983) Roger B Myerson and Mark A Satterthwaite. Efficient mechanisms for bilateral trading. Journal of economic theory, 29(2):265–281, 1983.
  • Pei (2020) Jian Pei. A survey on data pricing: from economics to data science. IEEE Transactions on knowledge and Data Engineering, 34(10):4586–4608, 2020.
  • Pemberton and Rau (2007) Malcolm Pemberton and Nicholas Rau. Mathematics for economists: an introductory textbook. Manchester University Press, 2007.
  • Sankararaman et al. (2019) Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1–35, 2019.
  • Shi et al. (2020) Chengshuai Shi, Wei Xiong, Cong Shen, and Jing Yang. Decentralized multi-player multi-armed bandits with no collision information. In International Conference on Artificial Intelligence and Statistics, pages 1519–1528. PMLR, 2020.
  • Shi et al. (2021) Chengshuai Shi, Cong Shen, and Jing Yang. Federated multi-armed bandits with personalization. In International Conference on Artificial Intelligence and Statistics, pages 2917–2925. PMLR, 2021.
  • Sim et al. (2020) Rachael Hwee Ling Sim, Yehong Zhang, Mun Choon Chan, and Bryan Kian Hsiang Low. Collaborative machine learning with incentive-aware model rewards. In International conference on machine learning, pages 8927–8936. PMLR, 2020.
  • Szorenyi et al. (2013) Balazs Szorenyi, Róbert Busa-Fekete, István Hegedus, Róbert Ormándi, Márk Jelasity, and Balázs Kégl. Gossip-based distributed stochastic bandit algorithms. In International conference on machine learning, pages 19–27. PMLR, 2013.
  • Tao et al. (2019) Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multi-armed bandits. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 126–146. IEEE, 2019.
  • Tu et al. (2022) Xuezhen Tu, Kun Zhu, Nguyen Cong Luong, Dusit Niyato, Yang Zhang, and Juan Li. Incentive mechanisms for federated learning: From economic and game theoretic perspective. IEEE Transactions on Cognitive Communications and Networking, 2022.
  • Wang et al. (2020a) Po-An Wang, Alexandre Proutiere, Kaito Ariu, Yassir Jedra, and Alessio Russo. Optimal algorithms for multiplayer multi-armed bandits. In International Conference on Artificial Intelligence and Statistics, pages 4120–4129. PMLR, 2020a.
  • Wang et al. (2020b) Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: Near-optimal regret with efficient communication. In International Conference on Learning Representations, 2020b.
  • Xu et al. (2021) Xinyi Xu, Lingjuan Lyu, Xingjun Ma, Chenglin Miao, Chuan Sheng Foo, and Bryan Kian Hsiang Low. Gradient driven rewards to guarantee fairness in collaborative machine learning. Advances in Neural Information Processing Systems, 34:16104–16117, 2021.
  • Zhan et al. (2021) Yufeng Zhan, Jie Zhang, Zicong Hong, Leijie Wu, Peng Li, and Song Guo. A survey of incentive mechanism design for federated learning. IEEE Transactions on Emerging Topics in Computing, 10(2):1035–1044, 2021.
  • Zhu et al. (2021) Zhaowei Zhu, Jingxuan Zhu, Ji Liu, and Yang Liu. Federated bandit: A gossiping approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, pages 3–4, 2021.

A Heuristic Search Algorithm

1:invalid client list {i1,i2,,ik}\{i_{1},i_{2},\cdots,i_{k}\}, data-incentivized participant set S^t\widehat{S}_{t}, last resort cost lastm\mathcal{I}^{m}_{\text{last}}
2:Initialization: St=S^tS_{t}=\widehat{S}_{t}
3:for n[k]n\in[k] do
4:     Rank clients {i1,,ikn+1}\{i_{1},\dots,i_{k-n+1}\} (in new order) by Eq. (6)
5:     St=St{ikn+1}S_{t}=S_{t}\cup\{i_{k-n+1}\} \triangleright add the client with the largest contribution
6:     for client j{i1,,ikn}j\in\{i_{1},\dots,i_{k-n}\}  do \triangleright find extra data-incentivized participants
7:         Compute data incentive j,td\mathcal{I}^{d}_{j,t} for client jj by Eq. (4)
8:         if  j,td>Djp\mathcal{I}^{d}_{j,t}>D^{p}_{j} then
9:              St=St{ΔVj,t}S_{t}=S_{t}\cup\{\Delta V_{j,t}\}               
10:     Compute total payment n,tm=iS~tSti,tm\mathcal{I}^{m}_{n,t}=\sum_{i\in\widetilde{S}_{t}\setminus S_{t}}\mathcal{I}^{m}_{i,t} by Eq. (5)
11:     if n,tmlastm\mathcal{I}^{m}_{n,t}\leq\mathcal{I}^{m}_{\text{last}} then
12:         return St=S^t{ΔViα,t}S_{t}=\widehat{S}_{t}\cup\{\Delta V_{i_{\alpha},t}\} \triangleright return last resort
13:     else
14:         if det(Σ(St)+Vg,t1)/det(Σ(S~t)+Vg,t1)>β\det(\Sigma(S_{t})+V_{g,t-1})/\det(\Sigma(\widetilde{S}_{t})+V_{g,t-1})>\beta then
15:              return StS_{t} \triangleright return search result               
Algorithm 4 Heuristic Search

As sketched in Section 4.3, we devised an iterative search method based on the following ranking heuristic (formally defined in Eq. (6)): the more one client assists in increasing the server’s determinant, the more valuable its contribution is, and thus we should motivate the most valuable clients to participate. Denote nkn\leq k (initialized as 11) as the number of clients to be selected from the invalid list {i1,,ik}\{i_{1},\dots,i_{k}\}, and initialize the participant set St=S^tS_{t}=\widehat{S}_{t}. In each round nn, we rank the remaining kn+1k-n+1 clients based on their potential contribution to the server by Eq. (6), and add the most valuable one to StS_{t} (Line 3-4). With the latest StS_{t} committed, we then proceed to determine additional data-incentivized participants by Eq. (4) (Line 5-8), and compute the total payment by Eq. (5) (Line 9). If having nn clients results in the total cost n,tm>lastm\mathcal{I}^{m}_{n,t}>\mathcal{I}^{m}_{\text{last}}, then we terminate the search and resort to our last resort (Line 10-11). Otherwise, if the resulting StS_{t} enables the server to satisfy the β\beta gap requirement, then we successfully find a better solution than last resort and can terminate the search. However, if having nn client is insufficient for the server to pass the β\beta gap requirement, we increase n=n+1n=n+1 and repeat the search process (Line 12-14). In particular, if the above process fails to terminate (i.e., having all mm clients still not suffices, we will still use the last resort. Note that, by utilizing matrix computation to calculate the contribution list in each round, this method only incurs a linear time complexity of O(N)O(N), when n=m=Nn=m=N.

B Technical Lemmas

Lemma 5 (Lemma H.3 of Wang et al. (2020b))

With probability 1δ1-\delta, single step pseudo-regret rt=θ,𝐱𝐱tr_{t}=\langle\theta^{*},\mathbf{x}^{*}-\mathbf{x}_{t}\rangle is bounded by

rt2(2log(det(Vit,t)1/2det(λI)1/2δ)+λ1/2)𝐱tVit,t1=O(dlogTδ)𝐱tVit,t1r_{t}\leq 2\left(\sqrt{2\log\left(\frac{\det({V}_{i_{t},t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}\right)}+\lambda^{1/2}\right)\lVert\mathbf{x}_{t}\rVert_{{V}_{i_{t},t}^{-1}}=O\left(\sqrt{d\log\frac{T}{\delta}}\right)\left\lVert\mathbf{x}_{t}\right\rVert_{{V}_{i_{t},t}^{-1}}
Lemma 6 (Lemma 11 of Abbasi-Yadkori et al. (2011))

Let {Xt}t=1\left\{X_{t}\right\}_{t=1}^{\infty} be a sequence in d\mathbb{R}^{d}, VV is a d×dd\times d positive definite matrix and define Vt=V+s=1tXsXs{V}_{t}=V+\sum_{s=1}^{t}X_{s}X_{s}^{\top}. Then we have that

log(det(Vn)det(V))t=1nXtVt112\log\left(\frac{\operatorname{det}\left({V}_{n}\right)}{\operatorname{det}(V)}\right)\leq\sum_{t=1}^{n}\left\|X_{t}\right\|^{2}_{{V}_{t-1}^{-1}}

Further, if Xt2L\left\|X_{t}\right\|_{2}\leq L for all tt, then

t=1nmin{1,XtVt112}2(logdet(Vn)logdetV)2(dlog((trace(V)+nL2)/d)logdetV)\sum_{t=1}^{n}\min\left\{1,\left\|X_{t}\right\|_{{V}_{t-1}^{-1}}^{2}\right\}\leq 2\left(\log\operatorname{det}\left({V}_{n}\right)-\log\operatorname{det}V\right)\leq 2\left(d\log\left(\left(\operatorname{trace}(V)+nL^{2}\right)/d\right)-\log\operatorname{det}V\right)
Lemma 7 (Lemma 12 of Abbasi-Yadkori et al. (2011))

Let AA, BB and CC be positive semi-definite matrices such that A=B+CA=B+C. Then, we have that

sup𝐱0𝐱A𝐱𝐱B𝐱det(A)det(B)\displaystyle\sup_{\mathbf{x}\neq\textbf{0}}\frac{\mathbf{x}^{\top}A\mathbf{x}}{\mathbf{x}^{\top}B\mathbf{x}}\leq\frac{\det(A)}{\det(B)}

C Proof of Theorem 3

Our proof of this negative result relies on the following lower bound result for federated linear bandits established in (He et al., 2022).

Lemma 8 (Theorem 5.3 of He et al. (2022))

Let pip_{i} denote the probability that an agent i[N]i\in[N] will communicate with the server at least once over time horizon TT. Then for any algorithm with

i=1Npic2CNlog(T/N)\sum_{i=1}^{N}p_{i}\leq\frac{c}{2C}\cdot\frac{N}{\log(T/N)} (7)

there always exists a linear bandit instance with σ=L=S=1\sigma=L=S=1, such that for TNdT\geq Nd, the expected regret of this algorithm is at least Ω(dNT)\Omega(d\sqrt{NT}).

In the following, we will create a situation, where Eq. (7) always holds true for payment-free incentive mechanism. Specifically, recall that the payment-free incentive mechanism (Section 4.2) motivates clients to participate using only data, i.e., the determinant ratio defined in Eq. (4) that indicates how much client ii’s confidence ellipsoid can shrink using the data offered by the server. Based on matrix determinant lemma (Ding and Zhou, 2007), we know that i,t(1+L2λ)T\mathcal{I}_{i,t}\leq(1+\frac{L^{2}}{\lambda})^{T}. Additionally, by applying the determinant-trace inequality (Lemma 10 of Abbasi-Yadkori et al. (2011)), we have i,t(1+TL2λd)d\mathcal{I}_{i,t}\leq(1+\frac{TL^{2}}{\lambda d})^{d}. Therefore, as long as Dip>min{(1+L2λ)T,(1+TL2λd)d}D_{i}^{p}>\min\{(1+\frac{L^{2}}{\lambda})^{T},(1+\frac{TL^{2}}{\lambda d})^{d}\}, where the tighter choice between the two upper bounds depends on the specific problem instance (i.e., either dd or TT being larger), it becomes impossible for the server to incentivize client ii to participate in the communication. Now based on Lemma 8, if the number of clients that satisfy Dipmin{(1+L2λ)T,(1+TL2λd)d}D_{i}^{p}\leq\min\{(1+\frac{L^{2}}{\lambda})^{T},(1+\frac{TL^{2}}{\lambda d})^{d}\} is smaller than c2CNlog(T/N)\frac{c}{2C}\cdot\frac{N}{\log(T/N)}, a sub-optimal regret of the order Ω(dNT)\Omega(d\sqrt{NT}) is inevitable for payment-free incentive mechanism, which finishes the proof.  

D Proof of Theorem 4

To prove this theorem, we first need the following lemma.

Lemma 9 (Communication Frequency Bound)

By setting the communication threshold Dc=TN2dlogTT2N2dRlogTlogβD_{c}=\frac{T}{N^{2}d\log T}-\sqrt{\frac{T^{2}}{N^{2}dR\log T}}\log\beta, the total number of epochs defined by the communication rounds satisfies,

P=O(NdlogT)P=O(Nd\log T)

where R=dlog(1+Tλd)=O(dlogT)R=\left\lceil d\log(1+\frac{T}{\lambda d})\right\rceil=O(d\log T).

Proof of Lemma 9. Denote PP as the total number of epochs divided by communication rounds throughout the time horizon TT, and Vg,tpV_{g,t_{p}} as the aggregated covariance matrix at the pp-th epoch. Specifically, Vg,t0=λIV_{g,t_{0}}=\lambda I, V~T\widetilde{V}_{T} is the covariance matrix constructed by all data points available in the system at time step TT.

Note that according to the incentivized communication scheme in Inc-FedUCB, not all clients will necessarily share their data in the last epoch, hence det(Vg,tP)det(V~T)(tr(V~T)d)(λ+T/d)d\det(V_{g,t_{P}})\leq\det(\widetilde{V}_{T})\leq\left(\frac{tr(\widetilde{V}_{T})}{d}\right)\leq(\lambda+T/d)^{d}. Therefore,

logdet(Vg,tP)det(Vg,tP1)+logdet(Vg,tP1)det(Vg,tP2)++logdet(Vg,t1)det(Vg,t0)=logdet(Vg,tP)det(Vg,t0)dlog(1+Tλd)\displaystyle\log\frac{\det(V_{g,t_{P}})}{\det(V_{g,t_{P-1}})}+\log\frac{\det(V_{g,t_{P-1}})}{\det(V_{g,t_{P-2}})}+\dots+\log\frac{\det(V_{g,t_{1}})}{\det(V_{g,t_{0}})}=\log\frac{\det(V_{g,t_{P}})}{\det(V_{g,t_{0}})}\leq\left\lceil d\log(1+\frac{T}{\lambda d})\right\rceil

Let α+\alpha\in\mathbb{R}^{+} be an arbitrary positive value, for epochs with length greater than α\alpha, there are at most Tα\lceil\frac{T}{\alpha}\rceil of them. For epochs with length less than α\alpha, say the pp-th epoch triggered by client ii, we have

Δti,tplogdet(Vi,tp)det(Vi,tlast)>Dc\Delta t_{i,t_{p}}\cdot\log\frac{\det(V_{i,t_{p}})}{\det(V_{i,t_{\text{last}}})}>D_{c}

Combining the β\beta gap constraint defined in Section 4.3 and the fact that the server always downloads to all clients at every communication round, we have Δti,tpα\Delta t_{i,t_{p}}\leq\alpha and hence

logdet(g,Vtp)det(Vg,tp1)logβdet(V~tp)det(Vg,tp1)logβdet(Vi,tp)det(Vg,tp1)logβdet(Vi,tp)det(Vi,tlast)Dcα+logβ\log\frac{\det(g,V_{t_{p}})}{\det(V_{g,t_{p-1}})}\geq\log\frac{\beta\cdot\det(\widetilde{V}_{t_{p}})}{\det(V_{g,t_{p-1}})}\geq\log\frac{\beta\cdot\det(V_{i,t_{p}})}{\det(V_{g,t_{p-1}})}\geq\log\frac{\beta\cdot\det(V_{i,t_{p}})}{\det(V_{i,t_{\text{last}}})}\geq\frac{D_{c}}{\alpha}+\log\beta

Let R=dlog(1+Tλd)=O(dlogT)R=\left\lceil d\log(1+\frac{T}{\lambda d})\right\rceil=O(d\log T), therefore, there are at most RDcα+logβ\lceil\frac{R}{\frac{D_{c}}{\alpha}+\log\beta}\rceil epochs with length less than α\alpha time steps. As a result, the total number of epochs PTα+RDcα+logβP\leq\lceil\frac{T}{\alpha}\rceil+\lceil\frac{R}{\frac{D_{c}}{\alpha}+\log\beta}\rceil. Note that Tα+RDcα+logβ2TRDc+αlogβ\lceil\frac{T}{\alpha}\rceil+\lceil\frac{R}{\frac{D_{c}}{\alpha}+\log\beta}\rceil\geq 2\sqrt{\frac{TR}{D_{c}+\alpha\log\beta}}, where the equality holds when α=T(Dc+αlogβ)R\alpha=\sqrt{\frac{T(D_{c}+\alpha\log\beta)}{R}}.

Furthermore, let Dc=TN2dlogTαlogβD_{c}=\frac{T}{N^{2}d\log T}-\alpha\log\beta, then α=T2N2dRlogT\alpha=\sqrt{\frac{T^{2}}{N^{2}dR\log T}}, we have

P=O(TRDc+αlogβ)=O(NdRlogT)=O(NdlogT)\displaystyle P=O\left(\sqrt{\frac{TR}{D_{c}+\alpha\log\beta}}\right)=O(N\sqrt{dR\log T})=O(Nd\log T) (8)

This concludes the proof of Lemma 9.  

Communication Cost:

The proof of communication cost upper bound directly follows Lemma 9. In each epoch, all clients first upload O(d2)O(d^{2}) scalars to the server and then download O(d2)O(d^{2}) scalars. Therefore, the total communication cost is CT=PO(Nd2)=O(N2d3logT)C_{T}=P\cdot O(Nd^{2})=O(N^{2}d^{3}\log T)  

Monetary Incentive Cost:

Under the clients’ committed data sharing cost Dp={D1p,,DNp}D^{p}=\{D^{p}_{1},\cdots,D^{p}_{N}\}, during each communication round at time step tpt_{p}, we only pay clients in the participant set StpS_{t_{p}}. Specifically, the payment (i.e., monetary incentive cost) i,tpm=0\mathcal{I}^{m}_{i,t_{p}}=0 if the data incentive is already sufficient to motivate the client to participate, i.e., when i,tpdDip\mathcal{I}^{d}_{i,t_{p}}\geq D^{p}_{i}. Otherwise, we only need to pay the minimum amount of monetary incentive such that Eq. (1) is satisfied, i.e., i,tpm=Dipi,tpd\mathcal{I}^{m}_{i,t_{p}}=D^{p}_{i}-\mathcal{I}^{d}_{i,t_{p}}. Therefore, the accumulative monetary incentive cost is

MT=p=1Pi=1Ni,tpm\displaystyle M_{T}=\sum\limits_{p=1}^{P}\sum\limits_{i=1}^{N}\mathcal{I}^{m}_{i,t_{p}} =p=1Pi=1Nmax{0,Dipi,tpd}𝕀(ΔVi,tpStp)\displaystyle=\sum\limits_{p=1}^{P}\sum\limits_{i=1}^{N}\max\left\{0,D^{p}_{i}-\mathcal{I}^{d}_{i,t_{p}}\right\}\cdot\mathbb{I}(\Delta V_{i,t_{p}}\in S_{t_{p}})
p=1Pi=1Nmax{0,maxi[N]{Dip}i,tpd}𝕀(ΔVi,tpStp)\displaystyle\leq\sum\limits_{p=1}^{P}\sum\limits_{i=1}^{N}\max\left\{0,\max_{i\in[N]}\{D_{i}^{p}\}-\mathcal{I}^{d}_{i,t_{p}}\right\}\cdot\mathbb{I}(\Delta V_{i,t_{p}}\in S_{t_{p}})
p=1Pi𝒩¯p(maxi[N]{Dip}i,tpd)𝕀(ΔVi,tpStp)\displaystyle\leq\sum\limits_{p=1}^{P}\sum\limits_{i\in\mathcal{\bar{N}}_{p}}(\max_{i\in[N]}\{D_{i}^{p}\}-\mathcal{I}^{d}_{i,t_{p}})\cdot\mathbb{I}(\Delta V_{i,t_{p}}\in S_{t_{p}})
maxi[N]{Dip}p=1Pi=1N𝕀(ΔVi,tpStp)p=1Pi𝒩¯pi,tpd𝕀(ΔVi,tpStp)\displaystyle\leq\max_{i\in[N]}\{D_{i}^{p}\}\sum\limits_{p=1}^{P}\sum\limits_{i=1}^{N}\mathbb{I}(\Delta V_{i,t_{p}}\in S_{t_{p}})-\sum\limits_{p=1}^{P}\sum\limits_{i\in\mathcal{\bar{N}}_{p}}\mathcal{I}^{d}_{i,t_{p}}\cdot\mathbb{I}(\Delta V_{i,t_{p}}\in S_{t_{p}})
=maxi[N]{Dip}p=1PNpi=1Np𝒫¯ii,tpd\displaystyle=\max_{i\in[N]}\{D_{i}^{p}\}\sum\limits_{p=1}^{P}N_{p}-\sum\limits_{i=1}^{N}\sum\limits_{p\in\mathcal{\bar{P}}_{i}}\mathcal{I}^{d}_{i,t_{p}}

where PP and NN represent the number of epochs and clients, NpN_{p} is the number of participants in pp-th epoch, 𝒩¯p\mathcal{\bar{N}}_{p} is the set of money-incentivized participants in the pp-th epoch, 𝒫¯i\mathcal{\bar{P}}_{i} is the set of epochs where client ii gets monetary incentive, whose size is denoted as Pi=|𝒫¯i|P_{i}=|\mathcal{\bar{P}}_{i}|. Denote Dmaxp=maxi[N]{Dip}D_{\max}^{p}=\max_{i\in[N]}\{D_{i}^{p}\} to simplify our later discussion.

Recall the definition of data incentive and Di,tp(Stp)=j:{ΔVj,tpStp}{ji}ΔVj,tp+ΔVi,tpD_{i,t_{p}}(S_{t_{p}})=\sum_{j:\{\Delta V_{j,t_{p}}\in S_{t_{p}}\}\wedge\{j\neq i\}}\Delta V_{j,t_{p}}+\Delta V_{-i,t_{p}} introduced in Eq. (4), we can show that

i,tpd\displaystyle\mathcal{I}^{d}_{i,t_{p}} =det(Di,tp(Stp)+Vi,tp)det(Vi,tp)1\displaystyle=\frac{\det\left(D_{i,t_{p}}(S_{t_{p}})+V_{i,t_{p}}\right)}{\det(V_{i,t_{p}})}-1
det(Vg,tp)det(Vi,tp)1\displaystyle\geq\frac{\det(V_{g,t_{p}})}{\det(V_{i,t_{p}})}-1

Therefore, we have

MT\displaystyle M_{T} Dmaxpp=1PNp+i=1Np𝒫¯i1i=1Np𝒫¯idet(Vg,tp)det(Vi,tp)\displaystyle\leq D_{\max}^{p}\cdot\sum\limits_{p=1}^{P}N_{p}+\sum\limits_{i=1}^{N}\sum\limits_{p\in\mathcal{\bar{P}}_{i}}1-\sum\limits_{i=1}^{N}\sum\limits_{p\in\mathcal{\bar{P}}_{i}}\frac{\det(V_{g,t_{p}})}{\det(V_{i,t_{p}})}
Dmaxpp=1PNp+i=1NPii=1NPi(det(Vg,t1)det(Vi,t1)det(Vg,t2)det(Vi,t2)det(Vg,tPi)det(Vi,tPi))1Pi\displaystyle\leq D_{\max}^{p}\cdot\sum\limits_{p=1}^{P}N_{p}+\sum\limits_{i=1}^{N}P_{i}-\sum\limits_{i=1}^{N}P_{i}\cdot\left(\frac{\det(V_{g,t_{1}})}{\det(V_{i,t_{1}})}\cdot\frac{\det(V_{g,t_{2}})}{\det(V_{i,t_{2}})}\cdots\frac{\det(V_{g,t_{P_{i}}})}{\det(V_{i,t_{P_{i}}})}\right)^{\frac{1}{P_{i}}}
Dmaxpp=1PNp+i=1NPii=1NPi(det(Vg,t1)det(Vi,t1)det(Vi,t1)det(Vi,t2)det(Vi,tPi1)det(Vi,tPi))1Pi\displaystyle\leq D_{\max}^{p}\cdot\sum\limits_{p=1}^{P}N_{p}+\sum\limits_{i=1}^{N}P_{i}-\sum\limits_{i=1}^{N}P_{i}\cdot\left(\frac{\det(V_{g,t_{1}})}{\det(V_{i,t_{1}})}\cdot\frac{\det(V_{i,t_{1}})}{\det(V_{i,t_{2}})}\cdots\frac{\det(V_{i,t_{P_{i}-1}})}{\det(V_{i,t_{P_{i}}})}\right)^{\frac{1}{P_{i}}}
=Dmaxpp=1PNp+i=1NPii=1NPi(det(Vg,t1)det(Vi,tPi))1Pi\displaystyle=D_{\max}^{p}\cdot\sum\limits_{p=1}^{P}N_{p}+\sum\limits_{i=1}^{N}P_{i}-\sum\limits_{i=1}^{N}P_{i}\cdot\left(\frac{\det(V_{g,t_{1}})}{\det(V_{i,t_{P_{i}}})}\right)^{\frac{1}{P_{i}}}
(1+Dmaxp)PNi=1NPi(det(λI)det(VT))1Pi\displaystyle\leq(1+D_{\max}^{p})\cdot P\cdot N-\sum\limits_{i=1}^{N}P_{i}\cdot\left(\frac{\det(\lambda I)}{\det(V_{T})}\right)^{\frac{1}{P_{i}}}

where the second step holds by Cauchy-Schwarz inequality and the last step follows the facts that PiPP_{i}\leq P, NpNN_{p}\leq N, det(Vg,t1)det(λI)\det(V_{g,t_{1}})\geq\det(\lambda I), and det(Vi,tPi)det(VT)\det(V_{i,t_{P_{i}}})\leq\det(V_{T}).

Specifically, by setting the communication threshold Dc=TN2dlogTT2N2dRlogTlogβD_{c}=\frac{T}{N^{2}d\log T}-\sqrt{\frac{T^{2}}{N^{2}dR\log T}}\log\beta, where R=dlog(1+Tλd)R=\left\lceil d\log(1+\frac{T}{\lambda d})\right\rceil, we have the total number of epochs P=O(NdlogT)P=O(Nd\log T) (Lemma 9). Therefore,

MT\displaystyle M_{T} (1+Dmaxp)O(N2dlogT)i=1NPi(det(λI)det(VT))1Pi\displaystyle\leq(1+D_{\max}^{p})\cdot O(N^{2}d\log T)-\sum\limits_{i=1}^{N}P_{i}\cdot\left(\frac{\det(\lambda I)}{\det(V_{T})}\right)^{\frac{1}{P_{i}}}
=O(N2dlogT)\displaystyle=O(N^{2}d\log T)

which finishes the proof.  

Regret:

To prove the regret upper bound, we first need the following lemma.

Lemma 10 (Instantaneous Regret Bound)

Under threshold β\beta, with probability 1δ1-\delta, the instantaneous pseudo-regret rt=θ,𝐱𝐱tr_{t}=\langle\theta^{*},\mathbf{x}^{*}-\mathbf{x}_{t}\rangle in jj-th epoch is bounded by

rt=O(dlogTδ)𝐱tV~t111βdet(Vg,tj)det(Vg,tj1)r_{t}=O\left(\sqrt{d\log\frac{T}{\delta}}\right)\cdot\|\mathbf{x}_{t}\|_{\widetilde{V}_{t-1}^{-1}}\cdot\sqrt{\frac{1}{\beta}\cdot\frac{\det(V_{g,t_{j}})}{\det(V_{g,t_{j-1}})}}

Proof of Lemma 10. Denote V~t\widetilde{V}_{t} as the covariance matrix constructed by all available data in the system at time step tt. As Eq. (3) indicates, the instantaneous regret of client ii is upper bounded by

rt2αit,t1𝐱tV~t11𝐱tdet(V~t1)det(Vit,t1)=O(dlogTδ)𝐱tV~t11det(V~t1)det(Vit,t1)r_{t}\leq 2\alpha_{i_{t},t-1}\sqrt{\mathbf{x}_{t}^{\top}\widetilde{V}_{t-1}^{-1}\mathbf{x}_{t}}\cdot\sqrt{\frac{\det(\widetilde{V}_{t-1})}{\det(V_{i_{t},t-1})}}=O\left(\sqrt{d\log\frac{T}{\delta}}\right)\cdot\|\mathbf{x}_{t}\|_{\widetilde{V}_{t-1}^{-1}}\cdot\sqrt{\frac{\det(\widetilde{V}_{t-1})}{\det(V_{i_{t},t-1})}}

Suppose the client iti_{t} appears at the jj-th epoch, i.e., tj1ttjt_{j-1}\leq t\leq t_{j}. As the server always downloads the aggregated data to every client in each communication round, we have

det(V~t)det(Vit,t)det(V~t)det(Vit,tj1)det(V~t)det(Vg,tj1)\frac{\det(\widetilde{V}_{t})}{\det(V_{i_{t},t})}\leq\frac{\det(\widetilde{V}_{t})}{\det(V_{i_{t},t_{j-1}})}\leq\frac{\det(\widetilde{V}_{t})}{\det(V_{g,t_{j-1}})}

Combining the β\beta gap constraint defined in Section 4.3, we can show that

det(V~t)det(Vit,t)det(V~t)det(Vg,tj1)det(Vg,tj)/βdet(Vg,tj1)=1βdet(Vg,tj)det(Vg,tj1)\frac{\det(\widetilde{V}_{t})}{\det(V_{i_{t},t})}\leq\frac{\det(\widetilde{V}_{t})}{\det(V_{g,t_{j-1}})}\leq\frac{\det(V_{g,t_{j}})/\beta}{\det(V_{g,t_{j-1}})}=\frac{1}{\beta}\cdot\frac{\det(V_{g,t_{j}})}{\det(V_{g,t_{j-1}})}

Lastly, plugging the above inequality into Eq. (3), we have

rt=O(dlogTδ)𝐱tV~t111βdet(Vg,tj)det(Vg,tj1)r_{t}=O\left(\sqrt{d\log\frac{T}{\delta}}\right)\cdot\|\mathbf{x}_{t}\|_{\widetilde{V}_{t-1}^{-1}}\cdot\sqrt{\frac{1}{\beta}\cdot\frac{\det(V_{g,t_{j}})}{\det(V_{g,t_{j-1}})}}

which finishes the proof of Lemma 10.  

Now, we are ready to prove the accumulative regret upper bound. Similar to DisLinUCB (Wang et al., 2020b), we group the communication epochs into good epochs and bad epochs.

Good Epochs: Note that for good epochs, we have 1det(Vg,tj)det(Vg,tj1)21\leq\frac{\det(V_{g,t_{j}})}{\det(V_{g,t_{j-1}})}\leq 2. Therefore, based on Lemma 10, the instantaneous regret in good epochs is

rt=O(dlogTδ)𝐱tV~t112βr_{t}=O\left(\sqrt{d\log\frac{T}{\delta}}\right)\cdot\|\mathbf{x}_{t}\|_{\widetilde{V}_{t-1}^{-1}}\cdot\sqrt{\frac{2}{\beta}}

Denote the accumulative regret among all good epochs as REGgoodREG_{good}, then using the Cauchy–Schwarz inequality we can see that

REGgood\displaystyle REG_{good} =pPgoodtprt\displaystyle=\sum\limits_{p\in P_{good}}\sum\limits_{t\in\mathcal{B}_{p}}r_{t}
TpPgoodtprt2\displaystyle\leq\sqrt{T\cdot\sum\limits_{p\in P_{good}}\sum\limits_{t\in\mathcal{B}_{p}}r_{t}^{2}}
O(TdlogTδ2βpPgoodtp𝐱tV~t112)\displaystyle\leq O\left(\sqrt{T\cdot d\log\frac{T}{\delta}\cdot\frac{2}{\beta}\sum\limits_{p\in P_{good}}\sum\limits_{t\in\mathcal{B}_{p}}\|\mathbf{x}_{t}\|_{\widetilde{V}_{t-1}^{-1}}^{2}}\right)

Combining the fact x2log(1+x),x[0,1]x\leq 2\log(1+x),\forall x\in[0,1] and Lemma 6, we have

REGgood\displaystyle REG_{good} O(TdβlogTδpPgoodtp2log(1+𝐱tV~t112))\displaystyle\leq O\left(\sqrt{T\cdot\frac{d}{\beta}\log\frac{T}{\delta}\sum\limits_{p\in P_{good}}\sum\limits_{t\in\mathcal{B}_{p}}2\log\left(1+\|\mathbf{x}_{t}\|_{\widetilde{V}_{t-1}^{-1}}^{2}\right)}\right)
O(TdβlogTδpPgoodlogdet(V~tp)det(V~tp1))\displaystyle\leq O\left(\sqrt{T\cdot\frac{d}{\beta}\log\frac{T}{\delta}\cdot\sum\limits_{p\in P_{good}}\log\frac{\det(\widetilde{V}_{t_{p}})}{\det(\widetilde{V}_{t_{p-1}})}}\right)
O(TdβlogTδpPAlllogdet(V~tp)det(V~tp1))\displaystyle\leq O\left(\sqrt{T\cdot\frac{d}{\beta}\log\frac{T}{\delta}\sum\limits_{p\in P_{All}}\log\frac{\det(\widetilde{V}_{t_{p}})}{\det(\widetilde{V}_{t_{p-1}})}}\right)
=O(TdβlogTδlogdet(V~tP)det(V~t0))\displaystyle=O\left(\sqrt{T\cdot\frac{d}{\beta}\log\frac{T}{\delta}\cdot\log\frac{\det(\widetilde{V}_{t_{P}})}{\det(\widetilde{V}_{t_{0}})}}\right)
O(TdβlogTδdlog(1+Tλd))\displaystyle\leq O\left(\sqrt{T\cdot\frac{d}{\beta}\log\frac{T}{\delta}\cdot d\log\left(1+\frac{T}{\lambda d}\right)}\right)
=O(dβTlogTδlogT)\displaystyle=O\left(\frac{d}{\sqrt{\beta}}\cdot\sqrt{T}\cdot\sqrt{\log\frac{T}{\delta}\cdot logT}\right)

Bad Epochs: Now moving on to the bad epoch. For any bad epoch starting from time step tst_{s} to time step tet_{e}, the regret in this epoch is

REG\displaystyle REG =t=tstert=i=1Nτ𝒩i(te)\𝒩i(ts)rτ\displaystyle=\sum\limits_{t=t_{s}}^{t_{e}}r_{t}=\sum\limits_{i=1}^{N}\sum_{\tau\in\mathcal{N}_{i}(t_{e})\backslash\mathcal{N}_{i}(t_{s})}r_{\tau}

where 𝒩i(t)={1τt:iτ=i}\mathcal{N}_{i}(t)=\{1\leq\tau\leq t:i_{\tau}=i\} denotes the set of time steps when client ii interacts with the environment up to tt. Combining the fact rτ2r_{\tau}\leq 2 and Lemma 5, we have

rτmin{2,2αiτ,τ1𝐱τViτ,τ11𝐱τ}=O(dlogTδ)min{1,𝐱τViτ,τ11}r_{\tau}\leq\min\{2,2\alpha_{i_{\tau},\tau-1}\sqrt{\mathbf{x}_{\tau}^{\top}V_{i_{\tau},\tau-1}^{-1}\mathbf{x}_{\tau}}\}=O\left(\sqrt{d\log\frac{T}{\delta}}\right)\min\{1,\|\mathbf{x}_{\tau}\|_{V_{i_{\tau},\tau-1}^{-1}}\}

Therefore,

REG\displaystyle REG O(dlogTδ)i=1Nτ𝒩i(te)\𝒩i(ts)min{1,𝐱τVi,τ11}\displaystyle\leq O\left(\sqrt{d\log\frac{T}{\delta}}\right)\sum\limits_{i=1}^{N}\sum_{\tau\in\mathcal{N}_{i}(t_{e})\backslash\mathcal{N}_{i}(t_{s})}\min\{1,\|\mathbf{x}_{\tau}\|_{V_{i,\tau-1}^{-1}}\}
O(dlogTδ)i=1NΔti,teτ𝒩i(te)\𝒩i(ts)min{1,𝐱τVi,τ112}\displaystyle\leq O\left(\sqrt{d\log\frac{T}{\delta}}\right)\sum\limits_{i=1}^{N}\sqrt{\Delta t_{i,t_{e}}\sum_{\tau\in\mathcal{N}_{i}(t_{e})\backslash\mathcal{N}_{i}(t_{s})}\min\{1,\|\mathbf{x}_{\tau}\|_{V_{i,\tau-1}^{-1}}^{2}\}}
O(dlogTδ)i=1NΔti,teτ𝒩i(te)\𝒩i(ts)log(1+𝐱τVi,τ112)\displaystyle\leq O\left(\sqrt{d\log\frac{T}{\delta}}\right)\sum\limits_{i=1}^{N}\sqrt{\Delta t_{i,t_{e}}\sum_{\tau\in\mathcal{N}_{i}(t_{e})\backslash\mathcal{N}_{i}(t_{s})}\log\left(1+\|\mathbf{x}_{\tau}\|_{V_{i,\tau-1}^{-1}}^{2}\right)}
=O(dlogTδ)i=1NΔti,teτ𝒩i(te)\𝒩i(ts)log(det(Vi,τ)det(Vi,τ1))\displaystyle=O\left(\sqrt{d\log\frac{T}{\delta}}\right)\sum\limits_{i=1}^{N}\sqrt{\Delta t_{i,t_{e}}\sum_{\tau\in\mathcal{N}_{i}(t_{e})\backslash\mathcal{N}_{i}(t_{s})}\log\left(\frac{\det(V_{i,\tau})}{\det(V_{i,\tau-1})}\right)}
O(dlogTδ)i=1NΔti,telogdet(Vi,te)det(Vi,tlast)\displaystyle\leq O\left(\sqrt{d\log\frac{T}{\delta}}\right)\sum\limits_{i=1}^{N}\sqrt{\Delta t_{i,t_{e}}\cdot\log\frac{\det(V_{i,t_{e}})}{\det(V_{i,t_{\text{last}}})}}
O(dlogTδ)NDc\displaystyle\leq O\left(\sqrt{d\log\frac{T}{\delta}}\right)N\cdot\sqrt{D_{c}}

where the second step holds by the Cauchy-Schwarz inequality, the third step follows from x2log(1+x),x[0,1]x\leq 2\log(1+x),\forall x\in[0,1], the fourth step utilizes the elementary algebra, and the last two steps follow the fact that no client triggers the communication before tet_{e}.

Recall that, as introduced in Lemma 9, the number of bad epochs is less than R=dlog(1+Tδ)=O(dlogT)R=\lceil d\log(1+\frac{T}{\delta})\rceil=O(d\log T), therefore the regret across all bad epochs is

REGbad\displaystyle REG_{bad} =O(dlogTδ)NDcO(dlogT)\displaystyle=O\left(\sqrt{d\log\frac{T}{\delta}}\right)N\cdot\sqrt{D_{c}}\cdot O(d\log T)
=O(Nd1.5DclogTδlogT)\displaystyle=O\left(Nd^{1.5}\sqrt{D_{c}\cdot\log\frac{T}{\delta}}\log T\right)

Combining the regret for all good and bad epochs, we have accumulative regret

RT\displaystyle R_{T} =REGgood+REGbad\displaystyle=REG_{good}+REG_{bad}
=O(dβTlogTδlogT)+O(Nd1.5DclogTδlogT)\displaystyle=O\left(\frac{d}{\sqrt{\beta}}\cdot\sqrt{T}\cdot\sqrt{\log\frac{T}{\delta}\cdot\log T}\right)+O\left(Nd^{1.5}\sqrt{D_{c}\cdot\log\frac{T}{\delta}}\log T\right)

According to Lemma 10, the above regret bound holds with high probability 1δ1-\delta. For completeness, we also present the regret when it fails to hold, which is bounded by δrt2Tδ\delta\cdot\sum r_{t}\leq 2T\cdot\delta in expectation. And this can be trivially set to O(1)O(1) by selecting δ=1/T\delta=1/T. In this way, we can primarily focus on analyzing the following regret when the bound holds.

RT=O(dβTlogT)+O(Nd1.5log1.5TDc)R_{T}=O\left(\frac{d}{\sqrt{\beta}}\sqrt{T}\log T\right)+O\left(Nd^{1.5}\log^{1.5}T\cdot\sqrt{D_{c}}\right)

With our choice of Dc=TN2dlogTT2N2dRlogTlogβD_{c}=\frac{T}{N^{2}d\log T}-\sqrt{\frac{T^{2}}{N^{2}dR\log T}}\log\beta in Lemma 9, we have

RT\displaystyle R_{T} =O(dβTlogT)+O(Nd1.5log1.5TTN2dlogTT2N2dRlogTlogβ)\displaystyle=O\left(\frac{d}{\sqrt{\beta}}\sqrt{T}\log T\right)+O\left(Nd^{1.5}\log^{1.5}T\cdot\sqrt{\frac{T}{N^{2}d\log T}-\sqrt{\frac{T^{2}}{N^{2}dR\log T}}\log\beta}\right)

Plugging in R=dlog(1+Tλd)=O(dlogT)R=\left\lceil d\log(1+\frac{T}{\lambda d})\right\rceil=O(d\log T), we get

RT=O(dβTlogT)+O(Nd1.5log1.5TTN2dlogT+TNdlogTlog1β)\displaystyle R_{T}=O\left(\frac{d}{\sqrt{\beta}}\sqrt{T}\log T\right)+O\left(Nd^{1.5}\log^{1.5}T\cdot\sqrt{\frac{T}{N^{2}d\log T}+\frac{T}{Nd\log T}\log\frac{1}{\beta}}\right)

Furthermore, by setting β>e1N\beta>e^{-\frac{1}{N}}, we can show that TN2dlogT>TNdlogTlog1β\frac{T}{N^{2}d\log T}>\frac{T}{Nd\log T}\log\frac{1}{\beta}, and therefore

RT=O(dβTlogT)+O(dTlogT)=O(dTlogT)\displaystyle R_{T}=O\left(\frac{d}{\sqrt{\beta}}\sqrt{T}\log T\right)+O\left(d\sqrt{T}\log T\right)=O\left(d\sqrt{T}\log T\right)

This concludes the proof.  

E Detailed Experimental Results

In addition to the empirical studies on the synthetic datasets reported in Section 5, we also conduct comprehensive experiments on the real-world recommendation dataset MovieLens (Harper and Konstan, 2015). Following Cesa-Bianchi et al. (2013); Li and Wang (2022a), we pre-processed the dataset to align it with the linear bandit problem setting, with feature dimension d=25d=25 and arm set size K=25K=25. Specifically, it contains N=54N=54 users and 26567 items (movies), where items receiving non-zero ratings are considered as having positive feedback, i.e., denoted by a reward of 1; otherwise, the reward is 0. In total, there are T=214729T=214729 interactions, with each user having at least 3000 observations. By default, we set all clients’ costs of data sharing as Dip=Dp,i[N]D^{p}_{i}=D^{p}_{\star}\in\mathbb{R},\forall i\in[N].

E.1 Payment-free vs. Payment-efficient incentive mechanism (Supplement to Section 5.1)

Refer to caption
(a) Dp=1D^{p}_{\star}=1
Refer to caption
(b) Dp=10D^{p}_{\star}=10
Refer to caption
(c) Dp=100D^{p}_{\star}=100
Figure 3: Comparison between payment-free vs. payment-efficient incentive designs.

Aligned with the findings presented in Section 5.1, the results on real-world dataset also confirm the advantage of the payment-efficient mechanism over the payment-free incentive mechanism in terms of both accumulative (normalized) reward and communication cost. As illustrated in Figure 3, this performance advantage is particularly notable in a more conservative environment, where clients have higher DpD^{p}_{\star}. And when the cost of data sharing for clients is relatively low, the performance gap between the two mechanisms becomes less significant. We attribute this to the fact that clients with low DipD^{p}_{i} values are more readily motivated by the data alone, thus alleviating the need for additional monetary incentive. On the other hand, higher values of DipD^{p}_{i} indicate that clients are more reluctant to share their data. As a result, the payment-free incentive mechanism fails to motivate a sufficient number of clients to participate in data sharing, leading to a noticeable performance gap, as evidenced in Figure 3(c). Note that the communication cost exhibits a sharp increase towards the end of the interactions. This is because the presence of highly skewed user distribution in the real-world dataset. For instance, in the last 2,032 rounds, only one client remains actively engaged with the environment, rapidly accumulating sufficient amounts of local updates, thus resulting in an increase in both communication frequency and cost.

E.2 Ablation Study on Heuristic Search (Supplement to Section 5.2)

Refer to caption
(a) Normalized Reward
Refer to caption
(b) Communication Cost
Refer to caption
(c) Payment Cost
Figure 4: Ablation study on heuristic search (w.r.t Dp[1,10,100]D^{p}_{\star}\in[1,10,100]).

We further study the effectiveness of each component in the heuristic search of Inc-FedUCB on the real-world dataset and compare the performance among different variants across different data sharing costs. As presented in Figure 4(a) and 4(b), the variants without payment-free (PF) component, which only rely on monetary incentive to motivate clients to participate, generally exhibit lower rewards and higher communication costs. The reason is a bit subtle: as the payment efficient mechanism is subject to both β\beta gap constraint and minimum payment cost requirement, it tends to satisfy the β\beta gap constraint with minimum amount of data collected. But the payment free mechanism will always collect the maximum amount data possible. As a result, without the PF component, the server tends to collect less (but enough) data, which in turn leads to more communication rounds and worse regret. The side-effect of the increased communication frequency is the higher payment costs, with respect to the β\beta gap requirement in each communication round. This is particularly notable in a more collaborative environment, where clients have lower data sharing costs. As exemplified in Figure 4(c), when the clients are more willing to share data (e.g., Dp=1D^{p}_{\star}=1), the variants without PF incur significantly higher payment costs compared to the those with PF, as the server misses the opportunity to get those easy to motivate clients. Therefore, providing data incentives becomes even more crucial in such scenarios to ensure effective client participation and minimize payment costs. On the other hand, the variants without iterative search (IS) tend to maintain competitive performance compared to the fully-fledged model, despite incurring a higher payment cost, highlighting the advantage of IS in minimizing payment.

E.3 Environment & Hyper-Parameter Study (Supplement to Section 5.3)

d=25,K=25,Dc=TN2dlogTT2N2dRlogTlogβd=25,K=25,D_{c}=\frac{T}{N^{2}d\log T}-\sqrt{\frac{T^{2}}{N^{2}dR\log T}}\cdot\log\beta DisLinUCB Inc-FedUCB (β=1\beta=1) Inc-FedUCB (β=0.7\beta=0.7) Inc-FedUCB (β=0.3\beta=0.3)
MovieLens (Dp=0)(D^{p}_{\star}=0) Reward (Acc.) 38,353 38,353 37,731 (Δ1.6%\Delta-1.6\%) 36,829 (Δ2.4%\Delta-2.4\%)
Commu. Cost 33,415,200 33,415,200 5,967,000 (Δ82%\Delta-82\%) 2,457,000 (Δ92.6%\Delta-92.6\%)
Pay. Cost \ 0 0 (Δ=0%\Delta=0\%) 0 (Δ=0%\Delta=0\%)
MovieLens (Dp=1)(D^{p}_{\star}=1) Reward (Acc.) \ 38,353 37,717 (Δ1.7%\Delta-1.7\%) 36,833 (Δ4%\Delta-4\%)
Commu. Cost \ 33,415,200 13,372,250 (Δ60%\Delta-60\%) 5,038,675 (Δ84.9%\Delta-84.9\%)
Pay. Cost \ 7859.67 124.41 (Δ98.4%\Delta-98.4\%) 0 (Δ100%\Delta-100\%)
MovieLens (Dp=10)(D^{p}_{\star}=10) Reward (Acc.) \ 38,353 37,648 (Δ1.8%\Delta-1.8\%) 36,675 (Δ4.4%\Delta-4.4\%)
Commu. Cost \ 33,415,200 10,041,250 (Δ70%\Delta-70\%) 4,240,625 (Δ87.3%\Delta-87.3\%)
Pay. Cost \ 110,737.62 8,590.43 (Δ92.2%\Delta-92.2\%) 2,076.98 (Δ98.1%\Delta-98.1\%)
MovieLens (Dp=100)(D^{p}_{\star}=100) Regret (Acc.) \ 38,353 37,641 (Δ1.9%\Delta-1.9\%) 36,562 (Δ4.7%\Delta-4.7\%)
Commu. Cost \ 33,415,200 8,496,600 (Δ74.6%\Delta-74.6\%) 5,136,700 (Δ84.6%\Delta-84.6\%)
Pay. Cost \ 1,155,616.99 105,847.84 (Δ90.8%\Delta-90.8\%) 32,618.34 (Δ97.2%\Delta-97.2\%)
Table 2: Study on hyper-parameter of Inc-FedUCB and environment (w/ theoretical DcD_{c}).
d=25,K=25,Dc=TN2dlogTd=25,K=25,D_{c}=\frac{T}{N^{2}d\log T} DisLinUCB Inc-FedUCB (β=1\beta=1) Inc-FedUCB (β=0.7\beta=0.7) Inc-FedUCB (β=0.3\beta=0.3)
MovieLens (Dp=0)(D^{p}_{\star}=0) Reward (Acc.) 38,353 38,353 38,353 (Δ=0%\Delta=0\%) 38,353 (Δ=0%\Delta=0\%)
Commu. Cost 33,415,200 33,415,200 33,415,200 (Δ=0%\Delta=0\%) 33,415,200 (Δ=0%\Delta=0\%)
Pay. Cost \ 0 0 (Δ=0%\Delta=0\%) 0 (Δ=0%\Delta=0\%)
MovieLens (Dp=1)(D^{p}_{\star}=1) Reward (Acc.) \ 38,353 38,207 (Δ0.4%\Delta-0.4\%) 38,208 (Δ0.4%\Delta-0.4\%)
Commu. Cost \ 33,415,200 171,046,600 (Δ+412%\Delta+412\%) 191,280,875 (Δ+472%\Delta+472\%)
Pay. Cost \ 7859.67 2095.73 (Δ73.3%\Delta-73.3\%) 36.32 (Δ99.5%\Delta-99.5\%)
MovieLens (Dp=10)(D^{p}_{\star}=10) Reward (Acc.) \ 38,353 38,251 (Δ0.3%\Delta-0.3\%) 37,609 (Δ1.9%\Delta-1.9\%)
Commu. Cost \ 33,415,200 135,521,025 (Δ+306%\Delta+306\%) 424,465,650 (Δ+1170%\Delta+1170\%)
Pay. Cost \ 110,737.62 33,271.39 (Δ70%\Delta-70\%) 33,872.78 (Δ69.4%\Delta-69.4\%)
MovieLens (Dp=100)(D^{p}_{\star}=100) Reward (Acc.) \ 38,353 38,251 (Δ0.3%\Delta-0.3\%) 37,970 (Δ1%\Delta-1\%)
Commu. Cost \ 33,415,200 135,521,025 (Δ+306%\Delta+306\%) 522,196,225 (Δ+1463%\Delta+1463\%)
Pay. Cost \ 1,155,616.99 352,231.39 (Δ69.5%\Delta-69.5\%) 346,619.77 (Δ70%\Delta-70\%)
Table 3: Study on hyper-parameter of Inc-FedUCB and environment (w/ lower fixed DcD_{c}).
d=25,K=25,Dc=TNdlogTd=25,K=25,D_{c}=\frac{T}{Nd\log T} DisLinUCB Inc-FedUCB (β=1\beta=1) Inc-FedUCB (β=0.7\beta=0.7) Inc-FedUCB (β=0.3\beta=0.3)
MovieLens (Dp=0)(D^{p}_{\star}=0) Reward (Acc.) 37,308 37,308 37,308 (Δ=0%\Delta=0\%) 37,308 (Δ=0%\Delta=0\%)
Commu. Cost 2,737,800 2,737,800 2,737,800 (Δ=0%\Delta=0\%) 2,737,800 (Δ=0%\Delta=0\%)
Pay. Cost \ 0 0 (Δ=0%\Delta=0\%) 0 (Δ=0%\Delta=0\%)
MovieLens (Dp=1)(D^{p}_{\star}=1) Reward (Acc.) \ 37,308 37,296 (Δ0.1%\Delta-0.1\%) 37,306 (Δ0.1%\Delta-0.1\%)
Commu. Cost \ 2,737,800 4,197,525 (Δ+53.3%\Delta+53.3\%) 5,948,950 (Δ+117.3%\Delta+117.3\%)
Pay. Cost \ 55.31 44.76 (Δ19.1%\Delta-19.1\%) 0 (Δ100%\Delta-100\%)
MovieLens (Dp=10)(D^{p}_{\star}=10) Reward (Acc.) \ 37,308 37,297 (Δ0.1%\Delta-0.1\%) 37,167 (Δ0.1%\Delta-0.1\%)
Commu. Cost \ 2,737,800 3,696,350 (Δ+35%\Delta+35\%) 5,765,075 (Δ+110.6%\Delta+110.6\%)
Pay. Cost \ 4048.69 3779.77 (Δ6.6%\Delta-6.6\%) 2242.22 (Δ44.6%\Delta-44.6\%)
MovieLens (Dp=100)(D^{p}_{\star}=100) Reward (Acc.) \ 37,308 37,273 (Δ0.1%\Delta-0.1\%) 36,946 (Δ0.1%\Delta-0.1\%)
Commu. Cost \ 2,737,800 3,484,850 (Δ+27.3%\Delta+27.3\%) 5,690,250 (Δ+107.8%\Delta+107.8\%)
Pay. Cost \ 77,041.04 65,286.90 (Δ15.3%\Delta-15.3\%) 40,010.59 (Δ48.1%\Delta-48.1\%)
Table 4: Study on hyper-parameter of Inc-FedUCB and environment (w/ higher fixed DcD_{c}).

In contrast to the hyper-parameter study on synthetic dataset with fixed communication threshold reported in Section 5.3, in this section, we comprehensively investigate the impact of β\beta and DpD^{p}_{\star} on the real-world dataset by varying the communication thresholds DcD_{c}. First, we empirically validate the effectiveness of the theoretical value of Dc=TN2dlogTT2N2dRlogTD_{c}=\frac{T}{N^{2}d\log T}-\sqrt{\frac{T^{2}}{N^{2}dR\log T}} as introduced in Theoreom 4. The results presented in Table 2 are generally consistent with the findings in Section 5.3: decreasing β\beta can substantially lower the payment cost while still maintain competitive rewards. We can also find that using the theoretical value of DcD_{c} can also save the communication cost. This results from the fact that setting DcD_{c} as a function of β\beta leads to higher communication threshold for lower β\beta, and therefore reducing communication frequency. This observation is essentially aligned with the intuition behind lower β\beta: when the systems has a higher tolerance for outdated sufficient statistics, it should not only pay less in each communication round but also triggers communication less frequently.

On the other hand, we investigate Inc-FedUCB’s performance under two fixed communication thresholds Dc=T/(N2dlogT)D_{c}=T/(N^{2}d\log T) and Dc=T/(NdlogT)D_{c}=T/(Nd\log T), which are presented in Table 3 and 4, respectively. These two values are created by increasing the theoretical value of DcD_{c}. Overall, the main findings align with those reported in Section 5.3, confirming ours previous statements. While reducing β\beta can achieve competitive rewards with lower payment costs, it comes at the expense of increased communication costs, suggesting the trade-off between payment costs and communication costs. Interestingly, the setting under a higher DpD^{p}_{\star} and DcD_{c} can help mitigate the impact of β\beta. Specifically, while increasing client’s cost of data sharing inherently brings additional incentive costs, raising the communication threshold results in less communication rounds, leading to a reduced overall communication costs. This finding highlights the importance of thoughtful design in choosing DcD_{c} and β\beta to balance the trade-off between payment costs and communication costs in real-world scenarios with diverse data sharing costs.

E.4 Extreme Case Study

d=25,K=25,Dp=100d=25,K=25,D^{p}_{\star}=100 Inc-FedUCB (β=1\beta=1) Inc-FedUCB (β=0.7\beta=0.7) Inc-FedUCB (β=0.3\beta=0.3) Inc-FedUCB (β=0.01\beta=0.01)
T=5000,N=50T=5000,N=50 (Dc=T/N2dlogT)(D_{c}=T/N^{2}d\log T) Regret (Acc.) 45.37 46.33 (Δ+2.1%\Delta+2.1\%) 48.49 (Δ+6.9%\Delta+6.9\%) 51.22 (Δ+12.9%\Delta+12.9\%)
Commu. Cost 174,720,000 264,193,275 (Δ+51.2%\Delta+51.2\%) 299,134,900 (Δ+71.2%\Delta+71.2\%) 314,667,500 (Δ+80.1%\Delta+80.1\%)
Pay. Cost 479,397.18 229,999.66 (Δ52%\Delta-52\%) 115,600 (Δ75.9%\Delta-75.9\%) 42,800 (Δ91.1%\Delta-91.1\%)
T=5000,N=50T=5000,N=50 (Dc=T/N2dlogTT2/N2dRlogTlogβ)(D_{c}=T/N^{2}d\log T-\sqrt{T^{2}/N^{2}dR\log T}\cdot\log\beta) Regret (Acc.) 45.37 46.72 (Δ+3%\Delta+3\%) 49.13 (Δ+8.3%\Delta+8.3\%) 53.72 (Δ+18.4%\Delta+18.4\%)
Commu. Cost 174,720,000 17,808,725 (Δ89.8%\Delta-89.8\%) 7,237,600 (Δ95.9%\Delta-95.9\%) 2,981,175 (Δ98.3%\Delta-98.3\%)
Pay. Cost 479,397.18 178,895.78 (Δ62.7%\Delta-62.7\%) 84,989.39 (Δ82.3%\Delta-82.3\%) 1,200 (Δ99.7%\Delta-99.7\%)
Table 5: Case study on synthetic dataset.
d=25,K=25,Dp=100d=25,K=25,D^{p}_{\star}=100 Inc-FedUCB (β=1\beta=1) Inc-FedUCB (β=0.7\beta=0.7) Inc-FedUCB (β=0.3\beta=0.3) Inc-FedUCB (β=0.01\beta=0.01)
MovieLens (Dc=T/N2dlogT)(D_{c}=T/N^{2}d\log T) Reward (Acc.) 38,353 38,251 (Δ0.3%\Delta-0.3\%) 37,970 (Δ1%\Delta-1\%) 37,039 (Δ3.4%\Delta-3.4\%)
Commu. Cost 33,415,200 135,521,025 (Δ+306%\Delta+306\%) 522,196,225 (Δ+1463%\Delta+1463\%) 1,226,741,425 (Δ+3571.2%\Delta+3571.2\%)
Pay. Cost 1,155,616.99 352,231.39 (Δ69.5%\Delta-69.5\%) 346,619.77 (Δ70%\Delta-70\%) 75,799.39 (Δ93.4%\Delta-93.4\%)
MovieLens (Dc=T/N2dlogTT2/N2dRlogTlogβ)(D_{c}=T/N^{2}d\log T-\sqrt{T^{2}/N^{2}dR\log T}\cdot\log\beta) Reward (Acc.) 38,353 37,641 (Δ1.9%\Delta-1.9\%) 36,562 (Δ4.7%\Delta-4.7\%) 31,873 (Δ16.9%\Delta-16.9\%)
Commu. Cost 33,415,200 8,496,600 (Δ74.6%\Delta-74.6\%) 5,136,700 (Δ84.6%\Delta-84.6\%) 1,880,450 (Δ94.4%\Delta-94.4\%)
Pay. Cost 1,155,616.99 105,847.84 (Δ90.8%\Delta-90.8\%) 32,618.34 (Δ97.2%\Delta-97.2\%) 200 (Δ99.9%\Delta-99.9\%)
Table 6: Case study on real-world dataset.

To further investigate the utility of Inc-FedUCB in extreme cases, we conduct a set of case studies on both synthetic and real-world datasets with fixed data sharing costs. As shown in Table 5 and 6, when β\beta is extremely small, we can achieve almost 100%100\% savings in incentive cost compared to the case where every client has to be incentivized to participate in data sharing (i.e., β=1\beta=1). However, this extreme setting inevitably results in a considerable drop in regret/reward performance and potentially tremendous extra communication cost due to the extremely outdated local statistics in clients. Nevertheless, by strategically choosing the communication threshold, we can mitigate the additional communication costs associated with the low β\beta values. For instance, in the synthetic dataset, the difference in performance drop between the theoretical DcD_{c} setting and heuristic DcD_{c} value is relatively small (Δ+18.4%\Delta+18.4\% vs. Δ+12.9%\Delta+12.9\%). However, these two different choices of DcD_{c} exhibit opposite effects on communication costs, with the theoretical one achieving a significant reduction (Δ98.3%\Delta-98.3\%) while the heuristic one incurred a significant increase (Δ+80.1%\Delta+80.1\%). On the other hand, in the real-world dataset, the heuristic choice of DcD_{c} may lead to a smaller performance drop compared to the theoretical setting of DcD_{c} (e.g., Δ3.4%\Delta-3.4\% vs. Δ16.9%\Delta-16.9\%), reflecting the specific characteristics of the environment (e.g., a high demand of up-to-date sufficient statistics). Similar to the findings in Section E.3, this case study also emphasizes the significance of properly setting the system hyper-parameter β\beta and DcD_{c}. By doing so, we can effectively accommodate the trade-off between performance, incentive costs, and communication costs, even in extreme cases.

F Notation Table

Notation Meaning
dd context dimension
NN total number of clients
TT total number of time steps
β\beta hyper-parameter that controls the regret level
DcD_{c} communication threshold
DipD_{i}^{p} data sharing cost of client ii
StS_{t} participant set at time step tt
Di,t(St)D_{i,t}(S_{t}) data offered by the sever to client ii at time step tt
i,td/i,tm\mathcal{I}_{i,t}^{d}/\mathcal{I}_{i,t}^{m} data/monetary incentive for client ii at time step tt
V~t\widetilde{V}_{t} covariance matrix constructed by all available data in the system
Vi,t,bi,tV_{i,t},b_{i,t} local data of client ii at time step tt
Vg,t,bg,tV_{g,t},b_{g,t} global data stored at the server at time step tt
ΔVi,t,Δbi,t\Delta V_{i,t},\Delta b_{i,t} data stored at client ii that has not been shared with the server
ΔVi,t,Δbi,t\Delta V_{-i,t},\Delta b_{-i,t} data stored at the server that has not been shared with the client ii
Table 7: Main technical notations used in this paper.