This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Efficient Algorithm for Cooperative Semi-Bandits

Riccardo Della Vecchia
Artificial Intelligence Lab, Institute for Data Science & Analytics
Bocconi University, Milano, Italy
   Tommaso Cesari
Toulouse School of Economics (TSE), Toulouse, France
Artificial and Natural Intelligence Toulouse Institute (ANITI), Toulouse, France
Abstract

We consider the problem of asynchronous online combinatorial optimization on a network of communicating agents. At each time step, some of the agents are stochastically activated, requested to make a prediction, and the system pays the corresponding loss. Then, neighbors of active agents receive semi-bandit feedback and exchange some succinct local information. The goal is to minimize the network regret, defined as the difference between the cumulative loss of the predictions of active agents and that of the best action in hindsight, selected from a combinatorial decision set. The main challenge in such a context is to control the computational complexity of the resulting algorithm while retaining minimax optimal regret guarantees. We introduce Coop-FTPL, a cooperative version of the well-known Follow The Perturbed Leader algorithm, that implements a new loss estimation procedure generalizing the Geometric Resampling of Neu and Bartók [2013] to our setting. Assuming that the elements of the decision set are kk-dimensional binary vectors with at most mm non-zero entries and α1\alpha_{1} is the independence number of the network, we show that the expected regret of our algorithm after TT time steps is of order QmkTlog(k)(kα1/Q+m)Q\sqrt{mkT\log(k)(k\alpha_{1}/Q+m)}, where QQ is the total activation probability mass. Furthermore, we prove that this is only klogk\sqrt{k\log k}-away from the best achievable rate and that Coop-FTPL has a state-of-the-art T3/2T^{3/2} worst-case computational complexity.

1 Introduction

Distributed online settings with communication constraints arise naturally in large-scale learning systems. For example, in domains such as finance or online advertising, agents often serve high volumes of prediction requests and have to update their local models in an online fashion. Bandwidth and computational constraints may therefore preclude a central processor from having access to all the observations from all sessions and synchronizing all local models at the same time. With these motivations in mind, we introduce and analyze a new online learning setting in which a network of agents solves efficiently a common nonstochastic combinatorial semi-bandit problem by sharing information only with their network neighbors. At each time step tt, some agents vv belonging to a communication network 𝒢\mathcal{G} are asked to make a prediction xt(v)x_{t}(v) belonging to a subset 𝒜\mathcal{A} of {0,1}k\{0,1\}^{k} and pay a (linear) loss xt(v),t\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle} where t[0,1]k\ell_{t}\in[0,1]^{k} is chosen adversarially by an oblivious environment. Then, any such agent vv receives the feedback (xt(1,v)t(1),,xt(k,v)t(k))\bigl{(}x_{t}(1,v)\ell_{t}(1),\ldots,x_{t}(k,v)\ell_{t}(k)\bigr{)}, which is shared, together with some local information, to its first neighbors in the graph. The goal is to minimize the network regret after TT time steps

RT=maxa𝒜𝔼[t=1Tv𝒮txt(v),tt=1Tv𝒮ta,t],R_{T}=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\left\langle a,\ell_{t}\right\rangle\right]\;, (1)

where 𝒮t\mathcal{S}_{t} is the set of agents vv that made a prediction at time tt. In words, this is the difference between the cumulative loss of the “active” agents and the loss that they would have incurred had they consistently made the best prediction in hindsight.

For this setting, we design a distributed algorithm that we call Coop-FTPL (Algorithm 1), and prove that its regret is upper bounded by QmkTlog(k)(kα1/Q+m)Q\sqrt{mkT\log(k)({k\alpha_{1}}/{Q}+m)} (Theorem 1), where α1\alpha_{1} is the independence number of the network 𝒢\mathcal{G} and QQ is the sum over all agents of the probability that the agent is active during a time step. Our algorithm employs an estimation technique that we call Cooperative Geometric Resampling (Coop-GR, Algorithm 2). It is an extension of a similar procedure appearing in [Neu and Bartók, 2013] that relies on the fact that the reciprocal of the probability of an event can be estimated by measuring the reoccurrence time. We can leverage this idea in the context of cooperative learning thanks to some statistical properties of the minimum of a family of geometric random variables (see Lemmas 13). Our algorithm has a state-of-the-art dependence on time of order T3/2T^{3/2} for the worst-case computational complexity (Proposition 1). Moreover, we show with a lower bound (Theorem 2) that no algorithm can achieve a regret smaller than QmkTα1/QQ\sqrt{mkT\alpha_{1}/Q} on all cooperative semi-bandit instances. Thus, our Coop-FTPL is at most a multiplicative factor of klogk\sqrt{k\log k}-away from the minimax result.

To the best of our knowledge, ours is the first computationally efficient near-optimal learning algorithm for the problem of cooperative learning with nonstochastic combinatorial bandits, where not all agents are necessarily active at all time steps.

2 Related work and further applications

Single-agent combinatorial bandits find applications in several fields, such as path planning, ranking and matching problems, finding minimum-weight spanning trees, cut sets, and multitask bandits. An efficient algorithm for this setting is Follow-The-Perturbed-Leader (FTPL), which was first proposed by Hannan [1957] and later rediscovered by Kalai and Vempala [2005]. Neu and Bartók [2013] show that combining FTPL with a loss estimation procedure called Geometric Resampling (GR) leads to a computationally efficient solution for this problem. More precisely, the solution is efficient given that the offline optimization problem of finding

a=argmina𝒜a,y,y[0,+)ka^{\star}=\operatorname*{argmin}_{a\in\mathcal{A}}\langle a,y\rangle\;,\qquad\forall y\in[0,+\infty)^{k} (2)

admits a computationally efficient algorithm. This assumption is minimal, in the sense that if the offline problem in Eq. (2) is hard to approximate, then any algorithm with low regret must also be inefficient.111A slight relaxation in this direction would be assuming that Eq. (2) can be approximated accurately and efficiently. Grötschel et al. [2012] and Lee et al. [2018] give some sufficient conditions for the validity of this assumption. They essentially rely on having an efficient membership oracle for the convex hull co(𝒜)\mathrm{co}(\mathcal{A}) of 𝒜\mathcal{A} and an evaluation oracle for the linear function to optimize. Audibert et al. [2014] note that Online Stochastic Mirror Descent (OSMD) or Follow The Regularized Leader (FTRL)-type algorithms can be efficiently implemented by convex programming if the convex hull of the decision set can be described by a polynomial number of constraints. Suehiro et al. [2012] investigate the details of such efficient implementations and design an algorithm with k6k^{6} time-complexity, which might still be unfeasible in practice. Methods based on the exponential weighting of each decision vector can be implemented efficiently only in a handful of special cases —see, e.g., [Koolen et al., 2010] and [Cesa-Bianchi and Lugosi, 2012] for some examples.

The study of cooperative nonstochastic online learning on networks was pioneered by Awerbuch and Kleinberg [2008], who investigated a bandit setting in which the communication graph is a clique, agents belong to clusters characterized by the same loss, and some agents may be non-cooperative. In our multi-agent setting, the end goal is to control the total network regret (1). This objective was already studied by Cesa-Bianchi et al. [2019a] in the full-information case. A similar line of work was pursued by Cesa-Bianchi et al. [2019b], where the authors consider networks of learning agents that cooperate to solve the same nonstochastic bandit problem. In their setting, all agents are simultaneously active at all time steps, and the feedback propagates throughout the network with a maximum delay of dd time steps, where dd is a parameter of the proposed algorithm. The authors introduce a cooperative version of Exp3 that they call Exp3-COOP with regret of order (d+1+Kαd/N)(TlogK)\sqrt{(d+1+{K\alpha_{d}}/{N})(T\log K)} where KK is the number of arms in the nonstochastic bandit problem, NN is the total number of agents in the network, and αd\alpha_{d} is the independence number of the dd-th power of the communication network. The case d=1d=1 corresponds to information that arrives with one round of delay and communication limited to first neighbors. In this setting Exp3-COOP has regret of order (1+Kα1/N)(TlogK)\sqrt{(1+{K\alpha_{1}}/{N})(T\log K)}. Thus, our work can be seen as an extension of this setting to the case of combinatorial bandits with stochastic activation of agents. Finally, we point out that if the network consists of a single node, our cooperative setting collapses into a single-agent combinatorial semi-bandit problem. In particular, when the number of arms is kk and m=1m=1, this becomes the well-known adversarial multiarmed bandit problem (see [Auer et al., 2002]). Hence, ours is a proper generalization of all the settings mentioned above. These cooperative problems are also studied in stochastic setting (see, e.g., Martínez-Rubio et al. [2019]).

Finally, the reader may wonder what kind of results could be achieved if the agents are activated adversarially rather than stochastically. Cesa-Bianchi et al. [2019a] showed that in this setting no learning can occur, not even in with full-information feedback.

3 Cooperative semi-bandit setting

In this section, we present our cooperative semi-bandit protocol and we introduce all relevant definitions and notation.

We say that 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) is a communication network over NN agents if it is an undirected graph over a set 𝒱\mathcal{V} with cardinality |𝒱|=N\left\lvert\mathcal{V}\right\rvert=N, whose elements we refer to as agents. Without loss of generality, we assume that 𝒱={1,,N}\mathcal{V}=\{1,\ldots,N\}. For any agent v𝒱v\in\mathcal{V}, we denote by 𝒩(v)\mathcal{N}(v) the set of agents containing vv and the neighborhood {w𝒱:(v,w)}\{w\in\mathcal{V}:(v,w)\in\mathcal{E}\}. We say that α1\alpha_{1} is the independence number of the network 𝒢\mathcal{G} if is the largest cardinality of an independent set of 𝒢\mathcal{G}, where an independent set of 𝒢\mathcal{G} is a subset of agents, no two of which are neighbors.

We study the following cooperative online combinatorial optimization protocol. Initially, hidden from the agents, the environment draws a sequence of subsets 𝒮1,𝒮2,𝒱\mathcal{S}_{1},\mathcal{S}_{2},\ldots\subset\mathcal{V} of agents, that we call active, and a sequence of loss vectors 1,2,k\ell_{1},\ell_{2},\ldots\in\mathbb{R}^{k}. We assume that each agent vv has a probability q(v)q(v) of being activated, which need only be known by vv. The set of active agents 𝒮t\mathcal{S}_{t} at time t{1,2,}t\in\{1,2,\ldots\} is then determined by drawing, for each agent v𝒱v\in\mathcal{V}, a Bernoulli random variable Xt(v)X_{t}(v) with bias q(v)q(v), independently of the past, and 𝒮t\mathcal{S}_{t} consists exclusively of agents v𝒱v\in\mathcal{V} for which Xt(v)=1X_{t}(v)=1. The decision set is a subset 𝒜\mathcal{A} of {a{0,1}k:i=1kaim}\bigl{\{}a\in\{0,1\}^{k}:\sum_{i=1}^{k}a_{i}\leq m\bigr{\}}, for some m{1,,k}m\in\{1,\ldots,k\}.

For each time step t{1,2,}t\in\{1,2,\ldots\}:

  1. 1.

    each active agent v𝒮tv\in\mathcal{S}_{t} predicts with xt(v)𝒜x_{t}(v)\in\mathcal{A} (possibly drawn at random);

  2. 2.

    each neighbor v𝒩(w)v\in\mathcal{N}(w) of an active agent w𝒮tw\in\mathcal{S}_{t} receives the feedback

    ft(w):=(xt(1,w)t(1),,xt(k,w)t(k));f_{t}(w):=\bigl{(}x_{t}(1,w)\ell_{t}(1),\ldots,x_{t}(k,w)\ell_{t}(k)\bigr{)}\;; (3)
  3. 3.

    each agent vw𝒮t𝒩(w)v\in\bigcup_{w\in\mathcal{S}_{t}}\mathcal{N}(w) receives some local information from its neighbors in 𝒩(v)\mathcal{N}(v);

  4. 4.

    the system incurs the loss v𝒮txt(v),t\sum_{v\in\mathcal{S}_{t}}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}.

The goal is to minimize the expected network regret as a function of the time horizon TT, defined by

RT:=maxa𝒜𝔼[t=1Tv𝒮txt(v),tt=1Tv𝒮ta,t],R_{T}:=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\left\langle a,\ell_{t}\right\rangle\right]\;, (4)

where the expectation is taken with respect to the draws of 𝒮1,,𝒮T\mathcal{S}_{1},\ldots,\mathcal{S}_{T} and (possibly) the randomization of the learners. In the next sections we will also denote by t\mathbb{P}_{t} the probability conditioned to the history up to and including round t1t-1, and by 𝔼t\mathbb{E}_{t} the corresponding expectation.

The nature of the local information exchanged by neighbors of active agents will be clarified in the next section. In short, they share succinct representations of the current state of their local prediction model.

4 Coop-FTPL and upper bound

In this section we introduce and analyze our efficient Coop-FTPL algorithm for cooperative online combinatorial optimization.

4.1 The Coop-FTPL algorithm

Coop-FTPL (Algorithm 1) takes as input a decision set 𝒜{0,1}k\mathcal{A}\subset\{0,1\}^{k}, a time horizon TT\in\mathbb{N}, a learning rate η>0\eta>0, a truncation parameter β\beta, and an exploration distribution ζ\zeta. At each time step tt, all active agents vv make a FTPL prediction xt(v)x_{t}(v) (line 1) with an i.i.d. perturbation sampled from ζ\zeta (line 1), then they receive some feedback ft(v)f_{t}(v) and share it with their first neighbors (line 1). Afterwards, each agent who received some feedback this round, requests from its neighbors ww a vector Kt(w)K_{t}(w) of geometric random samples (line 1) which is efficiently computed by Algorithm 2 and will be described in detail later. With these geometric samples, each agent vv computes an estimated loss ^t(v)\widehat{\ell}_{t}(v) (line 1) and updates the cumulative loss estimate L^t(v)\widehat{L}_{t}(v) (line 1). This estimator is described in detail in Section 4.3.

Input: decision set 𝒜\mathcal{A}, horizon TT, learning rate η\eta, truncation parameter β\beta, exploration pdf ζ\zeta
Initialization: L^0=0k\widehat{L}_{0}=0\in\mathbb{R}^{k}
1 for each time t=1,2,t=1,2,\ldots do
2      for  each active agent v𝒮tv\in\mathcal{S}_{t}  do // active agents
3           sample Zt(v)ζZ_{t}(v)\sim\zeta, independently of the past
4           make the prediction xt(v)=argmaxa𝒜a,Zt(v)ηL^t1(v)x_{t}(v)=\operatorname*{argmax}_{a\in\mathcal{A}}\bigl{\langle}a,Z_{t}(v)-\eta\widehat{L}_{t-1}(v)\bigr{\rangle}
5          
6           for each agent vw𝒮t𝒩(w)v\in\bigcup_{w\in\mathcal{S}_{t}}\mathcal{N}(w) do // neighbors of active agents
7                receive the feedback ft(w)f_{t}(w) (Eq. (3)) from each active neighbor w𝒩(v)𝒮tw\in\mathcal{N}(v)\cap\mathcal{S}_{t}
8                receive Kt(i,w)K_{t}(i,w) (i{1,,k}\forall\ i\in\{1,\ldots,k\}) from each neighbor w𝒩(v)w\in\mathcal{N}(v) using Algorithm 2
9                compute ^t(i,v)=t(i)Bt(i,v)minw𝒩(v){Kt(i,w)}\widehat{\ell}_{t}(i,v)=\ell_{t}(i)\,B_{t}\,(i,v)\min_{w\in\mathcal{N}(v)}\bigl{\{}K_{t}(i,w)\bigr{\}}, i{1,,k}\forall i\in\{1,\ldots,k\} (Eq. (6)–(7))
10                update L^t(v)=L^t1(v)+^t(v)\widehat{L}_{t}(v)=\widehat{L}_{t-1}(v)+\widehat{\ell}_{t}(v)
11               
Algorithm 1 Follow the perturbed leader for cooperative semi-bandits (Coop-FTPL)
Input: time step tt, component ii, agent ww, truncation parameter β\beta, exploration pdf ζ\zeta
1 for s=1,2,s=1,2,\ldots do
2      sample zsζz^{\prime}_{s}\sim\zeta, independently of the past
3      let xsx^{\prime}_{s} be the ii-th component of the vector argmaxa𝒜a,zsηL^t1(w)\operatorname*{argmax}_{a\in\mathcal{A}}\bigl{\langle}a,z^{\prime}_{s}-\eta\widehat{L}_{t-1}(w)\bigr{\rangle}
4      sample ysy^{\prime}_{s} from a Bernoulli distribution with parameter q(w)q(w), independently of the past
5      if (xs=1 and ys=1) or s=β(x^{\prime}_{s}=1\textbf{ and }y^{\prime}_{s}=1)\textbf{ or }s=\beta, break
6     
7      return Kt(i,w)=sK_{t}(i,w)=s
Algorithm 2 Geometric resampling for cooperative semi-bandits (Coop-GR)

4.2 Reduction to OSMD

Before describing Kt(w)K_{t}(w) and ^t(v)\widehat{\ell}_{t}(v), we make a connection between FTPL and the Online Stochastic Mirror Descent algorithm (OSMD)222For a brief overview of some key convex analysis and OSMD facts, see Appendices A and B that will be crucial for our analysis.333For a similar approach in the single-agent case, see [Lattimore and Szepesvári, 2020].

Fix any time step tt and an agent vv. As we mentioned above, if vv is active, it makes the following FTPL prediction (line 1)

xt(v)=argmina𝒜a,ηL^t1(v)Zt(v),x_{t}(v)=\operatorname*{argmin}_{a\in\mathcal{A}}\bigl{\langle}a,\eta\widehat{L}_{t-1}(v)-Z_{t}(v)\bigr{\rangle}\;,

where Zt(v)kZ_{t}(v)\in\mathbb{R}^{k} is sampled i.i.d. from ζ\zeta (the random perturbations introduce the exploration, which for an appropriate choice of ζ\zeta is sufficient to guarantee small regret). On the other hand, given a Legendre potential FF with dom(F)=int(co(𝒜))\text{dom}\left(\nabla F\right)=\mathrm{int}\left(\text{co}\left(\mathcal{A}\right)\right), an OSMD algorithm makes the prediction

x¯t(v)=argminxco(𝒜)(x,η^t1(v)+F(x,x¯t1(v))),\overline{x}_{t}(v)=\operatorname*{argmin}_{x\in\mathrm{co}(\mathcal{A})}\Bigl{(}\bigl{\langle}x,\eta\,\widehat{\ell}_{t-1}(v)\bigr{\rangle}+\mathcal{B}_{F}(x,\overline{x}_{t-1}(v))\Bigr{)}\;,

where F\mathcal{B}_{F} is the Bregman divergence induced by FF and co(𝒜)\mathrm{co}(\mathcal{A}) is the convex hull of 𝒜\mathcal{A}. Using the fact that dom(F)=int(co(𝒜))\text{dom}(\nabla F)=\mathrm{int}\bigl{(}\mathrm{co}(\mathcal{A})\bigr{)}, the argmin\operatorname*{argmin} above can be computed in a standard way by studying when the gradient of its argument is equal to zero, and proceeding inductively, we obtain the two identities F(x¯t(v))=F(x¯t1(v))η^t1(v)=ηL^t1(v).\nabla F(\overline{x}_{t}(v))=\nabla F(\overline{x}_{t-1}(v))-\eta\,\widehat{\ell}_{t-1}(v)=-\eta\widehat{L}_{t-1}(v)\;. By duality this implies that x¯t(v)=F(ηL^t1(v))\overline{x}_{t}(v)=\nabla F^{*}\bigl{(}-\eta\widehat{L}_{t-1}(v)\bigr{)}. We now want to relate xt(v)x_{t}(v) and x¯t(v)\overline{x}_{t}(v) so that

x¯t(v)=𝔼t[xt(v)]=𝔼t[argmina𝒜a,ηL^t1(v)Zt(v)],\overline{x}_{t}(v)=\mathbb{E}_{t}\bigl{[}x_{t}(v)\bigr{]}=\mathbb{E}_{t}\left[\operatorname*{argmin}_{a\in\mathcal{A}}\bigl{\langle}a,\eta\widehat{L}_{t-1}(v)-Z_{t}(v)\bigr{\rangle}\right]\;, (5)

where the conditional expectation 𝔼t\mathbb{E}_{t} (given the history up to time t1t-1) is taken with respect to Zt(v)Z_{t}(v). Thus, in order to view FTPL as an instance of OSMD, it suffices to find a Legendre potential FF with dom(F)=int(co(𝒜))\text{dom}\left(\nabla F\right)=\text{int}\left(\text{co}\left(\mathcal{A}\right)\right) such that F(ηL^t1(v))=𝔼t[argmaxa𝒜a,Zt(v)ηL^t1(v)]\nabla F^{*}\bigl{(}-\eta\widehat{L}_{t-1}(v)\bigr{)}=\mathbb{E}_{t}\bigl{[}\operatorname*{argmax}_{a\in\mathcal{A}}\bigl{\langle}a,Z_{t}(v)-\eta\widehat{L}_{t-1}(v)\bigr{\rangle}\bigr{]}. In order to satisfy this condition, we need that for any xkx\in\mathbb{R}^{k}, the Fenchel conjugate FF^{*} of FF enjoys F(x)=kargmaxa𝒜a,zxζ(z)dz.\nabla F^{*}\left(x\right)=\int_{\mathbb{R}^{k}}\operatorname*{argmax}_{a\in\mathcal{A}}\left\langle a,z-x\right\rangle\>\zeta(z)\>\mathrm{d}z. Then, we define h(x):=argmaxa𝒜a,xh(x):=\operatorname*{argmax}_{a\in\mathcal{A}}\left\langle a,x\right\rangle for any xkx\in\mathbb{R}^{k}, where h(x)h(x) is chosen to be an arbitrary maximizer if multiple maximizers exist. From convex analysis, if the convex hull co(𝒜)\mathrm{co}(\mathcal{A}) of 𝒜\mathcal{A} had a smooth boundary, then the support function xϕ(x):=maxaco(𝒜)a,x=maxa𝒜a,x,x\mapsto\phi(x):=\max_{a\in\mathrm{co}(\mathcal{A})}\left\langle a,x\right\rangle=\max_{a\in\mathcal{A}}\left\langle a,x\right\rangle, of co(𝒜)\mathrm{co}(\mathcal{A}) would satisfy ϕ(x)=h(x)\nabla\phi(x)=h(x). For combinatorial bandits, co(𝒜)\mathrm{co}(\mathcal{A}) is non-smooth, but, being ζ\zeta a density with respect to Lebesgue measure, one can prove (see, e.g., Lattimore and Szepesvári [2020]) that kϕ(x+z)ζ(z)dz=kh(x+z)ζ(z)dz\nabla\int_{\mathbb{R}^{k}}\phi\left(x+z\right)\zeta(z)\>\mathrm{d}z=\int_{\mathbb{R}^{k}}h\left(x+z\right)\zeta(z)\>\mathrm{d}z, for all xkx\in\mathbb{R}^{k}. This shows that FTPL can be interpreted as OSMD with a potential FF defined implicitly by its Fenchel conjugate

F(x):=kϕ(x+z)ζ(z)dz,xk.F^{*}(x):=\int_{\mathbb{R}^{k}}\phi\left(x+z\right)\zeta(z)\>\mathrm{d}z\;,\qquad\forall x\in\mathbb{R}^{k}\;.

Thus, recalling (5), we can think of the update x¯t(v)\overline{x}_{t}(v) of OSMD as the average of a random component-wise draw x¯t(i,v)=a𝒜Pt(a,v)a(i)\overline{x}_{t}(i,v)=\sum_{a\in\mathcal{A}}P_{t}(a,v)\,a(i) (for all i{1,,k}i\in\{1,\ldots,k\}), with respect to a distribution Pt(v)P_{t}(v) on 𝒜\mathcal{A} defined in terms of the distribution of ZtZ_{t}, as

Pt(a,v)=t[h(Zt(v)ηL^t1(v))=a],a𝒜,P_{t}(a,v)=\mathbb{P}_{t}\Bigl{[}h\bigl{(}Z_{t}(v)-\eta\widehat{L}_{t-1}(v)\bigr{)}=a\Bigr{]}\;,\qquad\forall a\in\mathcal{A}\;,

where t\mathbb{P}_{t} is the probability conditioned of the history up to time t1t-1.

4.3 An efficient estimator

For the understanding of the definitions and analyses of Kt(w)K_{t}(w) and ^t(v)\widehat{\ell}_{t}(v), we introduce three useful lemmas on geometric distributions. We defer their proofs to Appendix C.

Lemma 1.

Let Y1,,YmY_{1},\ldots,Y_{m} be mm independent random variables such that each YjY_{j} has a geometric distribution with parameter pj[0,1]p_{j}\in[0,1]. Then, the random variable Z:=minj{1,,m}YjZ:=\min_{j\in\{1,\ldots,m\}}Y_{j} has a geometric distribution with parameter 1j=1m(1pj)1-\prod_{j=1}^{m}(1-p_{j}).

Lemma 2.

Let GG be a geometric random variable with parameter q(0,1]q\in(0,1] and β>0\beta>0. Then, the expectation of the random variable min{G,β}\min\{G,\beta\} satisfies 𝔼[min{G,β}]=(1(1q)β)/q\mathbb{E}\bigl{[}\min\{G,\beta\}\bigr{]}=\bigl{(}1-(1-q)^{\beta}\bigr{)}/q.

Lemma 3.

For all v𝒱v\in\mathcal{V}, fix two arbitrary numbers p1(v),p2(v)[0,1]p_{1}(v),p_{2}(v)\in[0,1]. Consider a collection {Xs(v),Ys(v)}s,v𝒱\bigl{\{}X_{s}(v),Y_{s}(v)\bigr{\}}_{s\in\mathbb{N},v\in\mathcal{V}} of independent Bernoulli random variables such that 𝔼[Xs(v)]=p1(v)\mathbb{E}\bigl{[}X_{s}(v)\bigr{]}=p_{1}(v) and 𝔼[Ys(v)]=p2(v)\mathbb{E}\bigl{[}Y_{s}(v)\bigr{]}=p_{2}(v) for any ss\in\mathbb{N} and all v𝒱v\in\mathcal{V}. Then, the random variables {G(v)}v𝒱\{G(v)\}_{v\in\mathcal{V}} defined for all v𝒱v\in\mathcal{V} by G(v):=inf{s:Xs(v)Ys(v)=1}G(v):=\inf\bigl{\{}s\in\mathbb{N}:X_{s}(v)\,Y_{s}(v)=1\bigr{\}} are all independent and they have a geometric distribution with parameter p1(v)p2(v)p_{1}(v)\,p_{2}(v).

Fix now any time step tt, agent vv, and component i{1,,k}i\in\{1,\ldots,k\}. The loss estimator ^t(i,v)\widehat{\ell}_{t}(i,v) depends on the algorithmic definition of Kt(i,w)K_{t}(i,w) in Algorithm 2, where w𝒩(v)w\in\mathcal{N}(v). By Lemma 3, we have that for any ww, conditionally on the history up to time t1t-1, the random variable Kt(i,w)K_{t}(i,w), has a truncated geometric distribution with success probability equal to x¯t(i,w)q(w)\overline{x}_{t}(i,w)q(w) and truncation parameter β\beta. The idea is that using Kt(i,w)K_{t}(i,w) as an estimator we can reconstruct the inverse of the probability that agent vv observes ii at time tt. The truncation parameter β\beta is not needed for the analysis, but it is used just to optimize the computational efficiency of the algorithm.444Previously known cooperative algorithms for limited feedback settings need to exchange at least two real numbers: the probability according to which predictions are drawn and the loss. Instead of the probability, we only need to pass the integer Kt(i,w)K_{t}(i,w), which requires at most log2(β)\log_{2}(\beta) bits (order of log(T)\log(T), when tuned). Note also that for the loss, we could exchange an approximation ltl_{t} of t\ell_{t} using only n=O(log(T))n=O(\log(T)) bits. Indeed one can show that Lemma 4, in this case, is true replacing t\ell_{t} with ltl_{t} in the first equality. Everything else works the same in the proof of Theorem 1 up an extra (negligible) m2nTm2^{-n}T term.

The loss estimator of vv is then defined as

^t(i,v):=t(i)Bt(i,v)minw𝒩(v){Kt(i,w)},\widehat{\ell}_{t}(i,v):=\ell_{t}(i)\,B_{t}\,(i,v)\min_{w\in\mathcal{N}(v)}\bigl{\{}K_{t}(i,w)\bigr{\}}, (6)

where

Bt(i,v)=𝕀{w𝒩(v):w𝒮t,xt(i,w)=1},Kt(i,w)=min{Gt(i,w),β},B_{t}(i,v)=\mathbb{I}\bigl{\{}\exists w\in\mathcal{N}(v):w\in\mathcal{S}_{t},\,x_{t}(i,w)=1\bigr{\}}\;,\qquad K_{t}(i,w)=\min\bigl{\{}G_{t}(i,w),\beta\bigr{\}}\;, (7)

and given the history up to time t1t-1, for each i{1,,k}i\in\{1,\ldots,k\}, the family {Gt(i,w)}w𝒱\bigl{\{}G_{t}(i,w)\bigr{\}}_{w\in\mathcal{V}} consists of independent geometric random variables with parameter x¯t(i,w)q(w)\overline{x}_{t}(i,w)q(w). Note that the geometric random variables Gt(i,w)G_{t}(i,w) are actually never computed by Algorithm 2 which efficiently computes only their truncations Kt(i,w)K_{t}(i,w), with truncation parameter β\beta. Nevertheless, as it will be apparent later, they are a useful tool for the theoretical analysis. Note that, by Eq. (5), we have

t[xt(i,w)=1]=𝔼t[xt(i,w)]=x¯t(i,w),\mathbb{P}_{t}\bigl{[}x_{t}(i,w)=1\bigr{]}=\mathbb{E}_{t}[x_{t}(i,w)]=\overline{x}_{t}\left(i,w\right),

therefore

B¯t(i,v):=𝔼t[Bt(i,v)]=1w𝒩(v)(1x¯t(i,w)q(w))=1𝔼t[minw𝒩(v)Gt(i,w)],\overline{B}_{t}(i,v):=\mathbb{E}_{t}\left[B_{t}(i,v)\right]=1-\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)=\frac{1}{\mathbb{E}_{t}\left[\min_{w\in\mathcal{N}(v)}G_{t}(i,w)\right]}\;,

where the last identity follows by Lemma 1. Moreover from Lemmas 1 and 2, we have

𝔼t[minw𝒩(v)Kt(i,w)]=1w𝒩(v)(1x¯t(i,w)q(w))βB¯t(i,v).\mathbb{E}_{t}\left[\min_{w\in\mathcal{N}(v)}K_{t}(i,w)\right]=\frac{1-\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)^{\beta}}{\overline{B}_{t}(i,v)}\;.

The following key lemma gives an upper bound on the expected estimated loss.

Lemma 4.

For any time tt, component ii, agents vv, and truncation parameter β\beta, the expectation of the loss estimator in (6), given the history up to time t1t-1, satisfies

𝔼t[^t(i,v)]=t(i)(1(w𝒩(v)(1x¯t(i,w)q(w)))β)t(i).\mathbb{E}_{t}\Bigl{[}\widehat{\ell}_{t}(i,v)\Bigr{]}=\ell_{t}(i)\left(1-\Biggl{(}\prod_{w\in\mathcal{N}(v)}\bigl{(}1-\overline{x}_{t}(i,w)\,q(w)\bigr{)}\Biggr{)}^{\beta}\right)\leq\ell_{t}(i)\;.
Proof.

Using the fact that, conditioned on the history up to time t1t-1, the random variable minw𝒩(v)Gt(i,w)\min_{w\in\mathcal{N}(v)}G_{t}(i,w) has a geometric distribution with parameter B¯t(i,v)\overline{B}_{t}(i,v) (Lemmas 1-3), we get

𝔼t[^t(i,v)]\displaystyle\mathbb{E}_{t}\left[\widehat{\ell}_{t}(i,v)\right]
=𝔼t[t(i)Bt(i,v)minw𝒩(v){min{Gt(i,w),β}}]=𝔼t[t(i)Bt(i,v)min{minw𝒩(v)Gt(i,w),β}]\displaystyle\ =\mathbb{E}_{t}\left[\ell_{t}(i)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{\min\left\{G_{t}(i,w),\beta\right\}\right\}\right]=\mathbb{E}_{t}\left[\ell_{t}(i)B_{t}(i,v)\min\left\{\min_{w\in\mathcal{N}(v)}G_{t}(i,w),\beta\right\}\right]
=t(i)𝔼t[Bt(i,v)]𝔼t[min{minw𝒩(v)Gt(i,w),β}]=t(i)B¯t(i,v)(1(1B¯t(i,v))β)B¯t(i,v)\displaystyle\ =\ell_{t}(i)\mathbb{E}_{t}\left[B_{t}(i,v)\right]\mathbb{E}_{t}\left[\min\left\{\min_{w\in\mathcal{N}(v)}G_{t}(i,w),\beta\right\}\right]=\ell_{t}(i)\overline{B}_{t}(i,v)\frac{\left(1-\left(1-\overline{B}_{t}(i,v)\right)^{\beta}\right)}{\overline{B}_{t}(i,v)}
=t(i)(1(1B¯t(i,v))β)=t(i)(1(w𝒩(v)(1x¯t(i,w)q(w)))β),\displaystyle\ =\ell_{t}(i)\left(1-\left(1-\overline{B}_{t}(i,v)\right)^{\beta}\right)=\ell_{t}(i)\left(1-\Biggl{(}\prod_{w\in\mathcal{N}(v)}\bigl{(}1-\overline{x}_{t}(i,w)\,q(w)\bigr{)}\Biggr{)}^{\beta}\right)\;,

where we plugged in the definition of B¯t(i,v)\overline{B}_{t}(i,v) in the last equation. From the fact that x¯t(i,w)q(w)[0,1]\overline{x}_{t}(i,w)\,q(w)\in[0,1] and β>0\beta>0 follows that 𝔼t[^t(i,v)]t(i).\mathbb{E}_{t}\left[\widehat{\ell}_{t}(i,v)\right]\leq\ell_{t}(i).

4.4 Analysis

We can finally state our upper bound on the regret of Coop-FTPL. The key idea is to apply OSMD techniques to our FTPL algorithm, as explained in Section 4.2. The proof proceeds by splitting the regret of each agent in the network in three terms. The first two are treated with standard techniques; the first one depends on the diameter of co(𝒜)\mathrm{co}(\mathcal{A}) with respect to the regularizer FF and the second one on the Bregman divergence of consecutive updates. The last term is related to the bias of the estimator and is analyzed leveraging the lemmas on geometric distributions introduced in Section 4.3. Then, this terms are summed, each with a weight corresponding to their probabilities of being activated, and this sum is controlled using results that relate a sum of weights over the nodes of a graph of with the independence number of the graph.

Theorem 1.

If ζ\zeta is the Laplace density zζ(z)=2kexp(z1)z\mapsto\zeta(z)=2^{-k}\exp\bigl{(}-\left\lVert z\right\rVert_{1}\bigr{)}, η>0\eta>0, and 0<β1/(ηk)0<\beta\leq 1/(\eta k), then then the regret of our Coop-FTPL algorithm (Algorithm 1) satisfies

RT3Qmlogkη+3QηkT(kQα1+m)+α1kTβ.R_{T}\leq\frac{3Qm\log k}{\eta}+3Q\eta kT\left(\frac{k}{Q}\alpha_{1}+m\right)+\frac{\alpha_{1}\,k\,T}{\beta}\;.

In particular, tuning the parameters η,β\eta,\beta as follows

β=1kηandη=3mlog(k)5kT(kQα1+m),where Q=v𝒱q(v),\beta=\left\lfloor\frac{1}{k\eta}\right\rfloor\quad\text{and}\quad\eta=\sqrt{\frac{3m\log(k)}{5kT\left(\frac{k}{Q}\alpha_{1}+m\right)}}\;,\qquad\text{where }\qquad Q=\sum_{v\in\mathcal{V}}q(v)\;, (8)

yields

RT2Q15mkTlog(k)(kQα1+m).R_{T}\leq 2Q\sqrt{15mkT\log(k)\left(\frac{k}{Q}\alpha_{1}+m\right)}\;.

We now present a detailed sketch of the proof of our main result (full proof in Appendix D).

Sketch of the proof.

For the sake of convenience, we define the expected individual regret of an agent v𝒱v\in\mathcal{V} in the network with respect to a fixed action a𝒜a\in\mathcal{A} by

RT(a,v):=𝔼[t=1Txt(v),tt=1Ta,t],R_{T}(a,v):=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle x_{t}(v),\ell_{t}\right\rangle-\sum_{t=1}^{T}\left\langle a,\ell_{t}\right\rangle\right]\;,

where the expectation is taken with respect to the internal randomization of the agent, but not to its activation probability q(v)q(v). With this definition the total regret on the network in Eq. (4) can be decomposed as

RT\displaystyle R_{T} =maxa𝒜𝔼[t=1Tv𝒮t(xt(v),ta,t)]=maxa𝒜𝔼[t=1T𝔼t[v𝒮t(xt(v),ta,t)]]\displaystyle=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\Bigl{(}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{)}\right]=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{v\in\mathcal{S}_{t}}\Bigl{(}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{)}\right]\right]
=maxa𝒜𝔼[t=1Tv𝒱q(v)𝔼t[xt(v),ta,t]]=maxa𝒜v𝒱q(v)RT(a,v).\displaystyle=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{V}}q(v)\mathbb{E}_{t}\Bigl{[}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{]}\right]=\max_{a\in\mathcal{A}}\sum_{v\in\mathcal{V}}q(v)R_{T}(a,v)\;. (9)

The proof then proceeds by isolating the bias in the loss estimators. For each a𝒜a\in\mathcal{A} we have

RT(a,v)=𝔼[t=1Tx¯t(v)a,^t(v)]+𝔼[t=1Tx¯t(v)a,t^t(v)].R_{T}(a,v)=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\widehat{\ell}_{t}(v)\right\rangle\right]+\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\;.

Exploiting the analogy that we established between FTPL and OSMD, we begin by using the standard bound for the regret of OSMD in the first term of the previous equation. For the reader’s convenience, we restate it in Appendix B, Theorem 4. This leads to

RT(a,v)F(x¯1(v))F(a)η(I)+𝔼[1ηt=1TF(x¯t(v),x¯t+1(v))](II)+𝔼[t=1Tx¯t(v)a,t^t(v)](III).R_{T}(a,v)\!\leq\!\underbrace{\frac{F(\overline{x}_{1}(v))-F(a)}{\eta}}_{(\mathrm{I})}\!+\underbrace{\mathbb{E}\left[\frac{1}{\eta}\sum_{t=1}^{T}\mathcal{B}_{F}\left(\overline{x}_{t}(v),\overline{x}_{t+1}(v)\right)\right]}_{(\mathrm{II})}\!+\underbrace{\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]}_{(\mathrm{III})}.

The three terms are studied separately and in detail in Appendix D. Here, we provide a sketch of the bounds.

For the first term (I)(\mathrm{I}), we use the fact that the regularizer FF satisfies, for all a𝒜a\in\mathcal{A},

F(a)m(1+log(k)),\displaystyle F(a)\geq-m\bigl{(}1+\log(k)\bigr{)}, (10)

which follows by the definition of FF, the properties of the perturbation distribution, and the fact that a1m\left\lVert a\right\rVert_{1}\leq m for any a𝒜a\in\mathcal{A}. One can also show that F(a)0F(a)\leq 0 for all a𝒜a\in\mathcal{A}, and this, combined with the previous equation, leads to

(I)m(1+logk)η.(\mathrm{I})\leq\frac{m(1+\log k)}{\eta}\;.

For the second term (II),(\mathrm{II}), we have

F(x¯t(v),x¯t+1(v))\displaystyle\mathcal{B}_{F}\left(\overline{x}_{t}(v),\overline{x}_{t+1}(v)\right) =F(F(x¯t+1(v)),F(x¯t(v)))\displaystyle=\mathcal{B}_{F^{*}}\left(\nabla F\left(\overline{x}_{t+1}(v)\right),\nabla F\left(\overline{x}_{t}(v)\right)\right)
=F(ηL^t1(v)η^t(v),ηL^t1(v))\displaystyle=\mathcal{B}_{F^{*}}\left(-\eta\widehat{L}_{t-1}(v)-\eta\widehat{\ell}_{t}(v),-\eta\widehat{L}_{t-1}(v)\right)
=η22^t(v)2F(ξ(v))2,\displaystyle=\frac{\eta^{2}}{2}\left\|\widehat{\ell}_{t}(v)\right\|_{\nabla^{2}F^{*}(\xi(v))}^{2}\;, (11)

where the first equality is a standard property of Bregmann divergence (see Theorem 3 in Appendix A), the second follows from the definitions of the updates and the last by Taylor’s theorem, where ξ(v)=ηL^t1(v)αη^t(v)\xi(v)=-\eta\widehat{L}_{t-1}(v)-\alpha\eta\widehat{\ell}_{t}(v), for some α[0,1]\alpha\in[0,1]. The estimation of the entries of the Hessian are non trivial (but tedious); the interested reader can find them in Appendix D. Exploiting our assumption that β1/(ηk)\beta\leq 1/(\eta k), we get, for all i,j{1,,k}i,j\in\{1,\ldots,k\},

2F(ξ(v))ij\displaystyle\nabla^{2}F^{*}\bigl{(}\xi(v)\bigr{)}_{ij} ex¯t(i,v).\displaystyle\leq e\,\overline{x}_{t}(i,v)\;.

Plugging this estimate in Eq. (11) yields

η22^t(v)2F(ξ(v))2\displaystyle\frac{\eta^{2}}{2}\left\|\widehat{\ell}_{t}(v)\right\|_{\nabla^{2}F^{*}(\xi(v))}^{2} =η22i=1kj=1k2F(ξ(v))i,j^t(i,v)^t(j,v)\displaystyle=\frac{\eta^{2}}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\nabla^{2}F^{*}\bigl{(}\xi(v)\bigr{)}_{i,j}\widehat{\ell}_{t}(i,v)\widehat{\ell}_{t}(j,v)
η2e2i=1kj=1kx¯t(i,v)^t(i,v)^t(j,v)\displaystyle\leq\frac{\eta^{2}e}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)\widehat{\ell}_{t}(i,v)\widehat{\ell}_{t}(j,v)
η2e2i=1kj=1kx¯t(i,v)Bt(i,v)minw𝒩(v){Gt(i,w)}Bt(j,v)minw𝒩(v){Gt(j,w)},\displaystyle\leq\frac{\eta^{2}e}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\},

where the last inequality follows by neglecting the truncation with β\beta. Hence multiplying (II)(\mathrm{II}) by q(v)q(v) and summing over v𝒱v\in\mathcal{V} yields

v𝒱q(v)𝔼[η2t=1T^t(v)2F(ξ(v))2]=v𝒱q(v)η2𝔼[t=1T𝔼t[^t(v)2F(ξ(v))2]]\displaystyle\sum_{v\in\mathcal{V}}q(v)\mathbb{E}\left[\frac{\eta}{2}\sum_{t=1}^{T}\left\|\widehat{\ell}_{t}(v)\right\|_{\nabla^{2}F^{*}(\xi(v))}^{2}\right]=\sum_{v\in\mathcal{V}}q(v)\frac{\eta}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\|\widehat{\ell}_{t}(v)\right\|_{\nabla^{2}F^{*}(\xi(v))}^{2}\right]\right]
v𝒱q(v)ηe2𝔼[t=1T𝔼t[i,j=1kx¯t(i,v)Bt(i,v)minw𝒩(v){Gt(i,w)}Bt(j,v)minw𝒩(v){Gt(j,w)}]],\displaystyle\hskip 2.94753pt\leq\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i,j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\}\right]\right]\;,

which is rewritten as

v𝒱q(v)ηe2𝔼[t=1T𝔼t[i,j=1kx¯t(i,v)Bt(i,v)minw𝒩(v){Gt(i,w)}Bt(j,v)minw𝒩(v){Gt(j,w)}]]=v𝒱q(v)ηe2𝔼[t=1T𝔼t[i=1kj=1kx¯t(i,v)Bt(i,v)Gt~(i,v)Bt(j,v)G~t(j,v)]]=v𝒱q(v)ηe2𝔼[t=1Ti=1kj=1kx¯t(i,v)𝔼t[Bt(i,v)Bt(j,v)]𝔼t[Gt~(i,v)]𝔼t[G~t(j,v)]]=:(),\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i,j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\}\right]\right]\\ \begin{aligned} &=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\tilde{G_{t}}(i,v)B_{t}(j,v)\tilde{G}_{t}(j,v)\right]\right]\\ &=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)\mathbb{E}_{t}\left[B_{t}(i,v)B_{t}(j,v)\right]\mathbb{E}_{t}\left[\tilde{G_{t}}(i,v)\right]\mathbb{E}_{t}\left[\tilde{G}_{t}(j,v)\right]\right]=:\left(\star\right)\;,\end{aligned}

where in the first equality we defined G~t(i,v)=minw𝒩(v){Gt(i,w)}\tilde{G}_{t}(i,v)=\min_{w\in\mathcal{N}(v)}\bigl{\{}G_{t}(i,w)\bigr{\}} and, analogously, G~t(j,v)=minw𝒩(v){Gt(j,w)}\tilde{G}_{t}(j,v)=\min_{w\in\mathcal{N}(v)}\bigl{\{}G_{t}(j,w)\bigr{\}}, while the second follows by the conditional independence of the three terms (Bt(i,v),Bt(j,v))\bigl{(}B_{t}(i,v),B_{t}(j,v)\bigr{)}, Gt~(i,v)\tilde{G_{t}}(i,v), and G~t(j,v)\tilde{G}_{t}(j,v) given the history up to time t1t-1. Furthermore, making use of Lemmas 13 and upper bounding, we get:

()\displaystyle\left(\star\right) =v𝒱q(v)ηe2𝔼[t=1T𝔼t[i=1kj=1kx¯t(i,v)B¯t(i,v)Bt(i,v)Bt(j,v)B¯t(j,v)]]\displaystyle=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i=1}^{k}\sum_{j=1}^{k}\frac{\overline{x}_{t}(i,v)}{\overline{B}_{t}(i,v)}B_{t}(i,v)\frac{B_{t}(j,v)}{\overline{B}_{t}(j,v)}\right]\right]
ηek2𝔼[t=1Ti=1kv𝒱x¯t(i,v)q(v)B¯t(i,v)]ηekT2(1e1)(kα1+mQ),\displaystyle\leq\frac{\eta ek}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{v\in\mathcal{V}}\frac{\overline{x}_{t}(i,v)q(v)}{\overline{B}_{t}(i,v)}\right]\leq\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(k\alpha_{1}+mQ\right),

where the first equality uses the expected value of the geometric random variables G~\tilde{G}, the first inequality is obtained neglecting the indicator function Bt(i,v)B_{t}(i,v) and taking the conditional expectation of Bt(j,v)B_{t}(j,v), and the last inequality follows by a known upper bound involving independence numbers appearing, for example in Cesa-Bianchi et al. [2019a, b]. For the sake of convenience, we add this result to Appendix E, Lemma 6. We now consider the last term (III)(\mathrm{III}). Since t𝔼t[^t(v)]\ell_{t}\geq\mathbb{E}_{t}\Bigl{[}\widehat{\ell}_{t}(v)\Bigr{]} by Lemma 4, we have

(III)\displaystyle(\mathrm{III}) =𝔼[t=1T𝔼t[x¯t(v)a,t^t(v)]]𝔼[t=1T𝔼t[x¯t(v),t^t(v)]]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\right]\leq\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\langle\overline{x}_{t}(v),\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\right]
=𝔼[t=1Ti=1kt(i)x¯t(i,v)(w𝒩(v)(1x¯t(i,w)q(w)))β].\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\ell_{t}(i)\overline{x}_{t}(i,v)\Biggl{(}\prod_{w\in\mathcal{N}(v)}\bigl{(}1-\overline{x}_{t}(i,w)\,q(w)\bigr{)}\Biggr{)}^{\beta}\right]\;.

Multiplying (III)(\mathrm{III}) by q(v)q(v) and summing over the agents, we now upper bound t(i)\ell_{t}(i) with 11 and use the facts that 1xex1-x\leq e^{-x} for all x[0,1]x\in[0,1] and ey1/ye^{-y}\leq 1/y for all y>0y>0, to obtain

v𝒱q(v)𝔼[t=1Ti=1kt(i)x¯t(i,v)(w𝒩(v)(1x¯t(i,w)q(w)))β]𝔼[t=1Ti=1kv𝒱x¯t(i,v)q(v)(w𝒩(v)(1x¯t(i,w)q(w)))β]=𝔼[t=1Ti=1kv𝒱x¯t(i,v)q(v)>0x¯t(i,v)q(v)(w𝒩(v)(1x¯t(i,w)q(w)))β]𝔼[t=1Ti=1kv𝒱x¯t(i,v)q(v)>0x¯t(i,v)q(v)exp(βw𝒩(v)x¯t(i,w)q(w))]𝔼[t=1Ti=1kv𝒱x¯t(i,v)q(v)>0x¯t(i,v)q(v)βw𝒩(v)x¯t(i,w)q(w)]𝔼[t=1Ti=1kα1β]=α1kTβ\sum_{v\in\mathcal{V}}q(v)\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\ell_{t}(i)\overline{x}_{t}(i,v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ \begin{aligned} &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{v\in\mathcal{V}}\overline{x}_{t}(i,v)\,q(v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ &=\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\overline{x}_{t}(i,v)\,q(v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\overline{x}_{t}(i,v)q(v)\exp\left(-\beta\sum_{w\in\mathcal{N}(v)}\overline{x}_{t}(i,w)\,q(w)\right)\right]\\ &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\frac{\overline{x}_{t}(i,v)\,q(v)}{\beta\sum_{w\in\mathcal{N}(v)}\overline{x}_{t}(i,w)\,q(w)}\right]\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\frac{\alpha_{1}}{\beta}\right]=\frac{\alpha_{1}\,k\,T}{\beta}\end{aligned}

where the last inequality follows by a known upper bound involving independence numbers appearing, for example in [Alon et al., 2017, Lemma 10]. For the sake of convenience, we add this result to Appendix E, Lemma 5. Putting everything together and recalling that β=1/(kη)1/(2kη),\beta=\bigl{\lfloor}{1}/{(k\eta)}\bigr{\rfloor}\geq{1}/{(2k\eta)}, we can finally conclude that

RT\displaystyle R_{T} Qm(1+log(k))η+QηekT2(1e1)(kQα1+m)+α1kTβ\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+Q\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+\frac{\alpha_{1}\,k\,T}{\beta}
Qm(1+log(k))η+QηekT2(1e1)(kQα1+m)+2ηα1k2T\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+Q\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+2\eta\alpha_{1}k^{2}T
=Qm(1+log(k))η+ηQkT(e2(1e1)(kQα1+m)+2α1kQ)\displaystyle=Q\frac{m\left(1+\log(k)\right)}{\eta}+\eta QkT\left(\frac{e}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+2\alpha_{1}\frac{k}{Q}\right)
Qm(1+log(k))η+5ηQkT(kQα1+m)\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+5\eta QkT\left(\frac{k}{Q}\alpha_{1}+m\right)
2Q15mkTlog(k)(kQα1+m).\displaystyle\leq 2Q\sqrt{15mkT\log(k)\left(\frac{k}{Q}\alpha_{1}+m\right)}\;.

We conclude this section by discussing the computational complexity of our Coop-FTPL algorithm. The next result shows that the total number of elementary operations performed by Coop-FTPL over TT time-steps scales with T3/2T^{3/2} in the worst-case. To the best of our knowledge, no known algorithm attains a lower worst-case computational complexity.

Proposition 1.

If the optimization problem (2) can be solved with at most cc\in\mathbb{N} elementary operations, the worst-case computational complexity γCoop-FTPL\gamma_{\text{Coop-FTPL{}}} of each agent v𝒱v\in\mathcal{V} running our Coop-FTPL algorithm with the optimal tuning (8) for TT rounds is

γCoop-FTPL=𝒪(T3/2cα1/Q+1m).\gamma_{\text{Coop-FTPL{}}}=\mathcal{O}\left(T^{3/2}c\sqrt{\frac{\alpha_{1}/Q+1}{m}}\right)\;.
Proof.

The result follows immediately by noting that the number of elementary operations performed by each agent vv at each time step tt is at most

c(β+1)c(1kη+1)=c(1k5kT(kα1/Q+m)2mlogk+1)=c(5T(α1/Q+m/k)2mlogk+1).c(\beta+1)\leq c\left(\frac{1}{k\eta}+1\right)=c\left(\frac{1}{k}\sqrt{\frac{5kT(k\alpha_{1}/Q+m)}{2m\log k}}+1\right)=c\left(\sqrt{\frac{5T(\alpha_{1}/Q+m/k)}{2m\log k}}+1\right)\;.

5 Lower bound

In this section we show that no cooperative semi-bandit algorithm can beat the QmkTα1/QQ\sqrt{mkT\alpha_{1}/Q} rate. The key idea for constructing the lower bound is simple: if the activation probabilities q(v)q(v) are non-zero only for agents vv belonging to an independent set with cardinality α1\alpha_{1}, then the problem is reduced to α1\alpha_{1} independent instances of single-agent semi-bandits, whose minimax rate is known.

Theorem 2.

For each communication network 𝒢\mathcal{G} with independence number α1\alpha_{1} there exist cooperative semi-bandit instances for which the regret of any learning algorithm satisfies

RT=Ω(QmkTα1/Q).R_{T}=\Omega\bigl{(}Q\sqrt{mkT\alpha_{1}/Q}\bigr{)}\;.
Proof.

Let 𝒲={w1,,wα1}𝒱\mathcal{W}=\{w_{1},\ldots,w_{\alpha_{1}}\}\subset\mathcal{V} be an independent set with cardinality α1\alpha_{1}. Furthermore, let q(0,1]q\in(0,1] be a positive probability and for all agents v𝒱v\in\mathcal{V}, let

q(v)=q𝕀{v𝒲}.q(v)=q\mathbb{I}\{v\in\mathcal{W}\}\;.

In words, only agents belonging to an independent set with largest cardinality are activated (with positive probability), and all with the same probability. Thus, only agents in 𝒲\mathcal{W} contribute to the expected regret and their total mass Q=v𝒱q(v)Q=\sum_{v\in\mathcal{V}}q(v) is equal to α1q\alpha_{1}q. Moreover, note that being non-adjacent, agents in 𝒲\mathcal{W} never exchange any information. Each agent w𝒲w\in\mathcal{W} is therefore running an independent single-agent online linear optimization problem with semi-bandit feedback for an average of qTqT rounds. Since for single-agent semi-bandits, the worst-case lower bound on the regret after TT^{\prime} time steps is known to be Ω(mkT)\Omega\bigl{(}\sqrt{mkT^{\prime}}\bigr{)} (see, e.g., Audibert et al. [2014], Lattimore et al. [2018]) and the cardinality of 𝒲\mathcal{W} is α1\alpha_{1}, the regret of any cooperative semi-bandit algorithm run on this instance satisfies

RT=Ω(α1mkqT)=Ω(α1qmkT/q)=Ω(QmkTα1/Q),R_{T}=\Omega\bigl{(}\alpha_{1}\sqrt{mk\,qT}\bigr{)}=\Omega\bigl{(}\alpha_{1}q\sqrt{mkT/q}\bigr{)}=\Omega\bigl{(}Q\sqrt{mkT\alpha_{1}/Q}\bigr{)}\;,

where we used Q=α1qQ=\alpha_{1}q. This concludes the proof. ∎

In the previous section we showed that the expected regret of our Coop-FTPL algorithm can always be upper bounded by QmkTlog(k)(kα1/Q+m)Q\sqrt{mkT\log(k)({k\alpha_{1}}/{Q}+m)} (ignoring constants). Thus, Theorem 2 shows that, up to the additive mm term inside the rightmost bracket, the regret of Coop-FTPL is at most klogk\sqrt{k\log k}-away from the minimax optimal rate.

6 Conclusions and open problems

Motivated by spatially distributed large-scale learning systems, we introduced a new cooperative setting for adversarial semi-bandits in which only some of the agents are active at any given time step. We designed and analyzed an efficient algorithm that we called Coop-FTPL for which we proved near-optimal regret guarantees with state-of-the-art computational complexity costs. Our analysis relies on the fact that agents are aware of their activation probabilities, and they have some prior knowledge about the connectivity of the graph. Two interesting new lines of research are investigating if either of these assumptions could be lifted while retaining low regret and good computational complexity. In particular, removing the need for prior knowledge of the independence number would represent a significant theoretical and practical improvement, given that computing α1\alpha_{1} is NP-hard in the worst-case. It is however unclear if existing techniques that address this problem in similar settings (e.g., Cesa-Bianchi et al. [2019b]) would yield any results in our general case. We believe that entirely new ideas will be required to deal with this issue. We leave these intriguing problems open for future work.

Acknowledgements

This project has received funding from the French “Investing for the Future – PIA3” program under the Grant agreement ANITI ANR-19-PI3A-0004.555https://aniti.univ-toulouse.fr/

References

  • Neu and Bartók [2013] Gergely Neu and Gábor Bartók. An efficient algorithm for learning with semi-bandit feedback. In International Conference on Algorithmic Learning Theory, pages 234–248. Springer, 2013.
  • Hannan [1957] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
  • Kalai and Vempala [2005] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
  • Grötschel et al. [2012] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
  • Lee et al. [2018] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala. Efficient convex optimization with membership oracles. In Conference On Learning Theory, pages 1292–1294. PMLR, 2018.
  • Audibert et al. [2014] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014.
  • Suehiro et al. [2012] Daiki Suehiro, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Kiyohito Nagano. Online prediction under submodular constraints. In International Conference on Algorithmic Learning Theory, pages 260–274. Springer, 2012.
  • Koolen et al. [2010] Wouter M Koolen, Manfred K Warmuth, Jyrki Kivinen, et al. Hedging structured concepts. In COLT, pages 93–105. Citeseer, 2010.
  • Cesa-Bianchi and Lugosi [2012] Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
  • Awerbuch and Kleinberg [2008] Baruch Awerbuch and Robert Kleinberg. Competitive collaborative learning. Journal of Computer and System Sciences, 74(8):1271–1288, 2008.
  • Cesa-Bianchi et al. [2019a] Nicolò Cesa-Bianchi, Tommaso R Cesari, and Claire Monteleoni. Cooperative online learning: Keeping your neighbors updated. arXiv preprint arXiv:1901.08082, 2019a.
  • Cesa-Bianchi et al. [2019b] Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in nonstochastic bandits. The Journal of Machine Learning Research, 20(1):613–650, 2019b.
  • Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • Martínez-Rubio et al. [2019] David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic bandits. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 4529–4540. Curran Associates, Inc., 2019.
  • Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Alon et al. [2017] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785–1826, 2017.
  • Lattimore et al. [2018] Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. Toprank: A practical algorithm for online stochastic ranking. In Advances in Neural Information Processing Systems, pages 3945–3954, 2018.

Appendix A Legendre functions and Fenchel conjugates

In this section, we briefly recall a few known definitions and facts in convex analysis.

Definition 1 (Interior, boundary, and convex hull).

For any subset EE of k\mathbb{R}^{k}, we denote its topological interior by int(E)\mathrm{int}(E), its boundary by E\partial E, and its convex hull by co(E)\mathrm{co}(E).

Definition 2 (Effective domain).

The effective domain of a convex function F:k{+}F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\} is

dom(F):={xk:F(x)<+}.\mathrm{dom}(F):=\bigl{\{}x\in\mathbb{R}^{k}:F(x)<+\infty\bigr{\}}\;. (12)

With a slight abuse of notation, we will denote with the same symbol ff a convex function f:{+}f\colon\to\mathbb{R}\cup\{+\infty\} and its restriction f~:dom(f)\widetilde{f}\colon\mathrm{dom}(f)\to\mathbb{R} to its effective domain.

Definition 3 (Legendre function).

A convex function F:k{+}F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\} is Legendre if

  1. 1.

    int(dom(F))\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)} is non-empty;

  2. 2.

    FF is differentiable and strictly convex on int(dom(F))\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)};

  3. 3.

    for all x0[int(dom(F))]x_{0}\in\partial\left[\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\right], if xint(dom(F)),xx0x\in\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)},x\to x_{0}, then F(x)2+\left\lVert\nabla F(x)\right\rVert_{2}\to+\infty.

Definition 4 (Fenchel conjugate).

Let F:k{+}F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\} be a convex function. The Fenchel conjugate FF^{*} of FF is defined as the function

F:k\displaystyle F^{*}\colon\mathbb{R}^{k} {+}\displaystyle\to\mathbb{R}\cup\{+\infty\}
z\displaystyle z F(z):=supxk(x,zF(x)).\displaystyle\mapsto F^{*}(z):=\sup_{x\in\mathbb{R}^{k}}\bigl{(}\left\langle x,z\right\rangle-F(x)\bigr{)}\;.
Definition 5 (Bregman divergence).

Let F:k{+}F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\} a convex function with non-empty int(dom(F))\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)} that is differentiable on int(dom(F))\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}. The Bregman divergence induced by FF is

F:k×int(dom(F))\displaystyle\mathcal{B}_{F}\colon\mathbb{R}^{k}\times\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)} {+}\displaystyle\to\mathbb{R}\cup\{+\infty\}
(x,y)\displaystyle(x,y) F(x,y):=F(x)F(y)F(y),xy.\displaystyle\mapsto\mathcal{B}_{F}(x,y):=F(x)-F(y)-\bigl{\langle}\nabla F(y),x-y\bigr{\rangle}\;.

The following results are taken from [Lattimore and Szepesvári, 2020, Theorem 26.6 and Corollary 26.8].

Theorem 3.

Let F:k{+}F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\} be a Legendre function. Then:

  1. 1.

    the Fenchel conjugate FF^{*} of FF is Legendre;

  2. 2.

    F:int(dom(F))int(dom(F))\nabla F\colon\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\to\mathrm{int}\bigl{(}\mathrm{dom}(F^{*})\bigr{)} is bijective with inverse (F)1=F(\nabla F)^{-1}=\nabla F^{*};

  3. 3.

    F(x,y)=F(F(y),F(x))\mathcal{B}_{F}(x,y)=\mathcal{B}_{F^{*}}\bigl{(}\nabla F(y),\nabla F(x)\bigr{)}, for all x,yint(dom(f))x,y\in\mathrm{int}\bigl{(}\mathrm{dom}(f)\bigr{)}.

Corollary 1.

If F:k{+}F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\} is a Legendre function and xargminxdom(F)F(x)x\in\operatorname*{argmin}_{x\in\mathrm{dom}(F)}F(x), then xint(dom(F))x\in\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}.

Appendix B Online Stochastic Mirror Descent (OSMD)

In this section, we briefly recall the standard Online Stochastic Mirror Descent algorithm (OSMD) (Algorithm 3) and its analysis.

For an overview on some basic convex analysis definitions and results, we refer the reviewer to the previous Appendix A. For a convex function F:k{+}F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\} that is differentiable on the non-empty interior int(dom(F))\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\neq\varnothing of its effective domain dom(F)\mathrm{dom}(F), we denote by F\mathcal{B}_{F} the Bregman divergence induced by FF (Definition 5). Following the existing convention, we refer to the input function FF of OSMD as a potential. Furthermore, given a measure PP on a subset of k\mathbb{R}^{k}, we say that a vector xkx\in\mathbb{R}^{k} is the mean of the measure PP if xx is the component-wise expectation of a k\mathbb{R}^{k}-valued random variable with distribution PP. For any time step t{1,2,}t\in\{1,2,\ldots\}, we denote by 𝔼t\mathbb{E}_{t} the expectation conditioned to the history up to and including round t1t-1.

Input: Legendre potential F:k{+}F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\}, compact action set 𝒜k\mathcal{A}\subset\mathbb{R}^{k} with int(dom(F))co(𝒜)\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\cap\mathrm{co}(\mathcal{A})\neq\varnothing, learning rate η>0\eta>0
Initialization: x¯1=argminxdom(F)co(𝒜)F(x)\overline{x}_{1}=\operatorname*{argmin}_{x\in\mathrm{dom}(F)\cap\mathrm{co}(\mathcal{A})}F(x)
1 for t=1,2,t=1,2,\ldots do
2      choose a measure PtP_{t} on 𝒜\mathcal{A} with mean x¯t\overline{x}_{t}
3      make a prediction xtx_{t} drawn from 𝒜\mathcal{A} according to PtP_{t} and suffer the loss xt,t\left\langle x_{t},\ell_{t}\right\rangle
4      compute an estimate ^t\widehat{\ell}_{t} of the loss vector t\ell_{t}
5      make the update: x¯t+1=argminxdom(F)co(𝒜)ηx,^t+F(x,x¯t)\overline{x}_{t+1}=\operatorname*{argmin}_{x\in\mathrm{dom}(F)\cap\mathrm{co}(\mathcal{A})}\eta\bigl{\langle}x,\widehat{\ell}_{t}\bigr{\rangle}+\mathcal{B}_{F}(x,\overline{x}_{t})
6     
7     
Algorithm 3 Online Stochastic Mirror Descent (OSMD)

It is known that since co(𝒜)\mathrm{co}(\mathcal{A}) is convex and compact, int(dom(F))co(𝒜)\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\cap\mathrm{co}(\mathcal{A})\neq\varnothing, and FF is Legendre, then, all the argmin\operatorname*{argmin}’s exist in Algorithm 3 and x¯tint(dom(F))co(𝒜)\overline{x}_{t}\in\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\cap\mathrm{co}(\mathcal{A}) for all t{1,2,}t\in\{1,2,\ldots\} (see, e.g., [Lattimore and Szepesvári, 2020, Exercise 28.2]).

The following result is taken from [Lattimore and Szepesvári, 2020, Theorem 28.10] and gives an upper bound on the regret of OSMD.

Theorem 4.

Suppose that OSMD (Algorithm 3) is run with input F,𝒜,ηF,\mathcal{A},\eta. If the estimates ^t\widehat{\ell}_{t} computed at line 3 satisfy 𝔼t[^t]=t\mathbb{E}_{t}\bigl{[}\widehat{\ell}_{t}\bigr{]}=\ell_{t} for all t{1,2,}t\in\{1,2,\ldots\}, then, for all xco(𝒜)x\in\mathrm{co}(\mathcal{A}),

𝔼[t=1Tx¯t,tt=1Tx,t]𝔼[F(x)F(x¯1)η+t=1Tx¯tx¯t+1,^t1ηt=1TF(x¯t+1,x¯t)].\mathbb{E}\Biggl{[}\sum_{t=1}^{T}\left\langle\overline{x}_{t},\ell_{t}\right\rangle-\sum_{t=1}^{T}\left\langle x,\ell_{t}\right\rangle\Biggr{]}\leq\mathbb{E}\left[\frac{F(x)-F(\overline{x}_{1})}{\eta}+\sum_{t=1}^{T}\bigl{\langle}\overline{x}_{t}-\overline{x}_{t+1},\widehat{\ell}_{t}\bigr{\rangle}-\frac{1}{\eta}\sum_{t=1}^{T}\mathcal{B}_{F}(\overline{x}_{t+1},\overline{x}_{t})\right]\;.

Furthermore, letting

x~t+1=argminxdom(F)ηx,^t+F(x,x¯t)\widetilde{x}_{t+1}=\operatorname*{argmin}_{x\in\mathrm{dom}(F)}\eta\bigl{\langle}x,\widehat{\ell}_{t}\bigr{\rangle}+\mathcal{B}_{F}(x,\overline{x}_{t})

and assuming that η^t+F(x)F(dom(F))-\eta\widehat{\ell}_{t}+\nabla F(x)\in\nabla F\bigl{(}\mathrm{dom}(F)\bigr{)} for all xco(𝒜)x\in\mathrm{co}(\mathcal{A}) almost surely, then

supxco(𝒜)𝔼[t=1Tx¯t,tt=1Tx,t]diamF(co(𝒜))η+1ηt=1T𝔼[F(x¯t,x~t+1)],\sup_{x\in\mathrm{co}(\mathcal{A})}\mathbb{E}\Biggl{[}\sum_{t=1}^{T}\left\langle\overline{x}_{t},\ell_{t}\right\rangle-\sum_{t=1}^{T}\left\langle x,\ell_{t}\right\rangle\Biggr{]}\leq\frac{\mathrm{diam}_{F}\bigl{(}\mathrm{co}(\mathcal{A})\bigr{)}}{\eta}+\frac{1}{\eta}\sum_{t=1}^{T}\mathbb{E}\bigl{[}\mathcal{B}_{F}(\overline{x}_{t},\widetilde{x}_{t+1})\bigr{]}\;,

where diamF(co(𝒜)):=supx,yco(𝒜)(F(x)F(y))\mathrm{diam}_{F}\bigl{(}\mathrm{co}(\mathcal{A})\bigr{)}:=\sup_{x,y\in\mathrm{co}(\mathcal{A})}\bigl{(}F(x)-F(y)\bigr{)} is the diameter of co(𝒜)\mathrm{co}(\mathcal{A}) with respect to FF.

Appendix C Proofs of lemmas on geometric distributions

In this section we provide all missing proofs on geometric distributions that we stated in Section 4.

Proof.

of Lemma 1 For all j{1,,m}j\in\{1,\ldots,m\}, the cumulative distribution function (c.d.f.) of YjY_{j} is given, for all nn\in\mathbb{N}, by

[Yjn]=pji=1n(1pj)i1=1(1pj)n.\mathbb{P}\left[Y_{j}\leq n\right]=p_{j}\sum_{i=1}^{n}\left(1-p_{j}\right)^{i-1}=1-\left(1-p_{j}\right)^{n}\;.

The c.d.f. of ZZ is given, for all nn\in\mathbb{N}, by

[Zn]\displaystyle\mathbb{P}\left[Z\leq n\right] =[minj{1,,m}Yjn]=1j=1m[Yj>n]=1j=1m(1[Yjn])\displaystyle=\mathbb{P}\left[\min_{j\in\{1,\ldots,m\}}Y_{j}\leq n\right]=1-\prod_{j=1}^{m}\mathbb{P}\left[Y_{j}>n\right]=1-\prod_{j=1}^{m}\left(1-\mathbb{P}\left[Y_{j}\leq n\right]\right)
=1j=1m(1(1(1pj)n))=1(j=1m(1pj))n\displaystyle=1-\prod_{j=1}^{m}\left(1-\left(1-\left(1-p_{j}\right)^{n}\right)\right)=1-\left(\prod_{j=1}^{m}\left(1-p_{j}\right)\right)^{n}
=1(1[1j=1m(1pj)])n,\displaystyle=1-\left(1-\left[1-\prod_{j=1}^{m}\left(1-p_{j}\right)\right]\right)^{n}\;,

and this is the c.d.f. of a geometric random variable with parameter (1j=1m(1pj)(1-\prod_{j=1}^{m}\left(1-p_{j}\right)  . ∎

Proof.

of Lemma 2 By elementary calculations,

𝔼[min{G,β}]\displaystyle\mathbb{E}\left[\min\left\{G,\beta\right\}\right] =n=1n(1q)n1qn=β(nβ)(1q)n1q\displaystyle=\sum_{n=1}^{\infty}n\left(1-q\right)^{n-1}q-\sum_{n=\beta}^{\infty}\left(n-\beta\right)\left(1-q\right)^{n-1}q
=n=1n(1q)n1q(1q)βn=β(nβ)(1q)nβ1q\displaystyle=\sum_{n=1}^{\infty}n\left(1-q\right)^{n-1}q-\left(1-q\right)^{\beta}\sum_{n=\beta}^{\infty}\left(n-\beta\right)\left(1-q\right)^{n-\beta-1}q
=(1(1q)β)n=1n(1q)n1q=(1(1q)β)q.\displaystyle=\left(1-\left(1-q\right)^{\beta}\right)\sum_{n=1}^{\infty}n\left(1-q\right)^{n-1}q=\frac{\left(1-\left(1-q\right)^{\beta}\right)}{q}\;.

Proof.

of Lemma 3 The proof follows immediately from the fact that {Xs(v)Ys(v)}s,v𝒱\bigl{\{}X_{s}(v)\,Y_{s}(v)\bigr{\}}_{s\in\mathcal{I},v\in\mathcal{V}} is a collection of independent Bernoulli random variables with expectation 𝔼[Xs(v)Ys(v)]=p1(v)p2(v)\mathbb{E}\bigl{[}X_{s}(v)\,Y_{s}(v)\bigr{]}=p_{1}(v)\,p_{2}(v) for any ss\in\mathbb{N} and all v𝒱v\in\mathcal{V}. ∎

Appendix D Proof of Theorem 1

In this section, we present a complete proof of Theorem 1.

Proof.

of Theorem 1 For the sake of convenience, we define the expected individual regret of an agent v𝒱v\in\mathcal{V} in the network with respect to a fixed action a𝒜a\in\mathcal{A} by

RT(a,v):=𝔼[t=1Txt(v),tt=1Ta,t],R_{T}(a,v):=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle x_{t}(v),\ell_{t}\right\rangle-\sum_{t=1}^{T}\left\langle a,\ell_{t}\right\rangle\right]\;,

where the expectation is taken with respect to the internal randomization of the agent, but not its activation probability q(v)q(v). With this definition the total regret on the network in Eq. (4) can be decomposed as

RT\displaystyle R_{T} =maxa𝒜𝔼[t=1Tv𝒮t(xt(v),ta,t)]=maxa𝒜𝔼[t=1T𝔼t[v𝒮t(xt(v),ta,t)]]\displaystyle=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\Bigl{(}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{)}\right]=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{v\in\mathcal{S}_{t}}\Bigl{(}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{)}\right]\right]
=maxa𝒜𝔼[t=1Tv𝒱q(v)𝔼t[xt(v),ta,t]]=maxa𝒜v𝒱q(v)RT(a,v).\displaystyle=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{V}}q(v)\mathbb{E}_{t}\Bigl{[}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{]}\right]=\max_{a\in\mathcal{A}}\sum_{v\in\mathcal{V}}q(v)R_{T}(a,v)\;. (13)

The proof then proceeds by isolating the bias in the loss estimators. For each a𝒜a\in\mathcal{A} we get

RT(a,v)=𝔼[t=1Txt(v)a,t]=𝔼[𝔼t[t=1Txt(v)a,t]]=𝔼[t=1Tx¯t(v)a,t]=𝔼[t=1Tx¯t(v)a,^t(v)]+𝔼[t=1Tx¯t(v)a,t^t(v)]F(x¯1(v))F(a)η(I)+𝔼[1ηt=1TF(x¯t(v),x¯t+1(v))](II)+𝔼[t=1Tx¯t(v)a,t^t(v)](III)R_{T}(a,v)\\ \begin{aligned} &=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle x_{t}(v)-a,\ell_{t}\right\rangle\right]=\mathbb{E}\left[\mathbb{E}_{t}\left[\sum_{t=1}^{T}\left\langle x_{t}(v)-a,\ell_{t}\right\rangle\right]\right]=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\ell_{t}\right\rangle\right]\\ &=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\widehat{\ell}_{t}(v)\right\rangle\right]+\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\\ &\leq\underbrace{\frac{F(\overline{x}_{1}(v))-F(a)}{\eta}}_{(\mathrm{I})}+\underbrace{\mathbb{E}\left[\frac{1}{\eta}\sum_{t=1}^{T}\mathcal{B}_{F}\left(\overline{x}_{t}(v),\overline{x}_{t+1}(v)\right)\right]}_{(\mathrm{II})}+\underbrace{\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]}_{(\mathrm{III})}\end{aligned}

where the inequality follows by the standard analysis of OMD. We bound the three terms separately. For the first term (I)(\mathrm{I}), we have

F(a)\displaystyle F(a) =supxk(a,xF(x))=supxk(a,x𝔼[maxa𝒜a,x+Z])\displaystyle=\sup_{x\in\mathbb{R}^{k}}\bigl{(}\left\langle a,x\right\rangle-F^{*}(x)\bigr{)}=\sup_{x\in\mathbb{R}^{k}}\left(\left\langle a,x\right\rangle-\mathbb{E}\left[\max_{a^{\prime}\in\mathcal{A}}\left\langle a^{\prime},x+Z\right\rangle\right]\right)
𝔼[maxa𝒜a,Z]m𝔼[Z]=mi=1k1im(1+log(k)),\displaystyle\geq-\mathbb{E}\left[\max_{a^{\prime}\in\mathcal{A}}\left\langle a^{\prime},Z\right\rangle\right]\geq-m\mathbb{E}\left[\left\|Z\right\|_{\infty}\right]=-m\sum_{i=1}^{k}\frac{1}{i}\geq-m\left(1+\log(k)\right), (14)

where the first inequality follows by choosing x=0x=0, the second follows from Hölder’s inequality and a1m\left\|a\right\|_{1}\leq m for any a𝒜a\in\mathcal{A}, and the last equality is Exercise 30.4 in Lattimore and Szepesvári [2020]. It follows that

F(x¯1(v))F(a)m(1+log(k)),F\bigl{(}\overline{x}_{1}(v)\bigr{)}-F(a)\leq m\bigl{(}1+\log(k)\bigr{)}\;,

where we use the fact that F(a)0F(a)\leq 0 for all a𝒜a\in\mathcal{A} and this follows from the first line of Eq. (14) by the convexity of the maximum of a finite number of linear functions, using Jensen’s inequality and the fact that the random variable ZZ is centered. Thus

(I)m(1+logk)η.(\mathrm{I})\leq\frac{m(1+\log k)}{\eta}\;.

We now study the second term (II).(\mathrm{II}). We have

F(x¯t(v),x¯t+1(v))\displaystyle\mathcal{B}_{F}\left(\overline{x}_{t}(v),\overline{x}_{t+1}(v)\right) =F(F(x¯t+1(v)),F(x¯t(v)))\displaystyle=\mathcal{B}_{F^{*}}\left(\nabla F\left(\overline{x}_{t+1}(v)\right),\nabla F\left(\overline{x}_{t}(v)\right)\right)
=F(ηL^t1(v)η^t(v),ηL^t1(v))\displaystyle=\mathcal{B}_{F^{*}}\left(-\eta\widehat{L}_{t-1}(v)-\eta\widehat{\ell}_{t}(v),-\eta\widehat{L}_{t-1}(v)\right)
=η22^t(v)2F(ξ(v))2,\displaystyle=\frac{\eta^{2}}{2}\left\|\widehat{\ell}_{t}(v)\right\|_{\nabla^{2}F^{*}(\xi(v))}^{2}\;, (15)

where the first equality is a standard property of Bregmann divergence, the second follows from the definitions of the updates and the last by Taylor’s theorem, where ξ(v)=ηL^t1(v)αη^t(v)\xi(v)=-\eta\widehat{L}_{t-1}(v)-\alpha\eta\widehat{\ell}_{t}(v), for some α[0,1]\alpha\in[0,1]. To calculate the Hessian we use a change of variable to avoid applying the gradient to the (possibly) non-differentiable argmax\operatorname*{argmax} and we get:

2F(x)\displaystyle\nabla^{2}F^{*}(x) =(F(x))=𝔼[h(x+Z)]=kh(x+z)ζ(z)𝑑z\displaystyle=\nabla\left(\nabla F^{*}(x)\right)=\nabla\mathbb{E}[h(x+Z)]=\nabla\int_{\mathbb{R}^{k}}h(x+z)\zeta(z)dz
=kh(u)ζ(ux)𝑑u=kh(u)(ζ(ux))𝑑u\displaystyle=\nabla\int_{\mathbb{R}^{k}}h(u)\zeta(u-x)du=\int_{\mathbb{R}^{k}}h(u)(\nabla\zeta(u-x))^{\top}du
=kh(u)sign(ux)ζ(ux)𝑑u=kh(x+z)sign(z)ζ(z)𝑑z\displaystyle=\int_{\mathbb{R}^{k}}h(u)\mathrm{sign}(u-x)^{\top}\zeta(u-x)du=\int_{\mathbb{R}^{k}}h(x+z)\mathrm{sign}(z)^{\top}\zeta(z)dz

Using the definition of ξ(v)\xi(v) and the fact that h(x)h(x) is nonnegative,

2F(ξ(v))ij\displaystyle\nabla^{2}F^{*}(\xi(v))_{ij} =kh(ξ(v)+z)isign(z)jζ(z)𝑑z\displaystyle=\int_{\mathbb{R}^{k}}h(\xi(v)+z)_{i}\mathrm{sign}(z)_{j}\zeta(z)dz
kh(ξ(v)+z)iζ(z)𝑑z\displaystyle\leq\int_{\mathbb{R}^{k}}h(\xi(v)+z)_{i}\zeta(z)dz
=kh(zηL^t1(v)αη^t(v))iζ(z)𝑑z\displaystyle=\int_{\mathbb{R}^{k}}h\left(z-\eta\widehat{L}_{t-1}(v)-\alpha\eta\widehat{\ell}_{t}(v)\right)_{i}\zeta(z)dz
=kh(uηL^t1(v))iζ(u+αη^t(v))𝑑u\displaystyle=\int_{\mathbb{R}^{k}}h\left(u-\eta\widehat{L}_{t-1}(v)\right)_{i}\zeta\left(u+\alpha\eta\widehat{\ell}_{t}(v)\right)du
exp(αη^t(v)1)kh(uηL^t1(v))iζ(u)𝑑u\displaystyle\leq\exp\left(\left\|\alpha\eta\widehat{\ell}_{t}(v)\right\|_{1}\right)\int_{\mathbb{R}^{k}}h\left(u-\eta\widehat{L}_{t-1}(v)\right)_{i}\zeta(u)du
exp(αηi=1kBt(i,v)β)x¯t(i,v)\displaystyle\leq\exp\left(\alpha\eta\sum_{i=1}^{k}B_{t}(i,v)\beta\right)\overline{x}_{t}(i,v)
exp(αηkβ)x¯t(i,v)\displaystyle\leq\exp\left(\alpha\eta k\beta\right)\overline{x}_{t}(i,v)
ex¯t(i,v)\displaystyle\leq e\,\overline{x}_{t}(i,v)

where the last inequality follows by α1\alpha\leq 1 and β1/(ηk)\beta\leq 1/(\eta k). Plugging this estimate in Eq. (15) yields

η22^t(v)2F(ξ(v))2\displaystyle\frac{\eta^{2}}{2}\left\|\widehat{\ell}_{t}(v)\right\|_{\nabla^{2}F^{*}(\xi(v))}^{2} =η22i=1kj=1k2F(ξ(v))i,j^t(i,v)^t(j,v)\displaystyle=\frac{\eta^{2}}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\nabla^{2}F^{*}\bigl{(}\xi(v)\bigr{)}_{i,j}\widehat{\ell}_{t}(i,v)\widehat{\ell}_{t}(j,v)
η2e2i=1kj=1kx¯t(i,v)^t(i,v)^t(j,v)\displaystyle\leq\frac{\eta^{2}e}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)\widehat{\ell}_{t}(i,v)\widehat{\ell}_{t}(j,v)
η2e2i=1kj=1kx¯t(i,v)Bt(i,v)minw𝒩(v){Gt(i,w)}Bt(j,v)minw𝒩(v){Gt(j,w)},\displaystyle\leq\frac{\eta^{2}e}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\},

where the last inequality follows by neglecting the truncation with β\beta. Hence multiplying (II)(\mathrm{II}) by q(v)q(v) and summing over v𝒱v\in\mathcal{V} yields

v𝒱q(v)𝔼[η2t=1T^t(v)2F(ξ(v))2]=v𝒱q(v)η2𝔼[t=1T𝔼t[^t(v)2F(ξ(v))2]]\displaystyle\sum_{v\in\mathcal{V}}q(v)\mathbb{E}\left[\frac{\eta}{2}\sum_{t=1}^{T}\left\|\widehat{\ell}_{t}(v)\right\|_{\nabla^{2}F^{*}(\xi(v))}^{2}\right]=\sum_{v\in\mathcal{V}}q(v)\frac{\eta}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\|\widehat{\ell}_{t}(v)\right\|_{\nabla^{2}F^{*}(\xi(v))}^{2}\right]\right]
v𝒱q(v)ηe2𝔼[t=1T𝔼t[i,j=1kx¯t(i,v)Bt(i,v)minw𝒩(v){Gt(i,w)}Bt(j,v)minw𝒩(v){Gt(j,w)}]],\displaystyle\hskip 2.94753pt\leq\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i,j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\}\right]\right]\;,

which is rewritten as

v𝒱q(v)ηe2𝔼[t=1T𝔼t[i,j=1kx¯t(i,v)Bt(i,v)minw𝒩(v){Gt(i,w)}Bt(j,v)minw𝒩(v){Gt(j,w)}]]=v𝒱q(v)ηe2𝔼[t=1T𝔼t[i=1kj=1kx¯t(i,v)Bt(i,v)Gt~(i,v)Bt(j,v)G~t(j,v)]]=v𝒱q(v)ηe2𝔼[t=1Ti=1kj=1kx¯t(i,v)𝔼t[Bt(i,v)Bt(j,v)]𝔼t[Gt~(i,v)]𝔼t[G~t(j,v)]]=:(),\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i,j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\}\right]\right]\\ \begin{aligned} &=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\tilde{G_{t}}(i,v)B_{t}(j,v)\tilde{G}_{t}(j,v)\right]\right]\\ &=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)\mathbb{E}_{t}\left[B_{t}(i,v)B_{t}(j,v)\right]\mathbb{E}_{t}\left[\tilde{G_{t}}(i,v)\right]\mathbb{E}_{t}\left[\tilde{G}_{t}(j,v)\right]\right]=:\left(\star\right)\;,\end{aligned}

where in the first equality we defined G~t(i,v)=minw𝒩(v){Gt(i,w)}\tilde{G}_{t}(i,v)=\min_{w\in\mathcal{N}(v)}\bigl{\{}G_{t}(i,w)\bigr{\}} and, analogously, G~t(j,v)=minw𝒩(v){Gt(j,w)}\tilde{G}_{t}(j,v)=\min_{w\in\mathcal{N}(v)}\bigl{\{}G_{t}(j,w)\bigr{\}}, while the second follows by the conditional independence of the three terms (Bt(i,v),Bt(j,v))\bigl{(}B_{t}(i,v),B_{t}(j,v)\bigr{)}, Gt~(i,v)\tilde{G_{t}}(i,v), and G~t(j,v)\tilde{G}_{t}(j,v) given the history up to time t1t-1. Furthermore, making use of Lemmas 13, we get

()\displaystyle\left(\star\right) =v𝒱q(v)ηe2𝔼[t=1T𝔼t[i=1kj=1kx¯t(i,v)B¯t(i,v)Bt(i,v)Bt(j,v)B¯t(j,v)]]\displaystyle=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i=1}^{k}\sum_{j=1}^{k}\frac{\overline{x}_{t}(i,v)}{\overline{B}_{t}(i,v)}B_{t}(i,v)\frac{B_{t}(j,v)}{\overline{B}_{t}(j,v)}\right]\right]
v𝒱q(v)ηe2𝔼[t=1T𝔼t[i=1kj=1kx¯t(i,v)B¯t(i,v)Bt(j,v)B¯t(j,v)]]\displaystyle\leq\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i=1}^{k}\sum_{j=1}^{k}\frac{\overline{x}_{t}(i,v)}{\overline{B}_{t}(i,v)}\frac{B_{t}(j,v)}{\overline{B}_{t}(j,v)}\right]\right]
=v𝒱q(v)ηe2𝔼[t=1Ti=1kj=1kx¯t(i,v)B¯t(i,v)B¯t(j,v)B¯t(j,v)]\displaystyle=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{j=1}^{k}\frac{\overline{x}_{t}(i,v)}{\overline{B}_{t}(i,v)}\frac{\cancel{\overline{B}_{t}(j,v)}}{\cancel{\overline{B}_{t}(j,v)}}\right]
=ηek2𝔼[t=1Ti=1kv𝒱x¯t(i,v)q(v)B¯t(i,v)]\displaystyle=\frac{\eta ek}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{v\in\mathcal{V}}\frac{\overline{x}_{t}(i,v)q(v)}{\overline{B}_{t}(i,v)}\right]
ηek2𝔼[t=1Ti=1k(11e1(α1+v𝒱x¯t(i,v)q(v)))]\displaystyle\leq\frac{\eta ek}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\left(\frac{1}{1-e^{-1}}\left(\alpha_{1}+\sum_{v\in\mathcal{V}}\overline{x}_{t}(i,v)q(v)\right)\right)\right]
=ηek2𝔼[t=1T(11e1(kα1+v𝒱i=1kx¯t(i,v)q(v)))]\displaystyle=\frac{\eta ek}{2}\mathbb{E}\left[\sum_{t=1}^{T}\left(\frac{1}{1-e^{-1}}\left(k\alpha_{1}+\sum_{v\in\mathcal{V}}\sum_{i=1}^{k}\overline{x}_{t}(i,v)q(v)\right)\right)\right]
ηekT2(1e1)(kα1+mQ),\displaystyle\leq\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(k\alpha_{1}+mQ\right)\;,

where the first equality uses the expected value of the geometric random variables G~\tilde{G}, the first inequality is obtained neglecting the indicator function Bt(i,v)B_{t}(i,v), the following equality uses the expected value of the geometric random variables BtB_{t}, the second inequality follows by Lemma 6. We now consider the last term (III)(\mathrm{III}). Since t𝔼[^t(v)]\ell_{t}\geq\mathbb{E}[\widehat{\ell}_{t}(v)], from Lemma 4, we have

(III)\displaystyle(\mathrm{III}) =𝔼[t=1T𝔼t[x¯t(v)a,t^t(v)]]𝔼[t=1T𝔼t[x¯t(v),t^t(v)]]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\right]\leq\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\langle\overline{x}_{t}(v),\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\right]
=𝔼[t=1Ti=1kt(i)x¯t(i,v)(w𝒩(v)(1x¯t(i,w)q(w)))β].\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\ell_{t}(i)\overline{x}_{t}(i,v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\;.

Multiplying (III)(\mathrm{III}) by q(v)q(v) and summing over the agents, we can now upper bound t(i)\ell_{t}(i) with 11, then we use facts that 1xex1-x\leq e^{-x} for x[0,1]x\in[0,1] and that ey1/ye^{-y}\leq 1/y for all y>0y>0, to obtain

v𝒱q(v)𝔼[t=1Ti=1kt(i)x¯t(i,v)(w𝒩(v)(1x¯t(i,w)q(w)))β]𝔼[t=1Ti=1kv𝒱x¯t(i,v)q(v)(w𝒩(v)(1x¯t(i,w)q(w)))β]=𝔼[t=1Ti=1kv𝒱x¯t(i,v)q(v)>0x¯t(i,v)q(v)(w𝒩(v)(1x¯t(i,w)q(w)))β]𝔼[t=1Ti=1kv𝒱x¯t(i,v)q(v)>0x¯t(i,v)q(v)exp(βw𝒩(v)x¯t(i,w)q(w))]𝔼[t=1Ti=1kv𝒱x¯t(i,v)q(v)>0x¯t(i,v)q(v)βw𝒩(v)x¯t(i,w)q(w)]𝔼[t=1Ti=1kα1β]=α1kTβ\sum_{v\in\mathcal{V}}q(v)\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\ell_{t}(i)\overline{x}_{t}(i,v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ \begin{aligned} &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{v\in\mathcal{V}}\overline{x}_{t}(i,v)\,q(v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ &=\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\overline{x}_{t}(i,v)\,q(v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\overline{x}_{t}(i,v)q(v)\exp\left(-\beta\sum_{w\in\mathcal{N}(v)}\overline{x}_{t}(i,w)\,q(w)\right)\right]\\ &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\frac{\overline{x}_{t}(i,v)\,q(v)}{\beta\sum_{w\in\mathcal{N}(v)}\overline{x}_{t}(i,w)\,q(w)}\right]\\ &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\frac{\alpha_{1}}{\beta}\right]=\frac{\alpha_{1}\,k\,T}{\beta}\end{aligned}

where in the last inequality follows by Lemma 5. Putting all together and recalling that β=1kη12kη,\beta=\left\lfloor\frac{1}{k\eta}\right\rfloor\geq\frac{1}{2k\eta}, we can conclude that for every a𝒜,a\in\mathcal{A}, thanks to Eq. (13), we have

RT\displaystyle R_{T} Qm(1+log(k))η+QηekT2(1e1)(kQα1+m)+α1kTβ\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+Q\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+\frac{\alpha_{1}\,k\,T}{\beta}
Qm(1+log(k))η+QηekT2(1e1)(kQα1+m)+2ηα1k2T\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+Q\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+2\eta\alpha_{1}k^{2}T
=Qm(1+log(k))η+ηQkT(e2(1e1)(kQα1+m)+2α1kQ)\displaystyle=Q\frac{m\left(1+\log(k)\right)}{\eta}+\eta QkT\left(\frac{e}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+2\alpha_{1}\frac{k}{Q}\right)
Qm(1+log(k))η+5ηQkT(kQα1+m)\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+5\eta QkT\left(\frac{k}{Q}\alpha_{1}+m\right)
2Q15mkTlog(k)(kQα1+m).\displaystyle\leq 2Q\sqrt{15mkT\log(k)\left(\frac{k}{Q}\alpha_{1}+m\right)}\;.

Appendix E Bounds on independence numbers

The two following lemmas provide upper bounds of sums of weights over nodes of a graph expressed in terms of its independence number.

Lemma 5.

Let 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) be an undirected graph with indedence number α1\alpha_{1}, q(v)0q(v)\geq 0, and Q(v)=w𝒩(v)q(w)>0Q(v)=\sum_{w\in\mathcal{N}(v)}q(w)>0 for all v𝒱v\in\mathcal{V}. Then

v𝒱q(v)Q(v)α1\sum_{v\in\mathcal{V}}\frac{q(v)}{Q(v)}\leq\alpha_{1}
Proof.

Initialize V1=𝒱V_{1}=\mathcal{V}, fix w1argminwV1Q(w)w_{1}\in\operatorname*{argmin}_{w\in V_{1}}Q(w), and denote V2=𝒱𝒩(w1)V_{2}=\mathcal{V}\setminus\mathcal{N}(w_{1}). For k2k\geq 2 fix wkargminwVkQ(w)w_{k}\in\operatorname*{argmin}_{w\in V_{k}}Q(w) and shrink Vk+1=Vk𝒩(wk)V_{k+1}=V_{k}\setminus\mathcal{N}(w_{k}) until Vk+1=V_{k+1}=\varnothing. Since 𝒢\mathcal{G} is undirected wks=1k1𝒩(ws)w_{k}\notin\bigcup_{s=1}^{k-1}\mathcal{N}(w_{s}), therefore the number mm of times that an action can be picked this way is upper bounded by α1\alpha_{1}. Denoting 𝒩(wk)=Vk𝒩(wk)\mathcal{N}^{\prime}(w_{k})=V_{k}\cap\mathcal{N}(w_{k}) this implies

v𝒱q(v)Q(v)\displaystyle\sum_{v\in\mathcal{V}}\frac{q(v)}{Q(v)} =k=1mv𝒩(wk)q(v)Q(v)k=1mv𝒩(wk)q(v)Q(wk)\displaystyle=\sum_{k=1}^{m}\sum_{v\in\mathcal{N}^{\prime}(w_{k})}\frac{q(v)}{Q(v)}\leq\sum_{k=1}^{m}\sum_{v\in\mathcal{N}^{\prime}(w_{k})}\frac{q(v)}{Q(w_{k})}
k=1mv𝒩(wk)q(v)Q(wk)=mα1\displaystyle\leq\sum_{k=1}^{m}\frac{\sum_{v\in\mathcal{N}(w_{k})}q(v)}{Q(w_{k})}=m\leq\alpha_{1}

concluding the proof. ∎

Lemma 6.

Let 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) be an undirected graph with independence number α1\alpha_{1}. For each v𝒱v\in\mathcal{V}, let 𝒩(v)\mathcal{N}(v) be the neighborhood of node vv (including vv itself), and p(1,v),,p(k,v)0p(1,v),\dots,p(k,v)\geq 0. Then, for all i{1,,k}i\in\{1,\dots,k\},

v𝒱p(i,v)q(i,v)11e1(α1+v𝒱p(i,v))whereq(i,v)=1w𝒩(v)(1p(i,w)).\sum_{v\in\mathcal{V}}\frac{p(i,v)}{q(i,v)}\leq\frac{1}{1-e^{-1}}\left(\alpha_{1}+\sum_{v\in\mathcal{V}}p(i,v)\right)\quad\text{where}\quad q(i,v)=1-\prod_{w\in\mathcal{N}(v)}\bigl{(}1-p(i,w)\bigr{)}~{}.
Proof.

Fix i{1,,k}i\in\{1,\dots,k\} and set for brevity P(i,v)=w𝒩(v)p(i,w)P(i,v)=\sum_{w\in\mathcal{N}(v)}p(i,w). We can write

v𝒱p(i,v)q(i,v)\displaystyle\sum_{v\in\mathcal{V}}\frac{p(i,v)}{q(i,v)} =v𝒱:P(i,v)1p(i,v)q(i,v)(I)+v𝒱:P(i,v)<1p(i,v)q(i,v)(II),\displaystyle=\underbrace{\sum_{v\in\mathcal{V}\,:\,P(i,v)\geq 1}\frac{p(i,v)}{q(i,v)}}_{\mathrm{(I)}}\quad+\quad\underbrace{\sum_{v\in\mathcal{V}\,:\,P(i,v)<1}\frac{p(i,v)}{q(i,v)}}_{\mathrm{(II)}}~{},

and proceed by upper bounding the two terms (I) and (II) separately. Let r(v)r(v) be the cardinality of 𝒩(v)\mathcal{N}(v). We have, for any given v𝒱v\in\mathcal{V},

min{q(i,v):w𝒩(v)p(i,w)1}=1(11r(v))r(v)1e1.\min\left\{q(i,v)\,:\,\sum_{w\in\mathcal{N}(v)}p(i,w)\geq 1\right\}=1-\left(1-\frac{1}{r(v)}\right)^{r(v)}\geq 1-e^{-1}~{}.

The equality is due to the fact that the minimum is achieved when p(i,w)=1r(v)p(i,w)=\frac{1}{r(v)} for all w𝒩(v)w\in\mathcal{N}(v), and the inequality comes from r(v)1r(v)\geq 1 (for, v𝒩(v)v\in\mathcal{N}(v)). Hence

(I)v𝒱:P(i,v)1p(i,v)1e1v𝒱p(i,v)1e1.\displaystyle\mathrm{(I)}\leq\sum_{v\in\mathcal{V}\,:\,P(i,v)\geq 1}\frac{p(i,v)}{1-e^{-1}}\leq\sum_{v\in\mathcal{V}}\frac{p(i,v)}{1-e^{-1}}~{}.

As for (II), using the inequality 1xex,x[0,1]1-x\leq e^{-x},x\in[0,1], with x=p(i,w)x=p(i,w), we can write

q(i,v)1exp(w𝒩(v)p(i,w))=1exp(P(i,v)).\displaystyle q(i,v)\geq 1-\exp\left(-\sum_{w\in\mathcal{N}(v)}p(i,w)\right)=1-\exp\left(-P(i,v)\right)~{}.

In turn, because P(i,v)<1P(i,v)<1 in terms (II), we can use the inequality 1ex(1e1)x1-e^{-x}\geq(1-e^{-1})\,x, holding when x[0,1]x\in[0,1], with x=P(i,v)x=P(i,v), thereby concluding that

q(i,v)(1e1)P(i,v)q(i,v)\geq(1-e^{-1})P(i,v)

Thus

(II)v𝒱:P(i,v)<1p(i,v)(1e1)P(i,v)11e1v𝒱p(i,v)P(i,v)α11e1,\displaystyle\mathrm{(II)}\leq\sum_{v\in\mathcal{V}\,:\,P(i,v)<1}\frac{p(i,v)}{(1-e^{-1})P(i,v)}\leq\frac{1}{1-e^{-1}}\,\sum_{v\in\mathcal{V}}\frac{p(i,v)}{P(i,v)}\leq\frac{\alpha_{1}}{1-e^{-1}}~{},

where in the last step we used Lemma 5. ∎