An Efficient Algorithm for Cooperative Semi-Bandits

Riccardo Della Vecchia
Artificial Intelligence Lab, Institute for Data Science & Analytics
Bocconi University, Milano, Italy Tommaso Cesari
Toulouse School of Economics (TSE), Toulouse, France
Artificial and Natural Intelligence Toulouse Institute (ANITI), Toulouse, France

Abstract

We consider the problem of asynchronous online combinatorial optimization on a network of communicating agents. At each time step, some of the agents are stochastically activated, requested to make a prediction, and the system pays the corresponding loss. Then, neighbors of active agents receive semi-bandit feedback and exchange some succinct local information. The goal is to minimize the network regret, defined as the difference between the cumulative loss of the predictions of active agents and that of the best action in hindsight, selected from a combinatorial decision set. The main challenge in such a context is to control the computational complexity of the resulting algorithm while retaining minimax optimal regret guarantees. We introduce Coop-FTPL, a cooperative version of the well-known Follow The Perturbed Leader algorithm, that implements a new loss estimation procedure generalizing the Geometric Resampling of Neu and Bartók [2013] to our setting. Assuming that the elements of the decision set are $k$ -dimensional binary vectors with at most $m$ non-zero entries and $\alpha_{1}$ is the independence number of the network, we show that the expected regret of our algorithm after $T$ time steps is of order $Q\sqrt{mkT\log(k)(k\alpha_{1}/Q+m)}$ , where $Q$ is the total activation probability mass. Furthermore, we prove that this is only $\sqrt{k\log k}$ -away from the best achievable rate and that Coop-FTPL has a state-of-the-art $T^{3/2}$ worst-case computational complexity.

1 Introduction

Distributed online settings with communication constraints arise naturally in large-scale learning systems. For example, in domains such as finance or online advertising, agents often serve high volumes of prediction requests and have to update their local models in an online fashion. Bandwidth and computational constraints may therefore preclude a central processor from having access to all the observations from all sessions and synchronizing all local models at the same time. With these motivations in mind, we introduce and analyze a new online learning setting in which a network of agents solves efficiently a common nonstochastic combinatorial semi-bandit problem by sharing information only with their network neighbors. At each time step $t$ , some agents $v$ belonging to a communication network $\mathcal{G}$ are asked to make a prediction $x_{t}(v)$ belonging to a subset $\mathcal{A}$ of $\{0,1\}^{k}$ and pay a (linear) loss $\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}$ where $\ell_{t}\in[0,1]^{k}$ is chosen adversarially by an oblivious environment. Then, any such agent $v$ receives the feedback $\bigl{(}x_{t}(1,v)\ell_{t}(1),\ldots,x_{t}(k,v)\ell_{t}(k)\bigr{)}$ , which is shared, together with some local information, to its first neighbors in the graph. The goal is to minimize the network regret after $T$ time steps

R_{T}=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\left\langle a,\ell_{t}\right\rangle\right]\;,

(1)

where $\mathcal{S}_{t}$ is the set of agents $v$ that made a prediction at time $t$ . In words, this is the difference between the cumulative loss of the “active” agents and the loss that they would have incurred had they consistently made the best prediction in hindsight.

For this setting, we design a distributed algorithm that we call Coop-FTPL (Algorithm 1), and prove that its regret is upper bounded by $Q\sqrt{mkT\log(k)({k\alpha_{1}}/{Q}+m)}$ (Theorem 1), where $\alpha_{1}$ is the independence number of the network $\mathcal{G}$ and $Q$ is the sum over all agents of the probability that the agent is active during a time step. Our algorithm employs an estimation technique that we call Cooperative Geometric Resampling (Coop-GR, Algorithm 2). It is an extension of a similar procedure appearing in [Neu and Bartók, 2013] that relies on the fact that the reciprocal of the probability of an event can be estimated by measuring the reoccurrence time. We can leverage this idea in the context of cooperative learning thanks to some statistical properties of the minimum of a family of geometric random variables (see Lemmas 1–3). Our algorithm has a state-of-the-art dependence on time of order $T^{3/2}$ for the worst-case computational complexity (Proposition 1). Moreover, we show with a lower bound (Theorem 2) that no algorithm can achieve a regret smaller than $Q\sqrt{mkT\alpha_{1}/Q}$ on all cooperative semi-bandit instances. Thus, our Coop-FTPL is at most a multiplicative factor of $\sqrt{k\log k}$ -away from the minimax result.

To the best of our knowledge, ours is the first computationally efficient near-optimal learning algorithm for the problem of cooperative learning with nonstochastic combinatorial bandits, where not all agents are necessarily active at all time steps.

2 Related work and further applications

Single-agent combinatorial bandits find applications in several fields, such as path planning, ranking and matching problems, finding minimum-weight spanning trees, cut sets, and multitask bandits. An efficient algorithm for this setting is Follow-The-Perturbed-Leader (FTPL), which was first proposed by Hannan [1957] and later rediscovered by Kalai and Vempala [2005]. Neu and Bartók [2013] show that combining FTPL with a loss estimation procedure called Geometric Resampling (GR) leads to a computationally efficient solution for this problem. More precisely, the solution is efficient given that the offline optimization problem of finding

a^{\star}=\operatorname*{argmin}_{a\in\mathcal{A}}\langle a,y\rangle\;,\qquad\forall y\in[0,+\infty)^{k}

(2)

admits a computationally efficient algorithm. This assumption is minimal, in the sense that if the offline problem in Eq. (2) is hard to approximate, then any algorithm with low regret must also be inefficient.¹¹1A slight relaxation in this direction would be assuming that Eq. (2) can be approximated accurately and efficiently. Grötschel et al. [2012] and Lee et al. [2018] give some sufficient conditions for the validity of this assumption. They essentially rely on having an efficient membership oracle for the convex hull $\mathrm{co}(\mathcal{A})$ of $\mathcal{A}$ and an evaluation oracle for the linear function to optimize. Audibert et al. [2014] note that Online Stochastic Mirror Descent (OSMD) or Follow The Regularized Leader (FTRL)-type algorithms can be efficiently implemented by convex programming if the convex hull of the decision set can be described by a polynomial number of constraints. Suehiro et al. [2012] investigate the details of such efficient implementations and design an algorithm with $k^{6}$ time-complexity, which might still be unfeasible in practice. Methods based on the exponential weighting of each decision vector can be implemented efficiently only in a handful of special cases —see, e.g., [Koolen et al., 2010] and [Cesa-Bianchi and Lugosi, 2012] for some examples.

The study of cooperative nonstochastic online learning on networks was pioneered by Awerbuch and Kleinberg [2008], who investigated a bandit setting in which the communication graph is a clique, agents belong to clusters characterized by the same loss, and some agents may be non-cooperative. In our multi-agent setting, the end goal is to control the total network regret (1). This objective was already studied by Cesa-Bianchi et al. [2019a] in the full-information case. A similar line of work was pursued by Cesa-Bianchi et al. [2019b], where the authors consider networks of learning agents that cooperate to solve the same nonstochastic bandit problem. In their setting, all agents are simultaneously active at all time steps, and the feedback propagates throughout the network with a maximum delay of $d$ time steps, where $d$ is a parameter of the proposed algorithm. The authors introduce a cooperative version of Exp3 that they call Exp3-COOP with regret of order $\sqrt{(d+1+{K\alpha_{d}}/{N})(T\log K)}$ where $K$ is the number of arms in the nonstochastic bandit problem, $N$ is the total number of agents in the network, and $\alpha_{d}$ is the independence number of the $d$ -th power of the communication network. The case $d=1$ corresponds to information that arrives with one round of delay and communication limited to first neighbors. In this setting Exp3-COOP has regret of order $\sqrt{(1+{K\alpha_{1}}/{N})(T\log K)}$ . Thus, our work can be seen as an extension of this setting to the case of combinatorial bandits with stochastic activation of agents. Finally, we point out that if the network consists of a single node, our cooperative setting collapses into a single-agent combinatorial semi-bandit problem. In particular, when the number of arms is $k$ and $m=1$ , this becomes the well-known adversarial multiarmed bandit problem (see [Auer et al., 2002]). Hence, ours is a proper generalization of all the settings mentioned above. These cooperative problems are also studied in stochastic setting (see, e.g., Martínez-Rubio et al. [2019]).

Finally, the reader may wonder what kind of results could be achieved if the agents are activated adversarially rather than stochastically. Cesa-Bianchi et al. [2019a] showed that in this setting no learning can occur, not even in with full-information feedback.

3 Cooperative semi-bandit setting

In this section, we present our cooperative semi-bandit protocol and we introduce all relevant definitions and notation.

We say that $\mathcal{G}=(\mathcal{V},\mathcal{E})$ is a communication network over $N$ agents if it is an undirected graph over a set $\mathcal{V}$ with cardinality $\left\lvert\mathcal{V}\right\rvert=N$ , whose elements we refer to as agents. Without loss of generality, we assume that $\mathcal{V}=\{1,\ldots,N\}$ . For any agent $v\in\mathcal{V}$ , we denote by $\mathcal{N}(v)$ the set of agents containing $v$ and the neighborhood $\{w\in\mathcal{V}:(v,w)\in\mathcal{E}\}$ . We say that $\alpha_{1}$ is the independence number of the network $\mathcal{G}$ if is the largest cardinality of an independent set of $\mathcal{G}$ , where an independent set of $\mathcal{G}$ is a subset of agents, no two of which are neighbors.

We study the following cooperative online combinatorial optimization protocol. Initially, hidden from the agents, the environment draws a sequence of subsets $\mathcal{S}_{1},\mathcal{S}_{2},\ldots\subset\mathcal{V}$ of agents, that we call active, and a sequence of loss vectors $\ell_{1},\ell_{2},\ldots\in\mathbb{R}^{k}$ . We assume that each agent $v$ has a probability $q(v)$ of being activated, which need only be known by $v$ . The set of active agents $\mathcal{S}_{t}$ at time $t\in\{1,2,\ldots\}$ is then determined by drawing, for each agent $v\in\mathcal{V}$ , a Bernoulli random variable $X_{t}(v)$ with bias $q(v)$ , independently of the past, and $\mathcal{S}_{t}$ consists exclusively of agents $v\in\mathcal{V}$ for which $X_{t}(v)=1$ . The decision set is a subset $\mathcal{A}$ of $\bigl{\{}a\in\{0,1\}^{k}:\sum_{i=1}^{k}a_{i}\leq m\bigr{\}}$ , for some $m\in\{1,\ldots,k\}$ .

For each time step $t\in\{1,2,\ldots\}$ :

1.

each active agent $v\in\mathcal{S}_{t}$ predicts with $x_{t}(v)\in\mathcal{A}$ (possibly drawn at random);
2.

each neighbor $v\in\mathcal{N}(w)$ of an active agent $w\in\mathcal{S}_{t}$ receives the feedback

$f_{t}(w):=\bigl{(}x_{t}(1,w)\ell_{t}(1),\ldots,x_{t}(k,w)\ell_{t}(k)\bigr{)}\;;$ (3)
3.

each agent $v\in\bigcup_{w\in\mathcal{S}_{t}}\mathcal{N}(w)$ receives some local information from its neighbors in $\mathcal{N}(v)$ ;
4.

the system incurs the loss $\sum_{v\in\mathcal{S}_{t}}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}$ .

The goal is to minimize the expected network regret as a function of the time horizon $T$ , defined by

R_{T}:=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\left\langle a,\ell_{t}\right\rangle\right]\;,

(4)

where the expectation is taken with respect to the draws of $\mathcal{S}_{1},\ldots,\mathcal{S}_{T}$ and (possibly) the randomization of the learners. In the next sections we will also denote by $\mathbb{P}_{t}$ the probability conditioned to the history up to and including round $t-1$ , and by $\mathbb{E}_{t}$ the corresponding expectation.

The nature of the local information exchanged by neighbors of active agents will be clarified in the next section. In short, they share succinct representations of the current state of their local prediction model.

4 Coop-FTPL and upper bound

In this section we introduce and analyze our efficient Coop-FTPL algorithm for cooperative online combinatorial optimization.

4.1 The Coop-FTPL algorithm

Coop-FTPL (Algorithm 1) takes as input a decision set $\mathcal{A}\subset\{0,1\}^{k}$ , a time horizon $T\in\mathbb{N}$ , a learning rate $\eta>0$ , a truncation parameter $\beta$ , and an exploration distribution $\zeta$ . At each time step $t$ , all active agents $v$ make a FTPL prediction $x_{t}(v)$ (line 1) with an i.i.d. perturbation sampled from $\zeta$ (line 1), then they receive some feedback $f_{t}(v)$ and share it with their first neighbors (line 1). Afterwards, each agent who received some feedback this round, requests from its neighbors $w$ a vector $K_{t}(w)$ of geometric random samples (line 1) which is efficiently computed by Algorithm 2 and will be described in detail later. With these geometric samples, each agent $v$ computes an estimated loss $\widehat{\ell}_{t}(v)$ (line 1) and updates the cumulative loss estimate $\widehat{L}_{t}(v)$ (line 1). This estimator is described in detail in Section 4.3.

Input: decision set

\mathcal{A}

, horizon

T

, learning rate

\eta

, truncation parameter

\beta

, exploration pdf

\zeta

Initialization:

\widehat{L}_{0}=0\in\mathbb{R}^{k}

1 for each time $t=1,2,\ldots$ do

2 for each active agent $v\in\mathcal{S}_{t}$ do // active agents

3 sample

Z_{t}(v)\sim\zeta

, independently of the past

4 make the prediction

x_{t}(v)=\operatorname*{argmax}_{a\in\mathcal{A}}\bigl{\langle}a,Z_{t}(v)-\eta\widehat{L}_{t-1}(v)\bigr{\rangle}

6 for each agent $v\in\bigcup_{w\in\mathcal{S}_{t}}\mathcal{N}(w)$ do // neighbors of active agents

7 receive the feedback

f_{t}(w)

(Eq. (3)) from each active neighbor

w\in\mathcal{N}(v)\cap\mathcal{S}_{t}

8 receive

K_{t}(i,w)

(

\forall\ i\in\{1,\ldots,k\}

) from each neighbor

w\in\mathcal{N}(v)

using Algorithm 2

9 compute

\widehat{\ell}_{t}(i,v)=\ell_{t}(i)\,B_{t}\,(i,v)\min_{w\in\mathcal{N}(v)}\bigl{\{}K_{t}(i,w)\bigr{\}}

\forall i\in\{1,\ldots,k\}

(Eq. (6)–(7))

10 update

\widehat{L}_{t}(v)=\widehat{L}_{t-1}(v)+\widehat{\ell}_{t}(v)

Algorithm 1 Follow the perturbed leader for cooperative semi-bandits (Coop-FTPL)

Input: time step

t

, component

i

, agent

w

, truncation parameter

\beta

, exploration pdf

\zeta

1 for $s=1,2,\ldots$ do

2 sample

z^{\prime}_{s}\sim\zeta

, independently of the past

3 let

x^{\prime}_{s}

be the

i

-th component of the vector

\operatorname*{argmax}_{a\in\mathcal{A}}\bigl{\langle}a,z^{\prime}_{s}-\eta\widehat{L}_{t-1}(w)\bigr{\rangle}

4 sample

y^{\prime}_{s}

from a Bernoulli distribution with parameter

q(w)

, independently of the past

5 if

(x^{\prime}_{s}=1\textbf{ and }y^{\prime}_{s}=1)\textbf{ or }s=\beta

, break

7 return

K_{t}(i,w)=s

Algorithm 2 Geometric resampling for cooperative semi-bandits (Coop-GR)

4.2 Reduction to OSMD

Before describing $K_{t}(w)$ and $\widehat{\ell}_{t}(v)$ , we make a connection between FTPL and the Online Stochastic Mirror Descent algorithm (OSMD)²²2For a brief overview of some key convex analysis and OSMD facts, see Appendices A and B that will be crucial for our analysis.³³3For a similar approach in the single-agent case, see [Lattimore and Szepesvári, 2020].

Fix any time step $t$ and an agent $v$ . As we mentioned above, if $v$ is active, it makes the following FTPL prediction (line 1)

x_{t}(v)=\operatorname*{argmin}_{a\in\mathcal{A}}\bigl{\langle}a,\eta\widehat{L}_{t-1}(v)-Z_{t}(v)\bigr{\rangle}\;,

where $Z_{t}(v)\in\mathbb{R}^{k}$ is sampled i.i.d. from $\zeta$ (the random perturbations introduce the exploration, which for an appropriate choice of $\zeta$ is sufficient to guarantee small regret). On the other hand, given a Legendre potential $F$ with $\text{dom}\left(\nabla F\right)=\mathrm{int}\left(\text{co}\left(\mathcal{A}\right)\right)$ , an OSMD algorithm makes the prediction

\overline{x}_{t}(v)=\operatorname*{argmin}_{x\in\mathrm{co}(\mathcal{A})}\Bigl{(}\bigl{\langle}x,\eta\,\widehat{\ell}_{t-1}(v)\bigr{\rangle}+\mathcal{B}_{F}(x,\overline{x}_{t-1}(v))\Bigr{)}\;,

where $\mathcal{B}_{F}$ is the Bregman divergence induced by $F$ and $\mathrm{co}(\mathcal{A})$ is the convex hull of $\mathcal{A}$ . Using the fact that $\text{dom}(\nabla F)=\mathrm{int}\bigl{(}\mathrm{co}(\mathcal{A})\bigr{)}$ , the $\operatorname*{argmin}$ above can be computed in a standard way by studying when the gradient of its argument is equal to zero, and proceeding inductively, we obtain the two identities $\nabla F(\overline{x}_{t}(v))=\nabla F(\overline{x}_{t-1}(v))-\eta\,\widehat{\ell}_{t-1}(v)=-\eta\widehat{L}_{t-1}(v)\;.$ By duality this implies that $\overline{x}_{t}(v)=\nabla F^{*}\bigl{(}-\eta\widehat{L}_{t-1}(v)\bigr{)}$ . We now want to relate $x_{t}(v)$ and $\overline{x}_{t}(v)$ so that

\overline{x}_{t}(v)=\mathbb{E}_{t}\bigl{[}x_{t}(v)\bigr{]}=\mathbb{E}_{t}\left[\operatorname*{argmin}_{a\in\mathcal{A}}\bigl{\langle}a,\eta\widehat{L}_{t-1}(v)-Z_{t}(v)\bigr{\rangle}\right]\;,

(5)

where the conditional expectation $\mathbb{E}_{t}$ (given the history up to time $t-1$ ) is taken with respect to $Z_{t}(v)$ . Thus, in order to view FTPL as an instance of OSMD, it suffices to find a Legendre potential $F$ with $\text{dom}\left(\nabla F\right)=\text{int}\left(\text{co}\left(\mathcal{A}\right)\right)$ such that $\nabla F^{*}\bigl{(}-\eta\widehat{L}_{t-1}(v)\bigr{)}=\mathbb{E}_{t}\bigl{[}\operatorname*{argmax}_{a\in\mathcal{A}}\bigl{\langle}a,Z_{t}(v)-\eta\widehat{L}_{t-1}(v)\bigr{\rangle}\bigr{]}$ . In order to satisfy this condition, we need that for any $x\in\mathbb{R}^{k}$ , the Fenchel conjugate $F^{*}$ of $F$ enjoys $\nabla F^{*}\left(x\right)=\int_{\mathbb{R}^{k}}\operatorname*{argmax}_{a\in\mathcal{A}}\left\langle a,z-x\right\rangle\>\zeta(z)\>\mathrm{d}z.$ Then, we define $h(x):=\operatorname*{argmax}_{a\in\mathcal{A}}\left\langle a,x\right\rangle$ for any $x\in\mathbb{R}^{k}$ , where $h(x)$ is chosen to be an arbitrary maximizer if multiple maximizers exist. From convex analysis, if the convex hull $\mathrm{co}(\mathcal{A})$ of $\mathcal{A}$ had a smooth boundary, then the support function $x\mapsto\phi(x):=\max_{a\in\mathrm{co}(\mathcal{A})}\left\langle a,x\right\rangle=\max_{a\in\mathcal{A}}\left\langle a,x\right\rangle,$ of $\mathrm{co}(\mathcal{A})$ would satisfy $\nabla\phi(x)=h(x)$ . For combinatorial bandits, $\mathrm{co}(\mathcal{A})$ is non-smooth, but, being $\zeta$ a density with respect to Lebesgue measure, one can prove (see, e.g., Lattimore and Szepesvári [2020]) that $\nabla\int_{\mathbb{R}^{k}}\phi\left(x+z\right)\zeta(z)\>\mathrm{d}z=\int_{\mathbb{R}^{k}}h\left(x+z\right)\zeta(z)\>\mathrm{d}z$ , for all $x\in\mathbb{R}^{k}$ . This shows that FTPL can be interpreted as OSMD with a potential $F$ defined implicitly by its Fenchel conjugate

F^{*}(x):=\int_{\mathbb{R}^{k}}\phi\left(x+z\right)\zeta(z)\>\mathrm{d}z\;,\qquad\forall x\in\mathbb{R}^{k}\;.

Thus, recalling (5), we can think of the update $\overline{x}_{t}(v)$ of OSMD as the average of a random component-wise draw $\overline{x}_{t}(i,v)=\sum_{a\in\mathcal{A}}P_{t}(a,v)\,a(i)$ (for all $i\in\{1,\ldots,k\}$ ), with respect to a distribution $P_{t}(v)$ on $\mathcal{A}$ defined in terms of the distribution of $Z_{t}$ , as

P_{t}(a,v)=\mathbb{P}_{t}\Bigl{[}h\bigl{(}Z_{t}(v)-\eta\widehat{L}_{t-1}(v)\bigr{)}=a\Bigr{]}\;,\qquad\forall a\in\mathcal{A}\;,

where $\mathbb{P}_{t}$ is the probability conditioned of the history up to time $t-1$ .

4.3 An efficient estimator

For the understanding of the definitions and analyses of $K_{t}(w)$ and $\widehat{\ell}_{t}(v)$ , we introduce three useful lemmas on geometric distributions. We defer their proofs to Appendix C.

Lemma 1.

Let $Y_{1},\ldots,Y_{m}$ be $m$ independent random variables such that each $Y_{j}$ has a geometric distribution with parameter $p_{j}\in[0,1]$ . Then, the random variable $Z:=\min_{j\in\{1,\ldots,m\}}Y_{j}$ has a geometric distribution with parameter $1-\prod_{j=1}^{m}(1-p_{j})$ .

Lemma 2.

Let $G$ be a geometric random variable with parameter $q\in(0,1]$ and $\beta>0$ . Then, the expectation of the random variable $\min\{G,\beta\}$ satisfies $\mathbb{E}\bigl{[}\min\{G,\beta\}\bigr{]}=\bigl{(}1-(1-q)^{\beta}\bigr{)}/q$ .

Lemma 3.

For all $v\in\mathcal{V}$ , fix two arbitrary numbers $p_{1}(v),p_{2}(v)\in[0,1]$ . Consider a collection $\bigl{\{}X_{s}(v),Y_{s}(v)\bigr{\}}_{s\in\mathbb{N},v\in\mathcal{V}}$ of independent Bernoulli random variables such that $\mathbb{E}\bigl{[}X_{s}(v)\bigr{]}=p_{1}(v)$ and $\mathbb{E}\bigl{[}Y_{s}(v)\bigr{]}=p_{2}(v)$ for any $s\in\mathbb{N}$ and all $v\in\mathcal{V}$ . Then, the random variables $\{G(v)\}_{v\in\mathcal{V}}$ defined for all $v\in\mathcal{V}$ by $G(v):=\inf\bigl{\{}s\in\mathbb{N}:X_{s}(v)\,Y_{s}(v)=1\bigr{\}}$ are all independent and they have a geometric distribution with parameter $p_{1}(v)\,p_{2}(v)$ .

Fix now any time step $t$ , agent $v$ , and component $i\in\{1,\ldots,k\}$ . The loss estimator $\widehat{\ell}_{t}(i,v)$ depends on the algorithmic definition of $K_{t}(i,w)$ in Algorithm 2, where $w\in\mathcal{N}(v)$ . By Lemma 3, we have that for any $w$ , conditionally on the history up to time $t-1$ , the random variable $K_{t}(i,w)$ , has a truncated geometric distribution with success probability equal to $\overline{x}_{t}(i,w)q(w)$ and truncation parameter $\beta$ . The idea is that using $K_{t}(i,w)$ as an estimator we can reconstruct the inverse of the probability that agent $v$ observes $i$ at time $t$ . The truncation parameter $\beta$ is not needed for the analysis, but it is used just to optimize the computational efficiency of the algorithm.⁴⁴4Previously known cooperative algorithms for limited feedback settings need to exchange at least two real numbers: the probability according to which predictions are drawn and the loss. Instead of the probability, we only need to pass the integer $K_{t}(i,w)$ , which requires at most $\log_{2}(\beta)$ bits (order of $\log(T)$ , when tuned). Note also that for the loss, we could exchange an approximation $l_{t}$ of $\ell_{t}$ using only $n=O(\log(T))$ bits. Indeed one can show that Lemma 4, in this case, is true replacing $\ell_{t}$ with $l_{t}$ in the first equality. Everything else works the same in the proof of Theorem 1 up an extra (negligible) $m2^{-n}T$ term.

The loss estimator of $v$ is then defined as

\widehat{\ell}_{t}(i,v):=\ell_{t}(i)\,B_{t}\,(i,v)\min_{w\in\mathcal{N}(v)}\bigl{\{}K_{t}(i,w)\bigr{\}},

(6)

where

B_{t}(i,v)=\mathbb{I}\bigl{\{}\exists w\in\mathcal{N}(v):w\in\mathcal{S}_{t},\,x_{t}(i,w)=1\bigr{\}}\;,\qquad K_{t}(i,w)=\min\bigl{\{}G_{t}(i,w),\beta\bigr{\}}\;,

(7)

and given the history up to time $t-1$ , for each $i\in\{1,\ldots,k\}$ , the family $\bigl{\{}G_{t}(i,w)\bigr{\}}_{w\in\mathcal{V}}$ consists of independent geometric random variables with parameter $\overline{x}_{t}(i,w)q(w)$ . Note that the geometric random variables $G_{t}(i,w)$ are actually never computed by Algorithm 2 which efficiently computes only their truncations $K_{t}(i,w)$ , with truncation parameter $\beta$ . Nevertheless, as it will be apparent later, they are a useful tool for the theoretical analysis. Note that, by Eq. (5), we have

\mathbb{P}_{t}\bigl{[}x_{t}(i,w)=1\bigr{]}=\mathbb{E}_{t}[x_{t}(i,w)]=\overline{x}_{t}\left(i,w\right),

therefore

\overline{B}_{t}(i,v):=\mathbb{E}_{t}\left[B_{t}(i,v)\right]=1-\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)=\frac{1}{\mathbb{E}_{t}\left[\min_{w\in\mathcal{N}(v)}G_{t}(i,w)\right]}\;,

where the last identity follows by Lemma 1. Moreover from Lemmas 1 and 2, we have

\mathbb{E}_{t}\left[\min_{w\in\mathcal{N}(v)}K_{t}(i,w)\right]=\frac{1-\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)^{\beta}}{\overline{B}_{t}(i,v)}\;.

The following key lemma gives an upper bound on the expected estimated loss.

Lemma 4.

For any time $t$ , component $i$ , agents $v$ , and truncation parameter $\beta$ , the expectation of the loss estimator in (6), given the history up to time $t-1$ , satisfies

\mathbb{E}_{t}\Bigl{[}\widehat{\ell}_{t}(i,v)\Bigr{]}=\ell_{t}(i)\left(1-\Biggl{(}\prod_{w\in\mathcal{N}(v)}\bigl{(}1-\overline{x}_{t}(i,w)\,q(w)\bigr{)}\Biggr{)}^{\beta}\right)\leq\ell_{t}(i)\;.

Proof.

Using the fact that, conditioned on the history up to time $t-1$ , the random variable $\min_{w\in\mathcal{N}(v)}G_{t}(i,w)$ has a geometric distribution with parameter $\overline{B}_{t}(i,v)$ (Lemmas 1-3), we get

	$\displaystyle\mathbb{E}_{t}\left[\widehat{\ell}_{t}(i,v)\right]$
	$\displaystyle\ =\mathbb{E}_{t}\left[\ell_{t}(i)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{\min\left\{G_{t}(i,w),\beta\right\}\right\}\right]=\mathbb{E}_{t}\left[\ell_{t}(i)B_{t}(i,v)\min\left\{\min_{w\in\mathcal{N}(v)}G_{t}(i,w),\beta\right\}\right]$
	$\displaystyle\ =\ell_{t}(i)\mathbb{E}_{t}\left[B_{t}(i,v)\right]\mathbb{E}_{t}\left[\min\left\{\min_{w\in\mathcal{N}(v)}G_{t}(i,w),\beta\right\}\right]=\ell_{t}(i)\overline{B}_{t}(i,v)\frac{\left(1-\left(1-\overline{B}_{t}(i,v)\right)^{\beta}\right)}{\overline{B}_{t}(i,v)}$
	$\displaystyle\ =\ell_{t}(i)\left(1-\left(1-\overline{B}_{t}(i,v)\right)^{\beta}\right)=\ell_{t}(i)\left(1-\Biggl{(}\prod_{w\in\mathcal{N}(v)}\bigl{(}1-\overline{x}_{t}(i,w)\,q(w)\bigr{)}\Biggr{)}^{\beta}\right)\;,$

where we plugged in the definition of $\overline{B}_{t}(i,v)$ in the last equation. From the fact that $\overline{x}_{t}(i,w)\,q(w)\in[0,1]$ and $\beta>0$ follows that $\mathbb{E}_{t}\left[\widehat{\ell}_{t}(i,v)\right]\leq\ell_{t}(i).$ ∎

4.4 Analysis

We can finally state our upper bound on the regret of Coop-FTPL. The key idea is to apply OSMD techniques to our FTPL algorithm, as explained in Section 4.2. The proof proceeds by splitting the regret of each agent in the network in three terms. The first two are treated with standard techniques; the first one depends on the diameter of $\mathrm{co}(\mathcal{A})$ with respect to the regularizer $F$ and the second one on the Bregman divergence of consecutive updates. The last term is related to the bias of the estimator and is analyzed leveraging the lemmas on geometric distributions introduced in Section 4.3. Then, this terms are summed, each with a weight corresponding to their probabilities of being activated, and this sum is controlled using results that relate a sum of weights over the nodes of a graph of with the independence number of the graph.

Theorem 1.

If $\zeta$ is the Laplace density $z\mapsto\zeta(z)=2^{-k}\exp\bigl{(}-\left\lVert z\right\rVert_{1}\bigr{)}$ , $\eta>0$ , and $0<\beta\leq 1/(\eta k)$ , then then the regret of our Coop-FTPL algorithm (Algorithm 1) satisfies

R_{T}\leq\frac{3Qm\log k}{\eta}+3Q\eta kT\left(\frac{k}{Q}\alpha_{1}+m\right)+\frac{\alpha_{1}\,k\,T}{\beta}\;.

In particular, tuning the parameters $\eta,\beta$ as follows

\beta=\left\lfloor\frac{1}{k\eta}\right\rfloor\quad\text{and}\quad\eta=\sqrt{\frac{3m\log(k)}{5kT\left(\frac{k}{Q}\alpha_{1}+m\right)}}\;,\qquad\text{where }\qquad Q=\sum_{v\in\mathcal{V}}q(v)\;,

(8)

yields

R_{T}\leq 2Q\sqrt{15mkT\log(k)\left(\frac{k}{Q}\alpha_{1}+m\right)}\;.

We now present a detailed sketch of the proof of our main result (full proof in Appendix D).

Sketch of the proof.

For the sake of convenience, we define the expected individual regret of an agent $v\in\mathcal{V}$ in the network with respect to a fixed action $a\in\mathcal{A}$ by

R_{T}(a,v):=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle x_{t}(v),\ell_{t}\right\rangle-\sum_{t=1}^{T}\left\langle a,\ell_{t}\right\rangle\right]\;,

where the expectation is taken with respect to the internal randomization of the agent, but not to its activation probability $q(v)$ . With this definition the total regret on the network in Eq. (4) can be decomposed as

	$\displaystyle R_{T}$	$\displaystyle=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\Bigl{(}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{)}\right]=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{v\in\mathcal{S}_{t}}\Bigl{(}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{)}\right]\right]$
		$\displaystyle=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{V}}q(v)\mathbb{E}_{t}\Bigl{[}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{]}\right]=\max_{a\in\mathcal{A}}\sum_{v\in\mathcal{V}}q(v)R_{T}(a,v)\;.$		(9)

The proof then proceeds by isolating the bias in the loss estimators. For each $a\in\mathcal{A}$ we have

R_{T}(a,v)=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\widehat{\ell}_{t}(v)\right\rangle\right]+\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\;.

Exploiting the analogy that we established between FTPL and OSMD, we begin by using the standard bound for the regret of OSMD in the first term of the previous equation. For the reader’s convenience, we restate it in Appendix B, Theorem 4. This leads to

R_{T}(a,v)\!\leq\!\underbrace{\frac{F(\overline{x}_{1}(v))-F(a)}{\eta}}_{(\mathrm{I})}\!+\underbrace{\mathbb{E}\left[\frac{1}{\eta}\sum_{t=1}^{T}\mathcal{B}_{F}\left(\overline{x}_{t}(v),\overline{x}_{t+1}(v)\right)\right]}_{(\mathrm{II})}\!+\underbrace{\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]}_{(\mathrm{III})}.

The three terms are studied separately and in detail in Appendix D. Here, we provide a sketch of the bounds.

For the first term $(\mathrm{I})$ , we use the fact that the regularizer $F$ satisfies, for all $a\in\mathcal{A}$ ,

\displaystyle F(a)\geq-m\bigl{(}1+\log(k)\bigr{)},

(10)

which follows by the definition of $F$ , the properties of the perturbation distribution, and the fact that $\left\lVert a\right\rVert_{1}\leq m$ for any $a\in\mathcal{A}$ . One can also show that $F(a)\leq 0$ for all $a\in\mathcal{A}$ , and this, combined with the previous equation, leads to

(\mathrm{I})\leq\frac{m(1+\log k)}{\eta}\;.

For the second term $(\mathrm{II}),$ we have

$\displaystyle\mathcal{B}_{F}\left(\overline{x}_{t}(v),\overline{x}_{t+1}(v)\right)$	$\displaystyle=\mathcal{B}_{F^{*}}\left(\nabla F\left(\overline{x}_{t+1}(v)\right),\nabla F\left(\overline{x}_{t}(v)\right)\right)$
	$\displaystyle=\mathcal{B}_{F^{*}}\left(-\eta\widehat{L}_{t-1}(v)-\eta\widehat{\ell}_{t}(v),-\eta\widehat{L}_{t-1}(v)\right)$
	$\displaystyle=\frac{\eta^{2}}{2}\left\\|\widehat{\ell}_{t}(v)\right\\|_{\nabla^{2}F^{*}(\xi(v))}^{2}\;,$	(11)

where the first equality is a standard property of Bregmann divergence (see Theorem 3 in Appendix A), the second follows from the definitions of the updates and the last by Taylor’s theorem, where $\xi(v)=-\eta\widehat{L}_{t-1}(v)-\alpha\eta\widehat{\ell}_{t}(v)$ , for some $\alpha\in[0,1]$ . The estimation of the entries of the Hessian are non trivial (but tedious); the interested reader can find them in Appendix D. Exploiting our assumption that $\beta\leq 1/(\eta k)$ , we get, for all $i,j\in\{1,\ldots,k\}$ ,

\displaystyle\nabla^{2}F^{*}\bigl{(}\xi(v)\bigr{)}_{ij}

\displaystyle\leq e\,\overline{x}_{t}(i,v)\;.

Plugging this estimate in Eq. (11) yields

	$\displaystyle\frac{\eta^{2}}{2}\left\\|\widehat{\ell}_{t}(v)\right\\|_{\nabla^{2}F^{*}(\xi(v))}^{2}$	$\displaystyle=\frac{\eta^{2}}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\nabla^{2}F^{*}\bigl{(}\xi(v)\bigr{)}_{i,j}\widehat{\ell}_{t}(i,v)\widehat{\ell}_{t}(j,v)$
		$\displaystyle\leq\frac{\eta^{2}e}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)\widehat{\ell}_{t}(i,v)\widehat{\ell}_{t}(j,v)$
		$\displaystyle\leq\frac{\eta^{2}e}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\},$

where the last inequality follows by neglecting the truncation with $\beta$ . Hence multiplying $(\mathrm{II})$ by $q(v)$ and summing over $v\in\mathcal{V}$ yields

	$\displaystyle\sum_{v\in\mathcal{V}}q(v)\mathbb{E}\left[\frac{\eta}{2}\sum_{t=1}^{T}\left\\|\widehat{\ell}_{t}(v)\right\\|_{\nabla^{2}F^{}(\xi(v))}^{2}\right]=\sum_{v\in\mathcal{V}}q(v)\frac{\eta}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\\|\widehat{\ell}_{t}(v)\right\\|_{\nabla^{2}F^{}(\xi(v))}^{2}\right]\right]$
	$\displaystyle\hskip 2.94753pt\leq\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i,j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\}\right]\right]\;,$

which is rewritten as

\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i,j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\}\right]\right]\\ \begin{aligned} &=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\tilde{G_{t}}(i,v)B_{t}(j,v)\tilde{G}_{t}(j,v)\right]\right]\\ &=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)\mathbb{E}_{t}\left[B_{t}(i,v)B_{t}(j,v)\right]\mathbb{E}_{t}\left[\tilde{G_{t}}(i,v)\right]\mathbb{E}_{t}\left[\tilde{G}_{t}(j,v)\right]\right]=:\left(\star\right)\;,\end{aligned}

where in the first equality we defined $\tilde{G}_{t}(i,v)=\min_{w\in\mathcal{N}(v)}\bigl{\{}G_{t}(i,w)\bigr{\}}$ and, analogously, $\tilde{G}_{t}(j,v)=\min_{w\in\mathcal{N}(v)}\bigl{\{}G_{t}(j,w)\bigr{\}}$ , while the second follows by the conditional independence of the three terms $\bigl{(}B_{t}(i,v),B_{t}(j,v)\bigr{)}$ , $\tilde{G_{t}}(i,v)$ , and $\tilde{G}_{t}(j,v)$ given the history up to time $t-1$ . Furthermore, making use of Lemmas 1–3 and upper bounding, we get:

	$\displaystyle\left(\star\right)$	$\displaystyle=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i=1}^{k}\sum_{j=1}^{k}\frac{\overline{x}_{t}(i,v)}{\overline{B}_{t}(i,v)}B_{t}(i,v)\frac{B_{t}(j,v)}{\overline{B}_{t}(j,v)}\right]\right]$
		$\displaystyle\leq\frac{\eta ek}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{v\in\mathcal{V}}\frac{\overline{x}_{t}(i,v)q(v)}{\overline{B}_{t}(i,v)}\right]\leq\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(k\alpha_{1}+mQ\right),$

where the first equality uses the expected value of the geometric random variables $\tilde{G}$ , the first inequality is obtained neglecting the indicator function $B_{t}(i,v)$ and taking the conditional expectation of $B_{t}(j,v)$ , and the last inequality follows by a known upper bound involving independence numbers appearing, for example in Cesa-Bianchi et al. [2019a, b]. For the sake of convenience, we add this result to Appendix E, Lemma 6. We now consider the last term $(\mathrm{III})$ . Since $\ell_{t}\geq\mathbb{E}_{t}\Bigl{[}\widehat{\ell}_{t}(v)\Bigr{]}$ by Lemma 4, we have

	$\displaystyle(\mathrm{III})$	$\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\right]\leq\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\langle\overline{x}_{t}(v),\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\ell_{t}(i)\overline{x}_{t}(i,v)\Biggl{(}\prod_{w\in\mathcal{N}(v)}\bigl{(}1-\overline{x}_{t}(i,w)\,q(w)\bigr{)}\Biggr{)}^{\beta}\right]\;.$

Multiplying $(\mathrm{III})$ by $q(v)$ and summing over the agents, we now upper bound $\ell_{t}(i)$ with $1$ and use the facts that $1-x\leq e^{-x}$ for all $x\in[0,1]$ and $e^{-y}\leq 1/y$ for all $y>0$ , to obtain

\sum_{v\in\mathcal{V}}q(v)\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\ell_{t}(i)\overline{x}_{t}(i,v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ \begin{aligned} &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{v\in\mathcal{V}}\overline{x}_{t}(i,v)\,q(v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ &=\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\overline{x}_{t}(i,v)\,q(v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\overline{x}_{t}(i,v)q(v)\exp\left(-\beta\sum_{w\in\mathcal{N}(v)}\overline{x}_{t}(i,w)\,q(w)\right)\right]\\ &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\frac{\overline{x}_{t}(i,v)\,q(v)}{\beta\sum_{w\in\mathcal{N}(v)}\overline{x}_{t}(i,w)\,q(w)}\right]\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\frac{\alpha_{1}}{\beta}\right]=\frac{\alpha_{1}\,k\,T}{\beta}\end{aligned}

where the last inequality follows by a known upper bound involving independence numbers appearing, for example in [Alon et al., 2017, Lemma 10]. For the sake of convenience, we add this result to Appendix E, Lemma 5. Putting everything together and recalling that $\beta=\bigl{\lfloor}{1}/{(k\eta)}\bigr{\rfloor}\geq{1}/{(2k\eta)},$ we can finally conclude that

	$\displaystyle R_{T}$	$\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+Q\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+\frac{\alpha_{1}\,k\,T}{\beta}$
		$\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+Q\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+2\eta\alpha_{1}k^{2}T$
		$\displaystyle=Q\frac{m\left(1+\log(k)\right)}{\eta}+\eta QkT\left(\frac{e}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+2\alpha_{1}\frac{k}{Q}\right)$
		$\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+5\eta QkT\left(\frac{k}{Q}\alpha_{1}+m\right)$
		$\displaystyle\leq 2Q\sqrt{15mkT\log(k)\left(\frac{k}{Q}\alpha_{1}+m\right)}\;.$

∎

We conclude this section by discussing the computational complexity of our Coop-FTPL algorithm. The next result shows that the total number of elementary operations performed by Coop-FTPL over $T$ time-steps scales with $T^{3/2}$ in the worst-case. To the best of our knowledge, no known algorithm attains a lower worst-case computational complexity.

Proposition 1.

If the optimization problem (2) can be solved with at most $c\in\mathbb{N}$ elementary operations, the worst-case computational complexity $\gamma_{\text{Coop-FTPL{}}}$ of each agent $v\in\mathcal{V}$ running our Coop-FTPL algorithm with the optimal tuning (8) for $T$ rounds is

\gamma_{\text{Coop-FTPL{}}}=\mathcal{O}\left(T^{3/2}c\sqrt{\frac{\alpha_{1}/Q+1}{m}}\right)\;.

Proof.

The result follows immediately by noting that the number of elementary operations performed by each agent $v$ at each time step $t$ is at most

c(\beta+1)\leq c\left(\frac{1}{k\eta}+1\right)=c\left(\frac{1}{k}\sqrt{\frac{5kT(k\alpha_{1}/Q+m)}{2m\log k}}+1\right)=c\left(\sqrt{\frac{5T(\alpha_{1}/Q+m/k)}{2m\log k}}+1\right)\;.

∎

5 Lower bound

In this section we show that no cooperative semi-bandit algorithm can beat the $Q\sqrt{mkT\alpha_{1}/Q}$ rate. The key idea for constructing the lower bound is simple: if the activation probabilities $q(v)$ are non-zero only for agents $v$ belonging to an independent set with cardinality $\alpha_{1}$ , then the problem is reduced to $\alpha_{1}$ independent instances of single-agent semi-bandits, whose minimax rate is known.

Theorem 2.

For each communication network $\mathcal{G}$ with independence number $\alpha_{1}$ there exist cooperative semi-bandit instances for which the regret of any learning algorithm satisfies

R_{T}=\Omega\bigl{(}Q\sqrt{mkT\alpha_{1}/Q}\bigr{)}\;.

Proof.

Let $\mathcal{W}=\{w_{1},\ldots,w_{\alpha_{1}}\}\subset\mathcal{V}$ be an independent set with cardinality $\alpha_{1}$ . Furthermore, let $q\in(0,1]$ be a positive probability and for all agents $v\in\mathcal{V}$ , let

q(v)=q\mathbb{I}\{v\in\mathcal{W}\}\;.

In words, only agents belonging to an independent set with largest cardinality are activated (with positive probability), and all with the same probability. Thus, only agents in $\mathcal{W}$ contribute to the expected regret and their total mass $Q=\sum_{v\in\mathcal{V}}q(v)$ is equal to $\alpha_{1}q$ . Moreover, note that being non-adjacent, agents in $\mathcal{W}$ never exchange any information. Each agent $w\in\mathcal{W}$ is therefore running an independent single-agent online linear optimization problem with semi-bandit feedback for an average of $qT$ rounds. Since for single-agent semi-bandits, the worst-case lower bound on the regret after $T^{\prime}$ time steps is known to be $\Omega\bigl{(}\sqrt{mkT^{\prime}}\bigr{)}$ (see, e.g., Audibert et al. [2014], Lattimore et al. [2018]) and the cardinality of $\mathcal{W}$ is $\alpha_{1}$ , the regret of any cooperative semi-bandit algorithm run on this instance satisfies

R_{T}=\Omega\bigl{(}\alpha_{1}\sqrt{mk\,qT}\bigr{)}=\Omega\bigl{(}\alpha_{1}q\sqrt{mkT/q}\bigr{)}=\Omega\bigl{(}Q\sqrt{mkT\alpha_{1}/Q}\bigr{)}\;,

where we used $Q=\alpha_{1}q$ . This concludes the proof. ∎

In the previous section we showed that the expected regret of our Coop-FTPL algorithm can always be upper bounded by $Q\sqrt{mkT\log(k)({k\alpha_{1}}/{Q}+m)}$ (ignoring constants). Thus, Theorem 2 shows that, up to the additive $m$ term inside the rightmost bracket, the regret of Coop-FTPL is at most $\sqrt{k\log k}$ -away from the minimax optimal rate.

6 Conclusions and open problems

Motivated by spatially distributed large-scale learning systems, we introduced a new cooperative setting for adversarial semi-bandits in which only some of the agents are active at any given time step. We designed and analyzed an efficient algorithm that we called Coop-FTPL for which we proved near-optimal regret guarantees with state-of-the-art computational complexity costs. Our analysis relies on the fact that agents are aware of their activation probabilities, and they have some prior knowledge about the connectivity of the graph. Two interesting new lines of research are investigating if either of these assumptions could be lifted while retaining low regret and good computational complexity. In particular, removing the need for prior knowledge of the independence number would represent a significant theoretical and practical improvement, given that computing $\alpha_{1}$ is NP-hard in the worst-case. It is however unclear if existing techniques that address this problem in similar settings (e.g., Cesa-Bianchi et al. [2019b]) would yield any results in our general case. We believe that entirely new ideas will be required to deal with this issue. We leave these intriguing problems open for future work.

Acknowledgements

This project has received funding from the French “Investing for the Future – PIA3” program under the Grant agreement ANITI ANR-19-PI3A-0004.⁵⁵5https://aniti.univ-toulouse.fr/

References

Neu and Bartók [2013] Gergely Neu and Gábor Bartók. An efficient algorithm for learning with semi-bandit feedback. In International Conference on Algorithmic Learning Theory, pages 234–248. Springer, 2013.
Hannan [1957] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
Kalai and Vempala [2005] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
Grötschel et al. [2012] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
Lee et al. [2018] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala. Efficient convex optimization with membership oracles. In Conference On Learning Theory, pages 1292–1294. PMLR, 2018.
Audibert et al. [2014] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014.
Suehiro et al. [2012] Daiki Suehiro, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Kiyohito Nagano. Online prediction under submodular constraints. In International Conference on Algorithmic Learning Theory, pages 260–274. Springer, 2012.
Koolen et al. [2010] Wouter M Koolen, Manfred K Warmuth, Jyrki Kivinen, et al. Hedging structured concepts. In COLT, pages 93–105. Citeseer, 2010.
Cesa-Bianchi and Lugosi [2012] Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
Awerbuch and Kleinberg [2008] Baruch Awerbuch and Robert Kleinberg. Competitive collaborative learning. Journal of Computer and System Sciences, 74(8):1271–1288, 2008.
Cesa-Bianchi et al. [2019a] Nicolò Cesa-Bianchi, Tommaso R Cesari, and Claire Monteleoni. Cooperative online learning: Keeping your neighbors updated. arXiv preprint arXiv:1901.08082, 2019a.
Cesa-Bianchi et al. [2019b] Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in nonstochastic bandits. The Journal of Machine Learning Research, 20(1):613–650, 2019b.
Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
Martínez-Rubio et al. [2019] David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic bandits. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 4529–4540. Curran Associates, Inc., 2019.
Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
Alon et al. [2017] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785–1826, 2017.
Lattimore et al. [2018] Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. Toprank: A practical algorithm for online stochastic ranking. In Advances in Neural Information Processing Systems, pages 3945–3954, 2018.

Appendix A Legendre functions and Fenchel conjugates

In this section, we briefly recall a few known definitions and facts in convex analysis.

Definition 1 (Interior, boundary, and convex hull).

For any subset $E$ of $\mathbb{R}^{k}$ , we denote its topological interior by $\mathrm{int}(E)$ , its boundary by $\partial E$ , and its convex hull by $\mathrm{co}(E)$ .

Definition 2 (Effective domain).

The effective domain of a convex function $F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\}$ is

\mathrm{dom}(F):=\bigl{\{}x\in\mathbb{R}^{k}:F(x)<+\infty\bigr{\}}\;.

(12)

With a slight abuse of notation, we will denote with the same symbol $f$ a convex function $f\colon\to\mathbb{R}\cup\{+\infty\}$ and its restriction $\widetilde{f}\colon\mathrm{dom}(f)\to\mathbb{R}$ to its effective domain.

Definition 3 (Legendre function).

A convex function $F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\}$ is Legendre if

1.

$\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}$ is non-empty;
2.

$F$ is differentiable and strictly convex on $\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}$ ;
3.

for all $x_{0}\in\partial\left[\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\right]$ , if $x\in\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)},x\to x_{0}$ , then $\left\lVert\nabla F(x)\right\rVert_{2}\to+\infty$ .

Definition 4 (Fenchel conjugate).

Let $F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\}$ be a convex function. The Fenchel conjugate $F^{*}$ of $F$ is defined as the function

	$\displaystyle F^{*}\colon\mathbb{R}^{k}$	$\displaystyle\to\mathbb{R}\cup\{+\infty\}$
	$\displaystyle z$	$\displaystyle\mapsto F^{*}(z):=\sup_{x\in\mathbb{R}^{k}}\bigl{(}\left\langle x,z\right\rangle-F(x)\bigr{)}\;.$

Definition 5 (Bregman divergence).

Let $F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\}$ a convex function with non-empty $\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}$ that is differentiable on $\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}$ . The Bregman divergence induced by $F$ is

	$\displaystyle\mathcal{B}_{F}\colon\mathbb{R}^{k}\times\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}$	$\displaystyle\to\mathbb{R}\cup\{+\infty\}$
	$\displaystyle(x,y)$	$\displaystyle\mapsto\mathcal{B}_{F}(x,y):=F(x)-F(y)-\bigl{\langle}\nabla F(y),x-y\bigr{\rangle}\;.$

The following results are taken from [Lattimore and Szepesvári, 2020, Theorem 26.6 and Corollary 26.8].

Theorem 3.

Let $F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\}$ be a Legendre function. Then:

1.

the Fenchel conjugate $F^{*}$ of $F$ is Legendre;
2.

$\nabla F\colon\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\to\mathrm{int}\bigl{(}\mathrm{dom}(F^{*})\bigr{)}$ is bijective with inverse $(\nabla F)^{-1}=\nabla F^{*}$ ;
3.

$\mathcal{B}_{F}(x,y)=\mathcal{B}_{F^{*}}\bigl{(}\nabla F(y),\nabla F(x)\bigr{)}$ , for all $x,y\in\mathrm{int}\bigl{(}\mathrm{dom}(f)\bigr{)}$ .

Corollary 1.

If $F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\}$ is a Legendre function and $x\in\operatorname*{argmin}_{x\in\mathrm{dom}(F)}F(x)$ , then $x\in\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}$ .

Appendix B Online Stochastic Mirror Descent (OSMD)

In this section, we briefly recall the standard Online Stochastic Mirror Descent algorithm (OSMD) (Algorithm 3) and its analysis.

For an overview on some basic convex analysis definitions and results, we refer the reviewer to the previous Appendix A. For a convex function $F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\}$ that is differentiable on the non-empty interior $\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\neq\varnothing$ of its effective domain $\mathrm{dom}(F)$ , we denote by $\mathcal{B}_{F}$ the Bregman divergence induced by $F$ (Definition 5). Following the existing convention, we refer to the input function $F$ of OSMD as a potential. Furthermore, given a measure $P$ on a subset of $\mathbb{R}^{k}$ , we say that a vector $x\in\mathbb{R}^{k}$ is the mean of the measure $P$ if $x$ is the component-wise expectation of a $\mathbb{R}^{k}$ -valued random variable with distribution $P$ . For any time step $t\in\{1,2,\ldots\}$ , we denote by $\mathbb{E}_{t}$ the expectation conditioned to the history up to and including round $t-1$ .

Input: Legendre potential

F\colon\mathbb{R}^{k}\to\mathbb{R}\cup\{+\infty\}

, compact action set

\mathcal{A}\subset\mathbb{R}^{k}

with

\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\cap\mathrm{co}(\mathcal{A})\neq\varnothing

, learning rate

\eta>0

Initialization:

\overline{x}_{1}=\operatorname*{argmin}_{x\in\mathrm{dom}(F)\cap\mathrm{co}(\mathcal{A})}F(x)

1 for $t=1,2,\ldots$ do

2 choose a measure

P_{t}

\mathcal{A}

with mean

\overline{x}_{t}

3 make a prediction

x_{t}

drawn from

\mathcal{A}

according to

P_{t}

and suffer the loss

\left\langle x_{t},\ell_{t}\right\rangle

4 compute an estimate

\widehat{\ell}_{t}

of the loss vector

\ell_{t}

5 make the update:

\overline{x}_{t+1}=\operatorname*{argmin}_{x\in\mathrm{dom}(F)\cap\mathrm{co}(\mathcal{A})}\eta\bigl{\langle}x,\widehat{\ell}_{t}\bigr{\rangle}+\mathcal{B}_{F}(x,\overline{x}_{t})

Algorithm 3 Online Stochastic Mirror Descent (OSMD)

It is known that since $\mathrm{co}(\mathcal{A})$ is convex and compact, $\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\cap\mathrm{co}(\mathcal{A})\neq\varnothing$ , and $F$ is Legendre, then, all the $\operatorname*{argmin}$ ’s exist in Algorithm 3 and $\overline{x}_{t}\in\mathrm{int}\bigl{(}\mathrm{dom}(F)\bigr{)}\cap\mathrm{co}(\mathcal{A})$ for all $t\in\{1,2,\ldots\}$ (see, e.g., [Lattimore and Szepesvári, 2020, Exercise 28.2]).

The following result is taken from [Lattimore and Szepesvári, 2020, Theorem 28.10] and gives an upper bound on the regret of OSMD.

Theorem 4.

Suppose that OSMD (Algorithm 3) is run with input $F,\mathcal{A},\eta$ . If the estimates $\widehat{\ell}_{t}$ computed at line 3 satisfy $\mathbb{E}_{t}\bigl{[}\widehat{\ell}_{t}\bigr{]}=\ell_{t}$ for all $t\in\{1,2,\ldots\}$ , then, for all $x\in\mathrm{co}(\mathcal{A})$ ,

\mathbb{E}\Biggl{[}\sum_{t=1}^{T}\left\langle\overline{x}_{t},\ell_{t}\right\rangle-\sum_{t=1}^{T}\left\langle x,\ell_{t}\right\rangle\Biggr{]}\leq\mathbb{E}\left[\frac{F(x)-F(\overline{x}_{1})}{\eta}+\sum_{t=1}^{T}\bigl{\langle}\overline{x}_{t}-\overline{x}_{t+1},\widehat{\ell}_{t}\bigr{\rangle}-\frac{1}{\eta}\sum_{t=1}^{T}\mathcal{B}_{F}(\overline{x}_{t+1},\overline{x}_{t})\right]\;.

Furthermore, letting

\widetilde{x}_{t+1}=\operatorname*{argmin}_{x\in\mathrm{dom}(F)}\eta\bigl{\langle}x,\widehat{\ell}_{t}\bigr{\rangle}+\mathcal{B}_{F}(x,\overline{x}_{t})

and assuming that $-\eta\widehat{\ell}_{t}+\nabla F(x)\in\nabla F\bigl{(}\mathrm{dom}(F)\bigr{)}$ for all $x\in\mathrm{co}(\mathcal{A})$ almost surely, then

\sup_{x\in\mathrm{co}(\mathcal{A})}\mathbb{E}\Biggl{[}\sum_{t=1}^{T}\left\langle\overline{x}_{t},\ell_{t}\right\rangle-\sum_{t=1}^{T}\left\langle x,\ell_{t}\right\rangle\Biggr{]}\leq\frac{\mathrm{diam}_{F}\bigl{(}\mathrm{co}(\mathcal{A})\bigr{)}}{\eta}+\frac{1}{\eta}\sum_{t=1}^{T}\mathbb{E}\bigl{[}\mathcal{B}_{F}(\overline{x}_{t},\widetilde{x}_{t+1})\bigr{]}\;,

where $\mathrm{diam}_{F}\bigl{(}\mathrm{co}(\mathcal{A})\bigr{)}:=\sup_{x,y\in\mathrm{co}(\mathcal{A})}\bigl{(}F(x)-F(y)\bigr{)}$ is the diameter of $\mathrm{co}(\mathcal{A})$ with respect to $F$ .

Appendix C Proofs of lemmas on geometric distributions

In this section we provide all missing proofs on geometric distributions that we stated in Section 4.

Proof.

of Lemma 1 For all $j\in\{1,\ldots,m\}$ , the cumulative distribution function (c.d.f.) of $Y_{j}$ is given, for all $n\in\mathbb{N}$ , by

\mathbb{P}\left[Y_{j}\leq n\right]=p_{j}\sum_{i=1}^{n}\left(1-p_{j}\right)^{i-1}=1-\left(1-p_{j}\right)^{n}\;.

The c.d.f. of $Z$ is given, for all $n\in\mathbb{N}$ , by

	$\displaystyle\mathbb{P}\left[Z\leq n\right]$	$\displaystyle=\mathbb{P}\left[\min_{j\in\{1,\ldots,m\}}Y_{j}\leq n\right]=1-\prod_{j=1}^{m}\mathbb{P}\left[Y_{j}>n\right]=1-\prod_{j=1}^{m}\left(1-\mathbb{P}\left[Y_{j}\leq n\right]\right)$
		$\displaystyle=1-\prod_{j=1}^{m}\left(1-\left(1-\left(1-p_{j}\right)^{n}\right)\right)=1-\left(\prod_{j=1}^{m}\left(1-p_{j}\right)\right)^{n}$
		$\displaystyle=1-\left(1-\left[1-\prod_{j=1}^{m}\left(1-p_{j}\right)\right]\right)^{n}\;,$

and this is the c.d.f. of a geometric random variable with parameter $(1-\prod_{j=1}^{m}\left(1-p_{j}\right)$ . ∎

Proof.

of Lemma 2 By elementary calculations,

	$\displaystyle\mathbb{E}\left[\min\left\{G,\beta\right\}\right]$	$\displaystyle=\sum_{n=1}^{\infty}n\left(1-q\right)^{n-1}q-\sum_{n=\beta}^{\infty}\left(n-\beta\right)\left(1-q\right)^{n-1}q$
		$\displaystyle=\sum_{n=1}^{\infty}n\left(1-q\right)^{n-1}q-\left(1-q\right)^{\beta}\sum_{n=\beta}^{\infty}\left(n-\beta\right)\left(1-q\right)^{n-\beta-1}q$
		$\displaystyle=\left(1-\left(1-q\right)^{\beta}\right)\sum_{n=1}^{\infty}n\left(1-q\right)^{n-1}q=\frac{\left(1-\left(1-q\right)^{\beta}\right)}{q}\;.$

∎

Proof.

of Lemma 3 The proof follows immediately from the fact that $\bigl{\{}X_{s}(v)\,Y_{s}(v)\bigr{\}}_{s\in\mathcal{I},v\in\mathcal{V}}$ is a collection of independent Bernoulli random variables with expectation $\mathbb{E}\bigl{[}X_{s}(v)\,Y_{s}(v)\bigr{]}=p_{1}(v)\,p_{2}(v)$ for any $s\in\mathbb{N}$ and all $v\in\mathcal{V}$ . ∎

Appendix D Proof of Theorem 1

In this section, we present a complete proof of Theorem 1.

Proof.

of Theorem 1 For the sake of convenience, we define the expected individual regret of an agent $v\in\mathcal{V}$ in the network with respect to a fixed action $a\in\mathcal{A}$ by

R_{T}(a,v):=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle x_{t}(v),\ell_{t}\right\rangle-\sum_{t=1}^{T}\left\langle a,\ell_{t}\right\rangle\right]\;,

where the expectation is taken with respect to the internal randomization of the agent, but not its activation probability $q(v)$ . With this definition the total regret on the network in Eq. (4) can be decomposed as

	$\displaystyle R_{T}$	$\displaystyle=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{S}_{t}}\Bigl{(}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{)}\right]=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{v\in\mathcal{S}_{t}}\Bigl{(}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{)}\right]\right]$
		$\displaystyle=\max_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{v\in\mathcal{V}}q(v)\mathbb{E}_{t}\Bigl{[}\bigl{\langle}x_{t}(v),\ell_{t}\bigr{\rangle}-\left\langle a,\ell_{t}\right\rangle\Bigr{]}\right]=\max_{a\in\mathcal{A}}\sum_{v\in\mathcal{V}}q(v)R_{T}(a,v)\;.$		(13)

The proof then proceeds by isolating the bias in the loss estimators. For each $a\in\mathcal{A}$ we get

R_{T}(a,v)\\ \begin{aligned} &=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle x_{t}(v)-a,\ell_{t}\right\rangle\right]=\mathbb{E}\left[\mathbb{E}_{t}\left[\sum_{t=1}^{T}\left\langle x_{t}(v)-a,\ell_{t}\right\rangle\right]\right]=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\ell_{t}\right\rangle\right]\\ &=\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\widehat{\ell}_{t}(v)\right\rangle\right]+\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\\ &\leq\underbrace{\frac{F(\overline{x}_{1}(v))-F(a)}{\eta}}_{(\mathrm{I})}+\underbrace{\mathbb{E}\left[\frac{1}{\eta}\sum_{t=1}^{T}\mathcal{B}_{F}\left(\overline{x}_{t}(v),\overline{x}_{t+1}(v)\right)\right]}_{(\mathrm{II})}+\underbrace{\mathbb{E}\left[\sum_{t=1}^{T}\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]}_{(\mathrm{III})}\end{aligned}

where the inequality follows by the standard analysis of OMD. We bound the three terms separately. For the first term $(\mathrm{I})$ , we have

	$\displaystyle F(a)$	$\displaystyle=\sup_{x\in\mathbb{R}^{k}}\bigl{(}\left\langle a,x\right\rangle-F^{*}(x)\bigr{)}=\sup_{x\in\mathbb{R}^{k}}\left(\left\langle a,x\right\rangle-\mathbb{E}\left[\max_{a^{\prime}\in\mathcal{A}}\left\langle a^{\prime},x+Z\right\rangle\right]\right)$
		$\displaystyle\geq-\mathbb{E}\left[\max_{a^{\prime}\in\mathcal{A}}\left\langle a^{\prime},Z\right\rangle\right]\geq-m\mathbb{E}\left[\left\\|Z\right\\|_{\infty}\right]=-m\sum_{i=1}^{k}\frac{1}{i}\geq-m\left(1+\log(k)\right),$		(14)

where the first inequality follows by choosing $x=0$ , the second follows from Hölder’s inequality and $\left\|a\right\|_{1}\leq m$ for any $a\in\mathcal{A}$ , and the last equality is Exercise 30.4 in Lattimore and Szepesvári [2020]. It follows that

F\bigl{(}\overline{x}_{1}(v)\bigr{)}-F(a)\leq m\bigl{(}1+\log(k)\bigr{)}\;,

where we use the fact that $F(a)\leq 0$ for all $a\in\mathcal{A}$ and this follows from the first line of Eq. (14) by the convexity of the maximum of a finite number of linear functions, using Jensen’s inequality and the fact that the random variable $Z$ is centered. Thus

(\mathrm{I})\leq\frac{m(1+\log k)}{\eta}\;.

We now study the second term $(\mathrm{II}).$ We have

$\displaystyle\mathcal{B}_{F}\left(\overline{x}_{t}(v),\overline{x}_{t+1}(v)\right)$	$\displaystyle=\mathcal{B}_{F^{*}}\left(\nabla F\left(\overline{x}_{t+1}(v)\right),\nabla F\left(\overline{x}_{t}(v)\right)\right)$
	$\displaystyle=\mathcal{B}_{F^{*}}\left(-\eta\widehat{L}_{t-1}(v)-\eta\widehat{\ell}_{t}(v),-\eta\widehat{L}_{t-1}(v)\right)$
	$\displaystyle=\frac{\eta^{2}}{2}\left\\|\widehat{\ell}_{t}(v)\right\\|_{\nabla^{2}F^{*}(\xi(v))}^{2}\;,$	(15)

where the first equality is a standard property of Bregmann divergence, the second follows from the definitions of the updates and the last by Taylor’s theorem, where $\xi(v)=-\eta\widehat{L}_{t-1}(v)-\alpha\eta\widehat{\ell}_{t}(v)$ , for some $\alpha\in[0,1]$ . To calculate the Hessian we use a change of variable to avoid applying the gradient to the (possibly) non-differentiable $\operatorname*{argmax}$ and we get:

	$\displaystyle\nabla^{2}F^{*}(x)$	$\displaystyle=\nabla\left(\nabla F^{*}(x)\right)=\nabla\mathbb{E}[h(x+Z)]=\nabla\int_{\mathbb{R}^{k}}h(x+z)\zeta(z)dz$
		$\displaystyle=\nabla\int_{\mathbb{R}^{k}}h(u)\zeta(u-x)du=\int_{\mathbb{R}^{k}}h(u)(\nabla\zeta(u-x))^{\top}du$
		$\displaystyle=\int_{\mathbb{R}^{k}}h(u)\mathrm{sign}(u-x)^{\top}\zeta(u-x)du=\int_{\mathbb{R}^{k}}h(x+z)\mathrm{sign}(z)^{\top}\zeta(z)dz$

Using the definition of $\xi(v)$ and the fact that $h(x)$ is nonnegative,

	$\displaystyle\nabla^{2}F^{*}(\xi(v))_{ij}$	$\displaystyle=\int_{\mathbb{R}^{k}}h(\xi(v)+z)_{i}\mathrm{sign}(z)_{j}\zeta(z)dz$
		$\displaystyle\leq\int_{\mathbb{R}^{k}}h(\xi(v)+z)_{i}\zeta(z)dz$
		$\displaystyle=\int_{\mathbb{R}^{k}}h\left(z-\eta\widehat{L}_{t-1}(v)-\alpha\eta\widehat{\ell}_{t}(v)\right)_{i}\zeta(z)dz$
		$\displaystyle=\int_{\mathbb{R}^{k}}h\left(u-\eta\widehat{L}_{t-1}(v)\right)_{i}\zeta\left(u+\alpha\eta\widehat{\ell}_{t}(v)\right)du$
		$\displaystyle\leq\exp\left(\left\\|\alpha\eta\widehat{\ell}_{t}(v)\right\\|_{1}\right)\int_{\mathbb{R}^{k}}h\left(u-\eta\widehat{L}_{t-1}(v)\right)_{i}\zeta(u)du$
		$\displaystyle\leq\exp\left(\alpha\eta\sum_{i=1}^{k}B_{t}(i,v)\beta\right)\overline{x}_{t}(i,v)$
		$\displaystyle\leq\exp\left(\alpha\eta k\beta\right)\overline{x}_{t}(i,v)$
		$\displaystyle\leq e\,\overline{x}_{t}(i,v)$

where the last inequality follows by $\alpha\leq 1$ and $\beta\leq 1/(\eta k)$ . Plugging this estimate in Eq. (15) yields

	$\displaystyle\frac{\eta^{2}}{2}\left\\|\widehat{\ell}_{t}(v)\right\\|_{\nabla^{2}F^{*}(\xi(v))}^{2}$	$\displaystyle=\frac{\eta^{2}}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\nabla^{2}F^{*}\bigl{(}\xi(v)\bigr{)}_{i,j}\widehat{\ell}_{t}(i,v)\widehat{\ell}_{t}(j,v)$
		$\displaystyle\leq\frac{\eta^{2}e}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)\widehat{\ell}_{t}(i,v)\widehat{\ell}_{t}(j,v)$
		$\displaystyle\leq\frac{\eta^{2}e}{2}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\},$

where the last inequality follows by neglecting the truncation with $\beta$ . Hence multiplying $(\mathrm{II})$ by $q(v)$ and summing over $v\in\mathcal{V}$ yields

	$\displaystyle\sum_{v\in\mathcal{V}}q(v)\mathbb{E}\left[\frac{\eta}{2}\sum_{t=1}^{T}\left\\|\widehat{\ell}_{t}(v)\right\\|_{\nabla^{2}F^{}(\xi(v))}^{2}\right]=\sum_{v\in\mathcal{V}}q(v)\frac{\eta}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\\|\widehat{\ell}_{t}(v)\right\\|_{\nabla^{2}F^{}(\xi(v))}^{2}\right]\right]$
	$\displaystyle\hskip 2.94753pt\leq\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i,j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\}\right]\right]\;,$

which is rewritten as

\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i,j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(i,w)\right\}B_{t}(j,v)\min_{w\in\mathcal{N}(v)}\left\{G_{t}(j,w)\right\}\right]\right]\\ \begin{aligned} &=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)B_{t}(i,v)\tilde{G_{t}}(i,v)B_{t}(j,v)\tilde{G}_{t}(j,v)\right]\right]\\ &=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{j=1}^{k}\overline{x}_{t}(i,v)\mathbb{E}_{t}\left[B_{t}(i,v)B_{t}(j,v)\right]\mathbb{E}_{t}\left[\tilde{G_{t}}(i,v)\right]\mathbb{E}_{t}\left[\tilde{G}_{t}(j,v)\right]\right]=:\left(\star\right)\;,\end{aligned}

	$\displaystyle\left(\star\right)$	$\displaystyle=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i=1}^{k}\sum_{j=1}^{k}\frac{\overline{x}_{t}(i,v)}{\overline{B}_{t}(i,v)}B_{t}(i,v)\frac{B_{t}(j,v)}{\overline{B}_{t}(j,v)}\right]\right]$
		$\displaystyle\leq\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i=1}^{k}\sum_{j=1}^{k}\frac{\overline{x}_{t}(i,v)}{\overline{B}_{t}(i,v)}\frac{B_{t}(j,v)}{\overline{B}_{t}(j,v)}\right]\right]$
		$\displaystyle=\sum_{v\in\mathcal{V}}q(v)\frac{\eta e}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{j=1}^{k}\frac{\overline{x}_{t}(i,v)}{\overline{B}_{t}(i,v)}\frac{\cancel{\overline{B}_{t}(j,v)}}{\cancel{\overline{B}_{t}(j,v)}}\right]$
		$\displaystyle=\frac{\eta ek}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{v\in\mathcal{V}}\frac{\overline{x}_{t}(i,v)q(v)}{\overline{B}_{t}(i,v)}\right]$
		$\displaystyle\leq\frac{\eta ek}{2}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\left(\frac{1}{1-e^{-1}}\left(\alpha_{1}+\sum_{v\in\mathcal{V}}\overline{x}_{t}(i,v)q(v)\right)\right)\right]$
		$\displaystyle=\frac{\eta ek}{2}\mathbb{E}\left[\sum_{t=1}^{T}\left(\frac{1}{1-e^{-1}}\left(k\alpha_{1}+\sum_{v\in\mathcal{V}}\sum_{i=1}^{k}\overline{x}_{t}(i,v)q(v)\right)\right)\right]$
		$\displaystyle\leq\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(k\alpha_{1}+mQ\right)\;,$

where the first equality uses the expected value of the geometric random variables $\tilde{G}$ , the first inequality is obtained neglecting the indicator function $B_{t}(i,v)$ , the following equality uses the expected value of the geometric random variables $B_{t}$ , the second inequality follows by Lemma 6. We now consider the last term $(\mathrm{III})$ . Since $\ell_{t}\geq\mathbb{E}[\widehat{\ell}_{t}(v)]$ , from Lemma 4, we have

	$\displaystyle(\mathrm{III})$	$\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\langle\overline{x}_{t}(v)-a,\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\right]\leq\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\left\langle\overline{x}_{t}(v),\ell_{t}-\widehat{\ell}_{t}(v)\right\rangle\right]\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\ell_{t}(i)\overline{x}_{t}(i,v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\;.$

Multiplying $(\mathrm{III})$ by $q(v)$ and summing over the agents, we can now upper bound $\ell_{t}(i)$ with $1$ , then we use facts that $1-x\leq e^{-x}$ for $x\in[0,1]$ and that $e^{-y}\leq 1/y$ for all $y>0$ , to obtain

\sum_{v\in\mathcal{V}}q(v)\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\ell_{t}(i)\overline{x}_{t}(i,v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ \begin{aligned} &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{v\in\mathcal{V}}\overline{x}_{t}(i,v)\,q(v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ &=\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\overline{x}_{t}(i,v)\,q(v)\left(\prod_{w\in\mathcal{N}(v)}\left(1-\overline{x}_{t}(i,w)\,q(w)\right)\right)^{\beta}\right]\\ &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\overline{x}_{t}(i,v)q(v)\exp\left(-\beta\sum_{w\in\mathcal{N}(v)}\overline{x}_{t}(i,w)\,q(w)\right)\right]\\ &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\sum_{\begin{subarray}{c}v\in\mathcal{V}\\ \overline{x}_{t}(i,v)\,q(v)>0\end{subarray}}\frac{\overline{x}_{t}(i,v)\,q(v)}{\beta\sum_{w\in\mathcal{N}(v)}\overline{x}_{t}(i,w)\,q(w)}\right]\\ &\leq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{k}\frac{\alpha_{1}}{\beta}\right]=\frac{\alpha_{1}\,k\,T}{\beta}\end{aligned}

where in the last inequality follows by Lemma 5. Putting all together and recalling that $\beta=\left\lfloor\frac{1}{k\eta}\right\rfloor\geq\frac{1}{2k\eta},$ we can conclude that for every $a\in\mathcal{A},$ thanks to Eq. (13), we have

	$\displaystyle R_{T}$	$\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+Q\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+\frac{\alpha_{1}\,k\,T}{\beta}$
		$\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+Q\frac{\eta ekT}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+2\eta\alpha_{1}k^{2}T$
		$\displaystyle=Q\frac{m\left(1+\log(k)\right)}{\eta}+\eta QkT\left(\frac{e}{2\left(1-e^{-1}\right)}\left(\frac{k}{Q}\alpha_{1}+m\right)+2\alpha_{1}\frac{k}{Q}\right)$
		$\displaystyle\leq Q\frac{m\left(1+\log(k)\right)}{\eta}+5\eta QkT\left(\frac{k}{Q}\alpha_{1}+m\right)$
		$\displaystyle\leq 2Q\sqrt{15mkT\log(k)\left(\frac{k}{Q}\alpha_{1}+m\right)}\;.$

∎

Appendix E Bounds on independence numbers

The two following lemmas provide upper bounds of sums of weights over nodes of a graph expressed in terms of its independence number.

Lemma 5.

Let $\mathcal{G}=(\mathcal{V},\mathcal{E})$ be an undirected graph with indedence number $\alpha_{1}$ , $q(v)\geq 0$ , and $Q(v)=\sum_{w\in\mathcal{N}(v)}q(w)>0$ for all $v\in\mathcal{V}$ . Then

\sum_{v\in\mathcal{V}}\frac{q(v)}{Q(v)}\leq\alpha_{1}

Proof.

Initialize $V_{1}=\mathcal{V}$ , fix $w_{1}\in\operatorname*{argmin}_{w\in V_{1}}Q(w)$ , and denote $V_{2}=\mathcal{V}\setminus\mathcal{N}(w_{1})$ . For $k\geq 2$ fix $w_{k}\in\operatorname*{argmin}_{w\in V_{k}}Q(w)$ and shrink $V_{k+1}=V_{k}\setminus\mathcal{N}(w_{k})$ until $V_{k+1}=\varnothing$ . Since $\mathcal{G}$ is undirected $w_{k}\notin\bigcup_{s=1}^{k-1}\mathcal{N}(w_{s})$ , therefore the number $m$ of times that an action can be picked this way is upper bounded by $\alpha_{1}$ . Denoting $\mathcal{N}^{\prime}(w_{k})=V_{k}\cap\mathcal{N}(w_{k})$ this implies

	$\displaystyle\sum_{v\in\mathcal{V}}\frac{q(v)}{Q(v)}$	$\displaystyle=\sum_{k=1}^{m}\sum_{v\in\mathcal{N}^{\prime}(w_{k})}\frac{q(v)}{Q(v)}\leq\sum_{k=1}^{m}\sum_{v\in\mathcal{N}^{\prime}(w_{k})}\frac{q(v)}{Q(w_{k})}$
		$\displaystyle\leq\sum_{k=1}^{m}\frac{\sum_{v\in\mathcal{N}(w_{k})}q(v)}{Q(w_{k})}=m\leq\alpha_{1}$

concluding the proof. ∎

Lemma 6.

Let $\mathcal{G}=(\mathcal{V},\mathcal{E})$ be an undirected graph with independence number $\alpha_{1}$ . For each $v\in\mathcal{V}$ , let $\mathcal{N}(v)$ be the neighborhood of node $v$ (including $v$ itself), and $p(1,v),\dots,p(k,v)\geq 0$ . Then, for all $i\in\{1,\dots,k\}$ ,

\sum_{v\in\mathcal{V}}\frac{p(i,v)}{q(i,v)}\leq\frac{1}{1-e^{-1}}\left(\alpha_{1}+\sum_{v\in\mathcal{V}}p(i,v)\right)\quad\text{where}\quad q(i,v)=1-\prod_{w\in\mathcal{N}(v)}\bigl{(}1-p(i,w)\bigr{)}~{}.

Proof.

Fix $i\in\{1,\dots,k\}$ and set for brevity $P(i,v)=\sum_{w\in\mathcal{N}(v)}p(i,w)$ . We can write

\displaystyle\sum_{v\in\mathcal{V}}\frac{p(i,v)}{q(i,v)}

\displaystyle=\underbrace{\sum_{v\in\mathcal{V}\,:\,P(i,v)\geq 1}\frac{p(i,v)}{q(i,v)}}_{\mathrm{(I)}}\quad+\quad\underbrace{\sum_{v\in\mathcal{V}\,:\,P(i,v)<1}\frac{p(i,v)}{q(i,v)}}_{\mathrm{(II)}}~{},

and proceed by upper bounding the two terms (I) and (II) separately. Let $r(v)$ be the cardinality of $\mathcal{N}(v)$ . We have, for any given $v\in\mathcal{V}$ ,

\min\left\{q(i,v)\,:\,\sum_{w\in\mathcal{N}(v)}p(i,w)\geq 1\right\}=1-\left(1-\frac{1}{r(v)}\right)^{r(v)}\geq 1-e^{-1}~{}.

The equality is due to the fact that the minimum is achieved when $p(i,w)=\frac{1}{r(v)}$ for all $w\in\mathcal{N}(v)$ , and the inequality comes from $r(v)\geq 1$ (for, $v\in\mathcal{N}(v)$ ). Hence

\displaystyle\mathrm{(I)}\leq\sum_{v\in\mathcal{V}\,:\,P(i,v)\geq 1}\frac{p(i,v)}{1-e^{-1}}\leq\sum_{v\in\mathcal{V}}\frac{p(i,v)}{1-e^{-1}}~{}.

As for (II), using the inequality $1-x\leq e^{-x},x\in[0,1]$ , with $x=p(i,w)$ , we can write

\displaystyle q(i,v)\geq 1-\exp\left(-\sum_{w\in\mathcal{N}(v)}p(i,w)\right)=1-\exp\left(-P(i,v)\right)~{}.

In turn, because $P(i,v)<1$ in terms (II), we can use the inequality $1-e^{-x}\geq(1-e^{-1})\,x$ , holding when $x\in[0,1]$ , with $x=P(i,v)$ , thereby concluding that

q(i,v)\geq(1-e^{-1})P(i,v)

Thus

\displaystyle\mathrm{(II)}\leq\sum_{v\in\mathcal{V}\,:\,P(i,v)<1}\frac{p(i,v)}{(1-e^{-1})P(i,v)}\leq\frac{1}{1-e^{-1}}\,\sum_{v\in\mathcal{V}}\frac{p(i,v)}{P(i,v)}\leq\frac{\alpha_{1}}{1-e^{-1}}~{},

where in the last step we used Lemma 5. ∎