An Efficient Algorithm for Cooperative Semi-Bandits
Abstract
We consider the problem of asynchronous online combinatorial optimization on a network of communicating agents. At each time step, some of the agents are stochastically activated, requested to make a prediction, and the system pays the corresponding loss. Then, neighbors of active agents receive semi-bandit feedback and exchange some succinct local information. The goal is to minimize the network regret, defined as the difference between the cumulative loss of the predictions of active agents and that of the best action in hindsight, selected from a combinatorial decision set. The main challenge in such a context is to control the computational complexity of the resulting algorithm while retaining minimax optimal regret guarantees. We introduce Coop-FTPL, a cooperative version of the well-known Follow The Perturbed Leader algorithm, that implements a new loss estimation procedure generalizing the Geometric Resampling of Neu and Bartók [2013] to our setting. Assuming that the elements of the decision set are -dimensional binary vectors with at most non-zero entries and is the independence number of the network, we show that the expected regret of our algorithm after time steps is of order , where is the total activation probability mass. Furthermore, we prove that this is only -away from the best achievable rate and that Coop-FTPL has a state-of-the-art worst-case computational complexity.
1 Introduction
Distributed online settings with communication constraints arise naturally in large-scale learning systems. For example, in domains such as finance or online advertising, agents often serve high volumes of prediction requests and have to update their local models in an online fashion. Bandwidth and computational constraints may therefore preclude a central processor from having access to all the observations from all sessions and synchronizing all local models at the same time. With these motivations in mind, we introduce and analyze a new online learning setting in which a network of agents solves efficiently a common nonstochastic combinatorial semi-bandit problem by sharing information only with their network neighbors. At each time step , some agents belonging to a communication network are asked to make a prediction belonging to a subset of and pay a (linear) loss where is chosen adversarially by an oblivious environment. Then, any such agent receives the feedback , which is shared, together with some local information, to its first neighbors in the graph. The goal is to minimize the network regret after time steps
| (1) |
where is the set of agents that made a prediction at time . In words, this is the difference between the cumulative loss of the “active” agents and the loss that they would have incurred had they consistently made the best prediction in hindsight.
For this setting, we design a distributed algorithm that we call Coop-FTPL (Algorithm 1), and prove that its regret is upper bounded by (Theorem 1), where is the independence number of the network and is the sum over all agents of the probability that the agent is active during a time step. Our algorithm employs an estimation technique that we call Cooperative Geometric Resampling (Coop-GR, Algorithm 2). It is an extension of a similar procedure appearing in [Neu and Bartók, 2013] that relies on the fact that the reciprocal of the probability of an event can be estimated by measuring the reoccurrence time. We can leverage this idea in the context of cooperative learning thanks to some statistical properties of the minimum of a family of geometric random variables (see Lemmas 1–3). Our algorithm has a state-of-the-art dependence on time of order for the worst-case computational complexity (Proposition 1). Moreover, we show with a lower bound (Theorem 2) that no algorithm can achieve a regret smaller than on all cooperative semi-bandit instances. Thus, our Coop-FTPL is at most a multiplicative factor of -away from the minimax result.
To the best of our knowledge, ours is the first computationally efficient near-optimal learning algorithm for the problem of cooperative learning with nonstochastic combinatorial bandits, where not all agents are necessarily active at all time steps.
2 Related work and further applications
Single-agent combinatorial bandits find applications in several fields, such as path planning, ranking and matching problems, finding minimum-weight spanning trees, cut sets, and multitask bandits. An efficient algorithm for this setting is Follow-The-Perturbed-Leader (FTPL), which was first proposed by Hannan [1957] and later rediscovered by Kalai and Vempala [2005]. Neu and Bartók [2013] show that combining FTPL with a loss estimation procedure called Geometric Resampling (GR) leads to a computationally efficient solution for this problem. More precisely, the solution is efficient given that the offline optimization problem of finding
| (2) |
admits a computationally efficient algorithm. This assumption is minimal, in the sense that if the offline problem in Eq. (2) is hard to approximate, then any algorithm with low regret must also be inefficient.111A slight relaxation in this direction would be assuming that Eq. (2) can be approximated accurately and efficiently. Grötschel et al. [2012] and Lee et al. [2018] give some sufficient conditions for the validity of this assumption. They essentially rely on having an efficient membership oracle for the convex hull of and an evaluation oracle for the linear function to optimize. Audibert et al. [2014] note that Online Stochastic Mirror Descent (OSMD) or Follow The Regularized Leader (FTRL)-type algorithms can be efficiently implemented by convex programming if the convex hull of the decision set can be described by a polynomial number of constraints. Suehiro et al. [2012] investigate the details of such efficient implementations and design an algorithm with time-complexity, which might still be unfeasible in practice. Methods based on the exponential weighting of each decision vector can be implemented efficiently only in a handful of special cases —see, e.g., [Koolen et al., 2010] and [Cesa-Bianchi and Lugosi, 2012] for some examples.
The study of cooperative nonstochastic online learning on networks was pioneered by Awerbuch and Kleinberg [2008], who investigated a bandit setting in which the communication graph is a clique, agents belong to clusters characterized by the same loss, and some agents may be non-cooperative. In our multi-agent setting, the end goal is to control the total network regret (1). This objective was already studied by Cesa-Bianchi et al. [2019a] in the full-information case. A similar line of work was pursued by Cesa-Bianchi et al. [2019b], where the authors consider networks of learning agents that cooperate to solve the same nonstochastic bandit problem. In their setting, all agents are simultaneously active at all time steps, and the feedback propagates throughout the network with a maximum delay of time steps, where is a parameter of the proposed algorithm. The authors introduce a cooperative version of Exp3 that they call Exp3-COOP with regret of order where is the number of arms in the nonstochastic bandit problem, is the total number of agents in the network, and is the independence number of the -th power of the communication network. The case corresponds to information that arrives with one round of delay and communication limited to first neighbors. In this setting Exp3-COOP has regret of order . Thus, our work can be seen as an extension of this setting to the case of combinatorial bandits with stochastic activation of agents. Finally, we point out that if the network consists of a single node, our cooperative setting collapses into a single-agent combinatorial semi-bandit problem. In particular, when the number of arms is and , this becomes the well-known adversarial multiarmed bandit problem (see [Auer et al., 2002]). Hence, ours is a proper generalization of all the settings mentioned above. These cooperative problems are also studied in stochastic setting (see, e.g., Martínez-Rubio et al. [2019]).
Finally, the reader may wonder what kind of results could be achieved if the agents are activated adversarially rather than stochastically. Cesa-Bianchi et al. [2019a] showed that in this setting no learning can occur, not even in with full-information feedback.
3 Cooperative semi-bandit setting
In this section, we present our cooperative semi-bandit protocol and we introduce all relevant definitions and notation.
We say that is a communication network over agents if it is an undirected graph over a set with cardinality , whose elements we refer to as agents. Without loss of generality, we assume that . For any agent , we denote by the set of agents containing and the neighborhood . We say that is the independence number of the network if is the largest cardinality of an independent set of , where an independent set of is a subset of agents, no two of which are neighbors.
We study the following cooperative online combinatorial optimization protocol. Initially, hidden from the agents, the environment draws a sequence of subsets of agents, that we call active, and a sequence of loss vectors . We assume that each agent has a probability of being activated, which need only be known by . The set of active agents at time is then determined by drawing, for each agent , a Bernoulli random variable with bias , independently of the past, and consists exclusively of agents for which . The decision set is a subset of , for some .
For each time step :
-
1.
each active agent predicts with (possibly drawn at random);
-
2.
each neighbor of an active agent receives the feedback
(3) -
3.
each agent receives some local information from its neighbors in ;
-
4.
the system incurs the loss .
The goal is to minimize the expected network regret as a function of the time horizon , defined by
| (4) |
where the expectation is taken with respect to the draws of and (possibly) the randomization of the learners. In the next sections we will also denote by the probability conditioned to the history up to and including round , and by the corresponding expectation.
The nature of the local information exchanged by neighbors of active agents will be clarified in the next section. In short, they share succinct representations of the current state of their local prediction model.
4 Coop-FTPL and upper bound
In this section we introduce and analyze our efficient Coop-FTPL algorithm for cooperative online combinatorial optimization.
4.1 The Coop-FTPL algorithm
Coop-FTPL (Algorithm 1) takes as input a decision set , a time horizon , a learning rate , a truncation parameter , and an exploration distribution . At each time step , all active agents make a FTPL prediction (line 1) with an i.i.d. perturbation sampled from (line 1), then they receive some feedback and share it with their first neighbors (line 1). Afterwards, each agent who received some feedback this round, requests from its neighbors a vector of geometric random samples (line 1) which is efficiently computed by Algorithm 2 and will be described in detail later. With these geometric samples, each agent computes an estimated loss (line 1) and updates the cumulative loss estimate (line 1). This estimator is described in detail in Section 4.3.
4.2 Reduction to OSMD
Before describing and , we make a connection between FTPL and the Online Stochastic Mirror Descent algorithm (OSMD)222For a brief overview of some key convex analysis and OSMD facts, see Appendices A and B that will be crucial for our analysis.333For a similar approach in the single-agent case, see [Lattimore and Szepesvári, 2020].
Fix any time step and an agent . As we mentioned above, if is active, it makes the following FTPL prediction (line 1)
where is sampled i.i.d. from (the random perturbations introduce the exploration, which for an appropriate choice of is sufficient to guarantee small regret). On the other hand, given a Legendre potential with , an OSMD algorithm makes the prediction
where is the Bregman divergence induced by and is the convex hull of . Using the fact that , the above can be computed in a standard way by studying when the gradient of its argument is equal to zero, and proceeding inductively, we obtain the two identities By duality this implies that . We now want to relate and so that
| (5) |
where the conditional expectation (given the history up to time ) is taken with respect to . Thus, in order to view FTPL as an instance of OSMD, it suffices to find a Legendre potential with such that . In order to satisfy this condition, we need that for any , the Fenchel conjugate of enjoys Then, we define for any , where is chosen to be an arbitrary maximizer if multiple maximizers exist. From convex analysis, if the convex hull of had a smooth boundary, then the support function of would satisfy . For combinatorial bandits, is non-smooth, but, being a density with respect to Lebesgue measure, one can prove (see, e.g., Lattimore and Szepesvári [2020]) that , for all . This shows that FTPL can be interpreted as OSMD with a potential defined implicitly by its Fenchel conjugate
Thus, recalling (5), we can think of the update of OSMD as the average of a random component-wise draw (for all ), with respect to a distribution on defined in terms of the distribution of , as
where is the probability conditioned of the history up to time .
4.3 An efficient estimator
For the understanding of the definitions and analyses of and , we introduce three useful lemmas on geometric distributions. We defer their proofs to Appendix C.
Lemma 1.
Let be independent random variables such that each has a geometric distribution with parameter . Then, the random variable has a geometric distribution with parameter .
Lemma 2.
Let be a geometric random variable with parameter and . Then, the expectation of the random variable satisfies .
Lemma 3.
For all , fix two arbitrary numbers . Consider a collection of independent Bernoulli random variables such that and for any and all . Then, the random variables defined for all by are all independent and they have a geometric distribution with parameter .
Fix now any time step , agent , and component . The loss estimator depends on the algorithmic definition of in Algorithm 2, where . By Lemma 3, we have that for any , conditionally on the history up to time , the random variable , has a truncated geometric distribution with success probability equal to and truncation parameter . The idea is that using as an estimator we can reconstruct the inverse of the probability that agent observes at time . The truncation parameter is not needed for the analysis, but it is used just to optimize the computational efficiency of the algorithm.444Previously known cooperative algorithms for limited feedback settings need to exchange at least two real numbers: the probability according to which predictions are drawn and the loss. Instead of the probability, we only need to pass the integer , which requires at most bits (order of , when tuned). Note also that for the loss, we could exchange an approximation of using only bits. Indeed one can show that Lemma 4, in this case, is true replacing with in the first equality. Everything else works the same in the proof of Theorem 1 up an extra (negligible) term.
The loss estimator of is then defined as
| (6) |
where
| (7) |
and given the history up to time , for each , the family consists of independent geometric random variables with parameter . Note that the geometric random variables are actually never computed by Algorithm 2 which efficiently computes only their truncations , with truncation parameter . Nevertheless, as it will be apparent later, they are a useful tool for the theoretical analysis. Note that, by Eq. (5), we have
therefore
where the last identity follows by Lemma 1. Moreover from Lemmas 1 and 2, we have
The following key lemma gives an upper bound on the expected estimated loss.
Lemma 4.
For any time , component , agents , and truncation parameter , the expectation of the loss estimator in (6), given the history up to time , satisfies
4.4 Analysis
We can finally state our upper bound on the regret of Coop-FTPL. The key idea is to apply OSMD techniques to our FTPL algorithm, as explained in Section 4.2. The proof proceeds by splitting the regret of each agent in the network in three terms. The first two are treated with standard techniques; the first one depends on the diameter of with respect to the regularizer and the second one on the Bregman divergence of consecutive updates. The last term is related to the bias of the estimator and is analyzed leveraging the lemmas on geometric distributions introduced in Section 4.3. Then, this terms are summed, each with a weight corresponding to their probabilities of being activated, and this sum is controlled using results that relate a sum of weights over the nodes of a graph of with the independence number of the graph.
Theorem 1.
If is the Laplace density , , and , then then the regret of our Coop-FTPL algorithm (Algorithm 1) satisfies
In particular, tuning the parameters as follows
| (8) |
yields
We now present a detailed sketch of the proof of our main result (full proof in Appendix D).
Sketch of the proof.
For the sake of convenience, we define the expected individual regret of an agent in the network with respect to a fixed action by
where the expectation is taken with respect to the internal randomization of the agent, but not to its activation probability . With this definition the total regret on the network in Eq. (4) can be decomposed as
| (9) |
The proof then proceeds by isolating the bias in the loss estimators. For each we have
Exploiting the analogy that we established between FTPL and OSMD, we begin by using the standard bound for the regret of OSMD in the first term of the previous equation. For the reader’s convenience, we restate it in Appendix B, Theorem 4. This leads to
The three terms are studied separately and in detail in Appendix D. Here, we provide a sketch of the bounds.
For the first term , we use the fact that the regularizer satisfies, for all ,
| (10) |
which follows by the definition of , the properties of the perturbation distribution, and the fact that for any . One can also show that for all , and this, combined with the previous equation, leads to
For the second term we have
| (11) |
where the first equality is a standard property of Bregmann divergence (see Theorem 3 in Appendix A), the second follows from the definitions of the updates and the last by Taylor’s theorem, where , for some . The estimation of the entries of the Hessian are non trivial (but tedious); the interested reader can find them in Appendix D. Exploiting our assumption that , we get, for all ,
Plugging this estimate in Eq. (11) yields
where the last inequality follows by neglecting the truncation with . Hence multiplying by and summing over yields
which is rewritten as
where in the first equality we defined and, analogously, , while the second follows by the conditional independence of the three terms , , and given the history up to time . Furthermore, making use of Lemmas 1–3 and upper bounding, we get:
where the first equality uses the expected value of the geometric random variables , the first inequality is obtained neglecting the indicator function and taking the conditional expectation of , and the last inequality follows by a known upper bound involving independence numbers appearing, for example in Cesa-Bianchi et al. [2019a, b]. For the sake of convenience, we add this result to Appendix E, Lemma 6. We now consider the last term . Since by Lemma 4, we have
Multiplying by and summing over the agents, we now upper bound with and use the facts that for all and for all , to obtain
where the last inequality follows by a known upper bound involving independence numbers appearing, for example in [Alon et al., 2017, Lemma 10]. For the sake of convenience, we add this result to Appendix E, Lemma 5. Putting everything together and recalling that we can finally conclude that
∎
We conclude this section by discussing the computational complexity of our Coop-FTPL algorithm. The next result shows that the total number of elementary operations performed by Coop-FTPL over time-steps scales with in the worst-case. To the best of our knowledge, no known algorithm attains a lower worst-case computational complexity.
Proposition 1.
Proof.
The result follows immediately by noting that the number of elementary operations performed by each agent at each time step is at most
∎
5 Lower bound
In this section we show that no cooperative semi-bandit algorithm can beat the rate. The key idea for constructing the lower bound is simple: if the activation probabilities are non-zero only for agents belonging to an independent set with cardinality , then the problem is reduced to independent instances of single-agent semi-bandits, whose minimax rate is known.
Theorem 2.
For each communication network with independence number there exist cooperative semi-bandit instances for which the regret of any learning algorithm satisfies
Proof.
Let be an independent set with cardinality . Furthermore, let be a positive probability and for all agents , let
In words, only agents belonging to an independent set with largest cardinality are activated (with positive probability), and all with the same probability. Thus, only agents in contribute to the expected regret and their total mass is equal to . Moreover, note that being non-adjacent, agents in never exchange any information. Each agent is therefore running an independent single-agent online linear optimization problem with semi-bandit feedback for an average of rounds. Since for single-agent semi-bandits, the worst-case lower bound on the regret after time steps is known to be (see, e.g., Audibert et al. [2014], Lattimore et al. [2018]) and the cardinality of is , the regret of any cooperative semi-bandit algorithm run on this instance satisfies
where we used . This concludes the proof. ∎
In the previous section we showed that the expected regret of our Coop-FTPL algorithm can always be upper bounded by (ignoring constants). Thus, Theorem 2 shows that, up to the additive term inside the rightmost bracket, the regret of Coop-FTPL is at most -away from the minimax optimal rate.
6 Conclusions and open problems
Motivated by spatially distributed large-scale learning systems, we introduced a new cooperative setting for adversarial semi-bandits in which only some of the agents are active at any given time step. We designed and analyzed an efficient algorithm that we called Coop-FTPL for which we proved near-optimal regret guarantees with state-of-the-art computational complexity costs. Our analysis relies on the fact that agents are aware of their activation probabilities, and they have some prior knowledge about the connectivity of the graph. Two interesting new lines of research are investigating if either of these assumptions could be lifted while retaining low regret and good computational complexity. In particular, removing the need for prior knowledge of the independence number would represent a significant theoretical and practical improvement, given that computing is NP-hard in the worst-case. It is however unclear if existing techniques that address this problem in similar settings (e.g., Cesa-Bianchi et al. [2019b]) would yield any results in our general case. We believe that entirely new ideas will be required to deal with this issue. We leave these intriguing problems open for future work.
Acknowledgements
This project has received funding from the French “Investing for the Future – PIA3” program under the Grant agreement ANITI ANR-19-PI3A-0004.555https://aniti.univ-toulouse.fr/
References
- Neu and Bartók [2013] Gergely Neu and Gábor Bartók. An efficient algorithm for learning with semi-bandit feedback. In International Conference on Algorithmic Learning Theory, pages 234–248. Springer, 2013.
- Hannan [1957] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
- Kalai and Vempala [2005] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
- Grötschel et al. [2012] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
- Lee et al. [2018] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala. Efficient convex optimization with membership oracles. In Conference On Learning Theory, pages 1292–1294. PMLR, 2018.
- Audibert et al. [2014] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014.
- Suehiro et al. [2012] Daiki Suehiro, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Kiyohito Nagano. Online prediction under submodular constraints. In International Conference on Algorithmic Learning Theory, pages 260–274. Springer, 2012.
- Koolen et al. [2010] Wouter M Koolen, Manfred K Warmuth, Jyrki Kivinen, et al. Hedging structured concepts. In COLT, pages 93–105. Citeseer, 2010.
- Cesa-Bianchi and Lugosi [2012] Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
- Awerbuch and Kleinberg [2008] Baruch Awerbuch and Robert Kleinberg. Competitive collaborative learning. Journal of Computer and System Sciences, 74(8):1271–1288, 2008.
- Cesa-Bianchi et al. [2019a] Nicolò Cesa-Bianchi, Tommaso R Cesari, and Claire Monteleoni. Cooperative online learning: Keeping your neighbors updated. arXiv preprint arXiv:1901.08082, 2019a.
- Cesa-Bianchi et al. [2019b] Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in nonstochastic bandits. The Journal of Machine Learning Research, 20(1):613–650, 2019b.
- Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- Martínez-Rubio et al. [2019] David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic bandits. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 4529–4540. Curran Associates, Inc., 2019.
- Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
- Alon et al. [2017] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785–1826, 2017.
- Lattimore et al. [2018] Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. Toprank: A practical algorithm for online stochastic ranking. In Advances in Neural Information Processing Systems, pages 3945–3954, 2018.
Appendix A Legendre functions and Fenchel conjugates
In this section, we briefly recall a few known definitions and facts in convex analysis.
Definition 1 (Interior, boundary, and convex hull).
For any subset of , we denote its topological interior by , its boundary by , and its convex hull by .
Definition 2 (Effective domain).
The effective domain of a convex function is
| (12) |
With a slight abuse of notation, we will denote with the same symbol a convex function and its restriction to its effective domain.
Definition 3 (Legendre function).
A convex function is Legendre if
-
1.
is non-empty;
-
2.
is differentiable and strictly convex on ;
-
3.
for all , if , then .
Definition 4 (Fenchel conjugate).
Let be a convex function. The Fenchel conjugate of is defined as the function
Definition 5 (Bregman divergence).
Let a convex function with non-empty that is differentiable on . The Bregman divergence induced by is
The following results are taken from [Lattimore and Szepesvári, 2020, Theorem 26.6 and Corollary 26.8].
Theorem 3.
Let be a Legendre function. Then:
-
1.
the Fenchel conjugate of is Legendre;
-
2.
is bijective with inverse ;
-
3.
, for all .
Corollary 1.
If is a Legendre function and , then .
Appendix B Online Stochastic Mirror Descent (OSMD)
In this section, we briefly recall the standard Online Stochastic Mirror Descent algorithm (OSMD) (Algorithm 3) and its analysis.
For an overview on some basic convex analysis definitions and results, we refer the reviewer to the previous Appendix A. For a convex function that is differentiable on the non-empty interior of its effective domain , we denote by the Bregman divergence induced by (Definition 5). Following the existing convention, we refer to the input function of OSMD as a potential. Furthermore, given a measure on a subset of , we say that a vector is the mean of the measure if is the component-wise expectation of a -valued random variable with distribution . For any time step , we denote by the expectation conditioned to the history up to and including round .
It is known that since is convex and compact, , and is Legendre, then, all the ’s exist in Algorithm 3 and for all (see, e.g., [Lattimore and Szepesvári, 2020, Exercise 28.2]).
The following result is taken from [Lattimore and Szepesvári, 2020, Theorem 28.10] and gives an upper bound on the regret of OSMD.
Appendix C Proofs of lemmas on geometric distributions
In this section we provide all missing proofs on geometric distributions that we stated in Section 4.
Proof.
of Lemma 1 For all , the cumulative distribution function (c.d.f.) of is given, for all , by
The c.d.f. of is given, for all , by
and this is the c.d.f. of a geometric random variable with parameter . ∎
Proof.
Proof.
of Lemma 3 The proof follows immediately from the fact that is a collection of independent Bernoulli random variables with expectation for any and all . ∎
Appendix D Proof of Theorem 1
In this section, we present a complete proof of Theorem 1.
Proof.
of Theorem 1 For the sake of convenience, we define the expected individual regret of an agent in the network with respect to a fixed action by
where the expectation is taken with respect to the internal randomization of the agent, but not its activation probability . With this definition the total regret on the network in Eq. (4) can be decomposed as
| (13) |
The proof then proceeds by isolating the bias in the loss estimators. For each we get
where the inequality follows by the standard analysis of OMD. We bound the three terms separately. For the first term , we have
| (14) |
where the first inequality follows by choosing , the second follows from Hölder’s inequality and for any , and the last equality is Exercise 30.4 in Lattimore and Szepesvári [2020]. It follows that
where we use the fact that for all and this follows from the first line of Eq. (14) by the convexity of the maximum of a finite number of linear functions, using Jensen’s inequality and the fact that the random variable is centered. Thus
We now study the second term We have
| (15) |
where the first equality is a standard property of Bregmann divergence, the second follows from the definitions of the updates and the last by Taylor’s theorem, where , for some . To calculate the Hessian we use a change of variable to avoid applying the gradient to the (possibly) non-differentiable and we get:
Using the definition of and the fact that is nonnegative,
where the last inequality follows by and . Plugging this estimate in Eq. (15) yields
where the last inequality follows by neglecting the truncation with . Hence multiplying by and summing over yields
which is rewritten as
where in the first equality we defined and, analogously, , while the second follows by the conditional independence of the three terms , , and given the history up to time . Furthermore, making use of Lemmas 1–3, we get
where the first equality uses the expected value of the geometric random variables , the first inequality is obtained neglecting the indicator function , the following equality uses the expected value of the geometric random variables , the second inequality follows by Lemma 6. We now consider the last term . Since , from Lemma 4, we have
Multiplying by and summing over the agents, we can now upper bound with , then we use facts that for and that for all , to obtain
where in the last inequality follows by Lemma 5. Putting all together and recalling that we can conclude that for every thanks to Eq. (13), we have
∎
Appendix E Bounds on independence numbers
The two following lemmas provide upper bounds of sums of weights over nodes of a graph expressed in terms of its independence number.
Lemma 5.
Let be an undirected graph with indedence number , , and for all . Then
Proof.
Initialize , fix , and denote . For fix and shrink until . Since is undirected , therefore the number of times that an action can be picked this way is upper bounded by . Denoting this implies
concluding the proof. ∎
Lemma 6.
Let be an undirected graph with independence number . For each , let be the neighborhood of node (including itself), and . Then, for all ,
Proof.
Fix and set for brevity . We can write
and proceed by upper bounding the two terms (I) and (II) separately. Let be the cardinality of . We have, for any given ,
The equality is due to the fact that the minimum is achieved when for all , and the inequality comes from (for, ). Hence
As for (II), using the inequality , with , we can write
In turn, because in terms (II), we can use the inequality , holding when , with , thereby concluding that