This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Near-Optimal No-Regret Learning in General Games

Constantinos Daskalakis
MIT CSAIL
costis@csail.mit.edu
Supported by NSF Awards CCF-1901292, DMS-2022448 and DMS-2134108, by a Simons Investigator Award, by the Simons Collaboration on the Theory of Algorithmic Fairness, by a DSTA grant, and by the DOE PhILMs project (No. DE-AC05-76RL01830).
   Maxwell Fishelson
MIT CSAIL
maxfish@mit.edu
   Noah Golowich
MIT CSAIL
nzg@mit.edu
Supported by a Fannie & John Hertz Foundation Fellowship and an NSF Graduate Fellowship.
Abstract

We show that Optimistic Hedge – a common variant of multiplicative-weights-updates with recency bias – attains poly(logT){\rm poly}(\log T) regret in multi-player general-sum games. In particular, when every player of the game uses Optimistic Hedge to iteratively update her strategy in response to the history of play so far, then after TT rounds of interaction, each player experiences total regret that is poly(logT){\rm poly}(\log T). Our bound improves, exponentially, the O(T1/2)O({T}^{1/2}) regret attainable by standard no-regret learners in games, the O(T1/4)O(T^{1/4}) regret attainable by no-regret learners with recency bias [SALS15], and the O(T1/6){O}(T^{1/6}) bound that was recently shown for Optimistic Hedge in the special case of two-player games [CP20]. A corollary of our bound is that Optimistic Hedge converges to coarse correlated equilibrium in general games at a rate of O~(1T)\tilde{O}\left(\frac{1}{T}\right).

1 Introduction

Online learning has a long history that is intimately related to the development of game theory, convex optimization, and machine learning. One of its earliest instantiations can be traced to Brown’s proposal [Bro49] of fictitious play as a method to solve two-player zero-sum games. Indeed, as shown by [Rob51], when the players of (zero-sum) matrix game use fictitious play to iteratively update their actions in response to each other’s history of play, the resulting dynamics converge in the following sense: the product of the empirical distributions of strategies for each player converges to the set of Nash equilibria in the game, though the rate of convergence is now known to be exponentially slow [DP14]. Moreover, such convergence to Nash equilibria fails in non-zero-sum games [Sha64].

The slow convergence of fictitious play to Nash equilibria in zero-sum matrix games and non-convergence in general-sum games can be mitigated by appealing to the pioneering works [Bla54, Han57] and the ensuing literature on no-regret learning [CBL06]. It is known that if both players of a zero-sum matrix game experience regret that is at most ε(T)\varepsilon(T), the product of the players’ empirical distributions of strategies is an O(ε(T)/T)O(\varepsilon(T)/T)-approximate Nash equilibrium. More generally, if each player of a general-sum, multi-player game experiences regret that is at most ε(T)\varepsilon(T), the empirical distribution of joint strategies converges to a coarse correlated equilibrium111In general-sum games, it is typical to focus on proving convergence rates for weaker types of equilibrium than Nash, such as coarse correlated equilibria, since finding Nash equilibria is PPAD-complete [DGP06, CDT09]. of the game, at a rate of O(ε(T)/T)O(\varepsilon(T)/{T}). Importantly, a multitude of online learning algorithms, such as the celebrated Hedge and Follow-The-Perturbed-Leader algorithms, guarantee adversarial regret O(T)O(\sqrt{T}) [CBL06]. Thus, when such algorithms are employed by all players in a game, their O(T)O(\sqrt{T}) regret implies convergence to coarse correlated equilibria (and Nash equilibria of matrix games) at a rate of O(1/T)O(1/\sqrt{T}).

While standard no-regret learners guarantee O(T)O(\sqrt{T}) regret for each player in a game, the players can do better by employing specialized no-regret learning procedures. Indeed, it was established by [DDK11] that there exists a somewhat complex no-regret learner based on Nesterov’s excessive gap technique [Nes05], which guarantees O(logT)O(\log T) regret to each player of a two-player zero-sum game. This represents an exponential improvement over the regret guaranteed by standard no-regret learners. More generally, [SALS15] established that if players of a multi-player, general-sum game use any algorithm from the family of Optimistic Mirror Descent (MD) or Optimistic Follow-the-Regularized-Leader (FTRL) algorithms (which are analogoues of the MD and FTRL algorithms, respectively, with recency bias), each player enjoys regret that is O(T1/4)O(T^{1/4}). This was recently improved by [CP20] to O(T1/6)O(T^{1/6}) in the special case of two-player games in which the players use Optimistic Hedge, a particularly simple representative from both the Optimistic MD and Optimistic FTRL families.

The above results for general-sum games represent significant improvements over the O(T)O(\sqrt{T}) regret attainable by standard no-regret learners, but are not as dramatic as the logarithmic regret that has been shown attainable by no-regret learners, albeit more complex ones, in 2-player zero-sum games (e.g., [DDK11]). Indeed, despite extensive work on no-regret learning, understanding the optimal regret that can be guaranteed by no-regret learning algorithms in general-sum games has remained elusive. This question is especially intruiging in light of experiments suggesting that polylogarithmic regret should be attainable [SALS15, HAM21]. In this paper we settle this question by showing that no-regret learners can guarantee polylogarithmic regret to each player in general-sum multi-player games. Moreover, this regret is attainable by a particularly simple algorithm – Optimistic Hedge:

Table 1: Overview of prior work on fast rates for learning in games. mm denotes the number of players, and nn denotes the number of actions per player (assumed to be the same for all players). For Optimistic Hedge, the adversarial regret bounds in the right-hand column are obtained via a choice of adaptive step-sizes. The O~()\tilde{O}(\cdot) notation hides factors that are polynomial in logT\log T.

Algorithm Setting Regret in games Adversarial regret Hedge (& many other algs.) multi-player, general-sum O(Tlogn)O(\sqrt{T\log n}) [CBL06] O(Tlogn)O(\sqrt{T\log n}) [CBL06] Excessive Gap Technique 2-player, 0-sum O(logn(logT+log3/2n))O(\log n(\log T+\log^{3/2}n)) [DDK11] O(Tlogn)O(\sqrt{T\log n}) [DDK11] DS-OptMD, OptDA 2-player, 0-sum logO(1)(n)\log^{O(1)}(n) [HAM21] TlogO(1)(n)\sqrt{T\log^{O(1)}(n)} [HAM21] Optimistic Hedge multi-player, general-sum O(lognmT1/4)O(\log n\cdot\sqrt{m}\cdot T^{1/4}) [RS13b, SALS15] O~(Tlogn)\tilde{O}(\sqrt{T\log n}) [RS13b, SALS15] Optimistic Hedge 2-player, general-sum O(log5/6nT1/6)O(\log^{5/6}n\cdot T^{1/6}) [CP20] O~(Tlogn)\tilde{O}(\sqrt{T\log n}) Optimistic Hedge multi-player, general-sum O(lognmlog4T)O(\log n\cdot m\cdot\log^{4}T) (Theorem 3.1) O~(Tlogn)\tilde{O}(\sqrt{T\log n}) (Corollary D.1)

Theorem 1.1 (Abbreviated version of Theorem 3.1).

Suppose that mm players play a general-sum multi-player game, with a finite set of nn strategies per player, over TT rounds. Suppose also that each player uses Optimistic Hedge to update her strategy in every round, as a function of the history of play so far. Then each player experiences O(mlognlog4T)O(m\cdot\log n\cdot\log^{4}T) regret.

An immediate corollary of Theorem 1.1 is that the empirical distribution of play is a O(mlognlog4TT)O\left(\frac{m\log n\log^{4}T}{T}\right)-approximate coarse correlated equilibrium (CCE) of the game. We remark that Theorem 1.1 bounds the total regret experienced by each player of the multi-player game, which is the most standard regret objective for no-regret learning in games, and which is essential to achieve convergence to CCE. For the looser objective of the average of all players’ regrets, [RS13b] established a O(logn)O(\log n) bound for Optimistic Hedge in two-player zero-sum games, and [SALS15] generalized this bound, to O(mlogn)O(m\log n) in mm-player general-sum games. Note that since some players may experience negative regret [HAM21], the average of the players’ regrets cannot be used in general to bound the maximum regret experienced by any individual player. Finally, we remark that several results in the literature posit no-regret learning as a model of agents’ rational behavior; for instance, [Rou09, ST13, RST17] show that no-regret learners in smooth games enjoy strong Price-of-Anarchy bounds. By showing that each agent can obtain very small regret in games by playing Optimistic Hedge, Theorem 1.1 strengthens the plausability of the common assumption made in this literature that each agent will choose to use such a no-regret algorithm.

1.1 Related work

Table 1 summarizes the prior works that aim to establish optimal regret bounds for no-regret learners in games. We remark that [CP20] shows that the regret of Hedge is Ω(T)\Omega(\sqrt{T}) even in 2-player games where each player has 2 actions, meaning that optimism is necessary to obtain fast rates. The table also includes a recent result of [HAM21] showing that when the players in a 2-player zero-sum game with nn actions per player use a variant of Optimistic Hedge with adaptive step size (a special case of their algorithms DS-OptMD and OptDA), each player has logO(1)n\log^{O(1)}n regret. The techniques of [HAM21] differ substantially from ours: the result in [HAM21] is based on showing that the joint strategies x(t)x^{(t)} rapidly converge, pointwise, to a Nash equilibrium xx^{\star}. Such a result seems very unlikely to extend to our setting of general-sum games, since finding an approximate Nash equilibrium even in 2-player games is PPAD-complete [CDT09]. We also remark that the earlier work [KHSC18] shows that each player’s regret is at most O(logTlogn)O(\log T\cdot\log n) when they use a certain algorithm based on Optimistic MD in 2-player zero-sum games; their technique is heavily tailored to 2-player zero-sum games, relying on the notion of duality in such a setting.

[FLL+16] shows that one can obtain fast rates in games for a broader class of algorithms (e.g., including Hedge) if one adopts a relaxed (approximate) notion of optimality. [WL18] uses optimism to obtain adaptive regret bounds for bandit problems. Many recent papers (e.g., [DP19, GPD20, LGNPw21, HAM21, WLZL21, AIMM21]) have studied the last-iterate convergence of algorithms from the Optimistic Mirror Descent family, which includes Optimistic Hedge. Finally, a long line of papers (e.g., [HMcW+03, DFP+10, KLP11, BCM12, PP16, BP18, MPP18, BP19, CP19, VGFL+20]) has studied the dynamics of learning algorithms in games. Essentially all of these papers do not use optimism, and many of them show non-convergence (e.g., divergence or recurrence) of the iterates of various learning algorithms such as FTRL and Mirror Descent when used in games.

2 Preliminaries

Notation.

For a positive integer nn, let [n]:={1,2,,n}[n]:=\{1,2,\ldots,n\}. For a finite set 𝒮\mathcal{S}, let Δ(𝒮)\Delta({\mathcal{S}}) denote the space of distributions on 𝒮\mathcal{S}. For 𝒮=[n]\mathcal{S}=[n], we will write Δn:=Δ(𝒮)\Delta^{n}:=\Delta(\mathcal{S}) and interpret elements of Δn\Delta^{n} as vectors in n\mathbb{R}^{n}. For a vector vnv\in\mathbb{R}^{n} and j[n]j\in[n], we denote the jjth coordinate of vv as v(j)v(j). For vectors v,wnv,w\in\mathbb{R}^{n}, write v,w=j=1nv(j)w(j)\langle v,w\rangle=\sum_{j=1}^{n}v(j)w(j). The base-2 logarithm of x>0x>0 is denoted logx\log x.

No-regret learning in games.

We consider a game GG with mm\in\mathbb{N} players, where player i[m]i\in[m] has action space 𝒜i\mathcal{A}_{i} with ni:=|𝒜i|n_{i}:=|\mathcal{A}_{i}| actions. We may assume that 𝒜i=[ni]\mathcal{A}_{i}=[n_{i}] for each player ii. The joint action space is 𝒜:=𝒜1××𝒜m\mathcal{A}:=\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{m}. The specification of the game GG is completed by a collection of loss functions 1,,m:𝒜[0,1]\mathcal{L}_{1},\ldots,\mathcal{L}_{m}:\mathcal{A}\rightarrow[0,1]. For an action profile a=(a1,,am)𝒜a=(a_{1},\ldots,a_{m})\in\mathcal{A} and i[m]i\in[m], i(a)\mathcal{L}_{i}(a) is the loss player ii experiences when each player i[m]i^{\prime}\in[m] plays aia_{i^{\prime}}. A mixed strategy xiΔ(𝒜i)x_{i}\in\Delta(\mathcal{A}_{i}) for player ii is a distribution over 𝒜i\mathcal{A}_{i}, with the probability of playing action j𝒜ij\in\mathcal{A}_{i} given by xi(j)x_{i}(j). Given a mixed strategy profile x=(x1,,xm)x=(x_{1},\ldots,x_{m}) (or an action profile a=(a1,,am)a=(a_{1},\ldots,a_{m})) and a player i[m]i\in[m] we let xix_{-i} (or aia_{-i}, respectively) denote the profile after removing the iith mixed strategy xix_{i} (or the iith action aia_{i}, respectively).

The mm players play the game GG for a total of TT rounds. At the beginning of each round t[T]t\in[T], each player ii chooses a mixed strategy xi(t)Δ(𝒜i)x_{i}^{(t)}\in\Delta({\mathcal{A}_{i}}). The loss vector of player ii, denoted i(t)[0,1]ni\ell_{i}^{(t)}\in[0,1]^{n_{i}}, is defined as i(t)(j)=𝔼aixi(t)[i(j,ai)]\ell_{i}^{(t)}(j)=\mathbb{E}_{a_{-i}\sim x_{-i}^{(t)}}[\mathcal{L}_{i}(j,a_{-i})]. As a matter of convention, set i(0)=𝟎\ell_{i}^{(0)}=\mathbf{0} to be the all-zeros vector. We consider the full-information setting in this paper, meaning that player ii observes its full loss vector i(t)\ell_{i}^{(t)} for each round tt. Finally, player ii experiences a loss of i(t),xi(t)\langle\ell_{i}^{(t)},x_{i}^{(t)}\rangle. The goal of each player ii is to minimize its regret, defined as: Regi,T:=t[T]xi(t),i(t)minj[ni]t[T]i(t)(j).\operatorname{Reg}_{{i},{T}}:=\sum_{t\in[T]}\langle x_{i}^{(t)},\ell_{i}^{(t)}\rangle-\min_{j\in[n_{i}]}\sum_{t\in[T]}\ell_{i}^{(t)}(j).

Optimistic hedge.

The Optimistic Hedge algorithm chooses mixed strategies for player i[m]i\in[m] as follows: at time t=1t=1, it sets xi(1)=(1/ni,,1/ni)x_{i}^{(1)}=(1/n_{i},\ldots,1/n_{i}) to be the uniform distribution on 𝒜i\mathcal{A}_{i}. Then for all t<Tt<T, player ii’s strategy at iteration t+1t+1 is defined as follows, for j[ni]j\in[n_{i}]:

xi(t+1)(j):=xi(t)(j)exp(η(2i(t)(j)i(t1)(j)))k[ni]xi(t)(k)exp(η(2i(t)(k)i(t1)(k))).\displaystyle x_{i}^{(t+1)}(j):=\frac{x_{i}^{(t)}(j)\cdot\exp(-\eta\cdot(2\ell_{i}^{(t)}(j)-\ell_{i}^{(t-1)}(j)))}{\sum_{k\in[n_{i}]}x_{i}^{(t)}(k)\cdot\exp(-\eta\cdot(2\ell_{i}^{(t)}(k)-\ell_{i}^{(t-1)}(k)))}. (1)

Optimistic Hedge is a modification of Hedge, which performs the updates xi(t+1)(j):=xi(t)(j)exp(ηi(t)(j))k[ni]xi(t)(k)exp(ηi(t)(k))x_{i}^{(t+1)}(j):=\frac{x_{i}^{(t)}(j)\cdot\exp(-\eta\cdot\ell_{i}^{(t)}(j))}{\sum_{k\in[n_{i}]}x_{i}^{(t)}(k)\cdot\exp(-\eta\cdot\ell_{i}^{(t)}(k))}. The update (1) modifies the Hedge update by replacing the loss vector i(t)\ell_{i}^{(t)} with a predictor of the following iteration’s loss vector, i(t)+(i(t)i(t1))\ell_{i}^{(t)}+(\ell_{i}^{(t)}-\ell_{i}^{(t-1)}). Hedge corresponds to FTRL with a negative entropy regularizer (see, e.g., [Bub15]), whereas Optimistic Hedge corresponds to Optimistic FTRL with a negative entropy regularizer [RS13b, RS13a].

Distributions & divergences.

For distributions P,QP,Q on a finite domain [n][n], the KL divergence between P,QP,Q is KL(P;Q)=j=1nP(j)log(P(j)Q(j))\operatorname{KL}({P};{Q})=\sum_{j=1}^{n}P(j)\cdot\log\left(\frac{P(j)}{Q(j)}\right). The chi-squared divergence between P,QP,Q is χ2(P;Q)=j=1nQ(j)(P(j)Q(j))21=j=1n(P(j)Q(j))2Q(j)\chi^{2}({P};{Q})=\sum_{j=1}^{n}Q(j)\cdot\left(\frac{P(j)}{Q(j)}\right)^{2}-1=\sum_{j=1}^{n}\frac{(P(j)-Q(j))^{2}}{Q(j)}. For a distribution PP on [n][n] and a vector vnv\in\mathbb{R}^{n}, we write VarP(v):=j=1nP(j)(v(j)k=1nP(k)v(k))2.\operatorname{Var}_{{P}}\left({v}\right):=\sum_{j=1}^{n}P(j)\cdot\left(v(j)-\sum_{k=1}^{n}P(k)v(k)\right)^{2}. Also define vP:=j=1nP(j)v(j)2\left\|{v}\right\|_{P}:=\sqrt{\sum_{j=1}^{n}P(j)\cdot v(j)^{2}}. If further PP has full support, then define vP=j=1nv(j)2P(j)\left\|{v}\right\|_{P}^{\star}=\sqrt{\sum_{j=1}^{n}\frac{v(j)^{2}}{P(j)}}. The above notations will often be used when PP is the mixed strategy profile xix_{i} for some player ii and vv is a loss vector i\ell_{i}; in such a case the norms vP\left\|{v}\right\|_{P} and vP\left\|{v}\right\|_{P}^{\star} are often called local norms.

3 Results

Below we state our main theorem, which shows that when all players in a game play according to Optimistic Hedge with appropriate step size, they all experience polylogarithmic individual regrets.

Theorem 3.1 (Formal version of Theorem 1.1).

There are constants C,C>1C,C^{\prime}>1 so that the following holds. Suppose a time horizon TT\in\mathbb{N} and a game GG with mm players and nin_{i} actions for each player i[m]i\in[m] is given. Suppose all players play according to Optimistic Hedge with any positive step size η1Cmlog4T\eta\leq\frac{1}{C\cdot m\log^{4}T}. Then for any i[m]i\in[m], the regret of player ii satisfies

Regi,Tlogniη+ClogT.\displaystyle\operatorname{Reg}_{{i},{T}}\leq\frac{\log n_{i}}{\eta}+C^{\prime}\cdot\log T. (2)

In particular, if the players’ step size is chosen as η=1Cmlog4T\eta=\frac{1}{C\cdot m\log^{4}T}, then the regret of player ii satisfies

Regi,TO(mlognilog4T).\displaystyle\operatorname{Reg}_{{i},{T}}\leq O\left(m\cdot\log n_{i}\cdot\log^{4}T\right). (3)

A common goal in the literature on learning in games is to obtain an algorithm that achieves fast rates whan played by all players, and so that each player ii still obtains the optimal rate of O(T)O(\sqrt{T}) in the adversarial setting (i.e., when ii receives an arbitrary sequence of losses i(1),,i(T)\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}). We show in Corollary D.1 (in the appendix) that by running Optimistic Hedge with an adaptive step size, this is possible. Table 1 compares our regret bounds discussed in this section to those of prior work.

4 Proof overview

In this section we overview the proof of Theorem 3.1; the full proof may be found in the appendix.

4.1 New adversarial regret bound

The first step in the proof of Theorem 3.1 is to prove a new regret bound (Lemma 4.1 below) for Optimistic Hedge that holds for an adversarial sequence of losses. We will show in later sections that when all players play according to Optimistic Hedge, the right-hand side of the regret bound (4) is bounded by a quantity that grows only poly-logarithmically in TT.

Lemma 4.1.

There is a constant C>0C>0 so that the following holds. Suppose any player i[m]i\in[m] follows the Optimistic Hedge updates (1) with step size η<1/C\eta<1/C, for an arbitrary sequence of losses i(1),,i(T)[0,1]ni\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}\in[0,1]^{n_{i}}. Then

Regi,Tlogniη+t=1T(η2+Cη2)Varxi(t)(i(t)i(t1))t=1T(1Cη)η2Varxi(t)(i(t1)).\displaystyle\operatorname{Reg}_{{i},{T}}\leq\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left(\frac{\eta}{2}+C\eta^{2}\right)\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)-\sum_{t=1}^{T}\frac{(1-C\eta)\eta}{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right). (4)

The detailed proof of Lemma 4.1 can be found in Section A, but we sketch the main steps here. The starting point is a refinement of [RS13a, Lemma 3] (stated as Lemma A.5), which gives an upper bound for Regi,T\operatorname{Reg}_{{i},{T}} in terms of local norms corresponding to each of the iterates xi(t)x_{i}^{(t)} of Optimistic Hedge. The bound involves the difference between the Optimistic Hedge iterates xi(t)x_{i}^{(t)} and iterates x~i(t)\tilde{x}_{i}^{(t)} defined by x~i(t)=xi(t)(j)exp(η(i(t)(j)i(t1)(j)))k[ni]xi(t)(k)exp(η(i(t)(k)i(t1)(k)))\tilde{x}_{i}^{(t)}=\frac{x_{i}^{(t)}(j)\cdot\exp(-\eta\cdot(\ell_{i}^{(t)}(j)-\ell_{i}^{(t-1)}(j)))}{\sum_{k\in[n_{i}]}x_{i}^{(t)}(k)\cdot\exp(-\eta\cdot(\ell_{i}^{(t)}(k)-\ell_{i}^{(t-1)}(k)))}:

Regi,T\displaystyle\operatorname{Reg}_{{i},{T}}\leq logniη+t=1Txi(t)x~i(t)xi(t)Varxi(t)(i(t)i(t1))1ηt=1TKL(x~i(t);xi(t))1ηt=1TKL(xi(t);x~i(t1)).\displaystyle\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}). (5)

We next show (in Lemma A.2) that KL(x~i(t);xi(t))\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}}) and KL(xi(t);x~i(t1))\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}) may be lower bounded by (1/2O(η))χ2(x~i(t);xi(t))(1/2-O(\eta))\cdot\chi^{2}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}}) and (1/2O(η))χ2(xi(t);x~i(t1))(1/2-O(\eta))\cdot\chi^{2}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}), respectively. Note it is a standard fact that the KL divergence between two distributions is upper bounded by the chi-squared distribution between them; by contrast, Lemma A.2 can exploit that xi(t)x_{i}^{(t)}, x~i(t)\tilde{x}_{i}^{(t)} and x~i(t1)\tilde{x}_{i}^{(t-1)} are close to each other to show a reverse inequality. Finally, exploiting the exponential weights-style functional relationship between xi(t)x_{i}^{(t)} and x~i(t1)\tilde{x}_{i}^{(t-1)}, we show (in Lemma A.3) that the χ2\chi^{2}-divergence χ2(xi(t);x~i(t1))\chi^{2}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}) may be lower bounded by (1O(η))η2Varxi(t)(i(t1))(1-O(\eta))\cdot\eta^{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right), leading to the term (1Cη)η2Varxi(t)(i(t1))\frac{(1-C\eta)\eta}{2}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right) being subtracted in (4). The χ2\chi^{2}-divergence χ2(x~i(t);xi(t))\chi^{2}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}}), as well as the term xi(t)x~i(t)xi(t)\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star} in (5) are bounded in a similar manner to obtain (4).

4.2 Finite differences

Given Lemma 4.1, in order to establish Theorem 3.1, it suffices to show Lemma 4.2 below. Indeed, (6) below implies that the right-hand side of (4) is bounded above by logniη+ηO(log5T)\frac{\log n_{i}}{\eta}+\eta\cdot O(\log^{5}T), which is bounded above by O(mlognilog4T)O(m\log n_{i}\log^{4}T) for the choice η=Θ(1mlog4T)\eta=\Theta\left(\frac{1}{m\cdot\log^{4}T}\right) of Theorem 3.1.222Notice that the factor 12\frac{1}{2} in (6) is not important for this argument – any constant less than 1 would suffice.

Lemma 4.2 (Abbreviated; detailed version in Section C.3).

Suppose all players play according to Optimistic Hedge with step size η\eta satifying 1/Tη1Cmlog4T1/T\leq\eta\leq\frac{1}{Cm\cdot\log^{4}T} for a sufficiently large constant CC. Then for any i[m]i\in[m], the losses i(1),,i(T)ni\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}\in\mathbb{R}^{n_{i}} for player ii satisfy:

t=1TVarxi(t)(i(t)i(t1))12t=1TVarxi(t)(i(t1))+O(log5T).\displaystyle\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)\leq\frac{1}{2}\cdot\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+O\left(\log^{5}T\right). (6)

The definition below allows us to streamline our notation when proving Lemma 4.2.

Definition 4.1 (Finite differences).

Suppose L=(L(1),,L(T))L=(L^{(1)},\ldots,L^{(T)}) is a sequence of vectors L(t)nL^{(t)}\in\mathbb{R}^{n}. For integers h0h\geq 0, the order-hh finite difference sequence for the sequence LL, denoted by DhL\operatorname{D}_{h}{L}, is the sequence DhL:=((DhL)(1),,(DhL)(Th))\operatorname{D}_{h}{L}:=(\left(\operatorname{D}_{h}{L}\right)^{(1)},\ldots,\left(\operatorname{D}_{h}{L}\right)^{(T-h)}) defined recursively as: (D0L)(t):=L(t)\left(\operatorname{D}_{0}{L}\right)^{(t)}:=L^{(t)} for all 1tT1\leq t\leq T, and

(DhL)(t):=(Dh1L)(t+1)(Dh1L)(t)\left(\operatorname{D}_{h}{L}\right)^{(t)}:=\left(\operatorname{D}_{h-1}{L}\right)^{(t+1)}-\left(\operatorname{D}_{h-1}{L}\right)^{(t)} (7)

for all h1h\geq 1, 1tTh1\leq t\leq T-h.333We remark that while Definition 4.1 is stated for a 1-indexed sequence L(1),L(2),L^{(1)},L^{(2)},\ldots, we will also occasionally consider 0-indexed sequences L(0),L(1),L^{(0)},L^{(1)},\ldots, in which case the same recursive definition (7) holds for the finite differences (DhL)(t)\left(\operatorname{D}_{h}{L}\right)^{(t)}, t0t\geq 0.

Remark 4.3.

Notice that another way of writing (7) is: DhL=D1Dh1L.\operatorname{D}_{h}{L}=\operatorname{D}_{1}{\operatorname{D}_{h-1}{L}}. We also remark for later use that (DhL)(t)=s=0h(hs)(1)hsL(t+s).\left(\operatorname{D}_{h}{L}\right)^{(t)}=\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}L^{(t+s)}.

Let H=logTH=\log T, where TT denotes the fixed time horizon from Theorem 3.1 (and thus Lemma 4.2). In the proof of Lemma 4.2, we will bound the finite differences of order hHh\leq H for certain sequences. The bound (6) of Lemma 4.2 may be rephased as upper bounding t=1TVarxi(t)((D1i)(t1))\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{1}{\ell_{i}}\right)^{(t-1)}}\right), by 12t=1TVarxi(i(t1))\frac{1}{2}\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}}}\left({\ell_{i}^{(t-1)}}\right); to prove this, we proceed in two steps:

  1. 1.

    (Upwards induction step) First, in Lemma 4.4 below, we find an upper bound on (Dhi)(t)\left\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\right\|_{\infty} for all t[T]t\in[T], h0h\geq 0, which decays exponentially in hh for hHh\leq H. This is done via upwards induction on hh, i.e., first proving the base case h=0h=0 using boundedness of the losses i(t)\ell_{i}^{(t)} and then h=1,2,h=1,2,\ldots inductively. The main technical tool we develop for the inductive step is a weak form of the chain rule for finite differences, Lemma 4.5. The inductive step uses the fact that all players are following Optimistic Hedge to relate the hhth order finite differences of player ii’s loss sequence i(t)\ell_{i}^{(t)} to the hhth order finite differences of the strategy sequences xi(t)x_{i^{\prime}}^{(t)} for players iii^{\prime}\neq i; then we use the exponential-weights style updates of Optimistic Hedge and Lemma 4.5 to relate the hhth order finite differences of the strategies xi(t)x_{i^{\prime}}^{(t)} to the (h1)(h-1)th order finite differences of the losses i(t)\ell_{i^{\prime}}^{(t)}.

  2. 2.

    (Downwards induction step) We next show that for all 0hH0\leq h\leq H, t=1TVarxi(t)((Dh+1i)(t1))\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{h+1}{\ell_{i}}\right)^{(t-1)}}\right) is bounded above by cht=1TVarxi(t)((Dhi)(t1))+μhc_{h}\cdot\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t-1)}}\right)+\mu_{h}, for some ch<1/2c_{h}<1/2 and μh<O(log5T)\mu_{h}<O(\log^{5}T). This shown via downwards induction on hh, namely first establishing the base case h=Hh=H by using the result of item 1 for h=Hh=H and then treating the cases h=H1,H2,,0h=H-1,H-2,\ldots,0. The inductive step makes use of the discrete Fourier transform (DFT) to relate the finite differences of different orders (see Lemmas 4.7 and 4.8). In particular, Parseval’s equality together with a standard relationship between the DFT of the finite differences of a sequence to the DFT of that sequence allow us to first prove the inductive step in the frequency domain and then transport it back to the original (time) domain.

In the following subsections we explain in further detail how the two steps above are completed.

4.3 Upwards induction proof overview

Addressing item 1 in the previous subsection, the lemma below gives a bound on the supremum norm of the hh-th order finite differences of each player’s loss vector, when all players play according to Optimistic Hedge and experience losses according to their loss functions 1,,m:𝒜[0,1]\mathcal{L}_{1},\ldots,\mathcal{L}_{m}:\mathcal{A}\rightarrow[0,1].

Lemma 4.4 (Abbreviated).

Fix a step size η>0\eta>0 satisfying ηo(1mlogT)\eta\leq o\left(\frac{1}{m\log T}\right). If all players follow Optimistic Hedge updates with step size η\eta, then for any player i[m]i\in[m], integer hh satisfying 0hH0\leq h\leq H, and time step t[Th]t\in[T-h], it holds that (Dhi)(t)O(mη)hhO(h).\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\|_{\infty}\leq O(m\eta)^{h}\cdot h^{O(h)}.

A detailed version of Lemma 4.4, together with its full proof, may be found in Section B.4. We next give a proof overview of Lemma 4.4 for the case of 2 players, i.e., m=2m=2; we show in Section B.4 how to generalize this computation to general mm. Below we introduce the main technical tool in the proof, a “boundedness chain rule,” and then outline how it is used to prove Lemma 4.4.

Main technical tool for Lemma 4.4: boundedness chain rule.

We say that a function ϕ:n\phi:\mathbb{R}^{n}\rightarrow\mathbb{R} is a softmax-type function if there are real numbers ξ1,,ξn\xi_{1},\ldots,\xi_{n} and some j[n]j\in[n] so that for all (z1,,zn)n(z_{1},\ldots,z_{n})\in\mathbb{R}^{n}, ϕ((z1,,zn))=exp(zj)k=1nξkexp(zk).\phi((z_{1},\ldots,z_{n}))=\frac{\exp(z_{j})}{\sum_{k=1}^{n}\xi_{k}\cdot\exp(z_{k})}. Lemma 4.5 below may be interpreted as a “boundedness chain rule” for finite differences. To explain the context for this lemma, recall that given an infinitely differentiable vector-valued function L:nL:\mathbb{R}\rightarrow\mathbb{R}^{n} and an infinitely differentiable function ϕ:n\phi:\mathbb{R}^{n}\rightarrow\mathbb{R}, the higher order derivatives of the function ϕ(L(t))\phi(L(t)) may be computed in terms of those of LL and ϕ\phi using the chain rule. Lemma 4.5 considers an analogous setting where the input variable tt to LL is discrete-valued, taking values in [T][T] (and so we identify the function LL with the sequence L(1),,L(T)L^{(1)},\ldots,L^{(T)}). In this case, the higher order finite differences of the sequence L(1),,L(T)L^{(1)},\ldots,L^{(T)} (Definition 4.1) take the place of the higher order derivatives of LL with respect to tt. Though there is no generic chain rule for finite differences, Lemma 4.5 states that, at least when ϕ\phi is a softmax-type function, we may bound the higher order finite differences of the sequence ϕ(L(1)),,ϕ(L(T))\phi(L^{(1)}),\ldots,\phi(L^{(T)}). In the lemma’s statement we let ϕL\phi\circ L denote the sequence ϕ(L(1)),,ϕ(L(T))\phi(L^{(1)}),\ldots,\phi(L^{(T)}).

Lemma 4.5 (“Boundedness chain rule” for finite differences; abbreviated).

Suppose that h,nh,n\in\mathbb{N}, ϕ:n\phi:\mathbb{R}^{n}\rightarrow\mathbb{R} is a softmax-type function, and L=(L(1),,L(T))L=(L^{(1)},\ldots,L^{(T)}) is a sequence of vectors in n\mathbb{R}^{n} satisfying L(t)1\|L^{(t)}\|_{\infty}\leq 1 for t[T]t\in[T]. Suppose for some α(0,1)\alpha\in(0,1), for each 0hh0\leq h^{\prime}\leq h and t[Th]t\in[T-h^{\prime}], it holds that DhL(t)O(αh)(h)O(h)\|\operatorname{D}_{h^{\prime}}{L}^{(t)}\|_{\infty}\leq O({\alpha^{h^{\prime}}})\cdot(h^{\prime})^{O(h^{\prime})}. Then for all t[Th]t\in[T-h],

|(Dh(ϕL))(t)|O(αh)hO(h).\displaystyle|\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)}|\leq O(\alpha^{h})\cdot h^{O(h)}.

A detailed version of Lemma 4.5 may be found in Section B.3. While Lemma 4.5 requires ϕ\phi to be a softmax-type function for simplicity (and this is the only type of function ϕ\phi we will need to consider for the case m=2m=2) we remark that the detailed version of Lemma 4.5 allows ϕ\phi to be from a more general family of analytic functions whose higher order derivatives are appropriately bounded. The proof of Lemma 4.4 for all m2m\geq 2 requires that more general form of Lemma 4.5.

The proof of Lemma 4.5 proceeds by considering the Taylor expansion Pϕ()P_{\phi}(\cdot) of the function ϕ\phi at the origin, which we write as follows: for z=(z1,,zn)nz=(z_{1},\ldots,z_{n})\in\mathbb{R}^{n}, Pϕ(z):=k0,γ0n:|γ|=kaγzγP_{\phi}(z):=\sum_{k\geq 0,\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}a_{\gamma}z^{\gamma}, where aγa_{\gamma}\in\mathbb{R}, |γ||\gamma| denotes the quantity γ1++γn\gamma_{1}+\cdots+\gamma_{n} and zγz^{\gamma} denotes z1γ1znγnz_{1}^{\gamma_{1}}\cdots z_{n}^{\gamma_{n}}. The fact that ϕ\phi is a softmax-type function ensures that the radius of convergence of its Taylor series is at least 1, i.e., ϕ(z)=Pϕ(z)\phi(z)=P_{\phi}(z) for any zz satisfying z1\|z\|_{\infty}\leq 1. By the assumption that L(t)1\|L^{(t)}\|_{\infty}\leq 1 for each tt, we may therefore decompose (Dh(ϕL))(t)\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)} as:

(Dh(ϕL))(t)=k0,γ0n:|γ|=kaγ(DhLγ)(t),\displaystyle\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)}=\sum_{k\geq 0,\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}a_{\gamma}\cdot\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}, (8)

where LγL^{\gamma} denotes the sequence of scalars (Lγ)(t):=(L(t))γ(L^{\gamma})^{(t)}:=(L^{(t)})^{\gamma} for all tt. The fact that ϕ\phi is a softmax-type function allows us to establish strong bounds on |aγ||a_{\gamma}| for each γ\gamma in Lemma B.5. The proof of Lemma B.5 bounds the |aγ||a_{\gamma}| by exploiting the simple form of the derivative of a softmax-type function to decompose each aγa_{\gamma} into a sum of |γ|!|\gamma|! terms. Then we establish a bijection between the terms of this decomposition and graph structures we refer to as factorial trees; that bijection together with the use of an appropriate generating function allow us to complete the proof of Lemma B.5.

Thus, to prove Lemma 4.5, it suffices to bound |(DhLγ)(t)|\left|\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}\right| for all γ\gamma. We do so by using Lemma 4.6.

Lemma 4.6 (Abbreviated; detailed vesion in Section B.2).

Fix any h0h\geq 0, a multi-index γ0n\gamma\in\mathbb{Z}_{\geq 0}^{n} and set k=|γ|k=|\gamma|. For each of the khk^{h} functions π:[h][k]\pi:[h]\rightarrow[k], and for each r[k]r\in[k], there are integers hπ,r{0,1,,h}h^{\prime}_{\pi,r}\in\{0,1,\ldots,h\}, tπ,r0t^{\prime}_{\pi,r}\geq 0, and jπ,r[n]j^{\prime}_{\pi,r}\in[n], so that the following holds. For any sequence L(1),,L(T)nL^{(1)},\ldots,L^{(T)}\in\mathbb{R}^{n} of vectors, it holds that, for each t[Th]t\in[T-h],

(DhLγ)(t)=π:[h][k]r=1k(Dhπ,r(L(jπ,r)))(t+tπ,r).\displaystyle\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}=\sum_{\pi:[h]\rightarrow[k]}\prod_{r=1}^{k}\left(\operatorname{D}_{h^{\prime}_{\pi,r}}{(L(j^{\prime}_{\pi,r}))}\right)^{(t+t^{\prime}_{\pi,r})}. (9)

Lemma 4.6 expresses the hhth order finite differences of the sequence LγL^{\gamma} as a sum of khk^{h} terms, each of which is a product of kk finite order differences of a sequence L(t)(jπ,r)L^{(t)}(j^{\prime}_{\pi,r}) (i.e., the jπ,rj^{\prime}_{\pi,r}th coordinate of the vectors L(t)L^{(t)}). Crucially, when using Lemma 4.6 to prove Lemma 4.5, the assumption of Lemma 4.5 gives that for each j[n]j^{\prime}\in[n], each h[h]h^{\prime}\in[h], and each t[Th]t^{\prime}\in[T-h^{\prime}], we have the bound |(DhL(j))(t)|O(αh)(h)O(h)\left|\left(\operatorname{D}_{h^{\prime}}{L(j^{\prime})}\right)^{(t^{\prime})}\right|\leq O({\alpha^{h^{\prime}}})\cdot(h^{\prime})^{O(h^{\prime})}. These assumed bounds may be used to bound the right-hand side of (9), which together with Lemma 4.6 and (8) lets us complete the proof of Lemma 4.5.

Proving Lemma 4.4 using the boundedness chain rule.

Next we discuss how Lemma 4.5 is used to prove Lemma 4.4, namely to bound (Dhi)(t)\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\|_{\infty} for each t[Th]t\in[T-h], i[m]i\in[m], and 0hH0\leq h\leq H. Lemma 4.4 is proved using induction, with the base case h=0h=0 being a straightforward consequence of the fact that (D0i)(t)=i(t)1\|\left(\operatorname{D}_{0}{\ell_{i}}\right)^{(t)}\|_{\infty}=\|\ell_{i}^{(t)}\|_{\infty}\leq 1 for all i[m],t[T]i\in[m],t\in[T]. For the rest of this section we focus on the inductive case, i.e., we pick some h[H]h\in[H] and assume Lemma 4.4 holds for all h<hh^{\prime}<h.

The first step is to reduce the claim of Lemma 4.4 to the claim that the upper bound (Dhxi)(t)1O(mη)hhO(h)\|\left(\operatorname{D}_{h}{x_{i}}\right)^{(t)}\|_{1}\leq O\left(m\eta\right)^{h}\cdot h^{O(h)} holds for each t[Th],i[m]t\in[T-h],i\in[m]. Recalling that we are only sketching here the case m=2m=2 for simplicity, this reduction proceeds as follows: for i{1,2}i\in\{1,2\}, define the matrix Ain1×n2A_{i}\in\mathbb{R}^{n_{1}\times n_{2}} by (Ai)a1a2=i(a1,a2)(A_{i})_{a_{1}a_{2}}=\mathcal{L}_{i}(a_{1},a_{2}), for a1[n1],a2[n2]a_{1}\in[n_{1}],a_{2}\in[n_{2}]. We have assumed that all players are using Optimistic Hedge and thus i(t)=𝔼aixi(t),ii[i(a1,,an)]\ell_{i}^{(t)}=\mathbb{E}_{a_{i^{\prime}}\sim x_{i^{\prime}}^{(t)},\ \forall i^{\prime}\neq i}[\mathcal{L}_{i}(a_{1},\ldots,a_{n})]; for our case here (m=2m=2), this may be rewritten as 1(t)=A1x2(t)\ell_{1}^{(t)}=A_{1}x_{2}^{(t)}, 2(t)=A2x1(t)\ell_{2}^{(t)}=A_{2}^{\top}x_{1}^{(t)}. Thus

(Dh1)(t)=A1s=0h(hs)(1)hsx2(t+s)s=0h(hs)(1)hsx2(t+s)1=(Dhx2)(t)1,\displaystyle\|\left(\operatorname{D}_{h}{\ell_{1}}\right)^{(t)}\|_{\infty}=\left\|A_{1}\cdot\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}x_{2}^{(t+s)}\right\|_{\infty}\leq\left\|\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}x_{2}^{(t+s)}\right\|_{1}=\|\left(\operatorname{D}_{h}{x_{2}}\right)^{(t)}\|_{1},

where the first equality is from Remark 4.3 and the inequality follows since all entries of A1A_{1} have absolute value 1\leq 1. A similar computation allows us to show (Dh2)(t)(Dhx1)(t)1\|\left(\operatorname{D}_{h}{\ell_{2}}\right)^{(t)}\|_{\infty}\leq\|\left(\operatorname{D}_{h}{x_{1}}\right)^{(t)}\|_{1}.

To complete the inductive step it remains to upper bound the quantities (Dhxi)(t)1\|\left(\operatorname{D}_{h}{x_{i}}\right)^{(t)}\|_{1} for i[m],t[Th]i\in[m],t\in[T-h]. To do so, we note that the definition of the Optimistic Hedge updates (1) implies that for any i[m],t[T],j[ni]i\in[m],t\in[T],j\in[n_{i}], and t1t^{\prime}\geq 1, we have

xi(t+t)(j)=xi(t)(j)exp(η(i(t1)(j)s=0t1i(t+s)(j)i(t+t1)(j)))k=1nixi(t)(k)exp(η(i(t1)(k)s=0t1i(t+s)(k)i(t+t1)(k))).\displaystyle x_{i}^{(t+t^{\prime})}(j)=\frac{x_{i}^{(t)}(j)\cdot\exp\left(\eta\cdot\left(\ell_{i}^{(t-1)}(j)-\sum_{s=0}^{t^{\prime}-1}\ell_{i}^{(t+s)}(j)-\ell_{i}^{(t+t^{\prime}-1)}(j)\right)\right)}{\sum_{k=1}^{n_{i}}x_{i}^{(t)}(k)\cdot\exp\left(\eta\cdot\left(\ell_{i}^{(t-1)}(k)-\sum_{s=0}^{t^{\prime}-1}\ell_{i}^{(t+s)}(k)-\ell_{i}^{(t+t^{\prime}-1)}(k)\right)\right)}. (10)

For t[T]t\in[T], t0t^{\prime}\geq 0, set ¯i,t(t):=η(i(t1)s=0t1i(t+s)i(t+t1)).\bar{\ell}_{i,t}^{(t^{\prime})}:=\eta\cdot\left(\ell_{i}^{(t-1)}-\sum_{s=0}^{t^{\prime}-1}\ell_{i}^{(t+s)}-\ell_{i}^{(t+t^{\prime}-1)}\right). Also, for each i[m]i\in[m], j[ni]j\in[n_{i}], t[T]t\in[T], and any vector z=(z(1),,z(ni))niz=(z(1),\ldots,z(n_{i}))\in\mathbb{R}^{n_{i}} define ϕt,i,j(z):=xi(t)(j)exp(z(j))k=1nixi(t)(k)exp(z(k)).\phi_{t,i,j}(z):=\frac{x_{i}^{(t)}(j)\cdot\exp\left(z(j)\right)}{\sum_{k=1}^{n_{i}}x_{i}^{(t)}(k)\cdot\exp\left(z(k)\right)}. Thus (10) gives that for t1t^{\prime}\geq 1, xi(t+t)(j)=ϕt,i,j(¯i,t(t))x_{i}^{(t+t^{\prime})}(j)=\phi_{t,i,j}(\bar{\ell}_{i,t}^{(t^{\prime})}). Viewing tt as a fixed parameter and letting tt^{\prime} vary, it follows that for h0h\geq 0 and t1t^{\prime}\geq 1, (Dhxi(t+)(j))(t)=(Dh(ϕt,i,j¯i,t))(t)\left(\operatorname{D}_{h}{x_{i}^{(t+\cdot)}(j)}\right)^{(t^{\prime})}=\left(\operatorname{D}_{h}{(\phi_{t,i,j}\circ\bar{\ell}_{i,t})}\right)^{(t^{\prime})}.

Recalling that our goal is to bound |(Dhxi(j))(t+1)||\left(\operatorname{D}_{h}{x_{i}(j)}\right)^{(t+1)}| for each tt, we can do so by using Lemma 4.5 with ϕ=ϕt,i,j\phi=\phi_{t,i,j} and α=O(mη)\alpha=O(m\eta), if we can show that its precondition is met, i.e. that (Dh¯i,t)(t)1B1αh(h)B0h\|\left(\operatorname{D}_{h^{\prime}}{\bar{\ell}_{i,t}}\right)^{(t^{\prime})}\|_{\infty}\leq\frac{1}{B_{1}}\cdot\alpha^{h^{\prime}}\cdot(h^{\prime})^{B_{0}h^{\prime}} for all hhh^{\prime}\leq h, the appropriate α\alpha and appropriate constants B0,B1B_{0},B_{1}. Helpfully, the definition of ¯i,t(t)\bar{\ell}_{i,t}^{(t^{\prime})} as a partial sum allows us to relate the hh^{\prime}-th order finite differences of the sequence ¯i,t(t)\bar{\ell}_{i,t}^{(t^{\prime})} to the (h1)(h^{\prime}-1)-th order finite differences of the sequence i(t)\ell_{i}^{(t)} as follows:

(Dh¯i,t)(t)=η(Dh1i)(t+t1)2η(Dh1i)(t+t).\left(\operatorname{D}_{h^{\prime}}{\bar{\ell}_{i,t}}\right)^{(t^{\prime})}=\eta\cdot\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t+t^{\prime}-1)}-2\eta\cdot\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t+t^{\prime})}. (11)

Since h1<hh^{\prime}-1<h for hhh^{\prime}\leq h, the inductive assumption of Lemma 4.4 gives a bound on the \ell_{\infty}-norm of the terms on the right-hand side of (11), which are sufficient for us to apply Lemma 4.5. Note that the inductive assumption gives an upper bound on (Dh1i)(t)\|\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t)}\|_{\infty} that only scales with αh1\alpha^{h^{\prime}-1}, whereas Lemma 4.5 requires scaling of αh\alpha^{h^{\prime}}. This discrepancy is corrected by the factor of η\eta on the right-hand side of (11), which gives the desired scaling αh\alpha^{h^{\prime}} (since η<α\eta<\alpha for the choice α=O(mη)\alpha=O(m\eta)).

4.4 Downwards induction proof overview

In this section we discuss in further detail item 2 in Section 4.2; in particular, we will show that there is a parameter μ=Θ~(ηm)\mu=\tilde{\Theta}(\eta m) so that for all integers hh satisfying H1h0H-1\geq h\geq 0,

t=1Th1Varxi(t)((Dh+1i)(t))O(1/H)t=1ThVarxi(t)((Dhi)(t))+O~(μ2h),\sum_{t=1}^{T-h-1}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{h+1}{\ell_{i}}\right)^{(t)}}\right)\leq O(1/H)\cdot\sum_{t=1}^{T-h}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}}\right)+\tilde{O}\left(\mu^{2h}\right), (12)

where O~\tilde{O} hides factors polynomial in logT\log T. The validity of (12) for h=0h=0 implies Lemma 4.2. On the other hand, as long we choose the value μ\mu in (12) to satisfy μmηHΩ(1)\mu\geq m\eta H^{\Omega(1)}, then Lemma 4.4 implies that t=1THVarxi(t)((DHi)(t))O(μ2H)\sum_{t=1}^{T-H}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{H}{\ell_{i}}\right)^{(t)}}\right)\leq O(\mu^{2H}). This gives that (12) holds for h=H1h=H-1. To show that (12) holds for all H1>h0H-1>h\geq 0, we use downwards induction; fix any hh, and assume that (12) has been shown for all hh^{\prime} satisfying h<hH1h<h^{\prime}\leq H-1. Our main tool in the inductive step is to apply Lemma 4.7 below. To state it, for ζ>0,n\zeta>0,\ n\in\mathbb{N}, we say that a sequence of distributions P(1),,P(T)ΔnP^{(1)},\ldots,P^{(T)}\in\Delta^{n} is ζ\zeta-consecutively close if for each 1t<T1\leq t<T, it holds that max{P(t)P(t+1),P(t+1)P(t)}1+ζ\max\left\{\left\|\frac{P^{(t)}}{P^{(t+1)}}\right\|_{\infty},\left\|\frac{P^{(t+1)}}{P^{(t)}}\right\|_{\infty}\right\}\leq 1+\zeta.444Here, for distributions P,QΔnP,Q\in\Delta^{n}, PQn\frac{P}{Q}\in\mathbb{R}^{n} denotes the vector whose jjth entry is P(j)/Q(j)P(j)/Q(j). Lemma 4.7 shows that given a sequence of vectors for which the variances of its second-order finite differences are bounded by the variances of its first-order finite differences, a similar relationship holds between its first- and zeroth-order finite differences.

Lemma 4.7.

There is a sufficiently large constant C0>1C_{0}>1 so that the following holds. For any M,ζ,α>0M,\zeta,\alpha>0 and nn\in\mathbb{N}, suppose that P(1),,P(T)ΔnP^{(1)},\ldots,P^{(T)}\in\Delta^{n} and Z(1),,Z(T)[M,M]nZ^{(1)},\ldots,Z^{(T)}\in[-M,M]^{n} satisfy the following conditions:

  1. 1.

    The sequence P(1),,P(T)P^{(1)},\ldots,P^{(T)} is ζ\zeta-consecutively close for some ζ[1/(2T),α4/C0]\zeta\in[1/(2T),\alpha^{4}/C_{0}].

  2. 2.

    It holds that t=1T2VarP(t)((D2Z)(t))αt=1T1VarP(t)((D1Z)(t))+μ.\sum_{t=1}^{T-2}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t)}}\right)\leq\alpha\cdot\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)+\mu.

Then  t=1T1VarP(t)((D1Z)(t))α(1+α)t=1TVarP(t)(Z(t))+μα+C0M2α3.\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)\leq\alpha\cdot(1+\alpha)\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\frac{\mu}{\alpha}+\frac{C_{0}M^{2}}{\alpha^{3}}.

Given Lemma 4.7, the inductive step for establishing (12) is straightforward: we apply Lemma 4.7 with P(t)=xi(t)P^{(t)}=x_{i}^{(t)} and Z(t)=(Dhi)(t)Z^{(t)}=\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)} for all tt. The fact that xi(t)x_{i}^{(t)} are updated with Optimistic Hedge may be used to establish that precondition 1 of Lemma 4.7 holds. Since (D1Z)(t)=(Dh+1i)(t)\left(\operatorname{D}_{1}{Z}\right)^{(t)}=\left(\operatorname{D}_{h+1}{\ell_{i}}\right)^{(t)} and (D2Z)(t)=(Dh+2i)(t)\left(\operatorname{D}_{2}{Z}\right)^{(t)}=\left(\operatorname{D}_{h+2}{\ell_{i}}\right)^{(t)}, that the inductive hypothesis (12) holds for h+1h+1 implies that precondition 2 of Lemma 4.7 holds for appropriate α,μ>0\alpha,\mu>0. Thus Lemma 4.7 implies that (12) holds for the value hh, which completes the inductive step.

On the proof of Lemma 4.7.

Finally we discuss the proof of Lemma 4.7. One technical challenge is the fact that the vectors P(t)P^{(t)} are not constant functions of tt, but rather change slowly (as constrained by being ζ\zeta-consecutively close). The main tool for dealing with this difficulty is Lemma C.1, which shows that for a ζ\zeta-consecutively close sequence P(t)P^{(t)}, for any vector Z(t)Z^{(t)}, VarP(t)(Z(t))VarP(t+1)(Z(t))[1ζ,1+ζ]\frac{\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)}{\operatorname{Var}_{{P^{(t+1)}}}\left({Z^{(t)}}\right)}\in[1-\zeta,1+\zeta]. This fact, together with some algebraic manipulations, lets us to reduce to the case that all P(t)P^{(t)} are equal. It is also relatively straightforward to reduce to the case that P(t),Z(t)=0\langle P^{(t)},Z^{(t)}\rangle=0 for all tt, i.e., so that VarP(t)(Z(t))=Z(t)P(t)2\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)=\left\|{Z^{(t)}}\right\|_{P^{(t)}}^{2}. We may further separate Z(t)P(t)2=j=1nP(t)(j)(Z(t)(j))2\left\|{Z^{(t)}}\right\|_{P^{(t)}}^{2}=\sum_{j=1}^{n}P^{(t)}(j)\cdot(Z^{(t)}(j))^{2} into its individual components P(t)(j)(Z(t)(j))2P^{(t)}(j)\cdot(Z^{(t)}(j))^{2}, and treat each one separately, thus allowing us to reduce to a one-dimensional problem. Finally, we make one further reduction, which is to replace the finite differences Dh()\operatorname{D}_{h}{(\cdot)} in Lemma 4.7 with circular finite differences, defined below:

Definition 4.2 (Circular finite difference).

Suppose L=(L(0),,L(S1))L=(L^{(0)},\ldots,L^{(S-1)}) is a sequence of vectors L(t)nL^{(t)}\in\mathbb{R}^{n}. For integers h0h\geq 0, the level-hh circular finite difference sequence for the sequence LL, denoted by DhL\operatorname{D}^{\circ}_{h}{L}, is the sequence defined recursively as: (D0L)(t)=L(t)\left(\operatorname{D}^{\circ}_{0}{L}\right)^{(t)}=L^{(t)} for all 0t<S0\leq t<S, and

(DhL)(t)={(Dh1L)(t+1)(Dh1L)(t):0tS2(Dh1L)(1)(Dh1L)(T):t=S1.\left(\operatorname{D}^{\circ}_{h}{L}\right)^{(t)}=\begin{cases}\left(\operatorname{D}^{\circ}_{h-1}{L}\right)^{(t+1)}-\left(\operatorname{D}^{\circ}_{h-1}{L}\right)^{(t)}\quad:0\leq t\leq S-2\\ \left(\operatorname{D}^{\circ}_{h-1}{L}\right)^{(1)}-\left(\operatorname{D}^{\circ}_{h-1}{L}\right)^{(T)}\quad:t=S-1.\end{cases} (13)

Circular finite differences for a sequence L(0),,L(S1)L^{(0)},\ldots,L^{(S-1)} are defined similarly to finite differences (Definition 4.1) except that unlike for finite differences, where (DhL)(Sh),,(DhL)(S1)\left(\operatorname{D}_{h}{L}\right)^{(S-h)},\ldots,\left(\operatorname{D}_{h}{L}\right)^{(S-1)} are not defined, (DhL)(Sh),,(DhL)(S1)\left(\operatorname{D}^{\circ}_{h}{L}\right)^{(S-h)},\ldots,\left(\operatorname{D}^{\circ}_{h}{L}\right)^{(S-1)} are defined by “wrapping around” back to the beginning of the sequence. The above-described reductions, which are worked out in detail in Section C.2, allow us to reduce proving Lemma 4.7 to proving the following simpler lemma:

Lemma 4.8.

Suppose μ\mu\in\mathbb{R}, α>0\alpha>0, and W(0),,W(S1)W^{(0)},\ldots,W^{(S-1)}\in\mathbb{R} is a sequence of reals satisfying

t=0S1((D2W)(t))2αt=0S1((D1W)(t))2+μ.\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{2}{W}\right)^{(t)}\right)^{2}\leq\alpha\cdot\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}+\mu. (14)

Then  t=0S1((D1W)(t))2αt=1S1(W(t))2+μ/α.\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}\leq\alpha\cdot\sum_{t=1}^{S-1}(W^{(t)})^{2}+\mu/\alpha.

To prove Lemma 4.8, we apply the discrete Fourier transform to both sides of (14) and use the Cauchy-Schwarz inequality in frequency domain. For a sequence W(0),,W(S1)W^{(0)},\ldots,W^{(S-1)}\in\mathbb{R}, its (discrete) Fourier transform is the sequence W^(0),,W^(S1)\widehat{W}^{(0)},\ldots,\widehat{W}^{(S-1)} defined by W^(s)=t=0S1W(t)e2πistS\widehat{W}^{(s)}=\sum_{t=0}^{S-1}W^{(t)}\cdot e^{-\frac{2\pi ist}{S}}. Below we prove Lemma 4.8 for the special case μ=0\mu=0; we defer the general case to Section C.1.

Proof of Lemma 4.8 for special case μ=0\mu=0.

We have the following:

St=1T((D1W)(t))2=s=1T|D1W^(s)|2=s=1T|W^(s)(e2πis/T1)|2s=1T|W^(s)|2s=1T|W^(s)|2|e2πis/T1|4,\displaystyle S\cdot\sum_{t=1}^{T}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}=\sum_{s=1}^{T}\left|\widehat{\operatorname{D}^{\circ}_{1}{W}}^{(s)}\right|^{2}=\sum_{s=1}^{T}\left|\widehat{W}^{(s)}(e^{2\pi is/T-1})\right|^{2}\leq\sqrt{\sum_{s=1}^{T}\left|\widehat{W}^{(s)}\right|^{2}}\sqrt{\sum_{s=1}^{T}\left|\widehat{W}^{(s)}\right|^{2}\left|e^{2\pi is/T}-1\right|^{4}},

where the first equality uses Parseval’s equality, the second uses Fact C.3 (in the appendix) for h=1h=1, and the inequality uses Cauchy-Schwarz. By Parseval’s inequality and Fact C.3 for h=2h=2, the right-hand side of the above equals St=1T(W(t))2t=1T((D2W)(t))2S\cdot\sqrt{\sum_{t=1}^{T}(W^{(t)})^{2}}\cdot\sqrt{\sum_{t=1}^{T}\left(\left(\operatorname{D}^{\circ}_{2}{W}\right)^{(t)}\right)^{2}}, which, by assumption, is at most St=1T(W(t))2αt=1T((D1W)(t))2S\cdot\sqrt{\sum_{t=1}^{T}(W^{(t)})^{2}}\cdot\sqrt{\alpha\cdot\sum_{t=1}^{T}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}}. Rearranging terms completes the proof. ∎

References

  • [AIMM21] Waïss Azizian, Franck Iutzeler, Jérome Malick, and Panayotis Mertikopoulos. The last-iterate convergence rate of optimistic mirror descent in stochastic variational inequalities. In Conference on Learning Theory, pages 1–32, 2021.
  • [BCM12] Maria-Florina Balcan, Florin Constantin, and Ruta Mehta. The Weighted Majority Algorithm does not Converge in Nearly Zero-sum Games. In ICML Workshop on Markets, Mechanisms, and Multi-Agent Models, 2012.
  • [Bla54] David Blackwell. Controlled Random Walks. In Proceedings of the International Congress of Mathematicians, volume 3, pages 336–338, 1954.
  • [BP18] James P. Bailey and Georgios Piliouras. Multiplicative Weights Update in Zero-Sum Games. In Proceedings of the 2018 ACM Conference on Economics and Computation - EC ’18, pages 321–338, Ithaca, NY, USA, 2018. ACM Press.
  • [BP19] James P. Bailey and Georgios Piliouras. Fast and furious learning in zero-sum games: Vanishing regret with non-vanishing step sizes. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12977–12987, 2019.
  • [Bro49] George W Brown. Some Notes on Computation of Games Solutions. Technical report, RAND CORP SANTA MONICA CA, 1949.
  • [Bub15] Sébastien Bubeck. Convex Optimization: Algorithms and Complexity. Found. Trends Mach. Learn., 8(3–4):231–357, November 2015.
  • [CBL06] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge university press, 2006.
  • [CDT09] Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Settling the complexity of computing two-player nash equilibria. Journal of the ACM (JACM), 56(3):1–57, 2009.
  • [CP19] Yun Kuen Cheung and Georgios Piliouras. Vortices instead of equilibria in minmax optimization: Chaos and butterfly effects of online learning in zero-sum games. In Proceedings of the Thirty-Second Conference on Learning Theory, pages 807–834, 2019.
  • [CP20] Xi Chen and Binghui Peng. Hedging in games: Faster convergence of external and swap regrets. In Advances in Neural Information Processing Systems, volume 33, pages 18990–18999. Curran Associates, Inc., 2020.
  • [CS04] Imre Csiszár and Paul C. Shields. Information Theory and Statistics: A Tutorial. Commun. Inf. Theory, 1(4):417–528, December 2004.
  • [DDK11] Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-Optimal No-Regret Algorithms for Zero-Sum Games. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms (SODA), 2011.
  • [DFP+10] Constantinos Daskalakis, Rafael Frongillo, Christos H. Papadimitriou, George Pierrakos, and Gregory Valiant. On learning algorithms for nash equilibria. In Proceedings of the Third International Conference on Algorithmic Game Theory, SAGT’10, page 114–125, Berlin, Heidelberg, 2010. Springer-Verlag.
  • [DGP06] Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity of computing a nash equilibrium. In Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing (STOC), 2006.
  • [DP14] Constantinos Daskalakis and Qinxuan Pan. A counter-example to Karlin’s strong conjecture for fictitious play. In Proceedings of the 55th Annual Symposium on Foundations of Computer Science (FOCS), 2014.
  • [DP19] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, volume 124 of LIPIcs, pages 27:1–27:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
  • [FLL+16] Dylan J Foster, Zhiyuan Li, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos. Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • [GKP89] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, Reading, 1989.
  • [GPD20] Noah Golowich, Sarath Pattathil, and Constantinos Daskalakis. Tight last-iterate convergence rates for no-regret learning in multi-player games. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • [HAM21] Yu-Guan Hsieh, Kimon Antonakopoulos, and Panayotis Mertikopoulos. Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium. In Conference on Learning Theory, 2021.
  • [Han57] James Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
  • [HMcW+03] Hart, Andreu Mas-colell, Of Jörgen W. Weibull, O Vega, Drew Fudenberg, David K. Levine, Josef Hofbauer, Karl Sigmund, Eric Maskin, Motty Perry, and Er Vasin. Uncoupled dynamics do not lead to nash equilibrium. Amer. Econ. Rev, pages 1830–1836, 2003.
  • [KHSC18] Ehsan Asadi Kangarshahi, Ya-Ping Hsieh, Mehmet Fatih Sahin, and Volkan Cevher. Let’s be honest: An optimal no-regret framework for zero-sum games. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2488–2496. PMLR, 10–15 Jul 2018.
  • [KLP11] Robert Kleinberg, Katrina Ligett, and Georgios Piliouras. Beyond the nash equilibrium barrier. In In Innovations in Computer Science (ICS, pages 125–140, 2011.
  • [LGNPw21] Qi Lei, Sai Ganesh Nagarajan, Ioannis Panageas, and xiao wang. Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1441–1449. PMLR, 13–15 Apr 2021.
  • [MPP18] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’18, page 2703–2717, USA, 2018. Society for Industrial and Applied Mathematics.
  • [Nes05] Yu Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235–249, 2005.
  • [PP16] Christos Papadimitriou and Georgios Piliouras. From nash equilibria to chain recurrent sets: Solution concepts and topology. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ITCS ’16, page 227–235, New York, NY, USA, 2016. Association for Computing Machinery.
  • [Rob51] Julia Robinson. An Iterative Method of Solving a Game. Annals of mathematics, pages 296–301, 1951.
  • [Rou09] Tim Roughgarden. Intrinsic robustness of the price of anarchy. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, STOC ’09, page 513–522, New York, NY, USA, 2009. Association for Computing Machinery.
  • [RS13a] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Proceedings of the 26th Annual Conference on Learning Theory, pages 993–1019, 2013.
  • [RS13b] Alexander Rakhlin and Karthik Sridharan. Optimization, Learning, and Games with Predictable Sequences. arXiv:1311.1869 [cs], November 2013. arXiv: 1311.1869.
  • [RST17] Tim Roughgarden, Vasilis Syrgkanis, and Éva Tardos. The price of anarchy in auctions. J. Artif. Int. Res., 59(1):59–101, May 2017.
  • [SALS15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  • [Sha64] L. Shapley. Some Topics in Two-Person Games. Advances in Game theory, 1964.
  • [ST13] Vasilis Syrgkanis and Eva Tardos. Composable and efficient mechanisms. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC ’13, page 211–220, New York, NY, USA, 2013. Association for Computing Machinery.
  • [VGFL+20] Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, Thanasis Lianeas, Panayotis Mertikopoulos, and Georgios Piliouras. No-regret learning and mixed nash equilibria: They do not mix. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1380–1391. Curran Associates, Inc., 2020.
  • [WL18] Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Proceedings of the 31st Conference On Learning Theory, pages 1263–1291, 2018.
  • [WLZL21] Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Linear last-iterate convergence in constrained saddle-point optimization. In International Conference on Learning Representations, 2021.

Appendix A Proofs for Section 4.1

In this section we prove Lemma 4.1. Throughout the section we use the notation of Lemma 4.1: in particular, we assume that any player i[m]i\in[m] follows the Optimistic Hedge updates (1) with step size η>0\eta>0, for an arbitrary sequence of losses i(1),,i(T)\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}.

A.1 Preliminary lemmas

The first few lemmas in this section pertain to vectors P,QΔnP,Q\in\Delta^{n}, for some nn\in\mathbb{N}; note that such vectors P,QP,Q may be viewed as distributions on [n][n]. Let P/QnP/Q\in\mathbb{R}^{n} denote the Radon-Nikodym derivative, i.e., the vector whose jjth component is P(j)/Q(j)P(j)/Q(j).

Lemma A.1.

If P/QA\|P/Q\|_{\infty}\leq A, then χ2(P;Q)Aχ2(Q;P)\chi^{2}({P};{Q})\leq A\cdot\chi^{2}({Q};{P}).

Proof.

The lemma is immediate from the definition of the χ2\chi^{2} divergence:

χ2(P;Q)=j=1n(P(j)Q(j))2Q(j)Aj=1n(P(j)Q(j))2P(j)=Aχ2(Q;P).\displaystyle\chi^{2}({P};{Q})=\sum_{j=1}^{n}\frac{(P(j)-Q(j))^{2}}{Q(j)}\leq A\cdot\sum_{j=1}^{n}\frac{(P(j)-Q(j))^{2}}{P(j)}=A\cdot\chi^{2}({Q};{P}).

It is a standard fact (though one which we do not need in our proofs) that for all P,QΔniP,Q\in\Delta^{n_{i}}, KL(P;Q)χ2(P;Q)\operatorname{KL}({P};{Q})\leq\chi^{2}({P};{Q}). The below lemma shows an inequality in the opposite direction when P/Q,Q/P\|P/Q\|_{\infty},\|Q/P\|_{\infty} are bounded:

Lemma A.2.

There is a constant CC so that the following holds. Suppose that for A32A\leq\frac{3}{2} we have P/QA\left\|P/Q\right\|_{\infty}\leq A and Q/PA\left\|Q/P\right\|_{\infty}\leq A. Then (1/2C(A1))χ2(P;Q)KL(P;Q)(1/2-C(A-1))\cdot\chi^{2}(P;Q)\leq\operatorname{KL}(P;Q).

Proof.

There is a constant C>0C>0 so that for any 0<β1/20<\beta\leq 1/2, for all |x|β|x|\leq\beta, we have

log(1+x)x(1/2+Cβ)x2.\log(1+x)\geq x-(1/2+C\beta)x^{2}.

Set a=A1a=A-1, so that |P(j)/Q(j)1|a|P(j)/Q(j)-1|\leq a for all jj by assumption. Then for C=C+1/2C^{\prime}=C+1/2, we have

KL(P;Q)\displaystyle\operatorname{KL}(P;Q) =jP(j)logP(j)Q(j)\displaystyle=\sum_{j}P(j)\log\frac{P(j)}{Q(j)}
jP(j)((P(j)Q(j)1)(1/2+Ca)(P(j)Q(j)1)2)\displaystyle\geq\sum_{j}P(j)\cdot\left(\left(\frac{P(j)}{Q(j)}-1\right)-(1/2+Ca)\left(\frac{P(j)}{Q(j)}-1\right)^{2}\right)
χ2(P;Q)(1/2+Ca)jP(j)(P(j)Q(j))2Q(j)2\displaystyle\geq\chi^{2}(P;Q)-(1/2+Ca)\sum_{j}P(j)\cdot\frac{(P(j)-Q(j))^{2}}{Q(j)^{2}}
χ2(P;Q)1/2+CaAχ2(P;Q)\displaystyle\geq\chi^{2}(P;Q)-\frac{1/2+Ca}{A}\chi^{2}(P;Q)
1/2aC1+aχ2(P;Q)\displaystyle\geq\frac{1/2-aC}{1+a}\cdot\chi^{2}({P};{Q})
(1/2a(C+1/2))χ2(P;Q)\displaystyle\geq\left(1/2-a\cdot(C+1/2)\right)\cdot\chi^{2}({P};{Q})
=(1/2Ca)χ2(P;Q).\displaystyle=(1/2-C^{\prime}a)\cdot\chi^{2}(P;Q).

The next lemma considers two vectors x,xΔnx,x^{\prime}\in\Delta^{n} which are related by a multiplicative weights-style update with loss vector wnw\in\mathbb{R}^{n}; the lemma relates χ2(x;x)\chi^{2}({x^{\prime}};{x}) to wx2\|w\|_{x}^{2}.

Lemma A.3.

There is a constant C>0C>0 so that the following holds. Suppose that wnw\in\mathbb{R}^{n}, α>0\alpha>0, wα/21/C\|w\|_{\infty}\leq\alpha/2\leq 1/C, and x,xΔnx,x^{\prime}\in\Delta^{n} satisfy, for each j[n]j\in[n],

x(j)=x(j)exp(w(j))k[n]x(k)exp(w(k)).\displaystyle x^{\prime}(j)=\frac{x(j)\cdot\exp(w(j))}{\sum_{k\in[n]}x(k)\cdot\exp(w(k))}. (15)

Then

(1Cα)Varx(w)χ2(x;x)(1+Cα)Varx(w).(1-C\alpha)\cdot\operatorname{Var}_{{x}}\left({w}\right)\leq\chi^{2}(x^{\prime};x)\leq(1+C\alpha)\operatorname{Var}_{{x}}\left({w}\right).
Proof.

Let w=wx,w𝟏w^{\prime}=w-\langle x,w\rangle\mathbf{1}, where 𝟏\mathbf{1} denotes the all-1s vector. Note that Varx(w)=Varx(w)\operatorname{Var}_{{x}}\left({w}\right)=\operatorname{Var}_{{x}}\left({w^{\prime}}\right), and that if we replace ww with ww^{\prime}, (15) remains true. Moreover, w2wα\|w^{\prime}\|_{\infty}\leq 2\|w\|_{\infty}\leq\alpha. Thus, by replacing ww with ww^{\prime}, we may assume from here on that w,x=0\langle w,x\rangle=0 and that wα\|w\|\leq\alpha.

Note that

χ2(x;x)=1+i=1nx(i)(x(i)/x(i))2=1+𝔼(exp(W)𝔼exp(W))2,\chi^{2}(x^{\prime};x)=-1+\sum_{i=1}^{n}x(i)\cdot(x^{\prime}(i)/x(i))^{2}=-1+\mathbb{E}\left(\frac{\exp(W)}{\mathbb{E}\exp(W)}\right)^{2},

where WW is a random variable that takes values w(j)w(j) with probability x(j)x(j). As long as CC is a sufficiently large constant, we have that, for all zz satisfying |z|α|z|\leq\alpha,

1+z+(1Cα)z2/2exp(z)1+z+(1+Cα)z2/2.1+z+(1-C\alpha)z^{2}/2\leq\exp(z)\leq 1+z+(1+C\alpha)z^{2}/2. (16)

Thus, for a sufficiently large constant C0C_{0}^{\prime}, we have, for all zz satisfying |z|α|z|\leq\alpha,

1+2z+(2C0α)z2exp(z)21+2z+(2+C0α)z2.1+2z+(2-C_{0}^{\prime}\alpha)z^{2}\leq\exp(z)^{2}\leq 1+2z+(2+C_{0}^{\prime}\alpha)z^{2}. (17)

Moreover, since 𝔼W=0\mathbb{E}W=0, we have from (16) that 1+(1Cα)𝔼W2/2𝔼exp(W)1+(1+Cα)𝔼W2/21+(1-C\alpha)\mathbb{E}W^{2}/2\leq\mathbb{E}\exp(W)\leq 1+(1+C\alpha)\mathbb{E}W^{2}/2. For a sufficiently large constant C1C_{1}^{\prime} it follows that

1+(1C1α)𝔼W2(𝔼exp(W))21+(1+C1α)𝔼W2.1+(1-C_{1}^{\prime}\alpha)\mathbb{E}W^{2}\leq(\mathbb{E}\exp(W))^{2}\leq 1+(1+C_{1}^{\prime}\alpha)\mathbb{E}W^{2}. (18)

Combining (17) and (18) and again using the fact that 𝔼W=0\mathbb{E}W=0, we get, for some sufficiently large constant C′′C^{\prime\prime}, as long as α<1/C1\alpha<1/C_{1}^{\prime},

(1C′′α)𝔼W2\displaystyle(1-C^{\prime\prime}\alpha)\mathbb{E}W^{2}\leq 1+1+(2C0α)𝔼W21+(1+C1α)𝔼W2\displaystyle-1+\frac{1+(2-C_{0}^{\prime}\alpha)\mathbb{E}W^{2}}{1+(1+C_{1}^{\prime}\alpha)\mathbb{E}W^{2}}
\displaystyle\leq 1+𝔼(exp(W)2)(𝔼exp(W))2\displaystyle-1+\frac{\mathbb{E}(\exp(W)^{2})}{(\mathbb{E}\exp(W))^{2}}
\displaystyle\leq 1+1+(2+C0α)𝔼W21+(1C1α)𝔼W2\displaystyle-1+\frac{1+(2+C_{0}^{\prime}\alpha)\mathbb{E}W^{2}}{1+(1-C_{1}^{\prime}\alpha)\mathbb{E}W^{2}}
\displaystyle\leq (1+C′′α)𝔼W2.\displaystyle(1+C^{\prime\prime}\alpha)\mathbb{E}W^{2}.

By the assumption that w,x=0\langle w,x\rangle=0, we have 𝔼W2=Varx(w)\mathbb{E}W^{2}=\operatorname{Var}_{{x}}\left({w}\right), and thus the above gives the desired result. ∎

We will need the following standard lemma:

Lemma A.4 ([RS13a], Eq. (26)).

For any nn\in\mathbb{N}, n\ell\in\mathbb{R}^{n}, yΔny\in\Delta^{n}, if it holds that x=argminxΔnx,+KL(x;y)x=\operatorname*{arg\,min}_{x^{\prime}\in\Delta^{n}}\langle x^{\prime},\ell\rangle+\operatorname{KL}({x^{\prime}};{y}), then for any zΔnz\in\Delta^{n},

xz,KL(z;y)KL(z;x)KL(x;y).\displaystyle\langle x-z,\ell\rangle\leq\operatorname{KL}({z};{y})-\operatorname{KL}({z};{x})-\operatorname{KL}({x};{y}).

For t[T]t\in[T], we define the vector x~i(t)Δni\tilde{x}_{i}^{(t)}\in\Delta^{n_{i}} by

x~i(t)(j):=xi(t)(j)exp(η(i(t)(j)i(t1)(j)))k[ni]xi(t)(k)exp(η(i(t)(k)i(t1)(k))).\displaystyle\tilde{x}_{i}^{(t)}(j):=\frac{x_{i}^{(t)}(j)\cdot\exp(-\eta\cdot(\ell_{i}^{(t)}(j)-\ell_{i}^{(t-1)}(j)))}{\sum_{k\in[n_{i}]}x_{i}^{(t)}(k)\cdot\exp(-\eta\cdot(\ell_{i}^{(t)}(k)-\ell_{i}^{(t-1)}(k)))}. (19)

Additionally define x~i(0):=(1/ni,,1/ni)\tilde{x}_{i}^{(0)}:=(1/n_{i},\ldots,1/n_{i}) to be the uniform distribution over [ni][n_{i}].

The next lemma, Lemma A.5 is very similar to [RS13a, Lemma 3], and is indeed essentially shown in the course of the proof of that lemma. Note that no boundedness assumption is placed on the vectors i(t)\ell_{i}^{(t)} in Lemma A.5. For completeness we provide a full proof of the lemma.

Lemma A.5 (Refinement of Lemma 3, [RS13a]).

Suppose that any player i[m]i\in[m] follows the Optimistic Hedge updates (1) with step size η>0\eta>0, for an arbitrary sequence of losses i(1),,i(T)ni\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}\in\mathbb{R}^{n_{i}}. For any vector xΔnix^{\star}\in\Delta^{n_{i}}, it holds that

t=1Txi(t)x,i(t)logniη+t=1Txi(t)x~i(t)xi(t)Varxi(t)(i(t)i(t1))1ηt=1TKL(x~i(t);xi(t))1ηt=1TKL(xi(t);x~i(t1)).\displaystyle\sum_{t=1}^{T}\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle\leq\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}). (20)
Proof.

For any xΔnix^{\star}\in\Delta^{n_{i}}, it holds that

xi(t)x,i(t)=xi(t)x~i(t),i(t)i(t1)+xi(t)x~i(t),i(t1)+x~i(t)x,i(t).\displaystyle\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle=\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\rangle+\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\ell_{i}^{(t-1)}\rangle+\langle\tilde{x}_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle. (21)

For t[T]t\in[T], set c(t)=xi(t),i(t)i(t1)c^{(t)}=\langle x_{i}^{(t)},\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\rangle. Using the definition of the dual norm and the fact xi(t)x~i(t),𝟏=0\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\mathbf{1}\rangle=0, we have

xi(t)x~i(t),i(t)i(t1)=\displaystyle\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\rangle= xi(t)x~i(t),i(t)i(t1)c(t)𝟏\displaystyle\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\ell_{i}^{(t)}-\ell_{i}^{(t-1)}-c^{(t)}\mathbf{1}\rangle
\displaystyle\leq xi(t)x~i(t)xi(t)i(t)i(t1)c(t)𝟏xi(t)\displaystyle\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\cdot\left\|{\ell_{i}^{(t)}-\ell_{i}^{(t-1)}-c^{(t)}\mathbf{1}}\right\|_{x_{i}^{(t)}}
\displaystyle\leq xi(t)x~i(t)xi(t)Varxi(t)(i(t)i(t1)).\displaystyle\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\cdot\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}. (22)

It is immediate from the definitions of x~i(t)\tilde{x}_{i}^{(t)} (in (19)) and xi(t)x_{i}^{(t)} (in (1)) that for j[ni]j\in[n_{i}],

xi(t)(j)=x~i(t1)(j)exp(ηi(t1)(j))k[ni]x~i(t1)(k)exp(ηi(t1)(k))=argminxΔnix,ηi(t1)+KL(x;x~i(t1))\displaystyle x_{i}^{(t)}(j)=\frac{\tilde{x}_{i}^{(t-1)}(j)\cdot\exp(-\eta\cdot\ell_{i}^{(t-1)}(j))}{\sum_{k\in[n_{i}]}\tilde{x}_{i}^{(t-1)}(k)\cdot\exp(-\eta\cdot\ell_{i}^{(t-1)}(k))}=\operatorname*{arg\,min}_{x\in\Delta^{n_{i}}}\left\langle x,\eta\cdot\ell_{i}^{(t-1)}\right\rangle+\operatorname{KL}({x};{\tilde{x}_{i}^{(t-1)}}) (23)

Using Lemma A.4 with x=xi(t),=ηi(t1),y=x~i(t1),z=x~i(t)x=x_{i}^{(t)},\ell=\eta\ell_{i}^{(t-1)},y=\tilde{x}_{i}^{(t-1)},z=\tilde{x}_{i}^{(t)}, we obtain

xi(t)x~i(t),i(t1)1ηKL(x~i(t);x~i(t1))1ηKL(x~i(t);xi(t))1ηKL(xi(t);x~i(t1)).\displaystyle\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\ell_{i}^{(t-1)}\rangle\leq\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})-\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}). (24)

Next, we note that, again by (19) and (1), for j[ni]j\in[n_{i}],

x~i(t)(j)=x~i(t1)(j)exp(ηi(t)(j))k[ni]x~i(t1)(k)exp(ηi(t)(k))=argminxΔnix,ηi(t)+KL(x;x~i(t1)).\displaystyle\tilde{x}_{i}^{(t)}(j)=\frac{\tilde{x}_{i}^{(t-1)}(j)\cdot\exp(-\eta\cdot\ell_{i}^{(t)}(j))}{\sum_{k\in[n_{i}]}\tilde{x}_{i}^{(t-1)}(k)\cdot\exp(-\eta\cdot\ell_{i}^{(t)}(k))}=\operatorname*{arg\,min}_{x\in\Delta^{n_{i}}}\left\langle x,\eta\cdot\ell_{i}^{(t)}\right\rangle+\operatorname{KL}({x};{\tilde{x}_{i}^{(t-1)}}).

Using Lemma A.4 with x=x~i(t),=ηi(t),y=x~i(t1),z=xx=\tilde{x}_{i}^{(t)},\ell=\eta\ell_{i}^{(t)},y=\tilde{x}_{i}^{(t-1)},z=x^{\star}, we obtain

x~i(t)x,i(t)1ηKL(x;x~i(t1))1ηKL(x;x~i(t))1ηKL(x~i(t);x~i(t1)).\displaystyle\langle\tilde{x}_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle\leq\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t-1)}})-\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t)}})-\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}). (25)

By (21), (22), (24), and (25), we have

xi(t)x,i(t)\displaystyle\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle\leq xi(t)x~i(t)xi(t)Varxi(t)(i(t)i(t1))\displaystyle\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\cdot\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}
+1ηKL(x~i(t);x~i(t1))1ηKL(x~i(t);xi(t))1ηKL(xi(t);x~i(t1))\displaystyle+\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})-\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})
+1ηKL(x;x~i(t1))1ηKL(x;x~i(t))1ηKL(x~i(t);x~i(t1))\displaystyle+\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t-1)}})-\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t)}})-\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})
=\displaystyle= xi(t)x~i(t)xi(t)Varxi(t)(i(t)i(t1))+1ηKL(x;x~i(t1))1ηKL(x;x~i(t))\displaystyle\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\cdot\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}+\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t-1)}})-\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t)}})
1ηKL(x~i(t);xi(t))1ηKL(xi(t);x~i(t1)).\displaystyle-\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}). (26)

The statement of the lemma follows by summing (26) over t[T]t\in[T] and using the fact that for any choice of xx^{\star}, KL(x;xi(0))logni\operatorname{KL}({x^{\star}};{x_{i}^{(0)}})\leq\log n_{i}. ∎

A.2 Proof of Lemma 4.1

Now we are ready to prove Lemma 4.1. For convenience we restate the lemma.

Lemma 4.1 (restated).

There is a constant C>0C>0 so that the following holds. Suppose any player i[m]i\in[m] follows the Optimistic Hedge updates (1) with step size 0<η<1/C0<\eta<1/C, for an arbitrary sequence of losses i(1),,i(T)[0,1]ni\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}\in[0,1]^{n_{i}}. Then for any vector xΔnix^{\star}\in\Delta^{n_{i}}, it holds that

t=1Txi(t)x,i(t)logniη+t=1T(η2+Cη2)Varxi(t)(i(t)i(t1))t=1T(1Cη)η2Varxi(t)(i(t1)).\displaystyle\sum_{t=1}^{T}\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle\leq\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left(\frac{\eta}{2}+C\eta^{2}\right)\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)-\sum_{t=1}^{T}\frac{(1-C\eta)\eta}{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right). (27)
Proof.

Lemma A.5 gives that, for any xΔnix^{\star}\in\Delta^{n_{i}},

t=1Txi(t)x,i(t)logniη+t=1Txi(t)x~i(t)xi(t)Varxi(t)(i(t)i(t1))1ηt=1TKL(x~i(t);xi(t))1ηt=1TKL(xi(t);x~i(t1)).\displaystyle\sum_{t=1}^{T}\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle\leq\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}). (28)

Note that for any vectors x,xΔnix,x^{\prime}\in\Delta^{n_{i}}, if there is a vector ni\ell\in\mathbb{R}^{n_{i}} so that for all j[ni]j\in[n_{i}], x(j)=x(j)exp(η(j))kx(j)exp(η(k))x^{\prime}(j)=\frac{x(j)\cdot\exp(\eta\cdot\ell(j))}{\sum_{k}x(j)\cdot\exp(\eta\cdot\ell(k))}, we have that

exp(2η)xxexp(2η).\exp(-2\eta\|\ell\|_{\infty})\leq\left\|\frac{x^{\prime}}{x}\right\|_{\infty}\leq\exp(2\eta\|\ell\|_{\infty}).

Therefore, by (19) and (23), respectively, we obtain that, for η1/4\eta\leq 1/4,

exp(2ηi(t)i(t1))\displaystyle\exp(-2\eta\|\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\|_{\infty}) x~i(t)xi(t)exp(2ηi(t)i(t1))exp(4η)1+8η\displaystyle\leq\left\|{\frac{\tilde{x}_{i}^{(t)}}{x_{i}^{(t)}}}\right\|_{\infty}\leq\exp(2\eta\|\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\|_{\infty})\leq\exp(4\eta)\leq 1+8\eta
exp(2ηi(t1))\displaystyle\exp(-2\eta\|\ell_{i}^{(t-1)}\|_{\infty}) xi(t)x~i(t1)exp(2ηi(t1))exp(2η)1+4η.\displaystyle\leq\left\|{\frac{x_{i}^{(t)}}{\tilde{x}_{i}^{(t-1)}}}\right\|_{\infty}\leq\exp(2\eta\|\ell_{i}^{(t-1)}\|_{\infty})\leq\exp(2\eta)\leq 1+4\eta. (29)

(Above we have also used that i(t)1\|\ell_{i}^{(t)}\|_{\infty}\leq 1 for all tt.) Thus, for η116\eta\leq\frac{1}{16}, we can apply Lemma A.2 and show, for a sufficiently large constant C0C_{0},

KL(x~i(t);xi(t))\displaystyle\operatorname{KL}(\tilde{x}_{i}^{(t)};x_{i}^{(t)}) χ2(x~i(t);xi(t))(1/2C0η)\displaystyle\geq\chi^{2}(\tilde{x}_{i}^{(t)};x_{i}^{(t)})\cdot(1/2-C_{0}\eta) (30)
KL(xi(t);x~i(t1))\displaystyle\operatorname{KL}(x_{i}^{(t)};\tilde{x}_{i}^{(t-1)}) χ2(xi(t);x~i(t1))(1/2C0η).\displaystyle\geq\chi^{2}(x_{i}^{(t)};\tilde{x}_{i}^{(t-1)})\cdot(1/2-C_{0}\eta). (31)

Note also that for vectors x,yx,y we have that χ2(x;y)=(xyy)2\chi^{2}(x;y)=\left(\left\|{x-y}\right\|_{y}^{\star}\right)^{2}. By Lemma A.3 and (19), we have that, for a sufficiently large constant C1C_{1}, as long as η1/C1\eta\leq 1/C_{1},

(xi(t)x~i(t)xi(t))2=χ2(x~i(t);xi(t))(1+C1η)η2Varxi(t)(i(t)i(t1))\displaystyle\left(\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\right)^{2}=\chi^{2}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})\leq(1+C_{1}\eta)\eta^{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right) (32)

and

χ2(x~i(t);xi(t))(1C1η)η2Varxi(t)(i(t)i(t1)).\displaystyle\chi^{2}(\tilde{x}_{i}^{(t)};x_{i}^{(t)})\geq(1-C_{1}\eta)\eta^{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right). (33)

Next we lower bound χ2(xi(t);x~i(t1))\chi^{2}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}) as follows, where C2C_{2} denotes a sufficiently large constant: as long as η1/C2\eta\leq 1/C_{2},

χ2(xi(t);x~i(t1))\displaystyle\chi^{2}(x_{i}^{(t)};\tilde{x}_{i}^{(t-1)}) χ2(x~i(t1);xi(t))exp(2η)\displaystyle\geq\chi^{2}(\tilde{x}_{i}^{(t-1)};x_{i}^{(t)})\cdot\exp(-2\eta) (34)
(1C2η)η2Varxi(t)(i(t1)),\displaystyle\geq(1-C_{2}\eta)\eta^{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right), (35)

where (34) follows from Lemma A.1 and (29), and (35) follows from Lemma A.3 and (23).

Combining (28), (30), (31), (32), (33), and (35) gives that for a sufficiently large constant CC, as long as η<1/C\eta<1/C,

t=1Txi(t)x,i(t)\displaystyle\sum_{t=1}^{T}\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle logniη+t=1T(η/2+Cη2)Varxi(t)(i(t)i(t1))(1Cη)η2Varxi(t)(i(t1)),\displaystyle\leq\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}(\eta/2+C\eta^{2})\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)-\frac{(1-C\eta)\eta}{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right),

as desired. ∎

Appendix B Proofs for Section 4.3

In this section we give the full proof of Lemma 4.4. In Section B.1 we introduce some preliminaries. In Section B.2 we prove Lemma 4.5, the “boundedness chain rule” for finite differences. In Section B.4 we show how to use this lemma to prove Lemma 4.4.

B.1 Additional preliminaries

In this section we introduce some additional notations and basic combinatorial lemmas. Definition B.1 introduces the shift operator Es\operatorname{E}_{s}{}, which like the finite difference operator Dh\operatorname{D}_{h}{}, maps one sequence to another sequence.

Definition B.1 (Shift operator).

Suppose L=(L(1),,L(T))L=(L^{(1)},\ldots,L^{(T)}) is a sequence of vectors L(t)nL^{(t)}\in\mathbb{R}^{n}. For integers s0s\geq 0, the ss-shift sequence for the sequence LL, denoted by EsL\operatorname{E}_{s}{L}, is the sequence EsL=((EsL)(1),,(EsL)(Ts))\operatorname{E}_{s}{L}=(\left(\operatorname{E}_{s}{L}\right)^{(1)},\ldots,\left(\operatorname{E}_{s}{L}\right)^{(T-s)}), defined by (EsL)(t)=L(t+s)\left(\operatorname{E}_{s}{L}\right)^{(t)}=L^{(t+s)} for 1tTs1\leq t\leq T-s.

For sequences L=(L(1),,L(T))L=(L^{(1)},\ldots,L^{(T)}) and K=(K(1),,K(T))K=(K^{(1)},\ldots,K^{(T)}) of real numbers, we will denote the product sequence as LKL\cdot K as the sequence of vectors LK:=(L(1)K(1),,L(T)K(T))L\cdot K:=(L^{(1)}K^{(1)},\ldots,L^{(T)}K^{(T)}). Lemmas B.1 and B.2 below are standard analogues of the product rule for finite differences. The (straightforward) proofs are provided for completeness.

Lemma B.1 (Product rule; Eq. (2.55) of [GKP89]).

Suppose L=(L(1),,L(T))L=(L^{(1)},\ldots,L^{(T)}) and K=(K(1),,K(T))K=(K^{(1)},\ldots,K^{(T)}) are sequences of real numbers. Then the product sequence LKL\cdot K satisfies

D1(LK)=LD1K+D1LE1K.\operatorname{D}_{1}{(L\cdot K)}=L\cdot\operatorname{D}_{1}{K}+\operatorname{D}_{1}{L}\cdot\operatorname{E}_{1}{K}.
Proof.

We compute

D1(LK)(t)\displaystyle\operatorname{D}_{1}{(L\cdot K)}^{(t)} =L(t+1)K(t+1)L(t)K(t)\displaystyle=L^{(t+1)}K^{(t+1)}-L^{(t)}K^{(t)}
=L(t+1)K(t+1)L(t)K(t+1)+L(t)K(t+1)L(t)K(t)\displaystyle=L^{(t+1)}K^{(t+1)}-L^{(t)}K^{(t+1)}+L^{(t)}K^{(t+1)}-L^{(t)}K^{(t)}
=(LD1K+D1LE1K)(t).\displaystyle=(L\cdot\operatorname{D}_{1}{K}+\operatorname{D}_{1}{L}\cdot\operatorname{E}_{1}{K})^{(t)}.

Lemma B.2 (Multivariate product rule).

Suppose that mm\in\mathbb{N} and for 1im1\leq i\leq m, Li=(Li(1),,Li(T))L_{i}=(L_{i}^{(1)},\ldots,L_{i}^{(T)}) are sequences of real numbers. Then the product sequence i=1mLi\prod_{i=1}^{m}L_{i} satisfies

D1i=1mLi=i=1m(i<iLi)D1Li(i>iE1Li).\operatorname{D}_{1}{\prod_{i=1}^{m}L_{i}}=\sum_{i=1}^{m}\left(\prod_{i^{\prime}<i}L_{i^{\prime}}\right)\cdot\operatorname{D}_{1}{L_{i}}\cdot\left(\prod_{i^{\prime}>i}\operatorname{E}_{1}{L_{i^{\prime}}}\right).
Proof.

We compute

(D1i=1mLi)(t)\displaystyle\left({\operatorname{D}_{1}{\prod_{i=1}^{m}L_{i}}}\right)^{(t)} =i=1mLi(t+1)i=1mLi(t)\displaystyle=\prod_{i=1}^{m}L_{i}^{(t+1)}-\prod_{i=1}^{m}L_{i}^{(t)}
=i=1m(iiLi(t+1)i>iLi(t)i<iLi(t+1)iiLi(t))\displaystyle=\sum_{i=1}^{m}\left({\prod_{i^{\prime}\leq i}L_{i^{\prime}}^{(t+1)}\prod_{i^{\prime}>i}L_{i^{\prime}}^{(t)}-\prod_{i^{\prime}<i}L_{i^{\prime}}^{(t+1)}\prod_{i^{\prime}\geq i}L_{i^{\prime}}^{(t)}}\right)
=i=1m(i<iLi(t+1)i>iLi(t)(Li(t+1)Li(t)))\displaystyle=\sum_{i=1}^{m}\left({\prod_{i^{\prime}<i}L_{i^{\prime}}^{(t+1)}\cdot\prod_{i^{\prime}>i}L_{i^{\prime}}^{(t)}\cdot\left({L_{i}^{(t+1)}-L_{i}^{(t)}}\right)}\right)
=(i=1m(i<iLi)D1Li(i>iE1Li))(t).\displaystyle=\left({\sum_{i=1}^{m}\left(\prod_{i^{\prime}<i}L_{i^{\prime}}\right)\cdot\operatorname{D}_{1}{L_{i}}\cdot\left(\prod_{i^{\prime}>i}\operatorname{E}_{1}{L_{i^{\prime}}}\right)}\right)^{(t)}.

Lemma B.4 and Lemma B.3, which is used in the proof of the former, are used to bound certain sums with many terms in the proof of Lemma 4.5. To state Lemma B.3 we make one definition. For positive integers k,mk,m and any h,C>0h,C>0, define

Rh,m,k,C=0n1,,nkm(i=1kninihi=1kni)C,R_{h,m,k,C}=\sum_{0\leq n_{1},\cdots,n_{k}\leq m}\left({\frac{\prod_{i=1}^{k}n_{i}^{n_{i}}}{h^{\sum_{i=1}^{k}n_{i}}}}\right)^{C},

where the sum is over integers n1,,nkn_{1},\ldots,n_{k} satisfying 0nim0\leq n_{i}\leq m for i[k]i\in[k]. In the definition of Rh,m,k,CR_{h,m,k,C}, the quantity 000^{0} (which arises when some ni=0n_{i}=0) is interpreted as 1.

Lemma B.3.

For any positive integers k,mk,m and any h,C>0h,C>0 so that mh/2m\leq h/2, C2C\geq 2, and h8h\geq 8, then

Rh,m,k,Cexp(2khC).R_{h,m,k,C}\leq\exp\left({\frac{2k}{h^{C}}}\right).
Proof of Lemma B.3.

We may rewrite Rh,m,k,CR_{h,m,k,C} and then upper bound it as follows:

Rh,m,k,C\displaystyle R_{h,m,k,C} =(j=0m(jh)Cj)k\displaystyle=\left({\sum_{j=0}^{m}\left({\frac{j}{h}}\right)^{Cj}}\right)^{k}
(1+(1h)C+(m1)max((2h)2C,(mh)mC))k\displaystyle\leq\left({1+\left({\frac{1}{h}}\right)^{C}+(m-1)\max\left({\left({\frac{2}{h}}\right)^{2C},\left({\frac{m}{h}}\right)^{mC}}\right)}\right)^{k} (36)
(1+(1h)C+(h/2)max((2h)2C,(12)hC/2))k\displaystyle\leq\left({1+\left({\frac{1}{h}}\right)^{C}+(h/2)\max\left({\left({\frac{2}{h}}\right)^{2C},\left({\frac{1}{2}}\right)^{hC/2}}\right)}\right)^{k}

where (36) follows since (ih)Ci\left({\frac{i}{h}}\right)^{Ci} is convex in ii for i0i\geq 0, and therefore, in the interval [2,m][2,h/2][2,m]\subseteq[2,h/2], takes on maximal values at the endpoints. We see

(h/2)(2h)2C=(2h)2C1(1h)C(h/2)\left({\frac{2}{h}}\right)^{2C}=\left({\frac{2}{h}}\right)^{2C-1}\leq\left({\frac{1}{h}}\right)^{C}

for h8h\geq 8 when C2C\geq 2. Also,

(h/2)(12)hC/2(1h)C(h/2)\left({\frac{1}{2}}\right)^{hC/2}\leq\left({\frac{1}{h}}\right)^{C}

for h8h\geq 8 when C2C\geq 2. (This inequality is easily seen to be equivalent to the fact that (C+1)loghCh21(C+1)\log h-\frac{Ch}{2}\leq 1, which follows from the fact that loghh/20\log h-h/2\leq 0 for h8h\geq 8 and 3loghh13\log h-h\leq 1 for h8h\geq 8.) Therefore,

Rh,m,k,C\displaystyle R_{h,m,k,C} (1+(1h)C+(h/2)max((2h)2C,(12)hC/2))k\displaystyle\leq\left({1+\left({\frac{1}{h}}\right)^{C}+(h/2)\max\left({\left({\frac{2}{h}}\right)^{2C},\left({\frac{1}{2}}\right)^{hC/2}}\right)}\right)^{k}
(1+2(1h)C)k\displaystyle\leq\left({1+2\left({\frac{1}{h}}\right)^{C}}\right)^{k}
exp(2khC).\displaystyle\leq\exp\left({\frac{2k}{h^{C}}}\right).

Lemma B.4.

Fix integers h0,k1h\geq 0,k\geq 1. For any function π:[h][k]\pi:[h]\rightarrow[k], define, for each i[k]i\in[k], hi(π)=|{q[h]|π(q)=i}|h_{i}(\pi)=\left|\left\{q\in[h]|\pi(q)=i\right\}\right|. Then, for any C3C\geq 3,

π:[h][k]i=1khi(π)Chi(π)hChmax{k7,(hk+1)exp(2khC1)}.\sum_{\pi:[h]\rightarrow[k]}\frac{\prod_{i=1}^{k}h_{i}(\pi)^{Ch_{i}(\pi)}}{h^{Ch}}\leq\max\left\{k^{7},(hk+1)\cdot\exp\left(\frac{2k}{h^{C-1}}\right)\right\}. (37)
Proof.

In the case that h7h\leq 7, we simply use the fact that the number of functions π:[h][k]\pi:[h]\rightarrow[k] is khk7k^{h}\leq k^{7}, and each term of the summation on the left-hand side of (37) is at most 1. In the remainder of the proof we may thus assume that h8h\geq 8.

For any tuple (h1,,hk)(h_{1},\cdots,h_{k}) of non-negative integers with i=1khi=h\sum_{i=1}^{k}h_{i}=h, there are (hh1,h2,,hk)hhihihi{h\choose h_{1},h_{2},\cdots,h_{k}}\leq\frac{h^{h}}{\prod_{i}h_{i}^{h_{i}}} (see [CS04, Lemma 2.2] for a proof of this inequality) functions π:[h][k]\pi:[h]\rightarrow[k] such that hi(π)=hih_{i}(\pi)=h_{i} for all i[k]i\in[k]. Combining these like terms,

π:[h][k]ihi(π)Chi(π)hCh\displaystyle\sum_{\pi:[h]\rightarrow[k]}\frac{\prod_{i}h_{i}(\pi)^{Ch_{i}(\pi)}}{h^{Ch}} h1,,hk0hi=hhhihihi(ihihihh)C\displaystyle\leq\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ \sum h_{i}=h\end{subarray}}\frac{h^{h}}{\prod_{i}h_{i}^{h_{i}}}\cdot\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{h}}}\right)^{C}
h1,,hk0hi=h(ihihihh)C1.\displaystyle\leq\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ \sum h_{i}=h\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{h}}}\right)^{C-1}. (38)

We evaluate this sum in 2 cases: whether or not hmax:=maxi{hi}h_{\max}:=\max_{i}\{h_{i}\} is greater than h/2h/2. The contribution to this sum coming from terms with hmaxh/2h_{\max}\leq h/2 is

h1,,hk0h1,,hkh/2hi=h(ihihihh)C1\displaystyle\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ h_{1},\cdots,h_{k}\leq h/2\\ \sum h_{i}=h\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{h}}}\right)^{C-1} h1,,hk0h1,,hkh/2(ihihihhi)C1\displaystyle\leq\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ h_{1},\cdots,h_{k}\leq h/2\\ \end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{\sum h_{i}}}}\right)^{C-1}
=Rh,h/2,k,C1\displaystyle=R_{h,\lfloor h/2\rfloor,k,C-1}
exp(2khC1),\displaystyle\leq\exp\left({\frac{2k}{h^{C-1}}}\right), (39)

by Lemma B.3.

We next consider the case where hmax>h/2h_{\max}>h/2. For a specific term (h1,,hk)(h_{1},\cdots,h_{k}) with maxi{hi}>h/2\max_{i}\{h_{i}\}>h/2, we know there is a unique M[k]M\in[k] such that hM=maxi{hi}h_{M}=\max_{i}\{h_{i}\} since i=1khi=h\sum_{i=1}^{k}h_{i}=h. So, we can represent the contribution to the sum from this case as

M=1kh1,,hk0hM>h/2hi=h(ihihihh)C1\displaystyle\sum_{M=1}^{k}\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ h_{M}>h/2\\ \sum h_{i}=h\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{h}}}\right)^{C-1} =kh1,,hk0hk>h/2hi=h(ihihihh)C1\displaystyle=k\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ h_{k}>h/2\\ \sum h_{i}=h\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{h}}}\right)^{C-1} (40)
kd=0h/2((hd)hdhhd)C1h1,,hk10hi=d(ihihihd)C1\displaystyle\leq k\sum_{d=0}^{\lfloor h/2\rfloor}\left({\frac{(h-d)^{h-d}}{h^{h-d}}}\right)^{C-1}\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k-1}\geq 0\\ \sum h_{i}=d\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{d}}}\right)^{C-1} (41)
kd=0h/2h1,,hk10h1,,hk1d(ihihihhi)C1\displaystyle\leq k\sum_{d=0}^{\lfloor h/2\rfloor}\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k-1}\geq 0\\ h_{1},\cdots,h_{k-1}\leq d\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{\sum h_{i}}}}\right)^{C-1}
=kd=0h/2Rh,d,k1,C1\displaystyle=k\sum_{d=0}^{\lfloor h/2\rfloor}R_{h,d,k-1,C-1}
khexp(2khC1),\displaystyle\leq kh\cdot\exp\left(\frac{2k}{h^{C-1}}\right), (42)

where (40) follows by symmetry, (41) follows by factoring out the contribution of (hkhkhhk)C\left(\frac{h_{k}^{h_{k}}}{h^{h_{k}}}\right)^{C} and letting d=hhkd=h-h_{k}, and (42) follows by Lemma B.3.

The statement of the lemma follows from (38), (39), and (42). ∎

Lemma B.5.

For nn\in\mathbb{N}, let ξ1,,ξn0\xi_{1},\ldots,\xi_{n}\geq 0 such that ξ1++ξn=1\xi_{1}+\cdots+\xi_{n}=1. For each j[n]j\in[n], define ϕj:n\phi_{j}:\mathbb{R}^{n}\rightarrow\mathbb{R} to be the function

ϕj((z1,,zn))=ξjexp(zj)k=1nξkexp(zk)\displaystyle\phi_{j}((z_{1},\ldots,z_{n}))=\frac{\xi_{j}\exp(z_{j})}{\sum_{k=1}^{n}\xi_{k}\cdot\exp(z_{k})}

and let Pϕj(z)=γ0naj,γzγP_{\phi_{j}}(z)=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{j,\gamma}\cdot z^{\gamma} denote the Taylor series of ϕj\phi_{j}. Then for any j[n]j\in[n] and any integer k1k\geq 1,

γ0n:|γ|=k|aj,γ|ξjek+1.\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\left|a_{j,\gamma}\right|\leq\xi_{j}e^{k+1}.
Proof.

Note that, for each j[n]j\in[n],

aj,γ=1γ1!γ2!γn!kϕj(0)z1γ1z2γ2znγn,\displaystyle a_{j,\gamma}=\frac{1}{\gamma_{1}!\gamma_{2}!\cdots\gamma_{n}!}\cdot\frac{\partial^{k}\phi_{j}(0)}{\partial z_{1}^{\gamma_{1}}\partial z_{2}^{\gamma_{2}}\cdots z_{n}^{\gamma_{n}}},

and so

γ0n:|γ|=k|aj,γ|\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\left|a_{j,\gamma}\right| =γ0n:|γ|=k1γ1!γ2!γn!|kϕj(0)z1γ1z2γ2znγn|\displaystyle=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\frac{1}{\gamma_{1}!\gamma_{2}!\cdots\gamma_{n}!}\cdot\left|\frac{\partial^{k}\phi_{j}(0)}{\partial z_{1}^{\gamma_{1}}\partial z_{2}^{\gamma_{2}}\cdots z_{n}^{\gamma_{n}}}\right|
=1k!γ0n:|γ|=kk!γ1!γ2!γn!|kϕj(0)z1γ1z2γ2znγn|\displaystyle=\frac{1}{k!}\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\frac{k!}{\gamma_{1}!\gamma_{2}!\cdots\gamma_{n}!}\cdot\left|\frac{\partial^{k}\phi_{j}(0)}{\partial z_{1}^{\gamma_{1}}\partial z_{2}^{\gamma_{2}}\cdots z_{n}^{\gamma_{n}}}\right|
=1k!t[n]k|kϕj(0)zt1zt2ztk|.\displaystyle=\frac{1}{k!}\sum_{t\in[n]^{k}}\left|\frac{\partial^{k}\phi_{j}(0)}{\partial z_{t_{1}}\partial z_{t_{2}}\cdots\partial z_{t_{k}}}\right|.

It is straightforward to see that the following equalities hold for any i[n]i\in[n], iji\neq j:

ϕjzj\displaystyle\frac{\partial\phi_{j}}{\partial z_{j}} =ϕj(1ϕj)\displaystyle=\phi_{j}(1-\phi_{j})
ϕjzi\displaystyle\frac{\partial\phi_{j}}{\partial z_{i}} =ϕiϕj\displaystyle=-\phi_{i}\phi_{j}
(1ϕj)zj\displaystyle\frac{\partial(1-\phi_{j})}{\partial z_{j}} =ϕj(1ϕj)\displaystyle=-\phi_{j}(1-\phi_{j})
(1ϕj)zi\displaystyle\frac{\partial(1-\phi_{j})}{\partial z_{i}} =ϕiϕj\displaystyle=\phi_{i}\phi_{j}

We claim that for any (t1,,tk)[n]k(t_{1},\ldots,t_{k})\in[n]^{k}, we can express kϕjzt1ztk\frac{\partial^{k}\phi_{j}}{\partial z_{t_{1}}\cdots\partial z_{t_{k}}} as a polynomial in ϕ1,,ϕn,(1ϕ1),,(1ϕn)\phi_{1},\cdots,\phi_{n},(1-\phi_{1}),\cdots,(1-\phi_{n}) comprised of k!k! monomials each of degree k+1k+1. We verify this by induction, first noting that after taking zero derivatives, the function ϕj\phi_{j} is a degree-1 monomial. Assume that for some sequence b1,,b(1)!{0,1}b_{1},\ldots,b_{(\ell-1)!}\in\{0,1\}, we can express

1ϕjzt1zt1=f=1(1)!(1)bfd=01mf,d\displaystyle\frac{\partial^{\ell-1}\phi_{j}}{\partial z_{t_{1}}\cdots\partial z_{t_{\ell-1}}}=\sum_{f=1}^{(\ell-1)!}(-1)^{b_{f}}\prod_{d=0}^{\ell-1}m_{f,d}

where each mf,d{ϕ1,,ϕn,(1ϕ1),,(1ϕn)}m_{f,d}\in\left\{\phi_{1},\cdots,\phi_{n},(1-\phi_{1}),\cdots,(1-\phi_{n})\right\}. We see that for each ff, there is some sequence of bits bf,0,,bf,1{0,1}b_{f,0},\ldots,b_{f,\ell-1}\in\{0,1\} so that

ztd=01mf,d=d=01(1)bf,dmf,0mf,dmf,d,\displaystyle\frac{\partial}{\partial z_{t_{\ell}}}\prod_{d=0}^{\ell-1}m_{f,d}=\sum_{d=0}^{\ell-1}(-1)^{b_{f,d}}\cdot m_{f,0}\cdots m_{f,d}^{\prime}\cdots m_{f,d,\ell} (43)

where we define, for each 0d10\leq d\leq\ell-1,

mf,d and mf,d,={mf,d and ϕt if mf,d=ϕi with itmf,d and (1ϕt) if mf,d=ϕt(1mf,d) and ϕt if mf,d=1ϕi with it(1mf,d) and (1ϕt) if mf,d=1ϕt.\displaystyle m_{f,d}^{\prime}\text{ and }m_{f,d,\ell}=\begin{cases}m_{f,d}\text{ and }\phi_{t_{\ell}}&\text{ if }m_{f,d}=\phi_{i}\text{ with }i\neq t_{\ell}\\ m_{f,d}\text{ and }(1-\phi_{t_{\ell}})&\text{ if }m_{f,d}=\phi_{t_{\ell}}\\ (1-m_{f,d})\text{ and }\phi_{t_{\ell}}&\text{ if }m_{f,d}=1-\phi_{i}\text{ with }i\neq t_{\ell}\\ (1-m_{f,d})\text{ and }(1-\phi_{t_{\ell}})&\text{ if }m_{f,d}=1-\phi_{t_{\ell}}.\end{cases}

Thus, ϕjzt1zt\frac{\partial^{\ell}\phi_{j}}{\partial z_{t_{1}}\cdots\partial z_{t_{\ell}}} can be expressed as a sum of !\ell! monomials of degree (+1)(\ell+1), completing the inductive step.

This inductive argument also demonstrates a bijection between the k!k! monomials of kϕjzt1ztk\frac{\partial^{k}\phi_{j}}{\partial z_{t_{1}}\cdots\partial z_{t_{k}}} and a combinatorial structure that we call factorial trees. Formally, we define a factorial tree to be a directed graph on vertices {0,1,,k}\left\{0,1,\cdots,k\right\} such that each vertex i0i\neq 0 has a single incoming edge from one of the vertices in [0,i1][0,i-1]. (For a non-negative integer ii, we write [0,i]:={0,1,,i}[0,i]:=\{0,1,\ldots,i\}.) For a factorial tree ff, let pf()[0,1]p_{f}(\ell)\in[0,\ell-1] denote the parent of a vertex \ell. A particular factorial tree ff represents the monomial that was generated by choosing the pf()thp_{f}(\ell)^{\text{th}} term in (43) for derivation when taking the derivative zt\frac{\partial}{\partial z_{t_{\ell}}}, for each [k]\ell\in[k]. (See Figure 1 for an example.)

Refer to caption
Figure 1: A monomial ϕjϕiϕjϕk-\phi_{j}\phi_{i}\phi_{j}\phi_{k} of 3ϕjzizjzk\frac{\partial^{3}\phi_{j}}{\partial z_{i}\partial z_{j}\partial z_{k}} and its corresponding factorial tree

Each of the k!k! monomials comprising kϕjzt1ztk\frac{\partial^{k}\phi_{j}}{\partial z_{t_{1}}\cdots\partial z_{t_{k}}} is a product of k+1k+1 terms corresponding to indices j,t1,,tkj,t_{1},\cdots,t_{k} (i.e., the first term in the product is either ϕj\phi_{j} or 1ϕj1-\phi_{j}, the second term is either ϕt1\phi_{t_{1}} or 1ϕt11-\phi_{t_{1}}, and so on). We say that a term corresponding to index i[n]i\in[n] is perturbed if it is (1ϕi)(1-\phi_{i}) (as opposed to ϕi\phi_{i}). From our construction, we see that the th\ell^{\text{th}} term is perturbed if t=tpf()t_{\ell}=t_{p_{f}(\ell)} and there is no \ell^{\prime} such that pf()=p_{f}(\ell^{\prime})=\ell. That is, \ell is a leaf in the corresponding factorial tree ff and the parent of \ell corresponds to the same index as \ell. One can think of t1,,tkt_{1},\cdots,t_{k} as a coloring of all the vertices of the factorial tree with nn colors, except the root of the tree (vertex 0) which has fixed color jj. Then, we can say the th\ell^{\text{th}} term is perturbed if and only if \ell is a leaf with the same color as its parent. We call such a leaf a petal. For t[n]t\in[n], we let Pf,t[k]P_{f,t}\subseteq[k] be the set of petals on tree ff with color tt, Lf[k]L_{f}\subseteq[k] be the set of leaves of tree ff, and Bf=[k]LfB_{f}=[k]\setminus L_{f} be the set of all non-leaves other than the fixed-color root. Therefore,

γ0n:|γ|=k|aj,γ|\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\left|a_{j,\gamma}\right| =1k!t[n]k|kϕj(0)zt1ztk|\displaystyle=\frac{1}{k!}\sum_{t\in[n]^{k}}\left|\frac{\partial^{k}\phi_{j}(0)}{\partial z_{t_{1}}\cdots\partial z_{t_{k}}}\right|
1k!t[n]kf=0k(ϕt(0)𝟙[Pf,t]+(1ϕt(0))𝟙[Pf,t])\displaystyle\leq\frac{1}{k!}\sum_{t\in[n]^{k}}\sum_{f}\prod_{\ell=0}^{k}(\phi_{t_{\ell}}(0)\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\phi_{t_{\ell}}(0))\cdot\mathbbm{1}[\ell\in P_{f,t}])
(where we let t0=jt_{0}=j for notational convenience)
=1k!t[n]kf=0k(ξt𝟙[Pf,t]+(1ξt)𝟙[Pf,t])\displaystyle=\frac{1}{k!}\sum_{t\in[n]^{k}}\sum_{f}\prod_{\ell=0}^{k}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}])
=1k!ftBf[n]BftLf[n]Lf=0k(ξt𝟙[Pf,t]+(1ξt)𝟙[Pf,t]),\displaystyle=\frac{1}{k!}\sum_{f}\sum_{t_{B_{f}}\in[n]^{B_{f}}}\sum_{t_{L_{f}}\in[n]^{L_{f}}}\prod_{\ell=0}^{k}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}]),

where in the last step we decompose, for each factorial tree ff, t[n]kt\in[n]^{k} into the tuple of indices tBf[n]Bft_{B_{f}}\in[n]^{B_{f}} corresponding to the non-leaves BfB_{f}, and the tuple of indices tLf[n]Lft_{L_{f}}\in[n]^{L_{f}} corresponding to the leaves LfL_{f}.

We note that, fixing tree ff and the colors of all non-leaves tBt_{B},

tLf[n]LfLf(ξt𝟙[Pf,t]+(1ξt)𝟙[Pf,t])\displaystyle\sum_{t_{L_{f}}\in[n]^{L_{f}}}\prod_{\ell\in L_{f}}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}])
=Lf(t[n]ξt𝟙[ttpf()]+(1ξt)𝟙[t=tpf()])\displaystyle=\prod_{\ell\in L_{f}}\left({\sum_{t_{\ell}\in[n]}\xi_{t_{\ell}}\cdot\mathbbm{1}[t_{\ell}\neq t_{p_{f}(\ell)}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[t_{\ell}=t_{p_{f}(\ell)}]}\right)
=Lf(22ξtpf())\displaystyle=\prod_{\ell\in L_{f}}\left({2-2\xi_{t_{p_{f}(\ell)}}}\right)
2|Lf|\displaystyle\leq 2^{|L_{f}|}

And so,

1k!ftBf[n]BftLf[n]Lf=0k(ξt𝟙[Pf,t]+(1ξt)𝟙[Pf,t])\displaystyle\frac{1}{k!}\sum_{f}\sum_{t_{B_{f}}\in[n]^{B_{f}}}\sum_{t_{L_{f}}\in[n]^{L_{f}}}\prod_{\ell=0}^{k}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}])
1k!f2|Lf|tBf[n]BfBf{0}(ξt𝟙[Pf,t]+(1ξt)𝟙[Pf,t])\displaystyle\leq\frac{1}{k!}\sum_{f}2^{|L_{f}|}\sum_{t_{B_{f}}\in[n]^{B_{f}}}\prod_{\ell\in B_{f}\cup\left\{0\right\}}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}])
=1k!f2|Lf|tBf[n]BfBf{0}ξt\displaystyle=\frac{1}{k!}\sum_{f}2^{|L_{f}|}\sum_{t_{B_{f}}\in[n]^{B_{f}}}\prod_{\ell\in B_{f}\cup\left\{0\right\}}\xi_{t_{\ell}}
(as no non-leaf can ever be a petal)
=ξjk!f2|Lf|Bf(t[n]ξt)\displaystyle=\frac{\xi_{j}}{k!}\sum_{f}2^{|L_{f}|}\prod_{\ell\in B_{f}}\left({\sum_{t_{\ell}\in[n]}\xi_{t_{\ell}}}\right)
=ξjk!f2|Lf|=ξj𝔼f𝒰()[2|Lf|]\displaystyle=\frac{\xi_{j}}{k!}\sum_{f}2^{|L_{f}|}=\xi_{j}\mathbb{E}_{f\sim\mathcal{U}(\mathcal{F})}\left[{2^{|L_{f}|}}\right]

where \mathcal{F} is the set of all factorial trees and 𝒰()\mathcal{U}(\mathcal{F}) is the uniform distribution over \mathcal{F}. For a specific vertex [0,k]\ell\in[0,k], we note that Lf\ell\in L_{f} if and only if it is not the parent of any vertex +1,,k\ell+1,\cdots,k. So,

Prf𝒰()[Lf]=i=+1ki1i=k\Pr_{f\sim\mathcal{U}(\mathcal{F})}\left[{\ell\in L_{f}}\right]=\prod_{i=\ell+1}^{k}\frac{i-1}{i}=\frac{\ell}{k} (44)

We will show via induction that, for any vertex set S[0,k]S\subseteq[0,k]

Prf𝒰()[SLf]Sk\Pr_{f\sim\mathcal{U}(\mathcal{F})}\left[{S\subseteq L_{f}}\right]\leq\prod_{\ell\in S}\frac{\ell}{k} (45)

Having established the base case for every SS with |S|=1|S|=1, we assume (45) holds for all SS with |S|<s|S|<s. For any set of ss vertices VV, consider an arbitrary partition of VV into two sets ST=VS\cup T=V with |S|,|T|<s|S|,|T|<s. We see

Prf𝒰()[VLf]\displaystyle\Pr_{f\sim\mathcal{U}(\mathcal{F})}\left[{V\subseteq L_{f}}\right] =c=1kPr[pf(c)V]\displaystyle=\prod_{c=1}^{k}\Pr\left[{p_{f}(c)\not\in V}\right]
=c=1kPr[pf(c)S]Pr[pf(c)T|pf(c)S]\displaystyle=\prod_{c=1}^{k}\Pr\left[{p_{f}(c)\not\in S}\right]\Pr\left[{p_{f}(c)\not\in T|p_{f}(c)\not\in S}\right]
c=1kPr[pf(c)S]Pr[pf(c)T]\displaystyle\leq\prod_{c=1}^{k}\Pr\left[{p_{f}(c)\not\in S}\right]\Pr\left[{p_{f}(c)\not\in T}\right]
=Pr[SLf]Pr[TLf]\displaystyle=\Pr\left[{S\subseteq L_{f}}\right]\Pr\left[{T\subseteq L_{f}}\right]
Vk\displaystyle\leq\prod_{\ell\in V}\frac{\ell}{k}

by the inductive hypothesis, as desired. Thus, Pr[|Lf|s]S:|S|=sPr[SLf]\Pr\left[{|L_{f}|\geq s}\right]\leq\sum_{S:|S|=s}\Pr\left[{S\subseteq L_{f}}\right] is at most the sths^{\text{th}} coefficient of the polynomial

R(x)==0k(1+kx)R(x)=\prod_{\ell=0}^{k}\left({1+\frac{\ell}{k}x}\right)

and so

𝔼f𝒰()[2|Lf|]\displaystyle\mathbb{E}_{f\sim\mathcal{U}(\mathcal{F})}\left[{2^{|L_{f}|}}\right] s=0k2sPr[|Lf|s]\displaystyle\leq\sum_{s=0}^{k}2^{s}\Pr[|L_{f}|\geq s]
R(2)\displaystyle\leq R(2)
e=0k2/k=ek+1\displaystyle\leq e^{\sum_{\ell=0}^{k}2\ell/k}=e^{k+1}

and

γ0n:|γ|=k|aj,γ|ξjek+1,\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\left|a_{j,\gamma}\right|\leq\xi_{j}e^{k+1},

as desired. ∎

Lemma B.6.

Let ϕ1,,ϕm\phi_{1},\cdots,\phi_{m} be softmax-type functions. That is, for each ϕi\phi_{i}, there is some ji[n]j_{i}\in[n] and indices ξi1,,ξin\xi_{i1},\ldots,\xi_{in} such that

ϕi((z1,,zn))=exp(zji)k=1nξikexp(zk)\displaystyle\phi_{i}((z_{1},\ldots,z_{n}))=\frac{\exp(z_{j_{i}})}{\sum_{k=1}^{n}\xi_{ik}\cdot\exp(z_{k})}

where ξi1++ξin=1\xi_{i1}+\cdots+\xi_{in}=1 for all ii. Let P(z)=γ0naγzγP(z)=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{\gamma}z^{\gamma} denote the Taylor series of iϕi\prod_{i}\phi_{i}. Then for any integer kk,

γ0n:|γ|=k|aγ|(e3m)k.\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\left|a_{\gamma}\right|\leq(e^{3}m)^{k}.
Proof.

Letting Pi(z)=γ0nai,γzγP_{i}(z)=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{i,\gamma}z^{\gamma} denote the Taylor series of ϕi\phi_{i} for all ii, we have P(z)=iPi(z)P(z)=\prod_{i}P_{i}(z) and therefore

γ0n:|γ|=k|aγ|k1,,km0ki=kiγ0n:|γ|=ki|ai,γ|\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\left|a_{\gamma}\right|\leq\sum_{\begin{subarray}{c}k_{1},\cdots,k_{m}\in\mathbb{Z}_{\geq 0}\\ \sum k_{i}=k\end{subarray}}\prod_{i}\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k_{i}}\left|a_{i,\gamma}\right|

We have that |γ|=ki|ai,γ|e2ki\sum_{|\gamma|=k_{i}}\left|a_{i,\gamma}\right|\leq e^{2k_{i}} for all kik_{i} since, for ki=0k_{i}=0, ai,0=ϕi(0)=1a_{i,0}=\phi_{i}(0)=1, and for ki1k_{i}\geq 1,

|γ|=ki|ai,γ|ξijξijeki+1e2ki\sum_{|\gamma|=k_{i}}\left|a_{i,\gamma}\right|\leq\frac{\xi_{ij}}{\xi_{ij}}e^{k_{i}+1}\leq e^{2k_{i}} (46)

from Lemma B.5. Note that the softmax-type functions discussed in Lemma B.5 have a ξij\xi_{ij} term in the numerator, while those discussed here do not. This accounts for the extra ξij\xi_{ij} term that appears in equation (46). Thus,

k1,,km0ki=kiγ0n:|γ|=ki|ai,γ|\displaystyle\sum_{\begin{subarray}{c}k_{1},\cdots,k_{m}\in\mathbb{Z}_{\geq 0}\\ \sum k_{i}=k\end{subarray}}\prod_{i}\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k_{i}}\left|a_{i,\gamma}\right| k1,,km0ki=ke2k\displaystyle\leq\sum_{\begin{subarray}{c}k_{1},\cdots,k_{m}\in\mathbb{Z}_{\geq 0}\\ \sum k_{i}=k\end{subarray}}e^{2k}
=e2k(m+k1k)\displaystyle=e^{2k}{m+k-1\choose k}
e2k(e(m+k1)k)k\displaystyle\leq e^{2k}\left({\frac{e(m+k-1)}{k}}\right)^{k}
=(e3m)k\displaystyle=(e^{3}m)^{k}

as desired. ∎

Lemma B.7.

Let ϕ((z1,,zn))=exp(zj)k=1nξkexp(zk)\phi((z_{1},\ldots,z_{n}))=\frac{\exp(z_{j})}{\sum_{k=1}^{n}\xi_{k}\exp(z_{k})} be any softmax-type function. Then the radius of convergence of the Taylor series of ϕ\phi at the origin is at least 1.

Proof.

For a complex number zz, write (z),(z)\Re(z),\Im(z) to denote the real and imaginary parts, respectively, of zz. Note that for any ζ1,,ζn\zeta_{1},\ldots,\zeta_{n}\in\mathbb{C} with |ζk|π/3|\zeta_{k}|\leq\pi/3 for all k[n]k\in[n], we have

(exp(ζk))cos(π/3)exp(π/3)>1/10,\Re(\exp(\zeta_{k}))\geq\cos(\pi/3)\cdot\exp(-\pi/3)>1/10,

and thus |k=1nξkexp(ζk)|1/10\left|\sum_{k=1}^{n}\xi_{k}\cdot\exp(\zeta_{k})\right|\geq 1/10. Moreover, for any such point ζ=(ζ1,,ζn)\zeta=(\zeta_{1},\ldots,\zeta_{n}), it holds that |exp(ζj)|exp(π/3)<3|\exp(\zeta_{j})|\leq\exp(\pi/3)<3. It then follows that for such ζ\zeta we have |ϕ(z)|30|\phi(z)|\leq 30. In particular, ϕ\phi is holomorphic on the region {ζ:|ζk|π/3k[n]}\{\zeta:|\zeta_{k}|\leq\pi/3\ \ \forall k\in[n]\}.

Fix any γ0n\gamma\in\mathbb{Z}_{\geq 0}^{n}, and let k=|γ|k=|\gamma|. By the multivariate version of Cauchy’s integral formula,

|dγdzγϕ(z)|=\displaystyle\left|\frac{d^{\gamma}}{dz^{\gamma}}\phi(z)\right|= |γ!(2πi)n|ζ1z1|=π/3|ζnzn|=π/3ϕ(ζ1,,ζn)(ζ1z1)γ1+1(ζnzn)γn+1𝑑ζ1𝑑ζn|\displaystyle\left|\frac{\gamma!}{(2\pi i)^{n}}\int_{|\zeta_{1}-z_{1}|=\pi/3}\cdots\int_{|\zeta_{n}-z_{n}|=\pi/3}\frac{\phi(\zeta_{1},\ldots,\zeta_{n})}{(\zeta_{1}-z_{1})^{\gamma_{1}+1}\cdots(\zeta_{n}-z_{n})^{\gamma_{n}+1}}d\zeta_{1}\cdots d\zeta_{n}\right|
\displaystyle\leq 30γ!(π/3)k+n30γ!(π/3)k.\displaystyle\frac{30\gamma!}{(\pi/3)^{k+n}}\leq\frac{30\gamma!}{(\pi/3)^{k}}.

The power series of ϕ\phi at 𝟎\mathbf{0} is defined as Pϕ(z)=γ0naγzγP_{\phi}(z)=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{\gamma}\cdot z^{\gamma}, where aγ=1γ!dγdzγϕ(𝟎)a_{\gamma}=\frac{1}{\gamma!}\frac{d^{\gamma}}{dz^{\gamma}}\phi(\mathbf{0}). For any γ0n\gamma\in\mathbb{Z}_{\geq 0}^{n} with k=|γ|k=|\gamma|, we have |aγ|1/k(30/(π/3)k)1/k=(30)1/k3/π|a_{\gamma}|^{1/k}\leq(30/(\pi/3)^{k})^{1/k}=(30)^{1/k}\cdot 3/\pi, which tends to 3/π<13/\pi<1 as kk\rightarrow\infty. Thus, by the (multivariate version of the) Cauchy-Hadamard theorem, the radius of convergence of the power series of ϕ\phi at 𝟎\mathbf{0} is at least π/31\pi/3\geq 1. ∎

B.2 Proof of Lemma 4.6

In this section prove Lemma 4.6, which, as explained in Section 4.3, is an important ingredient in the proof of Lemma 4.5. The detailed version of Lemma 4.6 is presented below; it includes several claims which are omitted for simplicity in the abbreviated version in Section 4.3.

Lemma 4.6 (Detailed).

Fix any integer h0h\geq 0, a multi-index γ0n\gamma\in\mathbb{Z}_{\geq 0}^{n} and set k=|γ|k=|\gamma|. For each of the khk^{h} functions π:[h][k]\pi:[h]\rightarrow[k], and for each r[k]r\in[k], there are integers hπ,r{0,1,,h}h^{\prime}_{\pi,r}\in\{0,1,\ldots,h\}, tπ,r0t^{\prime}_{\pi,r}\geq 0, and jπ,r[n]j^{\prime}_{\pi,r}\in[n], so that the following holds. For any sequence L(1),,L(T)nL^{(1)},\ldots,L^{(T)}\in\mathbb{R}^{n} of vectors, it holds that

DhLγ=π:[h][k]r=1kEtπ,rDhπ,r(L(jπ,r)).\displaystyle\operatorname{D}_{h}{L^{\gamma}}=\sum_{\pi:[h]\rightarrow[k]}\prod_{r=1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{(L(j^{\prime}_{\pi,r}))}}. (47)

Moreover, the following properties hold:

  1. 1.

    For each π\pi and r[k]r\in[k], hπ,r=|{q[h]:π(q)=r}|h^{\prime}_{\pi,r}=|\{q\in[h]:\pi(q)=r\}|. In particular, r=1khπ,r=h\sum_{r=1}^{k}h^{\prime}_{\pi,r}=h.

  2. 2.

    For each π\pi and r[k]r\in[k], it holds that 0tπ,r+hπ,rh0\leq t^{\prime}_{\pi,r}+h^{\prime}_{\pi,r}\leq h.

  3. 3.

    For each π\pi, r[k]r\in[k], and j[n]j\in[n], γj=|{r[k]:jπ,r=j}|\gamma_{j}=|\{r\in[k]:j^{\prime}_{\pi,r}=j\}|.

Proof of Lemma 4.6.

We use induction on hh. First note that in the case h=0h=0 and for any k0k\geq 0, we have that (DhLγ)(t)=(L(t))γ\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}=(L^{(t)})^{\gamma}, and so for the unique function π:[k]\pi:\emptyset\rightarrow[k], for all r[k]r\in[k], we may take tπ,r=0t^{\prime}_{\pi,r}=0, hπ,r=0h^{\prime}_{\pi,r}=0, and ensure that for each j[n]j\in[n] there are γj\gamma_{j} values of rr so that jπ,r=jj^{\prime}_{\pi,r}=j.

Now fix any integer h>0h>0, and suppose the statement of the claim holds for all h<hh^{\prime}<h. We have that

DhLγ\displaystyle\operatorname{D}_{h}{L^{\gamma}}{}
=\displaystyle= D1Dh1Lγ\displaystyle\operatorname{D}_{1}{\operatorname{D}_{h-1}{L^{\gamma}}}{}
=\displaystyle= D1π:[h1][k]r=1kEtπ,rDhπ,rL(jπ,r)\displaystyle\operatorname{D}_{1}{\sum_{\pi:[h-1]\rightarrow[k]}\prod_{r=1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{L(j^{\prime}_{\pi,r})}}}{}
=\displaystyle= π:[h1][k]r=1kD1Etπ,rDhπ,rL(jπ,r)r=1r1Etπ,rDhπ,rL(jπ,r)r=r+1kEtπ,r+1Dhπ,rL(jπ,r)\displaystyle\sum_{\pi:[h-1]\rightarrow[k]}\sum_{r=1}^{k}\operatorname{D}_{1}{\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{L(j^{\prime}_{\pi,r})}}}\cdot\prod_{r^{\prime}=1}^{r-1}\operatorname{E}_{t^{\prime}_{\pi,r^{\prime}}}{\operatorname{D}_{h^{\prime}_{\pi,r^{\prime}}}{L(j^{\prime}_{\pi,r^{\prime}})}}\cdot\prod_{r^{\prime}=r+1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r^{\prime}}+1}{\operatorname{D}_{h^{\prime}_{\pi,r^{\prime}}}{L(j^{\prime}_{\pi,r^{\prime}})}} (48)
=\displaystyle= π:[h1][k]r=1kEtπ,rDhπ,r+1L(jπ,r)r=1r1Etπ,rDhπ,rL(jπ,r)r=r+1kEtπ,r+1Dhπ,rL(jπ,r).\displaystyle\sum_{\pi:[h-1]\rightarrow[k]}\sum_{r=1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}+1}{L(j^{\prime}_{\pi,r})}}\cdot\prod_{r^{\prime}=1}^{r-1}\operatorname{E}_{t^{\prime}_{\pi,r^{\prime}}}{\operatorname{D}_{h^{\prime}_{\pi,r^{\prime}}}{L(j^{\prime}_{\pi,r^{\prime}})}}\cdot\prod_{r^{\prime}=r+1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r^{\prime}}+1}{\operatorname{D}_{h^{\prime}_{\pi,r^{\prime}}}{L(j^{\prime}_{\pi,r^{\prime}})}}. (49)

where (48) uses Lemma B.2 and (49) uses the commutativity of Et\operatorname{E}_{t^{\prime}}{} and D1\operatorname{D}_{1}{}. For each π:[h1][k]\pi:[h-1]\rightarrow[k], we construct kk functions π1,,πk:[h][k]\pi_{1},\ldots,\pi_{k}:[h]\rightarrow[k], defined by πr(q)=π(q)\pi_{r}(q)=\pi(q) for q<hq<h, and πr(h)=r\pi_{r}(h)=r for r[k]r\in[k]. Next, for r,r[k]r,r^{\prime}\in[k], we define the quantities hπr,r,tπr,r,jπr,rh^{\prime}_{\pi_{r},r^{\prime}},t^{\prime}_{\pi_{r},r^{\prime}},j^{\prime}_{\pi_{r},r^{\prime}} as follows:

  • Set hπr,r=hπ,rh^{\prime}_{\pi_{r},r^{\prime}}=h^{\prime}_{\pi,r^{\prime}} if rrr\neq r^{\prime}, and hπr,r=hπ,r+1h^{\prime}_{\pi_{r},r}=h^{\prime}_{\pi,r}+1.

  • Set tπr,r=tπ,rt^{\prime}_{\pi_{r},r^{\prime}}=t^{\prime}_{\pi,r^{\prime}} if rrr^{\prime}\leq r, and tπr,r=tπ,r+1t^{\prime}_{\pi_{r},r^{\prime}}=t^{\prime}_{\pi,r^{\prime}}+1 if r>rr^{\prime}>r.

  • Set jπr,r=jπ,rj^{\prime}_{\pi_{r},r^{\prime}}=j^{\prime}_{\pi,r^{\prime}}.

By (49) and the above definitions, we have

DhLγ=π:[h][k]r=1kEtπ,rDhπ,rL(jπ,r),\displaystyle\operatorname{D}_{h}{L^{\gamma}}=\sum_{\pi:[h]\rightarrow[k]}\prod_{r^{\prime}=1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{L(j^{\prime}_{\pi,r})}},

thus verifying (9) for the value hh.

Finally we verify that items 1 through 3 in the lemma statement hold. The definition of hπr,rh^{\prime}_{\pi_{r},r^{\prime}} above together with the inductive hypothesis ensures that for all r,r[k]r,r^{\prime}\in[k], hπr,r=|{q[h]:πr(q)=r}|h^{\prime}_{\pi_{r},r^{\prime}}=|\{q\in[h]:\pi_{r}(q)=r^{\prime}\}|, thus verifying item 1 of the lemma statement. Since hπr,r+tπr,rhπ,r+tπ,r+1h^{\prime}_{\pi_{r},r^{\prime}}+t^{\prime}_{\pi_{r},r^{\prime}}\leq h^{\prime}_{\pi,r}+t^{\prime}_{\pi,r}+1 for all r,rr,r^{\prime}, it follows from the inductive hypothesis that 0hπr,r+tπr,rh0\leq h^{\prime}_{\pi_{r},r^{\prime}}+t^{\prime}_{\pi_{r},r^{\prime}}\leq h; this verifies item 2. Finally, note that for any j[n]j\in[n] and r[k]r\in[k], {r[k]:jπ,r=j}={r[k]:jπr,r=j}\{r^{\prime}\in[k]:j^{\prime}_{\pi,r^{\prime}}=j\}=\{r^{\prime}\in[k]:j^{\prime}_{\pi_{r},r^{\prime}}=j\}, and thus item 3 follows from the inductive hypothesis. ∎

B.3 Proof of Lemma 4.5

In this section we prove Lemma 4.5. To introduce the detailed version of the lemma we need the following definition. Suppose ϕ:n\phi:\mathbb{R}^{n}\rightarrow\mathbb{R} is a real-valued function that is real-analytic in a neighborhood of the origin. For real numbers Q,R>0Q,R>0, we say that ϕ\phi is (Q,R)(Q,R)-bounded if the Taylor series of ϕ\phi at 𝟎\mathbf{0}, denoted Pϕ(z1,,zn)=γ0naγzγP_{\phi}(z_{1},\ldots,z_{n})=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{\gamma}z^{\gamma}, satisfies, for each integer k0k\geq 0, γ0n:|γ|=k|aγ|QRk\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:|\gamma|=k}|a_{\gamma}|\leq Q\cdot R^{k}. In the statement of Lemma 4.5 below, the quantity 000^{0} is interpreted as 1 (in particular, (h)B0h=1(h^{\prime})^{B_{0}h^{\prime}}=1 for h=0h^{\prime}=0).

Lemma 4.5 (“Boundedness chain rule” for finite differences; detailed).

Suppose that h,nh,n\in\mathbb{N}, ϕ:n\phi:\mathbb{R}^{n}\rightarrow\mathbb{R} is a (Q,R)(Q,R)-bounded function so that the radius of convergence of its power series at 𝟎\mathbf{0} is at least ν>0\nu>0, and L=(L(1),,L(T))nL=(L^{(1)},\ldots,L^{(T)})\in\mathbb{R}^{n} is a sequence of vectors satisfying L(t)ν\|L^{(t)}\|_{\infty}\leq\nu for t[T]t\in[T]. Suppose for some α(0,1)\alpha\in(0,1), for each 0hh0\leq h^{\prime}\leq h and t[Th]t\in[T-h^{\prime}], it holds that DhL(t)1B1αh(h)B0h\|\operatorname{D}_{h^{\prime}}{L}^{(t)}\|_{\infty}\leq\frac{1}{B_{1}}\cdot\alpha^{h^{\prime}}\cdot(h^{\prime})^{B_{0}h^{\prime}} for some B12e2R,B03B_{1}\geq 2e^{2}R,B_{0}\geq 3. Then for all t[Th]t\in[T-h],

|(Dh(ϕL))(t)|12RQe2B1αhhB0h+1.\displaystyle|\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)}|\leq\frac{12RQe^{2}}{B_{1}}\cdot\alpha^{h}\cdot h^{B_{0}h+1}.
Proof of Lemma 4.5.

Note that the hhth order finite differences of a constant sequence are identically 0 for h1h\geq 1, so by subtracting ϕ(𝟎)\phi(\mathbf{0}) from ϕ\phi, we may assume without loss of generality that ϕ(𝟎)=0\phi(\mathbf{0})=0. (Here 𝟎\mathbf{0} denotes the all-zeros vector.)

By assumption, the radius of convergence of the power series of ϕ\phi at the origin is at least ν\nu, and so for each γ0n\gamma\in\mathbb{Z}_{\geq 0}^{n}, there is a real number aγa_{\gamma} so that for z=(z1,,zn)z=(z_{1},\ldots,z_{n}) with |zj|ν|z_{j}|\leq\nu for each jj,

ϕ(z)=k,γ0n:|γ|=kaγzγ.\phi(z)=\sum_{k\in\mathbb{N},\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}a_{\gamma}z^{\gamma}. (50)

Let Ak:=γ0n:|γ|=k|aγ|A_{k}:=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:|\gamma|=k}|a_{\gamma}|; by the assumption that ϕ\phi is (Q,R)(Q,R)-bounded, we have that AkQRkA_{k}\leq Q\cdot R^{k} for all kk\in\mathbb{N}.

For γ0n\gamma\in\mathbb{Z}_{\geq 0}^{n}, recall that LγL^{\gamma} denotes the sequence ((Lγ)(1),,(Lγ)(T))((L^{\gamma})^{(1)},\ldots,(L^{\gamma})^{(T)}), defined by (Lγ)(t)=(L(t)(1))γ1(L(t)(n))γn(L^{\gamma})^{(t)}=(L^{(t)}(1))^{\gamma_{1}}\cdots(L^{(t)}(n))^{\gamma_{n}}. Then since L(t)ν\|L^{(t)}\|_{\infty}\leq\nu for all t[T]t\in[T], we have that, for t[Th]t\in[T-h], (Dh(ϕL))(t)=γ0naγ(DhLγ)(t)\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)}=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{\gamma}\cdot\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}.

We next upper bound the quantities |(DhLγ)(t)||\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}|. To do so, fix some γ0n\gamma\in\mathbb{Z}_{\geq 0}^{n}, and set k=|γ|k=|\gamma|. For each function π:[h][k]\pi:[h]\rightarrow[k] and r[k]r\in[k], recall the integers hπ,r{0,1,,h}h^{\prime}_{\pi,r}\in\{0,1,\ldots,h\}, tπ,r0t^{\prime}_{\pi,r}\geq 0, jπ,r[n]j^{\prime}_{\pi,r}\in[n] defined in Lemma 4.6. By assumption it holds that for each t[Th]t\in[T-h], each hhh^{\prime}\leq h, each 0th0\leq t^{\prime}\leq h, |(DhL(j))(t+t)|1B1αh(h)B0h|\left(\operatorname{D}_{h^{\prime}}{L(j)}\right)^{(t+t^{\prime})}|\leq\frac{1}{B_{1}}\cdot\alpha^{h^{\prime}}\cdot(h^{\prime})^{B_{0}h^{\prime}}. It follows that for each t[Th]t\in[T-h] and function π:[h][k]\pi:[h]\rightarrow[k],

|r=1k(Etπ,rDhπ,rL(jπ,r))(t)|r=1k1B1αhπ,r(hπ,r)B0hπ,r=αhB1kr=1k(hπ,r)B0hπ,r,\displaystyle\left|\prod_{r=1}^{k}\left(\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{L(j^{\prime}_{\pi,r})}}\right)^{(t)}\right|\leq\prod_{r=1}^{k}\frac{1}{B_{1}}\cdot\alpha^{h^{\prime}_{\pi,r}}\cdot(h^{\prime}_{\pi,r})^{B_{0}h^{\prime}_{\pi,r}}=\frac{\alpha^{h}}{B_{1}^{k}}\cdot\prod_{r=1}^{k}(h^{\prime}_{\pi,r})^{B_{0}h^{\prime}_{\pi,r}},

where the last equality uses that r=1khπ,r=h\sum_{r=1}^{k}h^{\prime}_{\pi,r}=h (item 1 of Lemma 4.6). Then by Lemma 4.6, we have:

|(DhLγ)(t)|\displaystyle\left|\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}\right|\leq π:[h][k]|r=1k(Etπ,rDhπ,rL(jπ,r))(t)|\displaystyle\sum_{\pi:[h]\rightarrow[k]}\left|\prod_{r=1}^{k}\left(\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{L(j^{\prime}_{\pi,r})}}\right)^{(t)}\right|
\displaystyle\leq αhB1kπ:[h][k]r=1k(hπ,r)B0hπ,r\displaystyle\frac{\alpha^{h}}{B_{1}^{k}}\sum_{\pi:[h]\rightarrow[k]}\prod_{r=1}^{k}(h^{\prime}_{\pi,r})^{B_{0}h^{\prime}_{\pi,r}}
\displaystyle\leq αhB1khB0hmax{k7,(hk+1)exp(2khB01)},\displaystyle\frac{\alpha^{h}}{B_{1}^{k}}\cdot h^{B_{0}h}\max\left\{k^{7},(hk+1)\cdot\exp\left(\frac{2k}{h^{B_{0}-1}}\right)\right\}, (51)

where (51) follows from Lemma B.4, the fact that B03B_{0}\geq 3, and that hπ,r=|{q[h]:π(q)=r}|h^{\prime}_{\pi,r}=|\{q\in[h]:\pi(q)=r\}| (item 1 of Lemma 4.6).

We may now bound the order-hh finite differences of the sequence ϕL\phi\circ L as follows: for t[Th]t\in[T-h],

|(Dh(ϕL))(t)|\displaystyle|\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)}|\leq γ0n|aγ||(DhLγ)(t)|\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}|a_{\gamma}|\cdot\left|\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}\right|
\displaystyle\leq αhhB0h+1γ0n|aγ|B1|γ|max{|γ|7,(|γ|+1)exp(2|γ|hB01)}\displaystyle\alpha^{h}\cdot h^{B_{0}h+1}\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}|a_{\gamma}|\cdot B_{1}^{-|\gamma|}\cdot\max\left\{|\gamma|^{7},(|\gamma|+1)\cdot\exp\left(\frac{2|\gamma|}{h^{B_{0}-1}}\right)\right\} (52)
\displaystyle\leq αhhB0h+1kAkB1k(k7+2kexp(2k/hB01))\displaystyle\alpha^{h}\cdot h^{B_{0}h+1}\cdot\sum_{k\in\mathbb{N}}A_{k}\cdot B_{1}^{-k}\cdot\left(k^{7}+2k\cdot\exp(2k/h^{B_{0}-1})\right)
\displaystyle\leq αhhB0h+1Q(kk7(R/B1)k+k2k(R/B1)ke2k)\displaystyle\alpha^{h}\cdot h^{B_{0}h+1}\cdot Q\cdot\left(\sum_{k\in\mathbb{N}}k^{7}\cdot(R/B_{1})^{k}+\sum_{k\in\mathbb{N}}2k\cdot(R/B_{1})^{k}\cdot e^{2k}\right) (53)
\displaystyle\leq 2RQe2B1αhhB0h+1(kk7(2e2)k+2kk2k)\displaystyle\frac{2RQe^{2}}{B_{1}}\cdot\alpha^{h}\cdot h^{B_{0}h+1}\cdot\left(\sum_{k\in\mathbb{N}}k^{7}\cdot(2e^{2})^{-k}+2\sum_{k\in\mathbb{N}}k\cdot 2^{-k}\right) (54)
=\displaystyle= 12RQe2B1αhhB0h+1.\displaystyle\frac{12RQe^{2}}{B_{1}}\cdot\alpha^{h}\cdot h^{B_{0}h+1}.

where (52) uses (51), (53) uses the bound AkQRkA_{k}\leq QR^{k}, and (54) uses the assumption B12e2RB_{1}\geq 2e^{2}R. This gives the desired conclusion of the lemma. ∎

B.4 Proof of Lemma 4.4

In this section we prove Lemma 4.4. The detailed version of Lemma 4.4 is stated below.

Lemma 4.4 (Detailed).

Fix a parameter α(0,1H+3)\alpha\in\left(0,\frac{1}{H+3}\right). If all players follow Optimistic Hedge updates with step size ηα36e5m\eta\leq\frac{\alpha}{36e^{5}m}, then for any player i[m]i\in[m], integer hh satisfying 0hH0\leq h\leq H, time step t[Th]t\in[T-h], it holds that

(Dhi)(t)αhh3h+1.\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\|_{\infty}\leq\alpha^{h}\cdot h^{3h+1}.
Proof.

We have that for each agent i[m]i\in[m], each t[T]t\in[T], and each ai[ni]a_{i}\in[n_{i}], i(t)(ai)=𝔼aixi(t):ii[i(a1,,am)]\ell_{i}^{(t)}(a_{i})=\mathbb{E}_{a_{i^{\prime}}\sim x_{i^{\prime}}^{(t)}:\ i^{\prime}\neq i}[\mathcal{L}_{i}(a_{1},\ldots,a_{m})]. Thus, for 1tT1\leq t\leq T,

|(Dhi)(t)(ai)|=\displaystyle\left|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}(a_{i})\right|= |s=0h(hs)(1)hsi(t+s)(ai)|\displaystyle\left|\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}\ell_{i}^{(t+s)}(a_{i})\right| (55)
=\displaystyle= |ai[ni],iii(a1,,am)s=0h(hs)(1)hsiixi(t+s)(ai)|\displaystyle\left|\sum_{a_{i^{\prime}}\in[n_{i^{\prime}}],\ \forall i^{\prime}\neq i}\mathcal{L}_{i}(a_{1},\ldots,a_{m})\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}\cdot\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t+s)}(a_{i^{\prime}})\right|
\displaystyle\leq ai[ni],ii|s=0h(hs)(1)hsiixi(t+s)(ai)|\displaystyle\sum_{a_{i^{\prime}}\in[n_{i^{\prime}}],\ \forall i^{\prime}\neq i}\left|\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}\cdot\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t+s)}(a_{i^{\prime}})\right|
=\displaystyle= ai[ni],ii|(Dh(iixi(ai)))(t)|,\displaystyle\sum_{a_{i^{\prime}}\in[n_{i^{\prime}}],\ \forall i^{\prime}\neq i}\left|\left(\operatorname{D}_{h}{\left(\prod_{i^{\prime}\neq i}x_{i^{\prime}}(a_{i^{\prime}})\right)}\right)^{(t)}\right|, (56)

where (55) and (56) use Remark 4.3 and in (56), iixi(ai)\prod_{i^{\prime}\neq i}x_{i^{\prime}}(a_{i^{\prime}}) refers to the sequence iixi(1)(ai),\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(1)}(a_{i^{\prime}}), iixi(2)(ai),\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(2)}(a_{i^{\prime}}),\ldots, iixi(T)(ai)\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(T)}(a_{i^{\prime}}).

In the remainder of this lemma we will prepend to the loss sequence i(1),,i(T)\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)} the vectors i(0)=i(1):=𝟎ni\ell_{i}^{(0)}=\ell_{i}^{(-1)}:=\mathbf{0}\in\mathbb{R}^{n_{i}}. We will also prepend xi(0):=xi(1)=(1/ni,,1/ni)Δnix_{i}^{(0)}:=x_{i}^{(1)}=(1/n_{i},\ldots,1/n_{i})\in\Delta^{n_{i}} to the strategy sequence xi(1),,xi(T)x_{i}^{(1)},\ldots,x_{i}^{(T)}. Next notice that for any agent i[m]i\in[m], any t0{0,1,,T}t_{0}\in\{0,1,\ldots,T\}, and any t0t\geq 0, by the definition (1) of the Optimistic Hedge updates, it holds that, for each j[ni]j\in[n_{i}],

xi(t0+t+1)(j)=xi(t0)(j)exp(η(i(t01)(j)s=0ti(t0+s)(j)i(t0+t)(j)))k=1nixi(t0)(k)exp(η(i(t01)(k)s=0ti(t0+s)(k)i(t0+t)(k))).x_{i}^{(t_{0}+t+1)}(j)=\frac{x_{i}^{(t_{0})}(j)\cdot\exp\left(\eta\cdot\left(\ell_{i}^{(t_{0}-1)}(j)-\sum_{s=0}^{t}\ell_{i}^{(t_{0}+s)}(j)-\ell_{i}^{(t_{0}+t)}(j)\right)\right)}{\sum_{k=1}^{n_{i}}x_{i}^{(t_{0})}(k)\cdot\exp\left(\eta\cdot\left(\ell_{i}^{(t_{0}-1)}(k)-\sum_{s=0}^{t}\ell_{i}^{(t_{0}+s)}(k)-\ell_{i}^{(t_{0}+t)}(k)\right)\right)}.

Note in particular that our definitions of i(0),i(1),xi(0)\ell_{i}^{(0)},\ell_{i}^{(-1)},x_{i}^{(0)} ensure that the above equation holds even for t0{0,1}t_{0}\in\{0,1\}. Now an integer t0t_{0} satisfying 0t0T0\leq t_{0}\leq T; for t0t\geq 0, let us write

¯i,t0(t):=i(t01)s=0t1i(t0+s)i(t0+t1).\bar{\ell}_{i,t_{0}}^{(t)}:=\ell_{i}^{(t_{0}-1)}-\sum_{s=0}^{t-1}\ell_{i}^{(t_{0}+s)}-\ell_{i}^{(t_{0}+t-1)}.

Also, for a vector z=(z(1),,z(ni))niz=(z(1),\ldots,z(n_{i}))\in\mathbb{R}^{n_{i}} and an index j[ni]j\in[n_{i}], define

ϕt0,j(z):=exp(z(j))k=1nixi(t0)(k)exp(z(k)),\phi_{t_{0},j}(z):=\frac{\exp\left(z(j)\right)}{\sum_{k=1}^{n_{i}}x_{i}^{(t_{0})}(k)\cdot\exp\left(z(k)\right)}, (57)

so that xi(t0+t)(j)=xi(t0)(j)ϕt0,j(η¯i,t0(t))x_{i}^{(t_{0}+t)}(j)=x_{i}^{(t_{0})}(j)\cdot\phi_{t_{0},j}(\eta\cdot\bar{\ell}_{i,t_{0}}^{(t)}) for t1t\geq 1. In particular, for any i[m]i\in[m], and any choices of ai[ni]a_{i^{\prime}}\in[n_{i^{\prime}}] for all iii^{\prime}\neq i,

iixi(t0+t)(ai)=iixi(t0)(ai)ϕt0,ai(η¯i,t0(t)).\displaystyle\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t_{0}+t)}(a_{i^{\prime}})=\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t_{0})}(a_{i^{\prime}})\cdot\phi_{t_{0},a_{i^{\prime}}}(\eta\cdot\bar{\ell}_{i^{\prime},t_{0}}^{(t)}). (58)

Next, note that

(D1¯i,t0)(t)=i(t0+t1)2i(t0+t)=i(t0+t1)2(E1i)(t0+t1),\left(\operatorname{D}_{1}{\bar{\ell}_{i,t_{0}}}\right)^{(t)}=\ell_{i}^{(t_{0}+t-1)}-2\ell_{i}^{(t_{0}+t)}=\ell_{i}^{(t_{0}+t-1)}-2\left(\operatorname{E}_{1}{\ell_{i}}\right)^{(t_{0}+t-1)},

meaning that for any h1h^{\prime}\geq 1,

(Dh¯i,t0)(t)=(Dh1i)(t0+t1)2(E1Dh1i)(t0+t1).\left(\operatorname{D}_{h^{\prime}}{\bar{\ell}_{i,t_{0}}}\right)^{(t)}=\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t_{0}+t-1)}-2\left(\operatorname{E}_{1}{\operatorname{D}_{h^{\prime}-1}{\ell_{i}}}\right)^{(t_{0}+t-1)}. (59)

We next establish the following claims which will allow us to prove Lemma 4.4 by induction.

Claim B.8.

For any t0{0,1,,T}t_{0}\in\{0,1,\ldots,T\}, t0t\geq 0, and i[m]i\in[m], it holds that ¯i,t0(t)t+2\|\bar{\ell}_{i,t_{0}}^{(t)}\|_{\infty}\leq t+2.

Proof of Claim B.8.

The claim is immediate from the triangle inequality and the fact that i(t)1\|\ell_{i}^{(t)}\|_{\infty}\leq 1 for all t[T]t\in[T]. ∎

Claim B.9.

Fix hh so that 1hH1\leq h\leq H. Suppose that for some B03B_{0}\geq 3 and for all 0h<h0\leq h^{\prime}<h, all i[m]i\in[m], and all tTht\leq T-h^{\prime}, it holds that (Dhi)(t)αh(h+1)B0(h+1)\|\left(\operatorname{D}_{h^{\prime}}{\ell_{i}}\right)^{(t)}\|_{\infty}\leq\alpha^{h^{\prime}}\cdot(h^{\prime}+1)^{B_{0}(h^{\prime}+1)}. Suppose that the step size η\eta satisfies ηmin{α36e5m,112e5(H+3)m}\eta\leq\min\left\{\frac{\alpha}{36e^{5}m},\frac{1}{12e^{5}(H+3)m}\right\}. Then for all i[m]i\in[m] and 1tTh1\leq t\leq T-h,

(Dhi)(t)αhhB0h+1.\left\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\right\|_{\infty}\leq\alpha^{h}\cdot h^{B_{0}h+1}. (60)
Proof of Claim B.9.

Set B1:=12e5mB_{1}:=12e^{5}m, so that the assumption of the claim gives ηmin{α3B1,1B1(H+3)}\eta\leq\min\left\{\frac{\alpha}{3B_{1}},\frac{1}{B_{1}(H+3)}\right\}.

We first use Lemma 4.5 to bound, for each 0t0Th0\leq t_{0}\leq T-h, i[m]i\in[m], and ai[ni]a_{i^{\prime}}\in[n_{i^{\prime}}] for all iii^{\prime}\neq i, the quantity |(Dh(iixi(ai)))(t0+1)|\left|\left(\operatorname{D}_{h}{\left(\prod_{i^{\prime}\neq i}x_{i^{\prime}}(a_{i^{\prime}})\right)}\right)^{(t_{0}+1)}\right| . In particular, we will apply Lemma 4.5 with n=iinin=\sum_{i^{\prime}\neq i}n_{i^{\prime}}, ν=1\nu=1, the value of hh in the statement of Claim B.9, T=h+1T=h+1, and the sequence L(t)L^{(t)}, for 1th+11\leq t\leq h+1, defined as

L(t)=(η¯1,t0(t),,η¯i1,t0(t),η¯i+1,t0(t),,η¯m,t0(t)),L^{(t)}=\left(\eta\cdot\bar{\ell}_{1,t_{0}}^{(t)},\ldots,\eta\cdot\bar{\ell}_{i-1,t_{0}}^{(t)},\eta\cdot\bar{\ell}_{i+1,t_{0}}^{(t)},\ldots,\eta\cdot\bar{\ell}_{m,t_{0}}^{(t)}\right),

namely the concatenation of the vectors η¯1,t0(t),,η¯i1,t0(t),η¯i+1,t0(t),,η¯m,t0(t)\eta\cdot\bar{\ell}_{1,t_{0}}^{(t)},\ldots,\eta\cdot\bar{\ell}_{i-1,t_{0}}^{(t)},\eta\cdot\bar{\ell}_{i+1,t_{0}}^{(t)},\ldots,\eta\cdot\bar{\ell}_{m,t_{0}}^{(t)}. The function ϕ\phi in Lemma 4.5 is set to the function that takes as input the concatenation of ziniz_{i^{\prime}}\in\mathbb{R}^{n_{i^{\prime}}} for all iii^{\prime}\neq i and outputs:

ϕt0,ai(z1,,zi1,zi+1,,zm):=iiϕt0,ai(zi),\displaystyle\phi_{t_{0},a_{-i}}(z_{1},\ldots,z_{i-1},z_{i+1},\ldots,z_{m}):=\prod_{i^{\prime}\neq i}\phi_{t_{0},a_{i^{\prime}}}(z_{i^{\prime}}), (61)

where the function ϕt0,ai\phi_{t_{0},a_{i^{\prime}}} are as defined in (57). We first verify the preconditions of Lemma 4.5. By Lemma B.6, ϕt0,ai\phi_{t_{0},a_{-i}} is a (1,e3m)(1,e^{3}m)-bounded function for some constant C1C\geq 1. By Lemma B.7, the radius of convergence of each function ϕt0,ai\phi_{t_{0},a_{i^{\prime}}} at 𝟎\mathbf{0} is at least 1; thus the radius of convergence of ϕt0,ai\phi_{t_{0},a_{-i}} at 𝟎\mathbf{0} is at least ν=1\nu=1. Claim B.8 gives that ¯i,t0(t)t+2h+3\|\bar{\ell}_{i,t_{0}}^{(t)}\|_{\infty}\leq t+2\leq h+3 for all th+1t\leq h+1. Thus, since η1B1(H+3)\eta\leq\frac{1}{B_{1}(H+3)},

(D0(η¯i,t0))(t)=η¯i,t0(t)η(H+3)1B1\displaystyle\left\|\left(\operatorname{D}_{0}{\left(\eta\cdot\bar{\ell}_{i,t_{0}}\right)}\right)^{(t)}\right\|_{\infty}=\|\eta\cdot\bar{\ell}_{i,t_{0}}^{(t)}\|_{\infty}\leq\eta\cdot(H+3)\leq\frac{1}{B_{1}}

for 1th0+11\leq t\leq h_{0}+1. Next, for 1hh1\leq h^{\prime}\leq h and 1th+1h1\leq t\leq h+1-h^{\prime}, we have

(Dh(η¯i,t0))(t)\displaystyle\left\|\left(\operatorname{D}_{h^{\prime}}{(\eta\cdot\bar{\ell}_{i,t_{0}})}\right)^{(t)}\right\|_{\infty}\leq η(Dh1i)(t0+t1)+2η(Dh1i)(t0+t)\displaystyle\eta\cdot\left\|\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t_{0}+t-1)}\right\|_{\infty}+2\eta\cdot\left\|\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t_{0}+t)}\right\|_{\infty} (62)
\displaystyle\leq 3ηαh1(h)B0(h)\displaystyle 3\eta\cdot\alpha^{h^{\prime}-1}\cdot(h^{\prime})^{B_{0}(h^{\prime})} (63)
\displaystyle\leq 1B1αh(h)B0(h),\displaystyle\frac{1}{B_{1}}\cdot\alpha^{h^{\prime}}\cdot(h^{\prime})^{B_{0}(h^{\prime})}, (64)

where (62) follows from (59), (63) follows from the assumption in the statement of Claim B.9 and t0+t+h1t0+hTt_{0}+t+h^{\prime}-1\leq t_{0}+h\leq T, and (64) follows from the fact that 3ηαB13\eta\leq\frac{\alpha}{B_{1}}. It then follows from Lemma 4.5 and (58) that

1iixi(t0)(ai)|(Dh(iixi(ai)))(t0+1)|\displaystyle\frac{1}{\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t_{0})}(a_{i^{\prime}})}\cdot\left|\left(\operatorname{D}_{h}{\left(\prod_{i^{\prime}\neq i}{x_{i^{\prime}}(a_{i^{\prime}})}\right)}\right)^{(t_{0}+1)}\right|
=\displaystyle= |(Dh(ii(ϕt0,aiη¯i,t0)))(t0+1)|\displaystyle\left|\left(\operatorname{D}_{h}{\left(\prod_{i^{\prime}\neq i}\left(\phi_{t_{0},a_{i^{\prime}}}\circ\eta\bar{\ell}_{i^{\prime},t_{0}}\right)\right)}\right)^{(t_{0}+1)}\right| (65)
=\displaystyle= |(Dh(ϕt0,ai(η¯1,t0,,η¯i1,t0,η¯i+1,t0,,η¯m,t0)))(1)|\displaystyle\left|\left(\operatorname{D}_{h}{\left(\phi_{t_{0},a_{-i}}\circ(\eta\bar{\ell}_{1,t_{0}},\ldots,\eta\bar{\ell}_{i-1,t_{0}},\eta\bar{\ell}_{i+1,t_{0}},\ldots,\eta\bar{\ell}_{m,t_{0}})\right)}\right)^{(1)}\right| (66)
\displaystyle\leq 12e5mB1αhhB0h+1=αh(h)B0h+1.\displaystyle\frac{12e^{5}m}{B_{1}}\cdot\alpha^{h}\cdot h^{B_{0}h+1}=\alpha^{h}\cdot(h)^{B_{0}h+1}. (67)

(In particular, (65) uses (58), (66) uses the definition of ϕt0,ai\phi_{t_{0},a_{-i}} in (61), and (67) uses Lemma 4.5.)

Next we use (56), which gives that for each i[m]i\in[m] and t1t\geq 1,

(Dhi)(t)\displaystyle\left\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\right\|_{\infty}\leq ai[ni],ii|(Dh(iixi(ai)))(t)|\displaystyle\sum_{a_{i^{\prime}}\in[n_{i^{\prime}}],\ \forall i^{\prime}\neq i}\left|\left(\operatorname{D}_{h}{\left(\prod_{i^{\prime}\neq i}x_{i^{\prime}}(a_{i^{\prime}})\right)}\right)^{(t)}\right|
\displaystyle\leq ai[ni],iiiixi(t0)(ai)αh(h)B0h+1\displaystyle\sum_{a_{i^{\prime}}\in[n_{i^{\prime}}],\ \forall i^{\prime}\neq i}\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t_{0})}(a_{i^{\prime}})\cdot\alpha^{h}\cdot(h)^{B_{0}h+1} (68)
=\displaystyle= αh(h)B0h+1,\displaystyle\alpha^{h}\cdot(h)^{B_{0}h+1},

where (68) follows from (67) with t=t01t=t_{0}-1 (here we use that t0t_{0} may be 0). This completes the proof of Claim B.9.∎

It is immediate that for all i[m],t[T]i\in[m],t\in[T], we have that (D0i)(t)1=α01B01\|\left(\operatorname{D}_{0}{\ell_{i}}\right)^{(t)}\|_{\infty}\leq 1=\alpha^{0}\cdot 1^{B_{0}\cdot 1}. We now apply Claim B.9 inductively with B0=3B_{0}=3, for which it suffices to have ηα36e5m\eta\leq\frac{\alpha}{36e^{5}m} as long as α1/(H+3)\alpha\leq 1/(H+3). This gives that for 0hH0\leq h\leq H, i[m]i\in[m], and t[Th]t\in[T-h], (Dhi)(t)αhh3h+1\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\|_{\infty}\leq\alpha^{h}\cdot h^{3h+1}, completing the proof of Lemma 4.4. ∎

Appendix C Proofs for Section 4.4

The main goal of this section is to prove Lemma 4.7. First, in Section C.1 we prove some preliminary lemmas and then we prove Lemma 4.7 in Section C.2

C.1 Preliminary lemmas

Lemma C.1 shows that VarP(W)\operatorname{Var}_{{P}}\left({W}\right) and VarP(W)\operatorname{Var}_{{P^{\prime}}}\left({W}\right) are close when the entries of P,PP,P^{\prime} are close; it will be applied with P,PP,P^{\prime} equal to the strategies xi(t)Δnix_{i}^{(t)}\in\Delta^{n_{i}} played in the course of Optimistic Hedge.

Lemma C.1.

Suppose nn\in\mathbb{N} and M>0M>0 are given, and WnW\in\mathbb{R}^{n} is a vector. Suppose P,PΔnP,P^{\prime}\in\Delta^{n} are distributions with max{PP,PP}1+α\max\left\{\left\|\frac{P}{P^{\prime}}\right\|_{\infty},\left\|\frac{P^{\prime}}{P}\right\|_{\infty}\right\}\leq 1+\alpha for some α>0\alpha>0. Then

(1α)VarP(W)VarP(W)(1+α)VarP(W).\displaystyle(1-\alpha)\operatorname{Var}_{{P}}\left({W}\right)\leq\operatorname{Var}_{{P^{\prime}}}\left({W}\right)\leq(1+\alpha)\operatorname{Var}_{{P}}\left({W}\right). (69)
Proof.

We first prove that VarP(W)(1+α)VarP(W)\operatorname{Var}_{{P^{\prime}}}\left({W}\right)\leq(1+\alpha)\operatorname{Var}_{{P}}\left({W}\right). To do so, note that since adding a constant to every entry of WW does not change VarP(W)\operatorname{Var}_{{P}}\left({W}\right) or VarP(W)\operatorname{Var}_{{P^{\prime}}}\left({W}\right), by replacing WW with WP,W𝟏W-\langle P,W\rangle\cdot\mathbf{1}, we may assume without loss of generality that P,W=0\langle P,W\rangle=0. Thus VarP(W)=j=1nP(j)W(j)2\operatorname{Var}_{{P}}\left({W}\right)=\sum_{j=1}^{n}P(j)W(j)^{2}. Now we may compute:

VarP(W)\displaystyle\operatorname{Var}_{{P^{\prime}}}\left({W}\right)\leq jP(j)W(j)2\displaystyle\sum_{j}P^{\prime}(j)\cdot W(j)^{2}
=\displaystyle= jP(j)W(j)2+j(P(j)P(j))W(j)2\displaystyle\sum_{j}P(j)\cdot W(j)^{2}+\sum_{j}(P^{\prime}(j)-P(j))\cdot W(j)^{2}
=\displaystyle= (1+α)VarP(W),\displaystyle(1+\alpha)\operatorname{Var}_{{P}}\left({W}\right), (70)

where (70) uses the fact that PP1+α\left\|\frac{P^{\prime}}{P}\right\|_{\infty}\leq 1+\alpha.

By interchanging the roles of P,PP,P^{\prime}, we obtain that

VarP(W)11+αVarP(W)(1α)VarP(W).\displaystyle\operatorname{Var}_{{P^{\prime}}}\left({W}\right)\geq\frac{1}{1+\alpha}\operatorname{Var}_{{P}}\left({W}\right)\geq(1-\alpha)\operatorname{Var}_{{P}}\left({W}\right).

This completes the proof of the lemma. ∎

Next we prove Lemma 4.8 (recall that only the special case μ=0\mu=0 was proved in Section 4.4). For convenience the lemma is repeated below.

Lemma 4.8 (Restated).

Suppose μ\mu\in\mathbb{R}, α>0\alpha>0, and W(0),,W(S1)W^{(0)},\ldots,W^{(S-1)}\in\mathbb{R} is a sequence of reals satisfying

t=0S1((D2W)(t))2αt=0S1((D1W)(t))2+μ.\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{2}{W}\right)^{(t)}\right)^{2}\leq\alpha\cdot\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}+\mu. (71)

Then

t=0S1((D1W)(t))2αt=1S1(W(t))2+μ/α.\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}\leq\alpha\cdot\sum_{t=1}^{S-1}(W^{(t)})^{2}+\mu/\alpha.

To prove Lemma 4.8 we need the following basic facts about the Fourier transform:

Fact C.2 (Parseval’s equality).

It holds that t=0S1|W(t)|2=1Ss=0S1|W^(s)|2\sum_{t=0}^{S-1}|W^{(t)}|^{2}=\frac{1}{S}\sum_{s=0}^{S-1}|\widehat{W}^{(s)}|^{2}.

The second fact gives a formula for the Fourier transform of the circular finite differences; its simple form is the reason we work with circular finite differences in this section:

Fact C.3.

For h0h\in\mathbb{Z}_{\geq 0}, DhW^(s)=W^(s)(e2πist/S1)h\widehat{\operatorname{D}^{\circ}_{h}{W}}^{(s)}=\widehat{W}^{(s)}\cdot(e^{2\pi ist/S}-1)^{h}.

Proof of Lemma 4.8.

Note that the discrete Fourier transform of D1W\operatorname{D}^{\circ}_{1}{W} satisfies D1W^(s)=W^(s)(e2πis/T1)\widehat{\operatorname{D}^{\circ}_{1}{W}}^{(s)}=\widehat{W}^{(s)}\cdot(e^{2\pi is/T}-1), and similarly D2W^(s)=W^(s)(e2πis/T1)2\widehat{\operatorname{D}^{\circ}_{2}{W}}^{(s)}=\widehat{W}^{(s)}\cdot(e^{2\pi is/T}-1)^{2}, for 0sS10\leq s\leq S-1. By the Cauchy-Schwarz inequality, Parseval’s equality (Fact C.2), Fact C.3, and the assumption that (71) holds, we have

t=0S1((D1W)(t))2=\displaystyle\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}= 1Ss=0S1|D1W^(s)|2\displaystyle\frac{1}{S}\sum_{s=0}^{S-1}\left|\widehat{\operatorname{D}^{\circ}_{1}{W}}^{(s)}\right|^{2}
=\displaystyle= 1Ss=0S1|W^(s)(e2πis/T1)|2\displaystyle\frac{1}{S}\sum_{s=0}^{S-1}\left|\widehat{W}^{(s)}\cdot(e^{2\pi is/T-1})\right|^{2}
=\displaystyle= 1Ss=0S1|W^(s)||W^(s)||e2πis/T1|2\displaystyle\frac{1}{S}\sum_{s=0}^{S-1}\left|\widehat{W}^{(s)}\right|\cdot\left|\widehat{W}^{(s)}\right|\left|e^{2\pi is/T}-1\right|^{2}
\displaystyle\leq 1Ss=0S1|W^(s)|21Ss=0S1|W^(s)|2|e2πis/T1|4\displaystyle\sqrt{\frac{1}{S}\sum_{s=0}^{S-1}\left|\widehat{W}^{(s)}\right|^{2}}\cdot\sqrt{\frac{1}{S}\sum_{s=0}^{S-1}\left|\widehat{W}^{(s)}\right|^{2}\cdot\left|e^{2\pi is/T}-1\right|^{4}}
=\displaystyle= t=0S1(W(t))21Ss=0S1|D2W^(s)|2\displaystyle\sqrt{\sum_{t=0}^{S-1}(W^{(t)})^{2}}\cdot\sqrt{\frac{1}{S}\sum_{s=0}^{S-1}\left|\widehat{\operatorname{D}^{\circ}_{2}{W}}^{(s)}\right|^{2}}
=\displaystyle= t=0S1(W(t))2t=0S1((D2W)(t))2\displaystyle\sqrt{\sum_{t=0}^{S-1}(W^{(t)})^{2}}\cdot\sqrt{\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{2}{W}\right)^{(t)}\right)^{2}}
\displaystyle\leq t=0S1(W(t))2αt=0S1((D1W)(t))2+μ.\displaystyle\sqrt{\sum_{t=0}^{S-1}(W^{(t)})^{2}}\cdot\sqrt{\alpha\cdot\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}+\mu}. (72)

Note that for real numbers A>0A>0 and ϵ\epsilon with A+ϵ>0A+\epsilon>0, it holds that

A2A+ϵ=A1+ϵ/AA(1ϵ/A)=Aϵ.\frac{A^{2}}{{A+\epsilon}}=\frac{{A}}{1+{\epsilon/A}}\geq A\cdot(1-\epsilon/A)=A-\epsilon.

Taking A=t=0S1((D1W)(t))2A=\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2} and ϵ=μ/α\epsilon=\mu/\alpha (for which A+ϵ>0A+\epsilon>0 is immediate) and using (72) then gives

t=0S1((D1W)(t))2μ/α(t=0S1((D1W)(t))2)2t=0S1((D1W)(t))2+μ/ααt=0S1(W(t))2,\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}-\mu/\alpha\leq\frac{\left(\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}\right)^{2}}{{\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}+\mu/\alpha}}\leq\alpha\cdot\sum_{t=0}^{S-1}\left(W^{(t)}\right)^{2},

as desired. ∎

C.2 Proof of Lemma 4.7

Now we prove Lemma 4.7. For convenience we restate the lemma below with the exact value of the constant C0C_{0} referred to in the version in Section 4.4.

Lemma 4.7 (Restated).

For any M,ζ,α>0M,\zeta,\alpha>0 and nn\in\mathbb{N}, suppose that P(1),,P(T)ΔnP^{(1)},\ldots,P^{(T)}\in\Delta^{n} and Z(1),,Z(T)[M,M]nZ^{(1)},\ldots,Z^{(T)}\in[-M,M]^{n} satisfy the following conditions:

  1. 1.

    The sequence P(1),,P(T)P^{(1)},\ldots,P^{(T)} is ζ\zeta-consecutively close for some ζ[1/(2T),α4/8256]\zeta\in[1/(2T),\alpha^{4}/8256].

  2. 2.

    It holds that t=1T2VarP(t)((D2Z)(t))αt=1T1VarP(t)((D1Z)(t))+μ.\sum_{t=1}^{T-2}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t)}}\right)\leq\alpha\cdot\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)+\mu.

Then

t=1T1VarP(t)((D1Z)(t))α(1+α)t=1TVarP(t)(Z(t))+μα+1290M2α3.\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)\leq\alpha\cdot(1+\alpha)\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\frac{\mu}{\alpha}+\frac{1290M^{2}}{\alpha^{3}}. (73)
Proof.

Fix a positive integer S<1/(2ζ)<TS<1/(2\zeta)<T, to be specified exactly below. For 1t0TS+11\leq t_{0}\leq T-S+1, define μt0\mu_{t_{0}}\in\mathbb{R} by

μt0=s=0S3VarP(t0+s)((D2Z)(t0+s))αs=0S3VarP(t0+s)((D1Z)(t0+s)).\mu_{t_{0}}=\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t_{0}+s)}}\right)-\alpha\cdot\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right). (74)

Then

t0=1TS+1μt0\displaystyle\sum_{t_{0}=1}^{T-S+1}\mu_{t_{0}}
=\displaystyle= t=1T2VarP(t)((D2Z)(t))min{S2,t,Tt1}\displaystyle\sum_{t=1}^{T-2}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t)}}\right)\cdot\min\{S-2,t,T-t-1\}
αt=1T1VarP(t)((D1Z)(t))min{S2,t,Tt1}\displaystyle-\alpha\cdot\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)\cdot\min\{S-2,t,T-t-1\}
\displaystyle\leq (S2)t=1T2VarP(t)((D2Z)(t))(S2)αt=1T1VarP(t)((D1Z)(t))+8α(S2)2M2\displaystyle(S-2)\cdot\sum_{t=1}^{T-2}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t)}}\right)-(S-2)\alpha\cdot\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)+8\alpha(S-2)^{2}M^{2} (75)
\displaystyle\leq (S2)μ+2α(S2)2M2,\displaystyle(S-2)\mu+2\alpha(S-2)^{2}M^{2}, (76)

where (75) uses the fact that Z(t)M\|Z^{(t)}\|_{\infty}\leq M and so (D1Z)(t)2M\|\left(\operatorname{D}_{1}{Z}\right)^{(t)}\|_{\infty}\leq 2M for all t[T]t\in[T], and the final inequality (76) follows from assumption 2 of the lemma statement.

By (74) and Lemma C.1 with P=P(t0)P=P^{(t_{0})}, we have, for some constant C>0C>0,

s=0S3VarP(t0)((D2Z)(t0+s))\displaystyle\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t_{0}+s)}}\right)\leq (1+2ζS)s=0S3VarP(t0+s)((D2Z)(t0+s))\displaystyle(1+2\zeta S)\cdot\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t_{0}+s)}}\right)
=\displaystyle= (1+2ζS)αs=0S3VarP(t0+s)((D1Z)(t0+s))+(1+2ζS)μt0\displaystyle(1+2\zeta S)\alpha\cdot\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)+(1+2\zeta S)\mu_{t_{0}}
\displaystyle\leq (1+2ζS)2αs=0S3VarP(t0)((D1Z)(t0+s))+(1+2ζS)μt0.\displaystyle(1+2\zeta S)^{2}\alpha\cdot\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)+(1+2\zeta S)\mu_{t_{0}}. (77)

Here we have used that for 0sS0\leq s\leq S, it holds that max{P(t0+s)P(t0),P(t0)P(t0+s)}(1+ζ)S1+2ζS\max\left\{\left\|\frac{P^{(t_{0}+s)}}{P^{(t_{0})}}\right\|_{\infty},\left\|\frac{P^{(t_{0})}}{P^{(t_{0}+s)}}\right\|_{\infty}\right\}\leq(1+\zeta)^{S}\leq 1+2\zeta S since ζS1/2\zeta S\leq 1/2.

For any integer 1t0TS+11\leq t_{0}\leq T-S+1, we define the sequence Zt0(s):=Z(t0+s)P(t0),Z(t0+s)𝟏Z_{t_{0}}^{(s)}:=Z^{(t_{0}+s)}-\langle{P^{(t_{0})}},Z^{(t_{0}+s)}\rangle\mathbf{1}, for 0sS10\leq s\leq S-1. Thus Zt0(s),P(t0)=0\langle Z_{t_{0}}^{(s)},P^{(t_{0})}\rangle=0 for 0sS10\leq s\leq S-1, which implies that for all h0h\geq 0, 0sS10\leq s\leq S-1, (DhZt0)(s),P(t0)=0\langle\left(\operatorname{D}^{\circ}_{h}{Z_{t_{0}}}\right)^{(s)},P^{(t_{0})}\rangle=0, and thus

VarP(t0)((DhZt0)(s))=j=1nP(t0)(j)(DhZt0)(s)(j)2.\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{h}{Z_{t_{0}}}\right)^{(s)}}\right)=\sum_{j=1}^{n}P^{(t_{0})}(j)\cdot\left(\operatorname{D}^{\circ}_{h}{Z_{t_{0}}}\right)^{(s)}(j)^{2}. (78)

By the definition of the sequence Zt0Z_{t_{0}}, for 0sSh10\leq s\leq S-h-1, we have

VarP(t0)((DhZ)(t0+s))=VarP(t0)((DhZt0)(s))=VarP(t0)((DhZt0)(s)).\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{h}{Z}\right)^{(t_{0}+s)}}\right)=\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{h}{Z_{t_{0}}}\right)^{(s)}}\right)=\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{h}{Z_{t_{0}}}\right)^{(s)}}\right). (79)

For 1t0TS+11\leq t_{0}\leq T-S+1, let us now define

νt0,j:=s=0S1(D2Zt0)(s)(j)2(1+2ζS)2αs=0S1(D1Zt0)(s)(j)2,\displaystyle\nu_{t_{0},j}:=\sum_{s=0}^{S-1}\left(\operatorname{D}^{\circ}_{2}{Z_{t_{0}}}\right)^{(s)}(j)^{2}-(1+2\zeta S)^{2}\alpha\cdot\sum_{s=0}^{S-1}\left(\operatorname{D}^{\circ}_{1}{Z_{t_{0}}}\right)^{(s)}(j)^{2}, (80)

so that, by (77), (78), and (79),

j=1nP(t0)(j)νt0,j\displaystyle\sum_{j=1}^{n}P^{(t_{0})}(j)\cdot\nu_{t_{0},j}
=\displaystyle= s=0S1VarP(t0)((D2Zt0)(s))(1+2ζS)2αs=0S1VarP(t0)((D1Zt0)(s))\displaystyle\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z_{t_{0}}}\right)^{(s)}}\right)-(1+2\zeta S)^{2}\alpha\cdot\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{1}{Z_{t_{0}}}\right)^{(s)}}\right)
\displaystyle\leq (s=0S3VarP(t0)((D2Z)(t0+s)))+VarP(t0+1)((D2Z)(t0+S2))+VarP(t0+1)((D2Z)(t0+S1))\displaystyle\left(\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t_{0}+s)}}\right)\right)+\operatorname{Var}_{{P^{(t_{0}+1)}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-2)}}\right)+\operatorname{Var}_{{P^{(t_{0}+1)}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-1)}}\right)
(1+2ζS)2αs=0S3VarP(t0)((D1Z)(t0+s))\displaystyle-(1+2\zeta S)^{2}\alpha\cdot\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)
\displaystyle\leq (1+2ζS)μt0+VarP(t0)((D2Z)(t0+S2))+VarP(t0)((D2Z)(t0+S1)).\displaystyle(1+2\zeta S)\mu_{t_{0}}+\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-2)}}\right)+\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-1)}}\right). (81)

By (80) and Lemma 4.8 applied to the sequence Zt0(0),,Zt0(S1)Z_{t_{0}}^{(0)},\ldots,Z_{t_{0}}^{(S-1)}, it holds that, for each j[n]j\in[n],

s=0S1(D1Zt0)(s)(j)2(1+ζCS)2αs=0S1Zt0(s)(j)2+νt0,j(1+ζCS)2α.\sum_{s=0}^{S-1}\left(\operatorname{D}^{\circ}_{1}{Z_{t_{0}}}\right)^{(s)}(j)^{2}\leq(1+\zeta CS)^{2}\alpha\cdot\sum_{s=0}^{S-1}Z_{t_{0}}^{(s)}(j)^{2}+\frac{\nu_{t_{0},j}}{(1+\zeta CS)^{2}\alpha}. (82)

Then we have:

s=0S2VarP(t0)((D1Z)(t0+s))\displaystyle\sum_{s=0}^{S-2}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)
=\displaystyle= s=0S2VarP(t0)((D1Zt0)(s))\displaystyle\sum_{s=0}^{S-2}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{1}{Z_{t_{0}}}\right)^{(s)}}\right) (83)
\displaystyle\leq (1+2ζS)2αs=0S1VarP(t0)(Zt0(s))+j=1nP(t0)(j)νt0,j(1+2ζS)2α\displaystyle(1+2\zeta S)^{2}\alpha\cdot\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0})}}}\left({Z_{t_{0}}^{(s)}}\right)+\sum_{j=1}^{n}P^{(t_{0})}(j)\cdot\frac{\nu_{t_{0},j}}{(1+2\zeta S)^{2}\alpha} (84)
\displaystyle\leq (1+2ζS)2αs=0S1VarP(t0)(Zt0(s))+μt0(1+2ζS)α+VarP(t0)((D2Z)(t0+S2))+VarP(t0)((D2Z)(t0+S1))(1+2ζS)2α,\displaystyle(1+2\zeta S)^{2}\alpha\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0})}}}\left({Z_{t_{0}}^{(s)}}\right)+\frac{\mu_{t_{0}}}{(1+2\zeta S)\alpha}+\frac{\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-2)}}\right)+\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-1)}}\right)}{(1+2\zeta S)^{2}\alpha}, (85)

where (83) follows from (79), (84) follows from (82) and (78), and (85) follows from (81). Summing the above for 1t0TS+11\leq t_{0}\leq T-S+1, we obtain, for some constant C>0C>0,

(S1)t=1T1VarP(t)((D1Z)(t))\displaystyle(S-1)\cdot\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)
\displaystyle\leq t0=1TS+1s=0S2VarP(t0+s)((D1Z)(t0+s))+8(S1)2M2\displaystyle\sum_{t_{0}=1}^{T-S+1}\sum_{s=0}^{S-2}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)+8(S-1)^{2}M^{2} (86)
\displaystyle\leq t0=1TS+1(1+2ζS)s=0S2VarP(t0)((D1Z)(t0+s))+8(S1)2M2\displaystyle\sum_{t_{0}=1}^{T-S+1}(1+2\zeta S)\sum_{s=0}^{S-2}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)+8(S-1)^{2}M^{2} (87)
\displaystyle\leq (1+2ζS)3αt0=1TS+1s=0S1VarP(t0)(Zt0(s))+t0=1TS+1μt0α+8(S1)2M2\displaystyle(1+2\zeta S)^{3}\alpha\sum_{t_{0}=1}^{T-S+1}\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0})}}}\left({Z_{t_{0}}^{(s)}}\right)+\sum_{t_{0}=1}^{T-S+1}\frac{\mu_{t_{0}}}{\alpha}+8(S-1)^{2}M^{2}
+t0=1TS+1VarP(t0)((D2Z)(t0+S2))+VarP(t0)((D2Z)(t0+S1))(1+2ζS)α\displaystyle+\sum_{t_{0}=1}^{T-S+1}\frac{\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-2)}}\right)+\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-1)}}\right)}{(1+2\zeta S)\alpha} (88)
\displaystyle\leq (1+2ζS)4αt0=1TS+1s=0S1VarP(t0+s)(Z(t0+s))+t0=1TS+1μt0α+8(S1)2M2\displaystyle(1+2\zeta S)^{4}\alpha\sum_{t_{0}=1}^{T-S+1}\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({Z^{(t_{0}+s)}}\right)+\sum_{t_{0}=1}^{T-S+1}\frac{\mu_{t_{0}}}{\alpha}+8(S-1)^{2}M^{2}
+4(1+2ζS)αt0=1TS+1[VarP(t0)(Z(t0+S2))+3VarP(t0)(Z(t0+S1))\displaystyle+\frac{4}{(1+2\zeta S)\alpha}\sum_{t_{0}=1}^{T-S+1}\left[\operatorname{Var}_{{P^{(t_{0})}}}\left({Z^{(t_{0}+S-2)}}\right)+3\operatorname{Var}_{{P^{(t_{0})}}}\left({Z^{(t_{0}+S-1)}}\right)\right.
+3VarP(t0)(Z(t0))+VarP(t0)(Z(t0+1))]\displaystyle\left.+3\operatorname{Var}_{{P^{(t_{0})}}}\left({Z^{(t_{0})}}\right)+\operatorname{Var}_{{P^{(t_{0})}}}\left({Z^{(t_{0}+1)}}\right)\right] (89)
\displaystyle\leq (1+2ζS)4αSt=1TVarP(t)(Z(t))+t0=1TS+1μt0α+8(S1)2M2\displaystyle(1+2\zeta S)^{4}\alpha S\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\sum_{t_{0}=1}^{T-S+1}\frac{\mu_{t_{0}}}{\alpha}+8(S-1)^{2}M^{2}
+32(1+2ζS)αt=1TVarP(t)(Z(t))\displaystyle+\frac{32}{(1+2\zeta S)\alpha}\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right) (90)
\displaystyle\leq (1+2ζS)4αS(1+32α2S)t=1TVarP(t)(Z(t))+t0=1TS+1μt0α+8(S1)2M2\displaystyle(1+2\zeta S)^{4}\alpha S\cdot\left(1+\frac{32}{\alpha^{2}S}\right)\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\sum_{t_{0}=1}^{T-S+1}\frac{\mu_{t_{0}}}{\alpha}+8(S-1)^{2}M^{2} (91)
\displaystyle\leq (1+2ζS)4αS(1+32α2S)t=1TVarP(t)(Z(t))+(S2)μα+10(S1)2M2,\displaystyle(1+2\zeta S)^{4}\alpha S\cdot\left(1+\frac{32}{\alpha^{2}S}\right)\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\frac{(S-2)\mu}{\alpha}+10(S-1)^{2}M^{2}, (92)

where:

  • (86) follows since Z(t)M\|Z^{(t)}\|_{\infty}\leq M and thus (D1Z)(t)2M\|\left(\operatorname{D}_{1}{Z}\right)^{(t)}\|_{\infty}\leq 2M for all tt;

  • (87) follows from Lemma C.1 and the fact that for 0sS20\leq s\leq S-2, max{P(t0+s)P(t0),P(t0)P(t0+s)}1+2ζS\max\left\{\left\|\frac{P^{(t_{0}+s)}}{P^{(t_{0})}}\right\|_{\infty},\left\|\frac{P^{(t_{0})}}{P^{(t_{0}+s)}}\right\|_{\infty}\right\}\leq 1+2\zeta S as established above as a consequence of the fact that the distributions P(t)P^{(t)} are ζ\zeta-consecutively close.

  • (88) follows from (85);

  • The first term in (89) is bounded using Lemma C.1 and the fact that the distributions P(t)P^{(t)} are ζ\zeta-consecutively close, and the the final term in (89) is bounded using the fact that for any vectors Z1,,ZknZ_{1},\ldots,Z_{k}\in\mathbb{R}^{n} and any PΔnP\in\Delta^{n}, we have VarP(Z1++Zk)k(VarP(Z1)++VarP(Zk))\operatorname{Var}_{{P}}\left({Z_{1}+\cdots+Z_{k}}\right)\leq k\cdot\left(\operatorname{Var}_{{P}}\left({Z_{1}}\right)+\cdots+\operatorname{Var}_{{P}}\left({Z_{k}}\right)\right);

  • (90) and (91) by rearranging terms;

  • (92) follows from (76).

Now choose S=128α3S=\left\lceil\frac{128}{\alpha^{3}}\right\rceil, so that 32α2Sα4\frac{32}{\alpha^{2}S}\leq\frac{{\alpha}}{4}. Therefore, as long as 2ζSα322\zeta S\leq\frac{\alpha}{32}, we have, since α1/2\alpha\leq 1/2, that

(1+2ζS)4αSS1(1+32α2S)α(1+α/4)3α(1+α).(1+2\zeta S)^{4}\alpha\cdot\frac{S}{S-1}\cdot\left(1+\frac{32}{\alpha^{2}S}\right)\leq\alpha\cdot(1+\alpha/4)^{3}\leq\alpha\cdot(1+\alpha).

Then it follows from (92) that

t=1T1VarP(t)((D1Z)(t))\displaystyle\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)\leq α(1+α)t=1TVarP(t)(Z(t))+μα+10SM2,\displaystyle\alpha(1+\alpha)\cdot\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\frac{\mu}{\alpha}+10SM^{2}, (93)

Using that S129α3S\leq\frac{129}{\alpha^{3}}, the inequality 2ζSα/322\zeta S\leq\alpha/32 can be satisfied by ensuring that ζα48256=α412942α64S\zeta\leq\frac{\alpha^{4}}{8256}=\frac{\alpha^{4}}{129\cdot 42}\leq\frac{\alpha}{64S}. Note that our choice of SS ensures that ζS1/2\zeta S\leq 1/2, as was assumed earlier. Moreover, we have 10SM21290M2α310SM^{2}\leq\frac{1290M^{2}}{\alpha^{3}}. Thus, (93) gives the desired result. ∎

C.3 Completing the proof of Theorem 3.1

Using the lemma developed in the previous sections we now can complete the proof of Theorem 3.1. We begin by proving Lemma 4.2. The lemma is restated formally below.

Lemma 4.2 (Detailed).

There are constants C,C>1C,C^{\prime}>1 so that the following holds. Suppose a time horizon T4T\geq 4 is given, we set H:=logTH:=\lceil\log T\rceil, and all players play according to Optimistic Hedge with step size η\eta satisfying 1/Tη1CmH41/T\leq\eta\leq\frac{1}{C\cdot mH^{4}}. Then for any i[m]i\in[m], the losses i(1),,i(T)[0,1]ni\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}\in[0,1]^{n_{i}} for player ii satisfy:

t=1TVarxi(t)(i(t)i(t1))12t=1TVarxi(t)(i(t1))+CH5.\displaystyle\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)\leq\frac{1}{2}\cdot\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+C^{\prime}H^{5}. (94)

We state a generic version of this lemma that can be applied in more general settings.

Lemma C.4.

For any integers n2n\geq 2 and T4T\geq 4, we set H:=logTH:=\lceil\log T\rceil, α=1/(4H)\alpha=1/(4H), and α0=α/8H3\alpha_{0}=\frac{\sqrt{\alpha/8}}{H^{3}}. Suppose that Z(1),,Z(T)[0,1]nZ^{(1)},\ldots,Z^{(T)}\in[0,1]^{n} and P(1),,P(T)ΔnP^{(1)},\ldots,P^{(T)}\in\Delta^{n} satisfy the following

  1. 1.

    For each 0hH0\leq h\leq H and 1tTh1\leq t\leq T-h, it holds that (DhZ)(t)H(α0H3)h\left\|\left(\operatorname{D}_{h}{Z}\right)^{(t)}\right\|_{\infty}\leq H\cdot\left(\alpha_{0}H^{3}\right)^{h}

  2. 2.

    The sequence P(1),,P(T)P^{(1)},\ldots,P^{(T)} is ζ\zeta-consecutively close for some ζ[1/(2T),α4/8256]\zeta\in[1/(2T),\alpha^{4}/8256].

Then,

t=1TVarP(t)(Z(t)Z(t1))\displaystyle\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}-Z^{(t-1)}}\right)\leq 2αt=1TVarP(t)(Z(t1))+165120(1+ζ)H5+2\displaystyle 2\alpha\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t-1)}}\right)+165120(1+\zeta)H^{5}+2 (95)

First, we demonstrate how Lemma C.4 implies Lemma 4.2.

Proof of Lemma 4.2.

We hope to apply Lemma C.4 with n=nin=n_{i}, P(t)=xi(t)P^{(t)}=x_{i}^{(t)} and Z(t)=i(t)Z^{(t)}=\ell_{i}^{(t)}, as well as ζ=7η\zeta=7\eta. To do so, we must verify that the preconditions of Lemma C.4 hold when our sequences arise from the dynamics of players playing Optimistic Hedge with step size η\eta satisfying 1/Tη1CmH41/T\leq\eta\leq\frac{1}{C\cdot mH^{4}}.

Set C1=8256C_{1}=8256 (note that C1C_{1} is the constant appearing in item 2 of Lemma C.4 and item 1 of Lemma 4.7 in Section C.2). Our assumption that η1CmH4\eta\leq\frac{1}{C\cdot mH^{4}} implies that, as long as the constant CC satisfies C447C1=14794752C\geq 4^{4}\cdot 7\cdot C_{1}=14794752,

ηmin{α47C1,α036e5m}.\eta\leq\min\left\{\frac{\alpha^{4}}{7C_{1}},\frac{\alpha_{0}}{36e^{5}m}\right\}. (96)

To verify precondition 1, we apply Lemma 4.4 with the parameter α\alpha in the lemma set to α0\alpha_{0}: a valid selection as α0<1H+3\alpha_{0}<\frac{1}{H+3}. We conclude that, for each i[m]i\in[m], 0hH0\leq h\leq H and 1tTh1\leq t\leq T-h, it holds that (Dhi)(t)H(α0H3)h\left\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\right\|_{\infty}\leq H\cdot\left(\alpha_{0}H^{3}\right)^{h} since ηα036e5m\eta\leq\frac{\alpha_{0}}{36e^{5}m} as required by the lemma. To verify precondition 2, we first confirm that our selection of ζ=7η\zeta=7\eta places it in the desired interval [1/(2T),α4/C1][1/(2T),\alpha^{4}/C_{1}] as ηα47C1\eta\leq\frac{\alpha^{4}}{7C_{1}}. By the definition of the Optimistic Hedge updates, for all i[m]i\in[m] and 1tT1\leq t\leq T, we have max{xi(t)xi(t+1),xi(t+1)xi(t)}exp(6η)\max\left\{\left\|\frac{x_{i}^{(t)}}{x_{i}^{(t+1)}}\right\|_{\infty},\left\|\frac{x_{i}^{(t+1)}}{x_{i}^{(t)}}\right\|_{\infty}\right\}\leq\exp(6\eta). Thus, the sequence xi(1),,xi(T)x_{i}^{(1)},\ldots,x_{i}^{(T)} is (7η)(7\eta)-consecutively close (since exp(6η)1+7η\exp(6\eta)\leq 1+7\eta for η\eta satisfying (96)). Therefore, Lemma C.4 applies and we have

t=1TVarxi(t)(i(t)i(t1))\displaystyle\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right) 2αt=1TVarxi(t)(i(t1))+165120(1+7η)H5+2\displaystyle\leq 2\alpha\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+165120(1+7\eta)H^{5}+2
12t=1TVarxi(t)(i(t1))+CH5\displaystyle\leq\frac{1}{2}\cdot\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+C^{\prime}H^{5}

for C=2+165120(1+7/8256)=165262C^{\prime}=2+165120(1+7/8256)=165262, as desired. ∎

So, it suffices to prove Lemma C.4.

Proof of Lemma C.4.

Set μ=H(α0H3)H\mu=H\cdot\left(\alpha_{0}H^{3}\right)^{H}. Therefore, from item 1 we have,

t=1THVarP(t)((DHZ)(t))αt=1TH+1VarP(t)((DH1Z)(t))+μ2T.\displaystyle\sum_{t=1}^{T-H}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{H}{Z}\right)^{(t)}}\right)\leq\alpha\cdot\sum_{t=1}^{T-H+1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{H-1}{Z}\right)^{(t)}}\right)+\mu^{2}T. (97)

We will now prove, via reverse induction on hh, that for all hh satisfying H1h0H-1\geq h\geq 0,

t=1Th1VarP(t)((Dh+1Z)(t))α(1+2α)Hh1t=1ThVarP(t)((DhZ)(t))+2C0H2(2α0H3)2hα3.\sum_{t=1}^{T-h-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h+1}{Z}\right)^{(t)}}\right)\leq\alpha\cdot(1+2\alpha)^{H-h-1}\cdot\sum_{t=1}^{T-h}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h}{Z}\right)^{(t)}}\right)+\frac{2C_{0}\cdot H^{2}\cdot\left({2}\alpha_{0}H^{3}\right)^{2h}}{\alpha^{3}}. (98)

with C0=1290C_{0}=1290 (note that C0C_{0} is the constant appearing in the inequality (73) of the statement of Lemma 4.7 in Section C.2). The base case h=H1h=H-1 is verified by (97) and the fact that 22(H1)2HT2^{2(H-1)}\geq 2^{H}\geq T. Now suppose that (98) holds for some hh satisfying H1h1H-1\geq h\geq 1. We will now apply Lemma 4.7, with P(t)=P(t)P^{(t)}=P^{(t)} and Z(t)=(Dh1Z)(t)Z^{(t)}=\left(\operatorname{D}_{h-1}{Z}\right)^{(t)} for 1tTh+11\leq t\leq T-h+1, as well as M=H(2α0H3)h1M=H\cdot\left(2\alpha_{0}H^{3}\right)^{h-1}, ζ=ζ\zeta=\zeta, μ=2C0H2(2α0H3)2hα3\mu=\frac{2C_{0}\cdot H^{2}\cdot\left(2\alpha_{0}H^{3}\right)^{2h}}{\alpha^{3}}, and the parameter α\alpha of Lemma 4.7 set to α(1+2α)Hh1\alpha\cdot(1+2\alpha)^{H-h-1}. We verify that precondition 1 holds due to precondition 2 of Lemma C.4 and the fact that αα(1+2α)Hh1\alpha\leq\alpha\cdot(1+2\alpha)^{H-h-1}. Moreover, precondition 2 holds by our inductive hypothesis (98) and our choice of μ\mu. Therefore, by Lemma 4.7 and the fact that 1+α(1+2α)H1+2α1+\alpha\cdot(1+2\alpha)^{H}\leq 1+2\alpha for our choice of α=1/(4H)\alpha=1/(4H), it follows that

t=1ThVarP(t)((DhZ)(t))\displaystyle\sum_{t=1}^{T-h}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h}{Z}\right)^{(t)}}\right)\leq α(1+2α)Hht=1Th+1VarP(t)((Dh1Z)(t))+2C0H2(2α0H3)2hα4\displaystyle\alpha\cdot(1+2\alpha)^{H-h}\cdot\sum_{t=1}^{T-h+1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h-1}{Z}\right)^{(t)}}\right)+\frac{2C_{0}\cdot H^{2}\cdot\left(2\alpha_{0}H^{3}\right)^{2h}}{\alpha^{4}}
+C0H2(2α0H3)2(h1)α3\displaystyle+\frac{C_{0}\cdot H^{2}\cdot\left(2\alpha_{0}H^{3}\right)^{2(h-1)}}{\alpha^{3}}
\displaystyle\leq α(1+2α)Hht=1Th+1VarP(t)((Dh1Z)(t))\displaystyle\alpha\cdot(1+2\alpha)^{H-h}\cdot\sum_{t=1}^{T-h+1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h-1}{Z}\right)^{(t)}}\right)
+C0H2(2α0H3)2(h1)α3(1+2(2α0H3)2α)\displaystyle+\frac{C_{0}\cdot H^{2}\cdot(2\alpha_{0}H^{3})^{2(h-1)}}{\alpha^{3}}\cdot\left(1+\frac{2(2\alpha_{0}H^{3})^{2}}{\alpha}\right)
\displaystyle\leq α(1+2α)Hht=1Th+1VarP(t)((Dh1Z)(t))+2C0H2(2α0H3)2(h1)α3,\displaystyle\alpha\cdot(1+2\alpha)^{H-h}\cdot\sum_{t=1}^{T-h+1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h-1}{Z}\right)^{(t)}}\right)+\frac{2C_{0}\cdot H^{2}\cdot(2\alpha_{0}H^{3})^{2(h-1)}}{\alpha^{3}},

where the final equality follows since α0\alpha_{0} is chosen so that 2(2α0H3)2=α2(2\alpha_{0}H^{3})^{2}=\alpha. This completes the proof of the inductive step. Thus (98) holds for h=0h=0. Using again that the sequence P(t)P^{(t)} is (ζ)(\zeta)-exponentially close, we see that

t=1TVarP(t)(Z(t)Z(t1))\displaystyle\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}-Z^{(t-1)}}\right)
\displaystyle\leq 1+t=2TVarP(t)(Z(t)Z(t1))\displaystyle 1+\sum_{t=2}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}-Z^{(t-1)}}\right)
\displaystyle\leq 1+(1+ζ)t=1T1VarP(t)((D1Z)(t))\displaystyle 1+(1+\zeta)\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right) (99)
\displaystyle\leq 1+(1+ζ)(α(1+2α)H1t=1TVarP(t)(Z(t))+2C0H2α3)\displaystyle 1+(1+\zeta)\cdot\left(\alpha(1+2\alpha)^{H-1}\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\frac{2C_{0}H^{2}}{\alpha^{3}}\right) (100)
\displaystyle\leq 2+(1+ζ)(α(1+2α)H1(1+ζ)t=2TVarP(t)(Z(t1))+2C0H2α3)\displaystyle 2+(1+\zeta)\cdot\left(\alpha(1+2\alpha)^{H-1}(1+\zeta)\sum_{t=2}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t-1)}}\right)+\frac{2C_{0}H^{2}}{\alpha^{3}}\right) (101)
\displaystyle\leq α(1+2α)Ht=1TVarP(t)(Z(t1))+2(1+ζ)C0H2α3+2\displaystyle\alpha(1+2\alpha)^{H}\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t-1)}}\right)+\frac{2(1+\zeta)C_{0}H^{2}}{\alpha^{3}}+2
\displaystyle\leq 2αt=1TVarP(t)(Z(t1))+2(1+ζ)C0H2α3+2,\displaystyle 2\alpha\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t-1)}}\right)+\frac{2(1+\zeta)C_{0}H^{2}}{\alpha^{3}}+2, (102)

where (99) and (101) follow from Lemma C.1, and (100) uses (98) for h=0h=0. Now, (102) verifies the statement of the lemma as C0=1290C_{0}=1290 and α=14H\alpha=\frac{1}{4H}. ∎

We are finally ready to prove Theorem 3.1. For convenience the theorem is restated below.

Theorem 3.1 (Restated).

There are constants C,C>1C,C^{\prime}>1 so that the following holds. Suppose a time horizon TT\in\mathbb{N} is given. Suppose all players play according to Optimistic Hedge with any positive step size η1Cmlog4T\eta\leq\frac{1}{C\cdot m\log^{4}T}. Then for any i[m]i\in[m], the regret of player ii satisfies

Regi,Tlogniη+ClogT.\displaystyle\operatorname{Reg}_{{i},{T}}\leq\frac{\log n_{i}}{\eta}+C^{\prime}\cdot\log T. (103)

In particular, if the players’ step size is chosen as η=1Cmlog4T\eta=\frac{1}{C\cdot m\log^{4}T}, then the regret of player ii satisfies

Regi,TO(mlognilog4T).\displaystyle\operatorname{Reg}_{{i},{T}}\leq O\left(m\cdot\log n_{i}\cdot\log^{4}T\right). (104)
Proof.

The conclusion of the theorem is immediate if T<4T<4, so we may assume from here on that T4T\geq 4. Moreover, the conclusion of (103) is immediate if η1/T\eta\leq 1/T (as Regi,TT\operatorname{Reg}_{{i},{T}}\leq T necessarily), so we may also assume that η1/T\eta\geq 1/T. Let C′′C^{\prime\prime} be the constant CC of Lemma 4.1, let BB be the constant called CC in Lemma 4.2 and BB^{\prime} be the constant called CC^{\prime} in Lemma 4.2. As long as the constant CC of Theorem 3.1 is chosen so that CBC\geq B and η1Cmlog4T\eta\leq\frac{1}{C\cdot m\log^{4}T} implies that C′′η1/6C^{\prime\prime}\eta\leq 1/6, we have the following:

Regi,T\displaystyle\operatorname{Reg}_{{i},{T}}\leq logniη+t=1T(η2+Cη2)Varxi(t)(i(t)i(t1))t=1T(1Cη)η2Varxi(t)(i(t1))\displaystyle\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left(\frac{\eta}{2}+C\eta^{2}\right)\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)-\sum_{t=1}^{T}\frac{(1-C\eta)\eta}{2}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right) (105)
\displaystyle\leq logniη+2η3t=1TVarxi(t)(i(t)i(t1))η3t=1TVarxi(t)(i(t1))\displaystyle\frac{\log n_{i}}{\eta}+\frac{2\eta}{3}\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)-\frac{\eta}{3}\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)
\displaystyle\leq logniη+2η3(B(2logT)5)\displaystyle\frac{\log n_{i}}{\eta}+\frac{2\eta}{3}\cdot\left(B^{\prime}\cdot(2\log T)^{5}\right) (106)
\displaystyle\leq logniη+32BlogT,\displaystyle\frac{\log n_{i}}{\eta}+32B^{\prime}\cdot\log T, (107)

where (105) follows from Lemma 4.1, (106) follows from Lemma 4.2, and (107) follows from the upper bound η1Cmlog4T\eta\leq\frac{1}{Cm\log^{4}T}. We have thus established (103). The upper bound (104) follows immediately. ∎

Appendix D Adversarial regret bounds

In this section we discuss how Optimistic Hedge can be modified to achieve an algorithm that obtains the fast rates of Theorem 3.1 when played by all players, and which still obtains the optimal rate of O(T)O(\sqrt{T}) in the adversarial setting. Such guarantees are common in the literature [DDK11, RS13b, SALS15, KHSC18, HAM21]. The guarantees of this modification of Optimistic Hedge are stated in the following corollary (of Lemmas 4.1 and 4.2):

Corollary D.1.

There is an algorithm 𝒜\mathcal{A} which, if played by all mm players in a game, achieves the regret bound of Regi,TO(mlognilog4T)\operatorname{Reg}_{{i},{T}}\leq O(m\cdot\log n_{i}\cdot\log^{4}T) for each player ii; moreover, when player ii is faced with an adversarial sequence of losses, the algorithm 𝒜\mathcal{A}’s regret bound is Regi,TO(mlognilog4T+Tlogni)\operatorname{Reg}_{{i},{T}}\leq O(m\log n_{i}\cdot\log^{4}T+\sqrt{T\log n_{i}}).

Proof.

Let CC be the constant called CC in Theorem 3.1 and CC^{\prime} be the constant called CC^{\prime} in Lemma 4.2. The algorithm 𝒜\mathcal{A} of Corollary D.1 is obtained as follows:

  1. 1.

    Initially run Optimistic Hedge, with the step-size η=1Cmlog4T\eta=\frac{1}{Cm\log^{4}T}.

  2. 2.

    If, for some T04T_{0}\geq 4, (94) first fails to hold at time T0T_{0}, i.e.,

    t=1T0Varxi(t)(i(t)i(t1))>12t=1T0Varxi(t)(i(t1))+ClogT5,\displaystyle\sum_{t=1}^{T_{0}}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)>\frac{1}{2}\cdot\sum_{t=1}^{T_{0}}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+C^{\prime}\lceil\log T\rceil^{5}, (108)

    then set η=logniT\eta^{\prime}=\sqrt{\frac{\log n_{i}}{T}} and continue on running Optimistic Hedge with step size η\eta^{\prime}.

If there is no T04T_{0}\geq 4 so that (108) holds (and by Lemma 4.2, this will be the case when 𝒜\mathcal{A} is played by all mm players in a game), then the proof of Theorem 3.1 shows that the regret of each player ii is bounded as Regi,TO(mlognilog4T)\operatorname{Reg}_{{i},{T}}\leq O(m\log n_{i}\cdot\log^{4}T). Otherwise, since T0T_{0} is defined as the smallest integer at least 4 so that (108) holds, we have

t=1T0Varxi(t)(i(t)i(t1))12t=1T0Varxi(t)(i(t1))+ClogT5+4,\displaystyle\sum_{t=1}^{T_{0}}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)\leq\frac{1}{2}\cdot\sum_{t=1}^{T_{0}}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+C^{\prime}\lceil\log T\rceil^{5}+4,

and thus, by Lemma 4.1, for any xΔnix^{\star}\in\Delta^{n_{i}},

t=1T0i(t),xi(t)xRegi,T0O(mlognilog4T0).\displaystyle\sum_{t=1}^{T_{0}}\langle\ell_{i}^{(t)},x_{i}^{(t)}-x^{\star}\rangle\leq\operatorname{Reg}_{{i},{T_{0}}}\leq O(m\log n_{i}\cdot\log^{4}T_{0}). (109)

Further, by the choice of step size η=logniT\eta^{\prime}=\sqrt{\frac{\log n_{i}}{T}} for time steps t>T0t>T_{0}, we have, for any xΔnix^{\star}\in\Delta^{n_{i}},

t=T0+1Ti(t),xi(t)x\displaystyle\sum_{t=T_{0}+1}^{T}\langle\ell_{i}^{(t)},x_{i}^{(t)}-x^{\star}\rangle\leq logniη+ηt=T0+1Ti(t)i(t1)2\displaystyle\frac{\log n_{i}}{\eta^{\prime}}+\eta^{\prime}\sum_{t=T_{0}+1}^{T}\|\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\|_{\infty}^{2} (110)
\displaystyle\leq logniη+ηTO(Tlogni),\displaystyle\frac{\log n_{i}}{\eta^{\prime}}+\eta^{\prime}T\leq O(\sqrt{T\log n_{i}}), (111)

where (110) uses [SALS15, Proposition 7]. Adding (109) and (111) completes the proof of the corollary. ∎