Near-Optimal No-Regret Learning in General Games

Constantinos Daskalakis
MIT CSAIL
costis@csail.mit.edu Supported by NSF Awards CCF-1901292, DMS-2022448 and DMS-2134108, by a Simons Investigator Award, by the Simons Collaboration on the Theory of Algorithmic Fairness, by a DSTA grant, and by the DOE PhILMs project (No. DE-AC05-76RL01830). Maxwell Fishelson
MIT CSAIL
maxfish@mit.edu Noah Golowich
MIT CSAIL
nzg@mit.edu
Supported by a Fannie & John Hertz Foundation Fellowship and an NSF Graduate Fellowship.

Abstract

We show that Optimistic Hedge – a common variant of multiplicative-weights-updates with recency bias – attains ${\rm poly}(\log T)$ regret in multi-player general-sum games. In particular, when every player of the game uses Optimistic Hedge to iteratively update her strategy in response to the history of play so far, then after $T$ rounds of interaction, each player experiences total regret that is ${\rm poly}(\log T)$ . Our bound improves, exponentially, the $O({T}^{1/2})$ regret attainable by standard no-regret learners in games, the $O(T^{1/4})$ regret attainable by no-regret learners with recency bias [SALS15], and the ${O}(T^{1/6})$ bound that was recently shown for Optimistic Hedge in the special case of two-player games [CP20]. A corollary of our bound is that Optimistic Hedge converges to coarse correlated equilibrium in general games at a rate of $\tilde{O}\left(\frac{1}{T}\right)$ .

1 Introduction

Online learning has a long history that is intimately related to the development of game theory, convex optimization, and machine learning. One of its earliest instantiations can be traced to Brown’s proposal [Bro49] of fictitious play as a method to solve two-player zero-sum games. Indeed, as shown by [Rob51], when the players of (zero-sum) matrix game use fictitious play to iteratively update their actions in response to each other’s history of play, the resulting dynamics converge in the following sense: the product of the empirical distributions of strategies for each player converges to the set of Nash equilibria in the game, though the rate of convergence is now known to be exponentially slow [DP14]. Moreover, such convergence to Nash equilibria fails in non-zero-sum games [Sha64].

The slow convergence of fictitious play to Nash equilibria in zero-sum matrix games and non-convergence in general-sum games can be mitigated by appealing to the pioneering works [Bla54, Han57] and the ensuing literature on no-regret learning [CBL06]. It is known that if both players of a zero-sum matrix game experience regret that is at most $\varepsilon(T)$ , the product of the players’ empirical distributions of strategies is an $O(\varepsilon(T)/T)$ -approximate Nash equilibrium. More generally, if each player of a general-sum, multi-player game experiences regret that is at most $\varepsilon(T)$ , the empirical distribution of joint strategies converges to a coarse correlated equilibrium¹¹1In general-sum games, it is typical to focus on proving convergence rates for weaker types of equilibrium than Nash, such as coarse correlated equilibria, since finding Nash equilibria is PPAD-complete [DGP06, CDT09]. of the game, at a rate of $O(\varepsilon(T)/{T})$ . Importantly, a multitude of online learning algorithms, such as the celebrated Hedge and Follow-The-Perturbed-Leader algorithms, guarantee adversarial regret $O(\sqrt{T})$ [CBL06]. Thus, when such algorithms are employed by all players in a game, their $O(\sqrt{T})$ regret implies convergence to coarse correlated equilibria (and Nash equilibria of matrix games) at a rate of $O(1/\sqrt{T})$ .

While standard no-regret learners guarantee $O(\sqrt{T})$ regret for each player in a game, the players can do better by employing specialized no-regret learning procedures. Indeed, it was established by [DDK11] that there exists a somewhat complex no-regret learner based on Nesterov’s excessive gap technique [Nes05], which guarantees $O(\log T)$ regret to each player of a two-player zero-sum game. This represents an exponential improvement over the regret guaranteed by standard no-regret learners. More generally, [SALS15] established that if players of a multi-player, general-sum game use any algorithm from the family of Optimistic Mirror Descent (MD) or Optimistic Follow-the-Regularized-Leader (FTRL) algorithms (which are analogoues of the MD and FTRL algorithms, respectively, with recency bias), each player enjoys regret that is $O(T^{1/4})$ . This was recently improved by [CP20] to $O(T^{1/6})$ in the special case of two-player games in which the players use Optimistic Hedge, a particularly simple representative from both the Optimistic MD and Optimistic FTRL families.

The above results for general-sum games represent significant improvements over the $O(\sqrt{T})$ regret attainable by standard no-regret learners, but are not as dramatic as the logarithmic regret that has been shown attainable by no-regret learners, albeit more complex ones, in 2-player zero-sum games (e.g., [DDK11]). Indeed, despite extensive work on no-regret learning, understanding the optimal regret that can be guaranteed by no-regret learning algorithms in general-sum games has remained elusive. This question is especially intruiging in light of experiments suggesting that polylogarithmic regret should be attainable [SALS15, HAM21]. In this paper we settle this question by showing that no-regret learners can guarantee polylogarithmic regret to each player in general-sum multi-player games. Moreover, this regret is attainable by a particularly simple algorithm – Optimistic Hedge:

Table 1: Overview of prior work on fast rates for learning in games.

m

denotes the number of players, and

n

denotes the number of actions per player (assumed to be the same for all players). For Optimistic Hedge, the adversarial regret bounds in the right-hand column are obtained via a choice of adaptive step-sizes. The

\tilde{O}(\cdot)

notation hides factors that are polynomial in

\log T

Algorithm Setting Regret in games Adversarial regret Hedge (& many other algs.) multi-player, general-sum $O(\sqrt{T\log n})$ [CBL06] $O(\sqrt{T\log n})$ [CBL06] Excessive Gap Technique 2-player, 0-sum $O(\log n(\log T+\log^{3/2}n))$ [DDK11] $O(\sqrt{T\log n})$ [DDK11] DS-OptMD, OptDA 2-player, 0-sum $\log^{O(1)}(n)$ [HAM21] $\sqrt{T\log^{O(1)}(n)}$ [HAM21] Optimistic Hedge multi-player, general-sum $O(\log n\cdot\sqrt{m}\cdot T^{1/4})$ [RS13b, SALS15] $\tilde{O}(\sqrt{T\log n})$ [RS13b, SALS15] Optimistic Hedge 2-player, general-sum $O(\log^{5/6}n\cdot T^{1/6})$ [CP20] $\tilde{O}(\sqrt{T\log n})$ Optimistic Hedge multi-player, general-sum $O(\log n\cdot m\cdot\log^{4}T)$ (Theorem 3.1) $\tilde{O}(\sqrt{T\log n})$ (Corollary D.1)

Theorem 1.1 (Abbreviated version of Theorem 3.1).

Suppose that $m$ players play a general-sum multi-player game, with a finite set of $n$ strategies per player, over $T$ rounds. Suppose also that each player uses Optimistic Hedge to update her strategy in every round, as a function of the history of play so far. Then each player experiences $O(m\cdot\log n\cdot\log^{4}T)$ regret.

An immediate corollary of Theorem 1.1 is that the empirical distribution of play is a $O\left(\frac{m\log n\log^{4}T}{T}\right)$ -approximate coarse correlated equilibrium (CCE) of the game. We remark that Theorem 1.1 bounds the total regret experienced by each player of the multi-player game, which is the most standard regret objective for no-regret learning in games, and which is essential to achieve convergence to CCE. For the looser objective of the average of all players’ regrets, [RS13b] established a $O(\log n)$ bound for Optimistic Hedge in two-player zero-sum games, and [SALS15] generalized this bound, to $O(m\log n)$ in $m$ -player general-sum games. Note that since some players may experience negative regret [HAM21], the average of the players’ regrets cannot be used in general to bound the maximum regret experienced by any individual player. Finally, we remark that several results in the literature posit no-regret learning as a model of agents’ rational behavior; for instance, [Rou09, ST13, RST17] show that no-regret learners in smooth games enjoy strong Price-of-Anarchy bounds. By showing that each agent can obtain very small regret in games by playing Optimistic Hedge, Theorem 1.1 strengthens the plausability of the common assumption made in this literature that each agent will choose to use such a no-regret algorithm.

1.1 Related work

Table 1 summarizes the prior works that aim to establish optimal regret bounds for no-regret learners in games. We remark that [CP20] shows that the regret of Hedge is $\Omega(\sqrt{T})$ even in 2-player games where each player has 2 actions, meaning that optimism is necessary to obtain fast rates. The table also includes a recent result of [HAM21] showing that when the players in a 2-player zero-sum game with $n$ actions per player use a variant of Optimistic Hedge with adaptive step size (a special case of their algorithms DS-OptMD and OptDA), each player has $\log^{O(1)}n$ regret. The techniques of [HAM21] differ substantially from ours: the result in [HAM21] is based on showing that the joint strategies $x^{(t)}$ rapidly converge, pointwise, to a Nash equilibrium $x^{\star}$ . Such a result seems very unlikely to extend to our setting of general-sum games, since finding an approximate Nash equilibrium even in 2-player games is PPAD-complete [CDT09]. We also remark that the earlier work [KHSC18] shows that each player’s regret is at most $O(\log T\cdot\log n)$ when they use a certain algorithm based on Optimistic MD in 2-player zero-sum games; their technique is heavily tailored to 2-player zero-sum games, relying on the notion of duality in such a setting.

[FLL⁺16] shows that one can obtain fast rates in games for a broader class of algorithms (e.g., including Hedge) if one adopts a relaxed (approximate) notion of optimality. [WL18] uses optimism to obtain adaptive regret bounds for bandit problems. Many recent papers (e.g., [DP19, GPD20, LGNPw21, HAM21, WLZL21, AIMM21]) have studied the last-iterate convergence of algorithms from the Optimistic Mirror Descent family, which includes Optimistic Hedge. Finally, a long line of papers (e.g., [HMcW⁺03, DFP⁺10, KLP11, BCM12, PP16, BP18, MPP18, BP19, CP19, VGFL⁺20]) has studied the dynamics of learning algorithms in games. Essentially all of these papers do not use optimism, and many of them show non-convergence (e.g., divergence or recurrence) of the iterates of various learning algorithms such as FTRL and Mirror Descent when used in games.

2 Preliminaries

Notation.

For a positive integer $n$ , let $[n]:=\{1,2,\ldots,n\}$ . For a finite set $\mathcal{S}$ , let $\Delta({\mathcal{S}})$ denote the space of distributions on $\mathcal{S}$ . For $\mathcal{S}=[n]$ , we will write $\Delta^{n}:=\Delta(\mathcal{S})$ and interpret elements of $\Delta^{n}$ as vectors in $\mathbb{R}^{n}$ . For a vector $v\in\mathbb{R}^{n}$ and $j\in[n]$ , we denote the $j$ th coordinate of $v$ as $v(j)$ . For vectors $v,w\in\mathbb{R}^{n}$ , write $\langle v,w\rangle=\sum_{j=1}^{n}v(j)w(j)$ . The base-2 logarithm of $x>0$ is denoted $\log x$ .

No-regret learning in games.

We consider a game $G$ with $m\in\mathbb{N}$ players, where player $i\in[m]$ has action space $\mathcal{A}_{i}$ with $n_{i}:=|\mathcal{A}_{i}|$ actions. We may assume that $\mathcal{A}_{i}=[n_{i}]$ for each player $i$ . The joint action space is $\mathcal{A}:=\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{m}$ . The specification of the game $G$ is completed by a collection of loss functions $\mathcal{L}_{1},\ldots,\mathcal{L}_{m}:\mathcal{A}\rightarrow[0,1]$ . For an action profile $a=(a_{1},\ldots,a_{m})\in\mathcal{A}$ and $i\in[m]$ , $\mathcal{L}_{i}(a)$ is the loss player $i$ experiences when each player $i^{\prime}\in[m]$ plays $a_{i^{\prime}}$ . A mixed strategy $x_{i}\in\Delta(\mathcal{A}_{i})$ for player $i$ is a distribution over $\mathcal{A}_{i}$ , with the probability of playing action $j\in\mathcal{A}_{i}$ given by $x_{i}(j)$ . Given a mixed strategy profile $x=(x_{1},\ldots,x_{m})$ (or an action profile $a=(a_{1},\ldots,a_{m})$ ) and a player $i\in[m]$ we let $x_{-i}$ (or $a_{-i}$ , respectively) denote the profile after removing the $i$ th mixed strategy $x_{i}$ (or the $i$ th action $a_{i}$ , respectively).

The $m$ players play the game $G$ for a total of $T$ rounds. At the beginning of each round $t\in[T]$ , each player $i$ chooses a mixed strategy $x_{i}^{(t)}\in\Delta({\mathcal{A}_{i}})$ . The loss vector of player $i$ , denoted $\ell_{i}^{(t)}\in[0,1]^{n_{i}}$ , is defined as $\ell_{i}^{(t)}(j)=\mathbb{E}_{a_{-i}\sim x_{-i}^{(t)}}[\mathcal{L}_{i}(j,a_{-i})]$ . As a matter of convention, set $\ell_{i}^{(0)}=\mathbf{0}$ to be the all-zeros vector. We consider the full-information setting in this paper, meaning that player $i$ observes its full loss vector $\ell_{i}^{(t)}$ for each round $t$ . Finally, player $i$ experiences a loss of $\langle\ell_{i}^{(t)},x_{i}^{(t)}\rangle$ . The goal of each player $i$ is to minimize its regret, defined as: $\operatorname{Reg}_{{i},{T}}:=\sum_{t\in[T]}\langle x_{i}^{(t)},\ell_{i}^{(t)}\rangle-\min_{j\in[n_{i}]}\sum_{t\in[T]}\ell_{i}^{(t)}(j).$

Optimistic hedge.

The Optimistic Hedge algorithm chooses mixed strategies for player $i\in[m]$ as follows: at time $t=1$ , it sets $x_{i}^{(1)}=(1/n_{i},\ldots,1/n_{i})$ to be the uniform distribution on $\mathcal{A}_{i}$ . Then for all $t<T$ , player $i$ ’s strategy at iteration $t+1$ is defined as follows, for $j\in[n_{i}]$ :

\displaystyle x_{i}^{(t+1)}(j):=\frac{x_{i}^{(t)}(j)\cdot\exp(-\eta\cdot(2\ell_{i}^{(t)}(j)-\ell_{i}^{(t-1)}(j)))}{\sum_{k\in[n_{i}]}x_{i}^{(t)}(k)\cdot\exp(-\eta\cdot(2\ell_{i}^{(t)}(k)-\ell_{i}^{(t-1)}(k)))}.

(1)

Optimistic Hedge is a modification of Hedge, which performs the updates $x_{i}^{(t+1)}(j):=\frac{x_{i}^{(t)}(j)\cdot\exp(-\eta\cdot\ell_{i}^{(t)}(j))}{\sum_{k\in[n_{i}]}x_{i}^{(t)}(k)\cdot\exp(-\eta\cdot\ell_{i}^{(t)}(k))}$ . The update (1) modifies the Hedge update by replacing the loss vector $\ell_{i}^{(t)}$ with a predictor of the following iteration’s loss vector, $\ell_{i}^{(t)}+(\ell_{i}^{(t)}-\ell_{i}^{(t-1)})$ . Hedge corresponds to FTRL with a negative entropy regularizer (see, e.g., [Bub15]), whereas Optimistic Hedge corresponds to Optimistic FTRL with a negative entropy regularizer [RS13b, RS13a].

Distributions & divergences.

For distributions $P,Q$ on a finite domain $[n]$ , the KL divergence between $P,Q$ is $\operatorname{KL}({P};{Q})=\sum_{j=1}^{n}P(j)\cdot\log\left(\frac{P(j)}{Q(j)}\right)$ . The chi-squared divergence between $P,Q$ is $\chi^{2}({P};{Q})=\sum_{j=1}^{n}Q(j)\cdot\left(\frac{P(j)}{Q(j)}\right)^{2}-1=\sum_{j=1}^{n}\frac{(P(j)-Q(j))^{2}}{Q(j)}$ . For a distribution $P$ on $[n]$ and a vector $v\in\mathbb{R}^{n}$ , we write $\operatorname{Var}_{{P}}\left({v}\right):=\sum_{j=1}^{n}P(j)\cdot\left(v(j)-\sum_{k=1}^{n}P(k)v(k)\right)^{2}.$ Also define $\left\|{v}\right\|_{P}:=\sqrt{\sum_{j=1}^{n}P(j)\cdot v(j)^{2}}$ . If further $P$ has full support, then define $\left\|{v}\right\|_{P}^{\star}=\sqrt{\sum_{j=1}^{n}\frac{v(j)^{2}}{P(j)}}$ . The above notations will often be used when $P$ is the mixed strategy profile $x_{i}$ for some player $i$ and $v$ is a loss vector $\ell_{i}$ ; in such a case the norms $\left\|{v}\right\|_{P}$ and $\left\|{v}\right\|_{P}^{\star}$ are often called local norms.

3 Results

Below we state our main theorem, which shows that when all players in a game play according to Optimistic Hedge with appropriate step size, they all experience polylogarithmic individual regrets.

Theorem 3.1 (Formal version of Theorem 1.1).

There are constants $C,C^{\prime}>1$ so that the following holds. Suppose a time horizon $T\in\mathbb{N}$ and a game $G$ with $m$ players and $n_{i}$ actions for each player $i\in[m]$ is given. Suppose all players play according to Optimistic Hedge with any positive step size $\eta\leq\frac{1}{C\cdot m\log^{4}T}$ . Then for any $i\in[m]$ , the regret of player $i$ satisfies

\displaystyle\operatorname{Reg}_{{i},{T}}\leq\frac{\log n_{i}}{\eta}+C^{\prime}\cdot\log T.

(2)

In particular, if the players’ step size is chosen as $\eta=\frac{1}{C\cdot m\log^{4}T}$ , then the regret of player $i$ satisfies

\displaystyle\operatorname{Reg}_{{i},{T}}\leq O\left(m\cdot\log n_{i}\cdot\log^{4}T\right).

(3)

A common goal in the literature on learning in games is to obtain an algorithm that achieves fast rates whan played by all players, and so that each player $i$ still obtains the optimal rate of $O(\sqrt{T})$ in the adversarial setting (i.e., when $i$ receives an arbitrary sequence of losses $\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}$ ). We show in Corollary D.1 (in the appendix) that by running Optimistic Hedge with an adaptive step size, this is possible. Table 1 compares our regret bounds discussed in this section to those of prior work.

4 Proof overview

In this section we overview the proof of Theorem 3.1; the full proof may be found in the appendix.

4.1 New adversarial regret bound

The first step in the proof of Theorem 3.1 is to prove a new regret bound (Lemma 4.1 below) for Optimistic Hedge that holds for an adversarial sequence of losses. We will show in later sections that when all players play according to Optimistic Hedge, the right-hand side of the regret bound (4) is bounded by a quantity that grows only poly-logarithmically in $T$ .

Lemma 4.1.

There is a constant $C>0$ so that the following holds. Suppose any player $i\in[m]$ follows the Optimistic Hedge updates (1) with step size $\eta<1/C$ , for an arbitrary sequence of losses $\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}\in[0,1]^{n_{i}}$ . Then

\displaystyle\operatorname{Reg}_{{i},{T}}\leq\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left(\frac{\eta}{2}+C\eta^{2}\right)\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)-\sum_{t=1}^{T}\frac{(1-C\eta)\eta}{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right).

(4)

The detailed proof of Lemma 4.1 can be found in Section A, but we sketch the main steps here. The starting point is a refinement of [RS13a, Lemma 3] (stated as Lemma A.5), which gives an upper bound for $\operatorname{Reg}_{{i},{T}}$ in terms of local norms corresponding to each of the iterates $x_{i}^{(t)}$ of Optimistic Hedge. The bound involves the difference between the Optimistic Hedge iterates $x_{i}^{(t)}$ and iterates $\tilde{x}_{i}^{(t)}$ defined by $\tilde{x}_{i}^{(t)}=\frac{x_{i}^{(t)}(j)\cdot\exp(-\eta\cdot(\ell_{i}^{(t)}(j)-\ell_{i}^{(t-1)}(j)))}{\sum_{k\in[n_{i}]}x_{i}^{(t)}(k)\cdot\exp(-\eta\cdot(\ell_{i}^{(t)}(k)-\ell_{i}^{(t-1)}(k)))}$ :

\displaystyle\operatorname{Reg}_{{i},{T}}\leq

\displaystyle\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}).

(5)

We next show (in Lemma A.2) that $\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})$ and $\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})$ may be lower bounded by $(1/2-O(\eta))\cdot\chi^{2}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})$ and $(1/2-O(\eta))\cdot\chi^{2}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})$ , respectively. Note it is a standard fact that the KL divergence between two distributions is upper bounded by the chi-squared distribution between them; by contrast, Lemma A.2 can exploit that $x_{i}^{(t)}$ , $\tilde{x}_{i}^{(t)}$ and $\tilde{x}_{i}^{(t-1)}$ are close to each other to show a reverse inequality. Finally, exploiting the exponential weights-style functional relationship between $x_{i}^{(t)}$ and $\tilde{x}_{i}^{(t-1)}$ , we show (in Lemma A.3) that the $\chi^{2}$ -divergence $\chi^{2}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})$ may be lower bounded by $(1-O(\eta))\cdot\eta^{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)$ , leading to the term $\frac{(1-C\eta)\eta}{2}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)$ being subtracted in (4). The $\chi^{2}$ -divergence $\chi^{2}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})$ , as well as the term $\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}$ in (5) are bounded in a similar manner to obtain (4).

4.2 Finite differences

Given Lemma 4.1, in order to establish Theorem 3.1, it suffices to show Lemma 4.2 below. Indeed, (6) below implies that the right-hand side of (4) is bounded above by $\frac{\log n_{i}}{\eta}+\eta\cdot O(\log^{5}T)$ , which is bounded above by $O(m\log n_{i}\log^{4}T)$ for the choice $\eta=\Theta\left(\frac{1}{m\cdot\log^{4}T}\right)$ of Theorem 3.1.²²2Notice that the factor $\frac{1}{2}$ in (6) is not important for this argument – any constant less than 1 would suffice.

Lemma 4.2 (Abbreviated; detailed version in Section C.3).

Suppose all players play according to Optimistic Hedge with step size $\eta$ satifying $1/T\leq\eta\leq\frac{1}{Cm\cdot\log^{4}T}$ for a sufficiently large constant $C$ . Then for any $i\in[m]$ , the losses $\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}\in\mathbb{R}^{n_{i}}$ for player $i$ satisfy:

\displaystyle\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)\leq\frac{1}{2}\cdot\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+O\left(\log^{5}T\right).

(6)

The definition below allows us to streamline our notation when proving Lemma 4.2.

Definition 4.1 (Finite differences).

Suppose $L=(L^{(1)},\ldots,L^{(T)})$ is a sequence of vectors $L^{(t)}\in\mathbb{R}^{n}$ . For integers $h\geq 0$ , the order- $h$ finite difference sequence for the sequence $L$ , denoted by $\operatorname{D}_{h}{L}$ , is the sequence $\operatorname{D}_{h}{L}:=(\left(\operatorname{D}_{h}{L}\right)^{(1)},\ldots,\left(\operatorname{D}_{h}{L}\right)^{(T-h)})$ defined recursively as: $\left(\operatorname{D}_{0}{L}\right)^{(t)}:=L^{(t)}$ for all $1\leq t\leq T$ , and

\left(\operatorname{D}_{h}{L}\right)^{(t)}:=\left(\operatorname{D}_{h-1}{L}\right)^{(t+1)}-\left(\operatorname{D}_{h-1}{L}\right)^{(t)}

(7)

for all $h\geq 1$ , $1\leq t\leq T-h$ .³³3We remark that while Definition 4.1 is stated for a 1-indexed sequence $L^{(1)},L^{(2)},\ldots$ , we will also occasionally consider 0-indexed sequences $L^{(0)},L^{(1)},\ldots$ , in which case the same recursive definition (7) holds for the finite differences $\left(\operatorname{D}_{h}{L}\right)^{(t)}$ , $t\geq 0$ .

Remark 4.3.

Notice that another way of writing (7) is: $\operatorname{D}_{h}{L}=\operatorname{D}_{1}{\operatorname{D}_{h-1}{L}}.$ We also remark for later use that $\left(\operatorname{D}_{h}{L}\right)^{(t)}=\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}L^{(t+s)}.$

Let $H=\log T$ , where $T$ denotes the fixed time horizon from Theorem 3.1 (and thus Lemma 4.2). In the proof of Lemma 4.2, we will bound the finite differences of order $h\leq H$ for certain sequences. The bound (6) of Lemma 4.2 may be rephased as upper bounding $\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{1}{\ell_{i}}\right)^{(t-1)}}\right)$ , by $\frac{1}{2}\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}}}\left({\ell_{i}^{(t-1)}}\right)$ ; to prove this, we proceed in two steps:

1.

(Upwards induction step) First, in Lemma 4.4 below, we find an upper bound on $\left\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\right\|_{\infty}$ for all $t\in[T]$ , $h\geq 0$ , which decays exponentially in $h$ for $h\leq H$ . This is done via upwards induction on $h$ , i.e., first proving the base case $h=0$ using boundedness of the losses $\ell_{i}^{(t)}$ and then $h=1,2,\ldots$ inductively. The main technical tool we develop for the inductive step is a weak form of the chain rule for finite differences, Lemma 4.5. The inductive step uses the fact that all players are following Optimistic Hedge to relate the $h$ th order finite differences of player $i$ ’s loss sequence $\ell_{i}^{(t)}$ to the $h$ th order finite differences of the strategy sequences $x_{i^{\prime}}^{(t)}$ for players $i^{\prime}\neq i$ ; then we use the exponential-weights style updates of Optimistic Hedge and Lemma 4.5 to relate the $h$ th order finite differences of the strategies $x_{i^{\prime}}^{(t)}$ to the $(h-1)$ th order finite differences of the losses $\ell_{i^{\prime}}^{(t)}$ .
2.

(Downwards induction step) We next show that for all $0\leq h\leq H$ , $\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{h+1}{\ell_{i}}\right)^{(t-1)}}\right)$ is bounded above by $c_{h}\cdot\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t-1)}}\right)+\mu_{h}$ , for some $c_{h}<1/2$ and $\mu_{h}<O(\log^{5}T)$ . This shown via downwards induction on $h$ , namely first establishing the base case $h=H$ by using the result of item 1 for $h=H$ and then treating the cases $h=H-1,H-2,\ldots,0$ . The inductive step makes use of the discrete Fourier transform (DFT) to relate the finite differences of different orders (see Lemmas 4.7 and 4.8). In particular, Parseval’s equality together with a standard relationship between the DFT of the finite differences of a sequence to the DFT of that sequence allow us to first prove the inductive step in the frequency domain and then transport it back to the original (time) domain.

In the following subsections we explain in further detail how the two steps above are completed.

4.3 Upwards induction proof overview

Addressing item 1 in the previous subsection, the lemma below gives a bound on the supremum norm of the $h$ -th order finite differences of each player’s loss vector, when all players play according to Optimistic Hedge and experience losses according to their loss functions $\mathcal{L}_{1},\ldots,\mathcal{L}_{m}:\mathcal{A}\rightarrow[0,1]$ .

Lemma 4.4 (Abbreviated).

Fix a step size $\eta>0$ satisfying $\eta\leq o\left(\frac{1}{m\log T}\right)$ . If all players follow Optimistic Hedge updates with step size $\eta$ , then for any player $i\in[m]$ , integer $h$ satisfying $0\leq h\leq H$ , and time step $t\in[T-h]$ , it holds that $\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\|_{\infty}\leq O(m\eta)^{h}\cdot h^{O(h)}.$

A detailed version of Lemma 4.4, together with its full proof, may be found in Section B.4. We next give a proof overview of Lemma 4.4 for the case of 2 players, i.e., $m=2$ ; we show in Section B.4 how to generalize this computation to general $m$ . Below we introduce the main technical tool in the proof, a “boundedness chain rule,” and then outline how it is used to prove Lemma 4.4.

Main technical tool for Lemma 4.4: boundedness chain rule.

We say that a function $\phi:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is a softmax-type function if there are real numbers $\xi_{1},\ldots,\xi_{n}$ and some $j\in[n]$ so that for all $(z_{1},\ldots,z_{n})\in\mathbb{R}^{n}$ , $\phi((z_{1},\ldots,z_{n}))=\frac{\exp(z_{j})}{\sum_{k=1}^{n}\xi_{k}\cdot\exp(z_{k})}.$ Lemma 4.5 below may be interpreted as a “boundedness chain rule” for finite differences. To explain the context for this lemma, recall that given an infinitely differentiable vector-valued function $L:\mathbb{R}\rightarrow\mathbb{R}^{n}$ and an infinitely differentiable function $\phi:\mathbb{R}^{n}\rightarrow\mathbb{R}$ , the higher order derivatives of the function $\phi(L(t))$ may be computed in terms of those of $L$ and $\phi$ using the chain rule. Lemma 4.5 considers an analogous setting where the input variable $t$ to $L$ is discrete-valued, taking values in $[T]$ (and so we identify the function $L$ with the sequence $L^{(1)},\ldots,L^{(T)}$ ). In this case, the higher order finite differences of the sequence $L^{(1)},\ldots,L^{(T)}$ (Definition 4.1) take the place of the higher order derivatives of $L$ with respect to $t$ . Though there is no generic chain rule for finite differences, Lemma 4.5 states that, at least when $\phi$ is a softmax-type function, we may bound the higher order finite differences of the sequence $\phi(L^{(1)}),\ldots,\phi(L^{(T)})$ . In the lemma’s statement we let $\phi\circ L$ denote the sequence $\phi(L^{(1)}),\ldots,\phi(L^{(T)})$ .

Lemma 4.5 (“Boundedness chain rule” for finite differences; abbreviated).

Suppose that $h,n\in\mathbb{N}$ , $\phi:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is a softmax-type function, and $L=(L^{(1)},\ldots,L^{(T)})$ is a sequence of vectors in $\mathbb{R}^{n}$ satisfying $\|L^{(t)}\|_{\infty}\leq 1$ for $t\in[T]$ . Suppose for some $\alpha\in(0,1)$ , for each $0\leq h^{\prime}\leq h$ and $t\in[T-h^{\prime}]$ , it holds that $\|\operatorname{D}_{h^{\prime}}{L}^{(t)}\|_{\infty}\leq O({\alpha^{h^{\prime}}})\cdot(h^{\prime})^{O(h^{\prime})}$ . Then for all $t\in[T-h]$ ,

\displaystyle|\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)}|\leq O(\alpha^{h})\cdot h^{O(h)}.

A detailed version of Lemma 4.5 may be found in Section B.3. While Lemma 4.5 requires $\phi$ to be a softmax-type function for simplicity (and this is the only type of function $\phi$ we will need to consider for the case $m=2$ ) we remark that the detailed version of Lemma 4.5 allows $\phi$ to be from a more general family of analytic functions whose higher order derivatives are appropriately bounded. The proof of Lemma 4.4 for all $m\geq 2$ requires that more general form of Lemma 4.5.

The proof of Lemma 4.5 proceeds by considering the Taylor expansion $P_{\phi}(\cdot)$ of the function $\phi$ at the origin, which we write as follows: for $z=(z_{1},\ldots,z_{n})\in\mathbb{R}^{n}$ , $P_{\phi}(z):=\sum_{k\geq 0,\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}a_{\gamma}z^{\gamma}$ , where $a_{\gamma}\in\mathbb{R}$ , $|\gamma|$ denotes the quantity $\gamma_{1}+\cdots+\gamma_{n}$ and $z^{\gamma}$ denotes $z_{1}^{\gamma_{1}}\cdots z_{n}^{\gamma_{n}}$ . The fact that $\phi$ is a softmax-type function ensures that the radius of convergence of its Taylor series is at least 1, i.e., $\phi(z)=P_{\phi}(z)$ for any $z$ satisfying $\|z\|_{\infty}\leq 1$ . By the assumption that $\|L^{(t)}\|_{\infty}\leq 1$ for each $t$ , we may therefore decompose $\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)}$ as:

\displaystyle\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)}=\sum_{k\geq 0,\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}a_{\gamma}\cdot\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)},

(8)

where $L^{\gamma}$ denotes the sequence of scalars $(L^{\gamma})^{(t)}:=(L^{(t)})^{\gamma}$ for all $t$ . The fact that $\phi$ is a softmax-type function allows us to establish strong bounds on $|a_{\gamma}|$ for each $\gamma$ in Lemma B.5. The proof of Lemma B.5 bounds the $|a_{\gamma}|$ by exploiting the simple form of the derivative of a softmax-type function to decompose each $a_{\gamma}$ into a sum of $|\gamma|!$ terms. Then we establish a bijection between the terms of this decomposition and graph structures we refer to as factorial trees; that bijection together with the use of an appropriate generating function allow us to complete the proof of Lemma B.5.

Thus, to prove Lemma 4.5, it suffices to bound $\left|\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}\right|$ for all $\gamma$ . We do so by using Lemma 4.6.

Lemma 4.6 (Abbreviated; detailed vesion in Section B.2).

Fix any $h\geq 0$ , a multi-index $\gamma\in\mathbb{Z}_{\geq 0}^{n}$ and set $k=|\gamma|$ . For each of the $k^{h}$ functions $\pi:[h]\rightarrow[k]$ , and for each $r\in[k]$ , there are integers $h^{\prime}_{\pi,r}\in\{0,1,\ldots,h\}$ , $t^{\prime}_{\pi,r}\geq 0$ , and $j^{\prime}_{\pi,r}\in[n]$ , so that the following holds. For any sequence $L^{(1)},\ldots,L^{(T)}\in\mathbb{R}^{n}$ of vectors, it holds that, for each $t\in[T-h]$ ,

\displaystyle\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}=\sum_{\pi:[h]\rightarrow[k]}\prod_{r=1}^{k}\left(\operatorname{D}_{h^{\prime}_{\pi,r}}{(L(j^{\prime}_{\pi,r}))}\right)^{(t+t^{\prime}_{\pi,r})}.

(9)

Lemma 4.6 expresses the $h$ th order finite differences of the sequence $L^{\gamma}$ as a sum of $k^{h}$ terms, each of which is a product of $k$ finite order differences of a sequence $L^{(t)}(j^{\prime}_{\pi,r})$ (i.e., the $j^{\prime}_{\pi,r}$ th coordinate of the vectors $L^{(t)}$ ). Crucially, when using Lemma 4.6 to prove Lemma 4.5, the assumption of Lemma 4.5 gives that for each $j^{\prime}\in[n]$ , each $h^{\prime}\in[h]$ , and each $t^{\prime}\in[T-h^{\prime}]$ , we have the bound $\left|\left(\operatorname{D}_{h^{\prime}}{L(j^{\prime})}\right)^{(t^{\prime})}\right|\leq O({\alpha^{h^{\prime}}})\cdot(h^{\prime})^{O(h^{\prime})}$ . These assumed bounds may be used to bound the right-hand side of (9), which together with Lemma 4.6 and (8) lets us complete the proof of Lemma 4.5.

Proving Lemma 4.4 using the boundedness chain rule.

Next we discuss how Lemma 4.5 is used to prove Lemma 4.4, namely to bound $\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\|_{\infty}$ for each $t\in[T-h]$ , $i\in[m]$ , and $0\leq h\leq H$ . Lemma 4.4 is proved using induction, with the base case $h=0$ being a straightforward consequence of the fact that $\|\left(\operatorname{D}_{0}{\ell_{i}}\right)^{(t)}\|_{\infty}=\|\ell_{i}^{(t)}\|_{\infty}\leq 1$ for all $i\in[m],t\in[T]$ . For the rest of this section we focus on the inductive case, i.e., we pick some $h\in[H]$ and assume Lemma 4.4 holds for all $h^{\prime}<h$ .

The first step is to reduce the claim of Lemma 4.4 to the claim that the upper bound $\|\left(\operatorname{D}_{h}{x_{i}}\right)^{(t)}\|_{1}\leq O\left(m\eta\right)^{h}\cdot h^{O(h)}$ holds for each $t\in[T-h],i\in[m]$ . Recalling that we are only sketching here the case $m=2$ for simplicity, this reduction proceeds as follows: for $i\in\{1,2\}$ , define the matrix $A_{i}\in\mathbb{R}^{n_{1}\times n_{2}}$ by $(A_{i})_{a_{1}a_{2}}=\mathcal{L}_{i}(a_{1},a_{2})$ , for $a_{1}\in[n_{1}],a_{2}\in[n_{2}]$ . We have assumed that all players are using Optimistic Hedge and thus $\ell_{i}^{(t)}=\mathbb{E}_{a_{i^{\prime}}\sim x_{i^{\prime}}^{(t)},\ \forall i^{\prime}\neq i}[\mathcal{L}_{i}(a_{1},\ldots,a_{n})]$ ; for our case here ( $m=2$ ), this may be rewritten as $\ell_{1}^{(t)}=A_{1}x_{2}^{(t)}$ , $\ell_{2}^{(t)}=A_{2}^{\top}x_{1}^{(t)}$ . Thus

\displaystyle\|\left(\operatorname{D}_{h}{\ell_{1}}\right)^{(t)}\|_{\infty}=\left\|A_{1}\cdot\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}x_{2}^{(t+s)}\right\|_{\infty}\leq\left\|\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}x_{2}^{(t+s)}\right\|_{1}=\|\left(\operatorname{D}_{h}{x_{2}}\right)^{(t)}\|_{1},

where the first equality is from Remark 4.3 and the inequality follows since all entries of $A_{1}$ have absolute value $\leq 1$ . A similar computation allows us to show $\|\left(\operatorname{D}_{h}{\ell_{2}}\right)^{(t)}\|_{\infty}\leq\|\left(\operatorname{D}_{h}{x_{1}}\right)^{(t)}\|_{1}$ .

To complete the inductive step it remains to upper bound the quantities $\|\left(\operatorname{D}_{h}{x_{i}}\right)^{(t)}\|_{1}$ for $i\in[m],t\in[T-h]$ . To do so, we note that the definition of the Optimistic Hedge updates (1) implies that for any $i\in[m],t\in[T],j\in[n_{i}]$ , and $t^{\prime}\geq 1$ , we have

\displaystyle x_{i}^{(t+t^{\prime})}(j)=\frac{x_{i}^{(t)}(j)\cdot\exp\left(\eta\cdot\left(\ell_{i}^{(t-1)}(j)-\sum_{s=0}^{t^{\prime}-1}\ell_{i}^{(t+s)}(j)-\ell_{i}^{(t+t^{\prime}-1)}(j)\right)\right)}{\sum_{k=1}^{n_{i}}x_{i}^{(t)}(k)\cdot\exp\left(\eta\cdot\left(\ell_{i}^{(t-1)}(k)-\sum_{s=0}^{t^{\prime}-1}\ell_{i}^{(t+s)}(k)-\ell_{i}^{(t+t^{\prime}-1)}(k)\right)\right)}.

(10)

For $t\in[T]$ , $t^{\prime}\geq 0$ , set $\bar{\ell}_{i,t}^{(t^{\prime})}:=\eta\cdot\left(\ell_{i}^{(t-1)}-\sum_{s=0}^{t^{\prime}-1}\ell_{i}^{(t+s)}-\ell_{i}^{(t+t^{\prime}-1)}\right).$ Also, for each $i\in[m]$ , $j\in[n_{i}]$ , $t\in[T]$ , and any vector $z=(z(1),\ldots,z(n_{i}))\in\mathbb{R}^{n_{i}}$ define $\phi_{t,i,j}(z):=\frac{x_{i}^{(t)}(j)\cdot\exp\left(z(j)\right)}{\sum_{k=1}^{n_{i}}x_{i}^{(t)}(k)\cdot\exp\left(z(k)\right)}.$ Thus (10) gives that for $t^{\prime}\geq 1$ , $x_{i}^{(t+t^{\prime})}(j)=\phi_{t,i,j}(\bar{\ell}_{i,t}^{(t^{\prime})})$ . Viewing $t$ as a fixed parameter and letting $t^{\prime}$ vary, it follows that for $h\geq 0$ and $t^{\prime}\geq 1$ , $\left(\operatorname{D}_{h}{x_{i}^{(t+\cdot)}(j)}\right)^{(t^{\prime})}=\left(\operatorname{D}_{h}{(\phi_{t,i,j}\circ\bar{\ell}_{i,t})}\right)^{(t^{\prime})}$ .

Recalling that our goal is to bound $|\left(\operatorname{D}_{h}{x_{i}(j)}\right)^{(t+1)}|$ for each $t$ , we can do so by using Lemma 4.5 with $\phi=\phi_{t,i,j}$ and $\alpha=O(m\eta)$ , if we can show that its precondition is met, i.e. that $\|\left(\operatorname{D}_{h^{\prime}}{\bar{\ell}_{i,t}}\right)^{(t^{\prime})}\|_{\infty}\leq\frac{1}{B_{1}}\cdot\alpha^{h^{\prime}}\cdot(h^{\prime})^{B_{0}h^{\prime}}$ for all $h^{\prime}\leq h$ , the appropriate $\alpha$ and appropriate constants $B_{0},B_{1}$ . Helpfully, the definition of $\bar{\ell}_{i,t}^{(t^{\prime})}$ as a partial sum allows us to relate the $h^{\prime}$ -th order finite differences of the sequence $\bar{\ell}_{i,t}^{(t^{\prime})}$ to the $(h^{\prime}-1)$ -th order finite differences of the sequence $\ell_{i}^{(t)}$ as follows:

\left(\operatorname{D}_{h^{\prime}}{\bar{\ell}_{i,t}}\right)^{(t^{\prime})}=\eta\cdot\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t+t^{\prime}-1)}-2\eta\cdot\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t+t^{\prime})}.

(11)

Since $h^{\prime}-1<h$ for $h^{\prime}\leq h$ , the inductive assumption of Lemma 4.4 gives a bound on the $\ell_{\infty}$ -norm of the terms on the right-hand side of (11), which are sufficient for us to apply Lemma 4.5. Note that the inductive assumption gives an upper bound on $\|\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t)}\|_{\infty}$ that only scales with $\alpha^{h^{\prime}-1}$ , whereas Lemma 4.5 requires scaling of $\alpha^{h^{\prime}}$ . This discrepancy is corrected by the factor of $\eta$ on the right-hand side of (11), which gives the desired scaling $\alpha^{h^{\prime}}$ (since $\eta<\alpha$ for the choice $\alpha=O(m\eta)$ ).

4.4 Downwards induction proof overview

In this section we discuss in further detail item 2 in Section 4.2; in particular, we will show that there is a parameter $\mu=\tilde{\Theta}(\eta m)$ so that for all integers $h$ satisfying $H-1\geq h\geq 0$ ,

\sum_{t=1}^{T-h-1}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{h+1}{\ell_{i}}\right)^{(t)}}\right)\leq O(1/H)\cdot\sum_{t=1}^{T-h}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}}\right)+\tilde{O}\left(\mu^{2h}\right),

(12)

where $\tilde{O}$ hides factors polynomial in $\log T$ . The validity of (12) for $h=0$ implies Lemma 4.2. On the other hand, as long we choose the value $\mu$ in (12) to satisfy $\mu\geq m\eta H^{\Omega(1)}$ , then Lemma 4.4 implies that $\sum_{t=1}^{T-H}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\left(\operatorname{D}_{H}{\ell_{i}}\right)^{(t)}}\right)\leq O(\mu^{2H})$ . This gives that (12) holds for $h=H-1$ . To show that (12) holds for all $H-1>h\geq 0$ , we use downwards induction; fix any $h$ , and assume that (12) has been shown for all $h^{\prime}$ satisfying $h<h^{\prime}\leq H-1$ . Our main tool in the inductive step is to apply Lemma 4.7 below. To state it, for $\zeta>0,\ n\in\mathbb{N}$ , we say that a sequence of distributions $P^{(1)},\ldots,P^{(T)}\in\Delta^{n}$ is $\zeta$ -consecutively close if for each $1\leq t<T$ , it holds that $\max\left\{\left\|\frac{P^{(t)}}{P^{(t+1)}}\right\|_{\infty},\left\|\frac{P^{(t+1)}}{P^{(t)}}\right\|_{\infty}\right\}\leq 1+\zeta$ .⁴⁴4Here, for distributions $P,Q\in\Delta^{n}$ , $\frac{P}{Q}\in\mathbb{R}^{n}$ denotes the vector whose $j$ th entry is $P(j)/Q(j)$ . Lemma 4.7 shows that given a sequence of vectors for which the variances of its second-order finite differences are bounded by the variances of its first-order finite differences, a similar relationship holds between its first- and zeroth-order finite differences.

Lemma 4.7.

There is a sufficiently large constant $C_{0}>1$ so that the following holds. For any $M,\zeta,\alpha>0$ and $n\in\mathbb{N}$ , suppose that $P^{(1)},\ldots,P^{(T)}\in\Delta^{n}$ and $Z^{(1)},\ldots,Z^{(T)}\in[-M,M]^{n}$ satisfy the following conditions:

1.

The sequence $P^{(1)},\ldots,P^{(T)}$ is $\zeta$ -consecutively close for some $\zeta\in[1/(2T),\alpha^{4}/C_{0}]$ .
2.

It holds that $\sum_{t=1}^{T-2}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t)}}\right)\leq\alpha\cdot\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)+\mu.$

Then $\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)\leq\alpha\cdot(1+\alpha)\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\frac{\mu}{\alpha}+\frac{C_{0}M^{2}}{\alpha^{3}}.$

Given Lemma 4.7, the inductive step for establishing (12) is straightforward: we apply Lemma 4.7 with $P^{(t)}=x_{i}^{(t)}$ and $Z^{(t)}=\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}$ for all $t$ . The fact that $x_{i}^{(t)}$ are updated with Optimistic Hedge may be used to establish that precondition 1 of Lemma 4.7 holds. Since $\left(\operatorname{D}_{1}{Z}\right)^{(t)}=\left(\operatorname{D}_{h+1}{\ell_{i}}\right)^{(t)}$ and $\left(\operatorname{D}_{2}{Z}\right)^{(t)}=\left(\operatorname{D}_{h+2}{\ell_{i}}\right)^{(t)}$ , that the inductive hypothesis (12) holds for $h+1$ implies that precondition 2 of Lemma 4.7 holds for appropriate $\alpha,\mu>0$ . Thus Lemma 4.7 implies that (12) holds for the value $h$ , which completes the inductive step.

On the proof of Lemma 4.7.

Finally we discuss the proof of Lemma 4.7. One technical challenge is the fact that the vectors $P^{(t)}$ are not constant functions of $t$ , but rather change slowly (as constrained by being $\zeta$ -consecutively close). The main tool for dealing with this difficulty is Lemma C.1, which shows that for a $\zeta$ -consecutively close sequence $P^{(t)}$ , for any vector $Z^{(t)}$ , $\frac{\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)}{\operatorname{Var}_{{P^{(t+1)}}}\left({Z^{(t)}}\right)}\in[1-\zeta,1+\zeta]$ . This fact, together with some algebraic manipulations, lets us to reduce to the case that all $P^{(t)}$ are equal. It is also relatively straightforward to reduce to the case that $\langle P^{(t)},Z^{(t)}\rangle=0$ for all $t$ , i.e., so that $\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)=\left\|{Z^{(t)}}\right\|_{P^{(t)}}^{2}$ . We may further separate $\left\|{Z^{(t)}}\right\|_{P^{(t)}}^{2}=\sum_{j=1}^{n}P^{(t)}(j)\cdot(Z^{(t)}(j))^{2}$ into its individual components $P^{(t)}(j)\cdot(Z^{(t)}(j))^{2}$ , and treat each one separately, thus allowing us to reduce to a one-dimensional problem. Finally, we make one further reduction, which is to replace the finite differences $\operatorname{D}_{h}{(\cdot)}$ in Lemma 4.7 with circular finite differences, defined below:

Definition 4.2 (Circular finite difference).

Suppose $L=(L^{(0)},\ldots,L^{(S-1)})$ is a sequence of vectors $L^{(t)}\in\mathbb{R}^{n}$ . For integers $h\geq 0$ , the level- $h$ circular finite difference sequence for the sequence $L$ , denoted by $\operatorname{D}^{\circ}_{h}{L}$ , is the sequence defined recursively as: $\left(\operatorname{D}^{\circ}_{0}{L}\right)^{(t)}=L^{(t)}$ for all $0\leq t<S$ , and

\left(\operatorname{D}^{\circ}_{h}{L}\right)^{(t)}=\begin{cases}\left(\operatorname{D}^{\circ}_{h-1}{L}\right)^{(t+1)}-\left(\operatorname{D}^{\circ}_{h-1}{L}\right)^{(t)}\quad:0\leq t\leq S-2\\ \left(\operatorname{D}^{\circ}_{h-1}{L}\right)^{(1)}-\left(\operatorname{D}^{\circ}_{h-1}{L}\right)^{(T)}\quad:t=S-1.\end{cases}

(13)

Circular finite differences for a sequence $L^{(0)},\ldots,L^{(S-1)}$ are defined similarly to finite differences (Definition 4.1) except that unlike for finite differences, where $\left(\operatorname{D}_{h}{L}\right)^{(S-h)},\ldots,\left(\operatorname{D}_{h}{L}\right)^{(S-1)}$ are not defined, $\left(\operatorname{D}^{\circ}_{h}{L}\right)^{(S-h)},\ldots,\left(\operatorname{D}^{\circ}_{h}{L}\right)^{(S-1)}$ are defined by “wrapping around” back to the beginning of the sequence. The above-described reductions, which are worked out in detail in Section C.2, allow us to reduce proving Lemma 4.7 to proving the following simpler lemma:

Lemma 4.8.

Suppose $\mu\in\mathbb{R}$ , $\alpha>0$ , and $W^{(0)},\ldots,W^{(S-1)}\in\mathbb{R}$ is a sequence of reals satisfying

\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{2}{W}\right)^{(t)}\right)^{2}\leq\alpha\cdot\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}+\mu.

(14)

Then $\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}\leq\alpha\cdot\sum_{t=1}^{S-1}(W^{(t)})^{2}+\mu/\alpha.$

To prove Lemma 4.8, we apply the discrete Fourier transform to both sides of (14) and use the Cauchy-Schwarz inequality in frequency domain. For a sequence $W^{(0)},\ldots,W^{(S-1)}\in\mathbb{R}$ , its (discrete) Fourier transform is the sequence $\widehat{W}^{(0)},\ldots,\widehat{W}^{(S-1)}$ defined by $\widehat{W}^{(s)}=\sum_{t=0}^{S-1}W^{(t)}\cdot e^{-\frac{2\pi ist}{S}}$ . Below we prove Lemma 4.8 for the special case $\mu=0$ ; we defer the general case to Section C.1.

Proof of Lemma 4.8 for special case $\mu=0$ .

We have the following:

\displaystyle S\cdot\sum_{t=1}^{T}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}=\sum_{s=1}^{T}\left|\widehat{\operatorname{D}^{\circ}_{1}{W}}^{(s)}\right|^{2}=\sum_{s=1}^{T}\left|\widehat{W}^{(s)}(e^{2\pi is/T-1})\right|^{2}\leq\sqrt{\sum_{s=1}^{T}\left|\widehat{W}^{(s)}\right|^{2}}\sqrt{\sum_{s=1}^{T}\left|\widehat{W}^{(s)}\right|^{2}\left|e^{2\pi is/T}-1\right|^{4}},

where the first equality uses Parseval’s equality, the second uses Fact C.3 (in the appendix) for $h=1$ , and the inequality uses Cauchy-Schwarz. By Parseval’s inequality and Fact C.3 for $h=2$ , the right-hand side of the above equals $S\cdot\sqrt{\sum_{t=1}^{T}(W^{(t)})^{2}}\cdot\sqrt{\sum_{t=1}^{T}\left(\left(\operatorname{D}^{\circ}_{2}{W}\right)^{(t)}\right)^{2}}$ , which, by assumption, is at most $S\cdot\sqrt{\sum_{t=1}^{T}(W^{(t)})^{2}}\cdot\sqrt{\alpha\cdot\sum_{t=1}^{T}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}}$ . Rearranging terms completes the proof. ∎

References

[AIMM21] Waïss Azizian, Franck Iutzeler, Jérome Malick, and Panayotis Mertikopoulos. The last-iterate convergence rate of optimistic mirror descent in stochastic variational inequalities. In Conference on Learning Theory, pages 1–32, 2021.
[BCM12] Maria-Florina Balcan, Florin Constantin, and Ruta Mehta. The Weighted Majority Algorithm does not Converge in Nearly Zero-sum Games. In ICML Workshop on Markets, Mechanisms, and Multi-Agent Models, 2012.
[Bla54] David Blackwell. Controlled Random Walks. In Proceedings of the International Congress of Mathematicians, volume 3, pages 336–338, 1954.
[BP18] James P. Bailey and Georgios Piliouras. Multiplicative Weights Update in Zero-Sum Games. In Proceedings of the 2018 ACM Conference on Economics and Computation - EC ’18, pages 321–338, Ithaca, NY, USA, 2018. ACM Press.
[BP19] James P. Bailey and Georgios Piliouras. Fast and furious learning in zero-sum games: Vanishing regret with non-vanishing step sizes. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12977–12987, 2019.
[Bro49] George W Brown. Some Notes on Computation of Games Solutions. Technical report, RAND CORP SANTA MONICA CA, 1949.
[Bub15] Sébastien Bubeck. Convex Optimization: Algorithms and Complexity. Found. Trends Mach. Learn., 8(3–4):231–357, November 2015.
[CBL06] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge university press, 2006.
[CDT09] Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Settling the complexity of computing two-player nash equilibria. Journal of the ACM (JACM), 56(3):1–57, 2009.
[CP19] Yun Kuen Cheung and Georgios Piliouras. Vortices instead of equilibria in minmax optimization: Chaos and butterfly effects of online learning in zero-sum games. In Proceedings of the Thirty-Second Conference on Learning Theory, pages 807–834, 2019.
[CP20] Xi Chen and Binghui Peng. Hedging in games: Faster convergence of external and swap regrets. In Advances in Neural Information Processing Systems, volume 33, pages 18990–18999. Curran Associates, Inc., 2020.
[CS04] Imre Csiszár and Paul C. Shields. Information Theory and Statistics: A Tutorial. Commun. Inf. Theory, 1(4):417–528, December 2004.
[DDK11] Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-Optimal No-Regret Algorithms for Zero-Sum Games. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms (SODA), 2011.
[DFP⁺10] Constantinos Daskalakis, Rafael Frongillo, Christos H. Papadimitriou, George Pierrakos, and Gregory Valiant. On learning algorithms for nash equilibria. In Proceedings of the Third International Conference on Algorithmic Game Theory, SAGT’10, page 114–125, Berlin, Heidelberg, 2010. Springer-Verlag.
[DGP06] Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity of computing a nash equilibrium. In Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing (STOC), 2006.
[DP14] Constantinos Daskalakis and Qinxuan Pan. A counter-example to Karlin’s strong conjecture for fictitious play. In Proceedings of the 55th Annual Symposium on Foundations of Computer Science (FOCS), 2014.
[DP19] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, volume 124 of LIPIcs, pages 27:1–27:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
[FLL⁺16] Dylan J Foster, Zhiyuan Li, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos. Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
[GKP89] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, Reading, 1989.
[GPD20] Noah Golowich, Sarath Pattathil, and Constantinos Daskalakis. Tight last-iterate convergence rates for no-regret learning in multi-player games. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[HAM21] Yu-Guan Hsieh, Kimon Antonakopoulos, and Panayotis Mertikopoulos. Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium. In Conference on Learning Theory, 2021.
[Han57] James Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
[HMcW⁺03] Hart, Andreu Mas-colell, Of Jörgen W. Weibull, O Vega, Drew Fudenberg, David K. Levine, Josef Hofbauer, Karl Sigmund, Eric Maskin, Motty Perry, and Er Vasin. Uncoupled dynamics do not lead to nash equilibrium. Amer. Econ. Rev, pages 1830–1836, 2003.
[KHSC18] Ehsan Asadi Kangarshahi, Ya-Ping Hsieh, Mehmet Fatih Sahin, and Volkan Cevher. Let’s be honest: An optimal no-regret framework for zero-sum games. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2488–2496. PMLR, 10–15 Jul 2018.
[KLP11] Robert Kleinberg, Katrina Ligett, and Georgios Piliouras. Beyond the nash equilibrium barrier. In In Innovations in Computer Science (ICS, pages 125–140, 2011.
[LGNPw21] Qi Lei, Sai Ganesh Nagarajan, Ioannis Panageas, and xiao wang. Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1441–1449. PMLR, 13–15 Apr 2021.
[MPP18] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’18, page 2703–2717, USA, 2018. Society for Industrial and Applied Mathematics.
[Nes05] Yu Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235–249, 2005.
[PP16] Christos Papadimitriou and Georgios Piliouras. From nash equilibria to chain recurrent sets: Solution concepts and topology. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ITCS ’16, page 227–235, New York, NY, USA, 2016. Association for Computing Machinery.
[Rob51] Julia Robinson. An Iterative Method of Solving a Game. Annals of mathematics, pages 296–301, 1951.
[Rou09] Tim Roughgarden. Intrinsic robustness of the price of anarchy. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, STOC ’09, page 513–522, New York, NY, USA, 2009. Association for Computing Machinery.
[RS13a] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Proceedings of the 26th Annual Conference on Learning Theory, pages 993–1019, 2013.
[RS13b] Alexander Rakhlin and Karthik Sridharan. Optimization, Learning, and Games with Predictable Sequences. arXiv:1311.1869 [cs], November 2013. arXiv: 1311.1869.
[RST17] Tim Roughgarden, Vasilis Syrgkanis, and Éva Tardos. The price of anarchy in auctions. J. Artif. Int. Res., 59(1):59–101, May 2017.
[SALS15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
[Sha64] L. Shapley. Some Topics in Two-Person Games. Advances in Game theory, 1964.
[ST13] Vasilis Syrgkanis and Eva Tardos. Composable and efficient mechanisms. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC ’13, page 211–220, New York, NY, USA, 2013. Association for Computing Machinery.
[VGFL⁺20] Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, Thanasis Lianeas, Panayotis Mertikopoulos, and Georgios Piliouras. No-regret learning and mixed nash equilibria: They do not mix. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1380–1391. Curran Associates, Inc., 2020.
[WL18] Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Proceedings of the 31st Conference On Learning Theory, pages 1263–1291, 2018.
[WLZL21] Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Linear last-iterate convergence in constrained saddle-point optimization. In International Conference on Learning Representations, 2021.

Appendix A Proofs for Section 4.1

In this section we prove Lemma 4.1. Throughout the section we use the notation of Lemma 4.1: in particular, we assume that any player $i\in[m]$ follows the Optimistic Hedge updates (1) with step size $\eta>0$ , for an arbitrary sequence of losses $\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}$ .

A.1 Preliminary lemmas

The first few lemmas in this section pertain to vectors $P,Q\in\Delta^{n}$ , for some $n\in\mathbb{N}$ ; note that such vectors $P,Q$ may be viewed as distributions on $[n]$ . Let $P/Q\in\mathbb{R}^{n}$ denote the Radon-Nikodym derivative, i.e., the vector whose $j$ th component is $P(j)/Q(j)$ .

Lemma A.1.

If $\|P/Q\|_{\infty}\leq A$ , then $\chi^{2}({P};{Q})\leq A\cdot\chi^{2}({Q};{P})$ .

Proof.

The lemma is immediate from the definition of the $\chi^{2}$ divergence:

\displaystyle\chi^{2}({P};{Q})=\sum_{j=1}^{n}\frac{(P(j)-Q(j))^{2}}{Q(j)}\leq A\cdot\sum_{j=1}^{n}\frac{(P(j)-Q(j))^{2}}{P(j)}=A\cdot\chi^{2}({Q};{P}).

∎

It is a standard fact (though one which we do not need in our proofs) that for all $P,Q\in\Delta^{n_{i}}$ , $\operatorname{KL}({P};{Q})\leq\chi^{2}({P};{Q})$ . The below lemma shows an inequality in the opposite direction when $\|P/Q\|_{\infty},\|Q/P\|_{\infty}$ are bounded:

Lemma A.2.

There is a constant $C$ so that the following holds. Suppose that for $A\leq\frac{3}{2}$ we have $\left\|P/Q\right\|_{\infty}\leq A$ and $\left\|Q/P\right\|_{\infty}\leq A$ . Then $(1/2-C(A-1))\cdot\chi^{2}(P;Q)\leq\operatorname{KL}(P;Q)$ .

Proof.

There is a constant $C>0$ so that for any $0<\beta\leq 1/2$ , for all $|x|\leq\beta$ , we have

\log(1+x)\geq x-(1/2+C\beta)x^{2}.

Set $a=A-1$ , so that $|P(j)/Q(j)-1|\leq a$ for all $j$ by assumption. Then for $C^{\prime}=C+1/2$ , we have

	$\displaystyle\operatorname{KL}(P;Q)$	$\displaystyle=\sum_{j}P(j)\log\frac{P(j)}{Q(j)}$
		$\displaystyle\geq\sum_{j}P(j)\cdot\left(\left(\frac{P(j)}{Q(j)}-1\right)-(1/2+Ca)\left(\frac{P(j)}{Q(j)}-1\right)^{2}\right)$
		$\displaystyle\geq\chi^{2}(P;Q)-(1/2+Ca)\sum_{j}P(j)\cdot\frac{(P(j)-Q(j))^{2}}{Q(j)^{2}}$
		$\displaystyle\geq\chi^{2}(P;Q)-\frac{1/2+Ca}{A}\chi^{2}(P;Q)$
		$\displaystyle\geq\frac{1/2-aC}{1+a}\cdot\chi^{2}({P};{Q})$
		$\displaystyle\geq\left(1/2-a\cdot(C+1/2)\right)\cdot\chi^{2}({P};{Q})$
		$\displaystyle=(1/2-C^{\prime}a)\cdot\chi^{2}(P;Q).$

∎

The next lemma considers two vectors $x,x^{\prime}\in\Delta^{n}$ which are related by a multiplicative weights-style update with loss vector $w\in\mathbb{R}^{n}$ ; the lemma relates $\chi^{2}({x^{\prime}};{x})$ to $\|w\|_{x}^{2}$ .

Lemma A.3.

There is a constant $C>0$ so that the following holds. Suppose that $w\in\mathbb{R}^{n}$ , $\alpha>0$ , $\|w\|_{\infty}\leq\alpha/2\leq 1/C$ , and $x,x^{\prime}\in\Delta^{n}$ satisfy, for each $j\in[n]$ ,

\displaystyle x^{\prime}(j)=\frac{x(j)\cdot\exp(w(j))}{\sum_{k\in[n]}x(k)\cdot\exp(w(k))}.

(15)

Then

(1-C\alpha)\cdot\operatorname{Var}_{{x}}\left({w}\right)\leq\chi^{2}(x^{\prime};x)\leq(1+C\alpha)\operatorname{Var}_{{x}}\left({w}\right).

Proof.

Let $w^{\prime}=w-\langle x,w\rangle\mathbf{1}$ , where $\mathbf{1}$ denotes the all-1s vector. Note that $\operatorname{Var}_{{x}}\left({w}\right)=\operatorname{Var}_{{x}}\left({w^{\prime}}\right)$ , and that if we replace $w$ with $w^{\prime}$ , (15) remains true. Moreover, $\|w^{\prime}\|_{\infty}\leq 2\|w\|_{\infty}\leq\alpha$ . Thus, by replacing $w$ with $w^{\prime}$ , we may assume from here on that $\langle w,x\rangle=0$ and that $\|w\|\leq\alpha$ .

Note that

\chi^{2}(x^{\prime};x)=-1+\sum_{i=1}^{n}x(i)\cdot(x^{\prime}(i)/x(i))^{2}=-1+\mathbb{E}\left(\frac{\exp(W)}{\mathbb{E}\exp(W)}\right)^{2},

where $W$ is a random variable that takes values $w(j)$ with probability $x(j)$ . As long as $C$ is a sufficiently large constant, we have that, for all $z$ satisfying $|z|\leq\alpha$ ,

1+z+(1-C\alpha)z^{2}/2\leq\exp(z)\leq 1+z+(1+C\alpha)z^{2}/2.

(16)

Thus, for a sufficiently large constant $C_{0}^{\prime}$ , we have, for all $z$ satisfying $|z|\leq\alpha$ ,

1+2z+(2-C_{0}^{\prime}\alpha)z^{2}\leq\exp(z)^{2}\leq 1+2z+(2+C_{0}^{\prime}\alpha)z^{2}.

(17)

Moreover, since $\mathbb{E}W=0$ , we have from (16) that $1+(1-C\alpha)\mathbb{E}W^{2}/2\leq\mathbb{E}\exp(W)\leq 1+(1+C\alpha)\mathbb{E}W^{2}/2$ . For a sufficiently large constant $C_{1}^{\prime}$ it follows that

1+(1-C_{1}^{\prime}\alpha)\mathbb{E}W^{2}\leq(\mathbb{E}\exp(W))^{2}\leq 1+(1+C_{1}^{\prime}\alpha)\mathbb{E}W^{2}.

(18)

Combining (17) and (18) and again using the fact that $\mathbb{E}W=0$ , we get, for some sufficiently large constant $C^{\prime\prime}$ , as long as $\alpha<1/C_{1}^{\prime}$ ,

	$\displaystyle(1-C^{\prime\prime}\alpha)\mathbb{E}W^{2}\leq$	$\displaystyle-1+\frac{1+(2-C_{0}^{\prime}\alpha)\mathbb{E}W^{2}}{1+(1+C_{1}^{\prime}\alpha)\mathbb{E}W^{2}}$
	$\displaystyle\leq$	$\displaystyle-1+\frac{\mathbb{E}(\exp(W)^{2})}{(\mathbb{E}\exp(W))^{2}}$
	$\displaystyle\leq$	$\displaystyle-1+\frac{1+(2+C_{0}^{\prime}\alpha)\mathbb{E}W^{2}}{1+(1-C_{1}^{\prime}\alpha)\mathbb{E}W^{2}}$
	$\displaystyle\leq$	$\displaystyle(1+C^{\prime\prime}\alpha)\mathbb{E}W^{2}.$

By the assumption that $\langle w,x\rangle=0$ , we have $\mathbb{E}W^{2}=\operatorname{Var}_{{x}}\left({w}\right)$ , and thus the above gives the desired result. ∎

We will need the following standard lemma:

Lemma A.4 ([RS13a], Eq. (26)).

For any $n\in\mathbb{N}$ , $\ell\in\mathbb{R}^{n}$ , $y\in\Delta^{n}$ , if it holds that $x=\operatorname*{arg\,min}_{x^{\prime}\in\Delta^{n}}\langle x^{\prime},\ell\rangle+\operatorname{KL}({x^{\prime}};{y})$ , then for any $z\in\Delta^{n}$ ,

\displaystyle\langle x-z,\ell\rangle\leq\operatorname{KL}({z};{y})-\operatorname{KL}({z};{x})-\operatorname{KL}({x};{y}).

For $t\in[T]$ , we define the vector $\tilde{x}_{i}^{(t)}\in\Delta^{n_{i}}$ by

\displaystyle\tilde{x}_{i}^{(t)}(j):=\frac{x_{i}^{(t)}(j)\cdot\exp(-\eta\cdot(\ell_{i}^{(t)}(j)-\ell_{i}^{(t-1)}(j)))}{\sum_{k\in[n_{i}]}x_{i}^{(t)}(k)\cdot\exp(-\eta\cdot(\ell_{i}^{(t)}(k)-\ell_{i}^{(t-1)}(k)))}.

(19)

Additionally define $\tilde{x}_{i}^{(0)}:=(1/n_{i},\ldots,1/n_{i})$ to be the uniform distribution over $[n_{i}]$ .

The next lemma, Lemma A.5 is very similar to [RS13a, Lemma 3], and is indeed essentially shown in the course of the proof of that lemma. Note that no boundedness assumption is placed on the vectors $\ell_{i}^{(t)}$ in Lemma A.5. For completeness we provide a full proof of the lemma.

Lemma A.5 (Refinement of Lemma 3, [RS13a]).

Suppose that any player $i\in[m]$ follows the Optimistic Hedge updates (1) with step size $\eta>0$ , for an arbitrary sequence of losses $\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}\in\mathbb{R}^{n_{i}}$ . For any vector $x^{\star}\in\Delta^{n_{i}}$ , it holds that

\displaystyle\sum_{t=1}^{T}\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle\leq\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}).

(20)

Proof.

For any $x^{\star}\in\Delta^{n_{i}}$ , it holds that

\displaystyle\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle=\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\rangle+\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\ell_{i}^{(t-1)}\rangle+\langle\tilde{x}_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle.

(21)

For $t\in[T]$ , set $c^{(t)}=\langle x_{i}^{(t)},\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\rangle$ . Using the definition of the dual norm and the fact $\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\mathbf{1}\rangle=0$ , we have

$\displaystyle\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\rangle=$	$\displaystyle\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\ell_{i}^{(t)}-\ell_{i}^{(t-1)}-c^{(t)}\mathbf{1}\rangle$
$\displaystyle\leq$	$\displaystyle\left\\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\\|_{x_{i}^{(t)}}^{\star}\cdot\left\\|{\ell_{i}^{(t)}-\ell_{i}^{(t-1)}-c^{(t)}\mathbf{1}}\right\\|_{x_{i}^{(t)}}$
$\displaystyle\leq$	$\displaystyle\left\\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\\|_{x_{i}^{(t)}}^{\star}\cdot\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}.$	(22)

It is immediate from the definitions of $\tilde{x}_{i}^{(t)}$ (in (19)) and $x_{i}^{(t)}$ (in (1)) that for $j\in[n_{i}]$ ,

\displaystyle x_{i}^{(t)}(j)=\frac{\tilde{x}_{i}^{(t-1)}(j)\cdot\exp(-\eta\cdot\ell_{i}^{(t-1)}(j))}{\sum_{k\in[n_{i}]}\tilde{x}_{i}^{(t-1)}(k)\cdot\exp(-\eta\cdot\ell_{i}^{(t-1)}(k))}=\operatorname*{arg\,min}_{x\in\Delta^{n_{i}}}\left\langle x,\eta\cdot\ell_{i}^{(t-1)}\right\rangle+\operatorname{KL}({x};{\tilde{x}_{i}^{(t-1)}})

(23)

Using Lemma A.4 with $x=x_{i}^{(t)},\ell=\eta\ell_{i}^{(t-1)},y=\tilde{x}_{i}^{(t-1)},z=\tilde{x}_{i}^{(t)}$ , we obtain

\displaystyle\langle x_{i}^{(t)}-\tilde{x}_{i}^{(t)},\ell_{i}^{(t-1)}\rangle\leq\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})-\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}).

(24)

Next, we note that, again by (19) and (1), for $j\in[n_{i}]$ ,

\displaystyle\tilde{x}_{i}^{(t)}(j)=\frac{\tilde{x}_{i}^{(t-1)}(j)\cdot\exp(-\eta\cdot\ell_{i}^{(t)}(j))}{\sum_{k\in[n_{i}]}\tilde{x}_{i}^{(t-1)}(k)\cdot\exp(-\eta\cdot\ell_{i}^{(t)}(k))}=\operatorname*{arg\,min}_{x\in\Delta^{n_{i}}}\left\langle x,\eta\cdot\ell_{i}^{(t)}\right\rangle+\operatorname{KL}({x};{\tilde{x}_{i}^{(t-1)}}).

Using Lemma A.4 with $x=\tilde{x}_{i}^{(t)},\ell=\eta\ell_{i}^{(t)},y=\tilde{x}_{i}^{(t-1)},z=x^{\star}$ , we obtain

\displaystyle\langle\tilde{x}_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle\leq\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t-1)}})-\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t)}})-\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}).

(25)

By (21), (22), (24), and (25), we have

$\displaystyle\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle\leq$	$\displaystyle\left\\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\\|_{x_{i}^{(t)}}^{\star}\cdot\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}$
	$\displaystyle+\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})-\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})$
	$\displaystyle+\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t-1)}})-\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t)}})-\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})$
$\displaystyle=$	$\displaystyle\left\\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\\|_{x_{i}^{(t)}}^{\star}\cdot\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}+\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t-1)}})-\frac{1}{\eta}\operatorname{KL}({x^{\star}};{\tilde{x}_{i}^{(t)}})$
	$\displaystyle-\frac{1}{\eta}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}).$	(26)

The statement of the lemma follows by summing (26) over $t\in[T]$ and using the fact that for any choice of $x^{\star}$ , $\operatorname{KL}({x^{\star}};{x_{i}^{(0)}})\leq\log n_{i}$ . ∎

A.2 Proof of Lemma 4.1

Now we are ready to prove Lemma 4.1. For convenience we restate the lemma.

Lemma 4.1 (restated).

There is a constant $C>0$ so that the following holds. Suppose any player $i\in[m]$ follows the Optimistic Hedge updates (1) with step size $0<\eta<1/C$ , for an arbitrary sequence of losses $\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}\in[0,1]^{n_{i}}$ . Then for any vector $x^{\star}\in\Delta^{n_{i}}$ , it holds that

\displaystyle\sum_{t=1}^{T}\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle\leq\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left(\frac{\eta}{2}+C\eta^{2}\right)\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)-\sum_{t=1}^{T}\frac{(1-C\eta)\eta}{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right).

(27)

Proof.

Lemma A.5 gives that, for any $x^{\star}\in\Delta^{n_{i}}$ ,

\displaystyle\sum_{t=1}^{T}\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle\leq\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\sqrt{\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)}-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})-\frac{1}{\eta}\sum_{t=1}^{T}\operatorname{KL}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}}).

(28)

Note that for any vectors $x,x^{\prime}\in\Delta^{n_{i}}$ , if there is a vector $\ell\in\mathbb{R}^{n_{i}}$ so that for all $j\in[n_{i}]$ , $x^{\prime}(j)=\frac{x(j)\cdot\exp(\eta\cdot\ell(j))}{\sum_{k}x(j)\cdot\exp(\eta\cdot\ell(k))}$ , we have that

\exp(-2\eta\|\ell\|_{\infty})\leq\left\|\frac{x^{\prime}}{x}\right\|_{\infty}\leq\exp(2\eta\|\ell\|_{\infty}).

Therefore, by (19) and (23), respectively, we obtain that, for $\eta\leq 1/4$ ,

	$\displaystyle\exp(-2\eta\\|\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\\|_{\infty})$	$\displaystyle\leq\left\\|{\frac{\tilde{x}_{i}^{(t)}}{x_{i}^{(t)}}}\right\\|_{\infty}\leq\exp(2\eta\\|\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\\|_{\infty})\leq\exp(4\eta)\leq 1+8\eta$
	$\displaystyle\exp(-2\eta\\|\ell_{i}^{(t-1)}\\|_{\infty})$	$\displaystyle\leq\left\\|{\frac{x_{i}^{(t)}}{\tilde{x}_{i}^{(t-1)}}}\right\\|_{\infty}\leq\exp(2\eta\\|\ell_{i}^{(t-1)}\\|_{\infty})\leq\exp(2\eta)\leq 1+4\eta.$		(29)

(Above we have also used that $\|\ell_{i}^{(t)}\|_{\infty}\leq 1$ for all $t$ .) Thus, for $\eta\leq\frac{1}{16}$ , we can apply Lemma A.2 and show, for a sufficiently large constant $C_{0}$ ,

	$\displaystyle\operatorname{KL}(\tilde{x}_{i}^{(t)};x_{i}^{(t)})$	$\displaystyle\geq\chi^{2}(\tilde{x}_{i}^{(t)};x_{i}^{(t)})\cdot(1/2-C_{0}\eta)$		(30)
	$\displaystyle\operatorname{KL}(x_{i}^{(t)};\tilde{x}_{i}^{(t-1)})$	$\displaystyle\geq\chi^{2}(x_{i}^{(t)};\tilde{x}_{i}^{(t-1)})\cdot(1/2-C_{0}\eta).$		(31)

Note also that for vectors $x,y$ we have that $\chi^{2}(x;y)=\left(\left\|{x-y}\right\|_{y}^{\star}\right)^{2}$ . By Lemma A.3 and (19), we have that, for a sufficiently large constant $C_{1}$ , as long as $\eta\leq 1/C_{1}$ ,

\displaystyle\left(\left\|{x_{i}^{(t)}-\tilde{x}_{i}^{(t)}}\right\|_{x_{i}^{(t)}}^{\star}\right)^{2}=\chi^{2}({\tilde{x}_{i}^{(t)}};{x_{i}^{(t)}})\leq(1+C_{1}\eta)\eta^{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)

(32)

and

\displaystyle\chi^{2}(\tilde{x}_{i}^{(t)};x_{i}^{(t)})\geq(1-C_{1}\eta)\eta^{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right).

(33)

Next we lower bound $\chi^{2}({x_{i}^{(t)}};{\tilde{x}_{i}^{(t-1)}})$ as follows, where $C_{2}$ denotes a sufficiently large constant: as long as $\eta\leq 1/C_{2}$ ,

	$\displaystyle\chi^{2}(x_{i}^{(t)};\tilde{x}_{i}^{(t-1)})$	$\displaystyle\geq\chi^{2}(\tilde{x}_{i}^{(t-1)};x_{i}^{(t)})\cdot\exp(-2\eta)$		(34)
		$\displaystyle\geq(1-C_{2}\eta)\eta^{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right),$		(35)

where (34) follows from Lemma A.1 and (29), and (35) follows from Lemma A.3 and (23).

Combining (28), (30), (31), (32), (33), and (35) gives that for a sufficiently large constant $C$ , as long as $\eta<1/C$ ,

\displaystyle\sum_{t=1}^{T}\langle x_{i}^{(t)}-x^{\star},\ell_{i}^{(t)}\rangle

\displaystyle\leq\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}(\eta/2+C\eta^{2})\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)-\frac{(1-C\eta)\eta}{2}\cdot\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right),

as desired. ∎

Appendix B Proofs for Section 4.3

In this section we give the full proof of Lemma 4.4. In Section B.1 we introduce some preliminaries. In Section B.2 we prove Lemma 4.5, the “boundedness chain rule” for finite differences. In Section B.4 we show how to use this lemma to prove Lemma 4.4.

B.1 Additional preliminaries

In this section we introduce some additional notations and basic combinatorial lemmas. Definition B.1 introduces the shift operator $\operatorname{E}_{s}{}$ , which like the finite difference operator $\operatorname{D}_{h}{}$ , maps one sequence to another sequence.

Definition B.1 (Shift operator).

Suppose $L=(L^{(1)},\ldots,L^{(T)})$ is a sequence of vectors $L^{(t)}\in\mathbb{R}^{n}$ . For integers $s\geq 0$ , the $s$ -shift sequence for the sequence $L$ , denoted by $\operatorname{E}_{s}{L}$ , is the sequence $\operatorname{E}_{s}{L}=(\left(\operatorname{E}_{s}{L}\right)^{(1)},\ldots,\left(\operatorname{E}_{s}{L}\right)^{(T-s)})$ , defined by $\left(\operatorname{E}_{s}{L}\right)^{(t)}=L^{(t+s)}$ for $1\leq t\leq T-s$ .

For sequences $L=(L^{(1)},\ldots,L^{(T)})$ and $K=(K^{(1)},\ldots,K^{(T)})$ of real numbers, we will denote the product sequence as $L\cdot K$ as the sequence of vectors $L\cdot K:=(L^{(1)}K^{(1)},\ldots,L^{(T)}K^{(T)})$ . Lemmas B.1 and B.2 below are standard analogues of the product rule for finite differences. The (straightforward) proofs are provided for completeness.

Lemma B.1 (Product rule; Eq. (2.55) of [GKP89]).

Suppose $L=(L^{(1)},\ldots,L^{(T)})$ and $K=(K^{(1)},\ldots,K^{(T)})$ are sequences of real numbers. Then the product sequence $L\cdot K$ satisfies

\operatorname{D}_{1}{(L\cdot K)}=L\cdot\operatorname{D}_{1}{K}+\operatorname{D}_{1}{L}\cdot\operatorname{E}_{1}{K}.

Proof.

We compute

	$\displaystyle\operatorname{D}_{1}{(L\cdot K)}^{(t)}$	$\displaystyle=L^{(t+1)}K^{(t+1)}-L^{(t)}K^{(t)}$
		$\displaystyle=L^{(t+1)}K^{(t+1)}-L^{(t)}K^{(t+1)}+L^{(t)}K^{(t+1)}-L^{(t)}K^{(t)}$
		$\displaystyle=(L\cdot\operatorname{D}_{1}{K}+\operatorname{D}_{1}{L}\cdot\operatorname{E}_{1}{K})^{(t)}.$

∎

Lemma B.2 (Multivariate product rule).

Suppose that $m\in\mathbb{N}$ and for $1\leq i\leq m$ , $L_{i}=(L_{i}^{(1)},\ldots,L_{i}^{(T)})$ are sequences of real numbers. Then the product sequence $\prod_{i=1}^{m}L_{i}$ satisfies

\operatorname{D}_{1}{\prod_{i=1}^{m}L_{i}}=\sum_{i=1}^{m}\left(\prod_{i^{\prime}<i}L_{i^{\prime}}\right)\cdot\operatorname{D}_{1}{L_{i}}\cdot\left(\prod_{i^{\prime}>i}\operatorname{E}_{1}{L_{i^{\prime}}}\right).

Proof.

We compute

	$\displaystyle\left({\operatorname{D}_{1}{\prod_{i=1}^{m}L_{i}}}\right)^{(t)}$	$\displaystyle=\prod_{i=1}^{m}L_{i}^{(t+1)}-\prod_{i=1}^{m}L_{i}^{(t)}$
		$\displaystyle=\sum_{i=1}^{m}\left({\prod_{i^{\prime}\leq i}L_{i^{\prime}}^{(t+1)}\prod_{i^{\prime}>i}L_{i^{\prime}}^{(t)}-\prod_{i^{\prime}<i}L_{i^{\prime}}^{(t+1)}\prod_{i^{\prime}\geq i}L_{i^{\prime}}^{(t)}}\right)$
		$\displaystyle=\sum_{i=1}^{m}\left({\prod_{i^{\prime}<i}L_{i^{\prime}}^{(t+1)}\cdot\prod_{i^{\prime}>i}L_{i^{\prime}}^{(t)}\cdot\left({L_{i}^{(t+1)}-L_{i}^{(t)}}\right)}\right)$
		$\displaystyle=\left({\sum_{i=1}^{m}\left(\prod_{i^{\prime}<i}L_{i^{\prime}}\right)\cdot\operatorname{D}_{1}{L_{i}}\cdot\left(\prod_{i^{\prime}>i}\operatorname{E}_{1}{L_{i^{\prime}}}\right)}\right)^{(t)}.$

∎

Lemma B.4 and Lemma B.3, which is used in the proof of the former, are used to bound certain sums with many terms in the proof of Lemma 4.5. To state Lemma B.3 we make one definition. For positive integers $k,m$ and any $h,C>0$ , define

R_{h,m,k,C}=\sum_{0\leq n_{1},\cdots,n_{k}\leq m}\left({\frac{\prod_{i=1}^{k}n_{i}^{n_{i}}}{h^{\sum_{i=1}^{k}n_{i}}}}\right)^{C},

where the sum is over integers $n_{1},\ldots,n_{k}$ satisfying $0\leq n_{i}\leq m$ for $i\in[k]$ . In the definition of $R_{h,m,k,C}$ , the quantity $0^{0}$ (which arises when some $n_{i}=0$ ) is interpreted as 1.

Lemma B.3.

For any positive integers $k,m$ and any $h,C>0$ so that $m\leq h/2$ , $C\geq 2$ , and $h\geq 8$ , then

R_{h,m,k,C}\leq\exp\left({\frac{2k}{h^{C}}}\right).

Proof of Lemma B.3.

We may rewrite $R_{h,m,k,C}$ and then upper bound it as follows:

$\displaystyle R_{h,m,k,C}$	$\displaystyle=\left({\sum_{j=0}^{m}\left({\frac{j}{h}}\right)^{Cj}}\right)^{k}$
	$\displaystyle\leq\left({1+\left({\frac{1}{h}}\right)^{C}+(m-1)\max\left({\left({\frac{2}{h}}\right)^{2C},\left({\frac{m}{h}}\right)^{mC}}\right)}\right)^{k}$	(36)
	$\displaystyle\leq\left({1+\left({\frac{1}{h}}\right)^{C}+(h/2)\max\left({\left({\frac{2}{h}}\right)^{2C},\left({\frac{1}{2}}\right)^{hC/2}}\right)}\right)^{k}$

where (36) follows since $\left({\frac{i}{h}}\right)^{Ci}$ is convex in $i$ for $i\geq 0$ , and therefore, in the interval $[2,m]\subseteq[2,h/2]$ , takes on maximal values at the endpoints. We see

(h/2)\left({\frac{2}{h}}\right)^{2C}=\left({\frac{2}{h}}\right)^{2C-1}\leq\left({\frac{1}{h}}\right)^{C}

for $h\geq 8$ when $C\geq 2$ . Also,

(h/2)\left({\frac{1}{2}}\right)^{hC/2}\leq\left({\frac{1}{h}}\right)^{C}

for $h\geq 8$ when $C\geq 2$ . (This inequality is easily seen to be equivalent to the fact that $(C+1)\log h-\frac{Ch}{2}\leq 1$ , which follows from the fact that $\log h-h/2\leq 0$ for $h\geq 8$ and $3\log h-h\leq 1$ for $h\geq 8$ .) Therefore,

	$\displaystyle R_{h,m,k,C}$	$\displaystyle\leq\left({1+\left({\frac{1}{h}}\right)^{C}+(h/2)\max\left({\left({\frac{2}{h}}\right)^{2C},\left({\frac{1}{2}}\right)^{hC/2}}\right)}\right)^{k}$
		$\displaystyle\leq\left({1+2\left({\frac{1}{h}}\right)^{C}}\right)^{k}$
		$\displaystyle\leq\exp\left({\frac{2k}{h^{C}}}\right).$

∎

Lemma B.4.

Fix integers $h\geq 0,k\geq 1$ . For any function $\pi:[h]\rightarrow[k]$ , define, for each $i\in[k]$ , $h_{i}(\pi)=\left|\left\{q\in[h]|\pi(q)=i\right\}\right|$ . Then, for any $C\geq 3$ ,

\sum_{\pi:[h]\rightarrow[k]}\frac{\prod_{i=1}^{k}h_{i}(\pi)^{Ch_{i}(\pi)}}{h^{Ch}}\leq\max\left\{k^{7},(hk+1)\cdot\exp\left(\frac{2k}{h^{C-1}}\right)\right\}.

(37)

Proof.

In the case that $h\leq 7$ , we simply use the fact that the number of functions $\pi:[h]\rightarrow[k]$ is $k^{h}\leq k^{7}$ , and each term of the summation on the left-hand side of (37) is at most 1. In the remainder of the proof we may thus assume that $h\geq 8$ .

For any tuple $(h_{1},\cdots,h_{k})$ of non-negative integers with $\sum_{i=1}^{k}h_{i}=h$ , there are ${h\choose h_{1},h_{2},\cdots,h_{k}}\leq\frac{h^{h}}{\prod_{i}h_{i}^{h_{i}}}$ (see [CS04, Lemma 2.2] for a proof of this inequality) functions $\pi:[h]\rightarrow[k]$ such that $h_{i}(\pi)=h_{i}$ for all $i\in[k]$ . Combining these like terms,

	$\displaystyle\sum_{\pi:[h]\rightarrow[k]}\frac{\prod_{i}h_{i}(\pi)^{Ch_{i}(\pi)}}{h^{Ch}}$	$\displaystyle\leq\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ \sum h_{i}=h\end{subarray}}\frac{h^{h}}{\prod_{i}h_{i}^{h_{i}}}\cdot\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{h}}}\right)^{C}$
		$\displaystyle\leq\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ \sum h_{i}=h\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{h}}}\right)^{C-1}.$		(38)

We evaluate this sum in 2 cases: whether or not $h_{\max}:=\max_{i}\{h_{i}\}$ is greater than $h/2$ . The contribution to this sum coming from terms with $h_{\max}\leq h/2$ is

$\displaystyle\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ h_{1},\cdots,h_{k}\leq h/2\\ \sum h_{i}=h\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{h}}}\right)^{C-1}$	$\displaystyle\leq\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ h_{1},\cdots,h_{k}\leq h/2\\ \end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{\sum h_{i}}}}\right)^{C-1}$
	$\displaystyle=R_{h,\lfloor h/2\rfloor,k,C-1}$
	$\displaystyle\leq\exp\left({\frac{2k}{h^{C-1}}}\right),$	(39)

by Lemma B.3.

We next consider the case where $h_{\max}>h/2$ . For a specific term $(h_{1},\cdots,h_{k})$ with $\max_{i}\{h_{i}\}>h/2$ , we know there is a unique $M\in[k]$ such that $h_{M}=\max_{i}\{h_{i}\}$ since $\sum_{i=1}^{k}h_{i}=h$ . So, we can represent the contribution to the sum from this case as

$\displaystyle\sum_{M=1}^{k}\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ h_{M}>h/2\\ \sum h_{i}=h\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{h}}}\right)^{C-1}$	$\displaystyle=k\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k}\geq 0\\ h_{k}>h/2\\ \sum h_{i}=h\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{h}}}\right)^{C-1}$	(40)
	$\displaystyle\leq k\sum_{d=0}^{\lfloor h/2\rfloor}\left({\frac{(h-d)^{h-d}}{h^{h-d}}}\right)^{C-1}\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k-1}\geq 0\\ \sum h_{i}=d\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{d}}}\right)^{C-1}$	(41)
	$\displaystyle\leq k\sum_{d=0}^{\lfloor h/2\rfloor}\sum_{\begin{subarray}{c}h_{1},\cdots,h_{k-1}\geq 0\\ h_{1},\cdots,h_{k-1}\leq d\end{subarray}}\left({\frac{\prod_{i}h_{i}^{h_{i}}}{h^{\sum h_{i}}}}\right)^{C-1}$
	$\displaystyle=k\sum_{d=0}^{\lfloor h/2\rfloor}R_{h,d,k-1,C-1}$
	$\displaystyle\leq kh\cdot\exp\left(\frac{2k}{h^{C-1}}\right),$	(42)

where (40) follows by symmetry, (41) follows by factoring out the contribution of $\left(\frac{h_{k}^{h_{k}}}{h^{h_{k}}}\right)^{C}$ and letting $d=h-h_{k}$ , and (42) follows by Lemma B.3.

The statement of the lemma follows from (38), (39), and (42). ∎

Lemma B.5.

For $n\in\mathbb{N}$ , let $\xi_{1},\ldots,\xi_{n}\geq 0$ such that $\xi_{1}+\cdots+\xi_{n}=1$ . For each $j\in[n]$ , define $\phi_{j}:\mathbb{R}^{n}\rightarrow\mathbb{R}$ to be the function

\displaystyle\phi_{j}((z_{1},\ldots,z_{n}))=\frac{\xi_{j}\exp(z_{j})}{\sum_{k=1}^{n}\xi_{k}\cdot\exp(z_{k})}

and let $P_{\phi_{j}}(z)=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{j,\gamma}\cdot z^{\gamma}$ denote the Taylor series of $\phi_{j}$ . Then for any $j\in[n]$ and any integer $k\geq 1$ ,

\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\left|a_{j,\gamma}\right|\leq\xi_{j}e^{k+1}.

Proof.

Note that, for each $j\in[n]$ ,

\displaystyle a_{j,\gamma}=\frac{1}{\gamma_{1}!\gamma_{2}!\cdots\gamma_{n}!}\cdot\frac{\partial^{k}\phi_{j}(0)}{\partial z_{1}^{\gamma_{1}}\partial z_{2}^{\gamma_{2}}\cdots z_{n}^{\gamma_{n}}},

and so

	$\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ \|\gamma\|=k}\left\|a_{j,\gamma}\right\|$	$\displaystyle=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ \|\gamma\|=k}\frac{1}{\gamma_{1}!\gamma_{2}!\cdots\gamma_{n}!}\cdot\left\|\frac{\partial^{k}\phi_{j}(0)}{\partial z_{1}^{\gamma_{1}}\partial z_{2}^{\gamma_{2}}\cdots z_{n}^{\gamma_{n}}}\right\|$
		$\displaystyle=\frac{1}{k!}\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ \|\gamma\|=k}\frac{k!}{\gamma_{1}!\gamma_{2}!\cdots\gamma_{n}!}\cdot\left\|\frac{\partial^{k}\phi_{j}(0)}{\partial z_{1}^{\gamma_{1}}\partial z_{2}^{\gamma_{2}}\cdots z_{n}^{\gamma_{n}}}\right\|$
		$\displaystyle=\frac{1}{k!}\sum_{t\in[n]^{k}}\left\|\frac{\partial^{k}\phi_{j}(0)}{\partial z_{t_{1}}\partial z_{t_{2}}\cdots\partial z_{t_{k}}}\right\|.$

It is straightforward to see that the following equalities hold for any $i\in[n]$ , $i\neq j$ :

	$\displaystyle\frac{\partial\phi_{j}}{\partial z_{j}}$	$\displaystyle=\phi_{j}(1-\phi_{j})$
	$\displaystyle\frac{\partial\phi_{j}}{\partial z_{i}}$	$\displaystyle=-\phi_{i}\phi_{j}$
	$\displaystyle\frac{\partial(1-\phi_{j})}{\partial z_{j}}$	$\displaystyle=-\phi_{j}(1-\phi_{j})$
	$\displaystyle\frac{\partial(1-\phi_{j})}{\partial z_{i}}$	$\displaystyle=\phi_{i}\phi_{j}$

We claim that for any $(t_{1},\ldots,t_{k})\in[n]^{k}$ , we can express $\frac{\partial^{k}\phi_{j}}{\partial z_{t_{1}}\cdots\partial z_{t_{k}}}$ as a polynomial in $\phi_{1},\cdots,\phi_{n},(1-\phi_{1}),\cdots,(1-\phi_{n})$ comprised of $k!$ monomials each of degree $k+1$ . We verify this by induction, first noting that after taking zero derivatives, the function $\phi_{j}$ is a degree-1 monomial. Assume that for some sequence $b_{1},\ldots,b_{(\ell-1)!}\in\{0,1\}$ , we can express

\displaystyle\frac{\partial^{\ell-1}\phi_{j}}{\partial z_{t_{1}}\cdots\partial z_{t_{\ell-1}}}=\sum_{f=1}^{(\ell-1)!}(-1)^{b_{f}}\prod_{d=0}^{\ell-1}m_{f,d}

where each $m_{f,d}\in\left\{\phi_{1},\cdots,\phi_{n},(1-\phi_{1}),\cdots,(1-\phi_{n})\right\}$ . We see that for each $f$ , there is some sequence of bits $b_{f,0},\ldots,b_{f,\ell-1}\in\{0,1\}$ so that

\displaystyle\frac{\partial}{\partial z_{t_{\ell}}}\prod_{d=0}^{\ell-1}m_{f,d}=\sum_{d=0}^{\ell-1}(-1)^{b_{f,d}}\cdot m_{f,0}\cdots m_{f,d}^{\prime}\cdots m_{f,d,\ell}

(43)

where we define, for each $0\leq d\leq\ell-1$ ,

\displaystyle m_{f,d}^{\prime}\text{ and }m_{f,d,\ell}=\begin{cases}m_{f,d}\text{ and }\phi_{t_{\ell}}&\text{ if }m_{f,d}=\phi_{i}\text{ with }i\neq t_{\ell}\\ m_{f,d}\text{ and }(1-\phi_{t_{\ell}})&\text{ if }m_{f,d}=\phi_{t_{\ell}}\\ (1-m_{f,d})\text{ and }\phi_{t_{\ell}}&\text{ if }m_{f,d}=1-\phi_{i}\text{ with }i\neq t_{\ell}\\ (1-m_{f,d})\text{ and }(1-\phi_{t_{\ell}})&\text{ if }m_{f,d}=1-\phi_{t_{\ell}}.\end{cases}

Thus, $\frac{\partial^{\ell}\phi_{j}}{\partial z_{t_{1}}\cdots\partial z_{t_{\ell}}}$ can be expressed as a sum of $\ell!$ monomials of degree $(\ell+1)$ , completing the inductive step.

This inductive argument also demonstrates a bijection between the $k!$ monomials of $\frac{\partial^{k}\phi_{j}}{\partial z_{t_{1}}\cdots\partial z_{t_{k}}}$ and a combinatorial structure that we call factorial trees. Formally, we define a factorial tree to be a directed graph on vertices $\left\{0,1,\cdots,k\right\}$ such that each vertex $i\neq 0$ has a single incoming edge from one of the vertices in $[0,i-1]$ . (For a non-negative integer $i$ , we write $[0,i]:=\{0,1,\ldots,i\}$ .) For a factorial tree $f$ , let $p_{f}(\ell)\in[0,\ell-1]$ denote the parent of a vertex $\ell$ . A particular factorial tree $f$ represents the monomial that was generated by choosing the $p_{f}(\ell)^{\text{th}}$ term in (43) for derivation when taking the derivative $\frac{\partial}{\partial z_{t_{\ell}}}$ , for each $\ell\in[k]$ . (See Figure 1 for an example.)

Refer to caption — Figure 1: A monomial $-\phi_{j}\phi_{i}\phi_{j}\phi_{k}$ of $\frac{\partial^{3}\phi_{j}}{\partial z_{i}\partial z_{j}\partial z_{k}}$ and its corresponding factorial tree

Each of the $k!$ monomials comprising $\frac{\partial^{k}\phi_{j}}{\partial z_{t_{1}}\cdots\partial z_{t_{k}}}$ is a product of $k+1$ terms corresponding to indices $j,t_{1},\cdots,t_{k}$ (i.e., the first term in the product is either $\phi_{j}$ or $1-\phi_{j}$ , the second term is either $\phi_{t_{1}}$ or $1-\phi_{t_{1}}$ , and so on). We say that a term corresponding to index $i\in[n]$ is perturbed if it is $(1-\phi_{i})$ (as opposed to $\phi_{i}$ ). From our construction, we see that the $\ell^{\text{th}}$ term is perturbed if $t_{\ell}=t_{p_{f}(\ell)}$ and there is no $\ell^{\prime}$ such that $p_{f}(\ell^{\prime})=\ell$ . That is, $\ell$ is a leaf in the corresponding factorial tree $f$ and the parent of $\ell$ corresponds to the same index as $\ell$ . One can think of $t_{1},\cdots,t_{k}$ as a coloring of all the vertices of the factorial tree with $n$ colors, except the root of the tree (vertex $0$ ) which has fixed color $j$ . Then, we can say the $\ell^{\text{th}}$ term is perturbed if and only if $\ell$ is a leaf with the same color as its parent. We call such a leaf a petal. For $t\in[n]$ , we let $P_{f,t}\subseteq[k]$ be the set of petals on tree $f$ with color $t$ , $L_{f}\subseteq[k]$ be the set of leaves of tree $f$ , and $B_{f}=[k]\setminus L_{f}$ be the set of all non-leaves other than the fixed-color root. Therefore,

	$\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ \|\gamma\|=k}\left\|a_{j,\gamma}\right\|$	$\displaystyle=\frac{1}{k!}\sum_{t\in[n]^{k}}\left\|\frac{\partial^{k}\phi_{j}(0)}{\partial z_{t_{1}}\cdots\partial z_{t_{k}}}\right\|$
		$\displaystyle\leq\frac{1}{k!}\sum_{t\in[n]^{k}}\sum_{f}\prod_{\ell=0}^{k}(\phi_{t_{\ell}}(0)\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\phi_{t_{\ell}}(0))\cdot\mathbbm{1}[\ell\in P_{f,t}])$
(where we let $t_{0}=j$ for notational convenience)
		$\displaystyle=\frac{1}{k!}\sum_{t\in[n]^{k}}\sum_{f}\prod_{\ell=0}^{k}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}])$
		$\displaystyle=\frac{1}{k!}\sum_{f}\sum_{t_{B_{f}}\in[n]^{B_{f}}}\sum_{t_{L_{f}}\in[n]^{L_{f}}}\prod_{\ell=0}^{k}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}]),$

where in the last step we decompose, for each factorial tree $f$ , $t\in[n]^{k}$ into the tuple of indices $t_{B_{f}}\in[n]^{B_{f}}$ corresponding to the non-leaves $B_{f}$ , and the tuple of indices $t_{L_{f}}\in[n]^{L_{f}}$ corresponding to the leaves $L_{f}$ .

We note that, fixing tree $f$ and the colors of all non-leaves $t_{B}$ ,

	$\displaystyle\sum_{t_{L_{f}}\in[n]^{L_{f}}}\prod_{\ell\in L_{f}}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}])$
	$\displaystyle=\prod_{\ell\in L_{f}}\left({\sum_{t_{\ell}\in[n]}\xi_{t_{\ell}}\cdot\mathbbm{1}[t_{\ell}\neq t_{p_{f}(\ell)}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[t_{\ell}=t_{p_{f}(\ell)}]}\right)$
	$\displaystyle=\prod_{\ell\in L_{f}}\left({2-2\xi_{t_{p_{f}(\ell)}}}\right)$
	$\displaystyle\leq 2^{\|L_{f}\|}$

And so,

		$\displaystyle\frac{1}{k!}\sum_{f}\sum_{t_{B_{f}}\in[n]^{B_{f}}}\sum_{t_{L_{f}}\in[n]^{L_{f}}}\prod_{\ell=0}^{k}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}])$
		$\displaystyle\leq\frac{1}{k!}\sum_{f}2^{\|L_{f}\|}\sum_{t_{B_{f}}\in[n]^{B_{f}}}\prod_{\ell\in B_{f}\cup\left\{0\right\}}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}])$
		$\displaystyle=\frac{1}{k!}\sum_{f}2^{\|L_{f}\|}\sum_{t_{B_{f}}\in[n]^{B_{f}}}\prod_{\ell\in B_{f}\cup\left\{0\right\}}\xi_{t_{\ell}}$
(as no non-leaf can ever be a petal)
		$\displaystyle=\frac{\xi_{j}}{k!}\sum_{f}2^{\|L_{f}\|}\prod_{\ell\in B_{f}}\left({\sum_{t_{\ell}\in[n]}\xi_{t_{\ell}}}\right)$
		$\displaystyle=\frac{\xi_{j}}{k!}\sum_{f}2^{\|L_{f}\|}=\xi_{j}\mathbb{E}_{f\sim\mathcal{U}(\mathcal{F})}\left[{2^{\|L_{f}\|}}\right]$

where $\mathcal{F}$ is the set of all factorial trees and $\mathcal{U}(\mathcal{F})$ is the uniform distribution over $\mathcal{F}$ . For a specific vertex $\ell\in[0,k]$ , we note that $\ell\in L_{f}$ if and only if it is not the parent of any vertex $\ell+1,\cdots,k$ . So,

\Pr_{f\sim\mathcal{U}(\mathcal{F})}\left[{\ell\in L_{f}}\right]=\prod_{i=\ell+1}^{k}\frac{i-1}{i}=\frac{\ell}{k}

(44)

We will show via induction that, for any vertex set $S\subseteq[0,k]$

\Pr_{f\sim\mathcal{U}(\mathcal{F})}\left[{S\subseteq L_{f}}\right]\leq\prod_{\ell\in S}\frac{\ell}{k}

(45)

Having established the base case for every $S$ with $|S|=1$ , we assume (45) holds for all $S$ with $|S|<s$ . For any set of $s$ vertices $V$ , consider an arbitrary partition of $V$ into two sets $S\cup T=V$ with $|S|,|T|<s$ . We see

	$\displaystyle\Pr_{f\sim\mathcal{U}(\mathcal{F})}\left[{V\subseteq L_{f}}\right]$	$\displaystyle=\prod_{c=1}^{k}\Pr\left[{p_{f}(c)\not\in V}\right]$
		$\displaystyle=\prod_{c=1}^{k}\Pr\left[{p_{f}(c)\not\in S}\right]\Pr\left[{p_{f}(c)\not\in T\|p_{f}(c)\not\in S}\right]$
		$\displaystyle\leq\prod_{c=1}^{k}\Pr\left[{p_{f}(c)\not\in S}\right]\Pr\left[{p_{f}(c)\not\in T}\right]$
		$\displaystyle=\Pr\left[{S\subseteq L_{f}}\right]\Pr\left[{T\subseteq L_{f}}\right]$
		$\displaystyle\leq\prod_{\ell\in V}\frac{\ell}{k}$

by the inductive hypothesis, as desired. Thus, $\Pr\left[{|L_{f}|\geq s}\right]\leq\sum_{S:|S|=s}\Pr\left[{S\subseteq L_{f}}\right]$ is at most the $s^{\text{th}}$ coefficient of the polynomial

R(x)=\prod_{\ell=0}^{k}\left({1+\frac{\ell}{k}x}\right)

and so

	$\displaystyle\mathbb{E}_{f\sim\mathcal{U}(\mathcal{F})}\left[{2^{\|L_{f}\|}}\right]$	$\displaystyle\leq\sum_{s=0}^{k}2^{s}\Pr[\|L_{f}\|\geq s]$
		$\displaystyle\leq R(2)$
		$\displaystyle\leq e^{\sum_{\ell=0}^{k}2\ell/k}=e^{k+1}$

and

\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\left|a_{j,\gamma}\right|\leq\xi_{j}e^{k+1},

as desired. ∎

Lemma B.6.

Let $\phi_{1},\cdots,\phi_{m}$ be softmax-type functions. That is, for each $\phi_{i}$ , there is some $j_{i}\in[n]$ and indices $\xi_{i1},\ldots,\xi_{in}$ such that

\displaystyle\phi_{i}((z_{1},\ldots,z_{n}))=\frac{\exp(z_{j_{i}})}{\sum_{k=1}^{n}\xi_{ik}\cdot\exp(z_{k})}

where $\xi_{i1}+\cdots+\xi_{in}=1$ for all $i$ . Let $P(z)=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{\gamma}z^{\gamma}$ denote the Taylor series of $\prod_{i}\phi_{i}$ . Then for any integer $k$ ,

\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\left|a_{\gamma}\right|\leq(e^{3}m)^{k}.

Proof.

Letting $P_{i}(z)=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{i,\gamma}z^{\gamma}$ denote the Taylor series of $\phi_{i}$ for all $i$ , we have $P(z)=\prod_{i}P_{i}(z)$ and therefore

\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}\left|a_{\gamma}\right|\leq\sum_{\begin{subarray}{c}k_{1},\cdots,k_{m}\in\mathbb{Z}_{\geq 0}\\ \sum k_{i}=k\end{subarray}}\prod_{i}\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k_{i}}\left|a_{i,\gamma}\right|

We have that $\sum_{|\gamma|=k_{i}}\left|a_{i,\gamma}\right|\leq e^{2k_{i}}$ for all $k_{i}$ since, for $k_{i}=0$ , $a_{i,0}=\phi_{i}(0)=1$ , and for $k_{i}\geq 1$ ,

\sum_{|\gamma|=k_{i}}\left|a_{i,\gamma}\right|\leq\frac{\xi_{ij}}{\xi_{ij}}e^{k_{i}+1}\leq e^{2k_{i}}

(46)

from Lemma B.5. Note that the softmax-type functions discussed in Lemma B.5 have a $\xi_{ij}$ term in the numerator, while those discussed here do not. This accounts for the extra $\xi_{ij}$ term that appears in equation (46). Thus,

	$\displaystyle\sum_{\begin{subarray}{c}k_{1},\cdots,k_{m}\in\mathbb{Z}_{\geq 0}\\ \sum k_{i}=k\end{subarray}}\prod_{i}\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ \|\gamma\|=k_{i}}\left\|a_{i,\gamma}\right\|$	$\displaystyle\leq\sum_{\begin{subarray}{c}k_{1},\cdots,k_{m}\in\mathbb{Z}_{\geq 0}\\ \sum k_{i}=k\end{subarray}}e^{2k}$
		$\displaystyle=e^{2k}{m+k-1\choose k}$
		$\displaystyle\leq e^{2k}\left({\frac{e(m+k-1)}{k}}\right)^{k}$
		$\displaystyle=(e^{3}m)^{k}$

as desired. ∎

Lemma B.7.

Let $\phi((z_{1},\ldots,z_{n}))=\frac{\exp(z_{j})}{\sum_{k=1}^{n}\xi_{k}\exp(z_{k})}$ be any softmax-type function. Then the radius of convergence of the Taylor series of $\phi$ at the origin is at least 1.

Proof.

For a complex number $z$ , write $\Re(z),\Im(z)$ to denote the real and imaginary parts, respectively, of $z$ . Note that for any $\zeta_{1},\ldots,\zeta_{n}\in\mathbb{C}$ with $|\zeta_{k}|\leq\pi/3$ for all $k\in[n]$ , we have

\Re(\exp(\zeta_{k}))\geq\cos(\pi/3)\cdot\exp(-\pi/3)>1/10,

and thus $\left|\sum_{k=1}^{n}\xi_{k}\cdot\exp(\zeta_{k})\right|\geq 1/10$ . Moreover, for any such point $\zeta=(\zeta_{1},\ldots,\zeta_{n})$ , it holds that $|\exp(\zeta_{j})|\leq\exp(\pi/3)<3$ . It then follows that for such $\zeta$ we have $|\phi(z)|\leq 30$ . In particular, $\phi$ is holomorphic on the region $\{\zeta:|\zeta_{k}|\leq\pi/3\ \ \forall k\in[n]\}$ .

Fix any $\gamma\in\mathbb{Z}_{\geq 0}^{n}$ , and let $k=|\gamma|$ . By the multivariate version of Cauchy’s integral formula,

	$\displaystyle\left\|\frac{d^{\gamma}}{dz^{\gamma}}\phi(z)\right\|=$	$\displaystyle\left\|\frac{\gamma!}{(2\pi i)^{n}}\int_{\|\zeta_{1}-z_{1}\|=\pi/3}\cdots\int_{\|\zeta_{n}-z_{n}\|=\pi/3}\frac{\phi(\zeta_{1},\ldots,\zeta_{n})}{(\zeta_{1}-z_{1})^{\gamma_{1}+1}\cdots(\zeta_{n}-z_{n})^{\gamma_{n}+1}}d\zeta_{1}\cdots d\zeta_{n}\right\|$
	$\displaystyle\leq$	$\displaystyle\frac{30\gamma!}{(\pi/3)^{k+n}}\leq\frac{30\gamma!}{(\pi/3)^{k}}.$

The power series of $\phi$ at $\mathbf{0}$ is defined as $P_{\phi}(z)=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{\gamma}\cdot z^{\gamma}$ , where $a_{\gamma}=\frac{1}{\gamma!}\frac{d^{\gamma}}{dz^{\gamma}}\phi(\mathbf{0})$ . For any $\gamma\in\mathbb{Z}_{\geq 0}^{n}$ with $k=|\gamma|$ , we have $|a_{\gamma}|^{1/k}\leq(30/(\pi/3)^{k})^{1/k}=(30)^{1/k}\cdot 3/\pi$ , which tends to $3/\pi<1$ as $k\rightarrow\infty$ . Thus, by the (multivariate version of the) Cauchy-Hadamard theorem, the radius of convergence of the power series of $\phi$ at $\mathbf{0}$ is at least $\pi/3\geq 1$ . ∎

B.2 Proof of Lemma 4.6

In this section prove Lemma 4.6, which, as explained in Section 4.3, is an important ingredient in the proof of Lemma 4.5. The detailed version of Lemma 4.6 is presented below; it includes several claims which are omitted for simplicity in the abbreviated version in Section 4.3.

Lemma 4.6 (Detailed).

Fix any integer $h\geq 0$ , a multi-index $\gamma\in\mathbb{Z}_{\geq 0}^{n}$ and set $k=|\gamma|$ . For each of the $k^{h}$ functions $\pi:[h]\rightarrow[k]$ , and for each $r\in[k]$ , there are integers $h^{\prime}_{\pi,r}\in\{0,1,\ldots,h\}$ , $t^{\prime}_{\pi,r}\geq 0$ , and $j^{\prime}_{\pi,r}\in[n]$ , so that the following holds. For any sequence $L^{(1)},\ldots,L^{(T)}\in\mathbb{R}^{n}$ of vectors, it holds that

\displaystyle\operatorname{D}_{h}{L^{\gamma}}=\sum_{\pi:[h]\rightarrow[k]}\prod_{r=1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{(L(j^{\prime}_{\pi,r}))}}.

(47)

Moreover, the following properties hold:

1.

For each $\pi$ and $r\in[k]$ , $h^{\prime}_{\pi,r}=|\{q\in[h]:\pi(q)=r\}|$ . In particular, $\sum_{r=1}^{k}h^{\prime}_{\pi,r}=h$ .
2.

For each $\pi$ and $r\in[k]$ , it holds that $0\leq t^{\prime}_{\pi,r}+h^{\prime}_{\pi,r}\leq h$ .
3.

For each $\pi$ , $r\in[k]$ , and $j\in[n]$ , $\gamma_{j}=|\{r\in[k]:j^{\prime}_{\pi,r}=j\}|$ .

Proof of Lemma 4.6.

We use induction on $h$ . First note that in the case $h=0$ and for any $k\geq 0$ , we have that $\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}=(L^{(t)})^{\gamma}$ , and so for the unique function $\pi:\emptyset\rightarrow[k]$ , for all $r\in[k]$ , we may take $t^{\prime}_{\pi,r}=0$ , $h^{\prime}_{\pi,r}=0$ , and ensure that for each $j\in[n]$ there are $\gamma_{j}$ values of $r$ so that $j^{\prime}_{\pi,r}=j$ .

Now fix any integer $h>0$ , and suppose the statement of the claim holds for all $h^{\prime}<h$ . We have that

	$\displaystyle\operatorname{D}_{h}{L^{\gamma}}{}$
$\displaystyle=$	$\displaystyle\operatorname{D}_{1}{\operatorname{D}_{h-1}{L^{\gamma}}}{}$
$\displaystyle=$	$\displaystyle\operatorname{D}_{1}{\sum_{\pi:[h-1]\rightarrow[k]}\prod_{r=1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{L(j^{\prime}_{\pi,r})}}}{}$
$\displaystyle=$	$\displaystyle\sum_{\pi:[h-1]\rightarrow[k]}\sum_{r=1}^{k}\operatorname{D}_{1}{\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{L(j^{\prime}_{\pi,r})}}}\cdot\prod_{r^{\prime}=1}^{r-1}\operatorname{E}_{t^{\prime}_{\pi,r^{\prime}}}{\operatorname{D}_{h^{\prime}_{\pi,r^{\prime}}}{L(j^{\prime}_{\pi,r^{\prime}})}}\cdot\prod_{r^{\prime}=r+1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r^{\prime}}+1}{\operatorname{D}_{h^{\prime}_{\pi,r^{\prime}}}{L(j^{\prime}_{\pi,r^{\prime}})}}$	(48)
$\displaystyle=$	$\displaystyle\sum_{\pi:[h-1]\rightarrow[k]}\sum_{r=1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}+1}{L(j^{\prime}_{\pi,r})}}\cdot\prod_{r^{\prime}=1}^{r-1}\operatorname{E}_{t^{\prime}_{\pi,r^{\prime}}}{\operatorname{D}_{h^{\prime}_{\pi,r^{\prime}}}{L(j^{\prime}_{\pi,r^{\prime}})}}\cdot\prod_{r^{\prime}=r+1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r^{\prime}}+1}{\operatorname{D}_{h^{\prime}_{\pi,r^{\prime}}}{L(j^{\prime}_{\pi,r^{\prime}})}}.$	(49)

where (48) uses Lemma B.2 and (49) uses the commutativity of $\operatorname{E}_{t^{\prime}}{}$ and $\operatorname{D}_{1}{}$ . For each $\pi:[h-1]\rightarrow[k]$ , we construct $k$ functions $\pi_{1},\ldots,\pi_{k}:[h]\rightarrow[k]$ , defined by $\pi_{r}(q)=\pi(q)$ for $q<h$ , and $\pi_{r}(h)=r$ for $r\in[k]$ . Next, for $r,r^{\prime}\in[k]$ , we define the quantities $h^{\prime}_{\pi_{r},r^{\prime}},t^{\prime}_{\pi_{r},r^{\prime}},j^{\prime}_{\pi_{r},r^{\prime}}$ as follows:

•

Set $h^{\prime}_{\pi_{r},r^{\prime}}=h^{\prime}_{\pi,r^{\prime}}$ if $r\neq r^{\prime}$ , and $h^{\prime}_{\pi_{r},r}=h^{\prime}_{\pi,r}+1$ .
•

Set $t^{\prime}_{\pi_{r},r^{\prime}}=t^{\prime}_{\pi,r^{\prime}}$ if $r^{\prime}\leq r$ , and $t^{\prime}_{\pi_{r},r^{\prime}}=t^{\prime}_{\pi,r^{\prime}}+1$ if $r^{\prime}>r$ .
•

Set $j^{\prime}_{\pi_{r},r^{\prime}}=j^{\prime}_{\pi,r^{\prime}}$ .

By (49) and the above definitions, we have

\displaystyle\operatorname{D}_{h}{L^{\gamma}}=\sum_{\pi:[h]\rightarrow[k]}\prod_{r^{\prime}=1}^{k}\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{L(j^{\prime}_{\pi,r})}},

thus verifying (9) for the value $h$ .

Finally we verify that items 1 through 3 in the lemma statement hold. The definition of $h^{\prime}_{\pi_{r},r^{\prime}}$ above together with the inductive hypothesis ensures that for all $r,r^{\prime}\in[k]$ , $h^{\prime}_{\pi_{r},r^{\prime}}=|\{q\in[h]:\pi_{r}(q)=r^{\prime}\}|$ , thus verifying item 1 of the lemma statement. Since $h^{\prime}_{\pi_{r},r^{\prime}}+t^{\prime}_{\pi_{r},r^{\prime}}\leq h^{\prime}_{\pi,r}+t^{\prime}_{\pi,r}+1$ for all $r,r^{\prime}$ , it follows from the inductive hypothesis that $0\leq h^{\prime}_{\pi_{r},r^{\prime}}+t^{\prime}_{\pi_{r},r^{\prime}}\leq h$ ; this verifies item 2. Finally, note that for any $j\in[n]$ and $r\in[k]$ , $\{r^{\prime}\in[k]:j^{\prime}_{\pi,r^{\prime}}=j\}=\{r^{\prime}\in[k]:j^{\prime}_{\pi_{r},r^{\prime}}=j\}$ , and thus item 3 follows from the inductive hypothesis. ∎

B.3 Proof of Lemma 4.5

In this section we prove Lemma 4.5. To introduce the detailed version of the lemma we need the following definition. Suppose $\phi:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is a real-valued function that is real-analytic in a neighborhood of the origin. For real numbers $Q,R>0$ , we say that $\phi$ is $(Q,R)$ -bounded if the Taylor series of $\phi$ at $\mathbf{0}$ , denoted $P_{\phi}(z_{1},\ldots,z_{n})=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{\gamma}z^{\gamma}$ , satisfies, for each integer $k\geq 0$ , $\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:|\gamma|=k}|a_{\gamma}|\leq Q\cdot R^{k}$ . In the statement of Lemma 4.5 below, the quantity $0^{0}$ is interpreted as 1 (in particular, $(h^{\prime})^{B_{0}h^{\prime}}=1$ for $h^{\prime}=0$ ).

Lemma 4.5 (“Boundedness chain rule” for finite differences; detailed).

Suppose that $h,n\in\mathbb{N}$ , $\phi:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is a $(Q,R)$ -bounded function so that the radius of convergence of its power series at $\mathbf{0}$ is at least $\nu>0$ , and $L=(L^{(1)},\ldots,L^{(T)})\in\mathbb{R}^{n}$ is a sequence of vectors satisfying $\|L^{(t)}\|_{\infty}\leq\nu$ for $t\in[T]$ . Suppose for some $\alpha\in(0,1)$ , for each $0\leq h^{\prime}\leq h$ and $t\in[T-h^{\prime}]$ , it holds that $\|\operatorname{D}_{h^{\prime}}{L}^{(t)}\|_{\infty}\leq\frac{1}{B_{1}}\cdot\alpha^{h^{\prime}}\cdot(h^{\prime})^{B_{0}h^{\prime}}$ for some $B_{1}\geq 2e^{2}R,B_{0}\geq 3$ . Then for all $t\in[T-h]$ ,

\displaystyle|\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)}|\leq\frac{12RQe^{2}}{B_{1}}\cdot\alpha^{h}\cdot h^{B_{0}h+1}.

Proof of Lemma 4.5.

Note that the $h$ th order finite differences of a constant sequence are identically 0 for $h\geq 1$ , so by subtracting $\phi(\mathbf{0})$ from $\phi$ , we may assume without loss of generality that $\phi(\mathbf{0})=0$ . (Here $\mathbf{0}$ denotes the all-zeros vector.)

By assumption, the radius of convergence of the power series of $\phi$ at the origin is at least $\nu$ , and so for each $\gamma\in\mathbb{Z}_{\geq 0}^{n}$ , there is a real number $a_{\gamma}$ so that for $z=(z_{1},\ldots,z_{n})$ with $|z_{j}|\leq\nu$ for each $j$ ,

\phi(z)=\sum_{k\in\mathbb{N},\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ |\gamma|=k}a_{\gamma}z^{\gamma}.

(50)

Let $A_{k}:=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:|\gamma|=k}|a_{\gamma}|$ ; by the assumption that $\phi$ is $(Q,R)$ -bounded, we have that $A_{k}\leq Q\cdot R^{k}$ for all $k\in\mathbb{N}$ .

For $\gamma\in\mathbb{Z}_{\geq 0}^{n}$ , recall that $L^{\gamma}$ denotes the sequence $((L^{\gamma})^{(1)},\ldots,(L^{\gamma})^{(T)})$ , defined by $(L^{\gamma})^{(t)}=(L^{(t)}(1))^{\gamma_{1}}\cdots(L^{(t)}(n))^{\gamma_{n}}$ . Then since $\|L^{(t)}\|_{\infty}\leq\nu$ for all $t\in[T]$ , we have that, for $t\in[T-h]$ , $\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)}=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}a_{\gamma}\cdot\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}$ .

We next upper bound the quantities $|\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}|$ . To do so, fix some $\gamma\in\mathbb{Z}_{\geq 0}^{n}$ , and set $k=|\gamma|$ . For each function $\pi:[h]\rightarrow[k]$ and $r\in[k]$ , recall the integers $h^{\prime}_{\pi,r}\in\{0,1,\ldots,h\}$ , $t^{\prime}_{\pi,r}\geq 0$ , $j^{\prime}_{\pi,r}\in[n]$ defined in Lemma 4.6. By assumption it holds that for each $t\in[T-h]$ , each $h^{\prime}\leq h$ , each $0\leq t^{\prime}\leq h$ , $|\left(\operatorname{D}_{h^{\prime}}{L(j)}\right)^{(t+t^{\prime})}|\leq\frac{1}{B_{1}}\cdot\alpha^{h^{\prime}}\cdot(h^{\prime})^{B_{0}h^{\prime}}$ . It follows that for each $t\in[T-h]$ and function $\pi:[h]\rightarrow[k]$ ,

\displaystyle\left|\prod_{r=1}^{k}\left(\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{L(j^{\prime}_{\pi,r})}}\right)^{(t)}\right|\leq\prod_{r=1}^{k}\frac{1}{B_{1}}\cdot\alpha^{h^{\prime}_{\pi,r}}\cdot(h^{\prime}_{\pi,r})^{B_{0}h^{\prime}_{\pi,r}}=\frac{\alpha^{h}}{B_{1}^{k}}\cdot\prod_{r=1}^{k}(h^{\prime}_{\pi,r})^{B_{0}h^{\prime}_{\pi,r}},

where the last equality uses that $\sum_{r=1}^{k}h^{\prime}_{\pi,r}=h$ (item 1 of Lemma 4.6). Then by Lemma 4.6, we have:

$\displaystyle\left\|\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}\right\|\leq$	$\displaystyle\sum_{\pi:[h]\rightarrow[k]}\left\|\prod_{r=1}^{k}\left(\operatorname{E}_{t^{\prime}_{\pi,r}}{\operatorname{D}_{h^{\prime}_{\pi,r}}{L(j^{\prime}_{\pi,r})}}\right)^{(t)}\right\|$
$\displaystyle\leq$	$\displaystyle\frac{\alpha^{h}}{B_{1}^{k}}\sum_{\pi:[h]\rightarrow[k]}\prod_{r=1}^{k}(h^{\prime}_{\pi,r})^{B_{0}h^{\prime}_{\pi,r}}$
$\displaystyle\leq$	$\displaystyle\frac{\alpha^{h}}{B_{1}^{k}}\cdot h^{B_{0}h}\max\left\{k^{7},(hk+1)\cdot\exp\left(\frac{2k}{h^{B_{0}-1}}\right)\right\},$	(51)

where (51) follows from Lemma B.4, the fact that $B_{0}\geq 3$ , and that $h^{\prime}_{\pi,r}=|\{q\in[h]:\pi(q)=r\}|$ (item 1 of Lemma 4.6).

We may now bound the order- $h$ finite differences of the sequence $\phi\circ L$ as follows: for $t\in[T-h]$ ,

$\displaystyle\|\left(\operatorname{D}_{h}{(\phi\circ L)}\right)^{(t)}\|\leq$	$\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}\|a_{\gamma}\|\cdot\left\|\left(\operatorname{D}_{h}{L^{\gamma}}\right)^{(t)}\right\|$
$\displaystyle\leq$	$\displaystyle\alpha^{h}\cdot h^{B_{0}h+1}\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}}\|a_{\gamma}\|\cdot B_{1}^{-\|\gamma\|}\cdot\max\left\{\|\gamma\|^{7},(\|\gamma\|+1)\cdot\exp\left(\frac{2\|\gamma\|}{h^{B_{0}-1}}\right)\right\}$	(52)
$\displaystyle\leq$	$\displaystyle\alpha^{h}\cdot h^{B_{0}h+1}\cdot\sum_{k\in\mathbb{N}}A_{k}\cdot B_{1}^{-k}\cdot\left(k^{7}+2k\cdot\exp(2k/h^{B_{0}-1})\right)$
$\displaystyle\leq$	$\displaystyle\alpha^{h}\cdot h^{B_{0}h+1}\cdot Q\cdot\left(\sum_{k\in\mathbb{N}}k^{7}\cdot(R/B_{1})^{k}+\sum_{k\in\mathbb{N}}2k\cdot(R/B_{1})^{k}\cdot e^{2k}\right)$	(53)
$\displaystyle\leq$	$\displaystyle\frac{2RQe^{2}}{B_{1}}\cdot\alpha^{h}\cdot h^{B_{0}h+1}\cdot\left(\sum_{k\in\mathbb{N}}k^{7}\cdot(2e^{2})^{-k}+2\sum_{k\in\mathbb{N}}k\cdot 2^{-k}\right)$	(54)
$\displaystyle=$	$\displaystyle\frac{12RQe^{2}}{B_{1}}\cdot\alpha^{h}\cdot h^{B_{0}h+1}.$

where (52) uses (51), (53) uses the bound $A_{k}\leq QR^{k}$ , and (54) uses the assumption $B_{1}\geq 2e^{2}R$ . This gives the desired conclusion of the lemma. ∎

B.4 Proof of Lemma 4.4

In this section we prove Lemma 4.4. The detailed version of Lemma 4.4 is stated below.

Lemma 4.4 (Detailed).

Fix a parameter $\alpha\in\left(0,\frac{1}{H+3}\right)$ . If all players follow Optimistic Hedge updates with step size $\eta\leq\frac{\alpha}{36e^{5}m}$ , then for any player $i\in[m]$ , integer $h$ satisfying $0\leq h\leq H$ , time step $t\in[T-h]$ , it holds that

\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\|_{\infty}\leq\alpha^{h}\cdot h^{3h+1}.

Proof.

We have that for each agent $i\in[m]$ , each $t\in[T]$ , and each $a_{i}\in[n_{i}]$ , $\ell_{i}^{(t)}(a_{i})=\mathbb{E}_{a_{i^{\prime}}\sim x_{i^{\prime}}^{(t)}:\ i^{\prime}\neq i}[\mathcal{L}_{i}(a_{1},\ldots,a_{m})]$ . Thus, for $1\leq t\leq T$ ,

$\displaystyle\left\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}(a_{i})\right\|=$	$\displaystyle\left\|\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}\ell_{i}^{(t+s)}(a_{i})\right\|$	(55)
$\displaystyle=$	$\displaystyle\left\|\sum_{a_{i^{\prime}}\in[n_{i^{\prime}}],\ \forall i^{\prime}\neq i}\mathcal{L}_{i}(a_{1},\ldots,a_{m})\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}\cdot\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t+s)}(a_{i^{\prime}})\right\|$
$\displaystyle\leq$	$\displaystyle\sum_{a_{i^{\prime}}\in[n_{i^{\prime}}],\ \forall i^{\prime}\neq i}\left\|\sum_{s=0}^{h}{h\choose s}(-1)^{h-s}\cdot\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t+s)}(a_{i^{\prime}})\right\|$
$\displaystyle=$	$\displaystyle\sum_{a_{i^{\prime}}\in[n_{i^{\prime}}],\ \forall i^{\prime}\neq i}\left\|\left(\operatorname{D}_{h}{\left(\prod_{i^{\prime}\neq i}x_{i^{\prime}}(a_{i^{\prime}})\right)}\right)^{(t)}\right\|,$	(56)

where (55) and (56) use Remark 4.3 and in (56), $\prod_{i^{\prime}\neq i}x_{i^{\prime}}(a_{i^{\prime}})$ refers to the sequence $\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(1)}(a_{i^{\prime}}),$ $\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(2)}(a_{i^{\prime}}),\ldots$ , $\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(T)}(a_{i^{\prime}})$ .

In the remainder of this lemma we will prepend to the loss sequence $\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}$ the vectors $\ell_{i}^{(0)}=\ell_{i}^{(-1)}:=\mathbf{0}\in\mathbb{R}^{n_{i}}$ . We will also prepend $x_{i}^{(0)}:=x_{i}^{(1)}=(1/n_{i},\ldots,1/n_{i})\in\Delta^{n_{i}}$ to the strategy sequence $x_{i}^{(1)},\ldots,x_{i}^{(T)}$ . Next notice that for any agent $i\in[m]$ , any $t_{0}\in\{0,1,\ldots,T\}$ , and any $t\geq 0$ , by the definition (1) of the Optimistic Hedge updates, it holds that, for each $j\in[n_{i}]$ ,

x_{i}^{(t_{0}+t+1)}(j)=\frac{x_{i}^{(t_{0})}(j)\cdot\exp\left(\eta\cdot\left(\ell_{i}^{(t_{0}-1)}(j)-\sum_{s=0}^{t}\ell_{i}^{(t_{0}+s)}(j)-\ell_{i}^{(t_{0}+t)}(j)\right)\right)}{\sum_{k=1}^{n_{i}}x_{i}^{(t_{0})}(k)\cdot\exp\left(\eta\cdot\left(\ell_{i}^{(t_{0}-1)}(k)-\sum_{s=0}^{t}\ell_{i}^{(t_{0}+s)}(k)-\ell_{i}^{(t_{0}+t)}(k)\right)\right)}.

Note in particular that our definitions of $\ell_{i}^{(0)},\ell_{i}^{(-1)},x_{i}^{(0)}$ ensure that the above equation holds even for $t_{0}\in\{0,1\}$ . Now an integer $t_{0}$ satisfying $0\leq t_{0}\leq T$ ; for $t\geq 0$ , let us write

\bar{\ell}_{i,t_{0}}^{(t)}:=\ell_{i}^{(t_{0}-1)}-\sum_{s=0}^{t-1}\ell_{i}^{(t_{0}+s)}-\ell_{i}^{(t_{0}+t-1)}.

Also, for a vector $z=(z(1),\ldots,z(n_{i}))\in\mathbb{R}^{n_{i}}$ and an index $j\in[n_{i}]$ , define

\phi_{t_{0},j}(z):=\frac{\exp\left(z(j)\right)}{\sum_{k=1}^{n_{i}}x_{i}^{(t_{0})}(k)\cdot\exp\left(z(k)\right)},

(57)

so that $x_{i}^{(t_{0}+t)}(j)=x_{i}^{(t_{0})}(j)\cdot\phi_{t_{0},j}(\eta\cdot\bar{\ell}_{i,t_{0}}^{(t)})$ for $t\geq 1$ . In particular, for any $i\in[m]$ , and any choices of $a_{i^{\prime}}\in[n_{i^{\prime}}]$ for all $i^{\prime}\neq i$ ,

\displaystyle\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t_{0}+t)}(a_{i^{\prime}})=\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t_{0})}(a_{i^{\prime}})\cdot\phi_{t_{0},a_{i^{\prime}}}(\eta\cdot\bar{\ell}_{i^{\prime},t_{0}}^{(t)}).

(58)

Next, note that

\left(\operatorname{D}_{1}{\bar{\ell}_{i,t_{0}}}\right)^{(t)}=\ell_{i}^{(t_{0}+t-1)}-2\ell_{i}^{(t_{0}+t)}=\ell_{i}^{(t_{0}+t-1)}-2\left(\operatorname{E}_{1}{\ell_{i}}\right)^{(t_{0}+t-1)},

meaning that for any $h^{\prime}\geq 1$ ,

\left(\operatorname{D}_{h^{\prime}}{\bar{\ell}_{i,t_{0}}}\right)^{(t)}=\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t_{0}+t-1)}-2\left(\operatorname{E}_{1}{\operatorname{D}_{h^{\prime}-1}{\ell_{i}}}\right)^{(t_{0}+t-1)}.

(59)

We next establish the following claims which will allow us to prove Lemma 4.4 by induction.

Claim B.8.

For any $t_{0}\in\{0,1,\ldots,T\}$ , $t\geq 0$ , and $i\in[m]$ , it holds that $\|\bar{\ell}_{i,t_{0}}^{(t)}\|_{\infty}\leq t+2$ .

Proof of Claim B.8.

The claim is immediate from the triangle inequality and the fact that $\|\ell_{i}^{(t)}\|_{\infty}\leq 1$ for all $t\in[T]$ . ∎

Claim B.9.

Fix $h$ so that $1\leq h\leq H$ . Suppose that for some $B_{0}\geq 3$ and for all $0\leq h^{\prime}<h$ , all $i\in[m]$ , and all $t\leq T-h^{\prime}$ , it holds that $\|\left(\operatorname{D}_{h^{\prime}}{\ell_{i}}\right)^{(t)}\|_{\infty}\leq\alpha^{h^{\prime}}\cdot(h^{\prime}+1)^{B_{0}(h^{\prime}+1)}$ . Suppose that the step size $\eta$ satisfies $\eta\leq\min\left\{\frac{\alpha}{36e^{5}m},\frac{1}{12e^{5}(H+3)m}\right\}$ . Then for all $i\in[m]$ and $1\leq t\leq T-h$ ,

\left\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\right\|_{\infty}\leq\alpha^{h}\cdot h^{B_{0}h+1}.

(60)

Proof of Claim B.9.

Set $B_{1}:=12e^{5}m$ , so that the assumption of the claim gives $\eta\leq\min\left\{\frac{\alpha}{3B_{1}},\frac{1}{B_{1}(H+3)}\right\}$ .

We first use Lemma 4.5 to bound, for each $0\leq t_{0}\leq T-h$ , $i\in[m]$ , and $a_{i^{\prime}}\in[n_{i^{\prime}}]$ for all $i^{\prime}\neq i$ , the quantity $\left|\left(\operatorname{D}_{h}{\left(\prod_{i^{\prime}\neq i}x_{i^{\prime}}(a_{i^{\prime}})\right)}\right)^{(t_{0}+1)}\right|$ . In particular, we will apply Lemma 4.5 with $n=\sum_{i^{\prime}\neq i}n_{i^{\prime}}$ , $\nu=1$ , the value of $h$ in the statement of Claim B.9, $T=h+1$ , and the sequence $L^{(t)}$ , for $1\leq t\leq h+1$ , defined as

L^{(t)}=\left(\eta\cdot\bar{\ell}_{1,t_{0}}^{(t)},\ldots,\eta\cdot\bar{\ell}_{i-1,t_{0}}^{(t)},\eta\cdot\bar{\ell}_{i+1,t_{0}}^{(t)},\ldots,\eta\cdot\bar{\ell}_{m,t_{0}}^{(t)}\right),

namely the concatenation of the vectors $\eta\cdot\bar{\ell}_{1,t_{0}}^{(t)},\ldots,\eta\cdot\bar{\ell}_{i-1,t_{0}}^{(t)},\eta\cdot\bar{\ell}_{i+1,t_{0}}^{(t)},\ldots,\eta\cdot\bar{\ell}_{m,t_{0}}^{(t)}$ . The function $\phi$ in Lemma 4.5 is set to the function that takes as input the concatenation of $z_{i^{\prime}}\in\mathbb{R}^{n_{i^{\prime}}}$ for all $i^{\prime}\neq i$ and outputs:

\displaystyle\phi_{t_{0},a_{-i}}(z_{1},\ldots,z_{i-1},z_{i+1},\ldots,z_{m}):=\prod_{i^{\prime}\neq i}\phi_{t_{0},a_{i^{\prime}}}(z_{i^{\prime}}),

(61)

where the function $\phi_{t_{0},a_{i^{\prime}}}$ are as defined in (57). We first verify the preconditions of Lemma 4.5. By Lemma B.6, $\phi_{t_{0},a_{-i}}$ is a $(1,e^{3}m)$ -bounded function for some constant $C\geq 1$ . By Lemma B.7, the radius of convergence of each function $\phi_{t_{0},a_{i^{\prime}}}$ at $\mathbf{0}$ is at least 1; thus the radius of convergence of $\phi_{t_{0},a_{-i}}$ at $\mathbf{0}$ is at least $\nu=1$ . Claim B.8 gives that $\|\bar{\ell}_{i,t_{0}}^{(t)}\|_{\infty}\leq t+2\leq h+3$ for all $t\leq h+1$ . Thus, since $\eta\leq\frac{1}{B_{1}(H+3)}$ ,

\displaystyle\left\|\left(\operatorname{D}_{0}{\left(\eta\cdot\bar{\ell}_{i,t_{0}}\right)}\right)^{(t)}\right\|_{\infty}=\|\eta\cdot\bar{\ell}_{i,t_{0}}^{(t)}\|_{\infty}\leq\eta\cdot(H+3)\leq\frac{1}{B_{1}}

for $1\leq t\leq h_{0}+1$ . Next, for $1\leq h^{\prime}\leq h$ and $1\leq t\leq h+1-h^{\prime}$ , we have

$\displaystyle\left\\|\left(\operatorname{D}_{h^{\prime}}{(\eta\cdot\bar{\ell}_{i,t_{0}})}\right)^{(t)}\right\\|_{\infty}\leq$	$\displaystyle\eta\cdot\left\\|\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t_{0}+t-1)}\right\\|_{\infty}+2\eta\cdot\left\\|\left(\operatorname{D}_{h^{\prime}-1}{\ell_{i}}\right)^{(t_{0}+t)}\right\\|_{\infty}$	(62)
$\displaystyle\leq$	$\displaystyle 3\eta\cdot\alpha^{h^{\prime}-1}\cdot(h^{\prime})^{B_{0}(h^{\prime})}$	(63)
$\displaystyle\leq$	$\displaystyle\frac{1}{B_{1}}\cdot\alpha^{h^{\prime}}\cdot(h^{\prime})^{B_{0}(h^{\prime})},$	(64)

where (62) follows from (59), (63) follows from the assumption in the statement of Claim B.9 and $t_{0}+t+h^{\prime}-1\leq t_{0}+h\leq T$ , and (64) follows from the fact that $3\eta\leq\frac{\alpha}{B_{1}}$ . It then follows from Lemma 4.5 and (58) that

	$\displaystyle\frac{1}{\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t_{0})}(a_{i^{\prime}})}\cdot\left\|\left(\operatorname{D}_{h}{\left(\prod_{i^{\prime}\neq i}{x_{i^{\prime}}(a_{i^{\prime}})}\right)}\right)^{(t_{0}+1)}\right\|$
$\displaystyle=$	$\displaystyle\left\|\left(\operatorname{D}_{h}{\left(\prod_{i^{\prime}\neq i}\left(\phi_{t_{0},a_{i^{\prime}}}\circ\eta\bar{\ell}_{i^{\prime},t_{0}}\right)\right)}\right)^{(t_{0}+1)}\right\|$	(65)
$\displaystyle=$	$\displaystyle\left\|\left(\operatorname{D}_{h}{\left(\phi_{t_{0},a_{-i}}\circ(\eta\bar{\ell}_{1,t_{0}},\ldots,\eta\bar{\ell}_{i-1,t_{0}},\eta\bar{\ell}_{i+1,t_{0}},\ldots,\eta\bar{\ell}_{m,t_{0}})\right)}\right)^{(1)}\right\|$	(66)
$\displaystyle\leq$	$\displaystyle\frac{12e^{5}m}{B_{1}}\cdot\alpha^{h}\cdot h^{B_{0}h+1}=\alpha^{h}\cdot(h)^{B_{0}h+1}.$	(67)

(In particular, (65) uses (58), (66) uses the definition of $\phi_{t_{0},a_{-i}}$ in (61), and (67) uses Lemma 4.5.)

Next we use (56), which gives that for each $i\in[m]$ and $t\geq 1$ ,

$\displaystyle\left\\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\right\\|_{\infty}\leq$	$\displaystyle\sum_{a_{i^{\prime}}\in[n_{i^{\prime}}],\ \forall i^{\prime}\neq i}\left\|\left(\operatorname{D}_{h}{\left(\prod_{i^{\prime}\neq i}x_{i^{\prime}}(a_{i^{\prime}})\right)}\right)^{(t)}\right\|$
$\displaystyle\leq$	$\displaystyle\sum_{a_{i^{\prime}}\in[n_{i^{\prime}}],\ \forall i^{\prime}\neq i}\prod_{i^{\prime}\neq i}x_{i^{\prime}}^{(t_{0})}(a_{i^{\prime}})\cdot\alpha^{h}\cdot(h)^{B_{0}h+1}$	(68)
$\displaystyle=$	$\displaystyle\alpha^{h}\cdot(h)^{B_{0}h+1},$

where (68) follows from (67) with $t=t_{0}-1$ (here we use that $t_{0}$ may be 0). This completes the proof of Claim B.9.∎

It is immediate that for all $i\in[m],t\in[T]$ , we have that $\|\left(\operatorname{D}_{0}{\ell_{i}}\right)^{(t)}\|_{\infty}\leq 1=\alpha^{0}\cdot 1^{B_{0}\cdot 1}$ . We now apply Claim B.9 inductively with $B_{0}=3$ , for which it suffices to have $\eta\leq\frac{\alpha}{36e^{5}m}$ as long as $\alpha\leq 1/(H+3)$ . This gives that for $0\leq h\leq H$ , $i\in[m]$ , and $t\in[T-h]$ , $\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\|_{\infty}\leq\alpha^{h}\cdot h^{3h+1}$ , completing the proof of Lemma 4.4. ∎

Appendix C Proofs for Section 4.4

The main goal of this section is to prove Lemma 4.7. First, in Section C.1 we prove some preliminary lemmas and then we prove Lemma 4.7 in Section C.2

C.1 Preliminary lemmas

Lemma C.1 shows that $\operatorname{Var}_{{P}}\left({W}\right)$ and $\operatorname{Var}_{{P^{\prime}}}\left({W}\right)$ are close when the entries of $P,P^{\prime}$ are close; it will be applied with $P,P^{\prime}$ equal to the strategies $x_{i}^{(t)}\in\Delta^{n_{i}}$ played in the course of Optimistic Hedge.

Lemma C.1.

Suppose $n\in\mathbb{N}$ and $M>0$ are given, and $W\in\mathbb{R}^{n}$ is a vector. Suppose $P,P^{\prime}\in\Delta^{n}$ are distributions with $\max\left\{\left\|\frac{P}{P^{\prime}}\right\|_{\infty},\left\|\frac{P^{\prime}}{P}\right\|_{\infty}\right\}\leq 1+\alpha$ for some $\alpha>0$ . Then

\displaystyle(1-\alpha)\operatorname{Var}_{{P}}\left({W}\right)\leq\operatorname{Var}_{{P^{\prime}}}\left({W}\right)\leq(1+\alpha)\operatorname{Var}_{{P}}\left({W}\right).

(69)

Proof.

We first prove that $\operatorname{Var}_{{P^{\prime}}}\left({W}\right)\leq(1+\alpha)\operatorname{Var}_{{P}}\left({W}\right)$ . To do so, note that since adding a constant to every entry of $W$ does not change $\operatorname{Var}_{{P}}\left({W}\right)$ or $\operatorname{Var}_{{P^{\prime}}}\left({W}\right)$ , by replacing $W$ with $W-\langle P,W\rangle\cdot\mathbf{1}$ , we may assume without loss of generality that $\langle P,W\rangle=0$ . Thus $\operatorname{Var}_{{P}}\left({W}\right)=\sum_{j=1}^{n}P(j)W(j)^{2}$ . Now we may compute:

$\displaystyle\operatorname{Var}_{{P^{\prime}}}\left({W}\right)\leq$	$\displaystyle\sum_{j}P^{\prime}(j)\cdot W(j)^{2}$
$\displaystyle=$	$\displaystyle\sum_{j}P(j)\cdot W(j)^{2}+\sum_{j}(P^{\prime}(j)-P(j))\cdot W(j)^{2}$
$\displaystyle=$	$\displaystyle(1+\alpha)\operatorname{Var}_{{P}}\left({W}\right),$	(70)

where (70) uses the fact that $\left\|\frac{P^{\prime}}{P}\right\|_{\infty}\leq 1+\alpha$ .

By interchanging the roles of $P,P^{\prime}$ , we obtain that

\displaystyle\operatorname{Var}_{{P^{\prime}}}\left({W}\right)\geq\frac{1}{1+\alpha}\operatorname{Var}_{{P}}\left({W}\right)\geq(1-\alpha)\operatorname{Var}_{{P}}\left({W}\right).

This completes the proof of the lemma. ∎

Next we prove Lemma 4.8 (recall that only the special case $\mu=0$ was proved in Section 4.4). For convenience the lemma is repeated below.

Lemma 4.8 (Restated).

Suppose $\mu\in\mathbb{R}$ , $\alpha>0$ , and $W^{(0)},\ldots,W^{(S-1)}\in\mathbb{R}$ is a sequence of reals satisfying

\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{2}{W}\right)^{(t)}\right)^{2}\leq\alpha\cdot\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}+\mu.

(71)

Then

\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}\leq\alpha\cdot\sum_{t=1}^{S-1}(W^{(t)})^{2}+\mu/\alpha.

To prove Lemma 4.8 we need the following basic facts about the Fourier transform:

Fact C.2 (Parseval’s equality).

It holds that $\sum_{t=0}^{S-1}|W^{(t)}|^{2}=\frac{1}{S}\sum_{s=0}^{S-1}|\widehat{W}^{(s)}|^{2}$ .

The second fact gives a formula for the Fourier transform of the circular finite differences; its simple form is the reason we work with circular finite differences in this section:

Fact C.3.

For $h\in\mathbb{Z}_{\geq 0}$ , $\widehat{\operatorname{D}^{\circ}_{h}{W}}^{(s)}=\widehat{W}^{(s)}\cdot(e^{2\pi ist/S}-1)^{h}$ .

Proof of Lemma 4.8.

Note that the discrete Fourier transform of $\operatorname{D}^{\circ}_{1}{W}$ satisfies $\widehat{\operatorname{D}^{\circ}_{1}{W}}^{(s)}=\widehat{W}^{(s)}\cdot(e^{2\pi is/T}-1)$ , and similarly $\widehat{\operatorname{D}^{\circ}_{2}{W}}^{(s)}=\widehat{W}^{(s)}\cdot(e^{2\pi is/T}-1)^{2}$ , for $0\leq s\leq S-1$ . By the Cauchy-Schwarz inequality, Parseval’s equality (Fact C.2), Fact C.3, and the assumption that (71) holds, we have

$\displaystyle\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}=$	$\displaystyle\frac{1}{S}\sum_{s=0}^{S-1}\left\|\widehat{\operatorname{D}^{\circ}_{1}{W}}^{(s)}\right\|^{2}$
$\displaystyle=$	$\displaystyle\frac{1}{S}\sum_{s=0}^{S-1}\left\|\widehat{W}^{(s)}\cdot(e^{2\pi is/T-1})\right\|^{2}$
$\displaystyle=$	$\displaystyle\frac{1}{S}\sum_{s=0}^{S-1}\left\|\widehat{W}^{(s)}\right\|\cdot\left\|\widehat{W}^{(s)}\right\|\left\|e^{2\pi is/T}-1\right\|^{2}$
$\displaystyle\leq$	$\displaystyle\sqrt{\frac{1}{S}\sum_{s=0}^{S-1}\left\|\widehat{W}^{(s)}\right\|^{2}}\cdot\sqrt{\frac{1}{S}\sum_{s=0}^{S-1}\left\|\widehat{W}^{(s)}\right\|^{2}\cdot\left\|e^{2\pi is/T}-1\right\|^{4}}$
$\displaystyle=$	$\displaystyle\sqrt{\sum_{t=0}^{S-1}(W^{(t)})^{2}}\cdot\sqrt{\frac{1}{S}\sum_{s=0}^{S-1}\left\|\widehat{\operatorname{D}^{\circ}_{2}{W}}^{(s)}\right\|^{2}}$
$\displaystyle=$	$\displaystyle\sqrt{\sum_{t=0}^{S-1}(W^{(t)})^{2}}\cdot\sqrt{\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{2}{W}\right)^{(t)}\right)^{2}}$
$\displaystyle\leq$	$\displaystyle\sqrt{\sum_{t=0}^{S-1}(W^{(t)})^{2}}\cdot\sqrt{\alpha\cdot\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}+\mu}.$	(72)

Note that for real numbers $A>0$ and $\epsilon$ with $A+\epsilon>0$ , it holds that

\frac{A^{2}}{{A+\epsilon}}=\frac{{A}}{1+{\epsilon/A}}\geq A\cdot(1-\epsilon/A)=A-\epsilon.

Taking $A=\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}$ and $\epsilon=\mu/\alpha$ (for which $A+\epsilon>0$ is immediate) and using (72) then gives

\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}-\mu/\alpha\leq\frac{\left(\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}\right)^{2}}{{\sum_{t=0}^{S-1}\left(\left(\operatorname{D}^{\circ}_{1}{W}\right)^{(t)}\right)^{2}+\mu/\alpha}}\leq\alpha\cdot\sum_{t=0}^{S-1}\left(W^{(t)}\right)^{2},

as desired. ∎

C.2 Proof of Lemma 4.7

Now we prove Lemma 4.7. For convenience we restate the lemma below with the exact value of the constant $C_{0}$ referred to in the version in Section 4.4.

Lemma 4.7 (Restated).

For any $M,\zeta,\alpha>0$ and $n\in\mathbb{N}$ , suppose that $P^{(1)},\ldots,P^{(T)}\in\Delta^{n}$ and $Z^{(1)},\ldots,Z^{(T)}\in[-M,M]^{n}$ satisfy the following conditions:

1.

The sequence $P^{(1)},\ldots,P^{(T)}$ is $\zeta$ -consecutively close for some $\zeta\in[1/(2T),\alpha^{4}/8256]$ .
2.

It holds that $\sum_{t=1}^{T-2}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t)}}\right)\leq\alpha\cdot\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)+\mu.$

Then

\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)\leq\alpha\cdot(1+\alpha)\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\frac{\mu}{\alpha}+\frac{1290M^{2}}{\alpha^{3}}.

(73)

Proof.

Fix a positive integer $S<1/(2\zeta)<T$ , to be specified exactly below. For $1\leq t_{0}\leq T-S+1$ , define $\mu_{t_{0}}\in\mathbb{R}$ by

\mu_{t_{0}}=\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t_{0}+s)}}\right)-\alpha\cdot\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right).

(74)

Then

	$\displaystyle\sum_{t_{0}=1}^{T-S+1}\mu_{t_{0}}$
$\displaystyle=$	$\displaystyle\sum_{t=1}^{T-2}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t)}}\right)\cdot\min\{S-2,t,T-t-1\}$
	$\displaystyle-\alpha\cdot\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)\cdot\min\{S-2,t,T-t-1\}$
$\displaystyle\leq$	$\displaystyle(S-2)\cdot\sum_{t=1}^{T-2}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t)}}\right)-(S-2)\alpha\cdot\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)+8\alpha(S-2)^{2}M^{2}$	(75)
$\displaystyle\leq$	$\displaystyle(S-2)\mu+2\alpha(S-2)^{2}M^{2},$	(76)

where (75) uses the fact that $\|Z^{(t)}\|_{\infty}\leq M$ and so $\|\left(\operatorname{D}_{1}{Z}\right)^{(t)}\|_{\infty}\leq 2M$ for all $t\in[T]$ , and the final inequality (76) follows from assumption 2 of the lemma statement.

By (74) and Lemma C.1 with $P=P^{(t_{0})}$ , we have, for some constant $C>0$ ,

$\displaystyle\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t_{0}+s)}}\right)\leq$	$\displaystyle(1+2\zeta S)\cdot\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t_{0}+s)}}\right)$
$\displaystyle=$	$\displaystyle(1+2\zeta S)\alpha\cdot\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)+(1+2\zeta S)\mu_{t_{0}}$
$\displaystyle\leq$	$\displaystyle(1+2\zeta S)^{2}\alpha\cdot\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)+(1+2\zeta S)\mu_{t_{0}}.$	(77)

Here we have used that for $0\leq s\leq S$ , it holds that $\max\left\{\left\|\frac{P^{(t_{0}+s)}}{P^{(t_{0})}}\right\|_{\infty},\left\|\frac{P^{(t_{0})}}{P^{(t_{0}+s)}}\right\|_{\infty}\right\}\leq(1+\zeta)^{S}\leq 1+2\zeta S$ since $\zeta S\leq 1/2$ .

For any integer $1\leq t_{0}\leq T-S+1$ , we define the sequence $Z_{t_{0}}^{(s)}:=Z^{(t_{0}+s)}-\langle{P^{(t_{0})}},Z^{(t_{0}+s)}\rangle\mathbf{1}$ , for $0\leq s\leq S-1$ . Thus $\langle Z_{t_{0}}^{(s)},P^{(t_{0})}\rangle=0$ for $0\leq s\leq S-1$ , which implies that for all $h\geq 0$ , $0\leq s\leq S-1$ , $\langle\left(\operatorname{D}^{\circ}_{h}{Z_{t_{0}}}\right)^{(s)},P^{(t_{0})}\rangle=0$ , and thus

\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{h}{Z_{t_{0}}}\right)^{(s)}}\right)=\sum_{j=1}^{n}P^{(t_{0})}(j)\cdot\left(\operatorname{D}^{\circ}_{h}{Z_{t_{0}}}\right)^{(s)}(j)^{2}.

(78)

By the definition of the sequence $Z_{t_{0}}$ , for $0\leq s\leq S-h-1$ , we have

\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{h}{Z}\right)^{(t_{0}+s)}}\right)=\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{h}{Z_{t_{0}}}\right)^{(s)}}\right)=\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{h}{Z_{t_{0}}}\right)^{(s)}}\right).

(79)

For $1\leq t_{0}\leq T-S+1$ , let us now define

\displaystyle\nu_{t_{0},j}:=\sum_{s=0}^{S-1}\left(\operatorname{D}^{\circ}_{2}{Z_{t_{0}}}\right)^{(s)}(j)^{2}-(1+2\zeta S)^{2}\alpha\cdot\sum_{s=0}^{S-1}\left(\operatorname{D}^{\circ}_{1}{Z_{t_{0}}}\right)^{(s)}(j)^{2},

(80)

so that, by (77), (78), and (79),

	$\displaystyle\sum_{j=1}^{n}P^{(t_{0})}(j)\cdot\nu_{t_{0},j}$
$\displaystyle=$	$\displaystyle\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z_{t_{0}}}\right)^{(s)}}\right)-(1+2\zeta S)^{2}\alpha\cdot\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{1}{Z_{t_{0}}}\right)^{(s)}}\right)$
$\displaystyle\leq$	$\displaystyle\left(\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{2}{Z}\right)^{(t_{0}+s)}}\right)\right)+\operatorname{Var}_{{P^{(t_{0}+1)}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-2)}}\right)+\operatorname{Var}_{{P^{(t_{0}+1)}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-1)}}\right)$
	$\displaystyle-(1+2\zeta S)^{2}\alpha\cdot\sum_{s=0}^{S-3}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)$
$\displaystyle\leq$	$\displaystyle(1+2\zeta S)\mu_{t_{0}}+\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-2)}}\right)+\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-1)}}\right).$	(81)

By (80) and Lemma 4.8 applied to the sequence $Z_{t_{0}}^{(0)},\ldots,Z_{t_{0}}^{(S-1)}$ , it holds that, for each $j\in[n]$ ,

\sum_{s=0}^{S-1}\left(\operatorname{D}^{\circ}_{1}{Z_{t_{0}}}\right)^{(s)}(j)^{2}\leq(1+\zeta CS)^{2}\alpha\cdot\sum_{s=0}^{S-1}Z_{t_{0}}^{(s)}(j)^{2}+\frac{\nu_{t_{0},j}}{(1+\zeta CS)^{2}\alpha}.

(82)

Then we have:

	$\displaystyle\sum_{s=0}^{S-2}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)$
$\displaystyle=$	$\displaystyle\sum_{s=0}^{S-2}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{1}{Z_{t_{0}}}\right)^{(s)}}\right)$	(83)
$\displaystyle\leq$	$\displaystyle(1+2\zeta S)^{2}\alpha\cdot\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0})}}}\left({Z_{t_{0}}^{(s)}}\right)+\sum_{j=1}^{n}P^{(t_{0})}(j)\cdot\frac{\nu_{t_{0},j}}{(1+2\zeta S)^{2}\alpha}$	(84)
$\displaystyle\leq$	$\displaystyle(1+2\zeta S)^{2}\alpha\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0})}}}\left({Z_{t_{0}}^{(s)}}\right)+\frac{\mu_{t_{0}}}{(1+2\zeta S)\alpha}+\frac{\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-2)}}\right)+\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-1)}}\right)}{(1+2\zeta S)^{2}\alpha},$	(85)

where (83) follows from (79), (84) follows from (82) and (78), and (85) follows from (81). Summing the above for $1\leq t_{0}\leq T-S+1$ , we obtain, for some constant $C>0$ ,

	$\displaystyle(S-1)\cdot\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)$
$\displaystyle\leq$	$\displaystyle\sum_{t_{0}=1}^{T-S+1}\sum_{s=0}^{S-2}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)+8(S-1)^{2}M^{2}$	(86)
$\displaystyle\leq$	$\displaystyle\sum_{t_{0}=1}^{T-S+1}(1+2\zeta S)\sum_{s=0}^{S-2}\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t_{0}+s)}}\right)+8(S-1)^{2}M^{2}$	(87)
$\displaystyle\leq$	$\displaystyle(1+2\zeta S)^{3}\alpha\sum_{t_{0}=1}^{T-S+1}\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0})}}}\left({Z_{t_{0}}^{(s)}}\right)+\sum_{t_{0}=1}^{T-S+1}\frac{\mu_{t_{0}}}{\alpha}+8(S-1)^{2}M^{2}$
	$\displaystyle+\sum_{t_{0}=1}^{T-S+1}\frac{\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-2)}}\right)+\operatorname{Var}_{{P^{(t_{0})}}}\left({\left(\operatorname{D}^{\circ}_{2}{Z}\right)^{(t_{0}+S-1)}}\right)}{(1+2\zeta S)\alpha}$	(88)
$\displaystyle\leq$	$\displaystyle(1+2\zeta S)^{4}\alpha\sum_{t_{0}=1}^{T-S+1}\sum_{s=0}^{S-1}\operatorname{Var}_{{P^{(t_{0}+s)}}}\left({Z^{(t_{0}+s)}}\right)+\sum_{t_{0}=1}^{T-S+1}\frac{\mu_{t_{0}}}{\alpha}+8(S-1)^{2}M^{2}$
	$\displaystyle+\frac{4}{(1+2\zeta S)\alpha}\sum_{t_{0}=1}^{T-S+1}\left[\operatorname{Var}_{{P^{(t_{0})}}}\left({Z^{(t_{0}+S-2)}}\right)+3\operatorname{Var}_{{P^{(t_{0})}}}\left({Z^{(t_{0}+S-1)}}\right)\right.$
	$\displaystyle\left.+3\operatorname{Var}_{{P^{(t_{0})}}}\left({Z^{(t_{0})}}\right)+\operatorname{Var}_{{P^{(t_{0})}}}\left({Z^{(t_{0}+1)}}\right)\right]$	(89)
$\displaystyle\leq$	$\displaystyle(1+2\zeta S)^{4}\alpha S\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\sum_{t_{0}=1}^{T-S+1}\frac{\mu_{t_{0}}}{\alpha}+8(S-1)^{2}M^{2}$
	$\displaystyle+\frac{32}{(1+2\zeta S)\alpha}\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)$	(90)
$\displaystyle\leq$	$\displaystyle(1+2\zeta S)^{4}\alpha S\cdot\left(1+\frac{32}{\alpha^{2}S}\right)\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\sum_{t_{0}=1}^{T-S+1}\frac{\mu_{t_{0}}}{\alpha}+8(S-1)^{2}M^{2}$	(91)
$\displaystyle\leq$	$\displaystyle(1+2\zeta S)^{4}\alpha S\cdot\left(1+\frac{32}{\alpha^{2}S}\right)\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\frac{(S-2)\mu}{\alpha}+10(S-1)^{2}M^{2},$	(92)

where:

•

(86) follows since $\|Z^{(t)}\|_{\infty}\leq M$ and thus $\|\left(\operatorname{D}_{1}{Z}\right)^{(t)}\|_{\infty}\leq 2M$ for all $t$ ;
•

(87) follows from Lemma C.1 and the fact that for $0\leq s\leq S-2$ , $\max\left\{\left\|\frac{P^{(t_{0}+s)}}{P^{(t_{0})}}\right\|_{\infty},\left\|\frac{P^{(t_{0})}}{P^{(t_{0}+s)}}\right\|_{\infty}\right\}\leq 1+2\zeta S$ as established above as a consequence of the fact that the distributions $P^{(t)}$ are $\zeta$ -consecutively close.
•

(88) follows from (85);
•

The first term in (89) is bounded using Lemma C.1 and the fact that the distributions $P^{(t)}$ are $\zeta$ -consecutively close, and the the final term in (89) is bounded using the fact that for any vectors $Z_{1},\ldots,Z_{k}\in\mathbb{R}^{n}$ and any $P\in\Delta^{n}$ , we have $\operatorname{Var}_{{P}}\left({Z_{1}+\cdots+Z_{k}}\right)\leq k\cdot\left(\operatorname{Var}_{{P}}\left({Z_{1}}\right)+\cdots+\operatorname{Var}_{{P}}\left({Z_{k}}\right)\right)$ ;
•

(90) and (91) by rearranging terms;
•

(92) follows from (76).

Now choose $S=\left\lceil\frac{128}{\alpha^{3}}\right\rceil$ , so that $\frac{32}{\alpha^{2}S}\leq\frac{{\alpha}}{4}$ . Therefore, as long as $2\zeta S\leq\frac{\alpha}{32}$ , we have, since $\alpha\leq 1/2$ , that

(1+2\zeta S)^{4}\alpha\cdot\frac{S}{S-1}\cdot\left(1+\frac{32}{\alpha^{2}S}\right)\leq\alpha\cdot(1+\alpha/4)^{3}\leq\alpha\cdot(1+\alpha).

Then it follows from (92) that

\displaystyle\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)\leq

\displaystyle\alpha(1+\alpha)\cdot\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\frac{\mu}{\alpha}+10SM^{2},

(93)

Using that $S\leq\frac{129}{\alpha^{3}}$ , the inequality $2\zeta S\leq\alpha/32$ can be satisfied by ensuring that $\zeta\leq\frac{\alpha^{4}}{8256}=\frac{\alpha^{4}}{129\cdot 42}\leq\frac{\alpha}{64S}$ . Note that our choice of $S$ ensures that $\zeta S\leq 1/2$ , as was assumed earlier. Moreover, we have $10SM^{2}\leq\frac{1290M^{2}}{\alpha^{3}}$ . Thus, (93) gives the desired result. ∎

C.3 Completing the proof of Theorem 3.1

Using the lemma developed in the previous sections we now can complete the proof of Theorem 3.1. We begin by proving Lemma 4.2. The lemma is restated formally below.

Lemma 4.2 (Detailed).

There are constants $C,C^{\prime}>1$ so that the following holds. Suppose a time horizon $T\geq 4$ is given, we set $H:=\lceil\log T\rceil$ , and all players play according to Optimistic Hedge with step size $\eta$ satisfying $1/T\leq\eta\leq\frac{1}{C\cdot mH^{4}}$ . Then for any $i\in[m]$ , the losses $\ell_{i}^{(1)},\ldots,\ell_{i}^{(T)}\in[0,1]^{n_{i}}$ for player $i$ satisfy:

\displaystyle\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)\leq\frac{1}{2}\cdot\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+C^{\prime}H^{5}.

(94)

We state a generic version of this lemma that can be applied in more general settings.

Lemma C.4.

For any integers $n\geq 2$ and $T\geq 4$ , we set $H:=\lceil\log T\rceil$ , $\alpha=1/(4H)$ , and $\alpha_{0}=\frac{\sqrt{\alpha/8}}{H^{3}}$ . Suppose that $Z^{(1)},\ldots,Z^{(T)}\in[0,1]^{n}$ and $P^{(1)},\ldots,P^{(T)}\in\Delta^{n}$ satisfy the following

1.

For each $0\leq h\leq H$ and $1\leq t\leq T-h$ , it holds that $\left\|\left(\operatorname{D}_{h}{Z}\right)^{(t)}\right\|_{\infty}\leq H\cdot\left(\alpha_{0}H^{3}\right)^{h}$
2.

The sequence $P^{(1)},\ldots,P^{(T)}$ is $\zeta$ -consecutively close for some $\zeta\in[1/(2T),\alpha^{4}/8256]$ .

Then,

\displaystyle\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}-Z^{(t-1)}}\right)\leq

\displaystyle 2\alpha\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t-1)}}\right)+165120(1+\zeta)H^{5}+2

(95)

First, we demonstrate how Lemma C.4 implies Lemma 4.2.

Proof of Lemma 4.2.

We hope to apply Lemma C.4 with $n=n_{i}$ , $P^{(t)}=x_{i}^{(t)}$ and $Z^{(t)}=\ell_{i}^{(t)}$ , as well as $\zeta=7\eta$ . To do so, we must verify that the preconditions of Lemma C.4 hold when our sequences arise from the dynamics of players playing Optimistic Hedge with step size $\eta$ satisfying $1/T\leq\eta\leq\frac{1}{C\cdot mH^{4}}$ .

Set $C_{1}=8256$ (note that $C_{1}$ is the constant appearing in item 2 of Lemma C.4 and item 1 of Lemma 4.7 in Section C.2). Our assumption that $\eta\leq\frac{1}{C\cdot mH^{4}}$ implies that, as long as the constant $C$ satisfies $C\geq 4^{4}\cdot 7\cdot C_{1}=14794752$ ,

\eta\leq\min\left\{\frac{\alpha^{4}}{7C_{1}},\frac{\alpha_{0}}{36e^{5}m}\right\}.

(96)

To verify precondition 1, we apply Lemma 4.4 with the parameter $\alpha$ in the lemma set to $\alpha_{0}$ : a valid selection as $\alpha_{0}<\frac{1}{H+3}$ . We conclude that, for each $i\in[m]$ , $0\leq h\leq H$ and $1\leq t\leq T-h$ , it holds that $\left\|\left(\operatorname{D}_{h}{\ell_{i}}\right)^{(t)}\right\|_{\infty}\leq H\cdot\left(\alpha_{0}H^{3}\right)^{h}$ since $\eta\leq\frac{\alpha_{0}}{36e^{5}m}$ as required by the lemma. To verify precondition 2, we first confirm that our selection of $\zeta=7\eta$ places it in the desired interval $[1/(2T),\alpha^{4}/C_{1}]$ as $\eta\leq\frac{\alpha^{4}}{7C_{1}}$ . By the definition of the Optimistic Hedge updates, for all $i\in[m]$ and $1\leq t\leq T$ , we have $\max\left\{\left\|\frac{x_{i}^{(t)}}{x_{i}^{(t+1)}}\right\|_{\infty},\left\|\frac{x_{i}^{(t+1)}}{x_{i}^{(t)}}\right\|_{\infty}\right\}\leq\exp(6\eta)$ . Thus, the sequence $x_{i}^{(1)},\ldots,x_{i}^{(T)}$ is $(7\eta)$ -consecutively close (since $\exp(6\eta)\leq 1+7\eta$ for $\eta$ satisfying (96)). Therefore, Lemma C.4 applies and we have

	$\displaystyle\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)$	$\displaystyle\leq 2\alpha\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+165120(1+7\eta)H^{5}+2$
		$\displaystyle\leq\frac{1}{2}\cdot\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+C^{\prime}H^{5}$

for $C^{\prime}=2+165120(1+7/8256)=165262$ , as desired. ∎

So, it suffices to prove Lemma C.4.

Proof of Lemma C.4.

Set $\mu=H\cdot\left(\alpha_{0}H^{3}\right)^{H}$ . Therefore, from item 1 we have,

\displaystyle\sum_{t=1}^{T-H}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{H}{Z}\right)^{(t)}}\right)\leq\alpha\cdot\sum_{t=1}^{T-H+1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{H-1}{Z}\right)^{(t)}}\right)+\mu^{2}T.

(97)

We will now prove, via reverse induction on $h$ , that for all $h$ satisfying $H-1\geq h\geq 0$ ,

\sum_{t=1}^{T-h-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h+1}{Z}\right)^{(t)}}\right)\leq\alpha\cdot(1+2\alpha)^{H-h-1}\cdot\sum_{t=1}^{T-h}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h}{Z}\right)^{(t)}}\right)+\frac{2C_{0}\cdot H^{2}\cdot\left({2}\alpha_{0}H^{3}\right)^{2h}}{\alpha^{3}}.

(98)

with $C_{0}=1290$ (note that $C_{0}$ is the constant appearing in the inequality (73) of the statement of Lemma 4.7 in Section C.2). The base case $h=H-1$ is verified by (97) and the fact that $2^{2(H-1)}\geq 2^{H}\geq T$ . Now suppose that (98) holds for some $h$ satisfying $H-1\geq h\geq 1$ . We will now apply Lemma 4.7, with $P^{(t)}=P^{(t)}$ and $Z^{(t)}=\left(\operatorname{D}_{h-1}{Z}\right)^{(t)}$ for $1\leq t\leq T-h+1$ , as well as $M=H\cdot\left(2\alpha_{0}H^{3}\right)^{h-1}$ , $\zeta=\zeta$ , $\mu=\frac{2C_{0}\cdot H^{2}\cdot\left(2\alpha_{0}H^{3}\right)^{2h}}{\alpha^{3}}$ , and the parameter $\alpha$ of Lemma 4.7 set to $\alpha\cdot(1+2\alpha)^{H-h-1}$ . We verify that precondition 1 holds due to precondition 2 of Lemma C.4 and the fact that $\alpha\leq\alpha\cdot(1+2\alpha)^{H-h-1}$ . Moreover, precondition 2 holds by our inductive hypothesis (98) and our choice of $\mu$ . Therefore, by Lemma 4.7 and the fact that $1+\alpha\cdot(1+2\alpha)^{H}\leq 1+2\alpha$ for our choice of $\alpha=1/(4H)$ , it follows that

	$\displaystyle\sum_{t=1}^{T-h}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h}{Z}\right)^{(t)}}\right)\leq$	$\displaystyle\alpha\cdot(1+2\alpha)^{H-h}\cdot\sum_{t=1}^{T-h+1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h-1}{Z}\right)^{(t)}}\right)+\frac{2C_{0}\cdot H^{2}\cdot\left(2\alpha_{0}H^{3}\right)^{2h}}{\alpha^{4}}$
		$\displaystyle+\frac{C_{0}\cdot H^{2}\cdot\left(2\alpha_{0}H^{3}\right)^{2(h-1)}}{\alpha^{3}}$
	$\displaystyle\leq$	$\displaystyle\alpha\cdot(1+2\alpha)^{H-h}\cdot\sum_{t=1}^{T-h+1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h-1}{Z}\right)^{(t)}}\right)$
		$\displaystyle+\frac{C_{0}\cdot H^{2}\cdot(2\alpha_{0}H^{3})^{2(h-1)}}{\alpha^{3}}\cdot\left(1+\frac{2(2\alpha_{0}H^{3})^{2}}{\alpha}\right)$
	$\displaystyle\leq$	$\displaystyle\alpha\cdot(1+2\alpha)^{H-h}\cdot\sum_{t=1}^{T-h+1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{h-1}{Z}\right)^{(t)}}\right)+\frac{2C_{0}\cdot H^{2}\cdot(2\alpha_{0}H^{3})^{2(h-1)}}{\alpha^{3}},$

where the final equality follows since $\alpha_{0}$ is chosen so that $2(2\alpha_{0}H^{3})^{2}=\alpha$ . This completes the proof of the inductive step. Thus (98) holds for $h=0$ . Using again that the sequence $P^{(t)}$ is $(\zeta)$ -exponentially close, we see that

	$\displaystyle\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}-Z^{(t-1)}}\right)$
$\displaystyle\leq$	$\displaystyle 1+\sum_{t=2}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}-Z^{(t-1)}}\right)$
$\displaystyle\leq$	$\displaystyle 1+(1+\zeta)\sum_{t=1}^{T-1}\operatorname{Var}_{{P^{(t)}}}\left({\left(\operatorname{D}_{1}{Z}\right)^{(t)}}\right)$	(99)
$\displaystyle\leq$	$\displaystyle 1+(1+\zeta)\cdot\left(\alpha(1+2\alpha)^{H-1}\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t)}}\right)+\frac{2C_{0}H^{2}}{\alpha^{3}}\right)$	(100)
$\displaystyle\leq$	$\displaystyle 2+(1+\zeta)\cdot\left(\alpha(1+2\alpha)^{H-1}(1+\zeta)\sum_{t=2}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t-1)}}\right)+\frac{2C_{0}H^{2}}{\alpha^{3}}\right)$	(101)
$\displaystyle\leq$	$\displaystyle\alpha(1+2\alpha)^{H}\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t-1)}}\right)+\frac{2(1+\zeta)C_{0}H^{2}}{\alpha^{3}}+2$
$\displaystyle\leq$	$\displaystyle 2\alpha\sum_{t=1}^{T}\operatorname{Var}_{{P^{(t)}}}\left({Z^{(t-1)}}\right)+\frac{2(1+\zeta)C_{0}H^{2}}{\alpha^{3}}+2,$	(102)

where (99) and (101) follow from Lemma C.1, and (100) uses (98) for $h=0$ . Now, (102) verifies the statement of the lemma as $C_{0}=1290$ and $\alpha=\frac{1}{4H}$ . ∎

We are finally ready to prove Theorem 3.1. For convenience the theorem is restated below.

Theorem 3.1 (Restated).

There are constants $C,C^{\prime}>1$ so that the following holds. Suppose a time horizon $T\in\mathbb{N}$ is given. Suppose all players play according to Optimistic Hedge with any positive step size $\eta\leq\frac{1}{C\cdot m\log^{4}T}$ . Then for any $i\in[m]$ , the regret of player $i$ satisfies

\displaystyle\operatorname{Reg}_{{i},{T}}\leq\frac{\log n_{i}}{\eta}+C^{\prime}\cdot\log T.

(103)

In particular, if the players’ step size is chosen as $\eta=\frac{1}{C\cdot m\log^{4}T}$ , then the regret of player $i$ satisfies

\displaystyle\operatorname{Reg}_{{i},{T}}\leq O\left(m\cdot\log n_{i}\cdot\log^{4}T\right).

(104)

Proof.

The conclusion of the theorem is immediate if $T<4$ , so we may assume from here on that $T\geq 4$ . Moreover, the conclusion of (103) is immediate if $\eta\leq 1/T$ (as $\operatorname{Reg}_{{i},{T}}\leq T$ necessarily), so we may also assume that $\eta\geq 1/T$ . Let $C^{\prime\prime}$ be the constant $C$ of Lemma 4.1, let $B$ be the constant called $C$ in Lemma 4.2 and $B^{\prime}$ be the constant called $C^{\prime}$ in Lemma 4.2. As long as the constant $C$ of Theorem 3.1 is chosen so that $C\geq B$ and $\eta\leq\frac{1}{C\cdot m\log^{4}T}$ implies that $C^{\prime\prime}\eta\leq 1/6$ , we have the following:

$\displaystyle\operatorname{Reg}_{{i},{T}}\leq$	$\displaystyle\frac{\log n_{i}}{\eta}+\sum_{t=1}^{T}\left(\frac{\eta}{2}+C\eta^{2}\right)\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)-\sum_{t=1}^{T}\frac{(1-C\eta)\eta}{2}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)$	(105)
$\displaystyle\leq$	$\displaystyle\frac{\log n_{i}}{\eta}+\frac{2\eta}{3}\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)-\frac{\eta}{3}\sum_{t=1}^{T}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)$
$\displaystyle\leq$	$\displaystyle\frac{\log n_{i}}{\eta}+\frac{2\eta}{3}\cdot\left(B^{\prime}\cdot(2\log T)^{5}\right)$	(106)
$\displaystyle\leq$	$\displaystyle\frac{\log n_{i}}{\eta}+32B^{\prime}\cdot\log T,$	(107)

where (105) follows from Lemma 4.1, (106) follows from Lemma 4.2, and (107) follows from the upper bound $\eta\leq\frac{1}{Cm\log^{4}T}$ . We have thus established (103). The upper bound (104) follows immediately. ∎

Appendix D Adversarial regret bounds

In this section we discuss how Optimistic Hedge can be modified to achieve an algorithm that obtains the fast rates of Theorem 3.1 when played by all players, and which still obtains the optimal rate of $O(\sqrt{T})$ in the adversarial setting. Such guarantees are common in the literature [DDK11, RS13b, SALS15, KHSC18, HAM21]. The guarantees of this modification of Optimistic Hedge are stated in the following corollary (of Lemmas 4.1 and 4.2):

Corollary D.1.

There is an algorithm $\mathcal{A}$ which, if played by all $m$ players in a game, achieves the regret bound of $\operatorname{Reg}_{{i},{T}}\leq O(m\cdot\log n_{i}\cdot\log^{4}T)$ for each player $i$ ; moreover, when player $i$ is faced with an adversarial sequence of losses, the algorithm $\mathcal{A}$ ’s regret bound is $\operatorname{Reg}_{{i},{T}}\leq O(m\log n_{i}\cdot\log^{4}T+\sqrt{T\log n_{i}})$ .

Proof.

Let $C$ be the constant called $C$ in Theorem 3.1 and $C^{\prime}$ be the constant called $C^{\prime}$ in Lemma 4.2. The algorithm $\mathcal{A}$ of Corollary D.1 is obtained as follows:

1.

Initially run Optimistic Hedge, with the step-size $\eta=\frac{1}{Cm\log^{4}T}$ .

If, for some $T_{0}\geq 4$ , (94) first fails to hold at time $T_{0}$ , i.e.,

\displaystyle\sum_{t=1}^{T_{0}}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)>\frac{1}{2}\cdot\sum_{t=1}^{T_{0}}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+C^{\prime}\lceil\log T\rceil^{5},

(108)

then set $\eta^{\prime}=\sqrt{\frac{\log n_{i}}{T}}$ and continue on running Optimistic Hedge with step size $\eta^{\prime}$ .

If there is no $T_{0}\geq 4$ so that (108) holds (and by Lemma 4.2, this will be the case when $\mathcal{A}$ is played by all $m$ players in a game), then the proof of Theorem 3.1 shows that the regret of each player $i$ is bounded as $\operatorname{Reg}_{{i},{T}}\leq O(m\log n_{i}\cdot\log^{4}T)$ . Otherwise, since $T_{0}$ is defined as the smallest integer at least 4 so that (108) holds, we have

\displaystyle\sum_{t=1}^{T_{0}}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t)}-\ell_{i}^{(t-1)}}\right)\leq\frac{1}{2}\cdot\sum_{t=1}^{T_{0}}\operatorname{Var}_{{x_{i}^{(t)}}}\left({\ell_{i}^{(t-1)}}\right)+C^{\prime}\lceil\log T\rceil^{5}+4,

and thus, by Lemma 4.1, for any $x^{\star}\in\Delta^{n_{i}}$ ,

\displaystyle\sum_{t=1}^{T_{0}}\langle\ell_{i}^{(t)},x_{i}^{(t)}-x^{\star}\rangle\leq\operatorname{Reg}_{{i},{T_{0}}}\leq O(m\log n_{i}\cdot\log^{4}T_{0}).

(109)

Further, by the choice of step size $\eta^{\prime}=\sqrt{\frac{\log n_{i}}{T}}$ for time steps $t>T_{0}$ , we have, for any $x^{\star}\in\Delta^{n_{i}}$ ,

	$\displaystyle\sum_{t=T_{0}+1}^{T}\langle\ell_{i}^{(t)},x_{i}^{(t)}-x^{\star}\rangle\leq$	$\displaystyle\frac{\log n_{i}}{\eta^{\prime}}+\eta^{\prime}\sum_{t=T_{0}+1}^{T}\\|\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\\|_{\infty}^{2}$		(110)
	$\displaystyle\leq$	$\displaystyle\frac{\log n_{i}}{\eta^{\prime}}+\eta^{\prime}T\leq O(\sqrt{T\log n_{i}}),$		(111)

where (110) uses [SALS15, Proposition 7]. Adding (109) and (111) completes the proof of the corollary. ∎

	$\displaystyle\exp(-2\eta\\|\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\\|_{\infty})$	$\displaystyle\leq\left\\|{\frac{\tilde{x}_{i}^{(t)}}{x_{i}^{(t)}}}\right\\|_{\infty}\leq\exp(2\eta\\|\ell_{i}^{(t)}-\ell_{i}^{(t-1)}\\|_{\infty})\leq\exp(4\eta)\leq 1+8\eta$
	$\displaystyle\exp(-2\eta\\|\ell_{i}^{(t-1)}\\|_{\infty})$	$\displaystyle\leq\left\\|{\frac{x_{i}^{(t)}}{\tilde{x}_{i}^{(t-1)}}}\right\\|_{\infty}\leq\exp(2\eta\\|\ell_{i}^{(t-1)}\\|_{\infty})\leq\exp(2\eta)\leq 1+4\eta.$		(29)

	$\displaystyle\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ \|\gamma\|=k}\left\|a_{j,\gamma}\right\|$	$\displaystyle=\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ \|\gamma\|=k}\frac{1}{\gamma_{1}!\gamma_{2}!\cdots\gamma_{n}!}\cdot\left\|\frac{\partial^{k}\phi_{j}(0)}{\partial z_{1}^{\gamma_{1}}\partial z_{2}^{\gamma_{2}}\cdots z_{n}^{\gamma_{n}}}\right\|$
		$\displaystyle=\frac{1}{k!}\sum_{\gamma\in\mathbb{Z}_{\geq 0}^{n}:\ \|\gamma\|=k}\frac{k!}{\gamma_{1}!\gamma_{2}!\cdots\gamma_{n}!}\cdot\left\|\frac{\partial^{k}\phi_{j}(0)}{\partial z_{1}^{\gamma_{1}}\partial z_{2}^{\gamma_{2}}\cdots z_{n}^{\gamma_{n}}}\right\|$
		$\displaystyle=\frac{1}{k!}\sum_{t\in[n]^{k}}\left\|\frac{\partial^{k}\phi_{j}(0)}{\partial z_{t_{1}}\partial z_{t_{2}}\cdots\partial z_{t_{k}}}\right\|.$

		$\displaystyle\frac{1}{k!}\sum_{f}\sum_{t_{B_{f}}\in[n]^{B_{f}}}\sum_{t_{L_{f}}\in[n]^{L_{f}}}\prod_{\ell=0}^{k}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}])$
		$\displaystyle\leq\frac{1}{k!}\sum_{f}2^{\|L_{f}\|}\sum_{t_{B_{f}}\in[n]^{B_{f}}}\prod_{\ell\in B_{f}\cup\left\{0\right\}}(\xi_{t_{\ell}}\cdot\mathbbm{1}[\ell\not\in P_{f,t}]+(1-\xi_{t_{\ell}})\cdot\mathbbm{1}[\ell\in P_{f,t}])$
		$\displaystyle=\frac{1}{k!}\sum_{f}2^{\|L_{f}\|}\sum_{t_{B_{f}}\in[n]^{B_{f}}}\prod_{\ell\in B_{f}\cup\left\{0\right\}}\xi_{t_{\ell}}$
(as no non-leaf can ever be a petal)
		$\displaystyle=\frac{\xi_{j}}{k!}\sum_{f}2^{\|L_{f}\|}\prod_{\ell\in B_{f}}\left({\sum_{t_{\ell}\in[n]}\xi_{t_{\ell}}}\right)$
		$\displaystyle=\frac{\xi_{j}}{k!}\sum_{f}2^{\|L_{f}\|}=\xi_{j}\mathbb{E}_{f\sim\mathcal{U}(\mathcal{F})}\left[{2^{\|L_{f}\|}}\right]$

Near-Optimal No-Regret Learning in General Games

Abstract

1 Introduction

Theorem 1.1 (Abbreviated version of Theorem 3.1).

1.1 Related work

2 Preliminaries

Notation.

No-regret learning in games.

Optimistic hedge.

Distributions & divergences.

3 Results

Theorem 3.1 (Formal version of Theorem 1.1).

4 Proof overview

4.1 New adversarial regret bound

Lemma 4.1.

4.2 Finite differences

Lemma 4.2 (Abbreviated; detailed version in Section C.3).

Definition 4.1 (Finite differences).

Remark 4.3.

4.3 Upwards induction proof overview

Lemma 4.4 (Abbreviated).

Main technical tool for Lemma 4.4: boundedness chain rule.

Lemma 4.5 (“Boundedness chain rule” for finite differences; abbreviated).

Lemma 4.6 (Abbreviated; detailed vesion in Section B.2).

Proving Lemma 4.4 using the boundedness chain rule.

4.4 Downwards induction proof overview

Lemma 4.7.

On the proof of Lemma 4.7.

Definition 4.2 (Circular finite difference).

Lemma 4.8.

Proof of Lemma 4.8 for special case μ=0\mu=0.

References

Appendix A Proofs for Section 4.1

A.1 Preliminary lemmas

Lemma A.1.

Proof.

Lemma A.2.

Proof.

Lemma A.3.

Proof.

Lemma A.4 ([RS13a], Eq. (26)).

Lemma A.5 (Refinement of Lemma 3, [RS13a]).

Proof.

A.2 Proof of Lemma 4.1

Lemma 4.1 (restated).

Proof.

Appendix B Proofs for Section 4.3

B.1 Additional preliminaries

Definition B.1 (Shift operator).

Lemma B.1 (Product rule; Eq. (2.55) of [GKP89]).

Proof.

Lemma B.2 (Multivariate product rule).

Proof.

Lemma B.3.

Proof of Lemma B.3.

Lemma B.4.

Proof.

Lemma B.5.

Proof.

Lemma B.6.

Proof.

Lemma B.7.

Proof.

B.2 Proof of Lemma 4.6

Lemma 4.6 (Detailed).

Proof of Lemma 4.6.

B.3 Proof of Lemma 4.5

Lemma 4.5 (“Boundedness chain rule” for finite differences; detailed).

Proof of Lemma 4.5.

B.4 Proof of Lemma 4.4

Lemma 4.4 (Detailed).

Proof.

Claim B.8.

Proof of Claim B.8.

Claim B.9.

Proof of Claim B.9.

Appendix C Proofs for Section 4.4

C.1 Preliminary lemmas

Lemma C.1.

Proof.

Proof of Lemma 4.8 for special case $\mu=0$ .