This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Pseudonorm Approachability and
Applications to Regret Minimization

Christoph Dann
Google Research
   Yishay Mansour
Google Research
   Mehryar Mohri
Google Research
   Jon Schneider
Google Research
   Balasubramanian Sivan
Google Research
Abstract

Blackwell’s celebrated approachability theory provides a general framework for a variety of learning problems, including regret minimization. However, Blackwell’s proof and implicit algorithm measure approachability using the 2\ell_{2} (Euclidean) distance. We argue that in many applications such as regret minimization, it is more useful to study approachability under other distance metrics, most commonly the \ell_{\infty}-metric. But, the time and space complexity of the algorithms designed for \ell_{\infty}-approachability depend on the dimension of the space of the vectorial payoffs, which is often prohibitively large. Thus, we present a framework for converting high-dimensional \ell_{\infty}-approachability problems to low-dimensional pseudonorm approachability problems, thereby resolving such issues. We first show that the \ell_{\infty}-distance between the average payoff and the approachability set can be equivalently defined as a pseudodistance between a lower-dimensional average vector payoff and a new convex set we define. Next, we develop an algorithmic theory of pseudonorm approachability, analogous to previous work on approachability for 2\ell_{2} and other norms, showing that it can be achieved via online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball. We then use that to show, modulo mild normalization assumptions, that there exists an \ell_{\infty}-approachability algorithm whose convergence is independent of the dimension of the original vectorial payoff. We further show that that algorithm admits a polynomial-time complexity, assuming that the original \ell_{\infty}-distance can be computed efficiently. We also give an \ell_{\infty}-approachability algorithm whose convergence is logarithmic in that dimension using an FTRL algorithm with a maximum-entropy regularizer. Finally, we illustrate the benefits of our framework by applying it to several problems in regret minimization.

1 Introduction

The notion of approachability introduced by Blackwell (1956) can be viewed as an extension of von Neumann’s minimax theorem (von Neumann, 1928) to the case of vectorial payoffs. Blackwell gave a simple example showing that the straightforward analog of von Neumann’s minimax theorem does not hold for vectorial payoffs. However, in contrast with this negative result for one-shot games, he proved that, in repeated games, a player admits an adaptive strategy guaranteeing that their average payoff approaches a closed convex set in the limit, provided that the set satisfies a natural separability condition.

The theory of Blackwell approachability is intimately connected with the field of online learning for the reason that the problem of regret minimization can be viewed as an approachability problem: in particular, the learner would like their vector of regrets (with respect to each competing benchmark) to converge to a non-positive vector. In this vein, Abernethy et al. (2011) demonstrated how to use algorithms for approachability to solve a general class of regret minimization problems (and conversely, how to use regret minimization to construct approachability algorithms). However, applying their reduction sometimes leads to suboptimal regret guarantees – for example, for the specific case of minimizing external regret over TT rounds with KK actions, their reduction results in an algorithm with O(TK)O(\sqrt{TK}) regret (instead of the optimal O(TlogK)O(\sqrt{T\log K}) regret bound achievable by e.g. multiplicative weights).

One reason for this suboptimality is the choice of distance used to define approachability. Both Blackwell and Abernethy et al.​ consider approachability algorithms that minimize the Euclidean (2\ell_{2}) distance between their average payoff and the desired set. We argue that, for applications to regret minimization, it is often more useful to study approachability under other distance metrics, most commonly approachability under the \ell_{\infty} metric to the non-positive orthant, which is well suited to capture the fact that regret is a maximum over various competing benchmarks. This has been observed in the past in several recent publications (Perchet, 2015; Shimkin, 2016; Kwon, 2021). In particular, by constructing algorithms for \ell_{\infty} approachability, it is possible to naturally recover a O(TlogK)O(\sqrt{T\log K}) external regret learning algorithm (and algorithms with optimal regret guarantees for many other problems of interest).

However, there is still one significant problem with developing regret minimization algorithms via \ell_{\infty} approachability (or any of the forms of approachability previously mentioned): the time and space complexity of these algorithms depends polynomially on the dimension dd of the space of vectorial payoffs, which in turn equals the number of benchmarks we compete against in our regret minimization problem. In some regret minimization settings, this can be prohibitively expensive. For example, in the setting of swap regret (where the benchmarks are parameterized by the d=KKd=K^{K} swap functions mapping [K][K] to [K][K]), this results in algorithms with complexity exponential in KK. On the other hand, there exist algorithms, e.g. (Blum and Mansour, 2007), which are both efficient (poly(K)\operatorname*{poly}(K) time and space) and obtain optimal regret guarantees.

1.1 Main results

In this paper, we present a framework for converting high-dimensional \ell_{\infty} approachability problems to low-dimensional “pseudonorm” approachability problems, in turn resolving many of these issues. To be precise, recall that the setting of approachability can be thought of as a TT-round repeated game, where in round tt the learner chooses an action ptp_{t} from some convex “action” set 𝒫{\mathscr{P}}\subseteq\mathbb{R}^{n}, the adversary simultaneously chooses an action t\ell_{t} from some convex “loss” set {\mathscr{L}}\subseteq\mathbb{R}^{m}, and the learner receives a vector-valued payoff u(pt,t)u(p_{t},\ell_{t}), where u:𝒫×u\colon{\mathscr{P}}\times{\mathscr{L}}\rightarrow\mathbb{R} is a dd-dimensional bilinear function. The learner would like the \ell_{\infty} distance between their average payoff 1Tu(pt,t)\frac{1}{T}\sum u(p_{t},\ell_{t}) and some convex set111For simplicity, throughout this paper we assume 𝒮{\mathscr{S}} to be the negative orthant (,0]d(-\infty,0]^{d}, as this is the case most relevant to regret minimization. However, much of our results extend straightforwardly to arbitrary convex sets. 𝒮{\mathscr{S}} to be as small as possible.

We first demonstrate how to construct a new dd^{\prime}-dimensional bilinear function u~:𝒫×\widetilde{u}\colon{\mathscr{P}}\times{\mathscr{L}}\rightarrow\mathbb{R}^{{}^{\prime}}, a new convex set 𝒮{\mathscr{S}}^{\prime}\subseteq\mathbb{R}^{{}^{\prime}}, and a pseudonorm222In this paper a pseudonorm is a function f:f\colon\mathbb{R}^{{}^{\prime}}\rightarrow\mathbb{R}_{\geq 0} which satisfies most of the properties of a norm (e.g. positive homogeneity, triangle inequality), but may be asymmetric (there may exist xx where f(x)f(x)f(x)\neq f(-x)) and may not be definite (there may exist x0x\neq 0 where f(x)=0f(x)=0). Just as a norm \lVert\cdot\rVert defines a distance between x,yx,y\in\mathbb{R}^{{}^{\prime}} via xy\lVert x-y\rVert, a pseudonorm ff defines the pseudodistance f(xy)f(x-y). ff such that the “pseudodistance” between the average modified payoff 1Tu~(pt,t)\frac{1}{T}\sum\widetilde{u}(p_{t},\ell_{t}) and 𝒮{\mathscr{S}}^{\prime} is equal to the \ell_{\infty} distance between the original average payoff and 𝒮{\mathscr{S}}. Importantly, the new dimension dd^{\prime} is equal to mnmn and is independent of the original dimension dd.

We then develop an algorithmic theory of pseudonorm approachability analogous to that developed in (Abernethy et al., 2011) for the 2\ell_{2} norm and (Shimkin, 2016; Kwon, 2021) for other norms, showing that, in order to perform pseudonorm approachability, it suffices to be able to perform online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball (and that the rate of approachability is directly related to the regret guarantees of this OLO subalgorithm). This has the following consequences for approachability:

  • First, by solving this OLO problem with a quadratic regularized Follow-The-Regularized-Leader (FTRL) algorithm, we show (modulo mild normalization assumptions on the sizes of 𝒫{\mathscr{P}}, {\mathscr{L}}, and uu) that there exists a pseudonorm approachability algorithm (and hence an \ell_{\infty} approachability algorithm for the original problem) which converges at a rate of O(nm/T)O(nm/\sqrt{T}). We additionally provide a stronger bound on the rate which scales as O(DuDpD/T)O(D_{u}D_{p}D_{\ell}/\sqrt{T}) where Dp=diam𝒫D_{p}=\operatorname*{diam}{\mathscr{P}}, D=diamD_{\ell}=\operatorname*{diam}{\mathscr{L}}, and DuD_{u} is the maximum 2\ell_{2} norm of the set of vectors formed by taking the coefficients of components of uu (Theorem 3.9). In comparison, the best-known generic guarantee for \ell_{\infty}-approachability prior to this work converged at a dd-dependent rate of O((logd)/T)O(\sqrt{(\log d)/T}).

  • Second, we show that as long as we can evaluate the original \ell_{\infty}-distance between 1Tu(pt,t)\frac{1}{T}\sum u(p_{t},\ell_{t}) and (,0]d(-\infty,0]^{d} efficiently, we can implement the above algorithm in poly(m,n)\operatorname*{poly}(m,n) time per round (Theorem 3.15). This has the following natural consequence for the class of regret minimization problems that can be written as \ell_{\infty}-approachability problems: if it is possible to efficiently compute some notion of regret for a sequence of losses and actions, then there is an efficient (in the dimensions of the actions and losses) learner that minimizes this regret.

  • Finally, in some cases, the O((logd)/T)O(\sqrt{(\log d)/T}) approachability rate from (inefficient) \ell_{\infty}-approachability outperforms the rate obtained by the quadratic regularized FTRL algorithm. We define a new regularizer whose value is given by finding the maximum entropy distribution of a subset of distributions of support dd, and show that, by using this regularizer, we recover this O((logd)/T)O(\sqrt{(\log d)/T}) rate. In particular, whenever we can efficiently compute this maxent regularizer, there is an efficient learning algorithm with a O((logd)/T)O(\sqrt{(\log d)/T}) approachability rate.

We then apply our framework to various problems in regret minimization:

  • We show that our framework straightforwardly recovers a regret-optimal and efficient algorithm for swap regret minimization. Doing so requires computing the above maximum entropy regularizer for this specific case, where we show that it has a nice closed form. In particular, to our knowledge. this is the first approachability-based algorithm for swap regret that both is efficient and has the optimal minimax regret.

  • In Section 4.3, we apply our framework to develop the first efficient contextual learning algorithms with low Bayesian swap regret. Such algorithms have the property that if learners employ them in a repeated Bayesian game, the time-average of their strategies will converge to a Bayesian correlated equilibrium, a well-studied equilibrium notion in game theory (see e.g. Bergemann and Morris (2016)).

    This notion of Bayesian swap regret was recently introduced by Mansour et al. (2022), who also provided an algorithm with low Bayesian swap regret, albeit one that is not computationally efficient. By applying our framework, we easily obtain an efficient contextual learning algorithm with O(CKT)O(CK\sqrt{T}) Bayesian swap regret, resolving an open question of Mansour et al. (2022) (here CC is the number of “contexts” / “types” of the learner).

  • In Section 4.4, we further analyze the application of our general \ell_{\infty}-approachability theory and algorithm to the analysis of reinforcement learning (RL) with vectorial losses. We point out how our framework can provide a general solution in the full information setting with known transition probabilities and how we can recover the best known solution for standard regret minimization in episodic RL. More importantly, we show how our framework and algorithm can lead to an algorithm for constrained MDPs with a significantly more favorable regret guarantee, logarithmic in the number of constraints kk, in contrast with the k\sqrt{k}-dependency of the results of Miryoosefi et al. (2019).

1.2 Related Work

There is a wide literature dealing with various aspects of Blackwell’s approachability, including its applications to game theory, regret minimization, reinforcement learning, and multiple extensions.

Hart and Mas-Colell (2000) described an adaptive procedure for players in a game based on Blackwell’s approachability, which guarantees that the empirical distributions of the plays converges to the set of a correlated equilibrium. This procedure is related to internal regret minimization, for which, as shown by Foster and Vohra (1999), the existence of an algorithm follows the proof of Hart and Mas-Colell (2000). Hart and Mas-Colell (2001) further gave a general class of adaptive strategies based on approachability. Approachability has been widely used for calibration (Dawid, 1982), see Foster and Hart (2018) for a recent work on the topic. Approachability and partial monitoring were studied in a series of publications by Perchet (2010); Mannor et al. (2014a, b); Perchet and Quincampoix (2015, 2018); Kwon and Perchet (2017). More recently, approachability has also been used in the analysis of fairness in machine learning (Chzhen et al., 2021).

Approachability has also been extensively used in the context of reinforcement learning. Mannor and Shimkin (2003) discussed an extension of regret minimization in competitive Markov decision processes (MDPs) whose analysis is based on Blackwell’s approachability theory. Mannor and Shimkin (2004) presented a geometric approach to multiple-criteria reinforcement learning formulated as approachability conditions. Kalathil et al. (2014) presented strategies for approachability for MDPs and Stackelberg stochastic games based on Blackwell’s approachability theory. More recently, Miryoosefi et al. (2019) used approachability to derive solutions for reinforcement learning with convex constraints.

The notion of approachability was further extended in several studies. Vieille (1992) used differential games with a fixed duration to study weak approachability in finite dimensional spaces. Spinat (2002) formulated a necessary and sufficient condition for approachability of non-necessary convex sets. Lehrer (2003) extended Blackwell’s approachability theory to infinite-dimensional spaces.

The most closely related work to this paper, which we build upon, is that of Abernethy et al. (2011) who showed that, remarkably, any algorithm for Blackwell’s approachability could be converted into one for online convex optimization and vice-versa. Bernstein and Shimkin (2015) also discussed a related response-based approachability algorithm.

Perchet (2015) presented a specific study of \ell_{\infty} approachability, for which they gave an exponential weight algorithm. Shimkin (2016, Section 5) studied approachability for an arbitrary norm and gave a general duality result for an arbitrary norm using Sion’s minmax theorem. The pseudonorm duality theorem we prove, using Fenchel duality, can be viewed as a generalization. Kwon (2016, 2021) also presented a duality theorem similar to that of Shimkin (2016) which they used to derive a FTRL algorithm for general norm approachability. They further treated the special case of internal and swap regret. However, unlike the algorithms derived in this work, the computational complexity of their swap regret algorithm is in O(KK)O(K^{K}). This is also true of the paper of Perchet (2015) which also analyzes the swap regret problem.

It is known that if all players follow a swap regret minimization algorithm, then the empirical distribution of their play converges to a correlated equilibrium (Blum and Mansour, 2007). Hazan and Kale (2008) showed a result generalizing this property to the case of Φ\Phi-regret and Φ\Phi-equilibrium, where the Φ\Phi-regret is the difference between the cumulative expected loss suffered by the learner and that of the best Φ\Phi-modification of the sequence in hindsight. Gordon et al. (2008) further generalized the results of Hazan and Kale (2008) to a more general class of Φ\Phi-modification regrets. The algorithms discussed in (Gordon et al., 2008) are distinct from those discussed in this paper (they do not clearly extend to the general approachability setting, and they require significantly different computational assumptions than ours). Nevertheless, they bare some similarity with our work.

2 Preliminaries

Notation.

We use [n][n] as a shorthand for set {1,2,,n}\{1,2,\dots,n\}. We write Δd={x,=}\Delta_{d}=\{x\in\mathbb{R}\mid x_{i}\geq 0,\sum x_{i}=1\} to denote the simplex over dd dimensions and Δ¯d={x,}{\overline{\Delta}}_{d}=\{x\in\mathbb{R}\mid x_{i}\geq 0,\sum x_{i}\leq 1\} to denote the convex hull of the dd-simplex with the origin. conv(S)\operatorname*{conv}(S) denotes the convex hull of the points in SS, and cone(S)={αxα0,xconv(S)}\operatorname*{cone}(S)=\{\alpha x\mid\alpha\geq 0,x\in\operatorname*{conv}(S)\} the convex cone generated by the points in SS.

Some of the more standard proofs have been deferred to Appendix A.

2.1 Blackwell approachability and regret minimization

We begin by illustrating the theory of Blackwell approachability for the specific case of the \ell_{\infty}-distance; this case is both particularly suited to the application of regret minimization, and will play an important role in the results (e.g. reductions to pseudonorm approachability) that follow.

We consider a repeated game setting, where every round tt a learner chooses an action ptp_{t} belonging to a bounded333We bound the entries of 𝒫{\mathscr{P}}, {\mathscr{L}}, and uu within [1,1][-1,1] for convenience, but it is generally easy to translate between different boundedness assumptions (since almost all relevant quantities are linear). We express the majority of our theorem statements (with the notable exception of Theorem 3.9) in a way that is independent of the choice of bounds. convex set 𝒫[1,1]n{\mathscr{P}}\subseteq[-1,1]^{n}, and an adversary simultaneously chooses a loss t\ell_{t} belonging to a bounded convex set [1,1]m{\mathscr{L}}\subseteq[-1,1]^{m}. Let u:𝒫×[1,1]du\colon{\mathscr{P}}\times{\mathscr{L}}\rightarrow[-1,1]^{d} be a bounded bilinear444We briefly note that all our results also hold for biaffine functions uiu_{i}; in particular, extending the loss and action sets slightly (by replacing 𝒫{\mathscr{P}} and {\mathscr{L}} with 𝒫×{1}{\mathscr{P}}\times\{1\} and ×{1}{\mathscr{L}}\times\{1\}) allows us to write any biaffine function over the original sets as a bilinear function over the extended sets. vector-valued payoff function, and let 𝒮{\mathscr{S}}\subseteq\mathbb{R} be a closed convex set with the property that for every \ell\in{\mathscr{L}}, there exists a p𝒫p\in{\mathscr{P}} such that u(p,)𝒮u(p,\ell)\in{\mathscr{S}} (we say that such a set is “separable”). When d=1d=1, the minimax theorem implies that there exists a single p𝒫p\in{\mathscr{P}} such that for all \ell, u(p,)𝒮u(p,\ell)\in{\mathscr{S}}.

This is not true for d>1d>1, but the theory of Blackwell approachability provides the following algorithmic analogue of this statement. Define a learning algorithm 𝒜\mathcal{A} to be a collection of functions 𝒜t:t1𝒫\mathcal{A}_{t}\colon{\mathscr{L}}^{t-1}\rightarrow{\mathscr{P}} for each t[T]t\in[T], where 𝒜t\mathcal{A}_{t} describes how the learner decides their action ptp_{t} as a function of the observed losses 1,2,,t1\ell_{1},\ell_{2},\dots,\ell_{t-1} up until time t1t-1. Blackwell approachability guarantees that there exists a learning algorithm 𝒜\mathcal{A} such that when 𝒜\mathcal{A} is run on any loss sequence \bm{\ell} the resulting action sequence 𝒑\bm{p} has the property that:

limTd(1Tt=1Tu(pt,t),𝒮)=0.\lim_{T\rightarrow\infty}d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),{\mathscr{S}}\right)=0. (1)

(Here for v,wv,w\in\mathbb{R}, d(v,w)=maxi|viwi|d_{\infty}(v,w)=\max_{i}|v_{i}-w_{i}| represents the \ell_{\infty} distance between vv and ww).

As mentioned, one of the main motivations for studying Blackwell approachability is its connections to regret minimization. In particular, for a fixed choice of uu, define

𝖱𝖾𝗀(𝐩,)=max(maxi[d](t=1Tui(pt,t)),0).\operatorname{\mathsf{Reg}}(\mathbf{p},\bm{\ell})=\max\left(\max_{i\in[d]}\left(\sum_{t=1}^{T}u_{i}(p_{t},\ell_{t})\right),0\right). (2)

Note first that this definition of “regret” is exactly TT times the \ell_{\infty} approachability distance in the case where 𝒮=(,0]d{\mathscr{S}}=(\infty,0]^{d} is the negative orthant; that is,

𝖱𝖾𝗀(𝒑,)=Td(1Tt=1Tu(pt,t),(,0]d).\operatorname{\mathsf{Reg}}(\bm{p},\bm{\ell})=T\cdot d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right). (3)

But secondly, note that by choosing 𝒫{\mathscr{P}}, {\mathscr{L}}, and uu carefully, the definition (2)\eqref{eq:regret_def} can capture a wide variety of forms of regret studied in regret minimization. For example:

  • When 𝒫=ΔK{\mathscr{P}}=\Delta_{K}, =[0,1]K{\mathscr{L}}=[0,1]^{K}, d=Kd=K, and u(p,)i=p,iu(p,\ell)_{i}=\langle p,\ell\rangle-\ell_{i}, 𝖱𝖾𝗀(𝒑,)\operatorname{\mathsf{Reg}}(\bm{p},\bm{\ell}) is the external regret of playing action sequence 𝒑\bm{p} against loss sequence \bm{\ell}; i.e., it measures the regret compared to the best single action.

  • When 𝒫=ΔK{\mathscr{P}}=\Delta_{K}, =[0,1]K{\mathscr{L}}=[0,1]^{K}, d=KKd=K^{K}, and (for each function π:[K][K]\pi\colon[K]\rightarrow[K], u(p,)π=i=1Kpi(iπ(i))u(p,\ell)_{\pi}=\sum_{i=1}^{K}p_{i}(\ell_{i}-\ell_{\pi(i)}), 𝖱𝖾𝗀(𝒑,)\operatorname{\mathsf{Reg}}(\bm{p},\bm{\ell}) is the swap regret of playing action sequence 𝒑\bm{p} against loss sequence \bm{\ell}; i.e., it measures the regret compared to the best action sequence 𝒑\bm{p^{\prime}} obtained by applying a fixed swap function to sequence 𝒑\bm{p}.

  • When 𝒫{\mathscr{P}}\subseteq\mathbb{R}^{n} is a convex polytope, =[0,1]n{\mathscr{L}}=[0,1]^{n}, d=|V(𝒫)|d=|V({\mathscr{P}})| (where V(𝒫)V({\mathscr{P}}) is the set of vertices of 𝒫{\mathscr{P}}), and (for each vertex vV(𝒫)v\in V({\mathscr{P}})), u(p,)v=p,v,u(p,\ell)_{v}=\langle p,\ell\rangle-\langle v,\ell\rangle, this captures the (external) regret from performing online linear optimization over the polytope 𝒫{\mathscr{P}} (see Section 2.2).

  • Finally, to illustrate the power of this framework, we present an unusual swap regret minimization that we call “Procrustean swap regret minimization” (after the orthogonal Procrustes problem, see (Gower and Dijksterhuis, 2004)). Let 𝒫={x:}{\mathscr{P}}=\{x\in\mathbb{R}^{n}\,\colon||x||_{2}\leq 1\} be the unit ball in nn dimensions, =[1,1]n{\mathscr{L}}=[-1,1]^{n}, and, for each orthogonal matrix555Technically, this leads to an infinite dimensional uu (since the group of orthogonal matrices is infinite), but one can instead take an arbitrarily fine discrete approximation of the set. Indeed, one of the advantages of the results we present is that they are largely independent of the dimension dd of uu. QO(n)Q\in O(n), let u(p,)Q=p,Qp,u(p,\ell)_{Q}=\langle p,\ell\rangle-\langle Qp,\ell\rangle.

When the negative orthant (,0]d(-\infty,0]^{d} is separable with respect to uu, which is true in all of the above examples, the theory of Blackwell approachability immediately guarantees the existence of a sublinear regret learning algorithm 𝒜\mathcal{A} for the corresponding notion of regret. Specifically, define the regret of a learning algorithm 𝒜\mathcal{A} to be the worst-case regret over all possible loss vectors T\bm{\ell}\in{\mathscr{L}}^{T}; i.e.,

𝖱𝖾𝗀(𝒜)=maxT𝖱𝖾𝗀(𝒑,),\operatorname{\mathsf{Reg}}(\mathcal{A})=\max_{\bm{\ell}\in{\mathscr{L}}^{T}}\,\operatorname{\mathsf{Reg}}(\bm{p},\bm{\ell}), (4)

where pt=𝒜t(1,,t1)p_{t}=\mathcal{A}_{t}(\ell_{1},\dots,\ell_{t-1}). Then (1) implies that the same algorithm 𝒜\mathcal{A} satisfies 𝖱𝖾𝗀(𝒜)=o(T)\operatorname{\mathsf{Reg}}(\mathcal{A})=o(T). Motivated by this application, we will restrict our attention for the remainder of this paper to the setting where 𝒮=(,0]d{\mathscr{S}}=(-\infty,0]^{d} and will assume (unless otherwise specified) that this 𝒮{\mathscr{S}} is separable with respect to the bilinear function uu we consider.

In fact, the theory of \ell_{\infty}-approachability is constructive and allows us to write down explicit algorithms 𝒜\mathcal{A} along with explicit (and in many cases, near optimal) regret bounds. However, before we introduce these algorithms, we will need to introduce the problem of online linear optimization.

2.2 Online linear optimization

Here we discuss algorithms for online linear optimization (OLO), a special case of online convex optimization where all the loss functions are linear functions of the learner’s action. These will form an important primitive of our algorithms for approachability.

Let 𝒳,𝒴{\mathscr{X}},{\mathscr{Y}} be two bounded convex subsets of \mathbb{R}^{r} and consider the following learning problem. Every round tt (for TT rounds) our learning algorithm must choose an element xt𝒳x_{t}\in{\mathscr{X}} as a function of y1,y2,,yt1y_{1},y_{2},\dots,y_{t-1}. The adversary simultaneously chooses a loss “function” yt𝒴y_{t}\in{\mathscr{Y}}. This causes the learner to incur a loss of xt,yt\langle x_{t},y_{t}\rangle in round tt. The goal of the learner is to minimize their regret, defined as the difference between their total loss and the loss they would have incurred by playing the best fixed action in hindsight, i.e.,

𝖱𝖾𝗀(𝒙,𝒚)=maxx𝒳(t=1Txt,ytt=1Tx,yt).\operatorname{\mathsf{Reg}}(\bm{x},\bm{y})=\max_{x^{*}\in{\mathscr{X}}}\left(\sum_{t=1}^{T}\langle x_{t},y_{t}\rangle-\sum_{t=1}^{T}\langle x^{*},y_{t}\rangle\right).

There are many similarities between this problem and the approachability and regret minimization problems discussed in Section 2.1. For example, if we take 𝒳=𝒫=ΔK{\mathscr{X}}={\mathscr{P}}=\Delta_{K} and 𝒴==[0,1]K{\mathscr{Y}}={\mathscr{L}}=[0,1]^{K}, then OLO is equivalent to the problem of external regret minimization. However, not all regret minimization problems can be written directly as an instance of OLO – for example, there is no clear way to write swap regret minimization as an OLO instance. Eventually we will demonstrate how to apply OLO as a subroutine to solve any regret minimization problem, but this will involve a reduction to Blackwell approachability and will require running OLO on different spaces than the action/loss sets 𝒫{\mathscr{P}} and {\mathscr{L}} directly (which is why we distinguish the action/loss sets for OLO as 𝒳{\mathscr{X}} and 𝒴{\mathscr{Y}} respectively).

There is an important subclass of algorithms for OLO known as Follow the Regularized Leader (FTRL) algorithms (Shalev-Shwartz, 2007; Abernethy et al., 2008). An FTRL algorithm is completely specified by a strongly convex function R:𝒳R\colon{\mathscr{X}}\rightarrow\mathbb{R}, and plays the action

xt=argminx𝒳(R(x)+s=1t1x,ys).x_{t}=\operatorname*{argmin}_{x\in{\mathscr{X}}}\left(R(x)+\sum_{s=1}^{t-1}\langle x,y_{s}\rangle\right).

In words, this algorithm plays the action that minimizes the total loss on the rounds until the present (“following the leader”), subject to an additional regularization term. It is possible to characterize the worst-case regret of an FTRL algorithm in terms of properties of the regularizer RR and the sets 𝒳{\mathscr{X}} and 𝒴{\mathscr{Y}} (see e.g. Theorem 15 of Hazan (2016)). For our purposes, we will only need the following two results for specific regularizers.

Lemma 2.1 (Quadratic regularizer, (Zinkevich, 2003)).

Let Dx=diam𝒳D_{x}=\operatorname*{diam}{\mathscr{X}} and Dy=maxy𝒴y2D_{y}=\max_{y\in{\mathscr{Y}}}||y||_{2}. Let x0x_{0} be an arbitrary element of 𝒳{\mathscr{X}}, and let R(x)=xx02R(x)=||x-x_{0}||^{2}. Then, the FTRL algorithm with regularizer RR incurs worst-case regret at most O(DxDyT)O(D_{x}D_{y}\sqrt{T}).

Lemma 2.2 (Negative entropy regularizer, (Kivinen and Warmuth, 1995)).

Let 𝒳=Δd{\mathscr{X}}=\Delta_{d}, 𝒴=[0,1]d{\mathscr{Y}}=[0,1]^{d}, and let R(x)=i=1dxilogxiR(x)=\sum_{i=1}^{d}x_{i}\log x_{i} (where we extend this to the boundary of 𝒳{\mathscr{X}} by letting 0log0=00\log 0=0). Then, the FTRL algorithm with regularizer RR incurs worst-case regret at most O(Tlogd)O(\sqrt{T\log d}).

2.3 Algorithms for \ell_{\infty}-approachability

We can now write down an explicit description of our algorithm for \ell_{\infty}-approachability (in terms of a blackbox OLO algorithm) and get quantitative bounds on the rate of convergence in the LHS of (1). Let \mathcal{F} be an OLO algorithm for the sets 𝒳=Δ¯d{\mathscr{X}}={\overline{\Delta}}_{d} and 𝒴=[1,1]d{\mathscr{Y}}=[-1,1]^{d}. Then we can describe our algorithm 𝒜\mathcal{A} for \ell_{\infty}-approachability as Algorithm 1.

Initialize θ1\theta_{1} to an arbitrary point in Δ¯d{\overline{\Delta}}_{d};
for t1t\leftarrow 1 to TT do
   Choose pt𝒫p_{t}\in{\mathscr{P}} so that for all \ell\in{\mathscr{L}}, θt,u(pt,)0\langle\theta_{t},u(p_{t},\ell)\rangle\leq 0. Such a ptp_{t} is guaranteed to exist by the separability condition on uu, since θ,s0\langle\theta,s\rangle\leq 0 for any θΔ¯d\theta\in{\overline{\Delta}}_{d} and s(,0]ds\in(-\infty,0]^{d};
   Play action ptp_{t} and receive as feedback t\ell_{t}\in{\mathscr{L}};
   Set yt=u(pt,t)y_{t}=-u(p_{t},\ell_{t});
   Set θt+1=(y1,y2,,yt)\theta_{t+1}=\mathcal{F}(y_{1},y_{2},\dots,y_{t}).
end for
Algorithm 1 Description of Algorithm 𝒜\mathcal{A} for \ell_{\infty}-approachability

It turns out that we can relate the \ell_{\infty} approachability distance to (,0]d(-\infty,0]^{d} (and hence the regret of this algorithm 𝒜\mathcal{A}) to the regret of our OLO algorithm \mathcal{F}.

Theorem 2.3.

We have that

𝖱𝖾𝗀(𝒜)=Td(1Tt=1Tu(pt,t),(,0]d)𝖱𝖾𝗀().\operatorname{\mathsf{Reg}}(\mathcal{A})=T\cdot d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right)\leq\operatorname{\mathsf{Reg}}(\mathcal{F}).

If we let \mathcal{F} be the negative entropy FTRL algorithm (Lemma 2.2)666There is a slight technical difference between the sets 𝒳=Δ¯d\mathcal{X}={\overline{\Delta}}_{d} and 𝒴=[1,1]d\mathcal{Y}=[-1,1]^{d} in Algorithm 1 and the sets 𝒳=Δd\mathcal{X}^{\prime}=\Delta_{d} and 𝒴=[0,1]d\mathcal{Y}^{\prime}=[0,1]^{d} in Lemma 2.2. However, note that if we map x𝒳x\in\mathcal{X} to (x1,x2,,xn,1xi)Δd+1(x_{1},x_{2},\dots,x_{n},1-\sum x_{i})\in\Delta_{d+1} and y[1,1]dy\in[-1,1]^{d} to ((1+y1)/2,(1+y2)/2,,(1+yd)/2,0)[0,1]d+1((1+y_{1})/2,(1+y_{2})/2,\dots,(1+y_{d})/2,0)\in[0,1]^{d+1}, we preserve the inner product of xx and yy up to an additive constant, which disappears when computing regret, and a factor of 1/21/2., we obtain the following regret guarantee for 𝒜\mathcal{A}.

Corollary 2.4.

For any bilinear regret function u:𝒫×[1,1]du\colon{\mathscr{P}}\times{\mathscr{L}}\rightarrow[-1,1]^{d}, there exists a regret minimization algorithm 𝒜\mathcal{A} with worst-case regret 𝖱𝖾𝗀(𝒜)=O(Tlogd)\operatorname{\mathsf{Reg}}(\mathcal{A})=O(\sqrt{T\log d}).

Equivalently, Corollary 2.4 can be interpreted as saying that there exists an \ell_{\infty}-approachability algorithm which approaches the negative orthant at an average rate of O((logd)/T)O(\sqrt{(\log d)/T}). In general, (3) lets us straightforwardly convert between results for approachability and results for regret minimization. Throughout the remainder of the paper we will primarily phrase our results in terms of 𝖱𝖾𝗀(𝒜)\operatorname{\mathsf{Reg}}(\mathcal{A}), but switch between quantities of interest when convenient.

3 Main Results

Already Corollary 2.4 leads to a number of impressive consequences. For example, when applied to the problem of swap-regret minimization (where uu has dimension d=KKd=K^{K}), it leads to a learning algorithm with O(KTlogK)O(\sqrt{KT\log K}) regret, matching the best known regret bounds for this problem (Blum and Mansour, 2007). However, the algorithm we obtain in this way has two unfortunate properties.

First, since uu is dd-dimensional, implementing the algorithm as written above requires O(poly(d))O(\operatorname*{poly}(d)) time and space complexity (even storing any specific yty_{t} or θt\theta_{t} requires Ω(d)\Omega(d) space). This is fine if dd is small, but in many of our applications dd is much (e.g., exponentially) larger than the dimensions mm and nn of the loss and action sets. For example, for swap regret we have d=KKd=K^{K} but m=n=Km=n=K. Although Corollary 2.4 gives us an optimal O(KTlogK)O(\sqrt{KT\log K}) swap regret algorithm, it takes exponential time / space to implement (in contrast to other known swap regret algorithms, such as that of Blum and Mansour (2007)).

Secondly, although Corollary 2.4 has only a logarithmic dependence on dd, sometimes even this may be too large (for example when we want to compete against an uncountable set of benchmarks). In such cases, we would ideally like a regret bound that depends on the action and loss sets but not directly on dd.

In the following subsections, we will demonstrate a framework for regret minimization that allows us to achieve both of these goals (under some fairly light computational assumptions on uu).

3.1 Approachability for pseudonorms

In Section 2.3, we described the theory of Blackwell approachability for a distance metric defined by the \ell_{\infty} norm (i.e., z=maxi|zi|||z||_{\infty}=\max_{i}|z_{i}|). We begin here by describing a generalization of this approachability theory to functions we refer to as pseudonorms and pseudodistances. A function f:f\colon\mathbb{R}\to\mathbb{R}_{\geq 0} is a pseudonorm if f(0)=0f(0)=0, ff is positive homogeneous (for all zz\in\mathbb{R}, α\alpha\in\mathbb{R}_{\geq 0}, f(αz)=αf(z)f(\alpha z)=\alpha f(z)), and ff satisfies the triangle inequality (f(z+z)f(z)+f(z)f(z+z^{\prime})\leq f(z)+f(z^{\prime}) for all z,zz,z^{\prime}\in\mathbb{R}). Note that unlike norms, pseudonorms may not satisfy definiteness and are not necessarily symmetric; it may be the case that f(z)f(z)f(z)\neq f(-z). However, all norms are pseudonorms. Note also that by the positive homogeneity and the triangle inequality, a pseudonorm is a convex function. A pseudonorm ff defines a pseudodistance function dfd_{f} via: z,z,(ϝ,ϝ)=(ϝϝ)\forall z,z^{\prime}\in\mathbb{R},{}_{f}(z,z^{\prime})=f(z-z^{\prime}).

In order to effectively work with pseudodistances, it will be useful to define the dual set 𝒯f{\mathscr{T}}^{*}_{f} associated to ff as follows: 𝒯f={θ:z,θ,ϝ(ϝ)}{\mathscr{T}}^{*}_{f}=\left\{\theta\colon\forall z\in\mathbb{R},\left\langle\theta,z\right\rangle\leq f(z)\right\}. This coincides with the traditional notion of duality in convex analysis; for example, when ff is a norm, 𝒯f{\mathscr{T}}^{*}_{f} coincides with the dual ball of radius one: 𝒯f={θ:θ1}{\mathscr{T}}^{*}_{f}=\left\{\theta\colon\left\lVert\theta\right\rVert_{*}\leq 1\right\} (e.g., when ff is the dd-dimensional \ell_{\infty} norm, 𝒯f{\mathscr{T}}_{f}^{*} is the dd-dimensional 1\ell_{1}-ball). The following theorem relates the pseudodistance between zz and a convex set 𝒮{\mathscr{S}} to a convex optimization problem over the dual set.

Theorem 3.1.

For any closed convex set 𝒮{\mathscr{S}}\subset\mathbb{R}, the following equality holds for any zz\in\mathbb{R}:

df(z,𝒮)=infs𝒮f(zs)=supθ𝒯f{θzsups𝒮θs}.d_{f}(z,{\mathscr{S}})=\inf_{s\in{\mathscr{S}}}f(z-s)=\sup_{\theta\in{\mathscr{T}}^{*}_{f}}\left\{\theta\cdot z-\sup_{s\in{\mathscr{S}}}\theta\cdot s\right\}.
Proof.

We adopt the standard definition and notation in optimization for an indicator function I𝒦I_{{\mathscr{K}}} of a set 𝒦{\mathscr{K}}: for any xx, I𝒦(x)=0I_{{\mathscr{K}}}(x)=0 if xx is in 𝒦{\mathscr{K}}, ++\infty otherwise. Define f~\widetilde{f} by f~(s)=f(zs)\widetilde{f}(s)=f(z-s) for all ss\in\mathbb{R} and set g=I𝒮g=I_{\mathscr{S}}.

By definition, the conjugate function of ff is defined by: y,()=sup()\forall y\in\mathbb{R},f^{*}(y)=\sup_{x\in\mathbb{R}}x\cdot y-f(x). Now, if yy is in 𝒯f{\mathscr{T}}^{*}_{f}, then we have f(y)f(x)f(x)=0f^{*}(y)\leq f(x)-f(x)=0. Thus, since f(0)=0f(0)=0, the supremum in the definition of ff^{*} is achieved for x=0x=0 and f(0)=0f^{*}(0)=0. Otherwise, if y𝒯fy\not\in{\mathscr{T}}^{*}_{f}, there exists x𝒳x\in{\mathscr{X}} such that xy>f(x)x\cdot y>f(x). For that xx, for any t>0t>0, by the positive homogeneity of ff, we have (tx)yf(tx)=t(xyf(x))>0(tx)\cdot y-f(tx)=t\left(x\cdot y-f(x)\right)>0. Taking the limit t+t\to+\infty, this shows that f(y)=+f^{*}(y)=+\infty. Thus, we have f(y)=I𝒯ff^{*}(y)=I_{{\mathscr{T}}^{*}_{f}}.

By definition, the conjugate function f~\widetilde{f}^{*} is defined for all θ\theta by f~(θ)=supx𝒳xθf(zx)=supu𝒳(zu)θf(u)=f(θ)+zθ=I𝒯f(θ)+zθ\widetilde{f}^{*}(\theta)=\sup_{x\in{\mathscr{X}}}x\cdot\theta-f(z-x)=\sup_{u\in{\mathscr{X}}}(z-u)\cdot\theta-f(u)=f^{*}(-\theta)+z\cdot\theta=I_{{\mathscr{T}}^{*}_{f}}(-\theta)+z\cdot\theta, which can also be derived from the conjugate function calculus in Table B.1 of (Mohri et al., 2018).

It is also known that the conjugate function of the indicator function gg is defined by g:θsups𝒮θsg^{*}\colon\theta\mapsto\sup_{s\in{\mathscr{S}}}\theta\cdot s (Boyd and Vandenberghe, 2014). Since dom(f~)=\operatorname*{dom}(\widetilde{f})=\mathbb{R} and cont(g)=𝒮\operatorname*{cont}(g)={\mathscr{S}}, we have dom(f~)cont(g)=𝒮\operatorname*{dom}(\widetilde{f})\cap\operatorname*{cont}(g)={\mathscr{S}}\neq\emptyset. Thus, for any convex and bounded set 𝒮{\mathscr{S}} in \mathbb{R}, by Fenchel duality (Theorem B.1, Appendix B), we can write:

df(z,𝒮)\displaystyle d_{f}(z,{\mathscr{S}}) =infs𝒮f(zs)\displaystyle=\inf_{s\in{\mathscr{S}}}f(z-s)
=infs{f(zs)+I𝒮(s)}\displaystyle=\inf_{s\in\mathbb{R}}\left\{f(z-s)+I_{\mathscr{S}}(s)\right\} (def. of I𝒮I_{\mathscr{S}})
=supθ{(I𝒯f(θ)+θz)sups𝒮{θs}}\displaystyle=\sup_{\theta\in\mathbb{R}}\left\{-\left(I_{{\mathscr{T}}^{*}_{f}}(-\theta)+\theta\cdot z\right)-\sup_{s\in{\mathscr{S}}}\left\{-\theta\cdot s\right\}\right\} (Fenchel duality theorem)
=supθ𝒯f{θzsups𝒮{θs}}\displaystyle=\sup_{-\theta\in{\mathscr{T}}^{*}_{f}}\left\{-\theta\cdot z-\sup_{s\in{\mathscr{S}}}\left\{-\theta\cdot s\right\}\right\} (def. of I𝒯fI_{{\mathscr{T}}^{*}_{f}})
=supθ𝒯f{θzsups𝒮θs}.\displaystyle=\sup_{\theta\in{\mathscr{T}}^{*}_{f}}\left\{\theta\cdot z-\sup_{s\in{\mathscr{S}}}\theta\cdot s\right\}. (change of variable)

This completes the proof. ∎

We will primarily be concerned with the case where 𝒮{\mathscr{S}} is a convex cone. A convex cone is a set 𝒞{\mathscr{C}} such that if x𝒞x\in{\mathscr{C}}, αx𝒞\alpha x\in{\mathscr{C}} for all α0\alpha\geq 0. In this case, Theorem 3.1 can be simplified to write df(z,𝒞)d_{f}(z,{\mathscr{C}}) as a linear optimization problem over the intersection of 𝒯f{\mathscr{T}}^{*}_{f} and the polar cone of 𝒞{\mathscr{C}}.

Corollary 3.2.

Let 𝒞{\mathscr{C}} be a convex cone, and let 𝒞={yy,x0x𝒞}{\mathscr{C}}^{\circ}=\{y\,\mid\,\langle y,x\rangle\leq 0\,\forall x\in{\mathscr{C}}\} be the polar cone of 𝒞{\mathscr{C}}. Fix a dd-dimensional pseudonorm ff and let 𝒯f𝒞=𝒯f𝒞{\mathscr{T}}^{{\mathscr{C}}}_{f}={\mathscr{T}}^{*}_{f}\cap{\mathscr{C}}^{\circ}. Then for any zz\in\mathbb{R},

df(z,𝒞)=supθ𝒯f𝒞(θz).d_{f}(z,{\mathscr{C}})=\sup_{\theta\in{\mathscr{T}}^{{\mathscr{C}}}_{f}}(\theta\cdot z).

3.2 From high-dimensional \ell_{\infty} approachability to low-dimensional pseudonorm approachability

In Section 2.3, we expressed the regret for a general regret minimization problem as the \ell_{\infty} distance

𝖱𝖾𝗀=Td(1Tt=1Tu(pt,t),(,0]d).\operatorname{\mathsf{Reg}}=T\cdot d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right). (5)

In this section, we will demonstrate how to rewrite this in terms of a lower-dimensional pseudodistance.

Consider the bilinear “basis map” u~:𝒫×\widetilde{u}:{\mathscr{P}}\times{\mathscr{L}}\to\mathbb{R}^{nm} given by u~(p,)i,j=pij\widetilde{u}(p,\ell)_{i,j}=p_{i}\ell_{j}. Note that every bilinear function b:𝒫×b\colon{\mathscr{P}}\times{\mathscr{L}}\to\mathbb{R} can be written in the form b(p,)=u~(p,),vb(p,\ell)=\langle\widetilde{u}(p,\ell),v\rangle for some vector vv\in\mathbb{R}^{nm} (i.e., the monomials in u~\widetilde{u} form a basis for the set of biaffine functions on 𝒫×{\mathscr{P}}\times{\mathscr{L}})777If it is helpful, one can think of u~\widetilde{u} as the natural map from 𝒫×{\mathscr{P}}\times{\mathscr{L}} to the tensor product 𝒫{\mathscr{P}}\,\otimes\,{\mathscr{L}}. The vectors viv_{i} can then be thought of as elements of the dual space, each representing a linear functional on 𝒫{\mathscr{P}}\,\otimes\,{\mathscr{L}}..

Let viv_{i}\in\mathbb{R}^{nm} be the vector of coefficients corresponding to the component uiu_{i} of uu and consider the function f(x)=max(maxi[d]vi,x,0)f(x)=\max(\max_{i\in[d]}\langle v_{i},x\rangle,0). Note that ff is a pseudonorm on \mathbb{R}^{nm}; indeed, it’s straightforward to see that ff is non-negative and f(αx)=αf(x)f(\alpha x)=\alpha f(x) for any α0\alpha\geq 0. In addition, let 𝒞{\mathscr{C}} be the convex cone defined by 𝒞={xx,vi0i[d]}{\mathscr{C}}=\{x\,\mid\,\langle x,v_{i}\rangle\leq 0\,\forall i\in[d]\}. We claim that that we can rewrite the \ell_{\infty} distance in (5) in the following way.

Theorem 3.3.

We have that

d(1Tt=1Tu(pt,t),(,0]d)=df(1Tt=1Tu~(pt,t),𝒞).d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right)=d_{f}\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),{\mathscr{C}}\right). (6)

The dual set 𝒯f{\mathscr{T}}^{*}_{f} associated to this pseudonorm is given by:

𝒯f={θ~:,θ~,max[],}.{\mathscr{T}}^{*}_{f}=\left\{\widetilde{\theta}\in\mathbb{R}^{nm}\colon\forall x\in\mathbb{R}^{nm},\langle\widetilde{\theta},x\rangle\leq\max_{i\in[d]}\langle v_{i},x\rangle\right\}.

The following two properties of this dual set will be useful in the sections that follow. First, we show that we can alternately think of 𝒯f{\mathscr{T}}^{*}_{f} as the convex hull of the viv_{i}.

Lemma 3.4.

The dual set 𝒯f{\mathscr{T}}^{*}_{f} coincides with the convex hull of viv_{i}s: 𝒯f=conv{v1,,vd}{\mathscr{T}}^{*}_{f}=\operatorname*{conv}\left\{v_{1},\ldots,v_{d}\right\}.

Secondly, we show that the dual set 𝒯f{\mathscr{T}}_{f}^{*} is contained within the polar cone 𝒞{\mathscr{C}}^{\circ} of 𝒞{\mathscr{C}}.

Lemma 3.5.

We have that 𝒯f𝒞{\mathscr{T}}_{f}^{*}\subseteq{\mathscr{C}}^{\circ}, where 𝒞={yy,x0x𝒞}{\mathscr{C}}^{\circ}=\{y\,\mid\,\langle y,x\rangle\leq 0\,\forall x\in{\mathscr{C}}\}.

Proof.

By Lemma 3.4, it suffices to show that each vi𝒞v_{i}\in{\mathscr{C}}^{\circ}, i.e., for each viv_{i}, that vi,x0\langle v_{i},x\rangle\leq 0 for all x𝒞x\in{\mathscr{C}}. However, this immediately follows by the definition of 𝒞{\mathscr{C}}, since each x𝒞x\in{\mathscr{C}} satisfies x,vi0\langle x,v_{i}\rangle\leq 0 for all i[d]i\in[d]. (An equivalent way of thinking about this is that 𝒞{\mathscr{C}} is already the polar cone of the cone generated by the viv_{i}, so the viv_{i} must lie in the polar to 𝒞{\mathscr{C}}). ∎

This allows us to simplify Corollary 3.2 even further.

Corollary 3.6.

For this convex cone 𝒞{\mathscr{C}} and pseudonorm ff, we have that for any zz\in\mathbb{R}^{mn},

df(z,𝒞)=supθ~𝒯fθ~,z.d_{f}(z,{\mathscr{C}})=\sup_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}}\left\langle\widetilde{\theta},z\right\rangle.

Finally, we prove the following “separability” conditions for this approachability problem.

Lemma 3.7.

The following two statements are true.

  1. 1.

    For any \ell\in{\mathscr{L}}, there exists a p𝒫p\in{\mathscr{P}} such that u~(p,)𝒞\widetilde{u}(p,\ell)\in{\mathscr{C}}.

  2. 2.

    For any θ~𝒯f\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}, there exists a p𝒫p\in{\mathscr{P}} such that θ~,u~(p,)0\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle\leq 0 for all \ell\in{\mathscr{L}}.

3.3 Algorithms for pseudonorm approachability

We now present an algorithm for pseudonorm approachability in the setting of Section 3.2 (i.e., for the bilinear function u~\widetilde{u} and the convex cone 𝒞{\mathscr{C}}). Just as the algorithm in Section 2.3 for \ell_{\infty}-approachability required an OLO algorithm for the simplex, this algorithm will assume we have access to an OLO algorithm ~\widetilde{\mathcal{F}} for the sets 𝒳=𝒯f{\mathscr{X}}={\mathscr{T}}_{f}^{*}, and 𝒴=conv(u~(𝒫,)){\mathscr{Y}}=-\operatorname*{conv}(\widetilde{u}({\mathscr{P}},{\mathscr{L}})) (recall that conv(u~(𝒫,))\operatorname*{conv}(\widetilde{u}({\mathscr{P}},{\mathscr{L}})) denotes the convex hull of the points u~(p,)\widetilde{u}(p,\ell)\in\mathbb{R}^{nm} for p𝒫p\in{\mathscr{P}} and \ell\in{\mathscr{L}}).

Initialize θ~1\widetilde{\theta}_{1} to an arbitrary point in 𝒯f{\mathscr{T}}_{f}^{*};
for t1t\leftarrow 1 to TT do
   Choose pt𝒫p_{t}\in{\mathscr{P}} so that for all \ell\in{\mathscr{L}}, θ~t,u~(pt,)0\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle\leq 0 (such a ptp_{t} is guaranteed to exist by the second statement in Lemma 3.7);
   Play action ptp_{t} and receive as feedback t\ell_{t}\in{\mathscr{L}};
   Set yt=u~(pt,t)y_{t}=-\widetilde{u}(p_{t},\ell_{t});
   Set θ~t+1=~(y1,y2,,yt)\widetilde{\theta}_{t+1}=\widetilde{\mathcal{F}}(y_{1},y_{2},\dots,y_{t}).
end for
Algorithm 2 Description of Algorithm 𝒜~\widetilde{\mathcal{A}} for ff-approachability

Our algorithm 𝒜~\widetilde{\mathcal{A}} is summarized above in Algorithm 2. We now have the following analogue of Theorem 2.3.

Theorem 3.8.

The following guarantee holds for the pseudonorm approachability algorithm 𝒜~\widetilde{\mathcal{A}}:

𝖱𝖾𝗀(𝒜~)=Tdf(1Tt=1Tu~(pt,t),𝒞)𝖱𝖾𝗀(~).\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}})=T\cdot d_{f}\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),{\mathscr{C}}\right)\leq\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{F}}).
Proof.

The first equality follows as a direct consequence of Theorem 3.3 (which proves that the dfd_{f} distance to 𝒞{\mathscr{C}} is equal to the analogous \ell_{\infty} distance to (,0]d(-\infty,0]^{d}) and Theorem 2.3 (which shows that this \ell_{\infty} distance is equal to the regret of our regret minimization problem). It therefore suffices to prove the second equality.

Note that

df(1Tt=1Tu~(pt,t),𝒞)\displaystyle d_{f}\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),{\mathscr{C}}\right) =\displaystyle= supθ~𝒯fθ~,1Tt=1Tu~(pt,t)\displaystyle\sup_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}}\left\langle\widetilde{\theta},\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t})\right\rangle
=\displaystyle= infθ~𝒳θ~,1Tt=1T(u~(pt,t))\displaystyle-\inf_{\widetilde{\theta}\in{\mathscr{X}}}\left\langle\widetilde{\theta},\frac{1}{T}\sum_{t=1}^{T}\left(-\widetilde{u}(p_{t},\ell_{t})\right)\right\rangle
=\displaystyle= 1T𝖱𝖾𝗀(~)1Tt=1Tθ~t,u~(pt,t)\displaystyle\frac{1}{T}\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{F}})-\frac{1}{T}\sum_{t=1}^{T}\left\langle\widetilde{\theta}_{t},-\widetilde{u}(p_{t},\ell_{t})\right\rangle
\displaystyle\leq 1T𝖱𝖾𝗀(~).\displaystyle\frac{1}{T}\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{F}}).

Here, the first equality holds as a consequence of Corollary 3.6, and the last inequality holds since (by choice of ptp_{t} in step 2a) θ~t,u~(pt,)0\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle\leq 0 for all t[T]t\in[T]. ∎

If we choose ~\widetilde{\mathcal{F}} to be FTRL with a quadratic regularizer, Lemma 2.1 implies the following result.

Theorem 3.9.

Let u:𝒫u\colon{\mathscr{P}}\to{\mathscr{L}} be a bilinear regret function. Then there exists a regret minimization algorithm 𝒜~\widetilde{\mathcal{A}} for uu with regret

𝖱𝖾𝗀(𝒜~)=O(Dz(diam𝒫)(diam)T),\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}})=O(D_{z}(\operatorname*{diam}{\mathscr{P}})(\operatorname*{diam}{\mathscr{L}})\sqrt{T}),

where Dz=maxi[d]viD_{z}=\max_{i\in[d]}||v_{i}||. If we let

λ=maxi[d]vi=maxi[d],j[m],k[n]|2uipjk|,\lambda=\max_{i\in[d]}||v_{i}||_{\infty}=\max_{i\in[d],j\in[m],k\in[n]}\left|\frac{\partial^{2}u_{i}}{\partial p_{j}\partial\ell_{k}}\right|,

then this regret bound further satisfies

𝖱𝖾𝗀(𝒜~)=O(λnmT).\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}})=O(\lambda nm\sqrt{T}).
Proof.

To see the first result, note that Lemma 2.1 directly implies a bound of O(diam(𝒳)diam(𝒴)T)O(\operatorname*{diam}({\mathscr{X}})\operatorname*{diam}({\mathscr{Y}})\sqrt{T}), where 𝒳=𝒯f{\mathscr{X}}={\mathscr{T}}_{f}^{*} and 𝒴=conv(u~(𝒫,)){\mathscr{Y}}=-\operatorname*{conv}(\widetilde{u}({\mathscr{P}},{\mathscr{L}})). We’ll now proceed by simplifying diam(𝒳)\operatorname*{diam}({\mathscr{X}}) and diam(𝒴)\operatorname*{diam}({\mathscr{Y}}). First, for 𝒳{\mathscr{X}}, recall from Lemma 3.4 that 𝒳=𝒯f{\mathscr{X}}={\mathscr{T}}_{f}^{*} is the convex hull of the dd vectors viv_{i}\in\mathbb{R}^{nm}, so diam(𝒳)=O(maxi[d]vi)=O(Dz)\operatorname*{diam}({\mathscr{X}})=O(\max_{i\in[d]}||v_{i}||)=O(D_{z}). Second, for 𝒴{\mathscr{Y}}, note that diam(𝒴)=O(maxy𝒴y)=O(maxp𝒫,u~(p,))\operatorname*{diam}({\mathscr{Y}})=O(\max_{y\in{\mathscr{Y}}}||y||)=O(\max_{p\in{\mathscr{P}},\ell\in{\mathscr{L}}}||\widetilde{u}(p,\ell)||). But u~(p,)=p||\widetilde{u}(p,\ell)||=||p||\cdot||\ell||, so diam(𝒴)=O(maxp𝒫pmax)=O((diam𝒫)(diam))\operatorname*{diam}({\mathscr{Y}})=O\left(\max_{p\in{\mathscr{P}}}||p||\cdot\max_{\ell\in{\mathscr{L}}}||\ell||\right)=O((\operatorname*{diam}{\mathscr{P}})(\operatorname*{diam}{\mathscr{L}})). Combining these, we obtain the first inequality.

The second result directly follows from the following three facts: i. diam𝒫=O(n)\operatorname*{diam}{\mathscr{P}}=O(\sqrt{n}) (since 𝒫[1,1]n{\mathscr{P}}\subseteq[-1,1]^{n}, ii. diam=O(m)\operatorname*{diam}{\mathscr{L}}=O(\sqrt{m}) (since [1,1]m{\mathscr{L}}\subseteq[-1,1]^{m}), and iii. DzλmnD_{z}\leq\lambda\sqrt{mn}. ∎

Theorem 3.9 shows that in settings where λ\lambda is constant (which is true in all the settings we consider), there exists an algorithm where 𝖱𝖾𝗀(𝒜~)=poly(n,m)T\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}})=\operatorname*{poly}(n,m)\sqrt{T}; notably, 𝖱𝖾𝗀(𝒜~)\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}}) does not depend on the dimension dd, which in many cases can be thought of as the number of benchmarks of comparison for our learning algorithm.

In the following two sections, we will show how to strengthen this result in two different ways. First, we will show that (modulo some fairly mild computational assumptions) it is possible to efficiently implement the algorithm of Theorem 3.9. Second, we will show that by using a different choice of regularizer, we can recover exactly the regret obtained in Corollary 2.4.

3.4 Efficient algorithms for pseudonorm approachability

3.4.1 Computational assumptions

In this section we will discuss how to transform the algorithm in Theorem 3.9 into a computationally efficient algorithm. Note that without any constraints on {\mathscr{L}}, 𝒫{\mathscr{P}}, and uu or how they are specified, performing any sort of regret minimization efficiently is a hopeless task (e.g., consider the case where it is computationally hard to even determine membership in 𝒫{\mathscr{P}}). We’ll therefore make the following three structural / computational assumptions on {\mathscr{L}}, 𝒫{\mathscr{P}}, and uu.

First, we will restrict our attention to cases where the loss set {\mathscr{L}} is “orthant-generating”. A convex subset SS of \mathbb{R} is orthant-generating if S[0,)dS\subseteq[0,\infty)^{d} and for each i[d]i\in[d], there exists an λi>0\lambda_{i}>0 such that λieiS\lambda_{i}e_{i}\in S (as part of this assumption, we will also assume we have access to the values λi\lambda_{i}). Note that many common choices for {\mathscr{L}} (e.g., [0,1]m[0,1]^{m}, Δm\Delta_{m}, intersections of other p\ell_{p} balls with the positive orthant) are all orthant-generating.

Second, we will assume we have an efficient separation oracle for the action set 𝒫{\mathscr{P}}; that is, an oracle which takes a pp\in\mathbb{R}^{n} and outputs (in time poly(n)\operatorname*{poly}(n)) either that p𝒫p\in{\mathscr{P}} or a separating hyperplane hh\in\mathbb{R}^{n} such that p,h>maxp𝒫p,h\langle p,h\rangle>\max_{p^{\prime}\in{\mathscr{P}}}\langle p^{\prime},h\rangle.

Finally, we will assume we have access to what we call an efficient regret oracle for uu. Given a collection of RR action/loss pairs (pr,r)𝒫×(p_{r},\ell_{r})\in{\mathscr{P}}\times{\mathscr{L}} and RR positive constants αr0\alpha_{r}\geq 0, an efficient regret oracle can compute (in time poly(n,m,R)\operatorname*{poly}(n,m,R)) the value of maxi[d]r=1Rαrui(pr,r)\max_{i\in[d]}\sum_{r=1}^{R}\alpha_{r}u_{i}(p_{r},\ell_{r}). This can be thought of as evaluating the function 𝖱𝖾𝗀(𝒑,)\operatorname{\mathsf{Reg}}(\bm{p},\bm{\ell}) for a pair of action and loss sequences that take on the action/loss pair (pr,r)(p_{r},\ell_{r}) an αr/rαr\alpha_{r}/\sum_{r^{\prime}}\alpha_{r^{\prime}} fraction of the time. At a higher level, having access to an efficient regret oracle means that a learner can efficiently compute their overall regret at the end of TT rounds (it is hard to imagine how one could efficiently minimize regret without being able to compute it).

3.4.2 Extending the dual set

One of the ingredients we will need to implement algorithm 𝒜~\widetilde{\mathcal{A}} is a membership oracle for the dual set 𝒯f{\mathscr{T}}_{f}^{*} (e.g. in order to perform OLO over this dual set). To check whether θ~𝒯f\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}, it suffices to check whether θ~,xmaxi[d]vi,x\langle\widetilde{\theta},x\rangle\leq\max_{i\in[d]}\langle v_{i},x\rangle for all xx\in\mathbb{R}^{nm}, so in turn, it will be useful to be able to compute the function maxi[d]vi,x\max_{i\in[d]}\langle v_{i},x\rangle for any xx\in\mathbb{R}^{nm}.

Computing this maximum is very similar888In fact, in almost all practical cases where we have an efficient regret oracle, it is possible by a similar computation to directly compute this maximum. Here we describe an approach that works in a blackbox way given only a strict regret oracle – if you are accept the existence of an oracle that can optimize linear functions over the viv_{i}, you can skip this subsection. to what is provided by our regret oracle: note that by writing ui(pr,r)u_{i}(p_{r},\ell_{r}) as vi,u~(pr,r)\langle v_{i},\widetilde{u}(p_{r},\ell_{r})\rangle, we can think of the regret oracle providing the value of:

maxi[d]vi,r=1Rαru~(pr,r).\max_{i\in[d]}\left\langle v_{i},\sum_{r=1}^{R}\alpha_{r}\widetilde{u}(p_{r},\ell_{r})\right\rangle.

If it were possible to write any xx\in\mathbb{R} in the form r=1Rαru~(pr,r)\sum_{r=1}^{R}\alpha_{r}\widetilde{u}(p_{r},\ell_{r}), we would be done. However, this may not be possible: in particular, r=1Rαru~(pr,r)\sum_{r=1}^{R}\alpha_{r}\widetilde{u}(p_{r},\ell_{r}) must lie in the convex cone cone(u~(𝒫,))\operatorname*{cone}(\widetilde{u}({\mathscr{P}},{\mathscr{L}})) generated by all points of the form u~(p,)\widetilde{u}(p,\ell).

We will therefore briefly generalize the theory of Section 3.1 to cases where we are only able to optimize over an extension of the dual set. Given a convex cone 𝒵{\mathscr{Z}}, let 𝒯f(𝒵)={θz𝒵,θ,zf(z)}{\mathscr{T}}_{f}^{*}({\mathscr{Z}})=\{\theta\,\mid\,\forall z\in{\mathscr{Z}},\langle\theta,z\rangle\leq f(z)\}. Note that 𝒯f𝒯f(𝒵){\mathscr{T}}_{f}^{*}\subseteq{\mathscr{T}}_{f}^{*}({\mathscr{Z}}) (since in 𝒯f{\mathscr{T}}_{f}^{*}, θ,zf(z)\langle\theta,z\rangle\leq f(z) must hold for all zz in the ambient space). The following lemma shows that if zz is in 𝒵{\mathscr{Z}}, to maximize a linear function over 𝒯f{\mathscr{T}}_{f}^{*} it suffices to maximize it over 𝒯f(𝒵){\mathscr{T}}_{f}^{*}({\mathscr{Z}}).

Lemma 3.10.

Let 𝒵{\mathscr{Z}} be an arbitrary convex cone. Then if z𝒵z\in{\mathscr{Z}}, the following equalities hold:

supθ𝒯fθ,z=supθ𝒯f(𝒵)θ,z=f(z).\sup_{\theta\in{\mathscr{T}}^{*}_{f}}\langle\theta,z\rangle=\sup_{\theta\in{\mathscr{T}}^{*}_{f}({\mathscr{Z}})}\langle\theta,z\rangle=f(z).

Now, consider the specific convex cone 𝒵=cone(u~(𝒫,)){\mathscr{Z}}=\operatorname*{cone}(\widetilde{u}({\mathscr{P}},{\mathscr{L}})). Given Lemma 3.10, it is straightforward to check that Theorem 3.8 continues to hold even if the domain 𝒳{\mathscr{X}} of the OLO algorithm ~\widetilde{\mathcal{F}} is set to 𝒯f(𝒵){\mathscr{T}}_{f}^{*}({\mathscr{Z}}) instead of 𝒯f{\mathscr{T}}_{f}^{*}. In particular, the first equality in the proof of Theorem 3.8 is still true when 𝒯f{\mathscr{T}}_{f}^{*} is replaced by 𝒯f(𝒵){\mathscr{T}}_{f}^{*}({\mathscr{Z}}), since 1Ttu~(pt,t)𝒵\frac{1}{T}\sum_{t}\widetilde{u}(p_{t},\ell_{t})\in{\mathscr{Z}}.

There is one other issue we must deal with: it is possible that 𝒯f(𝒵){\mathscr{T}}_{f}^{*}({\mathscr{Z}}) is significantly larger than 𝒯f{\mathscr{T}}_{f}^{*}, and therefore an OLO algorithm with domain 𝒯f(𝒵){\mathscr{T}}_{f}^{*}({\mathscr{Z}}) might incur more regret than that of 𝒯f{\mathscr{T}}_{f}^{*}. In fact, there are cases where the set 𝒯f(𝒵){\mathscr{T}}_{f}^{*}({\mathscr{Z}}) is unbounded. Nonetheless, the following lemma shows that for OLO with a quadratic regularizer, we will never encounter very large values of θ~\widetilde{\theta}.

Lemma 3.11.

Let R(θ~)=θ~2R(\widetilde{\theta})=||\widetilde{\theta}||^{2}, and fix a z𝒵z\in{\mathscr{Z}} and η>0\eta>0. Let

θ~opt=argminθ~𝒯f(𝒵)(ηR(θ~)θ~,z).\widetilde{\theta}_{opt}=\operatorname*{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}({\mathscr{Z}})}\left(\eta R(\widetilde{\theta})-\langle\widetilde{\theta},z\rangle\right).

Then θ~optdiam(𝒯f)||\widetilde{\theta}_{opt}||\leq\operatorname*{diam}({\mathscr{T}}_{f}^{*}).

Lemma 3.11 therefore implies that a quadratic-regularized OLO algorithm ~\widetilde{\mathcal{F}} will only output values of θ~\widetilde{\theta} with θ~diam(𝒯f)||\widetilde{\theta}||\leq\operatorname*{diam}({\mathscr{T}}_{f}^{*}), and therefore we can safely set 𝒳=𝒯f{θθρ}{\mathscr{X}}={\mathscr{T}}_{f}^{*}\cap\{\theta\mid||\theta||\leq\rho\}, where ρ=Θ(diam(𝒯f))\rho=\Theta(\operatorname*{diam}({\mathscr{T}}_{f}^{*})) is an efficiently computable upper bound on diam(𝒯f)\operatorname*{diam}({\mathscr{T}}_{f}^{*}). This guarantees that diam(𝒳)=O(diam(𝒯f))\operatorname*{diam}({\mathscr{X}})=O(\operatorname*{diam}({\mathscr{T}}_{f}^{*})), and thus the regret guarantees of Theorem 3.9 remain unchanged.

3.4.3 Constructing a membership oracle

We will now demonstrate how to use our regret oracle to construct a membership oracle for the expanded set 𝒳{\mathscr{X}} defined in the previous section. We first show that it is possible to check for membership in 𝒵{\mathscr{Z}} (and further, when z𝒵z\in{\mathscr{Z}}, write zz as a convex combination of the generators of 𝒵{\mathscr{Z}}).

Lemma 3.12.

Given a point zz\in\mathbb{R}^{nm}, we can efficiently check whether z𝒵z\in{\mathscr{Z}}. If z𝒵z\in{\mathscr{Z}}, we can also efficiently write zz in the form k=1mαku~(pk,k)\sum_{k=1}^{m}\alpha_{k}\widetilde{u}(p_{k},\ell_{k}) (for an explicit choice of pkp_{k}, k\ell_{k}, and αk\alpha_{k}).

Note that expressing zz in the form k=1mαku~(pk,k)\sum_{k=1}^{m}\alpha_{k}\widetilde{u}(p_{k},\ell_{k}) allows us to directly apply our efficient regret oracle (with R=mR=m). We therefore gain the following optimization oracle as a corollary.

Corollary 3.13.

Given an efficient regret oracle, for any z𝒵z\in{\mathscr{Z}} we can efficiently (in time poly(n,m)\operatorname*{poly}(n,m)) compute maxi[d]vi,z\max_{i\in[d]}\langle v_{i},z\rangle.

Finally, we will show that given these two results, we can efficiently construct a membership oracle for our set 𝒳{\mathscr{X}}. To do this, we will need the following fact (loosely stated; see Lemma B.2 in Appendix B for an accurate statement): it is possible to minimize a convex function over a convex set as long as one has an evaluation oracle for the function and a membership oracle for the convex set.

Lemma 3.14.

Given an efficient regret oracle for uu, we can construct an efficient membership oracle for the set 𝒳{\mathscr{X}}.

3.4.4 Implementing regret minimization

Equipped with this membership oracle, we can now state and prove our main theorem.

Theorem 3.15.

Assume that:

  1. 1.

    The convex set {\mathscr{L}} is orthant-generating.

  2. 2.

    We have an efficient separation oracle for the convex set 𝒫{\mathscr{P}}.

  3. 3.

    We have an efficient regret oracle for the regret function uu.

Then it is possible to implement the algorithm of Theorem 3.9 in poly(n,m)\operatorname*{poly}(n,m) time per round.

Proof.

There are two steps of unclear computational complexity in the description of the algorithm in Section 3.3: step 2a, where we find a ptp_{t} such that θ~t,u~(pt,)0\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle\leq 0 for all \ell\in{\mathscr{L}}, and step 2d, where we have to run our OLO subalgorithm over 𝒯f{\mathscr{T}}_{f}^{*} (which we specifically instantiate as FTRL with a quadratic regularizer).

We begin by describing how to perform step 2a efficiently. Fix a θ~t\widetilde{\theta}_{t}. Note that since {\mathscr{L}} is orthant-generating, to check whether θ~t,u~(pt,)0\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle\leq 0 for all \ell\in{\mathscr{L}}, it suffices to check whether u~(pt,ek)0\widetilde{u}(p_{t},e_{k})\leq 0 for each unit vector eke_{k}\in\mathbb{R}^{m}. Therefore, to find a valid ptp_{t} we must find a point pt𝒫p_{t}\in{\mathscr{P}} that satisfies an additional mm explicit linear constraints. Since we have an efficient separation oracle for 𝒫{\mathscr{P}}, this is possible in poly(n,m)\operatorname*{poly}(n,m) time.

To implement the FTRL algorithm \mathcal{F} over the set 𝒳=𝒯f{\mathscr{X}}={\mathscr{T}}_{f}^{*} with a quadratic regularizer, each round we must find the minimizer of the convex function g(θ~)=R(θ~)+ηs=1t1θ~,u~(ps,s)g(\widetilde{\theta})=R(\widetilde{\theta})+\eta\sum_{s=1}^{t-1}\langle\widetilde{\theta},\widetilde{u}(p_{s},\ell_{s})\rangle over the convex set 𝒳{\mathscr{X}}. To do this, it suffices to exhibit an evaluation oracle for gg and a membership oracle for 𝒳{\mathscr{X}}. To evaluate gg, note that we simply need to be able to compute R(θ~)R(\widetilde{\theta}) (which is a Euclidean distance in \mathbb{R}^{mn}) and s=1t1θ~,u~(ps,s)\sum_{s=1}^{t-1}\langle\widetilde{\theta},\widetilde{u}(p_{s},\ell_{s})\rangle which we can do in O(mn)O(mn) time by keeping track of the cumulative sum s=1t1u~(ps,s)\sum_{s=1}^{t-1}\widetilde{u}(p_{s},\ell_{s}) and computing a single inner product in \mathbb{R}^{mn}. A membership oracle for 𝒳{\mathscr{X}} is provided by Lemma 3.14. ∎

3.5 Recovering \ell_{\infty} regret via maxent regularizers

One interesting aspect of this reduction to pseudonorm approachability is that, in some cases, the regret bound of O(Tlogd)O(\sqrt{T\log d}) achievable via \ell_{\infty}-approachability (Corollary 2.4) sometimes outperforms the regret bound of O(λmnT)O(\lambda mn\sqrt{T}) achieved by pseudonorm approachability (Theorem 3.9), for example when d=poly(m,n)d=\operatorname*{poly}(m,n). Of course, this comparison is not completely fair: in both cases there is flexibility to specify the underlying OLO algorithm, and the O(Tlogd)O(\sqrt{T\log d}) bound uses an FTRL algorithm with a negative entropy regularizer, whereas our pseudonorm approachability bound uses an FTRL algorithm with a quadratic regularizer. After all, there are well-known cases (e.g. for OLO over a simplex, as in \ell_{\infty}-approachability) where the negative entropy regularizer leads to exponentially better (in dd) regret bounds than the quadratic regularizer.

In this subsection, we will show that there exists a different regularizer for pseudonorm approachability – one we call a maxent regularizer – which recovers the O(Tlogd)O(\sqrt{T\log d}) regret bound of Corollary 2.4. In doing so, we will also better understand the parallels between the regret minimization algorithm 𝒜\mathcal{A} (which works via reduction to \ell_{\infty}-approachability in a dd-dimensional space) and the regret minimization algorithm 𝒜~\widetilde{\mathcal{A}} (which works via reduction to pseudonorm approachability in an nmnm-dimensional space).

Let VV be the dd-by-nmnm matrix whose iith row equals viv_{i}. Note that VV allows us to translate between analogous concepts/quantities for 𝒜\mathcal{A} and 𝒜~\widetilde{\mathcal{A}} in the following way.

Lemma 3.16.

The following statements are true:

  • For any p𝒫p\in{\mathscr{P}} and \ell\in{\mathscr{L}}, u(p,)=Vu~(p,)u(p,\ell)=V\widetilde{u}(p,\ell).

  • The dual set 𝒯f=VTΔd{\mathscr{T}}_{f}^{*}=V^{T}\Delta_{d} (i.e., θ~𝒯f\widetilde{\theta}\in{\mathscr{T}}_{f}^{*} iff there exists a θΔd\theta\in\Delta_{d} such that θ~=VTθ\widetilde{\theta}=V^{T}\theta).

  • If θΔd\theta\in\Delta_{d} and θ~=VTθ𝒯f\widetilde{\theta}=V^{T}\theta\in{\mathscr{T}}_{f}^{*}, then for any p𝒫p\in{\mathscr{P}} and \ell\in{\mathscr{L}}, θ,u(p,)=θ~,u~(p,)\langle\theta,u(p,\ell)\rangle=\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle.

  • Fix θtΔd\theta_{t}\in\Delta_{d} and let θ~=VTθ\widetilde{\theta}=V^{T}\theta. If ptp_{t} satisfies θt,u(pt,)0\langle\theta_{t},u(p_{t},\ell)\rangle\leq 0 for all \ell\in{\mathscr{L}}, then ptp_{t} also satisfies θ~t,u~(pt,)0\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle\leq 0 for all \ell\in{\mathscr{L}}.

Proof.

The first statement follows from the fact that Vu~(p,)i=vi,u~(p,)=u(p,)iV\widetilde{u}(p,\ell)_{i}=\langle v_{i},\widetilde{u}(p,\ell)\rangle=u(p,\ell)_{i}. The second claim follows as a consequence of Lemma 3.4 (that 𝒯f{\mathscr{T}}_{f}^{*} is the convex hull of the vectors viv_{i}). The third claim follows from the first claim: θ,u(p,)=θ,Vu~(p,)=VTθ,u~(p,)=θ~,u~(p,)\langle\theta,u(p,\ell)\rangle=\langle\theta,V\widetilde{u}(p,\ell)\rangle=\langle V^{T}\theta,\widetilde{u}(p,\ell)\rangle=\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle. Finally, the fourth claim follows directly from the third claim. ∎

Now, fix a regret minimization problem (specified by 𝒫{\mathscr{P}}, {\mathscr{L}}, and uu) and consider the execution of algorithm 𝒜\mathcal{A} on some specific loss sequence =(1,2,,T)\bm{\ell}=(\ell_{1},\ell_{2},\dots,\ell_{T}). Each round tt, 𝒜\mathcal{A} runs the entropy-regularized FTRL algorithm \mathcal{F} to generate a θtΔd\theta_{t}\in\Delta_{d} (as a function of the actions and losses up until round tt) and then uses θt\theta_{t} to select a ptp_{t} that satisfies θt,u(pt,)0\langle\theta_{t},u(p_{t},\ell)\rangle\leq 0 for all \ell\in{\mathscr{L}}. If we execute algorithm 𝒜~\widetilde{\mathcal{A}} for the same regret minimization problem on the same loss sequence, each round tt, 𝒜~\widetilde{\mathcal{A}} runs some (to be determined) FTRL algorithm ~\widetilde{\mathcal{F}} to generate a θ~t𝒯f\widetilde{\theta}_{t}\in{\mathscr{T}}_{f}^{*}, and then uses θ~t\widetilde{\theta}_{t} to select a ptp_{t} that satisfies θ~t,u~(pt,)\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle for all \ell\in{\mathscr{L}}. Lemma 3.16 shows that if θ~t=VTθt\widetilde{\theta}_{t}=V^{T}\theta_{t} for each tt, then both algorithms will generate exactly the same sequence of actions in response to this loss sequence999Technically, there may be some leeway in terms of which ptp_{t} to choose that e.g. satisfies θt,u(pt,)0\langle\theta_{t},u(p_{t},\ell)\rangle\leq 0 for all \ell\in{\mathscr{L}}, and the two different procedures could result in different choices of ptp_{t}. But if we break ties consistently (e.g. add the additional constraint of choosing the ptp_{t} that maximizes the inner product of ptp_{t} with some generic vector), then both procedures will produce the same value of ptp_{t}., and hence the same regret.

The question then becomes: how do we design an OLO algorithm ~\widetilde{\mathcal{F}} that outputs θ~t=VTθt\widetilde{\theta}_{t}=V^{T}\theta_{t} each round that \mathcal{F} would output θt\theta_{t}? Recall that if \mathcal{F} is an FTRL algorithm with regularizer R:ΔdR:\Delta_{d}\to\mathbb{R}, then

θt=argminθΔd(R(θ)+s=1t1θ,u(pt,t)).\theta_{t}=\operatorname*{argmin}_{\theta\in\Delta_{d}}\left(R(\theta)+\sum_{s=1}^{t-1}\langle\theta,u(p_{t},\ell_{t})\rangle\right).

Define R~:𝒯f\widetilde{R}:{\mathscr{T}}_{f}^{*}\to\mathbb{R} via

R~(θ~)=minθΔdVTθ=θ~R(θ).\widetilde{R}(\widetilde{\theta})=\min_{\begin{subarray}{c}\theta\in\Delta_{d}\\ V^{T}\theta=\widetilde{\theta}\end{subarray}}R(\theta). (7)

We claim that if we let ~\widetilde{\mathcal{F}} be the FTRL algorithm that uses regularizer R~\widetilde{R}, then ~\widetilde{\mathcal{F}} will output our desired sequence of θ~t\widetilde{\theta}_{t}.

Lemma 3.17.

The following equality holds for the output θ~t\widetilde{\theta}_{t}:

θ~t=argminθ~𝒯f(R~(θ~)+s=1t1θ~,u~(pt,t))=VTθt.\widetilde{\theta}_{t}=\operatorname*{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}}\left(\widetilde{R}(\widetilde{\theta})+\sum_{s=1}^{t-1}\langle\widetilde{\theta},\widetilde{u}(p_{t},\ell_{t})\rangle\right)=V^{T}\theta_{t}.
Proof.

Note that we can write

θ~t\displaystyle\widetilde{\theta}_{t} =\displaystyle= argminθ~𝒯f(R~(θ~)+s=1t1θ~,u~(pt,t))\displaystyle\operatorname*{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}}\left(\widetilde{R}(\widetilde{\theta})+\sum_{s=1}^{t-1}\langle\widetilde{\theta},\widetilde{u}(p_{t},\ell_{t})\rangle\right)
=\displaystyle= argminθ~𝒯f((minθΔdVTθ=θ~R(θ))+s=1t1θ~,u~(pt,t))\displaystyle\operatorname*{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}}\left(\left(\min_{\begin{subarray}{c}\theta\in\Delta_{d}\\ V^{T}\theta=\widetilde{\theta}\end{subarray}}R(\theta)\right)+\sum_{s=1}^{t-1}\langle\widetilde{\theta},\widetilde{u}(p_{t},\ell_{t})\rangle\right)
=\displaystyle= argminθ~𝒯f(minθΔdVTθ=θ~(R(θ)+s=1t1θ,u(pt,t)))\displaystyle\operatorname*{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}}\left(\min_{\begin{subarray}{c}\theta\in\Delta_{d}\\ V^{T}\theta=\widetilde{\theta}\end{subarray}}\left(R(\theta)+\sum_{s=1}^{t-1}\langle\theta,u(p_{t},\ell_{t})\rangle\right)\right)
=\displaystyle= VTargminθΔd(R(θ)+s=1t1θ,u(pt,t))\displaystyle V^{T}\operatorname*{argmin}_{\theta\in\Delta_{d}}\left(R(\theta)+\sum_{s=1}^{t-1}\langle\theta,u(p_{t},\ell_{t})\rangle\right)
=\displaystyle= VTθt.\displaystyle V^{T}\theta_{t}.

The third equality follows from the fact that if θ~=VTθ\widetilde{\theta}=V^{T}\theta, then θ,u(p,)=θ~,u~(p,)\langle\theta,u(p,\ell)\rangle=\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle (by Lemma 3.16). ∎

Corollary 3.18.

Let \mathcal{F} be an FTRL algorithm over Δd\Delta_{d} with regularizer RR, and let ~\widetilde{\mathcal{F}} be an FTRL algorithm over 𝒯f{\mathscr{T}}_{f}^{*} with regularizer R~\widetilde{R}. Then 𝖱𝖾𝗀(𝒜)=𝖱𝖾𝗀(𝒜~)\operatorname{\mathsf{Reg}}(\mathcal{A})=\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}}).

We now consider the specific case where RR is the negentropy function; i.e., for θΔd\theta\in\Delta_{d}, R(θ)=H(θ)=i=1dθilogθiR(\theta)=-H(\theta)=\sum_{i=1}^{d}\theta_{i}\log\theta_{i}. For this choice of RR, R~\widetilde{R} becomes

R~(θ~)=maxθΔdVTθ=θ~H(θ).\widetilde{R}(\widetilde{\theta})=-\max_{\begin{subarray}{c}\theta\in\Delta_{d}\\ V^{T}\theta=\widetilde{\theta}\end{subarray}}H(\theta). (8)

In other words, R~(θ~)-\widetilde{R}(\widetilde{\theta}) is the maximum entropy of any distribution θ\theta in Δd\Delta_{d} that satisfies the nmnm linear constraints imposed by VTθ=θ~V^{T}\theta=\widetilde{\theta}. This is exactly an instance of the Maxent problem studied in (Berger et al., 1996; Rosenfeld, 1996; Pietra et al., 1997; Dudík et al., 2007) and (Mohri et al., 2018)[chapter 12].

It is known (Pietra et al., 1997; Dudík et al., 2007; Mohri et al., 2018) that the entropy maximizing distribution is a Gibbs distribution. In particular, the θ\theta maximizing the expression in (8) satisfies (for some real constants λjk\lambda_{jk} for j[n]j\in[n], k[m]k\in[m])

θi=exp(j=1nk=1mλjkvijk)Z(λ),\theta_{i}=\frac{\exp\left(\sum_{j=1}^{n}\sum_{k=1}^{m}\lambda_{jk}v_{ijk}\right)}{Z(\lambda)}, (9)

where Z(λ)Z(\lambda) (the “partition function”) is defined via

Z(λ)=i=1dexp(j=1nk=1mλjkvijk).Z(\lambda)=\sum_{i=1}^{d}\exp\left(\sum_{j=1}^{n}\sum_{k=1}^{m}\lambda_{jk}v_{ijk}\right). (10)

Generally, there is exactly one choice of λjk\lambda_{jk} which results in VTθ=θ~V^{T}\theta=\widetilde{\theta} (since there are mnmn free variables and mnmn linear constraints). For this optimal λ\lambda, it is known that the maximum entropy is given by λ,θ~logZ(λ)\langle\lambda,\widetilde{\theta}\rangle-\log Z(\lambda). If it is possible to solve this system for λjk\lambda_{jk} and evaluate Z(λ)Z(\lambda) efficiently, we can then evaluate R~\widetilde{R} efficiently. In Section 4.1 we will see how to do this for the specific case of swap regret (where we are helped by the fact that (9) guarantees that θ\theta is a product distribution).

Regardless of how we compute the maxent regularizer R~\widetilde{R}, efficient computation leads to an efficient O(Tlogd)O(\sqrt{T\log d}) regret minimization algorithm.

Corollary 3.19.

If there exists an efficient (poly(n,m)\operatorname*{poly}(n,m) time) algorithm for computing the maxent regularizer R~(θ~)\widetilde{R}(\widetilde{\theta}), then there exists a regret minimization algorithm 𝒜~\widetilde{\mathcal{A}} for uu with regret O(Tlogd)O(\sqrt{T\log d}) and that can be implemented in poly(n,m)\operatorname*{poly}(n,m) time per round.

4 Applications

4.1 Swap regret

Recall that in the setting of swap regret, the action set is 𝒫=ΔK{\mathscr{P}}=\Delta_{K} (distributions over KK actions), the loss set is =[0,1]K{\mathscr{L}}=[0,1]^{K}, and the swap regret of playing a sequence of actions 𝒑\bm{p} on a sequence of losses \bm{\ell} is given by

𝖲𝗐𝖺𝗉𝖱𝖾𝗀(𝒑,)=maxπ:[K][K]t=1Ti=1K(ptitiptitπ(i)).\operatorname{\mathsf{SwapReg}}(\bm{p},\bm{\ell})=\max_{\pi:[K]\to[K]}\sum_{t=1}^{T}\sum_{i=1}^{K}(p_{ti}\ell_{ti}-p_{ti}\ell_{t\pi(i)}).

In words, swap regret compares the total loss achieved by this sequence of actions with the loss achieved by any transformed action sequence formed by applying an arbitrary “swap function” to the actions (i.e., always playing action π(i)\pi(i) instead of action ii). Swap regret minimization can be directly written as an \ell_{\infty}-approachability problem for the bilinear function u:𝒫×𝕂𝕂u\colon{\mathscr{P}}\times{\mathscr{L}}\to\mathbb{R}^{K^{K}} where

uπ(p,)=i=1K(piipiπ(i)).u_{\pi}(p,\ell)=\sum_{i=1}^{K}(p_{i}\ell_{i}-p_{i}\ell_{\pi(i)}).

(here we index uu by the KKK^{K} functions π:[K][K]\pi:[K]\to[K]). Note that the negative orthant is indeed separable with respect to uu since if we let p()=argminp𝒫p,p^{*}(\ell)=\operatorname*{argmin}_{p\in{\mathscr{P}}}\langle p,\ell\rangle, uπ(p,)0u_{\pi}(p,\ell)\leq 0 for any swap function π\pi.

We can now apply the theory developed in Section 3. First, since m=n=Km=n=K and the maximum absolute value of any coefficient in uu is 11, Theorem 3.9 immediately results in a swap regret algorithm with regret O(K2T)O(K^{2}\sqrt{T}). Moreover, since we can write

maxπ:[K][K]t=1Ti=1K(ptitiptitπ(i))=i=1Kmaxj[K]t=1T(ptitiptitj),\max_{\pi:[K]\to[K]}\sum_{t=1}^{T}\sum_{i=1}^{K}(p_{ti}\ell_{ti}-p_{ti}\ell_{t\pi(i)})=\sum_{i=1}^{K}\max_{j\in[K]}\sum_{t=1}^{T}(p_{ti}\ell_{ti}-p_{ti}\ell_{tj}),

we can compute 𝖲𝗐𝖺𝗉𝖱𝖾𝗀\operatorname{\mathsf{SwapReg}} efficiently. By Theorem 3.15, we can therefore implement this O(K2T)O(K^{2}\sqrt{T})-regret algorithm efficiently (in poly(K)\operatorname*{poly}(K) time per round). We can improve upon this regret bound by noting that diam(𝒫)=O(1)\operatorname*{diam}({\mathscr{P}})=O(1), diam()=O(K)\operatorname*{diam}({\mathscr{L}})=O(\sqrt{K}), and Dz=O(K)D_{z}=O(\sqrt{K}) (the coefficient vector vπv_{\pi} corresponding to uπ(p,)u_{\pi}(p,\ell) contains at most 2K2K coefficients that are ±1\pm 1). It follows that Dz(diam𝒫)(diam)=O(K)D_{z}(\operatorname*{diam}{\mathscr{P}})(\operatorname*{diam}{\mathscr{L}})=O(K), and it follows from the first part of Theorem 3.9 that the regret of the aforementioned algorithm is actually only O(KT)O(K\sqrt{T}).

This is still a factor of approximately O(K)O(\sqrt{K}) larger than the optimal O(TKlogK)O(\sqrt{TK\log K}) bound. To achieve the optimal regret bound, we will show how to compute the maxent regularizer for swap regret (and hence can efficiently implement an O(Tlogd)=O(TKlogK)O(\sqrt{T\log d})=O(\sqrt{TK\log K}) algorithm via Corollary 3.19). Recall that the maxent regularizer R~(θ~)\widetilde{R}(\widetilde{\theta}) is defined for θ~VTΔd\widetilde{\theta}\in V^{T}\Delta_{d} is the negative of the maximum entropy of a distribution θΔd\theta\in\Delta_{d} that satisfies θ~=VTθ\widetilde{\theta}=V^{T}\theta. For our problem, θ~\widetilde{\theta} is (K2)(K^{2})-dimensional, and this imposes the following linear constraints on θ\theta (which we view as a distribution over swap functions π\pi):

θ~ij\displaystyle\widetilde{\theta}_{ij} =\displaystyle= ππ(i)=jθπ for ji\displaystyle-\sum_{\pi\mid\pi(i)=j}\theta_{\pi}\hskip 28.45274pt\mbox{ for $j\neq i$}
θ~ii\displaystyle\widetilde{\theta}_{ii} =\displaystyle= 1ππ(i)=iθπ.\displaystyle 1-\sum_{\pi\mid\pi(i)=i}\theta_{\pi}.

Now, by the characterization presented in (9)101010In particular, in the case of swap regret we have that vπ,i,j=𝟏(k=j)𝟏(k=π(j))v_{\pi,i,j}=\bm{1}(k=j)-\bm{1}(k=\pi(j)). Since the first term (𝟏(k=j)\bm{1}(k=j)) does not depend on π\pi, we can ignore its contribution to θπ\theta_{\pi} (it cancels from the numerator and denominator)., we know the entropy maximizing θ\theta satisfies (for some K2K^{2} constants λij\lambda_{ij})

θπ=exp(i=1Kλiπ(i))πexp(i=1Kλiπ(i))=i=1Kexp(λiπ(i))πi=1Kexp(λiπ(i))=i=1Kexp(λiπ(i))j=1Kexp(λij).\displaystyle\theta_{\pi}=\frac{\exp\left(\sum_{i=1}^{K}\lambda_{i\pi(i)}\right)}{\sum_{\pi^{\prime}}\exp\left(\sum_{i=1}^{K}\lambda_{i\pi^{\prime}(i)}\right)}=\frac{\prod_{i=1}^{K}\exp\left(\lambda_{i\pi(i)}\right)}{\sum_{\pi^{\prime}}\prod_{i=1}^{K}\exp\left(\lambda_{i\pi^{\prime}(i)}\right)}=\prod_{i=1}^{K}\frac{\exp(\lambda_{i\pi(i)})}{\sum_{j=1}^{K}\exp(\lambda_{ij})}.

In particular, this shows that the entropy maximizing distribution θ\theta is a product distribution over the set of swap functions π\pi where for each i[K]i\in[K] the value of π(i)\pi(i) is chosen independently. Moreover, from θ~\widetilde{\theta} we can recover the overall marginal probability qij=πθ[π(i)=j]q_{ij}=\operatorname*{\mathbb{P}}_{\pi\sim\theta}[\pi(i)=j] (it is θ~ij-\widetilde{\theta}_{ij} if jij\neq i and 1θ~ii1-\widetilde{\theta}_{ii} if j=ij=i). The entropy of this product distribution can therefore be written as:

H(θ)=i=1KH(π(i))=i=1Kj=1Kqijlogqij.H(\theta)=\sum_{i=1}^{K}H(\pi(i))=-\sum_{i=1}^{K}\sum_{j=1}^{K}q_{ij}\log q_{ij}.

Our regularizer R~(θ~)\widetilde{R}(\widetilde{\theta}) is simply H(θ)-H(\theta) and can clearly be efficiently computed as a function of θ~\widetilde{\theta}. It follows from Corollary 3.19 that there exists an efficient (poly(K)\operatorname*{poly}(K) time per round) regret minimization algorithm for swap regret that incurs O(Tlogd)=O(TKlogK)O(\sqrt{T\log d})=O(\sqrt{TK\log K}) worst-case regret.

Finally, we briefly remark that the pseudonorm ff we construct here is closely related to the 1,\ell_{1,\infty} group norm defined over KK-by-KK square matrices as the 1\ell_{1} norm of vector formed by the \ell_{\infty} norms of the rows (i.e., M1,=i=1Kmaxj[K]|Mij|||M||_{1,\infty}=\sum_{i=1}^{K}\max_{j\in[K]}|M_{ij}|). This is not unique to swap regret; in many of our applications, the relevant pseudonorm can be thought of as a composition of multiple smaller norms (often 1\ell_{1} or \ell_{\infty} norms).

4.2 Procrustean swap regret

To illustrate the power of Theorem 3.15, we present a toy variant of swap regret where the learner must compete against an infinite set of swap functions (in particular, all orthogonal linear transformations of their sequence of actions) and yet can do this efficiently while incurring low (polynomial in the dimension of their action set) regret.

In this problem, the action set 𝒫={x;}{\mathscr{P}}=\{x\in\mathbb{R}^{n}\,;\,||x||_{2}\leq 1\} is the set of unit vectors with norm at most 1 and the loss set =[1,1]n{\mathscr{L}}=[-1,1]^{n}. The learner would like to minimize the following notion of regret:

𝖯𝗋𝗈𝖼𝗋𝗎𝗌𝗍𝖾𝗌𝖱𝖾𝗀(𝒑,)=maxQO(n)t=1T(pt,tQpt,t).\operatorname{\mathsf{ProcrustesReg}}(\bm{p},\bm{\ell})=\max_{Q\in O(n)}\sum_{t=1}^{T}(\langle p_{t},\ell_{t}\rangle-\langle Qp_{t},\ell_{t}\rangle). (11)

Here, O(n)O(n) is the set of all orthogonal nn-by-nn matrices. We call this notion of regret Procrustean swap regret due to its similarity with the orthogonal Procrustes problem from linear algebra, which (loosely) asks for the orthogonal matrix which most closely maps one sequence of points 𝒙1\bm{x}_{1} onto another sequence of points 𝒙2\bm{x}_{2} (in our setting, we intuitively want to map 𝒙\bm{x} onto -\bm{\ell} to minimize the loss of our benchmark). See Gower and Dijksterhuis (2004) for a more detailed discussion of the Procrustes problem. Regardless, note that we can compute 𝖯𝗋𝗈𝖼𝗋𝗎𝗌𝗍𝖾𝗌𝖱𝖾𝗀(𝒑,)\operatorname{\mathsf{ProcrustesReg}}(\bm{p},\bm{\ell}) efficiently, since we have an efficient membership oracle for the convex hull conv(O(n))\operatorname*{conv}(O(n)) of the set of orthogonal matrices (specifically, an nn-by-nn matrix MM belongs to conv(O(n))\operatorname*{conv}(O(n)) iff all its singular values are at most 11 in absolute value).

In our approachability framework, we can capture this notion of regret with the bilinear function u(p,)u(p,\ell) with coordinates indexed by QO(n)Q\in O(n) with u(p,)Q=p,Qp,u(p,\ell)_{Q}=\langle p,\ell\rangle-\langle Qp,\ell\rangle (which is separable with respect to the negative orthant, since for p=/2p=-\ell/||\ell||_{2}, u(p,)Q0u(p,\ell)_{Q}\leq 0). Since all the conditions of Theorem 3.15 hold, there is an efficient learning algorithm which incurs at most O(nT)O(\sqrt{nT}) Procrustean swap regret (in particular, diam𝒫=O(1)\operatorname*{diam}{\mathscr{P}}=O(1), diam=O(n)\operatorname*{diam}{\mathscr{L}}=O(\sqrt{n}), and Dz=O(1)D_{z}=O(1)).

4.3 Converging to Bayesian correlated equilibria

Swap regret has the nice property that if in a repeated nn-player normal-form game, all players run a low-swap regret algorithm to select their actions, their time-averaged strategy profile will converge to a correlated equilibrium (indeed, this is one of the major motivations for studying swap regret).

In repeated Bayesian games (games where each player has private information drawn independently from some distribution each round) the analogue of correlated equilibria is Bayesian correlated equilibria. Playing a repeated Bayesian game requires a contextual learning algorithm, which can observe the private information of the player (the “context”) and select an action based on this. Mansour et al. (2022) show that there is a notion of regret (that we call 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀\operatorname{\mathsf{BayesSwapReg}}) such that if all learners are playing an algorithm with low 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀\operatorname{\mathsf{BayesSwapReg}}, then over time they converge on average to a Bayesian correlated equilibrium. However, while Mansour et al. (2022) provide a low 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀\operatorname{\mathsf{BayesSwapReg}} algorithm, their algorithm is not provably efficient (it requires finding the fixed point of a system of quadratic equations); by applying our framework, we show that it is possible to obtain a polynomial-time, low-𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀\operatorname{\mathsf{BayesSwapReg}} algorithm for this problem.

Formally, we study the following full-information contextual online learning setting. As before, there are KK actions, but there are now CC different contexts (“types”). Every round tt, the adversary specifies a loss function t=[0,1]CK\ell_{t}\in{\mathscr{L}}=[0,1]^{CK}, where t,i,c\ell_{t,i,c} represents the loss from playing action ii in context cc. Simultaneously, the learner specifies an action pt𝒫=Δ([K])C𝕂p_{t}\in{\mathscr{P}}=\Delta([K])^{C}\subset\mathbb{R}^{CK} which we view as a function mapping each context to a distribution over actions. Overall, the learner receives expected utility c=1C𝒞[c]i=1Kpt(c)it,i,c\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}p_{t}(c)_{i}\ell_{t,i,c} this round (the learner’s context is drawn iid from the publicly known distribution 𝒞\mathcal{C} each round). In this formulation 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀\operatorname{\mathsf{BayesSwapReg}} can be written in the form

𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀=maxκ,πct=1Tc=1C𝒞[c]i=1K(pt(c)it,i,cpt(κ(c))it,πc(i),c),\operatorname{\mathsf{BayesSwapReg}}=\max_{\kappa,\pi_{c}}\sum_{t=1}^{T}\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right), (12)

where the maximum is over all “type swap functions” κ:[C][C]\kappa\colon[C]\to[C] and CC-tuples of “action-deviation swap functions” πc:[K][K]\pi_{c}\colon[K]\to[K]. It is straightforward to verify that 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀\operatorname{\mathsf{BayesSwapReg}} (as written in (12)) can be written as an \ell_{\infty}-approachability problem for a bilinear function u:𝒫×u\colon{\mathscr{P}}\times{\mathscr{L}}\to\mathbb{R} for d=CCKKCd=C^{C}K^{KC}. The theory of \ell_{\infty}-approachability guarantees the existence of an algorithm with O(Tlogd)=O(T(ClogC+KClogK))O(\sqrt{T\log d})=O(\sqrt{T(C\log C+KC\log K)}) regret, but this algorithm has time/space complexity O(d)O(d) and is very inefficient for even moderate values of CC or KK.

Instead, in Appendix C, we show that 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀\operatorname{\mathsf{BayesSwapReg}} can be written in the form

𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀=c=1Cmaxc[C]i=1Kmaxj[K]t=1T𝒞[c](pt(c)it,i,cpt(c)it,j,c).\operatorname{\mathsf{BayesSwapReg}}=\sum_{c=1}^{C}\max_{c^{\prime}\in[C]}\sum_{i=1}^{K}\max_{j\in[K]}\sum_{t=1}^{T}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(c^{\prime})_{i}\ell_{t,j,c}\right).

This allows us to evaluate 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀\operatorname{\mathsf{BayesSwapReg}} in poly(C,K,T)\operatorname*{poly}(C,K,T) time and apply our pseudonorm approachability framework. Directly from Theorems 3.9 and 3.15, we know that there exists an efficient (poly(C,K)\operatorname*{poly}(C,K) time per round) learning algorithm that incurs at most O(K2C2T)O(K^{2}C^{2}\sqrt{T}) swap regret. As with swap regret, we can tighten this bound somewhat by examining the values of diam(𝒳)\operatorname*{diam}({\mathscr{X}}) and diam(𝒴)\operatorname*{diam}({\mathscr{Y}}) and show that this algorithm actually incurs at most O(KCT)O(KC\sqrt{T}) regret (details left to Appendix C).

This is within approximately an O(KC)O(\sqrt{KC}) factor of optimal. Interestingly, unlike with swap regret, it is unclear if it is possible to efficiently solve the relevant entropy maximization problem for Bayesian swap regret (and hence achieve the optimal regret bound). We pose this as an open question.

Open Question 1.

Is it possible to efficiently (in poly(K,C)\operatorname*{poly}(K,C) time) evaluate the maximum entropy regularizer R~(θ~)\widetilde{R}(\widetilde{\theta}) for the problem of Bayesian swap regret?

4.4 Reinforcement learning in constrained MDPs

We consider episodic reinforcement learning in constrained Markov decision processes (MDPs). Here, the agent receives a vectorial reward (loss) in each time step and aims to find a policy that achieves a certain minimum total reward (maximum total loss) in each dimension. Approachability has been used to derive reinforcement learning algorithms for constrained MDPs before (Miryoosefi et al., 2019; Yu et al., 2021; Miryoosefi and Jin, 2021), however, exclusively using 2\ell_{2} geometry. As a result, these methods aim to bound the 2\ell_{2} distance to the feasible set and the bounds scale with the 2\ell_{2} norm of the reward vector. This deviates from the more common, and perhaps more natural, formulation for constrained MDPs studied in other works (Efroni et al., 2020; Brantley et al., 2020; Ding et al., 2021). Here, each component of the loss vector is within a given range (e.g. [0,1][0,1]) and the goal is to minimize the largest constraint violation among all components. We will show that \ell_{\infty}-approachability is the natural approach for this problem and yields algorithms that avoid the d\sqrt{d} factor in the regret, where dd is the number of constraints, that algorithms based on 2\ell_{2} approachability suffer. While the number of constraints dd can sometimes be small, there are many applications where dd is large and a poly(d)\operatorname{poly}(d) dependency in the regret is undesirable, even when a computational dependency on dd is manageable. For example, constraints may arise from fairness considerations like demographic disparity that ensure that the policy behaves similarly across all protected groups. This would require a constraint for each pair of groups which could be very many.

The formal problem setup is as follows. We consider an MDP (𝒳,𝒜,P,{}t[T])({\mathscr{X}},{\mathscr{A}},P,\left\{\ell\right\}_{t\in[T]}) defined by a state space 𝒳{\mathscr{X}}, an action set 𝒜{\mathscr{A}}, a transition function P:𝒳×𝒜×𝒳[0,1]P\colon{\mathscr{X}}\times{\mathscr{A}}\times{\mathscr{X}}\to[0,1], where P(x|x,a)P(x^{\prime}|x,a) is the probability of reaching state xx^{\prime} when choosing action aa at state xx, and :𝒳×𝒜[0,1]d\ell\colon{\mathscr{X}}\times{\mathscr{A}}\to[0,1]^{d} the loss vector function with =(1,,d)\ell=(\ell^{1},\ldots,\ell^{d}). We work with losses for consistency with the other sections but our results also readily apply to rewards. To simplify the presentation, we will assume a layered MDP with L+1L+1 layers with 𝒳=l=0L𝒳l{\mathscr{X}}=\bigcup_{l=0}^{L}{\mathscr{X}}_{l}, 𝒳i𝒳j={\mathscr{X}}_{i}\cap{\mathscr{X}}_{j}=\emptyset for iji\neq j, and with 𝒳0={x0}{\mathscr{X}}_{0}=\{x_{0}\} and 𝒳L={xL}{\mathscr{X}}_{L}=\{x_{L}\}.

We define a (stochastic) policy π\pi as mapping from 𝒜×𝒳[0,1]{\mathscr{A}}\times{\mathscr{X}}\to[0,1] where π(a|x)\pi(a|x) represents the probability of action aa in state xx. Given a policy π\pi and the transition probability PP, we define the occupancy measure 𝗊P,π{\mathsf{q}}^{P,\pi} as the probability of visiting state-action pair (x,a)(x,a) when following policy PP (Altman, 1999; Neu et al., 2012): 𝗊P,π(x,a)=[x,aP,π]{\mathsf{q}}^{P,\pi}(x,a)=\operatorname*{\mathbb{P}}\left[x,a\mid P,\pi\right]. We will denote by Δ(P)\Delta(P) the set of all occupancy measures, obtained by varying the policy π\pi. It is known that Δ(P)\Delta(P) forms a polytope.

We consider the feasibility problem in CMDPs with stochastic rewards and unknown dynamics. The loss vector :𝒳×𝒜[0,1]d\ell\colon{\mathscr{X}}\times{\mathscr{A}}\to[0,1]^{d} is the same in all episodes, ={}{\mathscr{L}}=\{\ell\}. Our goal is to learn a policy πΠ\pi\in\Pi such that 𝗊P,π,𝐜\langle{\mathsf{q}}^{P,\pi},\ell\rangle\leq{\mathbf{c}} for a given threshold vector 𝐜[0,L]d{\mathbf{c}}\in[0,L]^{d}. The payoff function u:Δ(P)×u\colon\Delta(P)\times{\mathscr{L}}\to\mathbb{R} is defined as: u(𝗉,)=[𝗉1c1𝗉dcd]u({\mathsf{p}},\ell)=\left[\begin{smallmatrix}{\mathsf{p}}\cdot\ell^{1}-c_{1}\\ \vdots\\ {\mathsf{p}}\cdot\ell^{d}-c_{d}\end{smallmatrix}\right]. The set 𝒮=(,0]d{\mathscr{S}}=(-\infty,0]^{d} is separable as long as there is a policy that satisfies all constraints. Although we define the payoff function uu in terms of occupancy measures Δ(P)\Delta(P), they will be implicit in the algorithm.

To aide the comparison of our approach to existing work in this setting, we omit the dimensionality reduction with pseudonorms in this application and directly work in the dd-dimensional space. We will analyze Algorithm 1 for \ell_{\infty} approachability, which can be implemented in MDPs by adopting the following oracles from prior work (Miryoosefi et al., 2019) on CMDPs:

  • BestResponse-Oracle For a given θ\theta\in\mathbb{R}, this oracle returns a policy π\pi that is ϵ0\epsilon_{0}-optimal with respect to the reward function (s,a)(s,a),θ(s,a)\mapsto-\langle\ell(s,a),\theta\rangle.

  • Est-Oracle For a given policy π\pi, this oracle returns a vector z[0,L]dz\in[0,L]^{d} such that z𝗊P,π,ϵ1\|z-\langle{\mathsf{q}}^{P,\pi},\ell\rangle\|_{\infty}\leq\epsilon_{1}.

Consider first the case without approximation errors, ϵ0=ϵ1=0\epsilon_{0}=\epsilon_{1}=0, for illustration. For a vector θtΔd+1\theta_{t}\in\Delta_{d+1}, let θt\theta_{t}^{\prime}\in\mathbb{R} be the vector that contains only the first dd dimensions of θt\theta_{t}. When we call the BestResponse oracle with vector θt\theta_{t}^{\prime}, it returns a policy πt\pi_{t} such that its occupancy measure satisfies 𝗊P,πt,θtmax𝗊Δ(P)𝗊,θt\langle{\mathsf{q}}^{P,\pi_{t}},-\theta_{t}^{\prime}\cdot\ell\rangle\geq\max_{{\mathsf{q}}^{\star}\in\Delta(P)}\langle{\mathsf{q}}^{\star},-\theta_{t}^{\prime}\cdot\ell\rangle. We can use this to show:

θt[u(𝗊P,πt,),0]\displaystyle\theta_{t}\cdot[u({\mathsf{q}}^{P,\pi_{t}},\ell),~0] =θtu(𝗊P,πt,)=𝗊P,πt,θtc𝗊,θtc0\displaystyle=\theta_{t}^{\prime}\cdot u({\mathsf{q}}^{P,\pi_{t}},\ell)=\langle{\mathsf{q}}^{P,\pi_{t}},\theta_{t}^{\prime}\cdot\ell\rangle-c\leq\langle{\mathsf{q}}^{\star},\theta_{t}^{\prime}\cdot\ell\rangle-c\leq 0

where 𝗊{\mathsf{q}}^{\star} is the occupancy measure of a policy that satisfies all constraints.

Thus 𝗊P,πt{\mathsf{q}}^{P,\pi_{t}} is a valid choice for ptp_{t} in Line 3 of Algorithm 1. Passing the policy πt\pi_{t} to the Est yields z^t=u(𝗊P,πt,)+c\hat{z}_{t}=u({\mathsf{q}}^{P,\pi_{t}},\ell)+c; this is enough to compute

yt\displaystyle y_{t} =[z^tc,0]=[u(pt,),0]\displaystyle=-[\hat{z}_{t}-c,~0]=-[u(p_{t},\ell),~0]

in Line 5 of Algorithm 1111111Since FTRL with negative entropy regularizer operates on the simplex with θt=1\|\theta_{t}\|=1, we pad the inputs with a zero dimension obtain an OLO algorithm on the interior of Δd\Delta_{d}.. Finally, we obtain θt+1+\theta_{t+1}\in\mathbb R^{d+1} by passing y1,y2,,yty_{1},y_{2},\dots,y_{t} to a negative entropy FTRL algorithm as the OLO algorithm \mathcal{F} (Line 6 of Alg. 1).

Using similar steps as for Theorem 2.3 and relying on the regret bound for \mathcal{F} in Lemma 2.2, we can show the following guarantee:

{proposition}

[] Consider a constrained episodic MDP with horizon LL, fixed loss vectors \ell with (s,a)[0,1]d\ell(s,a)\in[0,1]^{d} for all state-action pairs (s,a)𝒳×𝒜(s,a)\in{\mathscr{X}}\times{\mathscr{A}} and a constraint threshold vector c[0,L]dc\in[0,L]^{d}. Assume that there exists a feasible policy that satisfies all constraints and let π¯\bar{\pi} be the mixture policy of π1,,πT\pi_{1},\dots,\pi_{T} generated by the approach described above. Then the maximum constraint violation of π¯\bar{\pi} satisfies

maxi[d]{𝗊P,π¯ici}=d(1Tt=1Tu(𝗊P,πt,),𝒮)O(Llog(d)T)+ϵ0+2ϵ1.\displaystyle\max_{i\in[d]}\{{\mathsf{q}}^{P,\bar{\pi}}\cdot\ell^{i}-c_{i}\}=d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u({\mathsf{q}}^{P,\pi_{t}},\ell),{\mathscr{S}}\right)\leq O\left(L\sqrt{\frac{\log(d)}{T}}\right)+\epsilon_{0}+2\epsilon_{1}.

Applying the results from prior work (Miryoosefi et al., 2019) based on 2\ell_{2} approachability would yield a bound of (Ld+ϵ1)T1/2+ϵ0+2ϵ1(L\sqrt{d}+\epsilon_{1})T^{-1/2}+\epsilon_{0}+2\epsilon_{1} in our setting with an additional d\sqrt{d} and ϵ1\epsilon_{1} factor in front of T1/2T^{-1/2}. For the sake of exposition, we illustrated the benefit of \ell_{\infty} approachability using the oracles adopted by Miryoosefi et al. (2019), but our approach can also be applied with similar advantages to other works that make oracle calls explicit (Yu et al., 2021; Miryoosefi and Jin, 2021).

5 Conclusion

We presented a new algorithmic framework for \ell_{\infty}-approachability, which we argued is the most suitable notion of approachability for a variety of applications such as regret minimization. Our algorithms leverage a key dimensionality reduction and a reduction to an online linear optimization. These ideas can be similarly used to derive useful algorithms for approachability with alternative distance metrics. In fact, as already pointed out, some of our algorithms can be similarly viewed as a reduction to an online linear optimization of an equivalent group-norm approachability.

References

  • Abernethy et al. [2008] Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 263–274. Omnipress, 2008.
  • Abernethy et al. [2011] Jacob D. Abernethy, Peter L. Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. In Proceedings of COLT, volume 19 of JMLR Proceedings, pages 27–46, 2011.
  • Altman [1999] Eitan Altman. Constrained Markov Decision Processes. Chapman and Hall, 1999.
  • Bergemann and Morris [2016] Dirk Bergemann and Stephen Morris. Bayes correlated equilibrium and the comparison of information structures in games. Theoretical Economics, 11(2):487–522, 2016.
  • Berger et al. [1996] Adam L. Berger, Stephen Della Pietra, and Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Comp. Linguistics, 22(1), 1996.
  • Bernstein and Shimkin [2015] Andrey Bernstein and Nahum Shimkin. Response-based approachability with applications to generalized no-regret problems. J. Mach. Learn. Res., 16:747–773, 2015.
  • Blackwell [1956] David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956.
  • Blum and Mansour [2007] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8:1307–1324, 2007.
  • Borwein and Zhu [2005] Jonathan Borwein and Qiji Zhu. Techniques of Variational Analysis. Springer, 2005.
  • Boyd and Vandenberghe [2014] Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2014.
  • Brantley et al. [2020] Kianté Brantley, Miro Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. Proceedings of NIPS, 33:16315–16326, 2020.
  • Chzhen et al. [2021] Evgenii Chzhen, Christophe Giraud, and Gilles Stoltz. A unified approach to fair online learning via blackwell approachability. In Proceedings of NeurIPS, pages 18280–18292, 2021.
  • Dawid [1982] A. P. Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605–610, 1982. doi: 10.1080/01621459.1982.10477856. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1982.10477856.
  • Ding et al. [2021] Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In Proceedings of AISTATS, pages 3304–3312. PMLR, 2021.
  • Dudík et al. [2007] Miroslav Dudík, Steven J. Phillips, and Robert E. Schapire. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8, 2007.
  • Efroni et al. [2020] Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.
  • Forges [1993] Françoise Forges. Five legitimate definitions of correlated equilibrium in games with incomplete information. Theory and decision, 35(3):277–310, 1993.
  • Forges et al. [2006] Françoise Forges et al. Correlated equilibrium in games with incomplete information revisited. Theory and decision, 61(4):329–344, 2006.
  • Foster and Hart [2018] Dean P. Foster and Sergiu Hart. Smooth calibration, leaky forecasts, finite recall, and nash dynamics. Games and Economic Behavior, 109:271–293, 2018. ISSN 0899-8256. doi: https://doi.org/10.1016/j.geb.2017.12.022. URL https://www.sciencedirect.com/science/article/pii/S0899825618300113.
  • Foster and Vohra [1999] Dean P Foster and Rakesh V Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29(1-2):7–35, 1999.
  • Gordon et al. [2008] Geoffrey J. Gordon, Amy Greenwald, and Casey Marks. No-regret learning in convex games. In Proceedings of ICML, volume 307, pages 360–367. ACM, 2008.
  • Gower and Dijksterhuis [2004] John C Gower and Garmt B Dijksterhuis. Procrustes problems, volume 30. OUP Oxford, 2004.
  • Hart and Mas-Colell [2000] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
  • Hart and Mas-Colell [2001] Sergiu Hart and Andreu Mas-Colell. A general class of adaptive strategies. Journal of Economic Theory, 98(5):26–54, 2001.
  • Hartline et al. [2015] Jason Hartline, Vasilis Syrgkanis, and Eva Tardos. No-regret learning in bayesian games. Advances in Neural Information Processing Systems, 28, 2015.
  • Hazan [2016] Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • Hazan and Kale [2008] Elad Hazan and Satyen Kale. Computational equivalence of fixed points and no regret algorithms, and convergence to equilibria. In NIPS, pages 625–632, 2008.
  • Kalathil et al. [2014] Dileep M. Kalathil, Vivek S. Borkar, and Rahul Jain. Blackwell’s approachability in Stackelberg stochastic games: A learning version. In Proceedings of CDC, pages 4467–4472. IEEE, 2014.
  • Kivinen and Warmuth [1995] Jyrki Kivinen and Manfred K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. In Proceedings of STOC, pages 209–218. ACM, 1995.
  • Kwon [2016] Joon Kwon. Mirror descent strategies for regret minimization and approachability. PhD thesis, Université Pierre et Marie Curie, Paris 6, 2016.
  • Kwon [2021] Joon Kwon. Refined approachability algorithms and application to regret minimization with global costs. The Journal of Machine Learning Research, 22:200–1, 2021.
  • Kwon and Perchet [2017] Joon Kwon and Vianney Perchet. Online learning and blackwell approachability with partial monitoring: Optimal convergence rates. In Aarti Singh and Xiaojin (Jerry) Zhu, editors, Proceedings of AISTATS, volume 54, pages 604–613. PMLR, 2017.
  • Lee et al. [2018] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala. Efficient convex optimization with membership oracles. In Conference On Learning Theory, pages 1292–1294. PMLR, 2018.
  • Lehrer [2003] Ehud Lehrer. Approachability in infinite dimensional spaces. Int. J. Game Theory, 31(2):253–268, 2003.
  • Mannor and Shimkin [2003] Shie Mannor and Nahum Shimkin. The empirical bayes envelope and regret minimization in competitive markov decision processes. Math. Oper. Res., 28(2):327–345, 2003.
  • Mannor and Shimkin [2004] Shie Mannor and Nahum Shimkin. A geometric approach to multi-criterion reinforcement learning. J. Mach. Learn. Res., 5:325–360, 2004.
  • Mannor et al. [2014a] Shie Mannor, Vianney Perchet, and Gilles Stoltz. Set-valued approachability and online learning with partial monitoring. J. Mach. Learn. Res., 15(1):3247–3295, 2014a.
  • Mannor et al. [2014b] Shie Mannor, Vianney Perchet, and Gilles Stoltz. Approachability in unknown games: Online learning meets multi-objective optimization. In Proceedings of COLT, volume 35, pages 339–355, 2014b.
  • Mansour et al. [2022] Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan. Strategizing against learners in Bayesian games, 2022. URL https://arxiv.org/abs/2205.08562.
  • Miryoosefi and Jin [2021] Sobhan Miryoosefi and Chi Jin. A simple reward-free approach to constrained reinforcement learning. CoRR, abs/2107.05216, 2021.
  • Miryoosefi et al. [2019] Sobhan Miryoosefi, Kianté Brantley, Hal Daumé III, Miroslav Dudík, and Robert E. Schapire. Reinforcement learning with convex constraints. In Proceedings of NIPS, pages 14070–14079, 2019.
  • Mohri et al. [2018] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, second edition, 2018.
  • Neu et al. [2012] Gergely Neu, András György, and Csaba Szepesvári. The adversarial stochastic shortest path problem with unknown transition probabilities. In Proceedings of AISTATS, volume 22 of JMLR Proceedings, pages 805–813, 2012.
  • Perchet [2010] Vianney Perchet. Approchabilité, Calibration et Regret dans les Jeux à Observations Partielles. PhD thesis, Université Pierre et Marie Curie - Paris VI, 2010.
  • Perchet [2015] Vianney Perchet. Exponential weight approachability, applications to calibration and regret minimization. Dynamic Games and Applications, 5(1):136–153, 2015.
  • Perchet and Quincampoix [2015] Vianney Perchet and Marc Quincampoix. On a unified framework for approachability with full or partial monitoring. Mathematics of Operations Research, 40(3):596–610, 2015.
  • Perchet and Quincampoix [2018] Vianney Perchet and Marc Quincampoix. A differential game on Wasserstein space. Application to weak approachability with partial monitoring. CoRR, abs/1811.04575, 2018.
  • Pietra et al. [1997] Stephen Della Pietra, Vincent J. Della Pietra, and John D. Lafferty. Inducing features of random fields. IEEE Trans. Pattern Anal. Mach. Intell., 19(4), 1997.
  • Rockafellar [1970] R. Tyrrell Rockafellar. Convex analysis. Princeton Mathematical Series. Princeton University Press, Princeton, NJ, 1970.
  • Rosenfeld [1996] Ronald Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer Speech & Language, 10(3):187–228, 1996.
  • Shalev-Shwartz [2007] Shai Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, 2007.
  • Shimkin [2016] Nahum Shimkin. An online convex optimization approach to blackwell’s approachability. The Journal of Machine Learning Research, 17(1):4434–4456, 2016.
  • Spinat [2002] Xavier Spinat. A Necessary and Sufficient Condition for Approachability. Mathematics of Operations Research, 27(1):31–44, 2002.
  • Vieille [1992] Nicolas Vieille. Weak approachability. Math. Oper. Res., 17(4):781–791, 1992.
  • von Neumann [1928] John von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928.
  • Yu et al. [2021] Tiancheng Yu, Yi Tian, Jingzhao Zhang, and Suvrit Sra. Provably efficient algorithms for multi-objective competitive rl. In International Conference on Machine Learning, pages 12167–12176. PMLR, 2021.
  • Zinkevich [2003] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 928–936. AAAI Press, 2003.

Appendix A Omitted proofs

A.1 Proof of Theorem 2.3

Proof of Theorem 2.3.

Similar results appear in e.g. Kwon [2021]. For completeness, we include a proof here.

We will need the following fact: for any xx\in\mathbb{R}, d(x,(,0]d)=supθΔ¯dθ,xd_{\infty}(x,(\infty,0]^{d})=\sup_{\theta\in{\overline{\Delta}}_{d}}\langle\theta,x\rangle. This is easy to verify (both sides are equal to max(maxi[d]xi,0)\max(\max_{i\in[d]}x_{i},0)), but is also a consequence of Fenchel duality (see e.g. Appendix B and the proof of Theorem 3.1).

Armed with this fact, note that

d(1Tt=1Tu(pt,t),(,0]d)\displaystyle d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right) =\displaystyle= supθΔ¯dθ,1Tt=1Tu(pt,t)\displaystyle\sup_{\theta\in{\overline{\Delta}}_{d}}\left\langle\theta,\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t})\right\rangle
=\displaystyle= infθ𝒳θ,1Tt=1T(u(pt,t))\displaystyle-\inf_{\theta\in{\mathscr{X}}}\left\langle\theta,\frac{1}{T}\sum_{t=1}^{T}\left(-u(p_{t},\ell_{t})\right)\right\rangle
=\displaystyle= 1T𝖱𝖾𝗀()1Tt=1Tθt,u(pt,t)\displaystyle\frac{1}{T}\operatorname{\mathsf{Reg}}(\mathcal{F})-\frac{1}{T}\sum_{t=1}^{T}\left\langle\theta_{t},-u(p_{t},\ell_{t})\right\rangle
\displaystyle\leq 1T𝖱𝖾𝗀().\displaystyle\frac{1}{T}\operatorname{\mathsf{Reg}}(\mathcal{F}).

A.2 Proof of Corollary 3.2

Proof of Corollary 3.2.

Note that if θ𝒞\theta\not\in{\mathscr{C}}^{\circ}, then there exists a z𝒞z\in{\mathscr{C}} such that θ,z>0\langle\theta,z\rangle>0. Therefore, if θ𝒞\theta\not\in{\mathscr{C}}^{\circ}, then the supremum sups𝒞θs=\sup_{s\in{\mathscr{C}}}\theta\cdot s=\infty (we can take ss to be a large multiple of zz). On the other hand, if θ𝒞\theta\in{\mathscr{C}}^{\circ}, then the supremum sups𝒞θs=0\sup_{s\in{\mathscr{C}}}\theta\cdot s=0 (taking s=0s=0). It follows that the supremum supθ𝒯f{θzsups𝒞θs}\sup_{\theta\in{\mathscr{T}}^{*}_{f}}\left\{\theta\cdot z-\sup_{s\in{\mathscr{C}}}\theta\cdot s\right\} in Theorem 3.1 must be achieved for a θ𝒞\theta\in{\mathscr{C}}^{\circ} (we know that df(z,𝒞)d_{f}(z,{\mathscr{C}}) is not infinite since 0𝒞0\in{\mathscr{C}}, so df(z,𝒞)f(z)d_{f}(z,{\mathscr{C}})\leq f(z)). For such θ\theta, the sups𝒮θs\sup_{s\in{\mathscr{S}}}\theta\cdot s term vanishes, and we are left with the statement of this corollary. ∎

A.3 Proof of Theorem 3.3

Proof of Theorem 3.3.

Note that for any ρ0\rho\geq 0 we have the following chain of equivalences:

d(1Tt=1Tu(pt,t),(,0]d)ρ\displaystyle d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right)\leq\rho (13)
\displaystyle\Leftrightarrow 1Tt=1Tui(pt,t)ρi[d]\displaystyle\frac{1}{T}\sum_{t=1}^{T}u_{i}(p_{t},\ell_{t})\leq\rho\;\forall i\in[d]
\displaystyle\Leftrightarrow 1Tt=1Tu~(pt,t),viρi[d]\displaystyle\frac{1}{T}\sum_{t=1}^{T}\langle\widetilde{u}(p_{t},\ell_{t}),v_{i}\rangle\leq\rho\;\forall i\in[d]
\displaystyle\Leftrightarrow 1Tt=1Tu~(pt,t),viρi[d]\displaystyle\left\langle\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),v_{i}\right\rangle\leq\rho\;\forall i\in[d]
\displaystyle\Leftrightarrow f(1Tt=1Tu~(pt,t))ρ.\displaystyle f\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t})\right)\leq\rho.

We will now prove that this final inequality on the norm ff (13) implies the desired inequality on dfd_{f}:

df(1Tt=1Tu~(pt,t),𝒞)ρ.d_{f}\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),{\mathscr{C}}\right)\leq\rho. (14)

Note that since 0𝒞0\in{\mathscr{C}}, (14) directly implies 13. To show that (13) implies (14), we must show that

f((1Tt=1Tu~(pt,t))c)ρ.f\left(\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t})\right)-c\right)\leq\rho.

for any c𝒞c\in{\mathscr{C}}. This in turn is equivalent to proving that:

(1Tt=1Tu~(pt,t))c,viρi[d].\left\langle\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t})\right)-c,v_{i}\right\rangle\leq\rho\;\forall i\in[d].

But since c𝒞c\in{\mathscr{C}}, c,vi0\langle c,v_{i}\rangle\geq 0 for all i[d]i\in[d]. Therefore, as long as

1Tt=1Tu~(pt,t),viρi[d],\left\langle\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),v_{i}\right\rangle\leq\rho\;\forall i\in[d],

(which is true by (13)), (14) will be true as well, as desired.

A.4 Proof of Lemma 3.4

Proof of Lemma 3.4.

Let {\mathscr{H}} denote conv{v1,,vd}\operatorname*{conv}\left\{v_{1},\ldots,v_{d}\right\}. If θ~\widetilde{\theta} is in {\mathscr{H}}, then we can write θ~=i=1dαivi\widetilde{\theta}=\sum_{i=1}^{d}\alpha_{i}v_{i} for some αi0\alpha_{i}\geq 0, i[d]i\in[d], with i=1dαi=1\sum_{i=1}^{d}\alpha_{i}=1. Thus, for any xx\in\mathbb{R}^{mn}, we have

θ~,x=i=1dαivi,xi=1dαimaxj[d]vj,x=maxj[d]vj,x,\langle\widetilde{\theta},x\rangle=\sum_{i=1}^{d}\alpha_{i}\langle v_{i},x\rangle\leq\sum_{i=1}^{d}\alpha_{i}\max_{j\in[d]}\langle v_{j},x\rangle=\max_{j\in[d]}\langle v_{j},x\rangle,

which implies that θ~\widetilde{\theta} is in 𝒯f{\mathscr{T}}^{*}_{f}. Conversely, if θ~\widetilde{\theta} is not in {\mathscr{H}}, since {\mathscr{H}} is a non-empty closed convex set, θ~\widetilde{\theta} can be separated from {\mathscr{H}}, that is there exists xx\in\mathbb{R} such that

θ~,x>supvv,xmaxi[d]vi,x,\langle\widetilde{\theta},x\rangle>\sup_{v\in{\mathscr{H}}}\langle v,x\rangle\geq\max_{i\in[d]}\langle v_{i},x\rangle,

which implies that θ~\widetilde{\theta} is not in 𝒯f{\mathscr{T}}^{*}_{f}. This completes the proof. ∎

A.5 Proof of Lemma 3.7

Proof of Lemma 3.7.

We begin with the first statement. Note that since the negative orthant (,0]d(-\infty,0]^{d} is separable with respect to uu, for any \ell\in{\mathscr{L}}, there exists a p𝒫p\in{\mathscr{P}} such that ui(p,)0u_{i}(p,\ell)\leq 0 for all i[d]i\in[d]. Now, recall that we can write ui(p,)=u~(p,),viu_{i}(p,\ell)=\langle\widetilde{u}(p,\ell),v_{i}\rangle, so it follows that u~(p,),vi0\langle\widetilde{u}(p,\ell),v_{i}\rangle\leq 0 for all i[d]i\in[d]. This implies u~(p,)𝒞\widetilde{u}(p,\ell)\in{\mathscr{C}}, as desired.

We next prove the second statement. Note that since θ~𝒯f𝒞\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}\subseteq{\mathscr{C}}^{\circ}, this implies that if u~(p,)𝒞\widetilde{u}(p,\ell)\in{\mathscr{C}}, then θ~,u~(p,)0\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle\leq 0. By the first statement, this means that for any \ell\in{\mathscr{L}}, there exists a p𝒫p\in{\mathscr{P}} such that θ~,u~(p,)0\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle\leq 0. By the minimax theorem (since 𝒫{\mathscr{P}} and {\mathscr{L}} are both convex sets and θ~,u~(p,)\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle is a bilinear function of pp and \ell), this implies that there exists a p𝒫p\in{\mathscr{P}} such that for all \ell\in{\mathscr{L}}, θ~,u~(p,)0\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle\leq 0, as desired. ∎

A.6 Proof of Lemma 3.10

Proof of Lemma 3.10.

Note that by the definition of 𝒯f(𝒵){\mathscr{T}}^{*}_{f}({\mathscr{Z}}), it must be the case that supθ𝒯f(𝒵)θ,zf(z)\sup_{\theta\in{\mathscr{T}}^{*}_{f}({\mathscr{Z}})}\langle\theta,z\rangle\leq f(z) (this is one of the constraints defining 𝒯f(𝒵){\mathscr{T}}^{*}_{f}({\mathscr{Z}})). Since 𝒯f𝒯f(𝒵){\mathscr{T}}^{*}_{f}\subseteq{\mathscr{T}}^{*}_{f}({\mathscr{Z}}), if we show that supθ𝒯fθ,z=f(z)\sup_{\theta\in{\mathscr{T}}^{*}_{f}}\langle\theta,z\rangle=f(z) this proves the theorem. But this follows directly from Theorem 3.1 applied to the closed convex set 𝒮={0}{\mathscr{S}}=\{0\}. ∎

A.7 Proof of Lemma 3.11

Proof of Lemma 3.11.

Consider the element

θ~=argminθ~𝒯fθ~,z\widetilde{\theta}^{\prime}=\operatorname*{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}}-\langle\widetilde{\theta},z\rangle

which minimizes the unregularized objective over the smaller set 𝒯f{\mathscr{T}}_{f}^{*}. By Lemma 3.10, we must have that θ~,zθ~opt,z\langle\widetilde{\theta}^{\prime},z\rangle\geq\langle\widetilde{\theta}_{opt},z\rangle. But also, by definition of θ~opt\widetilde{\theta}_{opt}, we must have that ηR(θ~opt)θ~opt,zηR(θ~)θ~,z\eta R(\widetilde{\theta}_{opt})-\langle\widetilde{\theta}_{opt},z\rangle\leq\eta R(\widetilde{\theta}^{\prime})-\langle\widetilde{\theta}^{\prime},z\rangle. It follows that R(θ~opt)R(θ~)R(\widetilde{\theta}_{opt})\leq R(\widetilde{\theta}^{\prime}), and thus θ~optθ~diam(𝒯f)||\widetilde{\theta}_{opt}||\leq||\widetilde{\theta}^{\prime}||\leq\operatorname*{diam}({\mathscr{T}}_{f}^{*}). ∎

A.8 Proof of Lemma 3.12

Proof of Lemma 3.12.

Extend the domain of u~\widetilde{u} as a function to ×\mathbb{R}^{n}\times\mathbb{R}^{m}, and note that there is a unique way to write z=k=1mu~(zk,ek)z=\sum_{k=1}^{m}\widetilde{u}(z_{k},e_{k}), where for each k[m]k\in[m], zkz_{k} is an element of \mathbb{R}^{n} and eke_{k} is the kkth unit vector in \mathbb{R}^{m}. We first claim z𝒵=cone(u~(𝒫,))z\in{\mathscr{Z}}=\operatorname*{cone}(\widetilde{u}({\mathscr{P}},{\mathscr{L}})) iff each zkcone(𝒫)z_{k}\in\operatorname*{cone}({\mathscr{P}}).

To see this, first note that since {\mathscr{L}} is orthant-generating, there exist a sequence of λk\lambda_{k} such that λkek\lambda_{k}e_{k}\in{\mathscr{L}} for each k[m]k\in[m]. Now, if each zkcone(𝒫)z_{k}\in\operatorname*{cone}({\mathscr{P}}), then u~(zk,ek)cone(u~(𝒫,))\widetilde{u}(z_{k},e_{k})\in\operatorname*{cone}(\widetilde{u}({\mathscr{P}},{\mathscr{L}})) (since ekcone()e_{k}\in\operatorname*{cone}({\mathscr{L}})), so z𝒵z\in{\mathscr{Z}}. Conversely, if z𝒵z\in{\mathscr{Z}}, then we can write z=αr=1Ru~(pr,r)z=\sum\alpha_{r=1}^{R}\widetilde{u}(p_{r},\ell_{r}) for some R>0R>0 and αr0\alpha_{r}\geq 0. Expanding each r\ell_{r} in the {λkek}\{\lambda_{k}e_{k}\} basis, we find that each zkz_{k} must be a positive linear combination of the values prp_{r} and therefore zkcone(𝒫)z_{k}\in\operatorname*{cone}({\mathscr{P}}).

Therefore, to check whether z𝒵z\in{\mathscr{Z}}, it suffices to check whether each component zkz_{k} of zz belongs to cone(𝒫)\operatorname*{cone}({\mathscr{P}}). This is possible to do efficiently given an efficient separation oracle for 𝒫{\mathscr{P}} (we can write the convex program βzk𝒫\beta z_{k}\in{\mathscr{P}} for β>0\beta>0). Finally, if each zkcone(𝒫)z_{k}\in\operatorname*{cone}({\mathscr{P}}) we can also recover a value βk\beta_{k} for each kk such that βkzk𝒫\beta_{k}z_{k}\in{\mathscr{P}} (via the same convex program). This allows us to explicitly write z=k=1mαku~(pk,k)z=\sum_{k=1}^{m}\alpha_{k}\widetilde{u}(p_{k},\ell_{k}) with pk=βkzkp_{k}=\beta_{k}z_{k}, k=λkek\ell_{k}=\lambda_{k}e_{k}, and αk=1/(βkλk)\alpha_{k}=1/(\beta_{k}\lambda_{k}). ∎

A.9 Proof of Lemma 3.14

Proof of Lemma 3.14.

Checking for membership in the ball of radius ρ\rho is straightforward, so it suffices to exhibit a membership oracle for the set 𝒯f(𝒵){\mathscr{T}}_{f}^{*}({\mathscr{Z}}). Fix a θ~\widetilde{\theta} and consider the convex function h(z)=maxi[d]vi,zθ~,zh(z)=\max_{i\in[d]}\langle v_{i},z\rangle-\langle\widetilde{\theta},z\rangle. Note that by the definition of θ~𝒯f(𝒵)\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}({\mathscr{Z}}) iff minz𝒵h(z)0\min_{z\in{\mathscr{Z}}}h(z)\leq 0, so it suffices to compute the minimum of h(z)h(z) over the convex set 𝒵{\mathscr{Z}}.

As mentioned, to do this it suffices to exhibit a membership oracle for 𝒵{\mathscr{Z}} and an evaluation oracle for h(z)h(z). But now, Lemma 3.12 provides a membership oracle for 𝒵{\mathscr{Z}} and Corollary 3.13 allows us to efficiently evaluate h(z)h(z) for z𝒵z\in{\mathscr{Z}}. ∎

A.10 Proof of Proposition 4.4

Proof of Proposition 4.4.

We define yt=[u(pt,),0]y_{t}=-[u(p_{t},\ell),~0] the noiseless and y~t=[z^tc,0]\tilde{y}_{t}=-[\hat{z}_{t}-c,0] the actual loss passed to the OLO algorithm. We bound

maxi[k]{𝗊P,π¯c}\displaystyle\max_{i\in[k]}\{{\mathsf{q}}^{P,\bar{\pi}}\cdot\ell-c\} =d(1Tt=1Tu(𝗊P,πt,),𝒮)\displaystyle=d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u({\mathsf{q}}^{P,\pi_{t}},\ell),{\mathscr{S}}\right)
=maxθΔd+1θ(1Tt=1Tu(𝗊P,πt,)0)\displaystyle=\max_{\theta\in\Delta_{d+1}}\theta\cdot\begin{pmatrix}\frac{1}{T}\sum_{t=1}^{T}u({\mathsf{q}}^{P,\pi_{t}},\ell)\\ 0\end{pmatrix}
=maxθΔd+11Tt=1Tθ(u(𝗊P,πt,)0)\displaystyle=\max_{\theta\in\Delta_{d+1}}\frac{1}{T}\sum_{t=1}^{T}\theta\cdot\begin{pmatrix}u({\mathsf{q}}^{P,\pi_{t}},\ell)\\ 0\end{pmatrix}
maxθΔd+1{1Tt=1Tθ,y~t+ϵ1}\displaystyle\leq\max_{\theta\in\Delta_{d+1}}\left\{-\frac{1}{T}\sum_{t=1}^{T}\langle\theta,\tilde{y}_{t}\rangle+\epsilon_{1}\right\} (def. y~t\tilde{y}_{t} and Est oracle)
=minθΔd+1{1Tt=1Tθ,y~t}+ϵ1\displaystyle=-\min_{\theta\in\Delta_{d+1}}\left\{\frac{1}{T}\sum_{t=1}^{T}\langle\theta,\tilde{y}_{t}\rangle\right\}+\epsilon_{1}
𝖱𝖾𝗀()T1Tt=1Tθt,y~t+ϵ1\displaystyle\leq\frac{\operatorname{\mathsf{Reg}}(\mathcal{F})}{T}-\frac{1}{T}\sum_{t=1}^{T}\langle\theta_{t},\tilde{y}_{t}\rangle+\epsilon_{1}
𝖱𝖾𝗀()T1Tt=1Tθt,yt+ϵ1\displaystyle\leq\frac{\operatorname{\mathsf{Reg}}(\mathcal{F})}{T}-\frac{1}{T}\sum_{t=1}^{T}\langle\theta_{t},y_{t}\rangle+\epsilon_{1} (Est oracle)
𝖱𝖾𝗀()T+1Tt=1Tθt,(u(𝗊P,πt,)0)+2ϵ1\displaystyle\leq\frac{\operatorname{\mathsf{Reg}}(\mathcal{F})}{T}+\frac{1}{T}\sum_{t=1}^{T}\left\langle\theta_{t},\begin{pmatrix}u({\mathsf{q}}^{P,\pi_{t}},\ell)\\ 0\end{pmatrix}\right\rangle+2\epsilon_{1} (definition yty_{t})
𝖱𝖾𝗀()T+1Tt=1Tθt(u(𝗊P,πt,)0)+2ϵ1\displaystyle\leq\frac{\operatorname{\mathsf{Reg}}(\mathcal{F})}{T}+\frac{1}{T}\sum_{t=1}^{T}\left\langle\theta_{t}\cdot\begin{pmatrix}u({\mathsf{q}}^{P,\pi_{t}},\ell)\\ 0\end{pmatrix}\right\rangle+2\epsilon_{1} (definition yty_{t})
𝖱𝖾𝗀()T+ϵ0+2ϵ1\displaystyle\leq\frac{\operatorname{\mathsf{Reg}}(\mathcal{F})}{T}+\epsilon_{0}+2\epsilon_{1} (BestResponse oracle)
2Llog(d+1)T+ϵ0+2ϵ1\displaystyle\leq 2L\sqrt{\frac{\log(d+1)}{T}}+\epsilon_{0}+2\epsilon_{1} (Lemma 2.2)

Appendix B General theorems from convex optimization

We will use the following Fenchel duality theorem [Borwein and Zhu, 2005, Rockafellar, 1970], see also [Mohri et al., 2018].

Theorem B.1 (Fenchel duality).

Let 𝒳{\mathscr{X}} and 𝒴{\mathscr{Y}} be Banach spaces, f:𝒳{+}f\colon{\mathscr{X}}\to\mathbb{R}\cup\{+\infty\} and g:𝒴{+}g\colon{\mathscr{Y}}\to\mathbb{R}\cup\{+\infty\} convex functions and A:𝒳𝒴A\colon{\mathscr{X}}\to{\mathscr{Y}} a bounded linear map. Assume that ff, gg and AA satisfy one of the following conditions:

  • ff and gg are lower semi-continuous and 0core(dom(g)Adom(f))0\in\operatorname*{core}(\operatorname*{dom}(g)-A\operatorname*{dom}(f));

  • Adom(f)cont(g)A\operatorname*{dom}(f)\cap\operatorname*{cont}(g)\neq\emptyset;

then p=dp=d for the dual optimization problems

p\displaystyle p =infx𝒳{f(x)+g(Ax)}\displaystyle=\inf_{x\in{\mathscr{X}}}\left\{f(x)+g(Ax)\right\}
d\displaystyle d =supx𝒴{f(Ax)g(x)}\displaystyle=\sup_{x^{*}\in{\mathscr{Y}}^{*}}\left\{-f^{*}(A^{*}x^{*})-g^{*}(-x^{*})\right\}

and the supremum in the second problem is attained if finite.

When constructing efficient algorithms, we will need a efficient method for minimizing a convex function over a convex set given only a membership oracle to the set and an evaluation oracle to the function. This is provided by the following lemma.

Lemma B.2.

Let KK be a bounded convex subset of \mathbb{R} and f:K[0,1]f:K\to[0,1] a convex function over KK. Then given access to a membership oracle for KK (along with an interior point x0Kx_{0}\in K satisfying B(x0,r)KB(x0,R)B(x_{0},r)\subseteq K\subseteq B(x_{0},R) for some given radii r,R>0r,R>0) and an evaluation oracle for ff, there exists an algorithm which computes the minimum value over ff (to within precision ε\varepsilon) using poly(d,log(R/r),log(1/ε))\operatorname*{poly}(d,\log(R/r),\log(1/\varepsilon)) time and queries to these oracles.

Proof.

See Lee et al. [2018]. ∎

Appendix C Bayesian correlated equilibria

Here we provide another application of the \ell_{\infty} approachability framework, to the problem of constructing learning algorithms that converge to correlated equilibria in Bayesian games. Correlated equilibria in Bayesian games (alternatively, “games with incomplete information”) are well-studied throughout the economics and game theory literature; see e.g. [Forges, 1993, Forges et al., 2006, Bergemann and Morris, 2016]. Unlike ordinary correlated equilibria, which are also well-studied from a learning perspective, relatively little is known about algorithms that converge to correlated equilibria in Bayesian games. Hartline et al. [2015] study no-regret learning in Bayesian games, showing that no-regret algorithms converge to a Bayesian coarse correlated equilibrium. More recently, Mansour et al. [2022] introduce a notion of Bayesian swap regret with the property that learners with sublinear Bayesian swap regret converge to correlated equilibria in Bayesian games. Mansour et al. [2022] construct a learning algorithm that achieves low Bayesian swap regret, albeit not a provably efficient one. In this section, we apply our approachability framework to develop the first efficient low-regret algorithm for Bayesian swap regret.

We begin with some preliminaries about standard (non-Bayesian) normal-form games. In a normal form game GG with NN players, each player nn must choose a mixed action 𝐱nΔ([K]){\mathbf{x}}_{n}\in\Delta([K]) (for simplicity we will assume each player has the same number of pure actions). We call the collection of mixed strategies 𝐱=(𝐱1,𝐱2,,𝐱N){\mathbf{x}}=({\mathbf{x}}_{1},{\mathbf{x}}_{2},\dots,{\mathbf{x}}_{N}) played by all players a strategy profile. We will let Ui(𝐱)U_{i}({\mathbf{x}}) denote the utility of player nn under strategy profile 𝐱{\mathbf{x}} (and insist that UiU_{i} is linear in each player’s strategy).

Given a function π:[K][K]\pi\colon[K]\to[K] and a mixed action 𝐱nΔ([K]){\mathbf{x}}_{n}\in\Delta([K]), let π(𝐱n)\pi({\mathbf{x}}_{n}) be the mixed action in Δ([K])\Delta([K]) formed by sampling an action from 𝐱n{\mathbf{x}}_{n} and then applying the function π\pi (i.e., π(𝐱n)i=i|π(i)=i𝐱n,i\pi({\mathbf{x}}_{n})_{i}=\sum_{i^{\prime}|\pi(i^{\prime})=i}{\mathbf{x}}_{n,i^{\prime}}). A correlated equilibrium of GG is a distribution 𝒟\mathcal{D} over strategy profiles 𝐱{\mathbf{x}} such that for any player nn and function π:[K][K]\pi\colon[K]\to[K], it is the case that

𝔼𝐱𝒟[Un(𝐱)]𝔼𝐱𝒟[Un(𝐱)],\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}}[U_{n}({\mathbf{x}}^{\prime})]\leq\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}}[U_{n}({\mathbf{x}})],

where 𝐱=(𝐱1,𝐱2,,π(𝐱n),,𝐱N){\mathbf{x}}^{\prime}=({\mathbf{x}}_{1},{\mathbf{x}}_{2},\dots,\pi({\mathbf{x}}_{n}),\dots,{\mathbf{x}}_{N}). Similarly, an ε\varepsilon-correlated equilibrium of GG is a distribution 𝒟\mathcal{D} with the property that

𝔼𝐱𝒟[Un(𝐱)]𝔼𝐱𝒟[Un(𝐱)]+ε,\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}}[U_{n}({\mathbf{x}}^{\prime})]\leq\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}}[U_{n}({\mathbf{x}})]+\varepsilon,

for any n[N]n\in[N] and π:[K][K]\pi\colon[K]\to[K]. Correlated equilibria have the following natural interpretation: a mediator samples a strategy profile 𝐱{\mathbf{x}} from 𝒟\mathcal{D} and tells to each player nn a pure action ana_{n} randomly sampled from 𝐱n{\mathbf{x}}_{n}. If each player is incentivized to play the action ana_{n} they are told, then 𝒟\mathcal{D} is a correlated equilibrium.

We define Bayesian games similarly to standard games, with the modification that now each player nn also has some private information cn[C]c_{n}\in[C] drawn from some public distribution 𝒞n\mathcal{C}_{n}. We call the vector of realized types 𝐜=(c1,c2,,cN){\mathbf{c}}=(c_{1},c_{2},\dots,c_{N}) the type profile of the players (drawn randomly from 𝒞=𝒞1××𝒞N\mathcal{C}=\mathcal{C}_{1}\times\dots\times\mathcal{C}_{N}), and now let the utility Un(𝐱,𝐜)U_{n}({\mathbf{x}},{\mathbf{c}}) of player nn depend on both the strategy profile and type profile of the players. Note that we can alternately think of the strategy of player nn as a function 𝐱n:[C]Δ([K]){\mathbf{x}}_{n}\colon[C]\to\Delta([K]) mapping contexts to mixed actions; in this case we can again treat the expected utility for player nn (with expectation taken over the random type profile) as a multilinear function Un(𝐱)U_{n}({\mathbf{x}}) function of the strategy profile 𝐱=(𝐱1,,𝐱N){\mathbf{x}}=({\mathbf{x}}_{1},\dots,{\mathbf{x}}_{N}).

As with regular correlated equilibria, we can motivate the definition of Bayesian correlated equilibria via the introduction of a mediator. In the Bayesian case, all players begin by revealing their private types to the mediator, and the mediator observes a type profile 𝐜{\mathbf{c}}. The mediator then samples an joint action profile 𝐱{\mathbf{x}} from a distribution 𝒟(𝐜)\mathcal{D}({\mathbf{c}}) that depends on the observed type profile. Finally, for each player nn, the mediator samples a pure action an𝐱na_{n}\sim{\mathbf{x}}_{n} from the mixed strategy for nn, and relays ana_{n} to player nn (which they should follow). In order for this to be a valid correlated equilibrium, the following incentive compatibility constraints must be met:

  • Players must have no incentive to deviate from the strategy 𝐱n{\mathbf{x}}_{n} relayed to them. As in correlated equilibria, this includes deviations of the form “if I am told to play action ii, I will instead play action jj”.

  • Players also must have no incentive to misreport their type (thus affecting the distribution 𝒟(𝐜)\mathcal{D}({\mathbf{c}}) over joint strategy profiles).

  • Moreover, no combination of the above two deviations should result in improved utility for a player.

Formally, we define a Bayesian correlated equilibrium for a Bayesian game as follows. The distributions 𝒟(𝐜)\mathcal{D}({\mathbf{c}}) form a Bayesian correlated equilibrium if, for any player n[N]n\in[N], any “type deviation” κ:[C][C]\kappa\colon[C]\to[C], and any collection of “action deviations” πc:[K][K]\pi_{c}\colon[K]\to[K] (for each c[C]c\in[C]),

𝔼𝐱𝒟(c),𝐜𝒞[Un(𝐱)]𝔼𝐱𝒟(c),𝐜𝒞[Un(𝐱)]\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}(c),{\mathbf{c}}\sim\mathcal{C}}[U_{n}({\mathbf{x}}^{\prime})]\leq\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}(c),{\mathbf{c}}\sim\mathcal{C}}[U_{n}({\mathbf{x}})]

where 𝐱{\mathbf{x}}^{\prime} is derived from 𝐱{\mathbf{x}} by deviating from 𝐱n{\mathbf{x}}_{n} in the following way: when the player nn has type cc, they first report type c=κ(c)c^{\prime}=\kappa(c) to the mediator; if the mediator then tells them to play action aa, they instead play πc(a)\pi_{c}(a). No such deviation should improve the utility of an agent in a Bayesian correlated equilibrium. Likewise, an ε\varepsilon-Bayesian correlated equilibrium is a collection of distributions 𝒟(𝐜)\mathcal{D}({\mathbf{c}}) where no deviation increases the utility of a player by more than ε\varepsilon.

In order to play a Bayesian game, a learning algorithm must be contextual (using the agent’s private information to decide what action to play). We study the following setting of full-information contextual online learning. As before, there are KK actions, but there are now CC different contexts (/types). Every round tt, the adversary specifies a loss function t=[0,1]CK\ell_{t}\in\mathcal{L}=[0,1]^{CK}, where t,i,c\ell_{t,i,c} represents the loss from playing action ii in context cc. Simultaneously, the learner specifies an action pt:[C]Δ([K])p_{t}\colon[C]\to\Delta([K]) mapping each context to a distribution over actions. Overall, the learner receives expected utility c=1C𝒞[c]i=1Kpt(c)it,i,c\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}p_{t}(c)_{i}\ell_{t,i,c} this round (here 𝒞\mathcal{C} is a distribution over contexts; we assume that the learner’s context is drawn i.i.d. from 𝒞\mathcal{C} each round, and that this distribution 𝒞\mathcal{C} is publicly known).

Motivated by the deviations considered in Bayesian correlated equilibria, we can define the following notion of swap regret in Bayesian games (“Bayesian swap regret”):

𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀=maxκ,πct=1Tc=1C𝒞[c]i=1K(pt(c)it,i,cpt(κ(c))it,πc(i),c).\operatorname{\mathsf{BayesSwapReg}}=\max_{\kappa,\pi_{c}}\sum_{t=1}^{T}\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right). (15)
Lemma C.1.

Let 𝒜\mathcal{A} be an algorithm with 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀(𝒜)=o(T)\operatorname{\mathsf{BayesSwapReg}}(\mathcal{A})=o(T). Assume each player ii in a repeated Bayesian game GG (over TT rounds) runs a copy of algorithm 𝒜\mathcal{A}, and let pt=(pt(1),pt(2),,pt(N))p_{t}=(p^{(1)}_{t},p^{(2)}_{t},\dots,p^{(N)}_{t}) be the strategy profile at time tt. Then the time-averaged strategy profile 𝒟(𝐜):[C]NΔ([K]N)\mathcal{D}({\mathbf{c}})\colon[C]^{N}\to\Delta([K]^{N}), defined by sampling a tt uniformly at random from [T][T] and returning ptp_{t} is an ε\varepsilon-Bayesian correlated equilibrium with ε=o(1)\varepsilon=o(1).

Proof.

We will show that there exists no deviation for player nn which increases their utility by more than 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀(𝒜)/T=o(1)\operatorname{\mathsf{BayesSwapReg}}(\mathcal{A})/T=o(1).

Fix a type deviation κ:[C][C]\kappa\colon[C]\to[C] and set of action deviations πc:[K][K]\pi_{c}\colon[K]\to[K]. This collection of deviations transforms an arbitrary strategy pΔ([K])Cp\in\Delta([K])^{C} into the strategy pp^{\prime} satisfying p(c)i=iπc(i)=ip(κ(c))ip^{\prime}(c)_{i}=\sum_{i^{\prime}\mid\pi_{c}(i^{\prime})=i}p(\kappa(c))_{i^{\prime}}.

For each t[T]t\in[T], with probability 1/T1/T the mediator will return the strategy profiles pt=(pt(1),pt(2),,pt(N))p_{t}=(p^{(1)}_{t},p^{(2)}_{t},\dots,p^{(N)}_{t}). Now, since UnU_{n} is multilinear in the strategies of each player, there exists some vector t\ell_{t} such that the utility of player nn if they defect to a some strategy pp^{\prime} is given by the inner product p,t\langle p,\ell_{t}\rangle.

In particular, conditioned on the mediator returning ptp_{t}, the difference in utility for player nn between playing p(n)p^{(n)} and the strategy pp^{\prime} formed by applying the above deviations to p(n)p^{(n)} is exactly

c=1C𝒞[c]i=1K(pt(n)(c)it,i,cpt(n)(κ(c))it,πc(i),c).\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p^{(n)}_{t}(c)_{i}\ell_{t,i,c}-p^{(n)}_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right).

Taking expectations over all tt, we have that the expected difference in utility by deviating is

1Tt=1Tc=1C𝒞[c]i=1K(pt(n)(c)it,i,cpt(n)(κ(c))it,πc(i),c).\frac{1}{T}\sum_{t=1}^{T}\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p^{(n)}_{t}(c)_{i}\ell_{t,i,c}-p^{(n)}_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right).

But since player nn selected their strategies by playing 𝒜\mathcal{A}, this is at most 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀(𝒜)/T=o(1)\operatorname{\mathsf{BayesSwapReg}}(\mathcal{A})/T=o(1), as desired. ∎

It is possible to phrase (15) in the language of \ell_{\infty}-approachability by considering the CCKKCC^{C}\cdot K^{KC} dimensional vectorial payoff u(p,)u(p,\ell) given by:

u(p,)κ,πc=c=1C𝒞[c]i=1K(p(c)ii,cp(κ(c))iπc(i),c).u(p,\ell)_{\kappa,\pi_{c}}=\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p(c)_{i}\ell_{i,c}-p(\kappa(c))_{i}\ell_{\pi_{c}(i),c}\right).

A straightforward computation shows that the negative orthant is separable with respect to uu.

Lemma C.2.

The set 𝒮=(,0]d{\mathscr{S}}=(-\infty,0]^{d} is separable with respect to the vectorial payoff uu.

Proof.

Fix an [0,1]CK\ell\in[0,1]^{CK}. Then note that if we let p(c)=eip(c)=e_{i}, where i=argminj[K]t,j,ci=\operatorname*{argmin}_{j\in[K]}\ell_{t,j,c} (i.e., p(c)p(c) is entirely supported on the best fixed action to play in context cc), it follows that p(c)it,i,cp(c)it,j,cp(c)_{i}\ell_{t,i,c}\leq p(c^{\prime})_{i}\ell_{t,j,c} for all c[C]c^{\prime}\in[C] and j[K]j\in[K], and therefore that u(p,)𝒮u(p,\ell)\in{\mathscr{S}}. ∎

This in turn leads (via Theorem 2.3) to a low-regret (albeit computationally inefficient) algorithm for Bayesian swap regret. Instead, as in the case of swap regret, we will apply our pseudonorm approachability framework. First, we will show that we can rewrite (15) in such a way that allows us to easily evaluate 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀\operatorname{\mathsf{BayesSwapReg}}.

Lemma C.3.

We have that

𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀=c=1Cmaxc[C]i=1Kmaxj[K]t=1T𝒞[c](pt(c)it,i,cpt(c)it,j,c).\operatorname{\mathsf{BayesSwapReg}}=\sum_{c=1}^{C}\max_{c^{\prime}\in[C]}\sum_{i=1}^{K}\max_{j\in[K]}\sum_{t=1}^{T}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(c^{\prime})_{i}\ell_{t,j,c}\right). (16)
Proof.

From (15), we have that

𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀\displaystyle\operatorname{\mathsf{BayesSwapReg}} =\displaystyle= maxκ,πct=1Tc=1C𝒞[c]i=1K(pt(c)it,i,cpt(κ(c))it,πc(i),c)\displaystyle\max_{\kappa,\pi_{c}}\sum_{t=1}^{T}\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right)
=\displaystyle= maxκ,πcc=1Ci=1Kt=1T𝒞[c](pt(c)it,i,cpt(κ(c))it,πc(i),c)\displaystyle\max_{\kappa,\pi_{c}}\sum_{c=1}^{C}\sum_{i=1}^{K}\sum_{t=1}^{T}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right)
=\displaystyle= maxκ:[C][C]c=1Cmaxπc:[K][K]i=1Kt=1T𝒞[c](pt(c)it,i,cpt(κ(c))it,πc(i),c)\displaystyle\max_{\kappa:[C]\to[C]}\sum_{c=1}^{C}\max_{\pi_{c}\colon[K]\to[K]}\sum_{i=1}^{K}\sum_{t=1}^{T}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right)
=\displaystyle= c=1Cmaxc[C]i=1Kmaxj[K]t=1T𝒞[c](pt(c)it,i,cpt(c)it,j,c).\displaystyle\sum_{c=1}^{C}\max_{c^{\prime}\in[C]}\sum_{i=1}^{K}\max_{j\in[K]}\sum_{t=1}^{T}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(c^{\prime})_{i}\ell_{t,j,c}\right).

Note that (16) allows us to efficiently (in poly(K,C,T)\operatorname*{poly}(K,C,T) time) evaluate 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀\operatorname{\mathsf{BayesSwapReg}}. As mentioned in the main text, directly from Theorems 3.9 and 3.15 this gives us an efficient (poly(C,K)\operatorname*{poly}(C,K) time per round) learning algorithm that incurs at most O(K2C2T)O(K^{2}C^{2}\sqrt{T}) swap regret. We will now examine the values of diam(𝒫)\operatorname*{diam}({\mathscr{P}}), diam()\operatorname*{diam}({\mathscr{L}}) and DzD_{z} and show that this algorithm actually incurs at most O(KCT)O(KC\sqrt{T}) regret.

First, note that since 𝒫=Δ([K])C{\mathscr{P}}=\Delta([K])^{C}, elements p𝒫p\in{\mathscr{P}} can be thought of as CC KK-tuples of positive numbers that add to 11. Each such KK-tuple has squared distance at most 11, so diam(𝒫)C\operatorname*{diam}({\mathscr{P}})\leq\sqrt{C}. Second, since =[0,1]KC{\mathscr{L}}=[0,1]^{KC}, diam()=KC\operatorname*{diam}({\mathscr{L}})=\sqrt{KC}. Finally, the KCKC coefficients of each zκ,πcz_{\kappa,\pi_{c}} consist of 2K2K copies of the distribution [c]\operatorname*{\mathbb{P}}[c]; this has 2\ell_{2} norm at most 2K\sqrt{2K}, so Dz=O(2K)D_{z}=O(\sqrt{2K}). Combining these three quantities according to Theorem 3.9, we obtain the following corollary.

Corollary C.4.

There exists an efficient contextual learning algorithm with 𝖡𝖺𝗒𝖾𝗌𝖲𝗐𝖺𝗉𝖱𝖾𝗀=O(CKT)\operatorname{\mathsf{BayesSwapReg}}=O(CK\sqrt{T}).