Pseudonorm Approachability and
Applications to Regret Minimization

Christoph Dann
Google Research Yishay Mansour
Google Research Mehryar Mohri
Google Research Jon Schneider
Google Research Balasubramanian Sivan
Google Research

Abstract

Blackwell’s celebrated approachability theory provides a general framework for a variety of learning problems, including regret minimization. However, Blackwell’s proof and implicit algorithm measure approachability using the $\ell_{2}$ (Euclidean) distance. We argue that in many applications such as regret minimization, it is more useful to study approachability under other distance metrics, most commonly the $\ell_{\infty}$ -metric. But, the time and space complexity of the algorithms designed for $\ell_{\infty}$ -approachability depend on the dimension of the space of the vectorial payoffs, which is often prohibitively large. Thus, we present a framework for converting high-dimensional $\ell_{\infty}$ -approachability problems to low-dimensional pseudonorm approachability problems, thereby resolving such issues. We first show that the $\ell_{\infty}$ -distance between the average payoff and the approachability set can be equivalently defined as a pseudodistance between a lower-dimensional average vector payoff and a new convex set we define. Next, we develop an algorithmic theory of pseudonorm approachability, analogous to previous work on approachability for $\ell_{2}$ and other norms, showing that it can be achieved via online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball. We then use that to show, modulo mild normalization assumptions, that there exists an $\ell_{\infty}$ -approachability algorithm whose convergence is independent of the dimension of the original vectorial payoff. We further show that that algorithm admits a polynomial-time complexity, assuming that the original $\ell_{\infty}$ -distance can be computed efficiently. We also give an $\ell_{\infty}$ -approachability algorithm whose convergence is logarithmic in that dimension using an FTRL algorithm with a maximum-entropy regularizer. Finally, we illustrate the benefits of our framework by applying it to several problems in regret minimization.

1 Introduction

The notion of approachability introduced by Blackwell (1956) can be viewed as an extension of von Neumann’s minimax theorem (von Neumann, 1928) to the case of vectorial payoffs. Blackwell gave a simple example showing that the straightforward analog of von Neumann’s minimax theorem does not hold for vectorial payoffs. However, in contrast with this negative result for one-shot games, he proved that, in repeated games, a player admits an adaptive strategy guaranteeing that their average payoff approaches a closed convex set in the limit, provided that the set satisfies a natural separability condition.

The theory of Blackwell approachability is intimately connected with the field of online learning for the reason that the problem of regret minimization can be viewed as an approachability problem: in particular, the learner would like their vector of regrets (with respect to each competing benchmark) to converge to a non-positive vector. In this vein, Abernethy et al. (2011) demonstrated how to use algorithms for approachability to solve a general class of regret minimization problems (and conversely, how to use regret minimization to construct approachability algorithms). However, applying their reduction sometimes leads to suboptimal regret guarantees – for example, for the specific case of minimizing external regret over $T$ rounds with $K$ actions, their reduction results in an algorithm with $O(\sqrt{TK})$ regret (instead of the optimal $O(\sqrt{T\log K})$ regret bound achievable by e.g. multiplicative weights).

One reason for this suboptimality is the choice of distance used to define approachability. Both Blackwell and Abernethy et al. consider approachability algorithms that minimize the Euclidean ( $\ell_{2}$ ) distance between their average payoff and the desired set. We argue that, for applications to regret minimization, it is often more useful to study approachability under other distance metrics, most commonly approachability under the $\ell_{\infty}$ metric to the non-positive orthant, which is well suited to capture the fact that regret is a maximum over various competing benchmarks. This has been observed in the past in several recent publications (Perchet, 2015; Shimkin, 2016; Kwon, 2021). In particular, by constructing algorithms for $\ell_{\infty}$ approachability, it is possible to naturally recover a $O(\sqrt{T\log K})$ external regret learning algorithm (and algorithms with optimal regret guarantees for many other problems of interest).

However, there is still one significant problem with developing regret minimization algorithms via $\ell_{\infty}$ approachability (or any of the forms of approachability previously mentioned): the time and space complexity of these algorithms depends polynomially on the dimension $d$ of the space of vectorial payoffs, which in turn equals the number of benchmarks we compete against in our regret minimization problem. In some regret minimization settings, this can be prohibitively expensive. For example, in the setting of swap regret (where the benchmarks are parameterized by the $d=K^{K}$ swap functions mapping $[K]$ to $[K]$ ), this results in algorithms with complexity exponential in $K$ . On the other hand, there exist algorithms, e.g. (Blum and Mansour, 2007), which are both efficient ( $\operatorname*{poly}(K)$ time and space) and obtain optimal regret guarantees.

1.1 Main results

In this paper, we present a framework for converting high-dimensional $\ell_{\infty}$ approachability problems to low-dimensional “pseudonorm” approachability problems, in turn resolving many of these issues. To be precise, recall that the setting of approachability can be thought of as a $T$ -round repeated game, where in round $t$ the learner chooses an action $p_{t}$ from some convex “action” set ${\mathscr{P}}\subseteq\mathbb{R}^{n}$ , the adversary simultaneously chooses an action $\ell_{t}$ from some convex “loss” set ${\mathscr{L}}\subseteq\mathbb{R}^{m}$ , and the learner receives a vector-valued payoff $u(p_{t},\ell_{t})$ , where $u\colon{\mathscr{P}}\times{\mathscr{L}}\rightarrow\mathbb{R}$ is a $d$ -dimensional bilinear function. The learner would like the $\ell_{\infty}$ distance between their average payoff $\frac{1}{T}\sum u(p_{t},\ell_{t})$ and some convex set¹¹1For simplicity, throughout this paper we assume ${\mathscr{S}}$ to be the negative orthant $(-\infty,0]^{d}$ , as this is the case most relevant to regret minimization. However, much of our results extend straightforwardly to arbitrary convex sets. ${\mathscr{S}}$ to be as small as possible.

We first demonstrate how to construct a new $d^{\prime}$ -dimensional bilinear function $\widetilde{u}\colon{\mathscr{P}}\times{\mathscr{L}}\rightarrow\mathbb{R}^{{}^{\prime}}$ , a new convex set ${\mathscr{S}}^{\prime}\subseteq\mathbb{R}^{{}^{\prime}}$ , and a pseudonorm²²2In this paper a pseudonorm is a function $f\colon\mathbb{R}^{{}^{\prime}}\rightarrow\mathbb{R}_{\geq 0}$ which satisfies most of the properties of a norm (e.g. positive homogeneity, triangle inequality), but may be asymmetric (there may exist $x$ where $f(x)\neq f(-x)$ ) and may not be definite (there may exist $x\neq 0$ where $f(x)=0$ ). Just as a norm $\lVert\cdot\rVert$ defines a distance between $x,y\in\mathbb{R}^{{}^{\prime}}$ via $\lVert x-y\rVert$ , a pseudonorm $f$ defines the pseudodistance $f(x-y)$ . $f$ such that the “pseudodistance” between the average modified payoff $\frac{1}{T}\sum\widetilde{u}(p_{t},\ell_{t})$ and ${\mathscr{S}}^{\prime}$ is equal to the $\ell_{\infty}$ distance between the original average payoff and ${\mathscr{S}}$ . Importantly, the new dimension $d^{\prime}$ is equal to $mn$ and is independent of the original dimension $d$ .

We then develop an algorithmic theory of pseudonorm approachability analogous to that developed in (Abernethy et al., 2011) for the $\ell_{2}$ norm and (Shimkin, 2016; Kwon, 2021) for other norms, showing that, in order to perform pseudonorm approachability, it suffices to be able to perform online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball (and that the rate of approachability is directly related to the regret guarantees of this OLO subalgorithm). This has the following consequences for approachability:

•

First, by solving this OLO problem with a quadratic regularized Follow-The-Regularized-Leader (FTRL) algorithm, we show (modulo mild normalization assumptions on the sizes of ${\mathscr{P}}$ , ${\mathscr{L}}$ , and $u$ ) that there exists a pseudonorm approachability algorithm (and hence an $\ell_{\infty}$ approachability algorithm for the original problem) which converges at a rate of $O(nm/\sqrt{T})$ . We additionally provide a stronger bound on the rate which scales as $O(D_{u}D_{p}D_{\ell}/\sqrt{T})$ where $D_{p}=\operatorname*{diam}{\mathscr{P}}$ , $D_{\ell}=\operatorname*{diam}{\mathscr{L}}$ , and $D_{u}$ is the maximum $\ell_{2}$ norm of the set of vectors formed by taking the coefficients of components of $u$ (Theorem 3.9). In comparison, the best-known generic guarantee for $\ell_{\infty}$ -approachability prior to this work converged at a $d$ -dependent rate of $O(\sqrt{(\log d)/T})$ .
•

Second, we show that as long as we can evaluate the original $\ell_{\infty}$ -distance between $\frac{1}{T}\sum u(p_{t},\ell_{t})$ and $(-\infty,0]^{d}$ efficiently, we can implement the above algorithm in $\operatorname*{poly}(m,n)$ time per round (Theorem 3.15). This has the following natural consequence for the class of regret minimization problems that can be written as $\ell_{\infty}$ -approachability problems: if it is possible to efficiently compute some notion of regret for a sequence of losses and actions, then there is an efficient (in the dimensions of the actions and losses) learner that minimizes this regret.
•

Finally, in some cases, the $O(\sqrt{(\log d)/T})$ approachability rate from (inefficient) $\ell_{\infty}$ -approachability outperforms the rate obtained by the quadratic regularized FTRL algorithm. We define a new regularizer whose value is given by finding the maximum entropy distribution of a subset of distributions of support $d$ , and show that, by using this regularizer, we recover this $O(\sqrt{(\log d)/T})$ rate. In particular, whenever we can efficiently compute this maxent regularizer, there is an efficient learning algorithm with a $O(\sqrt{(\log d)/T})$ approachability rate.

We then apply our framework to various problems in regret minimization:

•

We show that our framework straightforwardly recovers a regret-optimal and efficient algorithm for swap regret minimization. Doing so requires computing the above maximum entropy regularizer for this specific case, where we show that it has a nice closed form. In particular, to our knowledge. this is the first approachability-based algorithm for swap regret that both is efficient and has the optimal minimax regret.
•

In Section 4.3, we apply our framework to develop the first efficient contextual learning algorithms with low Bayesian swap regret. Such algorithms have the property that if learners employ them in a repeated Bayesian game, the time-average of their strategies will converge to a Bayesian correlated equilibrium, a well-studied equilibrium notion in game theory (see e.g. Bergemann and Morris (2016)).

This notion of Bayesian swap regret was recently introduced by Mansour et al. (2022), who also provided an algorithm with low Bayesian swap regret, albeit one that is not computationally efficient. By applying our framework, we easily obtain an efficient contextual learning algorithm with $O(CK\sqrt{T})$ Bayesian swap regret, resolving an open question of Mansour et al. (2022) (here $C$ is the number of “contexts” / “types” of the learner).
•

In Section 4.4, we further analyze the application of our general $\ell_{\infty}$ -approachability theory and algorithm to the analysis of reinforcement learning (RL) with vectorial losses. We point out how our framework can provide a general solution in the full information setting with known transition probabilities and how we can recover the best known solution for standard regret minimization in episodic RL. More importantly, we show how our framework and algorithm can lead to an algorithm for constrained MDPs with a significantly more favorable regret guarantee, logarithmic in the number of constraints $k$ , in contrast with the $\sqrt{k}$ -dependency of the results of Miryoosefi et al. (2019).

1.2 Related Work

There is a wide literature dealing with various aspects of Blackwell’s approachability, including its applications to game theory, regret minimization, reinforcement learning, and multiple extensions.

Hart and Mas-Colell (2000) described an adaptive procedure for players in a game based on Blackwell’s approachability, which guarantees that the empirical distributions of the plays converges to the set of a correlated equilibrium. This procedure is related to internal regret minimization, for which, as shown by Foster and Vohra (1999), the existence of an algorithm follows the proof of Hart and Mas-Colell (2000). Hart and Mas-Colell (2001) further gave a general class of adaptive strategies based on approachability. Approachability has been widely used for calibration (Dawid, 1982), see Foster and Hart (2018) for a recent work on the topic. Approachability and partial monitoring were studied in a series of publications by Perchet (2010); Mannor et al. (2014a, b); Perchet and Quincampoix (2015, 2018); Kwon and Perchet (2017). More recently, approachability has also been used in the analysis of fairness in machine learning (Chzhen et al., 2021).

Approachability has also been extensively used in the context of reinforcement learning. Mannor and Shimkin (2003) discussed an extension of regret minimization in competitive Markov decision processes (MDPs) whose analysis is based on Blackwell’s approachability theory. Mannor and Shimkin (2004) presented a geometric approach to multiple-criteria reinforcement learning formulated as approachability conditions. Kalathil et al. (2014) presented strategies for approachability for MDPs and Stackelberg stochastic games based on Blackwell’s approachability theory. More recently, Miryoosefi et al. (2019) used approachability to derive solutions for reinforcement learning with convex constraints.

The notion of approachability was further extended in several studies. Vieille (1992) used differential games with a fixed duration to study weak approachability in finite dimensional spaces. Spinat (2002) formulated a necessary and sufficient condition for approachability of non-necessary convex sets. Lehrer (2003) extended Blackwell’s approachability theory to infinite-dimensional spaces.

The most closely related work to this paper, which we build upon, is that of Abernethy et al. (2011) who showed that, remarkably, any algorithm for Blackwell’s approachability could be converted into one for online convex optimization and vice-versa. Bernstein and Shimkin (2015) also discussed a related response-based approachability algorithm.

Perchet (2015) presented a specific study of $\ell_{\infty}$ approachability, for which they gave an exponential weight algorithm. Shimkin (2016, Section 5) studied approachability for an arbitrary norm and gave a general duality result for an arbitrary norm using Sion’s minmax theorem. The pseudonorm duality theorem we prove, using Fenchel duality, can be viewed as a generalization. Kwon (2016, 2021) also presented a duality theorem similar to that of Shimkin (2016) which they used to derive a FTRL algorithm for general norm approachability. They further treated the special case of internal and swap regret. However, unlike the algorithms derived in this work, the computational complexity of their swap regret algorithm is in $O(K^{K})$ . This is also true of the paper of Perchet (2015) which also analyzes the swap regret problem.

It is known that if all players follow a swap regret minimization algorithm, then the empirical distribution of their play converges to a correlated equilibrium (Blum and Mansour, 2007). Hazan and Kale (2008) showed a result generalizing this property to the case of $\Phi$ -regret and $\Phi$ -equilibrium, where the $\Phi$ -regret is the difference between the cumulative expected loss suffered by the learner and that of the best $\Phi$ -modification of the sequence in hindsight. Gordon et al. (2008) further generalized the results of Hazan and Kale (2008) to a more general class of $\Phi$ -modification regrets. The algorithms discussed in (Gordon et al., 2008) are distinct from those discussed in this paper (they do not clearly extend to the general approachability setting, and they require significantly different computational assumptions than ours). Nevertheless, they bare some similarity with our work.

2 Preliminaries

Notation.

We use $[n]$ as a shorthand for set $\{1,2,\dots,n\}$ . We write $\Delta_{d}=\{x\in\mathbb{R}\mid x_{i}\geq 0,\sum x_{i}=1\}$ to denote the simplex over $d$ dimensions and ${\overline{\Delta}}_{d}=\{x\in\mathbb{R}\mid x_{i}\geq 0,\sum x_{i}\leq 1\}$ to denote the convex hull of the $d$ -simplex with the origin. $\operatorname*{conv}(S)$ denotes the convex hull of the points in $S$ , and $\operatorname*{cone}(S)=\{\alpha x\mid\alpha\geq 0,x\in\operatorname*{conv}(S)\}$ the convex cone generated by the points in $S$ .

Some of the more standard proofs have been deferred to Appendix A.

2.1 Blackwell approachability and regret minimization

We begin by illustrating the theory of Blackwell approachability for the specific case of the $\ell_{\infty}$ -distance; this case is both particularly suited to the application of regret minimization, and will play an important role in the results (e.g. reductions to pseudonorm approachability) that follow.

We consider a repeated game setting, where every round $t$ a learner chooses an action $p_{t}$ belonging to a bounded³³3We bound the entries of ${\mathscr{P}}$ , ${\mathscr{L}}$ , and $u$ within $[-1,1]$ for convenience, but it is generally easy to translate between different boundedness assumptions (since almost all relevant quantities are linear). We express the majority of our theorem statements (with the notable exception of Theorem 3.9) in a way that is independent of the choice of bounds. convex set ${\mathscr{P}}\subseteq[-1,1]^{n}$ , and an adversary simultaneously chooses a loss $\ell_{t}$ belonging to a bounded convex set ${\mathscr{L}}\subseteq[-1,1]^{m}$ . Let $u\colon{\mathscr{P}}\times{\mathscr{L}}\rightarrow[-1,1]^{d}$ be a bounded bilinear⁴⁴4We briefly note that all our results also hold for biaffine functions $u_{i}$ ; in particular, extending the loss and action sets slightly (by replacing ${\mathscr{P}}$ and ${\mathscr{L}}$ with ${\mathscr{P}}\times\{1\}$ and ${\mathscr{L}}\times\{1\}$ ) allows us to write any biaffine function over the original sets as a bilinear function over the extended sets. vector-valued payoff function, and let ${\mathscr{S}}\subseteq\mathbb{R}$ be a closed convex set with the property that for every $\ell\in{\mathscr{L}}$ , there exists a $p\in{\mathscr{P}}$ such that $u(p,\ell)\in{\mathscr{S}}$ (we say that such a set is “separable”). When $d=1$ , the minimax theorem implies that there exists a single $p\in{\mathscr{P}}$ such that for all $\ell$ , $u(p,\ell)\in{\mathscr{S}}$ .

This is not true for $d>1$ , but the theory of Blackwell approachability provides the following algorithmic analogue of this statement. Define a learning algorithm $\mathcal{A}$ to be a collection of functions $\mathcal{A}_{t}\colon{\mathscr{L}}^{t-1}\rightarrow{\mathscr{P}}$ for each $t\in[T]$ , where $\mathcal{A}_{t}$ describes how the learner decides their action $p_{t}$ as a function of the observed losses $\ell_{1},\ell_{2},\dots,\ell_{t-1}$ up until time $t-1$ . Blackwell approachability guarantees that there exists a learning algorithm $\mathcal{A}$ such that when $\mathcal{A}$ is run on any loss sequence $\bm{\ell}$ the resulting action sequence $\bm{p}$ has the property that:

\lim_{T\rightarrow\infty}d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),{\mathscr{S}}\right)=0.

(1)

(Here for $v,w\in\mathbb{R}$ , $d_{\infty}(v,w)=\max_{i}|v_{i}-w_{i}|$ represents the $\ell_{\infty}$ distance between $v$ and $w$ ).

As mentioned, one of the main motivations for studying Blackwell approachability is its connections to regret minimization. In particular, for a fixed choice of $u$ , define

\operatorname{\mathsf{Reg}}(\mathbf{p},\bm{\ell})=\max\left(\max_{i\in[d]}\left(\sum_{t=1}^{T}u_{i}(p_{t},\ell_{t})\right),0\right).

(2)

Note first that this definition of “regret” is exactly $T$ times the $\ell_{\infty}$ approachability distance in the case where ${\mathscr{S}}=(\infty,0]^{d}$ is the negative orthant; that is,

\operatorname{\mathsf{Reg}}(\bm{p},\bm{\ell})=T\cdot d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right).

(3)

But secondly, note that by choosing ${\mathscr{P}}$ , ${\mathscr{L}}$ , and $u$ carefully, the definition $\eqref{eq:regret_def}$ can capture a wide variety of forms of regret studied in regret minimization. For example:

•

When ${\mathscr{P}}=\Delta_{K}$ , ${\mathscr{L}}=[0,1]^{K}$ , $d=K$ , and $u(p,\ell)_{i}=\langle p,\ell\rangle-\ell_{i}$ , $\operatorname{\mathsf{Reg}}(\bm{p},\bm{\ell})$ is the external regret of playing action sequence $\bm{p}$ against loss sequence $\bm{\ell}$ ; i.e., it measures the regret compared to the best single action.
•

When ${\mathscr{P}}=\Delta_{K}$ , ${\mathscr{L}}=[0,1]^{K}$ , $d=K^{K}$ , and (for each function $\pi\colon[K]\rightarrow[K]$ , $u(p,\ell)_{\pi}=\sum_{i=1}^{K}p_{i}(\ell_{i}-\ell_{\pi(i)})$ , $\operatorname{\mathsf{Reg}}(\bm{p},\bm{\ell})$ is the swap regret of playing action sequence $\bm{p}$ against loss sequence $\bm{\ell}$ ; i.e., it measures the regret compared to the best action sequence $\bm{p^{\prime}}$ obtained by applying a fixed swap function to sequence $\bm{p}$ .
•

When ${\mathscr{P}}\subseteq\mathbb{R}^{n}$ is a convex polytope, ${\mathscr{L}}=[0,1]^{n}$ , $d=|V({\mathscr{P}})|$ (where $V({\mathscr{P}})$ is the set of vertices of ${\mathscr{P}}$ ), and (for each vertex $v\in V({\mathscr{P}})$ ), $u(p,\ell)_{v}=\langle p,\ell\rangle-\langle v,\ell\rangle$ , this captures the (external) regret from performing online linear optimization over the polytope ${\mathscr{P}}$ (see Section 2.2).
•

Finally, to illustrate the power of this framework, we present an unusual swap regret minimization that we call “Procrustean swap regret minimization” (after the orthogonal Procrustes problem, see (Gower and Dijksterhuis, 2004)). Let ${\mathscr{P}}=\{x\in\mathbb{R}^{n}\,\colon||x||_{2}\leq 1\}$ be the unit ball in $n$ dimensions, ${\mathscr{L}}=[-1,1]^{n}$ , and, for each orthogonal matrix⁵⁵5Technically, this leads to an infinite dimensional $u$ (since the group of orthogonal matrices is infinite), but one can instead take an arbitrarily fine discrete approximation of the set. Indeed, one of the advantages of the results we present is that they are largely independent of the dimension $d$ of $u$ . $Q\in O(n)$ , let $u(p,\ell)_{Q}=\langle p,\ell\rangle-\langle Qp,\ell\rangle$ .

When the negative orthant $(-\infty,0]^{d}$ is separable with respect to $u$ , which is true in all of the above examples, the theory of Blackwell approachability immediately guarantees the existence of a sublinear regret learning algorithm $\mathcal{A}$ for the corresponding notion of regret. Specifically, define the regret of a learning algorithm $\mathcal{A}$ to be the worst-case regret over all possible loss vectors $\bm{\ell}\in{\mathscr{L}}^{T}$ ; i.e.,

\operatorname{\mathsf{Reg}}(\mathcal{A})=\max_{\bm{\ell}\in{\mathscr{L}}^{T}}\,\operatorname{\mathsf{Reg}}(\bm{p},\bm{\ell}),

(4)

where $p_{t}=\mathcal{A}_{t}(\ell_{1},\dots,\ell_{t-1})$ . Then (1) implies that the same algorithm $\mathcal{A}$ satisfies $\operatorname{\mathsf{Reg}}(\mathcal{A})=o(T)$ . Motivated by this application, we will restrict our attention for the remainder of this paper to the setting where ${\mathscr{S}}=(-\infty,0]^{d}$ and will assume (unless otherwise specified) that this ${\mathscr{S}}$ is separable with respect to the bilinear function $u$ we consider.

In fact, the theory of $\ell_{\infty}$ -approachability is constructive and allows us to write down explicit algorithms $\mathcal{A}$ along with explicit (and in many cases, near optimal) regret bounds. However, before we introduce these algorithms, we will need to introduce the problem of online linear optimization.

2.2 Online linear optimization

Here we discuss algorithms for online linear optimization (OLO), a special case of online convex optimization where all the loss functions are linear functions of the learner’s action. These will form an important primitive of our algorithms for approachability.

Let ${\mathscr{X}},{\mathscr{Y}}$ be two bounded convex subsets of $\mathbb{R}^{r}$ and consider the following learning problem. Every round $t$ (for $T$ rounds) our learning algorithm must choose an element $x_{t}\in{\mathscr{X}}$ as a function of $y_{1},y_{2},\dots,y_{t-1}$ . The adversary simultaneously chooses a loss “function” $y_{t}\in{\mathscr{Y}}$ . This causes the learner to incur a loss of $\langle x_{t},y_{t}\rangle$ in round $t$ . The goal of the learner is to minimize their regret, defined as the difference between their total loss and the loss they would have incurred by playing the best fixed action in hindsight, i.e.,

\operatorname{\mathsf{Reg}}(\bm{x},\bm{y})=\max_{x^{*}\in{\mathscr{X}}}\left(\sum_{t=1}^{T}\langle x_{t},y_{t}\rangle-\sum_{t=1}^{T}\langle x^{*},y_{t}\rangle\right).

There are many similarities between this problem and the approachability and regret minimization problems discussed in Section 2.1. For example, if we take ${\mathscr{X}}={\mathscr{P}}=\Delta_{K}$ and ${\mathscr{Y}}={\mathscr{L}}=[0,1]^{K}$ , then OLO is equivalent to the problem of external regret minimization. However, not all regret minimization problems can be written directly as an instance of OLO – for example, there is no clear way to write swap regret minimization as an OLO instance. Eventually we will demonstrate how to apply OLO as a subroutine to solve any regret minimization problem, but this will involve a reduction to Blackwell approachability and will require running OLO on different spaces than the action/loss sets ${\mathscr{P}}$ and ${\mathscr{L}}$ directly (which is why we distinguish the action/loss sets for OLO as ${\mathscr{X}}$ and ${\mathscr{Y}}$ respectively).

There is an important subclass of algorithms for OLO known as Follow the Regularized Leader (FTRL) algorithms (Shalev-Shwartz, 2007; Abernethy et al., 2008). An FTRL algorithm is completely specified by a strongly convex function $R\colon{\mathscr{X}}\rightarrow\mathbb{R}$ , and plays the action

x_{t}=\operatorname*{argmin}_{x\in{\mathscr{X}}}\left(R(x)+\sum_{s=1}^{t-1}\langle x,y_{s}\rangle\right).

In words, this algorithm plays the action that minimizes the total loss on the rounds until the present (“following the leader”), subject to an additional regularization term. It is possible to characterize the worst-case regret of an FTRL algorithm in terms of properties of the regularizer $R$ and the sets ${\mathscr{X}}$ and ${\mathscr{Y}}$ (see e.g. Theorem 15 of Hazan (2016)). For our purposes, we will only need the following two results for specific regularizers.

Lemma 2.1 (Quadratic regularizer, (Zinkevich, 2003)).

Let $D_{x}=\operatorname*{diam}{\mathscr{X}}$ and $D_{y}=\max_{y\in{\mathscr{Y}}}||y||_{2}$ . Let $x_{0}$ be an arbitrary element of ${\mathscr{X}}$ , and let $R(x)=||x-x_{0}||^{2}$ . Then, the FTRL algorithm with regularizer $R$ incurs worst-case regret at most $O(D_{x}D_{y}\sqrt{T})$ .

Lemma 2.2 (Negative entropy regularizer, (Kivinen and Warmuth, 1995)).

Let ${\mathscr{X}}=\Delta_{d}$ , ${\mathscr{Y}}=[0,1]^{d}$ , and let $R(x)=\sum_{i=1}^{d}x_{i}\log x_{i}$ (where we extend this to the boundary of ${\mathscr{X}}$ by letting $0\log 0=0$ ). Then, the FTRL algorithm with regularizer $R$ incurs worst-case regret at most $O(\sqrt{T\log d})$ .

2.3 Algorithms for $\ell_{\infty}$ -approachability

We can now write down an explicit description of our algorithm for $\ell_{\infty}$ -approachability (in terms of a blackbox OLO algorithm) and get quantitative bounds on the rate of convergence in the LHS of (1). Let $\mathcal{F}$ be an OLO algorithm for the sets ${\mathscr{X}}={\overline{\Delta}}_{d}$ and ${\mathscr{Y}}=[-1,1]^{d}$ . Then we can describe our algorithm $\mathcal{A}$ for $\ell_{\infty}$ -approachability as Algorithm 1.

Initialize

\theta_{1}

to an arbitrary point in

{\overline{\Delta}}_{d}

;

for $t\leftarrow 1$ to $T$ do

Choose

p_{t}\in{\mathscr{P}}

so that for all

\ell\in{\mathscr{L}}

\langle\theta_{t},u(p_{t},\ell)\rangle\leq 0

. Such a

p_{t}

is guaranteed to exist by the separability condition on

u

, since

\langle\theta,s\rangle\leq 0

for any

\theta\in{\overline{\Delta}}_{d}

and

s\in(-\infty,0]^{d}

;

Play action

p_{t}

and receive as feedback

\ell_{t}\in{\mathscr{L}}

;

Set

y_{t}=-u(p_{t},\ell_{t})

;

Set

\theta_{t+1}=\mathcal{F}(y_{1},y_{2},\dots,y_{t})

end for

Algorithm 1 Description of Algorithm

\mathcal{A}

for

\ell_{\infty}

-approachability

It turns out that we can relate the $\ell_{\infty}$ approachability distance to $(-\infty,0]^{d}$ (and hence the regret of this algorithm $\mathcal{A}$ ) to the regret of our OLO algorithm $\mathcal{F}$ .

Theorem 2.3.

We have that

\operatorname{\mathsf{Reg}}(\mathcal{A})=T\cdot d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right)\leq\operatorname{\mathsf{Reg}}(\mathcal{F}).

If we let $\mathcal{F}$ be the negative entropy FTRL algorithm (Lemma 2.2)⁶⁶6There is a slight technical difference between the sets $\mathcal{X}={\overline{\Delta}}_{d}$ and $\mathcal{Y}=[-1,1]^{d}$ in Algorithm 1 and the sets $\mathcal{X}^{\prime}=\Delta_{d}$ and $\mathcal{Y}^{\prime}=[0,1]^{d}$ in Lemma 2.2. However, note that if we map $x\in\mathcal{X}$ to $(x_{1},x_{2},\dots,x_{n},1-\sum x_{i})\in\Delta_{d+1}$ and $y\in[-1,1]^{d}$ to $((1+y_{1})/2,(1+y_{2})/2,\dots,(1+y_{d})/2,0)\in[0,1]^{d+1}$ , we preserve the inner product of $x$ and $y$ up to an additive constant, which disappears when computing regret, and a factor of $1/2$ ., we obtain the following regret guarantee for $\mathcal{A}$ .

Corollary 2.4.

For any bilinear regret function $u\colon{\mathscr{P}}\times{\mathscr{L}}\rightarrow[-1,1]^{d}$ , there exists a regret minimization algorithm $\mathcal{A}$ with worst-case regret $\operatorname{\mathsf{Reg}}(\mathcal{A})=O(\sqrt{T\log d})$ .

Equivalently, Corollary 2.4 can be interpreted as saying that there exists an $\ell_{\infty}$ -approachability algorithm which approaches the negative orthant at an average rate of $O(\sqrt{(\log d)/T})$ . In general, (3) lets us straightforwardly convert between results for approachability and results for regret minimization. Throughout the remainder of the paper we will primarily phrase our results in terms of $\operatorname{\mathsf{Reg}}(\mathcal{A})$ , but switch between quantities of interest when convenient.

3 Main Results

Already Corollary 2.4 leads to a number of impressive consequences. For example, when applied to the problem of swap-regret minimization (where $u$ has dimension $d=K^{K}$ ), it leads to a learning algorithm with $O(\sqrt{KT\log K})$ regret, matching the best known regret bounds for this problem (Blum and Mansour, 2007). However, the algorithm we obtain in this way has two unfortunate properties.

First, since $u$ is $d$ -dimensional, implementing the algorithm as written above requires $O(\operatorname*{poly}(d))$ time and space complexity (even storing any specific $y_{t}$ or $\theta_{t}$ requires $\Omega(d)$ space). This is fine if $d$ is small, but in many of our applications $d$ is much (e.g., exponentially) larger than the dimensions $m$ and $n$ of the loss and action sets. For example, for swap regret we have $d=K^{K}$ but $m=n=K$ . Although Corollary 2.4 gives us an optimal $O(\sqrt{KT\log K})$ swap regret algorithm, it takes exponential time / space to implement (in contrast to other known swap regret algorithms, such as that of Blum and Mansour (2007)).

Secondly, although Corollary 2.4 has only a logarithmic dependence on $d$ , sometimes even this may be too large (for example when we want to compete against an uncountable set of benchmarks). In such cases, we would ideally like a regret bound that depends on the action and loss sets but not directly on $d$ .

In the following subsections, we will demonstrate a framework for regret minimization that allows us to achieve both of these goals (under some fairly light computational assumptions on $u$ ).

3.1 Approachability for pseudonorms

In Section 2.3, we described the theory of Blackwell approachability for a distance metric defined by the $\ell_{\infty}$ norm (i.e., $||z||_{\infty}=\max_{i}|z_{i}|$ ). We begin here by describing a generalization of this approachability theory to functions we refer to as pseudonorms and pseudodistances. A function $f\colon\mathbb{R}\to\mathbb{R}_{\geq 0}$ is a pseudonorm if $f(0)=0$ , $f$ is positive homogeneous (for all $z\in\mathbb{R}$ , $\alpha\in\mathbb{R}_{\geq 0}$ , $f(\alpha z)=\alpha f(z)$ ), and $f$ satisfies the triangle inequality ( $f(z+z^{\prime})\leq f(z)+f(z^{\prime})$ for all $z,z^{\prime}\in\mathbb{R}$ ). Note that unlike norms, pseudonorms may not satisfy definiteness and are not necessarily symmetric; it may be the case that $f(z)\neq f(-z)$ . However, all norms are pseudonorms. Note also that by the positive homogeneity and the triangle inequality, a pseudonorm is a convex function. A pseudonorm $f$ defines a pseudodistance function $d_{f}$ via: $\forall z,z^{\prime}\in\mathbb{R},{}_{f}(z,z^{\prime})=f(z-z^{\prime})$ .

In order to effectively work with pseudodistances, it will be useful to define the dual set ${\mathscr{T}}^{*}_{f}$ associated to $f$ as follows: ${\mathscr{T}}^{*}_{f}=\left\{\theta\colon\forall z\in\mathbb{R},\left\langle\theta,z\right\rangle\leq f(z)\right\}$ . This coincides with the traditional notion of duality in convex analysis; for example, when $f$ is a norm, ${\mathscr{T}}^{*}_{f}$ coincides with the dual ball of radius one: ${\mathscr{T}}^{*}_{f}=\left\{\theta\colon\left\lVert\theta\right\rVert_{*}\leq 1\right\}$ (e.g., when $f$ is the $d$ -dimensional $\ell_{\infty}$ norm, ${\mathscr{T}}_{f}^{*}$ is the $d$ -dimensional $\ell_{1}$ -ball). The following theorem relates the pseudodistance between $z$ and a convex set ${\mathscr{S}}$ to a convex optimization problem over the dual set.

Theorem 3.1.

For any closed convex set ${\mathscr{S}}\subset\mathbb{R}$ , the following equality holds for any $z\in\mathbb{R}$ :

d_{f}(z,{\mathscr{S}})=\inf_{s\in{\mathscr{S}}}f(z-s)=\sup_{\theta\in{\mathscr{T}}^{*}_{f}}\left\{\theta\cdot z-\sup_{s\in{\mathscr{S}}}\theta\cdot s\right\}.

Proof.

We adopt the standard definition and notation in optimization for an indicator function $I_{{\mathscr{K}}}$ of a set ${\mathscr{K}}$ : for any $x$ , $I_{{\mathscr{K}}}(x)=0$ if $x$ is in ${\mathscr{K}}$ , $+\infty$ otherwise. Define $\widetilde{f}$ by $\widetilde{f}(s)=f(z-s)$ for all $s\in\mathbb{R}$ and set $g=I_{\mathscr{S}}$ .

By definition, the conjugate function of $f$ is defined by: $\forall y\in\mathbb{R},f^{*}(y)=\sup_{x\in\mathbb{R}}x\cdot y-f(x)$ . Now, if $y$ is in ${\mathscr{T}}^{*}_{f}$ , then we have $f^{*}(y)\leq f(x)-f(x)=0$ . Thus, since $f(0)=0$ , the supremum in the definition of $f^{*}$ is achieved for $x=0$ and $f^{*}(0)=0$ . Otherwise, if $y\not\in{\mathscr{T}}^{*}_{f}$ , there exists $x\in{\mathscr{X}}$ such that $x\cdot y>f(x)$ . For that $x$ , for any $t>0$ , by the positive homogeneity of $f$ , we have $(tx)\cdot y-f(tx)=t\left(x\cdot y-f(x)\right)>0$ . Taking the limit $t\to+\infty$ , this shows that $f^{*}(y)=+\infty$ . Thus, we have $f^{*}(y)=I_{{\mathscr{T}}^{*}_{f}}$ .

By definition, the conjugate function $\widetilde{f}^{*}$ is defined for all $\theta$ by $\widetilde{f}^{*}(\theta)=\sup_{x\in{\mathscr{X}}}x\cdot\theta-f(z-x)=\sup_{u\in{\mathscr{X}}}(z-u)\cdot\theta-f(u)=f^{*}(-\theta)+z\cdot\theta=I_{{\mathscr{T}}^{*}_{f}}(-\theta)+z\cdot\theta$ , which can also be derived from the conjugate function calculus in Table B.1 of (Mohri et al., 2018).

It is also known that the conjugate function of the indicator function $g$ is defined by $g^{*}\colon\theta\mapsto\sup_{s\in{\mathscr{S}}}\theta\cdot s$ (Boyd and Vandenberghe, 2014). Since $\operatorname*{dom}(\widetilde{f})=\mathbb{R}$ and $\operatorname*{cont}(g)={\mathscr{S}}$ , we have $\operatorname*{dom}(\widetilde{f})\cap\operatorname*{cont}(g)={\mathscr{S}}\neq\emptyset$ . Thus, for any convex and bounded set ${\mathscr{S}}$ in $\mathbb{R}$ , by Fenchel duality (Theorem B.1, Appendix B), we can write:

$\displaystyle d_{f}(z,{\mathscr{S}})$	$\displaystyle=\inf_{s\in{\mathscr{S}}}f(z-s)$
	$\displaystyle=\inf_{s\in\mathbb{R}}\left\{f(z-s)+I_{\mathscr{S}}(s)\right\}$	(def. of $I_{\mathscr{S}}$ )
	$\displaystyle=\sup_{\theta\in\mathbb{R}}\left\{-\left(I_{{\mathscr{T}}^{*}_{f}}(-\theta)+\theta\cdot z\right)-\sup_{s\in{\mathscr{S}}}\left\{-\theta\cdot s\right\}\right\}$	(Fenchel duality theorem)
	$\displaystyle=\sup_{-\theta\in{\mathscr{T}}^{*}_{f}}\left\{-\theta\cdot z-\sup_{s\in{\mathscr{S}}}\left\{-\theta\cdot s\right\}\right\}$	(def. of $I_{{\mathscr{T}}^{*}_{f}}$ )
	$\displaystyle=\sup_{\theta\in{\mathscr{T}}^{*}_{f}}\left\{\theta\cdot z-\sup_{s\in{\mathscr{S}}}\theta\cdot s\right\}.$	(change of variable)

This completes the proof. ∎

We will primarily be concerned with the case where ${\mathscr{S}}$ is a convex cone. A convex cone is a set ${\mathscr{C}}$ such that if $x\in{\mathscr{C}}$ , $\alpha x\in{\mathscr{C}}$ for all $\alpha\geq 0$ . In this case, Theorem 3.1 can be simplified to write $d_{f}(z,{\mathscr{C}})$ as a linear optimization problem over the intersection of ${\mathscr{T}}^{*}_{f}$ and the polar cone of ${\mathscr{C}}$ .

Corollary 3.2.

Let ${\mathscr{C}}$ be a convex cone, and let ${\mathscr{C}}^{\circ}=\{y\,\mid\,\langle y,x\rangle\leq 0\,\forall x\in{\mathscr{C}}\}$ be the polar cone of ${\mathscr{C}}$ . Fix a $d$ -dimensional pseudonorm $f$ and let ${\mathscr{T}}^{{\mathscr{C}}}_{f}={\mathscr{T}}^{*}_{f}\cap{\mathscr{C}}^{\circ}$ . Then for any $z\in\mathbb{R}$ ,

d_{f}(z,{\mathscr{C}})=\sup_{\theta\in{\mathscr{T}}^{{\mathscr{C}}}_{f}}(\theta\cdot z).

3.2 From high-dimensional $\ell_{\infty}$ approachability to low-dimensional pseudonorm approachability

In Section 2.3, we expressed the regret for a general regret minimization problem as the $\ell_{\infty}$ distance

\operatorname{\mathsf{Reg}}=T\cdot d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right).

(5)

In this section, we will demonstrate how to rewrite this in terms of a lower-dimensional pseudodistance.

Consider the bilinear “basis map” $\widetilde{u}:{\mathscr{P}}\times{\mathscr{L}}\to\mathbb{R}^{nm}$ given by $\widetilde{u}(p,\ell)_{i,j}=p_{i}\ell_{j}$ . Note that every bilinear function $b\colon{\mathscr{P}}\times{\mathscr{L}}\to\mathbb{R}$ can be written in the form $b(p,\ell)=\langle\widetilde{u}(p,\ell),v\rangle$ for some vector $v\in\mathbb{R}^{nm}$ (i.e., the monomials in $\widetilde{u}$ form a basis for the set of biaffine functions on ${\mathscr{P}}\times{\mathscr{L}}$ )⁷⁷7If it is helpful, one can think of $\widetilde{u}$ as the natural map from ${\mathscr{P}}\times{\mathscr{L}}$ to the tensor product ${\mathscr{P}}\,\otimes\,{\mathscr{L}}$ . The vectors $v_{i}$ can then be thought of as elements of the dual space, each representing a linear functional on ${\mathscr{P}}\,\otimes\,{\mathscr{L}}$ ..

Let $v_{i}\in\mathbb{R}^{nm}$ be the vector of coefficients corresponding to the component $u_{i}$ of $u$ and consider the function $f(x)=\max(\max_{i\in[d]}\langle v_{i},x\rangle,0)$ . Note that $f$ is a pseudonorm on $\mathbb{R}^{nm}$ ; indeed, it’s straightforward to see that $f$ is non-negative and $f(\alpha x)=\alpha f(x)$ for any $\alpha\geq 0$ . In addition, let ${\mathscr{C}}$ be the convex cone defined by ${\mathscr{C}}=\{x\,\mid\,\langle x,v_{i}\rangle\leq 0\,\forall i\in[d]\}$ . We claim that that we can rewrite the $\ell_{\infty}$ distance in (5) in the following way.

Theorem 3.3.

We have that

d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right)=d_{f}\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),{\mathscr{C}}\right).

(6)

The dual set ${\mathscr{T}}^{*}_{f}$ associated to this pseudonorm is given by:

{\mathscr{T}}^{*}_{f}=\left\{\widetilde{\theta}\in\mathbb{R}^{nm}\colon\forall x\in\mathbb{R}^{nm},\langle\widetilde{\theta},x\rangle\leq\max_{i\in[d]}\langle v_{i},x\rangle\right\}.

The following two properties of this dual set will be useful in the sections that follow. First, we show that we can alternately think of ${\mathscr{T}}^{*}_{f}$ as the convex hull of the $v_{i}$ .

Lemma 3.4.

The dual set ${\mathscr{T}}^{*}_{f}$ coincides with the convex hull of $v_{i}$ s: ${\mathscr{T}}^{*}_{f}=\operatorname*{conv}\left\{v_{1},\ldots,v_{d}\right\}$ .

Secondly, we show that the dual set ${\mathscr{T}}_{f}^{*}$ is contained within the polar cone ${\mathscr{C}}^{\circ}$ of ${\mathscr{C}}$ .

Lemma 3.5.

We have that ${\mathscr{T}}_{f}^{*}\subseteq{\mathscr{C}}^{\circ}$ , where ${\mathscr{C}}^{\circ}=\{y\,\mid\,\langle y,x\rangle\leq 0\,\forall x\in{\mathscr{C}}\}$ .

Proof.

By Lemma 3.4, it suffices to show that each $v_{i}\in{\mathscr{C}}^{\circ}$ , i.e., for each $v_{i}$ , that $\langle v_{i},x\rangle\leq 0$ for all $x\in{\mathscr{C}}$ . However, this immediately follows by the definition of ${\mathscr{C}}$ , since each $x\in{\mathscr{C}}$ satisfies $\langle x,v_{i}\rangle\leq 0$ for all $i\in[d]$ . (An equivalent way of thinking about this is that ${\mathscr{C}}$ is already the polar cone of the cone generated by the $v_{i}$ , so the $v_{i}$ must lie in the polar to ${\mathscr{C}}$ ). ∎

This allows us to simplify Corollary 3.2 even further.

Corollary 3.6.

For this convex cone ${\mathscr{C}}$ and pseudonorm $f$ , we have that for any $z\in\mathbb{R}^{mn}$ ,

d_{f}(z,{\mathscr{C}})=\sup_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}}\left\langle\widetilde{\theta},z\right\rangle.

Finally, we prove the following “separability” conditions for this approachability problem.

Lemma 3.7.

The following two statements are true.

1.

For any $\ell\in{\mathscr{L}}$ , there exists a $p\in{\mathscr{P}}$ such that $\widetilde{u}(p,\ell)\in{\mathscr{C}}$ .
2.

For any $\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}$ , there exists a $p\in{\mathscr{P}}$ such that $\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle\leq 0$ for all $\ell\in{\mathscr{L}}$ .

3.3 Algorithms for pseudonorm approachability

We now present an algorithm for pseudonorm approachability in the setting of Section 3.2 (i.e., for the bilinear function $\widetilde{u}$ and the convex cone ${\mathscr{C}}$ ). Just as the algorithm in Section 2.3 for $\ell_{\infty}$ -approachability required an OLO algorithm for the simplex, this algorithm will assume we have access to an OLO algorithm $\widetilde{\mathcal{F}}$ for the sets ${\mathscr{X}}={\mathscr{T}}_{f}^{*}$ , and ${\mathscr{Y}}=-\operatorname*{conv}(\widetilde{u}({\mathscr{P}},{\mathscr{L}}))$ (recall that $\operatorname*{conv}(\widetilde{u}({\mathscr{P}},{\mathscr{L}}))$ denotes the convex hull of the points $\widetilde{u}(p,\ell)\in\mathbb{R}^{nm}$ for $p\in{\mathscr{P}}$ and $\ell\in{\mathscr{L}}$ ).

Initialize

\widetilde{\theta}_{1}

to an arbitrary point in

{\mathscr{T}}_{f}^{*}

;

for $t\leftarrow 1$ to $T$ do

Choose

p_{t}\in{\mathscr{P}}

so that for all

\ell\in{\mathscr{L}}

\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle\leq 0

(such a

p_{t}

is guaranteed to exist by the second statement in Lemma 3.7);

Play action

p_{t}

and receive as feedback

\ell_{t}\in{\mathscr{L}}

;

Set

y_{t}=-\widetilde{u}(p_{t},\ell_{t})

;

Set

\widetilde{\theta}_{t+1}=\widetilde{\mathcal{F}}(y_{1},y_{2},\dots,y_{t})

end for

Algorithm 2 Description of Algorithm

\widetilde{\mathcal{A}}

for

f

-approachability

Our algorithm $\widetilde{\mathcal{A}}$ is summarized above in Algorithm 2. We now have the following analogue of Theorem 2.3.

Theorem 3.8.

The following guarantee holds for the pseudonorm approachability algorithm $\widetilde{\mathcal{A}}$ :

\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}})=T\cdot d_{f}\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),{\mathscr{C}}\right)\leq\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{F}}).

Proof.

The first equality follows as a direct consequence of Theorem 3.3 (which proves that the $d_{f}$ distance to ${\mathscr{C}}$ is equal to the analogous $\ell_{\infty}$ distance to $(-\infty,0]^{d}$ ) and Theorem 2.3 (which shows that this $\ell_{\infty}$ distance is equal to the regret of our regret minimization problem). It therefore suffices to prove the second equality.

Note that

$\displaystyle d_{f}\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),{\mathscr{C}}\right)$	$\displaystyle=$	$\displaystyle\sup_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}}\left\langle\widetilde{\theta},\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t})\right\rangle$
	$\displaystyle=$	$\displaystyle-\inf_{\widetilde{\theta}\in{\mathscr{X}}}\left\langle\widetilde{\theta},\frac{1}{T}\sum_{t=1}^{T}\left(-\widetilde{u}(p_{t},\ell_{t})\right)\right\rangle$
	$\displaystyle=$	$\displaystyle\frac{1}{T}\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{F}})-\frac{1}{T}\sum_{t=1}^{T}\left\langle\widetilde{\theta}_{t},-\widetilde{u}(p_{t},\ell_{t})\right\rangle$
	$\displaystyle\leq$	$\displaystyle\frac{1}{T}\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{F}}).$

Here, the first equality holds as a consequence of Corollary 3.6, and the last inequality holds since (by choice of $p_{t}$ in step 2a) $\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle\leq 0$ for all $t\in[T]$ . ∎

If we choose $\widetilde{\mathcal{F}}$ to be FTRL with a quadratic regularizer, Lemma 2.1 implies the following result.

Theorem 3.9.

Let $u\colon{\mathscr{P}}\to{\mathscr{L}}$ be a bilinear regret function. Then there exists a regret minimization algorithm $\widetilde{\mathcal{A}}$ for $u$ with regret

\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}})=O(D_{z}(\operatorname*{diam}{\mathscr{P}})(\operatorname*{diam}{\mathscr{L}})\sqrt{T}),

where $D_{z}=\max_{i\in[d]}||v_{i}||$ . If we let

\lambda=\max_{i\in[d]}||v_{i}||_{\infty}=\max_{i\in[d],j\in[m],k\in[n]}\left|\frac{\partial^{2}u_{i}}{\partial p_{j}\partial\ell_{k}}\right|,

then this regret bound further satisfies

\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}})=O(\lambda nm\sqrt{T}).

Proof.

To see the first result, note that Lemma 2.1 directly implies a bound of $O(\operatorname*{diam}({\mathscr{X}})\operatorname*{diam}({\mathscr{Y}})\sqrt{T})$ , where ${\mathscr{X}}={\mathscr{T}}_{f}^{*}$ and ${\mathscr{Y}}=-\operatorname*{conv}(\widetilde{u}({\mathscr{P}},{\mathscr{L}}))$ . We’ll now proceed by simplifying $\operatorname*{diam}({\mathscr{X}})$ and $\operatorname*{diam}({\mathscr{Y}})$ . First, for ${\mathscr{X}}$ , recall from Lemma 3.4 that ${\mathscr{X}}={\mathscr{T}}_{f}^{*}$ is the convex hull of the $d$ vectors $v_{i}\in\mathbb{R}^{nm}$ , so $\operatorname*{diam}({\mathscr{X}})=O(\max_{i\in[d]}||v_{i}||)=O(D_{z})$ . Second, for ${\mathscr{Y}}$ , note that $\operatorname*{diam}({\mathscr{Y}})=O(\max_{y\in{\mathscr{Y}}}||y||)=O(\max_{p\in{\mathscr{P}},\ell\in{\mathscr{L}}}||\widetilde{u}(p,\ell)||)$ . But $||\widetilde{u}(p,\ell)||=||p||\cdot||\ell||$ , so $\operatorname*{diam}({\mathscr{Y}})=O\left(\max_{p\in{\mathscr{P}}}||p||\cdot\max_{\ell\in{\mathscr{L}}}||\ell||\right)=O((\operatorname*{diam}{\mathscr{P}})(\operatorname*{diam}{\mathscr{L}}))$ . Combining these, we obtain the first inequality.

The second result directly follows from the following three facts: i. $\operatorname*{diam}{\mathscr{P}}=O(\sqrt{n})$ (since ${\mathscr{P}}\subseteq[-1,1]^{n}$ , ii. $\operatorname*{diam}{\mathscr{L}}=O(\sqrt{m})$ (since ${\mathscr{L}}\subseteq[-1,1]^{m}$ ), and iii. $D_{z}\leq\lambda\sqrt{mn}$ . ∎

Theorem 3.9 shows that in settings where $\lambda$ is constant (which is true in all the settings we consider), there exists an algorithm where $\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}})=\operatorname*{poly}(n,m)\sqrt{T}$ ; notably, $\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}})$ does not depend on the dimension $d$ , which in many cases can be thought of as the number of benchmarks of comparison for our learning algorithm.

In the following two sections, we will show how to strengthen this result in two different ways. First, we will show that (modulo some fairly mild computational assumptions) it is possible to efficiently implement the algorithm of Theorem 3.9. Second, we will show that by using a different choice of regularizer, we can recover exactly the regret obtained in Corollary 2.4.

3.4 Efficient algorithms for pseudonorm approachability

3.4.1 Computational assumptions

In this section we will discuss how to transform the algorithm in Theorem 3.9 into a computationally efficient algorithm. Note that without any constraints on ${\mathscr{L}}$ , ${\mathscr{P}}$ , and $u$ or how they are specified, performing any sort of regret minimization efficiently is a hopeless task (e.g., consider the case where it is computationally hard to even determine membership in ${\mathscr{P}}$ ). We’ll therefore make the following three structural / computational assumptions on ${\mathscr{L}}$ , ${\mathscr{P}}$ , and $u$ .

First, we will restrict our attention to cases where the loss set ${\mathscr{L}}$ is “orthant-generating”. A convex subset $S$ of $\mathbb{R}$ is orthant-generating if $S\subseteq[0,\infty)^{d}$ and for each $i\in[d]$ , there exists an $\lambda_{i}>0$ such that $\lambda_{i}e_{i}\in S$ (as part of this assumption, we will also assume we have access to the values $\lambda_{i}$ ). Note that many common choices for ${\mathscr{L}}$ (e.g., $[0,1]^{m}$ , $\Delta_{m}$ , intersections of other $\ell_{p}$ balls with the positive orthant) are all orthant-generating.

Second, we will assume we have an efficient separation oracle for the action set ${\mathscr{P}}$ ; that is, an oracle which takes a $p\in\mathbb{R}^{n}$ and outputs (in time $\operatorname*{poly}(n)$ ) either that $p\in{\mathscr{P}}$ or a separating hyperplane $h\in\mathbb{R}^{n}$ such that $\langle p,h\rangle>\max_{p^{\prime}\in{\mathscr{P}}}\langle p^{\prime},h\rangle$ .

Finally, we will assume we have access to what we call an efficient regret oracle for $u$ . Given a collection of $R$ action/loss pairs $(p_{r},\ell_{r})\in{\mathscr{P}}\times{\mathscr{L}}$ and $R$ positive constants $\alpha_{r}\geq 0$ , an efficient regret oracle can compute (in time $\operatorname*{poly}(n,m,R)$ ) the value of $\max_{i\in[d]}\sum_{r=1}^{R}\alpha_{r}u_{i}(p_{r},\ell_{r})$ . This can be thought of as evaluating the function $\operatorname{\mathsf{Reg}}(\bm{p},\bm{\ell})$ for a pair of action and loss sequences that take on the action/loss pair $(p_{r},\ell_{r})$ an $\alpha_{r}/\sum_{r^{\prime}}\alpha_{r^{\prime}}$ fraction of the time. At a higher level, having access to an efficient regret oracle means that a learner can efficiently compute their overall regret at the end of $T$ rounds (it is hard to imagine how one could efficiently minimize regret without being able to compute it).

3.4.2 Extending the dual set

One of the ingredients we will need to implement algorithm $\widetilde{\mathcal{A}}$ is a membership oracle for the dual set ${\mathscr{T}}_{f}^{*}$ (e.g. in order to perform OLO over this dual set). To check whether $\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}$ , it suffices to check whether $\langle\widetilde{\theta},x\rangle\leq\max_{i\in[d]}\langle v_{i},x\rangle$ for all $x\in\mathbb{R}^{nm}$ , so in turn, it will be useful to be able to compute the function $\max_{i\in[d]}\langle v_{i},x\rangle$ for any $x\in\mathbb{R}^{nm}$ .

Computing this maximum is very similar⁸⁸8In fact, in almost all practical cases where we have an efficient regret oracle, it is possible by a similar computation to directly compute this maximum. Here we describe an approach that works in a blackbox way given only a strict regret oracle – if you are accept the existence of an oracle that can optimize linear functions over the $v_{i}$ , you can skip this subsection. to what is provided by our regret oracle: note that by writing $u_{i}(p_{r},\ell_{r})$ as $\langle v_{i},\widetilde{u}(p_{r},\ell_{r})\rangle$ , we can think of the regret oracle providing the value of:

\max_{i\in[d]}\left\langle v_{i},\sum_{r=1}^{R}\alpha_{r}\widetilde{u}(p_{r},\ell_{r})\right\rangle.

If it were possible to write any $x\in\mathbb{R}$ in the form $\sum_{r=1}^{R}\alpha_{r}\widetilde{u}(p_{r},\ell_{r})$ , we would be done. However, this may not be possible: in particular, $\sum_{r=1}^{R}\alpha_{r}\widetilde{u}(p_{r},\ell_{r})$ must lie in the convex cone $\operatorname*{cone}(\widetilde{u}({\mathscr{P}},{\mathscr{L}}))$ generated by all points of the form $\widetilde{u}(p,\ell)$ .

We will therefore briefly generalize the theory of Section 3.1 to cases where we are only able to optimize over an extension of the dual set. Given a convex cone ${\mathscr{Z}}$ , let ${\mathscr{T}}_{f}^{*}({\mathscr{Z}})=\{\theta\,\mid\,\forall z\in{\mathscr{Z}},\langle\theta,z\rangle\leq f(z)\}$ . Note that ${\mathscr{T}}_{f}^{*}\subseteq{\mathscr{T}}_{f}^{*}({\mathscr{Z}})$ (since in ${\mathscr{T}}_{f}^{*}$ , $\langle\theta,z\rangle\leq f(z)$ must hold for all $z$ in the ambient space). The following lemma shows that if $z$ is in ${\mathscr{Z}}$ , to maximize a linear function over ${\mathscr{T}}_{f}^{*}$ it suffices to maximize it over ${\mathscr{T}}_{f}^{*}({\mathscr{Z}})$ .

Lemma 3.10.

Let ${\mathscr{Z}}$ be an arbitrary convex cone. Then if $z\in{\mathscr{Z}}$ , the following equalities hold:

\sup_{\theta\in{\mathscr{T}}^{*}_{f}}\langle\theta,z\rangle=\sup_{\theta\in{\mathscr{T}}^{*}_{f}({\mathscr{Z}})}\langle\theta,z\rangle=f(z).

Now, consider the specific convex cone ${\mathscr{Z}}=\operatorname*{cone}(\widetilde{u}({\mathscr{P}},{\mathscr{L}}))$ . Given Lemma 3.10, it is straightforward to check that Theorem 3.8 continues to hold even if the domain ${\mathscr{X}}$ of the OLO algorithm $\widetilde{\mathcal{F}}$ is set to ${\mathscr{T}}_{f}^{*}({\mathscr{Z}})$ instead of ${\mathscr{T}}_{f}^{*}$ . In particular, the first equality in the proof of Theorem 3.8 is still true when ${\mathscr{T}}_{f}^{*}$ is replaced by ${\mathscr{T}}_{f}^{*}({\mathscr{Z}})$ , since $\frac{1}{T}\sum_{t}\widetilde{u}(p_{t},\ell_{t})\in{\mathscr{Z}}$ .

There is one other issue we must deal with: it is possible that ${\mathscr{T}}_{f}^{*}({\mathscr{Z}})$ is significantly larger than ${\mathscr{T}}_{f}^{*}$ , and therefore an OLO algorithm with domain ${\mathscr{T}}_{f}^{*}({\mathscr{Z}})$ might incur more regret than that of ${\mathscr{T}}_{f}^{*}$ . In fact, there are cases where the set ${\mathscr{T}}_{f}^{*}({\mathscr{Z}})$ is unbounded. Nonetheless, the following lemma shows that for OLO with a quadratic regularizer, we will never encounter very large values of $\widetilde{\theta}$ .

Lemma 3.11.

Let $R(\widetilde{\theta})=||\widetilde{\theta}||^{2}$ , and fix a $z\in{\mathscr{Z}}$ and $\eta>0$ . Let

\widetilde{\theta}_{opt}=\operatorname*{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}({\mathscr{Z}})}\left(\eta R(\widetilde{\theta})-\langle\widetilde{\theta},z\rangle\right).

Then $||\widetilde{\theta}_{opt}||\leq\operatorname*{diam}({\mathscr{T}}_{f}^{*})$ .

Lemma 3.11 therefore implies that a quadratic-regularized OLO algorithm $\widetilde{\mathcal{F}}$ will only output values of $\widetilde{\theta}$ with $||\widetilde{\theta}||\leq\operatorname*{diam}({\mathscr{T}}_{f}^{*})$ , and therefore we can safely set ${\mathscr{X}}={\mathscr{T}}_{f}^{*}\cap\{\theta\mid||\theta||\leq\rho\}$ , where $\rho=\Theta(\operatorname*{diam}({\mathscr{T}}_{f}^{*}))$ is an efficiently computable upper bound on $\operatorname*{diam}({\mathscr{T}}_{f}^{*})$ . This guarantees that $\operatorname*{diam}({\mathscr{X}})=O(\operatorname*{diam}({\mathscr{T}}_{f}^{*}))$ , and thus the regret guarantees of Theorem 3.9 remain unchanged.

3.4.3 Constructing a membership oracle

We will now demonstrate how to use our regret oracle to construct a membership oracle for the expanded set ${\mathscr{X}}$ defined in the previous section. We first show that it is possible to check for membership in ${\mathscr{Z}}$ (and further, when $z\in{\mathscr{Z}}$ , write $z$ as a convex combination of the generators of ${\mathscr{Z}}$ ).

Lemma 3.12.

Given a point $z\in\mathbb{R}^{nm}$ , we can efficiently check whether $z\in{\mathscr{Z}}$ . If $z\in{\mathscr{Z}}$ , we can also efficiently write $z$ in the form $\sum_{k=1}^{m}\alpha_{k}\widetilde{u}(p_{k},\ell_{k})$ (for an explicit choice of $p_{k}$ , $\ell_{k}$ , and $\alpha_{k}$ ).

Note that expressing $z$ in the form $\sum_{k=1}^{m}\alpha_{k}\widetilde{u}(p_{k},\ell_{k})$ allows us to directly apply our efficient regret oracle (with $R=m$ ). We therefore gain the following optimization oracle as a corollary.

Corollary 3.13.

Given an efficient regret oracle, for any $z\in{\mathscr{Z}}$ we can efficiently (in time $\operatorname*{poly}(n,m)$ ) compute $\max_{i\in[d]}\langle v_{i},z\rangle$ .

Finally, we will show that given these two results, we can efficiently construct a membership oracle for our set ${\mathscr{X}}$ . To do this, we will need the following fact (loosely stated; see Lemma B.2 in Appendix B for an accurate statement): it is possible to minimize a convex function over a convex set as long as one has an evaluation oracle for the function and a membership oracle for the convex set.

Lemma 3.14.

Given an efficient regret oracle for $u$ , we can construct an efficient membership oracle for the set ${\mathscr{X}}$ .

3.4.4 Implementing regret minimization

Equipped with this membership oracle, we can now state and prove our main theorem.

Theorem 3.15.

Assume that:

1.

The convex set ${\mathscr{L}}$ is orthant-generating.
2.

We have an efficient separation oracle for the convex set ${\mathscr{P}}$ .
3.

We have an efficient regret oracle for the regret function $u$ .

Then it is possible to implement the algorithm of Theorem 3.9 in $\operatorname*{poly}(n,m)$ time per round.

Proof.

There are two steps of unclear computational complexity in the description of the algorithm in Section 3.3: step 2a, where we find a $p_{t}$ such that $\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle\leq 0$ for all $\ell\in{\mathscr{L}}$ , and step 2d, where we have to run our OLO subalgorithm over ${\mathscr{T}}_{f}^{*}$ (which we specifically instantiate as FTRL with a quadratic regularizer).

We begin by describing how to perform step 2a efficiently. Fix a $\widetilde{\theta}_{t}$ . Note that since ${\mathscr{L}}$ is orthant-generating, to check whether $\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle\leq 0$ for all $\ell\in{\mathscr{L}}$ , it suffices to check whether $\widetilde{u}(p_{t},e_{k})\leq 0$ for each unit vector $e_{k}\in\mathbb{R}^{m}$ . Therefore, to find a valid $p_{t}$ we must find a point $p_{t}\in{\mathscr{P}}$ that satisfies an additional $m$ explicit linear constraints. Since we have an efficient separation oracle for ${\mathscr{P}}$ , this is possible in $\operatorname*{poly}(n,m)$ time.

To implement the FTRL algorithm $\mathcal{F}$ over the set ${\mathscr{X}}={\mathscr{T}}_{f}^{*}$ with a quadratic regularizer, each round we must find the minimizer of the convex function $g(\widetilde{\theta})=R(\widetilde{\theta})+\eta\sum_{s=1}^{t-1}\langle\widetilde{\theta},\widetilde{u}(p_{s},\ell_{s})\rangle$ over the convex set ${\mathscr{X}}$ . To do this, it suffices to exhibit an evaluation oracle for $g$ and a membership oracle for ${\mathscr{X}}$ . To evaluate $g$ , note that we simply need to be able to compute $R(\widetilde{\theta})$ (which is a Euclidean distance in $\mathbb{R}^{mn}$ ) and $\sum_{s=1}^{t-1}\langle\widetilde{\theta},\widetilde{u}(p_{s},\ell_{s})\rangle$ which we can do in $O(mn)$ time by keeping track of the cumulative sum $\sum_{s=1}^{t-1}\widetilde{u}(p_{s},\ell_{s})$ and computing a single inner product in $\mathbb{R}^{mn}$ . A membership oracle for ${\mathscr{X}}$ is provided by Lemma 3.14. ∎

3.5 Recovering $\ell_{\infty}$ regret via maxent regularizers

One interesting aspect of this reduction to pseudonorm approachability is that, in some cases, the regret bound of $O(\sqrt{T\log d})$ achievable via $\ell_{\infty}$ -approachability (Corollary 2.4) sometimes outperforms the regret bound of $O(\lambda mn\sqrt{T})$ achieved by pseudonorm approachability (Theorem 3.9), for example when $d=\operatorname*{poly}(m,n)$ . Of course, this comparison is not completely fair: in both cases there is flexibility to specify the underlying OLO algorithm, and the $O(\sqrt{T\log d})$ bound uses an FTRL algorithm with a negative entropy regularizer, whereas our pseudonorm approachability bound uses an FTRL algorithm with a quadratic regularizer. After all, there are well-known cases (e.g. for OLO over a simplex, as in $\ell_{\infty}$ -approachability) where the negative entropy regularizer leads to exponentially better (in $d$ ) regret bounds than the quadratic regularizer.

In this subsection, we will show that there exists a different regularizer for pseudonorm approachability – one we call a maxent regularizer – which recovers the $O(\sqrt{T\log d})$ regret bound of Corollary 2.4. In doing so, we will also better understand the parallels between the regret minimization algorithm $\mathcal{A}$ (which works via reduction to $\ell_{\infty}$ -approachability in a $d$ -dimensional space) and the regret minimization algorithm $\widetilde{\mathcal{A}}$ (which works via reduction to pseudonorm approachability in an $nm$ -dimensional space).

Let $V$ be the $d$ -by- $nm$ matrix whose $i$ th row equals $v_{i}$ . Note that $V$ allows us to translate between analogous concepts/quantities for $\mathcal{A}$ and $\widetilde{\mathcal{A}}$ in the following way.

Lemma 3.16.

The following statements are true:

•

For any $p\in{\mathscr{P}}$ and $\ell\in{\mathscr{L}}$ , $u(p,\ell)=V\widetilde{u}(p,\ell)$ .
•

The dual set ${\mathscr{T}}_{f}^{*}=V^{T}\Delta_{d}$ (i.e., $\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}$ iff there exists a $\theta\in\Delta_{d}$ such that $\widetilde{\theta}=V^{T}\theta$ ).
•

If $\theta\in\Delta_{d}$ and $\widetilde{\theta}=V^{T}\theta\in{\mathscr{T}}_{f}^{*}$ , then for any $p\in{\mathscr{P}}$ and $\ell\in{\mathscr{L}}$ , $\langle\theta,u(p,\ell)\rangle=\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle$ .
•

Fix $\theta_{t}\in\Delta_{d}$ and let $\widetilde{\theta}=V^{T}\theta$ . If $p_{t}$ satisfies $\langle\theta_{t},u(p_{t},\ell)\rangle\leq 0$ for all $\ell\in{\mathscr{L}}$ , then $p_{t}$ also satisfies $\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle\leq 0$ for all $\ell\in{\mathscr{L}}$ .

Proof.

The first statement follows from the fact that $V\widetilde{u}(p,\ell)_{i}=\langle v_{i},\widetilde{u}(p,\ell)\rangle=u(p,\ell)_{i}$ . The second claim follows as a consequence of Lemma 3.4 (that ${\mathscr{T}}_{f}^{*}$ is the convex hull of the vectors $v_{i}$ ). The third claim follows from the first claim: $\langle\theta,u(p,\ell)\rangle=\langle\theta,V\widetilde{u}(p,\ell)\rangle=\langle V^{T}\theta,\widetilde{u}(p,\ell)\rangle=\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle$ . Finally, the fourth claim follows directly from the third claim. ∎

Now, fix a regret minimization problem (specified by ${\mathscr{P}}$ , ${\mathscr{L}}$ , and $u$ ) and consider the execution of algorithm $\mathcal{A}$ on some specific loss sequence $\bm{\ell}=(\ell_{1},\ell_{2},\dots,\ell_{T})$ . Each round $t$ , $\mathcal{A}$ runs the entropy-regularized FTRL algorithm $\mathcal{F}$ to generate a $\theta_{t}\in\Delta_{d}$ (as a function of the actions and losses up until round $t$ ) and then uses $\theta_{t}$ to select a $p_{t}$ that satisfies $\langle\theta_{t},u(p_{t},\ell)\rangle\leq 0$ for all $\ell\in{\mathscr{L}}$ . If we execute algorithm $\widetilde{\mathcal{A}}$ for the same regret minimization problem on the same loss sequence, each round $t$ , $\widetilde{\mathcal{A}}$ runs some (to be determined) FTRL algorithm $\widetilde{\mathcal{F}}$ to generate a $\widetilde{\theta}_{t}\in{\mathscr{T}}_{f}^{*}$ , and then uses $\widetilde{\theta}_{t}$ to select a $p_{t}$ that satisfies $\langle\widetilde{\theta}_{t},\widetilde{u}(p_{t},\ell)\rangle$ for all $\ell\in{\mathscr{L}}$ . Lemma 3.16 shows that if $\widetilde{\theta}_{t}=V^{T}\theta_{t}$ for each $t$ , then both algorithms will generate exactly the same sequence of actions in response to this loss sequence⁹⁹9Technically, there may be some leeway in terms of which $p_{t}$ to choose that e.g. satisfies $\langle\theta_{t},u(p_{t},\ell)\rangle\leq 0$ for all $\ell\in{\mathscr{L}}$ , and the two different procedures could result in different choices of $p_{t}$ . But if we break ties consistently (e.g. add the additional constraint of choosing the $p_{t}$ that maximizes the inner product of $p_{t}$ with some generic vector), then both procedures will produce the same value of $p_{t}$ ., and hence the same regret.

The question then becomes: how do we design an OLO algorithm $\widetilde{\mathcal{F}}$ that outputs $\widetilde{\theta}_{t}=V^{T}\theta_{t}$ each round that $\mathcal{F}$ would output $\theta_{t}$ ? Recall that if $\mathcal{F}$ is an FTRL algorithm with regularizer $R:\Delta_{d}\to\mathbb{R}$ , then

\theta_{t}=\operatorname*{argmin}_{\theta\in\Delta_{d}}\left(R(\theta)+\sum_{s=1}^{t-1}\langle\theta,u(p_{t},\ell_{t})\rangle\right).

Define $\widetilde{R}:{\mathscr{T}}_{f}^{*}\to\mathbb{R}$ via

\widetilde{R}(\widetilde{\theta})=\min_{\begin{subarray}{c}\theta\in\Delta_{d}\\ V^{T}\theta=\widetilde{\theta}\end{subarray}}R(\theta).

(7)

We claim that if we let $\widetilde{\mathcal{F}}$ be the FTRL algorithm that uses regularizer $\widetilde{R}$ , then $\widetilde{\mathcal{F}}$ will output our desired sequence of $\widetilde{\theta}_{t}$ .

Lemma 3.17.

The following equality holds for the output $\widetilde{\theta}_{t}$ :

\widetilde{\theta}_{t}=\operatorname*{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}}\left(\widetilde{R}(\widetilde{\theta})+\sum_{s=1}^{t-1}\langle\widetilde{\theta},\widetilde{u}(p_{t},\ell_{t})\rangle\right)=V^{T}\theta_{t}.

Proof.

Note that we can write

$\displaystyle\widetilde{\theta}_{t}$	$\displaystyle=$	$\displaystyle\operatorname{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{}}\left(\widetilde{R}(\widetilde{\theta})+\sum_{s=1}^{t-1}\langle\widetilde{\theta},\widetilde{u}(p_{t},\ell_{t})\rangle\right)$
	$\displaystyle=$	$\displaystyle\operatorname{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{}}\left(\left(\min_{\begin{subarray}{c}\theta\in\Delta_{d}\\ V^{T}\theta=\widetilde{\theta}\end{subarray}}R(\theta)\right)+\sum_{s=1}^{t-1}\langle\widetilde{\theta},\widetilde{u}(p_{t},\ell_{t})\rangle\right)$
	$\displaystyle=$	$\displaystyle\operatorname{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{}}\left(\min_{\begin{subarray}{c}\theta\in\Delta_{d}\\ V^{T}\theta=\widetilde{\theta}\end{subarray}}\left(R(\theta)+\sum_{s=1}^{t-1}\langle\theta,u(p_{t},\ell_{t})\rangle\right)\right)$
	$\displaystyle=$	$\displaystyle V^{T}\operatorname*{argmin}_{\theta\in\Delta_{d}}\left(R(\theta)+\sum_{s=1}^{t-1}\langle\theta,u(p_{t},\ell_{t})\rangle\right)$
	$\displaystyle=$	$\displaystyle V^{T}\theta_{t}.$

The third equality follows from the fact that if $\widetilde{\theta}=V^{T}\theta$ , then $\langle\theta,u(p,\ell)\rangle=\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle$ (by Lemma 3.16). ∎

Corollary 3.18.

Let $\mathcal{F}$ be an FTRL algorithm over $\Delta_{d}$ with regularizer $R$ , and let $\widetilde{\mathcal{F}}$ be an FTRL algorithm over ${\mathscr{T}}_{f}^{*}$ with regularizer $\widetilde{R}$ . Then $\operatorname{\mathsf{Reg}}(\mathcal{A})=\operatorname{\mathsf{Reg}}(\widetilde{\mathcal{A}})$ .

We now consider the specific case where $R$ is the negentropy function; i.e., for $\theta\in\Delta_{d}$ , $R(\theta)=-H(\theta)=\sum_{i=1}^{d}\theta_{i}\log\theta_{i}$ . For this choice of $R$ , $\widetilde{R}$ becomes

\widetilde{R}(\widetilde{\theta})=-\max_{\begin{subarray}{c}\theta\in\Delta_{d}\\ V^{T}\theta=\widetilde{\theta}\end{subarray}}H(\theta).

(8)

In other words, $-\widetilde{R}(\widetilde{\theta})$ is the maximum entropy of any distribution $\theta$ in $\Delta_{d}$ that satisfies the $nm$ linear constraints imposed by $V^{T}\theta=\widetilde{\theta}$ . This is exactly an instance of the Maxent problem studied in (Berger et al., 1996; Rosenfeld, 1996; Pietra et al., 1997; Dudík et al., 2007) and (Mohri et al., 2018)[chapter 12].

It is known (Pietra et al., 1997; Dudík et al., 2007; Mohri et al., 2018) that the entropy maximizing distribution is a Gibbs distribution. In particular, the $\theta$ maximizing the expression in (8) satisfies (for some real constants $\lambda_{jk}$ for $j\in[n]$ , $k\in[m]$ )

\theta_{i}=\frac{\exp\left(\sum_{j=1}^{n}\sum_{k=1}^{m}\lambda_{jk}v_{ijk}\right)}{Z(\lambda)},

(9)

where $Z(\lambda)$ (the “partition function”) is defined via

Z(\lambda)=\sum_{i=1}^{d}\exp\left(\sum_{j=1}^{n}\sum_{k=1}^{m}\lambda_{jk}v_{ijk}\right).

(10)

Generally, there is exactly one choice of $\lambda_{jk}$ which results in $V^{T}\theta=\widetilde{\theta}$ (since there are $mn$ free variables and $mn$ linear constraints). For this optimal $\lambda$ , it is known that the maximum entropy is given by $\langle\lambda,\widetilde{\theta}\rangle-\log Z(\lambda)$ . If it is possible to solve this system for $\lambda_{jk}$ and evaluate $Z(\lambda)$ efficiently, we can then evaluate $\widetilde{R}$ efficiently. In Section 4.1 we will see how to do this for the specific case of swap regret (where we are helped by the fact that (9) guarantees that $\theta$ is a product distribution).

Regardless of how we compute the maxent regularizer $\widetilde{R}$ , efficient computation leads to an efficient $O(\sqrt{T\log d})$ regret minimization algorithm.

Corollary 3.19.

If there exists an efficient ( $\operatorname*{poly}(n,m)$ time) algorithm for computing the maxent regularizer $\widetilde{R}(\widetilde{\theta})$ , then there exists a regret minimization algorithm $\widetilde{\mathcal{A}}$ for $u$ with regret $O(\sqrt{T\log d})$ and that can be implemented in $\operatorname*{poly}(n,m)$ time per round.

4 Applications

4.1 Swap regret

Recall that in the setting of swap regret, the action set is ${\mathscr{P}}=\Delta_{K}$ (distributions over $K$ actions), the loss set is ${\mathscr{L}}=[0,1]^{K}$ , and the swap regret of playing a sequence of actions $\bm{p}$ on a sequence of losses $\bm{\ell}$ is given by

\operatorname{\mathsf{SwapReg}}(\bm{p},\bm{\ell})=\max_{\pi:[K]\to[K]}\sum_{t=1}^{T}\sum_{i=1}^{K}(p_{ti}\ell_{ti}-p_{ti}\ell_{t\pi(i)}).

In words, swap regret compares the total loss achieved by this sequence of actions with the loss achieved by any transformed action sequence formed by applying an arbitrary “swap function” to the actions (i.e., always playing action $\pi(i)$ instead of action $i$ ). Swap regret minimization can be directly written as an $\ell_{\infty}$ -approachability problem for the bilinear function $u\colon{\mathscr{P}}\times{\mathscr{L}}\to\mathbb{R}^{K^{K}}$ where

u_{\pi}(p,\ell)=\sum_{i=1}^{K}(p_{i}\ell_{i}-p_{i}\ell_{\pi(i)}).

(here we index $u$ by the $K^{K}$ functions $\pi:[K]\to[K]$ ). Note that the negative orthant is indeed separable with respect to $u$ since if we let $p^{*}(\ell)=\operatorname*{argmin}_{p\in{\mathscr{P}}}\langle p,\ell\rangle$ , $u_{\pi}(p,\ell)\leq 0$ for any swap function $\pi$ .

We can now apply the theory developed in Section 3. First, since $m=n=K$ and the maximum absolute value of any coefficient in $u$ is $1$ , Theorem 3.9 immediately results in a swap regret algorithm with regret $O(K^{2}\sqrt{T})$ . Moreover, since we can write

\max_{\pi:[K]\to[K]}\sum_{t=1}^{T}\sum_{i=1}^{K}(p_{ti}\ell_{ti}-p_{ti}\ell_{t\pi(i)})=\sum_{i=1}^{K}\max_{j\in[K]}\sum_{t=1}^{T}(p_{ti}\ell_{ti}-p_{ti}\ell_{tj}),

we can compute $\operatorname{\mathsf{SwapReg}}$ efficiently. By Theorem 3.15, we can therefore implement this $O(K^{2}\sqrt{T})$ -regret algorithm efficiently (in $\operatorname*{poly}(K)$ time per round). We can improve upon this regret bound by noting that $\operatorname*{diam}({\mathscr{P}})=O(1)$ , $\operatorname*{diam}({\mathscr{L}})=O(\sqrt{K})$ , and $D_{z}=O(\sqrt{K})$ (the coefficient vector $v_{\pi}$ corresponding to $u_{\pi}(p,\ell)$ contains at most $2K$ coefficients that are $\pm 1$ ). It follows that $D_{z}(\operatorname*{diam}{\mathscr{P}})(\operatorname*{diam}{\mathscr{L}})=O(K)$ , and it follows from the first part of Theorem 3.9 that the regret of the aforementioned algorithm is actually only $O(K\sqrt{T})$ .

This is still a factor of approximately $O(\sqrt{K})$ larger than the optimal $O(\sqrt{TK\log K})$ bound. To achieve the optimal regret bound, we will show how to compute the maxent regularizer for swap regret (and hence can efficiently implement an $O(\sqrt{T\log d})=O(\sqrt{TK\log K})$ algorithm via Corollary 3.19). Recall that the maxent regularizer $\widetilde{R}(\widetilde{\theta})$ is defined for $\widetilde{\theta}\in V^{T}\Delta_{d}$ is the negative of the maximum entropy of a distribution $\theta\in\Delta_{d}$ that satisfies $\widetilde{\theta}=V^{T}\theta$ . For our problem, $\widetilde{\theta}$ is $(K^{2})$ -dimensional, and this imposes the following linear constraints on $\theta$ (which we view as a distribution over swap functions $\pi$ ):

	$\displaystyle\widetilde{\theta}_{ij}$	$\displaystyle=$	$\displaystyle-\sum_{\pi\mid\pi(i)=j}\theta_{\pi}\hskip 28.45274pt\mbox{ for $j\neq i$}$
	$\displaystyle\widetilde{\theta}_{ii}$	$\displaystyle=$	$\displaystyle 1-\sum_{\pi\mid\pi(i)=i}\theta_{\pi}.$

Now, by the characterization presented in (9)¹⁰¹⁰10In particular, in the case of swap regret we have that $v_{\pi,i,j}=\bm{1}(k=j)-\bm{1}(k=\pi(j))$ . Since the first term ( $\bm{1}(k=j)$ ) does not depend on $\pi$ , we can ignore its contribution to $\theta_{\pi}$ (it cancels from the numerator and denominator)., we know the entropy maximizing $\theta$ satisfies (for some $K^{2}$ constants $\lambda_{ij}$ )

\displaystyle\theta_{\pi}=\frac{\exp\left(\sum_{i=1}^{K}\lambda_{i\pi(i)}\right)}{\sum_{\pi^{\prime}}\exp\left(\sum_{i=1}^{K}\lambda_{i\pi^{\prime}(i)}\right)}=\frac{\prod_{i=1}^{K}\exp\left(\lambda_{i\pi(i)}\right)}{\sum_{\pi^{\prime}}\prod_{i=1}^{K}\exp\left(\lambda_{i\pi^{\prime}(i)}\right)}=\prod_{i=1}^{K}\frac{\exp(\lambda_{i\pi(i)})}{\sum_{j=1}^{K}\exp(\lambda_{ij})}.

In particular, this shows that the entropy maximizing distribution $\theta$ is a product distribution over the set of swap functions $\pi$ where for each $i\in[K]$ the value of $\pi(i)$ is chosen independently. Moreover, from $\widetilde{\theta}$ we can recover the overall marginal probability $q_{ij}=\operatorname*{\mathbb{P}}_{\pi\sim\theta}[\pi(i)=j]$ (it is $-\widetilde{\theta}_{ij}$ if $j\neq i$ and $1-\widetilde{\theta}_{ii}$ if $j=i$ ). The entropy of this product distribution can therefore be written as:

H(\theta)=\sum_{i=1}^{K}H(\pi(i))=-\sum_{i=1}^{K}\sum_{j=1}^{K}q_{ij}\log q_{ij}.

Our regularizer $\widetilde{R}(\widetilde{\theta})$ is simply $-H(\theta)$ and can clearly be efficiently computed as a function of $\widetilde{\theta}$ . It follows from Corollary 3.19 that there exists an efficient ( $\operatorname*{poly}(K)$ time per round) regret minimization algorithm for swap regret that incurs $O(\sqrt{T\log d})=O(\sqrt{TK\log K})$ worst-case regret.

Finally, we briefly remark that the pseudonorm $f$ we construct here is closely related to the $\ell_{1,\infty}$ group norm defined over $K$ -by- $K$ square matrices as the $\ell_{1}$ norm of vector formed by the $\ell_{\infty}$ norms of the rows (i.e., $||M||_{1,\infty}=\sum_{i=1}^{K}\max_{j\in[K]}|M_{ij}|$ ). This is not unique to swap regret; in many of our applications, the relevant pseudonorm can be thought of as a composition of multiple smaller norms (often $\ell_{1}$ or $\ell_{\infty}$ norms).

4.2 Procrustean swap regret

To illustrate the power of Theorem 3.15, we present a toy variant of swap regret where the learner must compete against an infinite set of swap functions (in particular, all orthogonal linear transformations of their sequence of actions) and yet can do this efficiently while incurring low (polynomial in the dimension of their action set) regret.

In this problem, the action set ${\mathscr{P}}=\{x\in\mathbb{R}^{n}\,;\,||x||_{2}\leq 1\}$ is the set of unit vectors with norm at most 1 and the loss set ${\mathscr{L}}=[-1,1]^{n}$ . The learner would like to minimize the following notion of regret:

\operatorname{\mathsf{ProcrustesReg}}(\bm{p},\bm{\ell})=\max_{Q\in O(n)}\sum_{t=1}^{T}(\langle p_{t},\ell_{t}\rangle-\langle Qp_{t},\ell_{t}\rangle).

(11)

Here, $O(n)$ is the set of all orthogonal $n$ -by- $n$ matrices. We call this notion of regret Procrustean swap regret due to its similarity with the orthogonal Procrustes problem from linear algebra, which (loosely) asks for the orthogonal matrix which most closely maps one sequence of points $\bm{x}_{1}$ onto another sequence of points $\bm{x}_{2}$ (in our setting, we intuitively want to map $\bm{x}$ onto $-\bm{\ell}$ to minimize the loss of our benchmark). See Gower and Dijksterhuis (2004) for a more detailed discussion of the Procrustes problem. Regardless, note that we can compute $\operatorname{\mathsf{ProcrustesReg}}(\bm{p},\bm{\ell})$ efficiently, since we have an efficient membership oracle for the convex hull $\operatorname*{conv}(O(n))$ of the set of orthogonal matrices (specifically, an $n$ -by- $n$ matrix $M$ belongs to $\operatorname*{conv}(O(n))$ iff all its singular values are at most $1$ in absolute value).

In our approachability framework, we can capture this notion of regret with the bilinear function $u(p,\ell)$ with coordinates indexed by $Q\in O(n)$ with $u(p,\ell)_{Q}=\langle p,\ell\rangle-\langle Qp,\ell\rangle$ (which is separable with respect to the negative orthant, since for $p=-\ell/||\ell||_{2}$ , $u(p,\ell)_{Q}\leq 0$ ). Since all the conditions of Theorem 3.15 hold, there is an efficient learning algorithm which incurs at most $O(\sqrt{nT})$ Procrustean swap regret (in particular, $\operatorname*{diam}{\mathscr{P}}=O(1)$ , $\operatorname*{diam}{\mathscr{L}}=O(\sqrt{n})$ , and $D_{z}=O(1)$ ).

4.3 Converging to Bayesian correlated equilibria

Swap regret has the nice property that if in a repeated $n$ -player normal-form game, all players run a low-swap regret algorithm to select their actions, their time-averaged strategy profile will converge to a correlated equilibrium (indeed, this is one of the major motivations for studying swap regret).

In repeated Bayesian games (games where each player has private information drawn independently from some distribution each round) the analogue of correlated equilibria is Bayesian correlated equilibria. Playing a repeated Bayesian game requires a contextual learning algorithm, which can observe the private information of the player (the “context”) and select an action based on this. Mansour et al. (2022) show that there is a notion of regret (that we call $\operatorname{\mathsf{BayesSwapReg}}$ ) such that if all learners are playing an algorithm with low $\operatorname{\mathsf{BayesSwapReg}}$ , then over time they converge on average to a Bayesian correlated equilibrium. However, while Mansour et al. (2022) provide a low $\operatorname{\mathsf{BayesSwapReg}}$ algorithm, their algorithm is not provably efficient (it requires finding the fixed point of a system of quadratic equations); by applying our framework, we show that it is possible to obtain a polynomial-time, low- $\operatorname{\mathsf{BayesSwapReg}}$ algorithm for this problem.

Formally, we study the following full-information contextual online learning setting. As before, there are $K$ actions, but there are now $C$ different contexts (“types”). Every round $t$ , the adversary specifies a loss function $\ell_{t}\in{\mathscr{L}}=[0,1]^{CK}$ , where $\ell_{t,i,c}$ represents the loss from playing action $i$ in context $c$ . Simultaneously, the learner specifies an action $p_{t}\in{\mathscr{P}}=\Delta([K])^{C}\subset\mathbb{R}^{CK}$ which we view as a function mapping each context to a distribution over actions. Overall, the learner receives expected utility $\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}p_{t}(c)_{i}\ell_{t,i,c}$ this round (the learner’s context is drawn iid from the publicly known distribution $\mathcal{C}$ each round). In this formulation $\operatorname{\mathsf{BayesSwapReg}}$ can be written in the form

\operatorname{\mathsf{BayesSwapReg}}=\max_{\kappa,\pi_{c}}\sum_{t=1}^{T}\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right),

(12)

where the maximum is over all “type swap functions” $\kappa\colon[C]\to[C]$ and $C$ -tuples of “action-deviation swap functions” $\pi_{c}\colon[K]\to[K]$ . It is straightforward to verify that $\operatorname{\mathsf{BayesSwapReg}}$ (as written in (12)) can be written as an $\ell_{\infty}$ -approachability problem for a bilinear function $u\colon{\mathscr{P}}\times{\mathscr{L}}\to\mathbb{R}$ for $d=C^{C}K^{KC}$ . The theory of $\ell_{\infty}$ -approachability guarantees the existence of an algorithm with $O(\sqrt{T\log d})=O(\sqrt{T(C\log C+KC\log K)})$ regret, but this algorithm has time/space complexity $O(d)$ and is very inefficient for even moderate values of $C$ or $K$ .

Instead, in Appendix C, we show that $\operatorname{\mathsf{BayesSwapReg}}$ can be written in the form

\operatorname{\mathsf{BayesSwapReg}}=\sum_{c=1}^{C}\max_{c^{\prime}\in[C]}\sum_{i=1}^{K}\max_{j\in[K]}\sum_{t=1}^{T}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(c^{\prime})_{i}\ell_{t,j,c}\right).

This allows us to evaluate $\operatorname{\mathsf{BayesSwapReg}}$ in $\operatorname*{poly}(C,K,T)$ time and apply our pseudonorm approachability framework. Directly from Theorems 3.9 and 3.15, we know that there exists an efficient ( $\operatorname*{poly}(C,K)$ time per round) learning algorithm that incurs at most $O(K^{2}C^{2}\sqrt{T})$ swap regret. As with swap regret, we can tighten this bound somewhat by examining the values of $\operatorname*{diam}({\mathscr{X}})$ and $\operatorname*{diam}({\mathscr{Y}})$ and show that this algorithm actually incurs at most $O(KC\sqrt{T})$ regret (details left to Appendix C).

This is within approximately an $O(\sqrt{KC})$ factor of optimal. Interestingly, unlike with swap regret, it is unclear if it is possible to efficiently solve the relevant entropy maximization problem for Bayesian swap regret (and hence achieve the optimal regret bound). We pose this as an open question.

Open Question 1.

Is it possible to efficiently (in $\operatorname*{poly}(K,C)$ time) evaluate the maximum entropy regularizer $\widetilde{R}(\widetilde{\theta})$ for the problem of Bayesian swap regret?

4.4 Reinforcement learning in constrained MDPs

We consider episodic reinforcement learning in constrained Markov decision processes (MDPs). Here, the agent receives a vectorial reward (loss) in each time step and aims to find a policy that achieves a certain minimum total reward (maximum total loss) in each dimension. Approachability has been used to derive reinforcement learning algorithms for constrained MDPs before (Miryoosefi et al., 2019; Yu et al., 2021; Miryoosefi and Jin, 2021), however, exclusively using $\ell_{2}$ geometry. As a result, these methods aim to bound the $\ell_{2}$ distance to the feasible set and the bounds scale with the $\ell_{2}$ norm of the reward vector. This deviates from the more common, and perhaps more natural, formulation for constrained MDPs studied in other works (Efroni et al., 2020; Brantley et al., 2020; Ding et al., 2021). Here, each component of the loss vector is within a given range (e.g. $[0,1]$ ) and the goal is to minimize the largest constraint violation among all components. We will show that $\ell_{\infty}$ -approachability is the natural approach for this problem and yields algorithms that avoid the $\sqrt{d}$ factor in the regret, where $d$ is the number of constraints, that algorithms based on $\ell_{2}$ approachability suffer. While the number of constraints $d$ can sometimes be small, there are many applications where $d$ is large and a $\operatorname{poly}(d)$ dependency in the regret is undesirable, even when a computational dependency on $d$ is manageable. For example, constraints may arise from fairness considerations like demographic disparity that ensure that the policy behaves similarly across all protected groups. This would require a constraint for each pair of groups which could be very many.

The formal problem setup is as follows. We consider an MDP $({\mathscr{X}},{\mathscr{A}},P,\left\{\ell\right\}_{t\in[T]})$ defined by a state space ${\mathscr{X}}$ , an action set ${\mathscr{A}}$ , a transition function $P\colon{\mathscr{X}}\times{\mathscr{A}}\times{\mathscr{X}}\to[0,1]$ , where $P(x^{\prime}|x,a)$ is the probability of reaching state $x^{\prime}$ when choosing action $a$ at state $x$ , and $\ell\colon{\mathscr{X}}\times{\mathscr{A}}\to[0,1]^{d}$ the loss vector function with $\ell=(\ell^{1},\ldots,\ell^{d})$ . We work with losses for consistency with the other sections but our results also readily apply to rewards. To simplify the presentation, we will assume a layered MDP with $L+1$ layers with ${\mathscr{X}}=\bigcup_{l=0}^{L}{\mathscr{X}}_{l}$ , ${\mathscr{X}}_{i}\cap{\mathscr{X}}_{j}=\emptyset$ for $i\neq j$ , and with ${\mathscr{X}}_{0}=\{x_{0}\}$ and ${\mathscr{X}}_{L}=\{x_{L}\}$ .

We define a (stochastic) policy $\pi$ as mapping from ${\mathscr{A}}\times{\mathscr{X}}\to[0,1]$ where $\pi(a|x)$ represents the probability of action $a$ in state $x$ . Given a policy $\pi$ and the transition probability $P$ , we define the occupancy measure ${\mathsf{q}}^{P,\pi}$ as the probability of visiting state-action pair $(x,a)$ when following policy $P$ (Altman, 1999; Neu et al., 2012): ${\mathsf{q}}^{P,\pi}(x,a)=\operatorname*{\mathbb{P}}\left[x,a\mid P,\pi\right]$ . We will denote by $\Delta(P)$ the set of all occupancy measures, obtained by varying the policy $\pi$ . It is known that $\Delta(P)$ forms a polytope.

We consider the feasibility problem in CMDPs with stochastic rewards and unknown dynamics. The loss vector $\ell\colon{\mathscr{X}}\times{\mathscr{A}}\to[0,1]^{d}$ is the same in all episodes, ${\mathscr{L}}=\{\ell\}$ . Our goal is to learn a policy $\pi\in\Pi$ such that $\langle{\mathsf{q}}^{P,\pi},\ell\rangle\leq{\mathbf{c}}$ for a given threshold vector ${\mathbf{c}}\in[0,L]^{d}$ . The payoff function $u\colon\Delta(P)\times{\mathscr{L}}\to\mathbb{R}$ is defined as: $u({\mathsf{p}},\ell)=\left[\begin{smallmatrix}{\mathsf{p}}\cdot\ell^{1}-c_{1}\\ \vdots\\ {\mathsf{p}}\cdot\ell^{d}-c_{d}\end{smallmatrix}\right]$ . The set ${\mathscr{S}}=(-\infty,0]^{d}$ is separable as long as there is a policy that satisfies all constraints. Although we define the payoff function $u$ in terms of occupancy measures $\Delta(P)$ , they will be implicit in the algorithm.

To aide the comparison of our approach to existing work in this setting, we omit the dimensionality reduction with pseudonorms in this application and directly work in the $d$ -dimensional space. We will analyze Algorithm 1 for $\ell_{\infty}$ approachability, which can be implemented in MDPs by adopting the following oracles from prior work (Miryoosefi et al., 2019) on CMDPs:

•

BestResponse-Oracle For a given $\theta\in\mathbb{R}$ , this oracle returns a policy $\pi$ that is $\epsilon_{0}$ -optimal with respect to the reward function $(s,a)\mapsto-\langle\ell(s,a),\theta\rangle$ .
•

Est-Oracle For a given policy $\pi$ , this oracle returns a vector $z\in[0,L]^{d}$ such that $\|z-\langle{\mathsf{q}}^{P,\pi},\ell\rangle\|_{\infty}\leq\epsilon_{1}$ .

Consider first the case without approximation errors, $\epsilon_{0}=\epsilon_{1}=0$ , for illustration. For a vector $\theta_{t}\in\Delta_{d+1}$ , let $\theta_{t}^{\prime}\in\mathbb{R}$ be the vector that contains only the first $d$ dimensions of $\theta_{t}$ . When we call the BestResponse oracle with vector $\theta_{t}^{\prime}$ , it returns a policy $\pi_{t}$ such that its occupancy measure satisfies $\langle{\mathsf{q}}^{P,\pi_{t}},-\theta_{t}^{\prime}\cdot\ell\rangle\geq\max_{{\mathsf{q}}^{\star}\in\Delta(P)}\langle{\mathsf{q}}^{\star},-\theta_{t}^{\prime}\cdot\ell\rangle$ . We can use this to show:

\displaystyle\theta_{t}\cdot[u({\mathsf{q}}^{P,\pi_{t}},\ell),~0]

\displaystyle=\theta_{t}^{\prime}\cdot u({\mathsf{q}}^{P,\pi_{t}},\ell)=\langle{\mathsf{q}}^{P,\pi_{t}},\theta_{t}^{\prime}\cdot\ell\rangle-c\leq\langle{\mathsf{q}}^{\star},\theta_{t}^{\prime}\cdot\ell\rangle-c\leq 0

where ${\mathsf{q}}^{\star}$ is the occupancy measure of a policy that satisfies all constraints.

Thus ${\mathsf{q}}^{P,\pi_{t}}$ is a valid choice for $p_{t}$ in Line 3 of Algorithm 1. Passing the policy $\pi_{t}$ to the Est yields $\hat{z}_{t}=u({\mathsf{q}}^{P,\pi_{t}},\ell)+c$ ; this is enough to compute

\displaystyle y_{t}

\displaystyle=-[\hat{z}_{t}-c,~0]=-[u(p_{t},\ell),~0]

in Line 5 of Algorithm 1¹¹¹¹11Since FTRL with negative entropy regularizer operates on the simplex with $\|\theta_{t}\|=1$ , we pad the inputs with a zero dimension obtain an OLO algorithm on the interior of $\Delta_{d}$ .. Finally, we obtain $\theta_{t+1}\in\mathbb R^{d+1}$ by passing $y_{1},y_{2},\dots,y_{t}$ to a negative entropy FTRL algorithm as the OLO algorithm $\mathcal{F}$ (Line 6 of Alg. 1).

Using similar steps as for Theorem 2.3 and relying on the regret bound for $\mathcal{F}$ in Lemma 2.2, we can show the following guarantee:

{proposition}

[] Consider a constrained episodic MDP with horizon $L$ , fixed loss vectors $\ell$ with $\ell(s,a)\in[0,1]^{d}$ for all state-action pairs $(s,a)\in{\mathscr{X}}\times{\mathscr{A}}$ and a constraint threshold vector $c\in[0,L]^{d}$ . Assume that there exists a feasible policy that satisfies all constraints and let $\bar{\pi}$ be the mixture policy of $\pi_{1},\dots,\pi_{T}$ generated by the approach described above. Then the maximum constraint violation of $\bar{\pi}$ satisfies

\displaystyle\max_{i\in[d]}\{{\mathsf{q}}^{P,\bar{\pi}}\cdot\ell^{i}-c_{i}\}=d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u({\mathsf{q}}^{P,\pi_{t}},\ell),{\mathscr{S}}\right)\leq O\left(L\sqrt{\frac{\log(d)}{T}}\right)+\epsilon_{0}+2\epsilon_{1}.

Applying the results from prior work (Miryoosefi et al., 2019) based on $\ell_{2}$ approachability would yield a bound of $(L\sqrt{d}+\epsilon_{1})T^{-1/2}+\epsilon_{0}+2\epsilon_{1}$ in our setting with an additional $\sqrt{d}$ and $\epsilon_{1}$ factor in front of $T^{-1/2}$ . For the sake of exposition, we illustrated the benefit of $\ell_{\infty}$ approachability using the oracles adopted by Miryoosefi et al. (2019), but our approach can also be applied with similar advantages to other works that make oracle calls explicit (Yu et al., 2021; Miryoosefi and Jin, 2021).

5 Conclusion

We presented a new algorithmic framework for $\ell_{\infty}$ -approachability, which we argued is the most suitable notion of approachability for a variety of applications such as regret minimization. Our algorithms leverage a key dimensionality reduction and a reduction to an online linear optimization. These ideas can be similarly used to derive useful algorithms for approachability with alternative distance metrics. In fact, as already pointed out, some of our algorithms can be similarly viewed as a reduction to an online linear optimization of an equivalent group-norm approachability.

References

Abernethy et al. [2008] Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 263–274. Omnipress, 2008.
Abernethy et al. [2011] Jacob D. Abernethy, Peter L. Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. In Proceedings of COLT, volume 19 of JMLR Proceedings, pages 27–46, 2011.
Altman [1999] Eitan Altman. Constrained Markov Decision Processes. Chapman and Hall, 1999.
Bergemann and Morris [2016] Dirk Bergemann and Stephen Morris. Bayes correlated equilibrium and the comparison of information structures in games. Theoretical Economics, 11(2):487–522, 2016.
Berger et al. [1996] Adam L. Berger, Stephen Della Pietra, and Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Comp. Linguistics, 22(1), 1996.
Bernstein and Shimkin [2015] Andrey Bernstein and Nahum Shimkin. Response-based approachability with applications to generalized no-regret problems. J. Mach. Learn. Res., 16:747–773, 2015.
Blackwell [1956] David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956.
Blum and Mansour [2007] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8:1307–1324, 2007.
Borwein and Zhu [2005] Jonathan Borwein and Qiji Zhu. Techniques of Variational Analysis. Springer, 2005.
Boyd and Vandenberghe [2014] Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2014.
Brantley et al. [2020] Kianté Brantley, Miro Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. Proceedings of NIPS, 33:16315–16326, 2020.
Chzhen et al. [2021] Evgenii Chzhen, Christophe Giraud, and Gilles Stoltz. A unified approach to fair online learning via blackwell approachability. In Proceedings of NeurIPS, pages 18280–18292, 2021.
Dawid [1982] A. P. Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605–610, 1982. doi: 10.1080/01621459.1982.10477856. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1982.10477856.
Ding et al. [2021] Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In Proceedings of AISTATS, pages 3304–3312. PMLR, 2021.
Dudík et al. [2007] Miroslav Dudík, Steven J. Phillips, and Robert E. Schapire. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8, 2007.
Efroni et al. [2020] Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.
Forges [1993] Françoise Forges. Five legitimate definitions of correlated equilibrium in games with incomplete information. Theory and decision, 35(3):277–310, 1993.
Forges et al. [2006] Françoise Forges et al. Correlated equilibrium in games with incomplete information revisited. Theory and decision, 61(4):329–344, 2006.
Foster and Hart [2018] Dean P. Foster and Sergiu Hart. Smooth calibration, leaky forecasts, finite recall, and nash dynamics. Games and Economic Behavior, 109:271–293, 2018. ISSN 0899-8256. doi: https://doi.org/10.1016/j.geb.2017.12.022. URL https://www.sciencedirect.com/science/article/pii/S0899825618300113.
Foster and Vohra [1999] Dean P Foster and Rakesh V Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29(1-2):7–35, 1999.
Gordon et al. [2008] Geoffrey J. Gordon, Amy Greenwald, and Casey Marks. No-regret learning in convex games. In Proceedings of ICML, volume 307, pages 360–367. ACM, 2008.
Gower and Dijksterhuis [2004] John C Gower and Garmt B Dijksterhuis. Procrustes problems, volume 30. OUP Oxford, 2004.
Hart and Mas-Colell [2000] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
Hart and Mas-Colell [2001] Sergiu Hart and Andreu Mas-Colell. A general class of adaptive strategies. Journal of Economic Theory, 98(5):26–54, 2001.
Hartline et al. [2015] Jason Hartline, Vasilis Syrgkanis, and Eva Tardos. No-regret learning in bayesian games. Advances in Neural Information Processing Systems, 28, 2015.
Hazan [2016] Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
Hazan and Kale [2008] Elad Hazan and Satyen Kale. Computational equivalence of fixed points and no regret algorithms, and convergence to equilibria. In NIPS, pages 625–632, 2008.
Kalathil et al. [2014] Dileep M. Kalathil, Vivek S. Borkar, and Rahul Jain. Blackwell’s approachability in Stackelberg stochastic games: A learning version. In Proceedings of CDC, pages 4467–4472. IEEE, 2014.
Kivinen and Warmuth [1995] Jyrki Kivinen and Manfred K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. In Proceedings of STOC, pages 209–218. ACM, 1995.
Kwon [2016] Joon Kwon. Mirror descent strategies for regret minimization and approachability. PhD thesis, Université Pierre et Marie Curie, Paris 6, 2016.
Kwon [2021] Joon Kwon. Refined approachability algorithms and application to regret minimization with global costs. The Journal of Machine Learning Research, 22:200–1, 2021.
Kwon and Perchet [2017] Joon Kwon and Vianney Perchet. Online learning and blackwell approachability with partial monitoring: Optimal convergence rates. In Aarti Singh and Xiaojin (Jerry) Zhu, editors, Proceedings of AISTATS, volume 54, pages 604–613. PMLR, 2017.
Lee et al. [2018] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala. Efficient convex optimization with membership oracles. In Conference On Learning Theory, pages 1292–1294. PMLR, 2018.
Lehrer [2003] Ehud Lehrer. Approachability in infinite dimensional spaces. Int. J. Game Theory, 31(2):253–268, 2003.
Mannor and Shimkin [2003] Shie Mannor and Nahum Shimkin. The empirical bayes envelope and regret minimization in competitive markov decision processes. Math. Oper. Res., 28(2):327–345, 2003.
Mannor and Shimkin [2004] Shie Mannor and Nahum Shimkin. A geometric approach to multi-criterion reinforcement learning. J. Mach. Learn. Res., 5:325–360, 2004.
Mannor et al. [2014a] Shie Mannor, Vianney Perchet, and Gilles Stoltz. Set-valued approachability and online learning with partial monitoring. J. Mach. Learn. Res., 15(1):3247–3295, 2014a.
Mannor et al. [2014b] Shie Mannor, Vianney Perchet, and Gilles Stoltz. Approachability in unknown games: Online learning meets multi-objective optimization. In Proceedings of COLT, volume 35, pages 339–355, 2014b.
Mansour et al. [2022] Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan. Strategizing against learners in Bayesian games, 2022. URL https://arxiv.org/abs/2205.08562.
Miryoosefi and Jin [2021] Sobhan Miryoosefi and Chi Jin. A simple reward-free approach to constrained reinforcement learning. CoRR, abs/2107.05216, 2021.
Miryoosefi et al. [2019] Sobhan Miryoosefi, Kianté Brantley, Hal Daumé III, Miroslav Dudík, and Robert E. Schapire. Reinforcement learning with convex constraints. In Proceedings of NIPS, pages 14070–14079, 2019.
Mohri et al. [2018] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, second edition, 2018.
Neu et al. [2012] Gergely Neu, András György, and Csaba Szepesvári. The adversarial stochastic shortest path problem with unknown transition probabilities. In Proceedings of AISTATS, volume 22 of JMLR Proceedings, pages 805–813, 2012.
Perchet [2010] Vianney Perchet. Approchabilité, Calibration et Regret dans les Jeux à Observations Partielles. PhD thesis, Université Pierre et Marie Curie - Paris VI, 2010.
Perchet [2015] Vianney Perchet. Exponential weight approachability, applications to calibration and regret minimization. Dynamic Games and Applications, 5(1):136–153, 2015.
Perchet and Quincampoix [2015] Vianney Perchet and Marc Quincampoix. On a unified framework for approachability with full or partial monitoring. Mathematics of Operations Research, 40(3):596–610, 2015.
Perchet and Quincampoix [2018] Vianney Perchet and Marc Quincampoix. A differential game on Wasserstein space. Application to weak approachability with partial monitoring. CoRR, abs/1811.04575, 2018.
Pietra et al. [1997] Stephen Della Pietra, Vincent J. Della Pietra, and John D. Lafferty. Inducing features of random fields. IEEE Trans. Pattern Anal. Mach. Intell., 19(4), 1997.
Rockafellar [1970] R. Tyrrell Rockafellar. Convex analysis. Princeton Mathematical Series. Princeton University Press, Princeton, NJ, 1970.
Rosenfeld [1996] Ronald Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer Speech & Language, 10(3):187–228, 1996.
Shalev-Shwartz [2007] Shai Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, 2007.
Shimkin [2016] Nahum Shimkin. An online convex optimization approach to blackwell’s approachability. The Journal of Machine Learning Research, 17(1):4434–4456, 2016.
Spinat [2002] Xavier Spinat. A Necessary and Sufficient Condition for Approachability. Mathematics of Operations Research, 27(1):31–44, 2002.
Vieille [1992] Nicolas Vieille. Weak approachability. Math. Oper. Res., 17(4):781–791, 1992.
von Neumann [1928] John von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928.
Yu et al. [2021] Tiancheng Yu, Yi Tian, Jingzhao Zhang, and Suvrit Sra. Provably efficient algorithms for multi-objective competitive rl. In International Conference on Machine Learning, pages 12167–12176. PMLR, 2021.
Zinkevich [2003] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 928–936. AAAI Press, 2003.

Appendix A Omitted proofs

A.1 Proof of Theorem 2.3

Proof of Theorem 2.3.

Similar results appear in e.g. Kwon [2021]. For completeness, we include a proof here.

We will need the following fact: for any $x\in\mathbb{R}$ , $d_{\infty}(x,(\infty,0]^{d})=\sup_{\theta\in{\overline{\Delta}}_{d}}\langle\theta,x\rangle$ . This is easy to verify (both sides are equal to $\max(\max_{i\in[d]}x_{i},0)$ ), but is also a consequence of Fenchel duality (see e.g. Appendix B and the proof of Theorem 3.1).

Armed with this fact, note that

$\displaystyle d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right)$	$\displaystyle=$	$\displaystyle\sup_{\theta\in{\overline{\Delta}}_{d}}\left\langle\theta,\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t})\right\rangle$
	$\displaystyle=$	$\displaystyle-\inf_{\theta\in{\mathscr{X}}}\left\langle\theta,\frac{1}{T}\sum_{t=1}^{T}\left(-u(p_{t},\ell_{t})\right)\right\rangle$
	$\displaystyle=$	$\displaystyle\frac{1}{T}\operatorname{\mathsf{Reg}}(\mathcal{F})-\frac{1}{T}\sum_{t=1}^{T}\left\langle\theta_{t},-u(p_{t},\ell_{t})\right\rangle$
	$\displaystyle\leq$	$\displaystyle\frac{1}{T}\operatorname{\mathsf{Reg}}(\mathcal{F}).$

∎

A.2 Proof of Corollary 3.2

Proof of Corollary 3.2.

Note that if $\theta\not\in{\mathscr{C}}^{\circ}$ , then there exists a $z\in{\mathscr{C}}$ such that $\langle\theta,z\rangle>0$ . Therefore, if $\theta\not\in{\mathscr{C}}^{\circ}$ , then the supremum $\sup_{s\in{\mathscr{C}}}\theta\cdot s=\infty$ (we can take $s$ to be a large multiple of $z$ ). On the other hand, if $\theta\in{\mathscr{C}}^{\circ}$ , then the supremum $\sup_{s\in{\mathscr{C}}}\theta\cdot s=0$ (taking $s=0$ ). It follows that the supremum $\sup_{\theta\in{\mathscr{T}}^{*}_{f}}\left\{\theta\cdot z-\sup_{s\in{\mathscr{C}}}\theta\cdot s\right\}$ in Theorem 3.1 must be achieved for a $\theta\in{\mathscr{C}}^{\circ}$ (we know that $d_{f}(z,{\mathscr{C}})$ is not infinite since $0\in{\mathscr{C}}$ , so $d_{f}(z,{\mathscr{C}})\leq f(z)$ ). For such $\theta$ , the $\sup_{s\in{\mathscr{S}}}\theta\cdot s$ term vanishes, and we are left with the statement of this corollary. ∎

A.3 Proof of Theorem 3.3

Proof of Theorem 3.3.

Note that for any $\rho\geq 0$ we have the following chain of equivalences:

	$\displaystyle d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u(p_{t},\ell_{t}),(-\infty,0]^{d}\right)\leq\rho$	(13)
$\displaystyle\Leftrightarrow$	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}u_{i}(p_{t},\ell_{t})\leq\rho\;\forall i\in[d]$
$\displaystyle\Leftrightarrow$	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\langle\widetilde{u}(p_{t},\ell_{t}),v_{i}\rangle\leq\rho\;\forall i\in[d]$
$\displaystyle\Leftrightarrow$	$\displaystyle\left\langle\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),v_{i}\right\rangle\leq\rho\;\forall i\in[d]$
$\displaystyle\Leftrightarrow$	$\displaystyle f\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t})\right)\leq\rho.$

We will now prove that this final inequality on the norm $f$ (13) implies the desired inequality on $d_{f}$ :

d_{f}\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),{\mathscr{C}}\right)\leq\rho.

(14)

Note that since $0\in{\mathscr{C}}$ , (14) directly implies 13. To show that (13) implies (14), we must show that

f\left(\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t})\right)-c\right)\leq\rho.

for any $c\in{\mathscr{C}}$ . This in turn is equivalent to proving that:

\left\langle\left(\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t})\right)-c,v_{i}\right\rangle\leq\rho\;\forall i\in[d].

But since $c\in{\mathscr{C}}$ , $\langle c,v_{i}\rangle\geq 0$ for all $i\in[d]$ . Therefore, as long as

\left\langle\frac{1}{T}\sum_{t=1}^{T}\widetilde{u}(p_{t},\ell_{t}),v_{i}\right\rangle\leq\rho\;\forall i\in[d],

(which is true by (13)), (14) will be true as well, as desired.

∎

A.4 Proof of Lemma 3.4

Proof of Lemma 3.4.

Let ${\mathscr{H}}$ denote $\operatorname*{conv}\left\{v_{1},\ldots,v_{d}\right\}$ . If $\widetilde{\theta}$ is in ${\mathscr{H}}$ , then we can write $\widetilde{\theta}=\sum_{i=1}^{d}\alpha_{i}v_{i}$ for some $\alpha_{i}\geq 0$ , $i\in[d]$ , with $\sum_{i=1}^{d}\alpha_{i}=1$ . Thus, for any $x\in\mathbb{R}^{mn}$ , we have

\langle\widetilde{\theta},x\rangle=\sum_{i=1}^{d}\alpha_{i}\langle v_{i},x\rangle\leq\sum_{i=1}^{d}\alpha_{i}\max_{j\in[d]}\langle v_{j},x\rangle=\max_{j\in[d]}\langle v_{j},x\rangle,

which implies that $\widetilde{\theta}$ is in ${\mathscr{T}}^{*}_{f}$ . Conversely, if $\widetilde{\theta}$ is not in ${\mathscr{H}}$ , since ${\mathscr{H}}$ is a non-empty closed convex set, $\widetilde{\theta}$ can be separated from ${\mathscr{H}}$ , that is there exists $x\in\mathbb{R}$ such that

\langle\widetilde{\theta},x\rangle>\sup_{v\in{\mathscr{H}}}\langle v,x\rangle\geq\max_{i\in[d]}\langle v_{i},x\rangle,

which implies that $\widetilde{\theta}$ is not in ${\mathscr{T}}^{*}_{f}$ . This completes the proof. ∎

A.5 Proof of Lemma 3.7

Proof of Lemma 3.7.

We begin with the first statement. Note that since the negative orthant $(-\infty,0]^{d}$ is separable with respect to $u$ , for any $\ell\in{\mathscr{L}}$ , there exists a $p\in{\mathscr{P}}$ such that $u_{i}(p,\ell)\leq 0$ for all $i\in[d]$ . Now, recall that we can write $u_{i}(p,\ell)=\langle\widetilde{u}(p,\ell),v_{i}\rangle$ , so it follows that $\langle\widetilde{u}(p,\ell),v_{i}\rangle\leq 0$ for all $i\in[d]$ . This implies $\widetilde{u}(p,\ell)\in{\mathscr{C}}$ , as desired.

We next prove the second statement. Note that since $\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}\subseteq{\mathscr{C}}^{\circ}$ , this implies that if $\widetilde{u}(p,\ell)\in{\mathscr{C}}$ , then $\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle\leq 0$ . By the first statement, this means that for any $\ell\in{\mathscr{L}}$ , there exists a $p\in{\mathscr{P}}$ such that $\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle\leq 0$ . By the minimax theorem (since ${\mathscr{P}}$ and ${\mathscr{L}}$ are both convex sets and $\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle$ is a bilinear function of $p$ and $\ell$ ), this implies that there exists a $p\in{\mathscr{P}}$ such that for all $\ell\in{\mathscr{L}}$ , $\langle\widetilde{\theta},\widetilde{u}(p,\ell)\rangle\leq 0$ , as desired. ∎

A.6 Proof of Lemma 3.10

Proof of Lemma 3.10.

Note that by the definition of ${\mathscr{T}}^{*}_{f}({\mathscr{Z}})$ , it must be the case that $\sup_{\theta\in{\mathscr{T}}^{*}_{f}({\mathscr{Z}})}\langle\theta,z\rangle\leq f(z)$ (this is one of the constraints defining ${\mathscr{T}}^{*}_{f}({\mathscr{Z}})$ ). Since ${\mathscr{T}}^{*}_{f}\subseteq{\mathscr{T}}^{*}_{f}({\mathscr{Z}})$ , if we show that $\sup_{\theta\in{\mathscr{T}}^{*}_{f}}\langle\theta,z\rangle=f(z)$ this proves the theorem. But this follows directly from Theorem 3.1 applied to the closed convex set ${\mathscr{S}}=\{0\}$ . ∎

A.7 Proof of Lemma 3.11

Proof of Lemma 3.11.

Consider the element

\widetilde{\theta}^{\prime}=\operatorname*{argmin}_{\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}}-\langle\widetilde{\theta},z\rangle

which minimizes the unregularized objective over the smaller set ${\mathscr{T}}_{f}^{*}$ . By Lemma 3.10, we must have that $\langle\widetilde{\theta}^{\prime},z\rangle\geq\langle\widetilde{\theta}_{opt},z\rangle$ . But also, by definition of $\widetilde{\theta}_{opt}$ , we must have that $\eta R(\widetilde{\theta}_{opt})-\langle\widetilde{\theta}_{opt},z\rangle\leq\eta R(\widetilde{\theta}^{\prime})-\langle\widetilde{\theta}^{\prime},z\rangle$ . It follows that $R(\widetilde{\theta}_{opt})\leq R(\widetilde{\theta}^{\prime})$ , and thus $||\widetilde{\theta}_{opt}||\leq||\widetilde{\theta}^{\prime}||\leq\operatorname*{diam}({\mathscr{T}}_{f}^{*})$ . ∎

A.8 Proof of Lemma 3.12

Proof of Lemma 3.12.

Extend the domain of $\widetilde{u}$ as a function to $\mathbb{R}^{n}\times\mathbb{R}^{m}$ , and note that there is a unique way to write $z=\sum_{k=1}^{m}\widetilde{u}(z_{k},e_{k})$ , where for each $k\in[m]$ , $z_{k}$ is an element of $\mathbb{R}^{n}$ and $e_{k}$ is the $k$ th unit vector in $\mathbb{R}^{m}$ . We first claim $z\in{\mathscr{Z}}=\operatorname*{cone}(\widetilde{u}({\mathscr{P}},{\mathscr{L}}))$ iff each $z_{k}\in\operatorname*{cone}({\mathscr{P}})$ .

To see this, first note that since ${\mathscr{L}}$ is orthant-generating, there exist a sequence of $\lambda_{k}$ such that $\lambda_{k}e_{k}\in{\mathscr{L}}$ for each $k\in[m]$ . Now, if each $z_{k}\in\operatorname*{cone}({\mathscr{P}})$ , then $\widetilde{u}(z_{k},e_{k})\in\operatorname*{cone}(\widetilde{u}({\mathscr{P}},{\mathscr{L}}))$ (since $e_{k}\in\operatorname*{cone}({\mathscr{L}})$ ), so $z\in{\mathscr{Z}}$ . Conversely, if $z\in{\mathscr{Z}}$ , then we can write $z=\sum\alpha_{r=1}^{R}\widetilde{u}(p_{r},\ell_{r})$ for some $R>0$ and $\alpha_{r}\geq 0$ . Expanding each $\ell_{r}$ in the $\{\lambda_{k}e_{k}\}$ basis, we find that each $z_{k}$ must be a positive linear combination of the values $p_{r}$ and therefore $z_{k}\in\operatorname*{cone}({\mathscr{P}})$ .

Therefore, to check whether $z\in{\mathscr{Z}}$ , it suffices to check whether each component $z_{k}$ of $z$ belongs to $\operatorname*{cone}({\mathscr{P}})$ . This is possible to do efficiently given an efficient separation oracle for ${\mathscr{P}}$ (we can write the convex program $\beta z_{k}\in{\mathscr{P}}$ for $\beta>0$ ). Finally, if each $z_{k}\in\operatorname*{cone}({\mathscr{P}})$ we can also recover a value $\beta_{k}$ for each $k$ such that $\beta_{k}z_{k}\in{\mathscr{P}}$ (via the same convex program). This allows us to explicitly write $z=\sum_{k=1}^{m}\alpha_{k}\widetilde{u}(p_{k},\ell_{k})$ with $p_{k}=\beta_{k}z_{k}$ , $\ell_{k}=\lambda_{k}e_{k}$ , and $\alpha_{k}=1/(\beta_{k}\lambda_{k})$ . ∎

A.9 Proof of Lemma 3.14

Proof of Lemma 3.14.

Checking for membership in the ball of radius $\rho$ is straightforward, so it suffices to exhibit a membership oracle for the set ${\mathscr{T}}_{f}^{*}({\mathscr{Z}})$ . Fix a $\widetilde{\theta}$ and consider the convex function $h(z)=\max_{i\in[d]}\langle v_{i},z\rangle-\langle\widetilde{\theta},z\rangle$ . Note that by the definition of $\widetilde{\theta}\in{\mathscr{T}}_{f}^{*}({\mathscr{Z}})$ iff $\min_{z\in{\mathscr{Z}}}h(z)\leq 0$ , so it suffices to compute the minimum of $h(z)$ over the convex set ${\mathscr{Z}}$ .

As mentioned, to do this it suffices to exhibit a membership oracle for ${\mathscr{Z}}$ and an evaluation oracle for $h(z)$ . But now, Lemma 3.12 provides a membership oracle for ${\mathscr{Z}}$ and Corollary 3.13 allows us to efficiently evaluate $h(z)$ for $z\in{\mathscr{Z}}$ . ∎

A.10 Proof of Proposition 4.4

Proof of Proposition 4.4.

We define $y_{t}=-[u(p_{t},\ell),~0]$ the noiseless and $\tilde{y}_{t}=-[\hat{z}_{t}-c,0]$ the actual loss passed to the OLO algorithm. We bound

$\displaystyle\max_{i\in[k]}\{{\mathsf{q}}^{P,\bar{\pi}}\cdot\ell-c\}$	$\displaystyle=d_{\infty}\left(\frac{1}{T}\sum_{t=1}^{T}u({\mathsf{q}}^{P,\pi_{t}},\ell),{\mathscr{S}}\right)$
	$\displaystyle=\max_{\theta\in\Delta_{d+1}}\theta\cdot\begin{pmatrix}\frac{1}{T}\sum_{t=1}^{T}u({\mathsf{q}}^{P,\pi_{t}},\ell)\\ 0\end{pmatrix}$
	$\displaystyle=\max_{\theta\in\Delta_{d+1}}\frac{1}{T}\sum_{t=1}^{T}\theta\cdot\begin{pmatrix}u({\mathsf{q}}^{P,\pi_{t}},\ell)\\ 0\end{pmatrix}$
	$\displaystyle\leq\max_{\theta\in\Delta_{d+1}}\left\{-\frac{1}{T}\sum_{t=1}^{T}\langle\theta,\tilde{y}_{t}\rangle+\epsilon_{1}\right\}$	(def. $\tilde{y}_{t}$ and Est oracle)
	$\displaystyle=-\min_{\theta\in\Delta_{d+1}}\left\{\frac{1}{T}\sum_{t=1}^{T}\langle\theta,\tilde{y}_{t}\rangle\right\}+\epsilon_{1}$
	$\displaystyle\leq\frac{\operatorname{\mathsf{Reg}}(\mathcal{F})}{T}-\frac{1}{T}\sum_{t=1}^{T}\langle\theta_{t},\tilde{y}_{t}\rangle+\epsilon_{1}$
	$\displaystyle\leq\frac{\operatorname{\mathsf{Reg}}(\mathcal{F})}{T}-\frac{1}{T}\sum_{t=1}^{T}\langle\theta_{t},y_{t}\rangle+\epsilon_{1}$	(Est oracle)
	$\displaystyle\leq\frac{\operatorname{\mathsf{Reg}}(\mathcal{F})}{T}+\frac{1}{T}\sum_{t=1}^{T}\left\langle\theta_{t},\begin{pmatrix}u({\mathsf{q}}^{P,\pi_{t}},\ell)\\ 0\end{pmatrix}\right\rangle+2\epsilon_{1}$	(definition $y_{t}$ )
	$\displaystyle\leq\frac{\operatorname{\mathsf{Reg}}(\mathcal{F})}{T}+\frac{1}{T}\sum_{t=1}^{T}\left\langle\theta_{t}\cdot\begin{pmatrix}u({\mathsf{q}}^{P,\pi_{t}},\ell)\\ 0\end{pmatrix}\right\rangle+2\epsilon_{1}$	(definition $y_{t}$ )
	$\displaystyle\leq\frac{\operatorname{\mathsf{Reg}}(\mathcal{F})}{T}+\epsilon_{0}+2\epsilon_{1}$	(BestResponse oracle)
	$\displaystyle\leq 2L\sqrt{\frac{\log(d+1)}{T}}+\epsilon_{0}+2\epsilon_{1}$	(Lemma 2.2)

∎

Appendix B General theorems from convex optimization

We will use the following Fenchel duality theorem [Borwein and Zhu, 2005, Rockafellar, 1970], see also [Mohri et al., 2018].

Theorem B.1 (Fenchel duality).

Let ${\mathscr{X}}$ and ${\mathscr{Y}}$ be Banach spaces, $f\colon{\mathscr{X}}\to\mathbb{R}\cup\{+\infty\}$ and $g\colon{\mathscr{Y}}\to\mathbb{R}\cup\{+\infty\}$ convex functions and $A\colon{\mathscr{X}}\to{\mathscr{Y}}$ a bounded linear map. Assume that $f$ , $g$ and $A$ satisfy one of the following conditions:

•

$f$ and $g$ are lower semi-continuous and $0\in\operatorname*{core}(\operatorname*{dom}(g)-A\operatorname*{dom}(f))$ ;
•

$A\operatorname*{dom}(f)\cap\operatorname*{cont}(g)\neq\emptyset$ ;

then $p=d$ for the dual optimization problems

	$\displaystyle p$	$\displaystyle=\inf_{x\in{\mathscr{X}}}\left\{f(x)+g(Ax)\right\}$
	$\displaystyle d$	$\displaystyle=\sup_{x^{}\in{\mathscr{Y}}^{}}\left\{-f^{}(A^{}x^{})-g^{}(-x^{*})\right\}$

and the supremum in the second problem is attained if finite.

When constructing efficient algorithms, we will need a efficient method for minimizing a convex function over a convex set given only a membership oracle to the set and an evaluation oracle to the function. This is provided by the following lemma.

Lemma B.2.

Let $K$ be a bounded convex subset of $\mathbb{R}$ and $f:K\to[0,1]$ a convex function over $K$ . Then given access to a membership oracle for $K$ (along with an interior point $x_{0}\in K$ satisfying $B(x_{0},r)\subseteq K\subseteq B(x_{0},R)$ for some given radii $r,R>0$ ) and an evaluation oracle for $f$ , there exists an algorithm which computes the minimum value over $f$ (to within precision $\varepsilon$ ) using $\operatorname*{poly}(d,\log(R/r),\log(1/\varepsilon))$ time and queries to these oracles.

Proof.

See Lee et al. [2018]. ∎

Appendix C Bayesian correlated equilibria

Here we provide another application of the $\ell_{\infty}$ approachability framework, to the problem of constructing learning algorithms that converge to correlated equilibria in Bayesian games. Correlated equilibria in Bayesian games (alternatively, “games with incomplete information”) are well-studied throughout the economics and game theory literature; see e.g. [Forges, 1993, Forges et al., 2006, Bergemann and Morris, 2016]. Unlike ordinary correlated equilibria, which are also well-studied from a learning perspective, relatively little is known about algorithms that converge to correlated equilibria in Bayesian games. Hartline et al. [2015] study no-regret learning in Bayesian games, showing that no-regret algorithms converge to a Bayesian coarse correlated equilibrium. More recently, Mansour et al. [2022] introduce a notion of Bayesian swap regret with the property that learners with sublinear Bayesian swap regret converge to correlated equilibria in Bayesian games. Mansour et al. [2022] construct a learning algorithm that achieves low Bayesian swap regret, albeit not a provably efficient one. In this section, we apply our approachability framework to develop the first efficient low-regret algorithm for Bayesian swap regret.

We begin with some preliminaries about standard (non-Bayesian) normal-form games. In a normal form game $G$ with $N$ players, each player $n$ must choose a mixed action ${\mathbf{x}}_{n}\in\Delta([K])$ (for simplicity we will assume each player has the same number of pure actions). We call the collection of mixed strategies ${\mathbf{x}}=({\mathbf{x}}_{1},{\mathbf{x}}_{2},\dots,{\mathbf{x}}_{N})$ played by all players a strategy profile. We will let $U_{i}({\mathbf{x}})$ denote the utility of player $n$ under strategy profile ${\mathbf{x}}$ (and insist that $U_{i}$ is linear in each player’s strategy).

Given a function $\pi\colon[K]\to[K]$ and a mixed action ${\mathbf{x}}_{n}\in\Delta([K])$ , let $\pi({\mathbf{x}}_{n})$ be the mixed action in $\Delta([K])$ formed by sampling an action from ${\mathbf{x}}_{n}$ and then applying the function $\pi$ (i.e., $\pi({\mathbf{x}}_{n})_{i}=\sum_{i^{\prime}|\pi(i^{\prime})=i}{\mathbf{x}}_{n,i^{\prime}}$ ). A correlated equilibrium of $G$ is a distribution $\mathcal{D}$ over strategy profiles ${\mathbf{x}}$ such that for any player $n$ and function $\pi\colon[K]\to[K]$ , it is the case that

\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}}[U_{n}({\mathbf{x}}^{\prime})]\leq\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}}[U_{n}({\mathbf{x}})],

where ${\mathbf{x}}^{\prime}=({\mathbf{x}}_{1},{\mathbf{x}}_{2},\dots,\pi({\mathbf{x}}_{n}),\dots,{\mathbf{x}}_{N})$ . Similarly, an $\varepsilon$ -correlated equilibrium of $G$ is a distribution $\mathcal{D}$ with the property that

\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}}[U_{n}({\mathbf{x}}^{\prime})]\leq\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}}[U_{n}({\mathbf{x}})]+\varepsilon,

for any $n\in[N]$ and $\pi\colon[K]\to[K]$ . Correlated equilibria have the following natural interpretation: a mediator samples a strategy profile ${\mathbf{x}}$ from $\mathcal{D}$ and tells to each player $n$ a pure action $a_{n}$ randomly sampled from ${\mathbf{x}}_{n}$ . If each player is incentivized to play the action $a_{n}$ they are told, then $\mathcal{D}$ is a correlated equilibrium.

We define Bayesian games similarly to standard games, with the modification that now each player $n$ also has some private information $c_{n}\in[C]$ drawn from some public distribution $\mathcal{C}_{n}$ . We call the vector of realized types ${\mathbf{c}}=(c_{1},c_{2},\dots,c_{N})$ the type profile of the players (drawn randomly from $\mathcal{C}=\mathcal{C}_{1}\times\dots\times\mathcal{C}_{N}$ ), and now let the utility $U_{n}({\mathbf{x}},{\mathbf{c}})$ of player $n$ depend on both the strategy profile and type profile of the players. Note that we can alternately think of the strategy of player $n$ as a function ${\mathbf{x}}_{n}\colon[C]\to\Delta([K])$ mapping contexts to mixed actions; in this case we can again treat the expected utility for player $n$ (with expectation taken over the random type profile) as a multilinear function $U_{n}({\mathbf{x}})$ function of the strategy profile ${\mathbf{x}}=({\mathbf{x}}_{1},\dots,{\mathbf{x}}_{N})$ .

As with regular correlated equilibria, we can motivate the definition of Bayesian correlated equilibria via the introduction of a mediator. In the Bayesian case, all players begin by revealing their private types to the mediator, and the mediator observes a type profile ${\mathbf{c}}$ . The mediator then samples an joint action profile ${\mathbf{x}}$ from a distribution $\mathcal{D}({\mathbf{c}})$ that depends on the observed type profile. Finally, for each player $n$ , the mediator samples a pure action $a_{n}\sim{\mathbf{x}}_{n}$ from the mixed strategy for $n$ , and relays $a_{n}$ to player $n$ (which they should follow). In order for this to be a valid correlated equilibrium, the following incentive compatibility constraints must be met:

•

Players must have no incentive to deviate from the strategy ${\mathbf{x}}_{n}$ relayed to them. As in correlated equilibria, this includes deviations of the form “if I am told to play action $i$ , I will instead play action $j$ ”.
•

Players also must have no incentive to misreport their type (thus affecting the distribution $\mathcal{D}({\mathbf{c}})$ over joint strategy profiles).
•

Moreover, no combination of the above two deviations should result in improved utility for a player.

Formally, we define a Bayesian correlated equilibrium for a Bayesian game as follows. The distributions $\mathcal{D}({\mathbf{c}})$ form a Bayesian correlated equilibrium if, for any player $n\in[N]$ , any “type deviation” $\kappa\colon[C]\to[C]$ , and any collection of “action deviations” $\pi_{c}\colon[K]\to[K]$ (for each $c\in[C]$ ),

\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}(c),{\mathbf{c}}\sim\mathcal{C}}[U_{n}({\mathbf{x}}^{\prime})]\leq\operatorname*{\mathbb E}_{{\mathbf{x}}\sim\mathcal{D}(c),{\mathbf{c}}\sim\mathcal{C}}[U_{n}({\mathbf{x}})]

where ${\mathbf{x}}^{\prime}$ is derived from ${\mathbf{x}}$ by deviating from ${\mathbf{x}}_{n}$ in the following way: when the player $n$ has type $c$ , they first report type $c^{\prime}=\kappa(c)$ to the mediator; if the mediator then tells them to play action $a$ , they instead play $\pi_{c}(a)$ . No such deviation should improve the utility of an agent in a Bayesian correlated equilibrium. Likewise, an $\varepsilon$ -Bayesian correlated equilibrium is a collection of distributions $\mathcal{D}({\mathbf{c}})$ where no deviation increases the utility of a player by more than $\varepsilon$ .

In order to play a Bayesian game, a learning algorithm must be contextual (using the agent’s private information to decide what action to play). We study the following setting of full-information contextual online learning. As before, there are $K$ actions, but there are now $C$ different contexts (/types). Every round $t$ , the adversary specifies a loss function $\ell_{t}\in\mathcal{L}=[0,1]^{CK}$ , where $\ell_{t,i,c}$ represents the loss from playing action $i$ in context $c$ . Simultaneously, the learner specifies an action $p_{t}\colon[C]\to\Delta([K])$ mapping each context to a distribution over actions. Overall, the learner receives expected utility $\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}p_{t}(c)_{i}\ell_{t,i,c}$ this round (here $\mathcal{C}$ is a distribution over contexts; we assume that the learner’s context is drawn i.i.d. from $\mathcal{C}$ each round, and that this distribution $\mathcal{C}$ is publicly known).

Motivated by the deviations considered in Bayesian correlated equilibria, we can define the following notion of swap regret in Bayesian games (“Bayesian swap regret”):

\operatorname{\mathsf{BayesSwapReg}}=\max_{\kappa,\pi_{c}}\sum_{t=1}^{T}\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right).

(15)

Lemma C.1.

Let $\mathcal{A}$ be an algorithm with $\operatorname{\mathsf{BayesSwapReg}}(\mathcal{A})=o(T)$ . Assume each player $i$ in a repeated Bayesian game $G$ (over $T$ rounds) runs a copy of algorithm $\mathcal{A}$ , and let $p_{t}=(p^{(1)}_{t},p^{(2)}_{t},\dots,p^{(N)}_{t})$ be the strategy profile at time $t$ . Then the time-averaged strategy profile $\mathcal{D}({\mathbf{c}})\colon[C]^{N}\to\Delta([K]^{N})$ , defined by sampling a $t$ uniformly at random from $[T]$ and returning $p_{t}$ is an $\varepsilon$ -Bayesian correlated equilibrium with $\varepsilon=o(1)$ .

Proof.

We will show that there exists no deviation for player $n$ which increases their utility by more than $\operatorname{\mathsf{BayesSwapReg}}(\mathcal{A})/T=o(1)$ .

Fix a type deviation $\kappa\colon[C]\to[C]$ and set of action deviations $\pi_{c}\colon[K]\to[K]$ . This collection of deviations transforms an arbitrary strategy $p\in\Delta([K])^{C}$ into the strategy $p^{\prime}$ satisfying $p^{\prime}(c)_{i}=\sum_{i^{\prime}\mid\pi_{c}(i^{\prime})=i}p(\kappa(c))_{i^{\prime}}$ .

For each $t\in[T]$ , with probability $1/T$ the mediator will return the strategy profiles $p_{t}=(p^{(1)}_{t},p^{(2)}_{t},\dots,p^{(N)}_{t})$ . Now, since $U_{n}$ is multilinear in the strategies of each player, there exists some vector $\ell_{t}$ such that the utility of player $n$ if they defect to a some strategy $p^{\prime}$ is given by the inner product $\langle p,\ell_{t}\rangle$ .

In particular, conditioned on the mediator returning $p_{t}$ , the difference in utility for player $n$ between playing $p^{(n)}$ and the strategy $p^{\prime}$ formed by applying the above deviations to $p^{(n)}$ is exactly

\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p^{(n)}_{t}(c)_{i}\ell_{t,i,c}-p^{(n)}_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right).

Taking expectations over all $t$ , we have that the expected difference in utility by deviating is

\frac{1}{T}\sum_{t=1}^{T}\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p^{(n)}_{t}(c)_{i}\ell_{t,i,c}-p^{(n)}_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right).

But since player $n$ selected their strategies by playing $\mathcal{A}$ , this is at most $\operatorname{\mathsf{BayesSwapReg}}(\mathcal{A})/T=o(1)$ , as desired. ∎

It is possible to phrase (15) in the language of $\ell_{\infty}$ -approachability by considering the $C^{C}\cdot K^{KC}$ dimensional vectorial payoff $u(p,\ell)$ given by:

u(p,\ell)_{\kappa,\pi_{c}}=\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p(c)_{i}\ell_{i,c}-p(\kappa(c))_{i}\ell_{\pi_{c}(i),c}\right).

A straightforward computation shows that the negative orthant is separable with respect to $u$ .

Lemma C.2.

The set ${\mathscr{S}}=(-\infty,0]^{d}$ is separable with respect to the vectorial payoff $u$ .

Proof.

Fix an $\ell\in[0,1]^{CK}$ . Then note that if we let $p(c)=e_{i}$ , where $i=\operatorname*{argmin}_{j\in[K]}\ell_{t,j,c}$ (i.e., $p(c)$ is entirely supported on the best fixed action to play in context $c$ ), it follows that $p(c)_{i}\ell_{t,i,c}\leq p(c^{\prime})_{i}\ell_{t,j,c}$ for all $c^{\prime}\in[C]$ and $j\in[K]$ , and therefore that $u(p,\ell)\in{\mathscr{S}}$ . ∎

This in turn leads (via Theorem 2.3) to a low-regret (albeit computationally inefficient) algorithm for Bayesian swap regret. Instead, as in the case of swap regret, we will apply our pseudonorm approachability framework. First, we will show that we can rewrite (15) in such a way that allows us to easily evaluate $\operatorname{\mathsf{BayesSwapReg}}$ .

Lemma C.3.

We have that

\operatorname{\mathsf{BayesSwapReg}}=\sum_{c=1}^{C}\max_{c^{\prime}\in[C]}\sum_{i=1}^{K}\max_{j\in[K]}\sum_{t=1}^{T}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(c^{\prime})_{i}\ell_{t,j,c}\right).

(16)

Proof.

From (15), we have that

$\displaystyle\operatorname{\mathsf{BayesSwapReg}}$	$\displaystyle=$	$\displaystyle\max_{\kappa,\pi_{c}}\sum_{t=1}^{T}\sum_{c=1}^{C}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\sum_{i=1}^{K}\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right)$
	$\displaystyle=$	$\displaystyle\max_{\kappa,\pi_{c}}\sum_{c=1}^{C}\sum_{i=1}^{K}\sum_{t=1}^{T}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right)$
	$\displaystyle=$	$\displaystyle\max_{\kappa:[C]\to[C]}\sum_{c=1}^{C}\max_{\pi_{c}\colon[K]\to[K]}\sum_{i=1}^{K}\sum_{t=1}^{T}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(\kappa(c))_{i}\ell_{t,\pi_{c}(i),c}\right)$
	$\displaystyle=$	$\displaystyle\sum_{c=1}^{C}\max_{c^{\prime}\in[C]}\sum_{i=1}^{K}\max_{j\in[K]}\sum_{t=1}^{T}\operatorname*{\mathbb{P}}_{\mathcal{C}}[c]\left(p_{t}(c)_{i}\ell_{t,i,c}-p_{t}(c^{\prime})_{i}\ell_{t,j,c}\right).$

∎

Note that (16) allows us to efficiently (in $\operatorname*{poly}(K,C,T)$ time) evaluate $\operatorname{\mathsf{BayesSwapReg}}$ . As mentioned in the main text, directly from Theorems 3.9 and 3.15 this gives us an efficient ( $\operatorname*{poly}(C,K)$ time per round) learning algorithm that incurs at most $O(K^{2}C^{2}\sqrt{T})$ swap regret. We will now examine the values of $\operatorname*{diam}({\mathscr{P}})$ , $\operatorname*{diam}({\mathscr{L}})$ and $D_{z}$ and show that this algorithm actually incurs at most $O(KC\sqrt{T})$ regret.

First, note that since ${\mathscr{P}}=\Delta([K])^{C}$ , elements $p\in{\mathscr{P}}$ can be thought of as $C$ $K$ -tuples of positive numbers that add to $1$ . Each such $K$ -tuple has squared distance at most $1$ , so $\operatorname*{diam}({\mathscr{P}})\leq\sqrt{C}$ . Second, since ${\mathscr{L}}=[0,1]^{KC}$ , $\operatorname*{diam}({\mathscr{L}})=\sqrt{KC}$ . Finally, the $KC$ coefficients of each $z_{\kappa,\pi_{c}}$ consist of $2K$ copies of the distribution $\operatorname*{\mathbb{P}}[c]$ ; this has $\ell_{2}$ norm at most $\sqrt{2K}$ , so $D_{z}=O(\sqrt{2K})$ . Combining these three quantities according to Theorem 3.9, we obtain the following corollary.

Corollary C.4.

There exists an efficient contextual learning algorithm with $\operatorname{\mathsf{BayesSwapReg}}=O(CK\sqrt{T})$ .

Pseudonorm Approachability and Applications to Regret Minimization

Abstract

1 Introduction

1.1 Main results

1.2 Related Work

2 Preliminaries

Notation.

2.1 Blackwell approachability and regret minimization

2.2 Online linear optimization

Lemma 2.1 (Quadratic regularizer, (Zinkevich, 2003)).

Lemma 2.2 (Negative entropy regularizer, (Kivinen and Warmuth, 1995)).

2.3 Algorithms for ℓ∞\ell_{\infty}-approachability

Theorem 2.3.

Corollary 2.4.

3 Main Results

3.1 Approachability for pseudonorms

Theorem 3.1.

Proof.

Corollary 3.2.

3.2 From high-dimensional ℓ∞\ell_{\infty} approachability to low-dimensional pseudonorm approachability

Theorem 3.3.

Lemma 3.4.

Lemma 3.5.

Proof.

Corollary 3.6.

Lemma 3.7.

3.3 Algorithms for pseudonorm approachability

Theorem 3.8.

Proof.

Theorem 3.9.

Proof.

3.4 Efficient algorithms for pseudonorm approachability

3.4.1 Computational assumptions

3.4.2 Extending the dual set

Lemma 3.10.

Lemma 3.11.

3.4.3 Constructing a membership oracle

Lemma 3.12.

Corollary 3.13.

Lemma 3.14.

3.4.4 Implementing regret minimization

Theorem 3.15.

Proof.

3.5 Recovering ℓ∞\ell_{\infty} regret via maxent regularizers

Lemma 3.16.

Proof.

Lemma 3.17.

Proof.

Corollary 3.18.

Corollary 3.19.

4 Applications

4.1 Swap regret

4.2 Procrustean swap regret

4.3 Converging to Bayesian correlated equilibria

Open Question 1.

4.4 Reinforcement learning in constrained MDPs

5 Conclusion

References

Appendix A Omitted proofs

A.1 Proof of Theorem 2.3

Proof of Theorem 2.3.

A.2 Proof of Corollary 3.2

Proof of Corollary 3.2.

A.3 Proof of Theorem 3.3

Proof of Theorem 3.3.

A.4 Proof of Lemma 3.4

Proof of Lemma 3.4.

A.5 Proof of Lemma 3.7

Proof of Lemma 3.7.

A.6 Proof of Lemma 3.10

Proof of Lemma 3.10.

A.7 Proof of Lemma 3.11

Proof of Lemma 3.11.

A.8 Proof of Lemma 3.12

Proof of Lemma 3.12.

A.9 Proof of Lemma 3.14

Proof of Lemma 3.14.

A.10 Proof of Proposition 4.4

Proof of Proposition 4.4.

Appendix B General theorems from convex optimization

Pseudonorm Approachability and
Applications to Regret Minimization

2.3 Algorithms for $\ell_{\infty}$ -approachability

3.2 From high-dimensional $\ell_{\infty}$ approachability to low-dimensional pseudonorm approachability

3.5 Recovering $\ell_{\infty}$ regret via maxent regularizers