Pseudonorm Approachability and
Applications to Regret Minimization
Abstract
Blackwell’s celebrated approachability theory provides a general framework for a variety of learning problems, including regret minimization. However, Blackwell’s proof and implicit algorithm measure approachability using the (Euclidean) distance. We argue that in many applications such as regret minimization, it is more useful to study approachability under other distance metrics, most commonly the -metric. But, the time and space complexity of the algorithms designed for -approachability depend on the dimension of the space of the vectorial payoffs, which is often prohibitively large. Thus, we present a framework for converting high-dimensional -approachability problems to low-dimensional pseudonorm approachability problems, thereby resolving such issues. We first show that the -distance between the average payoff and the approachability set can be equivalently defined as a pseudodistance between a lower-dimensional average vector payoff and a new convex set we define. Next, we develop an algorithmic theory of pseudonorm approachability, analogous to previous work on approachability for and other norms, showing that it can be achieved via online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball. We then use that to show, modulo mild normalization assumptions, that there exists an -approachability algorithm whose convergence is independent of the dimension of the original vectorial payoff. We further show that that algorithm admits a polynomial-time complexity, assuming that the original -distance can be computed efficiently. We also give an -approachability algorithm whose convergence is logarithmic in that dimension using an FTRL algorithm with a maximum-entropy regularizer. Finally, we illustrate the benefits of our framework by applying it to several problems in regret minimization.
1 Introduction
The notion of approachability introduced by Blackwell (1956) can be viewed as an extension of von Neumann’s minimax theorem (von Neumann, 1928) to the case of vectorial payoffs. Blackwell gave a simple example showing that the straightforward analog of von Neumann’s minimax theorem does not hold for vectorial payoffs. However, in contrast with this negative result for one-shot games, he proved that, in repeated games, a player admits an adaptive strategy guaranteeing that their average payoff approaches a closed convex set in the limit, provided that the set satisfies a natural separability condition.
The theory of Blackwell approachability is intimately connected with the field of online learning for the reason that the problem of regret minimization can be viewed as an approachability problem: in particular, the learner would like their vector of regrets (with respect to each competing benchmark) to converge to a non-positive vector. In this vein, Abernethy et al. (2011) demonstrated how to use algorithms for approachability to solve a general class of regret minimization problems (and conversely, how to use regret minimization to construct approachability algorithms). However, applying their reduction sometimes leads to suboptimal regret guarantees – for example, for the specific case of minimizing external regret over rounds with actions, their reduction results in an algorithm with regret (instead of the optimal regret bound achievable by e.g. multiplicative weights).
One reason for this suboptimality is the choice of distance used to define approachability. Both Blackwell and Abernethy et al. consider approachability algorithms that minimize the Euclidean () distance between their average payoff and the desired set. We argue that, for applications to regret minimization, it is often more useful to study approachability under other distance metrics, most commonly approachability under the metric to the non-positive orthant, which is well suited to capture the fact that regret is a maximum over various competing benchmarks. This has been observed in the past in several recent publications (Perchet, 2015; Shimkin, 2016; Kwon, 2021). In particular, by constructing algorithms for approachability, it is possible to naturally recover a external regret learning algorithm (and algorithms with optimal regret guarantees for many other problems of interest).
However, there is still one significant problem with developing regret minimization algorithms via approachability (or any of the forms of approachability previously mentioned): the time and space complexity of these algorithms depends polynomially on the dimension of the space of vectorial payoffs, which in turn equals the number of benchmarks we compete against in our regret minimization problem. In some regret minimization settings, this can be prohibitively expensive. For example, in the setting of swap regret (where the benchmarks are parameterized by the swap functions mapping to ), this results in algorithms with complexity exponential in . On the other hand, there exist algorithms, e.g. (Blum and Mansour, 2007), which are both efficient ( time and space) and obtain optimal regret guarantees.
1.1 Main results
In this paper, we present a framework for converting high-dimensional approachability problems to low-dimensional “pseudonorm” approachability problems, in turn resolving many of these issues. To be precise, recall that the setting of approachability can be thought of as a -round repeated game, where in round the learner chooses an action from some convex “action” set , the adversary simultaneously chooses an action from some convex “loss” set , and the learner receives a vector-valued payoff , where is a -dimensional bilinear function. The learner would like the distance between their average payoff and some convex set111For simplicity, throughout this paper we assume to be the negative orthant , as this is the case most relevant to regret minimization. However, much of our results extend straightforwardly to arbitrary convex sets. to be as small as possible.
We first demonstrate how to construct a new -dimensional bilinear function , a new convex set , and a pseudonorm222In this paper a pseudonorm is a function which satisfies most of the properties of a norm (e.g. positive homogeneity, triangle inequality), but may be asymmetric (there may exist where ) and may not be definite (there may exist where ). Just as a norm defines a distance between via , a pseudonorm defines the pseudodistance . such that the “pseudodistance” between the average modified payoff and is equal to the distance between the original average payoff and . Importantly, the new dimension is equal to and is independent of the original dimension .
We then develop an algorithmic theory of pseudonorm approachability analogous to that developed in (Abernethy et al., 2011) for the norm and (Shimkin, 2016; Kwon, 2021) for other norms, showing that, in order to perform pseudonorm approachability, it suffices to be able to perform online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball (and that the rate of approachability is directly related to the regret guarantees of this OLO subalgorithm). This has the following consequences for approachability:
-
•
First, by solving this OLO problem with a quadratic regularized Follow-The-Regularized-Leader (FTRL) algorithm, we show (modulo mild normalization assumptions on the sizes of , , and ) that there exists a pseudonorm approachability algorithm (and hence an approachability algorithm for the original problem) which converges at a rate of . We additionally provide a stronger bound on the rate which scales as where , , and is the maximum norm of the set of vectors formed by taking the coefficients of components of (Theorem 3.9). In comparison, the best-known generic guarantee for -approachability prior to this work converged at a -dependent rate of .
-
•
Second, we show that as long as we can evaluate the original -distance between and efficiently, we can implement the above algorithm in time per round (Theorem 3.15). This has the following natural consequence for the class of regret minimization problems that can be written as -approachability problems: if it is possible to efficiently compute some notion of regret for a sequence of losses and actions, then there is an efficient (in the dimensions of the actions and losses) learner that minimizes this regret.
-
•
Finally, in some cases, the approachability rate from (inefficient) -approachability outperforms the rate obtained by the quadratic regularized FTRL algorithm. We define a new regularizer whose value is given by finding the maximum entropy distribution of a subset of distributions of support , and show that, by using this regularizer, we recover this rate. In particular, whenever we can efficiently compute this maxent regularizer, there is an efficient learning algorithm with a approachability rate.
We then apply our framework to various problems in regret minimization:
-
•
We show that our framework straightforwardly recovers a regret-optimal and efficient algorithm for swap regret minimization. Doing so requires computing the above maximum entropy regularizer for this specific case, where we show that it has a nice closed form. In particular, to our knowledge. this is the first approachability-based algorithm for swap regret that both is efficient and has the optimal minimax regret.
-
•
In Section 4.3, we apply our framework to develop the first efficient contextual learning algorithms with low Bayesian swap regret. Such algorithms have the property that if learners employ them in a repeated Bayesian game, the time-average of their strategies will converge to a Bayesian correlated equilibrium, a well-studied equilibrium notion in game theory (see e.g. Bergemann and Morris (2016)).
This notion of Bayesian swap regret was recently introduced by Mansour et al. (2022), who also provided an algorithm with low Bayesian swap regret, albeit one that is not computationally efficient. By applying our framework, we easily obtain an efficient contextual learning algorithm with Bayesian swap regret, resolving an open question of Mansour et al. (2022) (here is the number of “contexts” / “types” of the learner).
-
•
In Section 4.4, we further analyze the application of our general -approachability theory and algorithm to the analysis of reinforcement learning (RL) with vectorial losses. We point out how our framework can provide a general solution in the full information setting with known transition probabilities and how we can recover the best known solution for standard regret minimization in episodic RL. More importantly, we show how our framework and algorithm can lead to an algorithm for constrained MDPs with a significantly more favorable regret guarantee, logarithmic in the number of constraints , in contrast with the -dependency of the results of Miryoosefi et al. (2019).
1.2 Related Work
There is a wide literature dealing with various aspects of Blackwell’s approachability, including its applications to game theory, regret minimization, reinforcement learning, and multiple extensions.
Hart and Mas-Colell (2000) described an adaptive procedure for players in a game based on Blackwell’s approachability, which guarantees that the empirical distributions of the plays converges to the set of a correlated equilibrium. This procedure is related to internal regret minimization, for which, as shown by Foster and Vohra (1999), the existence of an algorithm follows the proof of Hart and Mas-Colell (2000). Hart and Mas-Colell (2001) further gave a general class of adaptive strategies based on approachability. Approachability has been widely used for calibration (Dawid, 1982), see Foster and Hart (2018) for a recent work on the topic. Approachability and partial monitoring were studied in a series of publications by Perchet (2010); Mannor et al. (2014a, b); Perchet and Quincampoix (2015, 2018); Kwon and Perchet (2017). More recently, approachability has also been used in the analysis of fairness in machine learning (Chzhen et al., 2021).
Approachability has also been extensively used in the context of reinforcement learning. Mannor and Shimkin (2003) discussed an extension of regret minimization in competitive Markov decision processes (MDPs) whose analysis is based on Blackwell’s approachability theory. Mannor and Shimkin (2004) presented a geometric approach to multiple-criteria reinforcement learning formulated as approachability conditions. Kalathil et al. (2014) presented strategies for approachability for MDPs and Stackelberg stochastic games based on Blackwell’s approachability theory. More recently, Miryoosefi et al. (2019) used approachability to derive solutions for reinforcement learning with convex constraints.
The notion of approachability was further extended in several studies. Vieille (1992) used differential games with a fixed duration to study weak approachability in finite dimensional spaces. Spinat (2002) formulated a necessary and sufficient condition for approachability of non-necessary convex sets. Lehrer (2003) extended Blackwell’s approachability theory to infinite-dimensional spaces.
The most closely related work to this paper, which we build upon, is that of Abernethy et al. (2011) who showed that, remarkably, any algorithm for Blackwell’s approachability could be converted into one for online convex optimization and vice-versa. Bernstein and Shimkin (2015) also discussed a related response-based approachability algorithm.
Perchet (2015) presented a specific study of approachability, for which they gave an exponential weight algorithm. Shimkin (2016, Section 5) studied approachability for an arbitrary norm and gave a general duality result for an arbitrary norm using Sion’s minmax theorem. The pseudonorm duality theorem we prove, using Fenchel duality, can be viewed as a generalization. Kwon (2016, 2021) also presented a duality theorem similar to that of Shimkin (2016) which they used to derive a FTRL algorithm for general norm approachability. They further treated the special case of internal and swap regret. However, unlike the algorithms derived in this work, the computational complexity of their swap regret algorithm is in . This is also true of the paper of Perchet (2015) which also analyzes the swap regret problem.
It is known that if all players follow a swap regret minimization algorithm, then the empirical distribution of their play converges to a correlated equilibrium (Blum and Mansour, 2007). Hazan and Kale (2008) showed a result generalizing this property to the case of -regret and -equilibrium, where the -regret is the difference between the cumulative expected loss suffered by the learner and that of the best -modification of the sequence in hindsight. Gordon et al. (2008) further generalized the results of Hazan and Kale (2008) to a more general class of -modification regrets. The algorithms discussed in (Gordon et al., 2008) are distinct from those discussed in this paper (they do not clearly extend to the general approachability setting, and they require significantly different computational assumptions than ours). Nevertheless, they bare some similarity with our work.
2 Preliminaries
Notation.
We use as a shorthand for set . We write to denote the simplex over dimensions and to denote the convex hull of the -simplex with the origin. denotes the convex hull of the points in , and the convex cone generated by the points in .
Some of the more standard proofs have been deferred to Appendix A.
2.1 Blackwell approachability and regret minimization
We begin by illustrating the theory of Blackwell approachability for the specific case of the -distance; this case is both particularly suited to the application of regret minimization, and will play an important role in the results (e.g. reductions to pseudonorm approachability) that follow.
We consider a repeated game setting, where every round a learner chooses an action belonging to a bounded333We bound the entries of , , and within for convenience, but it is generally easy to translate between different boundedness assumptions (since almost all relevant quantities are linear). We express the majority of our theorem statements (with the notable exception of Theorem 3.9) in a way that is independent of the choice of bounds. convex set , and an adversary simultaneously chooses a loss belonging to a bounded convex set . Let be a bounded bilinear444We briefly note that all our results also hold for biaffine functions ; in particular, extending the loss and action sets slightly (by replacing and with and ) allows us to write any biaffine function over the original sets as a bilinear function over the extended sets. vector-valued payoff function, and let be a closed convex set with the property that for every , there exists a such that (we say that such a set is “separable”). When , the minimax theorem implies that there exists a single such that for all , .
This is not true for , but the theory of Blackwell approachability provides the following algorithmic analogue of this statement. Define a learning algorithm to be a collection of functions for each , where describes how the learner decides their action as a function of the observed losses up until time . Blackwell approachability guarantees that there exists a learning algorithm such that when is run on any loss sequence the resulting action sequence has the property that:
(1) |
(Here for , represents the distance between and ).
As mentioned, one of the main motivations for studying Blackwell approachability is its connections to regret minimization. In particular, for a fixed choice of , define
(2) |
Note first that this definition of “regret” is exactly times the approachability distance in the case where is the negative orthant; that is,
(3) |
But secondly, note that by choosing , , and carefully, the definition can capture a wide variety of forms of regret studied in regret minimization. For example:
-
•
When , , , and , is the external regret of playing action sequence against loss sequence ; i.e., it measures the regret compared to the best single action.
-
•
When , , , and (for each function , , is the swap regret of playing action sequence against loss sequence ; i.e., it measures the regret compared to the best action sequence obtained by applying a fixed swap function to sequence .
-
•
When is a convex polytope, , (where is the set of vertices of ), and (for each vertex ), , this captures the (external) regret from performing online linear optimization over the polytope (see Section 2.2).
-
•
Finally, to illustrate the power of this framework, we present an unusual swap regret minimization that we call “Procrustean swap regret minimization” (after the orthogonal Procrustes problem, see (Gower and Dijksterhuis, 2004)). Let be the unit ball in dimensions, , and, for each orthogonal matrix555Technically, this leads to an infinite dimensional (since the group of orthogonal matrices is infinite), but one can instead take an arbitrarily fine discrete approximation of the set. Indeed, one of the advantages of the results we present is that they are largely independent of the dimension of . , let .
When the negative orthant is separable with respect to , which is true in all of the above examples, the theory of Blackwell approachability immediately guarantees the existence of a sublinear regret learning algorithm for the corresponding notion of regret. Specifically, define the regret of a learning algorithm to be the worst-case regret over all possible loss vectors ; i.e.,
(4) |
where . Then (1) implies that the same algorithm satisfies . Motivated by this application, we will restrict our attention for the remainder of this paper to the setting where and will assume (unless otherwise specified) that this is separable with respect to the bilinear function we consider.
In fact, the theory of -approachability is constructive and allows us to write down explicit algorithms along with explicit (and in many cases, near optimal) regret bounds. However, before we introduce these algorithms, we will need to introduce the problem of online linear optimization.
2.2 Online linear optimization
Here we discuss algorithms for online linear optimization (OLO), a special case of online convex optimization where all the loss functions are linear functions of the learner’s action. These will form an important primitive of our algorithms for approachability.
Let be two bounded convex subsets of and consider the following learning problem. Every round (for rounds) our learning algorithm must choose an element as a function of . The adversary simultaneously chooses a loss “function” . This causes the learner to incur a loss of in round . The goal of the learner is to minimize their regret, defined as the difference between their total loss and the loss they would have incurred by playing the best fixed action in hindsight, i.e.,
There are many similarities between this problem and the approachability and regret minimization problems discussed in Section 2.1. For example, if we take and , then OLO is equivalent to the problem of external regret minimization. However, not all regret minimization problems can be written directly as an instance of OLO – for example, there is no clear way to write swap regret minimization as an OLO instance. Eventually we will demonstrate how to apply OLO as a subroutine to solve any regret minimization problem, but this will involve a reduction to Blackwell approachability and will require running OLO on different spaces than the action/loss sets and directly (which is why we distinguish the action/loss sets for OLO as and respectively).
There is an important subclass of algorithms for OLO known as Follow the Regularized Leader (FTRL) algorithms (Shalev-Shwartz, 2007; Abernethy et al., 2008). An FTRL algorithm is completely specified by a strongly convex function , and plays the action
In words, this algorithm plays the action that minimizes the total loss on the rounds until the present (“following the leader”), subject to an additional regularization term. It is possible to characterize the worst-case regret of an FTRL algorithm in terms of properties of the regularizer and the sets and (see e.g. Theorem 15 of Hazan (2016)). For our purposes, we will only need the following two results for specific regularizers.
Lemma 2.1 (Quadratic regularizer, (Zinkevich, 2003)).
Let and . Let be an arbitrary element of , and let . Then, the FTRL algorithm with regularizer incurs worst-case regret at most .
Lemma 2.2 (Negative entropy regularizer, (Kivinen and Warmuth, 1995)).
Let , , and let (where we extend this to the boundary of by letting ). Then, the FTRL algorithm with regularizer incurs worst-case regret at most .
2.3 Algorithms for -approachability
We can now write down an explicit description of our algorithm for -approachability (in terms of a blackbox OLO algorithm) and get quantitative bounds on the rate of convergence in the LHS of (1). Let be an OLO algorithm for the sets and . Then we can describe our algorithm for -approachability as Algorithm 1.
It turns out that we can relate the approachability distance to (and hence the regret of this algorithm ) to the regret of our OLO algorithm .
Theorem 2.3.
We have that
If we let be the negative entropy FTRL algorithm (Lemma 2.2)666There is a slight technical difference between the sets and in Algorithm 1 and the sets and in Lemma 2.2. However, note that if we map to and to , we preserve the inner product of and up to an additive constant, which disappears when computing regret, and a factor of ., we obtain the following regret guarantee for .
Corollary 2.4.
For any bilinear regret function , there exists a regret minimization algorithm with worst-case regret .
Equivalently, Corollary 2.4 can be interpreted as saying that there exists an -approachability algorithm which approaches the negative orthant at an average rate of . In general, (3) lets us straightforwardly convert between results for approachability and results for regret minimization. Throughout the remainder of the paper we will primarily phrase our results in terms of , but switch between quantities of interest when convenient.
3 Main Results
Already Corollary 2.4 leads to a number of impressive consequences. For example, when applied to the problem of swap-regret minimization (where has dimension ), it leads to a learning algorithm with regret, matching the best known regret bounds for this problem (Blum and Mansour, 2007). However, the algorithm we obtain in this way has two unfortunate properties.
First, since is -dimensional, implementing the algorithm as written above requires time and space complexity (even storing any specific or requires space). This is fine if is small, but in many of our applications is much (e.g., exponentially) larger than the dimensions and of the loss and action sets. For example, for swap regret we have but . Although Corollary 2.4 gives us an optimal swap regret algorithm, it takes exponential time / space to implement (in contrast to other known swap regret algorithms, such as that of Blum and Mansour (2007)).
Secondly, although Corollary 2.4 has only a logarithmic dependence on , sometimes even this may be too large (for example when we want to compete against an uncountable set of benchmarks). In such cases, we would ideally like a regret bound that depends on the action and loss sets but not directly on .
In the following subsections, we will demonstrate a framework for regret minimization that allows us to achieve both of these goals (under some fairly light computational assumptions on ).
3.1 Approachability for pseudonorms
In Section 2.3, we described the theory of Blackwell approachability for a distance metric defined by the norm (i.e., ). We begin here by describing a generalization of this approachability theory to functions we refer to as pseudonorms and pseudodistances. A function is a pseudonorm if , is positive homogeneous (for all , , ), and satisfies the triangle inequality ( for all ). Note that unlike norms, pseudonorms may not satisfy definiteness and are not necessarily symmetric; it may be the case that . However, all norms are pseudonorms. Note also that by the positive homogeneity and the triangle inequality, a pseudonorm is a convex function. A pseudonorm defines a pseudodistance function via: .
In order to effectively work with pseudodistances, it will be useful to define the dual set associated to as follows: . This coincides with the traditional notion of duality in convex analysis; for example, when is a norm, coincides with the dual ball of radius one: (e.g., when is the -dimensional norm, is the -dimensional -ball). The following theorem relates the pseudodistance between and a convex set to a convex optimization problem over the dual set.
Theorem 3.1.
For any closed convex set , the following equality holds for any :
Proof.
We adopt the standard definition and notation in optimization for an indicator function of a set : for any , if is in , otherwise. Define by for all and set .
By definition, the conjugate function of is defined by: . Now, if is in , then we have . Thus, since , the supremum in the definition of is achieved for and . Otherwise, if , there exists such that . For that , for any , by the positive homogeneity of , we have . Taking the limit , this shows that . Thus, we have .
By definition, the conjugate function is defined for all by , which can also be derived from the conjugate function calculus in Table B.1 of (Mohri et al., 2018).
It is also known that the conjugate function of the indicator function is defined by (Boyd and Vandenberghe, 2014). Since and , we have . Thus, for any convex and bounded set in , by Fenchel duality (Theorem B.1, Appendix B), we can write:
(def. of ) | ||||
(Fenchel duality theorem) | ||||
(def. of ) | ||||
(change of variable) |
This completes the proof. ∎
We will primarily be concerned with the case where is a convex cone. A convex cone is a set such that if , for all . In this case, Theorem 3.1 can be simplified to write as a linear optimization problem over the intersection of and the polar cone of .
Corollary 3.2.
Let be a convex cone, and let be the polar cone of . Fix a -dimensional pseudonorm and let . Then for any ,
3.2 From high-dimensional approachability to low-dimensional pseudonorm approachability
In Section 2.3, we expressed the regret for a general regret minimization problem as the distance
(5) |
In this section, we will demonstrate how to rewrite this in terms of a lower-dimensional pseudodistance.
Consider the bilinear “basis map” given by . Note that every bilinear function can be written in the form for some vector (i.e., the monomials in form a basis for the set of biaffine functions on )777If it is helpful, one can think of as the natural map from to the tensor product . The vectors can then be thought of as elements of the dual space, each representing a linear functional on ..
Let be the vector of coefficients corresponding to the component of and consider the function . Note that is a pseudonorm on ; indeed, it’s straightforward to see that is non-negative and for any . In addition, let be the convex cone defined by . We claim that that we can rewrite the distance in (5) in the following way.
Theorem 3.3.
We have that
(6) |
The dual set associated to this pseudonorm is given by:
The following two properties of this dual set will be useful in the sections that follow. First, we show that we can alternately think of as the convex hull of the .
Lemma 3.4.
The dual set coincides with the convex hull of s: .
Secondly, we show that the dual set is contained within the polar cone of .
Lemma 3.5.
We have that , where .
Proof.
By Lemma 3.4, it suffices to show that each , i.e., for each , that for all . However, this immediately follows by the definition of , since each satisfies for all . (An equivalent way of thinking about this is that is already the polar cone of the cone generated by the , so the must lie in the polar to ). ∎
This allows us to simplify Corollary 3.2 even further.
Corollary 3.6.
For this convex cone and pseudonorm , we have that for any ,
Finally, we prove the following “separability” conditions for this approachability problem.
Lemma 3.7.
The following two statements are true.
-
1.
For any , there exists a such that .
-
2.
For any , there exists a such that for all .
3.3 Algorithms for pseudonorm approachability
We now present an algorithm for pseudonorm approachability in the setting of Section 3.2 (i.e., for the bilinear function and the convex cone ). Just as the algorithm in Section 2.3 for -approachability required an OLO algorithm for the simplex, this algorithm will assume we have access to an OLO algorithm for the sets , and (recall that denotes the convex hull of the points for and ).
Our algorithm is summarized above in Algorithm 2. We now have the following analogue of Theorem 2.3.
Theorem 3.8.
The following guarantee holds for the pseudonorm approachability algorithm :
Proof.
The first equality follows as a direct consequence of Theorem 3.3 (which proves that the distance to is equal to the analogous distance to ) and Theorem 2.3 (which shows that this distance is equal to the regret of our regret minimization problem). It therefore suffices to prove the second equality.
Note that
Here, the first equality holds as a consequence of Corollary 3.6, and the last inequality holds since (by choice of in step 2a) for all . ∎
If we choose to be FTRL with a quadratic regularizer, Lemma 2.1 implies the following result.
Theorem 3.9.
Let be a bilinear regret function. Then there exists a regret minimization algorithm for with regret
where . If we let
then this regret bound further satisfies
Proof.
To see the first result, note that Lemma 2.1 directly implies a bound of , where and . We’ll now proceed by simplifying and . First, for , recall from Lemma 3.4 that is the convex hull of the vectors , so . Second, for , note that . But , so . Combining these, we obtain the first inequality.
The second result directly follows from the following three facts: i. (since , ii. (since ), and iii. . ∎
Theorem 3.9 shows that in settings where is constant (which is true in all the settings we consider), there exists an algorithm where ; notably, does not depend on the dimension , which in many cases can be thought of as the number of benchmarks of comparison for our learning algorithm.
In the following two sections, we will show how to strengthen this result in two different ways. First, we will show that (modulo some fairly mild computational assumptions) it is possible to efficiently implement the algorithm of Theorem 3.9. Second, we will show that by using a different choice of regularizer, we can recover exactly the regret obtained in Corollary 2.4.
3.4 Efficient algorithms for pseudonorm approachability
3.4.1 Computational assumptions
In this section we will discuss how to transform the algorithm in Theorem 3.9 into a computationally efficient algorithm. Note that without any constraints on , , and or how they are specified, performing any sort of regret minimization efficiently is a hopeless task (e.g., consider the case where it is computationally hard to even determine membership in ). We’ll therefore make the following three structural / computational assumptions on , , and .
First, we will restrict our attention to cases where the loss set is “orthant-generating”. A convex subset of is orthant-generating if and for each , there exists an such that (as part of this assumption, we will also assume we have access to the values ). Note that many common choices for (e.g., , , intersections of other balls with the positive orthant) are all orthant-generating.
Second, we will assume we have an efficient separation oracle for the action set ; that is, an oracle which takes a and outputs (in time ) either that or a separating hyperplane such that .
Finally, we will assume we have access to what we call an efficient regret oracle for . Given a collection of action/loss pairs and positive constants , an efficient regret oracle can compute (in time ) the value of . This can be thought of as evaluating the function for a pair of action and loss sequences that take on the action/loss pair an fraction of the time. At a higher level, having access to an efficient regret oracle means that a learner can efficiently compute their overall regret at the end of rounds (it is hard to imagine how one could efficiently minimize regret without being able to compute it).
3.4.2 Extending the dual set
One of the ingredients we will need to implement algorithm is a membership oracle for the dual set (e.g. in order to perform OLO over this dual set). To check whether , it suffices to check whether for all , so in turn, it will be useful to be able to compute the function for any .
Computing this maximum is very similar888In fact, in almost all practical cases where we have an efficient regret oracle, it is possible by a similar computation to directly compute this maximum. Here we describe an approach that works in a blackbox way given only a strict regret oracle – if you are accept the existence of an oracle that can optimize linear functions over the , you can skip this subsection. to what is provided by our regret oracle: note that by writing as , we can think of the regret oracle providing the value of:
If it were possible to write any in the form , we would be done. However, this may not be possible: in particular, must lie in the convex cone generated by all points of the form .
We will therefore briefly generalize the theory of Section 3.1 to cases where we are only able to optimize over an extension of the dual set. Given a convex cone , let . Note that (since in , must hold for all in the ambient space). The following lemma shows that if is in , to maximize a linear function over it suffices to maximize it over .
Lemma 3.10.
Let be an arbitrary convex cone. Then if , the following equalities hold:
Now, consider the specific convex cone . Given Lemma 3.10, it is straightforward to check that Theorem 3.8 continues to hold even if the domain of the OLO algorithm is set to instead of . In particular, the first equality in the proof of Theorem 3.8 is still true when is replaced by , since .
There is one other issue we must deal with: it is possible that is significantly larger than , and therefore an OLO algorithm with domain might incur more regret than that of . In fact, there are cases where the set is unbounded. Nonetheless, the following lemma shows that for OLO with a quadratic regularizer, we will never encounter very large values of .
Lemma 3.11.
Let , and fix a and . Let
Then .
3.4.3 Constructing a membership oracle
We will now demonstrate how to use our regret oracle to construct a membership oracle for the expanded set defined in the previous section. We first show that it is possible to check for membership in (and further, when , write as a convex combination of the generators of ).
Lemma 3.12.
Given a point , we can efficiently check whether . If , we can also efficiently write in the form (for an explicit choice of , , and ).
Note that expressing in the form allows us to directly apply our efficient regret oracle (with ). We therefore gain the following optimization oracle as a corollary.
Corollary 3.13.
Given an efficient regret oracle, for any we can efficiently (in time ) compute .
Finally, we will show that given these two results, we can efficiently construct a membership oracle for our set . To do this, we will need the following fact (loosely stated; see Lemma B.2 in Appendix B for an accurate statement): it is possible to minimize a convex function over a convex set as long as one has an evaluation oracle for the function and a membership oracle for the convex set.
Lemma 3.14.
Given an efficient regret oracle for , we can construct an efficient membership oracle for the set .
3.4.4 Implementing regret minimization
Equipped with this membership oracle, we can now state and prove our main theorem.
Theorem 3.15.
Assume that:
-
1.
The convex set is orthant-generating.
-
2.
We have an efficient separation oracle for the convex set .
-
3.
We have an efficient regret oracle for the regret function .
Then it is possible to implement the algorithm of Theorem 3.9 in time per round.
Proof.
There are two steps of unclear computational complexity in the description of the algorithm in Section 3.3: step 2a, where we find a such that for all , and step 2d, where we have to run our OLO subalgorithm over (which we specifically instantiate as FTRL with a quadratic regularizer).
We begin by describing how to perform step 2a efficiently. Fix a . Note that since is orthant-generating, to check whether for all , it suffices to check whether for each unit vector . Therefore, to find a valid we must find a point that satisfies an additional explicit linear constraints. Since we have an efficient separation oracle for , this is possible in time.
To implement the FTRL algorithm over the set with a quadratic regularizer, each round we must find the minimizer of the convex function over the convex set . To do this, it suffices to exhibit an evaluation oracle for and a membership oracle for . To evaluate , note that we simply need to be able to compute (which is a Euclidean distance in ) and which we can do in time by keeping track of the cumulative sum and computing a single inner product in . A membership oracle for is provided by Lemma 3.14. ∎
3.5 Recovering regret via maxent regularizers
One interesting aspect of this reduction to pseudonorm approachability is that, in some cases, the regret bound of achievable via -approachability (Corollary 2.4) sometimes outperforms the regret bound of achieved by pseudonorm approachability (Theorem 3.9), for example when . Of course, this comparison is not completely fair: in both cases there is flexibility to specify the underlying OLO algorithm, and the bound uses an FTRL algorithm with a negative entropy regularizer, whereas our pseudonorm approachability bound uses an FTRL algorithm with a quadratic regularizer. After all, there are well-known cases (e.g. for OLO over a simplex, as in -approachability) where the negative entropy regularizer leads to exponentially better (in ) regret bounds than the quadratic regularizer.
In this subsection, we will show that there exists a different regularizer for pseudonorm approachability – one we call a maxent regularizer – which recovers the regret bound of Corollary 2.4. In doing so, we will also better understand the parallels between the regret minimization algorithm (which works via reduction to -approachability in a -dimensional space) and the regret minimization algorithm (which works via reduction to pseudonorm approachability in an -dimensional space).
Let be the -by- matrix whose th row equals . Note that allows us to translate between analogous concepts/quantities for and in the following way.
Lemma 3.16.
The following statements are true:
-
•
For any and , .
-
•
The dual set (i.e., iff there exists a such that ).
-
•
If and , then for any and , .
-
•
Fix and let . If satisfies for all , then also satisfies for all .
Proof.
The first statement follows from the fact that . The second claim follows as a consequence of Lemma 3.4 (that is the convex hull of the vectors ). The third claim follows from the first claim: . Finally, the fourth claim follows directly from the third claim. ∎
Now, fix a regret minimization problem (specified by , , and ) and consider the execution of algorithm on some specific loss sequence . Each round , runs the entropy-regularized FTRL algorithm to generate a (as a function of the actions and losses up until round ) and then uses to select a that satisfies for all . If we execute algorithm for the same regret minimization problem on the same loss sequence, each round , runs some (to be determined) FTRL algorithm to generate a , and then uses to select a that satisfies for all . Lemma 3.16 shows that if for each , then both algorithms will generate exactly the same sequence of actions in response to this loss sequence999Technically, there may be some leeway in terms of which to choose that e.g. satisfies for all , and the two different procedures could result in different choices of . But if we break ties consistently (e.g. add the additional constraint of choosing the that maximizes the inner product of with some generic vector), then both procedures will produce the same value of ., and hence the same regret.
The question then becomes: how do we design an OLO algorithm that outputs each round that would output ? Recall that if is an FTRL algorithm with regularizer , then
Define via
(7) |
We claim that if we let be the FTRL algorithm that uses regularizer , then will output our desired sequence of .
Lemma 3.17.
The following equality holds for the output :
Proof.
Note that we can write
The third equality follows from the fact that if , then (by Lemma 3.16). ∎
Corollary 3.18.
Let be an FTRL algorithm over with regularizer , and let be an FTRL algorithm over with regularizer . Then .
We now consider the specific case where is the negentropy function; i.e., for , . For this choice of , becomes
(8) |
In other words, is the maximum entropy of any distribution in that satisfies the linear constraints imposed by . This is exactly an instance of the Maxent problem studied in (Berger et al., 1996; Rosenfeld, 1996; Pietra et al., 1997; Dudík et al., 2007) and (Mohri et al., 2018)[chapter 12].
It is known (Pietra et al., 1997; Dudík et al., 2007; Mohri et al., 2018) that the entropy maximizing distribution is a Gibbs distribution. In particular, the maximizing the expression in (8) satisfies (for some real constants for , )
(9) |
where (the “partition function”) is defined via
(10) |
Generally, there is exactly one choice of which results in (since there are free variables and linear constraints). For this optimal , it is known that the maximum entropy is given by . If it is possible to solve this system for and evaluate efficiently, we can then evaluate efficiently. In Section 4.1 we will see how to do this for the specific case of swap regret (where we are helped by the fact that (9) guarantees that is a product distribution).
Regardless of how we compute the maxent regularizer , efficient computation leads to an efficient regret minimization algorithm.
Corollary 3.19.
If there exists an efficient ( time) algorithm for computing the maxent regularizer , then there exists a regret minimization algorithm for with regret and that can be implemented in time per round.
4 Applications
4.1 Swap regret
Recall that in the setting of swap regret, the action set is (distributions over actions), the loss set is , and the swap regret of playing a sequence of actions on a sequence of losses is given by
In words, swap regret compares the total loss achieved by this sequence of actions with the loss achieved by any transformed action sequence formed by applying an arbitrary “swap function” to the actions (i.e., always playing action instead of action ). Swap regret minimization can be directly written as an -approachability problem for the bilinear function where
(here we index by the functions ). Note that the negative orthant is indeed separable with respect to since if we let , for any swap function .
We can now apply the theory developed in Section 3. First, since and the maximum absolute value of any coefficient in is , Theorem 3.9 immediately results in a swap regret algorithm with regret . Moreover, since we can write
we can compute efficiently. By Theorem 3.15, we can therefore implement this -regret algorithm efficiently (in time per round). We can improve upon this regret bound by noting that , , and (the coefficient vector corresponding to contains at most coefficients that are ). It follows that , and it follows from the first part of Theorem 3.9 that the regret of the aforementioned algorithm is actually only .
This is still a factor of approximately larger than the optimal bound. To achieve the optimal regret bound, we will show how to compute the maxent regularizer for swap regret (and hence can efficiently implement an algorithm via Corollary 3.19). Recall that the maxent regularizer is defined for is the negative of the maximum entropy of a distribution that satisfies . For our problem, is -dimensional, and this imposes the following linear constraints on (which we view as a distribution over swap functions ):
Now, by the characterization presented in (9)101010In particular, in the case of swap regret we have that . Since the first term () does not depend on , we can ignore its contribution to (it cancels from the numerator and denominator)., we know the entropy maximizing satisfies (for some constants )
In particular, this shows that the entropy maximizing distribution is a product distribution over the set of swap functions where for each the value of is chosen independently. Moreover, from we can recover the overall marginal probability (it is if and if ). The entropy of this product distribution can therefore be written as:
Our regularizer is simply and can clearly be efficiently computed as a function of . It follows from Corollary 3.19 that there exists an efficient ( time per round) regret minimization algorithm for swap regret that incurs worst-case regret.
Finally, we briefly remark that the pseudonorm we construct here is closely related to the group norm defined over -by- square matrices as the norm of vector formed by the norms of the rows (i.e., ). This is not unique to swap regret; in many of our applications, the relevant pseudonorm can be thought of as a composition of multiple smaller norms (often or norms).
4.2 Procrustean swap regret
To illustrate the power of Theorem 3.15, we present a toy variant of swap regret where the learner must compete against an infinite set of swap functions (in particular, all orthogonal linear transformations of their sequence of actions) and yet can do this efficiently while incurring low (polynomial in the dimension of their action set) regret.
In this problem, the action set is the set of unit vectors with norm at most 1 and the loss set . The learner would like to minimize the following notion of regret:
(11) |
Here, is the set of all orthogonal -by- matrices. We call this notion of regret Procrustean swap regret due to its similarity with the orthogonal Procrustes problem from linear algebra, which (loosely) asks for the orthogonal matrix which most closely maps one sequence of points onto another sequence of points (in our setting, we intuitively want to map onto to minimize the loss of our benchmark). See Gower and Dijksterhuis (2004) for a more detailed discussion of the Procrustes problem. Regardless, note that we can compute efficiently, since we have an efficient membership oracle for the convex hull of the set of orthogonal matrices (specifically, an -by- matrix belongs to iff all its singular values are at most in absolute value).
In our approachability framework, we can capture this notion of regret with the bilinear function with coordinates indexed by with (which is separable with respect to the negative orthant, since for , ). Since all the conditions of Theorem 3.15 hold, there is an efficient learning algorithm which incurs at most Procrustean swap regret (in particular, , , and ).
4.3 Converging to Bayesian correlated equilibria
Swap regret has the nice property that if in a repeated -player normal-form game, all players run a low-swap regret algorithm to select their actions, their time-averaged strategy profile will converge to a correlated equilibrium (indeed, this is one of the major motivations for studying swap regret).
In repeated Bayesian games (games where each player has private information drawn independently from some distribution each round) the analogue of correlated equilibria is Bayesian correlated equilibria. Playing a repeated Bayesian game requires a contextual learning algorithm, which can observe the private information of the player (the “context”) and select an action based on this. Mansour et al. (2022) show that there is a notion of regret (that we call ) such that if all learners are playing an algorithm with low , then over time they converge on average to a Bayesian correlated equilibrium. However, while Mansour et al. (2022) provide a low algorithm, their algorithm is not provably efficient (it requires finding the fixed point of a system of quadratic equations); by applying our framework, we show that it is possible to obtain a polynomial-time, low- algorithm for this problem.
Formally, we study the following full-information contextual online learning setting. As before, there are actions, but there are now different contexts (“types”). Every round , the adversary specifies a loss function , where represents the loss from playing action in context . Simultaneously, the learner specifies an action which we view as a function mapping each context to a distribution over actions. Overall, the learner receives expected utility this round (the learner’s context is drawn iid from the publicly known distribution each round). In this formulation can be written in the form
(12) |
where the maximum is over all “type swap functions” and -tuples of “action-deviation swap functions” . It is straightforward to verify that (as written in (12)) can be written as an -approachability problem for a bilinear function for . The theory of -approachability guarantees the existence of an algorithm with regret, but this algorithm has time/space complexity and is very inefficient for even moderate values of or .
Instead, in Appendix C, we show that can be written in the form
This allows us to evaluate in time and apply our pseudonorm approachability framework. Directly from Theorems 3.9 and 3.15, we know that there exists an efficient ( time per round) learning algorithm that incurs at most swap regret. As with swap regret, we can tighten this bound somewhat by examining the values of and and show that this algorithm actually incurs at most regret (details left to Appendix C).
This is within approximately an factor of optimal. Interestingly, unlike with swap regret, it is unclear if it is possible to efficiently solve the relevant entropy maximization problem for Bayesian swap regret (and hence achieve the optimal regret bound). We pose this as an open question.
Open Question 1.
Is it possible to efficiently (in time) evaluate the maximum entropy regularizer for the problem of Bayesian swap regret?
4.4 Reinforcement learning in constrained MDPs
We consider episodic reinforcement learning in constrained Markov decision processes (MDPs). Here, the agent receives a vectorial reward (loss) in each time step and aims to find a policy that achieves a certain minimum total reward (maximum total loss) in each dimension. Approachability has been used to derive reinforcement learning algorithms for constrained MDPs before (Miryoosefi et al., 2019; Yu et al., 2021; Miryoosefi and Jin, 2021), however, exclusively using geometry. As a result, these methods aim to bound the distance to the feasible set and the bounds scale with the norm of the reward vector. This deviates from the more common, and perhaps more natural, formulation for constrained MDPs studied in other works (Efroni et al., 2020; Brantley et al., 2020; Ding et al., 2021). Here, each component of the loss vector is within a given range (e.g. ) and the goal is to minimize the largest constraint violation among all components. We will show that -approachability is the natural approach for this problem and yields algorithms that avoid the factor in the regret, where is the number of constraints, that algorithms based on approachability suffer. While the number of constraints can sometimes be small, there are many applications where is large and a dependency in the regret is undesirable, even when a computational dependency on is manageable. For example, constraints may arise from fairness considerations like demographic disparity that ensure that the policy behaves similarly across all protected groups. This would require a constraint for each pair of groups which could be very many.
The formal problem setup is as follows. We consider an MDP defined by a state space , an action set , a transition function , where is the probability of reaching state when choosing action at state , and the loss vector function with . We work with losses for consistency with the other sections but our results also readily apply to rewards. To simplify the presentation, we will assume a layered MDP with layers with , for , and with and .
We define a (stochastic) policy as mapping from where represents the probability of action in state . Given a policy and the transition probability , we define the occupancy measure as the probability of visiting state-action pair when following policy (Altman, 1999; Neu et al., 2012): . We will denote by the set of all occupancy measures, obtained by varying the policy . It is known that forms a polytope.
We consider the feasibility problem in CMDPs with stochastic rewards and unknown dynamics. The loss vector is the same in all episodes, . Our goal is to learn a policy such that for a given threshold vector . The payoff function is defined as: . The set is separable as long as there is a policy that satisfies all constraints. Although we define the payoff function in terms of occupancy measures , they will be implicit in the algorithm.
To aide the comparison of our approach to existing work in this setting, we omit the dimensionality reduction with pseudonorms in this application and directly work in the -dimensional space. We will analyze Algorithm 1 for approachability, which can be implemented in MDPs by adopting the following oracles from prior work (Miryoosefi et al., 2019) on CMDPs:
-
•
BestResponse-Oracle For a given , this oracle returns a policy that is -optimal with respect to the reward function .
-
•
Est-Oracle For a given policy , this oracle returns a vector such that .
Consider first the case without approximation errors, , for illustration. For a vector , let be the vector that contains only the first dimensions of . When we call the BestResponse oracle with vector , it returns a policy such that its occupancy measure satisfies . We can use this to show:
where is the occupancy measure of a policy that satisfies all constraints.
Thus is a valid choice for in Line 3 of Algorithm 1. Passing the policy to the Est yields ; this is enough to compute
in Line 5 of Algorithm 1111111Since FTRL with negative entropy regularizer operates on the simplex with , we pad the inputs with a zero dimension obtain an OLO algorithm on the interior of .. Finally, we obtain by passing to a negative entropy FTRL algorithm as the OLO algorithm (Line 6 of Alg. 1).
Using similar steps as for Theorem 2.3 and relying on the regret bound for in Lemma 2.2, we can show the following guarantee:
[] Consider a constrained episodic MDP with horizon , fixed loss vectors with for all state-action pairs and a constraint threshold vector . Assume that there exists a feasible policy that satisfies all constraints and let be the mixture policy of generated by the approach described above. Then the maximum constraint violation of satisfies
Applying the results from prior work (Miryoosefi et al., 2019) based on approachability would yield a bound of in our setting with an additional and factor in front of . For the sake of exposition, we illustrated the benefit of approachability using the oracles adopted by Miryoosefi et al. (2019), but our approach can also be applied with similar advantages to other works that make oracle calls explicit (Yu et al., 2021; Miryoosefi and Jin, 2021).
5 Conclusion
We presented a new algorithmic framework for -approachability, which we argued is the most suitable notion of approachability for a variety of applications such as regret minimization. Our algorithms leverage a key dimensionality reduction and a reduction to an online linear optimization. These ideas can be similarly used to derive useful algorithms for approachability with alternative distance metrics. In fact, as already pointed out, some of our algorithms can be similarly viewed as a reduction to an online linear optimization of an equivalent group-norm approachability.
References
- Abernethy et al. [2008] Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 263–274. Omnipress, 2008.
- Abernethy et al. [2011] Jacob D. Abernethy, Peter L. Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. In Proceedings of COLT, volume 19 of JMLR Proceedings, pages 27–46, 2011.
- Altman [1999] Eitan Altman. Constrained Markov Decision Processes. Chapman and Hall, 1999.
- Bergemann and Morris [2016] Dirk Bergemann and Stephen Morris. Bayes correlated equilibrium and the comparison of information structures in games. Theoretical Economics, 11(2):487–522, 2016.
- Berger et al. [1996] Adam L. Berger, Stephen Della Pietra, and Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Comp. Linguistics, 22(1), 1996.
- Bernstein and Shimkin [2015] Andrey Bernstein and Nahum Shimkin. Response-based approachability with applications to generalized no-regret problems. J. Mach. Learn. Res., 16:747–773, 2015.
- Blackwell [1956] David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956.
- Blum and Mansour [2007] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8:1307–1324, 2007.
- Borwein and Zhu [2005] Jonathan Borwein and Qiji Zhu. Techniques of Variational Analysis. Springer, 2005.
- Boyd and Vandenberghe [2014] Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2014.
- Brantley et al. [2020] Kianté Brantley, Miro Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. Proceedings of NIPS, 33:16315–16326, 2020.
- Chzhen et al. [2021] Evgenii Chzhen, Christophe Giraud, and Gilles Stoltz. A unified approach to fair online learning via blackwell approachability. In Proceedings of NeurIPS, pages 18280–18292, 2021.
- Dawid [1982] A. P. Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605–610, 1982. doi: 10.1080/01621459.1982.10477856. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1982.10477856.
- Ding et al. [2021] Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In Proceedings of AISTATS, pages 3304–3312. PMLR, 2021.
- Dudík et al. [2007] Miroslav Dudík, Steven J. Phillips, and Robert E. Schapire. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8, 2007.
- Efroni et al. [2020] Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.
- Forges [1993] Françoise Forges. Five legitimate definitions of correlated equilibrium in games with incomplete information. Theory and decision, 35(3):277–310, 1993.
- Forges et al. [2006] Françoise Forges et al. Correlated equilibrium in games with incomplete information revisited. Theory and decision, 61(4):329–344, 2006.
- Foster and Hart [2018] Dean P. Foster and Sergiu Hart. Smooth calibration, leaky forecasts, finite recall, and nash dynamics. Games and Economic Behavior, 109:271–293, 2018. ISSN 0899-8256. doi: https://doi.org/10.1016/j.geb.2017.12.022. URL https://www.sciencedirect.com/science/article/pii/S0899825618300113.
- Foster and Vohra [1999] Dean P Foster and Rakesh V Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29(1-2):7–35, 1999.
- Gordon et al. [2008] Geoffrey J. Gordon, Amy Greenwald, and Casey Marks. No-regret learning in convex games. In Proceedings of ICML, volume 307, pages 360–367. ACM, 2008.
- Gower and Dijksterhuis [2004] John C Gower and Garmt B Dijksterhuis. Procrustes problems, volume 30. OUP Oxford, 2004.
- Hart and Mas-Colell [2000] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
- Hart and Mas-Colell [2001] Sergiu Hart and Andreu Mas-Colell. A general class of adaptive strategies. Journal of Economic Theory, 98(5):26–54, 2001.
- Hartline et al. [2015] Jason Hartline, Vasilis Syrgkanis, and Eva Tardos. No-regret learning in bayesian games. Advances in Neural Information Processing Systems, 28, 2015.
- Hazan [2016] Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
- Hazan and Kale [2008] Elad Hazan and Satyen Kale. Computational equivalence of fixed points and no regret algorithms, and convergence to equilibria. In NIPS, pages 625–632, 2008.
- Kalathil et al. [2014] Dileep M. Kalathil, Vivek S. Borkar, and Rahul Jain. Blackwell’s approachability in Stackelberg stochastic games: A learning version. In Proceedings of CDC, pages 4467–4472. IEEE, 2014.
- Kivinen and Warmuth [1995] Jyrki Kivinen and Manfred K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. In Proceedings of STOC, pages 209–218. ACM, 1995.
- Kwon [2016] Joon Kwon. Mirror descent strategies for regret minimization and approachability. PhD thesis, Université Pierre et Marie Curie, Paris 6, 2016.
- Kwon [2021] Joon Kwon. Refined approachability algorithms and application to regret minimization with global costs. The Journal of Machine Learning Research, 22:200–1, 2021.
- Kwon and Perchet [2017] Joon Kwon and Vianney Perchet. Online learning and blackwell approachability with partial monitoring: Optimal convergence rates. In Aarti Singh and Xiaojin (Jerry) Zhu, editors, Proceedings of AISTATS, volume 54, pages 604–613. PMLR, 2017.
- Lee et al. [2018] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala. Efficient convex optimization with membership oracles. In Conference On Learning Theory, pages 1292–1294. PMLR, 2018.
- Lehrer [2003] Ehud Lehrer. Approachability in infinite dimensional spaces. Int. J. Game Theory, 31(2):253–268, 2003.
- Mannor and Shimkin [2003] Shie Mannor and Nahum Shimkin. The empirical bayes envelope and regret minimization in competitive markov decision processes. Math. Oper. Res., 28(2):327–345, 2003.
- Mannor and Shimkin [2004] Shie Mannor and Nahum Shimkin. A geometric approach to multi-criterion reinforcement learning. J. Mach. Learn. Res., 5:325–360, 2004.
- Mannor et al. [2014a] Shie Mannor, Vianney Perchet, and Gilles Stoltz. Set-valued approachability and online learning with partial monitoring. J. Mach. Learn. Res., 15(1):3247–3295, 2014a.
- Mannor et al. [2014b] Shie Mannor, Vianney Perchet, and Gilles Stoltz. Approachability in unknown games: Online learning meets multi-objective optimization. In Proceedings of COLT, volume 35, pages 339–355, 2014b.
- Mansour et al. [2022] Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan. Strategizing against learners in Bayesian games, 2022. URL https://arxiv.org/abs/2205.08562.
- Miryoosefi and Jin [2021] Sobhan Miryoosefi and Chi Jin. A simple reward-free approach to constrained reinforcement learning. CoRR, abs/2107.05216, 2021.
- Miryoosefi et al. [2019] Sobhan Miryoosefi, Kianté Brantley, Hal Daumé III, Miroslav Dudík, and Robert E. Schapire. Reinforcement learning with convex constraints. In Proceedings of NIPS, pages 14070–14079, 2019.
- Mohri et al. [2018] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, second edition, 2018.
- Neu et al. [2012] Gergely Neu, András György, and Csaba Szepesvári. The adversarial stochastic shortest path problem with unknown transition probabilities. In Proceedings of AISTATS, volume 22 of JMLR Proceedings, pages 805–813, 2012.
- Perchet [2010] Vianney Perchet. Approchabilité, Calibration et Regret dans les Jeux à Observations Partielles. PhD thesis, Université Pierre et Marie Curie - Paris VI, 2010.
- Perchet [2015] Vianney Perchet. Exponential weight approachability, applications to calibration and regret minimization. Dynamic Games and Applications, 5(1):136–153, 2015.
- Perchet and Quincampoix [2015] Vianney Perchet and Marc Quincampoix. On a unified framework for approachability with full or partial monitoring. Mathematics of Operations Research, 40(3):596–610, 2015.
- Perchet and Quincampoix [2018] Vianney Perchet and Marc Quincampoix. A differential game on Wasserstein space. Application to weak approachability with partial monitoring. CoRR, abs/1811.04575, 2018.
- Pietra et al. [1997] Stephen Della Pietra, Vincent J. Della Pietra, and John D. Lafferty. Inducing features of random fields. IEEE Trans. Pattern Anal. Mach. Intell., 19(4), 1997.
- Rockafellar [1970] R. Tyrrell Rockafellar. Convex analysis. Princeton Mathematical Series. Princeton University Press, Princeton, NJ, 1970.
- Rosenfeld [1996] Ronald Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer Speech & Language, 10(3):187–228, 1996.
- Shalev-Shwartz [2007] Shai Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, 2007.
- Shimkin [2016] Nahum Shimkin. An online convex optimization approach to blackwell’s approachability. The Journal of Machine Learning Research, 17(1):4434–4456, 2016.
- Spinat [2002] Xavier Spinat. A Necessary and Sufficient Condition for Approachability. Mathematics of Operations Research, 27(1):31–44, 2002.
- Vieille [1992] Nicolas Vieille. Weak approachability. Math. Oper. Res., 17(4):781–791, 1992.
- von Neumann [1928] John von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928.
- Yu et al. [2021] Tiancheng Yu, Yi Tian, Jingzhao Zhang, and Suvrit Sra. Provably efficient algorithms for multi-objective competitive rl. In International Conference on Machine Learning, pages 12167–12176. PMLR, 2021.
- Zinkevich [2003] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 928–936. AAAI Press, 2003.
Appendix A Omitted proofs
A.1 Proof of Theorem 2.3
Proof of Theorem 2.3.
Similar results appear in e.g. Kwon [2021]. For completeness, we include a proof here.
We will need the following fact: for any , . This is easy to verify (both sides are equal to ), but is also a consequence of Fenchel duality (see e.g. Appendix B and the proof of Theorem 3.1).
Armed with this fact, note that
∎
A.2 Proof of Corollary 3.2
Proof of Corollary 3.2.
Note that if , then there exists a such that . Therefore, if , then the supremum (we can take to be a large multiple of ). On the other hand, if , then the supremum (taking ). It follows that the supremum in Theorem 3.1 must be achieved for a (we know that is not infinite since , so ). For such , the term vanishes, and we are left with the statement of this corollary. ∎
A.3 Proof of Theorem 3.3
A.4 Proof of Lemma 3.4
Proof of Lemma 3.4.
Let denote . If is in , then we can write for some , , with . Thus, for any , we have
which implies that is in . Conversely, if is not in , since is a non-empty closed convex set, can be separated from , that is there exists such that
which implies that is not in . This completes the proof. ∎
A.5 Proof of Lemma 3.7
Proof of Lemma 3.7.
We begin with the first statement. Note that since the negative orthant is separable with respect to , for any , there exists a such that for all . Now, recall that we can write , so it follows that for all . This implies , as desired.
We next prove the second statement. Note that since , this implies that if , then . By the first statement, this means that for any , there exists a such that . By the minimax theorem (since and are both convex sets and is a bilinear function of and ), this implies that there exists a such that for all , , as desired. ∎
A.6 Proof of Lemma 3.10
A.7 Proof of Lemma 3.11
A.8 Proof of Lemma 3.12
Proof of Lemma 3.12.
Extend the domain of as a function to , and note that there is a unique way to write , where for each , is an element of and is the th unit vector in . We first claim iff each .
To see this, first note that since is orthant-generating, there exist a sequence of such that for each . Now, if each , then (since ), so . Conversely, if , then we can write for some and . Expanding each in the basis, we find that each must be a positive linear combination of the values and therefore .
Therefore, to check whether , it suffices to check whether each component of belongs to . This is possible to do efficiently given an efficient separation oracle for (we can write the convex program for ). Finally, if each we can also recover a value for each such that (via the same convex program). This allows us to explicitly write with , , and . ∎
A.9 Proof of Lemma 3.14
Proof of Lemma 3.14.
Checking for membership in the ball of radius is straightforward, so it suffices to exhibit a membership oracle for the set . Fix a and consider the convex function . Note that by the definition of iff , so it suffices to compute the minimum of over the convex set .
A.10 Proof of Proposition 4.4
Appendix B General theorems from convex optimization
We will use the following Fenchel duality theorem [Borwein and Zhu, 2005, Rockafellar, 1970], see also [Mohri et al., 2018].
Theorem B.1 (Fenchel duality).
Let and be Banach spaces, and convex functions and a bounded linear map. Assume that , and satisfy one of the following conditions:
-
•
and are lower semi-continuous and ;
-
•
;
then for the dual optimization problems
and the supremum in the second problem is attained if finite.
When constructing efficient algorithms, we will need a efficient method for minimizing a convex function over a convex set given only a membership oracle to the set and an evaluation oracle to the function. This is provided by the following lemma.
Lemma B.2.
Let be a bounded convex subset of and a convex function over . Then given access to a membership oracle for (along with an interior point satisfying for some given radii ) and an evaluation oracle for , there exists an algorithm which computes the minimum value over (to within precision ) using time and queries to these oracles.
Proof.
See Lee et al. [2018]. ∎
Appendix C Bayesian correlated equilibria
Here we provide another application of the approachability framework, to the problem of constructing learning algorithms that converge to correlated equilibria in Bayesian games. Correlated equilibria in Bayesian games (alternatively, “games with incomplete information”) are well-studied throughout the economics and game theory literature; see e.g. [Forges, 1993, Forges et al., 2006, Bergemann and Morris, 2016]. Unlike ordinary correlated equilibria, which are also well-studied from a learning perspective, relatively little is known about algorithms that converge to correlated equilibria in Bayesian games. Hartline et al. [2015] study no-regret learning in Bayesian games, showing that no-regret algorithms converge to a Bayesian coarse correlated equilibrium. More recently, Mansour et al. [2022] introduce a notion of Bayesian swap regret with the property that learners with sublinear Bayesian swap regret converge to correlated equilibria in Bayesian games. Mansour et al. [2022] construct a learning algorithm that achieves low Bayesian swap regret, albeit not a provably efficient one. In this section, we apply our approachability framework to develop the first efficient low-regret algorithm for Bayesian swap regret.
We begin with some preliminaries about standard (non-Bayesian) normal-form games. In a normal form game with players, each player must choose a mixed action (for simplicity we will assume each player has the same number of pure actions). We call the collection of mixed strategies played by all players a strategy profile. We will let denote the utility of player under strategy profile (and insist that is linear in each player’s strategy).
Given a function and a mixed action , let be the mixed action in formed by sampling an action from and then applying the function (i.e., ). A correlated equilibrium of is a distribution over strategy profiles such that for any player and function , it is the case that
where . Similarly, an -correlated equilibrium of is a distribution with the property that
for any and . Correlated equilibria have the following natural interpretation: a mediator samples a strategy profile from and tells to each player a pure action randomly sampled from . If each player is incentivized to play the action they are told, then is a correlated equilibrium.
We define Bayesian games similarly to standard games, with the modification that now each player also has some private information drawn from some public distribution . We call the vector of realized types the type profile of the players (drawn randomly from ), and now let the utility of player depend on both the strategy profile and type profile of the players. Note that we can alternately think of the strategy of player as a function mapping contexts to mixed actions; in this case we can again treat the expected utility for player (with expectation taken over the random type profile) as a multilinear function function of the strategy profile .
As with regular correlated equilibria, we can motivate the definition of Bayesian correlated equilibria via the introduction of a mediator. In the Bayesian case, all players begin by revealing their private types to the mediator, and the mediator observes a type profile . The mediator then samples an joint action profile from a distribution that depends on the observed type profile. Finally, for each player , the mediator samples a pure action from the mixed strategy for , and relays to player (which they should follow). In order for this to be a valid correlated equilibrium, the following incentive compatibility constraints must be met:
-
•
Players must have no incentive to deviate from the strategy relayed to them. As in correlated equilibria, this includes deviations of the form “if I am told to play action , I will instead play action ”.
-
•
Players also must have no incentive to misreport their type (thus affecting the distribution over joint strategy profiles).
-
•
Moreover, no combination of the above two deviations should result in improved utility for a player.
Formally, we define a Bayesian correlated equilibrium for a Bayesian game as follows. The distributions form a Bayesian correlated equilibrium if, for any player , any “type deviation” , and any collection of “action deviations” (for each ),
where is derived from by deviating from in the following way: when the player has type , they first report type to the mediator; if the mediator then tells them to play action , they instead play . No such deviation should improve the utility of an agent in a Bayesian correlated equilibrium. Likewise, an -Bayesian correlated equilibrium is a collection of distributions where no deviation increases the utility of a player by more than .
In order to play a Bayesian game, a learning algorithm must be contextual (using the agent’s private information to decide what action to play). We study the following setting of full-information contextual online learning. As before, there are actions, but there are now different contexts (/types). Every round , the adversary specifies a loss function , where represents the loss from playing action in context . Simultaneously, the learner specifies an action mapping each context to a distribution over actions. Overall, the learner receives expected utility this round (here is a distribution over contexts; we assume that the learner’s context is drawn i.i.d. from each round, and that this distribution is publicly known).
Motivated by the deviations considered in Bayesian correlated equilibria, we can define the following notion of swap regret in Bayesian games (“Bayesian swap regret”):
(15) |
Lemma C.1.
Let be an algorithm with . Assume each player in a repeated Bayesian game (over rounds) runs a copy of algorithm , and let be the strategy profile at time . Then the time-averaged strategy profile , defined by sampling a uniformly at random from and returning is an -Bayesian correlated equilibrium with .
Proof.
We will show that there exists no deviation for player which increases their utility by more than .
Fix a type deviation and set of action deviations . This collection of deviations transforms an arbitrary strategy into the strategy satisfying .
For each , with probability the mediator will return the strategy profiles . Now, since is multilinear in the strategies of each player, there exists some vector such that the utility of player if they defect to a some strategy is given by the inner product .
In particular, conditioned on the mediator returning , the difference in utility for player between playing and the strategy formed by applying the above deviations to is exactly
Taking expectations over all , we have that the expected difference in utility by deviating is
But since player selected their strategies by playing , this is at most , as desired. ∎
It is possible to phrase (15) in the language of -approachability by considering the dimensional vectorial payoff given by:
A straightforward computation shows that the negative orthant is separable with respect to .
Lemma C.2.
The set is separable with respect to the vectorial payoff .
Proof.
Fix an . Then note that if we let , where (i.e., is entirely supported on the best fixed action to play in context ), it follows that for all and , and therefore that . ∎
This in turn leads (via Theorem 2.3) to a low-regret (albeit computationally inefficient) algorithm for Bayesian swap regret. Instead, as in the case of swap regret, we will apply our pseudonorm approachability framework. First, we will show that we can rewrite (15) in such a way that allows us to easily evaluate .
Lemma C.3.
We have that
(16) |
Note that (16) allows us to efficiently (in time) evaluate . As mentioned in the main text, directly from Theorems 3.9 and 3.15 this gives us an efficient ( time per round) learning algorithm that incurs at most swap regret. We will now examine the values of , and and show that this algorithm actually incurs at most regret.
First, note that since , elements can be thought of as -tuples of positive numbers that add to . Each such -tuple has squared distance at most , so . Second, since , . Finally, the coefficients of each consist of copies of the distribution ; this has norm at most , so . Combining these three quantities according to Theorem 3.9, we obtain the following corollary.
Corollary C.4.
There exists an efficient contextual learning algorithm with .