Hardness of Independent Learning and
Sparse Equilibrium Computation in Markov Games
Abstract
We consider the problem of decentralized multi-agent reinforcement learning in Markov games. A fundamental question is whether there exist algorithms that, when adopted by all agents and run independently in a decentralized fashion, lead to no-regret for each player, analogous to celebrated convergence results for no-regret learning in normal-form games. While recent work has shown that such algorithms exist for restricted settings (notably, when regret is defined with respect to deviations to Markovian policies), the question of whether independent no-regret learning can be achieved in the standard Markov game framework was open. We provide a decisive negative resolution this problem, both from a computational and statistical perspective. We show that:
-
1.
Under the widely-believed complexity-theoretic assumption that PPAD-hard problems cannot be solved in polynomial time, there is no polynomial-time algorithm that attains no-regret in general-sum Markov games when executed independently by all players, even when the game is known to the algorithm designer and the number of players is a small constant.
-
2.
When the game is unknown, no algorithm—regardless of computational efficiency—can achieve no-regret without observing a number of episodes that is exponential in the number of players.
Perhaps surprisingly, our lower bounds hold even for seemingly easier setting in which all agents are controlled by a a centralized algorithm. They are proven via lower bounds for a simpler problem we refer to as SparseCCE, in which the goal is to compute by any means—centralized, decentralized, or otherwise—a coarse correlated equilibrium that is “sparse” in the sense that it can be represented as a mixture of a small number of “product” policies. The crux of our approach is a novel application of aggregation techniques from online learning [Vov90, CBL06], whereby we show that any algorithm for the SparseCCE problem can be used to compute approximate Nash equilibria for non-zero sum normal-form games; this enables the application of well-known hardness results for Nash.
1 Introduction
The framework of multi-agent reinforcement learning (MARL), which describes settings in which multiple agents interact in a dynamic environment, has played a key role in recent breakthroughs in artificial intelligence, including the development of agents that approach or surpass human performance in games such as Go [SHM+16], Poker [BS18], Stratego [PVH+22], and Diplomacy [KEG+22, BBD+22]. MARL also shows promise for real-world multi-agent systems, including autonomous driving [SSS16], and cybersecurity [MK15], and economic policy [ZTS+22]. These applications, where reliability is critical, necessitate the development of algorithms that are practical and efficient, yet provide strong formal guarantees and robustness.
Multi-agent reinforcement learning is typically studied using the framework of Markov games (also known as stochastic games) [Sha53]. In a Markov game, agents interact over a finite number of steps: at each step, each agent observes the state of the environment, takes an action, and observes a reward which depends on the current state as well as the other agents’ actions. Then the environment transitions to a new state as a function of the current state and the actions taken. An episode consists of a finite number of such steps, and agents interact over the course of multiple episodes, progressively learning new information about their environment. Markov games generalize the well-known model of Markov Decision Processes (MDPs) [Put94], which describe the special case in which there is a single agent acting in a dynamic environment, and we wish to find a policy that maximizes its reward. By contrast, for Markov games, we typically aim to find a distribution over agents’ policies which constitutes some type of equilibrium.
1.1 Decentralized learning
In this paper, we focus on the problem of decentralized (or, independent) learning in Markov games. In decentralized MARL, each agent in the Markov game behaves independently, optimizing their policy myopically while treating the effects of the other agents as exogenous. Agents observe local information (in particular, their own actions and rewards), but do not observe the actions of the other agents directly. Decentralized learning enjoys a number of desirable properties, including scalability (computation is inherently linear in the number of agents), versatility (by virtue of independence, algorithms can be applied in uncertain environments in which the nature of the interaction and number of other agents are not known), and practicality (architectures for single-agent reinforcement learning can often be adapted directly). The central question we consider is whether there exist decentralized learning algorithms which, when employed by all agents in a Markov game, lead them to play near-equilibrium strategies over time.
Decentralized equilibrium computation in MARL is not well understood theoretically, and algorithms with provable guarantees are scarce. To motivate the challenges and most salient issues, it will be helpful to contrast with the simpler problem of decentralized learning in normal-form games, which may be interpreted as Markov games with a single state. Normal-form games enjoy a rich and celebrated theory of decentralized learning, dating back to Brown’s work on fictitious play [Bro49] and Blackwell’s theory of approachability [Bla56]. Much of the modern work on decentralized learning in normal-form games centers on no-regret learning, where agents select actions independently using online learning algorithms [CBL06] designed to minimize their regret (that is, the gap between realized payoffs and the payoff of the best fixed action in hindsight). In particular, a foundational result is that if each agent employs a no-regret learning strategy, then the average of the agents’ joint action distributions approaches a coarse correlated equilibrium (CCE) for the normal-form game [CBL06, Han57, Bla56]. CCE is a natural relaxation of the foundational concept of Nash equilibrium, which has the downside of being intractable to compute. On the other hand, there are many efficient algorithms that can achieve vanishing regret in a normal-form game, even when opponents select their actions in an arbitrary, potentially adaptive fashion, and thus converge to a CCE [Vov90, LW94, CBFH+97, HMC00, SALS15].
This simple connection between no-regret learning and decentralized convergence to equilibria has been influential in game theory, leading to numerous lines of research including fast rates of convergence to equilibria [SALS15, CP20, DFG21, AFK+22], price of anarchy bounds for smooth games [Rou15], and lower bounds on query and communication complexity for equilibrium computation [FGGS13, Rub16, BR17]. Empirically, no-regret algorithms such as regret matching [HMC00] and Hedge [Vov90, LW94, CBFH+97] have been used to compute equilibria that can achieve state-of-the-art performance in application domains such as Poker [BS18] and Diplomacy [BBD+22]. Motivated by these successes, we ask whether an analogous theory can be developed for Markov games. In particular:
Are there efficient algorithms for no-regret learning in Markov games?
Any Markov game can be viewed as a large normal-form game where each agent’s action space consists of their exponentially-sized space of policies, and their utility function is given by their expected reward. Thus, any learning algorithm for normal-form games can also be applied to Markov games, but the resulting sample and computational complexities will be intractably large. Our goal is to explore whether more efficient decentralized learning guarantees can be established.
Challenges for no-regret learning.
In spite of active research effort and many promising pieces of progress [JLWY21, SMB22, MB21, DGZ22, ELS+22], no-regret learning guarantees for Markov games have been elusive. A barrier faced by naive algorithms is that it is intractable to ensure no-regret against an arbitrary adversary, both computationally [BJY20, AYBK+13] and statistically [LWJ22, KECM21, FRSS22].
Fortunately, many of the implications of no-regret learning (in particular, convergence to equilibria) do not require the algorithm to have sublinear regret against an arbitrary adversary, but rather only against other agents who are running the same algorithm independently. This observation has been influential in normal-form games, where the well-known line of work on fast rates of convergence to equilibrium [SALS15, CP20, DFG21, AFK+22] holds only in this more restrictive setting. This motivates the following relaxation to our central question.
Problem 1.1.
Is there an efficient algorithm that, when adopted by all agents in a Markov game and run independently, leads to sublinear regret for each individual agent?
Attempts to address Problem 1.1.
Two recent lines of research have made progress toward addressing Problem 1.1 and related questions. In one direction, several recent papers have provided algorithms, including V-learning [JLWY21, SMB22, MB21] and SPoCMAR [DGZ22], that do not achieve no-regret, but can nevertheless compute and then sample from a coarse correlated equilibrium in a Markov game in a (mostly) decentralized fashion, with the caveat that they require a shared source of random bits as a mechanism to coordinate. Notably, V-learning depends only mildly on the shared randomness: agents first play policies in a fully independent fashion (i.e., without shared randomness) according to a simple learning algorithm for episodes, and use shared random bits only once learning finishes as part of a post-processing procedure to extract a CCE policy. A question left open by these works, is whether the sequence of policies played by the V-learning algorithm in the initial independent phase can itself guarantee each agent sublinear regret; this would eliminate the need for a separate post-processing procedure and shared randomness.
Most closely related to our work, [ELS+22] recently showed that Problem 1.1 can be solved positively for a restricted setting in which regret for each agent is defined as the maximum gain in value they can achieve by deviating to a fixed Markov policy. Markov policies are those whose choice of action depends only on the current state as opposed to the entire history of interaction. This notion of deviation is restrictive because in general, even when the opponent plays a sequence of Markov policies, the best response will be non-Markov. In challenging settings that abound in practice, it is standard to consider non-Markov policies [LDGV+21, AVDG+22], since they often achieve higher value than Markov policies; we provide a simple example in Proposition 6.1. Thus, while a regret guarantee with respect to the class of Markov policies (as in [ELS+22]) is certainly interesting, it may be too weak in general, and it is of great interest to understand whether Problem 1.1 can be answered positively in the general setting.111We remark that the V-learning and SPoCMAR algorithms mentioned above do learn equilibria that are robust to deviations to non-Markov policies, though they do not address Problem 1.1 since they do not have sublinear regret.
1.2 Our contributions
We resolve Problem 1.1 in the negative, from both a computational and statistical perspective.
Computational hardness.
We provide two computational lower bounds (Theorems 1.2 and 1.3) which show that under standard complexity-theoretic assumptions, there is no efficient algorithm that runs for a polynomial number of episodes and guarantees each agent non-trivial (“sublinear”) regret when used in tandem by all agents. Both results hold even if the Markov game is explicitly known to the algorithm designer; Theorem 1.3 is stronger and more general, but applies only to -player games, while Theorem 1.2 applies to -player games, but only for agents restricted to playing Markovian policies.
To state our first result, Theorem 1.2, we define a product Markov policy to be a joint policy in which players choose their actions independently according to Markov policies (see Sections 2 and 3 for formal definitions). Note that if all players use independent no-regret algorithms to choose Markov policies at each episode, then their joint play at each round is described by a product Markov policy, since any randomness in each player’s policy must be generated independently.
Theorem 1.2 (Informal version of Corollary 3.3).
If , then there is no polynomial-time algorithm that, given the description of a 2-player Markov game, outputs a sequence of joint product Markov policies which guarantees each agent sublinear regret.
Theorem 1.2 provides a decisive negative resolution to Problem 1.1 under the assumption that ,222Technically, the class we are denoting by P, namely of total search problems that have a deterministic polynomial-time algorithm, is sometimes denoted by , as it is a search problem. We ignore this distinction. which is standard in the theory of computational complexity [Pap94].333PPAD is the most well-studied complexity class in algorithmic game theory, and is widely believed to not admit polynomial time algorithms. Notably, the problem of computing a Nash equilibrium for normal-form games with two or more players is PPAD-complete [DGP09, CDT06, Rub18]. Beyond simply ruling out the existence of fully decentralized no-regret algorithms, it rules out existence of centralized algorithms that compute a sequence of product policies for which each agent has sublinear regret, even if such a sequence does not arise naturally as the result of agents independently following some learning algorithm. Salient implications include:
-
•
Theorem 1.2 provides a separation between Markov games and normal-form games, since standard no-regret algorithms for normal-form games i) run in polynomial time and ii) produce sequences of joint product policies that guarantee each agent sublinear regret. Notably, no-regret learning for normal-form games is efficient whenever the number of agents is polynomial, whereas Theorem 1.2 rules out polynomial-time algorithms for as few as two agents.
-
•
A question left open by the work of [JLWY21, SMB22, MB21] was whether the sequence of policies played by the V-learning algorithm during its independent learning phase can guarantee each agent sublinear regret. Since V-learning plays product Markov policies during the independent phase and is computationally efficient, Theorem 1.2 implies that these policies do not enjoy sublinear regret (assuming ).
Our second result, Theorem 1.3, extends the guarantee of Theorem 1.2 to the more general setting in which agents can select arbitrary, potentially non-Markovian policies at each episode. This comes at the cost of only providing hardness for 3-player games as opposed to 2-player games, as well as relying on the slightly stronger complexity-theoretic assumption that .444We use RP to denote the class of total search problems for which there exists a polynomial-time randomized algorithm which outputs a solution with probability at least , and otherwise outputs “fail”.
Theorem 1.3 (Informal version of Corollary 4.4).
If , then there is no polynomial-time algorithm that, given the description of a 3-player Markov game, outputs a sequence of joint product general policies (i.e., potentially non-Markov) which guarantees each agent sublinear regret.
Statistical hardness.
Theorems 1.2 and 1.3 rely on the widely-believed complexity theoretic assumption that PPAD-complete problems cannot be solved in (randomized) polynomial time. Such a restriction is inherent if we assume that the game is known to the algorithm designer. To avoid complexity-theoretic assumptions, we consider a setting in which the Markov game is unknown to the algorithm designer, and algorithms must learn about the game by executing policies (“querying”) and observing the resulting sequences of states, actions, and rewards. Our final result, Theorem 1.4, shows unconditionally that, for -player Markov games whose parameters are unknown, any algorithm computing a no-regret sequence as in Theorem 1.3 requires a number of queries that is exponential in .
Theorem 1.4 (Informal version of Theorem 5.2).
Given query access to a -player Markov game, no algorithm that makes fewer than queries can output a sequence of joint product policies which guarantees each agent sublinear regret.
Similar to our computational lower bounds, Theorem 1.4 goes far beyond decentralized algorithms, and rules out even centralized algorithms that compute a no-regret sequence by jointly controlling all players. The result provides another separation between Markov games and normal-form games, since standard no-regret algorithms for normal-form games can achieve sublinear regret using queries for any . The scaling in the lower bound, which does not rule out query-efficient algorithms when is constant, is to be expected for an unconditional result: If the game has only polynomially many parameters (which is the case for constant ), one can estimate all of the parameters using standard techniques [JKSY20], then directly find a no-regret sequence.
Proof techniques: the SparseCCE problem.
Rather than directly proving lower bounds for the problem of no-regret learning, we establish lower bounds for a simpler problem we refer to as SparseCCE. In the SparseCCE problem, the aim is to compute by any means—centralized, decentralized, or otherwise—a coarse correlated equilibrium that is “sparse” in the sense that it can be represented as the mixture of a small number of product policies. Any algorithm that computes a sequence of product policies with sublinear regret (in the sense of Theorem 1.3) immediately yields an algorithm for the SparseCCE problem, as—using the standard connection between CCE and no-regret—the uniform mixture of the policies in the no-regret sequence forms a sparse CCE. Thus, any lower bound for the sparse SparseCCE problem yields a lower bound for computation of no-regret sequences.
To provide lower bounds for the SparseCCE problem, we reduce from the problem of Nash equilibrium computation in normal-form games. We show that given any two-player normal-form game, it is possible to construct a Markov game (with two players in the case of Theorem 1.2 and three players in the case of Theorem 1.3) with the property that i) the description length is polynomial in the description length of the normal-form game, and ii) any (approximate) SparseCCE for the Markov game can be efficiently transformed into a approximate Nash equilibrium for the normal-form game. With this reduction established, our computational lower bounds follow from celebrated PPAD-hardness results for approximate Nash equilibrium computation in two-player normal-form games, and our statistical lower bounds follow from query complexity lower bounds for Nash. Proving the reduction from Nash to SparseCCE constitutes the bulk of our work, and makes novel use of aggregation techniques from online learning [Vov90, CBL06], as well as techniques from the literature on anti-folk theorems in game theory [BCI+08].
1.3 Organization
Section 2 presents preliminaries on no-regret learning and equilibrium computation in Markov games and normal-form games. Sections 3, 4 and 5 present our main results:
-
•
Sections 3 and 4 provide our computational lower bounds for no-regret in Markov games. Section 3 gives a lower bound for the setting in which algorithms are constrained to play Markovian policies, and Section 4 builds on the approach in this section to give a lower bound for general, potentially non-Markovian policies.
-
•
Section 5 provides statistical (query complexity) lower bounds for multi-player Markov games.
Proofs are deferred to the appendix unless otherwise stated.
Notation.
For , we write . For a finite set , denotes the space of distributions on . For an element , denotes the delta distribution that places probability mass on . We adopt standard big-oh notation, and write to denote that , with and defined analogously.
2 Preliminaries
This section contains preliminaries necessary to present our main results. We first introduce the Markov game framework (Section 2.1), then provide a brief review of normal-form games (Section 2.3), and finally introduce the concepts of coarse correlated equilibria and regret minimization (Section 2.4).
2.1 Markov games
We consider general-sum Markov games in a finite-horizon, episodic framework. For , an -player Markov game consists of a tuple , where:
-
•
denotes a finite state space and denotes a finite time horizon. We write .
-
•
For , denotes a finite action space for agent . We let denote the joint action space and . We denote joint actions in boldface, for example, . We write and .
-
•
is the transition kernel, with each denoting the kernel for step . In particular, is the probability of transitioning to from the state at step when agents play .
-
•
For and , is the instantaneous reward function of agent :555We assume that rewards lie in for notational convenience, as this ensures that the cumulative reward for each episode lies in . This assumption is not important to our results. the reward agent receives in state at step if agents play is given by .666We consider Markov games in which the rewards at each step are a deterministic function of the state and action profile. While some works consider the more general case of stochastic rewards, since our main goal is to prove lower bounds, it is without loss for us to assume that rewards are deterministic.
-
•
denotes the initial state distribution.
An episode in the Markov game proceeds as follows:
-
•
the initial state is drawn from the initial state distribution .
-
•
For each , given state , each agent plays action , and given the joint action profile , each agent receives reward of and the state of the system transitions to .
We denote the tuple of agents’ rewards at each step by , and refer to the resulting sequence as a trajectory. For , we define the prefix of the trajectory via .
Indexing.
We use the following notation: for some quantity (e.g., action, reward, etc.) indexed by agents, i.e., , and an agent , we write to denote the tuple consisting of all for .
2.2 Policies and value functions
We now introduce the notion of policies and value functions for Markov games. Policies are mappings from states (or sequences of states) to actions for the agents. We consider several different types of policies, which play a crucial role in distinguishing the types of equilibria that are tractable and those that are intractable to compute efficiently.
Markov policies.
A randomized Markov policy for agent is a sequence , where . We denote the space of randomized Markov policies for agent by . We write to denote the space of product Markov policies, which are joint policies in which each agent independently follows a policy in . In particular, a policy is specified by a collection , where . We additionally define , and for a policy , write to denote the collection of mappings , where denotes the tuple of all but player ’s policies.
When the Markov game is clear from context, for a policy we let denote the law of the trajectory when players select actions via , and let denote the corresponding expectation.
General (non-Markov) policies.
In addition to Markov policies, we will consider general history-dependent (or, non-Markov) policies, which select actions based on the entire sequence of states and actions observed up the current step. To streamline notation, for , let denote the history of agent ’s states, actions, and reward up to step . Let denote the space of all possible histories of agent up to step . For , a randomized general (i.e., non-Markov) policy of agent is a collection of mappings where is a mapping that takes the history observed by agent up to step and the current state and outputs a distribution over actions for agent .
We denote by the space of randomized general policies of agent , and further write to denote the space of product general policies; note that and . In particular, a policy is specicfied by a collection , where . When agents play according to a general policy , at each step , each agent, given the current state and their history , chooses to play an action , independently from all other agents. For a policy , we let and denote the law and expectation operator for the trajectory when players select actions via , and write to denote the collection of policies of all agents but , i.e., .
We will also consider distributions over product randomized general policies, namely elements of .777When is not a finite set, we take to be the set of Radon probability measures over equipped with the Borel -algebra. We will refer to elements of as distributional policies. To play according to some distributional policy , agents draw a randomized policy (so that ) and then play according to .
Remark 2.1 (Alternative definition for randomized general policies).
Instead of defining distributional policies as above, one might alternatively define as the set of distributions over agent ’s deterministic general policies, namely as the set . We show in Section D that this alternative definition is equivalent to our own: in particular, there is a mapping from to so that, for any Markov game, any policy produces identically distributed trajectories to its corresponding policy in . Further, this mapping is one-to-one if we identify policies that produce the same distributions over trajectories for all Markov games.
Deterministic policies.
It will be helpful to introduce notation for deterministic general (non-Markov) policies, which correspond to the special case of randomized policies where each policy exclusively maps to singleton distributions. In particular, a deterministic general policy of agent is a collection of mappings , where . We denote by the space of deterministic general policies of agent , and further write to denote the space of joint deterministic policies. We use the convention throughout that deterministic policies are denoted by the letter , whereas randomized policies are denoted by .
Value functions.
For a general policy , we define the value function for agent as
(1) |
this represents the expected reward that agent receives when each agent chooses their actions via . For a distributional policy , we extend this notation by defining .
2.3 Normal-form games
To motivate the solution concepts we consider for Markov games, let us first revisit the notion of normal-form games, which may be interpreted as Markov games with a single state. For , an -player -action normal-form game is specified by a tuple of reward tensors , where each tensor is of order (i.e., has entries). We will write . We assume for simplicity that each player has the same number of actions, and identify each player’s action space with . Then an an action profile is specified by ; if each player acts according to , then the reward for player is given by . Our hardness results will use the standard notion of Nash equilibrium in normal-form games. We define the -player -Nash problem to be the problem of computing an -approximate Nash equilibrium of a given -player -action normal-form game. (See Definition A.1 for a formal definition of -Nash equilibrium.) A celebrated result is that Nash equilibria are PPAD-hard to approximate, i.e., the 2-player -Nash problem is PPAD-hard for any constant [DGP09, CDT06]. We refer the reader to Section A.1 for further background on these concepts.
2.4 Markov games: Equilibria, no-regret, and independent learning
We now turn our focus back to Markov games, and introduce the main solution concepts we consider, as well as the notion of no-regret. Since computing Nash equilibria is intractable even for normal-form games, much of the work on efficient equilibrium computation has focused on alternative notions of equilibrium, notably coarse correlated equilibria and correlated equilibria. We focus on coarse correlated equilibria: being a superset of correlated equilibria, any lower bound for computing a coarse correlated equilibrium implies a lower bound for computing a correlated equilibrium.
For a distributional policy and a randomized policy of player , we let denote the distributional policy which is given by the distribution of for (and denotes the marginal of on all players but ). For , we write to denote the policy given by . Let us fix a Markov game , which in particular determines the players’ value functions as in (1).
Definition 2.2 (Coarse correlated equilibrium).
For , a distributional policy is defined to be an -coarse correlated equilibrium (CCE) if for each , it holds that .
The maximizing policy can always be chosen to be determinimistic, so is an -CCE if and only if .
Coarse correlated equilibria can be computed efficiently for both normal-form games and Markov games, and are fundamentally connected to the notion of no-regret and independent learning, which we now introduce.
Regret.
For a policy , we denote the distributional policy which puts all its mass on by . Thus denotes the distributional policy which randomizes uniformly over the . We define regret as follows.
Definition 2.3 (Regret).
Consider a sequence of policies . For , the regret of agent with respect to this sequence is defined as:
(2) |
In (2) the maximum over is always achieved by a deterministic general policy, so we have .
The following standard result shows that the uniform average of any no-regret sequence forms an approximate coarse correlated equilibrium.
Fact 2.4 (No-regret is equivalent to CCE).
Suppose that a sequence of policies satisfies for each . Then the uniform average of these policies, namely the distributional policy , is an -CCE.
Likewise if a sequence of policies has the property that the distributional policy , is an -CCE, then we have for all .
No-regret learning.
Fact 2.4 is an immediate consequence of Definitions 2.3 and 2.2. A standard approach to decentralized equilibrium computation, which exploits Fact 2.4, is to select using independent no-regret learning algorithms. A no-regret learning algorithm for player selects based on the realized trajectories that player observes over the course of play,888An alternative model allows for player to have knowledge of the previous joint policies , when selecting . but with no knowledge of , so as to ensure that no-regret is achieved: . If each player uses their own, independent no-regret learning algorithm, this approach yields product policies , and the uniform average of the yields a CCE as long as all of the players can keep their regret small.999In Section 6, we discuss the implications of relaxing the stipulation that be product policies (for example, by allowing the use of shared randomness, as in V-learning). In short, allowing to be non-product essentially trivializes the problem.
For the special case of normal-form games, the no-regret learning approach has been fruitful. There are several efficient algorithms, including regret matching [HMC00], Hedge (also known as exponential weights) [Vov90, LW94, CBFH+97], and generalizations of Hedge based on the follow-the-regularized-leader (FTRL) framework [SS12], which ensure that each player’s regret after episodes is bounded above by (that is ), even when the other players’ actions are chosen adversarially. All of these guarantees, which bound regret by a sublinear function in , lead to efficient, decentralized computation of approximate coarse correlated equilibrium in normal-form games. The success of this motivates our central question, which is whether similar guarantees may be established for Markov games. In particular, a formal version of Problem 1.1 asks: Is there an efficient algorithm that, when adopted by all agents in a Markov game and run independently, ensures that for all , for some ?
3 Lower bound for Markovian algorithms
In this section we prove Theorem 1.2 (restated formally below as Theorem 3.2), establishing that in two-player Markov games, there is no computationally efficient algorithm that computes a sequence of product Markov policies so that each player has small regret under this sequence. This section serves as a warm-up for our results in Section 4, which remove the assumption that are Markovian.
3.1 SparseMarkovCCE problem and computational model
As discussed in the introduction, our lower bounds for no-regret learning are a consequence of lower bounds for the SparseCCE problem. In what follows, we formalize this problem (specifically, the Markovian variant, which we refer to as SparseMarkovCCE), as well as our computational model.
Description length for Markov games (constant ).
Given a Markov game , we let denote the maximum number of bits needed to describe any of the rewards or transition probabilities in binary.101010We emphasize that is defined as the maximum number of bits required by any particular pair, not the total number of bits required for all pairs. We define . The interpretation of depends on the number of players : If is a constant (as will be the case in the current section and Section 4), then should be interpreted as the description length of the game , up to polynomial factors. In particular, for constant , the game can be described using bits. In Section 5, we discuss the interpretation of when is large.
The SparseMarkovCCE problem.
From Fact 2.4, we know that the problem of computing a sequence of joint product Markov policies for which each player has at most regret is equivalent to computing a sequence for which the uniform mixture forms an -approximate CCE. We define -SparseMarkovCCE as the computational problem of computing such a CCE directly.
Definition 3.1 (SparseMarkovCCE problem).
For an -player Markov game and parameters and (which may depend on the size of the game ), -SparseMarkovCCE is the problem of finding a sequence , with each , such that the distributional policy is an -CCE of (or equivalently, such that for all , ).
Decentralized learning algorithms naturally lead to solutions to the SparseMarkovCCE problem. In particular, consider any decentralized protocol which runs for episodes, where at each timestep , each player chooses a Markov policy to play, without knowledge of the other players’ policies (but possibly using the history); any strategy in which players independently run online learning algorithms falls under this protocol. If each player experiences overall regret at most , then the sequence is a solution to the -SparseMarkovCCE problem. However, one might expect the -SparseMarkovCCE problem to be much easier than decentralized learning, since it allows for algorithms that produce satisfying the constraints of Definition 3.1 in a centralized manner. The main result of this section, Theorem 3.2, rules out the existence of any efficient algorithms, including centralized ones, that solve the SparseMarkovCCE problem.
Before moving on, let us give a sense for what sort of scaling one should expect for the parameters and in the -SparseMarkovCCE problem. First, we note that there always exists a solution to the -SparseMarkovCCE problem in a Markov game, which is given by a (Markov) Nash equilibrium of the game; of course, Nash equilibria are intractable to compute in general.111111Such a Nash equilibrium can be seen to exist by using backwards induction to specify the player’s joint distribution of play at each state at steps . For the special case of normal-form games (where there is only a single state, and ), no-regret learning (e.g., Hedge) yields a computationally efficient solution to the -SparseMarkovCCE problem, where the hides a factor. Refined convergence guarantees of [DFG21, AFK+22] improve upon this result, and yield an efficient solution to the -SparseMarkovCCE problem.
3.2 Main result
Theorem 3.2.
There is a constant so that the following holds. Let be given, and let and satisfy . Suppose there is an algorithm that, given the description of any 2-player Markov game with , solves the -SparseMarkovCCE problem in time , for some . Then, for each , the 2-player -Nash problem (Definition A.1) can be solved in time .
We emphasize that the range ruled out by Theorem 3.2 is the most natural parameter regime, since the runtime of any decentralized algorithm which runs for episodes and produces a solution to the SparseMarkovCCE problem is at least linear in . Using that 2-player -Nash is PPAD-complete for (for any ) [DGP09, CDT06, Rub18], we obtain the following corollary.
Corollary 3.3 (SparseMarkovCCE is PPAD-complete).
For any constant , if there is an algorithm which, given the description of a 2-player Markov game , solves the -SparseMarkovCCE problem in time , then .
The condition in Corollary 3.3 is set to ensure that for sufficiently large , so as to satisfy the condition of Theorem 3.2. Corollary 3.3 rules out the existence of a polynomial-time algorithm that solves the SparseMarkovCCE problem with accuracy polynomially small and polynomially large in . Using a stronger complexity-theoretic assumption, the Exponential Time Hypothesis for PPAD [Rub16], we can obtain a stronger hardness result which rules out efficient algorithms even when 1) the accuracy is constant and 2) is quasipolynomially large.121212This is a consequence of the fact that for some absolute constant , there are no polynomial-time algorithms for computing -Nash equilibria in 2-player normal-form games under the Exponential Time Hypothesis for PPAD [Rub16].
Corollary 3.4 (ETH-hardness of SparseMarkovCCE).
There is a constant such that if there exists an algorithm that solves the -SparseMarkovCCE problem in time, then the Exponential Time Hypothesis for PPAD fails to hold.
Proof overview.
The proof of Theorem 3.2 is based on a reduction, which shows that any algorithm that efficiently solves the -SparseMarkovCCE problem, for not too large, can be used to efficiently compute an approximate Nash equilibrium of any given normal-form game. In particular, fix , and let a 2-player normal form game with actions be given. We construct a Markov game with horizon and action sets identical to those of the game , i.e., . The state space of consists states, which are indexed by joint action profiles; the transitions are defined so that the value of the state at step encodes the action profile taken by the agents at step .131313For technical reasons, this only is the case for even values of ; we discuss further details in the full proof in Section B.2. At each state of , the reward functions are given by the payoff matrices of , scaled down by a factor of (which ensures that the rewards received at each step belong to ). In particular, the rewards and transitions out of a given state do not depend on the identity of the state, and so can be thought of as a repeated game where is played times. The formal definition of is given in Definition B.3.
Fix any algorithm for the SparseMarkovCCE problem, and recall that for each step and state for , denotes the joint action distribution taken in at step for the sequence of produced by the algorithm. The bulk of the proof of Theorem 3.2 consists of proving a key technical result, Lemma B.4, which states that if indeed solves -SparseMarkovCCE, then there exists some tuple such that is an approximate Nash equilibrium for . With this established, it follows that we can find a Nash equilibrium efficiently by simply trying all choices for .
To prove Lemma B.4, we reason as follows. Assume that is an -CCE. If, by contradiction, none of the distributions are approximate Nash equilibria for , then it must be the case that for each , one of the players has a profitable deviation in with respect to the product strategy , at least for a constant fraction of the tuples . We will argue that if this were to be the case, it would imply that there exists a non-Markov deviation policy for at least one player in Definition 2.2, meaning that is not in fact an -CCE.
To sketch the idea, recall that to draw a trajectory from , we first draw a random index uniformly at random, and then execute for an episode. We will show (roughly) that for each player , it is possible to compute a non-Markov deviation policy which, under the draw of a trajectory from , can “infer” the value of the index within the first few steps of the episode. Then policy then, at each state and step after the first few steps, play a best response to their opponent’s portion of the strategy . If, for each possible value of , none of the distributions are approximate Nash equilibria of , this means that at least one of the players can significantly increase their value in over that of by playing , which contradicts the assumption that is an -CCE.
It remains to explain how we can construct a non-Markov policy which “infers” the value of . Unfortunately, exactly inferring the value of in the fashion described above is impossible: for instance, if there are so that , then clearly it is impossible to distinguish between the cases and . Nevertheless, by using the fact that each player observes the full joint action profile played at each step , we can construct a non-Markov policy which employs Vovk’s aggregating algorithm for online density estimation [Vov90, CBL06] in order to compute a distribution which is close to for most .141414Vovk’s aggregating algorithm is essentially the exponential weights algorithm with the logarithmic loss. A detailed background for the algorithm is provided in Section B.1. This guarantee is stated formally in an abstract setting in Proposition B.2, and is instantiated in the proof of Theorem 3.2 in (Equation 6). As we show in Section B.2, approximating as we have described is sufficient to carry out the reasoning from the previous paragraph.
4 Lower bound for non-Markov algorithms
In this section, we prove Theorem 1.3 (restated formally below as Theorem 4.3), which strengthens Theorem 3.2 by allowing the sequence of product policies to be non-Markovian. This additional strength comes at the cost of our lower bound only applying to 3-player Markov games (as opposed to Theorem 3.2, which applied to 2-player games).
4.1 SparseCCE problem and computational model
To formalize the computational model for the SparseCCE problem, we must first describe how the non-Markov product policies are represented. Recall that a non-Markov policy is, by definition, a mapping from agent ’s history and current state to a distribution over their next action. Since there are exponentially many possible histories, it is information-theoretically impossible to express an arbitrary policy in with polynomially many bits. As our focus is on computing a sequence of such policies in polynomial time, certainly a prerequisite is that can be expressed in polynomial space. Thus, we adopt the representational assumption, stated formally in Definition 4.1, that each of the policies is described by a bounded-size circuit that can compute the conditional distribution of each next action given the history. This assumption is satisfied by essentially all empirical and theoretical work concerning non-Markov policies (e.g., [LDGV+21, AVDG+22, JLWY21, SMB22]).
Definition 4.1 (Computable policy).
Given a -player Markov game and , we say that a policy is -computable if for each , there is a circuit of size that,151515For concreteness, we suppose that “circuit” means “boolean circuit” as in [AB06, Definition 6.1], where probabilities are represented in binary. The precise model of computation we use does not matter, though, and we could equally assume that the policies may be computed by Turing machines that terminate after steps. on input , outputs the distribution . A policy is -computable if each constituent policy is.
Our lower bound applies to algorithms that produce sequences for which each is -computable, where the value is taken to be polynomial in the description length of the game . For example, Markov policies whose probabilities can be expressed with bits are -computable for each player , since one can simply store each of the probabilities (for , , , ), each of which takes bits to represent.
The SparseCCE problem.
SparseCCE is the problem of computing a sequence of non-Markov product policies such that the uniform mixture forms an -approximate CCE. The problem generalizes SparseMarkovCCE (Definition 3.1) by relaxing the condition that the policies be Markov.
Definition 4.2 (SparseCCE Problem).
For an -player Markov game and parameters and (which may depend on the size of the game ), -SparseCCE is the problem of finding a sequence , with each being -computable, such that the distributional policy is an -CCE for (equivalently, such that for all , ).
4.2 Main result
Our main theorem for this section, Theorem 4.3, shows that for appropriate values of , , and , solving the -SparseCCE problem is at least as hard as computing Nash equilibria in normal-form games.
Theorem 4.3.
Fix , and let , and satisfy . Suppose there exists an algorithm that, given the description of any -player Markov game with , solves the -SparseCCE problem in time , for some . Then, for any , the -player -Nash problem can be solved in randomized time with failure probability , where is an absolute constant.
By analogy to Corollary 3.3, we obtain the following immediate consequence.
Corollary 4.4 (SparseCCE is hard under ).
For any constant , if there is an algorithm which, given the description of a 3-player Markov game , solves the -SparseCCE problem in time , then .
Proof overview for Theorem 4.3.
The proof of Theorem 4.3 has a similar high-level structure to that of Theorem 3.2: given an -player normal-form , we define an -player Markov game which has actions per player and horizon . The key difference in the proof of Theorem 4.3 is the structure of the players’ reward functions. To motivate this difference and the addition of an -th player, let us consider what goes wrong in the proof of Theorem 3.2 when the policies are allowed to be non-Markov. We will explain how a sequence can hypothetically solve the SparseCCE problem by attempting to punish any one player’s deviation policy, and thus avoid having to compute a Nash equilibrium of . In particular, for each player , suppose tries to detect, based on the state transitions and player ’s rewards, whether every other player is playing according to . If some player is not playing according to at some step , then at steps , the policy can select actions that attempt to minimize player ’s rewards. In particular, if player plays according to the policy that we described in Section 3.2, then other players can adjust their choice of actions in later rounds to decrease player ’s value.
This behavior is reminiscent of “tit-for-tat” strategies which are used to establish the folk theorem in the theory of repeated games [MF86, FLM94]. The folk theorem describes how Nash equilibria are more numerous (and potentially easier to find) in repeated games than in single-shot normal form games. As it turns out, the folk theorem does not provably yield to worst-case computational speedups in repeated games, at least when the number of players is at least 3. Indeed, [BCI+08] gave an “anti-folk theorem”, showing that computing Nash equilibria in -player repeated games is PPAD-hard for , via a reduction to -player normal-form games. We utilize their reduction, for which the key idea is as follows: given an -player normal form game , we construct an -player Markov game in which the -th player acts as a kibitzer,161616Kibitzer is a Yiddish term for an observer who offers advice. with actions indexed by tuples , for and . The kibitzer’s action represents 1) a player to give advice to, and 2) their advice to the player, which is to take action . In particular, if the kibitzer plays , it receives reward equal to the amount that player would obtain by deviating to , and player receives the negation of the kibitzer’s reward. Furthermore, all other players receive 0 reward.
To see why the addition of the kibitzer is useful, suppose that solves the SparseCCE problem, so that is an -CCE. We will show that, with at least constant probability over a trajectory drawn from (which involves drawing uniformly), the joint strategy profile played by the first players constitutes an approximate Nash equilibrium of . Suppose for the purpose of contradiction that this were not the case. We show that there exists a non-Markov deviation policy for the kibitzer which, similar to the proof of Theorem 3.2, learns the value of and plays a tuple such that action increases player ’s payoff in , thereby increasing its own payoff. Even if the other players attempt to punish the kibitzer for this deviation, they will not be able to since, roughly speaking, the kibitzer game as constructed above has the property that for any strategy for the first players, the kibitzer can always achieve reward at least .
The above argument shows that under the joint policy (namely, the first players play according to and the kibitzer plays according to ) then with constant probability over a trajectory drawn from this policy, the distribution of the first players’ actions is an approximate Nash equilibrium of . Thus, in order to efficiently find such a Nash (see Algorithm 2), we need to simulate the policy , which involves running Vovk’s aggregating algorithm. This approach is in contrast to the proof of Theorem 3.2, for which Vovk’s aggregating algorithm was an ingredient in the proof but was not actually used in the Nash computation algorithm (Algorithm 1). The details of the proof of correctness of Algorithm 2 are somewhat delicate, and may be found in Appendix C.
Two-player games.
One intruiging question we leave open is whether the SparseCCE problem remains hard for two-player Markov games. Interestingly, as shown by [LS05], there is a polynomial time algorithm to find an exact Nash equilibrium for the special case of repeated two-player normal-form games. Though their result only applies in the infinite-horizon setting, it is possible to extend their results to the finite-horizon setting, which rules out naive approaches to extend the proof of Theorem 4.3 and Corollary 4.4 to two players.
5 Multi-player games: Statistical lower bounds
In this section we present Theorem 1.4 (restated formally below as Theorem 5.2), which gives a statistical lower bound for the SparseCCE problem. The lower bound applies to any algorithm, regardless of computational cost, that accesses the underlying Markov game through a generative model.
Definition 5.1 (Generative model).
For an -player Markov game , a generative model oracle is defined as follows: given a query described by a tuple , the oracle returns the distribution and the tuple of rewards .
From the perspective of lower bounds, the assumption that the algorithm has access to a generative model is quite reasonable, as it encompasses most standard access models in RL, including the online access model, in which the algorithm repeatedly queries a policy and observes a trajectory drawn from it, as well as the local access generative model used in from [YHAY+22, WAJ+21]. We remark that it is slightly more standard to assume that queries to the generative model only return a sample from the distribution as opposed to the distribution itself [Kak03, KMN99], but since our goal is to prove lower bounds, the notion in Definition 5.1 only makes our results stronger.
To state our main result, we recall the definition . In the present section, we consider the setting where the number of players is large. Here, does not necessarily correspond to the description length for , and should be interpreted, roughly speaking, as a measure of the description complexity of with respect to decentralized learning algorithms. In particular, from the perspective of an individual agent implementing a decentralized learning algorithm, their sample complexity should depend only on the size of their individual action set (as well as the global parameters ), as opposed to the size of the joint action set, which grows exponentially in ; the former is captured by , while the latter is not. Indeed, a key advantage shared by much prior work on decentralized RL [JLWY21, SMB22, MB21, DGZ22] is their avoidance of the curse of multi-agents, which describes the situation where an algorithm has sample and computational costs that scale exponentially in .
Our main result for this section, Theorem 5.2, states that for -player Markov games, exponentially many generative model queries (in ) are necessary to produce a solution to the -SparseCCE problem, unless itself is exponential in .
Theorem 5.2.
Let be given. There are constants so that the following holds. Suppose there is an algorithm which, given access to a generative model for a -player Markov game with , solves the -SparseCCE problem for for some satisfying , and any . Then must make at least queries to the generative model.
Theorem 5.2 establishes that there are -player Markov games, where the number of states, actions per player, and horizon are bounded by , but any algorithm with regret must make queries (via Fact 2.4). In particular, if there are queries per episode, as is standard in the online simulator model where a trajectory is drawn from the policy at each episode , then episodes are required to have regret . This is in stark contrast to the setting of normal-form games, where even for the case of bandit feedback (which is a special case of the generative model setting), standard no-regret algorithms have the property that each player’s regret scales as (i.e., independently of ), where denotes the number of actions per player [LS20]. As with our computational lower bounds, Theorem 5.2 is not limited to decentralized algorithms, and also rules out centralized algorithms which, with access to a generative model, compute a sequence of policies which constitutes a solution to the SparseCCE problem. Furthermore, it holds for arbitrary values of , thus allowing the policies solving the SparseCCE problem to be arbitrary general policies.
6 Discussion and interpretation
Theorems 3.2, 4.3, and 5.2 present barriers—both computational and statistical—toward developing efficient decentralized no-regret guarantees for multi-agent reinforcement learning. We emphasize that no-regret algorithms are the only known approach for obtaining fully decentralized learning algorithms (i.e., those which do not rely even on shared randomness) in normal-form games, and it seems unlikely that a substantially different approach would work in Markov games. Thus, these lower bounds for finding subexponential-length sequences of policies with the no-regret property represent a significant obstacle for fully decentralized multi-agent reinforcement learning. Moreover, these results rule out even the prospect of developing efficient centralized algorithms that produce no-regret sequences of policies, i.e., those which “resemble” independent learning. In this section, we compare our lower bounds with recent upper bounds for decentralized learning in Markov games, and explain how to reconcile these results.
6.1 Comparison to V-learning
The V-learning algorithm [JLWY21, SMB22, MB21] is a polynomial-time decentralized learning algorithm that proceeds in two phases. In the first phase, the agents interact over the course of episodes in a decentralized fashion, playing product Markov policies . In the second phase, the agents use data gathered during the first phase to produce a distributional policy , which we refer to as the output policy of V-learning. As discussed in Section 1, one implication of Theorem 3.2 is that the first phase of V-learning cannot guarantee each agent sublinear regret. Indeed if is of polynomial size (and ), this follows because a bound of the form for all implies that solves the -SparseMarkovCCE problem.
The output policy produced by V-learning is an approximate CCE (per Definition 2.2), and it is natural to ask how many product policies it takes to represent as a uniform mixture (that is, whether solves the -SparseMarkovCCE problem for a reasonable value of ). First, recall that V-learning requires episodes to ensure that is an -CCE. It is straightforward to show that can be expressed as a non-uniform mixture of at most policies in (we prove this fact in detail below). By discretizing the non-uniform mixture, one can equivalently represent it as uniform mixture of product policies, up to error. Recalling the value of , we conclude that we can express as a uniform mixture of product policies in . Note that the lower bound of Theorem 4.3 rules out the efficient computation of an -CCE represented as a uniform mixture of efficiently computable policies in . Thus, in the regime where is polynomial in , this upper bound on the sparsity of the policy produced V-learning matches that from Theorem 4.3, up to a polynomial in the exponent.
The sparsity of the output policy from V-learning.
We now sketch a proof of the fact that the output policy produced by V-learning can be expressed as a (non-uniform) average of policies in , where is the number of episodes in the algorithm’s initial phase. We adopt the notation and terminology from [JLWY21].
Consider Algorithm 3 of [JLWY21], which describes the second phase of V-learning, which produces the output policy . We describe how to write as a weighted average of a collection of product policies, each of which is indexed by a function and a parameter : in particular, we will write , where are mixing weights summing to 1 and . The number of tuples is .
We define the mixing weight allocated to any tuple to be:
where and (for ) are defined as in [JLWY21].
Next, for each , we define to be the following policy: it maintains a parameter over the first steps of the episode (as in Algorithm 3 of [JLWY21]), but upon reaching state at step , given the present value of , sets , and updates , and then samples an action (where are defined in [JLWY21]). Since the mixing weights defined above exactly simulate the random draws of the parameter in Line 1 and the parameters in Line 4 of [JLWY21, Algorithm 3], it follows that the distributional policy defined by [JLWY21, Algorithm 3] is equal to .
6.2 No-regret learning against Markov deviations
As discussed in Section 1, [ELS+22] showed the existence of a learning algorithm with the property that if each agent plays it independently for episodes, then no player can achieve regret more than by deviating to any fixed Markov policy. This notion of regret corresponds to, in the context of Definition 2.3, replacing with the smaller quantity . Thus, the result of [ELS+22] applies to a weaker notion of regret than that of the SparseCCE problem, and so does not contradict any of our lower bounds. One may wonder which of these two notions of regret (namely, best possible gain via deviation to a Markov versus non-Markov policy) is the “right” one. We do not believe that there is a definitive answer to this question, but we remark that in many empirical applications of multi-agent reinforcement learning it is standard to consider non-Markov policies [LDGV+21, AVDG+22]. Furthermore, as shown in the proposition below, there are extremely simple games, e.g., of constant size, in which Markov deviations lead to “vacuous” behavior: in particular, all Markov policies have the same (suboptimal) value but the best non-Markov policy has much greater value:
Proposition 6.1.
There is a 2-player, 2-action, 1-state Markov game with horizon and a non-Markov policy for player 2 so that for all , yet .
Other recent work has also proved no-regret guarantees with respect to deviations to restricted policy classes. In particular, [ZLY22] studies a setting in which each agent is allowed to play policies in an arbitrary restricted policy class in each episode, and regret is measured with respect to deviations to any policy in . [ZLY22] introduces an algorithm, DORIS, with the property that when all agents play it independently, each agent experiences regret to their respective class .171717Note that in the tabular setting, the sample complexity of DORIS (Corollary 1) scales with the size of the joint action set, since each player’s value function class consists of the class of all functions , which has Eluder dimension scaling with , i.e., exponential in .
DORIS is not computationally efficient, since it involves performing exponential weights over the class , which requires space complexity . Nonetheless, one can compare the statistical guarantees the algorithm provides to our own results. Let denote the set of deterministic Markov policies of agent , namely sequences so that . In the case that , , we have , which means that DORIS obtains no-regret against Markov deviations when is constant, comparable to [ELS+22].181818[ELS+22] has the added bonus of computational efficiency, even for polynomially large , though has the significant drawback of assuming that the Markov game is known. However, we are interested in the setting in which each player’s regret is measured with respect to all deviations in (equivalently, ). Accordingly, if we take ,191919DORIS plays distributions over policies in at each episode, whereas in our lower bounds we consider the setting where a policy in is played each episode; Facts D.2 and D.3 shows that these two settings are essentially equivalent, in that any policy in can be simulated by one in , and vise versa. then , meaning that DORIS does not imply any sort of sample-efficient guarantee, even for .
Finally, we remark that the algorithm DORIS [ZLY22], as well as the similar algorithm OPMD from earlier work of [LWJ22], obtains the same regret bound stated above even when the opponents are controlled by (possibly adaptive) adversaries. However, this guarantee crucially relies on the fact that any agent implementing DORIS must observe the policies played by opponents following each episode; this feature is the reason that the regret bound of DORIS does not contradict the exponential lower bound of [LWJ22] for no-regret learning against an adversarial opponent. As a result of being restricted to this “revealed-policy” setting, DORIS is not a fully decentralized algorithm in the sense we consider in this paper.
6.3 On the role of shared randomness
A key assumption in our lower bounds for no-regret learning is that each of the joint policies produced by the algorithm is a product policy; such an assumption is natural, since it subsumes independent learning protocols in which each agent selects without knowledge of . Compared to general (stochastic) joint policies, product policies have the desirable property that, to sample a trajectory from , the agents do no require access to shared randomness. In particular, each agent can independently sample its action from at each of the steps of the episode. It is natural to ask how the situation changes if we allow the agents to use shared random bits when sampling from their policies, which corresponds to allowing to be non-product policies. In this case, V-learning yields a positive result via a standard “batch-to-online” conversion: by applying the first phase of V-learning during the first episodes and playing trajectories sampled i.i.d. from the output policy produced by V-learning during the remaining episodes (which requires shared randomness), it is straightforward to see that a regret bound of order can be obtained. Similar remarks apply to SPoCMAR [DGZ22], which can obtain a slightly worse regret bound of order in the same fashion. In fact, the batch-to-online conversion approach gives a generic solution for the setting in which shared randomness is available. That is, the assumption of shared randomness eliminates any distinction between no-regret algorithms and (non-sparse) equilibrium computation algorithms, modulo slight loss in rates. For this reason, the shared randomness assumption is too strong to develop any sort of distinct theory of no-regret learning.
6.4 Comparison to lower bounds for finding stationary CCE
A separate line of work [DGZ22, JMS22] has recently shown PPAD-hardness for the problem of finding stationary Markov CCE in infinite-horizon discounted stochastic games. These results are incomparable with our own: stationary Markov CCE are not sparse (in the sense of Definition 3.1), whereas we do not require stationarity of policies (as is standard in the finite-horizon setting).
6.5 Proof of Proposition 6.1
Below we prove Proposition 6.1.
Proof of Proposition 6.1.
We construct the claimed Markov game as follows. The single state is denoted by ; as there is only a single state, the transitions are trivial. We denote each player’s action space as . The rewards to player 1 are given as follows: for all ,
We allow the rewards of player 2 to be arbitrary; they do not affect the proof in any way.
We let be the policy which plays a uniformly random action at step 1 and then plays the same action at step 2: formally, , and . Then for any Markov policy of player 1, we must have , which means that .
On the other hand, any general (non-Markov) policy which satisfies
has . ∎
Acknowledgements
This work was performed in part while NG was an intern at Microsoft Research. NG is supported at MIT by a Fannie & John Hertz Foundation Fellowship and an NSF Graduate Fellowship. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence. SK acknowledges funding from the Office of Naval Research under award N00014-22-1-2377 and the National Science Foundation Grant under award #CCF-2212841.
References
- [AB06] S. Arora and B. Barak. Computational Complexity: A Modern Approach. Cambridge University Press, 2006.
- [AFK+22] Ioannis Anagnostides, Gabriele Farina, Christian Kroer, Chung-Wei Lee, Haipeng Luo, and Tuomas Sandholm. Uncoupled learning dynamics with $o(\log t)$ swap regret in multiplayer games. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- [AVDG+22] John P. Agapiou, Alexander Sasha Vezhnevets, Edgar A. Duéñez-Guzmán, Jayd Matyas, Yiran Mao, Peter Sunehag, Raphael Köster, Udari Madhushani, Kavya Kopparapu, Ramona Comanescu, DJ Strouse, Michael B. Johanson, Sukhdeep Singh, Julia Haas, Igor Mordatch, Dean Mobbs, and Joel Z. Leibo. Melting pot 2.0, 2022.
- [AYBK+13] Yasin Abbasi Yadkori, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvari. Online learning in markov decision processes with adversarially chosen transition probability distributions. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
- [Bab16] Yakov Babichenko. Query complexity of approximate nash equilibria. J. ACM, 63(4), oct 2016.
- [BBD+22] Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra. Human-level play in the game of ¡i¿diplomacy¡/i¿ by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
- [BCI+08] Christian Borgs, Jennifer Chayes, Nicole Immorlica, Adam Tauman Kalai, Vahab Mirrokni, and Christos Papadimitriou. The myth of the folk theorem. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 365–372, 2008.
- [BJY20] Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
- [Bla56] David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1–8, 1956.
- [BR17] Yakov Babichenko and Aviad Rubinstein. Communication complexity of approximate nash equilibria. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, page 878–889, New York, NY, USA, 2017. Association for Computing Machinery.
- [Bro49] George Williams Brown. Some notes on computation of games solutions. 1949.
- [BS18] Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
- [CBFH+97] Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997.
- [CBL06] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
- [CCT17] Xi Chen, Yu Cheng, and Bo Tang. Well-Supported vs. Approximate Nash Equilibria: Query Complexity of Large Games. In Christos H. Papadimitriou, editor, 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), volume 67 of Leibniz International Proceedings in Informatics (LIPIcs), pages 57:1–57:9, Dagstuhl, Germany, 2017. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
- [CDT06] Xi Chen, Xiaotie Deng, and Shang-hua Teng. Computing nash equilibria: Approximation and smoothed complexity. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 603–612, 2006.
- [CP20] Xi Chen and Binghui Peng. Hedging in games: Faster convergence of external and swap regrets. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18990–18999. Curran Associates, Inc., 2020.
- [DFG21] Constantinos Costis Daskalakis, Maxwell Fishelson, and Noah Golowich. Near-optimal no-regret learning in general games. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
- [DGP09] Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a nash equilibrium. SIAM Journal on Computing, 39(1):195–259, 2009.
- [DGZ22] Constantinos Daskalakis, Noah Golowich, and Kaiqing Zhang. The complexity of markov equilibrium in stochastic games, 2022.
- [ELS+22] Liad Erez, Tal Lancewicki, Uri Sherman, Tomer Koren, and Yishay Mansour. Regret minimization and convergence to equilibria in general-sum markov games, 2022.
- [FGGS13] John Fearnley, Martin Gairing, Paul Goldberg, and Rahul Savani. Learning equilibria of games via payoff queries. In Proceedings of the Fourteenth ACM Conference on Electronic Commerce, EC ’13, page 397–414, New York, NY, USA, 2013. Association for Computing Machinery.
- [FLM94] Drew Fudenberg, David Levine, and Eric Maskin. The folk theorem with imperfect public information. Econometrica, 62(5):997–1039, 1994.
- [FRSS22] Dylan J Foster, Alexander Rakhlin, Ayush Sekhari, and Karthik Sridharan. On the complexity of adversarial decision making. arXiv preprint arXiv:2206.13063, 2022.
- [Han57] J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
- [HMC00] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
- [JKSY20] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning (ICML), pages 4870–4879. PMLR, 2020.
- [JLWY21] Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning–a simple, efficient, decentralized algorithm for multiagent RL. arXiv preprint arXiv:2110.14555, 2021.
- [JMS22] Yujia Jin, Vidya Muthukumar, and Aaron Sidford. The complexity of infinite-horizon general-sum stochastic games, 2022.
- [Kak03] Sham M Kakade. On the sample complexity of reinforcement learning, 2003.
- [KECM21] Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. RL for latent MDPs: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34, 2021.
- [KEG+22] János Kramár, Tom Eccles, Ian Gemp, Andrea Tacchetti, Kevin R. McKee, Mateusz Malinowski, Thore Graepel, and Yoram Bachrach. Negotiation and honesty in artificial intelligence methods for the board game of Diplomacy. Nature Communications, 13(1):7214, December 2022. Number: 1 Publisher: Nature Publishing Group.
- [KMN99] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, page 1324–1331, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
- [LDGV+21] Joel Z Leibo, Edgar A Dueñez-Guzman, Alexander Vezhnevets, John P Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charlie Beattie, Igor Mordatch, and Thore Graepel. Scalable evaluation of multi-agent reinforcement learning with melting pot. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 6187–6199. PMLR, 18–24 Jul 2021.
- [LS05] Michael L. Littman and Peter Stone. A polynomial-time Nash equilibrium algorithm for repeated games. Decision Support Systems, 39:55–66, 2005.
- [LS20] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
- [LW94] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
- [LWJ22] Qinghua Liu, Yuanhao Wang, and Chi Jin. Learning Markov games with adversarial opponents: Efficient algorithms and fundamental limits. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 14036–14053. PMLR, 17–23 Jul 2022.
- [MB21] Weichao Mao and Tamer Basar. Provably efficient reinforcement learning in decentralized general-sum markov games. CoRR, abs/2110.05682, 2021.
- [MF86] Eric Maskin and D Fudenberg. The folk theorem in repeated games with discounting or with incomplete information. Econometrica, 53(3):533–554, 1986. Reprinted in A. Rubinstein (ed.), Game Theory in Economics, London: Edward Elgar, 1995. Also reprinted in D. Fudenberg and D. Levine (eds.), A Long-Run Collaboration on Games with Long-Run Patient Players, World Scientific Publishers, 2009, pp. 209-230.
- [MK15] Kleanthis Malialis and Daniel Kudenko. Distributed response to network intrusions using multiagent reinforcement learning. Engineering Applications of Artificial Intelligence, 41:270–284, 2015.
- [Nas51] John Nash. Non-cooperative games. Annals of Mathematics, 54(2):286–295, 1951.
- [Pap94] Christos H. Papadimitriou. On the complexity of the parity argument and other inefficient proofs of existence. J. Comput. Syst. Sci., 48(3):498–532, 1994.
- [Put94] Martin Puterman. Markov Decision Processes. John Wiley & Sons, Ltd, 1 edition, 1994.
- [PVH+22] Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Toby Pohlen, Tom Eccles, Mark Rowland, Marc Lanctot, Jean-Baptiste Lespiau, Bilal Piot, Shayegan Omidshafiei, Edward Lockhart, Laurent Sifre, Nathalie Beauguerlange, Remi Munos, David Silver, Satinder Singh, Demis Hassabis, and Karl Tuyls. Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623):990–996, 2022.
- [Rou15] Tim Roughgarden. Intrinsic robustness of the price of anarchy. Journal of the ACM, 2015.
- [Rub16] Aviad Rubinstein. Settling the complexity of computing approximate two-player Nash equilibria. In Annual Symposium on Foundations of Computer Science (FOCS), pages 258–265. IEEE, 2016.
- [Rub18] Aviad Rubinstein. Inapproximability of Nash equilibrium. SIAM Journal on Computing, 47(3):917–959, 2018.
- [SALS15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems (NIPS), pages 2989–2997, 2015.
- [Sha53] Lloyd Shapley. Stochastic Games. PNAS, 1953.
- [SHM+16] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- [SMB22] Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations, 2022.
- [SS12] Shai Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn., 4(2):107–194, feb 2012.
- [SSS16] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. CoRR, abs/1610.03295, 2016.
- [Vov90] Vladimir Vovk. Aggregating strategies. Proc. of Computational Learning Theory, 1990, 1990.
- [WAJ+21] Gellert Weisz, Philip Amortila, Barnabás Janzer, Yasin Abbasi-Yadkori, Nan Jiang, and Csaba Szepesvari. On query-efficient planning in mdps under linear realizability of the optimal state-value function. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 4355–4385. PMLR, 15–19 Aug 2021.
- [YHAY+22] Dong Yin, Botao Hao, Yasin Abbasi-Yadkori, Nevena Lazić, and Csaba Szepesvári. Efficient local planning with linear function approximation. In Sanjoy Dasgupta and Nika Haghtalab, editors, Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 of Proceedings of Machine Learning Research, pages 1165–1192. PMLR, 29 Mar–01 Apr 2022.
- [ZLY22] Wenhao Zhan, Jason D Lee, and Zhuoran Yang. Decentralized optimistic hyperpolicy mirror descent: Provably no-regret learning in markov games. arXiv preprint arXiv:2206.01588, 2022.
- [ZTS+22] Stephan Zheng, Alexander Trott, Sunil Srinivasa, David C. Parkes, and Richard Socher. The ai economist: Taxation policy design via two-level deep multiagent reinforcement learning. Science Advances, 8(18):eabk2607, 2022.
Appendix A Additional preliminaries
A.1 Nash equilibria and computational hardness.
The most foundational and well known solution concept for normal-form games is the Nash equilibrium [Nas51].
Definition A.1 (-Nash problem).
For a normal-form game and , a product distribution is said to be an -Nash equilibrium for if for all ,
We define the -player -Nash problem to be the problem of computing an -Nash equilibrium of a given -player -action normal-form game.202020One must also take care to specify the bit complexity of representing a normal-form game. We assume that the payoffs of any normal-form game given as an instance to the -Nash problem can each be expressed with bits; this assumption is without loss of generality as long as (which it will be for us).
Informally, is an -Nash equilibrium if no player can gain more than in reward by deviating to a single fixed action , while all other players randomly choose their actions according to . Despite the intuitive appeal of Nash equilibria, they are intractable to compute: for any , it is PPAD-hard to solve the -Nash problem, namely, to compute -approximate Nash equilibria in 2-player -action normal-form games [DGP09, CDT06, Rub18]. We recall that the complexity class PPAD consists of all total search problems which have a polynomial-time reduction to the End-of-The-Line (EOTL) problem. PPAD is the most well-studied complexity class in algorithmic game theory, and it is widely believed that . We refer the reader to [DGP09, CDT06, Rub18, Pap94] for further background on the class PPAD and the EOTL problem.
A.2 Query complexity of Nash equilibria
Our statistical lower bound for the SparseCCE problem in Theorem 5.2 relies on existing query complexity lower bounds for computing approximate Nash equilibria in -player normal-form games. We first review the query complexity model for normal-form games.
Oracle model for normal-form games.
For , consider an -player -action normal form game , specified by payoff tensors . Since the tensors contain a total of real-valued payoffs, in the setting when is large, it is unrealistic to assume that an algorithm is given the full payoff tensors as input. Therefore, prior work on computing equilibria in such games has studied the setting in which the algorithm makes adaptive oracle queries to the payoff tensors.
In particular, the algorithm, which is allowed to be randomized, has access to a payoff oracle for the game , which works as follows. At each time step, the algorithm can choose to specify an action profile and then query at the action profile . The oracle then returns the payoffs for each player if the action profile is played.
Query complexity lower bound for approximate Nash equilibrium.
The following theorem gives a lower bound on the number of queries any randomized algorithm needs to make to compute an approximate Nash equilibrium in an -player game.
Theorem A.2 (Corollary 4.5 of [Rub16]).
There is a constant so that any randomized algorithm which solves the -Nash problem for -player normal-form games with probability at least must use at least payoff queries.
Appendix B Proofs of lower bounds for SparseMarkovCCE (Section 3)
B.1 Preliminaries: Online density estimation
Our proof makes use of tools for online learning with the logarithmic loss, also known as conditional density estimation. In particular, we use a variant of the exponential weights algorithm known as Vovk’s aggregating algorithm in the context of density estimation [Vov90, CBL06]. We consider the following setting with two players, a Learner and Nature. Furthermore, there is a set , called the outcome space, and a set , called the context space; for our applications it suffices to assume and are finite. For some , there are time steps . At each time step :
-
•
Nature reveals a context ;
-
•
Having seen the context , the learner predicts a distribution ;
-
•
Nature chooses an outcome , and the learner suffers loss
For each , we let denote the history of interaction up to step ; we emphasize that each context may be chosen adaptively as a function of . Let denote the sigma-algebra generated by . We measure performance in terms of regret against a set of experts, also known as the expert setting. Each expert consists of a function . The regret of an algorithm against the expert class when it receives contexts and observes outcomes is defined as
Note that the learner can observe the expert predictions and use them to make its own prediction at each round .
Proposition B.1 (Vovk’s aggregating algorithm).
Consider Vovk’s aggregating algorithm, which predicts via
(3) |
This algorithm guarantees a regret bound of .
Recall that for probability distributions on a finite set , their total variation distance is defined as
(4) |
As a (standard) consequence of Proposition B.1, in the realizable setting in which the distribution of follows for some fixed (unknown) expert , we can obtain a bound on the total variation distance between the algorithm’s predictions and those of .
Proposition B.2.
If the distribution of outcomes is realizable, i.e., there exists an expert so that for all , then the predictions of the aggregation algorithm (3) satisfy
For completeness, we provide the proof of Proposition B.2 here.
Proof of Proposition B.2.
To simplify notation, for an expert , a context , and an outcome , we write to denote .
Proposition B.1 gives that the following inequality holds (almost surely):
For each , note that and are -measurable (by definition). Then
where the first inequality uses Pinsker’s inequality and the final equality uses the fact that . It follows that
Jensen’s inequality now gives that
∎
B.2 Proof of Theorem 3.2
Proof of Theorem 3.2.
Fix , which we recall represents an upper bound on the description length of the Markov game. Assume that we are given an algorithm that solves the -SparseMarkovCCE problem for Markov games satisfying in time . We proceed to describe an algorithm which solves the 2-player -Nash problem in time , as long as . First, define , and consider an arbitrary 2-player -action normal form , which is specified by payoff matrices , so that all entries of the game can be written in binary using at most bits (recall, per footnote 20, that we may assume that the entries of an instance of -Nash can be specified with bits). Based on , we construct a 2-player Markov game as follows:
Definition B.3.
We define the game to consist of the tuple , where:
-
•
The horizon of is (i.e., the largest even number at most ).
-
•
Let ; the action spaces of the 2 agents are given by .
-
•
There are a total of states: in particular, there is a state for each , as well as a distinguished state , so we have:
-
•
For all odd , the reward to agents given that the action profile is played at step is given by , for all . All agents receive 0 reward at even steps .
-
•
At odd steps , if actions are taken, the game transitions to the state . At even steps , the game always transitions to the state .
-
•
The initial state (i.e., at step ) is (i.e., is a singleton distribution supported on ).
It is evident that this construction takes polynomial time, and satisfies . We will now show by applying the algorithm to , we can efficiently compute -approximate Nash equilibrium for the original game . To do so, we appeal to Algorithm 1.
Algorithm 1 proceeds as follows. First, it constructs the 2-player Markov game as defined above, and calls the algorithm , which returns a sequence of product Markov policies with the property that the average is an -CCE of . It then enumerates over the distributions for each and odd, and checks whether each one is a -approximate Nash equilibrium of . If so, the algorithm outputs such a Nash equilibrium, and otherwise, it fails. The proof of Theorem 3.2 is thus completed by the following lemma, which states that as long as is an -CCE of , Algorithm 1 never fails.
Lemma B.4 (Correctness of Algorithm 1).
Consider the normal form game and the Markov game as constructed above, which has horizon . For any , , if and are product Markov policies so that is an -CCE of , then there is some odd and so that is an -Nash equilibrium of .
The proof of Lemma B.4 is given below. Applying Lemma B.4 with (which is a valid application since by our assumption on ), yields that Algorithm 1 always finds a -Nash equilibrium of the -action normal form game , thus solving the given instance of the -Nash problem. Furthermore, it is straightforward to see that Algorithm 1 runs in time , for some constant .
∎
Proof of Lemma B.4.
Consider a sequence of product Markov policies with the property that the average is an -CCE of . For all odd and , let , which is the distribution played under by player at step (at the unique state with positive probability of being reached at step ). For odd , we have , and our goal is to show that for some odd and , is an -Nash equilibrium of . To proceed, suppose for the sake of contradiction that this is not the case.
Let us write to denote the set of odd-numbered steps, and to denote the set of even-numbered steps. Let . We first note that for , agent ’s value under the mixture policy is given as follows:
For each , we will derive a contradiction by constructing a (non-Markov) deviation policy for player in , denoted , which will give player a significant gain in value against the policy . To do so, we need to specify , for all and ; note that we may restrict our attention only to histories that occur with positive probability under the transitions of .
Fix any , , and . If occurs with positive probability under the transitions of , then for each , and both , the action played by agent at step is determined by . Namely, if the state at step of is , then player played action at step . So, for each with , we may define as the action profile played at step , which is a measurable function of . With this in mind, we define by applying Vovk’s aggregating algorithm (Proposition B.2) as follows.
-
1.
If is even, play an arbitrary action (note that the actions at even-numbered steps have no influence on the transitions or rewards).
-
2.
If is odd, define , by , where is defined as follows: for ,
Note that is a function of via the action profiles ; to simplify notation, we suppress this dependence.
-
3.
Then for any state , define to be a best response to , namely
(5)
Note that, for odd , the distribution defined above can be viewed as an application of Vovk’s online aggregation algorithm at step in the following setting: the number of steps (, in the notation of Proposition B.2; note that plays a different role in the present proof) is , the context space is , and the outcome space is .212121Here denotes the index of the player who is not . There are experts (i.e., we have ), whose predictions on a context are defined as follows: the expert predicts . Then, the distribution is obtained by updating the aggregation algorithm with the context-observation pairs , for odd values of .
We next analyze the value of for to show that the deviation strategy we have defined indeed obtains significant gain. To do so, recall that this value represents the payoff for player under the process in which we draw an index uniformly at random, then for each step , player plays according to and player plays according to . (In particular, at odd-numbered steps, player plays according to .) We recall that denotes the expectation under this process. We let denote the random variable which is the history observed by player in this setup, i.e., when the policy played is , and let denote the action profiles for odd rounds, which are a measurable function of each player’s trajectory.
We apply Proposition B.2 with the time horizon as , and with the set of experts set to as defined above. The context sequence the sequence of increasing values of , and for each , the outcome at step (for which the context is ) is distributed as conditioned on , which in particular satisfies the realizability assumption stated in Proposition B.2. Then, since (as remarked above), the distributions , for , are exactly the predictions made by Vovk’s aggregating algorithm, Proposition B.2 gives that222222In fact, Proposition B.2 implies that a similar bound holds uniformly for each possible realization of , but Equation 6 suffices for our purposes.
(6) |
Recall that we have assumed for the sake of contradiction that is not an -Nash equilibrium of for each and . Consider a fixed draw of the random variable defined above. Then it holds that for and , defining
(7) |
we have . Consider any , , and a history of agent up to step (conditioned on ). Let us write ; note that is a function of , through its dependence on . We have, by the definition of in (5) and the definition of ,
(8) |
Combining Equation 7 and Equation 8, we get that for any fixed , , and ,
(9) |
Averaging over the draw of , which we recall is chosen uniformly, we see that
(10) | ||||
(11) | ||||
(12) |
where (10) follows from the definition , (11) follows from (9), and (12) uses (6). As long as , the this expression is bounded below by , meaning that is not an -approximate CCE. This completes the contradiction. ∎
Appendix C Proofs of lower bounds for SparseCCE (Sections 4 and 5)
In this section we prove our computational lower bounds for solving the SparseCCE problem with players (Theorem 4.3 and Corollary 4.4), as well as our statistical lower bound for solving the SparseCCE problem with a general number of players (Theorem 5.2).
Both theorems are proven as consequences of a more general result given in Theorem C.1 below, which reduces the Nash problem in -player normal-form games to the SparseCCE problem in -player Markov games. In more detail, the theorem shows that (a) if an algorithm for SparseCCE makes few calls to a generative model oracle, then we get an algorithm for the Nash problem with few calls to a payoff oracle (see Section A.2 for background on the payoff oracle for the Nash problem), and (b) if the algorithm for SparseCCE is computationally efficient, then so is the algorithm for the Nash problem.
Theorem C.1.
There is a constant so that the following holds. Consider , and suppose and satisfy . Suppose there is an algorithm which, given a generative model oracle for a -player Markov game with , solves the -SparseCCE problem for using generative model oracle queries. Then the following conclusions hold:
-
•
For any , the -player -Nash problem for any normal-form game can be solved, with failure probability , using at most queries to a payoff oracle for .
-
•
If the algorithm additionally runs in time for some , then the algorithm solving Nash from the previous bullet point runs in time .
Proof of Theorem 4.3.
Suppose there is an algorithm which, given the description of any 3-player Markov game with , solves the -SparseCCE problem in time . Such an algorithm immediately yields an algorithm which can solve the -SparseCCE problem in time using only a generative model oracle, since the exact description of the Markov game can be obtained with queries to the generative model (across all tuples). We can now solve the problem of computing a -Nash equilibrium of a given 2-player -action normal form game as follows. We simply apply the algorithm of Theorem C.1 with , noting that the oracle in the theorem statement can be implemented by reading the corresponding bits of input of the input game . The second bullet point yields that this algorithm takes time , for some constant . Furthermore, the assumption of Theorem C.1 is implied by the assumption that of Theorem 4.3. ∎
In a similar manner, Theorem 5.2 follows from Theorem C.1 by applying Theorem A.2, which states that there is no randomized algorithm that finds approximate Nash equilibria of -player, 2-action normal form games in time .
Proof of Theorem 5.2.
Let be the constant from Theorem A.2, and consider any . Suppose there is an algorithm which, for any -player Markov game with , makes oracle queries to a generative model oracle for , and solves the -SparseCCE problem for for some so that , for a sufficiently small absolute constant . Then, by Theorem C.1 with and (which ensures that as long as is sufficiently small), there is an algorithm which solves the -Nash problem—and thus the -Nash problem—for -player games with failure probability , using queries to a payoff oracle. But by Theorem A.2, any such algorithm requires queries to a payoff oracle. It follows that , as desired. ∎
C.1 Proof of Theorem C.1
Proof of Theorem C.1.
Fix any , . Suppose we are given an algorithm that solves the -player -SparseCCE problem for Markov games satisfying , running in time and using at most generative model queries. We proceed to describe an algorithm which solves the -player -Nash problem using queries to a payoff oracle, and running in time , where represents the failure probability. Define , and assume we are given an arbitrary -player -action normal form , which is specified by payoff matrices . We assume that all entries of each of the matrices have only the most significant bits nonzero; this assumption is without loss of generality, since by truncating the utilities to satisfy this assumption, we change all payoffs by at most , which degrades the quality of any approximate equilibrium by at most (in addition, we have since we have assumed ). We assume without loss of generality. Based on , we construct an -player Markov game as follows.
Definition C.2.
We define the Markov game as the tuple , where:
-
•
The horizon of is chosen to be the power of satisfying .
-
•
Let . The action spaces of agents are given by . The action space of agent is
so that .
We write to denote the joint action space of the first agents, and to denote the joint action space of all agents.
-
•
There is a single state, denoted by , i.e., (in particular, is a singleton distribution supported on ).
-
•
For all , the reward for agent , given an action profile at the unique state , is as follows: writing , we have
(13) where is defined per the kibitzer construction of [BCI+08]:
(14) In (13) above, is the binary representation of a binary encoding of the action profile . In particular, if the binary encoding of is , with , then . Note that takes bits to specify.
(15) |
It is evident that this construction takes polynomial time and satisfies . Furthermore, it is clear that a single generative model oracle call for the Markov game (per Definition 5.1) can be implemented using at most 2 calls to the oracle for the normal-form game . We will now show by applying the algorithm to , we can efficiently (in terms of runtime and oracle calls) compute a -approximate Nash equilibrium for the original game . To do so, we appeal to Algorithm 2.
Algorithm 2 proceeds as follows. First, it calls the algorithm on the -player Markov game , using the oracle to simulate ’s calls to the generative model oracle for . By assumption, the algorithm returns a sequence of product policies of the form , so that each is -computable, and so that the average is an -CCE of . Next, Algorithm 2 samples a trajectory from in which:
-
•
Players each play according to a policy for an index chosen uniformly at the start of the episode.
- •
In order avoid exponential dependence on the number of players when computing an approximate best response to , we draw (for ) samples from and use these samples to compute the best response. In particular, letting denote the th sampled action profile, we construct a function in Lines 11 and 14 which, for each , is defined as the average over samples of the realized payoffs ; note that to compute the payoffs for each sample, Algorithm 2 needs only two oracle calls to .
The following lemma, proven in the sequel, gives a correctness guarantee for Algorithm 2.
Lemma C.3 (Correctness of Algorithm 2).
Given any -player -action normal form game , if the algorithm solves the -SparseCCE problem for the game with satisfying , then Algorithm 2 outputs a -approximate Nash equilibrium of with probability at least , and otherwise fails.
The assumption that from the statement of Theorem C.1 yields that , so Lemma C.3 yields that Algorithm 2 outputs a -Nash equilibrium of with probability at least (and otherwise fails). By iterating Algorithm 2 for times, we may thus compute a -Nash equilibrium of with failure probability .
We now analyze the oracle cost and computational cost of Algorithm 2. It takes oracle calls to to simulate the generative model oracle calls of , and therefore, if runs in time , then the call to on Line 5, using oracle calls to to simulate simulate the generative model oracle calls, runs in time . Next, the computations of (and thus ) in Line 10 can be performed in time, the computation of in Line 14 requires time (and oracle calls to ) bounded above by , constructing the actions (for ) in Lines 13 and 14 takes time (using the fact that the policies are -computable), and constructing the rewards on Line 15 requires another oracle calls to . Altogether, Algorithm 2 requires oracle calls to and, if runs in time , then Algorithm 2 takes time , for some absolute constant .
∎
Remark C.4 (Bit complexity of exponential weights updates).
In the above proof we have noted that (as defined in Line 10 of Algorithm 2) can be computed in time . A detail we do not handle formally is that, since the values of are in general irrational, only the most significant bits of each real number can be computed in time . To give a truly polynomial-time implementation of Algorithm 2, one can compute only the most significant bits of each distribution , which is sufficient to approximate the true value of to within in total variation distance. Since only influences the subsequent execution of Algorithm 2 via the samples drawn in Line 11, by a union bound, the approximation of we have described perturbs the execution of the algorithm by at most in total variation distance. In particular, the correctness guarantee of Lemma C.3 still holds, with sucess probability at least .
Proof of Lemma C.3.
We will establish the following two facts:
- 1.
-
2.
Second, we will show that, since is an -CCE of , the strategy cannot not lead to a large increase of value for player , which will imply that Algorithm 2 must return a Nash equilibrium with high enough probability.
Defining for .
We begin by constructing the policy described; for later use in the proof, it will be convenient to construct a collection of closely related policies for , also representing strategies for deviating from the equilibrium .
Let be fixed. For , the mapping is defined as follows. Given a history (we assume without loss of generality that occurs with positive probability under some sequence of general policies) and a current state , we define through the following process.
-
1.
First, we claim that for all players , it is possible to extract the trajectory from the trajectory of player .
-
(a)
Recall that for each , from the definition in Equation 13 and the function , the bits following position of the reward given to player at step of the trajectory encode an action profile . Since occurs with positive probability, this is precisely the action profile which was played by agents at step . Note we also use here that by definition of the rewards in (13), the component of the reward only affects the first bits.
-
(b)
For and , define .
-
(c)
For , write ; in particular, is a deterministic function of . (Note that, since occurs with positive probability, the history observed by player up to step can be computed from it via Steps (a) and (b)). Going forward, for , we let denote the prefix of up to step .
-
(a)
-
2.
Now, using that player can compute all players’ trajectories, for each we define
(16) where is defined as follows: for ,
(17) Note that is a random variable which depends on the trajectory (which can be computed from ). In addition, the definition of (for each ) is exactly as is defined in Line 10 of Algorithm 2.
-
3.
For , define as follows:
(18) For the case , define (implicitly) to be the following distribution over : draw , define for , and finally set
(19) Note that, for each choice of , the distribution as defined above coincides with the distribution of the action defined in Eq. 15 in Algorithm 2, when player ’s history is and the state at step is . The following lemma, for use later in the proof, bounds the approximation error incurred in sampling .
Lemma C.5.
Fix any . With probability at least over the draw of , it holds that for all ,
which implies in particular that with probability at least over the draw of ,
(20)
It is immediate from our construction above that the following fact holds.
Lemma C.6.
The joint distribution of , for and , as computed by Algorithm 2, coincides with the distribution of in an episode of when players follow the policy .
Analyzing the distributions .
Fix any . We next prove some facts about the distributions defined above (as a function of ) in the process of computing .
For each , consider any choice of ; note that for each , the distributions for may be viewed as an application Vovk’s aggregating algorithm (Proposition B.2) in the following setting: the number of steps (, in the context of Proposition B.2; note that has a different meaning in the present proof) horizon is , the context space is , and the output space is . The expert set is (which has ), and the experts’ predictions on a context are defined via . Then for each , the distribution is obtained by updating the aggregating algorithm with the context-observation pairs for .
In more detail, fix any and with . We may apply Proposition B.2 with the number of steps set to , the set of experts as , and contexts and outcomes generated according to the distribution induced by running the policy in the Markov game as follows:
-
•
For each , we are given, at steps , the actions rewards for all agents , as well as the states .
-
–
For each , set to be agent ’s history.
-
–
The context fed to the aggregation algorithm at step is .
-
–
The outcome at step is given by ; note that this choice satisfies the realizability assumption in Proposition B.2.
-
–
To aid in generating the next context at step , choose for all and . Then set to be the next state given the transitions of and the action profile .
-
–
By Proposition B.2, it follows that for any fixed and with , under the process described above we have
(21) |
Analyzing the value of .
Next, using the development above, we show that if Algorithm 2 successfully computes a Nash equilibrium with constant probability (via ) whenever is an -CCE. We first state the following claim, which is proven in the sequel by analyzing the values for .
Lemma C.7.
If is an -CCE of , then it holds that for all ,
Note that in the game , since for all , and , it holds that (which holds since in (13), is multiplied by ), it follows that . Thus, by Lemma C.7, we have , and since is an -CCE of it follows that
(22) |
To simplify notation, we will write in the below calculations, where we recall that each is determined given the history up to step , , as defined in (16) and (17). An action profile drawn from is denoted as , with . We may now write as follows:
where:
-
•
The first inequality follows from the fact that takes values in and the fact that the total variation between product distributions is bounded above by the sum of total variation distances between each of the pairs of component distributions.
- •
- •
Rearranging and using (22) as well as the fact that (as ), we get that
Since is a product distribution a.s., we have that
Therefore, by Markov’s inequality, with probability at least over the choice of and the trajectories for (which collectively determine ), there is some so that
(23) |
where the final inequality follows as long as , i.e., , which holds since and we have assumed that .
Note that (23) implies that with probability at least under an episode drawn from , there is some so that is a -Nash equilibrium of the stage game . Thus, by Lemma C.6, with probability at least under an episode drawn from the distribution of Algorithm 2, there is some so that is a -Nash equilibrium of .
Finally, the following two observations conclude the proof of Lemma C.3.
- •
- •
Taking a union bound over all of the probability- failure events from Lemma C.5 for the sampling (for ), as well as over the probability- event that there is no which is a -Nash equilibrium of , we obtain that with probability at least , Algorithm 2 outputs a -Nash equilibrium of . ∎
Finally, we prove the remaining claims stated without proof above.
Proof of Lemma C.5.
Since for each , by Hoeffding’s inequality, for any fixed , with probability at least over the draw of , it holds that
where the final inequality follows from the choice of . The statement of the lemma follows by a union bound over all actions . ∎
Proof of Lemma C.7.
Fix any agent . We will argue that the policy defined within the proof of Lemma C.3 satisfies . Since is an -CCE of , it follows that
from which the result of Lemma C.7 follows after rearranging terms.
To simplify notation, let us write , where we recall that each is determined given the history up to step , , as defined in (16) and (17). An action profile drawn from is denoted by , with . We compute
where:
-
•
The first inequality follows from the fact that the rewards take values in and that the total variation between product distributions is bounded above by the sum of total variation distances between each of the pairs of component distributions.
- •
-
•
The final inequality follows by Lemma C.8 below, applied to agent and to the distribution , which we recall is a product distribution almost surely.
∎
Lemma C.8.
For any , , and any product distribution , it holds that
Proof.
Choose . Now we compute
where the first inequality follows since is a product distribution, the second inequality uses that is non-negative, and the final inequality follows since by choice of we have for all . ∎
C.2 Remarks on bit complexity of the rewards
The Markov game constructed to prove Theorem C.1 uses lower-order bits of the rewards to record the action profile taken each step. These lower order bits may be used by each agent to infer what actions were taken by other agents at the previous step, and we use this idea to construct the best-response policies defined in the proof. As a result of this aspect of the construction, the rewards of the game each take bits to specify. As discussed in the proof of Theorem C.1, it is without loss of generality to assume that the payoffs of the given normal-form game take bits each to specify, so when either or , the construction of uses more bits to express its rewards than what is used for the normal-form game .
It is possible to avoid this phenomenon by instead using the state transitions of the Markov game to encode the action profile taken at each step, as was done in the proof of Theorem 3.2. The idea, which we sketch here, is to replace the game of Definition C.2 with the following game :
Definition C.9 (Alternative construction to Definition C.2).
Given an -player, -action normal-form game , we define the game as follows.
-
•
The horizon of is .
-
•
Let . The action spaces of agents are given by . The action space of agent is
so that .
We write to denote the joint action space of the first agents, and to denote the joint action space of all agents. Then .
-
•
The state space is defined as follows. There are states, one for each action tuple . For each , we denote the corresponding state by .
-
•
For all , the reward to agent given action profile at any state is as follows: writing ,
(24) -
•
At each step , if action profile is taken, the game transitions to the state .
Note that the number of states of is equal to , and so . As a result, if we were to use the game in place of in the proof of Theorem C.1, we would need to define to ensure that , and so the condition would be replaced by . This would only lead to a small quantitative degradement in the statement of Theorem 4.3, with the condition in the statement replaced by for some constant . However, it would render the statement of Theorem 5.2 essentially vacuous. For this reason, we opt to go with the approach of Definition C.2 as opposed to Definition C.9.
We expect that the construction of Definition C.2 can nevertheless still be modified to use bits to express each reward in the Markov game . In particular, one could introduce stochastic transitions to encode in the state of the Markov game a small number of random bits of the full action profile played at each step. We leave such an approach for future work.
Appendix D Equivalence between and
In this section we consider an alternate definition of the space of randomized general policies of player , and show that it is equivalent to the one we gave in Section 2.
In particular, suppose we were to define a randomized general policy of agent as a distribution over deterministic general policies of agent : we write to denote the space of such distributions. Moreover, write to denote the space of product distributions over agents’ deterministic policies. Our goal in this section is to show that policies in are equivalent to those in in the following sense: there is an embedding map , not depending on the Markov game, so that the distribution of a trajectory drawn from any , for any Markov game, is the same as the distribution of a trajectory drawn from (Fact D.2). Furthermore, is surjective in the following sense: any policy produces trajectories that are distributed identically to those of (and thus of ), for some (Fact D.3). In Definition D.1 below, we define .
Definition D.1.
For and , define to put the following amount of mass on each :
(25) |
Furthermore, for , define .
Note that, in the special case that , is the point mass on .
Fact D.2 (Embedding equivalence).
Fix a -player Markov game and, arbitrary policies . Then a trajectory drawn from the product policy is distributed identically to a trajectory drawn from .
The proof of Fact D.2 is provided in Section D.1. Next, we show that the mapping is surjective in the following sense:
Fact D.3 (Right inverse of ).
There is a mapping so that for any Markov game and any , the distribution of a trajectory drawn from is identical to the distribution of a trajectory drawn from .
We will write . Fact D.3 states that the policy maps, under , to a policy in which is equivalent to (in the sense that their trajectories are identically distributed for any Markov game).
An important consequence of Fact D.2 is that the expected reward (i.e., value) under any is the same as that of . Thus given a Markov game, the induced normal-form game in which the players’ pure action sets are is equivalent to the normal-form game in which the players’ pure action sets are , in the following sense: for any mixed strategy in the former, namely a product distributional policy , the policy is a mixed strategy in the latter which gives each player the same value as under . (Note that is indeed a product distribution since is a product distribution and factors into individual coordiantes.) Furthermore, by Fact D.3, any distributional policy in arises in this manner, for some ; in fact, may be chosen to place all its mass on a single . Since factors into individual coordinates, it follows that yields a one-to-one mapping between the coarse correlated equilibria (or any other notion of equilibria, e.g., Nash equilibria or correlated equilibria) of these two normal-form games.
D.1 Proofs of the equivalence
Proof of Fact D.2.
Consider any trajectory consisting of a sequence of states and actions and rewards for each of the agents. Assume that for all (as otherwise has probability 0 under any policy). Write:
Then the probability of observing under is
(26) |
where, per usual, . Write . The probability of observing under is
(27) |
It is now straightforward to see from the definition of in (25) that the quantities in (26) and (27) are equal. ∎
Proof of Fact D.3.
Fix a policy . We define to be the policy , which is defined as follows: for , , we have, for ,
If the denominator of the above expression is 0, then is defined to be an arbitrary distribution on . (For concreteness, let us say that it puts all its mass on a fixed action in .) Furthermore, for , define .
Next, fix any . Let . By Fact D.2, it suffices to show that the distribution of trajectories under is the same as the distribution of trajectories drawn from .
So consider any trajectory consisting of a sequence of states and actions and rewards for each of the agents. Assume that for all (as otherwise has probability 0 under any policy). Write:
Then the probability of observing under is
which is equal to the probability of observing under . ∎