Hardness of Independent Learning and
Sparse Equilibrium Computation in Markov Games

Dylan J. Foster
dylanfoster@microsoft.com Noah Golowich
nzg@mit.edu
Sham M. Kakade
sham@seas.harvard.edu

Abstract

We consider the problem of decentralized multi-agent reinforcement learning in Markov games. A fundamental question is whether there exist algorithms that, when adopted by all agents and run independently in a decentralized fashion, lead to no-regret for each player, analogous to celebrated convergence results for no-regret learning in normal-form games. While recent work has shown that such algorithms exist for restricted settings (notably, when regret is defined with respect to deviations to Markovian policies), the question of whether independent no-regret learning can be achieved in the standard Markov game framework was open. We provide a decisive negative resolution this problem, both from a computational and statistical perspective. We show that:

1.

Under the widely-believed complexity-theoretic assumption that PPAD-hard problems cannot be solved in polynomial time, there is no polynomial-time algorithm that attains no-regret in general-sum Markov games when executed independently by all players, even when the game is known to the algorithm designer and the number of players is a small constant.
2.

When the game is unknown, no algorithm—regardless of computational efficiency—can achieve no-regret without observing a number of episodes that is exponential in the number of players.

Perhaps surprisingly, our lower bounds hold even for seemingly easier setting in which all agents are controlled by a a centralized algorithm. They are proven via lower bounds for a simpler problem we refer to as SparseCCE, in which the goal is to compute by any means—centralized, decentralized, or otherwise—a coarse correlated equilibrium that is “sparse” in the sense that it can be represented as a mixture of a small number of “product” policies. The crux of our approach is a novel application of aggregation techniques from online learning [Vov90, CBL06], whereby we show that any algorithm for the SparseCCE problem can be used to compute approximate Nash equilibria for non-zero sum normal-form games; this enables the application of well-known hardness results for Nash.

1 Introduction

The framework of multi-agent reinforcement learning (MARL), which describes settings in which multiple agents interact in a dynamic environment, has played a key role in recent breakthroughs in artificial intelligence, including the development of agents that approach or surpass human performance in games such as Go [SHM⁺16], Poker [BS18], Stratego [PVH⁺22], and Diplomacy [KEG⁺22, BBD⁺22]. MARL also shows promise for real-world multi-agent systems, including autonomous driving [SSS16], and cybersecurity [MK15], and economic policy [ZTS⁺22]. These applications, where reliability is critical, necessitate the development of algorithms that are practical and efficient, yet provide strong formal guarantees and robustness.

Multi-agent reinforcement learning is typically studied using the framework of Markov games (also known as stochastic games) [Sha53]. In a Markov game, agents interact over a finite number of steps: at each step, each agent observes the state of the environment, takes an action, and observes a reward which depends on the current state as well as the other agents’ actions. Then the environment transitions to a new state as a function of the current state and the actions taken. An episode consists of a finite number of such steps, and agents interact over the course of multiple episodes, progressively learning new information about their environment. Markov games generalize the well-known model of Markov Decision Processes (MDPs) [Put94], which describe the special case in which there is a single agent acting in a dynamic environment, and we wish to find a policy that maximizes its reward. By contrast, for Markov games, we typically aim to find a distribution over agents’ policies which constitutes some type of equilibrium.

1.1 Decentralized learning

In this paper, we focus on the problem of decentralized (or, independent) learning in Markov games. In decentralized MARL, each agent in the Markov game behaves independently, optimizing their policy myopically while treating the effects of the other agents as exogenous. Agents observe local information (in particular, their own actions and rewards), but do not observe the actions of the other agents directly. Decentralized learning enjoys a number of desirable properties, including scalability (computation is inherently linear in the number of agents), versatility (by virtue of independence, algorithms can be applied in uncertain environments in which the nature of the interaction and number of other agents are not known), and practicality (architectures for single-agent reinforcement learning can often be adapted directly). The central question we consider is whether there exist decentralized learning algorithms which, when employed by all agents in a Markov game, lead them to play near-equilibrium strategies over time.

Decentralized equilibrium computation in MARL is not well understood theoretically, and algorithms with provable guarantees are scarce. To motivate the challenges and most salient issues, it will be helpful to contrast with the simpler problem of decentralized learning in normal-form games, which may be interpreted as Markov games with a single state. Normal-form games enjoy a rich and celebrated theory of decentralized learning, dating back to Brown’s work on fictitious play [Bro49] and Blackwell’s theory of approachability [Bla56]. Much of the modern work on decentralized learning in normal-form games centers on no-regret learning, where agents select actions independently using online learning algorithms [CBL06] designed to minimize their regret (that is, the gap between realized payoffs and the payoff of the best fixed action in hindsight). In particular, a foundational result is that if each agent employs a no-regret learning strategy, then the average of the agents’ joint action distributions approaches a coarse correlated equilibrium (CCE) for the normal-form game [CBL06, Han57, Bla56]. CCE is a natural relaxation of the foundational concept of Nash equilibrium, which has the downside of being intractable to compute. On the other hand, there are many efficient algorithms that can achieve vanishing regret in a normal-form game, even when opponents select their actions in an arbitrary, potentially adaptive fashion, and thus converge to a CCE [Vov90, LW94, CBFH⁺97, HMC00, SALS15].

This simple connection between no-regret learning and decentralized convergence to equilibria has been influential in game theory, leading to numerous lines of research including fast rates of convergence to equilibria [SALS15, CP20, DFG21, AFK⁺22], price of anarchy bounds for smooth games [Rou15], and lower bounds on query and communication complexity for equilibrium computation [FGGS13, Rub16, BR17]. Empirically, no-regret algorithms such as regret matching [HMC00] and Hedge [Vov90, LW94, CBFH⁺97] have been used to compute equilibria that can achieve state-of-the-art performance in application domains such as Poker [BS18] and Diplomacy [BBD⁺22]. Motivated by these successes, we ask whether an analogous theory can be developed for Markov games. In particular:

Are there efficient algorithms for no-regret learning in Markov games?

Any Markov game can be viewed as a large normal-form game where each agent’s action space consists of their exponentially-sized space of policies, and their utility function is given by their expected reward. Thus, any learning algorithm for normal-form games can also be applied to Markov games, but the resulting sample and computational complexities will be intractably large. Our goal is to explore whether more efficient decentralized learning guarantees can be established.

Challenges for no-regret learning.

In spite of active research effort and many promising pieces of progress [JLWY21, SMB22, MB21, DGZ22, ELS⁺22], no-regret learning guarantees for Markov games have been elusive. A barrier faced by naive algorithms is that it is intractable to ensure no-regret against an arbitrary adversary, both computationally [BJY20, AYBK⁺13] and statistically [LWJ22, KECM21, FRSS22].

Fortunately, many of the implications of no-regret learning (in particular, convergence to equilibria) do not require the algorithm to have sublinear regret against an arbitrary adversary, but rather only against other agents who are running the same algorithm independently. This observation has been influential in normal-form games, where the well-known line of work on fast rates of convergence to equilibrium [SALS15, CP20, DFG21, AFK⁺22] holds only in this more restrictive setting. This motivates the following relaxation to our central question.

Problem 1.1.

Is there an efficient algorithm that, when adopted by all agents in a Markov game and run independently, leads to sublinear regret for each individual agent?

Attempts to address Problem 1.1.

Two recent lines of research have made progress toward addressing Problem 1.1 and related questions. In one direction, several recent papers have provided algorithms, including V-learning [JLWY21, SMB22, MB21] and SPoCMAR [DGZ22], that do not achieve no-regret, but can nevertheless compute and then sample from a coarse correlated equilibrium in a Markov game in a (mostly) decentralized fashion, with the caveat that they require a shared source of random bits as a mechanism to coordinate. Notably, V-learning depends only mildly on the shared randomness: agents first play policies in a fully independent fashion (i.e., without shared randomness) according to a simple learning algorithm for $T$ episodes, and use shared random bits only once learning finishes as part of a post-processing procedure to extract a CCE policy. A question left open by these works, is whether the sequence of policies played by the V-learning algorithm in the initial independent phase can itself guarantee each agent sublinear regret; this would eliminate the need for a separate post-processing procedure and shared randomness.

Most closely related to our work, [ELS⁺22] recently showed that Problem 1.1 can be solved positively for a restricted setting in which regret for each agent is defined as the maximum gain in value they can achieve by deviating to a fixed Markov policy. Markov policies are those whose choice of action depends only on the current state as opposed to the entire history of interaction. This notion of deviation is restrictive because in general, even when the opponent plays a sequence of Markov policies, the best response will be non-Markov. In challenging settings that abound in practice, it is standard to consider non-Markov policies [LDGV⁺21, AVDG⁺22], since they often achieve higher value than Markov policies; we provide a simple example in Proposition 6.1. Thus, while a regret guarantee with respect to the class of Markov policies (as in [ELS⁺22]) is certainly interesting, it may be too weak in general, and it is of great interest to understand whether Problem 1.1 can be answered positively in the general setting.¹¹1We remark that the V-learning and SPoCMAR algorithms mentioned above do learn equilibria that are robust to deviations to non-Markov policies, though they do not address Problem 1.1 since they do not have sublinear regret.

1.2 Our contributions

We resolve Problem 1.1 in the negative, from both a computational and statistical perspective.

Computational hardness.

We provide two computational lower bounds (Theorems 1.2 and 1.3) which show that under standard complexity-theoretic assumptions, there is no efficient algorithm that runs for a polynomial number of episodes and guarantees each agent non-trivial (“sublinear”) regret when used in tandem by all agents. Both results hold even if the Markov game is explicitly known to the algorithm designer; Theorem 1.3 is stronger and more general, but applies only to $3$ -player games, while Theorem 1.2 applies to $2$ -player games, but only for agents restricted to playing Markovian policies.

To state our first result, Theorem 1.2, we define a product Markov policy to be a joint policy in which players choose their actions independently according to Markov policies (see Sections 2 and 3 for formal definitions). Note that if all players use independent no-regret algorithms to choose Markov policies at each episode, then their joint play at each round is described by a product Markov policy, since any randomness in each player’s policy must be generated independently.

Theorem 1.2 (Informal version of Corollary 3.3).

If $\textsf{PPAD}\neq\textsf{P}$ , then there is no polynomial-time algorithm that, given the description of a 2-player Markov game, outputs a sequence of joint product Markov policies which guarantees each agent sublinear regret.

Theorem 1.2 provides a decisive negative resolution to Problem 1.1 under the assumption that $\textsf{PPAD}\neq\textsf{P}$ ,²²2Technically, the class we are denoting by P, namely of total search problems that have a deterministic polynomial-time algorithm, is sometimes denoted by $\mathsf{FP}$ , as it is a search problem. We ignore this distinction. which is standard in the theory of computational complexity [Pap94].³³3PPAD is the most well-studied complexity class in algorithmic game theory, and is widely believed to not admit polynomial time algorithms. Notably, the problem of computing a Nash equilibrium for normal-form games with two or more players is PPAD-complete [DGP09, CDT06, Rub18]. Beyond simply ruling out the existence of fully decentralized no-regret algorithms, it rules out existence of centralized algorithms that compute a sequence of product policies for which each agent has sublinear regret, even if such a sequence does not arise naturally as the result of agents independently following some learning algorithm. Salient implications include:

•

Theorem 1.2 provides a separation between Markov games and normal-form games, since standard no-regret algorithms for normal-form games i) run in polynomial time and ii) produce sequences of joint product policies that guarantee each agent sublinear regret. Notably, no-regret learning for normal-form games is efficient whenever the number of agents is polynomial, whereas Theorem 1.2 rules out polynomial-time algorithms for as few as two agents.
•

A question left open by the work of [JLWY21, SMB22, MB21] was whether the sequence of policies played by the V-learning algorithm during its independent learning phase can guarantee each agent sublinear regret. Since V-learning plays product Markov policies during the independent phase and is computationally efficient, Theorem 1.2 implies that these policies do not enjoy sublinear regret (assuming $\textsf{PPAD}\neq\textsf{P}$ ).

Our second result, Theorem 1.3, extends the guarantee of Theorem 1.2 to the more general setting in which agents can select arbitrary, potentially non-Markovian policies at each episode. This comes at the cost of only providing hardness for 3-player games as opposed to 2-player games, as well as relying on the slightly stronger complexity-theoretic assumption that $\textsf{PPAD}\nsubseteq\textsf{RP}$ .⁴⁴4We use RP to denote the class of total search problems for which there exists a polynomial-time randomized algorithm which outputs a solution with probability at least $2/3$ , and otherwise outputs “fail”.

Theorem 1.3 (Informal version of Corollary 4.4).

If $\textsf{PPAD}\nsubseteq\textsf{RP}$ , then there is no polynomial-time algorithm that, given the description of a 3-player Markov game, outputs a sequence of joint product general policies (i.e., potentially non-Markov) which guarantees each agent sublinear regret.

Statistical hardness.

Theorems 1.2 and 1.3 rely on the widely-believed complexity theoretic assumption that PPAD-complete problems cannot be solved in (randomized) polynomial time. Such a restriction is inherent if we assume that the game is known to the algorithm designer. To avoid complexity-theoretic assumptions, we consider a setting in which the Markov game is unknown to the algorithm designer, and algorithms must learn about the game by executing policies (“querying”) and observing the resulting sequences of states, actions, and rewards. Our final result, Theorem 1.4, shows unconditionally that, for $m$ -player Markov games whose parameters are unknown, any algorithm computing a no-regret sequence as in Theorem 1.3 requires a number of queries that is exponential in $m$ .

Theorem 1.4 (Informal version of Theorem 5.2).

Given query access to a $m$ -player Markov game, no algorithm that makes fewer than $2^{\Omega(m)}$ queries can output a sequence of joint product policies which guarantees each agent sublinear regret.

Similar to our computational lower bounds, Theorem 1.4 goes far beyond decentralized algorithms, and rules out even centralized algorithms that compute a no-regret sequence by jointly controlling all players. The result provides another separation between Markov games and normal-form games, since standard no-regret algorithms for normal-form games can achieve sublinear regret using $\operatorname{poly}(m)$ queries for any $m$ . The $2^{\Omega(m)}$ scaling in the lower bound, which does not rule out query-efficient algorithms when $m$ is constant, is to be expected for an unconditional result: If the game has only polynomially many parameters (which is the case for constant $m$ ), one can estimate all of the parameters using standard techniques [JKSY20], then directly find a no-regret sequence.

Proof techniques: the SparseCCE problem.

Rather than directly proving lower bounds for the problem of no-regret learning, we establish lower bounds for a simpler problem we refer to as SparseCCE. In the SparseCCE problem, the aim is to compute by any means—centralized, decentralized, or otherwise—a coarse correlated equilibrium that is “sparse” in the sense that it can be represented as the mixture of a small number of product policies. Any algorithm that computes a sequence of product policies with sublinear regret (in the sense of Theorem 1.3) immediately yields an algorithm for the SparseCCE problem, as—using the standard connection between CCE and no-regret—the uniform mixture of the policies in the no-regret sequence forms a sparse CCE. Thus, any lower bound for the sparse SparseCCE problem yields a lower bound for computation of no-regret sequences.

To provide lower bounds for the SparseCCE problem, we reduce from the problem of Nash equilibrium computation in normal-form games. We show that given any two-player normal-form game, it is possible to construct a Markov game (with two players in the case of Theorem 1.2 and three players in the case of Theorem 1.3) with the property that i) the description length is polynomial in the description length of the normal-form game, and ii) any (approximate) SparseCCE for the Markov game can be efficiently transformed into a approximate Nash equilibrium for the normal-form game. With this reduction established, our computational lower bounds follow from celebrated PPAD-hardness results for approximate Nash equilibrium computation in two-player normal-form games, and our statistical lower bounds follow from query complexity lower bounds for Nash. Proving the reduction from Nash to SparseCCE constitutes the bulk of our work, and makes novel use of aggregation techniques from online learning [Vov90, CBL06], as well as techniques from the literature on anti-folk theorems in game theory [BCI⁺08].

1.3 Organization

Section 2 presents preliminaries on no-regret learning and equilibrium computation in Markov games and normal-form games. Sections 3, 4 and 5 present our main results:

•

Sections 3 and 4 provide our computational lower bounds for no-regret in Markov games. Section 3 gives a lower bound for the setting in which algorithms are constrained to play Markovian policies, and Section 4 builds on the approach in this section to give a lower bound for general, potentially non-Markovian policies.
•

Section 5 provides statistical (query complexity) lower bounds for multi-player Markov games.

Proofs are deferred to the appendix unless otherwise stated.

Notation.

For $n\in\mathbb{N}$ , we write $[n]:=\{1,2,\ldots,n\}$ . For a finite set $\mathcal{T}$ , $\Delta(\mathcal{T})$ denotes the space of distributions on $\mathcal{T}$ . For an element $t\in\mathcal{T}$ , $\mathbb{I}_{t}\in\Delta(\mathcal{T})$ denotes the delta distribution that places probability mass $1$ on $t$ . We adopt standard big-oh notation, and write $f=\widetilde{O}(g)$ to denote that $f=O(g\cdot{}\max\left\{1,\mathrm{polylog}(g)\right\})$ , with $\Omega(\cdot)$ and $\widetilde{\Omega}(\cdot)$ defined analogously.

2 Preliminaries

This section contains preliminaries necessary to present our main results. We first introduce the Markov game framework (Section 2.1), then provide a brief review of normal-form games (Section 2.3), and finally introduce the concepts of coarse correlated equilibria and regret minimization (Section 2.4).

2.1 Markov games

We consider general-sum Markov games in a finite-horizon, episodic framework. For $m\in\mathbb{N}$ , an $m$ -player Markov game $\mathcal{G}$ consists of a tuple $\mathcal{G}=(\mathcal{S},H,(\mathcal{A}_{i})_{i\in[m]},\mathbb{P},(R_{i})_{i\in[m]},\mu)$ , where:

•

$\mathcal{S}$ denotes a finite state space and $H\in\mathbb{N}$ denotes a finite time horizon. We write $S:=|\mathcal{S}|$ .
•

For $i\in[m]$ , $\mathcal{A}_{i}$ denotes a finite action space for agent $i$ . We let $\mathcal{A}:=\prod_{i=1}^{m}\mathcal{A}_{i}$ denote the joint action space and $\mathcal{A}_{-i}:=\prod_{i^{\prime}\neq i}\mathcal{A}_{i^{\prime}}$ . We denote joint actions in boldface, for example, $\mathbf{a}=(a_{1},\ldots,a_{m})\in\mathcal{A}$ . We write $A_{i}\vcentcolon=|\mathcal{A}_{i}|$ and $A\vcentcolon=|\mathcal{A}|$ .
•

$\mathbb{P}=(\mathbb{P}_{1},\ldots,\mathbb{P}_{H})$ is the transition kernel, with each $\mathbb{P}_{h}:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S})$ denoting the kernel for step $h\in[H]$ . In particular, $\mathbb{P}_{h}(s^{\prime}|s,\mathbf{a})$ is the probability of transitioning to $s^{\prime}$ from the state $s$ at step $h$ when agents play $\mathbf{a}$ .
•

For $i\in[m]$ and $h\in[H]$ , $R_{i,h}:\mathcal{S}\times\mathcal{A}\rightarrow[-1/H,1/H]$ is the instantaneous reward function of agent $i$ :⁵⁵5We assume that rewards lie in $\left[-1/H,1/H\right]$ for notational convenience, as this ensures that the cumulative reward for each episode lies in $\left[-1,1\right]$ . This assumption is not important to our results. the reward agent $i$ receives in state $s$ at step $h$ if agents play $\mathbf{a}$ is given by $R_{i,h}(s,\mathbf{a})$ .⁶⁶6We consider Markov games in which the rewards at each step are a deterministic function of the state and action profile. While some works consider the more general case of stochastic rewards, since our main goal is to prove lower bounds, it is without loss for us to assume that rewards are deterministic.
•

$\mu\in\Delta(\mathcal{S})$ denotes the initial state distribution.

An episode in the Markov game proceeds as follows:

•

the initial state $s_{1}$ is drawn from the initial state distribution $\mu$ .
•

For each $h\leq H$ , given state $s_{h}$ , each agent $i$ plays action $a_{i,h}\in\mathcal{A}_{i}$ , and given the joint action profile $\mathbf{a}_{h}=(a_{1,h},\ldots,a_{m,h})$ , each agent $i$ receives reward of $r_{i,h}=R_{i,h}(s_{h},\mathbf{a}_{h})$ and the state of the system transitions to $s_{h+1}\sim\mathbb{P}_{h}(\cdot|s_{h},\mathbf{a}_{h})$ .

We denote the tuple of agents’ rewards at each step $h$ by $\mathbf{r}_{h}=(r_{1,h},\ldots,r_{m,h})$ , and refer to the resulting sequence $\tau_{H}\vcentcolon={}(s_{1},\mathbf{a}_{1},\mathbf{r}_{1}),\ldots,(s_{H},\mathbf{a}_{H},\mathbf{r}_{H})$ as a trajectory. For $h\in[H]$ , we define the prefix of the trajectory via $\tau_{h}\vcentcolon={}(s_{1},\mathbf{a}_{1},\mathbf{r}_{1}),\ldots,(s_{h},a_{h},r_{h})$ .

Indexing.

We use the following notation: for some quantity $x$ (e.g., action, reward, etc.) indexed by agents, i.e., $x=(x_{1},\ldots,x_{m})$ , and an agent $i\in[m]$ , we write $x_{-i}=(x_{1},\ldots,x_{i-1},x_{i+1},\ldots,x_{m})$ to denote the tuple consisting of all $x_{i^{\prime}}$ for $i^{\prime}\neq i$ .

2.2 Policies and value functions

We now introduce the notion of policies and value functions for Markov games. Policies are mappings from states (or sequences of states) to actions for the agents. We consider several different types of policies, which play a crucial role in distinguishing the types of equilibria that are tractable and those that are intractable to compute efficiently.

Markov policies.

A randomized Markov policy for agent $i$ is a sequence $\sigma_{i}=(\sigma_{i,1},\ldots,\sigma_{i,H})$ , where $\sigma_{i,h}:\mathcal{S}\rightarrow\Delta(\mathcal{A}_{i})$ . We denote the space of randomized Markov policies for agent $i$ by $\Pi^{\mathrm{markov}}_{i}$ . We write $\Pi^{\mathrm{markov}}:=\Pi^{\mathrm{markov}}_{1}\times\cdots\times\Pi^{\mathrm{markov}}_{m}$ to denote the space of product Markov policies, which are joint policies in which each agent $i$ independently follows a policy in $\Pi^{\mathrm{markov}}_{i}$ . In particular, a policy $\sigma\in\Pi^{\mathrm{markov}}$ is specified by a collection $\sigma=(\sigma_{1},\ldots,\sigma_{H})$ , where $\sigma_{h}:\mathcal{S}\rightarrow\Delta(\mathcal{A}_{1})\times\cdots\times\Delta(\mathcal{A}_{m})$ . We additionally define $\Pi^{\mathrm{markov}}_{-i}:=\prod_{i^{\prime}\neq i}\Pi^{\mathrm{markov}}_{i^{\prime}}$ , and for a policy $\sigma\in\Pi^{\mathrm{markov}}$ , write $\sigma_{-i}$ to denote the collection of mappings $\sigma_{-i}=(\sigma_{-i,1},\ldots,\sigma_{-i,H})$ , where $\sigma_{-i,h}:\mathcal{S}\rightarrow\prod_{i^{\prime}\neq i}\Delta(\mathcal{A}_{i^{\prime}})$ denotes the tuple of all but player $i$ ’s policies.

When the Markov game $\mathcal{G}$ is clear from context, for a policy $\sigma\in\Pi^{\mathrm{markov}}$ we let $\mathbb{P}_{\sigma}\left[\cdot\right]$ denote the law of the trajectory $\tau$ when players select actions via $\mathbf{a}_{h}\sim{}\sigma(s_{h})$ , and let $\operatorname{\mathbb{E}}_{\sigma}\left[\cdot\right]$ denote the corresponding expectation.

General (non-Markov) policies.

In addition to Markov policies, we will consider general history-dependent (or, non-Markov) policies, which select actions based on the entire sequence of states and actions observed up the current step. To streamline notation, for $i\in[m]$ , let $\tau_{i,h}=(s_{1},a_{i,1},r_{i,1},\ldots,s_{h},a_{i,h},r_{i,h})$ denote the history of agent $i$ ’s states, actions, and reward up to step $h$ . Let $\mathscr{H}_{i,h}=(\mathcal{S}\times\mathcal{A}_{i}\times[0,1])^{h}$ denote the space of all possible histories of agent $i$ up to step $h$ . For $i\in[m]$ , a randomized general (i.e., non-Markov) policy of agent $i$ is a collection of mappings $\sigma_{i}=(\sigma_{i,1},\ldots,\sigma_{i,H})$ where $\sigma_{i,h}:\mathscr{H}_{i,h-1}\times\mathcal{S}\rightarrow\Delta(\mathcal{A}_{i})$ is a mapping that takes the history observed by agent $i$ up to step $h-1$ and the current state and outputs a distribution over actions for agent $i$ .

We denote by ${\Pi}^{\mathrm{gen,rnd}}_{i}$ the space of randomized general policies of agent $i$ , and further write ${\Pi}^{\mathrm{gen,rnd}}:={\Pi}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\Pi}^{\mathrm{gen,rnd}}_{m}$ to denote the space of product general policies; note that $\Pi^{\mathrm{markov}}_{i}\subset{\Pi}^{\mathrm{gen,rnd}}_{i}$ and $\Pi^{\mathrm{markov}}\subset{\Pi}^{\mathrm{gen,rnd}}$ . In particular, a policy $\sigma\in{\Pi}^{\mathrm{gen,rnd}}$ is specicfied by a collection $(\sigma_{i,h})_{i\in[m],h\in[H]}$ , where $\sigma_{i,h}:\mathscr{H}_{i,h-1}\times\mathcal{S}\rightarrow\Delta(\mathcal{A}_{i})$ . When agents play according to a general policy $\sigma\in{\Pi}^{\mathrm{gen,rnd}}$ , at each step $h$ , each agent, given the current state $s_{h}$ and their history $\tau_{i,h-1}\in\mathscr{H}_{i,h-1}$ , chooses to play an action $a_{i,h}\sim\sigma_{i,h}(\tau_{i,h-1},s_{h})$ , independently from all other agents. For a policy $\sigma\in{\Pi}^{\mathrm{gen,rnd}}$ , we let $\mathbb{P}_{\sigma}\left[\cdot\right]$ and $\operatorname{\mathbb{E}}_{\sigma}\left[\cdot\right]$ denote the law and expectation operator for the trajectory $\tau$ when players select actions via $\mathbf{a}_{h}\sim{}\sigma(\tau_{h-1},s_{h})$ , and write $\sigma_{-i}$ to denote the collection of policies of all agents but $i$ , i.e., $\sigma_{-i}=(\sigma_{j,h})_{h\in[H],j\in[m]\backslash\{i\}}$ .

We will also consider distributions over product randomized general policies, namely elements of $\Delta({\Pi}^{\mathrm{gen,rnd}})$ .⁷⁷7When $\mathcal{T}$ is not a finite set, we take $\Delta(\mathcal{T})$ to be the set of Radon probability measures over $\mathcal{T}$ equipped with the Borel $\sigma$ -algebra. We will refer to elements of $\Delta({\Pi}^{\mathrm{gen,rnd}})$ as distributional policies. To play according to some distributional policy $P\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ , agents draw a randomized policy $\sigma\sim P$ (so that $\sigma\in{\Pi}^{\mathrm{gen,rnd}}$ ) and then play according to $\sigma$ .

Remark 2.1 (Alternative definition for randomized general policies).

Instead of defining distributional policies as above, one might alternatively define ${\Pi}^{\mathrm{gen,rnd}}_{i}$ as the set of distributions over agent $i$ ’s deterministic general policies, namely as the set $\Delta(\Pi^{\mathrm{gen,det}}_{i})$ . We show in Section D that this alternative definition is equivalent to our own: in particular, there is a mapping from ${\Pi}^{\mathrm{gen,rnd}}$ to $\Delta(\Pi^{\mathrm{gen,det}}_{1})\times\cdots\times\Delta(\Pi^{\mathrm{gen,det}}_{m})$ so that, for any Markov game, any policy $\sigma\in{\Pi}^{\mathrm{gen,rnd}}$ produces identically distributed trajectories to its corresponding policy in $\Delta(\Pi^{\mathrm{gen,det}}_{1})\times\cdots\times\Delta(\Pi^{\mathrm{gen,det}}_{m})$ . Further, this mapping is one-to-one if we identify policies that produce the same distributions over trajectories for all Markov games.

Deterministic policies.

It will be helpful to introduce notation for deterministic general (non-Markov) policies, which correspond to the special case of randomized policies where each policy $\sigma_{i,h}$ exclusively maps to singleton distributions. In particular, a deterministic general policy of agent $i$ is a collection of mappings $\pi_{i}=(\pi_{i,1},\ldots,\pi_{i,H})$ , where $\pi_{i,h}:\mathscr{H}_{i,h-1}\times\mathcal{S}\rightarrow\mathcal{A}_{i}$ . We denote by $\Pi^{\mathrm{gen,det}}_{i}$ the space of deterministic general policies of agent $i$ , and further write $\Pi^{\mathrm{gen,det}}:=\Pi^{\mathrm{gen,det}}_{1}\times\cdots\times\Pi^{\mathrm{gen,det}}_{m}$ to denote the space of joint deterministic policies. We use the convention throughout that deterministic policies are denoted by the letter $\pi$ , whereas randomized policies are denoted by $\sigma$ .

Value functions.

For a general policy $\sigma\in{\Pi}^{\mathrm{gen,rnd}}$ , we define the value function for agent $i\in[m]$ as

\displaystyle V_{i}^{\sigma}:=\mathbb{E}_{\sigma}\left[\sum_{h=1}^{H}R_{i,h}(s_{h},\mathbf{a}_{h})\ \mid s_{1}\sim\mu\right];

(1)

this represents the expected reward that agent $i$ receives when each agent chooses their actions via $a_{i,h}\sim\sigma_{h}(\tau_{i,h-1},s_{h})$ . For a distributional policy $P\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ , we extend this notation by defining $V_{i}^{P}:=\mathbb{E}_{\sigma\sim P}[V_{i}^{\sigma}]$ .

2.3 Normal-form games

To motivate the solution concepts we consider for Markov games, let us first revisit the notion of normal-form games, which may be interpreted as Markov games with a single state. For $m,n\in\mathbb{N}$ , an $m$ -player $n$ -action normal-form game $G$ is specified by a tuple of $m$ reward tensors $M_{1},\ldots,M_{m}\in[0,1]^{n\times\cdots\times n}$ , where each tensor is of order $m$ (i.e., has $n^{m}$ entries). We will write $G=(M_{1},\ldots,M_{m})$ . We assume for simplicity that each player has the same number $n$ of actions, and identify each player’s action space with $[n]$ . Then an an action profile is specified by $\mathbf{a}\in[n]^{m}$ ; if each player acts according to $\mathbf{a}$ , then the reward for player $i\in[m]$ is given by $(M_{i})_{\mathbf{a}}\in[0,1]$ . Our hardness results will use the standard notion of Nash equilibrium in normal-form games. We define the $m$ -player $(n,\epsilon)$ -Nash problem to be the problem of computing an $\epsilon$ -approximate Nash equilibrium of a given $m$ -player $n$ -action normal-form game. (See Definition A.1 for a formal definition of $\epsilon$ -Nash equilibrium.) A celebrated result is that Nash equilibria are PPAD-hard to approximate, i.e., the 2-player $(n,n^{-c})$ -Nash problem is PPAD-hard for any constant $c>0$ [DGP09, CDT06]. We refer the reader to Section A.1 for further background on these concepts.

2.4 Markov games: Equilibria, no-regret, and independent learning

We now turn our focus back to Markov games, and introduce the main solution concepts we consider, as well as the notion of no-regret. Since computing Nash equilibria is intractable even for normal-form games, much of the work on efficient equilibrium computation has focused on alternative notions of equilibrium, notably coarse correlated equilibria and correlated equilibria. We focus on coarse correlated equilibria: being a superset of correlated equilibria, any lower bound for computing a coarse correlated equilibrium implies a lower bound for computing a correlated equilibrium.

For a distributional policy $P\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ and a randomized policy $\sigma_{i}^{\prime}\in{\Pi}^{\mathrm{gen,rnd}}_{i}$ of player $i$ , we let $\sigma_{i}^{\prime}\times P_{-i}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ denote the distributional policy which is given by the distribution of $(\sigma_{i}^{\prime},\sigma_{-i})\in{\Pi}^{\mathrm{gen,rnd}}$ for $\sigma\sim P$ (and $\sigma_{-i}$ denotes the marginal of $\sigma$ on all players but $i$ ). For $\sigma\in{\Pi}^{\mathrm{gen,rnd}}$ , we write $\sigma_{i}^{\prime}\times\sigma_{-i}$ to denote the policy given by $(\sigma_{i}^{\prime},\sigma_{-i})\in{\Pi}^{\mathrm{gen,rnd}}$ . Let us fix a Markov game $\mathcal{G}$ , which in particular determines the players’ value functions $V_{i}^{\sigma}$ as in (1).

Definition 2.2 (Coarse correlated equilibrium).

For $\epsilon>0$ , a distributional policy $P\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ is defined to be an $\epsilon$ -coarse correlated equilibrium (CCE) if for each $i\in[m]$ , it holds that .

\displaystyle\max_{\sigma_{i}^{\prime}\in{\Pi}^{\mathrm{gen,rnd}}_{i}}V_{i}^{\sigma_{i}^{\prime}\times P_{-i}}-V_{i}^{P}\leq\epsilon.

The maximizing policy $\sigma_{i}^{\prime}$ can always be chosen to be determinimistic, so $P$ is an $\epsilon$ -CCE if and only if $\max_{\pi_{i}\in\Pi^{\mathrm{gen,det}}_{i}}V_{i}^{\pi_{i}\times P_{-i}}-V_{i}^{P}\leq\epsilon$ .

Coarse correlated equilibria can be computed efficiently for both normal-form games and Markov games, and are fundamentally connected to the notion of no-regret and independent learning, which we now introduce.

Regret.

For a policy $\sigma\in{\Pi}^{\mathrm{gen,rnd}}$ , we denote the distributional policy which puts all its mass on $\sigma$ by $\mathbb{I}_{\sigma}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ . Thus $\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ denotes the distributional policy which randomizes uniformly over the $\sigma^{\scriptscriptstyle{(t)}}$ . We define regret as follows.

Definition 2.3 (Regret).

Consider a sequence of policies $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in{\Pi}^{\mathrm{gen,rnd}}$ . For $i\in[m]$ , the regret of agent $i$ with respect to this sequence is defined as:

\displaystyle\mathrm{Reg}_{i,T}=\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})=\max_{\sigma_{i}\in{\Pi}^{\mathrm{gen,rnd}}_{i}}\sum_{t=1}^{T}V_{i}^{\sigma_{i}\times\sigma_{-i}^{\scriptscriptstyle{(t)}}}-V_{i}^{\sigma^{\scriptscriptstyle{(t)}}}.

(2)

In (2) the maximum over $\sigma_{i}\in{\Pi}^{\mathrm{gen,rnd}}_{i}$ is always achieved by a deterministic general policy, so we have $\mathrm{Reg}_{i,T}=\max_{\pi_{i}\in\Pi^{\mathrm{gen,det}}_{i}}\sum_{t=1}^{T}\big{(}V_{i}^{\pi_{i}\times\sigma_{-i}^{\scriptscriptstyle{(t)}}}-V_{i}^{\sigma^{\scriptscriptstyle{(t)}}}\big{)}$ .

The following standard result shows that the uniform average of any no-regret sequence forms an approximate coarse correlated equilibrium.

Fact 2.4 (No-regret is equivalent to CCE).

Suppose that a sequence of policies $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in{\Pi}^{\mathrm{gen,rnd}}$ satisfies $\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})\leq\epsilon\cdot T$ for each $i\in[m]$ . Then the uniform average of these $T$ policies, namely the distributional policy $\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ , is an $\epsilon$ -CCE.

Likewise if a sequence of policies $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in{\Pi}^{\mathrm{gen,rnd}}$ has the property that the distributional policy $\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ , is an $\epsilon$ -CCE, then we have $\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})\leq\epsilon\cdot T$ for all $i\in[m]$ .

No-regret learning.

Fact 2.4 is an immediate consequence of Definitions 2.3 and 2.2. A standard approach to decentralized equilibrium computation, which exploits Fact 2.4, is to select $\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle(T)}\in{\Pi}^{\mathrm{gen,rnd}}$ using independent no-regret learning algorithms. A no-regret learning algorithm for player $i$ selects $\sigma_{i}^{\scriptscriptstyle(t)}\in{\Pi}^{\mathrm{gen,rnd}}_{i}$ based on the realized trajectories $\tau^{\scriptscriptstyle(1)}_{i,H},\ldots,\tau^{\scriptscriptstyle(t-1)}_{i,H}\in\mathscr{H}_{i,H}$ that player $i$ observes over the course of play,⁸⁸8An alternative model allows for player $i$ to have knowledge of the previous joint policies $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(t-1)}}$ , when selecting $\sigma_{i}^{\scriptscriptstyle{(t)}}$ . but with no knowledge of $\sigma^{\scriptscriptstyle(t)}_{-i}$ , so as to ensure that no-regret is achieved: $\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})\leq\epsilon\cdot T$ . If each player $i$ uses their own, independent no-regret learning algorithm, this approach yields product policies $\sigma^{\scriptscriptstyle(t)}=\sigma_{1}^{\scriptscriptstyle(t)}\times\cdots\times\sigma^{\scriptscriptstyle(t)}_{m}$ , and the uniform average of the $\sigma^{\scriptscriptstyle{(t)}}$ yields a CCE as long as all of the players can keep their regret small.⁹⁹9In Section 6, we discuss the implications of relaxing the stipulation that $\sigma^{\scriptscriptstyle{(t)}}$ be product policies (for example, by allowing the use of shared randomness, as in V-learning). In short, allowing $\sigma^{\scriptscriptstyle{(t)}}$ to be non-product essentially trivializes the problem.

For the special case of normal-form games, the no-regret learning approach has been fruitful. There are several efficient algorithms, including regret matching [HMC00], Hedge (also known as exponential weights) [Vov90, LW94, CBFH⁺97], and generalizations of Hedge based on the follow-the-regularized-leader (FTRL) framework [SS12], which ensure that each player’s regret after $T$ episodes is bounded above by $O(\sqrt{T})$ (that is $\epsilon=O(1/\sqrt{T})$ ), even when the other players’ actions are chosen adversarially. All of these guarantees, which bound regret by a sublinear function in $T$ , lead to efficient, decentralized computation of approximate coarse correlated equilibrium in normal-form games. The success of this motivates our central question, which is whether similar guarantees may be established for Markov games. In particular, a formal version of Problem 1.1 asks: Is there an efficient algorithm that, when adopted by all agents in a Markov game and run independently, ensures that for all $i$ , $\mathrm{Reg}_{i,T}\leq\epsilon\cdot{}T$ for some $\epsilon=o(1)$ ?

3 Lower bound for Markovian algorithms

In this section we prove Theorem 1.2 (restated formally below as Theorem 3.2), establishing that in two-player Markov games, there is no computationally efficient algorithm that computes a sequence $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ of product Markov policies so that each player has small regret under this sequence. This section serves as a warm-up for our results in Section 4, which remove the assumption that $\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle(T)}$ are Markovian.

3.1 SparseMarkovCCE problem and computational model

As discussed in the introduction, our lower bounds for no-regret learning are a consequence of lower bounds for the SparseCCE problem. In what follows, we formalize this problem (specifically, the Markovian variant, which we refer to as SparseMarkovCCE), as well as our computational model.

Description length for Markov games (constant $m$ ).

Given a Markov game $\mathcal{G}$ , we let $\beta(\mathcal{G})$ denote the maximum number of bits needed to describe any of the rewards $R_{i,h}(s,\mathbf{a})$ or transition probabilities $\mathbb{P}_{h}(s^{\prime}|s,\mathbf{a})$ in binary.¹⁰¹⁰10We emphasize that $\beta(\mathcal{G})$ is defined as the maximum number of bits required by any particular $(s,\mathbf{a})$ pair, not the total number of bits required for all $(s,\mathbf{a})$ pairs. We define $|\mathcal{G}|:=\max\{S,\max_{i\in[m]}A_{i},H,\beta(\mathcal{G})\}$ . The interpretation of $|\mathcal{G}|$ depends on the number of players $m$ : If $m$ is a constant (as will be the case in the current section and Section 4), then $|\mathcal{G}|$ should be interpreted as the description length of the game $\mathcal{G}$ , up to polynomial factors. In particular, for constant $m$ , the game $\mathcal{G}$ can be described using $|\mathcal{G}|^{O(1)}$ bits. In Section 5, we discuss the interpretation of $|\mathcal{G}|$ when $m$ is large.

The SparseMarkovCCE problem.

From Fact 2.4, we know that the problem of computing a sequence $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ of joint product Markov policies for which each player has at most $\epsilon\cdot{}T$ regret is equivalent to computing a sequence $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ for which the uniform mixture forms an $\epsilon$ -approximate CCE. We define $(T,\epsilon)$ -SparseMarkovCCE as the computational problem of computing such a CCE directly.

Definition 3.1 (SparseMarkovCCE problem).

For an $m$ -player Markov game $\mathcal{G}$ and parameters $T\in\mathbb{N}$ and $\epsilon>0$ (which may depend on the size of the game $\mathcal{G}$ ), $(T,\epsilon)$ -SparseMarkovCCE is the problem of finding a sequence $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ , with each $\sigma^{\scriptscriptstyle{(t)}}\in\Pi^{\mathrm{markov}}$ , such that the distributional policy $\overline{\sigma}=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ is an $\epsilon$ -CCE of $\mathcal{G}$ (or equivalently, such that for all $i\in[m]$ , $\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})\leq\epsilon\cdot T$ ).

Decentralized learning algorithms naturally lead to solutions to the SparseMarkovCCE problem. In particular, consider any decentralized protocol which runs for $T$ episodes, where at each timestep $t\in[T]$ , each player $i\in[m]$ chooses a Markov policy $\sigma_{i}^{\scriptscriptstyle{(t)}}\in\Pi^{\mathrm{markov}}_{i}$ to play, without knowledge of the other players’ policies $\sigma_{-i}^{\scriptscriptstyle(t)}$ (but possibly using the history); any strategy in which players independently run online learning algorithms falls under this protocol. If each player experiences overall regret at most $\epsilon\cdot T$ , then the sequence $\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle(T)}$ is a solution to the $(T,\epsilon)$ -SparseMarkovCCE problem. However, one might expect the $(T,\epsilon)$ -SparseMarkovCCE problem to be much easier than decentralized learning, since it allows for algorithms that produce $(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})$ satisfying the constraints of Definition 3.1 in a centralized manner. The main result of this section, Theorem 3.2, rules out the existence of any efficient algorithms, including centralized ones, that solve the SparseMarkovCCE problem.

Before moving on, let us give a sense for what sort of scaling one should expect for the parameters $T$ and $\epsilon$ in the $(T,\epsilon)$ -SparseMarkovCCE problem. First, we note that there always exists a solution to the $(1,0)$ -SparseMarkovCCE problem in a Markov game, which is given by a (Markov) Nash equilibrium of the game; of course, Nash equilibria are intractable to compute in general.¹¹¹¹11Such a Nash equilibrium can be seen to exist by using backwards induction to specify the player’s joint distribution of play at each state at steps $H,H-1,\ldots,1$ . For the special case of normal-form games (where there is only a single state, and $H=1$ ), no-regret learning (e.g., Hedge) yields a computationally efficient solution to the $(T,\widetilde{O}(1/\sqrt{T}))$ -SparseMarkovCCE problem, where the $\widetilde{O}(\cdot)$ hides a $\max_{i}\log|A_{i}|$ factor. Refined convergence guarantees of [DFG21, AFK⁺22] improve upon this result, and yield an efficient solution to the $(T,\widetilde{O}(1/T))$ -SparseMarkovCCE problem.

3.2 Main result

Theorem 3.2.

There is a constant $C_{0}>1$ so that the following holds. Let $n\in\mathbb{N}$ be given, and let $T\in\mathbb{N}$ and $\epsilon>0$ satisfy $T<\exp(\epsilon^{2}\cdot n^{1/2}/2^{5})$ . Suppose there is an algorithm that, given the description of any 2-player Markov game $\mathcal{G}$ with $|\mathcal{G}|\leq n$ , solves the $(T,\epsilon)$ -SparseMarkovCCE problem in time $U$ , for some $U\in\mathbb{N}$ . Then, for each $n\in\mathbb{N}$ , the 2-player $(\lfloor n^{1/2}\rfloor,4\cdot\epsilon)$ -Nash problem (Definition A.1) can be solved in time $(nTU)^{C_{0}}$ .

We emphasize that the range $T<\exp(n^{O(1)})$ ruled out by Theorem 3.2 is the most natural parameter regime, since the runtime of any decentralized algorithm which runs for $T$ episodes and produces a solution to the SparseMarkovCCE problem is at least linear in $T$ . Using that 2-player $(n,\epsilon)$ -Nash is PPAD-complete for $\epsilon=n^{-c}$ (for any $c>0$ ) [DGP09, CDT06, Rub18], we obtain the following corollary.

Corollary 3.3 (SparseMarkovCCE is PPAD-complete).

For any constant $C>4$ , if there is an algorithm which, given the description of a 2-player Markov game $\mathcal{G}$ , solves the $(|\mathcal{G}|^{C},|\mathcal{G}|^{-\frac{1}{C}})$ -SparseMarkovCCE problem in time $\operatorname{poly}(|\mathcal{G}|)$ , then $\textsf{PPAD}=\textsf{P}$ .

The condition $C>4$ in Corollary 3.3 is set to ensure that $|\mathcal{G}|^{C}<\exp(|\mathcal{G}|^{-2/C}\cdot\sqrt{|\mathcal{G}|}/2^{6})$ for sufficiently large $|\mathcal{G}|$ , so as to satisfy the condition of Theorem 3.2. Corollary 3.3 rules out the existence of a polynomial-time algorithm that solves the SparseMarkovCCE problem with accuracy $\epsilon$ polynomially small and $T$ polynomially large in $\lvert\mathcal{G}\rvert$ . Using a stronger complexity-theoretic assumption, the Exponential Time Hypothesis for PPAD [Rub16], we can obtain a stronger hardness result which rules out efficient algorithms even when 1) the accuracy $\epsilon$ is constant and 2) $T$ is quasipolynomially large.¹²¹²12This is a consequence of the fact that for some absolute constant $\epsilon_{0}>0$ , there are no polynomial-time algorithms for computing $\epsilon_{0}$ -Nash equilibria in 2-player normal-form games under the Exponential Time Hypothesis for PPAD [Rub16].

Corollary 3.4 (ETH-hardness of SparseMarkovCCE).

There is a constant $\epsilon_{0}>0$ such that if there exists an algorithm that solves the $(|\mathcal{G}|^{o(\log|\mathcal{G}|)},\epsilon_{0})$ -SparseMarkovCCE problem in $|\mathcal{G}|^{o(\log|\mathcal{G}|)}$ time, then the Exponential Time Hypothesis for PPAD fails to hold.

Proof overview.

The proof of Theorem 3.2 is based on a reduction, which shows that any algorithm that efficiently solves the $(T,\epsilon)$ -SparseMarkovCCE problem, for $T$ not too large, can be used to efficiently compute an approximate Nash equilibrium of any given normal-form game. In particular, fix $n_{0}\in\mathbb{N}$ , and let a 2-player normal form game $G$ with $n_{0}$ actions be given. We construct a Markov game $\mathcal{G}=\mathcal{G}(G)$ with horizon $H=n_{0}$ and action sets identical to those of the game $G$ , i.e., $\mathcal{A}_{1}=\mathcal{A}_{2}=[n_{0}]$ . The state space of $\mathcal{G}$ consists $n_{0}^{2}$ states, which are indexed by joint action profiles; the transitions are defined so that the value of the state at step $h$ encodes the action profile taken by the agents at step $h-1$ .¹³¹³13For technical reasons, this only is the case for even values of $h$ ; we discuss further details in the full proof in Section B.2. At each state of $\mathcal{G}$ , the reward functions are given by the payoff matrices of $G$ , scaled down by a factor of $1/H$ (which ensures that the rewards received at each step belong to $[0,1/H]$ ). In particular, the rewards and transitions out of a given state do not depend on the identity of the state, and so $\mathcal{G}$ can be thought of as a repeated game where $G$ is played $H$ times. The formal definition of $\mathcal{G}$ is given in Definition B.3.

Fix any algorithm for the SparseMarkovCCE problem, and recall that for each step $h$ and state $s$ for $\mathcal{G}$ , $\sigma^{\scriptscriptstyle{(t)}}_{h}(s)\in\Delta(\mathcal{A}_{1})\times\Delta(\mathcal{A}_{2})$ denotes the joint action distribution taken in $s$ at step $h$ for the sequence of $\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle(T)}$ produced by the algorithm. The bulk of the proof of Theorem 3.2 consists of proving a key technical result, Lemma B.4, which states that if $\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle(T)}$ indeed solves $(T,\epsilon)$ -SparseMarkovCCE, then there exists some tuple $(h,s,t)$ such that $\sigma_{h}^{\scriptscriptstyle{(t)}}(s)$ is an approximate Nash equilibrium for $G$ . With this established, it follows that we can find a Nash equilibrium efficiently by simply trying all $HST$ choices for $(h,s,t)$ .

To prove Lemma B.4, we reason as follows. Assume that $\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ is an $\epsilon$ -CCE. If, by contradiction, none of the distributions $\left\{\sigma_{h}^{\scriptscriptstyle{(t)}}(s)\right\}_{h\in[H],s\in\mathcal{S},t\in[T]}$ are approximate Nash equilibria for $G$ , then it must be the case that for each $t$ , one of the players has a profitable deviation in $G$ with respect to the product strategy $\sigma_{h}^{\scriptscriptstyle{(t)}}(s)$ , at least for a constant fraction of the tuples $(s,h)$ . We will argue that if this were to be the case, it would imply that there exists a non-Markov deviation policy for at least one player $i$ in Definition 2.2, meaning that $\overline{\sigma}$ is not in fact an $\epsilon$ -CCE.

To sketch the idea, recall that to draw a trajectory from $\overline{\sigma}$ , we first draw a random index $t^{\star}\sim[T]$ uniformly at random, and then execute $\sigma^{\scriptscriptstyle{(t^{\star})}}$ for an episode. We will show (roughly) that for each player $i$ , it is possible to compute a non-Markov deviation policy $\pi_{i}^{\dagger}$ which, under the draw of a trajectory from $\overline{\sigma}$ , can “infer” the value of the index $t^{\star}$ within the first few steps of the episode. Then policy $\pi_{i}^{\dagger}$ then, at each state $s$ and step $h$ after the first few steps, play a best response to their opponent’s portion of the strategy $\sigma_{h}^{\scriptscriptstyle{(t^{\star})}}(s)$ . If, for each possible value of $t^{\star}$ , none of the distributions $\sigma_{h}^{\scriptscriptstyle{(t^{\star})}}(s)$ are approximate Nash equilibria of $G$ , this means that at least one of the players $i$ can significantly increase their value in $\mathcal{G}$ over that of $\overline{\sigma}$ by playing $\pi_{i}^{\dagger}$ , which contradicts the assumption that $\overline{\sigma}$ is an $\epsilon$ -CCE.

It remains to explain how we can construct a non-Markov policy $\pi_{i}^{\dagger}$ which “infers” the value of $t^{\star}$ . Unfortunately, exactly inferring the value of $t^{\star}$ in the fashion described above is impossible: for instance, if there are $t_{1}\neq t_{2}$ so that $\sigma^{\scriptscriptstyle{(t_{1})}}=\sigma^{\scriptscriptstyle{(t_{2})}}$ , then clearly it is impossible to distinguish between the cases $t^{\star}=t_{1}$ and $t^{\star}=t_{2}$ . Nevertheless, by using the fact that each player observes the full joint action profile played at each step $h$ , we can construct a non-Markov policy which employs Vovk’s aggregating algorithm for online density estimation [Vov90, CBL06] in order to compute a distribution which is close to $\sigma_{h}^{\scriptscriptstyle{(t^{\star})}}(s)$ for most $h\in[H]$ .¹⁴¹⁴14Vovk’s aggregating algorithm is essentially the exponential weights algorithm with the logarithmic loss. A detailed background for the algorithm is provided in Section B.1. This guarantee is stated formally in an abstract setting in Proposition B.2, and is instantiated in the proof of Theorem 3.2 in (Equation 6). As we show in Section B.2, approximating $\sigma_{h}^{\scriptscriptstyle{(t^{\star})}}(s)$ as we have described is sufficient to carry out the reasoning from the previous paragraph.

4 Lower bound for non-Markov algorithms

In this section, we prove Theorem 1.3 (restated formally below as Theorem 4.3), which strengthens Theorem 3.2 by allowing the sequence $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ of product policies to be non-Markovian. This additional strength comes at the cost of our lower bound only applying to 3-player Markov games (as opposed to Theorem 3.2, which applied to 2-player games).

4.1 SparseCCE problem and computational model

To formalize the computational model for the SparseCCE problem, we must first describe how the non-Markov product policies $\sigma^{\scriptscriptstyle{(t)}}=(\sigma^{\scriptscriptstyle{(t)}}_{1},\ldots,\sigma^{\scriptscriptstyle{(t)}}_{m})$ are represented. Recall that a non-Markov policy $\sigma_{i}^{\scriptscriptstyle{(t)}}\in{\Pi}^{\mathrm{gen,rnd}}_{i}$ is, by definition, a mapping from agent $i$ ’s history and current state to a distribution over their next action. Since there are exponentially many possible histories, it is information-theoretically impossible to express an arbitrary policy in ${\Pi}^{\mathrm{gen,rnd}}_{i}$ with polynomially many bits. As our focus is on computing a sequence of such policies $\sigma^{\scriptscriptstyle{(t)}}$ in polynomial time, certainly a prerequisite is that $\sigma^{\scriptscriptstyle{(t)}}$ can be expressed in polynomial space. Thus, we adopt the representational assumption, stated formally in Definition 4.1, that each of the policies $\sigma_{i}^{\scriptscriptstyle{(t)}}\in{\Pi}^{\mathrm{gen,rnd}}_{i}$ is described by a bounded-size circuit that can compute the conditional distribution of each next action given the history. This assumption is satisfied by essentially all empirical and theoretical work concerning non-Markov policies (e.g., [LDGV⁺21, AVDG⁺22, JLWY21, SMB22]).

Definition 4.1 (Computable policy).

Given a $m$ -player Markov game $\mathcal{G}$ and $N\in\mathbb{N}$ , we say that a policy $\sigma_{i}\in{\Pi}^{\mathrm{gen,rnd}}_{i}$ is $N$ -computable if for each $h\in[H]$ , there is a circuit of size $N$ that,¹⁵¹⁵15For concreteness, we suppose that “circuit” means “boolean circuit” as in [AB06, Definition 6.1], where probabilities are represented in binary. The precise model of computation we use does not matter, though, and we could equally assume that the policies $\sigma_{i}$ may be computed by Turing machines that terminate after $N$ steps. on input $(\tau_{i,h-1},s)\in\mathscr{H}_{i,h-1}\times\mathcal{S}$ , outputs the distribution $\sigma_{i}(\tau_{i,h-1},s)\in\Delta(\mathcal{A}_{i})$ . A policy $\sigma=(\sigma_{1},\ldots,\sigma_{m})\in{\Pi}^{\mathrm{gen,rnd}}$ is $N$ -computable if each constituent policy $\sigma_{i}$ is.

Our lower bound applies to algorithms that produce sequences $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ for which each $\sigma^{\scriptscriptstyle{(t)}}$ is $N$ -computable, where the value $N$ is taken to be polynomial in the description length of the game $\mathcal{G}$ . For example, Markov policies whose probabilities can be expressed with $\beta$ bits are $O(HSA_{i}\beta)$ -computable for each player $i$ , since one can simply store each of the probabilities $\sigma_{i,h}(s_{h},a_{i,h})$ (for $h\in[H]$ , $i\in[m]$ , $a_{i,h}\in\mathcal{A}_{i}$ , $s_{h}\in\mathcal{S}$ ), each of which takes $\beta$ bits to represent.

The SparseCCE problem.

SparseCCE is the problem of computing a sequence of non-Markov product policies $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ such that the uniform mixture forms an $\epsilon$ -approximate CCE. The problem generalizes SparseMarkovCCE (Definition 3.1) by relaxing the condition that the policies $\sigma^{\scriptscriptstyle{(t)}}$ be Markov.

Definition 4.2 (SparseCCE Problem).

For an $m$ -player Markov game $\mathcal{G}$ and parameters $T,N\in\mathbb{N}$ and $\epsilon>0$ (which may depend on the size of the game $\mathcal{G}$ ), $(T,\epsilon,N)$ -SparseCCE is the problem of finding a sequence $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in{\Pi}^{\mathrm{gen,rnd}}$ , with each $\sigma^{\scriptscriptstyle{(t)}}$ being $N$ -computable, such that the distributional policy $\overline{\sigma}=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ is an $\epsilon$ -CCE for $\mathcal{G}$ (equivalently, such that for all $i\in[m]$ , $\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})\leq\epsilon\cdot T$ ).

4.2 Main result

Our main theorem for this section, Theorem 4.3, shows that for appropriate values of $T$ , $\epsilon$ , and $N$ , solving the $(T,\epsilon,N)$ -SparseCCE problem is at least as hard as computing Nash equilibria in normal-form games.

Theorem 4.3.

Fix $n\in\mathbb{N}$ , and let $T,N\in\mathbb{N}$ , and $\epsilon>0$ satisfy $1<T<\exp\left(\frac{\epsilon^{2}\cdot n}{16}\right)$ . Suppose there exists an algorithm that, given the description of any $3$ -player Markov game $\mathcal{G}$ with $|\mathcal{G}|\leq n$ , solves the $(T,\epsilon,N)$ -SparseCCE problem in time $U$ , for some $U\in\mathbb{N}$ . Then, for any $\delta>0$ , the $2$ -player $(\lfloor n/2\rfloor,50\epsilon)$ -Nash problem can be solved in randomized time $(nTNU\log(1/\delta)/\epsilon)^{C_{0}}$ with failure probability $\delta$ , where $C_{0}>0$ is an absolute constant.

By analogy to Corollary 3.3, we obtain the following immediate consequence.

Corollary 4.4 (SparseCCE is hard under $\textsf{PPAD}\nsubseteq\textsf{RP}$ ).

For any constant $C>4$ , if there is an algorithm which, given the description of a 3-player Markov game $\mathcal{G}$ , solves the $(|\mathcal{G}|^{C},|\mathcal{G}|^{-\frac{1}{C}},|\mathcal{G}|^{C})$ -SparseCCE problem in time $\operatorname{poly}(|\mathcal{G}|)$ , then $\textsf{PPAD}\subseteq\textsf{RP}$ .

Proof overview for Theorem 4.3.

The proof of Theorem 4.3 has a similar high-level structure to that of Theorem 3.2: given an $m$ -player normal-form $G$ , we define an $(m+1)$ -player Markov game $\mathcal{G}=\mathcal{G}(G)$ which has $n_{0}:=\lfloor n/m\rfloor$ actions per player and horizon $H\approx n_{0}$ . The key difference in the proof of Theorem 4.3 is the structure of the players’ reward functions. To motivate this difference and the addition of an $(m+1)$ -th player, let us consider what goes wrong in the proof of Theorem 3.2 when the policies $\sigma^{\scriptscriptstyle{(t)}}$ are allowed to be non-Markov. We will explain how a sequence $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ can hypothetically solve the SparseCCE problem by attempting to punish any one player’s deviation policy, and thus avoid having to compute a Nash equilibrium of $G$ . In particular, for each player $j$ , suppose $\sigma_{j}^{\scriptscriptstyle{(t)}}$ tries to detect, based on the state transitions and player $j$ ’s rewards, whether every other player $i\neq j$ is playing according to $\sigma_{i}^{\scriptscriptstyle{(t)}}$ . If some player $i$ is not playing according to $\sigma_{i}^{\scriptscriptstyle{(t)}}$ at some step $h$ , then at steps $h^{\prime}>h$ , the policy $\sigma_{j}^{\scriptscriptstyle{(t)}}$ can select actions that attempt to minimize player $i$ ’s rewards. In particular, if player $i$ plays according to the policy $\pi_{i}^{\dagger}$ that we described in Section 3.2, then other players $j\neq i$ can adjust their choice of actions in later rounds to decrease player $i$ ’s value.

This behavior is reminiscent of “tit-for-tat” strategies which are used to establish the folk theorem in the theory of repeated games [MF86, FLM94]. The folk theorem describes how Nash equilibria are more numerous (and potentially easier to find) in repeated games than in single-shot normal form games. As it turns out, the folk theorem does not provably yield to worst-case computational speedups in repeated games, at least when the number of players is at least 3. Indeed, [BCI⁺08] gave an “anti-folk theorem”, showing that computing Nash equilibria in $(m+1)$ -player repeated games is PPAD-hard for $m\geq 2$ , via a reduction to $m$ -player normal-form games. We utilize their reduction, for which the key idea is as follows: given an $m$ -player normal form game $G$ , we construct an $(m+1)$ -player Markov game $\mathcal{G}(G)$ in which the $(m+1)$ -th player acts as a kibitzer,¹⁶¹⁶16Kibitzer is a Yiddish term for an observer who offers advice. with actions indexed by tuples $(j,a_{j})$ , for $j\in[m]$ and $a_{j}\in\mathcal{A}_{j}$ . The kibitzer’s action $(j,a_{j})$ represents 1) a player $j$ to give advice to, and 2) their advice to the player, which is to take action $a_{j}$ . In particular, if the kibitzer plays $(j,a_{j})$ , it receives reward equal to the amount that player $j$ would obtain by deviating to $a_{j}$ , and player $j$ receives the negation of the kibitzer’s reward. Furthermore, all other players receive 0 reward.

To see why the addition of the kibitzer is useful, suppose that $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ solves the SparseCCE problem, so that $\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}$ is an $\epsilon$ -CCE. We will show that, with at least constant probability over a trajectory drawn from $\overline{\sigma}$ (which involves drawing $t^{\star}\sim[T]$ uniformly), the joint strategy profile played by the first $m$ players constitutes an approximate Nash equilibrium of $G$ . Suppose for the purpose of contradiction that this were not the case. We show that there exists a non-Markov deviation policy $\pi_{m+1}^{\dagger}$ for the kibitzer which, similar to the proof of Theorem 3.2, learns the value of $t^{\star}$ and plays a tuple $(j,a_{j})$ such that action $a_{j}$ increases player $j$ ’s payoff in $G$ , thereby increasing its own payoff. Even if the other players attempt to punish the kibitzer for this deviation, they will not be able to since, roughly speaking, the kibitzer game as constructed above has the property that for any strategy for the first $m$ players, the kibitzer can always achieve reward at least $0$ .

The above argument shows that under the joint policy $\overline{\sigma}_{-(m+1)}\times\pi_{m+1}^{\dagger}$ (namely, the first $m$ players play according to $\overline{\sigma}$ and the kibitzer plays according to $\pi_{m+1}^{\dagger}$ ) then with constant probability over a trajectory drawn from this policy, the distribution of the first $m$ players’ actions is an approximate Nash equilibrium of $G$ . Thus, in order to efficiently find such a Nash (see Algorithm 2), we need to simulate the policy $\overline{\sigma}_{-(m+1)}\times\pi_{m+1}^{\dagger}$ , which involves running Vovk’s aggregating algorithm. This approach is in contrast to the proof of Theorem 3.2, for which Vovk’s aggregating algorithm was an ingredient in the proof but was not actually used in the Nash computation algorithm (Algorithm 1). The details of the proof of correctness of Algorithm 2 are somewhat delicate, and may be found in Appendix C.

Two-player games.

One intruiging question we leave open is whether the SparseCCE problem remains hard for two-player Markov games. Interestingly, as shown by [LS05], there is a polynomial time algorithm to find an exact Nash equilibrium for the special case of repeated two-player normal-form games. Though their result only applies in the infinite-horizon setting, it is possible to extend their results to the finite-horizon setting, which rules out naive approaches to extend the proof of Theorem 4.3 and Corollary 4.4 to two players.

5 Multi-player games: Statistical lower bounds

In this section we present Theorem 1.4 (restated formally below as Theorem 5.2), which gives a statistical lower bound for the SparseCCE problem. The lower bound applies to any algorithm, regardless of computational cost, that accesses the underlying Markov game through a generative model.

Definition 5.1 (Generative model).

For an $m$ -player Markov game $\mathcal{G}=(\mathcal{S},H,(\mathcal{A}_{i})_{i\in[m]},\mathbb{P},(R_{i})_{i\in[m]},\mu)$ , a generative model oracle is defined as follows: given a query described by a tuple $(h,s,\mathbf{a})\in[H]\times\mathcal{S}\times\mathcal{A}$ , the oracle returns the distribution $\mathbb{P}_{h}(\cdot|s,\mathbf{a})\in\Delta(\mathcal{S})$ and the tuple of rewards $(R_{i,h}(s,\mathbf{a}))_{i\in[m]}$ .

From the perspective of lower bounds, the assumption that the algorithm has access to a generative model is quite reasonable, as it encompasses most standard access models in RL, including the online access model, in which the algorithm repeatedly queries a policy and observes a trajectory drawn from it, as well as the local access generative model used in from [YHAY⁺22, WAJ⁺21]. We remark that it is slightly more standard to assume that queries to the generative model only return a sample from the distribution $\mathbb{P}_{h}(\cdot|s,\mathbf{a})$ as opposed to the distribution itself [Kak03, KMN99], but since our goal is to prove lower bounds, the notion in Definition 5.1 only makes our results stronger.

To state our main result, we recall the definition $|\mathcal{G}|=\max\{S,\max_{i\in[m]}A_{i},H,\beta(\mathcal{G})\}$ . In the present section, we consider the setting where the number of players $m$ is large. Here, $|\mathcal{G}|$ does not necessarily correspond to the description length for $\mathcal{G}$ , and should be interpreted, roughly speaking, as a measure of the description complexity of $\mathcal{G}$ $|\mathcal{G}|$ with respect to decentralized learning algorithms. In particular, from the perspective of an individual agent implementing a decentralized learning algorithm, their sample complexity should depend only on the size of their individual action set (as well as the global parameters $S,H,\beta(\mathcal{G})$ ), as opposed to the size of the joint action set, which grows exponentially in $m$ ; the former is captured by $\lvert\mathcal{G}\rvert$ , while the latter is not. Indeed, a key advantage shared by much prior work on decentralized RL [JLWY21, SMB22, MB21, DGZ22] is their avoidance of the curse of multi-agents, which describes the situation where an algorithm has sample and computational costs that scale exponentially in $m$ .

Our main result for this section, Theorem 5.2, states that for $m$ -player Markov games, exponentially many generative model queries (in $m$ ) are necessary to produce a solution to the $(T,\epsilon,N)$ -SparseCCE problem, unless $T$ itself is exponential in $m$ .

Theorem 5.2.

Let $m\geq{}2$ be given. There are constants $c,\epsilon>0$ so that the following holds. Suppose there is an algorithm $\mathscr{B}$ which, given access to a generative model for a $(m+1)$ -player Markov game $\mathcal{G}$ with $|\mathcal{G}|\leq 2m^{6}$ , solves the $(T,\epsilon/(10m),N)$ -SparseCCE problem for $\mathcal{G}$ for some $T$ satisfying $1<T<\exp(cm)$ , and any $N\in\mathbb{N}$ . Then $\mathscr{B}$ must make at least $2^{\Omega(m)}$ queries to the generative model.

Theorem 5.2 establishes that there are $m$ -player Markov games, where the number of states, actions per player, and horizon are bounded by $\operatorname{poly}(m)$ , but any algorithm with regret $o(T/m)$ must make $2^{\Omega(m)}$ queries (via Fact 2.4). In particular, if there are $\operatorname{poly}(m)$ queries per episode, as is standard in the online simulator model where a trajectory is drawn from the policy $\sigma^{\scriptscriptstyle{(t)}}$ at each episode $t\in[T]$ , then $T>2^{\Omega(m)}$ episodes are required to have regret $o(T/m)$ . This is in stark contrast to the setting of normal-form games, where even for the case of bandit feedback (which is a special case of the generative model setting), standard no-regret algorithms have the property that each player’s regret scales as $\widetilde{O}(\sqrt{Tn})$ (i.e., independently of $m$ ), where $n$ denotes the number of actions per player [LS20]. As with our computational lower bounds, Theorem 5.2 is not limited to decentralized algorithms, and also rules out centralized algorithms which, with access to a generative model, compute a sequence of policies which constitutes a solution to the SparseCCE problem. Furthermore, it holds for arbitrary values of $N$ , thus allowing the policies $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in{\Pi}^{\mathrm{gen,rnd}}$ solving the SparseCCE problem to be arbitrary general policies.

6 Discussion and interpretation

Theorems 3.2, 4.3, and 5.2 present barriers—both computational and statistical—toward developing efficient decentralized no-regret guarantees for multi-agent reinforcement learning. We emphasize that no-regret algorithms are the only known approach for obtaining fully decentralized learning algorithms (i.e., those which do not rely even on shared randomness) in normal-form games, and it seems unlikely that a substantially different approach would work in Markov games. Thus, these lower bounds for finding subexponential-length sequences of policies with the no-regret property represent a significant obstacle for fully decentralized multi-agent reinforcement learning. Moreover, these results rule out even the prospect of developing efficient centralized algorithms that produce no-regret sequences of policies, i.e., those which “resemble” independent learning. In this section, we compare our lower bounds with recent upper bounds for decentralized learning in Markov games, and explain how to reconcile these results.

6.1 Comparison to V-learning

The V-learning algorithm [JLWY21, SMB22, MB21] is a polynomial-time decentralized learning algorithm that proceeds in two phases. In the first phase, the $m$ agents interact over the course of $K$ episodes in a decentralized fashion, playing product Markov policies $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(K)}}\in\Pi^{\mathrm{markov}}$ . In the second phase, the agents use data gathered during the first phase to produce a distributional policy $\widehat{\sigma}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ , which we refer to as the output policy of V-learning. As discussed in Section 1, one implication of Theorem 3.2 is that the first phase of V-learning cannot guarantee each agent sublinear regret. Indeed if $K$ is of polynomial size (and $\textsf{PPAD}\neq\textsf{P}$ ), this follows because a bound of the form $\mathrm{Reg}_{i,K}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(K)}})\leq\epsilon K$ for all $i$ implies that $(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(K)}})$ solves the $(K,\epsilon)$ -SparseMarkovCCE problem.

The output policy $\widehat{\sigma}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ produced by V-learning is an approximate CCE (per Definition 2.2), and it is natural to ask how many product policies it takes to represent $\widehat{\sigma}$ as a uniform mixture (that is, whether $\widehat{\sigma}$ solves the $(T,\epsilon)$ -SparseMarkovCCE problem for a reasonable value of $T$ ). First, recall that V-learning requires $K=\operatorname{poly}(H,S,\max_{i}A_{i})/\epsilon^{2}$ episodes to ensure that $\widehat{\sigma}$ is an $\epsilon$ -CCE. It is straightforward to show that $\widehat{\sigma}$ can be expressed as a non-uniform mixture of at most $K^{KHS+1}$ policies in ${\Pi}^{\mathrm{gen,rnd}}$ (we prove this fact in detail below). By discretizing the non-uniform mixture, one can equivalently represent it as uniform mixture of $O(1/\epsilon)\cdot K^{KHS+1}$ product policies, up to $\epsilon$ error. Recalling the value of $K$ , we conclude that we can express $\widehat{\sigma}$ as a uniform mixture of $T=\exp(\widetilde{O}(1/\epsilon^{2})\cdot\operatorname{poly}(H,S,\max_{i}A_{i}))$ product policies in ${\Pi}^{\mathrm{gen,rnd}}$ . Note that the lower bound of Theorem 4.3 rules out the efficient computation of an $\epsilon$ -CCE represented as a uniform mixture of $T\ll\exp(\epsilon^{2}\cdot\max\{H,S,\max_{i}A_{i}\})$ efficiently computable policies in ${\Pi}^{\mathrm{gen,rnd}}$ . Thus, in the regime where $1/\epsilon$ is polynomial in $H,S,\max_{i}A_{i}$ , this upper bound on the sparsity of the policy $\widehat{\sigma}$ produced V-learning matches that from Theorem 4.3, up to a polynomial in the exponent.

The sparsity of the output policy from V-learning.

We now sketch a proof of the fact that the output policy $\widehat{\sigma}$ produced by V-learning can be expressed as a (non-uniform) average of $K^{KHS+1}$ policies in ${\Pi}^{\mathrm{gen,rnd}}$ , where $K$ is the number of episodes in the algorithm’s initial phase. We adopt the notation and terminology from [JLWY21].

Consider Algorithm 3 of [JLWY21], which describes the second phase of V-learning, which produces the output policy $\widehat{\sigma}$ . We describe how to write $\widehat{\sigma}$ as a weighted average of a collection of product policies, each of which is indexed by a function $\phi:[H]\times\mathcal{S}\times[K]\rightarrow[K]$ and a parameter $k_{0}\in[K]$ : in particular, we will write $\widehat{\sigma}=\sum_{k_{0},\phi}w_{k_{0},\phi}\cdot\sigma_{k_{0},\phi}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ , where $w_{k_{0},\phi}\in[0,1]$ are mixing weights summing to 1 and $\sigma_{k_{0},\phi}\in{\Pi}^{\mathrm{gen,rnd}}$ . The number of tuples $(k_{0},\phi)$ is $K^{1+KHS}$ .

We define the mixing weight allocated $w_{k_{0},\phi}$ to any tuple $(k_{0},\phi)$ to be:

\displaystyle w_{k_{0},\phi}:=\frac{1}{K}\cdot\prod_{(h,s,k)\in[H]\times\mathcal{S}\times[K]}\mathbbm{1}\{{\phi(h,s,k)\in[N_{h}^{k}(s)]}\}\cdot\alpha_{N_{h}^{k}(s)}^{\phi(h,s,k)},

where $N_{h}^{k}(s)\in[K]$ and $\alpha_{N_{h}^{k}(s)}^{i}\in[0,1]$ (for $i\in[N_{h}^{k}(s)]$ ) are defined as in [JLWY21].

Next, for each $k_{0},\phi$ , we define $\sigma_{k_{0},\phi}\in{\Pi}^{\mathrm{gen,rnd}}$ to be the following policy: it maintains a parameter $k\in[K]$ over the first $h\leq{}H$ steps of the episode (as in Algorithm 3 of [JLWY21]), but upon reaching state $s$ at step $h$ , given the present value of $k\in[K]$ , sets $i:=\phi(h,s,k)$ , and updates $k\leftarrow k_{h}^{i}(s)$ , and then samples an action $\mathbf{a}\sim\pi_{h}^{k}(\cdot|s)$ (where $k_{h}^{i}(s),\pi_{h}^{k}(\cdot|s)$ are defined in [JLWY21]). Since the mixing weights $w_{k_{0},\phi}$ defined above exactly simulate the random draws of the parameter $k$ in Line 1 and the parameters $i$ in Line 4 of [JLWY21, Algorithm 3], it follows that the distributional policy $\widehat{\sigma}$ defined by [JLWY21, Algorithm 3] is equal to $\sum_{k_{0},\phi}w_{k_{0},\phi}\cdot\sigma_{k_{0},\phi}\in\Delta({\Pi}^{\mathrm{gen,rnd}})$ .

6.2 No-regret learning against Markov deviations

As discussed in Section 1, [ELS⁺22] showed the existence of a learning algorithm with the property that if each agent plays it independently for $T$ episodes, then no player can achieve regret more than $O(\operatorname{poly}(m,H,S,\max_{i}A_{i})\cdot T^{3/4})$ by deviating to any fixed Markov policy. This notion of regret corresponds to, in the context of Definition 2.3, replacing $\max_{\sigma_{i}\in{\Pi}^{\mathrm{gen,rnd}}_{i}}$ with the smaller quantity $\max_{\sigma_{i}\in\Pi^{\mathrm{markov}}_{i}}$ . Thus, the result of [ELS⁺22] applies to a weaker notion of regret than that of the SparseCCE problem, and so does not contradict any of our lower bounds. One may wonder which of these two notions of regret (namely, best possible gain via deviation to a Markov versus non-Markov policy) is the “right” one. We do not believe that there is a definitive answer to this question, but we remark that in many empirical applications of multi-agent reinforcement learning it is standard to consider non-Markov policies [LDGV⁺21, AVDG⁺22]. Furthermore, as shown in the proposition below, there are extremely simple games, e.g., of constant size, in which Markov deviations lead to “vacuous” behavior: in particular, all Markov policies have the same (suboptimal) value but the best non-Markov policy has much greater value:

Proposition 6.1.

There is a 2-player, 2-action, 1-state Markov game with horizon $2$ and a non-Markov policy $\sigma_{2}\in{\Pi}^{\mathrm{gen,rnd}}_{2}$ for player 2 so that for all $\sigma_{1}\in\Pi^{\mathrm{markov}}_{1}$ , $V_{1}^{\sigma_{1}\times\sigma_{2}}=1/2$ yet $\max_{\sigma_{1}\in{\Pi}^{\mathrm{gen,rnd}}_{1}}\left\{V_{1}^{\sigma_{1}\times\sigma_{2}}\right\}=3/4$ .

The proof of Proposition 6.1 is provided in Section 6.5 below.

Other recent work has also proved no-regret guarantees with respect to deviations to restricted policy classes. In particular, [ZLY22] studies a setting in which each agent $i$ is allowed to play policies in an arbitrary restricted policy class $\Pi_{i}^{\prime}\subseteq{\Pi}^{\mathrm{gen,rnd}}_{i}$ in each episode, and regret is measured with respect to deviations to any policy in $\Pi_{i}^{\prime}$ . [ZLY22] introduces an algorithm, DORIS, with the property that when all agents play it independently, each agent $i$ experiences regret $O\left(\operatorname{poly}(m,A,S,H)\cdot\sqrt{T\sum_{i=1}^{m}\log|\Pi_{i}^{\prime}|}\right)$ to their respective class $\Pi^{\prime}_{i}$ .¹⁷¹⁷17Note that in the tabular setting, the sample complexity of DORIS (Corollary 1) scales with the size $A$ of the joint action set, since each player’s value function class consists of the class of all functions $f:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$ , which has Eluder dimension scaling with $S\cdot A$ , i.e., exponential in $m$ .

DORIS is not computationally efficient, since it involves performing exponential weights over the class $\Pi_{i}^{\prime}$ , which requires space complexity $\lvert\Pi^{\prime}_{I}\rvert$ . Nonetheless, one can compare the statistical guarantees the algorithm provides to our own results. Let $\Pi^{\mathrm{markov,det}}_{i}\subset\Pi^{\mathrm{markov}}_{i}$ denote the set of deterministic Markov policies of agent $i$ , namely sequences $\pi_{i}=(\pi_{i,1},\ldots,\pi_{i,H})$ so that $\pi_{i,h}:\mathcal{S}\rightarrow\mathcal{A}_{i}$ . In the case that $\Pi_{i}^{\prime}=\Pi^{\mathrm{markov,det}}_{i}$ , $\Pi_{i}^{\prime}$ , we have $\log|\Pi_{i}^{\prime}|=O(SH\log A_{i})$ , which means that DORIS obtains no-regret against Markov deviations when $m$ is constant, comparable to [ELS⁺22].¹⁸¹⁸18[ELS⁺22] has the added bonus of computational efficiency, even for polynomially large $m$ , though has the significant drawback of assuming that the Markov game is known. However, we are interested in the setting in which each player’s regret is measured with respect to all deviations in ${\Pi}^{\mathrm{gen,rnd}}_{i}$ (equivalently, $\Pi^{\mathrm{gen,det}}_{i}$ ). Accordingly, if we take $\Pi_{i}^{\prime}=\Pi^{\mathrm{gen,det}}_{i}\subset{\Pi}^{\mathrm{gen,rnd}}_{i}$ ,¹⁹¹⁹19DORIS plays distributions over policies in $\Pi_{i}^{\prime}=\Pi^{\mathrm{gen,det}}_{i}$ at each episode, whereas in our lower bounds we consider the setting where a policy in ${\Pi}^{\mathrm{gen,rnd}}_{i}$ is played each episode; Facts D.2 and D.3 shows that these two settings are essentially equivalent, in that any policy in ${\Pi}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\Pi}^{\mathrm{gen,rnd}}_{m}$ can be simulated by one in $\Delta(\Pi^{\mathrm{gen,det}}_{1})\times\cdots\times\Delta(\Pi^{\mathrm{gen,det}}_{m})$ , and vise versa. then $\log|\Pi_{i}^{\prime}|>(SA_{i})^{H-1}$ , meaning that DORIS does not imply any sort of sample-efficient guarantee, even for $m=2$ .

Finally, we remark that the algorithm DORIS [ZLY22], as well as the similar algorithm OPMD from earlier work of [LWJ22], obtains the same regret bound stated above even when the opponents are controlled by (possibly adaptive) adversaries. However, this guarantee crucially relies on the fact that any agent implementing DORIS must observe the policies played by opponents following each episode; this feature is the reason that the regret bound of DORIS does not contradict the exponential lower bound of [LWJ22] for no-regret learning against an adversarial opponent. As a result of being restricted to this “revealed-policy” setting, DORIS is not a fully decentralized algorithm in the sense we consider in this paper.

6.3 On the role of shared randomness

A key assumption in our lower bounds for no-regret learning is that each of the joint policies $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ produced by the algorithm is a product policy; such an assumption is natural, since it subsumes independent learning protocols in which each agent $i$ selects $\sigma_{i}^{\scriptscriptstyle(t)}$ without knowledge of $\sigma_{-i}^{\scriptscriptstyle(t)}$ . Compared to general (stochastic) joint policies, product policies have the desirable property that, to sample a trajectory from $\sigma^{\scriptscriptstyle{(t)}}=(\sigma_{1}^{\scriptscriptstyle{(t)}},\ldots,\sigma_{m}^{\scriptscriptstyle{(t)}})\in{\Pi}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\Pi}^{\mathrm{gen,rnd}}_{m}={\Pi}^{\mathrm{gen,rnd}}$ , the agents do no require access to shared randomness. In particular, each agent $i$ can independently sample its action from $\sigma_{i}^{\scriptscriptstyle{(t)}}$ at each of the $h$ steps of the episode. It is natural to ask how the situation changes if we allow the agents to use shared random bits when sampling from their policies, which corresponds to allowing $\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ to be non-product policies. In this case, V-learning yields a positive result via a standard “batch-to-online” conversion: by applying the first phase of V-learning during the first $T^{2/3}$ episodes and playing trajectories sampled i.i.d. from the output policy produced by V-learning during the remaining $T-T^{2/3}$ episodes (which requires shared randomness), it is straightforward to see that a regret bound of order $\operatorname{poly}(H,S,\max_{i}A_{i})\cdot T^{2/3}$ can be obtained. Similar remarks apply to SPoCMAR [DGZ22], which can obtain a slightly worse regret bound of order $\operatorname{poly}(H,S,\max_{i}A_{i})\cdot T^{3/4}$ in the same fashion. In fact, the batch-to-online conversion approach gives a generic solution for the setting in which shared randomness is available. That is, the assumption of shared randomness eliminates any distinction between no-regret algorithms and (non-sparse) equilibrium computation algorithms, modulo slight loss in rates. For this reason, the shared randomness assumption is too strong to develop any sort of distinct theory of no-regret learning.

6.4 Comparison to lower bounds for finding stationary CCE

A separate line of work [DGZ22, JMS22] has recently shown PPAD-hardness for the problem of finding stationary Markov CCE in infinite-horizon discounted stochastic games. These results are incomparable with our own: stationary Markov CCE are not sparse (in the sense of Definition 3.1), whereas we do not require stationarity of policies (as is standard in the finite-horizon setting).

6.5 Proof of Proposition 6.1

Below we prove Proposition 6.1.

Proof of Proposition 6.1.

We construct the claimed Markov game $\mathcal{G}$ as follows. The single state is denoted by $\mathfrak{s}$ ; as there is only a single state, the transitions are trivial. We denote each player’s action space as $\mathcal{A}_{1}=\mathcal{A}_{2}=\{1,2\}$ . The rewards to player 1 are given as follows: for all $(a_{1},a_{2})\in\mathcal{A}$ ,

\displaystyle R_{1,1}(\mathfrak{s},(a_{1},a_{2}))=\frac{1}{2}\cdot\mathbb{I}_{a_{2}=1},\qquad R_{1,2}(\mathfrak{s},(a_{1},a_{2}))=\frac{1}{2}\cdot\mathbb{I}_{a_{1}=a_{2}}.

We allow the rewards of player 2 to be arbitrary; they do not affect the proof in any way.

We let $\sigma_{2}=(\sigma_{2,1},\sigma_{2,2})\in{\Pi}^{\mathrm{gen,rnd}}_{2}$ be the policy which plays a uniformly random action at step 1 and then plays the same action at step 2: formally, $\sigma_{2,1}(s_{1})=\operatorname{Unif}(\mathcal{A}_{2})$ , and $\sigma_{2,2}((s_{1},a_{2,1},r_{2,1}),s_{2})=\mathbb{I}_{a_{2,1}}$ . Then for any Markov policy $\sigma_{1}\in\Pi^{\mathrm{markov}}_{1}$ of player 1, we must have $\mathbb{P}_{\sigma_{1}\times\sigma_{2}}(a_{1,2}=a_{2,2})=1/2$ , which means that $V_{1}^{\sigma_{1}\times\sigma_{2}}=\frac{1}{2}\cdot\mathbb{E}_{\sigma_{1}\times\sigma_{2}}[\mathbb{I}_{a_{2,1}=1}+\mathbb{I}_{a_{1,2}=a_{2,2}}]=1/2\cdot(1/2+1/2)=1/2$ .

On the other hand, any general (non-Markov) policy $\sigma_{1}\in{\Pi}^{\mathrm{gen,rnd}}_{1}$ which satisfies

\displaystyle\sigma_{1,2}((s_{1},a_{1,1},r_{1,1}),s_{2})=\begin{cases}\mathbb{I}_{1}:&r_{1,1}=1/2\\ \mathbb{I}_{2}:&r_{1,1}=0\end{cases}

has $V_{1}^{\sigma_{1}\times\sigma_{2}}=1/2\cdot(1/2+1)=3/4$ . ∎

Acknowledgements

This work was performed in part while NG was an intern at Microsoft Research. NG is supported at MIT by a Fannie & John Hertz Foundation Fellowship and an NSF Graduate Fellowship. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence. SK acknowledges funding from the Office of Naval Research under award N00014-22-1-2377 and the National Science Foundation Grant under award #CCF-2212841.

References

[AB06] S. Arora and B. Barak. Computational Complexity: A Modern Approach. Cambridge University Press, 2006.
[AFK⁺22] Ioannis Anagnostides, Gabriele Farina, Christian Kroer, Chung-Wei Lee, Haipeng Luo, and Tuomas Sandholm. Uncoupled learning dynamics with $o(\log t)$ swap regret in multiplayer games. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
[AVDG⁺22] John P. Agapiou, Alexander Sasha Vezhnevets, Edgar A. Duéñez-Guzmán, Jayd Matyas, Yiran Mao, Peter Sunehag, Raphael Köster, Udari Madhushani, Kavya Kopparapu, Ramona Comanescu, DJ Strouse, Michael B. Johanson, Sukhdeep Singh, Julia Haas, Igor Mordatch, Dean Mobbs, and Joel Z. Leibo. Melting pot 2.0, 2022.
[AYBK⁺13] Yasin Abbasi Yadkori, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvari. Online learning in markov decision processes with adversarially chosen transition probability distributions. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
[Bab16] Yakov Babichenko. Query complexity of approximate nash equilibria. J. ACM, 63(4), oct 2016.
[BBD⁺22] Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra. Human-level play in the game of ¡i¿diplomacy¡/i¿ by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
[BCI⁺08] Christian Borgs, Jennifer Chayes, Nicole Immorlica, Adam Tauman Kalai, Vahab Mirrokni, and Christos Papadimitriou. The myth of the folk theorem. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 365–372, 2008.
[BJY20] Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
[Bla56] David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1–8, 1956.
[BR17] Yakov Babichenko and Aviad Rubinstein. Communication complexity of approximate nash equilibria. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, page 878–889, New York, NY, USA, 2017. Association for Computing Machinery.
[Bro49] George Williams Brown. Some notes on computation of games solutions. 1949.
[BS18] Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
[CBFH⁺97] Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997.
[CBL06] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
[CCT17] Xi Chen, Yu Cheng, and Bo Tang. Well-Supported vs. Approximate Nash Equilibria: Query Complexity of Large Games. In Christos H. Papadimitriou, editor, 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), volume 67 of Leibniz International Proceedings in Informatics (LIPIcs), pages 57:1–57:9, Dagstuhl, Germany, 2017. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[CDT06] Xi Chen, Xiaotie Deng, and Shang-hua Teng. Computing nash equilibria: Approximation and smoothed complexity. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 603–612, 2006.
[CP20] Xi Chen and Binghui Peng. Hedging in games: Faster convergence of external and swap regrets. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18990–18999. Curran Associates, Inc., 2020.
[DFG21] Constantinos Costis Daskalakis, Maxwell Fishelson, and Noah Golowich. Near-optimal no-regret learning in general games. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
[DGP09] Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a nash equilibrium. SIAM Journal on Computing, 39(1):195–259, 2009.
[DGZ22] Constantinos Daskalakis, Noah Golowich, and Kaiqing Zhang. The complexity of markov equilibrium in stochastic games, 2022.
[ELS⁺22] Liad Erez, Tal Lancewicki, Uri Sherman, Tomer Koren, and Yishay Mansour. Regret minimization and convergence to equilibria in general-sum markov games, 2022.
[FGGS13] John Fearnley, Martin Gairing, Paul Goldberg, and Rahul Savani. Learning equilibria of games via payoff queries. In Proceedings of the Fourteenth ACM Conference on Electronic Commerce, EC ’13, page 397–414, New York, NY, USA, 2013. Association for Computing Machinery.
[FLM94] Drew Fudenberg, David Levine, and Eric Maskin. The folk theorem with imperfect public information. Econometrica, 62(5):997–1039, 1994.
[FRSS22] Dylan J Foster, Alexander Rakhlin, Ayush Sekhari, and Karthik Sridharan. On the complexity of adversarial decision making. arXiv preprint arXiv:2206.13063, 2022.
[Han57] J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
[HMC00] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
[JKSY20] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning (ICML), pages 4870–4879. PMLR, 2020.
[JLWY21] Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning–a simple, efficient, decentralized algorithm for multiagent RL. arXiv preprint arXiv:2110.14555, 2021.
[JMS22] Yujia Jin, Vidya Muthukumar, and Aaron Sidford. The complexity of infinite-horizon general-sum stochastic games, 2022.
[Kak03] Sham M Kakade. On the sample complexity of reinforcement learning, 2003.
[KECM21] Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. RL for latent MDPs: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34, 2021.
[KEG⁺22] János Kramár, Tom Eccles, Ian Gemp, Andrea Tacchetti, Kevin R. McKee, Mateusz Malinowski, Thore Graepel, and Yoram Bachrach. Negotiation and honesty in artificial intelligence methods for the board game of Diplomacy. Nature Communications, 13(1):7214, December 2022. Number: 1 Publisher: Nature Publishing Group.
[KMN99] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, page 1324–1331, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
[LDGV⁺21] Joel Z Leibo, Edgar A Dueñez-Guzman, Alexander Vezhnevets, John P Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charlie Beattie, Igor Mordatch, and Thore Graepel. Scalable evaluation of multi-agent reinforcement learning with melting pot. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 6187–6199. PMLR, 18–24 Jul 2021.
[LS05] Michael L. Littman and Peter Stone. A polynomial-time Nash equilibrium algorithm for repeated games. Decision Support Systems, 39:55–66, 2005.
[LS20] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
[LW94] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
[LWJ22] Qinghua Liu, Yuanhao Wang, and Chi Jin. Learning Markov games with adversarial opponents: Efficient algorithms and fundamental limits. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 14036–14053. PMLR, 17–23 Jul 2022.
[MB21] Weichao Mao and Tamer Basar. Provably efficient reinforcement learning in decentralized general-sum markov games. CoRR, abs/2110.05682, 2021.
[MF86] Eric Maskin and D Fudenberg. The folk theorem in repeated games with discounting or with incomplete information. Econometrica, 53(3):533–554, 1986. Reprinted in A. Rubinstein (ed.), Game Theory in Economics, London: Edward Elgar, 1995. Also reprinted in D. Fudenberg and D. Levine (eds.), A Long-Run Collaboration on Games with Long-Run Patient Players, World Scientific Publishers, 2009, pp. 209-230.
[MK15] Kleanthis Malialis and Daniel Kudenko. Distributed response to network intrusions using multiagent reinforcement learning. Engineering Applications of Artificial Intelligence, 41:270–284, 2015.
[Nas51] John Nash. Non-cooperative games. Annals of Mathematics, 54(2):286–295, 1951.
[Pap94] Christos H. Papadimitriou. On the complexity of the parity argument and other inefficient proofs of existence. J. Comput. Syst. Sci., 48(3):498–532, 1994.
[Put94] Martin Puterman. Markov Decision Processes. John Wiley & Sons, Ltd, 1 edition, 1994.
[PVH⁺22] Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Toby Pohlen, Tom Eccles, Mark Rowland, Marc Lanctot, Jean-Baptiste Lespiau, Bilal Piot, Shayegan Omidshafiei, Edward Lockhart, Laurent Sifre, Nathalie Beauguerlange, Remi Munos, David Silver, Satinder Singh, Demis Hassabis, and Karl Tuyls. Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623):990–996, 2022.
[Rou15] Tim Roughgarden. Intrinsic robustness of the price of anarchy. Journal of the ACM, 2015.
[Rub16] Aviad Rubinstein. Settling the complexity of computing approximate two-player Nash equilibria. In Annual Symposium on Foundations of Computer Science (FOCS), pages 258–265. IEEE, 2016.
[Rub18] Aviad Rubinstein. Inapproximability of Nash equilibrium. SIAM Journal on Computing, 47(3):917–959, 2018.
[SALS15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems (NIPS), pages 2989–2997, 2015.
[Sha53] Lloyd Shapley. Stochastic Games. PNAS, 1953.
[SHM⁺16] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
[SMB22] Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations, 2022.
[SS12] Shai Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn., 4(2):107–194, feb 2012.
[SSS16] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. CoRR, abs/1610.03295, 2016.
[Vov90] Vladimir Vovk. Aggregating strategies. Proc. of Computational Learning Theory, 1990, 1990.
[WAJ⁺21] Gellert Weisz, Philip Amortila, Barnabás Janzer, Yasin Abbasi-Yadkori, Nan Jiang, and Csaba Szepesvari. On query-efficient planning in mdps under linear realizability of the optimal state-value function. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 4355–4385. PMLR, 15–19 Aug 2021.
[YHAY⁺22] Dong Yin, Botao Hao, Yasin Abbasi-Yadkori, Nevena Lazić, and Csaba Szepesvári. Efficient local planning with linear function approximation. In Sanjoy Dasgupta and Nika Haghtalab, editors, Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 of Proceedings of Machine Learning Research, pages 1165–1192. PMLR, 29 Mar–01 Apr 2022.
[ZLY22] Wenhao Zhan, Jason D Lee, and Zhuoran Yang. Decentralized optimistic hyperpolicy mirror descent: Provably no-regret learning in markov games. arXiv preprint arXiv:2206.01588, 2022.
[ZTS⁺22] Stephan Zheng, Alexander Trott, Sunil Srinivasa, David C. Parkes, and Richard Socher. The ai economist: Taxation policy design via two-level deep multiagent reinforcement learning. Science Advances, 8(18):eabk2607, 2022.

Appendix A Additional preliminaries

A.1 Nash equilibria and computational hardness.

The most foundational and well known solution concept for normal-form games is the Nash equilibrium [Nas51].

Definition A.1 ( $(n,\epsilon)$ -Nash problem).

For a normal-form game $G=(M_{1},\ldots,M_{m})$ and $\epsilon>0$ , a product distribution $p\in\prod_{j=1}^{m}\Delta([n])$ is said to be an $\epsilon$ -Nash equilibrium for $G$ if for all $i\in[n]$ ,

\displaystyle\max_{a_{i}^{\prime}\in[n]}\mathbb{E}_{\mathbf{a}\sim p}[(M_{i})_{a_{i}^{\prime},\mathbf{a}_{-i}}]-\mathbb{E}_{\mathbf{a}\sim p}[(M_{i})_{\mathbf{a}}]\leq\epsilon.

We define the $m$ -player $(n,\epsilon)$ -Nash problem to be the problem of computing an $\epsilon$ -Nash equilibrium of a given $m$ -player $n$ -action normal-form game.²⁰²⁰20One must also take care to specify the bit complexity of representing a normal-form game. We assume that the payoffs of any normal-form game given as an instance to the $(n,\epsilon)$ -Nash problem can each be expressed with $\max\{n,m\}$ bits; this assumption is without loss of generality as long as $\epsilon\geq 2^{-\max\{n,m\}}$ (which it will be for us).

Informally, $p$ is an $\epsilon$ -Nash equilibrium if no player $i$ can gain more than $\epsilon$ in reward by deviating to a single fixed action $a_{i}^{\prime}$ , while all other players randomly choose their actions according to $p$ . Despite the intuitive appeal of Nash equilibria, they are intractable to compute: for any $c>0$ , it is PPAD-hard to solve the $(n,n^{-c})$ -Nash problem, namely, to compute $n^{-c}$ -approximate Nash equilibria in 2-player $n$ -action normal-form games [DGP09, CDT06, Rub18]. We recall that the complexity class PPAD consists of all total search problems which have a polynomial-time reduction to the End-of-The-Line (EOTL) problem. PPAD is the most well-studied complexity class in algorithmic game theory, and it is widely believed that $\textsf{PPAD}\neq\textsf{P}$ . We refer the reader to [DGP09, CDT06, Rub18, Pap94] for further background on the class PPAD and the EOTL problem.

A.2 Query complexity of Nash equilibria

Our statistical lower bound for the SparseCCE problem in Theorem 5.2 relies on existing query complexity lower bounds for computing approximate Nash equilibria in $m$ -player normal-form games. We first review the query complexity model for normal-form games.

Oracle model for normal-form games.

For $m,n\in\mathbb{N}$ , consider an $m$ -player $n$ -action normal form game $G$ , specified by payoff tensors $M_{1},\ldots,M_{m}$ . Since the tensors $M_{1},\ldots,M_{m}$ contain a total of $mn^{m}$ real-valued payoffs, in the setting when $m$ is large, it is unrealistic to assume that an algorithm is given the full payoff tensors as input. Therefore, prior work on computing equilibria in such games has studied the setting in which the algorithm makes adaptive oracle queries to the payoff tensors.

In particular, the algorithm, which is allowed to be randomized, has access to a payoff oracle $\mathcal{O}_{G}$ for the game $G$ , which works as follows. At each time step, the algorithm can choose to specify an action profile $\mathbf{a}\in[n]^{m}$ and then query $\mathcal{O}_{G}$ at the action profile $\mathbf{a}$ . The oracle $\mathcal{O}_{G}$ then returns the payoffs $(M_{1})_{\mathbf{a}},\ldots,(M_{m})_{\mathbf{a}}$ for each player if the action profile $\mathbf{a}$ is played.

Query complexity lower bound for approximate Nash equilibrium.

The following theorem gives a lower bound on the number of queries any randomized algorithm needs to make to compute an approximate Nash equilibrium in an $m$ -player game.

Theorem A.2 (Corollary 4.5 of [Rub16]).

There is a constant $\epsilon_{0}>0$ so that any randomized algorithm which solves the $(2,\epsilon_{0})$ -Nash problem for $m$ -player normal-form games with probability at least $2/3$ must use at least $2^{\Omega(m)}$ payoff queries.

We remark that [Bab16, CCT17] provide similar, though quantitatively weaker, lower bounds to that in Theorem A.2. We also emphasize that the lower bound of Theorem A.2 applies to any algorithm, i.e., including those which require extremely large computation time.

Appendix B Proofs of lower bounds for SparseMarkovCCE (Section 3)

B.1 Preliminaries: Online density estimation

Our proof makes use of tools for online learning with the logarithmic loss, also known as conditional density estimation. In particular, we use a variant of the exponential weights algorithm known as Vovk’s aggregating algorithm in the context of density estimation [Vov90, CBL06]. We consider the following setting with two players, a Learner and Nature. Furthermore, there is a set $\mathcal{Y}$ , called the outcome space, and a set $\mathcal{X}$ , called the context space; for our applications it suffices to assume $\mathcal{Y}$ and $\mathcal{X}$ are finite. For some $T\in\mathbb{N}$ , there are $T$ time steps $t=1,2,\ldots,T$ . At each time step $t\in[T]$ :

•

Nature reveals a context $x^{\scriptscriptstyle{(t)}}\in\mathcal{X}$ ;
•

Having seen the context $x^{\scriptscriptstyle{(t)}}$ , the learner predicts a distribution $\widehat{q}^{\scriptscriptstyle{(t)}}\in\Delta(\mathcal{Y})$ ;
•

Nature chooses an outcome $y^{\scriptscriptstyle{(t)}}\in\mathcal{Y}$ , and the learner suffers loss $\ell_{\mathrm{log}}^{\scriptscriptstyle{(t)}}(\widehat{q}^{\scriptscriptstyle{(t)}}):=\log\left(\frac{1}{\widehat{q}^{\scriptscriptstyle{(t)}}(y^{\scriptscriptstyle{(t)}})}\right).$

For each $t\in[T]$ , we let $\mathcal{H}^{\scriptscriptstyle{(t)}}=\{(x^{\scriptscriptstyle{(1)}},y^{\scriptscriptstyle{(1)}},\widehat{q}^{\scriptscriptstyle{(1)}}),\ldots,(x^{\scriptscriptstyle{(t)}},y^{\scriptscriptstyle{(t)}},\widehat{q}^{\scriptscriptstyle{(t)}})\}$ denote the history of interaction up to step $t$ ; we emphasize that each context $x^{\scriptscriptstyle{(t)}}$ may be chosen adaptively as a function of $\mathcal{H}^{\scriptscriptstyle{(t-1)}}$ . Let $\mathscr{F}^{\scriptscriptstyle{(t)}}$ denote the sigma-algebra generated by $(\mathcal{H}^{\scriptscriptstyle{(t)}},x^{\scriptscriptstyle{(t+1)}})$ . We measure performance in terms of regret against a set $\mathcal{I}$ of experts, also known as the expert setting. Each expert $i\in\mathcal{I}$ consists of a function $p_{i}:\mathcal{X}\rightarrow\Delta(\mathcal{Y})$ . The regret of an algorithm against the expert class $\mathcal{I}$ when it receives contexts $x^{\scriptscriptstyle{(1)}},\ldots,x^{\scriptscriptstyle{(T)}}$ and observes outcomes $y^{\scriptscriptstyle{(1)}},\ldots,y^{\scriptscriptstyle{(T)}}$ is defined as

\displaystyle\mathrm{Reg}_{\mathcal{I},T}=\sum_{t=1}^{T}\ell_{\mathrm{log}}^{\scriptscriptstyle{(t)}}(\widehat{q}^{\scriptscriptstyle{(t)}})-\min_{i\in\mathcal{I}}\sum_{t=1}^{T}\ell_{\mathrm{log}}^{\scriptscriptstyle{(t)}}(p_{i}(x^{\scriptscriptstyle{(t)}})).

Note that the learner can observe the expert predictions $\{p_{i}(x^{\scriptscriptstyle(t)})\}_{i\in\mathcal{I}}$ and use them to make its own prediction at each round $t$ .

Proposition B.1 (Vovk’s aggregating algorithm).

Consider Vovk’s aggregating algorithm, which predicts via

\displaystyle\widehat{q}^{\scriptscriptstyle{(t)}}(y):=\mathbb{E}_{i\sim\widetilde{q}^{\scriptscriptstyle{(t)}}}[p_{i}(x^{\scriptscriptstyle{(t)}})],\quad\text{where}\quad\widetilde{q}^{\scriptscriptstyle{(t)}}(i)\vcentcolon=\frac{\exp\left(-\sum_{s=1}^{t-1}\ell_{\mathrm{log}}^{\scriptscriptstyle{(s)}}(p_{i}(x^{\scriptscriptstyle{(s)}}))\right)}{\sum_{j\in\mathcal{I}}\exp\left(-\sum_{s=1}^{t-1}\ell_{\mathrm{log}}^{\scriptscriptstyle{(s)}}(p_{j}(x^{\scriptscriptstyle{(s)}}))\right)}.

(3)

This algorithm guarantees a regret bound of $\mathrm{Reg}_{\mathcal{I},T}\leq\log|\mathcal{I}|$ .

Recall that for probability distributions $p,q$ on a finite set $\mathcal{B}$ , their total variation distance is defined as

\displaystyle D_{\mathsf{TV}}({p},{q})=\max_{\mathcal{E}\subset\mathcal{B}}|p(\mathcal{E})-q(\mathcal{E})|.

(4)

As a (standard) consequence of Proposition B.1, in the realizable setting in which the distribution of $y^{\scriptscriptstyle{(t)}}|x^{\scriptscriptstyle{(t)}}$ follows $p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})$ for some fixed (unknown) expert $i^{\star}\in\mathcal{I}$ , we can obtain a bound on the total variation distance between the algorithm’s predictions and those of $p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})$ .

Proposition B.2.

If the distribution of outcomes is realizable, i.e., there exists an expert $i^{\star}\in\mathcal{I}$ so that $y^{\scriptscriptstyle{(t)}}\sim p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})\ |\ x^{\scriptscriptstyle{(t)}},\mathcal{H}^{\scriptscriptstyle{(t-1)}}$ for all $t\in[T]$ , then the predictions $\widehat{q}^{\scriptscriptstyle{(t)}}$ of the aggregation algorithm (3) satisfy

\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[D_{\mathsf{TV}}({\widehat{q}^{\scriptscriptstyle{(t)}}},{p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})})\right]\leq\sqrt{T\log|\mathcal{I}|}.

For completeness, we provide the proof of Proposition B.2 here.

Proof of Proposition B.2.

To simplify notation, for an expert $i\in\mathcal{I}$ , a context $x\in\mathcal{X}$ , and an outcome $y\in\mathcal{Y}$ , we write $p_{i}(y|x)$ to denote $p_{i}(x)(y)$ .

Proposition B.1 gives that the following inequality holds (almost surely):

\displaystyle\mathrm{Reg}_{\mathcal{I},T}=\sum_{t=1}^{T}\log\left(\frac{1}{\widehat{q}^{\scriptscriptstyle{(t)}}(y^{\scriptscriptstyle{(t)}})}\right)-\sum_{t=1}^{T}\log\left(\frac{1}{p_{i^{\star}}(y^{\scriptscriptstyle{(t)}}|x^{\scriptscriptstyle{(t)}})}\right)\leq\log|\mathcal{I}|.

For each $t\in[T]$ , note that $\widehat{q}^{\scriptscriptstyle{(t)}}$ and $x^{\scriptscriptstyle{(t)}}$ are $\mathscr{F}^{\scriptscriptstyle{(t-1)}}$ -measurable (by definition). Then

	$\displaystyle\sum_{t=1}^{T}D_{\mathsf{TV}}({\widehat{q}^{\scriptscriptstyle{(t)}}},{p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})})^{2}\leq$	$\displaystyle\sum_{t=1}^{T}D_{\mathsf{KL}}({p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})}\\|{\widehat{q}^{\scriptscriptstyle{(t)}}})$
	$\displaystyle=$	$\displaystyle\sum_{t=1}^{T}\sum_{y\in\mathcal{Y}}p_{i^{\star}}(y\|x^{\scriptscriptstyle{(t)}})\cdot\log\left(\frac{p_{i^{\star}}(y\|x^{\scriptscriptstyle{(t)}})}{\widehat{q}^{\scriptscriptstyle{(t)}}(y)}\right)$
	$\displaystyle=$	$\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[\log\left(\frac{1}{\widehat{q}^{\scriptscriptstyle{(t)}}(y^{\scriptscriptstyle{(t)}})}\right)-\log\left(\frac{1}{p_{i^{\star}}(y^{\scriptscriptstyle{(t)}}\|x^{\scriptscriptstyle{(t)}})}\right)\ \|\ \mathscr{F}^{\scriptscriptstyle{(t-1)}}\right],$

where the first inequality uses Pinsker’s inequality and the final equality uses the fact that $y^{\scriptscriptstyle{(t)}}\sim p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})|x^{\scriptscriptstyle{(t)}},\mathcal{H}^{\scriptscriptstyle{(t-1)}}$ . It follows that

\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}D_{\mathsf{TV}}({\widehat{q}^{\scriptscriptstyle{(t)}}},{p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})})^{2}\right]\leq\mathbb{E}[\mathrm{Reg}_{\mathcal{I},T}]\leq\log\lvert\mathcal{I}\rvert.

Jensen’s inequality now gives that

\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}D_{\mathsf{TV}}({\widehat{q}^{\scriptscriptstyle{(t)}}},{p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})})\right]\leq

\displaystyle\sqrt{T}\cdot\sqrt{\mathbb{E}\left[\sum_{t=1}^{T}D_{\mathsf{TV}}({\widehat{q}^{\scriptscriptstyle{(t)}}},{p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})})^{2}\right]}\leq\sqrt{T\log|\mathcal{I}|}.

∎

B.2 Proof of Theorem 3.2

Proof of Theorem 3.2.

Fix $n\in\mathbb{N}$ , which we recall represents an upper bound on the description length of the Markov game. Assume that we are given an algorithm $\mathscr{B}$ that solves the $(T,\epsilon)$ -SparseMarkovCCE problem for Markov games $\mathcal{G}$ satisfying $|\mathcal{G}|\leq n$ in time $U$ . We proceed to describe an algorithm which solves the 2-player $(\lfloor n^{1/2}/2\rfloor,4\cdot\epsilon)$ -Nash problem in time $(nTU)^{C_{0}}$ , as long as $T<\exp(\epsilon^{2}\cdot n^{1/2}/2^{5})$ . First, define $n_{0}:=\lfloor n^{1/2}/2\rfloor$ , and consider an arbitrary 2-player $n_{0}$ -action normal form $G$ , which is specified by payoff matrices $M_{1},M_{2}\in[0,1]^{n_{0}\times n_{0}}$ , so that all entries of the game can be written in binary using at most $n_{0}$ bits (recall, per footnote 20, that we may assume that the entries of an instance of $(n_{0},4\cdot\epsilon)$ -Nash can be specified with $n_{0}$ bits). Based on $G$ , we construct a 2-player Markov game $\mathcal{G}:=\mathcal{G}(G)$ as follows:

Definition B.3.

We define the game $\mathcal{G}(G)$ to consist of the tuple $\mathcal{G}(G)=(\mathcal{S},H,(\mathcal{A}_{i})_{i\in[2]},\mathbb{P},(R_{i})_{i\in[2]},\mu)$ , where:

•

The horizon of $\mathcal{G}$ is $H=2\lfloor n_{0}/2\rfloor$ (i.e., the largest even number at most $n_{0}$ ).
•

Let $A=n_{0}$ ; the action spaces of the 2 agents are given by $\mathcal{A}_{1}=\mathcal{A}_{2}=[A]$ .
•

There are a total of $A^{2}+1$ states: in particular, there is a state $\mathfrak{s}_{(a_{1},a_{2})}$ for each $(a_{1},a_{2})\in[A]^{2}$ , as well as a distinguished state $\mathfrak{s}$ , so we have:

$\displaystyle\mathcal{S}=\{\mathfrak{s}\}\cup\{\mathfrak{s}_{(a_{1},a_{2})}\ :\ (a_{1},a_{2})\in[A]^{2}\}.$
•

For all odd $h\in[H]$ , the reward to agents $j\in[2]$ given that the action profile $(a_{1},a_{2})$ is played at step $h$ is given by $R_{j,h}(s,(a_{1},a_{2})):=\frac{1}{H}\cdot(M_{j})_{a_{1},a_{2}}$ , for all $s\in\mathcal{S}$ . All agents receive 0 reward at even steps $h\in[H]$ .
•

At odd steps $h\in[H]$ , if actions $a_{1},a_{2}\in[A]$ are taken, the game transitions to the state $\mathfrak{s}_{(a_{1},a_{2})}$ . At even steps $h\in[H]$ , the game always transitions to the state $\mathfrak{s}$ .
•

The initial state (i.e., at step $h=1$ ) is $\mathfrak{s}$ (i.e., $\mu$ is a singleton distribution supported on $\mathfrak{s}$ ).

It is evident that this construction takes polynomial time, and satisfies $|\mathcal{G}|\leq A^{2}+1\leq n_{0}^{2}+1\leq n$ . We will now show by applying the algorithm $\mathscr{B}$ to $\mathcal{G}$ , we can efficiently compute $4\cdot\epsilon$ -approximate Nash equilibrium for the original game $G$ . To do so, we appeal to Algorithm 1.

1:Input: 2-player,

n_{0}

-action normal form game

G

2:Construct the 2-player Markov game

\mathcal{G}=\mathcal{G}(G)

per Definition B.3, which satisfies

|\mathcal{G}|\leq n

3:Call the algorithm

\mathscr{B}

on the game

\mathcal{G}

, which produces a sequence

\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}

, where each

\sigma^{\scriptscriptstyle{(t)}}\in\Pi^{\mathrm{markov}}

4:for

t\in[T]

and odd

h\in[H]

: do

5: if

\sigma^{\scriptscriptstyle{(t)}}_{h}(\mathfrak{s})\in\Delta(\mathcal{A}_{1})\times\Delta(\mathcal{A}_{2})

is a

(4\cdot\epsilon,n)

-Nash equilibrium of

G

: then return

\sigma^{\scriptscriptstyle{(t)}}_{h}(\mathfrak{s})

6:if the for loop terminates without returning: return fail.

Algorithm 1 Algorithm to compute Nash equilibrium used in proof of Theorem 3.2.

Algorithm 1 proceeds as follows. First, it constructs the 2-player Markov game $\mathcal{G}(G)$ as defined above, and calls the algorithm $\mathscr{B}$ , which returns a sequence $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in\Pi^{\mathrm{markov}}$ of product Markov policies with the property that the average $\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}$ is an $\epsilon$ -CCE of $\mathcal{G}$ . It then enumerates over the distributions $\sigma^{\scriptscriptstyle{(t)}}_{h}(\mathfrak{s})\in\Delta(\mathcal{A}_{1})\times\Delta(\mathcal{A}_{2})$ for each $t\in[T]$ and $h\in[H]$ odd, and checks whether each one is a $4\cdot\epsilon$ -approximate Nash equilibrium of $G$ . If so, the algorithm outputs such a Nash equilibrium, and otherwise, it fails. The proof of Theorem 3.2 is thus completed by the following lemma, which states that as long as $\overline{\sigma}$ is an $\epsilon$ -CCE of $\mathcal{G}$ , Algorithm 1 never fails.

Lemma B.4 (Correctness of Algorithm 1).

Consider the normal form game $G$ and the Markov game $\mathcal{G}=\mathcal{G}(G)$ as constructed above, which has horizon $H$ . For any $\epsilon_{0}>0$ , $T\in\mathbb{N}$ , if $T<\exp(H\cdot\epsilon_{0}^{2}/2^{8})$ and $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in\Pi^{\mathrm{markov}}$ are product Markov policies so that $\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}$ is an $(\epsilon_{0}/4)$ -CCE of $\mathcal{G}$ , then there is some odd $h\in[H]$ and $t\in[T]$ so that $\sigma_{h}^{\scriptscriptstyle{(t)}}(\mathfrak{s})$ is an $\epsilon_{0}$ -Nash equilibrium of $G$ .

The proof of Lemma B.4 is given below. Applying Lemma B.4 with $\epsilon_{0}=4\epsilon$ (which is a valid application since $T<\exp(n_{0}\cdot(4\epsilon)^{2}/2^{8})$ by our assumption on $T,\epsilon$ ), yields that Algorithm 1 always finds a $4\epsilon$ -Nash equilibrium of the $n_{0}$ -action normal form game $G$ , thus solving the given instance of the $(n_{0},4\cdot\epsilon)$ -Nash problem. Furthermore, it is straightforward to see that Algorithm 1 runs in time $U+(nT)^{C_{0}}\leq(UnT)^{C_{0}}$ , for some constant $C_{0}\geq 1$ .

∎

Proof of Lemma B.4.

Consider a sequence of product Markov policies $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ with the property that the average $\overline{\sigma}=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(T)}}}$ is an $(\epsilon_{0}/4)$ -CCE of $\mathcal{G}$ . For all odd $h\in[H]$ and $j\in[2]$ , let $p^{\scriptscriptstyle{(t)}}_{j,h}:=\sigma^{\scriptscriptstyle{(t)}}_{j,h}(\mathfrak{s})\in\Delta(\mathcal{A}_{j})$ , which is the distribution played under $\sigma^{\scriptscriptstyle{(t)}}$ by player $j$ at step $h$ (at the unique state $\mathfrak{s}$ with positive probability of being reached at step $h$ ). For odd $h$ , we have $\sigma_{h}^{\scriptscriptstyle{(t)}}(\mathfrak{s})=p_{1,h}^{\scriptscriptstyle{(t)}}\times p_{2,h}^{\scriptscriptstyle{(t)}}$ , and our goal is to show that for some odd $h\in[H]$ and $t\in[T]$ , $p_{1,h}^{\scriptscriptstyle{(t)}}\times p_{2,h}^{\scriptscriptstyle{(t)}}$ is an $\epsilon_{0}$ -Nash equilibrium of $G$ . To proceed, suppose for the sake of contradiction that this is not the case.

Let us write $\mathcal{O}_{H}:=\{h\in[H]:h\ \rm{odd}\}$ to denote the set of odd-numbered steps, and $\mathcal{E}_{H}=[H]\backslash\mathcal{O}_{H}$ to denote the set of even-numbered steps. Let $H_{0}=|\mathcal{O}_{H}|=|\mathcal{E}_{H}|=H/2$ . We first note that for $j\in[2]$ , agent $j$ ’s value under the mixture policy $\overline{\sigma}$ is given as follows:

\displaystyle V_{j}^{\overline{\sigma}}=\frac{1}{TH}\sum_{t=1}^{T}\sum_{h\in\mathcal{O}_{H}}\mathbb{E}_{a_{1}\sim p_{1,h}^{\scriptscriptstyle{(t)}},a_{2}\sim p_{2,h}^{\scriptscriptstyle{(t)}}}\left[(M_{j})_{a_{1},a_{2}}\right].

For each $j\in[2]$ , we will derive a contradiction by constructing a (non-Markov) deviation policy for player $j$ in $\mathcal{G}$ , denoted $\pi_{j}^{\dagger}\in\Pi^{\mathrm{gen,det}}_{j}$ , which will give player $j$ a significant gain in value against the policy $\overline{\sigma}$ . To do so, we need to specify $\pi_{j,h}^{\dagger}(\tau_{j,h-1},s_{h})\in\mathcal{A}_{j}$ , for all $\tau_{j,h-1}\in\mathscr{H}_{j,h-1}$ and $s_{h}\in\mathcal{S}$ ; note that we may restrict our attention only to histories $\tau_{j,h_{0}-1}$ that occur with positive probability under the transitions of $\mathcal{G}$ .

Fix any $h_{0}\in[H]$ , $\tau_{j,h_{0}-1}\in\mathscr{H}_{j,h_{0}-1}$ , and $s_{h_{0}}\in\mathcal{S}$ . If $\tau_{j,h_{0}-1}$ occurs with positive probability under the transitions of $\mathcal{G}$ , then for each $h\in\mathcal{O}_{H}$ , $h<h_{0}-1$ and both $j^{\prime}\in[2]$ , the action played by agent $j^{\prime}$ at step $h$ is determined by $\tau_{j,h}$ . Namely, if the state at step $h+1$ of $\tau_{j,h_{0}-1}$ is $\mathfrak{s}_{(a_{1}^{\prime},a_{2}^{\prime})}$ , then player $j^{\prime}$ played action $a_{j}^{\prime}$ at step $h$ . So, for each $h\in\mathcal{O}_{H}$ with $h<h_{0}-1$ , we may define $(a_{1,h},a_{2,h})$ as the action profile played at step $h$ , which is a measurable function of $\tau_{j,h_{0}-1}$ . With this in mind, we define $\pi_{j,h_{0}}^{\dagger}(\tau_{j,h_{0}-1},s_{h_{0}})$ by applying Vovk’s aggregating algorithm (Proposition B.2) as follows.

1.

If $h_{0}$ is even, play an arbitrary action (note that the actions at even-numbered steps have no influence on the transitions or rewards).

If $h_{0}$ is odd, define $\widehat{q}_{j,h_{0}}\in\Delta(\mathcal{A}_{j})$ , by $\widehat{q}_{j,h_{0}}:=\mathbb{E}_{t\sim\widetilde{q}_{j,h_{0}}}[p_{-j,h}^{\scriptscriptstyle{(t)}}]$ , where $\widetilde{q}_{j,h_{0}}\in\Delta([T])$ is defined as follows: for $t\in[T]$ ,

\displaystyle\widetilde{q}_{j,h_{0}}(t):=\frac{\exp\left(-\sum_{h<h_{0}:\ h\in\mathcal{O}_{H}}\log\left(\frac{1}{p^{\scriptscriptstyle{(t)}}_{-j,h}(a_{-j,h})}\right)\right)}{\sum_{t^{\prime}=1}^{T}\exp\left(-\sum_{h<h_{0}:\ h\in\mathcal{O}_{H}}\log\left(\frac{1}{p^{\scriptscriptstyle{(t^{\prime})}}_{-j,h}(a_{-j,h})}\right)\right)}.

Note that $\widehat{q}_{j,h_{0}}$ is a function of $\tau_{j,h_{0}-1}$ via the action profiles $\left\{(a_{1,h},a_{2,h})\right\}_{h<h_{0}:h\in\mathcal{O}_{H}}$ ; to simplify notation, we suppress this dependence.

Then for any state $s_{h_{0}}\in\mathcal{S}$ , define $\pi_{j,h_{0}}^{\dagger}(\tau_{j,h_{0}-1},s_{h_{0}})$ to be a best response to $\widehat{q}_{j,h_{0}}$ , namely

\displaystyle\pi_{j,h_{0}}^{\dagger}(\tau_{j,h_{0}-1},s_{h_{0}}):=\operatorname*{arg\,max}_{a_{j}\in\mathcal{A}_{j}}\mathbb{E}_{a_{-j}\sim\widehat{q}_{j,h_{0}}}\left[R_{j,h}(\mathfrak{s}_{h_{0}},(a_{1},a_{2}))\right]=\operatorname*{arg\,max}_{a_{j}\in\mathcal{A}_{j}}\mathbb{E}_{a_{-j}\sim\widehat{q}_{j,h_{0}}}\left[(M_{j})_{a_{1},a_{2}}\right].

(5)

Note that, for odd $h_{0}$ , the distribution $\widehat{q}_{j,h_{0}}\in\Delta(\mathcal{A}_{j})$ defined above can be viewed as an application of Vovk’s online aggregation algorithm at step $(h_{0}+1)/2$ in the following setting: the number of steps ( $T$ , in the notation of Proposition B.2; note that $T$ plays a different role in the present proof) is $H_{0}=H/2$ , the context space is $\mathcal{O}_{H}$ , and the outcome space is $\mathcal{A}_{-j}$ .²¹²¹21Here $-j$ denotes the index of the player who is not $j$ . There are $T$ experts $\widetilde{p}^{\scriptscriptstyle{(1)}},\ldots,\widetilde{p}^{\scriptscriptstyle{(T)}}$ (i.e., we have $\mathcal{I}=\left\{\widetilde{p}^{\scriptscriptstyle(t)}\right\}_{t\in[T]}$ ), whose predictions on a context $h\in\mathcal{O}_{H}$ are defined as follows: the expert $\widetilde{p}^{\scriptscriptstyle{(t)}}$ predicts $\widetilde{p}^{\scriptscriptstyle{(t)}}(h):=p_{-j,h}^{\scriptscriptstyle{(t)}}$ . Then, the distribution $\widehat{q}_{j,h_{0}}$ is obtained by updating the aggregation algorithm with the context-observation pairs $(h,a_{-j,h})$ , for odd values of $h<h_{0}$ .

We next analyze the value of $V_{j}^{\pi_{j}^{\dagger},\overline{\sigma}_{-j}}$ for $j\in[2]$ to show that the deviation strategy we have defined indeed obtains significant gain. To do so, recall that this value represents the payoff for player $j$ under the process in which we draw an index $t^{\star}\in\left[T\right]$ uniformly at random, then for each step $h\in[H]$ , player $j$ plays according to $\pi_{j}^{\dagger}$ and player $-j$ plays according to $\sigma_{-j}^{\scriptscriptstyle{(t^{\star})}}$ . (In particular, at odd-numbered steps, player $-j$ plays according to $p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}$ .) We recall that $\mathbb{E}_{\pi_{j}^{\dagger}\times\overline{\sigma}_{-j}}\left[\cdot\right]$ denotes the expectation under this process. We let $\tau_{j,h-1}\in\mathscr{H}_{j,h-1}$ denote the random variable which is the history observed by player $j$ in this setup, i.e., when the policy played is $\pi_{j}^{\dagger}\times\overline{\sigma}_{-j}$ , and let $\left\{(a_{1,h},a_{2,h})\right\}_{h\in\mathcal{O}_{H}}$ denote the action profiles for odd rounds, which are a measurable function of each player’s trajectory.

We apply Proposition B.2 with the time horizon as $H_{0}$ , and with the set of experts set to $\mathcal{I}\vcentcolon={}\{\widetilde{p}^{\scriptscriptstyle{(1)}},\ldots,\widetilde{p}^{\scriptscriptstyle{(T)}}\}$ as defined above. The context sequence the sequence of increasing values of $h\in\mathcal{O}_{H}$ , and for each $h\in\mathcal{O}_{H}$ , the outcome at step $(h+1)/2$ (for which the context is $h$ ) is distributed as $a_{-j,h}\sim\widetilde{p}^{\scriptscriptstyle{(t^{\star})}}(h)=p^{\scriptscriptstyle{(t^{\star})}}_{-j,h}$ conditioned on $t^{\star}$ , which in particular satisfies the realizability assumption stated in Proposition B.2. Then, since (as remarked above), the distributions $\widehat{q}_{j,h}$ , for $h\in\mathcal{O}_{H}$ , are exactly the predictions made by Vovk’s aggregating algorithm, Proposition B.2 gives that²²²²22In fact, Proposition B.2 implies that a similar bound holds uniformly for each possible realization of $t^{\star}$ , but Equation 6 suffices for our purposes.

\displaystyle\mathbb{E}_{\pi_{j}^{\dagger}\times\overline{\sigma}_{-j}}\left[\sum_{h\in\mathcal{O}_{H}}D_{\mathsf{TV}}({\widehat{q}_{j,h}},{p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}})\right]=\mathbb{E}_{\pi_{j}^{\dagger}\times\overline{\sigma}_{-j}}\left[\sum_{h\in\mathcal{O}_{H}}D_{\mathsf{TV}}({\widehat{q}_{j,h}},{\widetilde{p}^{\scriptscriptstyle{(t^{\star})}}(h)})\right]\leq\sqrt{H_{0}\log T}.

(6)

Recall that we have assumed for the sake of contradiction that $p_{1,h}^{\scriptscriptstyle{(t)}}\times p_{2,h}^{\scriptscriptstyle{(t)}}$ is not an $\epsilon_{0}$ -Nash equilibrium of $G$ for each $h\in[H]$ and $t\in[T]$ . Consider a fixed draw of the random variable $t^{\star}\in[T]$ defined above. Then it holds that for $j\in[2]$ and $h\in[H]$ , defining

\displaystyle\epsilon_{0,j,h}:=\max_{a_{j}\in[A]}\mathbb{E}_{a_{-j}\sim p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{a_{1},a_{2}}\right]-\mathbb{E}_{a_{1}\sim p_{1,h}^{\scriptscriptstyle{(t^{\star})}},a_{2}\sim p_{2,h}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{a_{1},a_{2}}\right],

(7)

we have $\epsilon_{0,1,h}+\epsilon_{0,2,h}\geq\epsilon_{0}$ . Consider any $j\in[2]$ , $h\in\mathcal{O}_{H}$ , and a history $\tau_{j,h-1}\in\mathscr{H}_{j,h-1}$ of agent $j$ up to step $h-1$ (conditioned on $t^{\star}$ ). Let us write $\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}}:=D_{\mathsf{TV}}({p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}},{\widehat{q}_{j,h}})$ ; note that $\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}}$ is a function of $\tau_{j,h-1}$ , through its dependence on $\widehat{q}_{j,h}$ . We have, by the definition of $\pi_{j,h}^{\dagger}(\tau_{j,h-1},s_{h})$ in (5) and the definition of $\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}}$ ,

$\displaystyle\mathbb{E}_{a_{-j}\sim p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{\pi_{h,j}^{\dagger}(\tau_{j,h-1},\mathfrak{s}),a_{-j}}\ \|\ t^{\star},\ \tau_{j,h-1}\right]\geq$	$\displaystyle\mathbb{E}_{a_{-j}\sim\widehat{q}_{j,h}}\left[(M_{j})_{\pi_{h,j}^{\dagger}(\tau_{j,h-1},\mathfrak{s}),a_{-j}}\ \|\ t^{\star},\ \tau_{j,h-1}\right]-\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}}$
$\displaystyle=$	$\displaystyle\max_{a_{j}\in[A]}\mathbb{E}_{a_{-j}\sim\widehat{q}_{j,h}}\left[(M_{j})_{a_{j},a_{-j}}\ \|\ t^{\star},\ \tau_{j,h-1}\right]-\delta_{h,-j}^{\scriptscriptstyle{(t^{\star})}}$
$\displaystyle\geq$	$\displaystyle\max_{a_{j}\in[A]}\mathbb{E}_{a_{-j}\sim p_{h,-j}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{a_{j},a_{-j}}\right]-2\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}}.$	(8)

Combining Equation 7 and Equation 8, we get that for any fixed $h\in\mathcal{O}_{H}$ , $j\in[2]$ , and $\tau_{j,h-1}\in\mathscr{H}_{j,h-1}$ ,

\displaystyle\mathbb{E}_{a_{-j}\sim p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{\pi_{j,h}^{\dagger}(\tau_{j,h-1},\mathfrak{s}),a_{-j}}\ |\ t^{\star},\ \tau_{j,h-1}\right]-\mathbb{E}_{a_{1}\sim p_{1,h}^{\scriptscriptstyle{(t^{\star})}},a_{2}\sim p_{2,h}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{a_{1},a_{2}}\right]>\epsilon_{0,j,h}-2\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}}.

(9)

Averaging over the draw of $t^{\star}\in[T]$ , which we recall is chosen uniformly, we see that

	$\displaystyle\sum_{j\in[2]}V_{j}^{\pi_{j}^{\dagger}\times\overline{\sigma}_{-j}}-V_{j}^{\overline{\sigma}}$
$\displaystyle=$	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\sum_{j\in[2]}V_{j}^{\pi_{j}^{\dagger}\times\sigma^{\scriptscriptstyle{(t)}}_{-j}}-V_{j}^{\sigma^{\scriptscriptstyle{(t)}}}$	(10)
$\displaystyle=$	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\sum_{j\in[2]}\mathbb{E}_{\pi_{j}^{\dagger}\times\sigma_{-j}^{\scriptscriptstyle{(t)}}}\left[\sum_{h\in\mathcal{O}_{H}}\mathbb{E}_{a_{-j}\sim p_{-j,h}^{\scriptscriptstyle{(t)}}}[R_{j,h}(\mathfrak{s},(\pi_{j,h}^{\dagger}(\tau_{j,h-1},\mathfrak{s}),a_{-j}))\ \|\ t,\ \tau_{j,h-1}]-\mathbb{E}_{a_{1}\sim p_{1,h}^{\scriptscriptstyle{(t)}},a_{2}\sim p_{2,h}^{\scriptscriptstyle{(t)}}}[R_{j,h}(\mathfrak{s},(a_{1},a_{2}))]\right]$
$\displaystyle=$	$\displaystyle\frac{1}{TH}\sum_{t=1}^{T}\sum_{j\in[2]}\mathbb{E}_{\pi_{j}^{\dagger}\times\sigma_{-j}^{\scriptscriptstyle{(t)}}}\left[\sum_{h\in\mathcal{O}_{H}}\mathbb{E}_{a_{-j}\sim p_{-j,h}^{\scriptscriptstyle{(t)}}}[(M_{j})_{\pi_{j,h}^{\dagger}(\tau_{j,h-1},\mathfrak{s}),a_{-j}}\ \|\ t,\ \tau_{j,h-1}]-\mathbb{E}_{a_{1}\sim p_{1,h}^{\scriptscriptstyle{(t)}},a_{2}\sim p_{2,h}^{\scriptscriptstyle{(t)}}}[(M_{j})_{a_{1},a_{2}}]\right]$
$\displaystyle\geq$	$\displaystyle\frac{1}{TH}\sum_{t=1}^{T}\sum_{j\in[2]}\mathbb{E}_{\pi_{j}^{\dagger}\times\sigma_{-j}^{\scriptscriptstyle{(t)}}}\left[\sum_{h\in\mathcal{O}_{H}}\left(\epsilon_{0,j,h}-2\delta_{-j,h}^{\scriptscriptstyle{(t)}}\right)\right]$	(11)
$\displaystyle\geq$	$\displaystyle\frac{\epsilon_{0}}{2}-\frac{2}{TH}\sum_{t=1}^{T}2\sqrt{H_{0}\log T}\geq\frac{\epsilon_{0}}{2}-4\sqrt{\log(T)/H},$	(12)

where (10) follows from the definition $\overline{\sigma}=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}$ , (11) follows from (9), and (12) uses (6). As long as $T<\exp(H\cdot(\epsilon_{0}/16)^{2})$ , the this expression is bounded below by $\epsilon_{0}/4$ , meaning that $\overline{\sigma}$ is not an $\epsilon_{0}/4$ -approximate CCE. This completes the contradiction. ∎

Appendix C Proofs of lower bounds for SparseCCE (Sections 4 and 5)

In this section we prove our computational lower bounds for solving the SparseCCE problem with $m=3$ players (Theorem 4.3 and Corollary 4.4), as well as our statistical lower bound for solving the SparseCCE problem with a general number $m$ of players (Theorem 5.2).

Both theorems are proven as consequences of a more general result given in Theorem C.1 below, which reduces the Nash problem in $m$ -player normal-form games to the SparseCCE problem in $(m+1)$ -player Markov games. In more detail, the theorem shows that (a) if an algorithm for SparseCCE makes few calls to a generative model oracle, then we get an algorithm for the Nash problem with few calls to a payoff oracle (see Section A.2 for background on the payoff oracle for the Nash problem), and (b) if the algorithm for SparseCCE is computationally efficient, then so is the algorithm for the Nash problem.

Theorem C.1.

There is a constant $C_{0}>0$ so that the following holds. Consider $n,m\in\mathbb{N}$ , and suppose $T,N,Q\in\mathbb{N}$ and $\epsilon>0$ satisfy $1<T<\exp\left(\frac{\epsilon^{2}\cdot\lfloor n/m\rfloor}{m^{2}}\right)$ . Suppose there is an algorithm $\mathscr{B}$ which, given a generative model oracle for a $(m+1)$ -player Markov game $\mathcal{G}$ with $|\mathcal{G}|\leq n$ , solves the $(T,\epsilon,N)$ -SparseCCE problem for $\mathcal{G}$ using $Q$ generative model oracle queries. Then the following conclusions hold:

•

For any $\delta>0$ , the $m$ -player $(\lfloor n/m\rfloor,16(m+1)\cdot\epsilon)$ -Nash problem for any normal-form game $G$ can be solved, with failure probability $\delta$ , using at most $C_{0}\cdot(Q\cdot\log(1/\delta))+(\log(1/\delta)\cdot nm/\epsilon)^{C_{0}}$ queries to a payoff oracle $\mathcal{O}_{G}$ for $G$ .
•

If the algorithm $\mathscr{B}$ additionally runs in time $U$ for some $U\in\mathbb{N}$ , then the algorithm solving Nash from the previous bullet point runs in time $(nmTNU\log(1/\delta)/\epsilon)^{C_{0}}$ .

Theorem 4.3 follows directly from Theorem C.1 by taking $m=2$ .

Proof of Theorem 4.3.

Suppose there is an algorithm which, given the description of any 3-player Markov game $\mathcal{G}$ with $|\mathcal{G}|\leq n$ , solves the $(T,\epsilon,N)$ -SparseCCE problem in time $U$ . Such an algorithm immediately yields an algorithm which can solve the $(T,\epsilon,N)$ -SparseCCE problem in time $U+|\mathcal{G}|^{O(1)}$ using only a generative model oracle, since the exact description of the Markov game can be obtained with $HS|\mathcal{A}|\leq HS(\max_{i}A_{i})^{3}\leq|\mathcal{G}|^{5}$ queries to the generative model (across all $(h,s,\mathbf{a})$ tuples). We can now solve the problem of computing a $50\cdot\epsilon$ -Nash equilibrium of a given 2-player $\lfloor n/2\rfloor$ -action normal form game $G$ as follows. We simply apply the algorithm of Theorem C.1 with $m=2$ , noting that the oracle $\mathcal{O}_{G}$ in the theorem statement can be implemented by reading the corresponding bits of input of the input game $G$ . The second bullet point yields that this algorithm takes time $(nTNU\log(1/\delta)/\epsilon)^{C_{0}}$ , for some constant $C_{0}$ . Furthermore, the assumption $T<\exp(\epsilon^{2}\cdot\lfloor n/m\rfloor/m^{2})$ of Theorem C.1 is implied by the assumption that $T<\exp(\epsilon^{2}n/16)$ of Theorem 4.3. ∎

In a similar manner, Theorem 5.2 follows from Theorem C.1 by applying Theorem A.2, which states that there is no randomized algorithm that finds approximate Nash equilibria of $m$ -player, 2-action normal form games in time $2^{o(m)}$ .

Proof of Theorem 5.2.

Let $\epsilon_{0}$ be the constant from Theorem A.2, and consider any $m\geq 3$ . Suppose there is an algorithm which, for any $m$ -player Markov game $\mathcal{G}$ with $|\mathcal{G}|\leq 2m^{6}$ , makes $Q$ oracle queries to a generative model oracle for $\mathcal{G}$ , and solves the $(T,\epsilon_{0}/(10m),N)$ -SparseCCE problem for $\mathcal{G}$ for some $T,N\in\mathbb{N}$ so that $T<\exp(cm)$ , for a sufficiently small absolute constant $c$ . Then, by Theorem C.1 with $\epsilon=\epsilon_{0}/(10m)$ and $n=m^{6}$ (which ensures that $T<\exp((\epsilon_{0}/(10m))^{2}\cdot\lfloor n/m\rfloor/m^{2})$ as long as $c$ is sufficiently small), there is an algorithm which solves the $(m^{5},\epsilon_{0})$ -Nash problem—and thus the $(2,\epsilon_{0})$ -Nash problem—for $(m-1)$ -player games with failure probability $1/3$ , using $O(Q)+m^{O(1)}$ queries to a payoff oracle. But by Theorem A.2, any such algorithm requires $2^{\Omega(m)}$ queries to a payoff oracle. It follows that $Q\geq 2^{\Omega(m)}$ , as desired. ∎

C.1 Proof of Theorem C.1

Proof of Theorem C.1.

Fix any $m\geq 2$ , $n\in\mathbb{N}$ . Suppose we are given an algorithm $\mathscr{B}$ that solves the $(m+1)$ -player $(T,\epsilon,N)$ -SparseCCE problem for Markov games $\mathcal{G}$ satisfying $|\mathcal{G}|\leq n$ , running in time $U$ and using at most $Q$ generative model queries. We proceed to describe an algorithm which solves the $m$ -player $(\lfloor n/m\rfloor,16(m+1)\cdot\epsilon)$ -Nash problem using $C_{0}\cdot(Q\cdot\log(1/\delta))+(\log(1/\delta)\cdot nm/\epsilon)^{C_{0}}$ queries to a payoff oracle, and running in time $(nmTNU\log(1/\delta)/\epsilon)^{C_{0}}$ , where $\delta$ represents the failure probability. Define $n_{0}:=\lfloor n/m\rfloor$ , and assume we are given an arbitrary $m$ -player $n_{0}$ -action normal form $G$ , which is specified by payoff matrices $M_{1},\ldots,M_{m}\in[0,1]^{n_{0}\times\cdots\times n_{0}}$ . We assume that all entries of each of the matrices $M_{j}$ have only the most significant $\max\{n_{0},\lceil\log 1/\epsilon\rceil\}$ bits nonzero; this assumption is without loss of generality, since by truncating the utilities to satisfy this assumption, we change all payoffs by at most $\epsilon$ , which degrades the quality of any approximate equilibrium by at most $2\epsilon$ (in addition, we have $\lceil\log 1/\epsilon\rceil\leq n_{0}$ since we have assumed $1<T<\exp(\epsilon^{2}n_{0}/m^{2})$ ). We assume $\epsilon\leq 1/2$ without loss of generality. Based on $G$ , we construct an $(m+1)$ -player Markov game $\mathcal{G}:=\mathcal{G}(G)$ as follows.

Definition C.2.

We define the Markov game $\mathcal{G}(G)$ as the tuple $\mathcal{G}(G)=(\mathcal{S},H,(\mathcal{A}_{i})_{i\in[2]},\mathbb{P},(R_{i})_{i\in[2]},\mu)$ , where:

•

The horizon of $\mathcal{G}$ is chosen to be the power of $2$ satisfying $n_{0}\leq H<2n_{0}$ .
•

Let $A\vcentcolon=n_{0}$ . The action spaces of agents $1,2,\ldots,m$ are given by $\mathcal{A}_{1}=\cdots=\mathcal{A}_{m}=[A]$ . The action space of agent $m+1$ is

$\displaystyle\mathcal{A}_{m+1}=\{(j,a_{j})\ :\ j\in[m],a_{j}\in\mathcal{A}_{j}\},$

so that $|\mathcal{A}_{m+1}|=Am\leq n$ .

We write $\mathcal{A}=\prod_{j=1}^{m}\mathcal{A}_{j}$ to denote the joint action space of the first $m$ agents, and $\overline{\mathcal{A}}:=\prod_{j=1}^{m+1}\mathcal{A}_{j}$ to denote the joint action space of all agents.
•

There is a single state, denoted by $\mathfrak{s}$ , i.e., $\mathcal{S}=\{\mathfrak{s}\}$ (in particular, $\mu$ is a singleton distribution supported on $\mathfrak{s}$ ).

•

For all $h\in[H]$ , the reward for agent $j\in[m+1]$ , given an action profile $\mathbf{a}=(a_{1},\ldots,a_{m+1})$ at the unique state $\mathfrak{s}$ , is as follows: writing $a_{m+1}=(j^{\prime},a_{j^{\prime}}^{\prime})$ , we have

\displaystyle R_{j,h}(\mathfrak{s},\mathbf{a})=\overline{R}_{j,h}(\mathfrak{s},\mathbf{a})+\frac{1}{H}\cdot 2^{-3\lceil\log 1/\epsilon\rceil}\cdot\mathsf{enc}(\mathbf{a}),

(13)

where $\overline{R}_{j,h}(\mathfrak{s},\mathbf{a})$ is defined per the kibitzer construction of [BCI⁺08]:

\displaystyle\overline{R}_{j,h}(\mathfrak{s},\mathbf{a}):=\begin{cases}0&:j\not\in\{j^{\prime},m+1\}\\ \frac{1}{H}\cdot\left((M_{j})_{a_{1},\ldots,a_{m}}-(M_{j})_{a_{1},\ldots,a_{j^{\prime}}^{\prime},\ldots,a_{m}}\right)&:j=j^{\prime}\\ \frac{1}{H}\cdot\left((M_{j})_{a_{1},\ldots,a_{j^{\prime}}^{\prime},\ldots,a_{m}}-(M_{j})_{a_{1},\ldots,a_{m}}\right)&:j=m+1.\end{cases}

(14)

In (13) above, $\mathsf{enc}(\mathbf{a})\in[0,1]$ is the binary representation of a binary encoding of the action profile $\mathbf{a}$ . In particular, if the binary encoding of $\mathbf{a}$ is $(b_{1},\ldots,b_{N})$ , with $b_{i}\in\{0,1\}$ , then $\mathsf{enc}(\mathbf{a})=\sum_{i=1}^{N}2^{-i}\cdot b_{i}$ . Note that $\mathsf{enc}(\mathbf{a})$ takes $N=O(m\log n_{0})\leq O(m\log n)$ bits to specify.

1:Input:

2: Parameters

n,n_{0},m,T\in\mathbb{N}

\delta=\epsilon/(6H)

K=\lceil 4\log(mn_{0}/\delta)/\epsilon^{2}\rceil

3: An

m

-player,

n_{0}

-action normal form game

G

, with utilies accessible by oracle

\mathcal{O}_{G}

4: An algorithm

\mathscr{B}

for computing approximate CCE of Markov games.

5:Call the algorithm

\mathscr{B}

on the

(m+1)

-player Markov game

\mathcal{G}=\mathcal{G}(G)

constructed as in Definition C.2, which produces a sequence

\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}

, where each

\sigma^{\scriptscriptstyle{(t)}}=(\sigma^{\scriptscriptstyle{(t)}}_{1},\ldots,\sigma^{\scriptscriptstyle{(t)}}_{m+1})

with

\sigma^{\scriptscriptstyle{(t)}}_{j}\in{\Pi}^{\mathrm{gen,rnd}}_{j}

. Here, we use the oracle

\mathcal{O}_{G}

to simulate generative model oracle queries made by

\mathscr{B}

6:Draw

t^{\star}\in[T]

uniformly at random.

7:For each

j\in[m]

, initialize

\tau_{j,0}

to be an empty trajectory.

8:for

h\in[H]

: do

\triangleright

Simulate a trajectory from $\mathcal{G}$

9: Set

s_{h}=\mathfrak{s}

(per the transitions of

\mathcal{G}

10: For each

j\in[m]

, define

\widehat{q}_{j,h}:=\mathbb{E}_{t\sim\widetilde{q}_{j,h}}\left[\sigma_{j,h}^{\scriptscriptstyle{(t)}}(\tau_{j,h-1},s_{h})\right]\in\Delta(\mathcal{A}_{j})

, where

\widetilde{q}_{j,h}\in\Delta([T])

is defined as follows: for

t\in[T]

\displaystyle\widetilde{q}_{j,h}(t):=\frac{\exp\left(-\sum_{g<h}\log\left(\frac{1}{\sigma^{\scriptscriptstyle{(t)}}_{j,g}(a_{j,g}|\tau_{j,g-1},s_{g})}\right)\right)}{\sum_{t^{\prime}=1}^{T}\exp\left(-\sum_{g<h}\log\left(\frac{1}{\sigma^{\scriptscriptstyle{(t^{\prime})}}_{j,g}(a_{j,g}|\tau_{j,g-1},s_{g})}\right)\right)}.

11: Draw

K

i.i.d. samples

\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h}

12: For each

a^{\prime}\in\mathcal{A}_{m+1}

, define

\widehat{R}_{m+1,h}(a^{\prime}):=\frac{1}{K}\sum_{k=1}^{K}R_{m+1,h}(s_{h},(\mathbf{a}_{h}^{k},a^{\prime}))

. Here, we use the oracle

\mathcal{O}_{G}

to compute

R_{m+1,h}(s_{h},(\mathbf{a}_{h}^{k},a^{\prime}))

for each tuple

(\mathbf{a}_{h}^{k},a^{\prime})

13: For each

j\in[m]

, draw

a_{j,h}\sim\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\cdot|\tau_{j,h-1},s_{h})

14: Choose the action

a_{m+1,h}

of player

m+1

as follows: (Action $a_{m+1,h}$ is corresponds to the action selected by the policy $\pi_{m+1}^{\dagger}$ of player $m+1$ defined within the proof of Lemma C.3; this policy is well-defined because the action profiles of all players $i\in[m]$ can be extracted from the lower-order bits of player $m+1$ ’s reward)

\displaystyle a_{m+1,h}:=\operatorname*{arg\,max}_{a^{\prime}\in\mathcal{A}_{m+1}}\left\{\widehat{R}_{m+1,h}(a^{\prime})\right\}.

(15)

15: For each

j\in[m+1]

, let

r_{j,h}=R_{j,h}(s_{h},(a_{1,h},\ldots,a_{m+1,h}))

16: Each player

j

constructs

\tau_{j,h}

by updating

\tau_{j,h-1}

with

(s_{h},a_{j,h},r_{j,h})

17: if

\widehat{R}_{m+1,h}(a_{m+1,h})\leq 14(m+1)\cdot\epsilon/H

then return

\widehat{q}_{h}:=\bigtimes_{j\in[m]}\widehat{q}_{j,h}

as a candidate approximate Nash equilibrium for

G

18:if the for loop terminates without returning: return fail.

Algorithm 2 Algorithm to compute Nash equilibrium used in proof of Theorem C.1.

It is evident that this construction takes polynomial time and satisfies $|\mathcal{G}|\leq mn_{0}\leq n$ . Furthermore, it is clear that a single generative model oracle call for the Markov game $\mathcal{G}$ (per Definition 5.1) can be implemented using at most 2 calls to the oracle $\mathcal{O}_{G}$ for the normal-form game $G$ . We will now show by applying the algorithm $\mathscr{B}$ to $\mathcal{G}$ , we can efficiently (in terms of runtime and oracle calls) compute a $16(m+1)\cdot\epsilon$ -approximate Nash equilibrium for the original game $G$ . To do so, we appeal to Algorithm 2.

Algorithm 2 proceeds as follows. First, it calls the algorithm $\mathscr{B}$ on the $(m+1)$ -player Markov game $\mathcal{G}(G)$ , using the oracle $\mathcal{O}_{G}$ to simulate $\mathscr{B}$ ’s calls to the generative model oracle for $\mathcal{G}$ . By assumption, the algorithm $\mathscr{B}$ returns a sequence $\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}$ of product policies of the form $\sigma^{\scriptscriptstyle{(t)}}=(\sigma^{\scriptscriptstyle{(t)}}_{1},\ldots,\sigma^{\scriptscriptstyle{(t)}}_{m+1})$ , so that each $\sigma^{\scriptscriptstyle{(t)}}_{j}\in{\Pi}^{\mathrm{gen,rnd}}_{j}$ is $N$ -computable, and so that the average $\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}$ is an $\epsilon$ -CCE of $\mathcal{G}$ . Next, Algorithm 2 samples a trajectory from $\mathcal{G}$ in which:

•

Players $1,\ldots,m$ each play according to a policy $\sigma^{\scriptscriptstyle{(t^{\star})}}$ for an index $t^{\star}\in[T]$ chosen uniformly at the start of the episode.
•

Player $m+1$ plays according to a strategy that, at each step $h\in[H]$ , computes distributions $\widehat{q}_{j,h}$ representing its “belief” of what action each player $j\in[m]$ will play at step $h$ (Line 10), and plays an approximate best response to the product of the strategies $\widehat{q}_{j,h}$ , $j\in[m]$ (Line 14).

In order avoid exponential dependence on the number of players $m$ when computing an approximate best response to $\bigtimes_{j\in[m]}\widehat{q}_{j,h}$ , we draw $K:=\lceil 4\log(mn_{0}/\delta)/\epsilon^{2}\rceil$ (for $\delta=\epsilon/(6H)$ ) samples from $\bigtimes_{j\in[m]}\widehat{q}_{j,h}$ and use these samples to compute the best response. In particular, letting $\mathbf{a}_{h}^{K}\in\mathcal{A}$ denote the $k$ th sampled action profile, we construct a function $\widehat{R}_{m+1,h}:\mathcal{A}_{m+1}\rightarrow\mathbb{R}$ in Lines 11 and 14 which, for each $a^{\prime}\in\mathcal{A}_{m+1}$ , is defined as the average over samples $\{\mathbf{a}_{h}^{k}\}_{k\in[K]}$ of the realized payoffs $R_{m+1,h}(s_{h},(\mathbf{a}_{h}^{k},a^{\prime}))$ ; note that to compute the payoffs for each sample, Algorithm 2 needs only two oracle calls to $\mathcal{O}_{G}$ .

The following lemma, proven in the sequel, gives a correctness guarantee for Algorithm 2.

Lemma C.3 (Correctness of Algorithm 2).

Given any $m$ -player $n_{0}$ -action normal form game $G$ , if the algorithm $\mathscr{B}$ solves the $(T,\epsilon,N)$ -SparseCCE problem for the game $\mathcal{G}(G)$ with $T,\epsilon,N$ satisfying $T\leq\exp(n_{0}\epsilon^{2}/m^{2})$ , then Algorithm 2 outputs a $16(m+1)\cdot\epsilon$ -approximate Nash equilibrium of $G$ with probability at least $1/3$ , and otherwise fails.

The assumption that $T<\exp\left(\frac{\epsilon^{2}\cdot\lfloor n/m\rfloor}{m^{2}}\right)$ from the statement of Theorem C.1 yields that $T\leq\exp(n_{0}\epsilon^{2}/m^{2})$ , so Lemma C.3 yields that Algorithm 2 outputs a $16(m+1)\cdot\epsilon$ -Nash equilibrium of $G$ with probability at least $1/3$ (and otherwise fails). By iterating Algorithm 2 for $\log(1/\delta)$ times, we may thus compute a $16(m+1)\cdot\epsilon$ -Nash equilibrium of $G$ with failure probability $1-\delta$ .

We now analyze the oracle cost and computational cost of Algorithm 2. It takes $2Q$ oracle calls to $\mathcal{O}_{G}$ to simulate the $Q$ generative model oracle calls of $\mathscr{B}$ , and therefore, if $\mathscr{B}$ runs in time $U$ , then the call to $\mathscr{B}$ on Line 5, using oracle calls to $\mathcal{O}_{G}$ to simulate simulate the generative model oracle calls, runs in time $O(U)$ . Next, the computations of $\widetilde{q}_{j,h}$ (and thus $\widehat{q}_{j,h}$ ) in Line 10 can be performed in $(nmTN)^{O(1)}$ time, the computation of $\widehat{R}_{m+1,h}:\mathcal{A}_{m+1}\rightarrow\mathbb{R}$ in Line 14 requires time (and oracle calls to $\mathcal{O}_{G}$ ) bounded above by $O(|\mathcal{A}_{m+1}|\cdot K)\leq(nm\log(1/\delta)/\epsilon)^{O(1)}$ , constructing the actions $a_{j,h}$ (for $j\in[m+1]$ ) in Lines 13 and 14 takes time $(Nmn)^{O(1)}$ (using the fact that the policies $\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}$ are $N$ -computable), and constructing the rewards $r_{j,h}$ on Line 15 requires another $2(m+1)$ oracle calls to $\mathcal{O}_{G}$ . Altogether, Algorithm 2 requires $2Q+(nm\log(1/\delta)/\epsilon)^{C_{0}}$ oracle calls to $\mathcal{O}_{G}$ and, if $\mathscr{B}$ runs in time $U$ , then Algorithm 2 takes time $(nmTNU\log(1/\delta)/\epsilon)^{C_{0}}$ , for some absolute constant $C_{0}$ .

∎

Remark C.4 (Bit complexity of exponential weights updates).

In the above proof we have noted that $\widetilde{q}_{j,h}$ (as defined in Line 10 of Algorithm 2) can be computed in time $(nmTN)^{O(1)}$ . A detail we do not handle formally is that, since the values of $\widetilde{q}_{j,h}(t)$ are in general irrational, only the $(nmTN)^{O(1)}$ most significant bits of each real number $\widetilde{q}_{j,h}(t)$ can be computed in time $(nmTN)^{O(1)}$ . To give a truly polynomial-time implementation of Algorithm 2, one can compute only the $(nmTN)^{O(1)}$ most significant bits of each distribution $\widetilde{q}_{j,h}$ , which is sufficient to approximate the true value of $\widehat{q}_{j,h}$ to within $\exp(-(nmTN)^{O(1)})$ in total variation distance. Since $\widehat{q}_{j,h}$ only influences the subsequent execution of Algorithm 2 via the samples $\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h}$ drawn in Line 11, by a union bound, the approximation of $\widehat{q}_{j,h}$ we have described perturbs the execution of the algorithm by at most $O(KH)\cdot\exp(-(nmTN)^{O(1)})$ in total variation distance. In particular, the correctness guarantee of Lemma C.3 still holds, with sucess probability at least $1/3-\exp(-(nmTN)^{O(1)})>1/4$ .

It remains to prove Lemma C.3, which is the bulk of the proof of Theorem C.1.

Proof of Lemma C.3.

We will establish the following two facts:

1.

First, the choices of $a_{m+1,h}$ in Line 14 (i.e., Eq. 15) of Algorithm 2 correspond to a valid policy $\pi_{m+1}^{\dagger}\in{\Pi}^{\mathrm{gen,rnd}}$ for player $m+1$ (representing a strategy for deviating from the equilibrium $\overline{\sigma}$ ), in that they can be expressed as a function of player $(m+1)$ ’s history, $(\tau_{m+1,h-1},s_{h})$ at each step $h$ .
2.

Second, we will show that, since $\overline{\sigma}$ is an $\epsilon$ -CCE of $\mathcal{G}$ , the strategy $\pi_{m+1}^{\dagger}$ cannot not lead to a large increase of value for player $m+1$ , which will imply that Algorithm 2 must return a Nash equilibrium with high enough probability.

Defining $\pi_{i}^{\dagger}$ for $i\in[m+1]$ .

We begin by constructing the policy $\pi_{m+1}^{\dagger}$ described; for later use in the proof, it will be convenient to construct a collection of closely related policies $\pi_{i}^{\dagger}\in{\Pi}^{\mathrm{gen,rnd}}$ for $i\in[m]$ , also representing strategies for deviating from the equilibrium $\overline{\sigma}$ .

Let $i\in[m+1]$ be fixed. For $h\in[H]$ , the mapping $\pi_{i,h}^{\dagger}:\mathscr{H}_{i,h-1}\times\mathcal{S}\rightarrow\mathcal{A}_{i}$ is defined as follows. Given a history $\tau_{i,h-1}=(s_{1},a_{i,1},r_{i,1},\ldots,s_{h-1},a_{i,h-1},r_{i,h-1})\in\mathscr{H}_{i,h-1}$ (we assume without loss of generality that $\tau_{i,h-1}$ occurs with positive probability under some sequence of general policies) and a current state $s_{h}$ , we define $\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h})\in\mathcal{A}_{i}$ through the following process.

1.
First, we claim that for all players $j\in[m+1]\setminus\{i\}$ , it is possible to extract the trajectory $\tau_{j,h-1}$ from the trajectory $\tau_{i,h-1}$ of player $i$ .
1. (a)
  
  Recall that for each $g<h$ , from the definition in Equation 13 and the function $\mathsf{enc}(\mathbf{a})$ , the bits following position $3\lceil\log 1/\epsilon\rceil$ of the reward $r_{i,g}$ given to player $i$ at step $g$ of the trajectory $\tau_{i,g-1}$ encode an action profile $\mathbf{a}_{g}\in\overline{\mathcal{A}}$ . Since $\tau_{i,h-1}$ occurs with positive probability, this is precisely the action profile which was played by agents at step $g$ . Note we also use here that by definition of the rewards $R_{j,h}(s,\mathbf{a})$ in (13), the component $\overline{R}_{j,h}(s,\mathbf{a})$ of the reward only affects the first $2\lceil\log 1/\epsilon\rceil$ bits.
2. (b)
  
  For $g<h$ and $j\in[m+1]\backslash\{i\}$ , define $r_{j,g}:=R_{j,g}(s_{g},\mathbf{a}_{g})$ .
3. (c)
  
  For $j\in[m+1]\backslash\{i\}$ , write $\tau_{j,h-1}:=(s_{1},a_{j,1},r_{j,1},\ldots,s_{h-1},a_{j,h-1},r_{j,h-1})$ ; in particular, $\tau_{j,h-1}$ is a deterministic function of $(\tau_{i,h-1},s_{h})$ . (Note that, since $\tau_{i,h-1}$ occurs with positive probability, the history $\tau_{j,h-1}$ observed by player $j$ up to step $h-1$ can be computed from it via Steps (a) and (b)). Going forward, for $g<h-1$ , we let $\tau_{j,g}$ denote the prefix of $\tau_{j,h-1}$ up to step $g$ .

Now, using that player $i$ can compute all players’ trajectories, for each $j\in[m+1]$ we define

\displaystyle\widehat{q}_{j,h}:=\mathbb{E}_{t\sim\widetilde{q}_{j,h}}\left[\sigma_{j,h}^{\scriptscriptstyle{(t)}}(\tau_{j,h-1},s_{h})\right]\in\Delta(\mathcal{A}_{j}),

(16)

where $\widetilde{q}_{j,h}\in\Delta([T])$ is defined as follows: for $t\in[T]$ ,

\displaystyle\widetilde{q}_{j,h}(t):=\frac{\exp\left(-\sum_{g<h}\log\left(\frac{1}{\sigma^{\scriptscriptstyle{(t)}}_{j,g}(a_{j,g}|\tau_{j,g-1},s_{g})}\right)\right)}{\sum_{t^{\prime}=1}^{T}\exp\left(-\sum_{g<h}\log\left(\frac{1}{\sigma^{\scriptscriptstyle{(t^{\prime})}}_{j,g}(a_{j,g}|\tau_{j,g-1},s_{g})}\right)\right)}.

(17)

Note that $\widehat{q}_{j,h}$ is a random variable which depends on the trajectory $(\tau_{j,h-1},s_{h})$ (which can be computed from $(\tau_{i,h-1},s_{h})$ ). In addition, the definition of $\widehat{q}_{j,h}$ (for each $j\in[m]$ ) is exactly as is defined in Line 10 of Algorithm 2.

For $i\in[m]$ , define $\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h})$ as follows:

\displaystyle\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h}):=\operatorname*{arg\,max}_{a^{\prime}\in\mathcal{A}_{i}}\mathbb{E}_{\mathbf{a}_{-i}\sim\bigtimes_{j\neq i}\widehat{q}_{j,h}}\left[R_{m+1,h}(s_{h},(a^{\prime},\mathbf{a}_{-i}))\right].

(18)

For the case $i=m+1$ , define $\pi_{m+1,h}^{\dagger}(\tau_{m+1,h-1},s_{h})\in\Delta(\mathcal{A}_{m+1})$ (implicitly) to be the following distribution over $a_{m+1,h}^{\dagger}\in\mathcal{A}_{m+1}$ : draw $\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h}$ , define $\widehat{R}_{m+1,h}(a^{\prime}):=\frac{1}{K}\sum_{k=1}^{K}R_{m+1,h}(s_{h},(\mathbf{a}_{h}^{k},a^{\prime}))$ for $a^{\prime}\in\mathcal{A}_{m+1}$ , and finally set

\displaystyle a_{m+1,h}^{\dagger}:=\operatorname*{arg\,max}_{a^{\prime}\in\mathcal{A}_{m+1}}\left\{\widehat{R}_{m+1,h}(a^{\prime})\right\}.

(19)

Note that, for each choice of $(\tau_{m+1,h-1},s_{h})$ , the distribution $\pi_{m+1,h}^{\dagger}(\tau_{m+1,h-1},s_{h})$ as defined above coincides with the distribution of the action $a_{m+1,h}^{\dagger}$ defined in Eq. 15 in Algorithm 2, when player $m+1$ ’s history is $\tau_{m+1,h-1}$ and the state at step $h$ is $s_{h}$ . The following lemma, for use later in the proof, bounds the approximation error incurred in sampling $\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h}$ .

Lemma C.5.

Fix any $(\tau_{m+1,h-1},s_{h})\in\mathscr{H}_{j,h-1}$ . With probability at least $1-\delta$ over the draw of $\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h}$ , it holds that for all $a^{\prime}\in\mathcal{A}_{m+1}$ ,

\displaystyle\left|\widehat{R}_{m+1,h}(a^{\prime})-\mathbb{E}_{a_{j}\sim\widehat{q}_{j,h}\ \forall j\in[m]}[R_{m+1,h}(s_{h},(a_{1},\ldots,a_{m},a^{\prime}))]\right|\leq\frac{\epsilon}{H},

which implies in particular that with probability at least $1-\delta$ over the draw of $a_{m+1,h}^{\dagger}\sim\pi_{m+1,h}^{\dagger}(\tau_{m+1,h-1},s_{h})$ ,

\displaystyle\max_{a^{\prime}\in\mathcal{A}_{m+1}}\left\{\mathbb{E}_{a_{j}\sim\widehat{q}_{j,h}\ \forall j\in[m]}[R_{m+1,h}(s_{h},(a_{1},\ldots,a_{m},a^{\prime}))]\right\}-\frac{2\epsilon}{H}\leq

\displaystyle\mathbb{E}_{a_{j}\sim\widehat{q}_{j,h}\ \forall j\in[m]}[R_{m+1,h}(s_{h},(a_{1},\ldots,a_{m},a_{m+1,h}^{\dagger}))].

(20)

It is immediate from our construction above that the following fact holds.

Lemma C.6.

The joint distribution of $\tau_{j,h}$ , for $j\in[m+1]$ and $h\in[H]$ , as computed by Algorithm 2, coincides with the distribution of $\tau_{j,h}$ in an episode of $\mathcal{G}$ when players follow the policy $\pi_{m+1}^{\dagger}\times\overline{\sigma}_{-(m+1)}$ .

Analyzing the distributions $\widehat{q}_{j,h}$ .

Fix any $i\in[m+1]$ . We next prove some facts about the distributions $\widehat{q}_{j,h}$ defined above (as a function of $(\tau_{i,h-1},s_{h})$ ) in the process of computing $\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h})$ .

For each $h\in[H]$ , consider any choice of $(\tau_{i,h-1},s_{h})\in\mathscr{H}_{i,h-1}\times\mathcal{S}$ ; note that for each $j\in[m+1]$ , the distributions $\widehat{q}_{j,h}\in\Delta(\mathcal{A}_{j})$ for $h\in[H]$ may be viewed as an application Vovk’s aggregating algorithm (Proposition B.2) in the following setting: the number of steps ( $T$ , in the context of Proposition B.2; note that $T$ has a different meaning in the present proof) horizon is $H$ , the context space is $\bigcup_{h=1}^{H}\mathscr{H}_{j,h-1}\times\mathcal{S}$ , and the output space is $\mathcal{A}_{j}$ . The expert set is $\mathcal{I}=\{\rho_{j}^{\scriptscriptstyle{(1)}},\ldots,\rho_{j}^{\scriptscriptstyle{(T)}}\}$ (which has $\lvert\mathcal{I}\rvert=T$ ), and the experts’ predictions on a context $(\tau_{j,h-1},s)\in\mathscr{H}_{j,h-1}\times\mathcal{S}$ are defined via $\rho_{j}^{\scriptscriptstyle{(t)}}(\cdot|\tau_{j,h-1},s):=\sigma_{j,h}^{\scriptscriptstyle{(t)}}(\cdot|\tau_{j,h-1},s)\in\Delta(\mathcal{A}_{j})$ . Then for each $h\in[H]$ , the distribution $\widehat{q}_{j,h}$ is obtained by updating the aggregating algorithm with the context-observation pairs $(\tau_{j,h^{\prime}-1},a_{j,h^{\prime}})$ for $h^{\prime}=1,2,\ldots,h-1$ .

In more detail, fix any $t^{\star}\in[T]$ and $j\in[m+1]$ with $i\neq j$ . We may apply Proposition B.2 with the number of steps set to $H$ , the set of experts as $\mathcal{I}=\{\rho_{j}^{\scriptscriptstyle{(1)}},\ldots,\rho_{j}^{\scriptscriptstyle{(T)}}\}$ , and contexts and outcomes generated according to the distribution induced by running the policy $\pi_{i}^{\dagger}\times\sigma_{-i}^{\scriptscriptstyle{(t^{\star})}}$ in the Markov game $\mathcal{G}$ as follows:

•
For each $h\in[H]$ , we are given, at steps $h^{\prime}<h$ , the actions $a_{k,h^{\prime}}$ rewards $r_{k,h^{\prime}}$ for all agents $k\in[m+1]$ , as well as the states $s_{1},\ldots,s_{h}$ .
- –
  
  For each $k\in[m+1]$ , set $\tau_{k,h-1}=(s_{1},a_{k,1},r_{k,1},\ldots,s_{h-1},a_{k,h-1},r_{k,h-1})$ to be agent $k$ ’s history.
- –
  
  The context fed to the aggregation algorithm at step $h$ is $(\tau_{j,h-1},s_{h})$ .
- –
  
  The outcome at step $h$ is given by $a_{j,h}\sim\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\cdot|\tau_{j,h-1},s_{h})$ ; note that this choice satisfies the realizability assumption in Proposition B.2.
- –
  
  To aid in generating the next context at step $h+1$ , choose $a_{k,h}\sim\sigma_{k,h}^{t^{\star}}(\tau_{k,h-1},s_{h})$ for all $k\in[m+1]\backslash\{i,j\}$ and $a_{i,h}=\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h})$ . Then set $s_{h+1}$ to be the next state given the transitions of $\mathcal{G}$ and the action profile $\mathbf{a}_{h}=(a_{1,h},\ldots,a_{m+1,h})$ .

By Proposition B.2, it follows that for any fixed $t^{\star}\in[T]$ and $j\in[m+1]$ with $j\neq i$ , under the process described above we have

\displaystyle\mathbb{E}_{\pi_{i}^{\dagger}\times\sigma_{-i}^{\scriptscriptstyle{(t^{\star})}}}\left[\sum_{h=1}^{H}D_{\mathsf{TV}}({\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})},{\widehat{q}_{j,h}})\right]\leq\sqrt{H\cdot\log T}.

(21)

Analyzing the value of $\pi_{m+1}^{\dagger}$ .

Next, using the development above, we show that if Algorithm 2 successfully computes a Nash equilibrium with constant probability (via $\pi_{m+1}^{\dagger}$ ) whenever $\widebar{\sigma}$ is an $\epsilon$ -CCE. We first state the following claim, which is proven in the sequel by analyzing the values $V_{i}^{\pi_{i}^{\dagger}\times\overline{\sigma}_{-i}}$ for $i\in[m]$ .

Lemma C.7.

If $\overline{\sigma}$ is an $\epsilon$ -CCE of $\mathcal{G}$ , then it holds that for all $i\in[m]$ ,

\displaystyle V_{i}^{\overline{\sigma}}\geq-\epsilon-m\sqrt{\log(T)/H}.

Note that in the game $\mathcal{G}$ , since for all $h\in[H]$ , $s\in\mathcal{S}$ and $\mathbf{a}\in\overline{\mathcal{A}}$ , it holds that $\left|\sum_{j=1}^{m+1}R_{j,h}(s,\mathbf{a})\right|\leq\frac{(m+1)\epsilon^{2}}{H}$ (which holds since in (13), $\mathsf{enc}(\mathbf{a})$ is multiplied by $\frac{1}{H}\cdot 2^{-3\lceil\log 1/\epsilon\rceil}$ ), it follows that $\left|\sum_{j=1}^{m+1}V_{j}^{\overline{\sigma}}\right|\leq(m+1)\epsilon^{2}$ . Thus, by Lemma C.7, we have $V_{m+1}^{\overline{\sigma}}\leq(m+1)\epsilon^{2}+m\cdot(\epsilon+m\sqrt{\log(T)/H})$ , and since $\overline{\sigma}$ is an $\epsilon$ -CCE of $\mathcal{G}$ it follows that

\displaystyle V_{m+1}^{\pi_{m+1}^{\dagger}\times\overline{\sigma}_{-(m+1)}}\leq 2(m+1)\cdot\epsilon+m^{2}\cdot\sqrt{\log(T)/H}.

(22)

To simplify notation, we will write $\widehat{q}_{h}:=\widehat{q}_{1,h}\times\cdots\times\widehat{q}_{m,h}$ in the below calculations, where we recall that each $\widehat{q}_{j,h}$ is determined given the history up to step $h$ , $(\tau_{j,h-1},s_{h})$ , as defined in (16) and (17). An action profile drawn from $\widehat{q}_{h}$ is denoted as $\mathbf{a}\sim\widehat{q}_{h}$ , with $\mathbf{a}\in\mathcal{A}$ . We may now write $V_{m+1}^{\pi_{m+1}^{\dagger}\times\overline{\sigma}_{-(m+1)}}$ as follows:

		$\displaystyle V_{m+1}^{\pi_{m+1}^{\dagger}\times\overline{\sigma}_{-(m+1)}}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}}}\mathbb{E}_{\begin{subarray}{c}a_{j,h}\sim\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})\ \forall j\in[m]\\ a_{m+1,h}\sim\pi_{m+1,h}^{\dagger}(\tau_{m+1,h-1},s_{h})\\ \mathbf{a}:=(a_{1,h},\ldots,a_{m+1,h})\end{subarray}}\left[R_{m+1,h}(s_{h},\mathbf{a})\right]$
	$\displaystyle\geq$	$\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}}}\Bigg{(}\mathbb{E}_{\begin{subarray}{c}a_{j,h}\sim\widehat{q}_{j,h}\ \forall j\in[m]\\ a_{m+1,h}\sim\pi_{m+1,h}^{\dagger}(\tau_{m+1,h-1},s_{h})\\ \mathbf{a}:=(a_{1,h},\ldots,a_{m+1,h})\end{subarray}}\left[R_{m+1,h}(s_{h},\mathbf{a})\right]-\frac{1}{H}\sum_{j\in[m]}D_{\mathsf{TV}}({\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})},{\widehat{q}_{j,h}})\Bigg{)}$
	$\displaystyle\geq$	$\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}}}\Bigg{(}\max_{a_{m+1,h}^{\prime}\in\mathcal{A}_{m+1}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}\left[R_{m+1,h}(s_{h},(\mathbf{a},a_{m+1,h}^{\prime}))\right]-\frac{2\epsilon}{H}-\frac{\delta}{H}-\frac{1}{H}\sum_{j\in[m]}D_{\mathsf{TV}}({\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})},{\widehat{q}_{j,h}})\Bigg{)}$
	$\displaystyle\geq$	$\displaystyle\frac{1}{H}\cdot\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}}}\left(\max_{j\in[m],a_{j,h}^{\prime}\in\mathcal{A}_{j}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}[(M_{j})_{a_{j}^{\prime},\mathbf{a}_{-j}}-(M_{j})_{\mathbf{a}}]\right)-\frac{m}{H}\cdot\sqrt{H\log T}-2\epsilon-\delta-\epsilon^{2},$

where:

•

The first inequality follows from the fact that $R_{m+1,h}(\cdot)$ takes values in $[-1/H,1/H]$ and the fact that the total variation between product distributions is bounded above by the sum of total variation distances between each of the pairs of component distributions.
•

The second inequality follows from the inequality (20) of Lemma C.5.
•

The final equality follows from the definition of the rewards in (13) and (14), and by summing (21) over $j\in[m]$ . We remark that the $-\epsilon^{2}$ term in the final line comes from the term $\frac{1}{H}\cdot 2^{-3\lceil\log 1/\epsilon\rceil}\cdot\mathsf{enc}(\mathbf{a})$ in (13).

Rearranging and using (22) as well as the fact that $\delta+\epsilon^{2}=\epsilon/(6H)+\epsilon^{2}\leq\epsilon$ (as $\epsilon\leq 1/2$ ), we get that

		$\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\mathbb{E}_{\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}}}\sum_{h=1}^{H}\left(\max_{j\in[m],a_{j,h}^{\prime}\in\mathcal{A}_{j}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}[(M_{j})_{a_{j}^{\prime},\mathbf{a}_{-j}}-(M_{j})_{\mathbf{a}}]\right)$
	$\displaystyle\leq$	$\displaystyle 2H\cdot\epsilon\cdot(m+1)+(m+1)m\cdot\sqrt{H\log T}+3H\epsilon.$

Since $\widehat{q}_{h}$ is a product distribution a.s., we have that

\max_{j\in[m],a_{j,h}^{\prime}\in\mathcal{A}_{j}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}[(M_{j})_{a_{j}^{\prime},\mathbf{a}_{-j}}-(M_{j})_{\mathbf{a}}]\geq 0.

Therefore, by Markov’s inequality, with probability at least $1/2$ over the choice of $t^{\star}\sim[T]$ and the trajectories $(\tau_{j,h-1},s_{h})\sim\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}}$ for $j\in[m]$ (which collectively determine $\widehat{q}_{h}$ ), there is some $h\in[H]$ so that

\displaystyle\max_{j\in[m],a_{j,h}^{\prime}\in\mathcal{A}_{j}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}[(M_{j})_{a_{j}^{\prime},\mathbf{a}_{-j}}-(M_{j})_{\mathbf{a}}]\leq 10(m+1)\cdot\epsilon+2(m+1)m\cdot\sqrt{\log(T)/H}\leq 12(m+1)\cdot\epsilon,

(23)

where the final inequality follows as long as $H\cdot\epsilon^{2}\geq m^{2}\log T$ , i.e., $T\leq\exp\left(\frac{H\cdot\epsilon^{2}}{m^{2}}\right)$ , which holds since $H\geq n_{0}$ and we have assumed that $T\leq\exp(\epsilon^{2}\cdot n_{0}/m^{2})$ .

Note that (23) implies that with probability at least $1/2$ under an episode drawn from $\pi_{m+1}^{\dagger}\times\overline{\sigma}_{-(m+1)}$ , there is some $h\in[H]$ so that $\widehat{q}_{h}$ is a $12(m+1)\cdot\epsilon$ -Nash equilibrium of the stage game $G$ . Thus, by Lemma C.6, with probability at least $1/2$ under an episode drawn from the distribution of Algorithm 2, there is some $h\in[H]$ so that $\widehat{q}_{h}$ is a $12(m+1)\cdot\epsilon$ -Nash equilibrium of $G$ .

Finally, the following two observations conclude the proof of Lemma C.3.

•

If $\widehat{q}_{h}$ is a $12(m+1)\cdot\epsilon$ -Nash equilibrium of $G$ , then by definition of the reward function $R_{m+1,h}(\cdot)$ in (13), upper bounding $\frac{1}{H}\cdot 2^{-3\lceil\log 1/\epsilon\rceil}\cdot\mathsf{enc}(\mathbf{a})$ by $\epsilon^{2}/H$ ,

\displaystyle\max_{a^{\prime}\in\mathcal{A}_{m+1}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}\left[R_{m+1,h}(\mathfrak{s},(\mathbf{a},a^{\prime}))\right]\leq\frac{1}{H}\cdot 12(m+1)\cdot\epsilon+\frac{\epsilon^{2}}{H},

which implies, by Lemma C.5, that with probability at least $1-\delta$ over the draw of $\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}$ ,

\displaystyle\max_{a^{\prime}\in\mathcal{A}_{m+1}}\left\{\widehat{R}_{m+1,h}(a^{\prime})\right\}\leq\frac{1}{H}\cdot 12(m+1)\cdot\epsilon+\frac{\epsilon^{2}}{H}+\frac{\epsilon}{H}\leq\frac{1}{H}\cdot 14(m+1)\cdot\epsilon,

i.e., the check in Line 17 of Algorithm 2 will pass and the algorithm will return $\widehat{q}_{h}$ (if step $h$ is reached).

•

Conversely, if $\max_{a^{\prime}\in\mathcal{A}_{m+1}}\left\{\widehat{R}_{m+1,h}(a^{\prime})\right\}\leq 14(m+1)\cdot\epsilon$ , i.e., the check in Line 17 passes, then by Lemma C.5, with probability at least $1-\delta$ over $\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}$ ,

\displaystyle\max_{a^{\prime}\in\mathcal{A}_{m+1}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}\left[R_{m+1,h}(\mathfrak{s},(\mathbf{a},a^{\prime}))\right]\leq\frac{1}{H}\cdot 14(m+1)\cdot\epsilon+\frac{\epsilon}{H}\leq\frac{1}{H}\cdot 15(m+1)\cdot\epsilon,

which implies, by the definition of $R_{m+1,h}(\cdot)$ in (13) and (14), that $\widehat{q}_{h}$ is a $16(m+1)\cdot\epsilon$ -Nash equilibrium of $G$ .

Taking a union bound over all $H$ of the probability- $\delta$ failure events from Lemma C.5 for the sampling $\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\widehat{q}_{h}$ (for $h\in[H]$ ), as well as over the probability- $1/2$ event that there is no $\widehat{q}_{h}$ which is a $12(m+1)\cdot\epsilon$ -Nash equilibrium of $G$ , we obtain that with probability at least $1-1/2-H\cdot\epsilon/(6H)\geq 1/3$ , Algorithm 2 outputs a $16(m+1)\cdot\epsilon$ -Nash equilibrium of $G$ . ∎

Finally, we prove the remaining claims stated without proof above.

Proof of Lemma C.5.

Since $R_{m+1,h}(\mathfrak{s},\mathbf{a})\in[-1/H,1/H]$ for each $\mathbf{a}\in\overline{\mathcal{A}}$ , by Hoeffding’s inequality, for any fixed $a^{\prime}\in\mathcal{A}_{m+1}$ , with probability at least $1-\delta/|\mathcal{A}_{m+1}|=1-\delta/(mn_{0})$ over the draw of $\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h}$ , it holds that

\displaystyle\left|\widehat{R}_{m+1,h}(a^{\prime})-\mathbb{E}_{a_{j}\sim\widehat{q}_{j,h}\ \forall j\in[m]}[R_{m+1,h}(s_{h},(a_{1},\ldots,a_{m},a^{\prime}))]\right|\leq\frac{2}{H}\cdot\sqrt{\frac{\log mn_{0}/\delta}{K}}\leq\frac{\epsilon}{H},

where the final inequality follows from the choice of $K=\lceil 4\log(mn_{0}/\delta)/\epsilon^{2}\rceil$ . The statement of the lemma follows by a union bound over all $|\mathcal{A}_{m+1}|$ actions $a^{\prime}\in\mathcal{A}_{m+1}$ . ∎

Proof of Lemma C.7.

Fix any agent $i\in[m]$ . We will argue that the policy $\pi_{i}^{\dagger}\in\Pi^{\mathrm{gen,det}}_{i}$ defined within the proof of Lemma C.3 satisfies $V_{i}^{\pi_{i}^{\dagger},\overline{\sigma}_{-i}}\geq-m\sqrt{\log(T)/H}$ . Since $\overline{\sigma}$ is an $\epsilon$ -CCE of $\mathcal{G}$ , it follows that

\displaystyle\epsilon\geq V_{i}^{\pi_{i}^{\dagger},\overline{\sigma}_{-i}}-V_{i}^{\overline{\sigma}}\geq-m\sqrt{\log(T)/H}-V_{i}^{\overline{\sigma}},

from which the result of Lemma C.7 follows after rearranging terms.

To simplify notation, let us write $\widehat{q}_{-i,h}:=\bigtimes_{j\neq i}\widehat{q}_{j,h}$ , where we recall that each $\widehat{q}_{j,h}$ is determined given the history up to step $h$ , $(\tau_{j,h-1},s_{h})$ , as defined in (16) and (17). An action profile drawn from $\widehat{q}_{-i,h}$ is denoted by $\mathbf{a}_{-i}\sim\widehat{q}_{-i,h}$ , with $\mathbf{a}_{-i}\in\overline{\mathcal{A}}_{-i}$ . We compute

		$\displaystyle V_{i}^{\pi_{i}^{\dagger}\times\overline{\sigma}_{-i}}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{i}^{\dagger}\times\sigma_{-i}^{\scriptscriptstyle{(t^{\star})}}}\mathbb{E}_{\mathbf{a}_{-i}\sim\bigtimes_{j\neq i}\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})}\left[R_{i,h}(s_{h},(\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h}),\mathbf{a}_{-i}))\right]$
	$\displaystyle\geq$	$\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{i}^{\dagger}\times\sigma_{-i}^{\scriptscriptstyle{(t^{\star})}}}\left(\mathbb{E}_{\mathbf{a}_{-i}\sim\widehat{q}_{-i,h}}\left[R_{i,h}(s_{h},(\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h}),\mathbf{a}_{-i}))\right]-\frac{1}{H}\sum_{j\neq i}D_{\mathsf{TV}}({\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})},{\widehat{q}_{j,h}})\right)$
	$\displaystyle\geq$	$\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{i}^{\dagger}\times\sigma_{-i}^{\scriptscriptstyle{(t^{\star})}}}\left(\max_{a_{i}^{\prime}\in\mathcal{A}_{i}}\mathbb{E}_{\mathbf{a}_{-i}\sim\widehat{q}_{-i,h}}\left[R_{i,h}(s_{h},(a_{i}^{\prime},\mathbf{a}_{-i}))\right]\right)-\frac{m}{H}\cdot\sqrt{H\log T}$
	$\displaystyle\geq$	$\displaystyle-m\sqrt{\log(T)/H},$

where:

•

The first inequality follows from the fact that the rewards $R_{i,h}(\cdot)$ take values in $[-1/H,1/H]$ and that the total variation between product distributions is bounded above by the sum of total variation distances between each of the pairs of component distributions.
•

The second inequality follows from the definition of $\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h})$ in terms of $\widehat{q}_{-i,h}$ in (18) as well as (21) applied to each $j\neq i$ and each $t^{\star}\in[T]$ .
•

The final inequality follows by Lemma C.8 below, applied to agent $i$ and to the distribution $\widehat{q}_{-i,h}$ , which we recall is a product distribution almost surely.

∎

Lemma C.8.

For any $i\in[m]$ , $s\in\mathcal{S},h\in[H]$ , and any product distribution $q\in\Delta(\overline{\mathcal{A}}_{-i})$ , it holds that

\displaystyle\max_{a_{i}^{\prime}\in\mathcal{A}_{i}}\mathbb{E}_{\mathbf{a}\sim q}\left[R_{i,h}(s,(a_{i}^{\prime},\mathbf{a}))\right]\geq 0.

Proof.

Choose $a_{i}^{\star}:=\operatorname*{arg\,max}_{a_{i}^{\prime}\in\mathcal{A}}\mathbb{E}_{\mathbf{a}\sim q}\left[(M_{i})_{a_{i}^{\prime},\mathbf{a}}\right]$ . Now we compute

	$\displaystyle H\cdot\mathbb{E}_{\mathbf{a}\sim q}\left[R_{i,h}(s,(a_{i}^{\star},\mathbf{a}))\right]\geq$	$\displaystyle H\cdot\min_{a_{m+1}^{\prime}\in\mathcal{A}_{m+1}}\mathbb{E}_{\mathbf{a}\sim q}\left[R_{i,h}(s,(a_{i}^{\star},a_{m+1}^{\prime},\mathbf{a}_{-(m+1)}))\right]$
	$\displaystyle\geq$	$\displaystyle\min_{(j,a_{j}^{\prime})\in\mathcal{A}_{m+1}}\mathbbm{1}\{{j=i}\}\cdot\mathbb{E}_{\mathbf{a}\sim q}\left[(M_{i})_{a_{i}^{\star},\mathbf{a}}-(M_{i})_{a_{i}^{\prime},\mathbf{a}}\right]$
	$\displaystyle\geq$	$\displaystyle 0,$

where the first inequality follows since $q$ is a product distribution, the second inequality uses that $\mathsf{enc}(\cdot)$ is non-negative, and the final inequality follows since by choice of $a_{i}^{\star}$ we have $\mathbb{E}_{\mathbf{a}\sim q}\left[(M_{i})_{a_{i}^{\star},\mathbf{a}}\right]\geq\mathbb{E}_{\mathbf{a}\sim q}\left[(M_{i})_{a_{i}^{\prime},\mathbf{a}}\right]$ for all $a_{i}^{\prime}\in\mathcal{A}_{i}$ . ∎

C.2 Remarks on bit complexity of the rewards

The Markov game $\mathcal{G}(G)$ constructed to prove Theorem C.1 uses lower-order bits of the rewards to record the action profile taken each step. These lower order bits may be used by each agent to infer what actions were taken by other agents at the previous step, and we use this idea to construct the best-response policies $\pi_{i}^{\dagger}$ defined in the proof. As a result of this aspect of the construction, the rewards of the game $\mathcal{G}(G)$ each take $O(m\cdot\log(n)+\log(1/\epsilon))$ bits to specify. As discussed in the proof of Theorem C.1, it is without loss of generality to assume that the payoffs of the given normal-form game $G$ take $O(\log 1/\epsilon)$ bits each to specify, so when either $m\gg 1$ or $n\gg 1/\epsilon$ , the construction of $\mathcal{G}(G)$ uses more bits to express its rewards than what is used for the normal-form game $G$ .

It is possible to avoid this phenomenon by instead using the state transitions of the Markov game to encode the action profile taken at each step, as was done in the proof of Theorem 3.2. The idea, which we sketch here, is to replace the game $\mathcal{G}(G)$ of Definition C.2 with the following game $\mathcal{G}^{\prime}(G)$ :

Definition C.9 (Alternative construction to Definition C.2).

Given an $m$ -player, $n_{0}$ -action normal-form game $G$ , we define the game $\mathcal{G}^{\prime}(G)=(\mathcal{S},H,(\mathcal{A}_{i})_{i\in[2]},\mathbb{P},(R_{i})_{i\in[2]},\mu)$ as follows.

•

The horizon of $\mathcal{G}$ is $H=n_{0}$ .
•

Let $A=n_{0}$ . The action spaces of agents $1,2,\ldots,m$ are given by $\mathcal{A}_{1}=\cdots=\mathcal{A}_{m}=[A]$ . The action space of agent $m+1$ is

$\displaystyle\mathcal{A}_{m+1}=\{(j,a_{j})\ :\ j\in[m],a_{j}\in\mathcal{A}_{j}\},$

so that $|\mathcal{A}_{m+1}|=Am\leq n$ .

We write $\mathcal{A}=\prod_{j=1}^{m}\mathcal{A}_{j}$ to denote the joint action space of the first $m$ agents, and $\overline{\mathcal{A}}:=\prod_{j=1}^{m+1}\mathcal{A}_{j}$ to denote the joint action space of all agents. Then $|\overline{\mathcal{A}}|=A^{m}\cdot(mA)=mA^{m+1}\leq n$ .
•

The state space $\mathcal{S}$ is defined as follows. There are $|\overline{\mathcal{A}}|$ states, one for each action tuple $\mathbf{a}\in\overline{\mathcal{A}}$ . For each $\mathbf{a}\in\overline{\mathcal{A}}$ , we denote the corresponding state by $\mathfrak{s}_{\mathbf{a}}$ .

•

For all $h\in[H]$ , the reward to agent $j\in[m+1]$ given action profile $\mathbf{a}=(a_{1},\ldots,a_{m+1})$ at any state $s\in\mathcal{S}$ is as follows: writing $a_{m+1}=(j^{\prime},a_{j^{\prime}}^{\prime})$ ,

\displaystyle R_{j,h}(s,\mathbf{a}):=\begin{cases}0&:j\not\in\{j^{\prime},m+1\}\\ \frac{1}{H}\cdot\left((M_{j})_{a_{1},\ldots,a_{m}}-(M_{j})_{a_{1},\ldots,a_{j^{\prime}}^{\prime},\ldots,a_{m}}\right)&:j=j^{\prime}\\ \frac{1}{H}\cdot\left((M_{j})_{a_{1},\ldots,a_{j^{\prime}}^{\prime},\ldots,a_{m}}-(M_{j})_{a_{1},\ldots,a_{m}}\right)&:j=m+1.\end{cases}

(24)

•

At each step $h\in[H]$ , if action profile $\mathbf{a}\in\overline{\mathcal{A}}$ is taken, the game transitions to the state $\mathfrak{s}_{\mathbf{a}}$ .

Note that the number of states of $\mathcal{G}^{\prime}(G)$ is equal to $|\overline{\mathcal{A}}|=mn_{0}^{m+1}$ , and so $|\mathcal{G}^{\prime}(G)|=mn_{0}^{m+1}$ . As a result, if we were to use the game $\mathcal{G}^{\prime}(G)$ in place of $\mathcal{G}(G)$ in the proof of Theorem C.1, we would need to define $n_{0}:=\lfloor n^{1/(m+1)}/m\rfloor$ to ensure that $|\mathcal{G}^{\prime}(G)|\leq n$ , and so the condition $T<\exp(\epsilon^{2}\cdot\lfloor n/m\rfloor/m^{2})$ would be replaced by $T<\exp(\epsilon^{2}\cdot\lfloor n^{1/(m+1)}/m\rfloor/m^{2})$ . This would only lead to a small quantitative degradement in the statement of Theorem 4.3, with the condition in the statement replaced by $T<\exp(c\cdot\epsilon^{2}\cdot n^{1/3})$ for some constant $c>0$ . However, it would render the statement of Theorem 5.2 essentially vacuous. For this reason, we opt to go with the approach of Definition C.2 as opposed to Definition C.9.

We expect that the construction of Definition C.2 can nevertheless still be modified to use $O(\log 1/\epsilon)$ bits to express each reward in the Markov game $\mathcal{G}$ . In particular, one could introduce stochastic transitions to encode in the state of the Markov game a small number of random bits of the full action profile played at each step. We leave such an approach for future work.

Appendix D Equivalence between ${\Pi}^{\mathrm{gen,rnd}}_{j}$ and $\Delta(\Pi^{\mathrm{gen,det}}_{j})$

In this section we consider an alternate definition of the space ${\Pi}^{\mathrm{gen,rnd}}_{i}$ of randomized general policies of player $i$ , and show that it is equivalent to the one we gave in Section 2.

In particular, suppose we were to define a randomized general policy of agent $i$ as a distribution over deterministic general policies of agent $i$ : we write ${\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{i}:=\Delta(\Pi^{\mathrm{gen,det}}_{i})$ to denote the space of such distributions. Moreover, write ${\widetilde{\Pi}}^{\mathrm{gen,rnd}}:={\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{m}=\Delta(\Pi^{\mathrm{gen,det}}_{1})\times\cdots\times\Delta(\Pi^{\mathrm{gen,det}}_{m})$ to denote the space of product distributions over agents’ deterministic policies. Our goal in this section is to show that policies in ${\widetilde{\Pi}}^{\mathrm{gen,rnd}}$ are equivalent to those in ${\Pi}^{\mathrm{gen,rnd}}$ in the following sense: there is an embedding map $\mathsf{Emb}:{\Pi}^{\mathrm{gen,rnd}}\rightarrow{\widetilde{\Pi}}^{\mathrm{gen,rnd}}$ , not depending on the Markov game, so that the distribution of a trajectory drawn from any $\sigma\in{\Pi}^{\mathrm{gen,rnd}}$ , for any Markov game, is the same as the distribution of a trajectory drawn from $\mathsf{Emb}(\sigma)$ (Fact D.2). Furthermore, $\mathsf{Emb}$ is surjective in the following sense: any policy $\widetilde{\sigma}\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}$ produces trajectories that are distributed identically to those of $\mathsf{Emb}(\sigma)$ (and thus of $\sigma$ ), for some $\sigma\in{\Pi}^{\mathrm{gen,rnd}}$ (Fact D.3). In Definition D.1 below, we define $\mathsf{Emb}$ .

Definition D.1.

For $j\in[m]$ and $\sigma_{j}\in{\Pi}^{\mathrm{gen,rnd}}_{j}$ , define $\mathsf{Emb}_{j}(\sigma_{j})\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{j}=\Delta(\Pi^{\mathrm{gen,det}}_{j})$ to put the following amount of mass on each $\pi_{j}\in\Pi^{\mathrm{gen,det}}_{j}$ :

\displaystyle(\mathsf{Emb}_{j}(\sigma_{j}))(\pi_{j}):=\prod_{h=1}^{H}\prod_{(\tau_{j,h-1},s_{h})\in\mathscr{H}_{j,h-1}\times\mathcal{S}}\sigma_{j}(\pi_{j,h}(\tau_{j,h-1},s_{h})\ |\ \tau_{j,h-1},s_{h})

(25)

Furthermore, for $\sigma=(\sigma_{1},\ldots,\sigma_{m})\in{\Pi}^{\mathrm{gen,rnd}}$ , define $\mathsf{Emb}(\sigma)=(\mathsf{Emb}(\sigma_{1}),\ldots,\mathsf{Emb}(\sigma_{m}))$ .

Note that, in the special case that $\sigma_{j}\in\Pi^{\mathrm{gen,det}}_{j}$ , $\mathsf{Emb}_{j}(\sigma_{j})$ is the point mass on $\sigma_{j}$ .

Fact D.2 (Embedding equivalence).

Fix a $m$ -player Markov game $\mathcal{G}$ and, arbitrary policies $\sigma_{j}\in{\Pi}^{\mathrm{gen,rnd}}_{j}$ . Then a trajectory drawn from the product policy $\sigma=(\sigma_{1},\ldots,\sigma_{m})\in{\Pi}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\Pi}^{\mathrm{gen,rnd}}_{m}$ is distributed identically to a trajectory drawn from $\mathsf{Emb}(\sigma)\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}$ .

The proof of Fact D.2 is provided in Section D.1. Next, we show that the mapping $\mathsf{Emb}$ is surjective in the following sense:

Fact D.3 (Right inverse of $\mathsf{Emb}_{j}$ ).

There is a mapping $\mathsf{Fac}:{\widetilde{\Pi}}^{\mathrm{gen,rnd}}\rightarrow{\Pi}^{\mathrm{gen,rnd}}$ so that for any Markov game $\mathcal{G}$ and any $\widetilde{\sigma}\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}$ , the distribution of a trajectory drawn from $\widetilde{\sigma}$ is identical to the distribution of a trajectory drawn from $\mathsf{Emb}\circ\mathsf{Fac}(\widetilde{\sigma})$ .

We will write $\mathsf{Fac}((\widetilde{\sigma}_{1},\ldots,\widetilde{\sigma}_{m})):=(\mathsf{Fac}_{1}(\widetilde{\sigma}_{1}),\ldots,\mathsf{Fac}_{m}(\widetilde{\sigma}_{m}))$ . Fact D.3 states that the policy $\mathsf{Fac}(\widetilde{\sigma})$ maps, under $\mathsf{Emb}$ , to a policy in ${\widetilde{\Pi}}^{\mathrm{gen,rnd}}$ which is equivalent to $\widetilde{\sigma}$ (in the sense that their trajectories are identically distributed for any Markov game).

An important consequence of Fact D.2 is that the expected reward (i.e., value) under any $\sigma\in{\Pi}^{\mathrm{gen,rnd}}$ is the same as that of $\mathsf{Emb}(\sigma)$ . Thus given a Markov game, the induced normal-form game in which the players’ pure action sets are ${\Pi}^{\mathrm{gen,rnd}}_{1},\ldots,{\Pi}^{\mathrm{gen,rnd}}_{m}$ is equivalent to the normal-form game in which the players’ pure action sets are $\Pi^{\mathrm{gen,det}}_{1},\ldots,\Pi^{\mathrm{gen,det}}_{m}$ , in the following sense: for any mixed strategy in the former, namely a product distributional policy $P\in\Delta({\Pi}^{\mathrm{gen,rnd}}_{1})\times\cdots\times\Delta({\Pi}^{\mathrm{gen,rnd}}_{m})$ , the policy $\mathbb{E}_{\sigma\sim P}[\mathsf{Emb}(\sigma)]\in\Delta(\Pi^{\mathrm{gen,det}}_{1})\times\cdots\times\Delta(\Pi^{\mathrm{gen,det}}_{m})={\widetilde{\Pi}}^{\mathrm{gen,rnd}}$ is a mixed strategy in the latter which gives each player the same value as under $P$ . (Note that $\mathbb{E}_{\sigma\sim P}[\mathsf{Emb}(\sigma)]$ is indeed a product distribution since $P$ is a product distribution and $\mathsf{Emb}$ factors into individual coordiantes.) Furthermore, by Fact D.3, any distributional policy in ${\widetilde{\Pi}}^{\mathrm{gen,rnd}}$ arises in this manner, for some $P\in\Delta({\Pi}^{\mathrm{gen,rnd}}_{1})\times\cdots\times\Delta({\Pi}^{\mathrm{gen,rnd}}_{m})$ ; in fact, $P$ may be chosen to place all its mass on a single $\sigma\in{\Pi}^{\mathrm{gen,rnd}}_{1}\times{\Pi}^{\mathrm{gen,rnd}}_{m}$ . Since $\mathsf{Emb}$ factors into individual coordinates, it follows that $\mathsf{Emb}$ yields a one-to-one mapping between the coarse correlated equilibria (or any other notion of equilibria, e.g., Nash equilibria or correlated equilibria) of these two normal-form games.

D.1 Proofs of the equivalence

Proof of Fact D.2.

Consider any trajectory $\tau=(s_{1},\mathbf{a}_{1},\mathbf{r}_{1},\ldots,s_{H},\mathbf{a}_{H},\mathbf{r}_{H})$ consisting of a sequence of $H$ states and actions and rewards for each of the $m$ agents. Assume that $r_{i,h}=R_{i,h}(s,\mathbf{a}_{h})$ for all $i,h$ (as otherwise $\tau$ has probability 0 under any policy). Write:

\displaystyle p_{\tau}:=\prod_{h=1}^{H-1}\mathbb{P}_{h}(s_{h+1}|s_{h},\mathbf{a}_{h}).

Then the probability of observing $\tau$ under $\sigma$ is

\displaystyle p_{\tau}\cdot\prod_{h=1}^{H-1}\prod_{j=1}^{m}\sigma_{j,h}(a_{j,h}|\tau_{j,h-1},s_{h})

(26)

where, per usual, $\tau_{j,h-1}=(s_{1},a_{j,1},r_{j,1},\ldots,s_{h-1},a_{j,h-1},r_{j,h-1})$ . Write $\sigma=(\sigma_{1},\ldots,\sigma_{m})=\mathsf{Emb}(\sigma)$ . The probability of observing $\tau$ under $\sigma$ is

\displaystyle p_{\tau}\cdot\prod_{j\in[m]}\sum_{\pi_{j}\in\Pi^{\mathrm{gen,det}}_{j}:\ \forall h,\ \pi(\tau_{j,h-1},s_{h})=a_{j,h}}\sigma_{j}(\pi_{j})

(27)

It is now straightforward to see from the definition of $\sigma_{j}(\pi_{j})$ in (25) that the quantities in (26) and (27) are equal. ∎

Proof of Fact D.3.

Fix a policy $\widetilde{\sigma}_{j}\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{j}=\Delta(\Pi^{\mathrm{gen,det}}_{j})$ . We define $\mathsf{Fac}_{j}(\widetilde{\sigma}_{j})$ to be the policy $\sigma_{j}\in{\Pi}^{\mathrm{gen,rnd}}_{j}$ , which is defined as follows: for $\tau_{j,h-1}=(s_{j,1},a_{j,1},r_{j,1},\ldots,s_{j,h-1},a_{j,h-1},r_{j,h-1})\in\mathscr{H}_{j,h-1}$ , $s_{h}\in\mathcal{S}$ , we have, for $a_{j,h}\in\mathcal{A}_{j}$ ,

\displaystyle\sigma_{j}(\tau_{j,h-1},s_{h})(a_{j,h})=\frac{\widetilde{\sigma}_{j}\left(\{\pi_{j}\in\Pi^{\mathrm{gen,det}}_{j}\ :\ \pi_{j}(\tau_{j,g},s_{g})=a_{j,g}\ \forall g\leq h\}\right)}{\widetilde{\sigma}_{j}\left(\{\pi_{j}\in\Pi^{\mathrm{gen,det}}_{j}\ :\ \pi_{j}(\tau_{j,g},s_{g})=a_{j,g}\ \forall g\leq h-1\}\right)}.

If the denominator of the above expression is 0, then $\sigma_{j}(\tau_{j,h-1},s_{h})$ is defined to be an arbitrary distribution on $\Delta(\mathcal{A}_{j})$ . (For concreteness, let us say that it puts all its mass on a fixed action in $\mathcal{A}_{j}$ .) Furthermore, for $\widetilde{\sigma}\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}$ , define $\mathsf{Fac}(\widetilde{\sigma}):=(\mathsf{Fac}_{1}(\widetilde{\sigma}_{1}),\ldots,\mathsf{Fac}_{m}(\widetilde{\sigma}_{m}))\in{\Pi}^{\mathrm{gen,rnd}}$ .

Next, fix any $\widetilde{\sigma}=(\widetilde{\sigma}_{1},\ldots,\widetilde{\sigma}_{m})\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{m}$ . Let $\sigma=\mathsf{Fac}(\widetilde{\sigma})$ . By Fact D.2, it suffices to show that the distribution of trajectories under $\sigma$ is the same as the distribution of trajectories drawn from $\sigma$ .

So consider any trajectory $\tau=(s_{1},\mathbf{a}_{1},\mathbf{r}_{1},\ldots,s_{H},\mathbf{a}_{H},\mathbf{r}_{H})$ consisting of a sequence of $H$ states and actions and rewards for each of the $m$ agents. Assume that $r_{i,h}=R_{i,h}(s,\mathbf{a}_{h})$ for all $i,h$ (as otherwise $\tau$ has probability 0 under any policy). Write:

\displaystyle p_{\tau}:=\prod_{h=1}^{H-1}\mathbb{P}_{h}(s_{h+1}|s_{h},\mathbf{a}_{h}).