This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hardness of Independent Learning and
Sparse Equilibrium Computation in Markov Games

Dylan J. Foster
dylanfoster@microsoft.com
   Noah Golowich
nzg@mit.edu
   Sham M. Kakade
sham@seas.harvard.edu
Abstract

We consider the problem of decentralized multi-agent reinforcement learning in Markov games. A fundamental question is whether there exist algorithms that, when adopted by all agents and run independently in a decentralized fashion, lead to no-regret for each player, analogous to celebrated convergence results for no-regret learning in normal-form games. While recent work has shown that such algorithms exist for restricted settings (notably, when regret is defined with respect to deviations to Markovian policies), the question of whether independent no-regret learning can be achieved in the standard Markov game framework was open. We provide a decisive negative resolution this problem, both from a computational and statistical perspective. We show that:

  1. 1.

    Under the widely-believed complexity-theoretic assumption that PPAD-hard problems cannot be solved in polynomial time, there is no polynomial-time algorithm that attains no-regret in general-sum Markov games when executed independently by all players, even when the game is known to the algorithm designer and the number of players is a small constant.

  2. 2.

    When the game is unknown, no algorithm—regardless of computational efficiency—can achieve no-regret without observing a number of episodes that is exponential in the number of players.

Perhaps surprisingly, our lower bounds hold even for seemingly easier setting in which all agents are controlled by a a centralized algorithm. They are proven via lower bounds for a simpler problem we refer to as SparseCCE, in which the goal is to compute by any means—centralized, decentralized, or otherwise—a coarse correlated equilibrium that is “sparse” in the sense that it can be represented as a mixture of a small number of “product” policies. The crux of our approach is a novel application of aggregation techniques from online learning [Vov90, CBL06], whereby we show that any algorithm for the SparseCCE problem can be used to compute approximate Nash equilibria for non-zero sum normal-form games; this enables the application of well-known hardness results for Nash.

1 Introduction

The framework of multi-agent reinforcement learning (MARL), which describes settings in which multiple agents interact in a dynamic environment, has played a key role in recent breakthroughs in artificial intelligence, including the development of agents that approach or surpass human performance in games such as Go [SHM+16], Poker [BS18], Stratego [PVH+22], and Diplomacy [KEG+22, BBD+22]. MARL also shows promise for real-world multi-agent systems, including autonomous driving [SSS16], and cybersecurity [MK15], and economic policy [ZTS+22]. These applications, where reliability is critical, necessitate the development of algorithms that are practical and efficient, yet provide strong formal guarantees and robustness.

Multi-agent reinforcement learning is typically studied using the framework of Markov games (also known as stochastic games) [Sha53]. In a Markov game, agents interact over a finite number of steps: at each step, each agent observes the state of the environment, takes an action, and observes a reward which depends on the current state as well as the other agents’ actions. Then the environment transitions to a new state as a function of the current state and the actions taken. An episode consists of a finite number of such steps, and agents interact over the course of multiple episodes, progressively learning new information about their environment. Markov games generalize the well-known model of Markov Decision Processes (MDPs) [Put94], which describe the special case in which there is a single agent acting in a dynamic environment, and we wish to find a policy that maximizes its reward. By contrast, for Markov games, we typically aim to find a distribution over agents’ policies which constitutes some type of equilibrium.

1.1 Decentralized learning

In this paper, we focus on the problem of decentralized (or, independent) learning in Markov games. In decentralized MARL, each agent in the Markov game behaves independently, optimizing their policy myopically while treating the effects of the other agents as exogenous. Agents observe local information (in particular, their own actions and rewards), but do not observe the actions of the other agents directly. Decentralized learning enjoys a number of desirable properties, including scalability (computation is inherently linear in the number of agents), versatility (by virtue of independence, algorithms can be applied in uncertain environments in which the nature of the interaction and number of other agents are not known), and practicality (architectures for single-agent reinforcement learning can often be adapted directly). The central question we consider is whether there exist decentralized learning algorithms which, when employed by all agents in a Markov game, lead them to play near-equilibrium strategies over time.

Decentralized equilibrium computation in MARL is not well understood theoretically, and algorithms with provable guarantees are scarce. To motivate the challenges and most salient issues, it will be helpful to contrast with the simpler problem of decentralized learning in normal-form games, which may be interpreted as Markov games with a single state. Normal-form games enjoy a rich and celebrated theory of decentralized learning, dating back to Brown’s work on fictitious play [Bro49] and Blackwell’s theory of approachability [Bla56]. Much of the modern work on decentralized learning in normal-form games centers on no-regret learning, where agents select actions independently using online learning algorithms [CBL06] designed to minimize their regret (that is, the gap between realized payoffs and the payoff of the best fixed action in hindsight). In particular, a foundational result is that if each agent employs a no-regret learning strategy, then the average of the agents’ joint action distributions approaches a coarse correlated equilibrium (CCE) for the normal-form game [CBL06, Han57, Bla56]. CCE is a natural relaxation of the foundational concept of Nash equilibrium, which has the downside of being intractable to compute. On the other hand, there are many efficient algorithms that can achieve vanishing regret in a normal-form game, even when opponents select their actions in an arbitrary, potentially adaptive fashion, and thus converge to a CCE [Vov90, LW94, CBFH+97, HMC00, SALS15].

This simple connection between no-regret learning and decentralized convergence to equilibria has been influential in game theory, leading to numerous lines of research including fast rates of convergence to equilibria [SALS15, CP20, DFG21, AFK+22], price of anarchy bounds for smooth games [Rou15], and lower bounds on query and communication complexity for equilibrium computation [FGGS13, Rub16, BR17]. Empirically, no-regret algorithms such as regret matching [HMC00] and Hedge [Vov90, LW94, CBFH+97] have been used to compute equilibria that can achieve state-of-the-art performance in application domains such as Poker [BS18] and Diplomacy [BBD+22]. Motivated by these successes, we ask whether an analogous theory can be developed for Markov games. In particular:

Are there efficient algorithms for no-regret learning in Markov games?

Any Markov game can be viewed as a large normal-form game where each agent’s action space consists of their exponentially-sized space of policies, and their utility function is given by their expected reward. Thus, any learning algorithm for normal-form games can also be applied to Markov games, but the resulting sample and computational complexities will be intractably large. Our goal is to explore whether more efficient decentralized learning guarantees can be established.

Challenges for no-regret learning.

In spite of active research effort and many promising pieces of progress [JLWY21, SMB22, MB21, DGZ22, ELS+22], no-regret learning guarantees for Markov games have been elusive. A barrier faced by naive algorithms is that it is intractable to ensure no-regret against an arbitrary adversary, both computationally [BJY20, AYBK+13] and statistically [LWJ22, KECM21, FRSS22].

Fortunately, many of the implications of no-regret learning (in particular, convergence to equilibria) do not require the algorithm to have sublinear regret against an arbitrary adversary, but rather only against other agents who are running the same algorithm independently. This observation has been influential in normal-form games, where the well-known line of work on fast rates of convergence to equilibrium [SALS15, CP20, DFG21, AFK+22] holds only in this more restrictive setting. This motivates the following relaxation to our central question.

Problem 1.1.

Is there an efficient algorithm that, when adopted by all agents in a Markov game and run independently, leads to sublinear regret for each individual agent?

Attempts to address Problem 1.1.

Two recent lines of research have made progress toward addressing Problem 1.1 and related questions. In one direction, several recent papers have provided algorithms, including V-learning [JLWY21, SMB22, MB21] and SPoCMAR [DGZ22], that do not achieve no-regret, but can nevertheless compute and then sample from a coarse correlated equilibrium in a Markov game in a (mostly) decentralized fashion, with the caveat that they require a shared source of random bits as a mechanism to coordinate. Notably, V-learning depends only mildly on the shared randomness: agents first play policies in a fully independent fashion (i.e., without shared randomness) according to a simple learning algorithm for TT episodes, and use shared random bits only once learning finishes as part of a post-processing procedure to extract a CCE policy. A question left open by these works, is whether the sequence of policies played by the V-learning algorithm in the initial independent phase can itself guarantee each agent sublinear regret; this would eliminate the need for a separate post-processing procedure and shared randomness.

Most closely related to our work, [ELS+22] recently showed that Problem 1.1 can be solved positively for a restricted setting in which regret for each agent is defined as the maximum gain in value they can achieve by deviating to a fixed Markov policy. Markov policies are those whose choice of action depends only on the current state as opposed to the entire history of interaction. This notion of deviation is restrictive because in general, even when the opponent plays a sequence of Markov policies, the best response will be non-Markov. In challenging settings that abound in practice, it is standard to consider non-Markov policies [LDGV+21, AVDG+22], since they often achieve higher value than Markov policies; we provide a simple example in Proposition 6.1. Thus, while a regret guarantee with respect to the class of Markov policies (as in [ELS+22]) is certainly interesting, it may be too weak in general, and it is of great interest to understand whether Problem 1.1 can be answered positively in the general setting.111We remark that the V-learning and SPoCMAR algorithms mentioned above do learn equilibria that are robust to deviations to non-Markov policies, though they do not address Problem 1.1 since they do not have sublinear regret.

1.2 Our contributions

We resolve Problem 1.1 in the negative, from both a computational and statistical perspective.

Computational hardness.

We provide two computational lower bounds (Theorems 1.2 and 1.3) which show that under standard complexity-theoretic assumptions, there is no efficient algorithm that runs for a polynomial number of episodes and guarantees each agent non-trivial (“sublinear”) regret when used in tandem by all agents. Both results hold even if the Markov game is explicitly known to the algorithm designer; Theorem 1.3 is stronger and more general, but applies only to 33-player games, while Theorem 1.2 applies to 22-player games, but only for agents restricted to playing Markovian policies.

To state our first result, Theorem 1.2, we define a product Markov policy to be a joint policy in which players choose their actions independently according to Markov policies (see Sections 2 and 3 for formal definitions). Note that if all players use independent no-regret algorithms to choose Markov policies at each episode, then their joint play at each round is described by a product Markov policy, since any randomness in each player’s policy must be generated independently.

Theorem 1.2 (Informal version of Corollary 3.3).

If PPADP\textsf{PPAD}\neq\textsf{P}, then there is no polynomial-time algorithm that, given the description of a 2-player Markov game, outputs a sequence of joint product Markov policies which guarantees each agent sublinear regret.

Theorem 1.2 provides a decisive negative resolution to Problem 1.1 under the assumption that PPADP\textsf{PPAD}\neq\textsf{P},222Technically, the class we are denoting by P, namely of total search problems that have a deterministic polynomial-time algorithm, is sometimes denoted by 𝖥𝖯\mathsf{FP}, as it is a search problem. We ignore this distinction. which is standard in the theory of computational complexity [Pap94].333PPAD is the most well-studied complexity class in algorithmic game theory, and is widely believed to not admit polynomial time algorithms. Notably, the problem of computing a Nash equilibrium for normal-form games with two or more players is PPAD-complete [DGP09, CDT06, Rub18]. Beyond simply ruling out the existence of fully decentralized no-regret algorithms, it rules out existence of centralized algorithms that compute a sequence of product policies for which each agent has sublinear regret, even if such a sequence does not arise naturally as the result of agents independently following some learning algorithm. Salient implications include:

  • Theorem 1.2 provides a separation between Markov games and normal-form games, since standard no-regret algorithms for normal-form games i) run in polynomial time and ii) produce sequences of joint product policies that guarantee each agent sublinear regret. Notably, no-regret learning for normal-form games is efficient whenever the number of agents is polynomial, whereas Theorem 1.2 rules out polynomial-time algorithms for as few as two agents.

  • A question left open by the work of [JLWY21, SMB22, MB21] was whether the sequence of policies played by the V-learning algorithm during its independent learning phase can guarantee each agent sublinear regret. Since V-learning plays product Markov policies during the independent phase and is computationally efficient, Theorem 1.2 implies that these policies do not enjoy sublinear regret (assuming PPADP\textsf{PPAD}\neq\textsf{P}).

Our second result, Theorem 1.3, extends the guarantee of Theorem 1.2 to the more general setting in which agents can select arbitrary, potentially non-Markovian policies at each episode. This comes at the cost of only providing hardness for 3-player games as opposed to 2-player games, as well as relying on the slightly stronger complexity-theoretic assumption that PPADRP\textsf{PPAD}\nsubseteq\textsf{RP}.444We use RP to denote the class of total search problems for which there exists a polynomial-time randomized algorithm which outputs a solution with probability at least 2/32/3, and otherwise outputs “fail”.

Theorem 1.3 (Informal version of Corollary 4.4).

If PPADRP\textsf{PPAD}\nsubseteq\textsf{RP}, then there is no polynomial-time algorithm that, given the description of a 3-player Markov game, outputs a sequence of joint product general policies (i.e., potentially non-Markov) which guarantees each agent sublinear regret.

Statistical hardness.

Theorems 1.2 and 1.3 rely on the widely-believed complexity theoretic assumption that PPAD-complete problems cannot be solved in (randomized) polynomial time. Such a restriction is inherent if we assume that the game is known to the algorithm designer. To avoid complexity-theoretic assumptions, we consider a setting in which the Markov game is unknown to the algorithm designer, and algorithms must learn about the game by executing policies (“querying”) and observing the resulting sequences of states, actions, and rewards. Our final result, Theorem 1.4, shows unconditionally that, for mm-player Markov games whose parameters are unknown, any algorithm computing a no-regret sequence as in Theorem 1.3 requires a number of queries that is exponential in mm.

Theorem 1.4 (Informal version of Theorem 5.2).

Given query access to a mm-player Markov game, no algorithm that makes fewer than 2Ω(m)2^{\Omega(m)} queries can output a sequence of joint product policies which guarantees each agent sublinear regret.

Similar to our computational lower bounds, Theorem 1.4 goes far beyond decentralized algorithms, and rules out even centralized algorithms that compute a no-regret sequence by jointly controlling all players. The result provides another separation between Markov games and normal-form games, since standard no-regret algorithms for normal-form games can achieve sublinear regret using poly(m)\operatorname{poly}(m) queries for any mm. The 2Ω(m)2^{\Omega(m)} scaling in the lower bound, which does not rule out query-efficient algorithms when mm is constant, is to be expected for an unconditional result: If the game has only polynomially many parameters (which is the case for constant mm), one can estimate all of the parameters using standard techniques [JKSY20], then directly find a no-regret sequence.

Proof techniques: the SparseCCE problem.

Rather than directly proving lower bounds for the problem of no-regret learning, we establish lower bounds for a simpler problem we refer to as SparseCCE. In the SparseCCE problem, the aim is to compute by any means—centralized, decentralized, or otherwise—a coarse correlated equilibrium that is “sparse” in the sense that it can be represented as the mixture of a small number of product policies. Any algorithm that computes a sequence of product policies with sublinear regret (in the sense of Theorem 1.3) immediately yields an algorithm for the SparseCCE problem, as—using the standard connection between CCE and no-regret—the uniform mixture of the policies in the no-regret sequence forms a sparse CCE. Thus, any lower bound for the sparse SparseCCE problem yields a lower bound for computation of no-regret sequences.

To provide lower bounds for the SparseCCE problem, we reduce from the problem of Nash equilibrium computation in normal-form games. We show that given any two-player normal-form game, it is possible to construct a Markov game (with two players in the case of Theorem 1.2 and three players in the case of Theorem 1.3) with the property that i) the description length is polynomial in the description length of the normal-form game, and ii) any (approximate) SparseCCE for the Markov game can be efficiently transformed into a approximate Nash equilibrium for the normal-form game. With this reduction established, our computational lower bounds follow from celebrated PPAD-hardness results for approximate Nash equilibrium computation in two-player normal-form games, and our statistical lower bounds follow from query complexity lower bounds for Nash. Proving the reduction from Nash to SparseCCE constitutes the bulk of our work, and makes novel use of aggregation techniques from online learning [Vov90, CBL06], as well as techniques from the literature on anti-folk theorems in game theory [BCI+08].

1.3 Organization

Section 2 presents preliminaries on no-regret learning and equilibrium computation in Markov games and normal-form games. Sections 3, 4 and 5 present our main results:

  • Sections 3 and 4 provide our computational lower bounds for no-regret in Markov games. Section 3 gives a lower bound for the setting in which algorithms are constrained to play Markovian policies, and Section 4 builds on the approach in this section to give a lower bound for general, potentially non-Markovian policies.

  • Section 5 provides statistical (query complexity) lower bounds for multi-player Markov games.

Proofs are deferred to the appendix unless otherwise stated.

Notation.

For nn\in\mathbb{N}, we write [n]:={1,2,,n}[n]:=\{1,2,\ldots,n\}. For a finite set 𝒯\mathcal{T}, Δ(𝒯)\Delta(\mathcal{T}) denotes the space of distributions on 𝒯\mathcal{T}. For an element t𝒯t\in\mathcal{T}, 𝕀tΔ(𝒯)\mathbb{I}_{t}\in\Delta(\mathcal{T}) denotes the delta distribution that places probability mass 11 on tt. We adopt standard big-oh notation, and write f=O~(g)f=\widetilde{O}(g) to denote that f=O(gmax{1,polylog(g)})f=O(g\cdot{}\max\left\{1,\mathrm{polylog}(g)\right\}), with Ω()\Omega(\cdot) and Ω~()\widetilde{\Omega}(\cdot) defined analogously.

2 Preliminaries

This section contains preliminaries necessary to present our main results. We first introduce the Markov game framework (Section 2.1), then provide a brief review of normal-form games (Section 2.3), and finally introduce the concepts of coarse correlated equilibria and regret minimization (Section 2.4).

2.1 Markov games

We consider general-sum Markov games in a finite-horizon, episodic framework. For mm\in\mathbb{N}, an mm-player Markov game 𝒢\mathcal{G} consists of a tuple 𝒢=(𝒮,H,(𝒜i)i[m],,(Ri)i[m],μ)\mathcal{G}=(\mathcal{S},H,(\mathcal{A}_{i})_{i\in[m]},\mathbb{P},(R_{i})_{i\in[m]},\mu), where:

  • 𝒮\mathcal{S} denotes a finite state space and HH\in\mathbb{N} denotes a finite time horizon. We write S:=|𝒮|S:=|\mathcal{S}|.

  • For i[m]i\in[m], 𝒜i\mathcal{A}_{i} denotes a finite action space for agent ii. We let 𝒜:=i=1m𝒜i\mathcal{A}:=\prod_{i=1}^{m}\mathcal{A}_{i} denote the joint action space and 𝒜i:=ii𝒜i\mathcal{A}_{-i}:=\prod_{i^{\prime}\neq i}\mathcal{A}_{i^{\prime}}. We denote joint actions in boldface, for example, 𝐚=(a1,,am)𝒜\mathbf{a}=(a_{1},\ldots,a_{m})\in\mathcal{A}. We write Ai:=|𝒜i|A_{i}\vcentcolon=|\mathcal{A}_{i}| and A:=|𝒜|A\vcentcolon=|\mathcal{A}|.

  • =(1,,H)\mathbb{P}=(\mathbb{P}_{1},\ldots,\mathbb{P}_{H}) is the transition kernel, with each h:𝒮×𝒜Δ(𝒮)\mathbb{P}_{h}:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S}) denoting the kernel for step h[H]h\in[H]. In particular, h(s|s,𝐚)\mathbb{P}_{h}(s^{\prime}|s,\mathbf{a}) is the probability of transitioning to ss^{\prime} from the state ss at step hh when agents play 𝐚\mathbf{a}.

  • For i[m]i\in[m] and h[H]h\in[H], Ri,h:𝒮×𝒜[1/H,1/H]R_{i,h}:\mathcal{S}\times\mathcal{A}\rightarrow[-1/H,1/H] is the instantaneous reward function of agent ii:555We assume that rewards lie in [1/H,1/H]\left[-1/H,1/H\right] for notational convenience, as this ensures that the cumulative reward for each episode lies in [1,1]\left[-1,1\right]. This assumption is not important to our results. the reward agent ii receives in state ss at step hh if agents play 𝐚\mathbf{a} is given by Ri,h(s,𝐚)R_{i,h}(s,\mathbf{a}).666We consider Markov games in which the rewards at each step are a deterministic function of the state and action profile. While some works consider the more general case of stochastic rewards, since our main goal is to prove lower bounds, it is without loss for us to assume that rewards are deterministic.

  • μΔ(𝒮)\mu\in\Delta(\mathcal{S}) denotes the initial state distribution.

An episode in the Markov game proceeds as follows:

  • the initial state s1s_{1} is drawn from the initial state distribution μ\mu.

  • For each hHh\leq H, given state shs_{h}, each agent ii plays action ai,h𝒜ia_{i,h}\in\mathcal{A}_{i}, and given the joint action profile 𝐚h=(a1,h,,am,h)\mathbf{a}_{h}=(a_{1,h},\ldots,a_{m,h}), each agent ii receives reward of ri,h=Ri,h(sh,𝐚h)r_{i,h}=R_{i,h}(s_{h},\mathbf{a}_{h}) and the state of the system transitions to sh+1h(|sh,𝐚h)s_{h+1}\sim\mathbb{P}_{h}(\cdot|s_{h},\mathbf{a}_{h}).

We denote the tuple of agents’ rewards at each step hh by 𝐫h=(r1,h,,rm,h)\mathbf{r}_{h}=(r_{1,h},\ldots,r_{m,h}), and refer to the resulting sequence τH:=(s1,𝐚1,𝐫1),,(sH,𝐚H,𝐫H)\tau_{H}\vcentcolon={}(s_{1},\mathbf{a}_{1},\mathbf{r}_{1}),\ldots,(s_{H},\mathbf{a}_{H},\mathbf{r}_{H}) as a trajectory. For h[H]h\in[H], we define the prefix of the trajectory via τh:=(s1,𝐚1,𝐫1),,(sh,ah,rh)\tau_{h}\vcentcolon={}(s_{1},\mathbf{a}_{1},\mathbf{r}_{1}),\ldots,(s_{h},a_{h},r_{h}).

Indexing.

We use the following notation: for some quantity xx (e.g., action, reward, etc.) indexed by agents, i.e., x=(x1,,xm)x=(x_{1},\ldots,x_{m}), and an agent i[m]i\in[m], we write xi=(x1,,xi1,xi+1,,xm)x_{-i}=(x_{1},\ldots,x_{i-1},x_{i+1},\ldots,x_{m}) to denote the tuple consisting of all xix_{i^{\prime}} for iii^{\prime}\neq i.

2.2 Policies and value functions

We now introduce the notion of policies and value functions for Markov games. Policies are mappings from states (or sequences of states) to actions for the agents. We consider several different types of policies, which play a crucial role in distinguishing the types of equilibria that are tractable and those that are intractable to compute efficiently.

Markov policies.

A randomized Markov policy for agent ii is a sequence σi=(σi,1,,σi,H)\sigma_{i}=(\sigma_{i,1},\ldots,\sigma_{i,H}), where σi,h:𝒮Δ(𝒜i)\sigma_{i,h}:\mathcal{S}\rightarrow\Delta(\mathcal{A}_{i}). We denote the space of randomized Markov policies for agent ii by Πimarkov\Pi^{\mathrm{markov}}_{i}. We write Πmarkov:=Π1markov××Πmmarkov\Pi^{\mathrm{markov}}:=\Pi^{\mathrm{markov}}_{1}\times\cdots\times\Pi^{\mathrm{markov}}_{m} to denote the space of product Markov policies, which are joint policies in which each agent ii independently follows a policy in Πimarkov\Pi^{\mathrm{markov}}_{i}. In particular, a policy σΠmarkov\sigma\in\Pi^{\mathrm{markov}} is specified by a collection σ=(σ1,,σH)\sigma=(\sigma_{1},\ldots,\sigma_{H}), where σh:𝒮Δ(𝒜1)××Δ(𝒜m)\sigma_{h}:\mathcal{S}\rightarrow\Delta(\mathcal{A}_{1})\times\cdots\times\Delta(\mathcal{A}_{m}). We additionally define Πimarkov:=iiΠimarkov\Pi^{\mathrm{markov}}_{-i}:=\prod_{i^{\prime}\neq i}\Pi^{\mathrm{markov}}_{i^{\prime}}, and for a policy σΠmarkov\sigma\in\Pi^{\mathrm{markov}}, write σi\sigma_{-i} to denote the collection of mappings σi=(σi,1,,σi,H)\sigma_{-i}=(\sigma_{-i,1},\ldots,\sigma_{-i,H}), where σi,h:𝒮iiΔ(𝒜i)\sigma_{-i,h}:\mathcal{S}\rightarrow\prod_{i^{\prime}\neq i}\Delta(\mathcal{A}_{i^{\prime}}) denotes the tuple of all but player ii’s policies.

When the Markov game 𝒢\mathcal{G} is clear from context, for a policy σΠmarkov\sigma\in\Pi^{\mathrm{markov}} we let σ[]\mathbb{P}_{\sigma}\left[\cdot\right] denote the law of the trajectory τ\tau when players select actions via 𝐚hσ(sh)\mathbf{a}_{h}\sim{}\sigma(s_{h}), and let 𝔼σ[]\operatorname{\mathbb{E}}_{\sigma}\left[\cdot\right] denote the corresponding expectation.

General (non-Markov) policies.

In addition to Markov policies, we will consider general history-dependent (or, non-Markov) policies, which select actions based on the entire sequence of states and actions observed up the current step. To streamline notation, for i[m]i\in[m], let τi,h=(s1,ai,1,ri,1,,sh,ai,h,ri,h)\tau_{i,h}=(s_{1},a_{i,1},r_{i,1},\ldots,s_{h},a_{i,h},r_{i,h}) denote the history of agent ii’s states, actions, and reward up to step hh. Let i,h=(𝒮×𝒜i×[0,1])h\mathscr{H}_{i,h}=(\mathcal{S}\times\mathcal{A}_{i}\times[0,1])^{h} denote the space of all possible histories of agent ii up to step hh. For i[m]i\in[m], a randomized general (i.e., non-Markov) policy of agent ii is a collection of mappings σi=(σi,1,,σi,H)\sigma_{i}=(\sigma_{i,1},\ldots,\sigma_{i,H}) where σi,h:i,h1×𝒮Δ(𝒜i)\sigma_{i,h}:\mathscr{H}_{i,h-1}\times\mathcal{S}\rightarrow\Delta(\mathcal{A}_{i}) is a mapping that takes the history observed by agent ii up to step h1h-1 and the current state and outputs a distribution over actions for agent ii.

We denote by Πigen,rnd{\Pi}^{\mathrm{gen,rnd}}_{i} the space of randomized general policies of agent ii, and further write Πgen,rnd:=Π1gen,rnd××Πmgen,rnd{\Pi}^{\mathrm{gen,rnd}}:={\Pi}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\Pi}^{\mathrm{gen,rnd}}_{m} to denote the space of product general policies; note that ΠimarkovΠigen,rnd\Pi^{\mathrm{markov}}_{i}\subset{\Pi}^{\mathrm{gen,rnd}}_{i} and ΠmarkovΠgen,rnd\Pi^{\mathrm{markov}}\subset{\Pi}^{\mathrm{gen,rnd}}. In particular, a policy σΠgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}} is specicfied by a collection (σi,h)i[m],h[H](\sigma_{i,h})_{i\in[m],h\in[H]}, where σi,h:i,h1×𝒮Δ(𝒜i)\sigma_{i,h}:\mathscr{H}_{i,h-1}\times\mathcal{S}\rightarrow\Delta(\mathcal{A}_{i}). When agents play according to a general policy σΠgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}}, at each step hh, each agent, given the current state shs_{h} and their history τi,h1i,h1\tau_{i,h-1}\in\mathscr{H}_{i,h-1}, chooses to play an action ai,hσi,h(τi,h1,sh)a_{i,h}\sim\sigma_{i,h}(\tau_{i,h-1},s_{h}), independently from all other agents. For a policy σΠgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}}, we let σ[]\mathbb{P}_{\sigma}\left[\cdot\right] and 𝔼σ[]\operatorname{\mathbb{E}}_{\sigma}\left[\cdot\right] denote the law and expectation operator for the trajectory τ\tau when players select actions via 𝐚hσ(τh1,sh)\mathbf{a}_{h}\sim{}\sigma(\tau_{h-1},s_{h}), and write σi\sigma_{-i} to denote the collection of policies of all agents but ii, i.e., σi=(σj,h)h[H],j[m]\{i}\sigma_{-i}=(\sigma_{j,h})_{h\in[H],j\in[m]\backslash\{i\}}.

We will also consider distributions over product randomized general policies, namely elements of Δ(Πgen,rnd)\Delta({\Pi}^{\mathrm{gen,rnd}}).777When 𝒯\mathcal{T} is not a finite set, we take Δ(𝒯)\Delta(\mathcal{T}) to be the set of Radon probability measures over 𝒯\mathcal{T} equipped with the Borel σ\sigma-algebra. We will refer to elements of Δ(Πgen,rnd)\Delta({\Pi}^{\mathrm{gen,rnd}}) as distributional policies. To play according to some distributional policy PΔ(Πgen,rnd)P\in\Delta({\Pi}^{\mathrm{gen,rnd}}), agents draw a randomized policy σP\sigma\sim P (so that σΠgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}}) and then play according to σ\sigma.

Remark 2.1 (Alternative definition for randomized general policies).

Instead of defining distributional policies as above, one might alternatively define Πigen,rnd{\Pi}^{\mathrm{gen,rnd}}_{i} as the set of distributions over agent ii’s deterministic general policies, namely as the set Δ(Πigen,det)\Delta(\Pi^{\mathrm{gen,det}}_{i}). We show in Section D that this alternative definition is equivalent to our own: in particular, there is a mapping from Πgen,rnd{\Pi}^{\mathrm{gen,rnd}} to Δ(Π1gen,det)××Δ(Πmgen,det)\Delta(\Pi^{\mathrm{gen,det}}_{1})\times\cdots\times\Delta(\Pi^{\mathrm{gen,det}}_{m}) so that, for any Markov game, any policy σΠgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}} produces identically distributed trajectories to its corresponding policy in Δ(Π1gen,det)××Δ(Πmgen,det)\Delta(\Pi^{\mathrm{gen,det}}_{1})\times\cdots\times\Delta(\Pi^{\mathrm{gen,det}}_{m}). Further, this mapping is one-to-one if we identify policies that produce the same distributions over trajectories for all Markov games.

Deterministic policies.

It will be helpful to introduce notation for deterministic general (non-Markov) policies, which correspond to the special case of randomized policies where each policy σi,h\sigma_{i,h} exclusively maps to singleton distributions. In particular, a deterministic general policy of agent ii is a collection of mappings πi=(πi,1,,πi,H)\pi_{i}=(\pi_{i,1},\ldots,\pi_{i,H}), where πi,h:i,h1×𝒮𝒜i\pi_{i,h}:\mathscr{H}_{i,h-1}\times\mathcal{S}\rightarrow\mathcal{A}_{i}. We denote by Πigen,det\Pi^{\mathrm{gen,det}}_{i} the space of deterministic general policies of agent ii, and further write Πgen,det:=Π1gen,det××Πmgen,det\Pi^{\mathrm{gen,det}}:=\Pi^{\mathrm{gen,det}}_{1}\times\cdots\times\Pi^{\mathrm{gen,det}}_{m} to denote the space of joint deterministic policies. We use the convention throughout that deterministic policies are denoted by the letter π\pi, whereas randomized policies are denoted by σ\sigma.

Value functions.

For a general policy σΠgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}}, we define the value function for agent i[m]i\in[m] as

Viσ:=𝔼σ[h=1HRi,h(sh,𝐚h)s1μ];\displaystyle V_{i}^{\sigma}:=\mathbb{E}_{\sigma}\left[\sum_{h=1}^{H}R_{i,h}(s_{h},\mathbf{a}_{h})\ \mid s_{1}\sim\mu\right]; (1)

this represents the expected reward that agent ii receives when each agent chooses their actions via ai,hσh(τi,h1,sh)a_{i,h}\sim\sigma_{h}(\tau_{i,h-1},s_{h}). For a distributional policy PΔ(Πgen,rnd)P\in\Delta({\Pi}^{\mathrm{gen,rnd}}), we extend this notation by defining ViP:=𝔼σP[Viσ]V_{i}^{P}:=\mathbb{E}_{\sigma\sim P}[V_{i}^{\sigma}].

2.3 Normal-form games

To motivate the solution concepts we consider for Markov games, let us first revisit the notion of normal-form games, which may be interpreted as Markov games with a single state. For m,nm,n\in\mathbb{N}, an mm-player nn-action normal-form game GG is specified by a tuple of mm reward tensors M1,,Mm[0,1]n××nM_{1},\ldots,M_{m}\in[0,1]^{n\times\cdots\times n}, where each tensor is of order mm (i.e., has nmn^{m} entries). We will write G=(M1,,Mm)G=(M_{1},\ldots,M_{m}). We assume for simplicity that each player has the same number nn of actions, and identify each player’s action space with [n][n]. Then an an action profile is specified by 𝐚[n]m\mathbf{a}\in[n]^{m}; if each player acts according to 𝐚\mathbf{a}, then the reward for player i[m]i\in[m] is given by (Mi)𝐚[0,1](M_{i})_{\mathbf{a}}\in[0,1]. Our hardness results will use the standard notion of Nash equilibrium in normal-form games. We define the mm-player (n,ϵ)(n,\epsilon)-Nash problem to be the problem of computing an ϵ\epsilon-approximate Nash equilibrium of a given mm-player nn-action normal-form game. (See Definition A.1 for a formal definition of ϵ\epsilon-Nash equilibrium.) A celebrated result is that Nash equilibria are PPAD-hard to approximate, i.e., the 2-player (n,nc)(n,n^{-c})-Nash problem is PPAD-hard for any constant c>0c>0 [DGP09, CDT06]. We refer the reader to Section A.1 for further background on these concepts.

2.4 Markov games: Equilibria, no-regret, and independent learning

We now turn our focus back to Markov games, and introduce the main solution concepts we consider, as well as the notion of no-regret. Since computing Nash equilibria is intractable even for normal-form games, much of the work on efficient equilibrium computation has focused on alternative notions of equilibrium, notably coarse correlated equilibria and correlated equilibria. We focus on coarse correlated equilibria: being a superset of correlated equilibria, any lower bound for computing a coarse correlated equilibrium implies a lower bound for computing a correlated equilibrium.

For a distributional policy PΔ(Πgen,rnd)P\in\Delta({\Pi}^{\mathrm{gen,rnd}}) and a randomized policy σiΠigen,rnd\sigma_{i}^{\prime}\in{\Pi}^{\mathrm{gen,rnd}}_{i} of player ii, we let σi×PiΔ(Πgen,rnd)\sigma_{i}^{\prime}\times P_{-i}\in\Delta({\Pi}^{\mathrm{gen,rnd}}) denote the distributional policy which is given by the distribution of (σi,σi)Πgen,rnd(\sigma_{i}^{\prime},\sigma_{-i})\in{\Pi}^{\mathrm{gen,rnd}} for σP\sigma\sim P (and σi\sigma_{-i} denotes the marginal of σ\sigma on all players but ii). For σΠgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}}, we write σi×σi\sigma_{i}^{\prime}\times\sigma_{-i} to denote the policy given by (σi,σi)Πgen,rnd(\sigma_{i}^{\prime},\sigma_{-i})\in{\Pi}^{\mathrm{gen,rnd}}. Let us fix a Markov game 𝒢\mathcal{G}, which in particular determines the players’ value functions ViσV_{i}^{\sigma} as in (1).

Definition 2.2 (Coarse correlated equilibrium).

For ϵ>0\epsilon>0, a distributional policy PΔ(Πgen,rnd)P\in\Delta({\Pi}^{\mathrm{gen,rnd}}) is defined to be an ϵ\epsilon-coarse correlated equilibrium (CCE) if for each i[m]i\in[m], it holds that .

maxσiΠigen,rndViσi×PiViPϵ.\displaystyle\max_{\sigma_{i}^{\prime}\in{\Pi}^{\mathrm{gen,rnd}}_{i}}V_{i}^{\sigma_{i}^{\prime}\times P_{-i}}-V_{i}^{P}\leq\epsilon.

The maximizing policy σi\sigma_{i}^{\prime} can always be chosen to be determinimistic, so PP is an ϵ\epsilon-CCE if and only if maxπiΠigen,detViπi×PiViPϵ\max_{\pi_{i}\in\Pi^{\mathrm{gen,det}}_{i}}V_{i}^{\pi_{i}\times P_{-i}}-V_{i}^{P}\leq\epsilon.

Coarse correlated equilibria can be computed efficiently for both normal-form games and Markov games, and are fundamentally connected to the notion of no-regret and independent learning, which we now introduce.

Regret.

For a policy σΠgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}}, we denote the distributional policy which puts all its mass on σ\sigma by 𝕀σΔ(Πgen,rnd)\mathbb{I}_{\sigma}\in\Delta({\Pi}^{\mathrm{gen,rnd}}). Thus 1Tt=1T𝕀σ(t)Δ(Πgen,rnd)\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}}) denotes the distributional policy which randomizes uniformly over the σ(t)\sigma^{\scriptscriptstyle{(t)}}. We define regret as follows.

Definition 2.3 (Regret).

Consider a sequence of policies σ(1),,σ(T)Πgen,rnd\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in{\Pi}^{\mathrm{gen,rnd}}. For i[m]i\in[m], the regret of agent ii with respect to this sequence is defined as:

Regi,T=Regi,T(σ(1),,σ(T))=maxσiΠigen,rndt=1TViσi×σi(t)Viσ(t).\displaystyle\mathrm{Reg}_{i,T}=\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})=\max_{\sigma_{i}\in{\Pi}^{\mathrm{gen,rnd}}_{i}}\sum_{t=1}^{T}V_{i}^{\sigma_{i}\times\sigma_{-i}^{\scriptscriptstyle{(t)}}}-V_{i}^{\sigma^{\scriptscriptstyle{(t)}}}. (2)

In (2) the maximum over σiΠigen,rnd\sigma_{i}\in{\Pi}^{\mathrm{gen,rnd}}_{i} is always achieved by a deterministic general policy, so we have Regi,T=maxπiΠigen,dett=1T(Viπi×σi(t)Viσ(t))\mathrm{Reg}_{i,T}=\max_{\pi_{i}\in\Pi^{\mathrm{gen,det}}_{i}}\sum_{t=1}^{T}\big{(}V_{i}^{\pi_{i}\times\sigma_{-i}^{\scriptscriptstyle{(t)}}}-V_{i}^{\sigma^{\scriptscriptstyle{(t)}}}\big{)}.

The following standard result shows that the uniform average of any no-regret sequence forms an approximate coarse correlated equilibrium.

Fact 2.4 (No-regret is equivalent to CCE).

Suppose that a sequence of policies σ(1),,σ(T)Πgen,rnd\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in{\Pi}^{\mathrm{gen,rnd}} satisfies Regi,T(σ(1),,σ(T))ϵT\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})\leq\epsilon\cdot T for each i[m]i\in[m]. Then the uniform average of these TT policies, namely the distributional policy σ¯:=1Tt=1T𝕀σ(t)Δ(Πgen,rnd)\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}}), is an ϵ\epsilon-CCE.

Likewise if a sequence of policies σ(1),,σ(T)Πgen,rnd\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in{\Pi}^{\mathrm{gen,rnd}} has the property that the distributional policy σ¯:=1Tt=1T𝕀σ(t)Δ(Πgen,rnd)\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}}), is an ϵ\epsilon-CCE, then we have Regi,T(σ(1),,σ(T))ϵT\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})\leq\epsilon\cdot T for all i[m]i\in[m].

No-regret learning.

Fact 2.4 is an immediate consequence of Definitions 2.3 and 2.2. A standard approach to decentralized equilibrium computation, which exploits Fact 2.4, is to select σ(1),,σ(T)Πgen,rnd\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle(T)}\in{\Pi}^{\mathrm{gen,rnd}} using independent no-regret learning algorithms. A no-regret learning algorithm for player ii selects σi(t)Πigen,rnd\sigma_{i}^{\scriptscriptstyle(t)}\in{\Pi}^{\mathrm{gen,rnd}}_{i} based on the realized trajectories τi,H(1),,τi,H(t1)i,H\tau^{\scriptscriptstyle(1)}_{i,H},\ldots,\tau^{\scriptscriptstyle(t-1)}_{i,H}\in\mathscr{H}_{i,H} that player ii observes over the course of play,888An alternative model allows for player ii to have knowledge of the previous joint policies σ(1),,σ(t1)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(t-1)}}, when selecting σi(t)\sigma_{i}^{\scriptscriptstyle{(t)}}. but with no knowledge of σi(t)\sigma^{\scriptscriptstyle(t)}_{-i}, so as to ensure that no-regret is achieved: Regi,T(σ(1),,σ(T))ϵT\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})\leq\epsilon\cdot T. If each player ii uses their own, independent no-regret learning algorithm, this approach yields product policies σ(t)=σ1(t)××σm(t)\sigma^{\scriptscriptstyle(t)}=\sigma_{1}^{\scriptscriptstyle(t)}\times\cdots\times\sigma^{\scriptscriptstyle(t)}_{m}, and the uniform average of the σ(t)\sigma^{\scriptscriptstyle{(t)}} yields a CCE as long as all of the players can keep their regret small.999In Section 6, we discuss the implications of relaxing the stipulation that σ(t)\sigma^{\scriptscriptstyle{(t)}} be product policies (for example, by allowing the use of shared randomness, as in V-learning). In short, allowing σ(t)\sigma^{\scriptscriptstyle{(t)}} to be non-product essentially trivializes the problem.

For the special case of normal-form games, the no-regret learning approach has been fruitful. There are several efficient algorithms, including regret matching [HMC00], Hedge (also known as exponential weights) [Vov90, LW94, CBFH+97], and generalizations of Hedge based on the follow-the-regularized-leader (FTRL) framework [SS12], which ensure that each player’s regret after TT episodes is bounded above by O(T)O(\sqrt{T}) (that is ϵ=O(1/T)\epsilon=O(1/\sqrt{T})), even when the other players’ actions are chosen adversarially. All of these guarantees, which bound regret by a sublinear function in TT, lead to efficient, decentralized computation of approximate coarse correlated equilibrium in normal-form games. The success of this motivates our central question, which is whether similar guarantees may be established for Markov games. In particular, a formal version of Problem 1.1 asks: Is there an efficient algorithm that, when adopted by all agents in a Markov game and run independently, ensures that for all ii, Regi,TϵT\mathrm{Reg}_{i,T}\leq\epsilon\cdot{}T for some ϵ=o(1)\epsilon=o(1)?

3 Lower bound for Markovian algorithms

In this section we prove Theorem 1.2 (restated formally below as Theorem 3.2), establishing that in two-player Markov games, there is no computationally efficient algorithm that computes a sequence σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}} of product Markov policies so that each player has small regret under this sequence. This section serves as a warm-up for our results in Section 4, which remove the assumption that σ(1),,σ(T)\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle(T)} are Markovian.

3.1 SparseMarkovCCE problem and computational model

As discussed in the introduction, our lower bounds for no-regret learning are a consequence of lower bounds for the SparseCCE problem. In what follows, we formalize this problem (specifically, the Markovian variant, which we refer to as SparseMarkovCCE), as well as our computational model.

Description length for Markov games (constant mm).

Given a Markov game 𝒢\mathcal{G}, we let β(𝒢)\beta(\mathcal{G}) denote the maximum number of bits needed to describe any of the rewards Ri,h(s,𝐚)R_{i,h}(s,\mathbf{a}) or transition probabilities h(s|s,𝐚)\mathbb{P}_{h}(s^{\prime}|s,\mathbf{a}) in binary.101010We emphasize that β(𝒢)\beta(\mathcal{G}) is defined as the maximum number of bits required by any particular (s,𝐚)(s,\mathbf{a}) pair, not the total number of bits required for all (s,𝐚)(s,\mathbf{a}) pairs. We define |𝒢|:=max{S,maxi[m]Ai,H,β(𝒢)}|\mathcal{G}|:=\max\{S,\max_{i\in[m]}A_{i},H,\beta(\mathcal{G})\}. The interpretation of |𝒢||\mathcal{G}| depends on the number of players mm: If mm is a constant (as will be the case in the current section and Section 4), then |𝒢||\mathcal{G}| should be interpreted as the description length of the game 𝒢\mathcal{G}, up to polynomial factors. In particular, for constant mm, the game 𝒢\mathcal{G} can be described using |𝒢|O(1)|\mathcal{G}|^{O(1)} bits. In Section 5, we discuss the interpretation of |𝒢||\mathcal{G}| when mm is large.

The SparseMarkovCCE problem.

From Fact 2.4, we know that the problem of computing a sequence σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}} of joint product Markov policies for which each player has at most ϵT\epsilon\cdot{}T regret is equivalent to computing a sequence σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}} for which the uniform mixture forms an ϵ\epsilon-approximate CCE. We define (T,ϵ)(T,\epsilon)-SparseMarkovCCE as the computational problem of computing such a CCE directly.

Definition 3.1 (SparseMarkovCCE problem).

For an mm-player Markov game 𝒢\mathcal{G} and parameters TT\in\mathbb{N} and ϵ>0\epsilon>0 (which may depend on the size of the game 𝒢\mathcal{G}), (T,ϵ)(T,\epsilon)-SparseMarkovCCE is the problem of finding a sequence σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}, with each σ(t)Πmarkov\sigma^{\scriptscriptstyle{(t)}}\in\Pi^{\mathrm{markov}}, such that the distributional policy σ¯=1Tt=1T𝕀σ(t)Δ(Πgen,rnd)\overline{\sigma}=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}}) is an ϵ\epsilon-CCE of 𝒢\mathcal{G} (or equivalently, such that for all i[m]i\in[m], Regi,T(σ(1),,σ(T))ϵT\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})\leq\epsilon\cdot T).

Decentralized learning algorithms naturally lead to solutions to the SparseMarkovCCE problem. In particular, consider any decentralized protocol which runs for TT episodes, where at each timestep t[T]t\in[T], each player i[m]i\in[m] chooses a Markov policy σi(t)Πimarkov\sigma_{i}^{\scriptscriptstyle{(t)}}\in\Pi^{\mathrm{markov}}_{i} to play, without knowledge of the other players’ policies σi(t)\sigma_{-i}^{\scriptscriptstyle(t)} (but possibly using the history); any strategy in which players independently run online learning algorithms falls under this protocol. If each player experiences overall regret at most ϵT\epsilon\cdot T, then the sequence σ(1),,σ(T)\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle(T)} is a solution to the (T,ϵ)(T,\epsilon)-SparseMarkovCCE problem. However, one might expect the (T,ϵ)(T,\epsilon)-SparseMarkovCCE problem to be much easier than decentralized learning, since it allows for algorithms that produce (σ(1),,σ(T))(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}) satisfying the constraints of Definition 3.1 in a centralized manner. The main result of this section, Theorem 3.2, rules out the existence of any efficient algorithms, including centralized ones, that solve the SparseMarkovCCE problem.

Before moving on, let us give a sense for what sort of scaling one should expect for the parameters TT and ϵ\epsilon in the (T,ϵ)(T,\epsilon)-SparseMarkovCCE problem. First, we note that there always exists a solution to the (1,0)(1,0)-SparseMarkovCCE problem in a Markov game, which is given by a (Markov) Nash equilibrium of the game; of course, Nash equilibria are intractable to compute in general.111111Such a Nash equilibrium can be seen to exist by using backwards induction to specify the player’s joint distribution of play at each state at steps H,H1,,1H,H-1,\ldots,1. For the special case of normal-form games (where there is only a single state, and H=1H=1), no-regret learning (e.g., Hedge) yields a computationally efficient solution to the (T,O~(1/T))(T,\widetilde{O}(1/\sqrt{T}))-SparseMarkovCCE problem, where the O~()\widetilde{O}(\cdot) hides a maxilog|Ai|\max_{i}\log|A_{i}| factor. Refined convergence guarantees of [DFG21, AFK+22] improve upon this result, and yield an efficient solution to the (T,O~(1/T))(T,\widetilde{O}(1/T))-SparseMarkovCCE problem.

3.2 Main result

Theorem 3.2.

There is a constant C0>1C_{0}>1 so that the following holds. Let nn\in\mathbb{N} be given, and let TT\in\mathbb{N} and ϵ>0\epsilon>0 satisfy T<exp(ϵ2n1/2/25)T<\exp(\epsilon^{2}\cdot n^{1/2}/2^{5}). Suppose there is an algorithm that, given the description of any 2-player Markov game 𝒢\mathcal{G} with |𝒢|n|\mathcal{G}|\leq n, solves the (T,ϵ)(T,\epsilon)-SparseMarkovCCE problem in time UU, for some UU\in\mathbb{N}. Then, for each nn\in\mathbb{N}, the 2-player (n1/2,4ϵ)(\lfloor n^{1/2}\rfloor,4\cdot\epsilon)-Nash problem (Definition A.1) can be solved in time (nTU)C0(nTU)^{C_{0}}.

We emphasize that the range T<exp(nO(1))T<\exp(n^{O(1)}) ruled out by Theorem 3.2 is the most natural parameter regime, since the runtime of any decentralized algorithm which runs for TT episodes and produces a solution to the SparseMarkovCCE problem is at least linear in TT. Using that 2-player (n,ϵ)(n,\epsilon)-Nash is PPAD-complete for ϵ=nc\epsilon=n^{-c} (for any c>0c>0) [DGP09, CDT06, Rub18], we obtain the following corollary.

Corollary 3.3 (SparseMarkovCCE is PPAD-complete).

For any constant C>4C>4, if there is an algorithm which, given the description of a 2-player Markov game 𝒢\mathcal{G}, solves the (|𝒢|C,|𝒢|1C)(|\mathcal{G}|^{C},|\mathcal{G}|^{-\frac{1}{C}})-SparseMarkovCCE problem in time poly(|𝒢|)\operatorname{poly}(|\mathcal{G}|), then PPAD=P\textsf{PPAD}=\textsf{P}.

The condition C>4C>4 in Corollary 3.3 is set to ensure that |𝒢|C<exp(|𝒢|2/C|𝒢|/26)|\mathcal{G}|^{C}<\exp(|\mathcal{G}|^{-2/C}\cdot\sqrt{|\mathcal{G}|}/2^{6}) for sufficiently large |𝒢||\mathcal{G}|, so as to satisfy the condition of Theorem 3.2. Corollary 3.3 rules out the existence of a polynomial-time algorithm that solves the SparseMarkovCCE problem with accuracy ϵ\epsilon polynomially small and TT polynomially large in |𝒢|\lvert\mathcal{G}\rvert. Using a stronger complexity-theoretic assumption, the Exponential Time Hypothesis for PPAD [Rub16], we can obtain a stronger hardness result which rules out efficient algorithms even when 1) the accuracy ϵ\epsilon is constant and 2) TT is quasipolynomially large.121212This is a consequence of the fact that for some absolute constant ϵ0>0\epsilon_{0}>0, there are no polynomial-time algorithms for computing ϵ0\epsilon_{0}-Nash equilibria in 2-player normal-form games under the Exponential Time Hypothesis for PPAD [Rub16].

Corollary 3.4 (ETH-hardness of SparseMarkovCCE).

There is a constant ϵ0>0\epsilon_{0}>0 such that if there exists an algorithm that solves the (|𝒢|o(log|𝒢|),ϵ0)(|\mathcal{G}|^{o(\log|\mathcal{G}|)},\epsilon_{0})-SparseMarkovCCE problem in |𝒢|o(log|𝒢|)|\mathcal{G}|^{o(\log|\mathcal{G}|)} time, then the Exponential Time Hypothesis for PPAD fails to hold.

Proof overview.

The proof of Theorem 3.2 is based on a reduction, which shows that any algorithm that efficiently solves the (T,ϵ)(T,\epsilon)-SparseMarkovCCE problem, for TT not too large, can be used to efficiently compute an approximate Nash equilibrium of any given normal-form game. In particular, fix n0n_{0}\in\mathbb{N}, and let a 2-player normal form game GG with n0n_{0} actions be given. We construct a Markov game 𝒢=𝒢(G)\mathcal{G}=\mathcal{G}(G) with horizon H=n0H=n_{0} and action sets identical to those of the game GG, i.e., 𝒜1=𝒜2=[n0]\mathcal{A}_{1}=\mathcal{A}_{2}=[n_{0}]. The state space of 𝒢\mathcal{G} consists n02n_{0}^{2} states, which are indexed by joint action profiles; the transitions are defined so that the value of the state at step hh encodes the action profile taken by the agents at step h1h-1.131313For technical reasons, this only is the case for even values of hh; we discuss further details in the full proof in Section B.2. At each state of 𝒢\mathcal{G}, the reward functions are given by the payoff matrices of GG, scaled down by a factor of 1/H1/H (which ensures that the rewards received at each step belong to [0,1/H][0,1/H]). In particular, the rewards and transitions out of a given state do not depend on the identity of the state, and so 𝒢\mathcal{G} can be thought of as a repeated game where GG is played HH times. The formal definition of 𝒢\mathcal{G} is given in Definition B.3.

Fix any algorithm for the SparseMarkovCCE problem, and recall that for each step hh and state ss for 𝒢\mathcal{G}, σh(t)(s)Δ(𝒜1)×Δ(𝒜2)\sigma^{\scriptscriptstyle{(t)}}_{h}(s)\in\Delta(\mathcal{A}_{1})\times\Delta(\mathcal{A}_{2}) denotes the joint action distribution taken in ss at step hh for the sequence of σ(1),,σ(T)\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle(T)} produced by the algorithm. The bulk of the proof of Theorem 3.2 consists of proving a key technical result, Lemma B.4, which states that if σ(1),,σ(T)\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle(T)} indeed solves (T,ϵ)(T,\epsilon)-SparseMarkovCCE, then there exists some tuple (h,s,t)(h,s,t) such that σh(t)(s)\sigma_{h}^{\scriptscriptstyle{(t)}}(s) is an approximate Nash equilibrium for GG. With this established, it follows that we can find a Nash equilibrium efficiently by simply trying all HSTHST choices for (h,s,t)(h,s,t).

To prove Lemma B.4, we reason as follows. Assume that σ¯:=1Tt=1T𝕀σ(t)Δ(Πgen,rnd)\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}}) is an ϵ\epsilon-CCE. If, by contradiction, none of the distributions {σh(t)(s)}h[H],s𝒮,t[T]\left\{\sigma_{h}^{\scriptscriptstyle{(t)}}(s)\right\}_{h\in[H],s\in\mathcal{S},t\in[T]} are approximate Nash equilibria for GG, then it must be the case that for each tt, one of the players has a profitable deviation in GG with respect to the product strategy σh(t)(s)\sigma_{h}^{\scriptscriptstyle{(t)}}(s), at least for a constant fraction of the tuples (s,h)(s,h). We will argue that if this were to be the case, it would imply that there exists a non-Markov deviation policy for at least one player ii in Definition 2.2, meaning that σ¯\overline{\sigma} is not in fact an ϵ\epsilon-CCE.

To sketch the idea, recall that to draw a trajectory from σ¯\overline{\sigma}, we first draw a random index t[T]t^{\star}\sim[T] uniformly at random, and then execute σ(t)\sigma^{\scriptscriptstyle{(t^{\star})}} for an episode. We will show (roughly) that for each player ii, it is possible to compute a non-Markov deviation policy πi\pi_{i}^{\dagger} which, under the draw of a trajectory from σ¯\overline{\sigma}, can “infer” the value of the index tt^{\star} within the first few steps of the episode. Then policy πi\pi_{i}^{\dagger} then, at each state ss and step hh after the first few steps, play a best response to their opponent’s portion of the strategy σh(t)(s)\sigma_{h}^{\scriptscriptstyle{(t^{\star})}}(s). If, for each possible value of tt^{\star}, none of the distributions σh(t)(s)\sigma_{h}^{\scriptscriptstyle{(t^{\star})}}(s) are approximate Nash equilibria of GG, this means that at least one of the players ii can significantly increase their value in 𝒢\mathcal{G} over that of σ¯\overline{\sigma} by playing πi\pi_{i}^{\dagger}, which contradicts the assumption that σ¯\overline{\sigma} is an ϵ\epsilon-CCE.

It remains to explain how we can construct a non-Markov policy πi\pi_{i}^{\dagger} which “infers” the value of tt^{\star}. Unfortunately, exactly inferring the value of tt^{\star} in the fashion described above is impossible: for instance, if there are t1t2t_{1}\neq t_{2} so that σ(t1)=σ(t2)\sigma^{\scriptscriptstyle{(t_{1})}}=\sigma^{\scriptscriptstyle{(t_{2})}}, then clearly it is impossible to distinguish between the cases t=t1t^{\star}=t_{1} and t=t2t^{\star}=t_{2}. Nevertheless, by using the fact that each player observes the full joint action profile played at each step hh, we can construct a non-Markov policy which employs Vovk’s aggregating algorithm for online density estimation [Vov90, CBL06] in order to compute a distribution which is close to σh(t)(s)\sigma_{h}^{\scriptscriptstyle{(t^{\star})}}(s) for most h[H]h\in[H].141414Vovk’s aggregating algorithm is essentially the exponential weights algorithm with the logarithmic loss. A detailed background for the algorithm is provided in Section B.1. This guarantee is stated formally in an abstract setting in Proposition B.2, and is instantiated in the proof of Theorem 3.2 in (Equation 6). As we show in Section B.2, approximating σh(t)(s)\sigma_{h}^{\scriptscriptstyle{(t^{\star})}}(s) as we have described is sufficient to carry out the reasoning from the previous paragraph.

4 Lower bound for non-Markov algorithms

In this section, we prove Theorem 1.3 (restated formally below as Theorem 4.3), which strengthens Theorem 3.2 by allowing the sequence σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}} of product policies to be non-Markovian. This additional strength comes at the cost of our lower bound only applying to 3-player Markov games (as opposed to Theorem 3.2, which applied to 2-player games).

4.1 SparseCCE problem and computational model

To formalize the computational model for the SparseCCE problem, we must first describe how the non-Markov product policies σ(t)=(σ1(t),,σm(t))\sigma^{\scriptscriptstyle{(t)}}=(\sigma^{\scriptscriptstyle{(t)}}_{1},\ldots,\sigma^{\scriptscriptstyle{(t)}}_{m}) are represented. Recall that a non-Markov policy σi(t)Πigen,rnd\sigma_{i}^{\scriptscriptstyle{(t)}}\in{\Pi}^{\mathrm{gen,rnd}}_{i} is, by definition, a mapping from agent ii’s history and current state to a distribution over their next action. Since there are exponentially many possible histories, it is information-theoretically impossible to express an arbitrary policy in Πigen,rnd{\Pi}^{\mathrm{gen,rnd}}_{i} with polynomially many bits. As our focus is on computing a sequence of such policies σ(t)\sigma^{\scriptscriptstyle{(t)}} in polynomial time, certainly a prerequisite is that σ(t)\sigma^{\scriptscriptstyle{(t)}} can be expressed in polynomial space. Thus, we adopt the representational assumption, stated formally in Definition 4.1, that each of the policies σi(t)Πigen,rnd\sigma_{i}^{\scriptscriptstyle{(t)}}\in{\Pi}^{\mathrm{gen,rnd}}_{i} is described by a bounded-size circuit that can compute the conditional distribution of each next action given the history. This assumption is satisfied by essentially all empirical and theoretical work concerning non-Markov policies (e.g., [LDGV+21, AVDG+22, JLWY21, SMB22]).

Definition 4.1 (Computable policy).

Given a mm-player Markov game 𝒢\mathcal{G} and NN\in\mathbb{N}, we say that a policy σiΠigen,rnd\sigma_{i}\in{\Pi}^{\mathrm{gen,rnd}}_{i} is NN-computable if for each h[H]h\in[H], there is a circuit of size NN that,151515For concreteness, we suppose that “circuit” means “boolean circuit” as in [AB06, Definition 6.1], where probabilities are represented in binary. The precise model of computation we use does not matter, though, and we could equally assume that the policies σi\sigma_{i} may be computed by Turing machines that terminate after NN steps. on input (τi,h1,s)i,h1×𝒮(\tau_{i,h-1},s)\in\mathscr{H}_{i,h-1}\times\mathcal{S}, outputs the distribution σi(τi,h1,s)Δ(𝒜i)\sigma_{i}(\tau_{i,h-1},s)\in\Delta(\mathcal{A}_{i}). A policy σ=(σ1,,σm)Πgen,rnd\sigma=(\sigma_{1},\ldots,\sigma_{m})\in{\Pi}^{\mathrm{gen,rnd}} is NN-computable if each constituent policy σi\sigma_{i} is.

Our lower bound applies to algorithms that produce sequences σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}} for which each σ(t)\sigma^{\scriptscriptstyle{(t)}} is NN-computable, where the value NN is taken to be polynomial in the description length of the game 𝒢\mathcal{G}. For example, Markov policies whose probabilities can be expressed with β\beta bits are O(HSAiβ)O(HSA_{i}\beta)-computable for each player ii, since one can simply store each of the probabilities σi,h(sh,ai,h)\sigma_{i,h}(s_{h},a_{i,h}) (for h[H]h\in[H], i[m]i\in[m], ai,h𝒜ia_{i,h}\in\mathcal{A}_{i}, sh𝒮s_{h}\in\mathcal{S}), each of which takes β\beta bits to represent.

The SparseCCE problem.

SparseCCE is the problem of computing a sequence of non-Markov product policies σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}} such that the uniform mixture forms an ϵ\epsilon-approximate CCE. The problem generalizes SparseMarkovCCE (Definition 3.1) by relaxing the condition that the policies σ(t)\sigma^{\scriptscriptstyle{(t)}} be Markov.

Definition 4.2 (SparseCCE Problem).

For an mm-player Markov game 𝒢\mathcal{G} and parameters T,NT,N\in\mathbb{N} and ϵ>0\epsilon>0 (which may depend on the size of the game 𝒢\mathcal{G}), (T,ϵ,N)(T,\epsilon,N)-SparseCCE is the problem of finding a sequence σ(1),,σ(T)Πgen,rnd\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in{\Pi}^{\mathrm{gen,rnd}}, with each σ(t)\sigma^{\scriptscriptstyle{(t)}} being NN-computable, such that the distributional policy σ¯=1Tt=1T𝕀σ(t)Δ(Πgen,rnd)\overline{\sigma}=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}\in\Delta({\Pi}^{\mathrm{gen,rnd}}) is an ϵ\epsilon-CCE for 𝒢\mathcal{G} (equivalently, such that for all i[m]i\in[m], Regi,T(σ(1),,σ(T))ϵT\mathrm{Reg}_{i,T}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}})\leq\epsilon\cdot T).

4.2 Main result

Our main theorem for this section, Theorem 4.3, shows that for appropriate values of TT, ϵ\epsilon, and NN, solving the (T,ϵ,N)(T,\epsilon,N)-SparseCCE problem is at least as hard as computing Nash equilibria in normal-form games.

Theorem 4.3.

Fix nn\in\mathbb{N}, and let T,NT,N\in\mathbb{N}, and ϵ>0\epsilon>0 satisfy 1<T<exp(ϵ2n16)1<T<\exp\left(\frac{\epsilon^{2}\cdot n}{16}\right). Suppose there exists an algorithm that, given the description of any 33-player Markov game 𝒢\mathcal{G} with |𝒢|n|\mathcal{G}|\leq n, solves the (T,ϵ,N)(T,\epsilon,N)-SparseCCE problem in time UU, for some UU\in\mathbb{N}. Then, for any δ>0\delta>0, the 22-player (n/2,50ϵ)(\lfloor n/2\rfloor,50\epsilon)-Nash problem can be solved in randomized time (nTNUlog(1/δ)/ϵ)C0(nTNU\log(1/\delta)/\epsilon)^{C_{0}} with failure probability δ\delta, where C0>0C_{0}>0 is an absolute constant.

By analogy to Corollary 3.3, we obtain the following immediate consequence.

Corollary 4.4 (SparseCCE is hard under PPADRP\textsf{PPAD}\nsubseteq\textsf{RP}).

For any constant C>4C>4, if there is an algorithm which, given the description of a 3-player Markov game 𝒢\mathcal{G}, solves the (|𝒢|C,|𝒢|1C,|𝒢|C)(|\mathcal{G}|^{C},|\mathcal{G}|^{-\frac{1}{C}},|\mathcal{G}|^{C})-SparseCCE problem in time poly(|𝒢|)\operatorname{poly}(|\mathcal{G}|), then PPADRP\textsf{PPAD}\subseteq\textsf{RP}.

Proof overview for Theorem 4.3.

The proof of Theorem 4.3 has a similar high-level structure to that of Theorem 3.2: given an mm-player normal-form GG, we define an (m+1)(m+1)-player Markov game 𝒢=𝒢(G)\mathcal{G}=\mathcal{G}(G) which has n0:=n/mn_{0}:=\lfloor n/m\rfloor actions per player and horizon Hn0H\approx n_{0}. The key difference in the proof of Theorem 4.3 is the structure of the players’ reward functions. To motivate this difference and the addition of an (m+1)(m+1)-th player, let us consider what goes wrong in the proof of Theorem 3.2 when the policies σ(t)\sigma^{\scriptscriptstyle{(t)}} are allowed to be non-Markov. We will explain how a sequence σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}} can hypothetically solve the SparseCCE problem by attempting to punish any one player’s deviation policy, and thus avoid having to compute a Nash equilibrium of GG. In particular, for each player jj, suppose σj(t)\sigma_{j}^{\scriptscriptstyle{(t)}} tries to detect, based on the state transitions and player jj’s rewards, whether every other player iji\neq j is playing according to σi(t)\sigma_{i}^{\scriptscriptstyle{(t)}}. If some player ii is not playing according to σi(t)\sigma_{i}^{\scriptscriptstyle{(t)}} at some step hh, then at steps h>hh^{\prime}>h, the policy σj(t)\sigma_{j}^{\scriptscriptstyle{(t)}} can select actions that attempt to minimize player ii’s rewards. In particular, if player ii plays according to the policy πi\pi_{i}^{\dagger} that we described in Section 3.2, then other players jij\neq i can adjust their choice of actions in later rounds to decrease player ii’s value.

This behavior is reminiscent of “tit-for-tat” strategies which are used to establish the folk theorem in the theory of repeated games [MF86, FLM94]. The folk theorem describes how Nash equilibria are more numerous (and potentially easier to find) in repeated games than in single-shot normal form games. As it turns out, the folk theorem does not provably yield to worst-case computational speedups in repeated games, at least when the number of players is at least 3. Indeed, [BCI+08] gave an “anti-folk theorem”, showing that computing Nash equilibria in (m+1)(m+1)-player repeated games is PPAD-hard for m2m\geq 2, via a reduction to mm-player normal-form games. We utilize their reduction, for which the key idea is as follows: given an mm-player normal form game GG, we construct an (m+1)(m+1)-player Markov game 𝒢(G)\mathcal{G}(G) in which the (m+1)(m+1)-th player acts as a kibitzer,161616Kibitzer is a Yiddish term for an observer who offers advice. with actions indexed by tuples (j,aj)(j,a_{j}), for j[m]j\in[m] and aj𝒜ja_{j}\in\mathcal{A}_{j}. The kibitzer’s action (j,aj)(j,a_{j}) represents 1) a player jj to give advice to, and 2) their advice to the player, which is to take action aja_{j}. In particular, if the kibitzer plays (j,aj)(j,a_{j}), it receives reward equal to the amount that player jj would obtain by deviating to aja_{j}, and player jj receives the negation of the kibitzer’s reward. Furthermore, all other players receive 0 reward.

To see why the addition of the kibitzer is useful, suppose that σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}} solves the SparseCCE problem, so that σ¯:=1Tt=1T𝕀σ(t)\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}} is an ϵ\epsilon-CCE. We will show that, with at least constant probability over a trajectory drawn from σ¯\overline{\sigma} (which involves drawing t[T]t^{\star}\sim[T] uniformly), the joint strategy profile played by the first mm players constitutes an approximate Nash equilibrium of GG. Suppose for the purpose of contradiction that this were not the case. We show that there exists a non-Markov deviation policy πm+1\pi_{m+1}^{\dagger} for the kibitzer which, similar to the proof of Theorem 3.2, learns the value of tt^{\star} and plays a tuple (j,aj)(j,a_{j}) such that action aja_{j} increases player jj’s payoff in GG, thereby increasing its own payoff. Even if the other players attempt to punish the kibitzer for this deviation, they will not be able to since, roughly speaking, the kibitzer game as constructed above has the property that for any strategy for the first mm players, the kibitzer can always achieve reward at least 0.

The above argument shows that under the joint policy σ¯(m+1)×πm+1\overline{\sigma}_{-(m+1)}\times\pi_{m+1}^{\dagger} (namely, the first mm players play according to σ¯\overline{\sigma} and the kibitzer plays according to πm+1\pi_{m+1}^{\dagger}) then with constant probability over a trajectory drawn from this policy, the distribution of the first mm players’ actions is an approximate Nash equilibrium of GG. Thus, in order to efficiently find such a Nash (see Algorithm 2), we need to simulate the policy σ¯(m+1)×πm+1\overline{\sigma}_{-(m+1)}\times\pi_{m+1}^{\dagger}, which involves running Vovk’s aggregating algorithm. This approach is in contrast to the proof of Theorem 3.2, for which Vovk’s aggregating algorithm was an ingredient in the proof but was not actually used in the Nash computation algorithm (Algorithm 1). The details of the proof of correctness of Algorithm 2 are somewhat delicate, and may be found in Appendix C.

Two-player games.

One intruiging question we leave open is whether the SparseCCE problem remains hard for two-player Markov games. Interestingly, as shown by [LS05], there is a polynomial time algorithm to find an exact Nash equilibrium for the special case of repeated two-player normal-form games. Though their result only applies in the infinite-horizon setting, it is possible to extend their results to the finite-horizon setting, which rules out naive approaches to extend the proof of Theorem 4.3 and Corollary 4.4 to two players.

5 Multi-player games: Statistical lower bounds

In this section we present Theorem 1.4 (restated formally below as Theorem 5.2), which gives a statistical lower bound for the SparseCCE problem. The lower bound applies to any algorithm, regardless of computational cost, that accesses the underlying Markov game through a generative model.

Definition 5.1 (Generative model).

For an mm-player Markov game 𝒢=(𝒮,H,(𝒜i)i[m],,(Ri)i[m],μ)\mathcal{G}=(\mathcal{S},H,(\mathcal{A}_{i})_{i\in[m]},\mathbb{P},(R_{i})_{i\in[m]},\mu), a generative model oracle is defined as follows: given a query described by a tuple (h,s,𝐚)[H]×𝒮×𝒜(h,s,\mathbf{a})\in[H]\times\mathcal{S}\times\mathcal{A}, the oracle returns the distribution h(|s,𝐚)Δ(𝒮)\mathbb{P}_{h}(\cdot|s,\mathbf{a})\in\Delta(\mathcal{S}) and the tuple of rewards (Ri,h(s,𝐚))i[m](R_{i,h}(s,\mathbf{a}))_{i\in[m]}.

From the perspective of lower bounds, the assumption that the algorithm has access to a generative model is quite reasonable, as it encompasses most standard access models in RL, including the online access model, in which the algorithm repeatedly queries a policy and observes a trajectory drawn from it, as well as the local access generative model used in from [YHAY+22, WAJ+21]. We remark that it is slightly more standard to assume that queries to the generative model only return a sample from the distribution h(|s,𝐚)\mathbb{P}_{h}(\cdot|s,\mathbf{a}) as opposed to the distribution itself [Kak03, KMN99], but since our goal is to prove lower bounds, the notion in Definition 5.1 only makes our results stronger.

To state our main result, we recall the definition |𝒢|=max{S,maxi[m]Ai,H,β(𝒢)}|\mathcal{G}|=\max\{S,\max_{i\in[m]}A_{i},H,\beta(\mathcal{G})\}. In the present section, we consider the setting where the number of players mm is large. Here, |𝒢||\mathcal{G}| does not necessarily correspond to the description length for 𝒢\mathcal{G}, and should be interpreted, roughly speaking, as a measure of the description complexity of 𝒢\mathcal{G} |𝒢||\mathcal{G}| with respect to decentralized learning algorithms. In particular, from the perspective of an individual agent implementing a decentralized learning algorithm, their sample complexity should depend only on the size of their individual action set (as well as the global parameters S,H,β(𝒢)S,H,\beta(\mathcal{G})), as opposed to the size of the joint action set, which grows exponentially in mm; the former is captured by |𝒢|\lvert\mathcal{G}\rvert, while the latter is not. Indeed, a key advantage shared by much prior work on decentralized RL [JLWY21, SMB22, MB21, DGZ22] is their avoidance of the curse of multi-agents, which describes the situation where an algorithm has sample and computational costs that scale exponentially in mm.

Our main result for this section, Theorem 5.2, states that for mm-player Markov games, exponentially many generative model queries (in mm) are necessary to produce a solution to the (T,ϵ,N)(T,\epsilon,N)-SparseCCE problem, unless TT itself is exponential in mm.

Theorem 5.2.

Let m2m\geq{}2 be given. There are constants c,ϵ>0c,\epsilon>0 so that the following holds. Suppose there is an algorithm \mathscr{B} which, given access to a generative model for a (m+1)(m+1)-player Markov game 𝒢\mathcal{G} with |𝒢|2m6|\mathcal{G}|\leq 2m^{6}, solves the (T,ϵ/(10m),N)(T,\epsilon/(10m),N)-SparseCCE problem for 𝒢\mathcal{G} for some TT satisfying 1<T<exp(cm)1<T<\exp(cm), and any NN\in\mathbb{N}. Then \mathscr{B} must make at least 2Ω(m)2^{\Omega(m)} queries to the generative model.

Theorem 5.2 establishes that there are mm-player Markov games, where the number of states, actions per player, and horizon are bounded by poly(m)\operatorname{poly}(m), but any algorithm with regret o(T/m)o(T/m) must make 2Ω(m)2^{\Omega(m)} queries (via Fact 2.4). In particular, if there are poly(m)\operatorname{poly}(m) queries per episode, as is standard in the online simulator model where a trajectory is drawn from the policy σ(t)\sigma^{\scriptscriptstyle{(t)}} at each episode t[T]t\in[T], then T>2Ω(m)T>2^{\Omega(m)} episodes are required to have regret o(T/m)o(T/m). This is in stark contrast to the setting of normal-form games, where even for the case of bandit feedback (which is a special case of the generative model setting), standard no-regret algorithms have the property that each player’s regret scales as O~(Tn)\widetilde{O}(\sqrt{Tn}) (i.e., independently of mm), where nn denotes the number of actions per player [LS20]. As with our computational lower bounds, Theorem 5.2 is not limited to decentralized algorithms, and also rules out centralized algorithms which, with access to a generative model, compute a sequence of policies which constitutes a solution to the SparseCCE problem. Furthermore, it holds for arbitrary values of NN, thus allowing the policies σ(1),,σ(T)Πgen,rnd\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in{\Pi}^{\mathrm{gen,rnd}} solving the SparseCCE problem to be arbitrary general policies.

6 Discussion and interpretation

Theorems 3.2, 4.3, and 5.2 present barriers—both computational and statistical—toward developing efficient decentralized no-regret guarantees for multi-agent reinforcement learning. We emphasize that no-regret algorithms are the only known approach for obtaining fully decentralized learning algorithms (i.e., those which do not rely even on shared randomness) in normal-form games, and it seems unlikely that a substantially different approach would work in Markov games. Thus, these lower bounds for finding subexponential-length sequences of policies with the no-regret property represent a significant obstacle for fully decentralized multi-agent reinforcement learning. Moreover, these results rule out even the prospect of developing efficient centralized algorithms that produce no-regret sequences of policies, i.e., those which “resemble” independent learning. In this section, we compare our lower bounds with recent upper bounds for decentralized learning in Markov games, and explain how to reconcile these results.

6.1 Comparison to V-learning

The V-learning algorithm [JLWY21, SMB22, MB21] is a polynomial-time decentralized learning algorithm that proceeds in two phases. In the first phase, the mm agents interact over the course of KK episodes in a decentralized fashion, playing product Markov policies σ(1),,σ(K)Πmarkov\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(K)}}\in\Pi^{\mathrm{markov}}. In the second phase, the agents use data gathered during the first phase to produce a distributional policy σ^Δ(Πgen,rnd)\widehat{\sigma}\in\Delta({\Pi}^{\mathrm{gen,rnd}}), which we refer to as the output policy of V-learning. As discussed in Section 1, one implication of Theorem 3.2 is that the first phase of V-learning cannot guarantee each agent sublinear regret. Indeed if KK is of polynomial size (and PPADP\textsf{PPAD}\neq\textsf{P}), this follows because a bound of the form Regi,K(σ(1),,σ(K))ϵK\mathrm{Reg}_{i,K}(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(K)}})\leq\epsilon K for all ii implies that (σ(1),,σ(K))(\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(K)}}) solves the (K,ϵ)(K,\epsilon)-SparseMarkovCCE problem.

The output policy σ^Δ(Πgen,rnd)\widehat{\sigma}\in\Delta({\Pi}^{\mathrm{gen,rnd}}) produced by V-learning is an approximate CCE (per Definition 2.2), and it is natural to ask how many product policies it takes to represent σ^\widehat{\sigma} as a uniform mixture (that is, whether σ^\widehat{\sigma} solves the (T,ϵ)(T,\epsilon)-SparseMarkovCCE problem for a reasonable value of TT). First, recall that V-learning requires K=poly(H,S,maxiAi)/ϵ2K=\operatorname{poly}(H,S,\max_{i}A_{i})/\epsilon^{2} episodes to ensure that σ^\widehat{\sigma} is an ϵ\epsilon-CCE. It is straightforward to show that σ^\widehat{\sigma} can be expressed as a non-uniform mixture of at most KKHS+1K^{KHS+1} policies in Πgen,rnd{\Pi}^{\mathrm{gen,rnd}} (we prove this fact in detail below). By discretizing the non-uniform mixture, one can equivalently represent it as uniform mixture of O(1/ϵ)KKHS+1O(1/\epsilon)\cdot K^{KHS+1} product policies, up to ϵ\epsilon error. Recalling the value of KK, we conclude that we can express σ^\widehat{\sigma} as a uniform mixture of T=exp(O~(1/ϵ2)poly(H,S,maxiAi))T=\exp(\widetilde{O}(1/\epsilon^{2})\cdot\operatorname{poly}(H,S,\max_{i}A_{i})) product policies in Πgen,rnd{\Pi}^{\mathrm{gen,rnd}}. Note that the lower bound of Theorem 4.3 rules out the efficient computation of an ϵ\epsilon-CCE represented as a uniform mixture of Texp(ϵ2max{H,S,maxiAi})T\ll\exp(\epsilon^{2}\cdot\max\{H,S,\max_{i}A_{i}\}) efficiently computable policies in Πgen,rnd{\Pi}^{\mathrm{gen,rnd}}. Thus, in the regime where 1/ϵ1/\epsilon is polynomial in H,S,maxiAiH,S,\max_{i}A_{i}, this upper bound on the sparsity of the policy σ^\widehat{\sigma} produced V-learning matches that from Theorem 4.3, up to a polynomial in the exponent.

The sparsity of the output policy from V-learning.

We now sketch a proof of the fact that the output policy σ^\widehat{\sigma} produced by V-learning can be expressed as a (non-uniform) average of KKHS+1K^{KHS+1} policies in Πgen,rnd{\Pi}^{\mathrm{gen,rnd}}, where KK is the number of episodes in the algorithm’s initial phase. We adopt the notation and terminology from [JLWY21].

Consider Algorithm 3 of [JLWY21], which describes the second phase of V-learning, which produces the output policy σ^\widehat{\sigma}. We describe how to write σ^\widehat{\sigma} as a weighted average of a collection of product policies, each of which is indexed by a function ϕ:[H]×𝒮×[K][K]\phi:[H]\times\mathcal{S}\times[K]\rightarrow[K] and a parameter k0[K]k_{0}\in[K]: in particular, we will write σ^=k0,ϕwk0,ϕσk0,ϕΔ(Πgen,rnd)\widehat{\sigma}=\sum_{k_{0},\phi}w_{k_{0},\phi}\cdot\sigma_{k_{0},\phi}\in\Delta({\Pi}^{\mathrm{gen,rnd}}), where wk0,ϕ[0,1]w_{k_{0},\phi}\in[0,1] are mixing weights summing to 1 and σk0,ϕΠgen,rnd\sigma_{k_{0},\phi}\in{\Pi}^{\mathrm{gen,rnd}}. The number of tuples (k0,ϕ)(k_{0},\phi) is K1+KHSK^{1+KHS}.

We define the mixing weight allocated wk0,ϕw_{k_{0},\phi} to any tuple (k0,ϕ)(k_{0},\phi) to be:

wk0,ϕ:=1K(h,s,k)[H]×𝒮×[K]𝟙{ϕ(h,s,k)[Nhk(s)]}αNhk(s)ϕ(h,s,k),\displaystyle w_{k_{0},\phi}:=\frac{1}{K}\cdot\prod_{(h,s,k)\in[H]\times\mathcal{S}\times[K]}\mathbbm{1}\{{\phi(h,s,k)\in[N_{h}^{k}(s)]}\}\cdot\alpha_{N_{h}^{k}(s)}^{\phi(h,s,k)},

where Nhk(s)[K]N_{h}^{k}(s)\in[K] and αNhk(s)i[0,1]\alpha_{N_{h}^{k}(s)}^{i}\in[0,1] (for i[Nhk(s)]i\in[N_{h}^{k}(s)]) are defined as in [JLWY21].

Next, for each k0,ϕk_{0},\phi, we define σk0,ϕΠgen,rnd\sigma_{k_{0},\phi}\in{\Pi}^{\mathrm{gen,rnd}} to be the following policy: it maintains a parameter k[K]k\in[K] over the first hHh\leq{}H steps of the episode (as in Algorithm 3 of [JLWY21]), but upon reaching state ss at step hh, given the present value of k[K]k\in[K], sets i:=ϕ(h,s,k)i:=\phi(h,s,k), and updates kkhi(s)k\leftarrow k_{h}^{i}(s), and then samples an action 𝐚πhk(|s)\mathbf{a}\sim\pi_{h}^{k}(\cdot|s) (where khi(s),πhk(|s)k_{h}^{i}(s),\pi_{h}^{k}(\cdot|s) are defined in [JLWY21]). Since the mixing weights wk0,ϕw_{k_{0},\phi} defined above exactly simulate the random draws of the parameter kk in Line 1 and the parameters ii in Line 4 of [JLWY21, Algorithm 3], it follows that the distributional policy σ^\widehat{\sigma} defined by [JLWY21, Algorithm 3] is equal to k0,ϕwk0,ϕσk0,ϕΔ(Πgen,rnd)\sum_{k_{0},\phi}w_{k_{0},\phi}\cdot\sigma_{k_{0},\phi}\in\Delta({\Pi}^{\mathrm{gen,rnd}}).

6.2 No-regret learning against Markov deviations

As discussed in Section 1, [ELS+22] showed the existence of a learning algorithm with the property that if each agent plays it independently for TT episodes, then no player can achieve regret more than O(poly(m,H,S,maxiAi)T3/4)O(\operatorname{poly}(m,H,S,\max_{i}A_{i})\cdot T^{3/4}) by deviating to any fixed Markov policy. This notion of regret corresponds to, in the context of Definition 2.3, replacing maxσiΠigen,rnd\max_{\sigma_{i}\in{\Pi}^{\mathrm{gen,rnd}}_{i}} with the smaller quantity maxσiΠimarkov\max_{\sigma_{i}\in\Pi^{\mathrm{markov}}_{i}}. Thus, the result of [ELS+22] applies to a weaker notion of regret than that of the SparseCCE problem, and so does not contradict any of our lower bounds. One may wonder which of these two notions of regret (namely, best possible gain via deviation to a Markov versus non-Markov policy) is the “right” one. We do not believe that there is a definitive answer to this question, but we remark that in many empirical applications of multi-agent reinforcement learning it is standard to consider non-Markov policies [LDGV+21, AVDG+22]. Furthermore, as shown in the proposition below, there are extremely simple games, e.g., of constant size, in which Markov deviations lead to “vacuous” behavior: in particular, all Markov policies have the same (suboptimal) value but the best non-Markov policy has much greater value:

Proposition 6.1.

There is a 2-player, 2-action, 1-state Markov game with horizon 22 and a non-Markov policy σ2Π2gen,rnd\sigma_{2}\in{\Pi}^{\mathrm{gen,rnd}}_{2} for player 2 so that for all σ1Π1markov\sigma_{1}\in\Pi^{\mathrm{markov}}_{1}, V1σ1×σ2=1/2V_{1}^{\sigma_{1}\times\sigma_{2}}=1/2 yet maxσ1Π1gen,rnd{V1σ1×σ2}=3/4\max_{\sigma_{1}\in{\Pi}^{\mathrm{gen,rnd}}_{1}}\left\{V_{1}^{\sigma_{1}\times\sigma_{2}}\right\}=3/4.

The proof of Proposition 6.1 is provided in Section 6.5 below.

Other recent work has also proved no-regret guarantees with respect to deviations to restricted policy classes. In particular, [ZLY22] studies a setting in which each agent ii is allowed to play policies in an arbitrary restricted policy class ΠiΠigen,rnd\Pi_{i}^{\prime}\subseteq{\Pi}^{\mathrm{gen,rnd}}_{i} in each episode, and regret is measured with respect to deviations to any policy in Πi\Pi_{i}^{\prime}. [ZLY22] introduces an algorithm, DORIS, with the property that when all agents play it independently, each agent ii experiences regret O(poly(m,A,S,H)Ti=1mlog|Πi|)O\left(\operatorname{poly}(m,A,S,H)\cdot\sqrt{T\sum_{i=1}^{m}\log|\Pi_{i}^{\prime}|}\right) to their respective class Πi\Pi^{\prime}_{i}.171717Note that in the tabular setting, the sample complexity of DORIS (Corollary 1) scales with the size AA of the joint action set, since each player’s value function class consists of the class of all functions f:𝒮×𝒜[0,1]f:\mathcal{S}\times\mathcal{A}\rightarrow[0,1], which has Eluder dimension scaling with SAS\cdot A, i.e., exponential in mm.

DORIS is not computationally efficient, since it involves performing exponential weights over the class Πi\Pi_{i}^{\prime}, which requires space complexity |ΠI|\lvert\Pi^{\prime}_{I}\rvert. Nonetheless, one can compare the statistical guarantees the algorithm provides to our own results. Let Πimarkov,detΠimarkov\Pi^{\mathrm{markov,det}}_{i}\subset\Pi^{\mathrm{markov}}_{i} denote the set of deterministic Markov policies of agent ii, namely sequences πi=(πi,1,,πi,H)\pi_{i}=(\pi_{i,1},\ldots,\pi_{i,H}) so that πi,h:𝒮𝒜i\pi_{i,h}:\mathcal{S}\rightarrow\mathcal{A}_{i}. In the case that Πi=Πimarkov,det\Pi_{i}^{\prime}=\Pi^{\mathrm{markov,det}}_{i}, Πi\Pi_{i}^{\prime}, we have log|Πi|=O(SHlogAi)\log|\Pi_{i}^{\prime}|=O(SH\log A_{i}), which means that DORIS obtains no-regret against Markov deviations when mm is constant, comparable to [ELS+22].181818[ELS+22] has the added bonus of computational efficiency, even for polynomially large mm, though has the significant drawback of assuming that the Markov game is known. However, we are interested in the setting in which each player’s regret is measured with respect to all deviations in Πigen,rnd{\Pi}^{\mathrm{gen,rnd}}_{i} (equivalently, Πigen,det\Pi^{\mathrm{gen,det}}_{i}). Accordingly, if we take Πi=Πigen,detΠigen,rnd\Pi_{i}^{\prime}=\Pi^{\mathrm{gen,det}}_{i}\subset{\Pi}^{\mathrm{gen,rnd}}_{i},191919DORIS plays distributions over policies in Πi=Πigen,det\Pi_{i}^{\prime}=\Pi^{\mathrm{gen,det}}_{i} at each episode, whereas in our lower bounds we consider the setting where a policy in Πigen,rnd{\Pi}^{\mathrm{gen,rnd}}_{i} is played each episode; Facts D.2 and D.3 shows that these two settings are essentially equivalent, in that any policy in Π1gen,rnd××Πmgen,rnd{\Pi}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\Pi}^{\mathrm{gen,rnd}}_{m} can be simulated by one in Δ(Π1gen,det)××Δ(Πmgen,det)\Delta(\Pi^{\mathrm{gen,det}}_{1})\times\cdots\times\Delta(\Pi^{\mathrm{gen,det}}_{m}), and vise versa. then log|Πi|>(SAi)H1\log|\Pi_{i}^{\prime}|>(SA_{i})^{H-1}, meaning that DORIS does not imply any sort of sample-efficient guarantee, even for m=2m=2.

Finally, we remark that the algorithm DORIS [ZLY22], as well as the similar algorithm OPMD from earlier work of [LWJ22], obtains the same regret bound stated above even when the opponents are controlled by (possibly adaptive) adversaries. However, this guarantee crucially relies on the fact that any agent implementing DORIS must observe the policies played by opponents following each episode; this feature is the reason that the regret bound of DORIS does not contradict the exponential lower bound of [LWJ22] for no-regret learning against an adversarial opponent. As a result of being restricted to this “revealed-policy” setting, DORIS is not a fully decentralized algorithm in the sense we consider in this paper.

6.3 On the role of shared randomness

A key assumption in our lower bounds for no-regret learning is that each of the joint policies σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}} produced by the algorithm is a product policy; such an assumption is natural, since it subsumes independent learning protocols in which each agent ii selects σi(t)\sigma_{i}^{\scriptscriptstyle(t)} without knowledge of σi(t)\sigma_{-i}^{\scriptscriptstyle(t)}. Compared to general (stochastic) joint policies, product policies have the desirable property that, to sample a trajectory from σ(t)=(σ1(t),,σm(t))Π1gen,rnd××Πmgen,rnd=Πgen,rnd\sigma^{\scriptscriptstyle{(t)}}=(\sigma_{1}^{\scriptscriptstyle{(t)}},\ldots,\sigma_{m}^{\scriptscriptstyle{(t)}})\in{\Pi}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\Pi}^{\mathrm{gen,rnd}}_{m}={\Pi}^{\mathrm{gen,rnd}}, the agents do no require access to shared randomness. In particular, each agent ii can independently sample its action from σi(t)\sigma_{i}^{\scriptscriptstyle{(t)}} at each of the hh steps of the episode. It is natural to ask how the situation changes if we allow the agents to use shared random bits when sampling from their policies, which corresponds to allowing σ(1),,σ(T)\sigma^{\scriptscriptstyle(1)},\ldots,\sigma^{\scriptscriptstyle{(T)}} to be non-product policies. In this case, V-learning yields a positive result via a standard “batch-to-online” conversion: by applying the first phase of V-learning during the first T2/3T^{2/3} episodes and playing trajectories sampled i.i.d. from the output policy produced by V-learning during the remaining TT2/3T-T^{2/3} episodes (which requires shared randomness), it is straightforward to see that a regret bound of order poly(H,S,maxiAi)T2/3\operatorname{poly}(H,S,\max_{i}A_{i})\cdot T^{2/3} can be obtained. Similar remarks apply to SPoCMAR [DGZ22], which can obtain a slightly worse regret bound of order poly(H,S,maxiAi)T3/4\operatorname{poly}(H,S,\max_{i}A_{i})\cdot T^{3/4} in the same fashion. In fact, the batch-to-online conversion approach gives a generic solution for the setting in which shared randomness is available. That is, the assumption of shared randomness eliminates any distinction between no-regret algorithms and (non-sparse) equilibrium computation algorithms, modulo slight loss in rates. For this reason, the shared randomness assumption is too strong to develop any sort of distinct theory of no-regret learning.

6.4 Comparison to lower bounds for finding stationary CCE

A separate line of work [DGZ22, JMS22] has recently shown PPAD-hardness for the problem of finding stationary Markov CCE in infinite-horizon discounted stochastic games. These results are incomparable with our own: stationary Markov CCE are not sparse (in the sense of Definition 3.1), whereas we do not require stationarity of policies (as is standard in the finite-horizon setting).

6.5 Proof of Proposition 6.1

Below we prove Proposition 6.1.

Proof of Proposition 6.1.

We construct the claimed Markov game 𝒢\mathcal{G} as follows. The single state is denoted by 𝔰\mathfrak{s}; as there is only a single state, the transitions are trivial. We denote each player’s action space as 𝒜1=𝒜2={1,2}\mathcal{A}_{1}=\mathcal{A}_{2}=\{1,2\}. The rewards to player 1 are given as follows: for all (a1,a2)𝒜(a_{1},a_{2})\in\mathcal{A},

R1,1(𝔰,(a1,a2))=12𝕀a2=1,R1,2(𝔰,(a1,a2))=12𝕀a1=a2.\displaystyle R_{1,1}(\mathfrak{s},(a_{1},a_{2}))=\frac{1}{2}\cdot\mathbb{I}_{a_{2}=1},\qquad R_{1,2}(\mathfrak{s},(a_{1},a_{2}))=\frac{1}{2}\cdot\mathbb{I}_{a_{1}=a_{2}}.

We allow the rewards of player 2 to be arbitrary; they do not affect the proof in any way.

We let σ2=(σ2,1,σ2,2)Π2gen,rnd\sigma_{2}=(\sigma_{2,1},\sigma_{2,2})\in{\Pi}^{\mathrm{gen,rnd}}_{2} be the policy which plays a uniformly random action at step 1 and then plays the same action at step 2: formally, σ2,1(s1)=Unif(𝒜2)\sigma_{2,1}(s_{1})=\operatorname{Unif}(\mathcal{A}_{2}), and σ2,2((s1,a2,1,r2,1),s2)=𝕀a2,1\sigma_{2,2}((s_{1},a_{2,1},r_{2,1}),s_{2})=\mathbb{I}_{a_{2,1}}. Then for any Markov policy σ1Π1markov\sigma_{1}\in\Pi^{\mathrm{markov}}_{1} of player 1, we must have σ1×σ2(a1,2=a2,2)=1/2\mathbb{P}_{\sigma_{1}\times\sigma_{2}}(a_{1,2}=a_{2,2})=1/2, which means that V1σ1×σ2=12𝔼σ1×σ2[𝕀a2,1=1+𝕀a1,2=a2,2]=1/2(1/2+1/2)=1/2V_{1}^{\sigma_{1}\times\sigma_{2}}=\frac{1}{2}\cdot\mathbb{E}_{\sigma_{1}\times\sigma_{2}}[\mathbb{I}_{a_{2,1}=1}+\mathbb{I}_{a_{1,2}=a_{2,2}}]=1/2\cdot(1/2+1/2)=1/2.

On the other hand, any general (non-Markov) policy σ1Π1gen,rnd\sigma_{1}\in{\Pi}^{\mathrm{gen,rnd}}_{1} which satisfies

σ1,2((s1,a1,1,r1,1),s2)={𝕀1:r1,1=1/2𝕀2:r1,1=0\displaystyle\sigma_{1,2}((s_{1},a_{1,1},r_{1,1}),s_{2})=\begin{cases}\mathbb{I}_{1}:&r_{1,1}=1/2\\ \mathbb{I}_{2}:&r_{1,1}=0\end{cases}

has V1σ1×σ2=1/2(1/2+1)=3/4V_{1}^{\sigma_{1}\times\sigma_{2}}=1/2\cdot(1/2+1)=3/4. ∎

Acknowledgements

This work was performed in part while NG was an intern at Microsoft Research. NG is supported at MIT by a Fannie & John Hertz Foundation Fellowship and an NSF Graduate Fellowship. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence. SK acknowledges funding from the Office of Naval Research under award N00014-22-1-2377 and the National Science Foundation Grant under award #CCF-2212841.

References

  • [AB06] S. Arora and B. Barak. Computational Complexity: A Modern Approach. Cambridge University Press, 2006.
  • [AFK+22] Ioannis Anagnostides, Gabriele Farina, Christian Kroer, Chung-Wei Lee, Haipeng Luo, and Tuomas Sandholm. Uncoupled learning dynamics with $o(\log t)$ swap regret in multiplayer games. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [AVDG+22] John P. Agapiou, Alexander Sasha Vezhnevets, Edgar A. Duéñez-Guzmán, Jayd Matyas, Yiran Mao, Peter Sunehag, Raphael Köster, Udari Madhushani, Kavya Kopparapu, Ramona Comanescu, DJ Strouse, Michael B. Johanson, Sukhdeep Singh, Julia Haas, Igor Mordatch, Dean Mobbs, and Joel Z. Leibo. Melting pot 2.0, 2022.
  • [AYBK+13] Yasin Abbasi Yadkori, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvari. Online learning in markov decision processes with adversarially chosen transition probability distributions. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
  • [Bab16] Yakov Babichenko. Query complexity of approximate nash equilibria. J. ACM, 63(4), oct 2016.
  • [BBD+22] Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra. Human-level play in the game of ¡i¿diplomacy¡/i¿ by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
  • [BCI+08] Christian Borgs, Jennifer Chayes, Nicole Immorlica, Adam Tauman Kalai, Vahab Mirrokni, and Christos Papadimitriou. The myth of the folk theorem. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 365–372, 2008.
  • [BJY20] Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
  • [Bla56] David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1–8, 1956.
  • [BR17] Yakov Babichenko and Aviad Rubinstein. Communication complexity of approximate nash equilibria. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, page 878–889, New York, NY, USA, 2017. Association for Computing Machinery.
  • [Bro49] George Williams Brown. Some notes on computation of games solutions. 1949.
  • [BS18] Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
  • [CBFH+97] Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997.
  • [CBL06] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
  • [CCT17] Xi Chen, Yu Cheng, and Bo Tang. Well-Supported vs. Approximate Nash Equilibria: Query Complexity of Large Games. In Christos H. Papadimitriou, editor, 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), volume 67 of Leibniz International Proceedings in Informatics (LIPIcs), pages 57:1–57:9, Dagstuhl, Germany, 2017. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  • [CDT06] Xi Chen, Xiaotie Deng, and Shang-hua Teng. Computing nash equilibria: Approximation and smoothed complexity. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 603–612, 2006.
  • [CP20] Xi Chen and Binghui Peng. Hedging in games: Faster convergence of external and swap regrets. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18990–18999. Curran Associates, Inc., 2020.
  • [DFG21] Constantinos Costis Daskalakis, Maxwell Fishelson, and Noah Golowich. Near-optimal no-regret learning in general games. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
  • [DGP09] Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a nash equilibrium. SIAM Journal on Computing, 39(1):195–259, 2009.
  • [DGZ22] Constantinos Daskalakis, Noah Golowich, and Kaiqing Zhang. The complexity of markov equilibrium in stochastic games, 2022.
  • [ELS+22] Liad Erez, Tal Lancewicki, Uri Sherman, Tomer Koren, and Yishay Mansour. Regret minimization and convergence to equilibria in general-sum markov games, 2022.
  • [FGGS13] John Fearnley, Martin Gairing, Paul Goldberg, and Rahul Savani. Learning equilibria of games via payoff queries. In Proceedings of the Fourteenth ACM Conference on Electronic Commerce, EC ’13, page 397–414, New York, NY, USA, 2013. Association for Computing Machinery.
  • [FLM94] Drew Fudenberg, David Levine, and Eric Maskin. The folk theorem with imperfect public information. Econometrica, 62(5):997–1039, 1994.
  • [FRSS22] Dylan J Foster, Alexander Rakhlin, Ayush Sekhari, and Karthik Sridharan. On the complexity of adversarial decision making. arXiv preprint arXiv:2206.13063, 2022.
  • [Han57] J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
  • [HMC00] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
  • [JKSY20] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning (ICML), pages 4870–4879. PMLR, 2020.
  • [JLWY21] Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning–a simple, efficient, decentralized algorithm for multiagent RL. arXiv preprint arXiv:2110.14555, 2021.
  • [JMS22] Yujia Jin, Vidya Muthukumar, and Aaron Sidford. The complexity of infinite-horizon general-sum stochastic games, 2022.
  • [Kak03] Sham M Kakade. On the sample complexity of reinforcement learning, 2003.
  • [KECM21] Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. RL for latent MDPs: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34, 2021.
  • [KEG+22] János Kramár, Tom Eccles, Ian Gemp, Andrea Tacchetti, Kevin R. McKee, Mateusz Malinowski, Thore Graepel, and Yoram Bachrach. Negotiation and honesty in artificial intelligence methods for the board game of Diplomacy. Nature Communications, 13(1):7214, December 2022. Number: 1 Publisher: Nature Publishing Group.
  • [KMN99] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, page 1324–1331, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
  • [LDGV+21] Joel Z Leibo, Edgar A Dueñez-Guzman, Alexander Vezhnevets, John P Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charlie Beattie, Igor Mordatch, and Thore Graepel. Scalable evaluation of multi-agent reinforcement learning with melting pot. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 6187–6199. PMLR, 18–24 Jul 2021.
  • [LS05] Michael L. Littman and Peter Stone. A polynomial-time Nash equilibrium algorithm for repeated games. Decision Support Systems, 39:55–66, 2005.
  • [LS20] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • [LW94] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
  • [LWJ22] Qinghua Liu, Yuanhao Wang, and Chi Jin. Learning Markov games with adversarial opponents: Efficient algorithms and fundamental limits. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 14036–14053. PMLR, 17–23 Jul 2022.
  • [MB21] Weichao Mao and Tamer Basar. Provably efficient reinforcement learning in decentralized general-sum markov games. CoRR, abs/2110.05682, 2021.
  • [MF86] Eric Maskin and D Fudenberg. The folk theorem in repeated games with discounting or with incomplete information. Econometrica, 53(3):533–554, 1986. Reprinted in A. Rubinstein (ed.), Game Theory in Economics, London: Edward Elgar, 1995. Also reprinted in D. Fudenberg and D. Levine (eds.), A Long-Run Collaboration on Games with Long-Run Patient Players, World Scientific Publishers, 2009, pp. 209-230.
  • [MK15] Kleanthis Malialis and Daniel Kudenko. Distributed response to network intrusions using multiagent reinforcement learning. Engineering Applications of Artificial Intelligence, 41:270–284, 2015.
  • [Nas51] John Nash. Non-cooperative games. Annals of Mathematics, 54(2):286–295, 1951.
  • [Pap94] Christos H. Papadimitriou. On the complexity of the parity argument and other inefficient proofs of existence. J. Comput. Syst. Sci., 48(3):498–532, 1994.
  • [Put94] Martin Puterman. Markov Decision Processes. John Wiley & Sons, Ltd, 1 edition, 1994.
  • [PVH+22] Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Toby Pohlen, Tom Eccles, Mark Rowland, Marc Lanctot, Jean-Baptiste Lespiau, Bilal Piot, Shayegan Omidshafiei, Edward Lockhart, Laurent Sifre, Nathalie Beauguerlange, Remi Munos, David Silver, Satinder Singh, Demis Hassabis, and Karl Tuyls. Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623):990–996, 2022.
  • [Rou15] Tim Roughgarden. Intrinsic robustness of the price of anarchy. Journal of the ACM, 2015.
  • [Rub16] Aviad Rubinstein. Settling the complexity of computing approximate two-player Nash equilibria. In Annual Symposium on Foundations of Computer Science (FOCS), pages 258–265. IEEE, 2016.
  • [Rub18] Aviad Rubinstein. Inapproximability of Nash equilibrium. SIAM Journal on Computing, 47(3):917–959, 2018.
  • [SALS15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems (NIPS), pages 2989–2997, 2015.
  • [Sha53] Lloyd Shapley. Stochastic Games. PNAS, 1953.
  • [SHM+16] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
  • [SMB22] Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations, 2022.
  • [SS12] Shai Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn., 4(2):107–194, feb 2012.
  • [SSS16] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. CoRR, abs/1610.03295, 2016.
  • [Vov90] Vladimir Vovk. Aggregating strategies. Proc. of Computational Learning Theory, 1990, 1990.
  • [WAJ+21] Gellert Weisz, Philip Amortila, Barnabás Janzer, Yasin Abbasi-Yadkori, Nan Jiang, and Csaba Szepesvari. On query-efficient planning in mdps under linear realizability of the optimal state-value function. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 4355–4385. PMLR, 15–19 Aug 2021.
  • [YHAY+22] Dong Yin, Botao Hao, Yasin Abbasi-Yadkori, Nevena Lazić, and Csaba Szepesvári. Efficient local planning with linear function approximation. In Sanjoy Dasgupta and Nika Haghtalab, editors, Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 of Proceedings of Machine Learning Research, pages 1165–1192. PMLR, 29 Mar–01 Apr 2022.
  • [ZLY22] Wenhao Zhan, Jason D Lee, and Zhuoran Yang. Decentralized optimistic hyperpolicy mirror descent: Provably no-regret learning in markov games. arXiv preprint arXiv:2206.01588, 2022.
  • [ZTS+22] Stephan Zheng, Alexander Trott, Sunil Srinivasa, David C. Parkes, and Richard Socher. The ai economist: Taxation policy design via two-level deep multiagent reinforcement learning. Science Advances, 8(18):eabk2607, 2022.

Appendix A Additional preliminaries

A.1 Nash equilibria and computational hardness.

The most foundational and well known solution concept for normal-form games is the Nash equilibrium [Nas51].

Definition A.1 ((n,ϵ)(n,\epsilon)-Nash problem).

For a normal-form game G=(M1,,Mm)G=(M_{1},\ldots,M_{m}) and ϵ>0\epsilon>0, a product distribution pj=1mΔ([n])p\in\prod_{j=1}^{m}\Delta([n]) is said to be an ϵ\epsilon-Nash equilibrium for GG if for all i[n]i\in[n],

maxai[n]𝔼𝐚p[(Mi)ai,𝐚i]𝔼𝐚p[(Mi)𝐚]ϵ.\displaystyle\max_{a_{i}^{\prime}\in[n]}\mathbb{E}_{\mathbf{a}\sim p}[(M_{i})_{a_{i}^{\prime},\mathbf{a}_{-i}}]-\mathbb{E}_{\mathbf{a}\sim p}[(M_{i})_{\mathbf{a}}]\leq\epsilon.

We define the mm-player (n,ϵ)(n,\epsilon)-Nash problem to be the problem of computing an ϵ\epsilon-Nash equilibrium of a given mm-player nn-action normal-form game.202020One must also take care to specify the bit complexity of representing a normal-form game. We assume that the payoffs of any normal-form game given as an instance to the (n,ϵ)(n,\epsilon)-Nash problem can each be expressed with max{n,m}\max\{n,m\} bits; this assumption is without loss of generality as long as ϵ2max{n,m}\epsilon\geq 2^{-\max\{n,m\}} (which it will be for us).

Informally, pp is an ϵ\epsilon-Nash equilibrium if no player ii can gain more than ϵ\epsilon in reward by deviating to a single fixed action aia_{i}^{\prime}, while all other players randomly choose their actions according to pp. Despite the intuitive appeal of Nash equilibria, they are intractable to compute: for any c>0c>0, it is PPAD-hard to solve the (n,nc)(n,n^{-c})-Nash problem, namely, to compute ncn^{-c}-approximate Nash equilibria in 2-player nn-action normal-form games [DGP09, CDT06, Rub18]. We recall that the complexity class PPAD consists of all total search problems which have a polynomial-time reduction to the End-of-The-Line (EOTL) problem. PPAD is the most well-studied complexity class in algorithmic game theory, and it is widely believed that PPADP\textsf{PPAD}\neq\textsf{P}. We refer the reader to [DGP09, CDT06, Rub18, Pap94] for further background on the class PPAD and the EOTL problem.

A.2 Query complexity of Nash equilibria

Our statistical lower bound for the SparseCCE problem in Theorem 5.2 relies on existing query complexity lower bounds for computing approximate Nash equilibria in mm-player normal-form games. We first review the query complexity model for normal-form games.

Oracle model for normal-form games.

For m,nm,n\in\mathbb{N}, consider an mm-player nn-action normal form game GG, specified by payoff tensors M1,,MmM_{1},\ldots,M_{m}. Since the tensors M1,,MmM_{1},\ldots,M_{m} contain a total of mnmmn^{m} real-valued payoffs, in the setting when mm is large, it is unrealistic to assume that an algorithm is given the full payoff tensors as input. Therefore, prior work on computing equilibria in such games has studied the setting in which the algorithm makes adaptive oracle queries to the payoff tensors.

In particular, the algorithm, which is allowed to be randomized, has access to a payoff oracle 𝒪G\mathcal{O}_{G} for the game GG, which works as follows. At each time step, the algorithm can choose to specify an action profile 𝐚[n]m\mathbf{a}\in[n]^{m} and then query 𝒪G\mathcal{O}_{G} at the action profile 𝐚\mathbf{a}. The oracle 𝒪G\mathcal{O}_{G} then returns the payoffs (M1)𝐚,,(Mm)𝐚(M_{1})_{\mathbf{a}},\ldots,(M_{m})_{\mathbf{a}} for each player if the action profile 𝐚\mathbf{a} is played.

Query complexity lower bound for approximate Nash equilibrium.

The following theorem gives a lower bound on the number of queries any randomized algorithm needs to make to compute an approximate Nash equilibrium in an mm-player game.

Theorem A.2 (Corollary 4.5 of [Rub16]).

There is a constant ϵ0>0\epsilon_{0}>0 so that any randomized algorithm which solves the (2,ϵ0)(2,\epsilon_{0})-Nash problem for mm-player normal-form games with probability at least 2/32/3 must use at least 2Ω(m)2^{\Omega(m)} payoff queries.

We remark that [Bab16, CCT17] provide similar, though quantitatively weaker, lower bounds to that in Theorem A.2. We also emphasize that the lower bound of Theorem A.2 applies to any algorithm, i.e., including those which require extremely large computation time.

Appendix B Proofs of lower bounds for SparseMarkovCCE (Section 3)

B.1 Preliminaries: Online density estimation

Our proof makes use of tools for online learning with the logarithmic loss, also known as conditional density estimation. In particular, we use a variant of the exponential weights algorithm known as Vovk’s aggregating algorithm in the context of density estimation [Vov90, CBL06]. We consider the following setting with two players, a Learner and Nature. Furthermore, there is a set 𝒴\mathcal{Y}, called the outcome space, and a set 𝒳\mathcal{X}, called the context space; for our applications it suffices to assume 𝒴\mathcal{Y} and 𝒳\mathcal{X} are finite. For some TT\in\mathbb{N}, there are TT time steps t=1,2,,Tt=1,2,\ldots,T. At each time step t[T]t\in[T]:

  • Nature reveals a context x(t)𝒳x^{\scriptscriptstyle{(t)}}\in\mathcal{X};

  • Having seen the context x(t)x^{\scriptscriptstyle{(t)}}, the learner predicts a distribution q^(t)Δ(𝒴)\widehat{q}^{\scriptscriptstyle{(t)}}\in\Delta(\mathcal{Y});

  • Nature chooses an outcome y(t)𝒴y^{\scriptscriptstyle{(t)}}\in\mathcal{Y}, and the learner suffers loss log(t)(q^(t)):=log(1q^(t)(y(t))).\ell_{\mathrm{log}}^{\scriptscriptstyle{(t)}}(\widehat{q}^{\scriptscriptstyle{(t)}}):=\log\left(\frac{1}{\widehat{q}^{\scriptscriptstyle{(t)}}(y^{\scriptscriptstyle{(t)}})}\right).

For each t[T]t\in[T], we let (t)={(x(1),y(1),q^(1)),,(x(t),y(t),q^(t))}\mathcal{H}^{\scriptscriptstyle{(t)}}=\{(x^{\scriptscriptstyle{(1)}},y^{\scriptscriptstyle{(1)}},\widehat{q}^{\scriptscriptstyle{(1)}}),\ldots,(x^{\scriptscriptstyle{(t)}},y^{\scriptscriptstyle{(t)}},\widehat{q}^{\scriptscriptstyle{(t)}})\} denote the history of interaction up to step tt; we emphasize that each context x(t)x^{\scriptscriptstyle{(t)}} may be chosen adaptively as a function of (t1)\mathcal{H}^{\scriptscriptstyle{(t-1)}}. Let (t)\mathscr{F}^{\scriptscriptstyle{(t)}} denote the sigma-algebra generated by ((t),x(t+1))(\mathcal{H}^{\scriptscriptstyle{(t)}},x^{\scriptscriptstyle{(t+1)}}). We measure performance in terms of regret against a set \mathcal{I} of experts, also known as the expert setting. Each expert ii\in\mathcal{I} consists of a function pi:𝒳Δ(𝒴)p_{i}:\mathcal{X}\rightarrow\Delta(\mathcal{Y}). The regret of an algorithm against the expert class \mathcal{I} when it receives contexts x(1),,x(T)x^{\scriptscriptstyle{(1)}},\ldots,x^{\scriptscriptstyle{(T)}} and observes outcomes y(1),,y(T)y^{\scriptscriptstyle{(1)}},\ldots,y^{\scriptscriptstyle{(T)}} is defined as

Reg,T=t=1Tlog(t)(q^(t))minit=1Tlog(t)(pi(x(t))).\displaystyle\mathrm{Reg}_{\mathcal{I},T}=\sum_{t=1}^{T}\ell_{\mathrm{log}}^{\scriptscriptstyle{(t)}}(\widehat{q}^{\scriptscriptstyle{(t)}})-\min_{i\in\mathcal{I}}\sum_{t=1}^{T}\ell_{\mathrm{log}}^{\scriptscriptstyle{(t)}}(p_{i}(x^{\scriptscriptstyle{(t)}})).

Note that the learner can observe the expert predictions {pi(x(t))}i\{p_{i}(x^{\scriptscriptstyle(t)})\}_{i\in\mathcal{I}} and use them to make its own prediction at each round tt.

Proposition B.1 (Vovk’s aggregating algorithm).

Consider Vovk’s aggregating algorithm, which predicts via

q^(t)(y):=𝔼iq~(t)[pi(x(t))],whereq~(t)(i):=exp(s=1t1log(s)(pi(x(s))))jexp(s=1t1log(s)(pj(x(s)))).\displaystyle\widehat{q}^{\scriptscriptstyle{(t)}}(y):=\mathbb{E}_{i\sim\widetilde{q}^{\scriptscriptstyle{(t)}}}[p_{i}(x^{\scriptscriptstyle{(t)}})],\quad\text{where}\quad\widetilde{q}^{\scriptscriptstyle{(t)}}(i)\vcentcolon=\frac{\exp\left(-\sum_{s=1}^{t-1}\ell_{\mathrm{log}}^{\scriptscriptstyle{(s)}}(p_{i}(x^{\scriptscriptstyle{(s)}}))\right)}{\sum_{j\in\mathcal{I}}\exp\left(-\sum_{s=1}^{t-1}\ell_{\mathrm{log}}^{\scriptscriptstyle{(s)}}(p_{j}(x^{\scriptscriptstyle{(s)}}))\right)}. (3)

This algorithm guarantees a regret bound of Reg,Tlog||\mathrm{Reg}_{\mathcal{I},T}\leq\log|\mathcal{I}|.

Recall that for probability distributions p,qp,q on a finite set \mathcal{B}, their total variation distance is defined as

D𝖳𝖵(p,q)=max|p()q()|.\displaystyle D_{\mathsf{TV}}({p},{q})=\max_{\mathcal{E}\subset\mathcal{B}}|p(\mathcal{E})-q(\mathcal{E})|. (4)

As a (standard) consequence of Proposition B.1, in the realizable setting in which the distribution of y(t)|x(t)y^{\scriptscriptstyle{(t)}}|x^{\scriptscriptstyle{(t)}} follows pi(x(t))p_{i^{\star}}(x^{\scriptscriptstyle{(t)}}) for some fixed (unknown) expert ii^{\star}\in\mathcal{I}, we can obtain a bound on the total variation distance between the algorithm’s predictions and those of pi(x(t))p_{i^{\star}}(x^{\scriptscriptstyle{(t)}}).

Proposition B.2.

If the distribution of outcomes is realizable, i.e., there exists an expert ii^{\star}\in\mathcal{I} so that y(t)pi(x(t))|x(t),(t1)y^{\scriptscriptstyle{(t)}}\sim p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})\ |\ x^{\scriptscriptstyle{(t)}},\mathcal{H}^{\scriptscriptstyle{(t-1)}} for all t[T]t\in[T], then the predictions q^(t)\widehat{q}^{\scriptscriptstyle{(t)}} of the aggregation algorithm (3) satisfy

t=1T𝔼[D𝖳𝖵(q^(t),pi(x(t)))]Tlog||.\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[D_{\mathsf{TV}}({\widehat{q}^{\scriptscriptstyle{(t)}}},{p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})})\right]\leq\sqrt{T\log|\mathcal{I}|}.

For completeness, we provide the proof of Proposition B.2 here.

Proof of Proposition B.2.

To simplify notation, for an expert ii\in\mathcal{I}, a context x𝒳x\in\mathcal{X}, and an outcome y𝒴y\in\mathcal{Y}, we write pi(y|x)p_{i}(y|x) to denote pi(x)(y)p_{i}(x)(y).

Proposition B.1 gives that the following inequality holds (almost surely):

Reg,T=t=1Tlog(1q^(t)(y(t)))t=1Tlog(1pi(y(t)|x(t)))log||.\displaystyle\mathrm{Reg}_{\mathcal{I},T}=\sum_{t=1}^{T}\log\left(\frac{1}{\widehat{q}^{\scriptscriptstyle{(t)}}(y^{\scriptscriptstyle{(t)}})}\right)-\sum_{t=1}^{T}\log\left(\frac{1}{p_{i^{\star}}(y^{\scriptscriptstyle{(t)}}|x^{\scriptscriptstyle{(t)}})}\right)\leq\log|\mathcal{I}|.

For each t[T]t\in[T], note that q^(t)\widehat{q}^{\scriptscriptstyle{(t)}} and x(t)x^{\scriptscriptstyle{(t)}} are (t1)\mathscr{F}^{\scriptscriptstyle{(t-1)}}-measurable (by definition). Then

t=1TD𝖳𝖵(q^(t),pi(x(t)))2\displaystyle\sum_{t=1}^{T}D_{\mathsf{TV}}({\widehat{q}^{\scriptscriptstyle{(t)}}},{p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})})^{2}\leq t=1TD𝖪𝖫(pi(x(t))q^(t))\displaystyle\sum_{t=1}^{T}D_{\mathsf{KL}}({p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})}\|{\widehat{q}^{\scriptscriptstyle{(t)}}})
=\displaystyle= t=1Ty𝒴pi(y|x(t))log(pi(y|x(t))q^(t)(y))\displaystyle\sum_{t=1}^{T}\sum_{y\in\mathcal{Y}}p_{i^{\star}}(y|x^{\scriptscriptstyle{(t)}})\cdot\log\left(\frac{p_{i^{\star}}(y|x^{\scriptscriptstyle{(t)}})}{\widehat{q}^{\scriptscriptstyle{(t)}}(y)}\right)
=\displaystyle= t=1T𝔼[log(1q^(t)(y(t)))log(1pi(y(t)|x(t)))|(t1)],\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[\log\left(\frac{1}{\widehat{q}^{\scriptscriptstyle{(t)}}(y^{\scriptscriptstyle{(t)}})}\right)-\log\left(\frac{1}{p_{i^{\star}}(y^{\scriptscriptstyle{(t)}}|x^{\scriptscriptstyle{(t)}})}\right)\ |\ \mathscr{F}^{\scriptscriptstyle{(t-1)}}\right],

where the first inequality uses Pinsker’s inequality and the final equality uses the fact that y(t)pi(x(t))|x(t),(t1)y^{\scriptscriptstyle{(t)}}\sim p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})|x^{\scriptscriptstyle{(t)}},\mathcal{H}^{\scriptscriptstyle{(t-1)}}. It follows that

𝔼[t=1TD𝖳𝖵(q^(t),pi(x(t)))2]𝔼[Reg,T]log||.\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}D_{\mathsf{TV}}({\widehat{q}^{\scriptscriptstyle{(t)}}},{p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})})^{2}\right]\leq\mathbb{E}[\mathrm{Reg}_{\mathcal{I},T}]\leq\log\lvert\mathcal{I}\rvert.

Jensen’s inequality now gives that

𝔼[t=1TD𝖳𝖵(q^(t),pi(x(t)))]\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}D_{\mathsf{TV}}({\widehat{q}^{\scriptscriptstyle{(t)}}},{p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})})\right]\leq T𝔼[t=1TD𝖳𝖵(q^(t),pi(x(t)))2]Tlog||.\displaystyle\sqrt{T}\cdot\sqrt{\mathbb{E}\left[\sum_{t=1}^{T}D_{\mathsf{TV}}({\widehat{q}^{\scriptscriptstyle{(t)}}},{p_{i^{\star}}(x^{\scriptscriptstyle{(t)}})})^{2}\right]}\leq\sqrt{T\log|\mathcal{I}|}.

B.2 Proof of Theorem 3.2

Proof of Theorem 3.2.

Fix nn\in\mathbb{N}, which we recall represents an upper bound on the description length of the Markov game. Assume that we are given an algorithm \mathscr{B} that solves the (T,ϵ)(T,\epsilon)-SparseMarkovCCE problem for Markov games 𝒢\mathcal{G} satisfying |𝒢|n|\mathcal{G}|\leq n in time UU. We proceed to describe an algorithm which solves the 2-player (n1/2/2,4ϵ)(\lfloor n^{1/2}/2\rfloor,4\cdot\epsilon)-Nash problem in time (nTU)C0(nTU)^{C_{0}}, as long as T<exp(ϵ2n1/2/25)T<\exp(\epsilon^{2}\cdot n^{1/2}/2^{5}). First, define n0:=n1/2/2n_{0}:=\lfloor n^{1/2}/2\rfloor, and consider an arbitrary 2-player n0n_{0}-action normal form GG, which is specified by payoff matrices M1,M2[0,1]n0×n0M_{1},M_{2}\in[0,1]^{n_{0}\times n_{0}}, so that all entries of the game can be written in binary using at most n0n_{0} bits (recall, per footnote 20, that we may assume that the entries of an instance of (n0,4ϵ)(n_{0},4\cdot\epsilon)-Nash can be specified with n0n_{0} bits). Based on GG, we construct a 2-player Markov game 𝒢:=𝒢(G)\mathcal{G}:=\mathcal{G}(G) as follows:

Definition B.3.

We define the game 𝒢(G)\mathcal{G}(G) to consist of the tuple 𝒢(G)=(𝒮,H,(𝒜i)i[2],,(Ri)i[2],μ)\mathcal{G}(G)=(\mathcal{S},H,(\mathcal{A}_{i})_{i\in[2]},\mathbb{P},(R_{i})_{i\in[2]},\mu), where:

  • The horizon of 𝒢\mathcal{G} is H=2n0/2H=2\lfloor n_{0}/2\rfloor (i.e., the largest even number at most n0n_{0}).

  • Let A=n0A=n_{0}; the action spaces of the 2 agents are given by 𝒜1=𝒜2=[A]\mathcal{A}_{1}=\mathcal{A}_{2}=[A].

  • There are a total of A2+1A^{2}+1 states: in particular, there is a state 𝔰(a1,a2)\mathfrak{s}_{(a_{1},a_{2})} for each (a1,a2)[A]2(a_{1},a_{2})\in[A]^{2}, as well as a distinguished state 𝔰\mathfrak{s}, so we have:

    𝒮={𝔰}{𝔰(a1,a2):(a1,a2)[A]2}.\displaystyle\mathcal{S}=\{\mathfrak{s}\}\cup\{\mathfrak{s}_{(a_{1},a_{2})}\ :\ (a_{1},a_{2})\in[A]^{2}\}.
  • For all odd h[H]h\in[H], the reward to agents j[2]j\in[2] given that the action profile (a1,a2)(a_{1},a_{2}) is played at step hh is given by Rj,h(s,(a1,a2)):=1H(Mj)a1,a2R_{j,h}(s,(a_{1},a_{2})):=\frac{1}{H}\cdot(M_{j})_{a_{1},a_{2}}, for all s𝒮s\in\mathcal{S}. All agents receive 0 reward at even steps h[H]h\in[H].

  • At odd steps h[H]h\in[H], if actions a1,a2[A]a_{1},a_{2}\in[A] are taken, the game transitions to the state 𝔰(a1,a2)\mathfrak{s}_{(a_{1},a_{2})}. At even steps h[H]h\in[H], the game always transitions to the state 𝔰\mathfrak{s}.

  • The initial state (i.e., at step h=1h=1) is 𝔰\mathfrak{s} (i.e., μ\mu is a singleton distribution supported on 𝔰\mathfrak{s}).

It is evident that this construction takes polynomial time, and satisfies |𝒢|A2+1n02+1n|\mathcal{G}|\leq A^{2}+1\leq n_{0}^{2}+1\leq n. We will now show by applying the algorithm \mathscr{B} to 𝒢\mathcal{G}, we can efficiently compute 4ϵ4\cdot\epsilon-approximate Nash equilibrium for the original game GG. To do so, we appeal to Algorithm 1.

1:Input: 2-player, n0n_{0}-action normal form game GG.
2:Construct the 2-player Markov game 𝒢=𝒢(G)\mathcal{G}=\mathcal{G}(G) per Definition B.3, which satisfies |𝒢|n|\mathcal{G}|\leq n.
3:Call the algorithm \mathscr{B} on the game 𝒢\mathcal{G}, which produces a sequence σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}, where each σ(t)Πmarkov\sigma^{\scriptscriptstyle{(t)}}\in\Pi^{\mathrm{markov}}.
4:for t[T]t\in[T] and odd h[H]h\in[H]do
5:     if σh(t)(𝔰)Δ(𝒜1)×Δ(𝒜2)\sigma^{\scriptscriptstyle{(t)}}_{h}(\mathfrak{s})\in\Delta(\mathcal{A}_{1})\times\Delta(\mathcal{A}_{2}) is a (4ϵ,n)(4\cdot\epsilon,n)-Nash equilibrium of GGthen return σh(t)(𝔰)\sigma^{\scriptscriptstyle{(t)}}_{h}(\mathfrak{s}).      
6:if the for loop terminates without returning: return fail.
Algorithm 1 Algorithm to compute Nash equilibrium used in proof of Theorem 3.2.

Algorithm 1 proceeds as follows. First, it constructs the 2-player Markov game 𝒢(G)\mathcal{G}(G) as defined above, and calls the algorithm \mathscr{B}, which returns a sequence σ(1),,σ(T)Πmarkov\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in\Pi^{\mathrm{markov}} of product Markov policies with the property that the average σ¯:=1Tt=1T𝕀σ(t)\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}} is an ϵ\epsilon-CCE of 𝒢\mathcal{G}. It then enumerates over the distributions σh(t)(𝔰)Δ(𝒜1)×Δ(𝒜2)\sigma^{\scriptscriptstyle{(t)}}_{h}(\mathfrak{s})\in\Delta(\mathcal{A}_{1})\times\Delta(\mathcal{A}_{2}) for each t[T]t\in[T] and h[H]h\in[H] odd, and checks whether each one is a 4ϵ4\cdot\epsilon-approximate Nash equilibrium of GG. If so, the algorithm outputs such a Nash equilibrium, and otherwise, it fails. The proof of Theorem 3.2 is thus completed by the following lemma, which states that as long as σ¯\overline{\sigma} is an ϵ\epsilon-CCE of 𝒢\mathcal{G}, Algorithm 1 never fails.

Lemma B.4 (Correctness of Algorithm 1).

Consider the normal form game GG and the Markov game 𝒢=𝒢(G)\mathcal{G}=\mathcal{G}(G) as constructed above, which has horizon HH. For any ϵ0>0\epsilon_{0}>0, TT\in\mathbb{N}, if T<exp(Hϵ02/28)T<\exp(H\cdot\epsilon_{0}^{2}/2^{8}) and σ(1),,σ(T)Πmarkov\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}\in\Pi^{\mathrm{markov}} are product Markov policies so that 1Tt=1T𝕀σ(t)\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}} is an (ϵ0/4)(\epsilon_{0}/4)-CCE of 𝒢\mathcal{G}, then there is some odd h[H]h\in[H] and t[T]t\in[T] so that σh(t)(𝔰)\sigma_{h}^{\scriptscriptstyle{(t)}}(\mathfrak{s}) is an ϵ0\epsilon_{0}-Nash equilibrium of GG.

The proof of Lemma B.4 is given below. Applying Lemma B.4 with ϵ0=4ϵ\epsilon_{0}=4\epsilon (which is a valid application since T<exp(n0(4ϵ)2/28)T<\exp(n_{0}\cdot(4\epsilon)^{2}/2^{8}) by our assumption on T,ϵT,\epsilon), yields that Algorithm 1 always finds a 4ϵ4\epsilon-Nash equilibrium of the n0n_{0}-action normal form game GG, thus solving the given instance of the (n0,4ϵ)(n_{0},4\cdot\epsilon)-Nash problem. Furthermore, it is straightforward to see that Algorithm 1 runs in time U+(nT)C0(UnT)C0U+(nT)^{C_{0}}\leq(UnT)^{C_{0}}, for some constant C01C_{0}\geq 1.

Proof of Lemma B.4.

Consider a sequence of product Markov policies σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}} with the property that the average σ¯=1Tt=1T𝕀σ(T)\overline{\sigma}=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(T)}}} is an (ϵ0/4)(\epsilon_{0}/4)-CCE of 𝒢\mathcal{G}. For all odd h[H]h\in[H] and j[2]j\in[2], let pj,h(t):=σj,h(t)(𝔰)Δ(𝒜j)p^{\scriptscriptstyle{(t)}}_{j,h}:=\sigma^{\scriptscriptstyle{(t)}}_{j,h}(\mathfrak{s})\in\Delta(\mathcal{A}_{j}), which is the distribution played under σ(t)\sigma^{\scriptscriptstyle{(t)}} by player jj at step hh (at the unique state 𝔰\mathfrak{s} with positive probability of being reached at step hh). For odd hh, we have σh(t)(𝔰)=p1,h(t)×p2,h(t)\sigma_{h}^{\scriptscriptstyle{(t)}}(\mathfrak{s})=p_{1,h}^{\scriptscriptstyle{(t)}}\times p_{2,h}^{\scriptscriptstyle{(t)}}, and our goal is to show that for some odd h[H]h\in[H] and t[T]t\in[T], p1,h(t)×p2,h(t)p_{1,h}^{\scriptscriptstyle{(t)}}\times p_{2,h}^{\scriptscriptstyle{(t)}} is an ϵ0\epsilon_{0}-Nash equilibrium of GG. To proceed, suppose for the sake of contradiction that this is not the case.

Let us write 𝒪H:={h[H]:hodd}\mathcal{O}_{H}:=\{h\in[H]:h\ \rm{odd}\} to denote the set of odd-numbered steps, and H=[H]\𝒪H\mathcal{E}_{H}=[H]\backslash\mathcal{O}_{H} to denote the set of even-numbered steps. Let H0=|𝒪H|=|H|=H/2H_{0}=|\mathcal{O}_{H}|=|\mathcal{E}_{H}|=H/2. We first note that for j[2]j\in[2], agent jj’s value under the mixture policy σ¯\overline{\sigma} is given as follows:

Vjσ¯=1THt=1Th𝒪H𝔼a1p1,h(t),a2p2,h(t)[(Mj)a1,a2].\displaystyle V_{j}^{\overline{\sigma}}=\frac{1}{TH}\sum_{t=1}^{T}\sum_{h\in\mathcal{O}_{H}}\mathbb{E}_{a_{1}\sim p_{1,h}^{\scriptscriptstyle{(t)}},a_{2}\sim p_{2,h}^{\scriptscriptstyle{(t)}}}\left[(M_{j})_{a_{1},a_{2}}\right].

For each j[2]j\in[2], we will derive a contradiction by constructing a (non-Markov) deviation policy for player jj in 𝒢\mathcal{G}, denoted πjΠjgen,det\pi_{j}^{\dagger}\in\Pi^{\mathrm{gen,det}}_{j}, which will give player jj a significant gain in value against the policy σ¯\overline{\sigma}. To do so, we need to specify πj,h(τj,h1,sh)𝒜j\pi_{j,h}^{\dagger}(\tau_{j,h-1},s_{h})\in\mathcal{A}_{j}, for all τj,h1j,h1\tau_{j,h-1}\in\mathscr{H}_{j,h-1} and sh𝒮s_{h}\in\mathcal{S}; note that we may restrict our attention only to histories τj,h01\tau_{j,h_{0}-1} that occur with positive probability under the transitions of 𝒢\mathcal{G}.

Fix any h0[H]h_{0}\in[H], τj,h01j,h01\tau_{j,h_{0}-1}\in\mathscr{H}_{j,h_{0}-1}, and sh0𝒮s_{h_{0}}\in\mathcal{S}. If τj,h01\tau_{j,h_{0}-1} occurs with positive probability under the transitions of 𝒢\mathcal{G}, then for each h𝒪Hh\in\mathcal{O}_{H}, h<h01h<h_{0}-1 and both j[2]j^{\prime}\in[2], the action played by agent jj^{\prime} at step hh is determined by τj,h\tau_{j,h}. Namely, if the state at step h+1h+1 of τj,h01\tau_{j,h_{0}-1} is 𝔰(a1,a2)\mathfrak{s}_{(a_{1}^{\prime},a_{2}^{\prime})}, then player jj^{\prime} played action aja_{j}^{\prime} at step hh. So, for each h𝒪Hh\in\mathcal{O}_{H} with h<h01h<h_{0}-1, we may define (a1,h,a2,h)(a_{1,h},a_{2,h}) as the action profile played at step hh, which is a measurable function of τj,h01\tau_{j,h_{0}-1}. With this in mind, we define πj,h0(τj,h01,sh0)\pi_{j,h_{0}}^{\dagger}(\tau_{j,h_{0}-1},s_{h_{0}}) by applying Vovk’s aggregating algorithm (Proposition B.2) as follows.

  1. 1.

    If h0h_{0} is even, play an arbitrary action (note that the actions at even-numbered steps have no influence on the transitions or rewards).

  2. 2.

    If h0h_{0} is odd, define q^j,h0Δ(𝒜j)\widehat{q}_{j,h_{0}}\in\Delta(\mathcal{A}_{j}), by q^j,h0:=𝔼tq~j,h0[pj,h(t)]\widehat{q}_{j,h_{0}}:=\mathbb{E}_{t\sim\widetilde{q}_{j,h_{0}}}[p_{-j,h}^{\scriptscriptstyle{(t)}}], where q~j,h0Δ([T])\widetilde{q}_{j,h_{0}}\in\Delta([T]) is defined as follows: for t[T]t\in[T],

    q~j,h0(t):=exp(h<h0:h𝒪Hlog(1pj,h(t)(aj,h)))t=1Texp(h<h0:h𝒪Hlog(1pj,h(t)(aj,h))).\displaystyle\widetilde{q}_{j,h_{0}}(t):=\frac{\exp\left(-\sum_{h<h_{0}:\ h\in\mathcal{O}_{H}}\log\left(\frac{1}{p^{\scriptscriptstyle{(t)}}_{-j,h}(a_{-j,h})}\right)\right)}{\sum_{t^{\prime}=1}^{T}\exp\left(-\sum_{h<h_{0}:\ h\in\mathcal{O}_{H}}\log\left(\frac{1}{p^{\scriptscriptstyle{(t^{\prime})}}_{-j,h}(a_{-j,h})}\right)\right)}.

    Note that q^j,h0\widehat{q}_{j,h_{0}} is a function of τj,h01\tau_{j,h_{0}-1} via the action profiles {(a1,h,a2,h)}h<h0:h𝒪H\left\{(a_{1,h},a_{2,h})\right\}_{h<h_{0}:h\in\mathcal{O}_{H}}; to simplify notation, we suppress this dependence.

  3. 3.

    Then for any state sh0𝒮s_{h_{0}}\in\mathcal{S}, define πj,h0(τj,h01,sh0)\pi_{j,h_{0}}^{\dagger}(\tau_{j,h_{0}-1},s_{h_{0}}) to be a best response to q^j,h0\widehat{q}_{j,h_{0}}, namely

    πj,h0(τj,h01,sh0):=argmaxaj𝒜j𝔼ajq^j,h0[Rj,h(𝔰h0,(a1,a2))]=argmaxaj𝒜j𝔼ajq^j,h0[(Mj)a1,a2].\displaystyle\pi_{j,h_{0}}^{\dagger}(\tau_{j,h_{0}-1},s_{h_{0}}):=\operatorname*{arg\,max}_{a_{j}\in\mathcal{A}_{j}}\mathbb{E}_{a_{-j}\sim\widehat{q}_{j,h_{0}}}\left[R_{j,h}(\mathfrak{s}_{h_{0}},(a_{1},a_{2}))\right]=\operatorname*{arg\,max}_{a_{j}\in\mathcal{A}_{j}}\mathbb{E}_{a_{-j}\sim\widehat{q}_{j,h_{0}}}\left[(M_{j})_{a_{1},a_{2}}\right]. (5)

Note that, for odd h0h_{0}, the distribution q^j,h0Δ(𝒜j)\widehat{q}_{j,h_{0}}\in\Delta(\mathcal{A}_{j}) defined above can be viewed as an application of Vovk’s online aggregation algorithm at step (h0+1)/2(h_{0}+1)/2 in the following setting: the number of steps (TT, in the notation of Proposition B.2; note that TT plays a different role in the present proof) is H0=H/2H_{0}=H/2, the context space is 𝒪H\mathcal{O}_{H}, and the outcome space is 𝒜j\mathcal{A}_{-j}.212121Here j-j denotes the index of the player who is not jj. There are TT experts p~(1),,p~(T)\widetilde{p}^{\scriptscriptstyle{(1)}},\ldots,\widetilde{p}^{\scriptscriptstyle{(T)}} (i.e., we have ={p~(t)}t[T]\mathcal{I}=\left\{\widetilde{p}^{\scriptscriptstyle(t)}\right\}_{t\in[T]}), whose predictions on a context h𝒪Hh\in\mathcal{O}_{H} are defined as follows: the expert p~(t)\widetilde{p}^{\scriptscriptstyle{(t)}} predicts p~(t)(h):=pj,h(t)\widetilde{p}^{\scriptscriptstyle{(t)}}(h):=p_{-j,h}^{\scriptscriptstyle{(t)}}. Then, the distribution q^j,h0\widehat{q}_{j,h_{0}} is obtained by updating the aggregation algorithm with the context-observation pairs (h,aj,h)(h,a_{-j,h}), for odd values of h<h0h<h_{0}.

We next analyze the value of Vjπj,σ¯jV_{j}^{\pi_{j}^{\dagger},\overline{\sigma}_{-j}} for j[2]j\in[2] to show that the deviation strategy we have defined indeed obtains significant gain. To do so, recall that this value represents the payoff for player jj under the process in which we draw an index t[T]t^{\star}\in\left[T\right] uniformly at random, then for each step h[H]h\in[H], player jj plays according to πj\pi_{j}^{\dagger} and player j-j plays according to σj(t)\sigma_{-j}^{\scriptscriptstyle{(t^{\star})}}. (In particular, at odd-numbered steps, player j-j plays according to pj,h(t)p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}.) We recall that 𝔼πj×σ¯j[]\mathbb{E}_{\pi_{j}^{\dagger}\times\overline{\sigma}_{-j}}\left[\cdot\right] denotes the expectation under this process. We let τj,h1j,h1\tau_{j,h-1}\in\mathscr{H}_{j,h-1} denote the random variable which is the history observed by player jj in this setup, i.e., when the policy played is πj×σ¯j\pi_{j}^{\dagger}\times\overline{\sigma}_{-j}, and let {(a1,h,a2,h)}h𝒪H\left\{(a_{1,h},a_{2,h})\right\}_{h\in\mathcal{O}_{H}} denote the action profiles for odd rounds, which are a measurable function of each player’s trajectory.

We apply Proposition B.2 with the time horizon as H0H_{0}, and with the set of experts set to :={p~(1),,p~(T)}\mathcal{I}\vcentcolon={}\{\widetilde{p}^{\scriptscriptstyle{(1)}},\ldots,\widetilde{p}^{\scriptscriptstyle{(T)}}\} as defined above. The context sequence the sequence of increasing values of h𝒪Hh\in\mathcal{O}_{H}, and for each h𝒪Hh\in\mathcal{O}_{H}, the outcome at step (h+1)/2(h+1)/2 (for which the context is hh) is distributed as aj,hp~(t)(h)=pj,h(t)a_{-j,h}\sim\widetilde{p}^{\scriptscriptstyle{(t^{\star})}}(h)=p^{\scriptscriptstyle{(t^{\star})}}_{-j,h} conditioned on tt^{\star}, which in particular satisfies the realizability assumption stated in Proposition B.2. Then, since (as remarked above), the distributions q^j,h\widehat{q}_{j,h}, for h𝒪Hh\in\mathcal{O}_{H}, are exactly the predictions made by Vovk’s aggregating algorithm, Proposition B.2 gives that222222In fact, Proposition B.2 implies that a similar bound holds uniformly for each possible realization of tt^{\star}, but Equation 6 suffices for our purposes.

𝔼πj×σ¯j[h𝒪HD𝖳𝖵(q^j,h,pj,h(t))]=𝔼πj×σ¯j[h𝒪HD𝖳𝖵(q^j,h,p~(t)(h))]H0logT.\displaystyle\mathbb{E}_{\pi_{j}^{\dagger}\times\overline{\sigma}_{-j}}\left[\sum_{h\in\mathcal{O}_{H}}D_{\mathsf{TV}}({\widehat{q}_{j,h}},{p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}})\right]=\mathbb{E}_{\pi_{j}^{\dagger}\times\overline{\sigma}_{-j}}\left[\sum_{h\in\mathcal{O}_{H}}D_{\mathsf{TV}}({\widehat{q}_{j,h}},{\widetilde{p}^{\scriptscriptstyle{(t^{\star})}}(h)})\right]\leq\sqrt{H_{0}\log T}. (6)

Recall that we have assumed for the sake of contradiction that p1,h(t)×p2,h(t)p_{1,h}^{\scriptscriptstyle{(t)}}\times p_{2,h}^{\scriptscriptstyle{(t)}} is not an ϵ0\epsilon_{0}-Nash equilibrium of GG for each h[H]h\in[H] and t[T]t\in[T]. Consider a fixed draw of the random variable t[T]t^{\star}\in[T] defined above. Then it holds that for j[2]j\in[2] and h[H]h\in[H], defining

ϵ0,j,h:=maxaj[A]𝔼ajpj,h(t)[(Mj)a1,a2]𝔼a1p1,h(t),a2p2,h(t)[(Mj)a1,a2],\displaystyle\epsilon_{0,j,h}:=\max_{a_{j}\in[A]}\mathbb{E}_{a_{-j}\sim p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{a_{1},a_{2}}\right]-\mathbb{E}_{a_{1}\sim p_{1,h}^{\scriptscriptstyle{(t^{\star})}},a_{2}\sim p_{2,h}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{a_{1},a_{2}}\right], (7)

we have ϵ0,1,h+ϵ0,2,hϵ0\epsilon_{0,1,h}+\epsilon_{0,2,h}\geq\epsilon_{0}. Consider any j[2]j\in[2], h𝒪Hh\in\mathcal{O}_{H}, and a history τj,h1j,h1\tau_{j,h-1}\in\mathscr{H}_{j,h-1} of agent jj up to step h1h-1 (conditioned on tt^{\star}). Let us write δj,h(t):=D𝖳𝖵(pj,h(t),q^j,h)\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}}:=D_{\mathsf{TV}}({p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}},{\widehat{q}_{j,h}}); note that δj,h(t)\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}} is a function of τj,h1\tau_{j,h-1}, through its dependence on q^j,h\widehat{q}_{j,h}. We have, by the definition of πj,h(τj,h1,sh)\pi_{j,h}^{\dagger}(\tau_{j,h-1},s_{h}) in (5) and the definition of δj,h(t)\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}},

𝔼ajpj,h(t)[(Mj)πh,j(τj,h1,𝔰),aj|t,τj,h1]\displaystyle\mathbb{E}_{a_{-j}\sim p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{\pi_{h,j}^{\dagger}(\tau_{j,h-1},\mathfrak{s}),a_{-j}}\ |\ t^{\star},\ \tau_{j,h-1}\right]\geq 𝔼ajq^j,h[(Mj)πh,j(τj,h1,𝔰),aj|t,τj,h1]δj,h(t)\displaystyle\mathbb{E}_{a_{-j}\sim\widehat{q}_{j,h}}\left[(M_{j})_{\pi_{h,j}^{\dagger}(\tau_{j,h-1},\mathfrak{s}),a_{-j}}\ |\ t^{\star},\ \tau_{j,h-1}\right]-\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}}
=\displaystyle= maxaj[A]𝔼ajq^j,h[(Mj)aj,aj|t,τj,h1]δh,j(t)\displaystyle\max_{a_{j}\in[A]}\mathbb{E}_{a_{-j}\sim\widehat{q}_{j,h}}\left[(M_{j})_{a_{j},a_{-j}}\ |\ t^{\star},\ \tau_{j,h-1}\right]-\delta_{h,-j}^{\scriptscriptstyle{(t^{\star})}}
\displaystyle\geq maxaj[A]𝔼ajph,j(t)[(Mj)aj,aj]2δj,h(t).\displaystyle\max_{a_{j}\in[A]}\mathbb{E}_{a_{-j}\sim p_{h,-j}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{a_{j},a_{-j}}\right]-2\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}}. (8)

Combining Equation 7 and Equation 8, we get that for any fixed h𝒪Hh\in\mathcal{O}_{H}, j[2]j\in[2], and τj,h1j,h1\tau_{j,h-1}\in\mathscr{H}_{j,h-1},

𝔼ajpj,h(t)[(Mj)πj,h(τj,h1,𝔰),aj|t,τj,h1]𝔼a1p1,h(t),a2p2,h(t)[(Mj)a1,a2]>ϵ0,j,h2δj,h(t).\displaystyle\mathbb{E}_{a_{-j}\sim p_{-j,h}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{\pi_{j,h}^{\dagger}(\tau_{j,h-1},\mathfrak{s}),a_{-j}}\ |\ t^{\star},\ \tau_{j,h-1}\right]-\mathbb{E}_{a_{1}\sim p_{1,h}^{\scriptscriptstyle{(t^{\star})}},a_{2}\sim p_{2,h}^{\scriptscriptstyle{(t^{\star})}}}\left[(M_{j})_{a_{1},a_{2}}\right]>\epsilon_{0,j,h}-2\delta_{-j,h}^{\scriptscriptstyle{(t^{\star})}}. (9)

Averaging over the draw of t[T]t^{\star}\in[T], which we recall is chosen uniformly, we see that

j[2]Vjπj×σ¯jVjσ¯\displaystyle\sum_{j\in[2]}V_{j}^{\pi_{j}^{\dagger}\times\overline{\sigma}_{-j}}-V_{j}^{\overline{\sigma}}
=\displaystyle= 1Tt=1Tj[2]Vjπj×σj(t)Vjσ(t)\displaystyle\frac{1}{T}\sum_{t=1}^{T}\sum_{j\in[2]}V_{j}^{\pi_{j}^{\dagger}\times\sigma^{\scriptscriptstyle{(t)}}_{-j}}-V_{j}^{\sigma^{\scriptscriptstyle{(t)}}} (10)
=\displaystyle= 1Tt=1Tj[2]𝔼πj×σj(t)[h𝒪H𝔼ajpj,h(t)[Rj,h(𝔰,(πj,h(τj,h1,𝔰),aj))|t,τj,h1]𝔼a1p1,h(t),a2p2,h(t)[Rj,h(𝔰,(a1,a2))]]\displaystyle\frac{1}{T}\sum_{t=1}^{T}\sum_{j\in[2]}\mathbb{E}_{\pi_{j}^{\dagger}\times\sigma_{-j}^{\scriptscriptstyle{(t)}}}\left[\sum_{h\in\mathcal{O}_{H}}\mathbb{E}_{a_{-j}\sim p_{-j,h}^{\scriptscriptstyle{(t)}}}[R_{j,h}(\mathfrak{s},(\pi_{j,h}^{\dagger}(\tau_{j,h-1},\mathfrak{s}),a_{-j}))\ |\ t,\ \tau_{j,h-1}]-\mathbb{E}_{a_{1}\sim p_{1,h}^{\scriptscriptstyle{(t)}},a_{2}\sim p_{2,h}^{\scriptscriptstyle{(t)}}}[R_{j,h}(\mathfrak{s},(a_{1},a_{2}))]\right]
=\displaystyle= 1THt=1Tj[2]𝔼πj×σj(t)[h𝒪H𝔼ajpj,h(t)[(Mj)πj,h(τj,h1,𝔰),aj|t,τj,h1]𝔼a1p1,h(t),a2p2,h(t)[(Mj)a1,a2]]\displaystyle\frac{1}{TH}\sum_{t=1}^{T}\sum_{j\in[2]}\mathbb{E}_{\pi_{j}^{\dagger}\times\sigma_{-j}^{\scriptscriptstyle{(t)}}}\left[\sum_{h\in\mathcal{O}_{H}}\mathbb{E}_{a_{-j}\sim p_{-j,h}^{\scriptscriptstyle{(t)}}}[(M_{j})_{\pi_{j,h}^{\dagger}(\tau_{j,h-1},\mathfrak{s}),a_{-j}}\ |\ t,\ \tau_{j,h-1}]-\mathbb{E}_{a_{1}\sim p_{1,h}^{\scriptscriptstyle{(t)}},a_{2}\sim p_{2,h}^{\scriptscriptstyle{(t)}}}[(M_{j})_{a_{1},a_{2}}]\right]
\displaystyle\geq 1THt=1Tj[2]𝔼πj×σj(t)[h𝒪H(ϵ0,j,h2δj,h(t))]\displaystyle\frac{1}{TH}\sum_{t=1}^{T}\sum_{j\in[2]}\mathbb{E}_{\pi_{j}^{\dagger}\times\sigma_{-j}^{\scriptscriptstyle{(t)}}}\left[\sum_{h\in\mathcal{O}_{H}}\left(\epsilon_{0,j,h}-2\delta_{-j,h}^{\scriptscriptstyle{(t)}}\right)\right] (11)
\displaystyle\geq ϵ022THt=1T2H0logTϵ024log(T)/H,\displaystyle\frac{\epsilon_{0}}{2}-\frac{2}{TH}\sum_{t=1}^{T}2\sqrt{H_{0}\log T}\geq\frac{\epsilon_{0}}{2}-4\sqrt{\log(T)/H}, (12)

where (10) follows from the definition σ¯=1Tt=1T𝕀σ(t)\overline{\sigma}=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}}, (11) follows from (9), and (12) uses (6). As long as T<exp(H(ϵ0/16)2)T<\exp(H\cdot(\epsilon_{0}/16)^{2}), the this expression is bounded below by ϵ0/4\epsilon_{0}/4, meaning that σ¯\overline{\sigma} is not an ϵ0/4\epsilon_{0}/4-approximate CCE. This completes the contradiction. ∎

Appendix C Proofs of lower bounds for SparseCCE (Sections 4 and 5)

In this section we prove our computational lower bounds for solving the SparseCCE problem with m=3m=3 players (Theorem 4.3 and Corollary 4.4), as well as our statistical lower bound for solving the SparseCCE problem with a general number mm of players (Theorem 5.2).

Both theorems are proven as consequences of a more general result given in Theorem C.1 below, which reduces the Nash problem in mm-player normal-form games to the SparseCCE problem in (m+1)(m+1)-player Markov games. In more detail, the theorem shows that (a) if an algorithm for SparseCCE makes few calls to a generative model oracle, then we get an algorithm for the Nash problem with few calls to a payoff oracle (see Section A.2 for background on the payoff oracle for the Nash problem), and (b) if the algorithm for SparseCCE is computationally efficient, then so is the algorithm for the Nash problem.

Theorem C.1.

There is a constant C0>0C_{0}>0 so that the following holds. Consider n,mn,m\in\mathbb{N}, and suppose T,N,QT,N,Q\in\mathbb{N} and ϵ>0\epsilon>0 satisfy 1<T<exp(ϵ2n/mm2)1<T<\exp\left(\frac{\epsilon^{2}\cdot\lfloor n/m\rfloor}{m^{2}}\right). Suppose there is an algorithm \mathscr{B} which, given a generative model oracle for a (m+1)(m+1)-player Markov game 𝒢\mathcal{G} with |𝒢|n|\mathcal{G}|\leq n, solves the (T,ϵ,N)(T,\epsilon,N)-SparseCCE problem for 𝒢\mathcal{G} using QQ generative model oracle queries. Then the following conclusions hold:

  • For any δ>0\delta>0, the mm-player (n/m,16(m+1)ϵ)(\lfloor n/m\rfloor,16(m+1)\cdot\epsilon)-Nash problem for any normal-form game GG can be solved, with failure probability δ\delta, using at most C0(Qlog(1/δ))+(log(1/δ)nm/ϵ)C0C_{0}\cdot(Q\cdot\log(1/\delta))+(\log(1/\delta)\cdot nm/\epsilon)^{C_{0}} queries to a payoff oracle 𝒪G\mathcal{O}_{G} for GG.

  • If the algorithm \mathscr{B} additionally runs in time UU for some UU\in\mathbb{N}, then the algorithm solving Nash from the previous bullet point runs in time (nmTNUlog(1/δ)/ϵ)C0(nmTNU\log(1/\delta)/\epsilon)^{C_{0}}.

Theorem 4.3 follows directly from Theorem C.1 by taking m=2m=2.

Proof of Theorem 4.3.

Suppose there is an algorithm which, given the description of any 3-player Markov game 𝒢\mathcal{G} with |𝒢|n|\mathcal{G}|\leq n, solves the (T,ϵ,N)(T,\epsilon,N)-SparseCCE problem in time UU. Such an algorithm immediately yields an algorithm which can solve the (T,ϵ,N)(T,\epsilon,N)-SparseCCE problem in time U+|𝒢|O(1)U+|\mathcal{G}|^{O(1)} using only a generative model oracle, since the exact description of the Markov game can be obtained with HS|𝒜|HS(maxiAi)3|𝒢|5HS|\mathcal{A}|\leq HS(\max_{i}A_{i})^{3}\leq|\mathcal{G}|^{5} queries to the generative model (across all (h,s,𝐚)(h,s,\mathbf{a}) tuples). We can now solve the problem of computing a 50ϵ50\cdot\epsilon-Nash equilibrium of a given 2-player n/2\lfloor n/2\rfloor-action normal form game GG as follows. We simply apply the algorithm of Theorem C.1 with m=2m=2, noting that the oracle 𝒪G\mathcal{O}_{G} in the theorem statement can be implemented by reading the corresponding bits of input of the input game GG. The second bullet point yields that this algorithm takes time (nTNUlog(1/δ)/ϵ)C0(nTNU\log(1/\delta)/\epsilon)^{C_{0}}, for some constant C0C_{0}. Furthermore, the assumption T<exp(ϵ2n/m/m2)T<\exp(\epsilon^{2}\cdot\lfloor n/m\rfloor/m^{2}) of Theorem C.1 is implied by the assumption that T<exp(ϵ2n/16)T<\exp(\epsilon^{2}n/16) of Theorem 4.3. ∎

In a similar manner, Theorem 5.2 follows from Theorem C.1 by applying Theorem A.2, which states that there is no randomized algorithm that finds approximate Nash equilibria of mm-player, 2-action normal form games in time 2o(m)2^{o(m)}.

Proof of Theorem 5.2.

Let ϵ0\epsilon_{0} be the constant from Theorem A.2, and consider any m3m\geq 3. Suppose there is an algorithm which, for any mm-player Markov game 𝒢\mathcal{G} with |𝒢|2m6|\mathcal{G}|\leq 2m^{6}, makes QQ oracle queries to a generative model oracle for 𝒢\mathcal{G}, and solves the (T,ϵ0/(10m),N)(T,\epsilon_{0}/(10m),N)-SparseCCE problem for 𝒢\mathcal{G} for some T,NT,N\in\mathbb{N} so that T<exp(cm)T<\exp(cm), for a sufficiently small absolute constant cc. Then, by Theorem C.1 with ϵ=ϵ0/(10m)\epsilon=\epsilon_{0}/(10m) and n=m6n=m^{6} (which ensures that T<exp((ϵ0/(10m))2n/m/m2)T<\exp((\epsilon_{0}/(10m))^{2}\cdot\lfloor n/m\rfloor/m^{2}) as long as cc is sufficiently small), there is an algorithm which solves the (m5,ϵ0)(m^{5},\epsilon_{0})-Nash problem—and thus the (2,ϵ0)(2,\epsilon_{0})-Nash problem—for (m1)(m-1)-player games with failure probability 1/31/3, using O(Q)+mO(1)O(Q)+m^{O(1)} queries to a payoff oracle. But by Theorem A.2, any such algorithm requires 2Ω(m)2^{\Omega(m)} queries to a payoff oracle. It follows that Q2Ω(m)Q\geq 2^{\Omega(m)}, as desired. ∎

C.1 Proof of Theorem C.1

Proof of Theorem C.1.

Fix any m2m\geq 2, nn\in\mathbb{N}. Suppose we are given an algorithm \mathscr{B} that solves the (m+1)(m+1)-player (T,ϵ,N)(T,\epsilon,N)-SparseCCE problem for Markov games 𝒢\mathcal{G} satisfying |𝒢|n|\mathcal{G}|\leq n, running in time UU and using at most QQ generative model queries. We proceed to describe an algorithm which solves the mm-player (n/m,16(m+1)ϵ)(\lfloor n/m\rfloor,16(m+1)\cdot\epsilon)-Nash problem using C0(Qlog(1/δ))+(log(1/δ)nm/ϵ)C0C_{0}\cdot(Q\cdot\log(1/\delta))+(\log(1/\delta)\cdot nm/\epsilon)^{C_{0}} queries to a payoff oracle, and running in time (nmTNUlog(1/δ)/ϵ)C0(nmTNU\log(1/\delta)/\epsilon)^{C_{0}}, where δ\delta represents the failure probability. Define n0:=n/mn_{0}:=\lfloor n/m\rfloor, and assume we are given an arbitrary mm-player n0n_{0}-action normal form GG, which is specified by payoff matrices M1,,Mm[0,1]n0××n0M_{1},\ldots,M_{m}\in[0,1]^{n_{0}\times\cdots\times n_{0}}. We assume that all entries of each of the matrices MjM_{j} have only the most significant max{n0,log1/ϵ}\max\{n_{0},\lceil\log 1/\epsilon\rceil\} bits nonzero; this assumption is without loss of generality, since by truncating the utilities to satisfy this assumption, we change all payoffs by at most ϵ\epsilon, which degrades the quality of any approximate equilibrium by at most 2ϵ2\epsilon (in addition, we have log1/ϵn0\lceil\log 1/\epsilon\rceil\leq n_{0} since we have assumed 1<T<exp(ϵ2n0/m2)1<T<\exp(\epsilon^{2}n_{0}/m^{2})). We assume ϵ1/2\epsilon\leq 1/2 without loss of generality. Based on GG, we construct an (m+1)(m+1)-player Markov game 𝒢:=𝒢(G)\mathcal{G}:=\mathcal{G}(G) as follows.

Definition C.2.

We define the Markov game 𝒢(G)\mathcal{G}(G) as the tuple 𝒢(G)=(𝒮,H,(𝒜i)i[2],,(Ri)i[2],μ)\mathcal{G}(G)=(\mathcal{S},H,(\mathcal{A}_{i})_{i\in[2]},\mathbb{P},(R_{i})_{i\in[2]},\mu), where:

  • The horizon of 𝒢\mathcal{G} is chosen to be the power of 22 satisfying n0H<2n0n_{0}\leq H<2n_{0}.

  • Let A:=n0A\vcentcolon=n_{0}. The action spaces of agents 1,2,,m1,2,\ldots,m are given by 𝒜1==𝒜m=[A]\mathcal{A}_{1}=\cdots=\mathcal{A}_{m}=[A]. The action space of agent m+1m+1 is

    𝒜m+1={(j,aj):j[m],aj𝒜j},\displaystyle\mathcal{A}_{m+1}=\{(j,a_{j})\ :\ j\in[m],a_{j}\in\mathcal{A}_{j}\},

    so that |𝒜m+1|=Amn|\mathcal{A}_{m+1}|=Am\leq n.

    We write 𝒜=j=1m𝒜j\mathcal{A}=\prod_{j=1}^{m}\mathcal{A}_{j} to denote the joint action space of the first mm agents, and 𝒜¯:=j=1m+1𝒜j\overline{\mathcal{A}}:=\prod_{j=1}^{m+1}\mathcal{A}_{j} to denote the joint action space of all agents.

  • There is a single state, denoted by 𝔰\mathfrak{s}, i.e., 𝒮={𝔰}\mathcal{S}=\{\mathfrak{s}\} (in particular, μ\mu is a singleton distribution supported on 𝔰\mathfrak{s}).

  • For all h[H]h\in[H], the reward for agent j[m+1]j\in[m+1], given an action profile 𝐚=(a1,,am+1)\mathbf{a}=(a_{1},\ldots,a_{m+1}) at the unique state 𝔰\mathfrak{s}, is as follows: writing am+1=(j,aj)a_{m+1}=(j^{\prime},a_{j^{\prime}}^{\prime}), we have

    Rj,h(𝔰,𝐚)=R¯j,h(𝔰,𝐚)+1H23log1/ϵ𝖾𝗇𝖼(𝐚),\displaystyle R_{j,h}(\mathfrak{s},\mathbf{a})=\overline{R}_{j,h}(\mathfrak{s},\mathbf{a})+\frac{1}{H}\cdot 2^{-3\lceil\log 1/\epsilon\rceil}\cdot\mathsf{enc}(\mathbf{a}), (13)

    where R¯j,h(𝔰,𝐚)\overline{R}_{j,h}(\mathfrak{s},\mathbf{a}) is defined per the kibitzer construction of [BCI+08]:

    R¯j,h(𝔰,𝐚):={0:j{j,m+1}1H((Mj)a1,,am(Mj)a1,,aj,,am):j=j1H((Mj)a1,,aj,,am(Mj)a1,,am):j=m+1.\displaystyle\overline{R}_{j,h}(\mathfrak{s},\mathbf{a}):=\begin{cases}0&:j\not\in\{j^{\prime},m+1\}\\ \frac{1}{H}\cdot\left((M_{j})_{a_{1},\ldots,a_{m}}-(M_{j})_{a_{1},\ldots,a_{j^{\prime}}^{\prime},\ldots,a_{m}}\right)&:j=j^{\prime}\\ \frac{1}{H}\cdot\left((M_{j})_{a_{1},\ldots,a_{j^{\prime}}^{\prime},\ldots,a_{m}}-(M_{j})_{a_{1},\ldots,a_{m}}\right)&:j=m+1.\end{cases} (14)

    In (13) above, 𝖾𝗇𝖼(𝐚)[0,1]\mathsf{enc}(\mathbf{a})\in[0,1] is the binary representation of a binary encoding of the action profile 𝐚\mathbf{a}. In particular, if the binary encoding of 𝐚\mathbf{a} is (b1,,bN)(b_{1},\ldots,b_{N}), with bi{0,1}b_{i}\in\{0,1\}, then 𝖾𝗇𝖼(𝐚)=i=1N2ibi\mathsf{enc}(\mathbf{a})=\sum_{i=1}^{N}2^{-i}\cdot b_{i}. Note that 𝖾𝗇𝖼(𝐚)\mathsf{enc}(\mathbf{a}) takes N=O(mlogn0)O(mlogn)N=O(m\log n_{0})\leq O(m\log n) bits to specify.

1:Input:
2: Parameters n,n0,m,Tn,n_{0},m,T\in\mathbb{N}, δ=ϵ/(6H)\delta=\epsilon/(6H), K=4log(mn0/δ)/ϵ2K=\lceil 4\log(mn_{0}/\delta)/\epsilon^{2}\rceil.
3: An mm-player, n0n_{0}-action normal form game GG, with utilies accessible by oracle 𝒪G\mathcal{O}_{G}.
4: An algorithm \mathscr{B} for computing approximate CCE of Markov games.
5:Call the algorithm \mathscr{B} on the (m+1)(m+1)-player Markov game 𝒢=𝒢(G)\mathcal{G}=\mathcal{G}(G) constructed as in Definition C.2, which produces a sequence σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}}, where each σ(t)=(σ1(t),,σm+1(t))\sigma^{\scriptscriptstyle{(t)}}=(\sigma^{\scriptscriptstyle{(t)}}_{1},\ldots,\sigma^{\scriptscriptstyle{(t)}}_{m+1}) with σj(t)Πjgen,rnd\sigma^{\scriptscriptstyle{(t)}}_{j}\in{\Pi}^{\mathrm{gen,rnd}}_{j}. Here, we use the oracle 𝒪G\mathcal{O}_{G} to simulate generative model oracle queries made by \mathscr{B}.
6:Draw t[T]t^{\star}\in[T] uniformly at random.
7:For each j[m]j\in[m], initialize τj,0\tau_{j,0} to be an empty trajectory.
8:for h[H]h\in[H]do \triangleright Simulate a trajectory from 𝒢\mathcal{G}
9:     Set sh=𝔰s_{h}=\mathfrak{s} (per the transitions of 𝒢\mathcal{G}).
10:      For each j[m]j\in[m], define q^j,h:=𝔼tq~j,h[σj,h(t)(τj,h1,sh)]Δ(𝒜j)\widehat{q}_{j,h}:=\mathbb{E}_{t\sim\widetilde{q}_{j,h}}\left[\sigma_{j,h}^{\scriptscriptstyle{(t)}}(\tau_{j,h-1},s_{h})\right]\in\Delta(\mathcal{A}_{j}), where q~j,hΔ([T])\widetilde{q}_{j,h}\in\Delta([T]) is defined as follows: for t[T]t\in[T],
q~j,h(t):=exp(g<hlog(1σj,g(t)(aj,g|τj,g1,sg)))t=1Texp(g<hlog(1σj,g(t)(aj,g|τj,g1,sg))).\displaystyle\widetilde{q}_{j,h}(t):=\frac{\exp\left(-\sum_{g<h}\log\left(\frac{1}{\sigma^{\scriptscriptstyle{(t)}}_{j,g}(a_{j,g}|\tau_{j,g-1},s_{g})}\right)\right)}{\sum_{t^{\prime}=1}^{T}\exp\left(-\sum_{g<h}\log\left(\frac{1}{\sigma^{\scriptscriptstyle{(t^{\prime})}}_{j,g}(a_{j,g}|\tau_{j,g-1},s_{g})}\right)\right)}.
11:     Draw KK i.i.d. samples 𝐚h1,,𝐚hK×j[m]q^j,h\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h}.
12:      For each a𝒜m+1a^{\prime}\in\mathcal{A}_{m+1}, define R^m+1,h(a):=1Kk=1KRm+1,h(sh,(𝐚hk,a))\widehat{R}_{m+1,h}(a^{\prime}):=\frac{1}{K}\sum_{k=1}^{K}R_{m+1,h}(s_{h},(\mathbf{a}_{h}^{k},a^{\prime})). Here, we use the oracle 𝒪G\mathcal{O}_{G} to compute Rm+1,h(sh,(𝐚hk,a))R_{m+1,h}(s_{h},(\mathbf{a}_{h}^{k},a^{\prime})) for each tuple (𝐚hk,a)(\mathbf{a}_{h}^{k},a^{\prime}).
13:     For each j[m]j\in[m], draw aj,hσj,h(t)(|τj,h1,sh)a_{j,h}\sim\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\cdot|\tau_{j,h-1},s_{h}).
14:      Choose the action am+1,ha_{m+1,h} of player m+1m+1 as follows: (Action am+1,ha_{m+1,h} is corresponds to the action selected by the policy πm+1\pi_{m+1}^{\dagger} of player m+1m+1 defined within the proof of Lemma C.3; this policy is well-defined because the action profiles of all players i[m]i\in[m] can be extracted from the lower-order bits of player m+1m+1’s reward)
am+1,h:=argmaxa𝒜m+1{R^m+1,h(a)}.\displaystyle a_{m+1,h}:=\operatorname*{arg\,max}_{a^{\prime}\in\mathcal{A}_{m+1}}\left\{\widehat{R}_{m+1,h}(a^{\prime})\right\}. (15)
15:      For each j[m+1]j\in[m+1], let rj,h=Rj,h(sh,(a1,h,,am+1,h))r_{j,h}=R_{j,h}(s_{h},(a_{1,h},\ldots,a_{m+1,h})).
16:      Each player jj constructs τj,h\tau_{j,h} by updating τj,h1\tau_{j,h-1} with (sh,aj,h,rj,h)(s_{h},a_{j,h},r_{j,h}).
17:     if R^m+1,h(am+1,h)14(m+1)ϵ/H\widehat{R}_{m+1,h}(a_{m+1,h})\leq 14(m+1)\cdot\epsilon/H then return q^h:=×j[m]q^j,h\widehat{q}_{h}:=\bigtimes_{j\in[m]}\widehat{q}_{j,h} as a candidate approximate Nash equilibrium for GG.      
18:if the for loop terminates without returning: return fail.
Algorithm 2 Algorithm to compute Nash equilibrium used in proof of Theorem C.1.

It is evident that this construction takes polynomial time and satisfies |𝒢|mn0n|\mathcal{G}|\leq mn_{0}\leq n. Furthermore, it is clear that a single generative model oracle call for the Markov game 𝒢\mathcal{G} (per Definition 5.1) can be implemented using at most 2 calls to the oracle 𝒪G\mathcal{O}_{G} for the normal-form game GG. We will now show by applying the algorithm \mathscr{B} to 𝒢\mathcal{G}, we can efficiently (in terms of runtime and oracle calls) compute a 16(m+1)ϵ16(m+1)\cdot\epsilon-approximate Nash equilibrium for the original game GG. To do so, we appeal to Algorithm 2.

Algorithm 2 proceeds as follows. First, it calls the algorithm \mathscr{B} on the (m+1)(m+1)-player Markov game 𝒢(G)\mathcal{G}(G), using the oracle 𝒪G\mathcal{O}_{G} to simulate \mathscr{B}’s calls to the generative model oracle for 𝒢\mathcal{G}. By assumption, the algorithm \mathscr{B} returns a sequence σ(1),,σ(T)\sigma^{\scriptscriptstyle{(1)}},\ldots,\sigma^{\scriptscriptstyle{(T)}} of product policies of the form σ(t)=(σ1(t),,σm+1(t))\sigma^{\scriptscriptstyle{(t)}}=(\sigma^{\scriptscriptstyle{(t)}}_{1},\ldots,\sigma^{\scriptscriptstyle{(t)}}_{m+1}), so that each σj(t)Πjgen,rnd\sigma^{\scriptscriptstyle{(t)}}_{j}\in{\Pi}^{\mathrm{gen,rnd}}_{j} is NN-computable, and so that the average σ¯:=1Tt=1T𝕀σ(t)\overline{\sigma}:=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}_{\sigma^{\scriptscriptstyle{(t)}}} is an ϵ\epsilon-CCE of 𝒢\mathcal{G}. Next, Algorithm 2 samples a trajectory from 𝒢\mathcal{G} in which:

  • Players 1,,m1,\ldots,m each play according to a policy σ(t)\sigma^{\scriptscriptstyle{(t^{\star})}} for an index t[T]t^{\star}\in[T] chosen uniformly at the start of the episode.

  • Player m+1m+1 plays according to a strategy that, at each step h[H]h\in[H], computes distributions q^j,h\widehat{q}_{j,h} representing its “belief” of what action each player j[m]j\in[m] will play at step hh (Line 10), and plays an approximate best response to the product of the strategies q^j,h\widehat{q}_{j,h}, j[m]j\in[m] (Line 14).

In order avoid exponential dependence on the number of players mm when computing an approximate best response to ×j[m]q^j,h\bigtimes_{j\in[m]}\widehat{q}_{j,h}, we draw K:=4log(mn0/δ)/ϵ2K:=\lceil 4\log(mn_{0}/\delta)/\epsilon^{2}\rceil (for δ=ϵ/(6H)\delta=\epsilon/(6H)) samples from ×j[m]q^j,h\bigtimes_{j\in[m]}\widehat{q}_{j,h} and use these samples to compute the best response. In particular, letting 𝐚hK𝒜\mathbf{a}_{h}^{K}\in\mathcal{A} denote the kkth sampled action profile, we construct a function R^m+1,h:𝒜m+1\widehat{R}_{m+1,h}:\mathcal{A}_{m+1}\rightarrow\mathbb{R} in Lines 11 and 14 which, for each a𝒜m+1a^{\prime}\in\mathcal{A}_{m+1}, is defined as the average over samples {𝐚hk}k[K]\{\mathbf{a}_{h}^{k}\}_{k\in[K]} of the realized payoffs Rm+1,h(sh,(𝐚hk,a))R_{m+1,h}(s_{h},(\mathbf{a}_{h}^{k},a^{\prime})); note that to compute the payoffs for each sample, Algorithm 2 needs only two oracle calls to 𝒪G\mathcal{O}_{G}.

The following lemma, proven in the sequel, gives a correctness guarantee for Algorithm 2.

Lemma C.3 (Correctness of Algorithm 2).

Given any mm-player n0n_{0}-action normal form game GG, if the algorithm \mathscr{B} solves the (T,ϵ,N)(T,\epsilon,N)-SparseCCE problem for the game 𝒢(G)\mathcal{G}(G) with T,ϵ,NT,\epsilon,N satisfying Texp(n0ϵ2/m2)T\leq\exp(n_{0}\epsilon^{2}/m^{2}), then Algorithm 2 outputs a 16(m+1)ϵ16(m+1)\cdot\epsilon-approximate Nash equilibrium of GG with probability at least 1/31/3, and otherwise fails.

The assumption that T<exp(ϵ2n/mm2)T<\exp\left(\frac{\epsilon^{2}\cdot\lfloor n/m\rfloor}{m^{2}}\right) from the statement of Theorem C.1 yields that Texp(n0ϵ2/m2)T\leq\exp(n_{0}\epsilon^{2}/m^{2}), so Lemma C.3 yields that Algorithm 2 outputs a 16(m+1)ϵ16(m+1)\cdot\epsilon-Nash equilibrium of GG with probability at least 1/31/3 (and otherwise fails). By iterating Algorithm 2 for log(1/δ)\log(1/\delta) times, we may thus compute a 16(m+1)ϵ16(m+1)\cdot\epsilon-Nash equilibrium of GG with failure probability 1δ1-\delta.

We now analyze the oracle cost and computational cost of Algorithm 2. It takes 2Q2Q oracle calls to 𝒪G\mathcal{O}_{G} to simulate the QQ generative model oracle calls of \mathscr{B}, and therefore, if \mathscr{B} runs in time UU, then the call to \mathscr{B} on Line 5, using oracle calls to 𝒪G\mathcal{O}_{G} to simulate simulate the generative model oracle calls, runs in time O(U)O(U). Next, the computations of q~j,h\widetilde{q}_{j,h} (and thus q^j,h\widehat{q}_{j,h}) in Line 10 can be performed in (nmTN)O(1)(nmTN)^{O(1)} time, the computation of R^m+1,h:𝒜m+1\widehat{R}_{m+1,h}:\mathcal{A}_{m+1}\rightarrow\mathbb{R} in Line 14 requires time (and oracle calls to 𝒪G\mathcal{O}_{G}) bounded above by O(|𝒜m+1|K)(nmlog(1/δ)/ϵ)O(1)O(|\mathcal{A}_{m+1}|\cdot K)\leq(nm\log(1/\delta)/\epsilon)^{O(1)}, constructing the actions aj,ha_{j,h} (for j[m+1]j\in[m+1]) in Lines 13 and 14 takes time (Nmn)O(1)(Nmn)^{O(1)} (using the fact that the policies σj,h(t)\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}} are NN-computable), and constructing the rewards rj,hr_{j,h} on Line 15 requires another 2(m+1)2(m+1) oracle calls to 𝒪G\mathcal{O}_{G}. Altogether, Algorithm 2 requires 2Q+(nmlog(1/δ)/ϵ)C02Q+(nm\log(1/\delta)/\epsilon)^{C_{0}} oracle calls to 𝒪G\mathcal{O}_{G} and, if \mathscr{B} runs in time UU, then Algorithm 2 takes time (nmTNUlog(1/δ)/ϵ)C0(nmTNU\log(1/\delta)/\epsilon)^{C_{0}}, for some absolute constant C0C_{0}.

Remark C.4 (Bit complexity of exponential weights updates).

In the above proof we have noted that q~j,h\widetilde{q}_{j,h} (as defined in Line 10 of Algorithm 2) can be computed in time (nmTN)O(1)(nmTN)^{O(1)}. A detail we do not handle formally is that, since the values of q~j,h(t)\widetilde{q}_{j,h}(t) are in general irrational, only the (nmTN)O(1)(nmTN)^{O(1)} most significant bits of each real number q~j,h(t)\widetilde{q}_{j,h}(t) can be computed in time (nmTN)O(1)(nmTN)^{O(1)}. To give a truly polynomial-time implementation of Algorithm 2, one can compute only the (nmTN)O(1)(nmTN)^{O(1)} most significant bits of each distribution q~j,h\widetilde{q}_{j,h}, which is sufficient to approximate the true value of q^j,h\widehat{q}_{j,h} to within exp((nmTN)O(1))\exp(-(nmTN)^{O(1)}) in total variation distance. Since q^j,h\widehat{q}_{j,h} only influences the subsequent execution of Algorithm 2 via the samples 𝐚h1,,𝐚hK×j[m]q^j,h\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h} drawn in Line 11, by a union bound, the approximation of q^j,h\widehat{q}_{j,h} we have described perturbs the execution of the algorithm by at most O(KH)exp((nmTN)O(1))O(KH)\cdot\exp(-(nmTN)^{O(1)}) in total variation distance. In particular, the correctness guarantee of Lemma C.3 still holds, with sucess probability at least 1/3exp((nmTN)O(1))>1/41/3-\exp(-(nmTN)^{O(1)})>1/4.

It remains to prove Lemma C.3, which is the bulk of the proof of Theorem C.1.

Proof of Lemma C.3.

We will establish the following two facts:

  1. 1.

    First, the choices of am+1,ha_{m+1,h} in Line 14 (i.e., Eq. 15) of Algorithm 2 correspond to a valid policy πm+1Πgen,rnd\pi_{m+1}^{\dagger}\in{\Pi}^{\mathrm{gen,rnd}} for player m+1m+1 (representing a strategy for deviating from the equilibrium σ¯\overline{\sigma}), in that they can be expressed as a function of player (m+1)(m+1)’s history, (τm+1,h1,sh)(\tau_{m+1,h-1},s_{h}) at each step hh.

  2. 2.

    Second, we will show that, since σ¯\overline{\sigma} is an ϵ\epsilon-CCE of 𝒢\mathcal{G}, the strategy πm+1\pi_{m+1}^{\dagger} cannot not lead to a large increase of value for player m+1m+1, which will imply that Algorithm 2 must return a Nash equilibrium with high enough probability.

Defining πi\pi_{i}^{\dagger} for i[m+1]i\in[m+1].

We begin by constructing the policy πm+1\pi_{m+1}^{\dagger} described; for later use in the proof, it will be convenient to construct a collection of closely related policies πiΠgen,rnd\pi_{i}^{\dagger}\in{\Pi}^{\mathrm{gen,rnd}} for i[m]i\in[m], also representing strategies for deviating from the equilibrium σ¯\overline{\sigma}.

Let i[m+1]i\in[m+1] be fixed. For h[H]h\in[H], the mapping πi,h:i,h1×𝒮𝒜i\pi_{i,h}^{\dagger}:\mathscr{H}_{i,h-1}\times\mathcal{S}\rightarrow\mathcal{A}_{i} is defined as follows. Given a history τi,h1=(s1,ai,1,ri,1,,sh1,ai,h1,ri,h1)i,h1\tau_{i,h-1}=(s_{1},a_{i,1},r_{i,1},\ldots,s_{h-1},a_{i,h-1},r_{i,h-1})\in\mathscr{H}_{i,h-1} (we assume without loss of generality that τi,h1\tau_{i,h-1} occurs with positive probability under some sequence of general policies) and a current state shs_{h}, we define πi,h(τi,h1,sh)𝒜i\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h})\in\mathcal{A}_{i} through the following process.

  1. 1.

    First, we claim that for all players j[m+1]{i}j\in[m+1]\setminus\{i\}, it is possible to extract the trajectory τj,h1\tau_{j,h-1} from the trajectory τi,h1\tau_{i,h-1} of player ii.

    1. (a)

      Recall that for each g<hg<h, from the definition in Equation 13 and the function 𝖾𝗇𝖼(𝐚)\mathsf{enc}(\mathbf{a}), the bits following position 3log1/ϵ3\lceil\log 1/\epsilon\rceil of the reward ri,gr_{i,g} given to player ii at step gg of the trajectory τi,g1\tau_{i,g-1} encode an action profile 𝐚g𝒜¯\mathbf{a}_{g}\in\overline{\mathcal{A}}. Since τi,h1\tau_{i,h-1} occurs with positive probability, this is precisely the action profile which was played by agents at step gg. Note we also use here that by definition of the rewards Rj,h(s,𝐚)R_{j,h}(s,\mathbf{a}) in (13), the component R¯j,h(s,𝐚)\overline{R}_{j,h}(s,\mathbf{a}) of the reward only affects the first 2log1/ϵ2\lceil\log 1/\epsilon\rceil bits.

    2. (b)

      For g<hg<h and j[m+1]\{i}j\in[m+1]\backslash\{i\}, define rj,g:=Rj,g(sg,𝐚g)r_{j,g}:=R_{j,g}(s_{g},\mathbf{a}_{g}).

    3. (c)

      For j[m+1]\{i}j\in[m+1]\backslash\{i\}, write τj,h1:=(s1,aj,1,rj,1,,sh1,aj,h1,rj,h1)\tau_{j,h-1}:=(s_{1},a_{j,1},r_{j,1},\ldots,s_{h-1},a_{j,h-1},r_{j,h-1}); in particular, τj,h1\tau_{j,h-1} is a deterministic function of (τi,h1,sh)(\tau_{i,h-1},s_{h}). (Note that, since τi,h1\tau_{i,h-1} occurs with positive probability, the history τj,h1\tau_{j,h-1} observed by player jj up to step h1h-1 can be computed from it via Steps (a) and (b)). Going forward, for g<h1g<h-1, we let τj,g\tau_{j,g} denote the prefix of τj,h1\tau_{j,h-1} up to step gg.

  2. 2.

    Now, using that player ii can compute all players’ trajectories, for each j[m+1]j\in[m+1] we define

    q^j,h:=𝔼tq~j,h[σj,h(t)(τj,h1,sh)]Δ(𝒜j),\displaystyle\widehat{q}_{j,h}:=\mathbb{E}_{t\sim\widetilde{q}_{j,h}}\left[\sigma_{j,h}^{\scriptscriptstyle{(t)}}(\tau_{j,h-1},s_{h})\right]\in\Delta(\mathcal{A}_{j}), (16)

    where q~j,hΔ([T])\widetilde{q}_{j,h}\in\Delta([T]) is defined as follows: for t[T]t\in[T],

    q~j,h(t):=exp(g<hlog(1σj,g(t)(aj,g|τj,g1,sg)))t=1Texp(g<hlog(1σj,g(t)(aj,g|τj,g1,sg))).\displaystyle\widetilde{q}_{j,h}(t):=\frac{\exp\left(-\sum_{g<h}\log\left(\frac{1}{\sigma^{\scriptscriptstyle{(t)}}_{j,g}(a_{j,g}|\tau_{j,g-1},s_{g})}\right)\right)}{\sum_{t^{\prime}=1}^{T}\exp\left(-\sum_{g<h}\log\left(\frac{1}{\sigma^{\scriptscriptstyle{(t^{\prime})}}_{j,g}(a_{j,g}|\tau_{j,g-1},s_{g})}\right)\right)}. (17)

    Note that q^j,h\widehat{q}_{j,h} is a random variable which depends on the trajectory (τj,h1,sh)(\tau_{j,h-1},s_{h}) (which can be computed from (τi,h1,sh)(\tau_{i,h-1},s_{h})). In addition, the definition of q^j,h\widehat{q}_{j,h} (for each j[m]j\in[m]) is exactly as is defined in Line 10 of Algorithm 2.

  3. 3.

    For i[m]i\in[m], define πi,h(τi,h1,sh)\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h}) as follows:

    πi,h(τi,h1,sh):=argmaxa𝒜i𝔼𝐚i×jiq^j,h[Rm+1,h(sh,(a,𝐚i))].\displaystyle\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h}):=\operatorname*{arg\,max}_{a^{\prime}\in\mathcal{A}_{i}}\mathbb{E}_{\mathbf{a}_{-i}\sim\bigtimes_{j\neq i}\widehat{q}_{j,h}}\left[R_{m+1,h}(s_{h},(a^{\prime},\mathbf{a}_{-i}))\right]. (18)

    For the case i=m+1i=m+1, define πm+1,h(τm+1,h1,sh)Δ(𝒜m+1)\pi_{m+1,h}^{\dagger}(\tau_{m+1,h-1},s_{h})\in\Delta(\mathcal{A}_{m+1}) (implicitly) to be the following distribution over am+1,h𝒜m+1a_{m+1,h}^{\dagger}\in\mathcal{A}_{m+1}: draw 𝐚h1,,𝐚hK×j[m]q^j,h\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h}, define R^m+1,h(a):=1Kk=1KRm+1,h(sh,(𝐚hk,a))\widehat{R}_{m+1,h}(a^{\prime}):=\frac{1}{K}\sum_{k=1}^{K}R_{m+1,h}(s_{h},(\mathbf{a}_{h}^{k},a^{\prime})) for a𝒜m+1a^{\prime}\in\mathcal{A}_{m+1}, and finally set

    am+1,h:=argmaxa𝒜m+1{R^m+1,h(a)}.\displaystyle a_{m+1,h}^{\dagger}:=\operatorname*{arg\,max}_{a^{\prime}\in\mathcal{A}_{m+1}}\left\{\widehat{R}_{m+1,h}(a^{\prime})\right\}. (19)

    Note that, for each choice of (τm+1,h1,sh)(\tau_{m+1,h-1},s_{h}), the distribution πm+1,h(τm+1,h1,sh)\pi_{m+1,h}^{\dagger}(\tau_{m+1,h-1},s_{h}) as defined above coincides with the distribution of the action am+1,ha_{m+1,h}^{\dagger} defined in Eq. 15 in Algorithm 2, when player m+1m+1’s history is τm+1,h1\tau_{m+1,h-1} and the state at step hh is shs_{h}. The following lemma, for use later in the proof, bounds the approximation error incurred in sampling 𝐚h1,,𝐚hK×j[m]q^j,h\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h}.

    Lemma C.5.

    Fix any (τm+1,h1,sh)j,h1(\tau_{m+1,h-1},s_{h})\in\mathscr{H}_{j,h-1}. With probability at least 1δ1-\delta over the draw of 𝐚h1,,𝐚hK×j[m]q^j,h\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h}, it holds that for all a𝒜m+1a^{\prime}\in\mathcal{A}_{m+1},

    |R^m+1,h(a)𝔼ajq^j,hj[m][Rm+1,h(sh,(a1,,am,a))]|ϵH,\displaystyle\left|\widehat{R}_{m+1,h}(a^{\prime})-\mathbb{E}_{a_{j}\sim\widehat{q}_{j,h}\ \forall j\in[m]}[R_{m+1,h}(s_{h},(a_{1},\ldots,a_{m},a^{\prime}))]\right|\leq\frac{\epsilon}{H},

    which implies in particular that with probability at least 1δ1-\delta over the draw of am+1,hπm+1,h(τm+1,h1,sh)a_{m+1,h}^{\dagger}\sim\pi_{m+1,h}^{\dagger}(\tau_{m+1,h-1},s_{h}),

    maxa𝒜m+1{𝔼ajq^j,hj[m][Rm+1,h(sh,(a1,,am,a))]}2ϵH\displaystyle\max_{a^{\prime}\in\mathcal{A}_{m+1}}\left\{\mathbb{E}_{a_{j}\sim\widehat{q}_{j,h}\ \forall j\in[m]}[R_{m+1,h}(s_{h},(a_{1},\ldots,a_{m},a^{\prime}))]\right\}-\frac{2\epsilon}{H}\leq 𝔼ajq^j,hj[m][Rm+1,h(sh,(a1,,am,am+1,h))].\displaystyle\mathbb{E}_{a_{j}\sim\widehat{q}_{j,h}\ \forall j\in[m]}[R_{m+1,h}(s_{h},(a_{1},\ldots,a_{m},a_{m+1,h}^{\dagger}))]. (20)

It is immediate from our construction above that the following fact holds.

Lemma C.6.

The joint distribution of τj,h\tau_{j,h}, for j[m+1]j\in[m+1] and h[H]h\in[H], as computed by Algorithm 2, coincides with the distribution of τj,h\tau_{j,h} in an episode of 𝒢\mathcal{G} when players follow the policy πm+1×σ¯(m+1)\pi_{m+1}^{\dagger}\times\overline{\sigma}_{-(m+1)}.

Analyzing the distributions q^j,h\widehat{q}_{j,h}.

Fix any i[m+1]i\in[m+1]. We next prove some facts about the distributions q^j,h\widehat{q}_{j,h} defined above (as a function of (τi,h1,sh)(\tau_{i,h-1},s_{h})) in the process of computing πi,h(τi,h1,sh)\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h}).

For each h[H]h\in[H], consider any choice of (τi,h1,sh)i,h1×𝒮(\tau_{i,h-1},s_{h})\in\mathscr{H}_{i,h-1}\times\mathcal{S}; note that for each j[m+1]j\in[m+1], the distributions q^j,hΔ(𝒜j)\widehat{q}_{j,h}\in\Delta(\mathcal{A}_{j}) for h[H]h\in[H] may be viewed as an application Vovk’s aggregating algorithm (Proposition B.2) in the following setting: the number of steps (TT, in the context of Proposition B.2; note that TT has a different meaning in the present proof) horizon is HH, the context space is h=1Hj,h1×𝒮\bigcup_{h=1}^{H}\mathscr{H}_{j,h-1}\times\mathcal{S}, and the output space is 𝒜j\mathcal{A}_{j}. The expert set is ={ρj(1),,ρj(T)}\mathcal{I}=\{\rho_{j}^{\scriptscriptstyle{(1)}},\ldots,\rho_{j}^{\scriptscriptstyle{(T)}}\} (which has ||=T\lvert\mathcal{I}\rvert=T), and the experts’ predictions on a context (τj,h1,s)j,h1×𝒮(\tau_{j,h-1},s)\in\mathscr{H}_{j,h-1}\times\mathcal{S} are defined via ρj(t)(|τj,h1,s):=σj,h(t)(|τj,h1,s)Δ(𝒜j)\rho_{j}^{\scriptscriptstyle{(t)}}(\cdot|\tau_{j,h-1},s):=\sigma_{j,h}^{\scriptscriptstyle{(t)}}(\cdot|\tau_{j,h-1},s)\in\Delta(\mathcal{A}_{j}). Then for each h[H]h\in[H], the distribution q^j,h\widehat{q}_{j,h} is obtained by updating the aggregating algorithm with the context-observation pairs (τj,h1,aj,h)(\tau_{j,h^{\prime}-1},a_{j,h^{\prime}}) for h=1,2,,h1h^{\prime}=1,2,\ldots,h-1.

In more detail, fix any t[T]t^{\star}\in[T] and j[m+1]j\in[m+1] with iji\neq j. We may apply Proposition B.2 with the number of steps set to HH, the set of experts as ={ρj(1),,ρj(T)}\mathcal{I}=\{\rho_{j}^{\scriptscriptstyle{(1)}},\ldots,\rho_{j}^{\scriptscriptstyle{(T)}}\}, and contexts and outcomes generated according to the distribution induced by running the policy πi×σi(t)\pi_{i}^{\dagger}\times\sigma_{-i}^{\scriptscriptstyle{(t^{\star})}} in the Markov game 𝒢\mathcal{G} as follows:

  • For each h[H]h\in[H], we are given, at steps h<hh^{\prime}<h, the actions ak,ha_{k,h^{\prime}} rewards rk,hr_{k,h^{\prime}} for all agents k[m+1]k\in[m+1], as well as the states s1,,shs_{1},\ldots,s_{h}.

    • For each k[m+1]k\in[m+1], set τk,h1=(s1,ak,1,rk,1,,sh1,ak,h1,rk,h1)\tau_{k,h-1}=(s_{1},a_{k,1},r_{k,1},\ldots,s_{h-1},a_{k,h-1},r_{k,h-1}) to be agent kk’s history.

    • The context fed to the aggregation algorithm at step hh is (τj,h1,sh)(\tau_{j,h-1},s_{h}).

    • The outcome at step hh is given by aj,hσj,h(t)(|τj,h1,sh)a_{j,h}\sim\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\cdot|\tau_{j,h-1},s_{h}); note that this choice satisfies the realizability assumption in Proposition B.2.

    • To aid in generating the next context at step h+1h+1, choose ak,hσk,ht(τk,h1,sh)a_{k,h}\sim\sigma_{k,h}^{t^{\star}}(\tau_{k,h-1},s_{h}) for all k[m+1]\{i,j}k\in[m+1]\backslash\{i,j\} and ai,h=πi,h(τi,h1,sh)a_{i,h}=\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h}). Then set sh+1s_{h+1} to be the next state given the transitions of 𝒢\mathcal{G} and the action profile 𝐚h=(a1,h,,am+1,h)\mathbf{a}_{h}=(a_{1,h},\ldots,a_{m+1,h}).

By Proposition B.2, it follows that for any fixed t[T]t^{\star}\in[T] and j[m+1]j\in[m+1] with jij\neq i, under the process described above we have

𝔼πi×σi(t)[h=1HD𝖳𝖵(σj,h(t)(τj,h1,sh),q^j,h)]HlogT.\displaystyle\mathbb{E}_{\pi_{i}^{\dagger}\times\sigma_{-i}^{\scriptscriptstyle{(t^{\star})}}}\left[\sum_{h=1}^{H}D_{\mathsf{TV}}({\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})},{\widehat{q}_{j,h}})\right]\leq\sqrt{H\cdot\log T}. (21)

Analyzing the value of πm+1\pi_{m+1}^{\dagger}.

Next, using the development above, we show that if Algorithm 2 successfully computes a Nash equilibrium with constant probability (via πm+1\pi_{m+1}^{\dagger}) whenever σ¯\widebar{\sigma} is an ϵ\epsilon-CCE. We first state the following claim, which is proven in the sequel by analyzing the values Viπi×σ¯iV_{i}^{\pi_{i}^{\dagger}\times\overline{\sigma}_{-i}} for i[m]i\in[m].

Lemma C.7.

If σ¯\overline{\sigma} is an ϵ\epsilon-CCE of 𝒢\mathcal{G}, then it holds that for all i[m]i\in[m],

Viσ¯ϵmlog(T)/H.\displaystyle V_{i}^{\overline{\sigma}}\geq-\epsilon-m\sqrt{\log(T)/H}.

Note that in the game 𝒢\mathcal{G}, since for all h[H]h\in[H], s𝒮s\in\mathcal{S} and 𝐚𝒜¯\mathbf{a}\in\overline{\mathcal{A}}, it holds that |j=1m+1Rj,h(s,𝐚)|(m+1)ϵ2H\left|\sum_{j=1}^{m+1}R_{j,h}(s,\mathbf{a})\right|\leq\frac{(m+1)\epsilon^{2}}{H} (which holds since in (13), 𝖾𝗇𝖼(𝐚)\mathsf{enc}(\mathbf{a}) is multiplied by 1H23log1/ϵ\frac{1}{H}\cdot 2^{-3\lceil\log 1/\epsilon\rceil}), it follows that |j=1m+1Vjσ¯|(m+1)ϵ2\left|\sum_{j=1}^{m+1}V_{j}^{\overline{\sigma}}\right|\leq(m+1)\epsilon^{2}. Thus, by Lemma C.7, we have Vm+1σ¯(m+1)ϵ2+m(ϵ+mlog(T)/H)V_{m+1}^{\overline{\sigma}}\leq(m+1)\epsilon^{2}+m\cdot(\epsilon+m\sqrt{\log(T)/H}), and since σ¯\overline{\sigma} is an ϵ\epsilon-CCE of 𝒢\mathcal{G} it follows that

Vm+1πm+1×σ¯(m+1)2(m+1)ϵ+m2log(T)/H.\displaystyle V_{m+1}^{\pi_{m+1}^{\dagger}\times\overline{\sigma}_{-(m+1)}}\leq 2(m+1)\cdot\epsilon+m^{2}\cdot\sqrt{\log(T)/H}. (22)

To simplify notation, we will write q^h:=q^1,h××q^m,h\widehat{q}_{h}:=\widehat{q}_{1,h}\times\cdots\times\widehat{q}_{m,h} in the below calculations, where we recall that each q^j,h\widehat{q}_{j,h} is determined given the history up to step hh, (τj,h1,sh)(\tau_{j,h-1},s_{h}), as defined in (16) and (17). An action profile drawn from q^h\widehat{q}_{h} is denoted as 𝐚q^h\mathbf{a}\sim\widehat{q}_{h}, with 𝐚𝒜\mathbf{a}\in\mathcal{A}. We may now write Vm+1πm+1×σ¯(m+1)V_{m+1}^{\pi_{m+1}^{\dagger}\times\overline{\sigma}_{-(m+1)}} as follows:

Vm+1πm+1×σ¯(m+1)\displaystyle V_{m+1}^{\pi_{m+1}^{\dagger}\times\overline{\sigma}_{-(m+1)}}
=\displaystyle= 𝔼t[T]h=1H𝔼πm+1×σ(m+1)(t)𝔼aj,hσj,h(t)(τj,h1,sh)j[m]am+1,hπm+1,h(τm+1,h1,sh)𝐚:=(a1,h,,am+1,h)[Rm+1,h(sh,𝐚)]\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}}}\mathbb{E}_{\begin{subarray}{c}a_{j,h}\sim\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})\ \forall j\in[m]\\ a_{m+1,h}\sim\pi_{m+1,h}^{\dagger}(\tau_{m+1,h-1},s_{h})\\ \mathbf{a}:=(a_{1,h},\ldots,a_{m+1,h})\end{subarray}}\left[R_{m+1,h}(s_{h},\mathbf{a})\right]
\displaystyle\geq 𝔼t[T]h=1H𝔼πm+1×σ(m+1)(t)(𝔼aj,hq^j,hj[m]am+1,hπm+1,h(τm+1,h1,sh)𝐚:=(a1,h,,am+1,h)[Rm+1,h(sh,𝐚)]1Hj[m]D𝖳𝖵(σj,h(t)(τj,h1,sh),q^j,h))\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}}}\Bigg{(}\mathbb{E}_{\begin{subarray}{c}a_{j,h}\sim\widehat{q}_{j,h}\ \forall j\in[m]\\ a_{m+1,h}\sim\pi_{m+1,h}^{\dagger}(\tau_{m+1,h-1},s_{h})\\ \mathbf{a}:=(a_{1,h},\ldots,a_{m+1,h})\end{subarray}}\left[R_{m+1,h}(s_{h},\mathbf{a})\right]-\frac{1}{H}\sum_{j\in[m]}D_{\mathsf{TV}}({\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})},{\widehat{q}_{j,h}})\Bigg{)}
\displaystyle\geq 𝔼t[T]h=1H𝔼πm+1×σ(m+1)(t)(maxam+1,h𝒜m+1𝔼𝐚q^h[Rm+1,h(sh,(𝐚,am+1,h))]2ϵHδH1Hj[m]D𝖳𝖵(σj,h(t)(τj,h1,sh),q^j,h))\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}}}\Bigg{(}\max_{a_{m+1,h}^{\prime}\in\mathcal{A}_{m+1}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}\left[R_{m+1,h}(s_{h},(\mathbf{a},a_{m+1,h}^{\prime}))\right]-\frac{2\epsilon}{H}-\frac{\delta}{H}-\frac{1}{H}\sum_{j\in[m]}D_{\mathsf{TV}}({\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})},{\widehat{q}_{j,h}})\Bigg{)}
\displaystyle\geq 1H𝔼t[T]h=1H𝔼πm+1×σ(m+1)(t)(maxj[m],aj,h𝒜j𝔼𝐚q^h[(Mj)aj,𝐚j(Mj)𝐚])mHHlogT2ϵδϵ2,\displaystyle\frac{1}{H}\cdot\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}}}\left(\max_{j\in[m],a_{j,h}^{\prime}\in\mathcal{A}_{j}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}[(M_{j})_{a_{j}^{\prime},\mathbf{a}_{-j}}-(M_{j})_{\mathbf{a}}]\right)-\frac{m}{H}\cdot\sqrt{H\log T}-2\epsilon-\delta-\epsilon^{2},

where:

  • The first inequality follows from the fact that Rm+1,h()R_{m+1,h}(\cdot) takes values in [1/H,1/H][-1/H,1/H] and the fact that the total variation between product distributions is bounded above by the sum of total variation distances between each of the pairs of component distributions.

  • The second inequality follows from the inequality (20) of Lemma C.5.

  • The final equality follows from the definition of the rewards in (13) and (14), and by summing (21) over j[m]j\in[m]. We remark that the ϵ2-\epsilon^{2} term in the final line comes from the term 1H23log1/ϵ𝖾𝗇𝖼(𝐚)\frac{1}{H}\cdot 2^{-3\lceil\log 1/\epsilon\rceil}\cdot\mathsf{enc}(\mathbf{a}) in (13).

Rearranging and using (22) as well as the fact that δ+ϵ2=ϵ/(6H)+ϵ2ϵ\delta+\epsilon^{2}=\epsilon/(6H)+\epsilon^{2}\leq\epsilon (as ϵ1/2\epsilon\leq 1/2), we get that

𝔼t[T]𝔼πm+1×σ(m+1)(t)h=1H(maxj[m],aj,h𝒜j𝔼𝐚q^h[(Mj)aj,𝐚j(Mj)𝐚])\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\mathbb{E}_{\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}}}\sum_{h=1}^{H}\left(\max_{j\in[m],a_{j,h}^{\prime}\in\mathcal{A}_{j}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}[(M_{j})_{a_{j}^{\prime},\mathbf{a}_{-j}}-(M_{j})_{\mathbf{a}}]\right)
\displaystyle\leq 2Hϵ(m+1)+(m+1)mHlogT+3Hϵ.\displaystyle 2H\cdot\epsilon\cdot(m+1)+(m+1)m\cdot\sqrt{H\log T}+3H\epsilon.

Since q^h\widehat{q}_{h} is a product distribution a.s., we have that

maxj[m],aj,h𝒜j𝔼𝐚q^h[(Mj)aj,𝐚j(Mj)𝐚]0.\max_{j\in[m],a_{j,h}^{\prime}\in\mathcal{A}_{j}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}[(M_{j})_{a_{j}^{\prime},\mathbf{a}_{-j}}-(M_{j})_{\mathbf{a}}]\geq 0.

Therefore, by Markov’s inequality, with probability at least 1/21/2 over the choice of t[T]t^{\star}\sim[T] and the trajectories (τj,h1,sh)πm+1×σ(m+1)(t)(\tau_{j,h-1},s_{h})\sim\pi_{m+1}^{\dagger}\times\sigma_{-(m+1)}^{\scriptscriptstyle{(t^{\star})}} for j[m]j\in[m] (which collectively determine q^h\widehat{q}_{h}), there is some h[H]h\in[H] so that

maxj[m],aj,h𝒜j𝔼𝐚q^h[(Mj)aj,𝐚j(Mj)𝐚]10(m+1)ϵ+2(m+1)mlog(T)/H12(m+1)ϵ,\displaystyle\max_{j\in[m],a_{j,h}^{\prime}\in\mathcal{A}_{j}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}[(M_{j})_{a_{j}^{\prime},\mathbf{a}_{-j}}-(M_{j})_{\mathbf{a}}]\leq 10(m+1)\cdot\epsilon+2(m+1)m\cdot\sqrt{\log(T)/H}\leq 12(m+1)\cdot\epsilon, (23)

where the final inequality follows as long as Hϵ2m2logTH\cdot\epsilon^{2}\geq m^{2}\log T, i.e., Texp(Hϵ2m2)T\leq\exp\left(\frac{H\cdot\epsilon^{2}}{m^{2}}\right), which holds since Hn0H\geq n_{0} and we have assumed that Texp(ϵ2n0/m2)T\leq\exp(\epsilon^{2}\cdot n_{0}/m^{2}).

Note that (23) implies that with probability at least 1/21/2 under an episode drawn from πm+1×σ¯(m+1)\pi_{m+1}^{\dagger}\times\overline{\sigma}_{-(m+1)}, there is some h[H]h\in[H] so that q^h\widehat{q}_{h} is a 12(m+1)ϵ12(m+1)\cdot\epsilon-Nash equilibrium of the stage game GG. Thus, by Lemma C.6, with probability at least 1/21/2 under an episode drawn from the distribution of Algorithm 2, there is some h[H]h\in[H] so that q^h\widehat{q}_{h} is a 12(m+1)ϵ12(m+1)\cdot\epsilon-Nash equilibrium of GG.

Finally, the following two observations conclude the proof of Lemma C.3.

  • If q^h\widehat{q}_{h} is a 12(m+1)ϵ12(m+1)\cdot\epsilon-Nash equilibrium of GG, then by definition of the reward function Rm+1,h()R_{m+1,h}(\cdot) in (13), upper bounding 1H23log1/ϵ𝖾𝗇𝖼(𝐚)\frac{1}{H}\cdot 2^{-3\lceil\log 1/\epsilon\rceil}\cdot\mathsf{enc}(\mathbf{a}) by ϵ2/H\epsilon^{2}/H,

    maxa𝒜m+1𝔼𝐚q^h[Rm+1,h(𝔰,(𝐚,a))]1H12(m+1)ϵ+ϵ2H,\displaystyle\max_{a^{\prime}\in\mathcal{A}_{m+1}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}\left[R_{m+1,h}(\mathfrak{s},(\mathbf{a},a^{\prime}))\right]\leq\frac{1}{H}\cdot 12(m+1)\cdot\epsilon+\frac{\epsilon^{2}}{H},

    which implies, by Lemma C.5, that with probability at least 1δ1-\delta over the draw of 𝐚h1,,𝐚hK\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K},

    maxa𝒜m+1{R^m+1,h(a)}1H12(m+1)ϵ+ϵ2H+ϵH1H14(m+1)ϵ,\displaystyle\max_{a^{\prime}\in\mathcal{A}_{m+1}}\left\{\widehat{R}_{m+1,h}(a^{\prime})\right\}\leq\frac{1}{H}\cdot 12(m+1)\cdot\epsilon+\frac{\epsilon^{2}}{H}+\frac{\epsilon}{H}\leq\frac{1}{H}\cdot 14(m+1)\cdot\epsilon,

    i.e., the check in Line 17 of Algorithm 2 will pass and the algorithm will return q^h\widehat{q}_{h} (if step hh is reached).

  • Conversely, if maxa𝒜m+1{R^m+1,h(a)}14(m+1)ϵ\max_{a^{\prime}\in\mathcal{A}_{m+1}}\left\{\widehat{R}_{m+1,h}(a^{\prime})\right\}\leq 14(m+1)\cdot\epsilon, i.e., the check in Line 17 passes, then by Lemma C.5, with probability at least 1δ1-\delta over 𝐚h1,,𝐚hK\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K},

    maxa𝒜m+1𝔼𝐚q^h[Rm+1,h(𝔰,(𝐚,a))]1H14(m+1)ϵ+ϵH1H15(m+1)ϵ,\displaystyle\max_{a^{\prime}\in\mathcal{A}_{m+1}}\mathbb{E}_{\mathbf{a}\sim\widehat{q}_{h}}\left[R_{m+1,h}(\mathfrak{s},(\mathbf{a},a^{\prime}))\right]\leq\frac{1}{H}\cdot 14(m+1)\cdot\epsilon+\frac{\epsilon}{H}\leq\frac{1}{H}\cdot 15(m+1)\cdot\epsilon,

    which implies, by the definition of Rm+1,h()R_{m+1,h}(\cdot) in (13) and (14), that q^h\widehat{q}_{h} is a 16(m+1)ϵ16(m+1)\cdot\epsilon-Nash equilibrium of GG.

Taking a union bound over all HH of the probability-δ\delta failure events from Lemma C.5 for the sampling 𝐚h1,,𝐚hKq^h\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\widehat{q}_{h} (for h[H]h\in[H]), as well as over the probability-1/21/2 event that there is no q^h\widehat{q}_{h} which is a 12(m+1)ϵ12(m+1)\cdot\epsilon-Nash equilibrium of GG, we obtain that with probability at least 11/2Hϵ/(6H)1/31-1/2-H\cdot\epsilon/(6H)\geq 1/3, Algorithm 2 outputs a 16(m+1)ϵ16(m+1)\cdot\epsilon-Nash equilibrium of GG. ∎

Finally, we prove the remaining claims stated without proof above.

Proof of Lemma C.5.

Since Rm+1,h(𝔰,𝐚)[1/H,1/H]R_{m+1,h}(\mathfrak{s},\mathbf{a})\in[-1/H,1/H] for each 𝐚𝒜¯\mathbf{a}\in\overline{\mathcal{A}}, by Hoeffding’s inequality, for any fixed a𝒜m+1a^{\prime}\in\mathcal{A}_{m+1}, with probability at least 1δ/|𝒜m+1|=1δ/(mn0)1-\delta/|\mathcal{A}_{m+1}|=1-\delta/(mn_{0}) over the draw of 𝐚h1,,𝐚hK×j[m]q^j,h\mathbf{a}_{h}^{1},\ldots,\mathbf{a}_{h}^{K}\sim\bigtimes_{j\in[m]}\widehat{q}_{j,h}, it holds that

|R^m+1,h(a)𝔼ajq^j,hj[m][Rm+1,h(sh,(a1,,am,a))]|2Hlogmn0/δKϵH,\displaystyle\left|\widehat{R}_{m+1,h}(a^{\prime})-\mathbb{E}_{a_{j}\sim\widehat{q}_{j,h}\ \forall j\in[m]}[R_{m+1,h}(s_{h},(a_{1},\ldots,a_{m},a^{\prime}))]\right|\leq\frac{2}{H}\cdot\sqrt{\frac{\log mn_{0}/\delta}{K}}\leq\frac{\epsilon}{H},

where the final inequality follows from the choice of K=4log(mn0/δ)/ϵ2K=\lceil 4\log(mn_{0}/\delta)/\epsilon^{2}\rceil. The statement of the lemma follows by a union bound over all |𝒜m+1||\mathcal{A}_{m+1}| actions a𝒜m+1a^{\prime}\in\mathcal{A}_{m+1}. ∎

Proof of Lemma C.7.

Fix any agent i[m]i\in[m]. We will argue that the policy πiΠigen,det\pi_{i}^{\dagger}\in\Pi^{\mathrm{gen,det}}_{i} defined within the proof of Lemma C.3 satisfies Viπi,σ¯imlog(T)/HV_{i}^{\pi_{i}^{\dagger},\overline{\sigma}_{-i}}\geq-m\sqrt{\log(T)/H}. Since σ¯\overline{\sigma} is an ϵ\epsilon-CCE of 𝒢\mathcal{G}, it follows that

ϵViπi,σ¯iViσ¯mlog(T)/HViσ¯,\displaystyle\epsilon\geq V_{i}^{\pi_{i}^{\dagger},\overline{\sigma}_{-i}}-V_{i}^{\overline{\sigma}}\geq-m\sqrt{\log(T)/H}-V_{i}^{\overline{\sigma}},

from which the result of Lemma C.7 follows after rearranging terms.

To simplify notation, let us write q^i,h:=×jiq^j,h\widehat{q}_{-i,h}:=\bigtimes_{j\neq i}\widehat{q}_{j,h}, where we recall that each q^j,h\widehat{q}_{j,h} is determined given the history up to step hh, (τj,h1,sh)(\tau_{j,h-1},s_{h}), as defined in (16) and (17). An action profile drawn from q^i,h\widehat{q}_{-i,h} is denoted by 𝐚iq^i,h\mathbf{a}_{-i}\sim\widehat{q}_{-i,h}, with 𝐚i𝒜¯i\mathbf{a}_{-i}\in\overline{\mathcal{A}}_{-i}. We compute

Viπi×σ¯i\displaystyle V_{i}^{\pi_{i}^{\dagger}\times\overline{\sigma}_{-i}}
=\displaystyle= 𝔼t[T]h=1H𝔼πi×σi(t)𝔼𝐚i×jiσj,h(t)(τj,h1,sh)[Ri,h(sh,(πi,h(τi,h1,sh),𝐚i))]\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{i}^{\dagger}\times\sigma_{-i}^{\scriptscriptstyle{(t^{\star})}}}\mathbb{E}_{\mathbf{a}_{-i}\sim\bigtimes_{j\neq i}\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})}\left[R_{i,h}(s_{h},(\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h}),\mathbf{a}_{-i}))\right]
\displaystyle\geq 𝔼t[T]h=1H𝔼πi×σi(t)(𝔼𝐚iq^i,h[Ri,h(sh,(πi,h(τi,h1,sh),𝐚i))]1HjiD𝖳𝖵(σj,h(t)(τj,h1,sh),q^j,h))\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{i}^{\dagger}\times\sigma_{-i}^{\scriptscriptstyle{(t^{\star})}}}\left(\mathbb{E}_{\mathbf{a}_{-i}\sim\widehat{q}_{-i,h}}\left[R_{i,h}(s_{h},(\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h}),\mathbf{a}_{-i}))\right]-\frac{1}{H}\sum_{j\neq i}D_{\mathsf{TV}}({\sigma_{j,h}^{\scriptscriptstyle{(t^{\star})}}(\tau_{j,h-1},s_{h})},{\widehat{q}_{j,h}})\right)
\displaystyle\geq 𝔼t[T]h=1H𝔼πi×σi(t)(maxai𝒜i𝔼𝐚iq^i,h[Ri,h(sh,(ai,𝐚i))])mHHlogT\displaystyle\mathbb{E}_{t^{\star}\sim[T]}\sum_{h=1}^{H}\mathbb{E}_{\pi_{i}^{\dagger}\times\sigma_{-i}^{\scriptscriptstyle{(t^{\star})}}}\left(\max_{a_{i}^{\prime}\in\mathcal{A}_{i}}\mathbb{E}_{\mathbf{a}_{-i}\sim\widehat{q}_{-i,h}}\left[R_{i,h}(s_{h},(a_{i}^{\prime},\mathbf{a}_{-i}))\right]\right)-\frac{m}{H}\cdot\sqrt{H\log T}
\displaystyle\geq mlog(T)/H,\displaystyle-m\sqrt{\log(T)/H},

where:

  • The first inequality follows from the fact that the rewards Ri,h()R_{i,h}(\cdot) take values in [1/H,1/H][-1/H,1/H] and that the total variation between product distributions is bounded above by the sum of total variation distances between each of the pairs of component distributions.

  • The second inequality follows from the definition of πi,h(τi,h1,sh)\pi_{i,h}^{\dagger}(\tau_{i,h-1},s_{h}) in terms of q^i,h\widehat{q}_{-i,h} in (18) as well as (21) applied to each jij\neq i and each t[T]t^{\star}\in[T].

  • The final inequality follows by Lemma C.8 below, applied to agent ii and to the distribution q^i,h\widehat{q}_{-i,h}, which we recall is a product distribution almost surely.

Lemma C.8.

For any i[m]i\in[m], s𝒮,h[H]s\in\mathcal{S},h\in[H], and any product distribution qΔ(𝒜¯i)q\in\Delta(\overline{\mathcal{A}}_{-i}), it holds that

maxai𝒜i𝔼𝐚q[Ri,h(s,(ai,𝐚))]0.\displaystyle\max_{a_{i}^{\prime}\in\mathcal{A}_{i}}\mathbb{E}_{\mathbf{a}\sim q}\left[R_{i,h}(s,(a_{i}^{\prime},\mathbf{a}))\right]\geq 0.
Proof.

Choose ai:=argmaxai𝒜𝔼𝐚q[(Mi)ai,𝐚]a_{i}^{\star}:=\operatorname*{arg\,max}_{a_{i}^{\prime}\in\mathcal{A}}\mathbb{E}_{\mathbf{a}\sim q}\left[(M_{i})_{a_{i}^{\prime},\mathbf{a}}\right]. Now we compute

H𝔼𝐚q[Ri,h(s,(ai,𝐚))]\displaystyle H\cdot\mathbb{E}_{\mathbf{a}\sim q}\left[R_{i,h}(s,(a_{i}^{\star},\mathbf{a}))\right]\geq Hminam+1𝒜m+1𝔼𝐚q[Ri,h(s,(ai,am+1,𝐚(m+1)))]\displaystyle H\cdot\min_{a_{m+1}^{\prime}\in\mathcal{A}_{m+1}}\mathbb{E}_{\mathbf{a}\sim q}\left[R_{i,h}(s,(a_{i}^{\star},a_{m+1}^{\prime},\mathbf{a}_{-(m+1)}))\right]
\displaystyle\geq min(j,aj)𝒜m+1𝟙{j=i}𝔼𝐚q[(Mi)ai,𝐚(Mi)ai,𝐚]\displaystyle\min_{(j,a_{j}^{\prime})\in\mathcal{A}_{m+1}}\mathbbm{1}\{{j=i}\}\cdot\mathbb{E}_{\mathbf{a}\sim q}\left[(M_{i})_{a_{i}^{\star},\mathbf{a}}-(M_{i})_{a_{i}^{\prime},\mathbf{a}}\right]
\displaystyle\geq 0,\displaystyle 0,

where the first inequality follows since qq is a product distribution, the second inequality uses that 𝖾𝗇𝖼()\mathsf{enc}(\cdot) is non-negative, and the final inequality follows since by choice of aia_{i}^{\star} we have 𝔼𝐚q[(Mi)ai,𝐚]𝔼𝐚q[(Mi)ai,𝐚]\mathbb{E}_{\mathbf{a}\sim q}\left[(M_{i})_{a_{i}^{\star},\mathbf{a}}\right]\geq\mathbb{E}_{\mathbf{a}\sim q}\left[(M_{i})_{a_{i}^{\prime},\mathbf{a}}\right] for all ai𝒜ia_{i}^{\prime}\in\mathcal{A}_{i}. ∎

C.2 Remarks on bit complexity of the rewards

The Markov game 𝒢(G)\mathcal{G}(G) constructed to prove Theorem C.1 uses lower-order bits of the rewards to record the action profile taken each step. These lower order bits may be used by each agent to infer what actions were taken by other agents at the previous step, and we use this idea to construct the best-response policies πi\pi_{i}^{\dagger} defined in the proof. As a result of this aspect of the construction, the rewards of the game 𝒢(G)\mathcal{G}(G) each take O(mlog(n)+log(1/ϵ))O(m\cdot\log(n)+\log(1/\epsilon)) bits to specify. As discussed in the proof of Theorem C.1, it is without loss of generality to assume that the payoffs of the given normal-form game GG take O(log1/ϵ)O(\log 1/\epsilon) bits each to specify, so when either m1m\gg 1 or n1/ϵn\gg 1/\epsilon, the construction of 𝒢(G)\mathcal{G}(G) uses more bits to express its rewards than what is used for the normal-form game GG.

It is possible to avoid this phenomenon by instead using the state transitions of the Markov game to encode the action profile taken at each step, as was done in the proof of Theorem 3.2. The idea, which we sketch here, is to replace the game 𝒢(G)\mathcal{G}(G) of Definition C.2 with the following game 𝒢(G)\mathcal{G}^{\prime}(G):

Definition C.9 (Alternative construction to Definition C.2).

Given an mm-player, n0n_{0}-action normal-form game GG, we define the game 𝒢(G)=(𝒮,H,(𝒜i)i[2],,(Ri)i[2],μ)\mathcal{G}^{\prime}(G)=(\mathcal{S},H,(\mathcal{A}_{i})_{i\in[2]},\mathbb{P},(R_{i})_{i\in[2]},\mu) as follows.

  • The horizon of 𝒢\mathcal{G} is H=n0H=n_{0}.

  • Let A=n0A=n_{0}. The action spaces of agents 1,2,,m1,2,\ldots,m are given by 𝒜1==𝒜m=[A]\mathcal{A}_{1}=\cdots=\mathcal{A}_{m}=[A]. The action space of agent m+1m+1 is

    𝒜m+1={(j,aj):j[m],aj𝒜j},\displaystyle\mathcal{A}_{m+1}=\{(j,a_{j})\ :\ j\in[m],a_{j}\in\mathcal{A}_{j}\},

    so that |𝒜m+1|=Amn|\mathcal{A}_{m+1}|=Am\leq n.

    We write 𝒜=j=1m𝒜j\mathcal{A}=\prod_{j=1}^{m}\mathcal{A}_{j} to denote the joint action space of the first mm agents, and 𝒜¯:=j=1m+1𝒜j\overline{\mathcal{A}}:=\prod_{j=1}^{m+1}\mathcal{A}_{j} to denote the joint action space of all agents. Then |𝒜¯|=Am(mA)=mAm+1n|\overline{\mathcal{A}}|=A^{m}\cdot(mA)=mA^{m+1}\leq n.

  • The state space 𝒮\mathcal{S} is defined as follows. There are |𝒜¯||\overline{\mathcal{A}}| states, one for each action tuple 𝐚𝒜¯\mathbf{a}\in\overline{\mathcal{A}}. For each 𝐚𝒜¯\mathbf{a}\in\overline{\mathcal{A}}, we denote the corresponding state by 𝔰𝐚\mathfrak{s}_{\mathbf{a}}.

  • For all h[H]h\in[H], the reward to agent j[m+1]j\in[m+1] given action profile 𝐚=(a1,,am+1)\mathbf{a}=(a_{1},\ldots,a_{m+1}) at any state s𝒮s\in\mathcal{S} is as follows: writing am+1=(j,aj)a_{m+1}=(j^{\prime},a_{j^{\prime}}^{\prime}),

    Rj,h(s,𝐚):={0:j{j,m+1}1H((Mj)a1,,am(Mj)a1,,aj,,am):j=j1H((Mj)a1,,aj,,am(Mj)a1,,am):j=m+1.\displaystyle R_{j,h}(s,\mathbf{a}):=\begin{cases}0&:j\not\in\{j^{\prime},m+1\}\\ \frac{1}{H}\cdot\left((M_{j})_{a_{1},\ldots,a_{m}}-(M_{j})_{a_{1},\ldots,a_{j^{\prime}}^{\prime},\ldots,a_{m}}\right)&:j=j^{\prime}\\ \frac{1}{H}\cdot\left((M_{j})_{a_{1},\ldots,a_{j^{\prime}}^{\prime},\ldots,a_{m}}-(M_{j})_{a_{1},\ldots,a_{m}}\right)&:j=m+1.\end{cases} (24)
  • At each step h[H]h\in[H], if action profile 𝐚𝒜¯\mathbf{a}\in\overline{\mathcal{A}} is taken, the game transitions to the state 𝔰𝐚\mathfrak{s}_{\mathbf{a}}.

Note that the number of states of 𝒢(G)\mathcal{G}^{\prime}(G) is equal to |𝒜¯|=mn0m+1|\overline{\mathcal{A}}|=mn_{0}^{m+1}, and so |𝒢(G)|=mn0m+1|\mathcal{G}^{\prime}(G)|=mn_{0}^{m+1}. As a result, if we were to use the game 𝒢(G)\mathcal{G}^{\prime}(G) in place of 𝒢(G)\mathcal{G}(G) in the proof of Theorem C.1, we would need to define n0:=n1/(m+1)/mn_{0}:=\lfloor n^{1/(m+1)}/m\rfloor to ensure that |𝒢(G)|n|\mathcal{G}^{\prime}(G)|\leq n, and so the condition T<exp(ϵ2n/m/m2)T<\exp(\epsilon^{2}\cdot\lfloor n/m\rfloor/m^{2}) would be replaced by T<exp(ϵ2n1/(m+1)/m/m2)T<\exp(\epsilon^{2}\cdot\lfloor n^{1/(m+1)}/m\rfloor/m^{2}). This would only lead to a small quantitative degradement in the statement of Theorem 4.3, with the condition in the statement replaced by T<exp(cϵ2n1/3)T<\exp(c\cdot\epsilon^{2}\cdot n^{1/3}) for some constant c>0c>0. However, it would render the statement of Theorem 5.2 essentially vacuous. For this reason, we opt to go with the approach of Definition C.2 as opposed to Definition C.9.

We expect that the construction of Definition C.2 can nevertheless still be modified to use O(log1/ϵ)O(\log 1/\epsilon) bits to express each reward in the Markov game 𝒢\mathcal{G}. In particular, one could introduce stochastic transitions to encode in the state of the Markov game a small number of random bits of the full action profile played at each step. We leave such an approach for future work.

Appendix D Equivalence between Πjgen,rnd{\Pi}^{\mathrm{gen,rnd}}_{j} and Δ(Πjgen,det)\Delta(\Pi^{\mathrm{gen,det}}_{j})

In this section we consider an alternate definition of the space Πigen,rnd{\Pi}^{\mathrm{gen,rnd}}_{i} of randomized general policies of player ii, and show that it is equivalent to the one we gave in Section 2.

In particular, suppose we were to define a randomized general policy of agent ii as a distribution over deterministic general policies of agent ii: we write Π~igen,rnd:=Δ(Πigen,det){\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{i}:=\Delta(\Pi^{\mathrm{gen,det}}_{i}) to denote the space of such distributions. Moreover, write Π~gen,rnd:=Π~1gen,rnd××Π~mgen,rnd=Δ(Π1gen,det)××Δ(Πmgen,det){\widetilde{\Pi}}^{\mathrm{gen,rnd}}:={\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{m}=\Delta(\Pi^{\mathrm{gen,det}}_{1})\times\cdots\times\Delta(\Pi^{\mathrm{gen,det}}_{m}) to denote the space of product distributions over agents’ deterministic policies. Our goal in this section is to show that policies in Π~gen,rnd{\widetilde{\Pi}}^{\mathrm{gen,rnd}} are equivalent to those in Πgen,rnd{\Pi}^{\mathrm{gen,rnd}} in the following sense: there is an embedding map 𝖤𝗆𝖻:Πgen,rndΠ~gen,rnd\mathsf{Emb}:{\Pi}^{\mathrm{gen,rnd}}\rightarrow{\widetilde{\Pi}}^{\mathrm{gen,rnd}}, not depending on the Markov game, so that the distribution of a trajectory drawn from any σΠgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}}, for any Markov game, is the same as the distribution of a trajectory drawn from 𝖤𝗆𝖻(σ)\mathsf{Emb}(\sigma) (Fact D.2). Furthermore, 𝖤𝗆𝖻\mathsf{Emb} is surjective in the following sense: any policy σ~Π~gen,rnd\widetilde{\sigma}\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}} produces trajectories that are distributed identically to those of 𝖤𝗆𝖻(σ)\mathsf{Emb}(\sigma) (and thus of σ\sigma), for some σΠgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}} (Fact D.3). In Definition D.1 below, we define 𝖤𝗆𝖻\mathsf{Emb}.

Definition D.1.

For j[m]j\in[m] and σjΠjgen,rnd\sigma_{j}\in{\Pi}^{\mathrm{gen,rnd}}_{j}, define 𝖤𝗆𝖻j(σj)Π~jgen,rnd=Δ(Πjgen,det)\mathsf{Emb}_{j}(\sigma_{j})\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{j}=\Delta(\Pi^{\mathrm{gen,det}}_{j}) to put the following amount of mass on each πjΠjgen,det\pi_{j}\in\Pi^{\mathrm{gen,det}}_{j}:

(𝖤𝗆𝖻j(σj))(πj):=h=1H(τj,h1,sh)j,h1×𝒮σj(πj,h(τj,h1,sh)|τj,h1,sh)\displaystyle(\mathsf{Emb}_{j}(\sigma_{j}))(\pi_{j}):=\prod_{h=1}^{H}\prod_{(\tau_{j,h-1},s_{h})\in\mathscr{H}_{j,h-1}\times\mathcal{S}}\sigma_{j}(\pi_{j,h}(\tau_{j,h-1},s_{h})\ |\ \tau_{j,h-1},s_{h}) (25)

Furthermore, for σ=(σ1,,σm)Πgen,rnd\sigma=(\sigma_{1},\ldots,\sigma_{m})\in{\Pi}^{\mathrm{gen,rnd}}, define 𝖤𝗆𝖻(σ)=(𝖤𝗆𝖻(σ1),,𝖤𝗆𝖻(σm))\mathsf{Emb}(\sigma)=(\mathsf{Emb}(\sigma_{1}),\ldots,\mathsf{Emb}(\sigma_{m})).

Note that, in the special case that σjΠjgen,det\sigma_{j}\in\Pi^{\mathrm{gen,det}}_{j}, 𝖤𝗆𝖻j(σj)\mathsf{Emb}_{j}(\sigma_{j}) is the point mass on σj\sigma_{j}.

Fact D.2 (Embedding equivalence).

Fix a mm-player Markov game 𝒢\mathcal{G} and, arbitrary policies σjΠjgen,rnd\sigma_{j}\in{\Pi}^{\mathrm{gen,rnd}}_{j}. Then a trajectory drawn from the product policy σ=(σ1,,σm)Π1gen,rnd××Πmgen,rnd\sigma=(\sigma_{1},\ldots,\sigma_{m})\in{\Pi}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\Pi}^{\mathrm{gen,rnd}}_{m} is distributed identically to a trajectory drawn from 𝖤𝗆𝖻(σ)Π~gen,rnd\mathsf{Emb}(\sigma)\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}.

The proof of Fact D.2 is provided in Section D.1. Next, we show that the mapping 𝖤𝗆𝖻\mathsf{Emb} is surjective in the following sense:

Fact D.3 (Right inverse of 𝖤𝗆𝖻j\mathsf{Emb}_{j}).

There is a mapping 𝖥𝖺𝖼:Π~gen,rndΠgen,rnd\mathsf{Fac}:{\widetilde{\Pi}}^{\mathrm{gen,rnd}}\rightarrow{\Pi}^{\mathrm{gen,rnd}} so that for any Markov game 𝒢\mathcal{G} and any σ~Π~gen,rnd\widetilde{\sigma}\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}, the distribution of a trajectory drawn from σ~\widetilde{\sigma} is identical to the distribution of a trajectory drawn from 𝖤𝗆𝖻𝖥𝖺𝖼(σ~)\mathsf{Emb}\circ\mathsf{Fac}(\widetilde{\sigma}).

We will write 𝖥𝖺𝖼((σ~1,,σ~m)):=(𝖥𝖺𝖼1(σ~1),,𝖥𝖺𝖼m(σ~m))\mathsf{Fac}((\widetilde{\sigma}_{1},\ldots,\widetilde{\sigma}_{m})):=(\mathsf{Fac}_{1}(\widetilde{\sigma}_{1}),\ldots,\mathsf{Fac}_{m}(\widetilde{\sigma}_{m})). Fact D.3 states that the policy 𝖥𝖺𝖼(σ~)\mathsf{Fac}(\widetilde{\sigma}) maps, under 𝖤𝗆𝖻\mathsf{Emb}, to a policy in Π~gen,rnd{\widetilde{\Pi}}^{\mathrm{gen,rnd}} which is equivalent to σ~\widetilde{\sigma} (in the sense that their trajectories are identically distributed for any Markov game).

An important consequence of Fact D.2 is that the expected reward (i.e., value) under any σΠgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}} is the same as that of 𝖤𝗆𝖻(σ)\mathsf{Emb}(\sigma). Thus given a Markov game, the induced normal-form game in which the players’ pure action sets are Π1gen,rnd,,Πmgen,rnd{\Pi}^{\mathrm{gen,rnd}}_{1},\ldots,{\Pi}^{\mathrm{gen,rnd}}_{m} is equivalent to the normal-form game in which the players’ pure action sets are Π1gen,det,,Πmgen,det\Pi^{\mathrm{gen,det}}_{1},\ldots,\Pi^{\mathrm{gen,det}}_{m}, in the following sense: for any mixed strategy in the former, namely a product distributional policy PΔ(Π1gen,rnd)××Δ(Πmgen,rnd)P\in\Delta({\Pi}^{\mathrm{gen,rnd}}_{1})\times\cdots\times\Delta({\Pi}^{\mathrm{gen,rnd}}_{m}), the policy 𝔼σP[𝖤𝗆𝖻(σ)]Δ(Π1gen,det)××Δ(Πmgen,det)=Π~gen,rnd\mathbb{E}_{\sigma\sim P}[\mathsf{Emb}(\sigma)]\in\Delta(\Pi^{\mathrm{gen,det}}_{1})\times\cdots\times\Delta(\Pi^{\mathrm{gen,det}}_{m})={\widetilde{\Pi}}^{\mathrm{gen,rnd}} is a mixed strategy in the latter which gives each player the same value as under PP. (Note that 𝔼σP[𝖤𝗆𝖻(σ)]\mathbb{E}_{\sigma\sim P}[\mathsf{Emb}(\sigma)] is indeed a product distribution since PP is a product distribution and 𝖤𝗆𝖻\mathsf{Emb} factors into individual coordiantes.) Furthermore, by Fact D.3, any distributional policy in Π~gen,rnd{\widetilde{\Pi}}^{\mathrm{gen,rnd}} arises in this manner, for some PΔ(Π1gen,rnd)××Δ(Πmgen,rnd)P\in\Delta({\Pi}^{\mathrm{gen,rnd}}_{1})\times\cdots\times\Delta({\Pi}^{\mathrm{gen,rnd}}_{m}); in fact, PP may be chosen to place all its mass on a single σΠ1gen,rnd×Πmgen,rnd\sigma\in{\Pi}^{\mathrm{gen,rnd}}_{1}\times{\Pi}^{\mathrm{gen,rnd}}_{m}. Since 𝖤𝗆𝖻\mathsf{Emb} factors into individual coordinates, it follows that 𝖤𝗆𝖻\mathsf{Emb} yields a one-to-one mapping between the coarse correlated equilibria (or any other notion of equilibria, e.g., Nash equilibria or correlated equilibria) of these two normal-form games.

D.1 Proofs of the equivalence

Proof of Fact D.2.

Consider any trajectory τ=(s1,𝐚1,𝐫1,,sH,𝐚H,𝐫H)\tau=(s_{1},\mathbf{a}_{1},\mathbf{r}_{1},\ldots,s_{H},\mathbf{a}_{H},\mathbf{r}_{H}) consisting of a sequence of HH states and actions and rewards for each of the mm agents. Assume that ri,h=Ri,h(s,𝐚h)r_{i,h}=R_{i,h}(s,\mathbf{a}_{h}) for all i,hi,h (as otherwise τ\tau has probability 0 under any policy). Write:

pτ:=h=1H1h(sh+1|sh,𝐚h).\displaystyle p_{\tau}:=\prod_{h=1}^{H-1}\mathbb{P}_{h}(s_{h+1}|s_{h},\mathbf{a}_{h}).

Then the probability of observing τ\tau under σ\sigma is

pτh=1H1j=1mσj,h(aj,h|τj,h1,sh)\displaystyle p_{\tau}\cdot\prod_{h=1}^{H-1}\prod_{j=1}^{m}\sigma_{j,h}(a_{j,h}|\tau_{j,h-1},s_{h}) (26)

where, per usual, τj,h1=(s1,aj,1,rj,1,,sh1,aj,h1,rj,h1)\tau_{j,h-1}=(s_{1},a_{j,1},r_{j,1},\ldots,s_{h-1},a_{j,h-1},r_{j,h-1}). Write σ=(σ1,,σm)=𝖤𝗆𝖻(σ)\sigma=(\sigma_{1},\ldots,\sigma_{m})=\mathsf{Emb}(\sigma). The probability of observing τ\tau under σ\sigma is

pτj[m]πjΠjgen,det:h,π(τj,h1,sh)=aj,hσj(πj)\displaystyle p_{\tau}\cdot\prod_{j\in[m]}\sum_{\pi_{j}\in\Pi^{\mathrm{gen,det}}_{j}:\ \forall h,\ \pi(\tau_{j,h-1},s_{h})=a_{j,h}}\sigma_{j}(\pi_{j}) (27)

It is now straightforward to see from the definition of σj(πj)\sigma_{j}(\pi_{j}) in (25) that the quantities in (26) and (27) are equal. ∎

Proof of Fact D.3.

Fix a policy σ~jΠ~jgen,rnd=Δ(Πjgen,det)\widetilde{\sigma}_{j}\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{j}=\Delta(\Pi^{\mathrm{gen,det}}_{j}). We define 𝖥𝖺𝖼j(σ~j)\mathsf{Fac}_{j}(\widetilde{\sigma}_{j}) to be the policy σjΠjgen,rnd\sigma_{j}\in{\Pi}^{\mathrm{gen,rnd}}_{j}, which is defined as follows: for τj,h1=(sj,1,aj,1,rj,1,,sj,h1,aj,h1,rj,h1)j,h1\tau_{j,h-1}=(s_{j,1},a_{j,1},r_{j,1},\ldots,s_{j,h-1},a_{j,h-1},r_{j,h-1})\in\mathscr{H}_{j,h-1}, sh𝒮s_{h}\in\mathcal{S}, we have, for aj,h𝒜ja_{j,h}\in\mathcal{A}_{j},

σj(τj,h1,sh)(aj,h)=σ~j({πjΠjgen,det:πj(τj,g,sg)=aj,ggh})σ~j({πjΠjgen,det:πj(τj,g,sg)=aj,ggh1}).\displaystyle\sigma_{j}(\tau_{j,h-1},s_{h})(a_{j,h})=\frac{\widetilde{\sigma}_{j}\left(\{\pi_{j}\in\Pi^{\mathrm{gen,det}}_{j}\ :\ \pi_{j}(\tau_{j,g},s_{g})=a_{j,g}\ \forall g\leq h\}\right)}{\widetilde{\sigma}_{j}\left(\{\pi_{j}\in\Pi^{\mathrm{gen,det}}_{j}\ :\ \pi_{j}(\tau_{j,g},s_{g})=a_{j,g}\ \forall g\leq h-1\}\right)}.

If the denominator of the above expression is 0, then σj(τj,h1,sh)\sigma_{j}(\tau_{j,h-1},s_{h}) is defined to be an arbitrary distribution on Δ(𝒜j)\Delta(\mathcal{A}_{j}). (For concreteness, let us say that it puts all its mass on a fixed action in 𝒜j\mathcal{A}_{j}.) Furthermore, for σ~Π~gen,rnd\widetilde{\sigma}\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}, define 𝖥𝖺𝖼(σ~):=(𝖥𝖺𝖼1(σ~1),,𝖥𝖺𝖼m(σ~m))Πgen,rnd\mathsf{Fac}(\widetilde{\sigma}):=(\mathsf{Fac}_{1}(\widetilde{\sigma}_{1}),\ldots,\mathsf{Fac}_{m}(\widetilde{\sigma}_{m}))\in{\Pi}^{\mathrm{gen,rnd}}.

Next, fix any σ~=(σ~1,,σ~m)Π~1gen,rnd××Π~mgen,rnd\widetilde{\sigma}=(\widetilde{\sigma}_{1},\ldots,\widetilde{\sigma}_{m})\in{\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{1}\times\cdots\times{\widetilde{\Pi}}^{\mathrm{gen,rnd}}_{m}. Let σ=𝖥𝖺𝖼(σ~)\sigma=\mathsf{Fac}(\widetilde{\sigma}). By Fact D.2, it suffices to show that the distribution of trajectories under σ\sigma is the same as the distribution of trajectories drawn from σ\sigma.

So consider any trajectory τ=(s1,𝐚1,𝐫1,,sH,𝐚H,𝐫H)\tau=(s_{1},\mathbf{a}_{1},\mathbf{r}_{1},\ldots,s_{H},\mathbf{a}_{H},\mathbf{r}_{H}) consisting of a sequence of HH states and actions and rewards for each of the mm agents. Assume that ri,h=Ri,h(s,𝐚h)r_{i,h}=R_{i,h}(s,\mathbf{a}_{h}) for all i,hi,h (as otherwise τ\tau has probability 0 under any policy). Write:

pτ:=h=1H1h(sh+1|sh,𝐚h).\displaystyle p_{\tau}:=\prod_{h=1}^{H-1}\mathbb{P}_{h}(s_{h+1}|s_{h},\mathbf{a}_{h}).

Then the probability of observing τ\tau under σ\sigma is

pτh=1Hj=1mσj,h(aj,h|τj,h1,sh)\displaystyle p_{\tau}\cdot\prod_{h=1}^{H}\prod_{j=1}^{m}\sigma_{j,h}(a_{j,h}|\tau_{j,h-1},s_{h})
=\displaystyle= pτj=1mh=1Hσ~j({πjΠjgen,det:πj(τj,g,sg)=aj,ggh})σ~j({πjΠjgen,det:πj(τj,g,sg)=aj,ggh1})\displaystyle p_{\tau}\cdot\prod_{j=1}^{m}\prod_{h=1}^{H}\frac{\widetilde{\sigma}_{j}\left(\{\pi_{j}\in\Pi^{\mathrm{gen,det}}_{j}\ :\ \pi_{j}(\tau_{j,g},s_{g})=a_{j,g}\ \forall g\leq h\}\right)}{\widetilde{\sigma}_{j}\left(\{\pi_{j}\in\Pi^{\mathrm{gen,det}}_{j}\ :\ \pi_{j}(\tau_{j,g},s_{g})=a_{j,g}\ \forall g\leq h-1\}\right)}
=\displaystyle= pτj=1mσ~j({πjΠjgen,det:πj(τj,g,sg)=aj,ggH}),\displaystyle p_{\tau}\cdot\prod_{j=1}^{m}\widetilde{\sigma}_{j}\left(\{\pi_{j}\in\Pi^{\mathrm{gen,det}}_{j}\ :\ \pi_{j}(\tau_{j,g},s_{g})=a_{j,g}\ \forall g\leq H\}\right),

which is equal to the probability of observing τ\tau under σ~\widetilde{\sigma}. ∎