This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[1]\fnmDengwang \surTang

1]\orgdivMing Hsieh Department of Electrical and Computer Engineering, \orgnameUniversity of Southern California, \orgaddress\cityLos Angeles, \stateCA, \postcode90089-2560, \countryUSA

2]\orgdivElectrical and Computer Engineering Division, Electrical Engineering and Computer Science Department, \orgnameUniversity of Michigan, \orgaddress\cityAnn Arbor, \stateMI, \postcode48109, \countryUSA

​​​​​Information Compression in Dynamic Games​​​​​

dwtang@umich.edu    \fnmVijay \surSubramanian vgsubram@umich.edu    \fnmDemosthenis \surTeneketzis teneket@umich.edu [ [
Abstract

One of the reasons why stochastic dynamic games with an underlying dynamic system are challenging is because strategic players have access to enormous amount of information which leads to the use of extremely complex strategies at equilibrium. One approach to resolve this challenge is to simplify players’ strategies by identifying appropriate compression of information maps so that the players can make decisions solely based on the compressed version of information, called the information state. Such maps allow players to implement their strategies efficiently. For finite dynamic games with asymmetric information, inspired by the notion of information state for single-agent control problems, we propose two notions of information states, namely mutually sufficient information (MSI) and unilaterally sufficient information (USI). Both these information states are obtained by applying information compression maps that are independent of the strategy profile. We show that Bayes-Nash Equilibria (BNE) and Sequential Equilibria (SE) exist when all players use MSI-based strategies. We prove that when all players employ USI-based strategies the resulting sets of BNE and SE payoff profiles are the same as the sets of BNE and SE payoff profiles resulting when all players use full information-based strategies. We prove that when all players use USI-based strategies the resulting set of weak Perfect Bayesian Equilibrium (wPBE) payoff profiles can be a proper subset of all wPBE payoff profiles. We identify MSI and USI in specific models of dynamic games in the literature. We end by presenting an open problem: Do there exist strategy-dependent information compression maps that guarantee the existence of at least one equilibrium or maintain all equilibria that exist under perfect recall? We show, by a counterexample, that a well-known strategy-dependent information compression map used in the literature does not possess any of the properties of the strategy-independent compression maps that result in MSI or USI.

keywords:
Non-cooperative Games, Dynamic Games, Information State, Sequential Equilibrium, Markov Decision Process
pacs:
[
pacs:
[
pacs:
[

JEL Classification]C72, C73, D80 MSC Classification]90C40, 91A10, 91A15, 91A25, 91A50

Acknowledgements]The authors would like to thank Yi Ouyang, Hamidreza Tavafoghi, Ashutosh Nayyar, Tilman Börgers, and David Miller for helpful discussions.

1 Introduction

The model of stochastic dynamic games has found application in many engineering and socioeconomic settings, such as transportation networks, power grid, spectrum markets, and online shopping platforms. In these settings, multiple agents/players make decisions over time on top of an ever-changing environment with players having different goals and asymmetric information. For example, in transportation networks, individual drivers make routing decisions based on information from online map services in order to reach their respective destinations as fast as possible. Their actions then collectively affect traffic conditions in the future. Another example involves online shopping platforms, where buyers leave reviews to inform potential future buyers, while sellers update prices and make listing decisions based on the feedback from buyers. In these systems, players’ decisions are generally not only interdependent, but also affect the underlying environment as well as future decisions and payoffs of all players in complex ways.

Determining the set of equilibria, or even solving for one equilibrium, in a given stochastic dynamic game can be a challenging task. The main challenges include: (a) the presence of an underlying environment/system that can change over time based on the actions of all players; (b) incomplete and asymmetric information; (c) large number of players, states, and actions; and (d) growing amount of information over time which results in a massive strategy space. As a result of the advances in technology, stochastic dynamic games today are often played by players (e.g. big corporations) that have access to substantial computational resources along with a large amount of data for decision making. Nevertheless, even these players are computationally constrained, and they must make decisions in real-time, hence complicated strategies may not be feasible for them. Therefore, it is important to determine computationally efficient strategies for players to play at equilibria. Compression of players’ information and then use of the strategies based on the compressed information is a well-heeled methodology that results in computationally efficient strategies. In this paper we address some of the above-mentioned challenges. We concentrate on the challenges associated with information compression, namely the existence of equilibria under information compression, and the preservation of all equilibrium payoff profiles under information compression. We leave as a topic of future investigation the discovery of efficient algorithms for the computation of equilibria based on strategies that use compressed information.

Specifically, our goal is to identify appropriate strategy-independent 111Strategy independent information compression maps are maps that are not parameterized by a strategy profile. Examples of strategy-independent information compression maps include those that use a fixed-subset of the game’s history (e.g. the most recent observation) or some statistics based on the game’s history (e.g. the number of times player ii takes a certain action). Strategy-dependent maps are parameterized by a strategy profile (see Section 6). information compression maps in dynamic games so that the resulting compressed information has properties/features sufficient to satisfy the following requirements: (R1) existence of equilibria when all players use strategies based on the compressed information; (R2) equality of the set of all equilibrium payoff profiles that are achieved when all players use full information based-strategies with the set of all equilibrium payoff profiles that are achieved when all players use strategies based on the compressed information.

Inspired by the literature on single-agent decision/control problems, particularly the notion of information state, we develop notions of information state (compressed information) that satisfy requirements (R1) and (R2). Specifically, we introduce the notions of Mutually Sufficient Information (MSI) and Unilaterally Sufficient Information (USI). We show that MSI has properties/features sufficient to satisfy (R1), whereas USI has properties sufficient to satisfy (R2) under several different equilibrium concepts.

The remainder of the paper is organized as follows: In Section 1.1 we briefly review related literature in stochastic control and game theory. In Section 1.2 we list our contributions. In Section 1.3 we introduce our notation. In Section 2 we formulate our game model. In Section 3.1 and Section 3.2 we introduce the notion of mutually sufficient information and unilaterally sufficient information respectively. We present our main results in Section 4. We discuss these results in Section 5. We discuss an open problem, primarily associated with strategy-dependent information compression, in Section 6. We provide supporting results in Appendix A. We present alternative characterizations of sequential equilibria in Appendix B. We provide proofs of the results of Sections 3 and 4 in Appendix C. We present the details of the discussions in Section 5 and Section 6 in Appendix D.

1.1 Related Literature

We first present a brief literature survey on information compression in single-agent decision problems because it has inspired several of the key ideas presented in this paper.

Single-agent decision/control problems are problems where one agent chooses actions over time on top of an ever-changing system to maximize their total reward. These problems have been extensively studied in the control theory [23], operations research [39], computer science [41], and mathematics [8] literature. Models like Markov Decision Process (MDP) and Partially Observable Markov Decision Process (POMDP) have been analyzed and applied widely in real-world systems. It is well known that in an MDP, the agent can use a Markov strategy—making decisions based on the current state—without loss of optimality. A Markov strategy can be seen as a strategy based on compressed information: the full information—state and action history—is compressed into only the current state. Furthermore, in finite horizon problems, such optimal Markov strategies can be found through a sequential decomposition procedure. It is also well known that any POMDP can be transformed into an MDP with an appropriate belief acting as the underlying state [23, Chapter 6]. As a result, the agent can use a belief-based strategy without loss of optimality. A belief-based strategy compresses the full information into the conditional belief of the current state. Critically, this information compression is strategy-independent [2, 44, 45, 23]. For general single-agent control problems, sufficient conditions that guarantee optimality of compression-based strategies have been proposed under the names of sufficient statistic [43, 46, 58, 18, 47] and information state [23, 24, 48]. In these works, the authors transform single-agent control problems with partial observations into equivalent problems with complete observations with the sufficient statistic/information state acting as the underlying state.

Multi-agent dynamic decision problems are either teams where all agents have the same objective, or games where agents have different objectives and are strategic. Information compression in dynamic teams has been investigated in [55, 32, 34, 24, 54, 48, 19], and many other works (see [54] and [48] for a list of references). Dynamic games can be divided into two categories: those with a static underlying environment (e.g. repeated games), and those with an underlying dynamic system. Over the years, economics researchers have studied repeated games extensively (e.g. see [30, Chapter 7]). As our focus is on dynamic games with an underlying dynamic system, we will not discuss the literature on repeated games. Among models for dynamic games with an underlying dynamic system, the model of zero-sum games, as a particular class which possesses special properties, has been analyzed in [42, 28, 40] and many others (see [37] for a list of references). Non-zero-sum games with an underlying dynamic system and symmetric information have also been studied extensively [5, 9]. For such dynamic games with perfect information, the authors of [26] introduce the concept of Markov Perfect Equilibrium (MPE), where each player compresses their information into a Markov state. Dynamic games with asymmetric information have been analyzed in [29, 27, 31, 33, 13, 14, 35, 36, 53, 52, 56, 51, 37]. In [33], the authors introduce the concept of Common Information Based Markov Perfect Equilibrium (CIB-MPE), which is an extension of MPE in partially observable systems. In a CIB-MPE, all players choose their actions at each time based on the Common-Information-Based (CIB) belief (a compression of the common information) and private information instead of full information. The authors establish the existence of CIB-MPE under the assumption that the CIB belief is strategy-independent. Furthermore, the authors develop a sequential decomposition procedure to solve for such equilibria. In [36], the authors extend the result of [33] to a particular model where the CIB beliefs are strategy-dependent. They introduce the concept of Common Information Based Perfect Bayesian Equilibrium (CIB-PBE). In a CIB-PBE all players choose their actions based on the CIB belief and their private information. They show that such equilibria can be found through a sequential decomposition whenever the decomposition has a solution. The authors conjecture the existence of such equilibria. The authors of [51] extend the model of [36] to games among teams. They consider two compression maps and their associated equilibrium concepts. For the first compression map, which is strategy-independent, they establish preservation of equilibrium payoffs. For the second information compression map, which is strategy-dependent, they propose a sequential decomposition of the game. If the decomposition admits a solution, then there exists a CIB-BNE based on the compressed information. Furthermore, they provide an example where CIB-BNE based on this specific compressed information do not exist. The example also proves that the conjecture about the existence of CIB-PBEs, made in [36], is false.

In addition to the methods on information compression that appear in [26, 33, 36, 51], there are two lines of work on games where the players’ decisions are based on limited information. In the first line of work, players face exogenous hard constraints on the information that can be used to choose actions [38, 7, 12, 15, 3]. In the second line of work, players can utilize any finite automaton with any number of states to choose actions, however more complex automata are assumed to be more costly [1, 4]. In our work, we also deal with finite automaton based strategies. However, there is a critical difference between our work and both lines of literatureboth of the above-mentioned lines of work: Our primary interest is to study conditions under which a compression based strategy profile can form an equilibrium under standard equilibrium concepts when unbounded rationality and perfect recall are allowed. Under these equilibrium concepts, we do not restrict the strategy of any player, nor do we impose any penalty on complicated strategies. In other words, a compression based strategy needs to be a best response compared to all possible strategies with full recall in terms of the payoff alone. The methodology for information compression presented in this paper is similar in spirit to that of [26, 33, 36, 56, 37]. However, this paper is significantly different from those works as it deals with the discovery of information compression maps that lead not only to the existence (in general) of various types of compressed information based equilibria but also to the preservation of all equilibrium payoff profiles (a topic not investigated in [26, 33, 36, 56, 37]). This paper builds on [51]; it identifies embodiments of the two information compression maps studied in [51] for a much more general class of games than that of [51], and a broader set of equilibrium concepts.

1.2 Contributions

Our main contributions are the following:

  1. 1.

    We propose two notions of information states/compressed information for dynamic games with asymmetric information that result in from strategy-independent compression maps: Mutually Sufficient Information (MSI) and Universally Sufficient Information (USI) — Definitions 4 and 1, respectively. We present an example that highlights the differences between MSI and USI.

  2. 2.

    We show that in finite dynamic games with asymmetric information, Bayes–Nash Equilibria (BNE) and Sequential Equilibria (SE) exist when all players use MSI-based strategies — Theorems 2 and 4, respectively.

  3. 3.

    We prove that when all players employ USI-based strategies the resulting sets of BNE and SE payoff profiles are same as the sets of BNE and SE payoff profiles resulting when all players use full information based strategies — Theorems 3 and 5, respectively.

  4. 4.

    We prove that when all players use USI-based strategies the resulting set of weak Perfect Bayesian Equilibrium (wPBE) payoff profiles can be a proper subset of the set of all wPBE payoff profiles — Proposition 6. A result similar to that of Proposition 6 is also true under Watson’s PBE [57].

    Figure 1 depicts the results stated in Contributions 3 and 4 above.

    USI-based BNE = All BNEAll wPBEUSI-based wPBEUSI-based SE = All SE
    Figure 1: A Venn diagram showing the relationship of the sets of payoff profiles for different equilibrium concepts using either unilateral sufficient information (USI) based strategy profiles or general strategies.
  5. 5.

    We present several examples — Examples 5.3 through 5.6 — of finite dynamic games with asymmetric information where we identify MSI and USI.

Additional contributions of this work are:

  1. 1.

    A set of alternative definitions of SE — Appendix B. These definitions are equivalent to the original definition of SE given in [21] and help simplify some of the proofs of the main results in this paper.

  2. 2.

    A new methodology for establishing existence of equilibria. The methodology is based on a best response function defined through a dynamic program for a single-agent control problem.

  3. 3.

    A counterexample showing that a well-known strategy-dependent compression map, resulting in sufficient private information along with common information-based beliefs, does not guarantee existence of equilibria based on the above-stated compressed information.

1.3 Notation

We follow the notational convention of stochastic control literature (i.e. using random variables to define the system, representing information as random variables, etc.) instead of the convention of game theory literature (i.e. game trees, nodes, information sets, etc.) unless otherwise specified. This allows us to apply techniques from stochastic control, which we rely heavily upon, in a more natural way. We use capital letters to represent random variables, bold capital letters to denote random vectors, and lower case letters to represent realizations. We use superscripts to indicate players, and subscripts to indicate time. We use ii to represent a typical player and i-i to represent all players other than ii. We use t1:t2t_{1}:t_{2} to indicate the collection of timestamps (t1,t1+1,,t2)(t_{1},t_{1}+1,\cdots,t_{2}). For example, X1:4iX_{1:4}^{i} stands for the random vector (X11,X2i,X3i,X4i)(X_{1}^{1},X_{2}^{i},X_{3}^{i},X_{4}^{i}). For random variables or random vectors represented by Latin letters, we use the corresponding script capital letters to denote the space of values these random vectors can take. For example, ti\mathcal{H}_{t}^{i} denotes the space of values the random vector HtiH_{t}^{i} can take. The products of sets refers to Cartesian products. We use ()\mathbb{P}(\cdot) and 𝔼[]\mathbb{E}[\cdot] to denote probabilities and expectations, respectively. We use Δ(Ω)\Delta(\varOmega) to denote the set of probability distributions on a finite set Ω\varOmega. For a distribution νΔ(Ω)\nu\in\Delta(\varOmega), we use supp(ν)\mathrm{supp}(\nu) to denote the support of ν\nu. When writing probabilities, we will omit the random variables when the lower case letters that represent the realizations clearly indicate the random variable it represents. For example, we will use (yti|xt,ut)\mathbb{P}(y_{t}^{i}|x_{t},u_{t}) as a shorthand for (Yti=yti|Xt=xt,Ut=ut)\mathbb{P}(Y_{t}^{i}=y_{t}^{i}|X_{t}=x_{t},U_{t}=u_{t}). When λ\lambda is a function from Ω1\varOmega_{1} to Δ(Ω2)\Delta(\varOmega_{2}), with some abuse of notation we write λ(ω2|ω1):=(λ(ω1))(ω2)\lambda(\omega_{2}|\omega_{1}):=(\lambda(\omega_{1}))(\omega_{2}) as if λ\lambda is a conditional distribution. We use 𝟏A\bm{1}_{A} to denote the indicator random variable of an event AA.

In general, probability distributions of random variables in a dynamic system are only well defined after a complete strategy profile is specified. We specify the strategy profile that defines the distribution in superscripts, e.g. g(xti|ht0)\mathbb{P}^{g}(x_{t}^{i}|h_{t}^{0}). When the conditional probability is independent of a certain part of the strategy (gti)(i,t)Ω(g_{t}^{i})_{(i,t)\in\varOmega}, we may omit this part of the strategy in the notation, e.g. g1:t1(xt|y1:t1,u1:t1)\mathbb{P}^{g_{1:t-1}}(x_{t}|y_{1:t-1},u_{1:t-1}), gi(uti|hti)\mathbb{P}^{g^{i}}(u_{t}^{i}|h_{t}^{i}) or (xt+1|xt,ut)\mathbb{P}(x_{t+1}|x_{t},u_{t}). We say that a realization of some random vector (for example htih_{t}^{i}) is admissible under a partially specified strategy profile (for example gig^{-i}) if the realization has strictly positive probability under some completion of the partially specified strategy profile (In this example, that means gi,gi(hti)>0\mathbb{P}^{g^{i},g^{-i}}(h_{t}^{i})>0 for some gig^{i}). Whenever we write a conditional probability or conditional expectation, we implicitly assume that the condition has non-zero probability under the specified strategy profile. When only part of the strategy profile is specified in the superscript, we implicitly assume that the condition is admissible under the specified partial strategy profile. In this paper, we make heavy use of value functions and reward-to-go functions. Such functions will be clearly defined within their context with the following convention: QQ stands for state-action value functions; VV stands for state value functions; and JJ stands for reward-to-go functions for a given strategy profile (as opposed to QQ or VV, both of which are typically defined via a maximum over all strategies).

2 Game Model and Objectives

2.1 Game Model

In this section we formulate a general model for a finite horizon dynamic game with finitely many players.

Denote the set of players by \mathcal{I}. Denote the set of timestamps by 𝒯={1,2,,T}\mathcal{T}=\{1,2,\cdots,T\}. At time tt, player ii\in\mathcal{I} takes action UtiU_{t}^{i}, obtains instantaneous reward RtiR_{t}^{i}, and then learns new information ZtiZ_{t}^{i}. Player ii may not necessarily observe the instantaneous rewards RtiR_{t}^{i} directly. The reward is observable only if it is part of ZtiZ_{t}^{i}. Define Zt=(Zti)i,Ut=(Uti)iZ_{t}=(Z_{t}^{i})_{i\in\mathcal{I}},U_{t}=(U_{t}^{i})_{i\in\mathcal{I}}, and Rt=(Rti)iR_{t}=(R_{t}^{i})_{i\in\mathcal{I}}. We assume that there is an underlying state variable XtX_{t} and

(Xt+1,Zt,Rt)\displaystyle(X_{t+1},Z_{t},R_{t}) =ft(Xt,Ut,Wt),t𝒯,\displaystyle=f_{t}(X_{t},U_{t},W_{t}),\qquad t\in\mathcal{T}, (1)

where (ft)t𝒯(f_{t})_{t\in\mathcal{T}} are fixed functions. The primitive random variable X1X_{1} represents the initial move of nature. The primitive random vector H1=(H1i)iH_{1}=(H_{1}^{i})_{i\in\mathcal{I}} represents the initial information of the players. The initial state and information X1X_{1} and H1H_{1} are, in general, correlated. The random variables (Wt)t=1T(W_{t})_{t=1}^{T} are mutually independent primitive random variables representing nature’s move. The vector (X1,H1)(X_{1},H_{1}) is assumed to be mutually independent with W1,W2,,WTW_{1},W_{2},\cdots,W_{T}. The distributions of the primitive random variables are common knowledge to all players.

Define 𝒳t,𝒰t,𝒵t,𝒲t,1\mathcal{X}_{t},\mathcal{U}_{t},\mathcal{Z}_{t},\mathcal{W}_{t},\mathcal{H}_{1} to be the sets of possible values of Xt,Ut,Zt,Wt,H1X_{t},U_{t},Z_{t},W_{t},H_{1} respectively. The sets 𝒳t,𝒰t,𝒵t,𝒲t,1\mathcal{X}_{t},\mathcal{U}_{t},\mathcal{Z}_{t},\mathcal{W}_{t},\mathcal{H}_{1} are assumed to be common knowledge among all players. In this work, in order to focus on conceptual difficulties instead of technical issues, we make the following assumption.

Assumption 1.

𝒳t,𝒰t,𝒵t,𝒲t,1\mathcal{X}_{t},\mathcal{U}_{t},\mathcal{Z}_{t},\mathcal{W}_{t},\mathcal{H}_{1} are finite sets, and RtiR_{t}^{i} is supported on [1,1][-1,1].

We assume perfect recall, i.e. the information player ii has at time tt is Hti=(H1i,Z1:t1i)H_{t}^{i}=(H_{1}^{i},Z_{1:t-1}^{i}), and player ii’s action UtiU_{t}^{i} is contained in the new information ZtiZ_{t}^{i}. A behavioral strategy gi=(gti)t𝒯g^{i}=(g_{t}^{i})_{t\in\mathcal{T}} of player ii is a collection of functions gti:tiΔ(𝒰ti)g_{t}^{i}\colon\mathcal{H}_{t}^{i}\mapsto\Delta(\mathcal{U}_{t}^{i}), where ti\mathcal{H}_{t}^{i} is the space where HtiH_{t}^{i} takes values. Under a behavioral strategy profile g=(gi)ig=(g^{i})_{i\in\mathcal{I}}, the total reward/payoff of player ii in this game is given by

Ji(g):=𝔼g[t=1TRti].J^{i}(g):=\mathbb{E}^{g}\left[\sum_{t=1}^{T}R_{t}^{i}\right]. (2)
Remark 1.

This is not a restrictive model: By choosing appropriate state representation XtX_{t} and instantaneous reward vector RtR_{t}, it can be used to model any finite-node extensive form sequential game with perfect recall.

We initially consider two solution concepts for dynamic games with asymmetric information: Bayes–Nash Equilibrium (BNE) and Sequential Equilibrium (SE). We define BNE and SE below.

Definition 1 (Bayes-Nash Equilibrium).

A behavioral strategy profile gg is said to form a Bayes-Nash equilibrium (BNE) if for any player ii and any behavioral strategy g~i\tilde{g}^{i} of player ii, we have Ji(g)Ji(g~i,gi)J^{i}(g)\geq J^{i}(\tilde{g}^{i},g^{-i}).

Definition 2 (Sequential Equilibrium).

Let g=(gi)ig=(g^{i})_{i\in\mathcal{I}} be a behavioral strategy profile. Let Q=(Qti)i,t𝒯Q=(Q_{t}^{i})_{i\in\mathcal{I},t\in\mathcal{T}} be a collection of history-action value functions, i.e. Qti:ti×𝒰tiQ_{t}^{i}\colon\mathcal{H}_{t}^{i}\times\mathcal{U}_{t}^{i}\mapsto\mathbb{R}. The strategy profile gg is said to be sequentially rational under QQ if for each i,t𝒯i\in\mathcal{I},t\in\mathcal{T} and each htitih_{t}^{i}\in\mathcal{H}_{t}^{i},

supp(gti(hti))argmaxutiQti(hti,uti).\mathrm{supp}(g_{t}^{i}(h_{t}^{i}))\subseteq\underset{u_{t}^{i}}{\arg\max}~{}Q_{t}^{i}(h_{t}^{i},u_{t}^{i}). (3)

QQ is said to be fully consistent with gg if there exist a sequence of pairs of strategies and history-action value functions (g(n),Q(n))n=1(g^{(n)},Q^{(n)})_{n=1}^{\infty} such that

  1. (1)

    g(n)g^{(n)} is fully mixed, i.e. every action is chosen with positive probability at every information set.

  2. (2)

    Q(n)Q^{(n)} is consistent with g(n)g^{(n)}, i.e.,

    Qτ(n),i(hτi,uτi)\displaystyle Q_{\tau}^{(n),i}(h_{\tau}^{i},u_{\tau}^{i}) =𝔼g(n)[t=τTRti|hτi,uτi],\displaystyle=\mathbb{E}^{g^{(n)}}\left[\sum_{t=\tau}^{T}R_{t}^{i}\Big{|}h_{\tau}^{i},u_{\tau}^{i}\right], (4)

    for each i,τ𝒯,hτiτi,uτi𝒰τii\in\mathcal{I},\tau\in\mathcal{T},h_{\tau}^{i}\in\mathcal{H}_{\tau}^{i},u_{\tau}^{i}\in\mathcal{U}_{\tau}^{i}.

  3. (3)

    (g(n),Q(n))(g,Q)(g^{(n)},Q^{(n)})\rightarrow(g,Q) as nn\rightarrow\infty.

A tuple (g,Q)(g,Q) is said to be a sequential equilibrium if gg is sequentially rational under QQ and QQ is fully consistent with gg.

Whereas Definition 2 of SE is different from that of [21], we show in Appendix B that it is equivalent to the concept in [21]. We use Definition 2 as it is more suitable for the development of our results.

In this paper, we are interested in analyzing the performance of strategy profiles that are based on some form of compressed information. Let KtiK_{t}^{i} be a function of HtiH_{t}^{i} that can be sequentially updated, i.e. there exist functions (ιti)t𝒯(\iota_{t}^{i})_{t\in\mathcal{T}} such that

K1i\displaystyle K_{1}^{i} =ι1i(H1i),\displaystyle=\iota_{1}^{i}(H_{1}^{i}), (5)
Kti\displaystyle K_{t}^{i} =ιti(Kt1i,Zt1i),t𝒯\{1}.\displaystyle=\iota_{t}^{i}(K_{t-1}^{i},Z_{t-1}^{i}),\qquad t\in\mathcal{T}\backslash\{1\}. (6)

Write Ki=(Kti)t𝒯K^{i}=(K_{t}^{i})_{t\in\mathcal{T}} and K=(Ki)iK=(K^{i})_{i\in\mathcal{I}}. We will refer to KiK^{i} as the compression of player ii’s information under ιi=(ιti)t𝒯\iota^{i}=(\iota_{t}^{i})_{t\in\mathcal{T}}. A KiK^{i}-based (behavioral) strategy ρi=(ρti)t𝒯\rho^{i}=(\rho_{t}^{i})_{t\in\mathcal{T}} is a collection of functions ρti:𝒦tiΔ(𝒰ti)\rho_{t}^{i}\colon\mathcal{K}_{t}^{i}\mapsto\Delta(\mathcal{U}_{t}^{i}). A strategy profile where each player ii uses a KiK^{i}-based strategy is called a KK-based strategy profile. If a KK-based strategy profile forms an Bayes-Nash (resp. sequential) equilibrium, then it is called a KK-based Bayes-Nash (resp. sequential) equilibrium. Note that unlike [38, 7, 12, 15, 3], we require the KK-based BNE and KK-based SE to contain no profitable deviation among all full-history-based strategies.

2.2 Objectives

Our goal is to discover properties/features of the compressed information KK sufficient to guarantee that (i) there exists KK-based BNE and SE; (ii) the set of KK-based BNE (resp. SE) payoff profiles is equal to the set of (general strategy based) BNE (resp. SE) profiles under perfect recall.

To achieve the above-stated objectives we proceed as follows: First, we introduce two notions of information state, namely MSI and USI (Section 3). Then, we investigate the existence of MSI-based and USI-based BNE and SE, as well as the preservation of the set of all BNE and SE payoff profiles when USI-based strategies are employed by all players (Section 4).

Remark 2.

A key challenge in achieving the above-stated goal is the following: Unlike the case of perfect recall, one may not be able to recover Kt1iK_{t-1}^{i} from KtiK_{t}^{i}. Therefore, KiK^{i}-based (behavioral) strategies are not equivalent to mixed strategies supported on the set of KiK^{i}-based pure strategies. This fact creates difficulty for analyzing KiK^{i}-based strategies since the standard technique of using Kuhn’s Theorem [22] to transform mixed strategies to behavioral strategies does not apply. To resolve this challenge, we developed stochastic control theory-based techniques that allow us to work with KiK^{i}-based behavioral strategies directly rather than transforming from a mixed strategy.

Remark 3.

In the following sections, when referring to the compressed information KtiK_{t}^{i}, we will consider the compression mappings ιi\iota^{i} to be fixed and given, so that KtiK_{t}^{i} is fixed given HtiH_{t}^{i}. The space of compressed information 𝒦ti\mathcal{K}_{t}^{i} is a fixed, finite set given ιi\iota^{i}. When we use ktik_{t}^{i} to represent a realization of KtiK_{t}^{i}, we assume that it corresponds to the compression of Hti=htiH_{t}^{i}=h_{t}^{i} under the fixed ιi\iota^{i}.

3 Two Definitions of Information State

Before we define notions of information state in dynamic games we introduce the notion of information state for one player when other players’ strategies are fixed. The following definition is an extension of the definition of information state in [48].

Definition 3.

Let gig^{-i} be a behavioral strategy profile of players other than ii. We say that KiK^{i} is an information state under gig^{-i} if there exist functions (Pti,gi)t𝒯,(rti,gi)t𝒯(P_{t}^{i,g^{-i}})_{t\in\mathcal{T}},(r_{t}^{i,g^{-i}})_{t\in\mathcal{T}}, where Pti,gi:𝒦ti×𝒰tiΔ(𝒦t+1i)P_{t}^{i,g^{-i}}\colon\mathcal{K}_{t}^{i}\times\mathcal{U}_{t}^{i}\mapsto\Delta(\mathcal{K}_{t+1}^{i}) and rti,gi:𝒦ti×𝒰ti[1,1]r_{t}^{i,g^{-i}}\colon\mathcal{K}_{t}^{i}\times\mathcal{U}_{t}^{i}\mapsto[-1,1], such that

  1. (1)

    gi,gi(kt+1i|hti,uti)=Pti,gi(kt+1i|kti,uti)\mathbb{P}^{g^{i},g^{-i}}(k_{t+1}^{i}|h_{t}^{i},u_{t}^{i})=P_{t}^{i,g^{-i}}(k_{t+1}^{i}|k_{t}^{i},u_{t}^{i}) for all t𝒯\{T}t\in\mathcal{T}\backslash\{T\};

  2. (2)

    𝔼gi,gi[Rti|hti,uti]=rti,gi(kti,uti)\mathbb{E}^{g^{i},g^{-i}}[R_{t}^{i}|h_{t}^{i},u_{t}^{i}]=r_{t}^{i,g^{-i}}(k_{t}^{i},u_{t}^{i}) for all t𝒯t\in\mathcal{T},

for all gig^{i}, and all (hti,uti)(h_{t}^{i},u_{t}^{i}) admissible under (gi,gi)(g^{i},g^{-i}). (Both Pti,giP_{t}^{i,g^{-i}} and rti,gir_{t}^{i,g^{-i}} may depend on gig^{-i}, but they do not depend on gig^{i}.)

In the absence of other players, the above definition is exactly the same as the definition of information state for player ii’s control problem. When other players are present, the parameters of player ii’s control problem, in general, depend on the strategy of other players. As a consequence, an information state under one strategy profile gig^{-i} may not be an information state under a different strategy profile g~i\tilde{g}^{-i}.

3.1 Mutually Sufficient Information

Definition 4 (Mutually Sufficient Information).

We say that K=(Ki)iK=(K^{i})_{i\in\mathcal{I}} is mutually sufficient information (MSI) if for all players ii\in\mathcal{I} and all KiK^{-i}-based strategy profiles ρi\rho^{-i}, KiK^{i} is an information state under ρi\rho^{-i}.

In words, MSI represents mutually consistent compression of information in a dynamic game: Player ii could compress their information to KiK^{i} without loss of performance when other players are compressing their information to KiK^{-i}. Note that MSI imposes interdependent conditions on the compression maps of all players: It requires the compression maps of all players to be consistent with each other.

The following lemma provides a sufficient condition for a compression maps to yield mutually sufficient information.

Lemma 1.

If for all ii\in\mathcal{I} and all KiK^{-i}-based strategy profiles ρi\rho^{-i}, there exist functions (Φti,ρi)t𝒯(\Phi_{t}^{i,\rho^{-i}})_{t\in\mathcal{T}} where Φti,ρi:𝒦tiΔ(𝒳t×𝒦ti)\Phi_{t}^{i,\rho^{-i}}\colon\mathcal{K}_{t}^{i}\mapsto\Delta(\mathcal{X}_{t}\times\mathcal{K}_{t}^{-i}) such that

gi,ρi(xt,kti|hti)=Φti,ρi(xt,kti|kti),\mathbb{P}^{g^{i},\rho^{-i}}(x_{t},k_{t}^{-i}|h_{t}^{i})=\Phi_{t}^{i,\rho^{-i}}(x_{t},k_{t}^{-i}|k_{t}^{i}), (7)

for all behavioral strategies gig^{i}, all t𝒯t\in\mathcal{T}, and all htih_{t}^{i} admissible under (gi,ρi)(g^{i},\rho^{-i}), then K=(Ki)iK=(K^{i})_{i\in\mathcal{I}} is mutually sufficient information.

Proof 3.1.

See Appendix C.1.

In words, the condition of Lemma 1 means that KtiK_{t}^{i} has the same predictive power as HtiH_{t}^{i} in terms of forming a belief on the current state and other players’ compressed information whenever other players are using compression-based strategies. This belief is sufficient for player ii to predict other player’s actions and future state evolution. Since other players are using compression-based strategies, player ii does not have to form a belief on other player’s full information in order to predict other players’ actions.

3.2 Unilaterally Sufficient Information

Definition 1 (Unilaterally Sufficient Information).

We say that KiK^{i} is unilaterally sufficient information (USI) for player ii\in\mathcal{I} if there exist functions (Fti,gi)t𝒯(F_{t}^{i,g^{i}})_{t\in\mathcal{T}} and (Φti,gi)t𝒯(\Phi_{t}^{i,g^{-i}})_{t\in\mathcal{T}} where Fti,gi:𝒦tiΔ(ti),Φti,gi:𝒦tiΔ(𝒳t×ti)F_{t}^{i,g^{i}}\colon\mathcal{K}_{t}^{i}\mapsto\Delta(\mathcal{H}_{t}^{i}),\Phi_{t}^{i,g^{-i}}\colon\mathcal{K}_{t}^{i}\mapsto\Delta(\mathcal{X}_{t}\times\mathcal{H}_{t}^{-i}) such that

g(xt,ht|kti)=Fti,gi(hti|kti)Φti,gi(xt,hti|kti),\mathbb{P}^{g}(x_{t},h_{t}|k_{t}^{i})=F_{t}^{i,g^{i}}(h_{t}^{i}|k_{t}^{i})\Phi_{t}^{i,g^{-i}}(x_{t},h_{t}^{-i}|k_{t}^{i}), (8)

for all behavioral strategy profiles gg, all t𝒯t\in\mathcal{T}, and all ktik_{t}^{i} admissible under gg.222In the case where random vectors XtX_{t}, HtiH_{t}^{i} and HtiH_{t}^{-i} share some common components, (8) should be interpreted in the following way: xtx_{t}, htih_{t}^{i} and htih_{t}^{-i} are three separate realizations that are not necessarily congruent with each other (i.e. they can disagree on their common parts). In the case of incongruency, the left-hand side equals 0. The equation needs to be true for all combinations of xt𝒳tx_{t}\in\mathcal{X}_{t}, htitih_{t}^{i}\in\mathcal{H}_{t}^{i} and htitih_{t}^{-i}\in\mathcal{H}_{t}^{-i}.

The definition of USI can be separated into two parts: The first part states that the conditional distribution of HtiH_{t}^{i}, player ii’s full information, given KtiK_{t}^{i}, the compressed information, does not depend on other players’ strategies. This is similar to the idea of sufficient statistics in the statistics literature [20]: If player ii would like to use their “data” HtiH_{t}^{i} to estimate the “parameter” gig^{-i}, then KtiK_{t}^{i} is a sufficient statistic for this parameter estimation problem. The second part states that KtiK_{t}^{i} has the same predictive power as HtiH_{t}^{i} in terms of forming a belief on the current state and other players’ full information. In contrast to the definition of mutually sufficient information, if KiK^{i} is unilaterally sufficient information, then KiK^{i} is sufficient for player ii’s decision making regardless of whether other players are using any information compression map.

3.3 Comparison

Using Lemma 1 it can be shown that if KiK^{i} is USI for each ii\in\mathcal{I}, then K=(Ki)iK=(K^{i})_{i\in\mathcal{I}} is MSI. The converse is not true. The following example illustrates the difference between MSI and USI.

Example 3.2.

Consider a two stage stateless (i.e. Xt=X_{t}=\varnothing) game of two players: Alice (A) moves first and Bob (B) moves afterwards. There is no initial information (i.e. H1A=H1B=H_{1}^{A}=H_{1}^{B}=\varnothing).

At time t=1t=1, Alice chooses U1A{0,1}U_{1}^{A}\in\{0,1\}. The instantaneous rewards of both players are given by

R1A=U1A,R1B=U1A.R_{1}^{A}=U_{1}^{A},R_{1}^{B}=-U_{1}^{A}. (9)

The new information of both Alice and Bob at time 11 is Z1A=Z1B=U1AZ_{1}^{A}=Z_{1}^{B}=U_{1}^{A}, i.e. Alice’s action is observed.

At time t=2t=2, Bob chooses U2B{1,1}U_{2}^{B}\in\{-1,1\}. The instantaneous rewards of both players are given by

R2A=U2B,R2B=0.R_{2}^{A}=U_{2}^{B},R_{2}^{B}=0. (10)

Set KtA=HtAK_{t}^{A}=H_{t}^{A} and KtB=K_{t}^{B}=\varnothing for both t{1,2}t\in\{1,2\}. It can be shown that KK is mutually sufficient information. However, KBK^{B} is not unilaterally sufficient information: We have g(h2B|k2B)=g(u1A)=g1A(u1A|)\mathbb{P}^{g}(h_{2}^{B}|k_{2}^{B})=\mathbb{P}^{g}(u_{1}^{A})=g_{1}^{A}(u_{1}^{A}|\varnothing), while the definition of USI requires that g(h2B|k2B)=FtB,gB(h2B|k2B)\mathbb{P}^{g}(h_{2}^{B}|k_{2}^{B})=F_{t}^{B,g^{B}}(h_{2}^{B}|k_{2}^{B}) for some function FtB,gBF_{t}^{B,g^{B}} that does not depend on gAg^{A}.

4 Information-State Based Equilibrium

In this section, we formulate our result on MSI and USI based equilibria for two equilibrium concepts: Bayes–Nash equilibria and sequential equilibria.

4.1 Information-State Based Bayes–Nash Equilibrium

Theorem 2.

If KK is mutually sufficient information, then there exists at least one KK-based BNE.

Proof 4.1.

See Appendix C.2.

The main idea for the proof of Theorem 2 is the definition of a best-response correspondence through the dynamic program for an underlying single-agent control problem.

Theorem 3.

If K=(Ki)iK=(K^{i})_{i\in\mathcal{I}} where KiK^{i} is unilaterally sufficient information for player ii, then the set of KK-based BNE payoffs is the same as that of all BNE.

Proof 4.2.

See Appendix C.3.

The intuition behind Theorem 3 is that one can think of player ii’s information that is not included in the unilaterally sufficient information KtiK_{t}^{i} as a private randomization device for player ii: When player ii is using a strategy that depends on their information outside of KtiK_{t}^{i}, it is as if they are using a randomized KiK^{i}-based strategy. The main idea for the proof of Theorem 3 is to show that for every BNE strategy profile gg, player ii can switch to an “equivalent” randomized KiK^{i}-based strategy ρi\rho^{i} while maintaining the equilibrium and payoffs.333Besides the connection of USI to sufficient statistics, the idea behind the construction of the equivalent KiK^{i}-based strategy is also closely related to the idea of the Rao–Blackwell estimator [20], where a new estimator is obtained by taking the conditional expectation of the old estimator given the sufficient statistics. The theorem then follows from iteratively switching the strategy of each player.

Example 3.2 can also be used to illustrate that when KK is an MSI but not an USI, KK-based BNE exist but KK-based strategies do not attain all equilibrium payoffs.

Example 3.2 (Continued).

In this example, KtA=HtA,KtB=K_{t}^{A}=H_{t}^{A},K_{t}^{B}=\varnothing for t=1,2t=1,2 is MSI. Furthermore, it can be shown that the following strategy profiles are BNE of the game: (E1) Alice plays U1A=1U_{1}^{A}=1 at time 1 and Bob plays U2B=1U_{2}^{B}=1 irrespective of Alice’s action at time 1; and (E2) Alice plays U1A=0U_{1}^{A}=0 at time 1; Bob plays U2B=1U_{2}^{B}=1 if U1A=0U_{1}^{A}=0 and U2B=1U_{2}^{B}=-1 if U1A=1U_{1}^{A}=1. Equilibrium (E1) is a KK-based equilibrium. However, (E2) cannot be attained by KK-based strategy profile for the following reason: In any KK-based equilibrium, Bob plays the same mixed strategy irrespective of Alice’s action and his expected payoff at the end of the game is 1-1. At (E2), Bob’s expected payoff at the end of the game is 0. Therefore, the payoff at (E2) cannot be attained by any KK-based strategy profile.

4.2 Information-State Based Sequential Equilibrium

Theorem 4.

If KK is mutually sufficient information, then there exists at least one KK-based sequential equilibrium.

Proof 4.3.

See Appendix C.4.

The proof of Theorem 4 follows steps similar to that of Theorem 2. The difference is that we explicitly construct a sequence of conjectured history-action value functions Q(n)Q^{(n)} (as defined in Definition 2) using the dynamic program of player ii’s decision problem. Then we argue that the strategies and the conjectures satisfies Definition 2.

Theorem 5.

If K=(Ki)iK=(K^{i})_{i\in\mathcal{I}} where KiK^{i} is unilaterally sufficient information for player ii, then the set of KK-based sequential equilibrium payoffs is the same as that of all sequential equilibria.

Proof 4.4.

See Appendix C.5.

The proof of Theorem 5 mostly follows the same ideas for Theorem 3: for each sequential equilibrium strategy profile gg, we construct an “equivalent” KiK^{i}-based strategy ρi\rho^{i} for player ii with similar construction as in Theorem 3. The critical part is to show that ρi\rho^{i} is still sequentially rational under the concept of sequential equilibrium.

5 Discussion

In this section we first investigate if USI can preserve the set of equilibrium payoffs achievable under perfect recall when refinements of BNE other than SE, namely, various versons of Perfect Bayesian Equilibrium (PBE), are considered. Then, we identify MSI and USI in specific models that appeared in the literature.

5.1 Other Equilibrium Concepts

We first present Example 5.1 to show that the result of Theorem 5 is not true when we replace SE with the concept of weak Perfect Bayesian Equilibrium (wPBE) [25] which is a refinement of BNE that is weaker than SE. Then, we discuss how the result of Proposition 6, that is, part of Example 5.1 and appears below, applies or does not apply to other versions of PBE, namely, those defined in [57] and [6].

The concept of wPBE is defined as follows: Let (g,μ)(g,\mu) be an assessment, where gg is a behavioral strategy profile as specified in Section 2 and μ\mu is a system of functions representing player’s beliefs in the extensive-form game representation. Then, (g,μ)(g,\mu) is said to be a weak perfect Bayesian equilibrium [25] if gg is sequentially rational to μ\mu and μ\mu satisfies Bayes rule with respect to gg on the equilibrium path. The concept of wPBE does not impose any restriction on beliefs off the equilibrium path.

Example 5.1.

Consider a two-stage game with two players: Bob (B) moves at stage 1; Alice (A) and Bob move simultaneously at stage 2. Let X1A,X1BX_{1}^{A},X_{1}^{B} be independent uniform random variables on {1,+1}\{-1,+1\} representing the types of the players. The state satisfies X1=(X1A,X1B)X_{1}=(X_{1}^{A},X_{1}^{B}) and X2=X1BX_{2}=X_{1}^{B}. The set of actions are 𝒰1B={1,+1}\mathcal{U}_{1}^{B}=\{-1,+1\}, 𝒰2A=𝒰2B={1,0,+1}\mathcal{U}_{2}^{A}=\mathcal{U}_{2}^{B}=\{-1,0,+1\}. The information structure is given by

H1A\displaystyle H_{1}^{A} =X1A,H1B=X1B;\displaystyle=X_{1}^{A},\quad H_{1}^{B}=X_{1}^{B}; (11)
H2A\displaystyle H_{2}^{A} =(X1A,U1B),H2B=(X1B,U1B),\displaystyle=(X_{1}^{A},U_{1}^{B}),\quad H_{2}^{B}=(X_{1}^{B},U_{1}^{B}), (12)

i.e. types are private and actions are observable.

The instantaneous payoffs of Alice are given by

R1A\displaystyle R_{1}^{A} ={1,if U1B=1;0,otherwise,R2A={1,if U2A=X2 or U2A=0;0,otherwise..\displaystyle=\begin{cases}-1,&\text{if }U_{1}^{B}=-1;\\ 0,&\text{otherwise},\end{cases}\qquad R_{2}^{A}=\begin{cases}1,&\text{if }U_{2}^{A}=X_{2}\text{ or }U_{2}^{A}=0;\\ 0,&\text{otherwise}.\end{cases}.

The instantaneous payoffs of Bob are given by

R1B\displaystyle R_{1}^{B} ={0.2,if U1B=1;0,otherwise,R2B={1,if U2A=U2B;0,otherwise..\displaystyle=\begin{cases}0.2,&\text{if }U_{1}^{B}=-1;\\ 0,&\text{otherwise},\end{cases}\qquad R_{2}^{B}=\begin{cases}-1,&\text{if }U_{2}^{A}=U_{2}^{B};\\ 0,&\text{otherwise}.\end{cases}.

Define K1A=X1AK_{1}^{A}=X_{1}^{A} and K2A=U1BK_{2}^{A}=U_{1}^{B}. It can be shown that KAK^{A} is unilaterally sufficient information for Alice.444In fact, this example can be seen as an instance of the model described in Example 5.6 which we introduce later. Set KtB=HtBK_{t}^{B}=H_{t}^{B}, i.e. no compression for Bob’s information. Then, KBK^{B} is trivially unilaterally sufficient information for Bob.

Proposition 6.

In the game defined in Example 5.1, the set of KK-based wPBE payoffs is a proper subset of that of all wPBE payoffs.

Proof 5.2.

See Appendix D.1.

Note that since any wPBE is first and foremost a BNE, by Theorem 3, any general strategy based wPBE payoff profile can be attained by a KK-based BNE. However, Proposition 6 implies that there exists a wPBE payoff profile such that none of the KK-based BNEs attaining this payoff profile are wPBEs.

Intuitively, the reason for some wPBE payoff profiles to be unachievable under KK-based wPBE payoffs in this example can be explained as follows. The state X1AX_{1}^{A} in this game can be thought of as a private randomization device of Alice that is payoff irrelevant (i.e. a private coin flip) that should not play a role in the outcome of the game. However, under the concept of wPBE, the presence of X1AX_{1}^{A} facilitates Alice to implement off-equilibrium strategies that are otherwise not sequentially rational. This holds due to the following: For a fixed realization of U1BU_{1}^{B}, the two realizations of X1AX_{1}^{A} give rise to two different information sets. Under the concept of wPBE, if the two information sets are both off equilibrium path, Alice is allowed to form different beliefs and hence justify the use of different mixed actions under different realizations of X1AX_{1}^{A}. Therefore, the presence of X1AX_{1}^{A} can expand Alice’s set of “justifiable” mixed actions off-equilibrium. By restricting Alice to use KAK^{A}-based strategies, i.e. choosing her mixed action not depending on X1AX_{1}^{A}, Alice loses the ability to use some mixed actions off-equilibrium in a “justifiable” manner, and hence losing her power to sustain certain equilibrium outcomes. This phenomenon, however, does not happen under the concept of sequential equilibrium, since SE (quite reasonably) would require Alice to use the same belief on two information sets if they only differ in the realization of X1AX_{1}^{A}.

With similar approaches, one can establish the analogue of Proposition 6 for the perfect Bayesian equilibrium concept defined in [57] (which we refer to as “Watson’s PBE”). Simply put, this is since Watson’s PBE imposes conditions on the belief update for each pair of successive information states in a separated manner. There exist no restrictions across different pairs of successive information states. As a result, for a fixed realization of U1BU_{1}^{B}, Alice is allowed to form different beliefs under two realizations of X1AX_{1}^{A} just like under wPBE as long as both beliefs are reasonable on their own. In fact, in the proof of Proposition 6, the two off-equilibrium belief updates both satisfy Watson’s condition of plain consistency [57].

Approaches similar to those in the proof of Proposition 6, however, do not apply to the PBE concept defined with the independence property of conditional probability systems specified in [6] (which we refer to as “Battigalli’s PBE”). In fact, Battigalli’s PBE is equivalent to sequential equilibrium if the dynamic game consists of only two strategic players [6]. We conjecture that in general games with three or more players, if KK is USI, then the set of all KK-based Battigalli’s PBE payoffs is the same as that of all Battigalli’s PBE payoffs. However, establishing this result can be difficult due to the complexity of Battigalli’s conditions.

5.2 Information States in Specific Models

In this section, we identify MSI and USI in specific game models studied in the literature. Whereas we recover some existing results using our framework, we also develop some new results.

Example 5.3.

Consider stateless dynamic games with observable actions, i.e. Xt=,H1i=,Zti=UtX_{t}=\varnothing,H_{1}^{i}=\varnothing,Z_{t}^{i}=U_{t} for all ii\in\mathcal{I}. One instance of such games is the class of repeated games [10]. In this game, Hti=U1:t1H_{t}^{i}=U_{1:t-1} for all ii\in\mathcal{I}. Let (ιt0)t𝒯(\iota_{t}^{0})_{t\in\mathcal{T}} be an arbitrary, common update function and let Ki=K0K^{i}=K^{0} be generated from (ιt0)t𝒯(\iota_{t}^{0})_{t\in\mathcal{T}}. Then KK is mutually sufficient information since Lemma 1 is trivially satisfied. As a result, Theorem 2 holds for KK, i.e. there exist at least one KK-based BNE.

However, in general, KK is not unilaterally sufficient information. To see that, one can consider the case when player jij\neq i is using a strategy that chooses different mixed actions for different realizations of U1:t1U_{1:t-1}. In this case gi,gi(k~t+1i|hti,uti)\mathbb{P}^{g^{i},g^{-i}}(\tilde{k}_{t+1}^{i}|h_{t}^{i},u_{t}^{i}) would potentially depend on U1:t1U_{1:t-1} as a whole. This means that KiK^{i} is not an information state for player ii under gig^{-i}, which violates Lemma C.4.

Furthermore for KK, the result of Theorem 3 does not necessarily hold, i.e. the set of KK-based BNE payoffs may not be the same as that of all BNE. Example 3.2 can be used to show this.

Example 5.4.

The model of [26] is a special case of our dynamic game model where Zti=(Xt+1,Ut)Z_{t}^{i}=(X_{t+1},U_{t}), i.e. the (past and current) states and past actions are observable. In this case, K=(Kti)t𝒯,iK=(K_{t}^{i})_{t\in\mathcal{T},i\in\mathcal{I}} with Kti=XtK_{t}^{i}=X_{t} is mutually sufficient information; note that Hti=(X1:t,U1:t1)H_{t}^{i}=(X_{1:t},U_{1:t-1}). Consider a KiK^{-i}-based strategy profile ρi\rho^{-i}, i.e. ρtj:𝒳tΔ(𝒰tj)\rho_{t}^{j}\colon\mathcal{X}_{t}\mapsto\Delta(\mathcal{U}_{t}^{j}) for t𝒯,j\{i}t\in\mathcal{T},j\in\mathcal{I}\backslash\{i\}. We have

gi,ρi(x~t,k~ti|hti)\displaystyle\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{x}_{t},\tilde{k}_{t}^{-i}|h_{t}^{i}) =gi,ρi(x~t,k~ti|x1:t,u1:t1)\displaystyle=\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{x}_{t},\tilde{k}_{t}^{-i}|x_{1:t},u_{1:t-1}) (13)
=𝟏{x~t=xt}ji𝟏{k~tj=xt}\displaystyle=\bm{1}_{\{\tilde{x}_{t}=x_{t}\}}\prod_{j\neq i}\bm{1}_{\{\tilde{k}_{t}^{j}=x_{t}\}} (14)
=:Φti,ρi(x~t,k~ti|xt).\displaystyle=:\Phi_{t}^{i,\rho^{-i}}(\tilde{x}_{t},\tilde{k}_{t}^{-i}|x_{t}). (15)

Hence KK is mutually sufficient information by Lemma 1. As a result, there exists at least one KK-based BNE.

Similar to Example 5.3, in general, KK is not unilaterally sufficient information, and the set of KK-based BNE payoffs may not be the same as that of all BNE. The argument for both claims can be carried out in an analogous way to Example 5.3.

Example 5.5.

The model of [33] is a special case of our dynamic model satisfying the following conditions.

  1. (1)

    The information of each player ii can be separated into the common information Ht0H_{t}^{0} and private information LtiL_{t}^{i}, i.e. there exists a strategy-independent bijection between HtiH_{t}^{i} and (Ht0,Lti)(H_{t}^{0},L_{t}^{i}) for all ii\in\mathcal{I}.

  2. (2)

    The common information Ht0H_{t}^{0} can be sequentially updated, i.e.

    Ht+10\displaystyle H_{t+1}^{0} =(Ht0,Zt0),\displaystyle=(H_{t}^{0},Z_{t}^{0}), (16)

    where Zt0=iZtiZ_{t}^{0}=\bigcap_{i\in\mathcal{I}}Z_{t}^{i} is the common part of the new information of all players at time tt.

  3. (3)

    The private information LtiL_{t}^{i} can be sequentially updated, i.e. there exist functions (ζti)t=0T1(\zeta_{t}^{i})_{t=0}^{T-1} such that

    Lt+1i\displaystyle L_{t+1}^{i} =ζti(Lti,Zti).\displaystyle=\zeta_{t}^{i}(L_{t}^{i},Z_{t}^{i}). (17)

In [33], the authors impose the following assumption.

Assumption 2 (Strategy independence of beliefs).

There exist a function Pt0P_{t}^{0} such that

g(xt,lt|ht0)=Pt0(xt,lt|ht0),\displaystyle\mathbb{P}^{g}(x_{t},l_{t}|h_{t}^{0})=P_{t}^{0}(x_{t},l_{t}|h_{t}^{0}), (18)

for all behavioral strategy profiles gg whenever g(ht0)>0\mathbb{P}^{g}(h_{t}^{0})>0, where lt=(lti)il_{t}=(l_{t}^{i})_{i\in\mathcal{I}}.

In this model, if we set Kti=(Πt,Lti)K_{t}^{i}=(\Pi_{t},L_{t}^{i}) where ΠtΔ(𝒳t×𝒮t)\Pi_{t}\in\Delta(\mathcal{X}_{t}\times\mathcal{S}_{t}) is a function of Ht0H_{t}^{0} defined through

Πt(xt,lt):=Pt0(xt,lt|Ht0),\Pi_{t}(x_{t},l_{t}):=P_{t}^{0}(x_{t},l_{t}|H_{t}^{0}), (19)

then K=(Ki)iK=(K^{i})_{i\in\mathcal{I}} is mutually sufficient information. First note that KtiK_{t}^{i} can be sequentially updated as Πt\Pi_{t} can be sequentially updated using Bayes rule. Then

gi,ρi(x~t,l~ti|hti)\displaystyle\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{x}_{t},\tilde{l}_{t}^{-i}|h_{t}^{i}) =gi,ρi(x~t,l~ti|ht0,lti)\displaystyle=\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{x}_{t},\tilde{l}_{t}^{-i}|h_{t}^{0},l_{t}^{i}) (20)
=gi,ρi(x~t,lti,l~ti|ht0)gi,ρi(lti|ht0)\displaystyle=\dfrac{\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{x}_{t},l_{t}^{i},\tilde{l}_{t}^{-i}|h_{t}^{0})}{\mathbb{P}^{g^{i},\rho^{-i}}(l_{t}^{i}|h_{t}^{0})} (21)
=Pt0(x~t,(lti,l~ti)|ht0)x^t,l^tiPt0(x^t,(lti,l^ti)|ht0)\displaystyle=\dfrac{P_{t}^{0}(\tilde{x}_{t},(l_{t}^{i},\tilde{l}_{t}^{-i})|h_{t}^{0})}{\sum_{\hat{x}_{t},\hat{l}_{t}^{-i}}P_{t}^{0}(\hat{x}_{t},(l_{t}^{i},\hat{l}_{t}^{-i})|h_{t}^{0})} (22)
=πt(x~t,(lti,l~ti))x^t,l^tiπt(x^t,(lti,l^ti))\displaystyle=\dfrac{\pi_{t}(\tilde{x}_{t},(l_{t}^{i},\tilde{l}_{t}^{-i}))}{\sum_{\hat{x}_{t},\hat{l}_{t}^{-i}}\pi_{t}(\hat{x}_{t},(l_{t}^{i},\hat{l}_{t}^{-i}))} (23)
=:Φ~ti,ρi(x~t,l~ti|kti),\displaystyle=:\tilde{\Phi}_{t}^{i,\rho^{-i}}(\tilde{x}_{t},\tilde{l}_{t}^{-i}|k_{t}^{i}), (24)

for some function Φ~ti,ρi\tilde{\Phi}_{t}^{i,\rho^{-i}}, where πt\pi_{t} is the realization of Πt\Pi_{t} corresponding to Ht0=ht0H_{t}^{0}=h_{t}^{0}. In steps (21) and (22) we apply Bayes rule on the conditional probabilities given ht0h_{t}^{0}, and we use Assumption 2 to express the belief with the strategy-independent function Pt0P_{t}^{0}.

Note that KtiK_{t}^{-i} is contained in the vector (Kti,Lti)(K_{t}^{i},L_{t}^{-i}), hence we conclude that

gi,ρi(x~t,k~ti|hti)\displaystyle\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{x}_{t},\tilde{k}_{t}^{-i}|h_{t}^{i}) =:Φti,ρi(x~t,k~ti|kti),\displaystyle=:\Phi_{t}^{i,\rho^{-i}}(\tilde{x}_{t},\tilde{k}_{t}^{-i}|k_{t}^{i}), (25)

for some function Φti,ρi\Phi_{t}^{i,\rho^{-i}}. By Lemma 1 we conclude that KK is mutually sufficient information. Therefore there exists at least one KK-based BNE.

Similar to Examples 5.3 and 5.4, in general, KK is not unilaterally sufficient information, and the set of KK-based BNE payoffs may not be the same as that of all BNE. The argument for both claims can be carried out in an analogous way to Examples 5.3 and 5.4.

Example 5.6.

The following model is a variant of [36] and [56].

  • Each player ii is associated with a local state XtiX_{t}^{i}, and Xt=(Xti)iX_{t}=(X_{t}^{i})_{i\in\mathcal{I}}.

  • Each player ii is associated with a local noise process WtiW_{t}^{i}, and Wt=(Wti)iW_{t}=(W_{t}^{i})_{i\in\mathcal{I}}.

  • There is no initial information, i.e. H1i=H_{1}^{i}=\varnothing for all ii\in\mathcal{I}.

  • There is a public noisy observation YtiY_{t}^{i} of the local state. The state transitions, observation processes, and reward generation processes satisfy the following:

    (Xt+1i,Yti)\displaystyle(X_{t+1}^{i},Y_{t}^{i}) =fti(Xti,Ut,Wti),i,\displaystyle=f_{t}^{i}(X_{t}^{i},U_{t},W_{t}^{i}),\quad\forall i\in\mathcal{I}, (26)
    Rti\displaystyle R_{t}^{i} =rti(Xt,Ut),i.\displaystyle=r_{t}^{i}(X_{t},U_{t}),\quad\forall i\in\mathcal{I}. (27)
  • The information player ii has at time tt is Hti=(Y1:t1,U1:t1,X1:ti)H_{t}^{i}=(Y_{1:t-1},U_{1:t-1},X_{1:t}^{i}) for ii\in\mathcal{I}, where Yt=(Yti)iY_{t}=(Y_{t}^{i})_{i\in\mathcal{I}}.

  • All the primitive random variables, i.e. the random variables in the collection (X1i)i(Wti)i,t𝒯(X_{1}^{i})_{i\in\mathcal{I}}\cup(W_{t}^{i})_{i\in\mathcal{I},t\in\mathcal{T}}, are mutually independent.

Proposition 7.

In the model of Example 5.6, Kti=(Y1:t1,U1:t1,Xti)K_{t}^{i}=(Y_{1:t-1},U_{1:t-1},X_{t}^{i}) is unilaterally sufficient information.555KiK^{i}-based strategies in this setting are closely related to the “strategies of type ss” defined in [56]. In [56], the authors showed that strategy profiles of type ss can attain all equilibrium payoffs attainable by general strategy profiles. However, the authors did not show that strategy profiles of type ss can do so while being an equilibrium.

Proof 5.7.

See Appendix D.2.

Finally, we note that the concept of USI is useful in the context of games among teams as well. We omit the details of the following example due to its complicated nature.

Example 5.8.

In the model of games among teams with delayed intra-team information sharing analyzed in [51], the authors defined the notion of sufficient private information (SPI). It can be shown (through the arguments in [51, Section 4.3] and [50, Chapters 4.6.1 and 4.6.2]) that Kti=(Ht0,Sti)K_{t}^{i}=(H_{t}^{0},S_{t}^{i}), which consists of the common information Ht0H_{t}^{0} and the SPI StiS_{t}^{i}, is unilaterally sufficient information.

6 An Open Problem

Identifying strategy-dependent compression maps that guarantee existence of at least one equilibrium (BNE or SE) or maintain all equilibria that exist under perfect recall is an open problem.

A known strategy-dependent compression map is one that compress separately first each agent’s private information, (resulting in “sufficient private information”), and then the agents’ common information (resulting in “common information based (CIB) beliefs” on the system state and the agents’ sufficient private information [35, 36, 56, 51, 37]). Such a compression does not possess any of the properties of the strategy-independent compression maps that result in MSI or USI. The following example presents a game where belief-based equilibria, i.e. equilibrium strategy profiles based on the above-described compression, do not exist.

Example 6.1.

Consider the following two-stage zero-sum game. The players are Alice (A) and Bob (B). Alice acts at stage t=1t=1 and Bob at stage t=2t=2. The game’s initial state X1X_{1} is distributed uniformly at random on {1,+1}\{-1,+1\}. Let HtA,HtBH_{t}^{A},H_{t}^{B} denote Alice’s and Bob’s information at stage tt, and UtA,UtBU_{t}^{A},U_{t}^{B} denote Alice’s and Bob’s actions at stage tt, t=1,2t=1,2. We assume that H1A=X1,H1B=H_{1}^{A}=X_{1},H_{1}^{B}=\varnothing, i.e. Alice knows X1X_{1} and Bob does not. At stage t=1t=1, Alice chooses U1A{1,1}U_{1}^{A}\in\{-1,1\}, and the state transition is given by X2=X1U1AX_{2}=X_{1}\cdot U_{1}^{A}. At stage t=2t=2, we assume that H2A=(X1:2,U1A)H_{2}^{A}=(X_{1:2},U_{1}^{A}) and H2B=U1AH_{2}^{B}=U_{1}^{A}, i.e. Bob observes Alice’s action but not the state before or after Alice’s action. At time t=2t=2, Bob picks an action U2B{U,D}U_{2}^{B}\in\{\mathrm{U},\mathrm{D}\}. Alice’s instantaneous rewards are given by

R1A={cif U1A=+1;0if U1A=1,andR2A={2if X2=+1,U2B=U;1if X2=1,U2B=D;0otherwise,\displaystyle R_{1}^{A}=\begin{cases}c&\text{if }U_{1}^{A}=+1;\\ 0&\text{if }U_{1}^{A}=-1,\end{cases}\qquad\text{and}\qquad R_{2}^{A}=\begin{cases}2&\text{if }X_{2}=+1,U_{2}^{B}=\mathrm{U};\\ 1&\text{if }X_{2}=-1,U_{2}^{B}=\mathrm{D};\\ 0&\text{otherwise},\end{cases} (28)

where c(0,1/3)c\in(0,1/3). The stage reward for Bob is RtB=RtAR_{t}^{B}=-R_{t}^{A} for t=1,2t=1,2.

The above game is a signaling game which can be represented in extensive form as in Figure 2.

AliceAlice(1,1)(1,-1)(0,0)(0,0)(0,0)(0,0)(2,2)(2,-2)(c,c)(c,-c)(2+c,2c)(2+c,-2-c)(1+c,1c)(1+c,-1-c)(c,c)(c,-c)+1+1[0.5][0.5]N1-1[0.5][0.5]BobBob1-11-1+1+1+1+1DUDUDUDU
Figure 2: Extensive form of the game in Example 6.1.

In order to define the concept of belief based equilibrium for this game, we specify the common information Ht0H_{t}^{0}, along with Alice’s and Bob’s private information, denoted by LtA,LtBL_{t}^{A},L_{t}^{B}, respectively, for t=1,2t=1,2 as follows:

H10\displaystyle H_{1}^{0} =,L1A=X1,L1B=,\displaystyle=\varnothing,\quad L_{1}^{A}=X_{1},\quad L_{1}^{B}=\varnothing, (29)
H20\displaystyle H_{2}^{0} =U1A,L2A=X2,L2B=.\displaystyle=U_{1}^{A},\quad L_{2}^{A}=X_{2},\quad L_{2}^{B}=\varnothing. (30)

We prove the following result.

Proposition 8.

In the game of Example 6.1 belief-based equilibria do not exist.

Proof 6.2.

See Appendix D.3.

7 Conclusion

In this paper, we investigated sufficient conditions for strategy-independent compression maps to be viable in dynamic games. Motivated by the literature on information states for control problems [23, 24, 48], we provided two notions of information state, both resulting in from strategy-independent information compression maps for dynamic games, namely mutually sufficient information (MSI) and unilaterally sufficient information (USI). While MSI guarantees the existence of compression-based equilibria, USI guarantees that compression-based equilibria can attain all equilibrium payoff profiles that are achieved when all agents have perfect recall. We established the results under both the concepts of Bayes-Nash equilibrium and sequential equilibrium. We discussed how USI does not guarantee the preservation of payoff profiles under certain other equilibrium refinements. We considered a strategy-depedent compression map that results in sufficient private information, for each agent, along with a CIB belief. We showed, by an example, that this information compression map does not possess any of the properties of the strategy-independent compression maps that result in MSI or USI.

The discovery of strategy-dependent information compression maps that lead to results similar to those of Theorem 2 and 4 or to those of Theorems 3 and 5 is a challenging open problem of paramount importance. Another important open problem is the discovery of information compression maps under which certain subsets of equilibrium payoff profiles are attained when strategies based on the resulting compressed information are used. The results of this paper have been derived for finite-horizon finite games. The extension of the results to infinite-horizon games and to games with continuous action and state spaces are other interesting technical problems.

Author Contributions

This work is a collaborative intellectual effort of the three authors, with Dengwang Tang being the leader. Due to the interconnected nature of the results, it is impossible to separate the contributions of each author.

Funding

This work is supported by National Science Foundation (NSF) Grant No. ECCS 1750041, ECCS 2025732, ECCS 2038416, ECCS 1608361, CCF 2008130, CMMI 2240981, Army Research Office (ARO) Award No. W911NF-17-1-0232, and Michigan Institute for Data Science (MIDAS) Sponsorship Funds by General Dynamics.

Data Availability

Not applicable since all results in this paper are theoretical.

Declarations

Conflict of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Ethical Approval

Not applicable since no experiments are involved in this work.

Appendix A Information State of Single-Agent Control Problems

In this section we consider single-agent Markov Decision Processes (MDPs) and develop auxiliary results. This section is a recap of [24] with more detailed results and proofs.

Let XtX_{t} be a controlled Markov Chain controlled by action UtU_{t} with initial distribution ν1Δ(𝒳1)\nu_{1}\in\Delta(\mathcal{X}_{1}) and transition kernels P=(Pt)t𝒯,Pt:𝒳t×𝒰tΔ(𝒳t+1)P=(P_{t})_{t\in\mathcal{T}},P_{t}\colon\mathcal{X}_{t}\times\mathcal{U}_{t}\mapsto\Delta(\mathcal{X}_{t+1}). Let r=(rt)t𝒯,rt:𝒳t×𝒰tr=(r_{t})_{t\in\mathcal{T}},r_{t}\colon\mathcal{X}_{t}\times\mathcal{U}_{t}\mapsto\mathbb{R} be a collection of instantaneous reward functions. An MDP is denoted by a tuple (ν1,P,r)(\nu_{1},P,r).

For a Markov strategy g=(gt)t𝒯,gt:𝒳tΔ(𝒰t)g=(g_{t})_{t\in\mathcal{T}},g_{t}\colon\mathcal{X}_{t}\mapsto\Delta(\mathcal{U}_{t}), we use g,ν1,P\mathbb{P}^{g,\nu_{1},P} and 𝔼g,ν1,P\mathbb{E}^{g,\nu_{1},P} to denote the probabilities of events and expectations of random variables under the distribution specified by controlled Markov Chain (ν1,P)(\nu_{1},P) and strategy gg. When (ν1,P)(\nu_{1},P) is fixed and clear from the context, we use g\mathbb{P}^{g} and 𝔼g\mathbb{E}^{g} respectively.

Define the total expected reward in the MDP (ν1,P,r)(\nu_{1},P,r) under strategy gg by

J(g;ν1,P,r):=𝔼g,ν1,P[t=1Trt(Xt,Ut)].\displaystyle J(g;\nu_{1},P,r):=\mathbb{E}^{g,\nu_{1},P}\left[\sum_{t=1}^{T}r_{t}(X_{t},U_{t})\right]. (31)

Define the value function and state-action quality function by

Vτ(xτ;P,r)\displaystyle V_{\tau}(x_{\tau};P,r) :=maxgτ:T𝔼gτ:T,P[t=τTrt(Xt,Ut)|xτ],τ[T+1],\displaystyle:=\max_{g_{\tau:T}}\mathbb{E}^{g_{\tau:T},P}\left[\sum_{t=\tau}^{T}r_{t}(X_{t},U_{t})|x_{\tau}\right],\qquad\forall\tau\in[T+1], (32)
Qτ(xτ,uτ;P,r)\displaystyle Q_{\tau}(x_{\tau},u_{\tau};P,r) :=rτ(xτ,uτ)+x~τ+1Vτ+1(x~τ+1)Pτ(x~τ+1|xτ,uτ),τ[T].\displaystyle:=r_{\tau}(x_{\tau},u_{\tau})+\sum_{\tilde{x}_{\tau+1}}V_{\tau+1}(\tilde{x}_{\tau+1})P_{\tau}(\tilde{x}_{\tau+1}|x_{\tau},u_{\tau}),\qquad\forall\tau\in[T]. (33)

Note that VT+1(;P,r)0V_{T+1}(\cdot;P,r)\equiv 0.

Definition 9.

[24] Let Kt=Ψt(Xt)K_{t}=\Psi_{t}(X_{t}) for some function Ψt\Psi_{t}. Then, KtK_{t} is called an information state for (P,r)(P,r) if there exist functions PtK:𝒦t×𝒰tΔ(𝒦t+1),rtK:𝒦t×𝒰tP_{t}^{K}\colon\mathcal{K}_{t}\times\mathcal{U}_{t}\mapsto\Delta(\mathcal{K}_{t+1}),r_{t}^{K}\colon\mathcal{K}_{t}\times\mathcal{U}_{t}\mapsto\mathbb{R} such that

  1. (1)

    Pt(kt+1|xt,ut)=PtK(kt+1|Ψt(xt),ut)P_{t}(k_{t+1}|x_{t},u_{t})=P_{t}^{K}(k_{t+1}|\Psi_{t}(x_{t}),u_{t}); and

  2. (2)

    rt(xt,ut)=rtK(Ψt(xt),ut)r_{t}(x_{t},u_{t})=r_{t}^{K}(\Psi_{t}(x_{t}),u_{t}).

If KtK_{t} is an information state, then KtK_{t} is also a controlled Markov Chain with initial distribution ν1KΔ(K1)\nu_{1}^{K}\in\Delta(K_{1}) and transition kernel PK=(PtK)t𝒯P^{K}=(P_{t}^{K})_{t\in\mathcal{T}}, where

ν1K(k1)=x1𝟏{k1=Ψ1(x1)}ν1(x1).\displaystyle\nu_{1}^{K}(k_{1})=\sum_{x_{1}}\bm{1}_{\{k_{1}=\Psi_{1}(x_{1})\}}\nu_{1}(x_{1}).

The tuple (ν1K,PK,rK)(\nu_{1}^{K},P^{K},r^{K}) defines a new MDP. For a KK-based strategy ρ=(ρt)t𝒯,ρt:𝒦tΔ(𝒰t)\rho=(\rho_{t})_{t\in\mathcal{T}},\rho_{t}\colon\mathcal{K}_{t}\mapsto\Delta(\mathcal{U}_{t}), the J,V,QJ,V,Q functions can be defined as above for the new MDP.

We state the following standard result (see, for example, Section 2 of [48]).

Lemma A.1.

Let Kt=Ψt(Xt)K_{t}=\Psi_{t}(X_{t}) be an information state for (P,r)(P,r). Then

  1. (1)

    Vt(xt;P,r)=Vt(Ψt(xt);PK,rK)V_{t}(x_{t};P,r)=V_{t}(\Psi_{t}(x_{t});P^{K},r^{K}) for all xtx_{t};

  2. (2)

    Qt(xt,ut;P,r)=Qt(Ψt(xt),ut;PK,rK)Q_{t}(x_{t},u_{t};P,r)=Q_{t}(\Psi_{t}(x_{t}),u_{t};P^{K},r^{K}) for all xt,utx_{t},u_{t}.

Definition 10.

Let gg be a Markov strategy, an KK-based strategy ρ\rho is said to be associated with gg if

ρt(kt)=𝔼g,ν1,P[gt(Xt)|kt],\displaystyle\rho_{t}(k_{t})=\mathbb{E}^{g,\nu_{1},P}[g_{t}(X_{t})|k_{t}], (34)

whenever g,ν1,P(kt)>0\mathbb{P}^{g,\nu_{1},P}(k_{t})>0.

The following lemma will be used in the proofs in Appendix C.

Lemma A.2 (Policy Equivalence Lemma).

Let (ν1,P,r)(\nu_{1},P,r) be an MDP. Let KtK_{t} be an information state for (P,r)(P,r). Let an KK-based strategy ρ\rho be associated with a Markov strategy gg, then

  1. (1)

    g,ν1,P(kt)=ρ,ν1,P(kt)\mathbb{P}^{g,\nu_{1},P}(k_{t})=\mathbb{P}^{\rho,\nu_{1},P}(k_{t}) for all kt𝒦tk_{t}\in\mathcal{K}_{t} and t𝒯t\in\mathcal{T};

  2. (2)

    J(g;ν1,P,r)=J(ρ;ν1,P,r)J(g;\nu_{1},P,r)=J(\rho;\nu_{1},P,r).

Proof A.3.

In this proof all probabilities and expectations are assumed to be defined with (ν1,P)(\nu_{1},P). Given a Markov strategy gg, let ρ\rho be an information state-based strategy that satisfies (34).

First, we have

g(ut|kt)=𝔼g[gt(ut|Xt)|kt]=ρt(ut|kt),\displaystyle\mathbb{P}^{g}(u_{t}|k_{t})=\mathbb{E}^{g}[g_{t}(u_{t}|X_{t})|k_{t}]=\rho_{t}(u_{t}|k_{t}), (35)

for all ktk_{t} such that g(kt)>0\mathbb{P}^{g}(k_{t})>0.

  1. (1)

    Proof by induction:

    Induction Base: We have g(k1)=ρ(k1)\mathbb{P}^{g}(k_{1})=\mathbb{P}^{\rho}(k_{1}) since the distribution of K1=Ψ1(X1)K_{1}=\Psi_{1}(X_{1}) is strategy-independent.

    Induction Step: Suppose that

    g(kt)=ρ(kt),\displaystyle\mathbb{P}^{g}(k_{t})=\mathbb{P}^{\rho}(k_{t}), (36)

    for all kt𝒦tk_{t}\in\mathcal{K}_{t}. We prove the result for time t+1t+1. Combining (35) and (36), and incorporating the information state transition kernel PtKP_{t}^{K} defined in Definition 9, we have

    g(kt+1)\displaystyle\mathbb{P}^{g}(k_{t+1}) =k~t,u~tg(kt+1|k~t,u~t)g(u~t|k~t)g(k~t)\displaystyle=\sum_{\tilde{k}_{t},\tilde{u}_{t}}\mathbb{P}^{g}(k_{t+1}|\tilde{k}_{t},\tilde{u}_{t})\mathbb{P}^{g}(\tilde{u}_{t}|\tilde{k}_{t})\mathbb{P}^{g}(\tilde{k}_{t}) (37)
    =k~t,u~tPtK(kt+1|k~t,u~t)ρt(ut|k~t)ρ(k~t)\displaystyle=\sum_{\tilde{k}_{t},\tilde{u}_{t}}P_{t}^{K}(k_{t+1}|\tilde{k}_{t},\tilde{u}_{t})\rho_{t}(u_{t}|\tilde{k}_{t})\mathbb{P}^{\rho}(\tilde{k}_{t}) (38)
    =ρ(kt+1).\displaystyle=\mathbb{P}^{\rho}(k_{t+1}). (39)

    Therefore we have established the induction step.

  2. (2)

    Using (35)(36) along with the result of part (1), we obtain

    𝔼g[rt(Xt,Ut)]\displaystyle\mathbb{E}^{g}[r_{t}(X_{t},U_{t})] =𝔼g[rtK(Kt,Ut)]\displaystyle=\mathbb{E}^{g}[r_{t}^{K}(K_{t},U_{t})] (40)
    =k~t,u~trtK(k~t,u~t)g(u~t|k~t)g(k~t)\displaystyle=\sum_{\tilde{k}_{t},\tilde{u}_{t}}r_{t}^{K}(\tilde{k}_{t},\tilde{u}_{t})\mathbb{P}^{g}(\tilde{u}_{t}|\tilde{k}_{t})\mathbb{P}^{g}(\tilde{k}_{t}) (41)
    =k~t,u~trtK(k~t,u~t)ρt(u~t|k~t)ρ(k~t)\displaystyle=\sum_{\tilde{k}_{t},\tilde{u}_{t}}r_{t}^{K}(\tilde{k}_{t},\tilde{u}_{t})\rho_{t}(\tilde{u}_{t}|\tilde{k}_{t})\mathbb{P}^{\rho}(\tilde{k}_{t}) (42)
    =𝔼ρ[rt(Xt,Ut)],\displaystyle=\mathbb{E}^{\rho}[r_{t}(X_{t},U_{t})], (43)

    for each t𝒯t\in\mathcal{T}. The result then follows from linearity of expectation.

This concludes the proof.

Appendix B Alternative Characterizations of Sequential Equilibria

This section deals with the game model introduced in Section 2. We provide three alternative definitions of sequential equilibria that are equivalent to the original one given by [21]. These definitions help simplify some of the proofs in Appendix C.

We would like to note that several alternative definitions of sequential equilibria are also given in [21, 16]. The definition of weak perfect equilibrium in Proposition 6 of [21] is close to our definitions in spirit in terms of using sequences of payoff functions instead of beliefs as a vehicle to define sequential rationality.

Notice that fixing the behavioral strategies gig^{-i} of players other than player ii, player ii’s best response problem (at every information set) can be considered as a Markov Decision Process with state HtiH_{t}^{i} and action UtiU_{t}^{i}, where the transition kernels and instantaneous reward functions depend on gig^{-i}. Inspired by this observation, we introduce an alternative definition of sequential equilibrium for our model, where we form conjectures of transition kernels and reward functions instead of forming beliefs on nodes. This allows us for a more compact representation of the appraisals and beliefs of players. We will later show that this alternative definition is equivalent to the classical definition of sequential equilibrium in [21].

For player ii\in\mathcal{I}, let Pi=(Pti)t𝒯\{T},Pti:ti×𝒰tiΔ(𝒵ti)P^{i}=(P_{t}^{i})_{t\in\mathcal{T}\backslash\{T\}},P_{t}^{i}\colon\mathcal{H}_{t}^{i}\times\mathcal{U}_{t}^{i}\mapsto\Delta(\mathcal{Z}_{t}^{i}) and ri=(rti)t𝒯,rti:ti×𝒰ti[1,1]r^{i}=(r_{t}^{i})_{t\in\mathcal{T}},r_{t}^{i}\colon\mathcal{H}_{t}^{i}\times\mathcal{U}_{t}^{i}\mapsto[-1,1] be collections of functions that represent conjectures of transition kernels and instantaneous reward functions. For a behavioral strategy profile gig^{i}, define the reward-to-go function JtiJ_{t}^{i} recursively through

JTi(gTi;hTi,Pi,ri):=u~TirTi(hTi,u~Ti)gTi(u~Ti|hTi);\displaystyle\quad~{}J_{T}^{i}(g_{T}^{i};h_{T}^{i},P^{i},r^{i}):=\sum_{\tilde{u}_{T}^{i}}r_{T}^{i}(h_{T}^{i},\tilde{u}_{T}^{i})g_{T}^{i}(\tilde{u}_{T}^{i}|h_{T}^{i}); (44a)
Jti(gt:Ti;hti,Pi,ri)\displaystyle\quad~{}J_{t}^{i}(g_{t:T}^{i};h_{t}^{i},P^{i},r^{i}) (44b)
:=\displaystyle:= u~ti[rti(hti,u~ti)+z~tiJt+1i(gt+1:Ti;(hti,z~ti),Pi,ri)Pti(z~ti|hti,u~ti)]gti(u~ti|hti).\displaystyle\sum_{\tilde{u}_{t}^{i}}\left[r_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i})+\sum_{\tilde{z}_{t}^{i}}J_{t+1}^{i}(g_{t+1:T}^{i};(h_{t}^{i},\tilde{z}_{t}^{i}),P^{i},r^{i})P_{t}^{i}(\tilde{z}_{t}^{i}|h_{t}^{i},\tilde{u}_{t}^{i})\right]g_{t}^{i}(\tilde{u}_{t}^{i}|h_{t}^{i}). (44c)
Definition 11 (“Model-based” Sequential Equilibrium).

Let g=(gi)ig=(g^{i})_{i\in\mathcal{I}} be a behavioral strategy profile. Let (P,r)=(Pi,ri)i(P,r)=(P^{i},r^{i})_{i\in\mathcal{I}} be a conjectured profile. Then, gg is said to be sequentially rational under (P,r)(P,r) if for each i,t𝒯i\in\mathcal{I},t\in\mathcal{T} and each htitih_{t}^{i}\in\mathcal{H}_{t}^{i},

Jti(gt:Ti;hti,Pi,ri)Jti(g~t:Ti;hti,Pi,ri),J_{t}^{i}(g_{t:T}^{i};h_{t}^{i},P^{i},r^{i})\geq J_{t}^{i}(\tilde{g}_{t:T}^{i};h_{t}^{i},P^{i},r^{i}), (45)

for all behavioral strategies g~t:Ti\tilde{g}_{t:T}^{i}. Conjectured profile (P,r)(P,r) is said to be fully consistent with gg if there exist a sequence of behavioral strategy and conjecture profiles (g(n),P(n),r(n))n=1(g^{(n)},P^{(n)},r^{(n)})_{n=1}^{\infty} such that

  1. (1)

    g(n)g^{(n)} is fully mixed, i.e. every action is chosen with positive probability at every information set.

  2. (2)

    For each ii\in\mathcal{I}, (P(n),i,r(n),i)(P^{(n),i},r^{(n),i}) is consistent with g(n),ig^{(n),-i}, i.e. for each i,t𝒯,htiti,uti𝒰tii\in\mathcal{I},t\in\mathcal{T},h_{t}^{i}\in\mathcal{H}_{t}^{i},u_{t}^{i}\in\mathcal{U}_{t}^{i},

    Pt(n),i(zti|hti,uti)\displaystyle P_{t}^{(n),i}(z_{t}^{i}|h_{t}^{i},u_{t}^{i}) =g(n),i(zti|hti,uti),\displaystyle=\mathbb{P}^{g^{(n),-i}}(z_{t}^{i}|h_{t}^{i},u_{t}^{i}), (46)
    rt(n),i(hti,uti)\displaystyle r_{t}^{(n),i}(h_{t}^{i},u_{t}^{i}) =𝔼g(n),i[Rti|hti,uti].\displaystyle=\mathbb{E}^{g^{(n),-i}}[R_{t}^{i}|h_{t}^{i},u_{t}^{i}]. (47)
  3. (3)

    (g(n),P(n),r(n))(g,P,r)(g^{(n)},P^{(n)},r^{(n)})\rightarrow(g,P,r) as nn\rightarrow\infty.

A triple (g,P,r)(g,P,r) is said to be a “model-based” sequential equilibrium666Here we borrow the terms “model-based” (resp. “model-free”) from the reinforcement learning literature: “Model-based” means that an algorithm constructs the underlying model (P,r)(P,r), while “model-free” usually means that the algorithm directly constructs state-action value functions QQ. if gg is sequentially rational under (P,r)(P,r) and (P,r)(P,r) is fully consistent with gg.

One can also form conjectures directly on the optimal reward-to-go given a state-action pair (hti,uti)(h_{t}^{i},u_{t}^{i}).

Definition 12 (“Model-free” Sequential Equilibrium, Definition 2 revisited).

Let g=(gi)ig=(g^{i})_{i\in\mathcal{I}} be a behavioral strategy profile. Let Q=(Qti)i,t𝒯Q=(Q_{t}^{i})_{i\in\mathcal{I},t\in\mathcal{T}} be a collection of functions where Qti:ti×𝒰ti[T,T]Q_{t}^{i}\colon\mathcal{H}_{t}^{i}\times\mathcal{U}_{t}^{i}\mapsto[-T,T]. The strategy profile gg is said to be sequentially rational under QQ if for each i,t𝒯i\in\mathcal{I},t\in\mathcal{T} and each htitih_{t}^{i}\in\mathcal{H}_{t}^{i},

supp(gti(hti))argmaxutiQti(hti,uti).\mathrm{supp}(g_{t}^{i}(h_{t}^{i}))\subseteq\underset{u_{t}^{i}}{\arg\max}~{}Q_{t}^{i}(h_{t}^{i},u_{t}^{i}). (48)

The collection of functions QQ is said to be fully consistent with gg if there exist a sequence of behavioral strategy and conjectured profiles (g(n),Q(n))n=1(g^{(n)},Q^{(n)})_{n=1}^{\infty} such that

  1. (1)

    g(n)g^{(n)} is fully mixed, i.e. every action is chosen with positive probability at every information set.

  2. (2)

    Q(n)Q^{(n)} is consistent with g(n)g^{(n)}, i.e.,

    Qτ(n),i(hτi,uτi)\displaystyle Q_{\tau}^{(n),i}(h_{\tau}^{i},u_{\tau}^{i}) =𝔼g(n)[t=τTRti|hτi,uτi],\displaystyle=\mathbb{E}^{g^{(n)}}\left[\sum_{t=\tau}^{T}R_{t}^{i}\Big{|}h_{\tau}^{i},u_{\tau}^{i}\right], (49)

    for each i,τ𝒯,hτiτi,uτi𝒰τii\in\mathcal{I},\tau\in\mathcal{T},h_{\tau}^{i}\in\mathcal{H}_{\tau}^{i},u_{\tau}^{i}\in\mathcal{U}_{\tau}^{i}.

  3. (3)

    (g(n),Q(n))(g,Q)(g^{(n)},Q^{(n)})\rightarrow(g,Q) as nn\rightarrow\infty.

A tuple (g,Q)(g,Q) is said to be a “model-free” sequential equilibrium if gg is sequentially rational under QQ and QQ is fully consistent with gg.

A slightly different definition is also equivalent:

Definition 13 (“Model-free” Sequential Equilibrium, Version 2).

A tuple (g,Q)(g,Q) is said to be a “model-free” sequential equilibrium (version 2) if it satisfies Definition 12 with condition (2) for full consistency replaced by the following condition:

  • (2’)

    For each ii, Q(n),iQ^{(n),i} is consistent with g(n),ig^{(n),-i}, i.e.

    Qτ(n),i(hτi,uτi)\displaystyle Q_{\tau}^{(n),i}(h_{\tau}^{i},u_{\tau}^{i}) =𝔼g(n),i[Rτi|hτi,uτi]+maxg~τ+1:Ti𝔼g~τ+1:Ti,g(n),i[t=τ+1TRti|hτi,uτi],\displaystyle=\mathbb{E}^{g^{(n),-i}}[R_{\tau}^{i}|h_{\tau}^{i},u_{\tau}^{i}]+\underset{\tilde{g}_{\tau+1:T}^{i}}{\max}~{}\mathbb{E}^{\tilde{g}_{\tau+1:T}^{i},g^{(n),-i}}\left[\sum_{t=\tau+1}^{T}R_{t}^{i}\Big{|}h_{\tau}^{i},u_{\tau}^{i}\right],

    for each τ𝒯,hτiτi,uτi𝒰τi\tau\in\mathcal{T},h_{\tau}^{i}\in\mathcal{H}_{\tau}^{i},u_{\tau}^{i}\in\mathcal{U}_{\tau}^{i}.

To introduce the last definition of SE, which corresponds to the original definition proposed in [21], we first describe the game in Section 2 as an extensive-form game tree as follows: To convert the game from a simultaneous move game to a sequential game, we set ={1,2,,I}\mathcal{I}=\{1,2,\cdots,I\}, where the index indicates the order of movement. For convenience, for ii\in\mathcal{I}, we use the superscript <i<i (resp. >i>i) to represent the set of players {1,,i1}\{1,\cdots,i-1\} (resp. {i+1,,I}\{i+1,\cdots,I\}) that moves before (resp. after) player ii in any given round. At time t=0t=0, nature takes action w0=(x1,h1)w_{0}=(x_{1},h_{1}) and the game enters t=1t=1. For each time t𝒯t\in\mathcal{T}, player 11 takes action ut1u_{t}^{1} first, then followed by player 22 taking action ut2u_{t}^{2}, and so on, while nature takes action wtw_{t} after player II takes action utIu_{t}^{I}. In this extensive form game, there are three types of nodes: (1) a node where some player ii\in\mathcal{I} takes action (at some time t𝒯t\in\mathcal{T}), (2) a node where nature takes action (at some time t{0}𝒯t\in\{0\}\cup\mathcal{T}), and (3) a terminal node, where the game has terminated. We denote the set of the first type of nodes corresponding to player ii and time tt as 𝒪ti\mathcal{O}_{t}^{i}. A node oti𝒪tio_{t}^{i}\in\mathcal{O}_{t}^{i} can also be represented as a vector oti=(x1,h1,w1:t1,u1:t1,ut<i)o_{t}^{i}=(x_{1},h_{1},w_{1:t-1},u_{1:t-1},u_{t}^{<i}) which contains all the moves (by all players and nature) before it. As a result, otio_{t}^{i} also uniquely determines the states x1:tx_{1:t} and information increment vectors z1:t1z_{1:t-1}. We denote the set of the terminal nodes as 𝒪T+1\mathcal{O}_{T+1}. A terminal node oT+1𝒪T+1o_{T+1}\in\mathcal{O}_{T+1} also has a vector representation oT+1=(x1,h1,w1:T,u1:T)o_{T+1}=(x_{1},h_{1},w_{1:T},u_{1:T}).

Given a terminal node oT+1o_{T+1}, all the actions of players and nature throughout the game are uniquely determined, hence the realizations of (Rt)t𝒯(R_{t})_{t\in\mathcal{T}} defined in Section 2 are also uniquely determined. Let Λ=(Λi)i,Λi:𝒪T+1\Lambda=(\Lambda^{i})_{i\in\mathcal{I}},\Lambda^{i}\colon\mathcal{O}_{T+1}\mapsto\mathbb{R} be the mappings from terminal nodes to total payoffs, i.e. Λi(oT+1)=t=1Trti\Lambda^{i}(o_{T+1})=\sum_{t=1}^{T}r_{t}^{i}, where rtir_{t}^{i} is the realization of RtiR_{t}^{i} corresponding to oT+1o_{T+1}. Also define Λτi(oT+1)=t=τTrti\Lambda_{\tau}^{i}(o_{T+1})=\sum_{t=\tau}^{T}r_{t}^{i} for each τ𝒯\tau\in\mathcal{T}.

Now, as we have constructed the extensive-form game, it is helpful to view the nodes in the game tree as a stochastic process. Define OtiO_{t}^{i} to be a random variable with support on 𝒪ti\mathcal{O}_{t}^{i} that represents the node player ii is at before taking action at time tt. Let OT+1O_{T+1} be a random variable with support on 𝒪T+1\mathcal{O}_{T+1} that represents the terminal node the game ends at. If we view (𝒯×){T+1}(\mathcal{T}\times\mathcal{I})\cup\{T+1\} as a set of time indices with lexicographic ordering, the random process (Oti)(t,i)𝒯×(OT+1)(O_{t}^{i})_{(t,i)\in\mathcal{T}\times\mathcal{I}}\cup(O_{T+1}) is a controlled Markov Chain controlled by action UtiU_{t}^{i} at time (t,i)(t,i).

Definition 14 (Classical Sequential Equilibrium [21]).

An assessment is a pair (g,μ)(g,\mu), where gg is a behavioral strategy profile of players (excluding nature) as described in Section 2, and μ=(μti)t,i,μti:tiΔ(𝒪ti)\mu=(\mu_{t}^{i})_{t\in\mathcal{I},i\in\mathcal{I}},\mu_{t}^{i}\colon\mathcal{H}_{t}^{i}\mapsto\Delta(\mathcal{O}_{t}^{i}) is a belief system. Then, gg is said to be sequentially rational given μ\mu if

oti𝔼gt:Ti,gt>i,gt:Ti[Λi(OT+1)|oti]μti(oti|hti)oti𝔼g~t:Ti,gt>i,gt:Ti[Λi(OT+1)|oti]μti(oti|hti),\sum_{o_{t}^{i}}\mathbb{E}^{g_{t:T}^{i},g_{t}^{>i},g_{t:T}^{-i}}[\Lambda^{i}(O_{T+1})|o_{t}^{i}]\mu_{t}^{i}(o_{t}^{i}|h_{t}^{i})\geq\sum_{o_{t}^{i}}\mathbb{E}^{\tilde{g}_{t:T}^{i},g_{t}^{>i},g_{t:T}^{-i}}[\Lambda^{i}(O_{T+1})|o_{t}^{i}]\mu_{t}^{i}(o_{t}^{i}|h_{t}^{i}), (50)

for all i,t𝒯,htitii\in\mathcal{I},t\in\mathcal{T},h_{t}^{i}\in\mathcal{H}_{t}^{i}, and all behavioral strategies g~t:Ti\tilde{g}_{t:T}^{i}. The belief system μ\mu is said to be fully consistent with gg if there exist a sequence of assessments (g(n),μ(n))n=1(g,μ)(g^{(n)},\mu^{(n)})_{n=1}^{\infty}\rightarrow(g,\mu) such that g(n)g^{(n)} is a fully mixed strategy profile and

  1. (1)

    g(n)g^{(n)} is fully mixed.

  2. (2)

    μ(n)\mu^{(n)} is consistent with g(n)g^{(n)}, i.e. μt(n),i(oti|hti)=g(n)(oti|hti)\mu_{t}^{(n),i}(o_{t}^{i}|h_{t}^{i})=\mathbb{P}^{g^{(n)}}(o_{t}^{i}|h_{t}^{i}) for all t𝒯,i,htitit\in\mathcal{T},i\in\mathcal{I},h_{t}^{i}\in\mathcal{H}_{t}^{i}, and oti𝒪tio_{t}^{i}\in\mathcal{O}_{t}^{i}.

  3. (3)

    (g(n),μ(n))(g,μ)(g^{(n)},\mu^{(n)})\rightarrow(g,\mu) as nn\rightarrow\infty.

An assessment (g,μ)(g,\mu) is said to be a (classical) sequential equilibrium if gg is sequentially rational given μ\mu and μ\mu is fully consistent with gg.

Remark B.1.

Since the instantaneous rewards R1:t1iR_{1:t-1}^{i} have already been realized at time tt, replacing the total reward Λ\Lambda with reward-to-go Λt\Lambda_{t} in (50) would result in an equivalent definition.

Theorem 15.

Definitions  11, 12, 13, and 14 are equivalent for strategy profiles.

Proof B.2.

We complete the proof via four steps: In each step, we show that if gg is a strategy profile satisfying one definition of SE, then it satisfy one of the other definitions of SE as well. We follow the following diagram: Definition 14 \Rightarrow Definition 11 \Rightarrow Definition 12 \Rightarrow Definition 13 \Rightarrow Definition 14.

Step 1: Classical SE (Definition 14) \Rightarrow “Model-based” SE (Definition 11)

Let (g,μ)(g,\mu) satisfy Definition 14. Let (g(n),μ(n))(g^{(n)},\mu^{(n)}) be a sequence of assessments that satisfies conditions (1)-(3) of fully consistency in Definition 14.

Set Pt(n),i(zti|hti,uti)=g(n)(zti|hti,uti)P_{t}^{(n),i}(z_{t}^{i}|h_{t}^{i},u_{t}^{i})=\mathbb{P}^{g^{(n)}}(z_{t}^{i}|h_{t}^{i},u_{t}^{i}) and rt(n),i(hti,uti)=𝔼g(n)[Rti|hti,uti]r_{t}^{(n),i}(h_{t}^{i},u_{t}^{i})=\mathbb{E}^{g^{(n)}}[R_{t}^{i}|h_{t}^{i},u_{t}^{i}] for all htiti,uti𝒰tih_{t}^{i}\in\mathcal{H}_{t}^{i},u_{t}^{i}\in\mathcal{U}_{t}^{i}.

Recall that we can write Oti=(X1,H1,W1:t1,U1:t1,Ut<i)O_{t}^{i}=(X_{1},H_{1},W_{1:t-1},U_{1:t-1},U_{t}^{<i}), and (X1:t,Z1:t1)(X_{1:t},Z_{1:t-1}) can be expressed as a function of OtiO_{t}^{i}. Therefore there exist fixed functions fti,Z,fti,Rf_{t}^{i,Z},f_{t}^{i,R} such that Zti=fti,Z(Oti,Uti,Ut>i,Wt)Z_{t}^{i}=f_{t}^{i,Z}(O_{t}^{i},U_{t}^{i},U_{t}^{>i},W_{t}), Rti=fti,R(Oti,Uti,Ut>i,Wt)R_{t}^{i}=f_{t}^{i,R}(O_{t}^{i},U_{t}^{i},U_{t}^{>i},W_{t}). Furthermore, for all j>ij>i, there also exists functions ftj,i,Hf_{t}^{j,i,H} such that Htj=ftj,i,H(Oti)H_{t}^{j}=f_{t}^{j,i,H}(O_{t}^{i}) (since Htj=(H1i,Z1:t1i)H_{t}^{j}=(H_{1}^{i},Z_{1:t-1}^{i})). Since μt(n),i(oti|hti)=g(n)(oti|hti)\mu_{t}^{(n),i}(o_{t}^{i}|h_{t}^{i})=\mathbb{P}^{g^{(n)}}(o_{t}^{i}|h_{t}^{i}) we have

Pt(n),i(zti|hti,uti)=oti,u~t>i,w~t𝟏{zti=fti,Z(oti,uti,u~t>i,w~t)}(w~t)(j=i+1Igt(n),j(u~tj|ftj,i,H(oti)))μt(n)(oti|hti),\displaystyle\begin{split}&\enspace\enspace\>P_{t}^{(n),i}(z_{t}^{i}|h_{t}^{i},u_{t}^{i})\\ &=\sum_{o_{t}^{i},\tilde{u}_{t}^{>i},\tilde{w}_{t}}\bm{1}_{\{z_{t}^{i}=f_{t}^{i,Z}(o_{t}^{i},u_{t}^{i},\tilde{u}_{t}^{>i},\tilde{w}_{t})\}}\mathbb{P}(\tilde{w}_{t})\left(\prod_{j=i+1}^{I}g_{t}^{(n),j}(\tilde{u}_{t}^{j}|f_{t}^{j,i,H}(o_{t}^{i}))\right)\mu_{t}^{(n)}(o_{t}^{i}|h_{t}^{i}),\end{split} (51)
rt(n),i(hti,uti)=oti,u~t>i,w~tfti,R(oti,uti,u~t>i,w~t)(w~t)(j=i+1Igt(n),j(u~tj|ftj,i,H(oti)))μt(n)(oti|hti).\displaystyle\begin{split}&\enspace\enspace\>r_{t}^{(n),i}(h_{t}^{i},u_{t}^{i})\\ &=\sum_{o_{t}^{i},\tilde{u}_{t}^{>i},\tilde{w}_{t}}f_{t}^{i,R}(o_{t}^{i},u_{t}^{i},\tilde{u}_{t}^{>i},\tilde{w}_{t})\mathbb{P}(\tilde{w}_{t})\left(\prod_{j=i+1}^{I}g_{t}^{(n),j}(\tilde{u}_{t}^{j}|f_{t}^{j,i,H}(o_{t}^{i}))\right)\mu_{t}^{(n)}(o_{t}^{i}|h_{t}^{i}).\end{split} (52)

Therefore, as μ(n)μ\mu^{(n)}\rightarrow\mu, g(n)gg^{(n)}\rightarrow g, we have (P(n),r(n))(P,r)(P^{(n)},r^{(n)})\rightarrow(P,r) for some (P,r)(P,r).

Let τ𝒯\tau\in\mathcal{T} and g~τ:Ti\tilde{g}_{\tau:T}^{i} be an arbitrary strategy. First, observe that one can represent the conditional reward-to-go 𝔼g(n)[t=τTRti|hτi]\mathbb{E}^{g^{(n)}}[\sum_{t=\tau}^{T}R_{t}^{i}|h_{\tau}^{i}] using μ(n)\mu^{(n)} or (P(n),r(n))(P^{(n)},r^{(n)}). Hence we have

oτi𝔼g~τ:Ti,gτ(n),>i,gτ+1:T(n),i[Λτi(OT+1)|oτi]μτ(n),i(oτi|τti)=Jti(g~τ:Ti;hτi,P(n),i,r(n),i),\begin{split}\sum_{o_{\tau}^{i}}\mathbb{E}^{\tilde{g}_{\tau:T}^{i},g_{\tau}^{(n),>i},g_{\tau+1:T}^{(n),-i}}[\Lambda_{\tau}^{i}(O_{T+1})|o_{\tau}^{i}]\mu_{\tau}^{(n),i}(o_{\tau}^{i}|\tau_{t}^{i})&=J_{t}^{i}(\tilde{g}_{\tau:T}^{i};h_{\tau}^{i},P^{(n),i},r^{(n),i}),\end{split} (53)

where JtiJ_{t}^{i} is as defined in (44).

Observe that the left-hand side of (53) is continuous in (gτ(n),>i,gτ+1:T(n),i,μτ(n),i)(g_{\tau}^{(n),>i},g_{\tau+1:T}^{(n),-i},\mu_{\tau}^{(n),i}) since it is a sum of products of components of (gτ(n),>i,gτ+1:T(n),i,μτ(n),i)(g_{\tau}^{(n),>i},g_{\tau+1:T}^{(n),-i},\mu_{\tau}^{(n),i}). Also observe that the right-hand side of (53) is continuous in (P(n),i,r(n),i)(P^{(n),i},r^{(n),i}) since it is a sum of products of components of (P(n),i,r(n),i)(P^{(n),i},r^{(n),i}) by the definition in (44). Therefore by taking limit as nn\rightarrow\infty, we conclude that

oτi𝔼g~τ:Ti,gi[Λτi(OT+1)|oτi]μτi(oτi|hτi)=Jτi(g~τ:Ti;hτi,Pi,ri),\sum_{o_{\tau}^{i}}\mathbb{E}^{\tilde{g}_{\tau:T}^{i},g^{-i}}[\Lambda_{\tau}^{i}(O_{T+1})|o_{\tau}^{i}]\mu_{\tau}^{i}(o_{\tau}^{i}|h_{\tau}^{i})=J_{\tau}^{i}(\tilde{g}_{\tau:T}^{i};h_{\tau}^{i},P^{i},r^{i}), (54)

for all strategies g~τ:Ti\tilde{g}_{\tau:T}^{i}. Using sequential rationality of gg with respect to μ\mu and (54) we conclude that

Jti(gτ:Ti;hτi,Pi,ri)Jti(g~τ:Ti;hτi,Pi,ri),\displaystyle J_{t}^{i}(g_{\tau:T}^{i};h_{\tau}^{i},P^{i},r^{i})\geq J_{t}^{i}(\tilde{g}_{\tau:T}^{i};h_{\tau}^{i},P^{i},r^{i}), (55)

for all τ𝒯,i,hτiτi\tau\in\mathcal{T},i\in\mathcal{I},h_{\tau}^{i}\in\mathcal{H}_{\tau}^{i}, i.e. gg is also sequentially rational given (P,r)(P,r).

Step 2: “Model-based” SE (Definition 11) \Rightarrow “Model-free” SE version 1 (Definition 12)

Let (g,P,r)(g,P,r) be a sequential equilibrium under Definition 11, and let (g(n),P(n),r(n))(g^{(n)},P^{(n)},r^{(n)}) satisfy conditions (1)-(3) of full consistency in Definition 11. Set

Qτ(n),i(hτi,uτi)=𝔼g(n)[t=τTRti|hτi,uτi],\displaystyle Q_{\tau}^{(n),i}(h_{\tau}^{i},u_{\tau}^{i})=\mathbb{E}^{g^{(n)}}\left[\sum_{t=\tau}^{T}R_{t}^{i}\Big{|}h_{\tau}^{i},u_{\tau}^{i}\right], (56)

for all τ𝒯,i,hτiτi,uτi𝒰τi\tau\in\mathcal{T},i\in\mathcal{I},h_{\tau}^{i}\in\mathcal{H}_{\tau}^{i},u_{\tau}^{i}\in\mathcal{U}_{\tau}^{i}. Then Q(n),iQ^{(n),i} satisfies the recurrence relation

QT(n),i(hTi,uTi)\displaystyle Q_{T}^{(n),i}(h_{T}^{i},u_{T}^{i}) =rT(n),i(hTi,uTi),\displaystyle=r_{T}^{(n),i}(h_{T}^{i},u_{T}^{i}), (57a)
Vt(n),i(hti)\displaystyle V_{t}^{(n),i}(h_{t}^{i}) :=u~tiQt(n),i(hti,u~ti)gt(n),i(u~ti|hti),t𝒯,\displaystyle:=\sum_{\tilde{u}_{t}^{i}}Q_{t}^{(n),i}(h_{t}^{i},\tilde{u}_{t}^{i})g_{t}^{(n),i}(\tilde{u}_{t}^{i}|h_{t}^{i}),\quad\forall t\in\mathcal{T}, (57b)
Qt(n),i(hti,uti)\displaystyle Q_{t}^{(n),i}(h_{t}^{i},u_{t}^{i}) =rt(n),i(hti,uti)\displaystyle=r_{t}^{(n),i}(h_{t}^{i},u_{t}^{i}) (57c)
+z~tiVt+1(n),i((hti,z~ti))Pt(n),i(z~ti|hti,uti),t𝒯\{T}.\displaystyle+\sum_{\tilde{z}_{t}^{i}}V_{t+1}^{(n),i}((h_{t}^{i},\tilde{z}_{t}^{i}))P_{t}^{(n),i}(\tilde{z}_{t}^{i}|h_{t}^{i},u_{t}^{i}),\quad\forall t\in\mathcal{T}\backslash\{T\}. (57d)

Since (g(n),P(n),r(n))(g,P,r)(g^{(n)},P^{(n)},r^{(n)})\rightarrow(g,P,r) as nn\rightarrow\infty, we have Q(n)QQ^{(n)}\rightarrow Q where Q=(Qti)t𝒯,iQ=(Q_{t}^{i})_{t\in\mathcal{T},i\in\mathcal{I}} satisfies

QTi(hTi,uTi)\displaystyle Q_{T}^{i}(h_{T}^{i},u_{T}^{i}) =rTi(hTi,uTi),\displaystyle=r_{T}^{i}(h_{T}^{i},u_{T}^{i}), (58a)
Vti(hti)\displaystyle V_{t}^{i}(h_{t}^{i}) :=u~tiQti(hti,u~ti)gti(u~ti|hti),t𝒯,\displaystyle:=\sum_{\tilde{u}_{t}^{i}}Q_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i})g_{t}^{i}(\tilde{u}_{t}^{i}|h_{t}^{i}),\quad\forall t\in\mathcal{T}, (58b)
Qti(hti,uti)\displaystyle Q_{t}^{i}(h_{t}^{i},u_{t}^{i}) =rti(hti,uti)\displaystyle=r_{t}^{i}(h_{t}^{i},u_{t}^{i})
+z~tiVt+1i((hti,z~ti))Pti(z~ti|hti,uti),t𝒯\{T}.\displaystyle+\sum_{\tilde{z}_{t}^{i}}V_{t+1}^{i}((h_{t}^{i},\tilde{z}_{t}^{i}))P_{t}^{i}(\tilde{z}_{t}^{i}|h_{t}^{i},u_{t}^{i}),\quad\forall t\in\mathcal{T}\backslash\{T\}. (58c)

Comparing (58) with the reward-to-go function JtiJ_{t}^{i} defined in (44), we observe that

Vti(hti)=Jti(gt:Ti;hti,Pi,ri),\displaystyle V_{t}^{i}(h_{t}^{i})=J_{t}^{i}(g_{t:T}^{i};h_{t}^{i},P^{i},r^{i}), (59)

for all t𝒯,i,hτiτit\in\mathcal{T},i\in\mathcal{I},h_{\tau}^{i}\in\mathcal{H}_{\tau}^{i}.

Let g~ti\tilde{g}_{t}^{i} be a strategy such that g^ti(hti)=ηΔ(𝒰ti)\hat{g}_{t}^{i}(h_{t}^{i})=\eta\in\Delta(\mathcal{U}_{t}^{i}), then

Jti((g~ti,gt+1:Ti);hti,Pi,ri)\displaystyle\enspace\enspace\>J_{t}^{i}((\tilde{g}_{t}^{i},g_{t+1:T}^{i});h_{t}^{i},P^{i},r^{i}) (60)
=u~t(rti(hti,u~ti)+z~tiJt+1i(gt+1:Ti;(hti,z~ti),Pi,ri)Pti(z~ti|hti,u~ti))η(u~ti)\displaystyle=\sum_{\tilde{u}_{t}}\left(r_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i})+\sum_{\tilde{z}_{t}^{i}}J_{t+1}^{i}(g_{t+1:T}^{i};(h_{t}^{i},\tilde{z}_{t}^{i}),P^{i},r^{i})P_{t}^{i}(\tilde{z}_{t}^{i}|h_{t}^{i},\tilde{u}_{t}^{i})\right)\eta(\tilde{u}_{t}^{i}) (61)
=u~t(rti(hti,u~ti)+z~tiVt+1i((hti,z~ti))Pti(z~ti|hti,u~ti))η(u~ti)\displaystyle=\sum_{\tilde{u}_{t}}\left(r_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i})+\sum_{\tilde{z}_{t}^{i}}V_{t+1}^{i}((h_{t}^{i},\tilde{z}_{t}^{i}))P_{t}^{i}(\tilde{z}_{t}^{i}|h_{t}^{i},\tilde{u}_{t}^{i})\right)\eta(\tilde{u}_{t}^{i}) (62)
=u~tQti(hti,u^ti)η(u~ti),\displaystyle=\sum_{\tilde{u}_{t}}Q_{t}^{i}(h_{t}^{i},\hat{u}_{t}^{i})\eta(\tilde{u}_{t}^{i}), (63)

where we substitute (44) in (61), (59) in (62), and (58c) in (63).

By sequential rationality of gg with respect to (P,r)(P,r), we have

Jti(gt:Ti;hti,Pi,ri)Jti((g~ti,gt+1:Ti);hti,Pi,ri),J_{t}^{i}(g_{t:T}^{i};h_{t}^{i},P^{i},r^{i})\geq J_{t}^{i}((\tilde{g}_{t}^{i},g_{t+1:T}^{i});h_{t}^{i},P^{i},r^{i}),

which means that

u~tQti(hti,u~ti)gti(u~ti|hti)u~tQti(hti,u~ti)η(u~ti),\displaystyle\sum_{\tilde{u}_{t}}Q_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i})g_{t}^{i}(\tilde{u}_{t}^{i}|h_{t}^{i})\geq\sum_{\tilde{u}_{t}}Q_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i})\eta(\tilde{u}_{t}^{i}), (64)

for all ηΔ(𝒰ti)\eta\in\Delta(\mathcal{U}_{t}^{i}) for all t𝒯,i,hτiτit\in\mathcal{T},i\in\mathcal{I},h_{\tau}^{i}\in\mathcal{H}_{\tau}^{i}. Hence gg is sequentially rational given QQ. Therefore (g,Q)(g,Q) is a sequential equilibrium under Definition 12.

Step 3: “Model-free” SE version 1 (Definition 12) \Rightarrow “Model-free” SE version 2 (Definition 13)

Let (g,Q)(g,Q) be a sequential equilibrium under Definition 12 and let (g(n),Q(n))(g^{(n)},Q^{(n)}) satisfies conditions (1)-(3) of full consistency in Definition 12. Then Q(n),iQ^{(n),i} satisfies

QT(n),i(hTi,uTi)\displaystyle Q_{T}^{(n),i}(h_{T}^{i},u_{T}^{i}) =𝔼g(n),i[RTi|hTi,uTi],\displaystyle=\mathbb{E}^{g^{(n),-i}}[R_{T}^{i}|h_{T}^{i},u_{T}^{i}], (65a)
Vt(n),i(hti)\displaystyle V_{t}^{(n),i}(h_{t}^{i}) :=u~tiQt(n),i(hti,u~ti)gt(n),i(u~ti|hti),t𝒯,\displaystyle:=\sum_{\tilde{u}_{t}^{i}}Q_{t}^{(n),i}(h_{t}^{i},\tilde{u}_{t}^{i})g_{t}^{(n),i}(\tilde{u}_{t}^{i}|h_{t}^{i}),\quad\forall t\in\mathcal{T}, (65b)
Qt(n),i(hti,uti)\displaystyle Q_{t}^{(n),i}(h_{t}^{i},u_{t}^{i}) =𝔼g(n),i[Rti|hti,uti]\displaystyle=\mathbb{E}^{g^{(n),-i}}[R_{t}^{i}|h_{t}^{i},u_{t}^{i}]
+z~tiVt+1(n),i((hti,z~ti))g(n),i(z~ti|hti,uti),t𝒯\{T},\displaystyle+\sum_{\tilde{z}_{t}^{i}}V_{t+1}^{(n),i}((h_{t}^{i},\tilde{z}_{t}^{i}))\mathbb{P}^{g^{(n),-i}}(\tilde{z}_{t}^{i}|h_{t}^{i},u_{t}^{i}),\quad\forall t\in\mathcal{T}\backslash\{T\}, (65c)

and Q(n)QQ^{(n)}\rightarrow Q as nn\rightarrow\infty. Set

Q^τ(n),i(hτi,uτi)=𝔼g(n),i[Rτi|hτi,uτi]+maxg~τ+1:Ti𝔼g~τ+1:Ti,g(n),i[t=τ+1TRti|hτi,uτi],\displaystyle\hat{Q}_{\tau}^{(n),i}(h_{\tau}^{i},u_{\tau}^{i})=\mathbb{E}^{g^{(n),-i}}[R_{\tau}^{i}|h_{\tau}^{i},u_{\tau}^{i}]+\underset{\tilde{g}_{\tau+1:T}^{i}}{\max}~{}\mathbb{E}^{\tilde{g}_{\tau+1:T}^{i},g^{(n),-i}}\left[\sum_{t=\tau+1}^{T}R_{t}^{i}\Big{|}h_{\tau}^{i},u_{\tau}^{i}\right], (66)

for each τ𝒯,hτiτi,uτi𝒰τi\tau\in\mathcal{T},h_{\tau}^{i}\in\mathcal{H}_{\tau}^{i},u_{\tau}^{i}\in\mathcal{U}_{\tau}^{i}. Then Q^(n),i\hat{Q}^{(n),i} satisfies the recurrence relation

Q^T(n),i(hTi,uTi)\displaystyle\hat{Q}_{T}^{(n),i}(h_{T}^{i},u_{T}^{i}) =𝔼g(n),i[RTi|hTi,uTi],\displaystyle=\mathbb{E}^{g^{(n),-i}}[R_{T}^{i}|h_{T}^{i},u_{T}^{i}], (67a)
V^t(n),i(hti)\displaystyle\hat{V}_{t}^{(n),i}(h_{t}^{i}) :=maxu~tiQ^t(n),i(hti,u~ti),t𝒯,\displaystyle:=\max_{\tilde{u}_{t}^{i}}\hat{Q}_{t}^{(n),i}(h_{t}^{i},\tilde{u}_{t}^{i}),\quad\forall t\in\mathcal{T}, (67b)
Q^t(n),i(hti,uti)\displaystyle\hat{Q}_{t}^{(n),i}(h_{t}^{i},u_{t}^{i}) =𝔼g(n),i[Rti|hti,uti]\displaystyle=\mathbb{E}^{g^{(n),-i}}[R_{t}^{i}|h_{t}^{i},u_{t}^{i}]
+z~tiV^t+1(n),i((hti,z~ti))g(n),i(z~ti|hti,uti),t𝒯\{T}.\displaystyle+\sum_{\tilde{z}_{t}^{i}}\hat{V}_{t+1}^{(n),i}((h_{t}^{i},\tilde{z}_{t}^{i}))\mathbb{P}^{g^{(n),-i}}(\tilde{z}_{t}^{i}|h_{t}^{i},u_{t}^{i}),\quad\forall t\in\mathcal{T}\backslash\{T\}. (67c)

Claim: Q^t(n)Qti\hat{Q}_{t}^{(n)}\rightarrow Q_{t}^{i} as nn\rightarrow\infty.

Given the claim, we have (g(n),Q^(n))(g^{(n)},\hat{Q}^{(n)}) satisfying conditions (1)(2’)(3) of full consistency in Definition 13. Therefore (g,Q)(g,Q) is also a sequential equilibrium under Definition 13, and we complete this part of the proof.

Proof of Claim: By induction on time t𝒯t\in\mathcal{T}.

Induction Base: Observe that Q^T(n)=QT(n)\hat{Q}_{T}^{(n)}=Q_{T}^{(n)} by construction. Since QT(n)QTQ_{T}^{(n)}\rightarrow Q_{T} we also have Q^T(n)QT\hat{Q}_{T}^{(n)}\rightarrow Q_{T}.

Induction Step: Suppose that the result is true for time tt. We prove it for time t1t-1.

By induction hypothesis and g(n)gg^{(n)}\rightarrow g, we have

V^t(n),i(hti)=\displaystyle\hat{V}_{t}^{(n),i}(h_{t}^{i})= maxu~tiQ^t(n),i(hti,u~ti)nmaxu~tiQti(hti,u~ti).\displaystyle\max_{\tilde{u}_{t}^{i}}\hat{Q}_{t}^{(n),i}(h_{t}^{i},\tilde{u}_{t}^{i})\xrightarrow{n\rightarrow\infty}\max_{\tilde{u}_{t}^{i}}Q_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i}). (68)

Since Q(n)QQ^{(n)}\rightarrow Q and g(n)gg^{(n)}\rightarrow g, we have

Vt(n),i(hti)=\displaystyle V_{t}^{(n),i}(h_{t}^{i})= u~tiQt(n),i(hti,u~ti)gt(n),i(u~ti|hti)\displaystyle\sum_{\tilde{u}_{t}^{i}}Q_{t}^{(n),i}(h_{t}^{i},\tilde{u}_{t}^{i})g_{t}^{(n),i}(\tilde{u}_{t}^{i}|h_{t}^{i}) (69)
n\displaystyle\xrightarrow{n\rightarrow\infty} u~tiQti(hti,u~ti)gti(u~ti|hti)=:Vti(hti).\displaystyle\sum_{\tilde{u}_{t}^{i}}Q_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i})g_{t}^{i}(\tilde{u}_{t}^{i}|h_{t}^{i})=:V_{t}^{i}(h_{t}^{i}). (70)

Since gg is sequentially rational given QQ, we have

u~tiQti(hti,u~ti)gti(u~ti|hti)\displaystyle\sum_{\tilde{u}_{t}^{i}}Q_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i})g_{t}^{i}(\tilde{u}_{t}^{i}|h_{t}^{i}) =maxu~tiQti(hti,u~ti).\displaystyle=\max_{\tilde{u}_{t}^{i}}Q_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i}). (71)

Combining (68)(70)(71) we have V^t(n),i(hti)Vti(hti)\hat{V}_{t}^{(n),i}(h_{t}^{i})\rightarrow V_{t}^{i}(h_{t}^{i}) for all htitih_{t}^{i}\in\mathcal{H}_{t}^{i}. Since ti\mathcal{H}_{t}^{i} is a finite set, we have

maxh~ti|V^t(n),i(h~ti)Vt(n),i(h~ti)|n0.\displaystyle\max_{\tilde{h}_{t}^{i}}|\hat{V}_{t}^{(n),i}(\tilde{h}_{t}^{i})-V_{t}^{(n),i}(\tilde{h}_{t}^{i})|\xrightarrow{n\rightarrow\infty}0. (72)

We then have

|Q^t1(n),i(hti,uti)Qt1(n),i(hti,uti)|\displaystyle\enspace\enspace\>|\hat{Q}_{t-1}^{(n),i}(h_{t}^{i},u_{t}^{i})-Q_{t-1}^{(n),i}(h_{t}^{i},u_{t}^{i})| (73)
=|z~t1i[V^t(n),i((ht1i,z~t1i))Vt(n),i((ht1i,z~t1i))]gt1(n),i(z~t1i|ht1i,ut1i)|\displaystyle=\left|\sum_{\tilde{z}_{t-1}^{i}}\left[\hat{V}_{t}^{(n),i}((h_{t-1}^{i},\tilde{z}_{t-1}^{i}))-V_{t}^{(n),i}((h_{t-1}^{i},\tilde{z}_{t-1}^{i}))\right]\mathbb{P}^{g_{t-1}^{(n),-i}}(\tilde{z}_{t-1}^{i}|h_{t-1}^{i},u_{t-1}^{i})\right| (74)
maxz~t1i|V^t(n),i((ht1i,z~t1i))Vt(n),i((ht1i,z~t1i))|n0,\displaystyle\leq\max_{\tilde{z}_{t-1}^{i}}|\hat{V}_{t}^{(n),i}((h_{t-1}^{i},\tilde{z}_{t-1}^{i}))-V_{t}^{(n),i}((h_{t-1}^{i},\tilde{z}_{t-1}^{i}))|\xrightarrow{n\rightarrow\infty}0, (75)

where we substitute (65c)(67c) in (74). Since Qt1(n),i(hti,uti)Qt1i(hti,uti)Q_{t-1}^{(n),i}(h_{t}^{i},u_{t}^{i})\rightarrow Q_{t-1}^{i}(h_{t}^{i},u_{t}^{i}), we conclude that Q^t1(n),i(hti,uti)Qt1i(hti,uti)\hat{Q}_{t-1}^{(n),i}(h_{t}^{i},u_{t}^{i})\rightarrow Q_{t-1}^{i}(h_{t}^{i},u_{t}^{i}), establishing the induction step.

Step 4: “Model-free” SE version 2 (Definition 13) \Rightarrow Classical SE (Definition 14)

Let (g,Q)(g,Q) be a sequential equilibrium under Definition 13 and let (g(n),Q^(n))(g^{(n)},\hat{Q}^{(n)}) satisfies conditions (1)(2’)(3) of full consistency in Definition 13.

Define the beliefs μ(n)\mu^{(n)} on the nodes of the extensive-form game through μ(n)(oti|hti)=g(n)(oti|hti)\mu^{(n)}(o_{t}^{i}|h_{t}^{i})=\mathbb{P}^{g^{(n)}}(o_{t}^{i}|h_{t}^{i}). By taking subsequences, without lost of generality, assume that μ(n)μ\mu^{(n)}\rightarrow\mu.

Let g^ti\hat{g}_{t}^{i} be an arbitrary strategy, then by condition (2’) of Definition 13, we can write

u~tiQ^t(n),i(hti,u~ti)g^ti(u~ti|hti)=maxg~t+1:Tioti𝔼g^ti,g~t+1:Ti,gt(n),>i,gt+1:T(n),i[Λti(OT+1)|oti]μt(n),i(oti|hti).\begin{split}&\enspace\enspace\>\sum_{\tilde{u}_{t}^{i}}\hat{Q}_{t}^{(n),i}(h_{t}^{i},\tilde{u}_{t}^{i})\hat{g}_{t}^{i}(\tilde{u}_{t}^{i}|h_{t}^{i})\\ &=\max_{\tilde{g}_{t+1:T}^{i}}\sum_{o_{t}^{i}}\mathbb{E}^{\hat{g}_{t}^{i},\tilde{g}_{t+1:T}^{i},g_{t}^{(n),>i},g_{t+1:T}^{(n),-i}}[\Lambda_{t}^{i}(O_{T+1})|o_{t}^{i}]\ \mu_{t}^{(n),i}(o_{t}^{i}|h_{t}^{i}).\end{split} (76)

For each otio_{t}^{i}, 𝔼g~ti,g~t+1:T[Λti(Ot+1)|oti]\mathbb{E}^{\tilde{g}_{t}^{\geq i},\tilde{g}_{t+1:T}}[\Lambda_{t}^{i}(O_{t+1})|o_{t}^{i}] is continuous in (g~ti,g~t+1:T)(\tilde{g}_{t}^{\geq i},\tilde{g}_{t+1:T}) since it is the sum of product of components of (g~ti,g~t+1:T)(\tilde{g}_{t}^{\geq i},\tilde{g}_{t+1:T}). Therefore,

oti𝔼g^ti,g~t+1:Ti,gt(n),>i,gt+1:T(n),i[Λti(Ot+1)|oti]μt(n),i(oti|hti)\displaystyle\sum_{o_{t}^{i}}\mathbb{E}^{\hat{g}_{t}^{i},\tilde{g}_{t+1:T}^{i},g_{t}^{(n),>i},g_{t+1:T}^{(n),-i}}[\Lambda_{t}^{i}(O_{t+1})|o_{t}^{i}]\ \mu_{t}^{(n),i}(o_{t}^{i}|h_{t}^{i})
n\displaystyle\xrightarrow{n\rightarrow\infty} oti𝔼g^ti,g~t+1:Ti,gt>i,gt+1:Ti[Λti(Ot+1)|oti]μti(oti|hti),\displaystyle\sum_{o_{t}^{i}}\mathbb{E}^{\hat{g}_{t}^{i},\tilde{g}_{t+1:T}^{i},g_{t}^{>i},g_{t+1:T}^{-i}}[\Lambda_{t}^{i}(O_{t+1})|o_{t}^{i}]\ \mu_{t}^{i}(o_{t}^{i}|h_{t}^{i}), (77)

for each behavioral straetegy g~t+1:Ti\tilde{g}_{t+1:T}^{i}. Applying Berge’s Maximum Theorem [49], and taking the limit on both sides of (76), we obtain

u~tiQti(hti,u~ti)g^ti(u~ti|hti)=maxg~t+1:Tioti𝔼g^ti,g~t+1:Ti,gt>i,gt+1:Ti[Λti(OT+1)|oti]μti(oti|hti),\displaystyle\sum_{\tilde{u}_{t}^{i}}Q_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i})~{}\hat{g}_{t}^{i}(\tilde{u}_{t}^{i}|h_{t}^{i})=\max_{\tilde{g}_{t+1:T}^{i}}\sum_{o_{t}^{i}}\mathbb{E}^{\hat{g}_{t}^{i},\tilde{g}_{t+1:T}^{i},g_{t}^{>i},g_{t+1:T}^{-i}}[\Lambda_{t}^{i}(O_{T+1})|o_{t}^{i}]~{}\mu_{t}^{i}(o_{t}^{i}|h_{t}^{i}), (78)

for all t𝒯,i,htitit\in\mathcal{T},i\in\mathcal{I},h_{t}^{i}\in\mathcal{H}_{t}^{i}, and all behavioral strategy g^ti\hat{g}_{t}^{i}.

Sequential rationality of gg to QQ means that

gtiargmaxg^tiu~tiQti(hti,u~ti)g^ti(u~ti|hti)=argmaxg^timaxg~t+1:Tioti𝔼g^ti,g~t+1:Ti,gt>i,gt+1:Ti[Λti(OT+1)|oti]μti(oti|hti),\begin{split}g_{t}^{i}&\in\underset{\hat{g}_{t}^{i}}{\arg\max}~{}\sum_{\tilde{u}_{t}^{i}}Q_{t}^{i}(h_{t}^{i},\tilde{u}_{t}^{i})~{}\hat{g}_{t}^{i}(\tilde{u}_{t}^{i}|h_{t}^{i})\\ &=\underset{\hat{g}_{t}^{i}}{\arg\max}~{}\max_{\tilde{g}_{t+1:T}^{i}}\sum_{o_{t}^{i}}\mathbb{E}^{\hat{g}_{t}^{i},\tilde{g}_{t+1:T}^{i},g_{t}^{>i},g_{t+1:T}^{-i}}[\Lambda_{t}^{i}(O_{T+1})|o_{t}^{i}]~{}\mu_{t}^{i}(o_{t}^{i}|h_{t}^{i}),\end{split} (79)

for all t𝒯,it\in\mathcal{T},i\in\mathcal{I}, and all htitih_{t}^{i}\in\mathcal{H}_{t}^{i}.

Recall that the node OtiO_{t}^{i} uniquely determines (X1,W1:t1,U1:t1)(X_{1},W_{1:t-1},U_{1:t-1}). Therefore, the instantaneous rewards RτiR_{\tau}^{i} for τt1\tau\leq t-1 are uniquely determined by OtiO_{t}^{i} as well. For τt1\tau\leq t-1, let rτir_{\tau}^{i} be realizations of RτiR_{\tau}^{i} under Oti=otiO_{t}^{i}=o_{t}^{i}. Recall that Λi\Lambda^{i} is the total reward function and Λti\Lambda_{t}^{i} is the reward-to-go function starting with (and including) time tt. We have 𝔼g^ti,g~t+1:Ti,gt>i,gt+1:Ti[Λi(OT+1)Λti(OT+1)|oti]=τ=1t1rτi\mathbb{E}^{\hat{g}_{t}^{i},\tilde{g}_{t+1:T}^{i},g_{t}^{>i},g_{t+1:T}^{-i}}[\Lambda^{i}(O_{T+1})-\Lambda_{t}^{i}(O_{T+1})|o_{t}^{i}]=\sum_{\tau=1}^{t-1}r_{\tau}^{i} to be independent of the strategy profile. Therefore we have

gtiargmaxg^timaxg~t+1:Tioti𝔼g^ti,g~t+1:Ti,gt>i,gt+1:Ti[Λi(OT+1)|oti]μti(oti|hti).g_{t}^{i}\in\underset{\hat{g}_{t}^{i}}{\arg\max}~{}\max_{\tilde{g}_{t+1:T}^{i}}\sum_{o_{t}^{i}}\mathbb{E}^{\hat{g}_{t}^{i},\tilde{g}_{t+1:T}^{i},g_{t}^{>i},g_{t+1:T}^{-i}}[\Lambda^{i}(O_{T+1})|o_{t}^{i}]~{}\mu_{t}^{i}(o_{t}^{i}|h_{t}^{i}). (80)

Fixing hτih_{\tau}^{i}, the problem of optimizing

Jτi(g~τ:Ti;hτi,μτi):=oτi𝔼g~τ:Ti,gτ>i,gτ+1:Ti[Λi(OT+1)|oτi]μτi(oτi|hτi),J_{\tau}^{i}(\tilde{g}_{\tau:T}^{i};h_{\tau}^{i},\mu_{\tau}^{i}):=\sum_{o_{\tau}^{i}}\mathbb{E}^{\tilde{g}_{\tau:T}^{i},g_{\tau}^{>i},g_{\tau+1:T}^{-i}}[\Lambda^{i}(O_{T+1})|o_{\tau}^{i}]~{}\mu_{\tau}^{i}(o_{\tau}^{i}|h_{\tau}^{i}), (81)

over all g~τ:Ti\tilde{g}_{\tau:T}^{i} is a POMDP problem with

  • Timestamps T~={τ,τ+1,,T,T+1}\tilde{T}=\{\tau,\tau+1,\cdots,T,T+1\};

  • State process (Oti)t=τT(OT+1)(O_{t}^{i})_{t=\tau}^{T}\cup(O_{T+1});

  • Control actions (Uti)t=τT(U_{t}^{i})_{t=\tau}^{T};

  • Initial state distribution μτi(hτi)Δ(𝒪τi)\mu_{\tau}^{i}(h_{\tau}^{i})\in\Delta(\mathcal{O}_{\tau}^{i});

  • State transition kernel gt>i,gt+1<i(ot+1i|oti,uti)\mathbb{P}^{g_{t}^{>i},g_{t+1}^{<i}}(o_{t+1}^{i}|o_{t}^{i},u_{t}^{i}) for t<Tt<T and gT>i(oT+1|oTi,uTi)\mathbb{P}^{g_{T}^{>i}}(o_{T+1}|o_{T}^{i},u_{T}^{i}) for t=Tt=T;

  • Observation history: (Hti)t=τT(H_{t}^{i})_{t=\tau}^{T};

  • Instantaneous rewards are 0. Terminal reward is Λi(OT+1)\Lambda^{i}(O_{T+1}).

The belief μ\mu is fully consistent with gg by construction. From standard results in game theory, we know that μt+1i(ht+1i)\mu_{t+1}^{i}(h_{t+1}^{i}) can be updated with Bayes rule from μti(hti)\mu_{t}^{i}(h_{t}^{i}) and gg whenever applicable. Therefore, (μt)t=τT(\mu_{t})_{t=\tau}^{T} represent the true beliefs of the state given observations in the above POMDP problem. Therefore, through standard control theory [23, Section 6.7], (80) is a sufficient condition for gt:Tig_{t:T}^{i} to be optimal for the above POMDP problem, which means that gg is sequentially rational given μ\mu.

Therefore we conclude that (g,μ)(g,\mu) is a sequential equilibrium under Definition 14.

Appendix C Proofs for Sections 3 and 4

C.1 Proof of Lemma 1

Lemma C.1 (Lemma 1, restated).

If for all ii\in\mathcal{I} and all KiK^{-i}-based strategy profiles ρi\rho^{-i}, there exist functions (Φti,ρi)t𝒯(\Phi_{t}^{i,\rho^{-i}})_{t\in\mathcal{T}} where Φti,ρi:𝒦tiΔ(𝒳t×𝒦ti)\Phi_{t}^{i,\rho^{-i}}\colon\mathcal{K}_{t}^{i}\mapsto\Delta(\mathcal{X}_{t}\times\mathcal{K}_{t}^{-i}) such that

gi,ρi(xt,kti|hti)=Φti,ρi(xt,kti|kti),\mathbb{P}^{g^{i},\rho^{-i}}(x_{t},k_{t}^{-i}|h_{t}^{i})=\Phi_{t}^{i,\rho^{-i}}(x_{t},k_{t}^{-i}|k_{t}^{i}), (82)

for all behavioral strategies gig^{i}, all t𝒯t\in\mathcal{T}, and all htih_{t}^{i} admissible under (gi,ρi)(g^{i},\rho^{-i}), then K=(Ki)iK=(K^{i})_{i\in\mathcal{I}} is mutually sufficient information.

Proof C.2.

Let gig^{i} be an arbitrary behavioral strategy for player ii and ρi\rho^{-i} be any KiK^{-i}-based strategy profile. Let htih_{t}^{i} be admissible under (gi,ρi)(g^{i},\rho^{-i}). We have

gi,ρi(x~t,u~ti|hti)\displaystyle\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{x}_{t},\tilde{u}_{t}^{-i}|h_{t}^{i}) =h~tigi,ρi(u~ti|x~t,h~ti,hti,uti)gi,ρi(x~t,h~ti|hti,uti)\displaystyle=\sum_{\tilde{h}_{t}^{-i}}\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{u}_{t}^{-i}|\tilde{x}_{t},\tilde{h}_{t}^{-i},h_{t}^{i},u_{t}^{i})\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{x}_{t},\tilde{h}_{t}^{-i}|h_{t}^{i},u_{t}^{i}) (83)
=h~ti(jiρtj(u~tj|k~tj))gi,ρi(x~t,h~ti|hti)\displaystyle=\sum_{\tilde{h}_{t}^{-i}}\left(\prod_{j\neq i}\rho_{t}^{j}(\tilde{u}_{t}^{j}|\tilde{k}_{t}^{j})\right)\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{x}_{t},\tilde{h}_{t}^{-i}|h_{t}^{i}) (84)
=k~ti(jiρtj(u~tj|k~tj))gi,ρi(x~t,k~ti|hti)\displaystyle=\sum_{\tilde{k}_{t}^{-i}}\left(\prod_{j\neq i}\rho_{t}^{j}(\tilde{u}_{t}^{j}|\tilde{k}_{t}^{j})\right)\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{x}_{t},\tilde{k}_{t}^{-i}|h_{t}^{i}) (85)
=k~ti(jiρtj(u~tj|k~tj))Φti,ρi(x~t,k~ti|kti),\displaystyle=\sum_{\tilde{k}_{t}^{-i}}\left(\prod_{j\neq i}\rho_{t}^{j}(\tilde{u}_{t}^{j}|\tilde{k}_{t}^{j})\right)\Phi_{t}^{i,\rho^{-i}}(\tilde{x}_{t},\tilde{k}_{t}^{-i}|k_{t}^{i}), (86)

where in (83) we applied the Law of Total Probability. In (85) we combined the realizations of h~ti\tilde{h}_{t}^{-i} corresponding to the same compressed information k~ti\tilde{k}_{t}^{-i}. In the final equation, we used the condition of Lemma 1.

By the definition of the model, Zti=fti,Z(Xt,Ut,Wt)Z_{t}^{i}=f_{t}^{i,Z}(X_{t},U_{t},W_{t}) for some fixed function fti,Zf_{t}^{i,Z} independent of the strategy profile. Since the compressed information can be sequentially updated as Kt+1i=ιt+1i(Kti,Zti)K_{t+1}^{i}=\iota_{t+1}^{i}(K_{t}^{i},Z_{t}^{i}), this means that we can write Kt+1i=ξti(Kti,Xt,Ut,Wt)K_{t+1}^{i}=\xi_{t}^{i}(K_{t}^{i},X_{t},U_{t},W_{t}) for some fixed function ξti\xi_{t}^{i}. Since WtW_{t} is a primitive random variable, we conclude that (kt+1i|kti,xt,ut)\mathbb{P}(k_{t+1}^{i}|k_{t}^{i},x_{t},u_{t}) is independent of any strategy profile. Therefore,

gi,ρi(kt+1i|hti,uti)\displaystyle\quad~{}\mathbb{P}^{g^{i},\rho^{-i}}(k_{t+1}^{i}|h_{t}^{i},u_{t}^{i}) (87)
=x~t,u~ti(kt+1i|kti,x~t,(u~ti,uti))gi,ρi(x~t,u~ti|hti)\displaystyle=\sum_{\tilde{x}_{t},\tilde{u}_{t}^{-i}}\mathbb{P}(k_{t+1}^{i}|k_{t}^{i},\tilde{x}_{t},(\tilde{u}_{t}^{-i},u_{t}^{i}))\mathbb{P}^{g^{i},\rho^{-i}}(\tilde{x}_{t},\tilde{u}_{t}^{-i}|h_{t}^{i}) (88)
=x~t,u~ti[(kt+1i|kti,x~t,(u~ti,uti))k~ti(jiρtj(u~tj|k~tj))Φti,ρi(x~t,k~ti|kti)]\displaystyle=\sum_{\tilde{x}_{t},\tilde{u}_{t}^{-i}}\left[\mathbb{P}(k_{t+1}^{i}|k_{t}^{i},\tilde{x}_{t},(\tilde{u}_{t}^{-i},u_{t}^{i}))\sum_{\tilde{k}_{t}^{-i}}\left(\prod_{j\neq i}\rho_{t}^{j}(\tilde{u}_{t}^{j}|\tilde{k}_{t}^{j})\right)\Phi_{t}^{i,\rho^{-i}}(\tilde{x}_{t},\tilde{k}_{t}^{-i}|k_{t}^{i})\right] (89)
=:Pti,ρi(kt+1i|kti,uti),\displaystyle=:P_{t}^{i,\rho^{-i}}(k_{t+1}^{i}|k_{t}^{i},u_{t}^{i}), (90)

for some function Pti,giP_{t}^{i,g^{-i}}, where in (88) we used the Law of Total Probability, and we substituted (86) in (89).

Since Rti=fti,R(Xt,Ut,Wt)R_{t}^{i}=f_{t}^{i,R}(X_{t},U_{t},W_{t}) for some fixed function fti,Rf_{t}^{i,R} and WtW_{t} is a primitive random variable, we have 𝔼[Rti|Xt,Ut]\mathbb{E}[R_{t}^{i}|X_{t},U_{t}] to be independent of the strategy profile gg. By an argument similar to the one that leads from (87) to (90) we obtain

𝔼gi,ρi[Rti|hti,uti]\displaystyle\quad~{}\mathbb{E}^{g^{i},\rho^{-i}}[R_{t}^{i}|h_{t}^{i},u_{t}^{i}] (91)
=x~t,u~ti[𝔼[Rti|x~t,(uti,u~ti)]k~ti(jiρtj(u~tj|k~tj))Φti,ρi(x~t,k~ti|kti)]\displaystyle=\sum_{\tilde{x}_{t},\tilde{u}_{t}^{-i}}\left[\mathbb{E}[R_{t}^{i}|\tilde{x}_{t},(u_{t}^{i},\tilde{u}_{t}^{-i})]\sum_{\tilde{k}_{t}^{-i}}\left(\prod_{j\neq i}\rho_{t}^{j}(\tilde{u}_{t}^{j}|\tilde{k}_{t}^{j})\right)\Phi_{t}^{i,\rho^{-i}}(\tilde{x}_{t},\tilde{k}_{t}^{-i}|k_{t}^{i})\right] (92)
=:rti,ρi(kti,uti),\displaystyle=:r_{t}^{i,\rho^{-i}}(k_{t}^{i},u_{t}^{i}), (93)

for some function rii,ρir_{i}^{i,\rho^{-i}}. With (90) and (93), we have shown that KK satisfies Definition 4 and hence KK is MSI.

C.2 Proof of Theorem 2

Theorem 16 (Theorem 2, restated).

If KK is mutually sufficient information, then there exists at least one KK-based BNE.

Proof C.3.

The proof will proceed as follows: We first construct a best-response correspondence using stochastic control theory, and then we establish the existence of equilibria by applying Kakutani’s fixed-point theorem to this correspondence. For technical reasons, we first consider only behavioral strategies where each action has probability at least ϵ>0\epsilon>0 of being played at each information set. We then take ϵ\epsilon to zero.

Fixing a KiK^{-i}-based strategy profile ρi\rho^{-i}, we first argue that KtiK_{t}^{i} is a controlled Markov process controlled by player ii’s action UtiU_{t}^{i}.

From the definition of an information state (Definition 3) we know that

g~i,ρi(kt+1i|hti,uti)=Pti,ρi(kt+1i|kti,uti).\displaystyle\mathbb{P}^{\tilde{g}^{i},\rho^{-i}}(k_{t+1}^{i}|h_{t}^{i},u_{t}^{i})=P_{t}^{i,\rho^{-i}}(k_{t+1}^{i}|k_{t}^{i},u_{t}^{i}). (94)

Since (K1:ti,U1:ti)(K_{1:t}^{i},U_{1:t}^{i}) is a function of (Hti,Uti)(H_{t}^{i},U_{t}^{i}), by the smoothing property of conditional probability we have

g~i,ρi(kt+1i|k1:ti,u1:ti)=Pti,ρi(kt+1i|kti,uti).\displaystyle\mathbb{P}^{\tilde{g}^{i},\rho^{-i}}(k_{t+1}^{i}|k_{1:t}^{i},u_{1:t}^{i})=P_{t}^{i,\rho^{-i}}(k_{t+1}^{i}|k_{t}^{i},u_{t}^{i}). (95)

Therefore we have shown that KtiK_{t}^{i} is a controlled Markov process controlled by player ii’s action UtiU_{t}^{i}.

From the definition of information state (Definition 3) we know that

𝔼g~i,ρi[Rti|kti,uti]=rti,ρi(kti,uti),\displaystyle\mathbb{E}^{\tilde{g}^{i},\rho^{-i}}\left[R_{t}^{i}|k_{t}^{i},u_{t}^{i}\right]=r_{t}^{i,\rho^{-i}}(k_{t}^{i},u_{t}^{i}), (96)

for all (kti,uti)(k_{t}^{i},u_{t}^{i}) admissible under (g~i,ρi)(\tilde{g}^{i},\rho^{-i}).

Therefore, using the Law of Total Expectation we have

Ji(g~i,ρi)\displaystyle J^{i}(\tilde{g}^{i},\rho^{-i}) =𝔼g~i,ρi[t=1TRti]=𝔼g~i,ρi[t=1T𝔼g~i,ρi[Rti|Kti,Uti]]\displaystyle=\mathbb{E}^{\tilde{g}^{i},\rho^{-i}}\left[\sum_{t=1}^{T}R_{t}^{i}\right]=\mathbb{E}^{\tilde{g}^{i},\rho^{-i}}\left[\sum_{t=1}^{T}\mathbb{E}^{\tilde{g}^{i},\rho^{-i}}\left[R_{t}^{i}|K_{t}^{i},U_{t}^{i}\right]\right] (97)
=𝔼g~i,ρi[t=1Trti,ρi(Kti,Uti)].\displaystyle=\mathbb{E}^{\tilde{g}^{i},\rho^{-i}}\left[\sum_{t=1}^{T}r_{t}^{i,\rho^{-i}}(K_{t}^{i},U_{t}^{i})\right]. (98)

By standard MDP theory, there exist KiK^{i}-based strategies ρi\rho^{i} that maximize Ji(g~i,ρi)J^{i}(\tilde{g}^{i},\rho^{-i}) over all behavioral strategies g~i\tilde{g}^{i}. Furthermore, optimal KiK^{i}-based strategies can be found through dynamic programming.

Assume ϵ>0\epsilon>0, let 𝒫ϵ,i\mathcal{P}^{\epsilon,i} denote the set of KiK^{i}-based strategies for player ii where each action uti𝒰tiu_{t}^{i}\in\mathcal{U}_{t}^{i} is chosen with probability at least ϵ\epsilon at any information set. To endow 𝒫ϵ,i\mathcal{P}^{\epsilon,i} with a topology, we consider it as a product of sets of distributions, i.e.

𝒫ϵ,i=t𝒯kti𝒦tiΔϵ(𝒰ti),\displaystyle\mathcal{P}^{\epsilon,i}=\prod_{t\in\mathcal{T}}\prod_{k_{t}^{i}\in\mathcal{K}_{t}^{i}}\Delta^{\epsilon}(\mathcal{U}_{t}^{i}), (99)

where

Δϵ(𝒰ti)={ηΔ(𝒰ti):η(uti)ϵuti𝒰ti}.\displaystyle\Delta^{\epsilon}(\mathcal{U}_{t}^{i})=\{\eta\in\Delta(\mathcal{U}_{t}^{i}):\eta(u_{t}^{i})\geq\epsilon~{}\forall u_{t}^{i}\in\mathcal{U}_{t}^{i}\}. (100)

Define 𝒫ϵ=i𝒫ϵ,i\mathcal{P}^{\epsilon}=\prod_{i\in\mathcal{I}}\mathcal{P}^{\epsilon,i}. Denote the set of all KiK^{i}-based strategy profiles by 𝒫0\mathcal{P}^{0}.

For the rest of the proof, assume that ϵ\epsilon is small enough such that Δϵ(𝒰ti)\Delta^{\epsilon}(\mathcal{U}_{t}^{i}) is non-empty for all t𝒯t\in\mathcal{T} and ii\in\mathcal{I}.

For each t𝒯t\in\mathcal{T}, ii\in\mathcal{I} and kti𝒦tik_{t}^{i}\in\mathcal{K}_{t}^{i}, define the correspondence BRtϵ,i[kti]:𝒫ϵ,iΔϵ(𝒰ti)\mathrm{BR}_{t}^{\epsilon,i}[k_{t}^{i}]:\mathcal{P}^{\epsilon,-i}\mapsto\Delta^{\epsilon}(\mathcal{U}_{t}^{i}) sequentially through

QTϵ,i(kTi,uTi;ρi)\displaystyle Q_{T}^{\epsilon,i}(k_{T}^{i},u_{T}^{i};\rho^{-i}) :=rTi,ρi(kTi,uTi),\displaystyle:=r_{T}^{i,\rho^{-i}}(k_{T}^{i},u_{T}^{i}), (101a)
BRtϵ,i[kti](ρi)\displaystyle\mathrm{BR}_{t}^{\epsilon,i}[k_{t}^{i}](\rho^{-i}) :=argmaxηΔϵ(𝒰ti)u~tiQtϵ,i(kti,u~ti;ρi)η(u~ti),\displaystyle:=\underset{\eta\in\Delta^{\epsilon}(\mathcal{U}_{t}^{i})}{\arg\max}~{}\sum_{\tilde{u}_{t}^{i}}Q_{t}^{\epsilon,i}(k_{t}^{i},\tilde{u}_{t}^{i};\rho^{-i})\eta(\tilde{u}_{t}^{i}), (101b)
Vtϵ,i(kti;ρi)\displaystyle V_{t}^{\epsilon,i}(k_{t}^{i};\rho^{-i}) :=maxηΔϵ(𝒰ti)u~tiQtϵ,i(kti,u~ti;ρi)η(u~ti),\displaystyle:=\max_{\eta\in\Delta^{\epsilon}(\mathcal{U}_{t}^{i})}\sum_{\tilde{u}_{t}^{i}}Q_{t}^{\epsilon,i}(k_{t}^{i},\tilde{u}_{t}^{i};\rho^{-i})\eta(\tilde{u}_{t}^{i}), (101c)
Qt1ϵ,i(kt1i,ut1i;ρi)\displaystyle Q_{t-1}^{\epsilon,i}(k_{t-1}^{i},u_{t-1}^{i};\rho^{-i}) :=rt1i,ρi(kt1i,ut1i)+\displaystyle:=r_{t-1}^{i,\rho^{-i}}(k_{t-1}^{i},u_{t-1}^{i})+ (101d)
+kti𝒦ti\displaystyle+\sum_{k_{t}^{i}\in\mathcal{K}_{t}^{i}} Vtϵ,i(kti;ρi)Pt1i,ρi(kti|kt1i,uti).\displaystyle V_{t}^{\epsilon,i}(k_{t}^{i};\rho^{-i})P_{t-1}^{i,\rho^{-i}}(k_{t}^{i}|k_{t-1}^{i},u_{t}^{i}). (101e)

Define BRϵ:𝒫ϵ𝒫ϵ\mathrm{BR}^{\epsilon}:\mathcal{P}^{\epsilon}\mapsto\mathcal{P}^{\epsilon} by

BRϵ(ρ)=it𝒯kti𝒦tiBRtϵ,i[kti](ρi).\displaystyle\mathrm{BR}^{\epsilon}(\rho)=\prod_{i\in\mathcal{I}}\prod_{t\in\mathcal{T}}\prod_{k_{t}^{i}\in\mathcal{K}_{t}^{i}}\mathrm{BR}_{t}^{\epsilon,i}[k_{t}^{i}](\rho^{-i}). (102)

Claim:

  1. (a)

    Pti,ρi(kt+1i|kti,uti)P_{t}^{i,\rho^{-i}}(k_{t+1}^{i}|k_{t}^{i},u_{t}^{i}) is continuous in ρi\rho^{-i} on 𝒫ϵ,i\mathcal{P}^{\epsilon,-i} for all t𝒯t\in\mathcal{T} and all kt+1i𝒦t+1i,kti𝒦ti,uti𝒰tik_{t+1}^{i}\in\mathcal{K}_{t+1}^{i},k_{t}^{i}\in\mathcal{K}_{t}^{i},u_{t}^{i}\in\mathcal{U}_{t}^{i}.

  2. (b)

    rti,ρi(kti,uti)r_{t}^{i,\rho^{-i}}(k_{t}^{i},u_{t}^{i}) is continuous in ρi\rho^{-i} on 𝒫ϵ,i\mathcal{P}^{\epsilon,-i} for all t𝒯t\in\mathcal{T} and all kti𝒦ti,uti𝒰tik_{t}^{i}\in\mathcal{K}_{t}^{i},u_{t}^{i}\in\mathcal{U}_{t}^{i}.

Given the claims we prove by induction that Qtϵ,i(kti,uti;ρi)Q_{t}^{\epsilon,i}(k_{t}^{i},u_{t}^{i};\rho^{-i}) is continuous in ρi\rho^{-i} on 𝒫ϵ,i\mathcal{P}^{\epsilon,-i} for each kti𝒦ti,uti𝒰tik_{t}^{i}\in\mathcal{K}_{t}^{i},u_{t}^{i}\in\mathcal{U}_{t}^{i}.

Induction Base: QTϵ,i(kTi,uTi;ρi)=rTi,ρi(kTi,uTi)Q_{T}^{\epsilon,i}(k_{T}^{i},u_{T}^{i};\rho^{-i})=r_{T}^{i,\rho^{-i}}(k_{T}^{i},u_{T}^{i}) is continuous in ρi\rho^{-i} on 𝒫ϵ,i\mathcal{P}^{\epsilon,-i} due to part (a) of the claims.

Induction Step: Suppose that the induction hypothesis is true for tt. Then Vtϵ,i(kti;ρi)V_{t}^{\epsilon,i}(k_{t}^{i};\rho^{-i}) is continuous in ρi\rho^{-i} on 𝒫ϵ,i\mathcal{P}^{\epsilon,-i} due to Berge’s Maximum Theorem [49]. Then, Qt1ϵ,i(kt1i,ut1i;ρi)Q_{t-1}^{\epsilon,i}(k_{t-1}^{i},u_{t-1}^{i};\rho^{-i}) is continuous in ρi\rho^{-i} on 𝒫ϵ,i\mathcal{P}^{\epsilon,-i} due to the claims.

Applying Berge’s Maximum Theorem [49] once again, we conclude that BRtϵ,i[kti]\mathrm{BR}_{t}^{\epsilon,i}[k_{t}^{i}] is upper hemicontinuous on 𝒫ϵ,i\mathcal{P}^{\epsilon,-i}. For each ρi𝒫ϵ,i\rho^{-i}\in\mathcal{P}^{\epsilon,-i}, BRtϵ,i[kti](ρi)\mathrm{BR}_{t}^{\epsilon,i}[k_{t}^{i}](\rho^{-i}) is non-empty and convex since it is the solution set of a linear program.

As a product of compact-valued upper hemicontinuous correspondences, BRϵ\mathrm{BR}^{\epsilon} is upper hemicontinuous. For each ρ𝒫ϵ\rho\in\mathcal{P}^{\epsilon}, BRϵ(ρ)\mathrm{BR}^{\epsilon}(\rho) is non-empty and convex. By Kakutani’s fixed point theorem, BRϵ\mathrm{BR}^{\epsilon} has a fixed point.

The above construction provides an approximate KK-based BNE for small ϵ\epsilon. Next, we show that we can take ϵ\epsilon to zero to obtain an exact BNE: Let (ϵn)n=1(\epsilon_{n})_{n=1}^{\infty} be a sequence such that ϵn>0,ϵn0\epsilon_{n}>0,\epsilon_{n}\rightarrow 0. Let ρ(n)\rho^{(n)} be a fixed point of BRϵn\mathrm{BR}^{\epsilon_{n}}. Then for each ii\in\mathcal{I} we have

ρ(n),iargmaxρ~i𝒫ϵn,iJi(ρ~i,ρ(n),i).\rho^{(n),i}\in\underset{\tilde{\rho}^{i}\in\mathcal{P}^{\epsilon_{n},i}}{\arg\max}~{}J^{i}(\tilde{\rho}^{i},\rho^{(n),-i}). (103)

Let ρ()𝒫0\rho^{(\infty)}\in\mathcal{P}^{0} be the limit of a sub-sequence of (ρ(n))n=1(\rho^{(n)})_{n=1}^{\infty}. Since Ji(ρ)J^{i}(\rho) is continuous in ρ\rho on 𝒫0\mathcal{P}^{0}, and ϵ𝒫ϵ,i\epsilon\mapsto\mathcal{P}^{\epsilon,i} is a continuous correspondence with compact, non-empty value, through applying Berge’s Maximum Theorem [49] one last time, we conclude that for each ii\in\mathcal{I}

ρ(),iargmaxρ~i𝒫0,iJi(ρ~i,ρ(),i),\rho^{(\infty),i}\in\underset{\tilde{\rho}^{i}\in\mathcal{P}^{0,i}}{\arg\max}~{}J^{i}(\tilde{\rho}^{i},\rho^{(\infty),-i}), (104)

i.e. ρ(),i\rho^{(\infty),i} is optimal among KiK^{i}-based strategies in response to ρ(),i\rho^{(\infty),-i}. Recall that we have shown that there exist KiK^{i}-based strategies ρi\rho^{i} that maximizes Ji(g~i,ρi)J^{i}(\tilde{g}^{i},\rho^{-i}) over all behavioral strategies g~i\tilde{g}^{i}. Therefore, we conclude that ρ()\rho^{(\infty)} forms a BNE, proving the existence of KK-based BNE.

Proof of Claim: We establish the continuity of the two functions by showing that they can be expressed with basic functions (i.e. summation, multiplication, division).

Let g^i\hat{g}^{i} be a behavioral strategy where player ii chooses actions uniformly at random at every information set. For ρi𝒫ϵ,i\rho^{-i}\in\mathcal{P}^{\epsilon,-i}, we have g^i,ρi(kti)>0\mathbb{P}^{\hat{g}^{i},\rho^{-i}}(k_{t}^{i})>0 for all kti𝒦tik_{t}^{i}\in\mathcal{K}_{t}^{i} since (g^i,ρi)(\hat{g}^{i},\rho^{-i}) is a strategy profile that always plays strictly mixed actions. Therefore we have

Pti,ρi(kt+1i|kti,uti)\displaystyle P_{t}^{i,\rho^{-i}}(k_{t+1}^{i}|k_{t}^{i},u_{t}^{i}) =g^i,ρi(kt+1i|kti,uti)=g^i,ρi(kt+1i,kti,uti)g^i,ρi(kti,uti),\displaystyle=\mathbb{P}^{\hat{g}^{i},\rho^{-i}}(k_{t+1}^{i}|k_{t}^{i},u_{t}^{i})=\dfrac{\mathbb{P}^{\hat{g}^{i},\rho^{-i}}(k_{t+1}^{i},k_{t}^{i},u_{t}^{i})}{\mathbb{P}^{\hat{g}^{i},\rho^{-i}}(k_{t}^{i},u_{t}^{i})}, (105)
rti,ρi(kti,uti)\displaystyle r_{t}^{i,\rho^{-i}}(k_{t}^{i},u_{t}^{i}) =𝔼g^i,ρi[Rti|kti,uti]\displaystyle=\mathbb{E}^{\hat{g}^{i},\rho^{-i}}[R_{t}^{i}|k_{t}^{i},u_{t}^{i}] (106)
=xt𝒳t,uti𝒰t𝔼[Rti|xt,ut]g^i,ρi(xt,uti|kti,uti),\displaystyle=\sum_{x_{t}\in\mathcal{X}_{t},u_{t}^{-i}\in\mathcal{U}_{t}}\mathbb{E}[R_{t}^{i}|x_{t},u_{t}]\mathbb{P}^{\hat{g}^{i},\rho^{-i}}(x_{t},u_{t}^{-i}|k_{t}^{i},u_{t}^{i}), (107)

where 𝔼[Rti|xt,ut]\mathbb{E}[R_{t}^{i}|x_{t},u_{t}] is independent of the strategy profile.

We know that both g^i,ρi(kt+1i,kti,uti)\mathbb{P}^{\hat{g}^{i},\rho^{-i}}(k_{t+1}^{i},k_{t}^{i},u_{t}^{i}) and g^i,ρi(kti,uti)\mathbb{P}^{\hat{g}^{i},\rho^{-i}}(k_{t}^{i},u_{t}^{i}) are sums of products of components of ρi\rho^{-i} and g^i\hat{g}^{i}, hence both are continuous in ρi\rho^{-i}. Therefore Pti,ρi(zti|kti,uti)P_{t}^{i,\rho^{-i}}(z_{t}^{i}|k_{t}^{i},u_{t}^{i}) is continuous in ρi\rho^{-i} on 𝒫ϵ,i\mathcal{P}^{\epsilon,-i}. The continuity of rti,ρi(kti,uti)r_{t}^{i,\rho^{-i}}(k_{t}^{i},u_{t}^{i}) in ρi\rho^{-i} on 𝒫ϵ,i\mathcal{P}^{\epsilon,-i} can be shown with an analogous argument.

C.3 Proof of Theorem 3

Theorem 17 (Theorem 3, restated).

If K=(Ki)iK=(K^{i})_{i\in\mathcal{I}} where KiK^{i} is unilaterally sufficient information for player ii, then the set of KK-based BNE payoffs is the same as that of all BNE.

To establish Theorem 3, we first introduce Definition 18, an extension of Definition 3, for the convenience of the proof. Then, we establish Lemmas C.4, C.6, C.8. Finally, we conclude the proof of Theorem 3 from the three lemmas.

In the following definition, we provide an extension of the definition of the information state where not only player ii’s payoff are considered. This definition allows us to characterize compression maps that preserve payoff profiles, as required in the statement of Theorem 3.

Definition 18.

Let gig^{-i} be a behavioral strategy profile of players other than ii and 𝒥\mathcal{J}\subseteq\mathcal{I} be a subset of players. We say that KiK^{i} is an information state under gig^{-i} for the payoffs of 𝒥\mathcal{J} if there exist functions (Pti,gi)t𝒯,(rtj,gi)j𝒥,t𝒯(P_{t}^{i,g^{-i}})_{t\in\mathcal{T}},(r_{t}^{j,g^{-i}})_{j\in\mathcal{J},t\in\mathcal{T}}, where Pti,gi:𝒦ti×𝒰tiΔ(𝒦t+1i)P_{t}^{i,g^{-i}}:\mathcal{K}_{t}^{i}\times\mathcal{U}_{t}^{i}\mapsto\Delta(\mathcal{K}_{t+1}^{i}) and rtj,gi:𝒦ti×𝒰ti[1,1]r_{t}^{j,g^{-i}}:\mathcal{K}_{t}^{i}\times\mathcal{U}_{t}^{i}\mapsto[-1,1], such that

  1. (1)

    gi,gi(kt+1i|hti,uti)=Pti,gi(kt+1i|kti,uti)\mathbb{P}^{g^{i},g^{-i}}(k_{t+1}^{i}|h_{t}^{i},u_{t}^{i})=P_{t}^{i,g^{-i}}(k_{t+1}^{i}|k_{t}^{i},u_{t}^{i}) for all t𝒯\{T}t\in\mathcal{T}\backslash\{T\}; and

  2. (2)

    𝔼gi,gi[Rtj|hti,uti]=rtj,gi(kti,uti)\mathbb{E}^{g^{i},g^{-i}}[R_{t}^{j}|h_{t}^{i},u_{t}^{i}]=r_{t}^{j,g^{-i}}(k_{t}^{i},u_{t}^{i}) for all j𝒥j\in\mathcal{J} and all t𝒯t\in\mathcal{T},

for all gig^{i}, and all (hti,uti)(h_{t}^{i},u_{t}^{i}) admissible under (gi,gi)(g^{i},g^{-i}).

Notice that condition (2) of Definition 18 means that the information state KiK^{i} is sufficient for evaluating other agents’ payoffs as well. This property is essential in establishing the preservation of payoff profiles of other agents when player ii switches to a compression-based strategy.

Lemma C.4.

If KiK^{i} is unilaterally sufficient information, then KiK^{i} is an information state under gig^{-i} for the payoffs of \mathcal{I} under all behavioral strategy profiles gig^{-i}.

Proof C.5 (Proof of Lemma C.4).

Let Φti,gi\Phi_{t}^{i,g^{-i}} be as in the definition of USI (Definition 1), we have

g(xt,hti|hti)\displaystyle\mathbb{P}^{g}(x_{t},h_{t}^{-i}|h_{t}^{i}) =Φti,gi(xt,hti|kti).\displaystyle=\Phi_{t}^{i,g^{-i}}(x_{t},h_{t}^{-i}|k_{t}^{i}). (108)

Applying the Law of Total Probability,

g(x~t,u~ti|hti)\displaystyle\mathbb{P}^{g}(\tilde{x}_{t},\tilde{u}_{t}^{-i}|h_{t}^{i}) =h~tig(u~ti|x~t,h~ti,hti)g(x~t,h~ti|hti)\displaystyle=\sum_{\tilde{h}_{t}^{-i}}\mathbb{P}^{g}(\tilde{u}_{t}^{-i}|\tilde{x}_{t},\tilde{h}_{t}^{-i},h_{t}^{i})\mathbb{P}^{g}(\tilde{x}_{t},\tilde{h}_{t}^{-i}|h_{t}^{i}) (109)
=h~ti(jigtj(u~tj|h~tj))Φti,gi(x~t,h~ti|kti)\displaystyle=\sum_{\tilde{h}_{t}^{-i}}\left(\prod_{j\neq i}g_{t}^{j}(\tilde{u}_{t}^{j}|\tilde{h}_{t}^{j})\right)\Phi_{t}^{i,g^{-i}}(\tilde{x}_{t},\tilde{h}_{t}^{-i}|k_{t}^{i}) (110)
=:P~ti,gi(x~t,u~ti|kti).\displaystyle=:\tilde{P}_{t}^{i,g^{-i}}(\tilde{x}_{t},\tilde{u}_{t}^{-i}|k_{t}^{i}). (111)

We know that Kt+1i=ιt+1i(Kti,Zti)=ξti(Kti,Xt,Ut,Wt)K_{t+1}^{i}=\iota_{t+1}^{i}(K_{t}^{i},Z_{t}^{i})=\xi_{t}^{i}(K_{t}^{i},X_{t},U_{t},W_{t}) for some fixed function ξti\xi_{t}^{i} independent of the strategy profile gg. Since WtW_{t} is a primitive random variable, (kt+1i|kti,xt,ut)\mathbb{P}(k_{t+1}^{i}|k_{t}^{i},x_{t},u_{t}) is independent of the strategy profile gg. Therefore,

g(kt+1i|hti,uti)\displaystyle\mathbb{P}^{g}(k_{t+1}^{i}|h_{t}^{i},u_{t}^{i}) =x~t,u~ti(kt+1i|kti,x~t,(u~ti,uti))P~ti,gi(x~t,u~ti|kti)\displaystyle=\sum_{\tilde{x}_{t},\tilde{u}_{t}^{-i}}\mathbb{P}(k_{t+1}^{i}|k_{t}^{i},\tilde{x}_{t},(\tilde{u}_{t}^{-i},u_{t}^{i}))\tilde{P}_{t}^{i,g^{-i}}(\tilde{x}_{t},\tilde{u}_{t}^{-i}|k_{t}^{i}) (112)
=:Pti,gi(kt+1i|kti,uti),\displaystyle=:P_{t}^{i,g^{-i}}(k_{t+1}^{i}|k_{t}^{i},u_{t}^{i}), (113)

establishing part (1) of Definition 18.

Consider any jj\in\mathcal{I}. Since RtjR_{t}^{j} is a strategy-independent function of (Xt,Ut,Wt)(X_{t},U_{t},W_{t}), 𝔼[Rtj|xt,ut]\mathbb{E}[R_{t}^{j}|x_{t},u_{t}] is independent of gg. Therefore

𝔼g[Rtj|hti,uti]\displaystyle\mathbb{E}^{g}[R_{t}^{j}|h_{t}^{i},u_{t}^{i}] =x~t,u~ti𝔼[Rtj|x~t,(uti,u~ti)]P~ti,gi(x~t,u~ti|kti)\displaystyle=\sum_{\tilde{x}_{t},\tilde{u}_{t}^{-i}}\mathbb{E}[R_{t}^{j}|\tilde{x}_{t},(u_{t}^{i},\tilde{u}_{t}^{-i})]\tilde{P}_{t}^{i,g^{-i}}(\tilde{x}_{t},\tilde{u}_{t}^{-i}|k_{t}^{i}) (114)
=:rtj,gi(kti,uti),\displaystyle=:r_{t}^{j,g^{-i}}(k_{t}^{i},u_{t}^{i}), (115)

establishing part (2) of Definition 18.

In Lemma C.6, we show that any behavioral strategy of player ii can be replaced by an equivalent randomized USI-based strategy while preserving payoffs of all players.

Lemma C.6.

Let KiK^{i} be unilaterally sufficient information. Then for every behavioral strategy profile gig^{i}, if the KiK^{i} based strategy ρi\rho^{i} is given by

ρti(uti|kti)=h~titigti(uti|h~ti)Fti,gi(h~ti|kti),\rho_{t}^{i}(u_{t}^{i}|k_{t}^{i})=\sum_{\tilde{h}_{t}^{i}\in\mathcal{H}_{t}^{i}}g_{t}^{i}(u_{t}^{i}|\tilde{h}_{t}^{i})F_{t}^{i,g^{i}}(\tilde{h}_{t}^{i}|k_{t}^{i}), (116)

where Fti,gi(h~ti|kti)F_{t}^{i,g^{i}}(\tilde{h}_{t}^{i}|k_{t}^{i}) is defined in Definition 1, then

Jj(gi,gi)=Jj(ρi,gi),J^{j}(g^{i},g^{-i})=J^{j}(\rho^{i},g^{-i}),

for all jj\in\mathcal{I} and all behavioral strategy profiles gig^{-i} of players other than ii.

Proof C.7 (Proof of Lemma C.6).

Let jj\in\mathcal{I}. Consider an MDP with state HtiH_{t}^{i}, action UtiU_{t}^{i} and instantaneous reward r~ti,j(hti,uti):=𝔼gi[Rtj|hti,uti]\tilde{r}_{t}^{i,j}(h_{t}^{i},u_{t}^{i}):=\mathbb{E}^{g^{-i}}[R_{t}^{j}|h_{t}^{i},u_{t}^{i}]. By Lemma C.4, KiK^{i} is an information state (as defined in Definition 9) for this MDP. Hence Jj(gi,gi)=Jj(ρi,gi)J^{j}(g^{i},g^{-i})=J^{j}(\rho^{i},g^{-i}) follows from the Policy Equivalence Lemma (Lemma A.2).

In Lemma C.8, we proceed to show that a behavioral strategy can be replaced with an USI-based strategy while preserving not only the payoffs of all players, but also the equilibrium.

Lemma C.8.

If KiK^{i} is unilaterally sufficient information for player ii, then for any BNE strategy profile g=(gi)ig=(g^{i})_{i\in\mathcal{I}} there exists a KiK^{i}-based strategy ρi\rho^{i} such that (ρi,gi)(\rho^{i},g^{-i}) forms a BNE with the same expected payoff profile as gg.

Proof C.9 (Proof of Lemma C.8).

Let ρi\rho^{i} be associated with gig^{i} as specified in Lemma C.6. Set g¯=(ρi,gi)\bar{g}=(\rho^{i},g^{-i}). Since Ji(ρi,gi)=Ji(gi,gi)J^{i}(\rho^{i},g^{-i})=J^{i}(g^{i},g^{-i}) and gig^{i} is a best response to gig^{-i}, ρi\rho^{i} is also a best response to gig^{-i}.

Consider jij\neq i. Let g~j\tilde{g}^{j} be an arbitrary behavioral strategy of player jj. By using Lemma C.6 twice we have

Jj(g¯j,g¯j)\displaystyle J^{j}(\bar{g}^{j},\bar{g}^{-j}) =Jj(ρi,gi)=Jj(g)Jj(g~j,gj)\displaystyle=J^{j}(\rho^{i},g^{-i})=J^{j}(g)\geq J^{j}(\tilde{g}^{j},g^{-j}) (117)
=Jj(g~j,(ρi,g{i,j}))=Jj(g~j,g¯j).\displaystyle=J^{j}(\tilde{g}^{j},(\rho^{i},g^{-\{i,j\}}))=J^{j}(\tilde{g}^{j},\bar{g}^{-j}). (118)

Therefore g¯j\bar{g}^{j} is a best response to (ρi,g{i,j})(\rho^{i},g^{-\{i,j\}}). We conclude that g¯=(ρi,gi)\bar{g}=(\rho^{i},g^{-i}) is also a BNE.

Proof C.10 (Proof of Theorem 3).

Given any BNE strategy profile gg, applying Lemma C.8 iteratively for each ii\in\mathcal{I}, we obtain a KK-based BNE strategy profile ρ\rho with the same expected payoff profile as gg. Therefore the set of KK-based BNE payoffs is the same as that of all BNE.

C.4 Proof of Theorem 4

Theorem 19 (Theorem 4, restated).

If KK is mutually sufficient information, then there exists at least one KK-based sequential equilibrium.

Proof C.11.

The proof of Theorem 4 follows similar steps to that of Theorem 2, where we construct a sequence of strictly mixed strategy profiles via the fixed points of dynamic program based best response mappings. In addition, we show the sequential rationality of the strategy profile constructed.

Let (ρ(n))n=1(\rho^{(n)})_{n=1}^{\infty} be a sequence of KK-based strategy profiles that always assigns strictly mixed actions as constructed in the proof of Theorem 2. By taking a sub-sequence, without loss of generality, assume that ρ(n)ρ()\rho^{(n)}\rightarrow\rho^{(\infty)} for some KK-based strategy profile ρ()\rho^{(\infty)}.

Let Q(n)Q^{(n)} be conjectures of reward-to-go functions consistent (in the sense of Definition 12) with ρ(n)\rho^{(n)}, i.e.

Qτ(n),i(hti,uti):=𝔼ρ(n)[t=τTRti|hτi,uτi].\displaystyle Q_{\tau}^{(n),i}(h_{t}^{i},u_{t}^{i}):=\mathbb{E}^{\rho^{(n)}}\left[\sum_{t=\tau}^{T}R_{t}^{i}\Big{|}h_{\tau}^{i},u_{\tau}^{i}\right]. (119)

Let Q()Q^{(\infty)} be the limit of a sub-sequence of (Q(n))n=1(Q^{(n)})_{n=1}^{\infty} (such a limit exists since the range of each Qτ(n),iQ_{\tau}^{(n),i} is a compact set). We proceed to show that (ρ(),Q())(\rho^{(\infty)},Q^{(\infty)}) forms a sequential equilibrium (as defined in Definition 12). Note that by construction, Q()Q^{(\infty)} is fully consistent with ρ()\rho^{(\infty)}. We only need to show sequential rationality.

Claim: Let Qtϵ,iQ_{t}^{\epsilon,i} be as defined in (101) in the proof of Theorem 2, then

Qt(n),i(hti,uti)=Qtϵn,i(kti,uti;ρ(n),i),Q_{t}^{(n),i}(h_{t}^{i},u_{t}^{i})=Q_{t}^{\epsilon_{n},i}(k_{t}^{i},u_{t}^{i};\rho^{(n),-i}), (120)

for all i,t𝒯,htitii\in\mathcal{I},t\in\mathcal{T},h_{t}^{i}\in\mathcal{H}_{t}^{i}, and uti𝒰tiu_{t}^{i}\in\mathcal{U}_{t}^{i}.

By construction in the proof of Theorem 2, ρt(n),i(kti)BRtϵn,i[kti](ρ(n),i)\rho_{t}^{(n),i}(k_{t}^{i})\in\mathrm{BR}_{t}^{\epsilon_{n},i}[k_{t}^{i}](\rho^{(n),-i}). Given the claim, this means that

ρt(n),i(kti)argmaxηΔϵn(𝒰ti)u~tiQt(n),i(hti,u~ti)η(u~ti),\rho_{t}^{(n),i}(k_{t}^{i})\in\underset{\eta\in\Delta^{\epsilon_{n}}(\mathcal{U}_{t}^{i})}{\arg\max}~{}\sum_{\tilde{u}_{t}^{i}}Q_{t}^{(n),i}(h_{t}^{i},\tilde{u}_{t}^{i})\eta(\tilde{u}_{t}^{i}), (121)

for all i,t𝒯i\in\mathcal{I},t\in\mathcal{T} and htitih_{t}^{i}\in\mathcal{H}_{t}^{i}.

Applying Berge’s Maximum Theorem [49] in a similar manner to the proof of Theorem 2 we obtain

ρt(),i(kti)argmaxηΔ(𝒰ti)u~tiQt(),i(hti,u~ti)η(u~ti),\rho_{t}^{(\infty),i}(k_{t}^{i})\in\underset{\eta\in\Delta(\mathcal{U}_{t}^{i})}{\arg\max}~{}\sum_{\tilde{u}_{t}^{i}}Q_{t}^{(\infty),i}(h_{t}^{i},\tilde{u}_{t}^{i})\eta(\tilde{u}_{t}^{i}), (122)

for all i,t𝒯i\in\mathcal{I},t\in\mathcal{T} and htitih_{t}^{i}\in\mathcal{H}_{t}^{i}.

Therefore, we have shown that ρ()\rho^{(\infty)} is sequentially rational under Q()Q^{(\infty)} and we have completed the proof.

Proof of Claim: For clarity of exposition we drop the superscript (n)(n) of ρ(n)\rho^{(n)}. We know that Qt(n),iQ_{t}^{(n),i} satisfies the following equations:

QT(n),i(hTi,uTi)\displaystyle Q_{T}^{(n),i}(h_{T}^{i},u_{T}^{i}) =𝔼ρ[RTi|hTi,uTi],\displaystyle=\mathbb{E}^{\rho}[R_{T}^{i}|h_{T}^{i},u_{T}^{i}], (123a)
Vt(n),i(hti)\displaystyle V_{t}^{(n),i}(h_{t}^{i}) :=u~tiQt(n),i(hti,u~ti)ρti(u~ti|kti),\displaystyle:=\sum_{\tilde{u}_{t}^{i}}Q_{t}^{(n),i}(h_{t}^{i},\tilde{u}_{t}^{i})\rho^{i}_{t}(\tilde{u}_{t}^{i}|k_{t}^{i}), (123b)
Qt1(n),i(ht1i,ut1i)\displaystyle Q_{t-1}^{(n),i}(h_{t-1}^{i},u_{t-1}^{i}) :=𝔼ρ[Rt1i|ht1i,ut1i]+h~titiVt(n),i(h~ti)ρ(h~ti|ht1i,uti).\displaystyle:=\mathbb{E}^{\rho}[R_{t-1}^{i}|h_{t-1}^{i},u_{t-1}^{i}]+\sum_{\tilde{h}_{t}^{i}\in\mathcal{H}_{t}^{i}}V_{t}^{(n),i}(\tilde{h}_{t}^{i})\mathbb{P}^{\rho}(\tilde{h}_{t}^{i}|h_{t-1}^{i},u_{t}^{i}). (123c)

Since KK is mutually sufficient information, we have

ρ(kt+1i|hti,uti)\displaystyle\mathbb{P}^{\rho}(k_{t+1}^{i}|h_{t}^{i},u_{t}^{i}) :=Pti,ρi(kt+1i|kti,uti),\displaystyle:=P_{t}^{i,\rho^{-i}}(k_{t+1}^{i}|k_{t}^{i},u_{t}^{i}), (124)
𝔼ρ[Rti|hti,uti]\displaystyle\mathbb{E}^{\rho}[R_{t}^{i}|h_{t}^{i},u_{t}^{i}] :=rti,ρi(kti,uti),\displaystyle:=r_{t}^{i,\rho^{-i}}(k_{t}^{i},u_{t}^{i}), (125)

where Pti,ρiP_{t}^{i,\rho^{-i}} and rti,ρir_{t}^{i,\rho^{-i}} are as specified in Definition 3.

Therefore, through an inductive argument, one can show then Qt(n),i(hti,uti)Q_{t}^{(n),i}(h_{t}^{i},u_{t}^{i}) depends on htih_{t}^{i} only through ktik_{t}^{i}, and

QT(n),i(kTi,uTi)\displaystyle Q_{T}^{(n),i}(k_{T}^{i},u_{T}^{i}) =rTi,ρi(kTi,uTi),\displaystyle=r_{T}^{i,\rho^{-i}}(k_{T}^{i},u_{T}^{i}), (126a)
Vt(n),i(kti)\displaystyle V_{t}^{(n),i}(k_{t}^{i}) :=u~tiQti(kti,u~ti;ρi)ρti(u~ti|kti),\displaystyle:=\sum_{\tilde{u}_{t}^{i}}Q_{t}^{i}(k_{t}^{i},\tilde{u}_{t}^{i};\rho^{-i})\rho^{i}_{t}(\tilde{u}_{t}^{i}|k_{t}^{i}), (126b)
Qt1(n),i(kt1i,ut1i)\displaystyle\quad Q_{t-1}^{(n),i}(k_{t-1}^{i},u_{t-1}^{i}) :=rt1i,ρi(kt1i,ut1i)+k~ti𝒦tiVt(n),i(k~ti)Pt1i,ρi(k~ti|kt1i,uti).\displaystyle:=r_{t-1}^{i,\rho^{-i}}(k_{t-1}^{i},u_{t-1}^{i})+\sum_{\tilde{k}_{t}^{i}\in\mathcal{K}_{t}^{i}}V_{t}^{(n),i}(\tilde{k}_{t}^{i})P_{t-1}^{i,\rho^{-i}}(\tilde{k}_{t}^{i}|k_{t-1}^{i},u_{t}^{i}). (126c)

The claim is then established by comparing (126) with (101) and combining with the fact that ρti(kti)BRtϵ,i[kti](ρi)\rho_{t}^{i}(k_{t}^{i})\in\mathrm{BR}_{t}^{\epsilon,i}[k_{t}^{i}](\rho^{-i}).

C.5 Proof of Theorem 5

Theorem 20 (Theorem 5, restated).

If K=(Ki)iK=(K^{i})_{i\in\mathcal{I}} where KiK^{i} is unilaterally sufficient information for player ii, then the set of KK-based sequential equilibrium payoffs is the same as that of all sequential equilibria.

To prove the assertion of Theorem 5 we establish a series of technical results that appear in Lemmas C.12 - C.20 below. The two key results needed for the proof of the theorem are provided by Lemmas C.16 and C.20. Lemma C.16 asserts that a player can switch to a USI-based strategy without changing the dynamic decision problems faced by the other players. The result of Lemma C.16 allows to establish the analogue of the payoff equivalence result of Lemma A.2 under the concept of sequential equilibrium. Lemma C.20 asserts that any one player can switch to a USI-based strategy without affecting the sequential equilibrium (under perfect recall) and its payoffs. The proof of Lemma C.16 is based on two technical results provided by Lemmas C.12 and C.14. The proof of Lemma C.20 is based on Lemmas C.16 and C.18 which states that the history-action value function of a player ii\in\mathcal{I} can be expressed with their USI.

Lemma C.12.

Suppose that KiK^{i} is unilaterally sufficient information. Then

g(hti|htj)\displaystyle\mathbb{P}^{g}(h_{t}^{i}|h_{t}^{j}) =g(hti|kti)g(kti|htj),\displaystyle=\mathbb{P}^{g}(h_{t}^{i}|k_{t}^{i})\mathbb{P}^{g}(k_{t}^{i}|h_{t}^{j}), (127)

whenever g(kti)>0,g(htj)>0\mathbb{P}^{g}(k_{t}^{i})>0,\mathbb{P}^{g}(h_{t}^{j})>0.

Proof C.13.

From the definition of unilaterally sufficient information (Definition 1) we have

g(h~ti,h~tj|kti)\displaystyle\mathbb{P}^{g}(\tilde{h}_{t}^{i},\tilde{h}_{t}^{j}|k_{t}^{i}) =Fti,gi(h~ti|kti)Fti,j,gi(h~tj|kti),\displaystyle=F_{t}^{i,g^{i}}(\tilde{h}_{t}^{i}|k_{t}^{i})F_{t}^{i,j,g^{-i}}(\tilde{h}_{t}^{j}|k_{t}^{i}), (128)

where

Fti,j,gi(htj|kti):=x~t,h~t{i,j}Φti,gi(x~t,(htj,h~t{i,j})|kti).\displaystyle F_{t}^{i,j,g^{-i}}(h_{t}^{j}|k_{t}^{i}):=\sum_{\tilde{x}_{t},\tilde{h}_{t}^{-\{i,j\}}}\Phi_{t}^{i,g^{-i}}(\tilde{x}_{t},(h_{t}^{j},\tilde{h}_{t}^{-\{i,j\}})|k_{t}^{i}). (129)

Therefore, we conclude that HtiH_{t}^{i} and HtjH_{t}^{j} are conditionally independent given KtiK_{t}^{i}. Since KtiK_{t}^{i} is a function of HtiH_{t}^{i}, we have

g(hti|htj)\displaystyle\mathbb{P}^{g}(h_{t}^{i}|h_{t}^{j}) =g(hti,kti|htj)=g(hti|kti)g(kti|htj).\displaystyle=\mathbb{P}^{g}(h_{t}^{i},k_{t}^{i}|h_{t}^{j})=\mathbb{P}^{g}(h_{t}^{i}|k_{t}^{i})\mathbb{P}^{g}(k_{t}^{i}|h_{t}^{j}). (130)
Lemma C.14.

Suppose that KiK^{i} is unilaterally sufficient information for player ii\in\mathcal{I}. Then there exist functions (Πtj,i,g{i,j})j\{i},t𝒯(\Pi_{t}^{j,i,g^{-\{i,j\}}})_{j\in\mathcal{I}\backslash\{i\},t\in\mathcal{T}}, (rti,j,g{i,j})j\{i},t𝒯(r_{t}^{i,j,g^{-\{i,j\}}})_{j\in\mathcal{I}\backslash\{i\},t\in\mathcal{T}}, where Πti,j,g{i,j}:𝒦ti×tj×𝒰ti×𝒰tjΔ(t+1j)\Pi_{t}^{i,j,g^{-\{i,j\}}}:\mathcal{K}_{t}^{i}\times\mathcal{H}_{t}^{j}\times\mathcal{U}_{t}^{i}\times\mathcal{U}_{t}^{j}\mapsto\Delta(\mathcal{H}_{t+1}^{j}), rti,j,g{i,j}:𝒦ti×tj×𝒰ti×𝒰tj[1,1]r_{t}^{i,j,g^{-\{i,j\}}}:\mathcal{K}_{t}^{i}\times\mathcal{H}_{t}^{j}\times\mathcal{U}_{t}^{i}\times\mathcal{U}_{t}^{j}\mapsto[-1,1] such that

  1. (1)

    g(h~t+1j|hti,htj,uti,utj)=Πtj,i,g{i,j}(h~t+1j|kti,htj,uti,utj)\mathbb{P}^{g}(\tilde{h}_{t+1}^{j}|h_{t}^{i},h_{t}^{j},u_{t}^{i},u_{t}^{j})=\Pi_{t}^{j,i,g^{-\{i,j\}}}(\tilde{h}_{t+1}^{j}|k_{t}^{i},h_{t}^{j},u_{t}^{i},u_{t}^{j}) for all t𝒯\{T}t\in\mathcal{T}\backslash\{T\}; and

  2. (2)

    𝔼g[Rtj|hti,htj,uti,utj]=rti,j,g{i,j}(kti,htj,uti,utj)\mathbb{E}^{g}[R_{t}^{j}|h_{t}^{i},h_{t}^{j},u_{t}^{i},u_{t}^{j}]=r_{t}^{i,j,g^{-\{i,j\}}}(k_{t}^{i},h_{t}^{j},u_{t}^{i},u_{t}^{j}) for all t𝒯t\in\mathcal{T},

for all j\{i}j\in\mathcal{I}\backslash\{i\} and all behavioral strategy profiles gg whenever the left-hand side expressions are well-defined.

Proof C.15 (Proof of Lemma C.14).

Let g^l\hat{g}^{l} be some fixed, fully mixed behavioral strategy for player ll\in\mathcal{I}.

Fix jij\neq i. First,

g(xt,ht{i,j}|hti,htj)\displaystyle\mathbb{P}^{g}(x_{t},h_{t}^{-\{i,j\}}|h_{t}^{i},h_{t}^{j}) =g^{i,j},g{i,j}(xt,ht{i,j}|hti,htj)\displaystyle=\mathbb{P}^{\hat{g}^{\{i,j\}},g^{-\{i,j\}}}(x_{t},h_{t}^{-\{i,j\}}|h_{t}^{i},h_{t}^{j}) (131)
=Φti,(g^j,g{i,j})(xt,hti|kti)x~t,h~t{i,j}Φti,(g^j,g{i,j})(x~t,(h~t{i,j},htj)|kti)\displaystyle=\dfrac{\Phi_{t}^{i,(\hat{g}^{j},g^{-\{i,j\}})}(x_{t},h_{t}^{-i}|k_{t}^{i})}{\sum_{\tilde{x}_{t},\tilde{h}_{t}^{-\{i,j\}}}\Phi_{t}^{i,(\hat{g}^{j},g^{-\{i,j\}})}(\tilde{x}_{t},(\tilde{h}_{t}^{-\{i,j\}},h_{t}^{j})|k_{t}^{i})} (132)
=:Φti,j,g{i,j}(xt,ht{i,j}|kti,htj),\displaystyle=:\Phi_{t}^{i,j,g^{-\{i,j\}}}(x_{t},h_{t}^{-\{i,j\}}|k_{t}^{i},h_{t}^{j}), (133)

for any behavioral strategy profile gg, where in (131) we used the fact that since (hti,htj)(h_{t}^{i},h_{t}^{j}) are included in the conditioning, the conditional probability is independent of the strategies of player ii and jj [23, Section 6.5]. In (132) we used Bayes rule and the definition of USI (Definition 1).

Therefore, using the Law of Total Probability,

g(x~t,u~t{i,j}|hti,htj)\displaystyle\mathbb{P}^{g}(\tilde{x}_{t},\tilde{u}_{t}^{-\{i,j\}}|h_{t}^{i},h_{t}^{j}) =h~t{i,j}g(u~t{i,j}|x~t,h~t{i,j},hti,htj)g(x~t,h~t{i,j}|hti,htj)\displaystyle=\sum_{\tilde{h}_{t}^{-\{i,j\}}}\mathbb{P}^{g}(\tilde{u}_{t}^{-\{i,j\}}|\tilde{x}_{t},\tilde{h}_{t}^{-\{i,j\}},h_{t}^{i},h_{t}^{j})\mathbb{P}^{g}(\tilde{x}_{t},\tilde{h}_{t}^{-\{i,j\}}|h_{t}^{i},h_{t}^{j}) (134)
=h~t{i,j}(l\{i,j}gtl(u~tl|h~tl))Φti,j,g{i,j}(x~t,h~t{i,j}|kti,htj)\displaystyle=\sum_{\tilde{h}_{t}^{-\{i,j\}}}\left(\prod_{l\in\mathcal{I}\backslash\{i,j\}}g_{t}^{l}(\tilde{u}_{t}^{l}|\tilde{h}_{t}^{l})\right)\Phi_{t}^{i,j,g^{-\{i,j\}}}(\tilde{x}_{t},\tilde{h}_{t}^{-\{i,j\}}|k_{t}^{i},h_{t}^{j}) (135)
=:P~ti,j,g{i,j}(x~t,u~t{i,j}|kti,htj),\displaystyle=:\tilde{P}_{t}^{i,j,g^{-\{i,j\}}}(\tilde{x}_{t},\tilde{u}_{t}^{-\{i,j\}}|k_{t}^{i},h_{t}^{j}), (136)

for any behavioral strategy profile gg.

We know that Ht+1j=ξtj(Xt,Ut,Htj)H_{t+1}^{j}=\xi_{t}^{j}(X_{t},U_{t},H_{t}^{j}) for some function ξtj\xi_{t}^{j} independent of the strategy profile gg, hence using the Law of Total Probability we have

g(h~t+1j|hti,htj,uti,utj)\displaystyle\enspace\enspace\>\mathbb{P}^{g}(\tilde{h}_{t+1}^{j}|h_{t}^{i},h_{t}^{j},u_{t}^{i},u_{t}^{j}) (137)
=x~t,u~t{i,j}𝟏{h~t+1j=ξti(x~t,(ut{i,j},u~t{i,j}),htj)}P~ti,j,g{i,j}(x~t,u~t{i,j}|kti,htj)\displaystyle=\sum_{\tilde{x}_{t},\tilde{u}_{t}^{-\{i,j\}}}\bm{1}_{\{\tilde{h}_{t+1}^{j}=\xi_{t}^{i}(\tilde{x}_{t},(u_{t}^{\{i,j\}},\tilde{u}_{t}^{-\{i,j\}}),h_{t}^{j})\}}\tilde{P}_{t}^{i,j,g^{-\{i,j\}}}(\tilde{x}_{t},\tilde{u}_{t}^{-\{i,j\}}|k_{t}^{i},h_{t}^{j}) (138)
=:Πtj,i,g{i,j}(h~t+1j|kti,htj,uti,utj),\displaystyle=:\Pi_{t}^{j,i,g^{-\{i,j\}}}(\tilde{h}_{t+1}^{j}|k_{t}^{i},h_{t}^{j},u_{t}^{i},u_{t}^{j}), (139)

establishing part (1) of Lemma C.14.

Since 𝔼[Rtj|xt,ut]\mathbb{E}[R_{t}^{j}|x_{t},u_{t}] is strategy-independent, for j\{i}j\in\mathcal{I}\backslash\{i\}, using the Law of Total Expectation we have

𝔼g[Rtj|hti,htj,uti,utj]\displaystyle\mathbb{E}^{g}[R_{t}^{j}|h_{t}^{i},h_{t}^{j},u_{t}^{i},u_{t}^{j}] =x~t,u~ti𝔼[Rtj|x~t,(ut{i,j},u~t{i,j})]P~ti,j,g{i,j}(x~t,u~t{i,j}|kti,htj)\displaystyle=\sum_{\tilde{x}_{t},\tilde{u}_{t}^{-i}}\mathbb{E}[R_{t}^{j}|\tilde{x}_{t},(u_{t}^{\{i,j\}},\tilde{u}_{t}^{-\{i,j\}})]\tilde{P}_{t}^{i,j,g^{-\{i,j\}}}(\tilde{x}_{t},\tilde{u}_{t}^{-\{i,j\}}|k_{t}^{i},h_{t}^{j}) (140)
=:rti,j,g{i,j}(kti,htj,uti,utj),\displaystyle=:r_{t}^{i,j,g^{-\{i,j\}}}(k_{t}^{i},h_{t}^{j},u_{t}^{i},u_{t}^{j}), (141)

establishing part (2) of Lemma C.14.

Lemma C.16.

Suppose that KiK^{i} is unilaterally sufficient information. Let g=(gj)jg=(g^{j})_{j\in\mathcal{I}} be a fully mixed behavioral strategy profile. Let a KiK^{i}-based strategy ρi\rho^{i} be such that

ρti(uti|kti)=h~tigti(uti|h~ti)Fti,gi(h~ti|kti).\rho_{t}^{i}(u_{t}^{i}|k_{t}^{i})=\sum_{\tilde{h}_{t}^{i}}g_{t}^{i}(u_{t}^{i}|\tilde{h}_{t}^{i})F_{t}^{i,g^{i}}(\tilde{h}_{t}^{i}|k_{t}^{i}). (142)

Then

  1. (1)

    g(h~t+1j|htj,utj)=ρi,gi(h~t+1j|htj,utj)\mathbb{P}^{g}(\tilde{h}_{t+1}^{j}|h_{t}^{j},u_{t}^{j})=\mathbb{P}^{\rho^{i},g^{-i}}(\tilde{h}_{t+1}^{j}|h_{t}^{j},u_{t}^{j}) for all t𝒯\{T}t\in\mathcal{T}\backslash\{T\}; and

  2. (2)

    𝔼g[Rtj|htj,utj]=𝔼ρi,gi[Rtj|htj,utj]\mathbb{E}^{g}[R_{t}^{j}|h_{t}^{j},u_{t}^{j}]=\mathbb{E}^{\rho^{i},g^{-i}}[R_{t}^{j}|h_{t}^{j},u_{t}^{j}] for all t𝒯t\in\mathcal{T},

for all j\{i}j\in\mathcal{I}\backslash\{i\} and all htjtj,utj𝒰tjh_{t}^{j}\in\mathcal{H}_{t}^{j},u_{t}^{j}\in\mathcal{U}_{t}^{j}.

Proof C.17.

Fixing gig^{-i}, HtiH_{t}^{i} is a controlled Markov Chain controlled by UtiU_{t}^{i} and player ii faces a Markov Decision Problem. By Lemma C.4, KtiK_{t}^{i} is an information state (as defined in 9) of this MDP. Therefore, by the Policy Equivalence Lemma (Lemma A.2) we have

gi,gi(kti)=ρi,gi(kti).\mathbb{P}^{g^{i},g^{-i}}(k_{t}^{i})=\mathbb{P}^{\rho^{i},g^{-i}}(k_{t}^{i}). (143)

Furthermore, from the definition of USI we have

gi,gi(htj|kti)\displaystyle\mathbb{P}^{g^{i},g^{-i}}(h_{t}^{j}|k_{t}^{i}) =x~t,h~t{i,j}Φti,gi(x~t,(htj,ht{i,j})|kti)\displaystyle=\sum_{\tilde{x}_{t},\tilde{h}_{t}^{-\{i,j\}}}\Phi_{t}^{i,g^{-i}}(\tilde{x}_{t},(h_{t}^{j},h_{t}^{-\{i,j\}})|k_{t}^{i}) (144)
=:Fti,j,gi(htj|kti).\displaystyle=:F_{t}^{i,j,g^{-i}}(h_{t}^{j}|k_{t}^{i}). (145)

Using Bayes Rule, we then have

gi,gi(kti|htj)\displaystyle\mathbb{P}^{g^{i},g^{-i}}(k_{t}^{i}|h_{t}^{j}) =gi,gi(htj|kti)gi,gi(kti)k~tigi,gi(htj|k~ti)gi,gi(k~ti)\displaystyle=\dfrac{\mathbb{P}^{g^{i},g^{-i}}(h_{t}^{j}|k_{t}^{i})\mathbb{P}^{g^{i},g^{-i}}(k_{t}^{i})}{\sum_{\tilde{k}_{t}^{i}}\mathbb{P}^{g^{i},g^{-i}}(h_{t}^{j}|\tilde{k}_{t}^{i})\mathbb{P}^{g^{i},g^{-i}}(\tilde{k}_{t}^{i})} (146)
=Fti,j,gi(htj|kti)gi,gi(kti)k~tiFti,j,gi(htj|k~ti)gi,gi(k~ti).\displaystyle=\dfrac{F_{t}^{i,j,g^{-i}}(h_{t}^{j}|k_{t}^{i})\mathbb{P}^{g^{i},g^{-i}}(k_{t}^{i})}{\sum_{\tilde{k}_{t}^{i}}F_{t}^{i,j,g^{-i}}(h_{t}^{j}|\tilde{k}_{t}^{i})\mathbb{P}^{g^{i},g^{-i}}(\tilde{k}_{t}^{i})}. (147)

Note that (147) applies for all strategies gig^{i}. Replacing gig^{i} with ρi\rho^{i} we have

ρi,gi(kti|htj)\displaystyle\mathbb{P}^{\rho^{i},g^{-i}}(k_{t}^{i}|h_{t}^{j}) =Fti,j,gi(htj|kti)ρi,gi(kti)k~tiFti,j,gi(htj|k~ti)ρi,gi(k~ti).\displaystyle=\dfrac{F_{t}^{i,j,g^{-i}}(h_{t}^{j}|k_{t}^{i})\mathbb{P}^{\rho^{i},g^{-i}}(k_{t}^{i})}{\sum_{\tilde{k}_{t}^{i}}F_{t}^{i,j,g^{-i}}(h_{t}^{j}|\tilde{k}_{t}^{i})\mathbb{P}^{\rho^{i},g^{-i}}(\tilde{k}_{t}^{i})}. (148)

Combining (143), (147), and (148) we conclude that

gi,gi(kti|htj)=ρi,gi(kti|htj).\displaystyle\mathbb{P}^{g^{i},g^{-i}}(k_{t}^{i}|h_{t}^{j})=\mathbb{P}^{\rho^{i},g^{-i}}(k_{t}^{i}|h_{t}^{j}). (149)

Using (142), Lemma C.12, and Lemma C.14 we have

g(h~t+1j|htj,utj)\displaystyle\quad~{}\mathbb{P}^{g}(\tilde{h}_{t+1}^{j}|h_{t}^{j},u_{t}^{j}) (150)
=h~ti:g(h~ti,htj)>0u~tig(h~t+1j|h~ti,htj,u~ti,utj)g(u~ti|h~ti,htj,utj)g(h~ti|htj,utj)\displaystyle=\sum_{\tilde{h}_{t}^{i}:\mathbb{P}^{g}(\tilde{h}_{t}^{i},h_{t}^{j})>0}\sum_{\tilde{u}_{t}^{i}}\mathbb{P}^{g}(\tilde{h}_{t+1}^{j}|\tilde{h}_{t}^{i},h_{t}^{j},\tilde{u}_{t}^{i},u_{t}^{j})\mathbb{P}^{g}(\tilde{u}_{t}^{i}|\tilde{h}_{t}^{i},h_{t}^{j},u_{t}^{j})\mathbb{P}^{g}(\tilde{h}_{t}^{i}|h_{t}^{j},u_{t}^{j}) (151)
=h~ti,u~tiΠtj,i,g{i,j}(h~t+1j|k~ti,htj,u~ti,utj)gti(u~ti|h~ti)g(h~ti|htj)\displaystyle=\sum_{\tilde{h}_{t}^{i},\tilde{u}_{t}^{i}}\Pi_{t}^{j,i,g^{-\{i,j\}}}(\tilde{h}_{t+1}^{j}|\tilde{k}_{t}^{i},h_{t}^{j},\tilde{u}_{t}^{i},u_{t}^{j})g_{t}^{i}(\tilde{u}_{t}^{i}|\tilde{h}_{t}^{i})\mathbb{P}^{g}(\tilde{h}_{t}^{i}|h_{t}^{j}) (152)
=h~ti,u~tiΠtj,i,g{i,j}(h~t+1j|k~ti,htj,u~ti,utj)gti(u~ti|h~ti)g(h~ti|k~ti)g(k~ti|htj)\displaystyle=\sum_{\tilde{h}_{t}^{i},\tilde{u}_{t}^{i}}\Pi_{t}^{j,i,g^{-\{i,j\}}}(\tilde{h}_{t+1}^{j}|\tilde{k}_{t}^{i},h_{t}^{j},\tilde{u}_{t}^{i},u_{t}^{j})g_{t}^{i}(\tilde{u}_{t}^{i}|\tilde{h}_{t}^{i})\mathbb{P}^{g}(\tilde{h}_{t}^{i}|\tilde{k}_{t}^{i})\mathbb{P}^{g}(\tilde{k}_{t}^{i}|h_{t}^{j}) (153)
=k~ti,u~tiΠtj,i,g{i,j}(h~t+1j|k~ti,htj,u~ti,utj)(h^tigti(u~ti|h^ti)g(h^ti|k~ti))g(k~ti|htj)\displaystyle=\sum_{\tilde{k}_{t}^{i},\tilde{u}_{t}^{i}}\Pi_{t}^{j,i,g^{-\{i,j\}}}(\tilde{h}_{t+1}^{j}|\tilde{k}_{t}^{i},h_{t}^{j},\tilde{u}_{t}^{i},u_{t}^{j})\left(\sum_{\hat{h}_{t}^{i}}g_{t}^{i}(\tilde{u}_{t}^{i}|\hat{h}_{t}^{i})\mathbb{P}^{g}(\hat{h}_{t}^{i}|\tilde{k}_{t}^{i})\right)\mathbb{P}^{g}(\tilde{k}_{t}^{i}|h_{t}^{j}) (154)
=k~ti,u~tiΠtj,i,g{i,j}(h~t+1j|k~ti,htj,u~ti,utj)ρti(u~ti|k~ti)g(k~ti|htj),\displaystyle=\sum_{\tilde{k}_{t}^{i},\tilde{u}_{t}^{i}}\Pi_{t}^{j,i,g^{-\{i,j\}}}(\tilde{h}_{t+1}^{j}|\tilde{k}_{t}^{i},h_{t}^{j},\tilde{u}_{t}^{i},u_{t}^{j})\rho_{t}^{i}(\tilde{u}_{t}^{i}|\tilde{k}_{t}^{i})\mathbb{P}^{g}(\tilde{k}_{t}^{i}|h_{t}^{j}), (155)

where in (152) we utilized Lemma C.14 and the function Πtj,i,g{i,j}\Pi_{t}^{j,i,g^{-\{i,j\}}} defined in it. In (153) we applied Lemma C.12. In the last equation we used (142) and the definition of USI.

Following a similar argument, we can show that

ρi,gi(h~t+1j|htj,utj)\displaystyle\quad~{}\mathbb{P}^{\rho^{i},g^{-i}}(\tilde{h}_{t+1}^{j}|h_{t}^{j},u_{t}^{j}) (156)
=k~ti,u~tiΠtj,i,g{i,j}(h~t+1j|k~ti,htj,u~ti,utj)ρti(u~ti|k~ti)ρi,gi(k~ti|htj).\displaystyle=\sum_{\tilde{k}_{t}^{i},\tilde{u}_{t}^{i}}\Pi_{t}^{j,i,g^{-\{i,j\}}}(\tilde{h}_{t+1}^{j}|\tilde{k}_{t}^{i},h_{t}^{j},\tilde{u}_{t}^{i},u_{t}^{j})\rho_{t}^{i}(\tilde{u}_{t}^{i}|\tilde{k}_{t}^{i})\mathbb{P}^{\rho^{i},g^{-i}}(\tilde{k}_{t}^{i}|h_{t}^{j}). (157)

Using (149) and comparing (155) with (157), we conclude that

g(h~t+1j|htj,utj)=ρi,gi(h~t+1j|htj,utj),\displaystyle\mathbb{P}^{g}(\tilde{h}_{t+1}^{j}|h_{t}^{j},u_{t}^{j})=\mathbb{P}^{\rho^{i},g^{-i}}(\tilde{h}_{t+1}^{j}|h_{t}^{j},u_{t}^{j}), (158)

proving statement (1) of the Lemma.

Following an analogous argument, we can show that

𝔼g[Rtj|htj,utj]\displaystyle\mathbb{E}^{g}[R_{t}^{j}|h_{t}^{j},u_{t}^{j}] =k~ti,u~tirti,j,g{i,j}(k~ti,htj,u~ti,utj)ρti(u~ti|k~ti)g(k~ti|htj)\displaystyle=\sum_{\tilde{k}_{t}^{i},\tilde{u}_{t}^{i}}r_{t}^{i,j,g^{-\{i,j\}}}(\tilde{k}_{t}^{i},h_{t}^{j},\tilde{u}_{t}^{i},u_{t}^{j})\rho_{t}^{i}(\tilde{u}_{t}^{i}|\tilde{k}_{t}^{i})\mathbb{P}^{g}(\tilde{k}_{t}^{i}|h_{t}^{j}) (159)
𝔼ρi,gi[Rtj|htj,utj]\displaystyle\mathbb{E}^{\rho^{i},g^{-i}}[R_{t}^{j}|h_{t}^{j},u_{t}^{j}] =k~ti,u~tirti,j,g{i,j}(k~ti,htj,u~ti,utj)ρti(u~ti|k~ti)ρi,gi(k~ti|htj),\displaystyle=\sum_{\tilde{k}_{t}^{i},\tilde{u}_{t}^{i}}r_{t}^{i,j,g^{-\{i,j\}}}(\tilde{k}_{t}^{i},h_{t}^{j},\tilde{u}_{t}^{i},u_{t}^{j})\rho_{t}^{i}(\tilde{u}_{t}^{i}|\tilde{k}_{t}^{i})\mathbb{P}^{\rho^{i},g^{-i}}(\tilde{k}_{t}^{i}|h_{t}^{j}), (160)

where rti,j,g{i,j}r_{t}^{i,j,g^{-\{i,j\}}} is defined in Lemma C.14. We similarly conclude that

𝔼g[Rtj|htj,utj]\displaystyle\mathbb{E}^{g}[R_{t}^{j}|h_{t}^{j},u_{t}^{j}] =𝔼ρi,gi[Rtj|htj,utj],\displaystyle=\mathbb{E}^{\rho^{i},g^{-i}}[R_{t}^{j}|h_{t}^{j},u_{t}^{j}], (161)

proving statement (2) of the Lemma.

Lemma C.18.

Suppose that KiK^{i} is unilaterally sufficient information for player ii. Let gig^{-i} be a fully mixed behavioral strategy profile for players other than ii. Define QτiQ_{\tau}^{i} through

Qτi(hτi,uτi)=𝔼gi[Rτi|hτi,uτi]+maxg~τ+1:Ti𝔼g~τ+1:Ti,gi[t=τ+1TRti|hτi,uτi].Q_{\tau}^{i}(h_{\tau}^{i},u_{\tau}^{i})=\mathbb{E}^{g^{-i}}[R_{\tau}^{i}|h_{\tau}^{i},u_{\tau}^{i}]+\underset{\tilde{g}_{\tau+1:T}^{i}}{\max}~{}\mathbb{E}^{\tilde{g}_{\tau+1:T}^{i},g^{-i}}\left[\sum_{t=\tau+1}^{T}R_{t}^{i}\Big{|}h_{\tau}^{i},u_{\tau}^{i}\right]. (162)

Then there exist a function Q^τi:𝒦τi×𝒰τi[T,T]\hat{Q}_{\tau}^{i}:\mathcal{K}_{\tau}^{i}\times\mathcal{U}_{\tau}^{i}\mapsto[-T,T] such that

Qτi(hτi,uτi)=Q^τi(kτi,uτi).Q_{\tau}^{i}(h_{\tau}^{i},u_{\tau}^{i})=\hat{Q}_{\tau}^{i}(k_{\tau}^{i},u_{\tau}^{i}). (163)
Proof C.19.

By Lemma C.4, KiK^{i} is an information state for the payoff of player ii under gig^{-i}. Fixing gig^{-i}, HtiH_{t}^{i} is a controlled Markov Chain controlled by UtiU_{t}^{i}. Through Definition 9, KtiK_{t}^{i} is an information state of this controlled Markov Chain. The Lemma then follows from a direct application of Lemma A.1.

Lemma C.20.

Suppose that KiK^{i} is unilaterally sufficient information for player ii. Let gg be (the strategy part of) a sequential equilibrium. Then there exist a KiK^{i}-based strategy ρi\rho^{i} such that (ρi,gi)(\rho^{i},g^{-i}) is (the strategy part of) a sequential equilibrium with the same expected payoff profile as gg.

Proof C.21 (Proof of Lemma C.20).

Recall that in Theorem 15 we established the equivalence of a variety of definitions of Sequential Equilibrium for strategy profiles. Let (g,Q)(g,Q) be a sequential equilibrium under Definition 13. Let (g(n),Q(n))(g^{(n)},Q^{(n)}) be a sequence of strategy and conjecture profiles that satisfies conditions (1)(2’)(3) of Definition 13.

Set ρ(n),i\rho^{(n),i} through

ρt(n),i(uti|kti)=h~tigt(n),i(uti|h~ti)Fti,g(n),i(h~ti|kti),\rho_{t}^{(n),i}(u_{t}^{i}|k_{t}^{i})=\sum_{\tilde{h}_{t}^{i}}g_{t}^{(n),i}(u_{t}^{i}|\tilde{h}_{t}^{i})F_{t}^{i,g^{(n),i}}(\tilde{h}_{t}^{i}|k_{t}^{i}), (164)

where Fti,g(n),iF_{t}^{i,g^{(n),i}} is defined in Definition 1. By replacing the sequence with one of its sub-sequences, without loss of generality, assume that ρ(n),iρi\rho^{(n),i}\rightarrow\rho^{i} for some ρi\rho^{i}.

For the ease of notation, denote g¯(n)=(ρ(n),i,g(n),i)\bar{g}^{(n)}=(\rho^{(n),i},g^{(n),-i}) and g¯=(ρi,gi)\bar{g}=(\rho^{i},g^{-i}). We have g¯(n)g¯\bar{g}^{(n)}\rightarrow\bar{g}. In the rest of the proof, we will show that (g¯,Q)(\bar{g},Q) is a sequential equilibrium.

We only need to show that g¯\bar{g} is sequentially rational to QQ and (g¯(n),Q(n))(\bar{g}^{(n)},Q^{(n)}) satisfies conditions (2’) of Definition 13, as conditions (1)(3) of Definition 13 are true by construction. Since g¯i=gi\bar{g}^{-i}=g^{-i}, we automatically have g¯j\bar{g}^{j} to be sequentially rational given QjQ^{j} for all j\{i}j\in\mathcal{I}\backslash\{i\}, and Q(n),iQ^{(n),i} to be consistent with g¯(n),i\bar{g}^{(n),-i} for each nn. It suffices to establish

  1. (i)

    ρi\rho^{i} is sequentially rational with respect to QiQ^{i}; and

  2. (ii)

    Q(n),jQ^{(n),j} is consistent with g¯(n),j\bar{g}^{(n),-j} for each j\{i}j\in\mathcal{I}\backslash\{i\}.

To establish (i), we will use the Lemma C.18 to show that Qti(hti,uti)Q_{t}^{i}(h_{t}^{i},u_{t}^{i}) is a function of (kti,uti)(k_{t}^{i},u_{t}^{i}), and hence one can use an ktik_{t}^{i} based strategy to optimize QtiQ_{t}^{i}.

Proof of (i): By construction,

ρt(n),i(kti)=h~ti:k~ti=ktigt(n),i(h~ti)ηt(n)(h~ti|kti),\rho_{t}^{(n),i}(k_{t}^{i})=\sum_{\tilde{h}_{t}^{i}:\tilde{k}_{t}^{i}=k_{t}^{i}}g_{t}^{(n),i}(\tilde{h}_{t}^{i})\cdot\eta_{t}^{(n)}(\tilde{h}_{t}^{i}|k_{t}^{i}), (165)

for some distribution ηt(n)(kti)Δ(ti)\eta_{t}^{(n)}(k_{t}^{i})\in\Delta(\mathcal{H}_{t}^{i}). Let ηt(kti)\eta_{t}(k_{t}^{i}) be an accumulation point of the sequence [ηt(n)(kti)]n=1[\eta_{t}^{(n)}(k_{t}^{i})]_{n=1}^{\infty}. We have

ρti(kti)=h~ti:k~ti=ktigti(h~ti)ηt(h~ti|kti).\rho_{t}^{i}(k_{t}^{i})=\sum_{\tilde{h}_{t}^{i}:\tilde{k}_{t}^{i}=k_{t}^{i}}g_{t}^{i}(\tilde{h}_{t}^{i})\cdot\eta_{t}(\tilde{h}_{t}^{i}|k_{t}^{i}). (166)

As a result, we have

supp(ρti(kti))h~ti:k~ti=ktisupp(gti(h~ti)).\mathrm{supp}(\rho_{t}^{i}(k_{t}^{i}))\subseteq\bigcup_{\tilde{h}_{t}^{i}:\tilde{k}_{t}^{i}=k_{t}^{i}}\mathrm{supp}(g_{t}^{i}(\tilde{h}_{t}^{i})). (167)

By Lemma C.18 we have Qt(n),i(hti,uti)=Q^t(n),i(kti,uti)Q_{t}^{(n),i}(h_{t}^{i},u_{t}^{i})=\hat{Q}_{t}^{(n),i}(k_{t}^{i},u_{t}^{i}) for some function Q^t(n),i\hat{Q}_{t}^{(n),i}. Since Q(n),iQiQ^{(n),i}\rightarrow Q^{i}, we have Qti(hti,uti)=Q^ti(kti,uti)Q_{t}^{i}(h_{t}^{i},u_{t}^{i})=\hat{Q}_{t}^{i}(k_{t}^{i},u_{t}^{i}) for some function Q^i\hat{Q}^{i}. By sequential rationality we have

supp(gti(h~ti))argmaxutiQ^ti(kti,uti),\mathrm{supp}(g_{t}^{i}(\tilde{h}_{t}^{i}))\subseteq\underset{u_{t}^{i}}{\arg\max}~{}\hat{Q}_{t}^{i}(k_{t}^{i},u_{t}^{i}), (168)

for all h~ti\tilde{h}_{t}^{i} whose corresponding compression k~ti\tilde{k}_{t}^{i} satisfies k~ti=kti\tilde{k}_{t}^{i}=k_{t}^{i}. Therefore, by (167) and (168) we conclude that

supp(ρti(kti))argmaxutiQ^ti(kti,uti),\mathrm{supp}(\rho_{t}^{i}(k_{t}^{i}))\subseteq\underset{u_{t}^{i}}{\arg\max}~{}\hat{Q}_{t}^{i}(k_{t}^{i},u_{t}^{i}), (169)

establishing sequential rationality of ρi\rho^{i} with respect to QiQ^{i}.

To establish (ii), we will use the Lemmas C.6 and C.16 to show that when player ii switches their strategy from g(n),ig^{(n),i} to ρ(n),i\rho^{(n),i}, other players face the same control problem at every information set. As a result, their Q(n),jQ^{(n),j} functions stays the same.

Proof of (ii): Consider player jij\neq i. Through standard control theory, we know that a collection of functions Q~j\tilde{Q}^{j} is consistent (in the sense of condition (2’) of Definition 13) with a fully mixed strategy profile g~j\tilde{g}^{-j} if and only if it satisfies the following equations:

Q~Tj(hTj,uTj)\displaystyle\tilde{Q}_{T}^{j}(h_{T}^{j},u_{T}^{j}) =𝔼g~j[RTj|hTj,uTj],\displaystyle=\mathbb{E}^{\tilde{g}^{-j}}[R_{T}^{j}|h_{T}^{j},u_{T}^{j}], (170a)
V~tj(htj)\displaystyle\tilde{V}_{t}^{j}(h_{t}^{j}) =maxu~tjQ~tj(htj,u~tj),t𝒯,\displaystyle=\max_{\tilde{u}_{t}^{j}}\tilde{Q}_{t}^{j}(h_{t}^{j},\tilde{u}_{t}^{j}),\qquad\forall t\in\mathcal{T}, (170b)
Q~tj(htj,utj)\displaystyle\tilde{Q}_{t}^{j}(h_{t}^{j},u_{t}^{j}) =𝔼g~j[Rtj|htj,utj]+h~t+1jV~t+1j(h~t+1j)g~j(h~t+1j|htj,utj),t𝒯\{T}.\displaystyle=\mathbb{E}^{\tilde{g}^{-j}}[R_{t}^{j}|h_{t}^{j},u_{t}^{j}]+\sum_{\tilde{h}_{t+1}^{j}}\tilde{V}_{t+1}^{j}(\tilde{h}_{t+1}^{j})\mathbb{P}^{\tilde{g}^{-j}}(\tilde{h}_{t+1}^{j}|h_{t}^{j},u_{t}^{j}),\qquad\forall t\in\mathcal{T}\backslash\{T\}. (170c)

By Lemma C.16, we have

g(n),j(h~t+1j|htj,utj)\displaystyle\mathbb{P}^{g^{(n),-j}}(\tilde{h}_{t+1}^{j}|h_{t}^{j},u_{t}^{j}) =ρ(n),i,g(n),{i,j}(h~t+1j|htj,utj),\displaystyle=\mathbb{P}^{\rho^{(n),i},g^{(n),-\{i,j\}}}(\tilde{h}_{t+1}^{j}|h_{t}^{j},u_{t}^{j}), (171)
𝔼g(n),j[Rtj|htj,utj]\displaystyle\mathbb{E}^{g^{(n),-j}}[R_{t}^{j}|h_{t}^{j},u_{t}^{j}] =𝔼ρ(n),i,g(n),{i,j}[Rtj|htj,utj],\displaystyle=\mathbb{E}^{\rho^{(n),i},g^{(n),-\{i,j\}}}[R_{t}^{j}|h_{t}^{j},u_{t}^{j}], (172)

and hence we conclude that Q(n),jQ^{(n),j} is also consistent with g¯(n),j=(ρ(n),i,g(n),{i,j})\bar{g}^{(n),-j}=(\rho^{(n),i},g^{(n),-\{i,j\}}).

Now we have shown that (g¯,Q)(\bar{g},Q) forms a sequential equilibrium. The second half of the Lemma, which states that g¯\bar{g} yields the same expected payoff as gg, can be shown with the following argument: By Lemma C.6, g¯(n)\bar{g}^{(n)} yields the same expected payoff profile as g(n)g^{(n)}. Since the expected payoff of each player is a continuous function of the behavioral strategy profile, we conclude that g¯\bar{g} yields the same expected payoff as gg.

Finally, we conclude Theorem 5 from Lemma C.20.

Proof C.22 (Proof of Theorem 5).

Given any SE strategy profile gg, applying Lemma C.20 iteratively for each ii\in\mathcal{I}, we obtain a KK-based SE strategy profile ρ\rho with the same expected payoff profile as gg. Therefore the set of KK-based SE payoffs is the same as that of all SE.

Appendix D Proofs for Section 5 and Section 6

D.1 Proof of Proposition 6

Proposition 21 (Proposition 6, restated).

In the game defined in Example 5.1, the set of KK-based wPBE payoffs is a proper subset of that of all wPBE payoffs.

Proof D.1.

Set g1Bg_{1}^{B} to be the strategy of Bob where he always chooses U1B=+1U_{1}^{B}=+1, and g2A:𝒳1A×𝒰1BΔ(𝒰2A)g_{2}^{A}:\mathcal{X}_{1}^{A}\times\mathcal{U}_{1}^{B}\mapsto\Delta(\mathcal{U}_{2}^{A}) is given by

g2A(x1A,u1B)={0 w.p. 1,if u1B=+1;x1A w.p. 23,0 w.p. 13,otherwise,\displaystyle g_{2}^{A}(x_{1}^{A},u_{1}^{B})=\begin{cases}0\text{ w.p. }1,&\text{if }u_{1}^{B}=+1;\\ x_{1}^{A}\text{ w.p. }\frac{2}{3},~{}0\text{ w.p. }\frac{1}{3},&\text{otherwise},\end{cases}

and g2B:𝒳1B×𝒰1BΔ(𝒰2B)g_{2}^{B}:\mathcal{X}_{1}^{B}\times\mathcal{U}_{1}^{B}\mapsto\Delta(\mathcal{U}_{2}^{B}) is the strategy of Bob where he always chooses U2B=1U_{2}^{B}=-1 irrespective of U1BU_{1}^{B}.

The beliefs μ1B:𝒳1BΔ(𝒳1A)\mu_{1}^{B}:\mathcal{X}_{1}^{B}\mapsto\Delta(\mathcal{X}_{1}^{A}), μ2A:𝒳1A×𝒰1BΔ(𝒳1B)\mu_{2}^{A}:\mathcal{X}_{1}^{A}\times\mathcal{U}_{1}^{B}\mapsto\Delta(\mathcal{X}_{1}^{B}), and μ2B:𝒳1B×𝒰1BΔ(𝒳1A)\mu_{2}^{B}:\mathcal{X}_{1}^{B}\times\mathcal{U}_{1}^{B}\mapsto\Delta(\mathcal{X}_{1}^{A}) are given by

μ1B(x1B)\displaystyle\mu_{1}^{B}(x_{1}^{B}) =the prior of X1A,\displaystyle=\text{the prior of }X_{1}^{A},
μ2A(x1A,u1B)\displaystyle\mu_{2}^{A}(x_{1}^{A},u_{1}^{B}) ={1 w.p. 12,+1 w.p. 12,if u1B=+1;x1A w.p. 1,otherwise,\displaystyle=\begin{cases}-1\text{ w.p. }\frac{1}{2},\;+1\text{ w.p. }\frac{1}{2},&\text{if }u_{1}^{B}=+1;\\ x_{1}^{A}\text{ w.p. }1,&\text{otherwise},\end{cases}
μ2B(x1B,u1B)\displaystyle\mu_{2}^{B}(x_{1}^{B},u_{1}^{B}) =the prior of X1A.\displaystyle=\text{the prior of }X_{1}^{A}.

One can verify that gg is sequentially rational given μ\mu, and μ\mu is “preconsistent” [17] with gg, i.e. the beliefs can be updated with Bayes rule for consecutive information sets on and off-equilibrium paths. In particular, (g,μ)(g,\mu) is a wPBE. (It can also be shown that (g,μ)(g,\mu) satisfies Watson’s PBE definition [57]. However, (g,μ)(g,\mu) is not a PBE in the sense of Fudenberg and Tirole [11], since μ\mu violates their “no-signaling-what-you-don’t-know” condition.)

We proceed to show that no KK-based wPBE can attain the payoff profile of gg.

Suppose that ρ=(ρA,ρB)\rho=(\rho^{A},\rho^{B}) is a KK-based weak PBE strategy profile. First, observe that at t=2t=2, Alice can only choose her actions based on U1BU_{1}^{B} according to the definition of KAK^{A}-based strategies. Let α,βΔ({1,0,1})\alpha,\beta\in\Delta(\{-1,0,1\}) be Alice’s mixed action at time t=2t=2 under U2A=1U_{2}^{A}=-1 and U2A=+1U_{2}^{A}=+1 respectively under strategy ρA\rho^{A}. With some abuse of notation, denote ρA=(α,β)\rho^{A}=(\alpha,\beta). There exists no belief system under which Alice is indifferent between all of her three actions at time t=2t=2. Therefore, no strictly mixed action at t=2t=2 would be sequentially rational. Therefore, sequential rationally of ρA\rho^{A} (with respect to some belief) implies that min{α(1),α(0),α(+1)}=min{β(1),β(0),β(+1)}=0\min\{\alpha(-1),\alpha(0),\alpha(+1)\}=\min\{\beta(-1),\beta(0),\beta(+1)\}=0.

To respond to ρA=(α,β)\rho^{A}=(\alpha,\beta), Bob can always maximizes his stage 2 instantaneous reward to 0 by using a suitable response strategy. If Bob plays 1-1 at t=1t=1, his best total payoff is given by 0.20.2; if Bob plays +1+1 at t=1t=1, his best total payoff is given by 0. Hence Bob strictly prefers 1-1 to +1+1. Therefore, in any best response (in terms of total expected payoff) to Alice’s strategy ρA\rho^{A}, Bob plays U1B=1U_{1}^{B}=-1 irrespective of his private type. Therefore, Alice has an instantaneous payoff of 1-1 at t=1t=1 and a total payoff 0\leq 0 under ρ\rho, proving that the payoff profile of ρ\rho is different from that of gg.

D.2 Proof of Proposition 7

Proposition 22 (Proposition 7, restated).

In the model of Example 5.6, Kti=(Y1:t1,U1:t1,Xti)K_{t}^{i}=(Y_{1:t-1},U_{1:t-1},X_{t}^{i}) is unilaterally sufficient information.

We first prove Lemma D.2, which establish the conditional independence of the state processes given the common information.

Lemma D.2.

In the model of Example 5.6, there exists functions (ξtgi)gi𝒢i,i,ξtgi:𝒴1:t1×𝒰1:t1Δ(𝒳1:ti)(\xi_{t}^{g^{i}})_{g^{i}\in\mathcal{G}^{i},i\in\mathcal{I}},\xi_{t}^{g^{i}}:\mathcal{Y}_{1:t-1}\times\mathcal{U}_{1:t-1}\mapsto\Delta(\mathcal{X}_{1:t}^{i}) such that

g(x1:t|y1:t1,u1:t1)=iξtgi(x1:ti|y1:t1,u1:t1),\mathbb{P}^{g}(x_{1:t}|y_{1:t-1},u_{1:t-1})=\prod_{i\in\mathcal{I}}\xi_{t}^{g^{i}}(x_{1:t}^{i}|y_{1:t-1},u_{1:t-1}), (173)

for all strategy profiles gg and all (y1:t1,u1:t1)(y_{1:t-1},u_{1:t-1}) admissible under gg.

Proof D.3 (Proof of Lemma D.2).

Denote Ht0=(𝐘1:t1,𝐔1:t1)H_{t}^{0}=(\mathbf{Y}_{1:t-1},\mathbf{U}_{1:t-1}). We prove the result by induction on time tt.

Induction Base: The result is true for t=1t=1 since H10=H_{1}^{0}=\varnothing and the random variables (X1i)i(X_{1}^{i})_{i\in\mathcal{I}} are assumed to be mutually independent.

Induction Step: Suppose that we have proved Lemma D.2 for time tt. We then prove the result for time t+1t+1.

We have

g(x1:t+1,yt,ut|ht0)\displaystyle\mathbb{P}^{g}(x_{1:t+1},y_{t},u_{t}|h_{t}^{0}) =g(xt+1,yt|x1:t,ut,ht0)g(ut|x1:t,ht0)g(x1:t|ht0)\displaystyle=\mathbb{P}^{g}(x_{t+1},y_{t}|x_{1:t},u_{t},h_{t}^{0})\mathbb{P}^{g}(u_{t}|x_{1:t},h_{t}^{0})\mathbb{P}^{g}(x_{1:t}|h_{t}^{0}) (174)
=i((xt+1i,yti|xti,ut)gti(uti|x1:ti,ht0)ξtgi(x1:ti|ht0))\displaystyle=\prod_{i\in\mathcal{I}}\left(\mathbb{P}(x_{t+1}^{i},y_{t}^{i}|x_{t}^{i},u_{t})g_{t}^{i}(u_{t}^{i}|x_{1:t}^{i},h_{t}^{0})\xi_{t}^{g^{i}}(x_{1:t}^{i}|h_{t}^{0})\right) (175)
=:iνtgi(x1:t+1i,yt,ut,ht0)=iνtgi(x1:t+1i,ht+10),\displaystyle=:\prod_{i\in\mathcal{I}}\nu_{t}^{g^{i}}(x_{1:t+1}^{i},y_{t},u_{t},h_{t}^{0})=\prod_{i\in\mathcal{I}}\nu_{t}^{g^{i}}(x_{1:t+1}^{i},h_{t+1}^{0}), (176)

where the induction hypothesis is utilized in (175).

Therefore, using Bayes rule,

g(x1:t+1|ht+10)\displaystyle\mathbb{P}^{g}(x_{1:t+1}|h_{t+1}^{0}) =g(x1:t+1,yt,ut|ht0)y~t,u~tg(x~1:t+1,yt,ut|ht+10)\displaystyle=\dfrac{\mathbb{P}^{g}(x_{1:t+1},y_{t},u_{t}|h_{t}^{0})}{\sum_{\tilde{y}_{t},\tilde{u}_{t}}\mathbb{P}^{g}(\tilde{x}_{1:t+1},y_{t},u_{t}|h_{t+1}^{0})} (177)
=iνtgi(x1:t+1i,ht+10)x~1:t+1iνtgi(x~1:t+1i,ht+10)\displaystyle=\dfrac{\prod_{i\in\mathcal{I}}\nu_{t}^{g^{i}}(x_{1:t+1}^{i},h_{t+1}^{0})}{\sum_{\tilde{x}_{1:t+1}}\prod_{i\in\mathcal{I}}\nu_{t}^{g^{i}}(\tilde{x}_{1:t+1}^{i},h_{t+1}^{0})} (178)
=iνtgi(x1:t+1i,ht+10)ix~1:t+1iνtgi(x~1:t+1i,ht+10)\displaystyle=\dfrac{\prod_{i\in\mathcal{I}}\nu_{t}^{g^{i}}(x_{1:t+1}^{i},h_{t+1}^{0})}{\prod_{i\in\mathcal{I}}\sum_{\tilde{x}_{1:t+1}^{i}}\nu_{t}^{g^{i}}(\tilde{x}_{1:t+1}^{i},h_{t+1}^{0})} (179)
=:iξt+1gi(x1:t+1i|ht+10),\displaystyle=:\prod_{i\in\mathcal{I}}\xi_{t+1}^{g^{i}}(x_{1:t+1}^{i}|h_{t+1}^{0}), (180)

where

ξt+1gi(x1:t+1i|ht+10):=νtgi(x1:t+1i,ht+10)x~1:t+1iνtgi(x~1:t+1i,ht+10),\xi_{t+1}^{g^{i}}(x_{1:t+1}^{i}|h_{t+1}^{0}):=\dfrac{\nu_{t}^{g^{i}}(x_{1:t+1}^{i},h_{t+1}^{0})}{\sum_{\tilde{x}_{1:t+1}^{i}}\nu_{t}^{g^{i}}(\tilde{x}_{1:t+1}^{i},h_{t+1}^{0})}, (181)

establishing the induction step.

Proof D.4 (Proof of Proposition 7).

Denote Ht0=(𝐘1:t1,𝐔1:t1)H_{t}^{0}=(\mathbf{Y}_{1:t-1},\mathbf{U}_{1:t-1}). Then Kti=(Ht0,Xti)K_{t}^{i}=(H_{t}^{0},X_{t}^{i}). Given Lemma D.2, we have

g(x1:t1i|kti)\displaystyle\mathbb{P}^{g}(x_{1:t-1}^{i}|k_{t}^{i}) =g(x1:ti|ht0)g(xti|ht0)=ξtgi(x1:ti|ht0)x~1:t1iξtgi((x~1:t1i,xti)|ht0)\displaystyle=\dfrac{\mathbb{P}^{g}(x_{1:t}^{i}|h_{t}^{0})}{\mathbb{P}^{g}(x_{t}^{i}|h_{t}^{0})}=\dfrac{\xi_{t}^{g^{i}}(x_{1:t}^{i}|h_{t}^{0})}{\sum_{\tilde{x}_{1:t-1}^{i}}\xi_{t}^{g^{i}}((\tilde{x}_{1:t-1}^{i},x_{t}^{i})|h_{t}^{0})} (182)
=:F~ti,gi(x1:t1i|kti).\displaystyle=:\tilde{F}_{t}^{i,g^{i}}(x_{1:t-1}^{i}|k_{t}^{i}). (183)

Since Hti=(Kti,X1:t1i)H_{t}^{i}=(K_{t}^{i},X_{1:t-1}^{i}), we conclude that

g(h~ti|kti)=Fti,gi(h~ti|kti),\mathbb{P}^{g}(\tilde{h}_{t}^{i}|k_{t}^{i})=F_{t}^{i,g^{i}}(\tilde{h}_{t}^{i}|k_{t}^{i}), (184)

for some function Fti,giF_{t}^{i,g^{i}}.

Given Lemma D.2, we have

g(x~1:ti|hti)\displaystyle\mathbb{P}^{g}(\tilde{x}_{1:t}^{-i}|h_{t}^{i}) =g(x~1:ti,x1:ti|ht0)g(x1:ti|ht0)=jiξtgj(x~1:tj|ht0).\displaystyle=\dfrac{\mathbb{P}^{g}(\tilde{x}_{1:t}^{-i},x_{1:t}^{i}|h_{t}^{0})}{\mathbb{P}^{g}(x_{1:t}^{i}|h_{t}^{0})}=\prod_{j\neq i}\xi_{t}^{g^{j}}(\tilde{x}_{1:t}^{j}|h_{t}^{0}). (185)

As a result, we have

g(x~1:ti,k~ti|hti)\displaystyle\mathbb{P}^{g}(\tilde{x}_{1:t}^{-i},\tilde{k}_{t}^{i}|h_{t}^{i}) =𝟏{k~ti=kti}jiξtgj(x1:tj|ht0)\displaystyle=\bm{1}_{\{\tilde{k}_{t}^{i}=k_{t}^{i}\}}\prod_{j\neq i}\xi_{t}^{g^{j}}(x_{1:t}^{j}|h_{t}^{0}) (186)
=:Φ~ti,gi(x~1:ti|kti).\displaystyle=:\tilde{\Phi}_{t}^{i,g^{-i}}(\tilde{x}_{1:t}^{-i}|k_{t}^{i}). (187)

Since (𝐗t,Hti)(\mathbf{X}_{t},H_{t}^{-i}) is a fixed function of (𝐗1:ti,Kti)(\mathbf{X}_{1:t}^{-i},K_{t}^{i}), we conclude that

g(x~t,h~ti|hti)=Φti,gi(x~t,h~ti|kti),\mathbb{P}^{g}(\tilde{x}_{t},\tilde{h}_{t}^{-i}|h_{t}^{i})=\Phi_{t}^{i,g^{-i}}(\tilde{x}_{t},\tilde{h}_{t}^{-i}|k_{t}^{i}), (188)

for some function Φti,gi\Phi_{t}^{i,g^{-i}}.

Combining (184) and (188) while using the fact that KtiK_{t}^{i} is a function of HtiH_{t}^{i}, we obtain

g(x~t,h~t|kti)\displaystyle\mathbb{P}^{g}(\tilde{x}_{t},\tilde{h}_{t}|k_{t}^{i}) =Fti,gi(h~ti|kti)Φti,gi(x~t,h~ti|kti).\displaystyle=F_{t}^{i,g^{i}}(\tilde{h}_{t}^{i}|k_{t}^{i})\Phi_{t}^{i,g^{-i}}(\tilde{x}_{t},\tilde{h}_{t}^{-i}|k_{t}^{i}). (189)

We conclude that KiK^{i} is unilaterally sufficient information.

D.3 Proof of Proposition 8

Proposition 23 (Proposition 8, restated).

In the game of Example 6.1 belief-based equilibria do not exist.

Proof D.5.

We first characterize all the Bayes-Nash equilibria of Example 6.1 in behavioral strategy profiles. Then we will show that none of the BNE corresponds to a belief-based equilibrium.

Let α=(α1,α2)[0,1]2\alpha=(\alpha_{1},\alpha_{2})\in[0,1]^{2} describe Alice’s behavioral strategy: α1\alpha_{1} is the probability that Alice plays U1A=1U_{1}^{A}=-1 given X1A=1X_{1}^{A}=-1; α2\alpha_{2} is the probability that Alice plays U1A=+1U_{1}^{A}=+1 given X1A=+1X_{1}^{A}=+1. Let β=(β1,β2)[0,1]2\beta=(\beta_{1},\beta_{2})\in[0,1]^{2} denote Bob’s behavioral strategy: β1\beta_{1} is the probability that Bob plays U2B=UU_{2}^{B}=\mathrm{U} when observing U1A=1U_{1}^{A}=-1, β2\beta_{2} is the probability that Bob plays U2B=UU_{2}^{B}=\mathrm{U} when observing U1A=+1U_{1}^{A}=+1.

Claim:

α=(13,13),β=(13+c,13c),\alpha^{*}=\left(\frac{1}{3},\frac{1}{3}\right),\quad\beta^{*}=\left(\frac{1}{3}+c,\frac{1}{3}-c\right), (190)

is the unique BNE of Example 6.1.

Given the claim, one can conclude that a belief based equilibrium does not exist in this game: Bob’s true belief b2b_{2} on X2X_{2} at the beginning of stage 2, given his information H2B=U1AH_{2}^{B}=U_{1}^{A}, would satisfy

b2(+1)\displaystyle b_{2}^{-}(+1) =α1α1+1α2, if α(0,1);\displaystyle=\dfrac{\alpha_{1}}{\alpha_{1}+1-\alpha_{2}},\quad\text{ if }\alpha\neq(0,1); (191)
b2+(+1)\displaystyle b_{2}^{+}(+1) =α2α2+1α1, if α(1,0),\displaystyle=\dfrac{\alpha_{2}}{\alpha_{2}+1-\alpha_{1}},\quad\text{ if }\alpha\neq(1,0), (192)

where b2b_{2}^{-} represents the belief under U1A=1U_{1}^{A}=-1 and b2+b_{2}^{+} represents the belief under U1A=+1U_{1}^{A}=+1. If Alice plays α=(13,13)\alpha^{*}=\left(\frac{1}{3},\frac{1}{3}\right), then b2=b2+b_{2}^{-}=b_{2}^{+}. Under a belief-based equilibrium concept (e.g. [36, 56]), Bob’s stage behavioral strategy β\beta should yield the same action distribution under the same belief, which means that β1=β2\beta_{1}=\beta_{2}. However we have β=(13+c,13c)\beta^{*}=\left(\frac{1}{3}+c,\frac{1}{3}-c\right). Therefore, (α,β)(\alpha^{*},\beta^{*}), the unique BNE of the game, is not a belief-based equilibrium. We conclude that a belief-based equilibrium does not exist in Example 6.1.

Proof of Claim: Denote Alice’s total expected payoff to be J(α,β)J(\alpha,\beta). Then

J(α,β)\displaystyle\enspace\enspace\>J(\alpha,\beta)
=12c(1α1+α2)+12α12β1+12(1α1)(1β2)+12(1α2)(1β1)+12α22β2\displaystyle=\frac{1}{2}c(1-\alpha_{1}+\alpha_{2})+\frac{1}{2}\alpha_{1}\cdot 2\beta_{1}+\frac{1}{2}(1-\alpha_{1})(1-\beta_{2})+\frac{1}{2}(1-\alpha_{2})(1-\beta_{1})+\frac{1}{2}\alpha_{2}\cdot 2\beta_{2}
=12c(1α1+α2)+12(2α1α2)+12(2α1+α21)β1+12(2α2+α11)β2.\displaystyle=\frac{1}{2}c(1-\alpha_{1}+\alpha_{2})+\frac{1}{2}(2-\alpha_{1}-\alpha_{2})+\frac{1}{2}(2\alpha_{1}+\alpha_{2}-1)\beta_{1}+\frac{1}{2}(2\alpha_{2}+\alpha_{1}-1)\beta_{2}.

Define J(α)=minβJ(α,β)J^{*}(\alpha)=\min_{\beta}J(\alpha,\beta). Since the game is zero-sum, Alice plays α\alpha at some equilibrium if and only if α\alpha maximizes J(α)J^{*}(\alpha). We compute

J(α)\displaystyle J^{*}(\alpha) =12c(1α1+α2)+12(2α1α2)+\displaystyle=\frac{1}{2}c(1-\alpha_{1}+\alpha_{2})+\frac{1}{2}(2-\alpha_{1}-\alpha_{2})+
+12min{2α1+α21,0}+12min{α1+2α21,0}.\displaystyle+\frac{1}{2}\min\{2\alpha_{1}+\alpha_{2}-1,0\}+\frac{1}{2}\min\{\alpha_{1}+2\alpha_{2}-1,0\}.

Since J(α)J^{*}(\alpha) is a continuous piecewise linear function, the set of maximizers can be found by comparing the values at the extreme points of the pieces. We have

J(0,0)\displaystyle J^{*}(0,0) =12c+11212=12c;\displaystyle=\frac{1}{2}c+1-\frac{1}{2}-\frac{1}{2}=\frac{1}{2}c;
J(12,0)\displaystyle J^{*}\left(\frac{1}{2},0\right) =12c12+1232+1201212=14c+12;\displaystyle=\frac{1}{2}c\cdot\frac{1}{2}+\frac{1}{2}\cdot\frac{3}{2}+\frac{1}{2}\cdot 0-\frac{1}{2}\cdot\frac{1}{2}=\frac{1}{4}c+\frac{1}{2};
J(0,12)\displaystyle J^{*}\left(0,\frac{1}{2}\right) =12c32+12321212120=34c+12;\displaystyle=\frac{1}{2}c\cdot\frac{3}{2}+\frac{1}{2}\cdot\frac{3}{2}-\frac{1}{2}\cdot\frac{1}{2}-\frac{1}{2}\cdot 0=\frac{3}{4}c+\frac{1}{2};
J(1,0)\displaystyle J^{*}(1,0) =12c0+121+120+120=12;\displaystyle=\frac{1}{2}c\cdot 0+\frac{1}{2}\cdot 1+\frac{1}{2}\cdot 0+\frac{1}{2}\cdot 0=\frac{1}{2};
J(0,1)\displaystyle J^{*}(0,1) =12c2+121+120+120=c+12;\displaystyle=\frac{1}{2}c\cdot 2+\frac{1}{2}\cdot 1+\frac{1}{2}\cdot 0+\frac{1}{2}\cdot 0=c+\frac{1}{2};
J(13,13)\displaystyle J^{*}\left(\frac{1}{3},\frac{1}{3}\right) =12c+1243+120+120=12c+23;\displaystyle=\frac{1}{2}c+\frac{1}{2}\cdot\frac{4}{3}+\frac{1}{2}\cdot 0+\frac{1}{2}\cdot 0=\frac{1}{2}c+\frac{2}{3};
J(1,1)\displaystyle J^{*}(1,1) =12c+120+120+120=12c.\displaystyle=\frac{1}{2}c+\frac{1}{2}\cdot 0+\frac{1}{2}\cdot 0+\frac{1}{2}\cdot 0=\frac{1}{2}c.
α1\alpha_{1}α2\alpha_{2}(1,1)(1,1)(0,1)(0,1)(0,0)(0,0)(1,0)(1,0)(12,0)(\frac{1}{2},0)(0,12)(0,\frac{1}{2})(13,13)(\frac{1}{3},\frac{1}{3})
Figure 3: The pieces (polygons) for which J(α)J^{*}(\alpha) is linear on. The extreme points of the pieces are labeled.

Since c<13c<\frac{1}{3}, we have (13,13)(\frac{1}{3},\frac{1}{3}) to be the unique maximum among the extreme points. Hence we have argmaxαJ(α)={(13,13)}\arg\max_{\alpha}J^{*}(\alpha)=\{(\frac{1}{3},\frac{1}{3})\}, i.e. Alice always plays α=(13,13)\alpha^{*}=(\frac{1}{3},\frac{1}{3}) in any BNE of the game.

Now, consider Bob’s equilibrium strategy. β\beta^{*} is an equilibrium strategy of Bob only if αargmaxαJ(α,β)\alpha^{*}\in\arg\max_{\alpha}J(\alpha,\beta^{*}).

For each β\beta, J(α,β)J(\alpha,\beta) is a linear function of α\alpha and

αJ(α,β)=(12c12+β1+12β2,12c12+12β1+β2),α(0,1)2.\displaystyle\nabla_{\alpha}J(\alpha,\beta)=\left(-\frac{1}{2}c-\frac{1}{2}+\beta_{1}+\frac{1}{2}\beta_{2},\frac{1}{2}c-\frac{1}{2}+\frac{1}{2}\beta_{1}+\beta_{2}\right),\quad\forall\alpha\in(0,1)^{2}.

We need αJ(α,β)|α=α=(0,0)\nabla_{\alpha}J(\alpha,\beta^{*})\Big{|}_{\alpha=\alpha^{*}}=(0,0). Hence

12c12+β1+12β2\displaystyle-\frac{1}{2}c-\frac{1}{2}+\beta_{1}^{*}+\frac{1}{2}\beta_{2}^{*} =0;\displaystyle=0;
12c12+12β1+β2\displaystyle\frac{1}{2}c-\frac{1}{2}+\frac{1}{2}\beta_{1}^{*}+\beta_{2}^{*} =0,\displaystyle=0,

which implies that β=(13+c,13c)\beta^{*}=(\frac{1}{3}+c,\frac{1}{3}-c), proving the claim.

References

  • \bibcommenthead
  • Abreu and Rubinstein [1988] Abreu D, Rubinstein A (1988) The structure of Nash equilibrium in repeated games with finite automata. Econometrica: Journal of the Econometric Society pp 1259–1281. Available at https://doi.org/10.2307/1913097
  • Åström [1965] Åström KJ (1965) Optimal control of Markov processes with incomplete state information. Journal of mathematical analysis and applications 10(1):174–205. Available at https://doi.org/10.1016/0022-247x(65)90154-x
  • Aumann et al [1997] Aumann RJ, Hart S, Perry M (1997) The absent-minded driver. Games and Economic Behavior 20(1):102–116. Available at https://doi.org/10.1006/game.1997.0577
  • Banks and Sundaram [1990] Banks JS, Sundaram RK (1990) Repeated games, finite automata, and complexity. Games and Economic Behavior 2(2):97–117. Available at https://doi.org/10.1016/0899-8256(90)90024-o
  • Başar and Olsder [1999] Başar T, Olsder GJ (1999) Dynamic noncooperative game theory, vol 23. SIAM
  • Battigalli [1996] Battigalli P (1996) Strategic independence and perfect Bayesian equilibria. Journal of Economic Theory 70(1):201–234. Available at https://doi.org/10.1006/jeth.1996.0082
  • Battigalli [1997] Battigalli P (1997) Dynamic consistency and imperfect recall. Games and Economic Behavior 20(1):31–50. Available at https://doi.org/10.1006/game.1997.0535
  • Bellman [1966] Bellman R (1966) Dynamic programming. Science 153(3731):34–37
  • Filar and Vrieze [2012] Filar J, Vrieze K (2012) Competitive Markov decision processes. Springer Science & Business Media
  • Fudenberg and Tirole [1991a] Fudenberg D, Tirole J (1991a) Game theory. MIT press
  • Fudenberg and Tirole [1991b] Fudenberg D, Tirole J (1991b) Perfect Bayesian equilibrium and sequential equilibrium. Journal of Economic Theory 53(2):236–260. Available at https://doi.org/10.1016/0022-0531(91)90155-w
  • Grove and Halpern [1997] Grove AJ, Halpern JY (1997) On the expected value of games with absentmindedness. Games and Economic Behavior 20(1):51–65. Available at https://doi.org/10.1006/game.1997.0558
  • Gupta et al [2014] Gupta A, Nayyar A, Langbort C, et al (2014) Common information based Markov perfect equilibria for linear-Gaussian games with asymmetric information. SIAM Journal on Control and Optimization 52(5):3228–3260. URL https://doi.org/10.1137/140953514
  • Gupta et al [2016] Gupta A, Langbort C, Başar T (2016) Dynamic games with asymmetric information and resource constrained players with applications to security of cyberphysical systems. IEEE Transactions on Control of Network Systems 4(1):71–81. Available at https://doi.org/10.1109/tcns.2016.2584183
  • Halpern [1997] Halpern JY (1997) On ambiguities in the interpretation of game trees. Games and Economic Behavior 20(1):66–96. Available at https://doi.org/10.1006/game.1997.0557
  • Halpern [2009] Halpern JY (2009) A nonstandard characterization of sequential equilibrium, perfect equilibrium, and proper equilibrium. International Journal of Game Theory 38(1):37–49. Available at https://doi.org/10.1007/s00182-008-0139-0
  • Hendon et al [1996] Hendon E, Jacobsen HJ, Sloth B (1996) The one-shot-deviation principle for sequential rationality. Games and Economic Behavior 12(2):274–282. Available at https://doi.org/10.1006/game.1996.0018
  • Hinderer [1970] Hinderer K (1970) Sufficient statistics, Markovian and stationary models, vol 33, Springer Berlin Heidelberg, Berlin, Heidelberg, chap 18, pp 118–126
  • Kao and Subramanian [2022] Kao H, Subramanian V (2022) Common information based approximate state representations in multi-agent reinforcement learning. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 6947–6967
  • Kay [1993] Kay SM (1993) Fundamentals of statistical signal processing: Estimation theory. Prentice-Hall, Inc.
  • Kreps and Wilson [1982] Kreps DM, Wilson R (1982) Sequential equilibria. Econometrica: Journal of the Econometric Society pp 863–894. Available at https://doi.org/10.2307/1912767
  • Kuhn [1953] Kuhn H (1953) Extensive games and the problem of information. In: Contributions to the Theory of Games (AM-28), Volume II. Princeton University Press, p 193–216
  • Kumar and Varaiya [2015] Kumar PR, Varaiya P (2015) Stochastic systems: Estimation, identification and adaptive control. SIAM
  • Mahajan and Mannan [2016] Mahajan A, Mannan M (2016) Decentralized stochastic control. Annals of Operations Research 241(1):109–126. Available at https://doi.org/10.1007/s10479-014-1652-0
  • Mas-Colell et al [1995] Mas-Colell A, Whinston MD, Green JR (1995) Microeconomic theory, vol 1. Oxford university press New York
  • Maskin and Tirole [2001] Maskin E, Tirole J (2001) Markov perfect equilibrium: I. Observable actions. Journal of Economic Theory 100(2):191–219. Available at https://doi.org/10.1006/jeth.2000.2785
  • Maskin and Tirole [2013] Maskin E, Tirole J (2013) Markov equilibrium, J. F. Mertens Memorial Conference. Available at https://youtu.be/UNtLnKJzrhs
  • Mertens and Neyman [1981] Mertens JF, Neyman A (1981) Stochastic games. International Journal of Game Theory 10(2):53–66. Available at https://doi.org/10.1007/bf01769259
  • Mertens and Parthasarathy [2003] Mertens JF, Parthasarathy T (2003) Equilibria for discounted stochastic games. In: Stochastic games and applications. Springer, p 131–172
  • Myerson [2013] Myerson RB (2013) Game theory. Harvard university press
  • Nayyar and Başar [2012] Nayyar A, Başar T (2012) Dynamic stochastic games with asymmetric information. In: 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), IEEE, pp 7145–7150, available at https://doi.org/10.1109/cdc.2012.6426857
  • Nayyar et al [2011] Nayyar A, Mahajan A, Teneketzis D (2011) Optimal control strategies in delayed sharing information structures. IEEE Transactions on Automatic Control 56(7):1606–1620. Available at https://doi.org/10.1109/tac.2010.2089381
  • Nayyar et al [2013a] Nayyar A, Gupta A, Langbort C, et al (2013a) Common information based Markov perfect equilibria for stochastic games with asymmetric information: Finite games. IEEE Transactions on Automatic Control 59(3):555–570. Available at https://doi.org/10.1109/tac.2013.2283743
  • Nayyar et al [2013b] Nayyar A, Mahajan A, Teneketzis D (2013b) Decentralized stochastic control with partial history sharing: A common information approach. IEEE Transactions on Automatic Control 58(7):1644–1658. Available at https://doi.org/10.1109/tac.2013.2239000
  • Ouyang et al [2015] Ouyang Y, Tavafoghi H, Teneketzis D (2015) Dynamic oligopoly games with private Markovian dynamics. In: 2015 54th IEEE Conference on Decision and Control (CDC), IEEE, pp 5851–5858, available at https://doi.org/10.1109/cdc.2015.7403139
  • Ouyang et al [2016] Ouyang Y, Tavafoghi H, Teneketzis D (2016) Dynamic games with asymmetric information: Common information based perfect Bayesian equilibria and sequential decomposition. IEEE Transactions on Automatic Control 62(1):222–237. Available at https://doi.org/10.1109/tac.2016.2544936
  • Ouyang et al [2024] Ouyang Y, Tavafoghi H, Teneketzis D (2024) An approach to stochastic dynamic games with asymmetric information and hidden actions. Dynamic Games and Applications pp 1–34. Available at https://doi.org/10.1007/s13235-024-00558-7
  • Piccione and Rubinstein [1997] Piccione M, Rubinstein A (1997) On the interpretation of decision problems with imperfect recall. Games and Economic Behavior 20(1):3–24. Available at https://doi.org/10.1016/0165-4896(96)81573-3
  • Powell [2007] Powell WB (2007) Approximate Dynamic Programming: Solving the curses of dimensionality, vol 703. John Wiley & Sons
  • Rosenberg [1998] Rosenberg D (1998) Duality and Markovian strategies. International Journal of Game Theory 27(4). Available at https://doi.org/10.1007/s001820050091
  • Russell and Norvig [2002] Russell S, Norvig P (2002) Artificial intelligence: A modern approach. Prentice Hall
  • Shapley [1953] Shapley LS (1953) Stochastic games. Proceedings of the National Academy of Sciences 39(10):1095–1100. Available at https://doi.org/10.1073/pnas.39.10.1095
  • Shiryaev [1964] Shiryaev AN (1964) On Markov sufficient statistics in non-additive Bayes problems of sequential analysis. Theory of Probability & Its Applications 9(4):604–618. Available at https://doi.org/10.1137/1109082
  • Smallwood and Sondik [1973] Smallwood RD, Sondik EJ (1973) The optimal control of partially observable Markov processes over a finite horizon. Operations research 21(5):1071–1088. Available at https://doi.org/10.1287/opre.21.5.1071
  • Sondik [1978] Sondik EJ (1978) The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations research 26(2):282–304. Available at https://doi.org/10.1287/opre.26.2.282
  • Striebel [1965] Striebel C (1965) Sufficient statistics in the optimum control of stochastic systems. Journal of Mathematical Analysis and Applications 12(3):576–592. Available at https://doi.org/10.1016/0022-247X(65)90027-2
  • Striebel [1975] Striebel C (1975) Statistics Sufficient for Control, Springer Berlin Heidelberg, Berlin, Heidelberg, chap 3, pp 38–58
  • Subramanian et al [2022] Subramanian J, Sinha A, Seraj R, et al (2022) Approximate information state for approximate planning and reinforcement learning in partially observed systems. Journal of Machine Learning Research 23:12–1
  • Sundaram [1996] Sundaram RK (1996) A first course in optimization theory. Cambridge university press
  • Tang [2021] Tang D (2021) Games in multi-agent dynamic systems: Decision-making with compressed information. PhD thesis, University of Michigan
  • Tang et al [2023] Tang D, Tavafoghi H, Subramanian V, et al (2023) Dynamic games among teams with delayed intra-team information sharing. Dynamic Games and Applications 13:353–411. Available at https://doi.org/10.1007/s13235-022-00424-4
  • Tavafoghi [2017] Tavafoghi H (2017) On design and analysis of cyber-physical systems with strategic agents. PhD thesis, University of Michigan, Ann Arbor
  • Tavafoghi et al [2016] Tavafoghi H, Ouyang Y, Teneketzis D (2016) On stochastic dynamic games with delayed sharing information structure. In: 2016 IEEE 55th Conference on Decision and Control (CDC), IEEE, pp 7002–7009, available at https://doi.org/10.1109/cdc.2016.7799348
  • Tavafoghi et al [2022] Tavafoghi H, Ouyang Y, Teneketzis D (2022) A unified approach to dynamic decision problems with asymmetric information: Nonstrategic agents. IEEE Transactions on Automatic Control 67(3):1105–1119. Available at https://doi.org/10.1109/tac.2021.3060835
  • Varaiya and Walrand [1978] Varaiya P, Walrand J (1978) On delayed sharing patterns. IEEE Transactions on Automatic Control 23(3):443–445. Available at https://doi.org/10.1109/TAC.1978.1101739
  • Vasal et al [2019] Vasal D, Sinha A, Anastasopoulos A (2019) A systematic process for evaluating structured perfect Bayesian equilibria in dynamic games with asymmetric information. IEEE Transactions on Automatic Control 64(1):81–96. Available at https://doi.org/10.1109/tac.2018.2809863
  • Watson [2017] Watson J (2017) A general, practicable definition of perfect Bayesian equilibrium. unpublished draft Available at https://econweb.ucsd.edu/~jwatson/PAPERS/WatsonPBE.pdf
  • Whittle [1969] Whittle P (1969) Sequential decision processes with essential unobservables. Advances in Applied Probability 1(2):271–287. Available at https://doi.org/10.2307/1426220