This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning the Structure and Parameters of Large-Population Graphical Games from Behavioral Data

Jean Honorio
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Cambridge, MA 02139, USA
jhonorio@csail.mit.edu

Luis Ortiz
Department of Computer Science
Stony Brook University
Stony Brook, NY 11794-4400, USA
leortiz@cs.stonybrook.edu
Abstract

We consider learning, from strictly behavioral data, the structure and parameters of linear influence games (LIGs), a class of parametric graphical games introduced by Irfan and Ortiz [2014]. LIGs facilitate causal strategic inference (CSI): Making inferences from causal interventions on stable behavior in strategic settings. Applications include the identification of the most influential individuals in large (social) networks. Such tasks can also support policy-making analysis. Motivated by the computational work on LIGs, we cast the learning problem as maximum-likelihood estimation (MLE) of a generative model defined by pure-strategy Nash equilibria (PSNE). Our simple formulation uncovers the fundamental interplay between goodness-of-fit and model complexity: good models capture equilibrium behavior within the data while controlling the true number of equilibria, including those unobserved. We provide a generalization bound establishing the sample complexity for MLE in our framework. We propose several algorithms including convex loss minimization (CLM) and sigmoidal approximations. We prove that the number of exact PSNE in LIGs is small, with high probability; thus, CLM is sound. We illustrate our approach on synthetic data and real-world U.S. congressional voting records. We briefly discuss our learning framework’s generality and potential applicability to general graphical games.

1 Introduction

Game theory has become a central tool for modeling multi-agent systems in AI. Non-cooperative game theory has been considered as the appropriate mathematical framework in which to formally study strategic behavior in multi-agent scenarios.111See, e.g., the survey of Shoham [2008] and the books of Nisan et al. [2007] and Shoham and Leyton-Brown [2009] for more information. The core solution concept of Nash equilibrium (NE) [Nash, 1951] serves a descriptive role of the stable outcome of the overall behavior of systems involving self-interested individuals interacting strategically with each other in distributed settings for which no direct global control is possible. NE is also often used in predictive roles as the basis for what one might call causal strategic inference, i.e., inferring the results of causal interventions on stable actions/behavior/outcomes in strategic settings (See, e.g., Ballester et al. 2004, 2006, Heal and Kunreuther 2003, 2006, 2007, Kunreuther and Michel-Kerjan 2007, Ortiz and Kearns 2003, Kearns 2005, Irfan and Ortiz 2014, and the references therein). Needless to say, the computation and analysis of NE in games is of significant interest to the computational game-theory community within AI.

The introduction of compact representations to game theory over the last decade have extended computational/algorithmic game theory’s potential for large-scale, practical applications often encountered in the real-world. For the most part, such game model representations are analogous to probabilistic graphical models widely used in machine learning and AI.222The fundamental property such compact representation of games exploit is that of conditional independence: each player’s payoff function values are determined by the actions of the player and those of the player’s neighbors only, and thus are conditionally (payoff) independent of the actions of the non-neighboring players, given the action of the neighboring players. Introduced within the AI community about a decade ago, graphical games [Kearns et al., 2001] constitute an example of one of the first and arguably one of the most influential graphical models for game theory.333Other game-theoretic graphical models include game networks [La Mura, 2000], multi-agent influence diagrams (MAIDs) [Koller and Milch, 2003], and action-graph games [Jiang and Leyton-Brown, 2008].

There has been considerable progress on problems of computing classical equilibrium solution concepts such as NE and correlated equilibria (CE) [Aumann, 1974] in graphical games (see, e.g., Kearns et al. 2001, Vickrey and Koller 2002, Ortiz and Kearns 2003, Blum et al. 2006, Kakade et al. 2003, Papadimitriou and Roughgarden 2008, Jiang and Leyton-Brown 2011 and the references therein). Indeed, graphical games played a prominent role in establishing the computational complexity of computing NE in general normal-form games (see, e.g., Daskalakis et al. 2009 and the references therein).

An example of a recent computational application of non-cooperative game-theoretic graphical modeling and causal strategic inference (CSI) that motivates the current paper is the work of Irfan and Ortiz [2014]. They proposed a new approach to the study of influence and the identification of the “most influential” individuals (or nodes) in large (social) networks. Their approach is strictly game-theoretic in the sense that it relies on non-cooperative game theory and the central concept of pure-strategy Nash equilibria (PSNE)444In this paper, because we concern ourselves primarily with PSNE, whenever we use the term “equilibrium” or “equilibria” without qualification, we mean PSNE. as an approximate predictor of stable behavior in strategic settings, and, unlike other models of behavior in mathematical sociology,555Some of these models have recently gained interest and have been studied within computer science, specially those related to diffusion or contagion processes (see, e.g., Granovetter 1978, Morris 2000, Domingos and Richardson 2001, Domingos 2005, Even-Dar and Shapira 2007). it is not interested and thus avoids explicit modeling of the complex dynamics by which such stable outcomes could have arisen or could be achieved. Instead, it concerns itself with the “bottom-line” end-state stable outcomes (or steady state behavior). Hence, the proposed approach provides an alternative to models based on the diffusion of behavior through a social network (See Kleinberg 2007 for an introduction and discussion targeted to computer scientists, and further references).

The underlying assumption for most work in computational game theory that deals with algorithms for computing equilibrium concepts is that the games under consideration are already available, or have been “hand-designed” by the analyst. While this may be possible for systems involving a handful of players, it is in general impossible in systems with at least tens of agent entities, if not more, as we are interested in this paper.666Of course, modeling and hand-crafting games for systems with many agents may be possible if the system has particular structure one could exploit. To give an example, this would be analogous to how one can exploit the probabilistic structure of HMMs to deal with long stochastic processes in a representationally succinct and computationally tractable way. Yet, we believe it is fair to say that such systems are largely/likely the exception in real-world settings in practice. For instance, in their paper, Irfan and Ortiz [2014] propose a class of games, called influence games. In particular, they concentrate on linear influence games (LIGs), and, as briefly mentioned above, study a variety of computational problems resulting from their approach, assuming such games are given as input.

Research in computational game theory has paid relatively little attention to the problem of learning (both the structure and parameters of) graphical games from data. Addressing this problem is essential to the development, potential use and success of game-theoretic models in practical applications. Indeed, we are beginning to see an increase in the availability of data collected from processes that are the result of deliberate actions of agents in complex system. A lot of this data results from the interaction of a large number of individuals, being not only people (i.e., individual human decision-makers), but also companies, governments, groups or engineered autonomous systems (e.g., autonomous trading agents), for which any form of global control is usually weak. The Internet is currently a major source of such data, and the smart grid, with its trumpeted ability to allow individual customers to install autonomous control devices and systems for electricity demand, will likely be another one in the near future.

In this paper, we investigate in considerable technical depth the problem of learning LIGs from strictly behavioral data: We do not assume the availability of utility, payoff or cost information in the data; the problem is precisely to infer that information from just the joint behavior collected in the data, up to the degree needed to explain the joint behavior itself. We expect that, in most cases, the parameters quantifying a utility function or best-response condition are unavailable and hard to determine in real-world settings. The availability of data resulting from the observation of an individual public behavior is arguably a weaker assumption than the availability of individual utility observations, which are often private. In addition, we do not assume prior knowledge of the conditional payoff/utility independence structure as represented by the game graph.

Motivated by the work of Irfan and Ortiz [2014] on a strictly non-cooperative game-theoretic approach to influence and strategic behavior in networks, we present a formal framework and design algorithms for learning the structure and parameters of LIGs with a large number of players. We concentrate on data about what one might call “the bottom line:” i.e., data about“end-states”, “steady-states” or final behavior as represented by possibly noisy samples of joint actions/pure-strategies from stable outcomes, which we assume come from a hidden underlying game. Thus, we do not use, consider or assume available any temporal data about the detailed behavioral dynamics. In fact, the data we consider does not contain the dynamics that might have possibly led to the potentially stable joint-action outcome! Since scalability is one of our main goals, we aim to propose methods that are polynomial-time in the number of players.

Given that LIGs belong to the class of 2-action graphical games [Kearns et al., 2001] with parametric payoff functions, we first needed to deal with the relative dearth of work on the broader problem of learning general graphical games from purely behavioral data. Hence, in addressing this problem, while inspired by the computational approach of Irfan and Ortiz [2014], the learning problem formulation we propose is in principle applicable to arbitrary games (although, again, the emphasis is on the PSNE of such games). In particular, we introduce a simple statistical generative mixture model, built “on top of” the game-theoretic model, with the only objective being to capture noise in the data. Despite the simplicity of the generative model, we are able to learn games from U.S. congressional voting records, which we use as a source of real-world behavioral data, that, as we will illustrate, seem to capture interesting, non-trivial aspects of the U.S. congress. While such models learned from real-world data are impossible to validate, we argue that there exists a considerable amount of anecdotal evidence for such aspects as captured by the models we learned. Figure 1 provides a brief illustration. (Should there be further need for clarification as to the why we present this figure, please see Footnote 14.)

As a final remark, given that LIGs constitute a non-trivial sub-class of parametric graphical games, we view our work as a step in the direction of addressing the broader problem of learning general graphical games with a large number of players from strictly behavioral data. We also hope our work helps to continue to bring and increase attention from the machine-learning community to the problem of inferring games from behavioral data (in which we attempt to learn a game that would “rationalize” players’ observed behavior).777This is a type of problem arising from game theory and economics that is different from the problem of learning in games (in which the focus is the study of how individual players learn to play a game by a sequence of repeated interactions), a more matured and perhaps better known problem within machine learning (see, e.g., Fudenberg and Levine 1999).

Refer to caption
Figure 1: 110th US Congress’s LIG (January 3, 2007-09): We provide an illustration of the application of our approach to real congressional voting data. Irfan and Ortiz [2014] use such games to address a variety of computational problems, including the identification of most influential senators. (We refer the reader to their paper for further details.) We show the graph connectivity of a game learned by independent 1\ell_{1}-regularized logistic regression (see Section 6.5). The reader should focus on the overall characteristics of the graph and not the details of the connectivity or the actual “influence” weights between senators. We highlight some particularly interesting characteristics consistent with anecdotal evidence. First, senators are more likely to be influenced by members of the same party than by members of the opposite party (the dashed green line denotes the separation between the parties). Republicans were “more strongly united” (tighter connectivity) than Democrats at the time. Second, the current US Vice President Biden (Dem./Delaware) and McCain (Rep./Arizona) are displayed at the “extreme of each party” (Biden at the bottom-right corner, McCain at the bottom-left) eliciting their opposite ideologies. Third, note that Biden, McCain, the current US President Obama (Dem./Illinois) and US Secretary of State Hillary Clinton (Dem./New York) have very few outgoing arcs; e.g., Obama only directly influences Feingold (Dem./Wisconsin), a prominent senior member with strongly liberal stands. One may wonder why do such prominent senators seem to have so little direct influence on others? A possible explanation is that US President Bush was about to complete his second term (the maximum allowed). Both parties had very long presidential primaries. All those senators contended for the presidential candidacy within their parties. Hence, one may posit that those senators were focusing on running their campaigns and that their influence in the day-to-day business of congress was channeled through other prominent senior members of their parties.

1.1 A Framework for Learning Games: Desiderata

The following list summarizes the discussion above and guides our choices in our pursuit of a machine-learning framework for learning game-theoretic graphical models from strictly behavioral data.

  • The learning algorithm

    • must output an LIG (which is a special type of graphical game); and

    • should be practical and tractably deal with a large number of players (typically in the hundreds, and certainly at least 44).

  • The learned model objective is the “bottom line” in the sense that the basis for its evaluation is the prediction of end-state (or steady-state) joint decision-making behavior, and not the temporal behavioral dynamics that might have lead to end-state or the stable steady-state joint behavior.888Note that we are in no way precluding dynamic models as a way to end-state prediction. But there is no inherent need to make any explicit attempt or effort to model or predict the temporal behavioral dynamics that might have lead to end-state or the stable steady-state joint behavior, including pre-play “cheap talk,” which are often overly complex processes. (See Appendix A.1 for further discussion.)

  • The learning framework

    • would only have available strictly behavioral data on actual decisions/actions taken. It cannot require or use any kind of payoff-related information.

    • should be agnostic as to the type or nature of the decision-maker and does not assume each player is a single human. Players can be institutions or governments, or associated with the decision-making process of a group of individuals representing, e.g., a company (or sub-units, office sites within a company, etc.), a nation state (like in the UN, NATO, etc.), or a voting district. In other words, the recorded behavioral actions of each player may really be a representative of larger entities or groups of individuals, not necessarily a single human.

    • must provide computationally efficient learning algorithm with provable guarantees: worst-case polynomial running time in the number of players.

    • should be “data efficient” and provide provable guarantees on sample complexity (given in terms of “generalization” bounds).

1.2 Technical Contributions

While our probabilistic model is inspired by the concept of equilibrium from game theory, our technical contributions are not in the field of game theory nor computational game theory. Our technical contributions and the tools that we use are the ones in classical machine learning.

Our technical contributions include a novel generative model of behavioral data in Section 4 for general games. Motivated by the LIGs and the computational game-theoretic framework put forward by Irfan and Ortiz [2014], we formally define “identifiability” and “triviality” within the context of non-cooperative graphical games based on PSNE as the solution concept for stable outcomes in large strategic systems. We provide conditions that ensure identifiability among non-trivial games. We then present the maximum-likelihood estimation (MLE) problem for general (non-trivial identifiable) games. In Section 5, we show a generalization bound for the MLE problem as well as an upper bound of the functional/strategic complexity (i.e., analogous to the“VC-dimension” in supervised learning) of LIGs. In Section 6, we provide technical evidence justifying the approximation of the original problem by maximizing the number of observed equilibria in the data as suitable for a hypothesis-space of games with small true number of equilibria. We then present our convex loss minimization approach and a baseline sigmoidal approximation for LIGs. For completeness, we also present exhaustive search methods for both general games as well as LIGs. In Section 7, we formally define the concept of absolute-indifference of players and show that our convex loss minimization approach produces games in which all players are non-absolutely-indifferent. We provide a bound which shows that LIGs have small true number of equilibria with high probability.

2 Related Work

We provide a brief summary overview of previous work on learning games here, and delay discussion of the work presented below until after we formally present our model; this will provide better context and make “comparing and contrasting” easier for those interested, without affecting expert readers who may want to get to the technical aspects of the paper without much delay.

Table 1 constitutes our best attempt at a simple visualization to fairly present the differences and similarities of previous approaches to modeling behavioral data within the computational game-theory community in AI.

Reference Class Needs Learns Learns Guarant. Equil. Dyn. Num.
Payoff Param. Struct. Concept Agents
Wright and Leyton-Brown [2010] NF Y Na - N QRE N 2
Wright and Leyton-Brown [2012] NF Y Na - N QRE N 2
Gao and Pfeffer [2010] NF Y Y - N QRE N 2
Vorobeychik et al. [2007] NF Y Y - N MSNE N 2-5
Ficici et al. [2008] NF Y Y - N MSNE N 10-200
Duong et al. [2008] NGT Y Na N N - N 4,10
Duong et al. [2010] NGTb Y Nc N N - Yd 10
Duong et al. [2012] NGTb Y Nc Ye N - Yd 36
Duong et al. [2009] GG Y Y Yf N PSNE N 2-13
Kearns and Wortman [2008] NGT N - - Y - Y 100
Ziebart et al. [2010] NF N Y - N CE N 2-3
Waugh et al. [2011] NF N Y - Y CE Y 7
Our approach GG N Y Yg Y PSNE N 100g
Table 1: Summary of approaches for learning models of behavior. See main text for a discussion. For each method we show its model class (GG: graphical games, NF: normal-form non-graphical games, NGT: non-game-theoretic model); whether it needs observed payoffs, learns utility parameters, learns graphical structure or provides guarantees(e.g., generalization, sample complexity or PAC learnability); its equilibria concept (PSNE: pure strategy or MSNE: mixed strategy Nash equilibria, CE: correlated equilibria, QRE: quantal response equilibria), whether it is dynamic (i.e., behavior predicted from past behavior); and the number of agents in the experimental validation. Note that there are relatively fewer models that do not assume observed payoff; among them, our method is the only one that learns the structure of graphical games, furthermore, we provide guarantees and a polynomial-time algorithm. aLearns only the “rationality parameter”. bA graphical game could in principle be extracted, after removing the temporal/dynamic part. cIt learns parameters for the “potential functions.” dIf the dynamic part is kept, it is not a graphical game. eIt performs greedy search by constraining the maximum degree. fIt performs branch and bound. gIt has polynomial time-complexity in the number of agents, thus it can scale to thousands.

The research interest of previous work varies in what they intend to capture in terms of different aspects of behavior (e.g., dynamics, probabilistic vs. strategic) or simply different settings/domains (i.e., modeling “real human behavior,” knowledge of achieved payoff or utility, etc.).

With the exception of Ziebart et al. [2010], Waugh et al. [2011], Kearns and Wortman [2008], previous methods assume that the actions as well as corresponding payoffs (or noisy samples from the true payoff function) are observed in the data. Our setting largely differs from Ziebart et al. [2010], Kearns and Wortman [2008] because of their focus on system dynamics, in which future behavior is predicted from a sequence of past behavior. Kearns and Wortman [2008] proposed a learning-theory framework to model collective behavior based on stochastic models.

Our problem is clearly different from methods in quantal response models [McKelvey and Palfrey, 1995, Wright and Leyton-Brown, 2010, 2012] and graphical multiagent models (GMMs) [Duong et al., 2008, 2010] that assume known structure and observed payoffs. Duong et al. [2012] learns the structure of games that are not graphical, i.e., the payoff depends on all other players. Their approach also assumes observed payoff and consider a dynamic consensus scenario, where agents on a network attempt to reach a unanimous vote. In analogy to voting, we do not assume the availability of the dynamics (i.e., the previous actions) that led to the final vote. They also use fixed information on the conditioning sets of neighbors during their search for graph structure. We also note that the work of Vorobeychik et al. [2007], Gao and Pfeffer [2010], Ziebart et al. [2010] present experimental validation mostly for 2 players only, 7 players in Waugh et al. [2011] and up to 13 players in Duong et al. [2009].

In several cases in previous work, researchers define probabilistic models using knowledge of the payoff functions explicitly (i.e., a Gibbs distribution with potentials that are functions of the players payoffs, regrets, etc.) to model joint behavior (i.e., joint pure-strategies); see, e.g., Duong et al. [2008, 2010, 2012], and to some degree also Wright and Leyton-Brown [2010, 2012]. It should be clear to the reader that this is not the same as our generative model, which is defined directly on the PSNE (or stable outcomes) of the game, which the players’ payoffs determine only indirectly.

In contrast, in this paper, we assume that the joint actions are the only observable information and that both the game graph structure and payoff functions are unknown, unobserved and unavailable. We present the first techniques for learning the structure and parameters of a non-trivial class of large-population graphical games from joint actions only. Furthermore, we present experimental validation in games of up to 100100 players. Our convex loss minimization approach could potentially be applied to larger problems since it has polynomial time complexity in the number of players.

2.1 On Learning Probabilistic Graphical Models

There has been a significant amount of work on learning the structure of probabilistic graphical models from data. We mention only a few references that follow a maximum likelihood approach for Markov random fields [Lee et al., 2007], bounded tree-width distributions [Chow and Liu, 1968, Srebro, 2001], Ising models [Wainwright et al., 2007, Banerjee et al., 2008, Höfling and Tibshirani, 2009], Gaussian graphical models [Banerjee et al., 2006], Bayesian networks [Guo and Schuurmans, 2006, Schmidt et al., 2007b] and directed cyclic graphs [Schmidt and Murphy, 2009].

Our approach learns the structure and parameters of games by maximum likelihood estimation on a related probabilistic model. Our probabilistic model does not fit into any of the types described above. Although a (directed) graphical game has a directed cyclic graph, there is a semantic difference with respect to graphical models. Structure in a graphical model implies a factorization of the probabilistic model. In a graphical game, the graph structure implies strategic dependence between players, and has no immediate probabilistic implication. Furthermore, our general model differs from Schmidt and Murphy [2009] since our generative model does not decompose as a multiplication of potential functions.

Finally, it is very important to note that our specific aim is to model behavioral data that is strategic in nature. Hence, our modeling and learning approach deviates from those for probabilistic graphical models which are of course better suited for other types of data, mostly probabilistic in nature (i.e., resulting from a fixed underlying probability distribution). As a consequence, it is also very important to keep in mind that our work is not in competition with the work in probabilistic graphical models, and is not meant to replace it (except in the context of data sets collected from complex strategic behavior just mentioned). Each approach has its own aim, merits and pitfalls in terms of the nature of data sets that each seeks to model. We return to this point in Section 8 (Experimental Results).

2.2 On Linear Threshold Models and Econometrics

Irfan and Ortiz [2014] introduced LIGs in the AI community, showed that such games are useful, and addressed a variety of computational problems, including the identification of most influential senators. The class of LIGs is related to the well-known linear threshold model (LTM) in sociology [Granovetter, 1978], recently very popular within the social network and theoretical computer science community [Kleinberg, 2007].999López-Pintado and Watts [2008] also provide an excellent summary of the various models in this area of mathematical social science. Irfan and Ortiz [2014] discusses linear threshold models in depth; we briefly discuss them here for self-containment. LTMs are usually studied as the basis for some kind of diffusion process. A typical problem is the identification of most influential individuals in a social network. An LTM is not in itself a game-theoretic model and, in fact, Granovetter himself argues against this view in the context of the setting and the type of questions in which he was most interested [Granovetter, 1978]. Our reading of the relevant literature suggests that subsequent work on LTMs has not taken a strictly game-theoretic view either. The problem of learning mathematical models of influence from behavioral data has just started to receive attention. There has been a number of articles in the last couple of years addressing the problem of learning the parameters of a variety of diffusion models of influence [Saito et al., 2008, 2009, 2010, Goyal et al., 2010, Gomez Rodriguez et al., 2010, Cao et al., 2011].101010Often learning consists of estimating the threshold parameter from data given as temporal sequences from“traces” or “action logs.” Sometimes the “influence weights” are estimated assuming a given graph, and almost always the weights are assumed positive and estimated as “probabilities of influence.” For example, Saito et al. [2010] considers a dynamic (continuous time) LTM that has only positive influence weights and a randomly generated threshold value. Cao et al. [2011] uses active learning to estimate the threshold values of an LTM leading to a maximum spread of influence.

Our model is also related to a particular model of discrete choice with social interactions in econometrics (see, e.g. Brock and Durlauf 2001). The main difference is that we take a strictly non-cooperative game-theoretic approach within the classical “static”/one-shot game framework and do not use a random utility model. We follow the approach of Irfan and Ortiz [2014] who takes a strictly non-cooperative game-theoretic approach within the classical “static”/one-shot game framework, and thus we do not use a random utility model. In addition, we do not make the assumption of rational expectations, which in the context of models of discrete choice with social interactions essentially implies the assumption that all players use exactly the same mixed strategy.111111A formal definition of “rational expectations” is beyond the scope of this paper. We refer the reader to the early part of the article by Brock and Durlauf [2001] where they explain why assuming rational expectations leads to the conclusion that all players use exactly the same mixed strategy. That is the relevant part of that work to ours.

3 Background: Game Theory and Linear Influence Games

In classical game-theory (see, e.g. Fudenberg and Tirole 1991 for a textbook introduction), a normal-form game is defined by a set of players VV (e.g., we can let V={1,,n}V=\{1,\dots,n\} if there are nn players), and for each player ii, a set of actions, or pure-strategies AiA_{i}, and a payoff function ui:×jVAju_{i}:\times_{j\in V}A_{j}\to\mathbb{R} mapping the joint actions of all the players, given by the Cartesian product 𝒜×jVAj\mathcal{A}\equiv\times_{j\in V}A_{j}, to a real number. In non-cooperative game theory we assume players are greedy, rational and act independently, by which we mean that each player ii always want to maximize their own utility, subject to the actions selected by others, irrespective of how the optimal action chosen help or hurt others.

A core solution concept in non-cooperative game theory is that of an Nash equilibrium. A joint action 𝐱𝒜\mathbf{x}^{*}\in\mathcal{A} is a pure-strategy Nash equilibrium (PSNE) of a non-cooperative game if, for each player ii, xiargmaxxiAiui(xi,𝐱i)x^{*}_{i}\in{\arg\max}_{x_{i}\in A_{i}}u_{i}(x_{i},\mathbf{x}^{*}_{-i}); that is, 𝐱\mathbf{x}^{*} constitutes a mutual best-response, no player ii has any incentive to unilaterally deviate from the prescribed action xix^{*}_{i}, given the joint action of the other players 𝐱i×jV{i}Aj\mathbf{x}^{*}_{-i}\in\times_{j\in V-\{i\}}A_{j} in the equilibrium. In what follows, we denote a game by 𝒢{\mathcal{G}}, and the set of all pure-strategy Nash equilibria of 𝒢{\mathcal{G}} by121212Because this paper concerns mostly PSNE, we denote the set of PSNE of game 𝒢{\mathcal{G}} as 𝒩(𝒢){\mathcal{NE}}({\mathcal{G}}) to simplify notation.

𝒩(𝒢){𝐱(iV)xiargmaxxiAiui(xi,𝐱i)}.{\mathcal{NE}}({\mathcal{G}})\equiv\{\mathbf{x}^{*}\mid(\forall i\in V){\rm\ }x^{*}_{i}\in{\arg\max}_{x_{i}\in A_{i}}u_{i}(x_{i},\mathbf{x}^{*}_{-i})\}\;.

A (directed) graphical game is a game-theoretic graphical model [Kearns et al., 2001]. It provides a succinct representation of normal-form games. In a graphical game, we have a (directed) graph G=(V,E)G=(V,E) in which each node in VV corresponds to a player in the game. The interpretation of the edges/arcs EE of GG is that the payoff function of player ii is only a function of the set of parents/neighbors 𝒩i{j(i,j)E}\mathcal{N}_{i}\equiv\{j\mid(i,j)\in E\} in GG (i.e., the set of players corresponding to nodes that point to the node corresponding to player ii in the graph). In the context of a graphical game, we refer to the uiu_{i}’s as the local payoff functions/matrices.

Linear influence games (LIGs) [Irfan and Ortiz, 2014] are a sub-class of 22-action graphical games with parametric payoff functions. For LIGs, we assume that we are given a matrix of influence weights 𝐖n×n\mathbf{W}\in\mathbb{R}^{n\times n}, with 𝐝𝐢𝐚𝐠(𝐖)=𝟎\mathbf{diag}(\mathbf{W})=\mathbf{0}, and a threshold vector 𝐛n\mathbf{b}\in\mathbb{R}^{n}. For each player ii, we define the influence function fi(𝐱i)j𝒩iwijxjbi=𝐰i,iT𝐱ibif_{i}(\mathbf{x}_{-i})\equiv\sum_{j\in\mathcal{N}_{i}}w_{ij}x_{j}-b_{i}={\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i} and the payoff function ui(𝐱)xifi(𝐱i)u_{i}(\mathbf{x})\equiv x_{i}f_{i}(\mathbf{x}_{-i}). We further assume binary actions: Ai{1,+1}A_{i}\equiv\{-1,+1\} for all ii. The best response xix_{i}^{*} of player ii to the joint action 𝐱i\mathbf{x}_{-i} of the other players is defined as

{𝐰i,iT𝐱i>bixi=+1,𝐰i,iT𝐱i<bixi=1 and 𝐰i,iT𝐱i=bixi{1,+1}}xi(𝐰i,iT𝐱ibi)0.\left\{\begin{array}[]{@{}l@{}}{\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}>b_{i}\Rightarrow x^{*}_{i}=+1,\\ {\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}<b_{i}\Rightarrow x^{*}_{i}=-1\text{ and }\\ {\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}=b_{i}\Rightarrow x^{*}_{i}\in\{-1,+1\}\end{array}\right\}\Leftrightarrow x^{*}_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})\geq 0\;.

Intuitively, for any other player jj, we can think of wijw_{ij}\in\mathbb{R} as a weight parameter quantifying the “influence factor” that jj has on ii, and we can think of bib_{i}\in\mathbb{R} as a threshold parameter quantifying the level of “tolerance” that player ii has for playing 1-1.131313As we formally/mathematically define here, LIGs are 22-action graphical games with linear-quadratic payoff functions. Given our main interest in this paper on the PSNE solution concept, for the most part, we simply view LIGs as compact representations of the PSNE of graphical games that the algorithms of Irfan and Ortiz [2014] use for CSI. (This is in contrast to a perhaps more natural, “intuitive” but still informal description/interpretation one may provide for instructive/pedagogical purposes based on “direct influences,” as we do here.) This view of LIGs is analogous to the modern, predominant view of Bayesian networks as compact representations of joint probability distributions that are also very useful for modeling uncertainty in complex systems and practical for probabilistic inference [Koller and Friedman, 2009]. (And also analogous is the “intuitive” descriptions/interpretations of BN structures, used for instructive/pedagogical purposes, based on “causal” interactions between the random variables Koller and Friedman, 2009.)

As discussed in Irfan and Ortiz [2014], LIGs are also a sub-class of polymatrix games [Janovskaja, 1968]. Furthermore, in the special case of 𝐛=𝟎\mathbf{b}=\mathbf{0} and symmetric 𝐖\mathbf{W}, a LIG becomes a party-affiliation game [Fabrikant et al., 2004].

In this paper, the use of the verb “influence” strictly refers to influences defined by the model.

Figure 1 provides a preview illustration of the application of our approach to congressional voting.141414We present this game graph because many people express interest in “seeing” the type of games we learn on this particular data set. The reader should please understand that by presenting this graph we are definitely not implying or arguing that we can identify the ground-truth graph of “direct influences.” (We say this even in the very unlikely event that the “ground-truth model” be an LIG that faithfully capture the “true direct influences” in this U.S. Congress, something arguably no model could ever do.) As we show later in Section 4.2, LIGs are not identifiable with respect to their local compact parametric representation encoding the game graph through their weights and biases, but only with respect to their PSNE, which are joint actions capturing a global property of a game that we really care about for CSI. Certainly, we could never validate the model parameters of an LIG at the local, microscopic level of “direct influences” using only the type of observational data we used to learn the model depicted by the graph in the figure. For that, we would need help from domain experts to design controlled experiments that would yield the right type of data for proper/rigorous scientific validation.

4 Our Proposed Framework for Learning LIGs

Our goal is to learn the structure and parameters of an LIG from observed joint actions only (i.e., without any payoff data/information).151515In principle, the learning framework itself is technically immediately/easily applicable to any class of simultaneous/one-shot games. Generalizing the algorithms and other theoretical results (e.g., on generalization error) while maintaining the tractability in sample complexity and computation may require significant effort. Yet, for simplicity, most of the presentation in this section is actually in terms of general 2-action games. While we make sporadic references to LIGs throughout the section, it is not until we reach the end of the section that we present and discuss the particular instantiation of our proposed framework with LIGs.

Our main performance measure will be average log-likelihood (although later we will be considering misclassification-type error measures in the context of simultaneous-classification, as a result of an approximation of the average log-likelihood). Our emphasis on a PSNE-based statistical model for the behavioral data results from the approach to causal strategic inference taken by Irfan and Ortiz [2014], which is strongly founded on PSNE.161616The possibility that PSNE may not exist in some LIGs does not present a significant problem in our case because we are learning the game, and can require that the LIG output has at least one PSNE. Indeed, in our approach, games with no PSNE achieve the lowest possible likelihood within our generative model of the data; said differently, games with PSNE have higher likelihoods than those that do not have any PSNE.

Note that our problem is unsupervised, i.e., we do not know a priori which joint actions are PSNE and which ones are not. If our only goal were to find a game 𝒢{\mathcal{G}} in which all the given observed data is an equilibrium, then any “dummy” game, such as the “dummy” LIG 𝒢=(𝐖,𝐛),𝐖=𝟎,𝐛=𝟎{\mathcal{G}}=(\mathbf{W},\mathbf{b}),\mathbf{W}=\mathbf{0},\mathbf{b}=\mathbf{0}, would be an optimal solution because |𝒩(𝒢)|=2n|{\mathcal{NE}}({\mathcal{G}})|=2^{n}.171717Ng and Russell [2000] made a similar observation in the context of single-agent inverse reinforcement learning (IRL). In this section, we present a probabilistic formulation that allows finding games that maximize the empirical proportion of equilibria in the data while keeping the true proportion of equilibria as low as possible. Furthermore, we show that trivial games such as LIGs with 𝐖=𝟎,𝐛=𝟎\mathbf{W}=\mathbf{0},\mathbf{b}=\mathbf{0}, obtain the lowest log-likelihood.

4.1 Our Proposed Generative Model of Behavioral Data

We propose the following simple generative (mixture) model for behavioral data based strictly in the context of “simultaneous”/one-shot play in non-cooperative game theory, again motivated by Irfan and Ortiz [2014]’s PSNE-based approach to causal strategic inference (CSI).181818Model “simplicity” and “abstractions” are not necessarily a bad thing in practice. More “realism” often leads to more “complexity” in terms of model representation and computation; and to potentially poorer generalization performance as well [Kearns and Vazirani, 1994]. We believe that even if the data could be the result of complex cognitive, behavioral or neuronal processes underlying human decision making and social interactions, the practical guiding principle of model selection in ML, which governs the fundamental tradeoff between model complexity and generalization performance, still applies. Let 𝒢{\mathcal{G}} be a game. With some probability 0<q<10<q<1, a joint action 𝐱\mathbf{x} is chosen uniformly at random from 𝒩(𝒢){\mathcal{NE}}({\mathcal{G}}); otherwise, 𝐱\mathbf{x} is chosen uniformly at random from its complement set {1,+1}n𝒩(𝒢)\{-1,+1\}^{n}-{\mathcal{NE}}({\mathcal{G}}). Hence, the generative model is a mixture model with mixture parameter qq corresponding to the probability that a stable outcome (i.e., a PSNE) of the game is observed, uniform over PSNE. Formally, the probability mass function (PMF) over joint-behaviors {1,+1}n\{-1,+1\}^{n} parameterized by (𝒢,q)({\mathcal{G}},q) is

p(𝒢,q)(𝐱)=q1[𝐱𝒩(𝒢)]|𝒩(𝒢)|+(1q)1[𝐱𝒩(𝒢)]2n|𝒩(𝒢)|,p_{({\mathcal{G}},q)}(\mathbf{x})=q\,\frac{1[{\mathbf{x}\in{\mathcal{NE}}({\mathcal{G}})}]}{|{\mathcal{NE}}({\mathcal{G}})|}+(1-q)\,\frac{1[{\mathbf{x}\notin{\mathcal{NE}}({\mathcal{G}})}]}{2^{n}-|{\mathcal{NE}}({\mathcal{G}})|}\;, (1)

where we can think of qq as the “signal” level, and thus 1q1-q as the “noise” level in the data set.

Remark 1.

Note that in order for Eq. (1) to be a valid PMF for any 𝒢{\mathcal{G}}, we need to enforce the following conditions |𝒩(𝒢)|=0q=0|{\mathcal{NE}}({\mathcal{G}})|=0\Rightarrow q=0 and |𝒩(𝒢)|=2nq=1|{\mathcal{NE}}({\mathcal{G}})|=2^{n}\Rightarrow q=1. Furthermore, note that in both cases (|𝒩(𝒢)|{0,2n}|{\mathcal{NE}}({\mathcal{G}})|\in\{0,2^{n}\}) the PMF becomes a uniform distribution. We also enforce the following condition:191919We can easily remove this condition at the expense of complicating the theoretical analysis on the generalization bounds because of having to deal with those extreme cases. if 0<|𝒩(𝒢)|<2n0<|{\mathcal{NE}}({\mathcal{G}})|<2^{n} then q{0,1}q\not\in\{0,1\}.

4.2 On PSNE-Equivalence and PSNE-Identifiability of Games

For any valid value of mixture parameter qq, the PSNE of a game 𝒢{\mathcal{G}} completely determines our generative model p(𝒢,q)p_{({\mathcal{G}},q)}. Thus, given any such mixture parameter, two games with the same set of PSNE will induce the same PMF over the space of joint actions.202020It is not hard to come up with examples of multiple games that have the same PSNE set. In fact, later in this section, we show three instances of LIGs with very different weight-matrix parameter that have this property. Note that this is not a roadblock to our objectives of learning LIGs because our main interest is the PSNE of the game, not the individual parameters that define it. We note that this situation is hardly exclusive to game-theoretic models: an analogous issue occurs in probabilistic graphical models (e.g., Bayesian networks).

Definition 2.

We say that two games 𝒢1{\mathcal{G}}_{1} and 𝒢2{\mathcal{G}}_{2} are PSNE-equivalent if and only if their PSNE sets are identical, i.e., 𝒢1𝒩𝒢2𝒩(𝒢1)=𝒩(𝒢2){\mathcal{G}}_{1}\equiv_{\mathcal{NE}}{\mathcal{G}}_{2}\Leftrightarrow{\mathcal{NE}}({\mathcal{G}}_{1})={\mathcal{NE}}({\mathcal{G}}_{2}).

We often drop the “PSNE-” qualifier when clear from context.

Definition 3.

We say a set Υ\Upsilon^{*} of valid parameters (𝒢,q)({\mathcal{G}},q) for the generative model is PSNE-identifiable with respect to the PMF p(𝒢,q)p_{({\mathcal{G}},q)} defined in Eq. (1), if and only if, for every pair (𝒢1,q1),(𝒢2,q2)Υ({\mathcal{G}}_{1},q_{1}),({\mathcal{G}}_{2},q_{2})\in\Upsilon^{*}, if p(𝒢1,q1)(𝐱)=p(𝒢2,q2)(𝐱)p_{({\mathcal{G}}_{1},q_{1})}(\mathbf{x})=p_{({\mathcal{G}}_{2},q_{2})}(\mathbf{x}) for all 𝐱{1,+1}n\mathbf{x}\in\{-1,+1\}^{n} then 𝒢1𝒩𝒢2{\mathcal{G}}_{1}\equiv_{\mathcal{NE}}{\mathcal{G}}_{2} and q1=q2q_{1}=q_{2}. We say a game 𝒢{\mathcal{G}} is PSNE-identifiable with respect to Υ\Upsilon^{*} and the p(𝒢,q)p_{({\mathcal{G}},q)}, if and only if, there exists a qq such that (𝒢,q)Υ({\mathcal{G}},q)\in\Upsilon^{*}.

Definition 4.

We define the true proportion of equilibria in the game 𝒢{\mathcal{G}} relative to all possible joint actions as

π(𝒢)|𝒩(𝒢)|/2n.\pi({\mathcal{G}})\equiv|{\mathcal{NE}}({\mathcal{G}})|/2^{n}\;. (2)

We also say that a game 𝒢{\mathcal{G}} is trivial if and only if |𝒩(𝒢)|{0,2n}|{\mathcal{NE}}({\mathcal{G}})|\in\{0,2^{n}\} (or equivalently π(𝒢){0,1}\pi({\mathcal{G}})\in\{0,1\}), and non-trivial if and only if 0<|𝒩(𝒢)|<2n0<|{\mathcal{NE}}({\mathcal{G}})|<2^{n} (or equivalently 0<π(𝒢)<10<\pi({\mathcal{G}})<1).

The following propositions establish that the condition q>π(𝒢)q>\pi({\mathcal{G}}) ensures that the probability of an equilibrium is strictly greater than a non-equilibrium. The condition also guarantees that non-trivial games are identifiable.

Proposition 5.

Given a non-trivial game 𝒢{\mathcal{G}}, the mixture parameter q>π(𝒢)q>\pi({\mathcal{G}}) if and only if p(𝒢,q)(𝐱1)>p(𝒢,q)(𝐱2)p_{({\mathcal{G}},q)}(\mathbf{x}_{1})>p_{({\mathcal{G}},q)}(\mathbf{x}_{2}) for any 𝐱1𝒩(𝒢)\mathbf{x}_{1}\in{\mathcal{NE}}({\mathcal{G}}) and 𝐱2𝒩(𝒢)\mathbf{x}_{2}\notin{\mathcal{NE}}({\mathcal{G}}).

Proof.

Note that p(𝒢,q)(𝐱1)=q/|𝒩(𝒢)|>p(𝒢,q)(𝐱2)=(1q)/(2n|𝒩(𝒢)|)q>|𝒩(𝒢)|/2np_{({\mathcal{G}},q)}(\mathbf{x}_{1})=q/|{\mathcal{NE}}({\mathcal{G}})|>p_{({\mathcal{G}},q)}(\mathbf{x}_{2})=(1-q)/(2^{n}-|{\mathcal{NE}}({\mathcal{G}})|)\Leftrightarrow q>|{\mathcal{NE}}({\mathcal{G}})|/2^{n} and given Eq. (2), we prove our claim. ∎

Proposition 6.

Let (𝒢1,q1)({\mathcal{G}}_{1},q_{1}) and (𝒢2,q2)({\mathcal{G}}_{2},q_{2}) be two valid generative-model parameter tuples.

  • (a)

    If 𝒢1𝒩𝒢2{\mathcal{G}}_{1}\equiv_{{\mathcal{NE}}}{\mathcal{G}}_{2} and q1=q2q_{1}=q_{2} then (𝐱)p(𝒢1,q1)(𝐱)=p(𝒢2,q2)(𝐱)(\forall\mathbf{x}){\rm\ }p_{({\mathcal{G}}_{1},q_{1})}(\mathbf{x})=p_{({\mathcal{G}}_{2},q_{2})}(\mathbf{x}),

  • (b)

    Let 𝒢1{\mathcal{G}}_{1} and 𝒢2{\mathcal{G}}_{2} be also two non-trivial games such that q1>π(𝒢1)q_{1}>\pi({\mathcal{G}}_{1}) and q2>π(𝒢2)q_{2}>\pi({\mathcal{G}}_{2}). If (𝐱)p(𝒢1,q1)(𝐱)=p(𝒢2,q2)(𝐱)(\forall\mathbf{x}){\rm\ }p_{({\mathcal{G}}_{1},q_{1})}(\mathbf{x})=p_{({\mathcal{G}}_{2},q_{2})}(\mathbf{x}), then 𝒢1𝒩𝒢2{\mathcal{G}}_{1}\equiv_{\mathcal{NE}}{\mathcal{G}}_{2} and q1=q2q_{1}=q_{2}.

Proof.

Let 𝒩k𝒩(𝒢k){\mathcal{NE}}_{k}\equiv{\mathcal{NE}}({\mathcal{G}}_{k}). First, we prove part (a). By Definition 2, 𝒢1𝒩𝒢2𝒩1=𝒩2{\mathcal{G}}_{1}\equiv_{\mathcal{NE}}{\mathcal{G}}_{2}\Rightarrow{\mathcal{NE}}_{1}={\mathcal{NE}}_{2}. Note that p(𝒢k,qk)(𝐱)p_{({\mathcal{G}}_{k},q_{k})}(\mathbf{x}) in Eq. (1) depends only on characteristic functions 1[𝐱𝒩k]1[{\mathbf{x}\in{\mathcal{NE}}_{k}}]. Therefore, (𝐱)p(𝒢1,q1)(𝐱)=p(𝒢2,q2)(𝐱)(\forall\mathbf{x}){\rm\ }p_{({\mathcal{G}}_{1},q_{1})}(\mathbf{x})=p_{({\mathcal{G}}_{2},q_{2})}(\mathbf{x}).

Second, we prove part (b) by contradiction. Assume, wlog, (𝐱)𝐱𝒩1𝐱𝒩2(\exists\mathbf{x}){\rm\ }\mathbf{x}\in{\mathcal{NE}}_{1}\wedge\mathbf{x}\notin{\mathcal{NE}}_{2}. Then p(𝒢1,q1)(𝐱)=p(𝒢2,q2)(𝐱)p_{({\mathcal{G}}_{1},q_{1})}(\mathbf{x})=p_{({\mathcal{G}}_{2},q_{2})}(\mathbf{x}) implies by Eq. (1) that q1/|𝒩1|=(1q2)/(2n|𝒩2|)q1/π(𝒢1)=(1q2)/(1π(𝒢2))q_{1}/|{\mathcal{NE}}_{1}|=(1-q_{2})/(2^{n}-|{\mathcal{NE}}_{2}|)\Rightarrow q_{1}/\pi({\mathcal{G}}_{1})=(1-q_{2})/(1-\pi({\mathcal{G}}_{2})) by Eq. (2). By assumption, we have q1>π(𝒢1)q_{1}>\pi({\mathcal{G}}_{1}), which implies that (1q2)/(1π(𝒢2))>1q2<π(𝒢2)(1-q_{2})/(1-\pi({\mathcal{G}}_{2}))>1\Rightarrow q_{2}<\pi({\mathcal{G}}_{2}), a contradiction. Hence, we have 𝒢1𝒩𝒢2{\mathcal{G}}_{1}\equiv_{\mathcal{NE}}{\mathcal{G}}_{2}. Assume, q1q2q_{1}\neq q_{2}. Then we have p(𝒢1,q1)(𝐱)=p(𝒢2,q2)(𝐱)p_{({\mathcal{G}}_{1},q_{1})}(\mathbf{x})=p_{({\mathcal{G}}_{2},q_{2})}(\mathbf{x}) implies by Eq. (1) and 𝒢1𝒩𝒢2{\mathcal{G}}_{1}\equiv_{\mathcal{NE}}{\mathcal{G}}_{2} (and after some algebraic manipulations) that (q1q2)(1[𝐱𝒩(𝒢1)]|𝒩(𝒢1)|1[𝐱𝒩(𝒢1)]2n|𝒩(𝒢1)|)=01[𝐱𝒩(𝒢1)]|𝒩(𝒢1)|=1[𝐱𝒩(𝒢1)]2n|𝒩(𝒢1)|(q_{1}-q_{2})\left(\frac{1[{\mathbf{x}\in{\mathcal{NE}}({\mathcal{G}}_{1})}]}{|{\mathcal{NE}}({\mathcal{G}}_{1})|}-\frac{1[{\mathbf{x}\notin{\mathcal{NE}}({\mathcal{G}}_{1})}]}{2^{n}-|{\mathcal{NE}}({\mathcal{G}}_{1})|}\right)=0\Rightarrow\frac{1[{\mathbf{x}\in{\mathcal{NE}}({\mathcal{G}}_{1})}]}{|{\mathcal{NE}}({\mathcal{G}}_{1})|}=\frac{1[{\mathbf{x}\notin{\mathcal{NE}}({\mathcal{G}}_{1})}]}{2^{n}-|{\mathcal{NE}}({\mathcal{G}}_{1})|}, a contradiction. ∎

The last proposition, along with our definitions of “trivial” (as given in Definition 4) and “identifiable” (Definition 3), allows us to formally define our hypothesis space.

Definition 7.

Let \mathcal{H} be a class of games of interest. We call the set Υ{(𝒢,q)𝒢0<π(𝒢)<q<1}\Upsilon\equiv\{({\mathcal{G}},q)\,\mid\,{\mathcal{G}}\in\mathcal{H}\wedge 0<\pi({\mathcal{G}})<q<1\} the hypothesis space of non-trivial identifiable games and mixture parameters. We also refer to a game 𝒢{\mathcal{G}}\in\mathcal{H} that is also in some tuple (𝒢,q)Υ({\mathcal{G}},q)\in\Upsilon for some qq, as a non-trivial identifiable game.212121Technically, we should call the set Υ\Upsilon “the hypothesis space consisting of tuples of non-trivial games from \mathcal{H} and mixture parameters identifiable up to PSNE, with respect to the probabilistic model defined in Eq. (1).” Similarly, we should call such game 𝒢{\mathcal{G}} “a non-trivial game from \mathcal{H} identifiable up to PSNE, with respect to the probabilistic model defined in Eq. (1).” We opted for brevity.

Remark 8.

Recall that a trivial game induces a uniform PMF by Remark 1. Therefore, a non-trivial game is not equivalent to a trivial game since by Proposition 5, non-trivial games do not induce uniform PMFs.222222In general, Section 4.2 characterizes our hypothesis space (non-trivial identifiable games and mixture parameters) via two specific conditions. The first condition, non-triviality (explained in Remark 1), is 0<π(𝒢)<10<\pi({\mathcal{G}})<1. The second condition, identifiability of the PSNE set from its related PMF (discussed in Propositions 5 and 6), is π(G)<q\pi(G)<q. For completeness, in this remark, we clarify that the class of trivial games (uniform PMFs) is different from the class of non-trivial games (non-uniform PMFs). Thus, in the rest of the paper we focus exclusively on non-trivial identifiable games; that is, games that produce non-uniform PMFs and for which the PSNE set is uniquely identified from their PMFs.

4.3 Additional Discussion on Modeling Choices

We now discuss other equilibrium concepts, such as mixed-strategy Nash equilibria (MSNE) and quantal response equilibria (QRE). We also discuss a more sophisticated noise process as well as a generalization of our model to non-uniform distributions; while likely more realistic, the alternative models are mathematically more complex and potentially less tractable computationally.

4.3.1 On Other Equilibrium Concepts

There is still quite a bit of debate as to the appropriateness of game-theoretic equilibrium concepts to model individual human behavior in a social context. Camerer’s book on behavioral game theory [Camerer, 2003] addresses some of the issues. Our interpretation of Camerer’s position is not that Nash equilibria is universally a bad predictor but that it is not consistently the best, for reasons that are still not well understood. This point is best illustrated in Chapter 3, Figure 3.1 of Camerer [2003].

(Logit) quantal response equilibria (QRE) [McKelvey and Palfrey, 1995] has been proposed as an alternative to Nash in the context of behavioral game theory. Models based on QRE have been shown superior during initial play in some experimental settings, but prior work assumes known structure and observed payoffs, and only the “precision/rationality parameter” is estimated, e.g. Wright and Leyton-Brown [2010, 2012]. In a logit QRE, the precision parameter, typically denoted by λ\lambda, can be mathematically interpreted as the inverse-temperature parameter of individual Gibbs distributions over the pure-strategy of each player ii.

It is customary to compute the MLE for λ\lambda from available data. To the best of our knowledge, all work in QRE assumes exact knowledge of the game payoffs, and thus, no method has been proposed to simultaneously estimate the payoff functions uiu_{i} when they are unknown. Indeed, computing MLE for λ\lambda, given the payoff functions, is relatively efficient for normal-form games using standard techniques, but may be hard for graphical games; on the other hand, MLE estimation of the payoff functions themselves within a QRE model of behavior seems like a highly non-trivial optimization problem, and is unclear that it is even computationally tractable, even in normal-form games. At the very least, standard techniques do not apply and more sophisticated optimization algorithms or heuristics would have to be derived. Such extensions are clearly beyond the scope of this paper.232323Note that despite the apparent similarity in mathematical expression between logit QRE and the PSNE of the LIG we obtain by using individual logistic regression, they are fundamentally different because of the complex correlations that QRE conditions impose on the parameters (𝐖,𝐛)(\mathbf{W},\mathbf{b}) of the payoff functions. It is unclear how to adapt techniques for logistic regression similar to the ones we used here to efficiently/tractably compute MLE within the logit QRE framework.

Wright and Leyton-Brown [2012] also considers even more mathematically complex variants of behavioral models that combine QRE with different models that account for constraints in “cognitive levels” of reasoning ability/effort, yet the estimation and usage of such models still assumes knowledge of the payoff functions.

It would be fair to say that most of the human-subject experiments in behavioral game theory involve only a handful of players [Camerer, 2003]. The scalability of those results to games with a large population of players is unclear.

Now, just as an example, we do not necessarily view the Senators final votes as those of a single human individual anyway: after all, such a decision is (or should be) obtained with consultation with their staff and (one would at least hope) the constituency of the state they represent. Also, the final voting decision is taken after consultation or meetings between the staff of the different senators. We view this underlying process as one of “cheap talk.” While cheap talk may be an important process to study, in this paper, we just concentrate on the end result: the final vote. The reason is more than just scientific; as the congressional voting setting illustrates, data for such process is sometimes not available, or would seem very hard to infer from the end-states alone. While our experiments concentrate on congressional voting data, because it is publicly available and easy to obtain, the same would hold for other settings such as Supreme court decisions, voluntary vaccination, UN voting records and governmental decisions, to name a few. We speculate that in almost all those cases, only the end-result is likely to be recorded and little information would be available about the “cheap talk” process or “pre-play” period leading to the final decision.

In our work we consider PSNE because of our motivation to provide LIGs for use within the casual strategic inference framework for modeling “influence” in large-population networks of Irfan and Ortiz [2014]. Note that the universality of MSNE does not diminish the importance of PSNE in game theory.242424Research work on the properties and computation of PSNE include Rosenthal [1973], Gilboa and Zemel [1989], Stanford [1995], Rinott and Scarsini [2000], Fabrikant et al. [2004], Gottlob et al. [2005], Sureka and Wurman [2005], Daskalakis and Papadimitriou [2006], Dunkel [2007], Dunkel and Schulz [2006], Dilkina et al. [2007], Ackermann and Skopalik [2007], Hasan et al. [2008], Hasan and Galiana [2008], Ryan et al. [2010], Chapman et al. [2010], Hasan and Galiana [2010]. Indeed, a debate still exist within the game theory community as to the justification for randomization, specially in human contexts. While concentrating exclusively on PSNE may not make sense in all settings, it does make sense in some.252525For example, in the context of congressional voting, we believe Senators almost always have full-information about how some, if not all other Senators they care about would vote. Said differently, we believe uncertainty in a Senator’s final vote, by the time the vote is actually taken, is rare, and certainly not the norm. Hence, it is unclear how much there is to gain, in this particular setting, by considering possible randomization in the Senators’ voting strategies. In addition, were we to introduce mixed-strategies into the inference and learning framework and model, we would be adding a considerable amount of complexity in almost all respects, thus requiring a substantive effort to study on its own.262626For example, note that because in our setting we learn exclusively from observed joint actions, we could not assume knowledge of the internal mixed-strategies of players. Perhaps we could generalize our model to allow for mixed-strategies by defining a process in which a joint mixed strategy 𝐩\mathbf{p} from the set of MSNE (or its complement) is drawn according to some distribution, then a (pure-strategy) realization 𝐱\mathbf{x} is drawn from 𝐩\mathbf{p} that would correspond to the observed joint actions. One problem we might face with this approach is that little is known about the structure of MSNE in general multi-player games. For example, it is not even clear that the set of MSNE is always measurable in general!

4.3.2 On the Noise Process

Here we discuss a more sophisticated noise process as well as a generalization of our model to non-uniform distributions. The problem with these models is that they lead to a significantly more complex expression for the generative model and thus likelihood functions. This is in contrast to the simplicity afforded us by the generative model with a more global noise process defined above. (See Appendix A.2.1 for further discussion.)

In this paper we considered a “global” noise process, modeled using a parameter qq corresponding to the probability that a sample observation is an equilibrium of the underlying hidden game. One could easily envision potentially better and more natural/realistic “local” noise processes, at the expense of producing a significantly more complex generative model, and less computationally amenable, than the one considered in this paper. For instance, we could use a noise process that is formed of many independent, individual noise processes, one for each player. (See Appendix A.2.2 for further discussion.)

4.4 Learning Games via MLE

We now formulate the problem of learning games as one of maximum likelihood estimation with respect to our PSNE-based generative model defined in Eq. (1) and the hypothesis space of non-trivial identifiable games and mixture parameters (Definition 7). We remind the reader that our problem is unsupervised; that is, we do not know a priori which joint actions are equilibria and which ones are not. We base our framework on the fact that games are PSNE-identifiable with respect to their induced PMF, under the condition that q>π(𝒢)q>\pi({\mathcal{G}}), by Proposition 6.

First, we introduce a shorthand notation for the Kullback-Leibler (KL) divergence between two Bernoulli distributions parameterized by 0p110\leq p_{1}\leq 1 and 0p210\leq p_{2}\leq 1:

KL(p1p2)KL(Bernoulli(p1)Bernoulli(p2))=p1logp1p2+(1p1)log1p11p2.\begin{array}[]{@{}l@{\hspace{0.025in}}l@{}}KL(p_{1}\|p_{2})\hfil\hskip 1.8063pt&\equiv KL({\rm Bernoulli}(p_{1})\|{\rm Bernoulli}(p_{2}))\\ \hfil\hskip 1.8063pt&=p_{1}\log\frac{p_{1}}{p_{2}}+(1-p_{1})\log\frac{1-p_{1}}{1-p_{2}}\;.\end{array} (3)

Using this function, we can derive the following expression of the MLE problem.

Lemma 9.

Given a data set 𝒟={𝐱(1),,𝐱(m)}\mathcal{D}=\{\mathbf{x}^{(1)},\dots,\mathbf{x}^{(m)}\}, define the empirical proportion of equilibria, i.e., the proportion of samples in 𝒟\mathcal{D} that are equilibria of 𝒢{\mathcal{G}}, as

π^(𝒢)1ml1[𝐱(l)𝒩(𝒢)].\textstyle{\widehat{\pi}({\mathcal{G}})\equiv\frac{1}{m}\sum_{l}1[{\mathbf{x}^{(l)}\in{\mathcal{NE}}({\mathcal{G}})}]}\;. (4)

The MLE problem for the probabilistic model given in Eq. (1) can be expressed as finding:

(𝒢^,q^)argmax(𝒢,q)Υ^(𝒢,q),where^(𝒢,q)=KL(π^(𝒢)π(𝒢))KL(π^(𝒢)q)nlog2,(\widehat{{\mathcal{G}}},\widehat{q})\in{\arg\max}_{({\mathcal{G}},q)\in\Upsilon}{\widehat{\mathcal{L}}({\mathcal{G}},q)}{\rm,where\ }\widehat{\mathcal{L}}({\mathcal{G}},q)=KL(\widehat{\pi}({\mathcal{G}})\|\pi({\mathcal{G}}))-KL(\widehat{\pi}({\mathcal{G}})\|q)-n\log 2\;, (5)

where \mathcal{H} and Υ\Upsilon are as in Definition 7, and π(𝒢)\pi({\mathcal{G}}) is defined as in Eq. (2). Also, the optimal mixture parameter q^=min(π^(𝒢),112m)\widehat{q}=\min(\widehat{\pi}({\mathcal{G}}),1-\frac{1}{2m}).

Proof.

Let 𝒩𝒩(𝒢){\mathcal{NE}}\equiv{\mathcal{NE}}({\mathcal{G}}), ππ(𝒢)\pi\equiv\pi({\mathcal{G}}) and π^π^(𝒢)\widehat{\pi}\equiv\widehat{\pi}({\mathcal{G}}). First, for a non-trivial 𝒢{\mathcal{G}}, logp(𝒢,q)(𝐱(l))=logq|𝒩|\log p_{({\mathcal{G}},q)}(\mathbf{x}^{(l)})=\log\frac{q}{|{\mathcal{NE}}|} for 𝐱(l)𝒩\mathbf{x}^{(l)}\in{\mathcal{NE}}, and logp(𝒢,q)(𝐱(l))=log1q2n|𝒩|\log p_{({\mathcal{G}},q)}(\mathbf{x}^{(l)})=\log\frac{1-q}{2^{n}-|{\mathcal{NE}}|} for 𝐱(l)𝒩\mathbf{x}^{(l)}\notin{\mathcal{NE}}. The average log-likelihood ^(𝒢,q)=1mllogp𝒢,q(𝐱(l))=π^logq|𝒩|+(1π^)log1q2n|𝒩|=π^logqπ+(1π^)log1q1πnlog2\widehat{\mathcal{L}}({\mathcal{G}},q)=\frac{1}{m}\sum_{l}{\log p_{{\mathcal{G}},q}(\mathbf{x}^{(l)})}=\widehat{\pi}\log\frac{q}{|{\mathcal{NE}}|}+(1-\widehat{\pi})\log\frac{1-q}{2^{n}-|{\mathcal{NE}}|}=\widehat{\pi}\log\frac{q}{\pi}+(1-\widehat{\pi})\log\frac{1-q}{1-\pi}-n\log 2. By adding 0=π^logπ^+π^logπ^(1π^)log(1π^)+(1π^)log(1π^)0=-\widehat{\pi}\log\widehat{\pi}+\widehat{\pi}\log\widehat{\pi}-(1-\widehat{\pi})\log(1-\widehat{\pi})+(1-\widehat{\pi})\log(1-\widehat{\pi}), this can be rewritten as ^(𝒢,q)=π^logπ^π+(1π^)log1π^1ππ^logπ^q(1π^)log1π^1qnlog2\widehat{\mathcal{L}}({\mathcal{G}},q)=\widehat{\pi}\log\frac{\widehat{\pi}}{\pi}+(1-\widehat{\pi})\log\frac{1-\widehat{\pi}}{1-\pi}-\widehat{\pi}\log\frac{\widehat{\pi}}{q}-(1-\widehat{\pi})\log\frac{1-\widehat{\pi}}{1-q}-n\log 2, and by using Eq. (3) we prove our claim.

Note that by maximizing with respect to the mixture parameter qq and by properties of the KL divergence, we get KL(π^q^)=0q^=π^KL(\widehat{\pi}\|\widehat{q})=0\Leftrightarrow\widehat{q}=\widehat{\pi}. We define our hypothesis space Υ\Upsilon given the conditions in Remark 1 and Propositions 5 and 6. For the case π^=1\widehat{\pi}=1, we “shrink” the optimal mixture parameter q^\widehat{q} to 112m1-\frac{1}{2m} in order to enforce the second condition given in Remark 1. ∎

Remark 10.

Recall that a trivial game (e.g., LIG 𝒢=(𝐖,𝐛),𝐖=𝟎,𝐛=𝟎,π(𝒢)=1{\mathcal{G}}=(\mathbf{W},\mathbf{b}),\mathbf{W}=\mathbf{0},\mathbf{b}=\mathbf{0},\pi({\mathcal{G}})=1) induces a uniform PMF by Remark 1, and therefore its log-likelihood is nlog2-n\log 2. Note that the lowest log-likelihood for non-trivial identifiable games in Eq. (5) is nlog2-n\log 2 by setting the optimal mixture parameter q^=π^(𝒢)\widehat{q}=\widehat{\pi}({\mathcal{G}}) and given that KL(π^(𝒢)π(𝒢))0KL(\widehat{\pi}({\mathcal{G}})\|\pi({\mathcal{G}}))\geq 0.

Furthermore, Eq. (5) implies that for non-trivial identifiable games 𝒢{\mathcal{G}}, we expect the true proportion of equilibria π(𝒢)\pi({\mathcal{G}}) to be strictly less than the empirical proportion of equilibria π^(𝒢)\widehat{\pi}({\mathcal{G}}) in the given data. This is by setting the optimal mixture parameter q^=π^(𝒢)\widehat{q}=\widehat{\pi}({\mathcal{G}}) and the condition q>π(𝒢)q>\pi({\mathcal{G}}) in our hypothesis space.

4.4.1 Learning LIGs via MLE: Model Selection

Our main learning problem consists of inferring the structure and parameters of an LIG from data with the main purpose being modeling the game’s PSNE, as reflected in the generative model. Note that, as we have previously stated, different games (i.e., with different payoff functions) can be PSNE-equivalent. For instance, the three following LIGs, with different weight parameter matrices, induce the same PSNE sets, i.e., 𝒩(𝐖k,𝟎)={(1,1,1),(+1,+1,+1)}{\mathcal{NE}}(\mathbf{W}_{k},\mathbf{0})=\{(-1,-1,-1),(+1,+1,+1)\} for k=1,2,3k=1,2,3:272727Using the formal mathematical definition of “identifiability” in statistics, we would say that the LIG examples prove that the model parameters (𝐖,𝐛)(\mathbf{W},\mathbf{b}) of an LIG 𝒢{\mathcal{G}} are not identifiable with respect to the generative model p(𝒢,q)p_{({\mathcal{G}},q)} defined in Eq. (1). We note that this situation is hardly exclusive to game-theoretic models. As example of an analogous issue in probabilistic graphical models is the fact that two Bayesian networks with different graph structures can represent not only the same conditional independence properties but also exactly the same set of joint probability distributions [Chickering, 2002, Koller and Friedman, 2009]. As a side note, we can distinguish these games with respect to their larger set of mixed-strategy Nash equilibria (MSNE), but, as stated previously, we do not consider MSNE in this paper because our primary motivation is the work of Irfan and Ortiz [2014], which is based on the concept of PSNE.

𝐖1=[0001/200010],𝐖2=[000200100],𝐖3=[011101110].\mathbf{W}_{1}=\left[\begin{array}[]{@{\hspace{0.025in}}c@{\hspace{0.05in}}c@{\hspace{0.05in}}c@{\hspace{0.025in}}}\hskip 1.8063pt\lx@intercol\hfil 0\hfil\hskip 3.61371pt&0\hfil\hskip 3.61371pt&0\hfil\hskip 1.8063pt\\ \hskip 1.8063pt\lx@intercol\hfil 1/2\hfil\hskip 3.61371pt&0\hfil\hskip 3.61371pt&0\hfil\hskip 1.8063pt\\ \hskip 1.8063pt\lx@intercol\hfil 0\hfil\hskip 3.61371pt&1\hfil\hskip 3.61371pt&0\hfil\hskip 1.8063pt\\ \end{array}\right]{\rm,\ \ }\mathbf{W}_{2}=\left[\begin{array}[]{@{\hspace{0.025in}}c@{\hspace{0.05in}}c@{\hspace{0.05in}}c@{\hspace{0.025in}}}\hskip 1.8063pt\lx@intercol\hfil 0\hfil\hskip 3.61371pt&0\hfil\hskip 3.61371pt&0\hfil\hskip 1.8063pt\\ \hskip 1.8063pt\lx@intercol\hfil 2\hfil\hskip 3.61371pt&0\hfil\hskip 3.61371pt&0\hfil\hskip 1.8063pt\\ \hskip 1.8063pt\lx@intercol\hfil 1\hfil\hskip 3.61371pt&0\hfil\hskip 3.61371pt&0\hfil\hskip 1.8063pt\\ \end{array}\right]{\rm,\ \ }\mathbf{W}_{3}=\left[\begin{array}[]{@{\hspace{0.025in}}c@{\hspace{0.05in}}c@{\hspace{0.05in}}c@{\hspace{0.025in}}}\hskip 1.8063pt\lx@intercol\hfil 0\hfil\hskip 3.61371pt&1\hfil\hskip 3.61371pt&1\hfil\hskip 1.8063pt\\ \hskip 1.8063pt\lx@intercol\hfil 1\hfil\hskip 3.61371pt&0\hfil\hskip 3.61371pt&1\hfil\hskip 1.8063pt\\ \hskip 1.8063pt\lx@intercol\hfil 1\hfil\hskip 3.61371pt&1\hfil\hskip 3.61371pt&0\hfil\hskip 1.8063pt\\ \end{array}\right]\;.

Thus, not only the MLE may not be unique, but also all such PSNE-equivalent MLE games will achieve the same level of generalization performance. But, as reflected by our generative model, our main interest in the model parameters of the LIGs is only with respect to the PSNE they induce, not the model parameters per se. Hence, all we need is a way to select among PSNE-equivalent LIGs.

In our work, the indentifiability or interpretability of exact model parameters of LIGs is not our main interest. That is, in the research presented here, we did not seek or attempt to work on creating alternative generative models with the objective to provide a theoretical guarantee that, given an infinite amount of data, we can recover the model parameters of an unknown ground-truth model generating the data, assuming the ground-truth model is an LIG. We opted for a more practical ML approach in which we just want to learn a single LIG that has good generalization performance (i.e., predictive performance in terms of average log-likelihood) with respect to our generative model. Given the nature of our generative model, this essentially translate to learning an LIG that captures as best as possible the PSNE of the unknown ground-truth game. Unfortunately, as we just illustrated, several LIGs with different model parameter values can have the same set of PSNE. Thus, they all would have the same (generalization) performance ability.

As we all know, model selection is core to ML. One of the reason we chose an ML-approach to learning games is precisely the elegant way in which ML deals with the problem of how to select among multiple models that achieve the same level of performance: invoke the principle of Ockham’s razor and select the “simplest” model among those with the same (generalization) performance. This ML philosophy is not ad hoc. It is instead well established in practice and well supported by theory. Seminal results from the various theories of learning, such as computational and statistical learning theory and PAC learning, support the well-known ML adage that “learning requires bias.” In short, as is by now standard in an ML-approach, we measure the quality of our data-induced models via their generalization ability and invoke the principle of Ockham’s razor to bias our search toward simpler models using well-known and -studied regularization techniques.

Now, as we also all know, exactly what “simple” and “bias” means depends on the problem. In our case, we would prefer games with sparse graphs, if for no reason other than to simplify analysis, exploration, study, and (visual) “interpretability” of the game model by human experts, everything else being equal (i.e., models with the same explanatory power on the data as measured by the likelihoods).282828Just to be clear, here we mean “interpretability” not in any formal mathematical sense, or as often used in some areas of the social sciences such as economics. But, instead, as we typically use it in ML/AI textbooks, such as for example, when referring to shallow/sparse decision trees, generally considered to be easier to explain and understand. Similarly, the hope here is that the “sparsity” or “simplicity” of the game graph/model would make it also simpler for human experts to explain or understand what about the model is leading them to generate novel hypotheses, reach certain conclusions or make certain inferences about the global strategic behavior of the agents/players, such as those based on the game’s PSNE and facilitated by CSI. We should also point out that, in preliminary empirical work, we have observed that the representationally sparser the LIG graph, the computationally easier it is for algorithms and other heuristics that operate on the LIG, as those of Irfan and Ortiz [2014] for CSI, for example. For example, among the LIGs presented above, using structural properties alone, we would generally prefer the former two models to the latter, all else being equal (e.g., generalization performance).

5 Generalization Bound and VC-Dimension

In this section, we show a generalization bound for the maximum likelihood problem as well as an upper bound of the VC-dimension of LIGs. Our objective is to establish that with probability at least 1δ1-\delta, for some confidence parameter 0<δ<10<\delta<1, the maximum likelihood estimate is within ϵ>0\epsilon>0 of the optimal parameters, in terms of achievable expected log-likelihood.

Given the ground-truth distribution 𝒬\mathcal{Q} of the data, let π¯(𝒢)\bar{\pi}({\mathcal{G}}) be the expected proportion of equilibria, i.e.,

π¯(𝒢)=𝒬[𝐱𝒩(𝒢)],\bar{\pi}({\mathcal{G}})=\mathbb{P}_{\mathcal{Q}}[\mathbf{x}\in{\mathcal{NE}}({\mathcal{G}})]\;,

and let ¯(𝒢,q)\bar{\mathcal{L}}({\mathcal{G}},q) be the expected log-likelihood of a generative model from game 𝒢{\mathcal{G}} and mixture parameter qq, i.e.,

¯(𝒢,q)=𝔼𝒬[logp(𝒢,q)(𝐱)].\bar{\mathcal{L}}({\mathcal{G}},q)=\mathbb{E}_{\mathcal{Q}}[\log p_{({\mathcal{G}},q)}(\mathbf{x})]\;.

Let θ^(𝒢^,q^)\widehat{\theta}\equiv(\widehat{{\mathcal{G}}},\widehat{q}) be a maximum-likelihood estimate as in Eq. (5) (i.e., θ^argmaxθΥ^(θ)\widehat{\theta}\in{\arg\max}_{\theta\in\Upsilon}{\widehat{\mathcal{L}}(\theta)}), and θ¯(𝒢¯,q¯)\bar{\theta}\equiv(\bar{{\mathcal{G}}},\bar{q}) be the maximum-expected-likelihood estimate: θ¯argmaxθΥ¯(θ)\bar{\theta}\in{\arg\max}_{\theta\in\Upsilon}{\bar{\mathcal{L}}(\theta)}.292929If the ground-truth model belongs to the class of LIGs, then θ¯\bar{\theta} is also the ground-truth model, or PSNE-equivalent to it. We use, without formally re-stating, the last definitions in the technical results presented in the remaining of this section.

Note that our hypothesis space Υ\Upsilon as stated in Definition 7 includes a continuous parameter qq that could potentially have infinite VC-dimension. The following lemma will allow us later to prove that uniform convergence for the extreme values of qq implies uniform convergence for all qq in the domain.

Lemma 11.

Consider any game 𝒢{\mathcal{G}} and, for 0<q′′<q<q<10<q^{\prime\prime}<q^{\prime}<q<1, let θ=(𝒢,q)\theta=({\mathcal{G}},q), θ=(𝒢,q)\theta^{\prime}=({\mathcal{G}},q^{\prime}) and θ′′=(𝒢,q′′)\theta^{\prime\prime}=({\mathcal{G}},q^{\prime\prime}). If, for any ϵ>0\epsilon>0 we have |^(θ)¯(θ)|ϵ/2|\widehat{\mathcal{L}}(\theta)-\bar{\mathcal{L}}(\theta)|\leq\epsilon/2 and |^(θ′′)¯(θ′′)|ϵ/2|\widehat{\mathcal{L}}(\theta^{\prime\prime})-\bar{\mathcal{L}}(\theta^{\prime\prime})|\leq\epsilon/2, then |^(θ)¯(θ)|ϵ/2|\widehat{\mathcal{L}}(\theta^{\prime})-\bar{\mathcal{L}}(\theta^{\prime})|\leq\epsilon/2.

Proof.

Let 𝒩𝒩(𝒢){\mathcal{NE}}\equiv{\mathcal{NE}}({\mathcal{G}}), ππ(𝒢)\pi\equiv\pi({\mathcal{G}}), π^π^(𝒢)\widehat{\pi}\equiv\widehat{\pi}({\mathcal{G}}), π¯π¯(𝒢)\bar{\pi}\equiv\bar{\pi}({\mathcal{G}}), and 𝔼[]\mathbb{E}[\cdot] and []\mathbb{P}[\cdot] be the expectation and probability with respect to the ground-truth distribution 𝒬\mathcal{Q} of the data.

First note that for any θ=(𝒢,q)\theta=({\mathcal{G}},q), we have ¯(θ)=𝔼[logp(𝒢,q)(𝐱)]=𝔼[1[𝐱𝒩]logq|𝒩|+1[𝐱𝒩]log1q2n|𝒩|]=[𝐱𝒩]logq|𝒩|+[𝐱𝒩]log1q2n|𝒩|=π¯logq|𝒩|+(1π¯)log1q2n|𝒩|=π¯log(q1q2n|𝒩||𝒩|)+log1q2n|𝒩|=π¯log(q1q1ππ)+log1q1πnlog2\bar{\mathcal{L}}(\theta)=\mathbb{E}[\log p_{({\mathcal{G}},q)}(\mathbf{x})]=\mathbb{E}[1[{\mathbf{x}\in{\mathcal{NE}}}]\log\frac{q}{|{\mathcal{NE}}|}+1[{\mathbf{x}\notin{\mathcal{NE}}}]\log\frac{1-q}{2^{n}-|{\mathcal{NE}}|}]=\mathbb{P}[\mathbf{x}\in{\mathcal{NE}}]\log\frac{q}{|{\mathcal{NE}}|}+\mathbb{P}[\mathbf{x}\notin{\mathcal{NE}}]\log\frac{1-q}{2^{n}-|{\mathcal{NE}}|}=\bar{\pi}\log\frac{q}{|{\mathcal{NE}}|}+(1-\bar{\pi})\log\frac{1-q}{2^{n}-|{\mathcal{NE}}|}=\bar{\pi}\log\left(\frac{q}{1-q}\cdot\frac{2^{n}-|{\mathcal{NE}}|}{|{\mathcal{NE}}|}\right)+\log\frac{1-q}{2^{n}-|{\mathcal{NE}}|}=\bar{\pi}\log\left(\frac{q}{1-q}\cdot\frac{1-\pi}{\pi}\right)+\log\frac{1-q}{1-\pi}-n\log 2.

Similarly, for any θ=(𝒢,q)\theta=({\mathcal{G}},q), we have ^(θ)=π^log(q1q1ππ)+log1q1πnlog2\widehat{\mathcal{L}}(\theta)=\widehat{\pi}\log\left(\frac{q}{1-q}\cdot\frac{1-\pi}{\pi}\right)+\log\frac{1-q}{1-\pi}-n\log 2. So that ^(θ)¯(θ)=(π^π¯)log(q1q1ππ)\widehat{\mathcal{L}}(\theta)-\bar{\mathcal{L}}(\theta)=(\widehat{\pi}-\bar{\pi})\log\left(\frac{q}{1-q}\cdot\frac{1-\pi}{\pi}\right).

Furthermore, the function q1q\frac{q}{1-q} is strictly monotonically increasing for 0q<10\leq q<1. If π^>π¯\widehat{\pi}>\bar{\pi} then ϵ/2^(θ′′)¯(θ′′)<^(θ)¯(θ)<^(θ)¯(θ)ϵ/2-\epsilon/2\leq\widehat{\mathcal{L}}(\theta^{\prime\prime})-\bar{\mathcal{L}}(\theta^{\prime\prime})<\widehat{\mathcal{L}}(\theta^{\prime})-\bar{\mathcal{L}}(\theta^{\prime})<\widehat{\mathcal{L}}(\theta)-\bar{\mathcal{L}}(\theta)\leq\epsilon/2. Else, if π^<π¯\widehat{\pi}<\bar{\pi}, we have ϵ/2^(θ′′)¯(θ′′)>^(θ)¯(θ)>^(θ)¯(θ)ϵ/2\epsilon/2\geq\widehat{\mathcal{L}}(\theta^{\prime\prime})-\bar{\mathcal{L}}(\theta^{\prime\prime})>\widehat{\mathcal{L}}(\theta^{\prime})-\bar{\mathcal{L}}(\theta^{\prime})>\widehat{\mathcal{L}}(\theta)-\bar{\mathcal{L}}(\theta)\geq-\epsilon/2. Finally, if π^=π¯\widehat{\pi}=\bar{\pi} then ^(θ′′)¯(θ′′)=^(θ)¯(θ)=^(θ)¯(θ)=0\widehat{\mathcal{L}}(\theta^{\prime\prime})-\bar{\mathcal{L}}(\theta^{\prime\prime})=\widehat{\mathcal{L}}(\theta^{\prime})-\bar{\mathcal{L}}(\theta^{\prime})=\widehat{\mathcal{L}}(\theta)-\bar{\mathcal{L}}(\theta)=0. ∎

In the remaining of this section, denote by d()|𝒢{𝒩(𝒢)}|d(\mathcal{H})\equiv\left|\cup_{{\mathcal{G}}\in\mathcal{H}}\{{\mathcal{NE}}({\mathcal{G}})\}\right| the number of all possible PSNE sets induced by games in \mathcal{H}, the class of games of interest.

The following theorem shows that the expected log-likelihood of the maximum likelihood estimate θ^\widehat{\theta} converges in probability to that of the optimal θ¯=(𝒢¯,q¯)\bar{\theta}=(\bar{{\mathcal{G}}},\bar{q}), as the data size mm increases.

Theorem 12.

The following holds with 𝒬\mathcal{Q}-probability at least 1δ1-\delta:

¯(θ^)¯(θ¯)(logmax(2m,11q¯)+nlog2)2m(logd()+log4δ).\textstyle{\bar{\mathcal{L}}(\widehat{\theta})\geq\bar{\mathcal{L}}(\bar{\theta})-\left(\log\max(2m,\frac{1}{1-\bar{q}})+n\log 2\right)\sqrt{\frac{2}{m}\left(\log d(\mathcal{H})+\log\frac{4}{\delta}\right)}}\;.
Proof.

First our objective is to find a lower bound for [¯(θ^)¯(θ¯)ϵ][¯(θ^)¯(θ¯)ϵ+(^(θ^)^(θ¯))][^(θ^)+¯(θ^)ϵ2,^(θ¯)¯(θ¯)ϵ2]=[^(θ^)¯(θ^)ϵ2,^(θ¯)¯(θ¯)ϵ2]=1[^(θ^)¯(θ^)>ϵ2^(θ¯)¯(θ¯)<ϵ2]\mathbb{P}[\bar{\mathcal{L}}(\widehat{\theta})-\bar{\mathcal{L}}(\bar{\theta})\geq-\epsilon]\geq\mathbb{P}[\bar{\mathcal{L}}(\widehat{\theta})-\bar{\mathcal{L}}(\bar{\theta})\geq-\epsilon+(\widehat{\mathcal{L}}(\widehat{\theta})-\widehat{\mathcal{L}}(\bar{\theta}))]\geq\mathbb{P}[-\widehat{\mathcal{L}}(\widehat{\theta})+\bar{\mathcal{L}}(\widehat{\theta})\geq-\frac{\epsilon}{2},\widehat{\mathcal{L}}(\bar{\theta})-\bar{\mathcal{L}}(\bar{\theta})\geq-\frac{\epsilon}{2}]=\mathbb{P}[\widehat{\mathcal{L}}(\widehat{\theta})-\bar{\mathcal{L}}(\widehat{\theta})\leq\frac{\epsilon}{2},\widehat{\mathcal{L}}(\bar{\theta})-\bar{\mathcal{L}}(\bar{\theta})\geq-\frac{\epsilon}{2}]=1-\mathbb{P}[\widehat{\mathcal{L}}(\widehat{\theta})-\bar{\mathcal{L}}(\widehat{\theta})>\frac{\epsilon}{2}\vee\widehat{\mathcal{L}}(\bar{\theta})-\bar{\mathcal{L}}(\bar{\theta})<-\frac{\epsilon}{2}].

Let q~max(112m,q¯)\widetilde{q}\equiv\max(1-\frac{1}{2m},\bar{q}). Now, we have [^(θ^)¯(θ^)>ϵ2^(θ¯)¯(θ¯)<ϵ2][(θΥ,qq~)|^(θ)¯(θ)|>ϵ2]=[(θ,𝒢,q{π(𝒢),q~})|^(θ)¯(θ)|>ϵ2]\mathbb{P}[\widehat{\mathcal{L}}(\widehat{\theta})-\bar{\mathcal{L}}(\widehat{\theta})>\frac{\epsilon}{2}\vee\widehat{\mathcal{L}}(\bar{\theta})-\bar{\mathcal{L}}(\bar{\theta})<-\frac{\epsilon}{2}]\leq\mathbb{P}[(\exists\theta\in\Upsilon,q\leq\widetilde{q}){\rm\ }|\widehat{\mathcal{L}}(\theta)-\bar{\mathcal{L}}(\theta)|>\frac{\epsilon}{2}]=\mathbb{P}[(\exists\theta,{\mathcal{G}}\in\mathcal{H},q\in\{\pi({\mathcal{G}}),\widetilde{q}\}){\rm\ }|\widehat{\mathcal{L}}(\theta)-\bar{\mathcal{L}}(\theta)|>\frac{\epsilon}{2}]. The last equality follows from invoking Lemma 11.

Note that 𝔼[^(θ)]=¯(θ)\mathbb{E}[\widehat{\mathcal{L}}(\theta)]=\bar{\mathcal{L}}(\theta) and that since π(𝒢)qq~\pi({\mathcal{G}})\leq q\leq\widetilde{q}, the log-likelihood is bounded as (𝐱)Blogp(𝒢,q)(𝐱)0(\forall\mathbf{x}){\rm\ }-B\leq\log p_{({\mathcal{G}},q)}(\mathbf{x})\leq 0, where B=log11q~+nlog2=logmax(2m,11q¯)+nlog2B=\log\frac{1}{1-\widetilde{q}}+n\log 2=\log\max(2m,\frac{1}{1-\bar{q}})+n\log 2. Therefore, by Hoeffding’s inequality, we have [|^(θ)¯(θ)|>ϵ2]2emϵ22B2\mathbb{P}[|\widehat{\mathcal{L}}(\theta)-\bar{\mathcal{L}}(\theta)|>\frac{\epsilon}{2}]\leq 2e^{-\frac{m\epsilon^{2}}{2B^{2}}}.

Furthermore, note that there are 2d()2d(\mathcal{H}) possible parameters θ\theta, since we need to consider only two values of q{π(𝒢),q~})q\in\{\pi({\mathcal{G}}),\widetilde{q}\}) and because the number of all possible PSNE sets induced by games in \mathcal{H} is d()|𝒢{𝒩(𝒢)}|d(\mathcal{H})\equiv\left|\cup_{{\mathcal{G}}\in\mathcal{H}}\{{\mathcal{NE}}({\mathcal{G}})\}\right|. Therefore, by the union bound we get the following uniform convergence [(θ,𝒢,q{π(𝒢),q~})|^(θ)¯(θ)|>ϵ2]4d()[|^(θ)¯(θ)|>ϵ2]4d()emϵ22B2=δ\mathbb{P}[(\exists\theta,{\mathcal{G}}\in\mathcal{H},q\in\{\pi({\mathcal{G}}),\widetilde{q}\}){\rm\ }|\widehat{\mathcal{L}}(\theta)-\bar{\mathcal{L}}(\theta)|>\frac{\epsilon}{2}]\leq 4d(\mathcal{H})\mathbb{P}[|\widehat{\mathcal{L}}(\theta)-\bar{\mathcal{L}}(\theta)|>\frac{\epsilon}{2}]\leq 4d(\mathcal{H})e^{-\frac{m\epsilon^{2}}{2B^{2}}}=\delta. Finally, by solving for δ\delta we prove our claim. ∎

Remark 13.

A more elaborate analysis allows to tighten the bound in Theorem 12 from 𝒪(log11q¯)\mathcal{O}(\log\frac{1}{1-\bar{q}}) to 𝒪(logq¯1q¯)\mathcal{O}(\log\frac{\bar{q}}{1-\bar{q}}). We chose to provide the former result for clarity of presentation.

The following theorem establishes the complexity of the class of LIGs, which implies that the term logd()\log d(\mathcal{H}) of the generalization bound in Theorem 12 is only polynomial in the number of players nn.

Theorem 14.

If \mathcal{H} is the class of LIGs, then d()|𝒢{𝒩(𝒢)}|2nn(n+1)2+12n3d(\mathcal{H})\equiv\left|\cup_{{\mathcal{G}}\in\mathcal{H}}\{{\mathcal{NE}}({\mathcal{G}})\}\right|\leq 2^{n\frac{n(n+1)}{2}+1}\leq 2^{n^{3}}.

Proof.

The logarithm of the number of possible pure-strategy Nash equilibria sets supported by \mathcal{H} (i.e., that can be produced by some game in \mathcal{H}) is upper bounded by the VC-dimension of the class of neural networks with a single hidden layer of nn units and n+(n2)n+{n\choose 2} input units, linear threshold activation functions, and constant output weights.

For every LIG 𝒢=(𝐖,𝐛){\mathcal{G}}=(\mathbf{W},\mathbf{b}) in \mathcal{H}, define the neural network with a single layer of nn hidden units, nn of the inputs corresponds to the linear terms x1,,xnx_{1},\ldots,x_{n} and (n2){n\choose 2} corresponds to the quadratic polynomial terms xixjx_{i}x_{j} for all pairs of players (i,j)(i,j), 1i<jn1\leq i<j\leq n. For every hidden unit ii, the weights corresponding to the linear terms x1,,xnx_{1},\ldots,x_{n} are b1,,bn-b_{1},\ldots,-b_{n}, respectively, while the weights corresponding to the quadratic terms xixjx_{i}x_{j} are wij-w_{ij}, for all pairs of players (i,j)(i,j), 1i<jn1\leq i<j\leq n, respectively. The weights of the bias term of all the hidden units are set to 0. All nn output weights are set to 11 while the weight of the output bias term is set to 0. The output of the neural network is 1[𝐱𝒩(𝒢)]1[{\mathbf{x}\notin{\mathcal{NE}}({\mathcal{G}})}]. Note that we define the neural network to classify non-equilibrium as opposed to equilibrium to keep the convention in the neural network literature to define the threshold function to output 0 for input 0. The alternative is to redefine the threshold function to output 11 instead for input 0.

Finally, we use the VC-dimension of neural networks [Sontag, 1998]. ∎

From Theorems 12 and 14, we state the generalization bounds for LIGs.

Corollary 15.

The following holds with 𝒬\mathcal{Q}-probability at least 1δ1-\delta:

¯(θ^)¯(θ¯)(logmax(2m,11q¯)+nlog2)2m(n3log2+log4δ),\textstyle{\bar{\mathcal{L}}(\widehat{\theta})\geq\bar{\mathcal{L}}(\bar{\theta})-\left(\log\max(2m,\frac{1}{1-\bar{q}})+n\log 2\right)\sqrt{\frac{2}{m}\left(n^{3}\log 2+\log\frac{4}{\delta}\right)}}\;,

where \mathcal{H} is the class of LIGs, in which case Υ{(𝒢,q)𝒢0<π(𝒢)<q<1}\Upsilon\equiv\{({\mathcal{G}},q)\mid{\mathcal{G}}\in\mathcal{H}\wedge 0<\pi({\mathcal{G}})<q<1\} (Definition 7) becomes the hypothesis space of non-trivial identifiable LIGs and mixture parameters.

6 Algorithms

In this section, we approximate the maximum likelihood problem by maximizing the number of observed equilibria in the data, suitable for a hypothesis space of games with small true proportion of equilibria. We then present our convex loss minimization approach. We also discuss baseline methods such as sigmoidal approximation and exhaustive search.

But first, let us discuss some negative results that justifies the use of simple approaches. Irfan and Ortiz [2014] showed that counting the number of Nash equilibria in LIGs is #P-complete; thus, computing the log-likelihood function, and therefore MLE, is NP-hard.303030This is not a disadvantage relative to probabilistic graphical models, since computing the log-likelihood function is also NP-hard for Ising models and Markov random fields in general, while learning is also NP-hard for Bayesian networks. General approximation techniques such as pseudo-likelihood estimation do not lead to tractable methods for learning LIGs.313131We show that evaluating the pseudo-likelihood function for our generative model is NP-hard. Consider a non-trivial LIG 𝒢=(𝐖,𝐛){\mathcal{G}}=(\mathbf{W},\mathbf{b}). Furthermore, assume 𝒢{\mathcal{G}} has a single non-absolutely-indifferent player ii and absolutely-indifferent players ji\forall j\neq i; that is, assume that (𝐰i,i,bi)𝟎(\mathbf{w}_{{i},{{-i}}},b_{i})\neq\mathbf{0} and (ji)(𝐰j,j,bj)=𝟎(\forall j\neq i){\rm\ }(\mathbf{w}_{{j},{{-j}}},b_{j})=\mathbf{0} (See Definition 19). Let fi(𝐱i)𝐰i,iT𝐱ibif_{i}(\mathbf{x}_{-i})\equiv{\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i}, we have 1[𝐱𝒩(𝒢)]=1[xifi(𝐱i)0]1[{\mathbf{x}\in{\mathcal{NE}}({\mathcal{G}})}]=1[{x_{i}f_{i}(\mathbf{x}_{-i})\geq 0}] and therefore p(𝒢,q)(𝐱)=q1[xifi(𝐱i)0]|𝒩(𝒢)|+(1q)11[xifi(𝐱i)0]2n|𝒩(𝒢)|p_{({\mathcal{G}},q)}(\mathbf{x})=q\frac{1[{x_{i}f_{i}(\mathbf{x}_{-i})\geq 0}]}{|{\mathcal{NE}}({\mathcal{G}})|}+(1-q)\frac{1-1[{x_{i}f_{i}(\mathbf{x}_{-i})\geq 0}]}{2^{n}-|{\mathcal{NE}}({\mathcal{G}})|}. The result follows because computing |𝒩(𝒢)||{\mathcal{NE}}({\mathcal{G}})| is #P-complete, even for this specific instance of a single non-absolutely-indifferent player [Irfan and Ortiz, 2014]. From an optimization perspective, the log-likelihood function is not continuous because of the number of equilibria. Therefore, we cannot rely on concepts such as Lipschitz continuity.323232To prove that counting the number of equilibria is not (Lipschitz) continuous, we show how small changes in the parameters 𝒢=(𝐖,𝐛){\mathcal{G}}=(\mathbf{W},\mathbf{b}) can produce big changes in |𝒩(𝒢)||{\mathcal{NE}}({\mathcal{G}})|. For instance, consider two games 𝒢k=(𝐖k,𝐛k){\mathcal{G}}_{k}=(\mathbf{W}_{k},\mathbf{b}_{k}), where 𝐖1=𝟎,𝐛1=𝟎,|𝒩(𝒢1)|=2n\mathbf{W}_{1}=\mathbf{0},\mathbf{b}_{1}=\mathbf{0},|{\mathcal{NE}}({\mathcal{G}}_{1})|=2^{n} and 𝐖2=ε(𝟏𝟏T𝐈),𝐛2=𝟎,|𝒩(𝒢2)|=2\mathbf{W}_{2}=\varepsilon(\mathbf{1}{\mathbf{1}}^{\rm T}-\mathbf{I}),\mathbf{b}_{2}=\mathbf{0},|{\mathcal{NE}}({\mathcal{G}}_{2})|=2 for ε>0\varepsilon>0. For ε0\varepsilon\rightarrow 0, any p\ell_{p}-norm 𝐖1𝐖2p0\|\mathbf{W}_{1}-\mathbf{W}_{2}\|_{p}\rightarrow 0 but |𝒩(𝒢1)||𝒩(𝒢2)|=2n2|{\mathcal{NE}}({\mathcal{G}}_{1})|-|{\mathcal{NE}}({\mathcal{G}}_{2})|=2^{n}-2 remains constant. Furthermore, bounding the number of equilibria by known bounds for Ising models leads to trivial bounds.333333The log-partition function of an Ising model is a trivial bound for counting the number of equilibria. To see this, let fi(𝐱i)𝐰i,iT𝐱ibif_{i}(\mathbf{x}_{-i})\equiv{\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i}, |𝒩(𝒢)|=𝐱i1[xifi(𝐱i)0]𝐱iexifi(𝐱i)=𝐱e𝐱T𝐖𝐱𝐛T𝐱=𝒵(12(𝐖+𝐖T),𝐛)|{\mathcal{NE}}({\mathcal{G}})|=\sum_{\mathbf{x}}\prod_{i}1[{x_{i}f_{i}(\mathbf{x}_{-i})\geq 0}]\leq\sum_{\mathbf{x}}\prod_{i}{e^{x_{i}f_{i}(\mathbf{x}_{-i})}}=\sum_{\mathbf{x}}{e^{{\mathbf{x}}^{\rm T}\mathbf{W}\mathbf{x}-{\mathbf{b}}^{\rm T}\mathbf{x}}}=\mathcal{Z}(\frac{1}{2}(\mathbf{W}+{\mathbf{W}}^{\rm T}),\mathbf{b}), where 𝒵\mathcal{Z} denotes the partition function of an Ising model. Given the convexity of 𝒵\mathcal{Z} [Koller and Friedman, 2009], and that the gradient vanishes at 𝐖=𝟎,𝐛=𝟎\mathbf{W}=\mathbf{0},\mathbf{b}=\mathbf{0}, we know that 𝒵(12(𝐖+𝐖T),𝐛)2n\mathcal{Z}(\frac{1}{2}(\mathbf{W}+{\mathbf{W}}^{\rm T}),\mathbf{b})\geq 2^{n}, which is the maximum |𝒩(𝒢)||{\mathcal{NE}}({\mathcal{G}})|.

6.1 An Exact Quasi-Linear Method for General Games: Sample-Picking

As a first approach, consider solving the maximum likelihood estimation problem in Eq. (5) by an exact exhaustive search algorithm. This algorithm iterates through all possible Nash equilibria sets, i.e., for s=0,,2ns=0,\dots,2^{n}, we generate all possible sets of size ss with elements from the joint-action space {1,+1}n\{-1,+1\}^{n}. Recall that there exist (2ns){2^{n}\choose s} of such sets of size ss and since s=02n(2ns)=22n\sum_{s=0}^{2^{n}}{2^{n}\choose s}=2^{2^{n}} the search space is super-exponential in the number of players nn.

Algorithm 1 Sample-Picking for General Games
  Input: Data set 𝒟=𝐱(1),,𝐱(m)\mathcal{D}=\mathbf{x}^{(1)},\dots,\mathbf{x}^{(m)}.
  Compute the unique samples 𝐲(1),,𝐲(U)\mathbf{y}^{(1)},\dots,\mathbf{y}^{(U)} and their frequency p^(1),,p^(U)\widehat{p}^{(1)},\dots,\widehat{p}^{(U)} in the data set 𝒟\mathcal{D}.
  Sort joint actions by their frequency such that p^(1)p^(2)p^(U)\widehat{p}^{(1)}\geq\widehat{p}^{(2)}\geq\dots\geq\widehat{p}^{(U)}.
  for  each unique sample k=1,,Uk=1,\dots,U do
   Define 𝒢k{\mathcal{G}}_{k} by the Nash equilibria set 𝒩(𝒢k)={𝐲(1),,𝐲(k)}{\mathcal{NE}}({\mathcal{G}}_{k})=\{\mathbf{y}^{(1)},\dots,\mathbf{y}^{(k)}\}.
   Compute the log-likelihood ^(𝒢k,q^k)\widehat{\mathcal{L}}({\mathcal{G}}_{k},\widehat{q}_{k}) in Eq. (5) (Note that q^k=π^(𝒢)=1m(p^(1)++p^(k))\widehat{q}_{k}=\widehat{\pi}({\mathcal{G}})=\frac{1}{m}{(\widehat{p}^{(1)}+\dots+\widehat{p}^{(k)})}, π(𝒢)=k2n\pi({\mathcal{G}})=\frac{k}{2^{n}}).
  end for
  Output: The game 𝒢k^{\mathcal{G}}_{\widehat{k}} such that k^=argmaxk^(𝒢k,q^k)\widehat{k}={\arg\max}_{k}\widehat{\mathcal{L}}({\mathcal{G}}_{k},\widehat{q}_{k}).

Based on few observations, we can obtain an 𝒪(mlogm)\mathcal{O}(m\log m) algorithm for mm samples. First, note that the above method does not constrain the set of Nash equilibria in any fashion. Therefore, only joint actions that are observed in the data are candidates of being Nash equilibria in order to maximize the log-likelihood. This is because the introduction of an unobserved joint action will increase the true proportion of equilibria without increasing the empirical proportion of equilibria and thus leading to a lower log-likelihood in Eq. (5). Second, given a fixed number of Nash equilibria kk, the best strategy would be to pick the kk joint actions that appear more frequently in the observed data. This will maximize the empirical proportion of equilibria, which will maximize the log-likelihood. Based on these observations, we propose Algorithm 1.

As an aside note, the fact that general games do not constrain the set of Nash equilibria, makes the method more likely to over-fit. On the other hand, LIGs will potentially include unobserved equilibria given the linearity constraints in the search space, and thus they would be less likely to over-fit.

6.2 An Exact Super-Exponential Method for LIGs: Exhaustive Search

Note that in the previous subsection, we search in the space of all possible games, not only the LIGs. First note that sample-picking for linear games is NP-hard, i.e., at any iteration of sample-picking, checking whether the set of Nash equilibria 𝒩{\mathcal{NE}} corresponds to an LIG or not is equivalent to the following constraint satisfaction problem with linear constraints:

min𝐖,𝐛 1s.t.(𝐱𝒩)x1(𝐰1,1T𝐱1b1)0xn(𝐰n,nT𝐱nbn)0,(𝐱𝒩)x1(𝐰1,1T𝐱1b1)<0xn(𝐰n,nT𝐱nbn)<0.\begin{array}[]{@{}l@{}}\displaystyle{\min_{\mathbf{W},\mathbf{b}}{{\rm\ }1}}\\ \vskip-8.67204pt\\ \begin{array}[]{@{}l@{\hspace{0.08in}}l@{\hspace{0.05in}}l@{\hspace{0.05in}}l@{\hspace{0.05in}}l@{}}{\rm s.t.}\hfil\hskip 5.78172pt&(\forall\mathbf{x}\in{\mathcal{NE}})\hfil\hskip 3.61371pt&x_{1}({\mathbf{w}_{{1},{{-1}}}}^{\rm T}\mathbf{x}_{-1}-b_{1})\geq 0\hfil\hskip 3.61371pt&\wedge\dots\wedge\hfil\hskip 3.61371pt&x_{n}({\mathbf{w}_{{n},{{-n}}}}^{\rm T}\mathbf{x}_{-n}-b_{n})\geq 0\;,\\ \hfil\hskip 5.78172pt&(\forall\mathbf{x}\notin{\mathcal{NE}})\hfil\hskip 3.61371pt&x_{1}({\mathbf{w}_{{1},{{-1}}}}^{\rm T}\mathbf{x}_{-1}-b_{1})<0\hfil\hskip 3.61371pt&\vee\dots\vee\hfil\hskip 3.61371pt&x_{n}({\mathbf{w}_{{n},{{-n}}}}^{\rm T}\mathbf{x}_{-n}-b_{n})<0\;.\end{array}\end{array} (6)

Note that Eq. (6) contains “or” operators in order to account for the non-equilibria. This makes the problem of finding the (𝐖,𝐛)(\mathbf{W},\mathbf{b}) that satisfies such conditions NP-hard for a non-empty complement set {1,+1}n𝒩\{-1,+1\}^{n}-{\mathcal{NE}}. Furthermore, since sample-picking only consider observed equilibria, the search is not optimal with respect to the space of LIGs.

Regarding a more refined approach for enumerating LIGs only, note that in an LIG each player separates hypercube vertices with a linear function, i.e., for 𝐯(𝐰i,i,bi)\mathbf{v}\equiv(\mathbf{w}_{{i},{{-i}}},b_{i}) and 𝐲(xi𝐱i,xi){1,+1}n\mathbf{y}\equiv(x_{i}\mathbf{x}_{-i},-x_{i})\in\{-1,+1\}^{n} we have xi(𝐰i,iT𝐱ibi)=𝐯T𝐲x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})={\mathbf{v}}^{\rm T}\mathbf{y}. Assume we assign a binary label to each vertex 𝐲\mathbf{y}, then note that not all possible labelings are linearly separable. Labelings which are linearly separable are called linear threshold functions (LTFs). A lower bound of the number of LTFs was first provided in Muroga [1965], which showed that the number of LTFs is at least α(n)20.33048n2\alpha(n)\equiv 2^{0.33048n^{2}}. Tighter lower bounds were shown later in Yamija and Ibaraki [1965] for n6n\geq 6 and in Muroga and Toda [1966] for n8n\geq 8. Regarding an upper bound, Winder [1960] showed that the number of LTFs is at most β(n)2n2\beta(n)\equiv 2^{n^{2}}. By using such bounds for all players, we can conclude that there is at least α(n)n=20.33048n3{\alpha(n)}^{n}=2^{0.33048n^{3}} and at most β(n)n=2n3{\beta(n)}^{n}=2^{n^{3}} LIGs (which is indeed another upper bound of the VC-dimension of the class of LIGs; the bound in Theorem 14 is tighter and uses bounds of the VC-dimension of neural networks). The bounds discussed above would bound the time-complexity of a search algorithm if we could easily enumerate all LTFs for a single player. Unfortunately, this seems to be far from a trivial problem. By using results in Muroga [1971], a weight vector 𝐯\mathbf{v} with integer entries such that (i)|vi|β(n)(n+1)(n+1)/2/2n(\forall i){\rm\ }|v_{i}|\leq\beta(n)\equiv{(n+1)}^{(n+1)/2}/2^{n} is sufficient to realize all possible LTFs. Therefore we can conclude that enumerating LIGs takes at most (2β(n)+1)n2(n+12)n3{(2\beta(n)+1)}^{n^{2}}\approx{(\frac{\sqrt{n+1}}{2})}^{n^{3}} steps, and we propose the use of this method only for n4n\leq 4.

For n=4n=4 we found that the number of possible PSNE sets induced by LIGs is 23,706. Experimentally, we did not find differences between this method and sample-picking since most of the time, the model with maximum likelihood was an LIG.

6.3 From Maximum Likelihood to Maximum Empirical Proportion of Equilibria

We approximately perform maximum likelihood estimation for LIGs, by maximizing the empirical proportion of equilibria, i.e., the equilibria in the observed data. This strategy allows us to avoid computing π(𝒢)\pi({\mathcal{G}}) as in Eq. (2) for maximum likelihood estimation (given its dependence on |𝒩(𝒢)||{\mathcal{NE}}({\mathcal{G}})|). We propose this approach for games with small true proportion of equilibria with high probability, i.e., with probability at least 1δ1-\delta, we have π(𝒢)κnδ\pi({\mathcal{G}})\leq\frac{\kappa^{n}}{\delta} for 1/2κ<11/2\leq\kappa<1. Particularly, we will show in Section 7 that for LIGs we have κ=3/4\kappa=3/4. Given this, our approximate problem relies on a bound of the log-likelihood that holds with high probability. We also show that under very mild conditions, the parameters (𝒢,q)({\mathcal{G}},q) belong to the hypothesis space of the original problem with high probability.

First, we derive bounds on the log-likelihood function.

Lemma 16.

Given a non-trivial game 𝒢{\mathcal{G}} with 0<π(𝒢)<π^(𝒢)0<\pi({\mathcal{G}})<\widehat{\pi}({\mathcal{G}}), the KL divergence in the log-likelihood function in Eq. (5) is bounded as follows:

π^(𝒢)logπ(𝒢)log2<KL(π^(𝒢)π(𝒢))<π^(𝒢)logπ(𝒢).-\widehat{\pi}({\mathcal{G}})\log\pi({\mathcal{G}})-\log 2<KL(\widehat{\pi}({\mathcal{G}})\|\pi({\mathcal{G}}))<-\widehat{\pi}({\mathcal{G}})\log\pi({\mathcal{G}})\;.
Proof.

Let ππ(𝒢)\pi\equiv\pi({\mathcal{G}}) and π^π^(𝒢)\widehat{\pi}\equiv\widehat{\pi}({\mathcal{G}}). Note that α(π)limπ^0KL(π^π)=0\alpha(\pi)\equiv\lim_{\widehat{\pi}\rightarrow 0}{KL(\widehat{\pi}\|\pi)}=0,343434Here we are making the implicit assumption that π<π^\pi<\widehat{\pi}. This is sensible. For example, in most models learned from the congressional voting data using a variety of learning algorithms we propose, the total number of PSNE would range roughly from 100K—1M; using base 2, this is roughly from 2162^{16}2202^{20}. This may look like a huge number until one recognizes that there could potential be 21002^{100} PSNE. Hence, we have that π\pi would be in the range of 2842^{-84}2802^{-80}. In fact, we believe this holds more broadly because, as a general objective, we want models that can capture as many PSNE behavior as possible but no more than needed, which tend to reduce the PSNE of the learned models, and thus their π\pi values, while simultaneously trying to increase π^\widehat{\pi} as much as possible. and β(π)limπ^1KL(π^π)=logπnlog2\beta(\pi)\equiv\lim_{\widehat{\pi}\rightarrow 1}{KL(\widehat{\pi}\|\pi)}=-\log\pi\leq n\log 2. Since the function is convex we can upper-bound it by α(π)+(β(π)α(π))π^=π^logπ\alpha(\pi)+(\beta(\pi)-\alpha(\pi))\widehat{\pi}=-\widehat{\pi}\log\pi.

To find a lower bound, we find the point in which the derivative of the original function is equal to the slope of the upper bound, i.e., KL(π^π)π^=β(π)α(π)=logπ\frac{\partial KL(\widehat{\pi}\|\pi)}{\partial\widehat{\pi}}=\beta(\pi)-\alpha(\pi)=-\log\pi, which gives π^=12π\widehat{\pi}^{*}=\frac{1}{2-\pi}. Then, the maximum difference between the upper bound and the original function is given by limπ0π^logπKL(π^π)=log2\lim_{\pi\rightarrow 0}{-\widehat{\pi}^{*}\log\pi-KL(\widehat{\pi}^{*}\|\pi)}=\log 2. ∎

Note that the lower and upper bounds are very informative when π(𝒢)0\pi({\mathcal{G}})\rightarrow 0 (or in our setting when n+n\rightarrow+\infty), since log2\log 2 becomes small when compared to logπ(𝒢)-\log\pi({\mathcal{G}}), as shown in Figure 2.

Refer to caption
Refer to caption
Refer to caption
Figure 2: KL divergence (blue) and bounds derived in Lemma 16 (red) for π=(3/4)n\pi=(3/4)^{n} where n=9n=9 (left), n=18n=18 (center) and n=36n=36 (right). Note that the bounds are very informative when n+n\rightarrow+\infty (or equivalently when π0\pi\rightarrow 0).

Next, we derive the problem of maximizing the empirical proportion of equilibria from the maximum likelihood estimation problem.

Theorem 17.

Assume that with probability at least 1δ1-\delta we have π(𝒢)κnδ\pi({\mathcal{G}})\leq\frac{\kappa^{n}}{\delta} for 1/2κ<11/2\leq\kappa<1. Maximizing a lower bound (with high probability) of the log-likelihood in Eq. (5) is equivalent to maximizing the empirical proportion of equilibria:

max𝒢π^(𝒢),\max_{{\mathcal{G}}\in\mathcal{H}}{{\rm\ }\widehat{\pi}({\mathcal{G}})}\;, (7)

furthermore, for all games 𝒢{\mathcal{G}} such that π^(𝒢)γ\widehat{\pi}({\mathcal{G}})\geq\gamma for some 0<γ<1/20<\gamma<1/2, for sufficiently large n>logκ(δγ)n>\log_{\kappa}{(\delta\gamma)} and optimal mixture parameter q^=min(π^(𝒢),112m)\widehat{q}=\min(\widehat{\pi}({\mathcal{G}}),1-\frac{1}{2m}), we have (𝒢,q^)Υ({\mathcal{G}},\widehat{q})\in\Upsilon, where Υ={(𝒢,q)𝒢0<π(𝒢)<q<1}\Upsilon=\{({\mathcal{G}},q)\mid{\mathcal{G}}\in\mathcal{H}\wedge 0<\pi({\mathcal{G}})<q<1\} is the hypothesis space of non-trivial identifiable games and mixture parameters.

Proof.

By applying the lower bound in Lemma 16 in Eq. (5) to non-trivial games, we have ^(𝒢,q^)=KL(π^(𝒢)π(𝒢))KL(π^(𝒢)q^)nlog2>π^(𝒢)logπ(𝒢)KL(π^(𝒢)q^)(n+1)log2\widehat{\mathcal{L}}({\mathcal{G}},\widehat{q})=KL(\widehat{\pi}({\mathcal{G}})\|\pi({\mathcal{G}}))-KL(\widehat{\pi}({\mathcal{G}})\|\widehat{q})-n\log 2>-\widehat{\pi}({\mathcal{G}})\log\pi({\mathcal{G}})-KL(\widehat{\pi}({\mathcal{G}})\|\widehat{q})-(n+1)\log 2. Since π(𝒢)κnδ\pi({\mathcal{G}})\leq\frac{\kappa^{n}}{\delta}, we have logπ(𝒢)logκnδ-\log\pi({\mathcal{G}})\geq-\log\frac{\kappa^{n}}{\delta}. Therefore ^(𝒢,q^)>π^(𝒢)logκnδKL(π^(𝒢)q^)(n+1)log2\widehat{\mathcal{L}}({\mathcal{G}},\widehat{q})>-\widehat{\pi}({\mathcal{G}})\log\frac{\kappa^{n}}{\delta}-KL(\widehat{\pi}({\mathcal{G}})\|\widehat{q})-(n+1)\log 2. Regarding the term KL(π^(𝒢)q^)KL(\widehat{\pi}({\mathcal{G}})\|\widehat{q}), if π^(𝒢)<1KL(π^(𝒢)q^)=KL(π^(𝒢)π^(𝒢))=0\widehat{\pi}({\mathcal{G}})<1\Rightarrow KL(\widehat{\pi}({\mathcal{G}})\|\widehat{q})=KL(\widehat{\pi}({\mathcal{G}})\|\widehat{\pi}({\mathcal{G}}))=0, and if π^(𝒢)=1KL(π^(𝒢)q^)=KL(1112m)=log(112m)log2\widehat{\pi}({\mathcal{G}})=1\Rightarrow KL(\widehat{\pi}({\mathcal{G}})\|\widehat{q})=KL(1\|1-\frac{1}{2m})=-\log(1-\frac{1}{2m})\leq\log 2 and approaches 0 when m+m\rightarrow+\infty. Maximizing the lower bound of the log-likelihood becomes max𝒢π^(𝒢)\max_{{\mathcal{G}}\in\mathcal{H}}{\widehat{\pi}({\mathcal{G}})} by removing the constant terms that do not depend on 𝒢{\mathcal{G}}.

In order to prove (𝒢,q^)Υ({\mathcal{G}},\widehat{q})\in\Upsilon we need to prove 0<π(𝒢)<q^<10<\pi({\mathcal{G}})<\widehat{q}<1. For proving the first inequality 0<π(𝒢)0<\pi({\mathcal{G}}), note that π^(𝒢)γ>0\widehat{\pi}({\mathcal{G}})\geq\gamma>0, and therefore 𝒢{\mathcal{G}} has at least one equilibria. For proving the third inequality q^<1\widehat{q}<1, note that q^=min(π^(𝒢),112m)<1\widehat{q}=\min(\widehat{\pi}({\mathcal{G}}),1-\frac{1}{2m})<1. For proving the second inequality π(𝒢)<q^\pi({\mathcal{G}})<\widehat{q}, we need to prove π(𝒢)<π^(𝒢)\pi({\mathcal{G}})<\widehat{\pi}({\mathcal{G}}) and π(𝒢)<112m\pi({\mathcal{G}})<1-\frac{1}{2m}. Since π(𝒢)κnδ\pi({\mathcal{G}})\leq\frac{\kappa^{n}}{\delta} and γπ^(𝒢)\gamma\leq\widehat{\pi}({\mathcal{G}}), it suffices to prove κnδ<γπ(𝒢)<π^(𝒢)\frac{\kappa^{n}}{\delta}<\gamma\Rightarrow\pi({\mathcal{G}})<\widehat{\pi}({\mathcal{G}}). Similarly we need to prove κnδ<112mπ(𝒢)<112m\frac{\kappa^{n}}{\delta}<1-\frac{1}{2m}\Rightarrow\pi({\mathcal{G}})<1-\frac{1}{2m}. Putting both together, we have κnδ<min(γ,112m)=γ\frac{\kappa^{n}}{\delta}<\min(\gamma,1-\frac{1}{2m})=\gamma since γ<1/2\gamma<1/2 and 112m1/21-\frac{1}{2m}\geq 1/2. Finally, κnδ<γn>logκ(δγ)\frac{\kappa^{n}}{\delta}<\gamma\Leftrightarrow n>\log_{\kappa}{(\delta\gamma)}. ∎

6.4 A Non-Concave Maximization Method: Sigmoidal Approximation

A very simple optimization approach can be devised by using a sigmoid in order to approximate the 0/1 function 1[z0]1[{z\geq 0}] in the maximum likelihood problem of Eq. (5) as well as when maximizing the empirical proportion of equilibria as in Eq. (7). We use the following sigmoidal approximation:

1[z0]Hα,β(z)12(1+tanh(zβarctanh(12α1/n))).\textstyle{1[{z\geq 0}]\approx H_{\alpha,\beta}(z)\equiv\frac{1}{2}(1+\tanh(\frac{z}{\beta}-{\rm arctanh}(1-2\alpha^{1/n})))}\;. (8)

The additional term α\alpha ensures that for 𝒢=(𝐖,𝐛),𝐖=𝟎,𝐛=𝟎{\mathcal{G}}=(\mathbf{W},\mathbf{b}),\mathbf{W}=\mathbf{0},\mathbf{b}=\mathbf{0} we get 1[𝐱𝒩(𝒢)]Hα,β(0)n=α1[{\mathbf{x}\in{\mathcal{NE}}({\mathcal{G}})}]\approx H_{\alpha,\beta}(0)^{n}=\alpha. We perform gradient ascent on these objective functions that have many local maxima. Note that when maximizing the “sigmoidal” likelihood, each step of the gradient ascent is NP-hard due to the “sigmoidal” true proportion of equilibria. Therefore, we propose the use of the sigmoidal maximum likelihood only for n15n\leq 15.

In our implementation, we add an 1\ell_{1}-norm regularizer ρ𝐖1-\rho\|\mathbf{W}\|_{1} where ρ>0\rho>0 to both maximization problems. The 1\ell_{1}-norm regularizer encourages sparseness and attempts to lower the generalization error by controlling over-fitting.

6.5 Our Proposed Approach: Convex Loss Minimization

From an optimization perspective, it is more convenient to minimize a convex objective instead of a sigmoidal approximation in order to avoid the many local minima.

Note that maximizing the empirical proportion of equilibria in Eq. (7) is equivalent to minimizing the empirical proportion of non-equilibria, i.e., min𝒢(1π^(𝒢))\min_{{\mathcal{G}}\in\mathcal{H}}{(1-\widehat{\pi}({\mathcal{G}}))}. Furthermore, 1π^(𝒢)=1ml1[𝐱(l)𝒩(𝒢)]1-\widehat{\pi}({\mathcal{G}})=\frac{1}{m}\sum_{l}{1[{\mathbf{x}^{(l)}\notin{\mathcal{NE}}({\mathcal{G}})}]}. Denote by \ell the 0/1 loss, i.e., (z)=1[z<0]\ell(z)=1[{z<0}]. For LIGs, maximizing the empirical proportion of equilibria in Eq. (7) is equivalent to solving the loss minimization problem:

min𝐖,𝐛1mlmaxi(xi(l)(𝐰i,iT𝐱i(l)bi)).\min_{\mathbf{W},\mathbf{b}}{{\rm\ }\frac{1}{m}\sum_{l}{\max_{i}{\ell(x_{i}^{(l)}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}^{(l)}-b_{i}))}}}\;. (9)

We can further relax this problem by introducing convex upper bounds of the 0/1 loss. Note that the use of convex losses also avoids the trivial solution of Eq. (9), i.e., 𝐖=𝟎,𝐛=𝟎\mathbf{W}=\mathbf{0},\mathbf{b}=\mathbf{0} (which obtains the lowest log-likelihood as discussed in Remark 10). Intuitively speaking, note that minimizing the logistic loss (z)=log(1+ez)\ell(z)=\log(1+e^{-z}) will make z+z\rightarrow+\infty, while minimizing the hinge loss (z)=max(0,1z)\ell(z)=\max{(0,1-z)} will make z1z\rightarrow 1 unlike the 0/1 loss (z)=1[z<0]\ell(z)=1[{z<0}] that only requires z=0z=0 in order to be minimized. In what follows, we develop four efficient methods for solving Eq. (9) under specific choices of loss functions, i.e., hinge and logistic.

In our implementation, we add an 1\ell_{1}-norm regularizer ρ𝐖1\rho\|\mathbf{W}\|_{1} where ρ>0\rho>0 to all the minimization problems. The 1\ell_{1}-norm regularizer encourages sparseness and attempts to lower the generalization error by controlling over-fitting.

6.5.1 Independent Support Vector Machines and Logistic Regression

We can relax the loss minimization problem in Eq. (9) by using the loose bound maxi(zi)i(zi)\max_{i}{\ell(z_{i})}\leq\sum_{i}{\ell(z_{i})}. This relaxation simplifies the original problem into several independent problems. For each player ii, we train the weights (𝐰i,i,bi)(\mathbf{w}_{{i},{{-i}}},b_{i}) in order to predict independent (disjoint) actions. This leads to 1-norm SVMs of Bradley and Mangasarian [1998], Zhu et al. [2004] and 1\ell_{1}-regularized logistic regression. We solve the latter with the 1\ell_{1}-projection method of Schmidt et al. [2007a]. While the training is independent, our goal is not the prediction for independent players but the characterization of joint actions. The use of these well known techniques in our context is novel, since we interpret the output of SVMs and logistic regression as the parameters of an LIG. Therefore, we use the parameters to measure empirical and true proportion of equilibria, KL divergence and log-likelihood in our probabilistic model.

6.5.2 Simultaneous Support Vector Machines

While converting the loss minimization problem in Eq. (9) by using loose bounds allow to obtain several independent problems with small number of variables, a second reasonable strategy would be to use tighter bounds at the expense of obtaining a single optimization problem with a higher number of variables.

For the hinge loss (z)=max(0,1z)\ell(z)=\max{(0,1-z)}, we have maxi(zi)=max(0,1z1,,1zn)\max_{i}{\ell(z_{i})}=\max{(0,1-z_{1},\dots,1-z_{n})} and the loss minimization problem in Eq. (9) becomes the following primal linear program:

min𝐖,𝐛,𝝃1mlξl+ρ𝐖1s.t.(l,i)xi(l)(𝐰i,iT𝐱i(l)bi)1ξl,(l)ξl0,\begin{array}[]{@{}l@{}}\displaystyle{\min_{\mathbf{W},\mathbf{b},\text{\boldmath$\xi$}}{{\rm\ }\frac{1}{m}\sum_{l}{\xi_{l}}+\rho\|\mathbf{W}\|_{1}}}\\ \vskip-8.67204pt\\ \begin{array}[]{@{}l@{\hspace{0.08in}}l@{}l@{\hspace{0.08in}}l@{}}{\rm s.t.}\hfil\hskip 5.78172pt&(\forall l,i){\rm\ }x_{i}^{(l)}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}^{(l)}-b_{i})\geq 1-\xi_{l}{\rm\ \ ,}&(\forall l){\rm\ }\xi_{l}\geq 0\;,\hfil\hskip 5.78172pt\end{array}\end{array} (10)

where ρ>0\rho>0.

Note that Eq. (10) is equivalent to a linear program since we can set 𝐖=𝐖+𝐖\mathbf{W}=\mathbf{W}^{+}-\mathbf{W}^{-}, 𝐖1=ijwij++wij\|\mathbf{W}\|_{1}=\sum_{ij}{w_{ij}^{+}+w_{ij}^{-}} and add the constraints 𝐖+𝟎\mathbf{W}^{+}\geq\mathbf{0} and 𝐖𝟎\mathbf{W}^{-}\geq\mathbf{0}. We follow the regular SVM derivation by adding slack variables ξl\xi_{l} for each sample ll. This problem is a generalization of 1-norm SVMs of Bradley and Mangasarian [1998], Zhu et al. [2004].

By Lagrangian duality, the dual of the problem in Eq. (10) is the following linear program:

max𝜶liαlis.t.(i)lαlixi(l)𝐱i(l)ρ,(l,i)αli0,(i)lαlixi(l)=0,(l)iαli1m.\begin{array}[]{@{}l@{}}\displaystyle{\max_{\text{\boldmath$\alpha$}}{{\rm\ }\sum_{li}\alpha_{li}}}\\ \vskip-10.11775pt\\ \begin{array}[]{@{}l@{\hspace{0.08in}}l@{}l@{\hspace{0.08in}}l@{}}{\rm s.t.}\hfil\hskip 5.78172pt&(\forall i){\rm\ }\|\sum_{l}{\alpha_{li}x_{i}^{(l)}\mathbf{x}_{-i}^{(l)}}\|_{\infty}\leq\rho{\rm\ \ ,}&(\forall l,i){\rm\ }\alpha_{li}\geq 0\;\;,\hfil\hskip 5.78172pt\\ \hfil\hskip 5.78172pt&(\forall i){\rm\ }\sum_{l}{\alpha_{li}x_{i}^{(l)}}=0{\rm\ \ ,}&(\forall l){\rm\ }\sum_{i}{\alpha_{li}}\leq\frac{1}{m}\;.\hfil\hskip 5.78172pt\end{array}\end{array} (11)

Furthermore, strong duality holds in this case. Note that Eq. (11) is equivalent to a linear program since we can transform the constraint 𝐜ρ\|\mathbf{c}\|_{\infty}\leq\rho into ρ𝟏𝐜ρ𝟏-\rho\mathbf{1}\leq\mathbf{c}\leq\rho\mathbf{1}.

6.5.3 Simultaneous Logistic Regression

For the logistic loss (z)=log(1+ez)\ell(z)=\log(1+e^{-z}), we could use the non-smooth loss maxi(zi)\max_{i}{\ell(z_{i})} directly. Instead, we chose a smooth upper bound, i.e., log(1+iezi)\log(1+\sum_{i}{e^{-z_{i}}}). The following discussion and technical lemma provides the reason behind our us of this simultaneous logistic loss.

Given that any loss (z)\ell(z) is a decreasing function, the following identity holds maxi(zi)=(minizi)\max_{i}{\ell(z_{i})}=\ell(\min_{i}{z_{i}}). Hence, we can either upper-bound the max\max function by the logsumexp{\rm logsumexp} function or lower-bound the min\min function by a negative logsumexp{\rm logsumexp}. We chose the latter option for the logistic loss for the following reasons: Claim i of the following technical lemma shows that lower-bounding min\min generates a loss that is strictly less than upper-bounding max\max. Claim ii shows that lower-bounding min\min generates a loss that is strictly less than independently penalizing each player. Claim iii shows that there are some cases in which upper-bounding max\max generates a loss that is strictly greater than independently penalizing each player.

Lemma 18.

For the logistic loss (z)=log(1+ez)\ell(z)=\log(1+e^{-z}) and a set of n>1n>1 numbers {z1,,zn}\{z_{1},\dots,z_{n}\}:

i.(z1,,zn)maxi(zi)(logiezi)<logie(zi)maxi(zi)+logn,ii.(z1,,zn)(logiezi)<i(zi),iii.(z1,,zn)logie(zi)>i(zi).\begin{array}[]{@{}l@{\hspace{0.05in}}l@{}}{\rm i.}\hfil\hskip 3.61371pt&(\forall z_{1},\dots,z_{n}){\rm\ }\max_{i}{\ell(z_{i})}\leq\ell\left(-\log\sum_{i}{e^{-z_{i}}}\right)<\log\sum_{i}{e^{\ell(z_{i})}}\leq\max_{i}{\ell(z_{i})}+\log n\;,\\ {\rm ii.}\hfil\hskip 3.61371pt&(\forall z_{1},\dots,z_{n}){\rm\ }\ell\left(-\log\sum_{i}{e^{-z_{i}}}\right)<\sum_{i}{\ell(z_{i})}\;,\\ {\rm iii.}\hfil\hskip 3.61371pt&(\exists z_{1},\dots,z_{n}){\rm\ }\log\sum_{i}{e^{\ell(z_{i})}}>\sum_{i}{\ell(z_{i})}\;.\end{array}
Proof.

Given a set of numbers {a1,,an}\{a_{1},\dots,a_{n}\}, the max\max function is bounded by the logsumexp{\rm logsumexp} function by maxiailogieaimaxiai+logn\max_{i}{a_{i}}\leq\log\sum_{i}{e^{a_{i}}}\leq\max_{i}{a_{i}}+\log n [Boyd and Vandenberghe, 2006]. Equivalently, the min\min function is bounded by miniailognlogieaiminiai\min_{i}{a_{i}}-\log n\leq-\log\sum_{i}{e^{-a_{i}}}\leq\min_{i}{a_{i}}.

These identities allow us to prove two inequalities in Claim i, i.e., maxi(zi)=(minizi)(logiezi)\max_{i}{\ell(z_{i})}=\ell(\min_{i}{z_{i}})\leq\ell\left(-\log\sum_{i}{e^{-z_{i}}}\right) and logie(zi)maxi(zi)+logn\log\sum_{i}{e^{\ell(z_{i})}}\leq\max_{i}{\ell(z_{i})}+\log n. To prove the remaining inequality (logiezi)<logie(zi)\ell\left(-\log\sum_{i}{e^{-z_{i}}}\right)<\log\sum_{i}{e^{\ell(z_{i})}}, note that for the logistic loss (logiezi)=log(1+iezi)\ell\left(-\log\sum_{i}{e^{-z_{i}}}\right)=\log(1+\sum_{i}{e^{-z_{i}}}) and logie(zi)=log(n+iezi)\log\sum_{i}{e^{\ell(z_{i})}}=\log(n+\sum_{i}{e^{-z_{i}}}). Since n>1n>1, strict inequality holds.

To prove Claim ii, we need to show that (logiezi)=log(1+iezi)<i(zi)=ilog(1+ezi)\ell\left(-\log\sum_{i}{e^{-z_{i}}}\right)=\log(1+\sum_{i}{e^{-z_{i}}})<\sum_{i}{\ell(z_{i})}=\sum_{i}{\log(1+e^{-z_{i}})}. This is equivalent to 1+iezi<i(1+ezi)=𝐜{0,1}ne𝐜T𝐳=1+iezi+𝐜{0,1}n,𝟏T𝐜>1e𝐜T𝐳1+\sum_{i}{e^{-z_{i}}}<\prod_{i}{(1+e^{-z_{i}})}=\sum_{\mathbf{c}\in\{0,1\}^{n}}{e^{-{\mathbf{c}}^{\rm T}\mathbf{z}}}=1+\sum_{i}{e^{-z_{i}}}+\sum_{\mathbf{c}\in\{0,1\}^{n},{\mathbf{1}}^{\rm T}\mathbf{c}>1}{e^{-{\mathbf{c}}^{\rm T}\mathbf{z}}}. Finally, we have 𝐜{0,1}n,𝟏T𝐜>1e𝐜T𝐳>0\sum_{\mathbf{c}\in\{0,1\}^{n},{\mathbf{1}}^{\rm T}\mathbf{c}>1}{e^{-{\mathbf{c}}^{\rm T}\mathbf{z}}}>0 because the exponential function is strictly positive.

To prove Claim iii, it suffices to find set of numbers {z1,,zn}\{z_{1},\dots,z_{n}\} for which logie(zi)=log(n+iezi)>i(zi)=ilog(1+ezi)\log\sum_{i}{e^{\ell(z_{i})}}=\log(n+\sum_{i}{e^{-z_{i}}})>\sum_{i}{\ell(z_{i})}=\sum_{i}{\log(1+e^{-z_{i}})}. This is equivalent to n+iezi>i(1+ezi)n+\sum_{i}{e^{-z_{i}}}>\prod_{i}{(1+e^{-z_{i}})}. By setting (i)zi=logn(\forall i){\rm\ }z_{i}=\log n, we reduce the claim we want to prove to n+1>(1+1n)nn+1>(1+\frac{1}{n})^{n}. Strict inequality holds for n>1n>1. Furthermore, note that limn+(1+1n)n=e\lim_{n\rightarrow+\infty}{(1+\frac{1}{n})^{n}}=e. ∎

Returning to our simultaneous logistic regression formulation, the loss minimization problem in Eq. (9) becomes

min𝐖,𝐛1mllog(1+iexi(l)(𝐰i,iT𝐱i(l)bi))+ρ𝐖1,\min_{\mathbf{W},\mathbf{b}}{{\rm\ }\frac{1}{m}\sum_{l}{\begin{array}[]{@{}l@{}}\log(1+\sum_{i}{e^{-x_{i}^{(l)}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}^{(l)}-b_{i})}})\end{array}}+\rho\|\mathbf{W}\|_{1}}\;, (12)

where ρ>0\rho>0.

In our implementation, we use the 1\ell_{1}-projection method of Schmidt et al. [2007a] for optimizing Eq. (12). This method performs a limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) step in an expanded model (i.e., 𝐖=𝐖+𝐖\mathbf{W}=\mathbf{W}^{+}-\mathbf{W}^{-}, 𝐖1=ijwij++wij\|\mathbf{W}\|_{1}=\sum_{ij}{w_{ij}^{+}+w_{ij}^{-}}) followed by a projection onto the non-negative orthant to enforce 𝐖+𝟎\mathbf{W}^{+}\geq\mathbf{0} and 𝐖𝟎\mathbf{W}^{-}\geq\mathbf{0}.

7 On the True Proportion of Equilibria

In this section, we justify the use of convex loss minimization for learning the structure and parameters of LIGs. We define absolute indifference of players and show that our convex loss minimization approach produces games in which all players are non-absolutely-indifferent. We then provide a bound of the true proportion of equilibria with high probability. Our bound assumes independence of weight vectors among players, and applies to a large family of distributions of weight vectors. Furthermore, we do not assume any connectivity properties of the underlying graph.

Parallel to our analysis, Daskalakis et al. [2011] analyzed a different setting: random games which structure is drawn from the Erdős-Rényi model (i.e., each edge is present independently with the same probability pp) and utility functions which are random tables. The analysis in Daskalakis et al. [2011], while more general than ours (which only focus on LIGs), it is at the same time more restricted since it assumes either the Erdős-Rényi model for random structures or connectivity properties for deterministic structures.

7.1 Convex Loss Minimization Produces Non-Absolutely-Indifferent Players

First, we define the notion of absolute indifference of players. Our goal in this subsection is to show that our proposed convex loss algorithms produce LIGs in which all players are non-absolutely-indifferent and therefore every player defines constraints to the true proportion of equilibria.

Definition 19.

Given an LIG 𝒢=(𝐖,𝐛){\mathcal{G}}=(\mathbf{W},\mathbf{b}), we say a player ii is absolutely indifferent if and only if (𝐰i,i,bi)=𝟎(\mathbf{w}_{{i},{{-i}}},b_{i})=\mathbf{0}, and non-absolutely-indifferent if and only if (𝐰i,i,bi)𝟎(\mathbf{w}_{{i},{{-i}}},b_{i})\neq\mathbf{0}.

Next, we concentrate on the first ingredient for our bound of the true proportion of equilibria. We show that independent and simultaneous SVM and logistic regression produce games in which all players are non-absolutely-indifferent except for some “degenerate” cases. The following lemma applies to independent SVMs for c(l)=0c^{(l)}=0 and simultaneous SVMs for c(l)=max(0,maxji(1xj(l)(𝐰i,iT𝐱i(l)bi)))c^{(l)}=\max(0,\max_{j\neq i}{(1-x_{j}^{(l)}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}^{(l)}-b_{i}))}).

Lemma 20.

Given (l)c(l)0(\forall l){\rm\ }c^{(l)}\geq 0, the minimization of the hinge training loss ^(𝐰i,i,bi)=1mlmax(c(l),1xi(l)(𝐰i,iT𝐱i(l)bi))\widehat{\ell}(\mathbf{w}_{{i},{{-i}}},b_{i})=\frac{1}{m}\sum_{l}{\max(c^{(l)},1-x_{i}^{(l)}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}^{(l)}-b_{i}))} guarantees non-absolutely-indifference of player ii except for some “degenerate” cases, i.e., the optimal solution (𝐰i,i,bi)=𝟎(\mathbf{w}_{{i},{{-i}}}^{*},b_{i}^{*})=\mathbf{0} if and only if (ji)l1[xi(l)xj(l)=1]u(l)=l1[xi(l)xj(l)=1]u(l)(\forall j\neq i){\rm\ }\sum_{l}{1[{x_{i}^{(l)}x_{j}^{(l)}\hskip-1.8063pt\hskip-1.8063pt=\hskip-1.8063pt1}]u^{(l)}}=\sum_{l}{1[{x_{i}^{(l)}x_{j}^{(l)}\hskip-1.8063pt\hskip-1.8063pt=\hskip-1.8063pt-1}]u^{(l)}} and l1[xi(l)=1]u(l)=l1[xi(l)=1]u(l)\sum_{l}{1[{x_{i}^{(l)}\hskip-1.8063pt\hskip-1.8063pt=\hskip-1.8063pt1}]u^{(l)}}=\sum_{l}{1[{x_{i}^{(l)}\hskip-1.8063pt\hskip-1.8063pt=\hskip-1.8063pt-1}]u^{(l)}} where u(l)u^{(l)} is defined as c(l)>1u(l)=0c^{(l)}>1\Leftrightarrow u^{(l)}=0, c(l)<1u(l)=1c^{(l)}<1\Leftrightarrow u^{(l)}=1 and c(l)=1u(l)[0;1]c^{(l)}=1\Leftrightarrow u^{(l)}\in[0;1].

Proof.

Let fi(𝐱i)𝐰i,iT𝐱ibif_{i}(\mathbf{x}_{-i})\equiv{\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i}. By noting that max(α,β)=max0u1(α+u(βα))\max(\alpha,\beta)=\max_{0\leq u\leq 1}{(\alpha+u(\beta-\alpha))}, we can rewrite ^(𝐰i,i,bi)=1mlmax0u(l)1(c(l)+u(l)(1xi(l)fi(𝐱i(l))c(l)))\widehat{\ell}(\mathbf{w}_{{i},{{-i}}},b_{i})=\frac{1}{m}\sum_{l}{\max_{0\leq u^{(l)}\leq 1}{(c^{(l)}+u^{(l)}(1-x_{i}^{(l)}f_{i}(\mathbf{x}_{-i}^{(l)})-c^{(l)}))}}.

Note that ^\widehat{\ell} has the minimizer (𝐰i,i,bi)=𝟎(\mathbf{w}_{{i},{{-i}}}^{*},b_{i}^{*})=\mathbf{0} if and only if 𝟎\mathbf{0} belongs to the subdifferential set of the non-smooth function ^\widehat{\ell} at (𝐰i,i,bi)=𝟎(\mathbf{w}_{{i},{{-i}}},b_{i})=\mathbf{0}. In order to maximize ^\widehat{\ell}, we have c(l)>1xi(l)fi(𝐱i(l))u(l)=0c^{(l)}>1-x_{i}^{(l)}f_{i}(\mathbf{x}_{-i}^{(l)})\Leftrightarrow u^{(l)}=0, c(l)<1xi(l)fi(𝐱i(l))u(l)=1c^{(l)}<1-x_{i}^{(l)}f_{i}(\mathbf{x}_{-i}^{(l)})\Leftrightarrow u^{(l)}=1 and c(l)=1xi(l)fi(𝐱i(l))u(l)[0;1]c^{(l)}=1-x_{i}^{(l)}f_{i}(\mathbf{x}_{-i}^{(l)})\Leftrightarrow u^{(l)}\in[0;1]. The previous rules simplify at the solution under analysis, since (𝐰i,i,bi)=𝟎fi(𝐱i(l))=0(\mathbf{w}_{{i},{{-i}}},b_{i})=\mathbf{0}\Rightarrow f_{i}(\mathbf{x}_{-i}^{(l)})=0.

Let gj(𝐰i,i,bi)^wij(𝐰i,i,bi)g_{j}(\mathbf{w}_{{i},{{-i}}},b_{i})\equiv\frac{\partial\widehat{\ell}}{\partial w_{ij}}(\mathbf{w}_{{i},{{-i}}},b_{i}) and h(𝐰i,i,bi)^bi(𝐰i,i,bi)h(\mathbf{w}_{{i},{{-i}}},b_{i})\equiv\frac{\partial\widehat{\ell}}{\partial b_{i}}(\mathbf{w}_{{i},{{-i}}},b_{i}). By making (ji) 0gj(𝟎,0)(\forall j\neq i){\rm\ }0\in g_{j}(\mathbf{0},0) and 0h(𝟎,0)0\in h(\mathbf{0},0), we get (ji)lxi(l)xj(l)u(l)=0(\forall j\neq i){\rm\ }\sum_{l}{x_{i}^{(l)}x_{j}^{(l)}u^{(l)}}=0 and lxi(l)u(l)=0\sum_{l}{x_{i}^{(l)}u^{(l)}}=0. Finally, by noting that xi(l){1,1}x_{i}^{(l)}\in\{-1,1\}, we prove our claim. ∎

Remark 21.

Note that for independent SVMs, the “degenerate” cases in Lemma 20 simplify to (ji)l1[xi(l)xj(l)=1]=m2(\forall j\neq i){\rm\ }\sum_{l}{1[{x_{i}^{(l)}x_{j}^{(l)}=1}]}=\frac{m}{2} and l1[xi(l)=1]=m2\sum_{l}{1[{x_{i}^{(l)}=1}]}=\frac{m}{2}.

The following lemma applies to independent logistic regression for c(l)=0c^{(l)}=0 and simultaneous logistic regression for c(l)=jiexj(l)(𝐰i,iT𝐱i(l)bi)c^{(l)}=\sum_{j\neq i}e^{-x_{j}^{(l)}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}^{(l)}-b_{i})}.

Lemma 22.

Given (l)c(l)0(\forall l){\rm\ }c^{(l)}\geq 0, the minimization of the logistic training loss ^(𝐰i,i,bi)=1mllog(c(l)+1+exi(l)(𝐰i,iT𝐱i(l)bi))\widehat{\ell}(\mathbf{w}_{{i},{{-i}}},b_{i})=\frac{1}{m}\sum_{l}{\log(c^{(l)}+1+e^{-x_{i}^{(l)}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}^{(l)}-b_{i})})} guarantees non-absolutely-indifference of player ii except for some “degenerate” cases, i.e., the optimal solution (𝐰i,i,bi)=𝟎(\mathbf{w}_{{i},{{-i}}}^{*},b_{i}^{*})=\mathbf{0} if and only if (ji)l1[xi(l)xj(l)=1]c(l)+2=l1[xi(l)xj(l)=1]c(l)+2(\forall j\neq i){\rm\ }\sum_{l}\frac{1[{x_{i}^{(l)}x_{j}^{(l)}=1}]}{c^{(l)}+2}=\sum_{l}\frac{1[{x_{i}^{(l)}x_{j}^{(l)}=-1}]}{c^{(l)}+2} and l1[xi(l)=1]c(l)+2=l1[xi(l)=1]c(l)+2\sum_{l}\frac{1[{x_{i}^{(l)}=1}]}{c^{(l)}+2}=\sum_{l}\frac{1[{x_{i}^{(l)}=-1}]}{c^{(l)}+2}.

Proof.

Note that ^\widehat{\ell} has the minimizer (𝐰i,i,bi)=𝟎(\mathbf{w}_{{i},{{-i}}}^{*},b_{i}^{*})=\mathbf{0} if and only if the gradient of the smooth function ^\widehat{\ell} is 𝟎\mathbf{0} at (𝐰i,i,bi)=𝟎(\mathbf{w}_{{i},{{-i}}},b_{i})=\mathbf{0}. Let gj(𝐰i,i,bi)^wij(𝐰i,i,bi)g_{j}(\mathbf{w}_{{i},{{-i}}},b_{i})\equiv\frac{\partial\widehat{\ell}}{\partial w_{ij}}(\mathbf{w}_{{i},{{-i}}},b_{i}) and h(𝐰i,i,bi)^bi(𝐰i,i,bi)h(\mathbf{w}_{{i},{{-i}}},b_{i})\equiv\frac{\partial\widehat{\ell}}{\partial b_{i}}(\mathbf{w}_{{i},{{-i}}},b_{i}). By making (ji)gj(𝟎,0)=0(\forall j\neq i){\rm\ }g_{j}(\mathbf{0},0)=0 and h(𝟎,0)=0h(\mathbf{0},0)=0, we get (ji)lxi(l)xj(l)c(l)+2=0(\forall j\neq i){\rm\ }\sum_{l}\frac{x_{i}^{(l)}x_{j}^{(l)}}{c^{(l)}+2}=0 and lxi(l)c(l)+2=0\sum_{l}\frac{x_{i}^{(l)}}{c^{(l)}+2}=0. Finally, by noting that xi(l){1,1}x_{i}^{(l)}\in\{-1,1\}, we prove our claim. ∎

Remark 23.

Note that for independent logistic regression, the “degenerate” cases in Lemma 22 simplify to (ji)l1[xi(l)xj(l)=1]=m2(\forall j\neq i){\rm\ }\sum_{l}{1[{x_{i}^{(l)}x_{j}^{(l)}=1}]}=\frac{m}{2} and l1[xi(l)=1]=m2\sum_{l}{1[{x_{i}^{(l)}=1}]}=\frac{m}{2}.

Based on these results, after termination of our proposed algorithms, we fix cases in which the optimal solution (𝐰i,i,bi)=𝟎(\mathbf{w}_{{i},{{-i}}}^{*},b_{i}^{*})=\mathbf{0} by setting bi=1b^{*}_{i}=1 if the action of player ii was mostly 1-1 or bi=1b^{*}_{i}=-1 otherwise. We point out to the careful reader that we did not include the 1\ell_{1}-regularization term in the above proofs since the subdifferential of ρ𝐰i,i1\rho\|\mathbf{w}_{{i},{{-i}}}\|_{1} vanishes at 𝐰i,i=0\mathbf{w}_{{i},{{-i}}}=0, and therefore our proofs still hold.

7.2 Bounding the True Proportion of Equilibria

In what follows, we concentrate on the second ingredient for our bound of the true proportion of equilibria. We show a bound for a single non-absolutely-indifferent player and a fixed joint-action 𝐱\mathbf{x}, that interestingly does not depend on the specific joint-action 𝐱\mathbf{x}. This is a key ingredient for bounding the true proportion of equilibria in our main theorem.

Lemma 24.

Given an LIG 𝒢=(𝐖,𝐛){\mathcal{G}}=(\mathbf{W},\mathbf{b}) with non-absolutely-indifferent player ii, assume that (𝐰i,i,bi)(\mathbf{w}_{{i},{{-i}}},b_{i}) is a random vector drawn from a distribution 𝒫i\mathcal{P}_{i}. If for all 𝐱{1,+1}n\mathbf{x}\in\{-1,+1\}^{n}, 𝒫i[xi(𝐰i,iT𝐱ibi)=0]=0{\mathbb{P}_{\mathcal{P}_{i}}[x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})=0]=0}, then

i. for all 𝐱𝒫i[xi(𝐰i,iT𝐱ibi)0]=𝒫i[xi(𝐰i,iT𝐱ibi)0]\displaystyle\text{i. for all }\mathbf{x}\text{, }\mathbb{P}_{\mathcal{P}_{i}}[x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})\geq 0]=\mathbb{P}_{\mathcal{P}_{i}}[x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})\leq 0]
if and only if, for all 𝐱𝒫i[xi(𝐰i,iT𝐱ibi)0]=1/2.\displaystyle\text{if and only if, }\text{for all }\mathbf{x}\text{, }\mathbb{P}_{\mathcal{P}_{i}}[x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})\geq 0]=1/2\;.

If 𝒫i\mathcal{P}_{i} is a uniform distribution of support {1,+1}n\{-1,+1\}^{n}, then

ii. for all 𝐱𝒫i[xi(𝐰i,iT𝐱ibi)0][1/2, 3/4].\text{ii. for all }\mathbf{x}\text{, }\mathbb{P}_{\mathcal{P}_{i}}[x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})\geq 0]\in[1/2,{\rm\ }3/4]\;.
Proof.

Claim i follows immediately from a simple condition we obtain from the normalization axiom of probability and the hypothesis of the claim: i.e., for all 𝐱{1,+1}n\mathbf{x}\in\{-1,+1\}^{n}, 𝒫i[xi(𝐰i,iT𝐱ibi)0]+𝒫i[xi(𝐰i,iT𝐱ibi)0]=1\mathbb{P}_{\mathcal{P}_{i}}[x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})\geq 0]+\mathbb{P}_{\mathcal{P}_{i}}[x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})\leq 0]=1.

To prove Claim ii, first let 𝐯(𝐰i,i,bi)\mathbf{v}\equiv(\mathbf{w}_{{i},{{-i}}},b_{i}) and 𝐲(xi𝐱i,xi){1,+1}n\mathbf{y}\equiv(x_{i}\mathbf{x}_{-i},-x_{i})\in\{-1,+1\}^{n}. Note that xi(𝐰i,iT𝐱ibi)=𝐯T𝐲x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})={\mathbf{v}}^{\rm T}\mathbf{y}. Then, let f1(𝐯1,𝐲)𝐯1T𝐲1+y1f_{1}(\mathbf{v}_{-1},\mathbf{y})\equiv{\mathbf{v}_{-1}}^{\rm T}\mathbf{y}_{-1}+y_{1}. Note that (v1,v1𝐯1)(v_{1},v_{1}\mathbf{v}_{-1}) spans all possible vectors in {1,+1}n\{-1,+1\}^{n}. Because 𝒫i\mathcal{P}_{i} is a uniform distribution of support {1,+1}n\{-1,+1\}^{n}, we have:

𝒫i[𝐯T𝐲0]\displaystyle\mathbb{P}_{\mathcal{P}_{i}}[{\mathbf{v}}^{\rm T}\mathbf{y}\geq 0] =12n𝐯1[𝐯T𝐲0]\displaystyle=\textstyle{\frac{1}{2^{n}}\sum_{\mathbf{v}}{1[{{\mathbf{v}}^{\rm T}\mathbf{y}\geq 0}]}}
=12n𝐯1[v1f1(𝐯1,𝐲)0]\displaystyle=\textstyle{\frac{1}{2^{n}}\sum_{\mathbf{v}}{1[{v_{1}f_{1}(\mathbf{v}_{-1},\mathbf{y})\geq 0}]}}
=12n𝐯(1[v1=+1]1[f1(𝐯1,𝐲)0]+1[v1=1]1[f1(𝐯1,𝐲)0])\displaystyle=\textstyle{\frac{1}{2^{n}}\sum_{\mathbf{v}}{\left(1[{v_{1}=+1}]1[{f_{1}(\mathbf{v}_{-1},\mathbf{y})\geq 0}]+1[{v_{1}=-1}]1[{f_{1}(\mathbf{v}_{-1},\mathbf{y})\leq 0}]\right)}}
=12n𝐯1(1[f1(𝐯1,𝐲)0]+1[f1(𝐯1,𝐲)0])\displaystyle=\textstyle{\frac{1}{2^{n}}\sum_{\mathbf{v}_{-1}}{\left(1[{f_{1}(\mathbf{v}_{-1},\mathbf{y})\geq 0}]+1[{f_{1}(\mathbf{v}_{-1},\mathbf{y})\leq 0}]\right)}}
=2n12n+12n𝐯11[f1(𝐯1,𝐲)=0]\displaystyle=\textstyle{\frac{2^{n-1}}{2^{n}}+\frac{1}{2^{n}}\sum_{\mathbf{v}_{-1}}{1[{f_{1}(\mathbf{v}_{-1},\mathbf{y})=0}]}}
=1/2+12nα(𝐲)\displaystyle=\textstyle{1/2+\frac{1}{2^{n}}\alpha(\mathbf{y})}

where α(𝐲)𝐯11[f1(𝐯1,𝐲)=0]=𝐯11[𝐯1T𝐲1+y1=0]\alpha(\mathbf{y})\equiv\sum_{\mathbf{v}_{-1}}{1[{f_{1}(\mathbf{v}_{-1},\mathbf{y})=0}]}=\sum_{\mathbf{v}_{-1}}{1[{{\mathbf{v}_{-1}}^{\rm T}\mathbf{y}_{-1}+y_{1}=0}]}. Note that α(𝐲)0\alpha(\mathbf{y})\geq 0 and thus, 𝒫i[𝐯T𝐲0]1/2\mathbb{P}_{\mathcal{P}_{i}}[{\mathbf{v}}^{\rm T}\mathbf{y}\geq 0]\geq 1/2. Geometrically speaking, α(𝐲)\alpha(\mathbf{y}) is the number of vertices of the (n1)(n-1)-dimensional hypercube that are covered by the hyperplane with normal 𝐲1\mathbf{y}_{-1} and bias y1y_{1}. Recall that 𝐲𝟎\mathbf{y}\neq\mathbf{0} since 𝐲{1,+1}n\mathbf{y}\in\{-1,+1\}^{n}. By relaxing this fact, as noted in Aichholzer and Aurenhammer [1996] a hyperplane with n2n-2 zeros on 𝐲1\mathbf{y}_{-1} (i.e., a (n2)(n-2)-parallel hyperplane) covers exactly half of the 2n12^{n-1} vertices, the maximum possible. Therefore, 𝒫i[𝐯T𝐲0]=1/2+12nα(𝐲)1/2+2n22n=3/4\mathbb{P}_{\mathcal{P}_{i}}[{\mathbf{v}}^{\rm T}\mathbf{y}\geq 0]=1/2+\frac{1}{2^{n}}\alpha(\mathbf{y})\leq 1/2+\frac{2^{n-2}}{2^{n}}=3/4. ∎

Remark 25.

It is important to note that under the conditions of Lemma 24, in a measure-theoretic sense, for almost all vectors (𝐰i,i,bi)(\mathbf{w}_{{i},{{-i}}},b_{i}) in the surface of the hypersphere in nn-dimensions (i.e., except for a set of Lebesgue-measure zero), we have that, xi(𝐰i,iT𝐱ibi)0x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})\neq 0 for all 𝐱{1,+1}n\mathbf{x}\in\{-1,+1\}^{n}. Hence, the hypothesis stated for Claim i of Lemma 24 holds for almost all probability measures 𝒫i\mathcal{P}_{i} (i.e., except for a set of probability measures, over the surface of the hypersphere in nn-dimensions, with Lebesgue-measure zero). Note that Claim ii essentially states that we can still upper bound, for all 𝐱{1,+1}n\mathbf{x}\in\{-1,+1\}^{n}, the probability that such 𝐱\mathbf{x} is a PSNE of a random LIG even if we draw the weights and threshold parameters from a 𝒫i\mathcal{P}_{i} belonging to such sets of Lebesgue-measure zero.

Remark 26.

Note that any distribution that has zero mean and that depends on some norm of (𝐰i,i,bi)(\mathbf{w}_{{i},{{-i}}},b_{i}) fulfills the requirements for Claim i of Lemma 24. This includes, for instance, the multivariate normal distribution with arbitrary covariance which is related to the Bhattacharyya norm. Additionally, any distribution in which each entry of the vector (𝐰i,i,bi)(\mathbf{w}_{{i},{{-i}}},b_{i}) is independent and symmetric also fulfills those requirements. This includes, for instance, the Laplace and uniform distributions. Furthermore, note that distributions with support on non-empty subsets of entries of (𝐰i,i,bi)(\mathbf{w}_{{i},{{-i}}},b_{i}), as well as mixtures of the above cases are also allowed. This includes, for instance, sparse graphs.

Next, we present our bound for the true proportion of equilibria of games in which all players are non-absolutely-indifferent.

Theorem 27.

Assume that all players are non-absolutely-indifferent and that the rows of an LIG 𝒢=(𝐖,𝐛){\mathcal{G}}=(\mathbf{W},\mathbf{b}) are independent (but not necessarily identically distributed) random vectors, i.e., for every player ii, (𝐰i,i,bi)(\mathbf{w}_{{i},{{-i}}},b_{i}) is independently drawn from an arbitrary distribution 𝒫i\mathcal{P}_{i}. If for all ii and 𝐱\mathbf{x}, 𝒫i[xi(𝐰i,iT𝐱ibi)0]κ{\mathbb{P}_{\mathcal{P}_{i}}[x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})\geq 0]\leq\kappa} for 1/2κ<11/2\leq\kappa<1, then the expected true proportion of equilibria is bounded as

𝔼𝒫1,,𝒫n[π(𝒢)]κn.\mathbb{E}_{\mathcal{P}_{1},\dots,\mathcal{P}_{n}}[\pi({\mathcal{G}})]\leq\kappa^{n}\;.

Furthermore, the following high probability statement

𝒫1,,𝒫n[π(𝒢)κnδ]1δ\mathbb{P}_{\mathcal{P}_{1},\dots,\mathcal{P}_{n}}[\pi({\mathcal{G}})\leq\textstyle{\frac{\kappa^{n}}{\delta}}]\geq 1-\delta

holds.

Proof.

Let fi(𝐰i,i,bi,𝐱)1[xi(𝐰i,iT𝐱ibi)0]f_{i}(\mathbf{w}_{{i},{{-i}}},b_{i},\mathbf{x})\equiv 1[{x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})\geq 0}] and 𝒫{𝒫1,,𝒫n}\mathcal{P}\equiv\{\mathcal{P}_{1},\dots,\mathcal{P}_{n}\}. By Eq. (2), 𝔼𝒫[π(𝒢)]=12n𝐱𝔼𝒫[ifi(𝐰i,i,bi,𝐱)]\mathbb{E}_{\mathcal{P}}[\pi({\mathcal{G}})]=\frac{1}{2^{n}}\sum_{\mathbf{x}}{\mathbb{E}_{\mathcal{P}}[\prod_{i}{f_{i}(\mathbf{w}_{{i},{{-i}}},b_{i},\mathbf{x})}]}. For any 𝐱\mathbf{x}, f1(𝐰1,1,b1,𝐱),,fn(𝐰n,n,bn,𝐱)f_{1}(\mathbf{w}_{{1},{{-1}}},b_{1},\mathbf{x}),\dots,f_{n}(\mathbf{w}_{{n},{{-n}}},b_{n},\mathbf{x}) are independent since (𝐰1,1,b1),,(𝐰n,n,bn)(\mathbf{w}_{{1},{{-1}}},b_{1}),\dots,(\mathbf{w}_{{n},{{-n}}},b_{n}) are independently distributed. Thus, 𝔼𝒫[π(𝒢)]=12n𝐱i𝔼𝒫i[fi(𝐰i,i,bi,𝐱)]\mathbb{E}_{\mathcal{P}}[\pi({\mathcal{G}})]=\frac{1}{2^{n}}\sum_{\mathbf{x}}{\prod_{i}{\mathbb{E}_{\mathcal{P}_{i}}[f_{i}(\mathbf{w}_{{i},{{-i}}},b_{i},\mathbf{x})]}}. Since for all ii and 𝐱\mathbf{x}, 𝔼𝒫i[fi(𝐰i,i,bi,𝐱)]=𝒫i[xi(𝐰i,iT𝐱ibi)0]κ\mathbb{E}_{\mathcal{P}_{i}}[f_{i}(\mathbf{w}_{{i},{{-i}}},b_{i},\mathbf{x})]=\mathbb{P}_{\mathcal{P}_{i}}[x_{i}({\mathbf{w}_{{i},{{-i}}}}^{\rm T}\mathbf{x}_{-i}-b_{i})\geq 0]\leq\kappa, we have 𝔼𝒫[π(𝒢)]κn\mathbb{E}_{\mathcal{P}}[\pi({\mathcal{G}})]\leq\kappa^{n}.

By Markov’s inequality, given that π(𝒢)0\pi({\mathcal{G}})\geq 0, we have 𝒫[π(𝒢)c]𝔼𝒫[π(𝒢)]cκnc\mathbb{P}_{\mathcal{P}}[\pi({\mathcal{G}})\geq c]\leq\frac{\mathbb{E}_{\mathcal{P}}[\pi({\mathcal{G}})]}{c}\leq\frac{\kappa^{n}}{c}. For c=κnδ𝒫[π(𝒢)κnδ]δ𝒫[π(𝒢)κnδ]1δc=\frac{\kappa^{n}}{\delta}\Rightarrow\mathbb{P}_{\mathcal{P}}[\pi({\mathcal{G}})\geq\frac{\kappa^{n}}{\delta}]\leq\delta\Rightarrow\mathbb{P}_{\mathcal{P}}[\pi({\mathcal{G}})\leq\frac{\kappa^{n}}{\delta}]\geq 1-\delta. ∎

Remark 28.

Under the same assumptions of Theorem 27, it is possible to prove that with probability at least 1δ1-\delta we have π(𝒢)κn+12log1δ\pi({\mathcal{G}})\leq\kappa^{n}+\sqrt{\frac{1}{2}\log\frac{1}{\delta}} by using Hoeffding’s lemma. We point out that such a bound is not better than the Markov’s bound derived above.

8 Experimental Results

For learning LIGs we used our convex loss methods: independent and simultaneous SVM and logistic regression (See Section 6.5). Additionally, we used the (super-exponential) exhaustive search method (See Section 6.2) only for n4n\leq 4. As a baseline, we used the (NP-hard) sigmoidal maximum likelihood only for n15n\leq 15 as well as the sigmoidal maximum empirical proportion of equilibria (See Section 6.4). Regarding the parameters α\alpha and β\beta our sigmoidal function in Eq. (8), we found experimentally that α=0.1\alpha=0.1 and β=0.001\beta=0.001 achieved the best results.

Refer to captionRefer to caption

Figure 3: On the Distinction between Game-Theoretic and Probabilistic Models in the Context of “Probabilistic” vs. “Strategic” Behavioral Data. The plot shows the performance of Ising models (in green) vs. LIGs (in red) when we learn models from each respective class from data generated by drawing i.i.d. samples from a mixture model of an Ising model, pIsingp_{\text{Ising}}, and our PSNE-based generative model, pLIGp_{\text{LIG}}, with mixing parameter qstratq_{\text{strat}} corresponding to the probability that the sample is drawn from the LIG component. Hence, we can view qstratq_{\text{strat}} as controlling the proportion of the data that is “strategic” in nature. The graph of the Ising model is an (undirected) chain with 4 variable nodes, while that of the LIG is, as shown in the left, also a chain of 4 players with arcs between every consecutive pair of nodes. The parameters of each mixture component in the “ground-truth” mixture model pmix(𝐱)qstratpLIG(𝐱)+(1qstrat)pIsing(𝐱)p_{\text{mix}}(\mathbf{x})\equiv q_{\text{strat}}\;p_{\text{LIG}}(\mathbf{x})+(1-q_{\text{strat}})\;p_{\text{Ising}}(\mathbf{x}) are the same: node-potential/bias-threshold parameters are all 0; weights of all the edges is +1+1. We set the “signal” parameter qq of our generative model pLIGp_{\text{LIG}} to 0.90.9. The x-axis of the plot in the right-hand side above corresponds to the mixture parameter qstratq_{\text{strat}}; so that, as we move from left to right in the plot, more proportion of the data is “strategic” in nature: qstrat=0q_{\text{strat}}=0 means the data is “purely probabilistic” while qstrat=1q_{\text{strat}}=1 means it is “purely strategic.” For each value of qstrat{0,0.25,0.50,0.75,1}q_{\text{strat}}\in\{0,0.25,0.50,0.75,1\}, we generated 5050 pairs of data sets from pmixp_{\text{mix}}, each of size 5050, each pair corresponding to a training and a validation data set, respectively. The learning methods used the validation data set to estimate their respective 1\ell_{1} regularization parameter. The Ising models learned correspond exactly to the optimal penalized likelihood. We use a simultaneous logistic regression approach, described in Section 6, to learn LIGs. In the y-axis of the plot in the right-hand side is the average, over the 5050 repetitions, of the exact KL-divergence between the respective learned model and pmix(𝐱)p_{\text{mix}}(\mathbf{x}). We also include (a linear interpolation of the individual) error bars at 95% confidence level. The plot clearly shows that the more “strategic” the data the better the game-theoretic-based generative model. We can see that the learned Ising models (1) do considerably better than the LIG models when the data is purely probabilistic; and (2) are more “robust” across the spectrum, degrading very gracefully as the data becomes more strategic in nature; but (3) seem to need more data to learn when the data comes exclusively from an Ising model than the LIG model does when the data is purely strategic: The LIG achieves KL values much closer to 0 when the data is purely strategic than the Ising model does when the data is purely probabilistic.

For reasons briefly discussed at the end of Section 2.1, we have little interest in determining how much worst game-theoretic models are relative to probabilistic models when applied to data from purely probabilistic processes, without any strategic component, as we think this to be a futile exercise. We believe the same is true for evaluating the quality of a probabilistic graphical model vs. a game-theoretic model when applied to strategic behavioral data, resulting from a process defined by game-theoretic concepts based on the (stable) outcomes of a game. Nevertheless, we summarize some experiments in Figure 3 that should help illustrate the point of discussed at the end of Section 2.1.

Still, for scientific curiosity, we compare LIGs to learning Ising models. Once again, our goal is not to show the superiority of either games or Ising models. For n15n\leq 15 players, we perform exact 1\ell_{1}-regularized maximum likelihood estimation by using the FOBOS algorithm [Duchi and Singer, 2009a, b] and exact gradients of the log-likelihood of the Ising model. Since the computation of the exact gradient at each step is NP-hard, we used this method only for n15n\leq 15. For n>15n>15 players, we use the Höfling-Tibshirani method [Höfling and Tibshirani, 2009], which uses a sequence of first-order approximations of the exact log-likelihood. We also used a two-step algorithm, by first learning the structure by 1\ell_{1}-regularized logistic regression [Wainwright et al., 2007] and then using the FOBOS algorithm [Duchi and Singer, 2009a, b] with belief propagation for gradient approximation. We did not find a statistically significant difference between the test log-likelihood of both algorithms and therefore we only report the latter.

Our experimental setup is as follows: after learning a model for different values of the regularization parameter ρ\rho in a training set, we select the value of ρ\rho that maximizes the log-likelihood in a validation set, and report statistics in a test set. For synthetic experiments, we report the Kullback-Leibler (KL) divergence, average precision (one minus the fraction of falsely included equilibria), average recall (one minus the fraction of falsely excluded equilibria) in order to measure the closeness of the recovered models to the ground truth. For real-world experiments, we report the log-likelihood. In both synthetic and real-world experiments, we report the number of equilibria and the empirical proportion of equilibria. Our results are statistically significant, we avoided showing error bars for clarity of presentation since error bars and markers overlapped.

8.1 Experiments on Synthetic Data

Refer to captionRefer to captionRefer to captionFirst synthetic model

Refer to caption
Refer to caption
Refer to caption
Figure 4: Closeness of the Recovered Models to the Ground-Truth Synthetic Model for Different Mixture Parameters qgq_{g}. Our convex loss methods (IS,SS: independent and simultaneous SVM, IL,SL: independent and simultaneous logistic regression) and sigmoidal maximum likelihood (S1) have lower KL than exhaustive search (EX), sigmoidal maximum empirical proportion of equilibria (S2) and Ising models (IM). For all methods, the recovery of equilibria is perfect for qg=0.9q_{g}=0.9 (number of equilibria equal to the ground truth, equilibrium precision and recall equal to 1) and the empirical proportion of equilibria resembles the mixture parameter of the ground truth qgq_{g}.

We first test the ability of the proposed methods to recover the PSNE induced by ground-truth games from data when those games are LIGs. We use a small first synthetic model in order to compare with the (super-exponential) exhaustive search method. The ground-truth model 𝒢g=(𝐖g,𝐛g){\mathcal{G}}_{g}=(\mathbf{W}_{g},\mathbf{b}_{g}) has n=4n=4 players and 4 Nash equilibria (i.e., π(𝒢g)\pi({\mathcal{G}}_{g})=0.25), 𝐖g\mathbf{W}_{g} was set according to Figure 4 (the weight of each edge was set to +1+1) and 𝐛g=𝟎\mathbf{b}_{g}=\mathbf{0}. The mixture parameter of the ground-truth model qgq_{g} was set to 0.5,0.7,0.9. For each of 50 repetitions, we generated a training, a validation and a test set of 50 samples each. Figure 4 shows that our convex loss methods and sigmoidal maximum likelihood outperform (lower KL) exhaustive search, sigmoidal maximum empirical proportion of equilibria and Ising models. Note that the exhaustive search method which performs exact maximum likelihood suffers from over-fitting and consequently does not produce the lowest KL. From all convex loss methods, simultaneous logistic regression achieves the lowest KL. For all methods, the recovery of equilibria is perfect for qg=0.9q_{g}=0.9 (number of equilibria equal to the ground truth, equilibrium precision and recall equal to 1). Additionally, the empirical proportion of equilibria resembles the mixture parameter of the ground-truth model qgq_{g}.

Refer to captionRefer to captionRefer to captionSecond synthetic model

Refer to caption
Refer to caption
Refer to caption
Figure 5: Closeness of the recovered models to the ground truth synthetic model for different mixture parameters qgq_{g}. Our convex loss methods (IS,SS: independent and simultaneous SVM, IL,SL: independent and simultaneous logistic regression) have lower KL than sigmoidal maximum likelihood (S1), sigmoidal maximum empirical proportion of equilibria (S2) and Ising models (IM). For convex loss methods, the equilibrium recovery is better than the remaining methods (number of equilibria equal to the ground truth, higher equilibrium precision and recall) and the empirical proportion of equilibria resembles the mixture parameter of the ground truth qgq_{g}.

Next, we use a slightly larger second synthetic model with more complex interactions. We still keep the model small enough in order to compare with the (NP-hard) sigmoidal maximum likelihood method. The ground truth model 𝒢g=(𝐖g,𝐛g){\mathcal{G}}_{g}=(\mathbf{W}_{g},\mathbf{b}_{g}) has n=9n=9 players and 16 Nash equilibria (i.e., π(𝒢g)\pi({\mathcal{G}}_{g})=0.03125), 𝐖g\mathbf{W}_{g} was set according to Figure 5 (the weight of each blue and red edge was set to +1+1 and 1-1 respectively) and 𝐛g=𝟎\mathbf{b}_{g}=\mathbf{0}. The mixture parameter of the ground truth qgq_{g} was set to 0.5,0.7,0.9. For each of 50 repetitions, we generated a training, a validation and a test set of 50 samples each. Figure 5 shows that our convex loss methods outperform (lower KL) sigmoidal methods and Ising models. From all convex loss methods, simultaneous logistic regression achieves the lowest KL. For convex loss methods, the equilibrium recovery is better than the remaining methods (number of equilibria equal to the ground truth, higher equilibrium precision and recall). Additionally, the empirical proportion of equilibria resembles the mixture parameter of the ground truth qgq_{g}.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: KL divergence between the recovered models and the ground truth for data sets of different number of samples. Each chart shows the density of the ground truth, probability P(+1)P(+1) that an edge has weight +1, and average number of equilibria (NE). Our convex loss methods (IS,SS: independent and simultaneous SVM, IL,SL: independent and simultaneous logistic regression) have lower KL than sigmoidal maximum empirical proportion of equilibria (S2) and Ising models (IM). The results are remarkably better when the number of equilibria in the ground truth model is small (e.g., for NE<20<20).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: KL divergence between the recovered models and the ground truth for data sets of different number of players. Each chart shows the density of the ground truth, probability P(+1)P(+1) that an edge has weight +1, and average number of equilibria (NE) for n=2n=2;n=14n=14. In general, simultaneous logistic regression (SL) has lower KL than sigmoidal maximum empirical proportion of equilibria (S2), and the latter one has lower KL than sigmoidal maximum likelihood (S1). Other convex losses behave the same as simultaneous logistic regression (omitted for clarity of presentation).

In the next experiment, we show that the performance of convex loss minimization improves as the number of samples increases. We used random graphs with slightly more variables and varying number of samples (10,30,100,300). The ground truth model 𝒢g=(𝐖g,𝐛g){\mathcal{G}}_{g}=(\mathbf{W}_{g},\mathbf{b}_{g}) contains n=20n=20 players. For each of 20 repetitions, we generate edges in the ground truth model 𝐖g\mathbf{W}_{g} with a required density (either 0.2,0.5,0.8). For simplicity, the weight of each edge is set to +1+1 with probability P(+1)P(+1) and to 1-1 with probability 1P(+1)1-P(+1).353535Part of the reason for using such “simple”/”limited” binary set of weight values in this synthetic experiment regards the ability to generate “interesting” LIGs; that is, games with interesting sets of PSNE. As a word of caution, this is not as simple as it appears at first glance. LIGs with weights and biases generated uniformly at random from some set of real values are almost always not interesting, often having only 1 or 2 PSNE [Irfan and Ortiz, 2014]. It is not until we move to more “special”/restricted classes of games, such as that used in this experiments, that more interesting PSNE structure arises from randomly generated LIGs. That is in large part why we concentrated our experiments in games with those simple properties. (Simplicity itself also had a role in our decision, of course.) Please understand that we are not saying that LIGs that use a larger set of integers, or non-integer real-valued weights wijw_{ij}’s or bib_{i}’s are not interesting, as the LIGs we learn from the real-world data demonstrate. What we are saying is that we do not yet have a good understanding on how to randomly generate “interesting” synthetic games from the standpoint of their induced PSNE. We leave a comprehensive evaluation of our MLE-based algorithms’ ability to recover the PSNE of randomly generated synthetic LIGs, which would involve a diversity of synthetic game graph structures, influence weights and biases that induce “interesting” sets of PSNE, for future work. Hence, the Nash equilibria of the generated games does not depend on the magnitude of the weights, just on their sign. We set the bias 𝐛g=𝟎\mathbf{b}_{g}=\mathbf{0} and the mixture parameter of the ground truth qg=0.7q_{g}=0.7. We then generated a training and a validation set with the same number of samples. Figure 6 shows that our convex loss methods outperform (lower KL) sigmoidal maximum empirical proportion of equilibria and Ising models (except for the synthetic model with high true proportion of equilibria: density 0.8, P(+1)=0P(+1)=0, NE>1000>1000). The results are remarkably better when the number of equilibria in the ground truth model is small (e.g., for NE<20<20). From all convex loss methods, simultaneous logistic regression achieves the lowest KL.

In the next experiment, we evaluate two effects in our approximation methods. First, we evaluate the impact of removing the true proportion of equilibria from our objective function, i.e., the use of maximum empirical proportion of equilibria instead of maximum likelihood. Second, we evaluate the impact of using convex losses instead of a sigmoidal approximation of the 0/1 loss. We used random graphs with varying number of players and 50 samples. The ground truth model 𝒢g=(𝐖g,𝐛g){\mathcal{G}}_{g}=(\mathbf{W}_{g},\mathbf{b}_{g}) contains n=4,6,8,10,12n=4,6,8,10,12 players. For each of 20 repetitions, we generate edges in the ground truth model 𝐖g\mathbf{W}_{g} with a required density (either 0.2,0.5,0.8). As in the previous experiment, the weight of each edge is set to +1+1 with probability P(+1)P(+1) and to 1-1 with probability 1P(+1)1-P(+1). We set the bias 𝐛g=𝟎\mathbf{b}_{g}=\mathbf{0} and the mixture parameter of the ground truth qg=0.7q_{g}=0.7. We then generated a training and a validation set with the same number of samples. Figure 7 shows that in general, convex loss methods outperform (lower KL) sigmoidal maximum empirical proportion of equilibria, and the latter one outperforms sigmoidal maximum likelihood. A different effect is observed for mild (0.5) to high (0.8) density and P(+1)=1P(+1)=1 in which the sigmoidal maximum likelihood obtains the lowest KL. In a closer inspection, we found that the ground truth games usually have only 2 equilibria: (+1,,+1)(+1,\dots,+1) and (1,,1)(-1,\dots,-1), which seems to present a challenge for convex loss methods. It seems that for these specific cases, removing the true proportion of equilibria from the objective function negatively impacts the estimation process, but note that sigmoidal maximum likelihood is not computationally feasible for n>15n>15.

8.2 Experiments on Real-World Data: U.S. Congressional Voting Records

Refer to caption
Refer to caption
Refer to caption
Figure 8: Statistics for games learnt from 20 senators from the first session of the 104th congress, first session of the 107th congress and second session of the 110th congress. The log-likelihood of our convex loss methods (IS,SS: independent and simultaneous SVM, IL,SL: independent and simultaneous logistic regression) is higher than sigmoidal maximum empirical proportion of equilibria (S2) and Ising models (IM). For all methods, the number of equilibria (and so the true proportion of equilibria) is low.

(a)Refer to captionRefer to captionRefer to caption

(b)Refer to captionRefer to captionRefer to caption

(c)Refer to captionRefer to captionRefer to caption

(d)Refer to captionRefer to captionRefer to caption

(e)Refer to captionRefer to captionRefer to caption

Figure 9: (Top) Matrices of (direct) influence weights 𝐖\mathbf{W} for games learned from all 100 senators, from the first session of the 104th congress (left), first session of the 107th congress (center) and second session of the 110th congress (right), by using our independent (a) and simultaneous (b) logistic regression methods. A row represents how much every other senator directly-influence the senator in such row, in terms of the influence weights of the learned LIG. Positive influence-weight parameter values are shown in blue; negative values are in red. Democrats are shown in the top/left corner, while Republicans are shown in the bottom/right corner. Note that simultaneous method produce structures that are sparser than its independent counterpart. (c) Partial view of the graph for simultaneous logistic regression. (d) Most directly-influential senators and (e) least directly-influenceable senators. Regularization parameter ρ=0.0006\rho=0.0006.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Direct influence between parties and direct influences from Obama and McCain. Games were learned from all 100 senators from the 101th congress (Jan 1989) to the 111th congress (Dec 2010) by using our simultaneous logistic regression method. Direct influence between senators of the same party are stronger than senators of different party, which is also decreasing over time. In the last sessions, direct influence from Obama to Republicans increased, and influence from McCain to both parties decreased. Regularization parameter ρ=0.0006\rho=0.0006.

We used the U.S. congressional voting records in order to measure the generalization performance of convex loss minimization in a real-world data set. The data set is publicly available at http://www.senate.gov/. We used the first session of the 104th congress (Jan 1995 to Jan 1996, 613 votes), the first session of the 107th congress (Jan 2001 to Dec 2001, 380 votes) and the second session of the 110th congress (Jan 2008 to Jan 2009, 215 votes). Following on other researchers who have experimented with this data set (e.g., Banerjee et al. 2008), abstentions were replaced with negative votes. Since reporting the log-likelihood requires computing the number of equilibria (which is NP-hard), we selected only 20 senators by stratified random sampling. We randomly split the data into three parts. We performed six repetitions by making each third of the data take turns as training, validation and testing sets. Figure 8 shows that our convex loss methods outperform (higher log-likelihood) sigmoidal maximum empirical proportion of equilibria and Ising models. From all convex loss methods, simultaneous logistic regression achieves the lowest KL. For all methods, the number of equilibria (and so the true proportion of equilibria) is low.

We apply convex loss minimization to larger problems, by learning structures of games from all 100 senators. Figure 9 shows that simultaneous logistic regression produce structures that are sparser than its independent counterpart. The simultaneous method better elicits the bipartisan structure of the congress. We define the (aggregate) direct influence of player jj to all other players as i|wij|\sum_{i}{|w_{ij}|} after normalizing all weights, i.e., for each player ii we divide (𝐰i,i,bi)(\mathbf{w}_{{i},{{-i}}},b_{i}) by 𝐰i,i1+|bi|\|\mathbf{w}_{{i},{{-i}}}\|_{1}+|b_{i}|. Note that Jeffords and Clinton are one of the 5 most directly-influential as well as 5 least directly-influenceable (high bias) senators, in the 107th and 110th congress respectively. McCain and Feingold are both in the list of 5 most directly-influential senators in the 104th and 107th congress. McCain appears again in the list of 5 least directly influenceable senators in the 110th congress (as defined above in the context of the LIG model).

We test the hypothesis that the aggregate direct influence, as defined by our model, between senators of the same party are stronger than senators of different party. We learn structures of games from all 100 senators from the 101th congress to the 111th congress (Jan 1989 to Dec 2010). The number of votes cast for each session were average: 337, minimum: 215, maximum: 613. Figure 10 validates our hypothesis and more interestingly, it shows that influence between different parties is decreasing over time. Note that the influence from Obama to Republicans increased in the last sessions, while McCain’s influence to Republicans decreased.

Since the U.S. Congressional voting data is observational, we used the log-likelihood as an adequate measure of predictive performance. We argue that the log-likelihood of joint actions provides a more “global view” compared to predicting the action of a single agent. Furthermore, predicting the action of a single agent (i.e., xix_{i}) works under the assumption that we have access to the decisions of the other agents (i.e., 𝐱i\mathbf{x}_{-i}), which is in contrast to our framework. Regarding causal strategic inference, Irfan and Ortiz [2014] use the games that we produce in this section in order to address problems such as the identification of most influential senators. (We refer the reader to their paper for further details.)

9 Concluding Remarks

In Section 6, we present a variety of algorithms to learn LIGs from strictly behavioral data, including what we call independent logistic regression (ILR). There is a very popular technique for learning Ising models that uses independent regularized logistic regression to compute the individual conditional probabilities as a step toward computing a globally coherent joint probability distribution. However, this approach is inherently problematic, as some authors have previously pointed out (see, e.g., Guo et al. 2010). Without getting too technical, the main roadblock is that there is no guarantee that estimates of the weights produced by the individual regressions be symmetric: w^ij=w^ji\widehat{w}_{ij}=\widehat{w}_{ji} for all i,ji,j. Learning an Ising model requires the enforcement of this condition, and a variety of heuristics have been proposed. (Please see Section 2.1 for relevant work and references in this area.)

We also apply ILR in exactly the same manner but for a different objective: learning LIGs. Some seem to think that this diminishes the significance of our contributions. We strongly believe the opposite is true: That we can learn games by using such simple, practical, efficient and well-studied techniques is a significant plus in our view. Again, without getting too technical, the estimates of ILR need not be symmetric for LIG models, and are always perfectly consistent with the LIG definition. In fact, asymmetric estimates are common in practice (the LIG for the 110th Congress depicted in Figure 1 is an example). And we believe this makes the model more interesting. In the ILR-learned LIG, a player may have a positive, negative or no direct effect on another players utility, and vice versa.363636It is easy to come up with examples of such opposing/antagonistic interactions between individuals in real-world settings. (See, e.g., “parents with teenagers,” perhaps a more timely example is the U.S. Congress in recent times.)

Thus, despite the process of estimation of model parameters being similar, the view of the output of that estimation process is radically different in each case. Our experiments show that our generative model with LIGs built from ILR estimates achieves higher generalization likelihoods than standard probabilistic models such as Ising models that may also use ILR. This fact, that the generative model defined in terms of game-theoretic equilibrium concepts can explain the data better than traditional probabilistic models, provides further evidence supporting such a game-theoretic “view” of the ILR estimated parameters and yields additional confidence in their use in game-theoretic models.

In short, ILR is a thoroughly studied method with a long tradition and an extensive literature from which we can only benefit. We find it to a be a good, unexpected outcome of our research in this work, and thus a reasonably significant contribution, that we can successfully and effectively use ILR, a very simple and practical estimation technique for learning probabilistic graphical models, to learn game-theoretic graphical models too.

9.1 Extensions and Future Work

There are several ways of extending this research. We can extend our approach to ϵ\epsilon-approximate PSNE.373737By definition, given ϵ0\epsilon\geq 0, a joint pure-strategy 𝐱\mathbf{x}^{*} is an ϵ\epsilon-approximate PSNE if for each player ii, we have ui(𝐱)maxxiui(xi,𝐱i)ϵu_{i}(\mathbf{x}^{*})\geq\max_{x_{i}}u_{i}(x_{i},\mathbf{x}_{-i}^{*})-\epsilon; in other words, no players can gain more than ϵ\epsilon in payoff/utility value from unilaterally deviating from xix_{i}^{*}, assuming the other players play 𝐱i\mathbf{x}_{-i}^{*}. Using this definition, we can see that a PSNE is simply a 0-approximate PSNE. In this case, for each player instead of one condition, we will have two best-response conditions which are still linear in 𝐖\mathbf{W} and 𝐛\mathbf{b}. Additionally, we can extend our approach to a broader class of graphical games and non-Boolean actions. Note that our analysis does not rely on binary actions, but on binary features of one player 1[xi=1]1[{x_{i}=1}] or two players 1[xi=xj]1[{x_{i}=x_{j}}]. We can use features of three players 1[xi=xj=xk]1[{x_{i}=x_{j}=x_{k}}] or of non-Boolean actions 1[xi=3,xj=7]1[{x_{i}=3,x_{j}=7}]. This kernelized version is still linear in 𝐖\mathbf{W} and 𝐛\mathbf{b}. These extensions are possible because our algorithms and analysis rely on linearity and binary features; additionally, we can obtain a new upper-bound on the “VC-dimension” by changing the inputs of the neural-network architecture. We can easily extend our approach to parameter learning for fixed structures by using a 22\ell_{2}^{2} regularizer instead.

Future work should also consider and study more sophisticated noise processes, MSNE, and the analysis of different upper bounds for the 0/1 loss (e.g., exponential, smooth hinge). Finally, we should consider other slightly more complex versions of our model based on Bayesian or stochastic games to account for possible variations of the influence-weights and bias-threshold parameters. As an example, we may consider versions of our model for congressional voting that would explicitly capture game differences in terms of influences and biases that depend on the nature or topic of each specific bill being voted on, as well as Senators’ time-changing preferences and trends.

Acknowledgements

We are grateful to Christian Luhmann for extensive discussion of many aspects of this work and the multiple comments and feedback he provided to improve both the work and the presentation. We warmly thank Tommi Jaakkola for several informal discussions on the topic of learning games and the ideas behind causal strategic inference, as well as suggestions on how to improve the presentation. We also thank Mohammad Irfan for his help with motivating causal strategic inference in inference games and examples to illustrate their use. We would also like to thank Alexander Berg, Tamara Berg and Dimitris Samaras for their comments and questions during informal presentations of this work, which helped us tailor the presentation to a wider audience. We are very grateful to Nicolle Gruzling for sharing her Master’s thesis [Gruzling, 2006] which contains valuable references used in Section 6.2. Finally, we thank several anonymous reviewers for their comments and feedback, and in particular, an anonymous reviewer for the reference to an overview on the literature on composite marginal likelihoods.

Shorter versions of the work presented here appear as Chapter 7 of the first author’s Ph.D. dissertation [Honorio, 2012] and as e-print arXiv:1206.3713 [cs.LG] [Honorio and Ortiz, 2012].

This work was supported in part by NSF CAREER Award IIS-1054541.

Appendix A Additional Discussion

In this section, we discuss our choice of modeling end-state predictions without modeling dynamics. We also discuss some alternative noise models to the one studied in this manuscript.

A.1 On End-State Predictions without Explicitly Modeling Dynamics

Certainly, in cases in which information about the dynamics is available, the learned model may use such information while still making end-state predictions. But no such information, either via data sequences or prior knowledge, is available in any of the publicly-available real-world data sets we study here. Take congressional voting as an example. Considering the voting records as sequence of votes does not seem sensible in our context from a modeling/engineering perspective because the data set does not have any detailed information about the nature of the vote: we just have each senator’s vote on whatever bill they considered, and little to no information about the detailed dynamics that might have lead to the senators’ final votes. Indeed, one may go further and argue that assuming the the availability of information about the dynamics of the process is a considerable burden on the modeler and something of a wishful thinking in many practical, real-world settings.

Besides the nature of our setting, the lack of data or information, and the CSI motivation, there are other more fundamental ML reasons why we have no interest in considering dynamics in this paper. First, we view the additional complexity of a dynamic/temporal model as providing the wrong tradeoff: dynamic/temporal models are often inherently more complex to express and learn from data. Second, it is common ML practice to separate single example/outcome problems from sequence problems; said differently, ML generally treats the problem of learning from individual i.i.d. examples different from that of learning from sequences or sequence prediction. Third, we invoke Vapnik’s Principle of Transduction (Vapnik, 1998, page 477):383838See also http://www.cs.man.ac.uk/~jknowles/transductive.html additional information.

“When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need but not a more general one.”

We believe this additional complexity of temporal dynamics, while possibly more “realistic,” might easily weaken the power of the “bottom-line” prediction of the possible stable final outcomes because the resulting models can get side-tracked by the details of the interaction. We believe the difficulty of modeling such details, specially with the relatively limited amount of the data available, leads to poor prediction performance on what we really care about from an engineering stand point: We would like to know or predict what will end up happening, and have little or no interest on the how or why this happens.

We recognize the scientific significance and importance of research in social and behavioral sciences such as sociology, psychology and, in some cases economics, on explaining, at a higher level of abstraction, often going as low as the “cognitive” or “neuroscience” level, the process by which final decisions are reached.

We believe the sudden growth of interest from industry (both online and physical companies), government and other national and international institutions on predicting “behavior” for the purpose of revenue, improving efficiency, instituting effective policies with minimal regulations, etc., should shift the focus of the study of “behavior” closer to an engineering endeavor. We believe such entities are after the “bottom line” and will care more about the end-goal than how or why a specific outcome is achieved, modulo, of course, having simple enough and tractable computational models that provide reasonably accurate predictions of final end-state behavior, or at least accurate enough for their purposes.

A.2 On Alternative Noise Models

Next, we discuss some alternative noise models to the one studied in this manuscript. Specifically, we discuss an extension of the PSNE-based mixture noise model as well as individual-player noise models.

A.2.1 On Generalizations of the PSNE-based Mixture Noise Models

A simple extension to our model in Eq. (1) is to allow for more general distributions for the PSNE and the non-PSNE sets. That is, with some probability 0<q<10<q<1, a joint action 𝐱\mathbf{x} is chosen from 𝒩(𝒢){\mathcal{NE}}({\mathcal{G}}) by following a distribution 𝒫α\mathcal{P}_{\alpha} parameterized by α\alpha; otherwise, 𝐱\mathbf{x} is chosen from its complement set {1,+1}n𝒩(𝒢)\{-1,+1\}^{n}-{\mathcal{NE}}({\mathcal{G}}) by following a distribution 𝒫β\mathcal{P}_{\beta} parameterized by β\beta. The corresponding probability mass function (PMF) over joint-behaviors {1,+1}n\{-1,+1\}^{n} parameterized by (𝒢,q,α,β)({\mathcal{G}},q,\alpha,\beta) is

p(𝒢,q,α,β)(𝐱)=qpα(𝐱)1[𝐱𝒩(𝒢)]𝐳𝒩(𝒢)pα(𝐳)+(1q)pβ(𝐱)1[𝐱𝒩(𝒢)]𝐳𝒩(𝒢)pβ(𝐳),\textstyle p_{({\mathcal{G}},q,\alpha,\beta)}(\mathbf{x})=q\,\frac{p_{\alpha}(\mathbf{x})1[{\mathbf{x}\in{\mathcal{NE}}({\mathcal{G}})}]}{\sum_{\mathbf{z}\in{\mathcal{NE}}({\mathcal{G}})}{p_{\alpha}(\mathbf{z})}}+(1-q)\,\frac{p_{\beta}(\mathbf{x})1[{\mathbf{x}\notin{\mathcal{NE}}({\mathcal{G}})}]}{\sum_{\mathbf{z}\notin{\mathcal{NE}}({\mathcal{G}})}{p_{\beta}(\mathbf{z})}}\;,

where 𝒢{\mathcal{G}} is a game, pα(𝐱)p_{\alpha}(\mathbf{x}) and pβ(𝐱)p_{\beta}(\mathbf{x}) are PMFs over {1,+1}n\{-1,+1\}^{n}.

One reasonable technique to learn such a model from data is to perform an alternate method. Compared to our simpler model, this model requires a step that maximizes the likelihood by changing α\alpha and β\beta while keeping 𝒢{\mathcal{G}} and qq constant. The complexity of this step will depend on how we parameterize the PMFs 𝒫α\mathcal{P}_{\alpha} and 𝒫β\mathcal{P}_{\beta} but will very likely be an NP-hard problem because of the partition function. Furthermore, the problem of maximizing the likelihood by changing 𝒩(𝒢){\mathcal{NE}}({\mathcal{G}}) (while keeping α\alpha, β\beta and qq constant) is combinatorial in nature and in this paper, we provided a tractable approximation method with provable guarantees (for the case of uniform distributions). Other approaches for maximizing the likelihood by changing 𝒢{\mathcal{G}} are very likely to be exponential as we discuss briefly at the start of Section 6.

A.2.2 On Individual-Player Noise Models

As an example, consider a the generative model in which we first randomly select a PSNE 𝐱\mathbf{x} of the game from a distribution 𝒫α\mathcal{P}_{\alpha} parameterized by α\alpha, and then each player ii, independently, acts according to xix_{i} with probability qiq_{i} and switches its action with probability 1qi1-q_{i}. The corresponding probability mass function (PMF) over joint-behaviors {1,+1}n\{-1,+1\}^{n} parameterized by (𝒢,q1,,qn,α)({\mathcal{G}},q_{1},\dots,q_{n},\alpha) is

p𝒢,𝐪,α(𝐱)=𝐲𝒩(𝒢)pα(𝐲)𝐳𝒩(𝒢)pα(𝐳)iqi1[xi=yi](1qi)1[xiyi],\textstyle p_{{\mathcal{G}},\mathbf{q},\alpha}(\mathbf{x})=\sum_{\mathbf{y}\in{\mathcal{NE}}({\mathcal{G}})}{\frac{p_{\alpha}(\mathbf{y})}{\sum_{\mathbf{z}\in{\mathcal{NE}}({\mathcal{G}})}{p_{\alpha}(\mathbf{z})}}}\prod_{i}{q_{i}^{1[{x_{i}=y_{i}}]}(1-q_{i})^{1[{x_{i}\neq y_{i}}]}}\;,

where 𝒢{\mathcal{G}} is a game, 𝐪=(q1,,qn)\mathbf{q}=(q_{1},\dots,q_{n}) and pα(𝐲)p_{\alpha}(\mathbf{y}) is a PMF over {1,+1}n\{-1,+1\}^{n}.

One reasonable technique to learn such a model from data is to perform an alternate method. In one step, we maximize the likelihood by changing 𝒩(𝒢){\mathcal{NE}}({\mathcal{G}}) while keeping 𝐪\mathbf{q} and α\alpha constant, which could be tractably performed by applying Jensen’s inequality since the problem is combinatorial in nature. In another step, we maximize the likelihood by changing 𝐪\mathbf{q} while keeping 𝒢{\mathcal{G}} and α\alpha constant, which could be performed by gradient ascent (the complexity of this step depends on the size of 𝒩(𝒢){\mathcal{NE}}({\mathcal{G}}) which could be exponential!). In yet another step, we maximize the likelihood by changing α\alpha while keeping 𝒢{\mathcal{G}} and 𝐪\mathbf{q} constant (the complexity of this step will depend on how we parameterize the PMF 𝒫α\mathcal{P}_{\alpha} but will very likely be an NP-hard problem because of the partition function). The main problem with this technique is that it can be formally proved that the first step (Jensen’s inequality) will almost surely pick a single equilibria for the model (i.e., |𝒩(𝒢)|=1|{\mathcal{NE}}({\mathcal{G}})|=1).

References

  • Ackermann and Skopalik [2007] H. Ackermann and A. Skopalik. On the Complexity of Pure Nash Equilibria in Player-Specific Network Congestion Games. In X. Deng and F. Graham, editors, Internet and Network Economics, volume 4858 of Lecture Notes in Computer Science, pages 419–430. Springer Berlin Heidelberg, 2007.
  • Aichholzer and Aurenhammer [1996] O. Aichholzer and F. Aurenhammer. Classifying Hyperplanes in Hypercubes. SIAM Journal on Discrete Mathematics, 9(2):225–232, 1996.
  • Aumann [1974] R. Aumann. Subjectivity and Correlation in Randomized Strategies. Journal of Mathematical Economics, 1:67–96, 1974.
  • Ballester et al. [2004] C. Ballester, A. Calvó-Armengol, and Y. Zenou. Who’s Who in Crime Networks. Wanted: the Key Player. Working Paper Series 617, Research Institute of Industrial Economics, 2004.
  • Ballester et al. [2006] C. Ballester, A. Calvó-Armengol, and Y. Zenou. Who’s Who in Networks. Wanted: The Key Player. Econometrica, 74(5):1403–1417, 2006.
  • Banerjee et al. [2008] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. Journal of Machine Learning Research, 9:485–516, 2008.
  • Banerjee et al. [2006] O. Banerjee, L. El Ghaoui, A. d’Aspremont, and G. Natsoulis. Convex Optimization Techniques for Fitting Sparse Gaussian Graphical Models. In W. Cohen and A. Moore, editors, Proceedings of the 23nd International Machine Learning Conference, pages 89–96. Omni Press, 2006.
  • Blum et al. [2006] B. Blum, C.R. Shelton, and D. Koller. A Continuation Method for Nash Equilibria in Structured Games. Journal of Artificial Intelligence Research, 25:457–502, 2006.
  • Boyd and Vandenberghe [2006] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2006.
  • Bradley and Mangasarian [1998] P. S. Bradley and O. L. Mangasarian. Feature Selection via Concave Minimization and Support Vector Machines. In Proceedings of the 15th International Conference on Machine Learning, ICML ’98, pages 82–90, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc.
  • Brock and Durlauf [2001] W. Brock and S. Durlauf. Discrete Choice with Social Interactions. The Review of Economic Studies, 68(2):235–260, 2001.
  • Camerer [2003] C. Camerer. Behavioral Game Theory: Experiments on Strategic Interaction. Princeton University Press, 2003.
  • Cao et al. [2011] T. Cao, X. Wu, T. Hu, and S. Wang. Active Learning of Model Parameters for Influence Maximization. In D. Gunopulos, T. Hofmann, D. Malerba, and M. Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, volume 6911 of Lecture Notes in Computer Science, pages 280–295. Springer Berlin Heidelberg, 2011.
  • Chapman et al. [2010] A. Chapman, A. Farinelli, E. Munoz de Cote, A. Rogers, and N. Jennings. A Distributed Algorithm for Optimising over Pure Strategy Nash Equilibria. In AAAI Conference on Artificial Intelligence, 2010.
  • Chickering [2002] D. Chickering. Learning Equivalence Classes of Bayesian-Network Structures. Journal of Machine Learning Research, 2:445–498, 2002.
  • Chow and Liu [1968] C. Chow and C. Liu. Approximating Discrete Probability Distributions with Dependence Trees. IEEE Transactions on Information Theory, 14(3):462–467, 1968.
  • Daskalakis et al. [2011] C. Daskalakis, A. Dimakisy, and E. Mossel. Connectivity and Equilibrium in Random Games. Annals of Applied Probability, 21(3):987–1016, 2011.
  • Daskalakis et al. [2009] C. Daskalakis, P. Goldberg, and C. Papadimitriou. The complexity of computing a Nash equilibrium. Communications of the ACM, 52(2):89–97, 2009.
  • Daskalakis and Papadimitriou [2006] C. Daskalakis and C.H. Papadimitriou. Computing Pure Nash Equilibria in Graphical Games via Markov Random Fields. In Proceedings of the 7th ACM Conference on Electronic Commerce, EC ’06, pages 91–99, New York, NY, USA, 2006. ACM.
  • Dilkina et al. [2007] B. Dilkina, C. P. Gomes, and A. Sabharwal. The Impact of Network Topology on Pure Nash Equilibria in Graphical Games. In Proceedings of the 22nd National Conference on Artificial Intelligence - Volume 1, AAAI’07, pages 42–49. AAAI Press, 2007.
  • Domingos [2005] P. Domingos. Mining Social Networks for Viral Marketing. IEEE Intelligent Systems, 20(1):80–82, 2005. Short Paper.
  • Domingos and Richardson [2001] P. Domingos and M. Richardson. Mining the Network Value of Customers. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pages 57–66, New York, NY, USA, 2001. ACM.
  • Duchi and Singer [2009a] J. Duchi and Y. Singer. Efficient Learning using Forward-Backward Splitting. In Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 495–503. Curran Associates, Inc., 2009a.
  • Duchi and Singer [2009b] J. Duchi and Y. Singer. Efficient Online and Batch Learning using Forward Backward Splitting. Journal of Machine Learning Research, 10:2899–2934, 2009b.
  • Dunkel [2007] J. Dunkel. Complexity of Pure-Strategy Nash Equilibria in Non-Cooperative Games. In K.-H. Waldmann and U. M. Stocker, editors, Operations Research Proceedings, volume 2006, pages 45–51. Springer Berlin Heidelberg, 2007.
  • Dunkel and Schulz [2006] J. Dunkel and A. Schulz. On the Complexity of Pure-Strategy Nash Equilibria in Congestion and Local-Effect Games. In P. Spirakis, M. Mavronicolas, and S. Kontogiannis, editors, Internet and Network Economics, volume 4286 of Lecture Notes in Computer Science, pages 62–73. Springer Berlin Heidelberg, 2006.
  • Duong et al. [2009] Q. Duong, Y. Vorobeychik, S. Singh, and M.P. Wellman. Learning Graphical Game Models. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, IJCAI’09, pages 116–121, San Francisco, CA, USA, 2009. Morgan Kaufmann Publishers Inc.
  • Duong et al. [2008] Q. Duong, M. Wellman, and S. Singh. Knowledge Combination in Graphical Multiagent Model. In Proceedings of the 24th Annual Conference on Uncertainty in Artificial Intelligence (UAI-08), pages 145–152, Corvallis, Oregon, 2008. AUAI Press.
  • Duong et al. [2012] Q. Duong, M.P. Wellman, S. Singh, and M. Kearns. Learning and Predicting Dynamic Networked Behavior with Graphical Multiagent Models. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1, AAMAS ’12, pages 441–448, Richland, SC, 2012. International Foundation for Autonomous Agents and Multiagent Systems.
  • Duong et al. [2010] Q. Duong, M.P. Wellman, S. Singh, and Y. Vorobeychik. History-dependent Graphical Multiagent Models. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: Volume 1, AAMAS ’10, pages 1215–1222, Richland, SC, 2010. International Foundation for Autonomous Agents and Multiagent Systems.
  • Even-Dar and Shapira [2007] E. Even-Dar and A. Shapira. A Note on Maximizing the Spread of Influence in Social Networks. In X. Deng and F. Graham, editors, Internet and Network Economics, volume 4858 of Lecture Notes in Computer Science, pages 281–286. Springer Berlin Heidelberg, 2007.
  • Fabrikant et al. [2004] A. Fabrikant, C. Papadimitriou, and K. Talwar. The Complexity of Pure Nash Equilibria. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, STOC ’04, pages 604–612, New York, NY, USA, 2004. ACM.
  • Ficici et al. [2008] S. Ficici, D. Parkes, and A. Pfeffer. Learning and Solving Many-Player Games through a Cluster-Based Representation. In Proceedings of the 24th Annual Conference on Uncertainty in Artificial Intelligence (UAI-08), pages 188–195, Corvallis, Oregon, 2008. AUAI Press.
  • Fudenberg and Levine [1999] D. Fudenberg and D. Levine. The Theory of Learning in Games. MIT Press, 1999.
  • Fudenberg and Tirole [1991] D. Fudenberg and J. Tirole. Game Theory. The MIT Press, 1991.
  • Gao and Pfeffer [2010] X. Gao and A. Pfeffer. Learning Game Representations from Data Using Rationality Constraints. In Proceedings of the 26th Annual Conference on Uncertainty in Artificial Intelligence (UAI-10), pages 185–192, Corvallis, Oregon, 2010. AUAI Press.
  • Gilboa and Zemel [1989] I. Gilboa and E. Zemel. Nash and correlated equilibria: some complexity considerations. Games and Economic Behavior, 1(1):80–93, 1989.
  • Gomez Rodriguez et al. [2010] M. Gomez Rodriguez, J. Leskovec, and A. Krause. Inferring Networks of Diffusion and Influence. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 1019–1028, New York, NY, USA, 2010. ACM.
  • Gottlob et al. [2005] G. Gottlob, G. Greco, and F. Scarcello. Pure Nash equilibria: Hard and easy games. Journal of Artificial Intelligence Research, 24(1):357–406, 2005.
  • Goyal et al. [2010] A. Goyal, F. Bonchi, and L.V.S. Lakshmanan. Learning Influence Probabilities in Social Networks. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, WSDM ’10, pages 241–250, New York, NY, USA, 2010. ACM.
  • Granovetter [1978] M. Granovetter. Threshold Models of Collective Behavior. The American Journal of Sociology, 83(6):1420–1443, 1978.
  • Gruzling [2006] N. Gruzling. Linear Separability of the Vertices of an nn-Dimensional Hypercube. Master’s thesis, The University of British Columbia, 2006.
  • Guo et al. [2010] J. Guo, E. Levina, G. Michailidis, and J. Zhu. Joint structure estimation for categorical Markov networks. Technical report, University of Michigan, Department of Statistics, 2010. Submitted. http://www.stat.lsa.umich.edu/~elevina.
  • Guo and Schuurmans [2006] Y. Guo and D. Schuurmans. Convex Structure Learning for Bayesian Networks: Polynomial Feature Selection and Approximate Ordering. In Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), pages 208–216, Arlington, Virginia, 2006. AUAI Press.
  • Hasan and Galiana [2008] E. Hasan and F. Galiana. Electricity Markets Cleared by Merit Order–Part II: Strategic Offers and Market Power. IEEE Transactions on Power Systems, 23(2):372–379, 2008.
  • Hasan and Galiana [2010] E. Hasan and F. Galiana. Fast Computation of Pure Strategy Nash Equilibria in Electricity Markets Cleared by Merit Order. IEEE Transactions on Power Systems, 25(2):722–728, 2010.
  • Hasan et al. [2008] E. Hasan, F. Galiana, and A. Conejo. Electricity Markets Cleared by Merit Order–Part I: Finding the Market Outcomes Supported by Pure Strategy Nash Equilibria. IEEE Transactions on Power Systems, 23(2):361–371, 2008.
  • Heal and Kunreuther [2003] G. Heal and H. Kunreuther. You Only Die Once: Managing Discrete Interdependent Risks. Working Paper W9885, National Bureau of Economic Research, 2003.
  • Heal and Kunreuther [2006] G. Heal and H. Kunreuther. Supermodularity and Tipping. Working Paper 12281, National Bureau of Economic Research, 2006.
  • Heal and Kunreuther [2007] G. Heal and H. Kunreuther. Modeling Interdependent Risks. Risk Analysis, 27:621–634, 2007.
  • Höfling and Tibshirani [2009] H. Höfling and R. Tibshirani. Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods. Journal of Machine Learning Research, 10:883–906, 2009.
  • Honorio [2012] J. Honorio. Tractable Learning of Graphical Model Structures from Data. PhD thesis, Stony Brook University, Department of Computer Science, 2012.
  • Honorio and Ortiz [2012] J. Honorio and L. Ortiz. Learning the Structure and Parameters of Large-Population Graphical Games from Behavioral Data. Computer Research Repository, 2012. http://arxiv.org/abs/1206.3713.
  • Irfan and Ortiz [2014] M. Irfan and L. E. Ortiz. On Influence, Stable Behavior, and the Most Influential Individuals in Networks: A Game-Theoretic Approach. Artificial Intelligence, 215:79–119, 2014.
  • Janovskaja [1968] E. Janovskaja. Equilibrium situations in multi-matrix games. Litovskiĭ Matematicheskiĭ Sbornik, 8:381–384, 1968.
  • Jiang and Leyton-Brown [2008] A. Jiang and K. Leyton-Brown. Action-Graph Games. Technical Report TR-2008-13, University of British Columbia, Department of Computer Science, 2008.
  • Jiang and Leyton-Brown [2011] A.X. Jiang and K. Leyton-Brown. Polynomial-time Computation of Exact Correlated Equilibrium in Compact Games. In Proceedings of the 12th ACM Conference on Electronic Commerce, EC ’11, pages 119–126, New York, NY, USA, 2011. ACM.
  • Kakade et al. [2003] S. Kakade, M. Kearns, J. Langford, and L. Ortiz. Correlated Equilibria in Graphical Games. In Proceedings of the 4th ACM Conference on Electronic Commerce, EC ’03, pages 42–47, New York, NY, USA, 2003. ACM.
  • Kearns [2005] M. Kearns. Economics, Computer Science, and Policy. Issues in Science and Technology, 2005.
  • Kearns et al. [2001] M. Kearns, M. Littman, and S. Singh. Graphical Models for Game Theory. In Proceedings of the 17th Annual Conference on Uncertainty in Artificial Intelligence (UAI-01), pages 253–260, San Francisco, CA, 2001. Morgan Kaufmann.
  • Kearns and Vazirani [1994] M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. The MIT Press, 1994.
  • Kearns and Wortman [2008] M. Kearns and J. Wortman. Learning from Collective Behavior. In R.A. Servedio and T. Zhang, editors, 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, COLT ’08, pages 99–110. Omnipress, 2008.
  • Kleinberg [2007] J. Kleinberg. Cascading Behavior in Networks: Algorithmic and Economic Issues. In N. Nisan, T. Roughgarden, É. Tardos, and V. V. Vazirani, editors, Algorithmic Game Theory, chapter 24, pages 613–632. Cambridge University Press, 2007.
  • Koller and Friedman [2009] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009.
  • Koller and Milch [2003] D. Koller and B. Milch. Multi-agent influence diagrams for representing and solving games. Games and Economic Behavior, 45(1):181–221, 2003.
  • Kunreuther and Michel-Kerjan [2007] H. Kunreuther and E. Michel-Kerjan. Assessing, Managing and Benefiting from Global Interdependent Risks: The Case of Terrorism and Natural Disasters, 2007. CREATE Symposium: http://opim.wharton.upenn.edu/risk/library/AssessingRisks-2007.pdf.
  • La Mura [2000] P. La Mura. Game Networks. In Proceedings of the 16th Annual Conference on Uncertainty in Artificial Intelligence (UAI-00), pages 335–342, San Francisco, CA, 2000. Morgan Kaufmann.
  • Lee et al. [2007] S. Lee, V. Ganapathi, and D. Koller. Efficient Structure Learning of Markov Networks using L1{L}_{1}-Regularization. In B. Schölkopf, J.C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 817–824. MIT Press, 2007.
  • López-Pintado and Watts [2008] D. López-Pintado and D. Watts. Social Influence, Binary Decisions and Collective Dynamics. Rationality and Society, 20(4):399–443, 2008.
  • McKelvey and Palfrey [1995] R. McKelvey and T. Palfrey. Quantal Response Equilibria for Normal Form Games. Games and Economic Behavior, 10(1):6–38, 1995.
  • Morris [2000] S. Morris. Contagion. The Review of Economic Studies, 67(1):57–78, 2000.
  • Muroga [1965] S. Muroga. Lower bounds on the number of threshold functions and a maximum weight. IEEE Transactions on Electronic Computers, 14:136–148, 1965.
  • Muroga [1971] S. Muroga. Threshold Logic and Its Applications. John Wiley & Sons, 1971.
  • Muroga and Toda [1966] S. Muroga and I. Toda. Lower Bound of the Number of Threshold Functions. IEEE Transactions on Electronic Computers, 5:805–806, 1966.
  • Nash [1951] J. Nash. Non-cooperative games. Annals of Mathematics, 54(2):286–295, 1951.
  • Ng and Russell [2000] A. Y. Ng and S. J. Russell. Algorithms for Inverse Reinforcement Learning. In Proceedings of the 17th International Conference on Machine Learning, ICML ’00, pages 663–670, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
  • Nisan et al. [2007] N. Nisan, T. Roughgarden, É. Tardos, and V. V. Vazirani, editors. Algorithmic Game Theory. Cambridge University Press, 2007.
  • Ortiz and Kearns [2003] L. E. Ortiz and M. Kearns. Nash Propagation for Loopy Graphical Games. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 817–824. MIT Press, 2003.
  • Papadimitriou and Roughgarden [2008] C. Papadimitriou and T. Roughgarden. Computing correlated equilibria in multi-player games. Journal of the ACM, 55(3):1–29, 2008.
  • Rinott and Scarsini [2000] Y. Rinott and M. Scarsini. On the Number of Pure Strategy Nash Equilibria in Random Games. Games and Economic Behavior, 33(2):274–293, 2000.
  • Rosenthal [1973] R. Rosenthal. A class of games possessing pure-strategy Nash equilibria. International Journal of Game Theory, 2(1):65–67, 1973.
  • Ryan et al. [2010] C.T. Ryan, A.X. Jiang, and K. Leyton-Brown. Computing Pure Strategy Nash Equilibria in Compact Symmetric Games. In Proceedings of the 11th ACM Conference on Electronic Commerce, EC ’10, pages 63–72, New York, NY, USA, 2010. ACM.
  • Saito et al. [2009] K. Saito, M. Kimura, K. Ohara, and H. Motoda. Learning Continuous-Time Information Diffusion Model for Social Behavioral Data Analysis. In Z. Zhou and T. Washio, editors, Advances in Machine Learning, volume 5828 of Lecture Notes in Computer Science, pages 322–337. Springer Berlin Heidelberg, 2009.
  • Saito et al. [2010] K. Saito, M. Kimura, K. Ohara, and H. Motoda. Selecting Information Diffusion Models over Social Networks for Behavioral Analysis. In J. L. Balcázar, F. Bonchi, A. Gionis, and M. Sebag, editors, Machine Learning and Knowledge Discovery in Databases, volume 6323 of Lecture Notes in Computer Science, pages 180–195. Springer Berlin Heidelberg, 2010.
  • Saito et al. [2008] K. Saito, R. Nakano, and M. Kimura. Prediction of Information Diffusion Probabilities for Independent Cascade Model. In I. Lovrek, R. J. Howlett, and L. C. Jain, editors, Knowledge-Based Intelligent Information and Engineering Systems, volume 5179 of Lecture Notes in Computer Science, pages 67–75. Springer Berlin Heidelberg, 2008.
  • Schmidt et al. [2007a] M. Schmidt, G. Fung, and R. Rosales. Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches. In Proceedings of the 18th European Conference on Machine Learning, ECML ’07, pages 286–297, Berlin, Heidelberg, 2007a. Springer-Verlag.
  • Schmidt and Murphy [2009] M. Schmidt and K. Murphy. Modeling Discrete Interventional Data using Directed Cyclic Graphical Models. In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence (UAI-09), pages 487–495, Corvallis, Oregon, 2009. AUAI Press.
  • Schmidt et al. [2007b] M. Schmidt, A. Niculescu-Mizil, and K. Murphy. Learning Graphical Model Structure Using 1{\ell}_{1}-regularization Paths. In Proceedings of the 22nd National Conference on Artificial Intelligence - Volume 2, pages 1278–1283, 2007b.
  • Shoham [2008] Y. Shoham. Computer Science and Game Theory. Communications of the ACM, 51(8):74–79, 2008.
  • Shoham and Leyton-Brown [2009] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, 2009.
  • Sontag [1998] E. Sontag. VC Dimension of Neural Networks. Neural Networks and Machine Learning, pages 69–95, 1998.
  • Srebro [2001] N. Srebro. Maximum Likelihood Bounded Tree-Width Markov Networks. In Proceedings of the 17th Annual Conference on Uncertainty in Artificial Intelligence (UAI-01), pages 504–511, San Francisco, CA, 2001. Morgan Kaufmann.
  • Stanford [1995] W. Stanford. A Note on the Probability of kk Pure Nash Equilibria in Matrix Games. Games and Economic Behavior, 9(2):238–246, 1995.
  • Sureka and Wurman [2005] A. Sureka and P.R. Wurman. Using Tabu Best-response Search to Find Pure Strategy Nash Equilibria in Normal Form Games. In Proceedings of the 4th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS ’05, pages 1023–1029, New York, NY, USA, 2005. ACM.
  • Vapnik [1998] V. Vapnik. Statistical Learning Theory. Wiley, 1998.
  • Vickrey and Koller [2002] D. Vickrey and D. Koller. Multi-agent Algorithms for Solving Graphical Games. In 18th National Conference on Artificial Intelligence, pages 345–351, Menlo Park, CA, USA, 2002. American Association for Artificial Intelligence.
  • Vorobeychik et al. [2007] Y. Vorobeychik, M. Wellman, and S. Singh. Learning payoff functions in infinite games. Machine Learning, 67(1-2):145–168, 2007.
  • Wainwright et al. [2007] M.J. Wainwright, J.D. Lafferty, and P.K. Ravikumar. High-Dimensional Graphical Model Selection Using 1{\ell}_{1}-Regularized Logistic Regression. In B. Schölkopf, J.C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1465–1472. MIT Press, 2007.
  • Waugh et al. [2011] K. Waugh, B. Ziebart, and D. Bagnell. Computational Rationalization: The Inverse Equilibrium Problem. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 1169–1176, New York, NY, USA, 2011. ACM.
  • Winder [1960] R. Winder. Single state threshold logic. Switching Circuit Theory and Logical Design, S-134:321–332, 1960.
  • Wright and Leyton-Brown [2010] J. Wright and K. Leyton-Brown. Beyond Equilibrium: Predicting Human Behavior in Normal-Form Games. In AAAI Conference on Artificial Intelligence, 2010.
  • Wright and Leyton-Brown [2012] J.R. Wright and K. Leyton-Brown. Behavioral Game Theoretic Models: A Bayesian Framework for Parameter Analysis. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 2, AAMAS ’12, pages 921–930, Richland, SC, 2012. International Foundation for Autonomous Agents and Multiagent Systems.
  • Yamija and Ibaraki [1965] S. Yamija and T. Ibaraki. A lower bound of the number of threshold functions. IEEE Transactions on Electronic Computers, 14:926–929, 1965.
  • Zhu et al. [2004] J. Zhu, S. Rosset, R. Tibshirani, and T.J. Hastie. 1-norm Support Vector Machines. In S. Thrun, L.K. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 49–56. MIT Press, 2004.
  • Ziebart et al. [2010] B. Ziebart, J. A. Bagnell, and A. K. Dey. Modeling Interaction via the Principle of Maximum Causal Entropy. In Johannes Fürnkranz and Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1255–1262, Haifa, Israel, 2010. Omnipress.