This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning with Linear Function Approximations in Mean-Field Controlthanks: E. Bayraktar is partially supported by the National Science Foundation under grant DMS-2106556 and by the Susan M. Smith chair.

Erhan Bayraktar and Ali Devran Kara E. Bayraktar is with the Department of Mathematics, University of Michigan, Ann Arbor, MI, USA, Email: erhan@umich.edu
A. D. Kara is with the Department of Mathematics, Florida State University, FL, USA, Email:akara@fsu.edu
Abstract

The paper focuses on mean-field type multi-agent control problems with finite state and action spaces where the dynamics and cost structures are symmetric and homogeneous, and are affected by the distribution of the agents. A standard solution method for these problems is to consider the infinite population limit as an approximation and use symmetric solutions of the limit problem to achieve near optimality. The control policies, and in particular the dynamics, depend on the population distribution in the finite population setting, or the marginal distribution of the state variable of a representative agent for the infinite population setting. Hence, learning and planning for these control problems generally require estimating the reaction of the system to all possible state distributions of the agents. To overcome this issue, we consider linear function approximation for the control problem and provide coordinated and independent learning methods. We rigorously establish error upper bounds for the performance of learned solutions. The performance gap stems from (i) the mismatch due to estimating the true model with a linear one, and (ii) using the infinite population solution in the finite population problem as an approximate control. The provided upper bounds quantify the impact of these error sources on the overall performance.

1 Introduction

The goal of the paper is to present various learning methods for mean-field control problems under linear function approximations and to provide provable error bounds for the learned solutions.

1.1 Literature Review

Learning for multi agent control problems is a practically relevant and a challenging problem where there has been as a growing interest in recent years. A general solution methodology for multi-agent control problems is difficult to obtain and the solution, in general, is intractable except for special information structures between the agents. We refer the reader to the survey paper by [43] for a substantive summary of learning methods in the context of multi-agent decision making problems.

In this paper, we study a particular case of multi-agent problems in which both the agents and their interactions are symmetric and homogeneous. For these mean-field type decision making problems, the agents are coupled only through the so-called mean-field term. These problems can be broadly divided into two categories; mean-field game problems where the agents are competitive and interested in optimizing their self objective functions, and mean-field control problems, where the agents are interested in a common objective function optimization. We cite some papers by [21, 12, 11, 42, 26, 2, 18, 19, 24, 35, 40, 38, 39] and references therein, for papers in mean-field game setting. We do not discuss these in detail as our focus will be on mean-field control problems which are significantly different in both analysis and the nature of the problems of interest.

For mean-field control problems, where the agents are cooperative and work together to minimize (or maximize) a common cost (or reward) function, see [8, 17, 30, 14, 36, 13, 20, 7, 10] and references therein for the study of dynamic programming principle and learning methods in continuous time. In particular, we point out the papers [28, 17] which provide the justification for studying the centralized limit problem by rigorously connecting the large population decentralized setting and the infinite population limit problem.

For papers studying mean-field control in discrete time, we refer the reader to [33, 5, 22, 23, 32, 15]. [33, 32] study existence of solutions to the control problem in both infinite and finite population settings, and they rigorously establish the connection between the finite and infinite population problems. [5] studies the finite population mean-field control problems and their infinite population limit, and provide solutions of the ergodic control problems for some special cases.

In the context of learning, [22, 23] study dynamic programming principle and Q learning methods directly for the infinite population control problem. The value functions and the Q functions are defined for the lifted problem, where the state process is the measure-valued mean-field flow. They consider dynamics without common noise, and thus the learning problem from the perspective of a coordinator becomes a deterministic one.

[15] also considers the limit (infinite population) problem and studies different classes of policies that achieve optimal performance for the infinite population (limit problem) and focuses on Q learning methods for the problem after establishing the optimality of randomized feedback policies for the agents. The learning problem considers the state as the measure valued mean-field term and defines a learning problem over the set of probability measures where various approximations are considered to deal with the high dimension issues.

[4, 3] have studied learning methods for the mean-field game and control problems from a joint lens. However, for the control setup, they consider a different control objective compared to the previously cited papers. In particular, they aim to optimize the asymptotic phase of the control problem where the agents are assumed to reach to their stationary distributions under joint symmetric policies. Furthermore, the agents only use their local state variables, and thus the objective is to find a stationary measure for the agents where the cost is minimized under this stationary regime. Since the agents only use their local state variables (and not the mean-field term) for their control, the authors can define a Q function over the finite state and action spaces of the agents.

[34] consider a closely related problem to our setting, where they propose model-based learning methods for the mean-field control. Similar to [22, 23], they directly work with the infinite population dynamics without analyzing the approximation consistency between the finite-population dynamics and their infinite-population counterpart. Furthermore, they restrict the dynamics to the models with additive noise, and the optimality search is within deterministic and Lipschitz continuous controls.

We also note that there are various studies that focus on the application of the mean-field modeling using numerical methods based on machine learning techniques, see e.g. the works by [37, 1, 29].

In this paper, we will consider the learning problem using an alternative formulation where the state is represented as the measure valued mean-field term. To approximate this uncountable space, and the cost and transition functions, different from the previous works in the mean-field control setting, we will consider linear function approximation methods. These methods have been studied well for single agent discrete time stochastic control problems. We cite papers by [31, 16, 41, 27] in which reinforcement learning techniques are used to study Markov decision problems with continuous state spaces using linear function approximations.

Contributions.

  • In Section 2, we present the learning methods using linear function approximation. We focus on various scenarios.

    • We first consider the ideal case where we assume that the team has infinitely many agents. For this case, we study; (i) learning by a coordinator who has access to information about every agent in the team, and estimates a model from a data set by fitting a linear model that minimizes the L2L_{2} distance between the training data and the estimate linear model, (ii) each agent estimates their own linear model using their local information via an iterative algorithm from a single sequence of data.

    • In Section 2.3, we consider the practical case, where the team has finitely many agents, and they aim to estimate a linear model from a single sequence of data, using their local information variables.

  • The methods we study in Section 2 minimize the L2L_{2} distance between the learned linear model and the actual model under a probability measure that depends on the training data. However, to find upper bounds for the performance loss of the policies designed for the learned linear estimates in any scenario, we need uniform estimation errors rather than L2L_{2} estimation errors. In Section 3, we generalize L2L_{2} error bounds to uniform error bounds.

  • The proposed learning methods do not match the true model perfectly in general, due to linear approximation mismatch. Therefore, finally, in Section 4, we provide upper bounds on the performance of the policies that are designed for the learned models when they are applied on the true control problem. We note that the flow of the mean-field term is deterministic for infinitely many agents, and thus can be estimated using the dynamics without observing the mean-field term. Therefore, for the execution of the policies we focus on two methods, (i) open loop control, where the agents only observe their local states and estimate the mean-field term with the learned dynamics, (ii) closed loop control where the agents observe both their local information and the mean-field term. For each of these execution procedures, we provide upper bounds for the performance loss. As in Section 2, we first consider the ideal case where it is assumed that the system has infinitely many agents. In this case, the error bound depends on the uniform model mismatch between the learned model and the true model. We then consider the case with finitely many agents. We assume that each agent follows the policy that they calculate considering the limit (infinite population) model. In this case, the error upper bounds depend on both the uniform model mismatch, and an empirical concentration bound since we estimate the finitely many agent model with the infinite population limit problem.

1.2 Problem formulation.

The dynamics for the model are presented as follows: suppose NN agents (decision-makers or controllers) act in a cooperative way to minimize a cost function, and the agents share a common state and an action space denoted by 𝕏\mathds{X} and 𝕌\mathds{U}. We assume that 𝕏\mathds{X} and 𝕌\mathds{U} are finite. We refer the reader to the paper by [6], for finite approximations of mean-field control problems where the state and actions spaces of the agents are continuous. For any time step tt, and agent i{1,,N}i\in\{1,\dots,N\} we have

xt+1i=f(xti,uti,μ𝐱𝐭,wti)\displaystyle x^{i}_{t+1}=f(x_{t}^{i},u_{t}^{i},\mu_{\bf x_{t}},w_{t}^{i}) (1)

for a measurable function ff, where {wti}\{w_{t}^{i}\} denotes the i.i.d. idiosyncratic noise process.

Furthermore, μ𝐱𝒫N(𝕏)\mu_{\bf x}\in\mathcal{P}_{N}(\mathds{X}) denotes the empirical distribution of the agents on the state space 𝕏\mathds{X} such that for a given joint state of the team of agents 𝐱:=(x1,,xN)𝕏N{\bf x}:=(x^{1},\dots,x^{N})\in\mathds{X}^{N}

μ𝐱():=1Ni=1Nδxi()\displaystyle\mu_{\bf x}(\cdot):=\frac{1}{N}\sum_{i=1}^{N}\delta_{x^{i}}(\cdot)

where δxi\delta_{x^{i}} represents the Dirac measure centered at xix^{i}. Throughout this paper, we use the notation

𝕏N:=𝕏××𝕏 N times\displaystyle\mathds{X}^{N}:=\underbrace{\mathds{X}\times\dots\times\mathds{X}}_{\text{ $N$ times}}

to denote the space of all joint state variables of the team equipped with the product topology on 𝕏N\mathds{X}^{N}. We further define 𝒫N(𝕏)\mathcal{P}_{N}(\mathds{X}), the set of all empirical measures on 𝕏\mathds{X} constructed using sequences of NN states in 𝕏\mathds{X}, such that

𝒫N(𝕏):={μ𝐱:𝐱=(x1,,xN)𝕏N}.\displaystyle\mathcal{P}_{N}(\mathds{X}):=\{\mu_{\bf x}:{\bf x}=(x^{1},\dots,x^{N})\in\mathds{X}^{N}\}.

Note that 𝒫N(𝕏)𝒫(𝕏)\mathcal{P}_{N}(\mathds{X})\subset{\mathcal{P}}(\mathds{X}) where 𝒫(𝕏){\mathcal{P}}(\mathds{X}) denotes the set of all probability measures on 𝕏\mathds{X} equipped with the weak convergence topology.

Equivalently, the next state of the agent ii is determined by some stochastic kernel, that is, a regular conditional probability distribution:

𝒯(|xti,uti,μ𝐱𝐭).\displaystyle\mathcal{T}(\cdot|x_{t}^{i},u_{t}^{i},\mu_{\bf x_{t}}). (2)

At each time stage tt, each agent receives a cost determined by a measurable stage-wise cost function c:𝕏×𝕌×𝒫N(𝕏)c:\mathds{X}\times\mathds{U}\times\mathcal{P}_{N}(\mathds{X})\to\mathds{R}. If the state, action, and empirical distribution of the agents are given by xti,uti,μ𝐱𝐭x_{t}^{i},u_{t}^{i},\mu_{\bf x_{t}}, then the agent receives the cost.

c(xti,uti,μ𝐱𝐭).\displaystyle c(x_{t}^{i},u_{t}^{i},\mu_{\bf x_{t}}).

For the remainder of the paper, by an abuse of notation, we will sometimes denote the dynamics in terms of the vector state and action variables, 𝐱=(x1,,xN){\bf x}=(x^{1},\dots,x^{N}), and 𝐮=(u1,,uN){\bf u}=(u^{1},\dots,u^{N}), and vector noise variables 𝐰=(w1,,wN){\bf w}=(w^{1},\dots,w^{N}) such that

𝐱𝐭+𝟏=f(𝐱𝐭,𝐮𝐭,𝐰𝐭).\displaystyle{\bf x_{t+1}}=f({\bf x_{t},u_{t},w_{t}}).

For the initial formulation, every agent is assumed to know the state and action variables of every other agent. We define an admissible policy for an agent ii, as a sequence of functions γi:={γti}t\gamma^{i}:=\{\gamma^{i}_{t}\}_{t}, where γti\gamma^{i}_{t} is a 𝕌\mathds{U}-valued (possibly randomized) function which is measurable with respect to the σ\sigma-algebra generated by

It={𝐱0,,𝐱t,𝐮0,,𝐮t1}.\displaystyle I_{t}=\{{\bf x}_{0},\dots,{\bf x}_{t},{\bf u}_{0},\dots,{\bf u}_{t-1}\}. (3)

Accordingly, an admissible team policy, is defined as γ:={γ1,,γN}\gamma:=\{\gamma^{1},\dots,\gamma^{N}\}, where γi\gamma^{i} is an admissible policy for the agent ii. In other words, agents share the complete information.

The objective of the agents is to minimize the following cost function

JβN(𝐱0,γ)=t=0βtEγ[𝐜(𝐱t,𝐮t)]\displaystyle J^{N}_{\beta}({\bf x}_{0},\gamma)=\sum_{t=0}^{\infty}\beta^{t}E_{\gamma}\left[{\bf c}({\bf x}_{t},{\bf u}_{t})\right]

where EγE_{\gamma} denotes the expectation with respect to the probability measure induces by the team policy γ\gamma, and where

𝐜(𝐱t,𝐮t):=1Ni=1Nc(xti,uti,μ𝐱t).\displaystyle{\bf c}({\bf x}_{t},{\bf u}_{t}):=\frac{1}{N}\sum_{i=1}^{N}c(x^{i}_{t},u^{i}_{t},\mu_{{\bf x}_{t}}).

The optimal cost is defined by

JβN,(𝐱0):=infγΓJβN(𝐱0,γ)\displaystyle J_{\beta}^{N,*}({\bf x}_{0}):=\inf_{\gamma\in\Gamma}J_{\beta}^{N}({\bf x}_{0},\gamma) (4)

where Γ\Gamma denotes the set of all admissible team policies.

We note that this information structure (3) will be our benchmark for evaluating the performance of the approximate solutions using simpler information structures presented in the paper. In other words, the value function that is achieved when the agents share full information and full history will be taken to be our reference point for simpler information structures.

For example, one immediate observation is that the problem under full information sharing can be reformulated as a centralized control problem where the state and action spaces are 𝕏N\mathds{X}^{N} and 𝕌N\mathds{U}^{N}. Therefore, one can consider Markov policies such that It={𝐱t}I_{t}=\{{\bf x}_{t}\} without loss of optimality.

However, if the problem is modeled as an MDP with state space 𝕏N\mathds{X}^{N} and action space 𝕌N\mathds{U}^{N}, we face some computational challenges:

  • (i)

    the curse of dimensionality when NN is large, since 𝕏N\mathds{X}^{N} and 𝕌N\mathds{U}^{N} might be too large even when 𝕏,𝕌\mathds{X,U} are of manageable size,

  • (ii)

    the curse of coordination: even if the optimal team policy is found, its execution at the agent level requires coordination among the agents. In particular, the agents may need to follow asymmetric policies to achieve optimality, even though we assume full symmetry for the dynamics and the cost models. The following simple example from [9] shows that the agents may need to follow asymmetric policies to achieve optimality which requires coordination among the agents.

Example 1.1.

Consider a team control problem with two agents, i.e. N=2N=2. We assume that 𝕏=𝕌={0,1}\mathds{X}=\mathds{U}=\{0,1\}. The stage wise cost function of the agents is defined as

c(x,u,μ𝐱)=μ𝐱μ¯)\displaystyle c(x,u,\mu_{\bf x})=\|\mu_{\bf x}-\bar{\mu})\|

where

μ¯=12δ0+12δ1.\displaystyle\bar{\mu}=\frac{1}{2}\delta_{0}+\frac{1}{2}\delta_{1}.

In words, the state distribution should be distributed equally over the state space {0,1}\{0,1\} for minimal stage-wise cost. For the dynamics we assume a deterministic model such that

xt+1=ut.\displaystyle x_{t+1}=u_{t}.

In words, the action of an agent purely determines the next state of the same agent. The goal of the agents is to minimize

t=0βtEg1,g2[c(xt1,ut1,μ𝐱𝐭)+c(xt2,ut2,μ𝐱𝐭)2]\displaystyle\sum_{t=0}^{\infty}\beta^{t}E^{g^{1},g^{2}}\left[\frac{c(x_{t}^{1},u_{t}^{1},\mu_{\bf x_{t}})+c(x_{t}^{2},u_{t}^{2},\mu_{\bf x_{t}})}{2}\right]

for some initial state values 𝐱0=[x01,x02]{\bf{x}}_{0}=[x_{0}^{1},x_{0}^{2}], by choosing policies g1,g2g^{1},g^{2}. The expectation is over the possible randomization of the policies. We assume full information sharing such that every agent has access to the state and action information of the other agent.

We let the initial states be x01=x02=0x_{0}^{1}=x_{0}^{2}=0. An optimal policy for the agents for the problem is given by

g1(0,0)=0,g2(0,0)=1\displaystyle g^{1}(0,0)=0,\qquad g^{2}(0,0)=1
g1(0,1)=0,g2(0,1)=1\displaystyle g^{1}(0,1)=0,\qquad g^{2}(0,1)=1
g1(1,0)=1,g2(1,0)=0\displaystyle g^{1}(1,0)=1,\qquad g^{2}(1,0)=0
g1(1,1)=1,g2(1,1)=0\displaystyle g^{1}(1,1)=1,\qquad g^{2}(1,1)=0

which always spreads the agents equally over the state space. One can realize that, when the agents are positioned at either (0,0)(0,0) or (1,1)(1,1), they have to use personalized policies to decide on which one to be placed at 0 or 11.

For any symmetric policy g1(x1,x2)=g2(x1,x2)=g(x1,x2)g^{1}(x^{1},x^{2})=g^{2}(x^{1},x^{2})=g(x^{1},x^{2}), including the randomized ones, there will always be cases with strict positive probability, where the agents are positioned at the same state, and thus the performance will be strictly worse than the optimal performance.

A standard approach to deal with mean-field control problems when NN is large is to consider the infinite population problem, i.e. taking the limit NN\to\infty. A propagation of chaos argument can be used to show that in the limit, the agents become asymptotically independent. Hence, the problem can be formulated from the perspective of a representative single agent. This approach is suitable to deal with coordination challenges, as the correlation between the agents vanish in the limit, and thus the symmetric policies can achieve optimal performance for the infinite population control problem. In particular, for Example 1.1 in the infinite population setting, the optimal policy is to follow a randomized policy such that Pr(u=1)=Pr(u=0)=12Pr(u=1)=Pr(u=0)=\frac{1}{2}. We will introduce the limit problem in Section 1.5 and make the connections between the limit problem and the finite population problem rigorous.

1.3 Preliminaries.

Recall that we assume that the state and action spaces of agents 𝕏,𝕌\mathds{X,U} are finite (see [6] for finite approximations of continuous space mean-field control problems).

Note. Even though we assume that 𝕏\mathds{X} and 𝕌\mathds{U} are finite, we will continue using integral signs instead of summation signs for expectation computations due to notation consistency, by simply considering Dirac delta measures.

We metrize 𝕏\mathds{X} and 𝕌\mathds{U} so that d(x,x)=1d(x,x^{\prime})=1 if xxx\neq x^{\prime} and d(x,x)=0d(x,x^{\prime})=0 otherwise. Note that with this metric, for any μ,ν𝒫(𝕏)\mu,\nu\in{\mathcal{P}}(\mathds{X}) and for any coupling QQ of μ,ν\mu,\nu, we have that

EQ[|XY|]=PQ(XY)\displaystyle E_{Q}\left[|X-Y|\right]=P_{Q}(X\neq Y)

which in particular implies via the optimal coupling that

W1(μ,ν)=μνTV\displaystyle W_{1}(\mu,\nu)=\|\mu-\nu\|_{TV}

where W1W_{1} denotes the first order Wasserstein distance, or the Kantorovich–Rubinstein metric, and TV\|\cdot\|_{TV} denotes the total variation norm for signed measures.

Note further that for measures defined on finite spaces, we have that

μνTV=12μν1=12x|μ(x)ν(x)|.\displaystyle\|\mu-\nu\|_{TV}=\frac{1}{2}\|\mu-\nu\|_{1}=\frac{1}{2}\sum_{x}\left|\mu(x)-\nu(x)\right|. (5)

Hence, in what follows we will simply write μν\|\mu-\nu\| to refer to the distance between μ\mu and ν\nu, which may correspond to the total variation distance, the first order Wasserstein metric, or the normalized L1L_{1} distance.

We also define the following Dobrushin coefficient for the kernel 𝒯\mathcal{T}:

supμ,γ,x,x^𝕌𝒯(|x,u,μ)γ(du|x)𝕌𝒯(|x^,u,μ)γ(du|x^)=:δT\displaystyle\sup_{\mu,\gamma,x,\hat{x}}\|\int_{\mathds{U}}\mathcal{T}(\cdot|x,u,\mu)\gamma(du|x)-\int_{\mathds{U}}\mathcal{T}(\cdot|\hat{x},u,\mu)\gamma(du|\hat{x})\|=:\delta_{T} (6)

Realize that we always have δT1\delta_{T}\leq 1. In certain cases, we can also have strict inequality, e.g. if there exists some x𝕏x^{*}\in\mathds{X} such that

𝒯(x|x,u,μ)<1α,x,u,μ\displaystyle\mathcal{T}(x^{*}|x,u,\mu)<1-\alpha,\quad\forall x,u,\mu

then one can show that δT1α<1.\delta_{T}\leq 1-\alpha<1.

1.4 Measure Valued Formulation of the Finite Population Control Problem

For the remaining part of the paper, we will often consider an alternative formulation of the control problem for the finitely many agent case where the controlled process is the state distribution of the agents, rather than the state vector of the agents. We refer the reader to [6] for the full construction; in this section, we will give an overview.

We define an MDP for the distribution of the agents, where the control actions are the joint distribution of the state and action vectors of the agents.

We let the state space be 𝒵=𝒫N(𝕏)\mathcal{Z}={\mathcal{P}}_{N}(\mathds{X}) which is the set of all empirical measures on 𝕏\mathds{X} that can be constructed using the state vectors of NN-agents. In other words, for a given state vector 𝐱={x1,,xN}{\bf x}=\{x^{1},\dots,x^{N}\}, we consider μ𝐱𝒫N(𝕏)\mu_{\bf x}\in{\mathcal{P}}_{N}(\mathds{X}) to be the new state variable of the team control problem.

The admissible set of actions for some state μ𝒵\mu\in\mathcal{Z}, is denoted by U(μ)U(\mu), where

U(μ)={Θ𝒫N(𝕌×𝕏)|Θ(𝕌,)=μ()},\displaystyle U(\mu)=\{\Theta\in{\mathcal{P}}_{N}(\mathds{U}\times\mathds{X})|\Theta(\mathds{U},\cdot)=\mu(\cdot)\}, (7)

that is, the set of actions for a state μ\mu, is the set of all joint empirical measures on 𝕏×𝕌\mathds{X}\times\mathds{U} whose marginal on 𝕏\mathds{X} coincides with μ\mu.

We equip the state space 𝒵\mathcal{Z}, and the action sets U(μ)U(\mu), with the norm TV\|\cdot\|_{TV} (see (5)) .

One can show that [6, 9] the empirical distributions of the states of agents μt\mu_{t}, and of the joint state and actions Θt\Theta_{t} define a controlled Markov chain such that

Pr(μt+1B|μt,μ0,Θt,,Θ0)=Pr(μt+1B|μt,Θt)\displaystyle Pr(\mu_{t+1}\in B|\mu_{t},\dots\mu_{0},\Theta_{t},\dots,\Theta_{0})=Pr(\mu_{t+1}\in B|\mu_{t},\Theta_{t})
:=η(B|μt,Θt)\displaystyle:=\eta(B|\mu_{t},\Theta_{t}) (8)

where η(|μ,Θ)𝒫(𝒫N(𝕏))\eta(\cdot|\mu,\Theta)\in{\mathcal{P}}({\mathcal{P}}_{N}(\mathds{X})) is the transition kernel of the centralized measure valued MDP, which is induced by the dynamics of the team problem.

We define the stage-wise cost function k(μ,Θ)k(\mu,\Theta) by

k(μ,Θ):=c(x,u,μ)Θ(du,dx)=1Ni=1Nc(xi,ui,μ).\displaystyle k(\mu,\Theta):=\int c(x,u,\mu)\Theta(du,dx)=\frac{1}{N}\sum_{i=1}^{N}c(x^{i},u^{i},\mu). (9)

Thus, we have an MDP with state space 𝒵\mathcal{Z}, action space μ𝒵U(μ)\cup_{\mu\in\mathcal{Z}}U(\mu), transition kernel η\eta and the stage-wise cost function kk.

We define the set of admissible policies for this measured valued MDP as a sequence of functions g={g0,g1,g2,}g=\{g_{0},g_{1},g_{2},\dots\} such that at every time tt, gtg_{t} is measurable with respect to the σ\sigma-algebra generated by the information variables

I^t={μ0,,μt,Θ0,,Θt1}.\displaystyle\hat{I}_{t}=\{\mu_{0},\dots,\mu_{t},\Theta_{0},\dots,\Theta_{t-1}\}.

We denote the set of all admissible control policies by GG for the measure valued MDP.

In particular, we define the infinite horizon discounted expected cost function under a policy gg by

KβN(μ0,g)=Eμ0η[t=0βtk(μt,Θt)].\displaystyle K^{N}_{\beta}(\mu_{0},g)=E_{\mu_{0}}^{\eta}\left[\sum_{t=0}^{\infty}\beta^{t}k(\mu_{t},\Theta_{t})\right].

We also define the optimal cost by

KβN,(μ0)=infgGKβN(μ0,g).\displaystyle K_{\beta}^{N,*}(\mu_{0})=\inf_{g\in G}K^{N}_{\beta}(\mu_{0},g). (10)

The following result shows that this formulation is without loss of optimality:

Theorem 1 ([6]).

Under Assumption 1, for any 𝐱0{\bf x}_{0} that satisfies μ𝐱0=μ0\mu_{{\bf x}_{0}}=\mu_{0}, that is for any 𝐱0{\bf x}_{0} with distribution μ0\mu_{0}, we have that

  • i.)
    KβN,(μ0)=JβN,(𝐱0).\displaystyle K_{\beta}^{N,*}(\mu_{0})=J_{\beta}^{N,*}({\bf x}_{0}).
  • ii.)

    There exists a stationary and Markov optimal policy gg^{*} for the measure valued MDP, and using gg^{*}, every agent can construct a policy γi:𝕏×𝒫N(𝕏)𝕌\gamma^{i}:\mathds{X}\times{\mathcal{P}}_{N}(\mathds{X})\to\mathds{U} such that for γ:={γ1,γ2,,γN}\gamma:=\{\gamma^{1},\gamma^{2},\dots,\gamma^{N}\}, we have that

    JβN(𝐱𝟎,γ)=JβN,(𝐱𝟎).\displaystyle J^{N}_{\beta}({\bf x_{0}},\gamma)=J_{\beta}^{N,*}({\bf x_{0}}).

    That is, the policy obtained from the measure valued formulation attains the optimal performance for the original team control problem.

1.5 Mean-field Limit Problem

We now introduce the control problem for infinite population teams, i.e. for NN\to\infty. For some agent ii\in\mathds{N}, we define the dynamics as

xt+1i=f(xti,uti,μti,wti)\displaystyle x^{i}_{t+1}=f(x_{t}^{i},u_{t}^{i},\mu_{t}^{i},w_{t}^{i})

where x0μ0x_{0}\sim\mu_{0} and μti=(Xti)\mu_{t}^{i}=\mathcal{L}(X_{t}^{i}) is the law of the state at time tt. The agent tries to minimize the following cost function:

Jβ(μ0,γ)=t=0βtE[c(Xti,Uti,μti)]\displaystyle J_{\beta}^{\infty}(\mu_{0},\gamma)=\sum_{t=0}^{\infty}\beta^{t}E\left[c(X_{t}^{i},U_{t}^{i},\mu_{t}^{i})\right]

where γ={γt}t\gamma=\{\gamma_{t}\}_{t} is an admissible policy such that γt\gamma_{t} is measurable with respect to the information variables

Iti={x0i,,xti,u0i,,ut1i,μ0i,,μti}.\displaystyle I_{t}^{i}=\left\{x_{0}^{i},\dots,x_{t}^{i},u_{0}^{i},\dots,u_{t-1}^{i},\mu_{0}^{i},\dots,\mu_{t}^{i}\right\}.

Note that the agents are no longer correlated and they are indistinguishable. Hence, in what follows we will drop the dependence on ii when we refer to the infinite population problem.

The problem is now a single agent control problem; however, the state variable is not Markovian. However, we can reformulate the problem as an MDP by viewing the state variable as the measure valued μt\mu_{t}.

We let the state space to be 𝒫(𝕏){\mathcal{P}}(\mathds{X}). Different from the measure valued construction we have introduced in Section 1.4, we let the action space to be Γ=𝒫(𝕌)|𝕏|\Gamma={\mathcal{P}}(\mathds{U})^{|\mathds{X}|}. In particular, an action γ(|x)Γ\gamma(\cdot|x)\in\Gamma for the team is a randomized policy at the agent level. We equip Γ\Gamma with the product topology, where we use the weak convergence for each coordinate. We note that each action γ(du|x)\gamma(du|x) and state μ(dx)\mu(dx) induce a distribution on 𝕏×𝕌\mathds{X}\times\mathds{U}, which we denote by Θ(du,dx)=γ(du|x)μ(dx)\Theta(du,dx)=\gamma(du|x)\mu(dx).

Recall the notation in (2); at time tt, we can use the following stochastic kernel for the dynamics:

xt+1𝒯(|xt,ut,μt)\displaystyle x_{t+1}\sim\mathcal{T}(\cdot|x_{t},u_{t},\mu_{t})

which is induced by the idiosyncratic noise wtiw^{i}_{t}. Hence, we can define

μt+1=F(μt,γt):=𝒯(|x,u,μt)γt(du|x)μt(dx).\displaystyle\mu_{t+1}=F(\mu_{t},\gamma_{t}):=\int\mathcal{T}(\cdot|x,u,\mu_{t})\gamma_{t}(du|x)\mu_{t}(dx). (11)

Note that the dynamics are deterministic for the infinite population measure valued problem. Furthermore, we can define the stage-wise cost function as

k(μ,γ):=c(x,u,μ)γ(du|x)μ(dx).\displaystyle k(\mu,\gamma):=\int c(x,u,\mu)\gamma(du|x)\mu(dx). (12)

Hence, the problem is a deterministic MDP for the measure valued state process μt\mu_{t}. A policy, say g:𝒫(𝕏)Γg:{\mathcal{P}}(\mathds{X})\to\Gamma for the measure-valued MDP, can be written as

g(μ)=γ(du|x)\displaystyle g(\mu)=\gamma(du|x)

for some μ𝒫(𝕏)\mu\in{\mathcal{P}}(\mathds{X}). That is, an agent observes μ\mu and chooses their actions as an agent-level randomized policy γ(du|x)\gamma(du|x).

We reintroduce the infinite horizon discounted cost of the agents for the measure valued formulation:

Kβ(μ0,g)=t=0βtk(μt,γt)\displaystyle K_{\beta}(\mu_{0},g)=\sum_{t=0}^{\infty}\beta^{t}k(\mu_{t},\gamma_{t})

for some initial mean-field term μ0\mu_{0} and under some policy gg. Furthermore, the optimal policy is denoted by

Kβ(μ0)=infgKβ(μ0,g).\displaystyle K_{\beta}^{*}(\mu_{0})=\inf_{g}K_{\beta}(\mu_{0},g).

At a given time tt, the pair (xt,μt)(x_{t},\mu_{t}) can be used as sufficient information for decision making by the agent ii. Furthermore, if the model is fully known by the agents, then the mean-field flow μt\mu_{t} can be perfectly estimated if every agent agrees and follows the same policy g(μ)g(\mu), since the dynamics of μt\mu_{t} is deterministic.

We note that for the infinite population control problem, the coordination requirement between the agents may be relaxed, though cannot be fully abandoned in general (see Section 1.6). In particular, if the agents agree on a common policy g(μ)=γ(du|x,μ)g(\mu)=\gamma(du|x,\mu), then for the execution of this policy, no coordination or communication is needed since every agent can estimate the mean-field term μt\mu_{t} independently and perfectly. Furthermore, every agent can use the same agent-level policy γ(du|x,μ)\gamma(du|x,\mu) symmetrically, without any coordination with the other agents.

The following result makes the connection between the finite population and the infinite population control problem rigorous [32, 5, 9].

Assumption 1.
  • i.

    For the transition kernel 𝒯(|x,u,μ)\mathcal{T}(\cdot|x,u,\mu) (see (2))

    𝒯(|x,u,μ)𝒯(|x,u,μ)Kfμμ\displaystyle\|\mathcal{T}(\cdot|x,u,\mu)-\mathcal{T}(\cdot|x,u,\mu^{\prime})\|\leq K_{f}\|\mu-\mu^{\prime}\|

    for some Kf<K_{f}<\infty, for each x,ux,u and for every μ,μ𝒫(𝕏)\mu,\mu^{\prime}\in{\mathcal{P}}(\mathds{X}).

  • ii

    cc is Lipschitz in μ\mu such that

    |c(x,u,μ)c(x,u,μ)|Kcμμ\displaystyle|c(x,u,\mu)-c(x,u,\mu^{\prime})|\leq K_{c}\|\mu-\mu^{\prime}\|

    for some Kc<K_{c}<\infty.

Theorem 2.

Under Assumption 1, the following holds:

  • i.

    For any μ0Nμ0\mu_{0}^{N}\to\mu_{0},

    limNKβN,(μ0N)=Kβ,(μ0).\displaystyle\lim_{N\to\infty}K_{\beta}^{N,*}(\mu_{0}^{N})=K_{\beta}^{\infty,*}(\mu_{0}).

    That is, the optimal value function of the finite population control problem converges to that of the infinite population control problem as NN\to\infty.

  • ii.

    Suppose each agent solves the infinite population control problem given in (11) and (12), and constructs their policies, say

    g(μ)=γ(du|x,μ).g_{\infty}(\mu)=\gamma_{\infty}(du|x,\mu).

    If they follow the infinite population solution in the finite population control problem, for any μ0Nμ0\mu_{0}^{N}\to\mu_{0} we then have

    limNKβN(μ0N,g)=Kβ,(μ0).\displaystyle\lim_{N\to\infty}K_{\beta}^{N}(\mu_{0}^{N},g_{\infty})=K_{\beta}^{\infty,*}(\mu_{0}).

    That is, the symmetric policy constructed using the infinite population problem is near optimal for finite but sufficiently large populations.

Remark 1.

The result has significant implications for the computational challenges we have mentioned earlier. Firstly, the second part of the result states that if the number of agents is large enough, then the symmetric policy obtained from the limit problem is near optimal. Hence, the agents can use symmetric policies without coordination, solving their control problems as long as they have access to the mean-field term and their local state. Secondly, note that the flow of the mean-field term μt\mu_{t} (11) is deterministic if there is no common noise affecting the dynamics. Thus, agents can estimate the marginal distribution of their local state variables xtix_{t}^{i}, without observing the mean-field term if they know the dynamics. In particular, without the common noise, the local state of the agents and the initial mean-field term μ0\mu_{0} are sufficient information for near optimality.

However, as we will see in what follows, to achieve near optimal performance, agents must agree on a particular policy g(μ)=γ(du|x,μ)μ(dx)g(\mu)=\gamma(du|x,\mu)\mu(dx). In particular, if the optimal infinite population policy is not unique, and the agents apply different optimal policies without coordination, the results of the previous results might fail. Hence, coordination cannot be fully ignored.

1.6 Limitations of Full Decentralization

We have argued in the previous section that the team control problem can be solved near optimally by using the infinite population control solution. Furthermore, if the agents agree on the application of a common optimal policy, the resulting team policy can be executed independently in a decentralized way and achieves near-optimal performance.

The following example shows that if the agents do not coordinate on which policy to follow, i.e. if they are fully decentralized, then the resulting team policy will not achieve the desired outcome.

Example 1.2.

Consider a team control problem with infinite population where 𝕏=𝕌={0,1}\mathds{X}=\mathds{U}=\{0,1\}. The stage wise cost function of the agents is defined as

c(x,u,μ)={μμ¯1 if μ(0)34μμ¯2 otherwise\displaystyle c(x,u,\mu)=\begin{cases}\|\mu-\bar{\mu}_{1}\|&\text{ if }\mu(0)\leq\frac{3}{4}\\ \|\mu-\bar{\mu}_{2}\|&\text{ otherwise}\end{cases}

where

μ¯1=12δ0+12δ1\displaystyle\bar{\mu}_{1}=\frac{1}{2}\delta_{0}+\frac{1}{2}\delta_{1}
μ¯2=δ0.\displaystyle\bar{\mu}_{2}=\delta_{0}.

In words, the state distribution should be either be distributed equally over the state space {0,1}\{0,1\} or it should fully concentrate in state 0 for minimal stage-wise cost. One can check that the cost function satisfies Assumption 1 for some Kc<K_{c}<\infty (e.g. Kc=1K_{c}=1). For the dynamics we assume a deterministic model such that

xt+1=ut.\displaystyle x_{t+1}=u_{t}.

In words, the action of an agent purely determines the next state of the same agent. The goal of the agents is to minimize

Kβ(μ0,g)=lim supNt=0βtE[1Ni=1Nc(xti,uti,μ𝐱𝐭)]\displaystyle K_{\beta}(\mu_{0},g)=\limsup_{N\to\infty}\sum_{t=0}^{\infty}\beta^{t}E[\frac{1}{N}\sum_{i=1}^{N}c(x_{t}^{i},u_{t}^{i},\mu_{\bf x_{t}})]

where the initial distribution is given by μ0=12δ0+12δ1\mu_{0}=\frac{1}{2}\delta_{0}+\frac{1}{2}\delta_{1}.

It is easy to see that there are two possible optimal policies for the agents g1(μ)=γ1(du|x,μ)μ(dx)g_{1}(\mu)=\gamma_{1}(du|x,\mu)\mu(dx) and g2(μ)=γ2(du|x,μ)μ(dx)g_{2}(\mu)=\gamma_{2}(du|x,\mu)\mu(dx) where

γ1(|x)=12δ0()+12δ1()\displaystyle\gamma_{1}(\cdot|x)=\frac{1}{2}\delta_{0}(\cdot)+\frac{1}{2}\delta_{1}(\cdot)
γ2(|x)=δ0().\displaystyle\gamma_{2}(\cdot|x)=\delta_{0}(\cdot).

If all the agents coordinate and apply either g1g_{1} or g2g_{2} all together, the realized costs will be 0, i.e.

Kβ(μ0,g1)=Kβ(μ0,g2)=0.\displaystyle K_{\beta}(\mu_{0},g_{1})=K_{\beta}(\mu_{0},g_{2})=0.

However, if the agents do not coordinate and pick their policies from g1,g2g_{1},g_{2} randomly, the cost incurred will be strictly greater than 0. For example, assume that any given agent decides to use g1g_{1} with probability 0.50.5 and the policy g2g_{2} with probability 0.50.5. Then the resulting policy, say g^\hat{g} will be such that

g^(μ)=(12γ1(du|x)+12γ2(du|x))μ(dx)\displaystyle\hat{g}(\mu)=\left(\frac{1}{2}\gamma_{1}(du|x)+\frac{1}{2}\gamma_{2}(du|x)\right)\mu(dx)

Thus, at every time step t1t\geq 1, 14\frac{1}{4} of the agents will be in state 11 and 34\frac{3}{4} of the agents will be in state 0, hence the total accumulated cost of the resulting policy g^\hat{g} will be

Kβ(μ0,g^)=t=1βt14=β14β>0.\displaystyle K_{\beta}(\mu_{0},\hat{g})=\sum_{t=1}^{\infty}\beta^{t}\frac{1}{4}=\frac{\beta}{1-4\beta}>0.

Thus, we see that if the optimal policy for the mean-field control problem is not unique, the agents cannot follow fully decentralized policies, and they need to coordinate at some level. For this problem, if they agree initially on which policy to follow, then no other communication is needed afterwards for the execution of the decided policy. Nonetheless, an initial agreement and coordination is needed to achieve the optimal performance.

We note that the issue with the previous example results from the fact that the optimal policy is not unique. If the optimal policy can be guaranteed to be unique, then the agents can act fully independently.

2 Learning for Mean-field Control with Linear Approximations

We have seen in the previous sections that in general there are limitations for full decentralization, and that a certain level of coordination is required for optimal or near optimal performance during control. In this section, we will study the learning problem in which neither the agents nor the coordinator know the dynamics and aims to learn the model or optimal decision strategies from the data.

We have observed that the limit problem introduced in Section 1.5 can be seen as a deterministic centralized control problem. In particular, if the model is known, and once it is coordinated which control strategy to follow, the agents do not need further communication or coordination to execute the optimal control. Each agent can simply apply an open-loop policy using only their local state information, and the mean-field term can be estimated perfectly, if every agent is following the same policy. However, to estimate the deterministic mean-field flow μt\mu_{t}, the model must be known. For problems where the model is not fully known, the open-loop policies will not be applicable.

Our goal in this section is to present various learning algorithms to learn the dynamics and cost model of the control problem. We will first focus on the idealized scenario, where we assume that there exist infinitely many agents on the team. For this case, we provide two methods; (i) the first one where a coordinator has access to all information of every agent, and decides on the exploration policy, and (ii) the second one where each agent learns the model on their own by tracking their local state and the mean-field term. However, the agents need to coordinate for the exploration policy through a common randomness variable to induce stochastic dynamics for better exploration. Next, we study the realistic setting where the team has large but finitely many agents. For this case, we only consider an independent learning method where the agents learn the model on their own using their local information variables.

Before we present our learning algorithms, we note that the space 𝒫(𝕏){\mathcal{P}}(\mathds{X}) is uncountable even under the assumption that 𝕏\mathds{X} is finite. Therefore, we will focus on finite representations of the cost function c(x,u,μ)c(x,u,\mu) and the kernel 𝒯(|x,u,μ)\mathcal{T}(\cdot|x,u,\mu). In particular, we will try to learn the functions of the following form

c(x,u,μ)=𝚽(x,u)(μ)θ(x,u)\displaystyle c(x,u,\mu)={\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf\theta}_{(x,u)}
𝒯(|x,u,μ)=𝚽(x,u)(μ)𝐐(x,u)()\displaystyle\mathcal{T}(\cdot|x,u,\mu)={\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf Q}_{(x,u)}(\cdot) (13)

where 𝚽(x,u)(μ)=[Φ(x,u)1(μ),,Φ(x,u)d(μ)]{\bf\Phi}_{(x,u)}(\mu)=[\Phi^{1}_{(x,u)}(\mu),\dots,\Phi^{d}_{(x,u)}(\mu)]^{\intercal}, for a set of linearly independent functions Φ(x,u)j(μ):𝒫(𝕏)\Phi^{j}_{(x,u)}(\mu):{\mathcal{P}}(\mathds{X})\to\mathds{R} for each pair (x,u)(x,u), for some d<d<\infty. We assume that the basis functions 𝚽(x,u)(μ){\bf\Phi}_{(x,u)}(\mu) are known and the goal is to learn the parameters θ(x,u)\theta_{(x,u)} and 𝐐(x,u)(){\bf Q}_{(x,u)}(\cdot). We assume θ(x,u)d\theta_{(x,u)}\in\mathds{R}^{d}, and 𝐐(x,u)()=[Q(x,u)1(),,Q(x,u)d()]{\bf Q}_{(x,u)}(\cdot)=[Q^{1}_{(x,u)}(\cdot),\dots,Q_{(x,u)}^{d}(\cdot)] is a vector of unknown signed measures on 𝕏\mathds{X}.

In what follows, we will assume that the basis functions, Φ(x,u)j()\Phi^{j}_{(x,u)}(\cdot) are uniformly bounded. Note that this is without loss of generality.

Assumption 2.

We assume that

Φ(x,u)j()1\displaystyle\|\Phi^{j}_{(x,u)}(\cdot)\|_{\infty}\leq 1

for every (x,u)(x,u) pair, and for all j{1,,d}j\in\{1,\dots,d\}.

For the rest of the paper, we will use θ(x,u)\theta_{(x,u)} and θ(x,u)\theta(x,u) interchangeably; similarly we will use 𝐐(x,u){\bf Q}_{(x,u)} and 𝐐(x,u){\bf Q}(x,u) interchangeably.

Remark 2.

We note that we do not assume that the model and the cost function have the linear form given in (2). However, we will aim to learn and estimate models among the class of linear functions presented in (2). We will later analyze error bounds for the case where the actual model is not linear and thus the learned model does not perfectly match the true model and study the performance loss when we apply policies that are learned for the linear model.

2.1 Coordinated Learning with Linear Function Approximation for Infinitely Many Players

In this section, we will consider an idealized scenario, where there are infinitely many agents, and a coordinator learns the model by linear function approximation.

Data collection. For this section, we assume that there exists a training set TT that consists of a time sequence of length MM. The training set is assumed to be coming from an arbitrary sequence of data. The data at each time stage contains

xi,ui,X1i,c(xi,ui,μ),μ\displaystyle x^{i},u^{i},X_{1}^{i},c(x^{i},u^{i},\mu),\mu

for all the agents present in the team, i{1,,N,}i\in\{1,\dots,N,\dots\}, where the agents’ states are distributed according to μ\mu at the given time step. That is, every data point includes the current state and action, the one-step ahead state, the stage-wise cost realization, and the mean-field term for every agent. Furthermore, we assume the ideal scenario where there are infinitely many agents. Hence, at every time step, the coordinator has access to infinitely many data points where the spaces 𝕏,𝕌\mathds{X},\mathds{U} are finite. The coordinator then has access to infinitely many sample transitions observed under (x,u,μ)(x,u,\mu), and thus, the kernel 𝒯(|x,u,μ)\mathcal{T}(\cdot|x,u,\mu) can be perfectly estimated for every xx such that μ(x)>0\mu(x)>0 and γ(u|x)>0\gamma(u|x)>0, via empirical measures. Here, γ\gamma represents the exploration policy of the agents. We assume the following:

Assumption 3.

For any x𝕏x\in\mathds{X}, the exploration policy for every agent puts positive probability on to every control action such that

γ(u|x)>0, for all (x,u)𝕏×𝕌.\displaystyle\gamma(u|x)>0,\text{ for all }(x,u)\in\mathds{X\times U}.

We define the following sets for which the model and the cost functions can be learned perfectly within the training data: let x𝕏x\in\mathds{X}, we define

Px:={μT:μ(x)>0}.\displaystyle P_{x}:=\{\mu\in T:\mu(x)>0\}. (14)

Px𝒫(𝕏)P_{x}\subset{\mathcal{P}}(\mathds{X}) denotes the set of probability measures which assign positive measure to a particular state x𝕏x\in\mathds{X} that are also in the training data for the mean-field terms. In particular, for a given (x,u)(x,u) pair, the kernel 𝒯(|x,u,μ)\mathcal{T}(\cdot|x,u,\mu) and the cost c(x,u,μ)c(x,u,\mu) can be learned perfectly for every μPx\mu\in P_{x} with Assumption 3.

For a given x𝕏x\in\mathds{X}, we denote by MxM_{x} the number of mean-field terms within the set PxP_{x} (see (14)). For every (x,u)𝕏×𝕌(x,u)\in\mathds{X}\times\mathds{U} pair, the coordinator aims to find θ(x,u){\bf\theta}_{(x,u)} and 𝐐(x,u){\bf Q}_{(x,u)} such that

1Mxj=1Mx|c(x,u,μj)𝚽(𝐱,𝐮)(μj)θ(𝐱,𝐮)|2\displaystyle\frac{1}{M_{x}}\sum_{j=1}^{M_{x}}\left|c(x,u,\mu_{j})-{\bf\Phi^{\intercal}_{(x,u)}}(\mu_{j}){\bf\theta_{(x,u)}}\right|^{2}
1Mxj=1Mx𝒯(|x,u,μj)𝚽(𝐱,𝐮)(μj)𝐐(𝐱,𝐮)()\displaystyle\frac{1}{M_{x}}\sum_{j=1}^{M_{x}}\left\|\mathcal{T}(\cdot|x,u,\mu_{j})-{\bf\Phi^{\intercal}_{(x,u)}}(\mu_{j}){\bf Q_{(x,u)}(\cdot)}\right\|

is minimized.

The least squares linear models can be estimated in closed form for θ(x,u){\bf\theta}_{(x,u)} and 𝐐(x,u)(){\bf Q}_{(x,u)}(\cdot) using the training data. We define the following vector and matrices to present the closed form solution in a more compact form: for each (x,u)𝕏×𝕌(x,u)\in\mathds{X}\times\mathds{U} we introduce 𝐛(x,u)Mx{\bf b}_{(x,u)}\in\mathds{R}^{M_{x}}, and 𝐝(x,u)Mx×|𝕏|{\bf d}_{(x,u)}\in\mathds{R}^{M_{x}\times|\mathds{X}|},

𝐛(x,u)=[c(x,u,μ1)c(x,u,μ2)c(x,u,μMx)],𝐝(x,u)=[𝒯(x1|x,u,μ1)𝒯(x2|x,u,μ1)𝒯(x|𝕏||x,u,μ1)𝒯(x1|x,u,μ2)𝒯(x2|x,u,μ2)𝒯(x|𝕏||x,u,μ2)𝒯(x1|x,u,μMx)𝒯(x2|x,u,μMx)𝒯(x|𝕏||x,u,μMx)].\displaystyle{\bf b}_{(x,u)}=\begin{bmatrix}c(x,u,\mu_{1})\\ c(x,u,\mu_{2})\\ \vdots\\ c(x,u,\mu_{M_{x}})\end{bmatrix},\quad{\bf d}_{(x,u)}=\begin{bmatrix}\mathcal{T}(x^{1}|x,u,\mu_{1})&\mathcal{T}(x^{2}|x,u,\mu_{1})&\dots&\mathcal{T}(x^{|\mathds{X}|}|x,u,\mu_{1})\\ \mathcal{T}(x^{1}|x,u,\mu_{2})&\mathcal{T}(x^{2}|x,u,\mu_{2})&\dots&\mathcal{T}(x^{|\mathds{X}|}|x,u,\mu_{2})\\ \vdots\\ \mathcal{T}(x^{1}|x,u,\mu_{M_{x}})&\mathcal{T}(x^{2}|x,u,\mu_{M_{x}})&\dots&\mathcal{T}(x^{|\mathds{X}|}|x,u,\mu_{M_{x}})\end{bmatrix}. (15)

Furthermore, we also define 𝐀(x,u)d×Mx{\bf A}_{(x,u)}\in\mathds{R}^{d\times M_{x}}

𝐀(x,u)=[𝚽(x,u)(μ1),,𝚽(x,u)(μMx)].\displaystyle{\bf A}_{(x,u)}=\left[{\bf\Phi}_{(x,u)}(\mu_{1}),\dots,{\bf\Phi}_{(x,u)}(\mu_{M_{x}})\right]. (16)

Assuming that 𝐀(x,u){\bf A}_{(x,u)} has linearly independent columns, i.e. 𝚽(x,u)(μi){\bf\Phi}_{(x,u)}(\mu_{i}) and 𝚽(x,u)(μj){\bf\Phi}_{(x,u)}(\mu_{j}) are linearly independent for μiμj\mu_{i}\neq\mu_{j}, the estimates for θ(x,u)\theta_{(x,u)} and 𝐐(x,u){\bf Q}_{(x,u)} can be written as follows

θ(x,u)\displaystyle\theta_{(x,u)} =(𝐀(x,u)𝐀(x,u))1𝐀(x,u)𝐛(x,u)\displaystyle=\left({\bf A}^{\intercal}_{(x,u)}{\bf A}_{(x,u)}\right)^{-1}{\bf A}^{\intercal}_{(x,u)}{\bf b}_{(x,u)}
𝐐(x,u)\displaystyle{\bf Q}_{(x,u)} =(𝐀(x,u)𝐀(x,u))1𝐀(x,u)𝐝(x,u).\displaystyle=\left({\bf A}^{\intercal}_{(x,u)}{\bf A}_{(x,u)}\right)^{-1}{\bf A}^{\intercal}_{(x,u)}{\bf d}_{(x,u)}. (17)

Note that above, each row of 𝐐(x,u){\bf Q}_{(x,u)} represents a signed measure on 𝕏\mathds{X}.

2.2 Independent Learning with Linear Function Approximation for Infinitely Many Players

In this section, we will introduce a learning method where the agents perform independent learning to some extent. Here, rather than using a training set, we will focus on an online learning algorithm where at every time step, agent ii observes xi,ui,X1i,c(xi,ui,μ),μx^{i},u^{i},X_{1}^{i},c(x^{i},u^{i},\mu),\mu. That is, each agent has access to their local state, action, cost realizations, one-step ahead state, and the mean-field term. However, they do not have access to local information about the other agents.

We first argue that full decentralization is usually not possible in the context of learning either. Recall that the mean-field flow is deterministic if every agent follows the same independently randomized agent-level policy. Furthermore, the flow of the mean-field terms remains deterministic even when the agents choose different exploration policies if the randomization is independent. To see this, assume that each agent picks some policy γw(du|x)\gamma_{w}(du|x) randomly by choosing w𝕎w\in\mathds{W} from some arbitrary distribution, where the mapping wγw(du|x)w\to\gamma_{w}(du|x) is predetermined. If the agents pick w𝕎w\in\mathds{W} independently, the mean-field dynamics is given by

μt+1()=𝒯(|x,u,μt)γw(du|x)Pw(dw)μt(dx)\displaystyle\mu_{t+1}(\cdot)=\int\mathcal{T}(\cdot|x,u,\mu_{t})\gamma_{w}(du|x)P_{w}(dw)\mu_{t}(dx)

where Pw(dw)P_{w}(dw) is the distribution by which the agents perform their independent randomization for the policy selection. Hence, the mean-field term dynamics follow a deterministic flow. Note that for the above example, for simplicity, we assume that the agents pick according to the same distribution. In general, even if the agents follow different distributions for w𝕎w\in\mathds{W}, the dynamics of the mean-field flow would remain deterministic according to a mixture distribution.

Deterministic behavior might cause poor exploration performance. There might be cases where the mean-field flow gets stuck at a fixed distribution without learning or exploring the ‘important’ parts of the space 𝒫(𝕏){\mathcal{P}}(\mathds{X}) sufficiently. To overcome this issue, and to make sure that the system is stirred sufficiently well during the exploration, one option is to introduce a common randomness for the selection of the exploration policies. In particular, each agent follows a randomized policy γw(du|x)\gamma_{w}(du|x) where the common randomness w𝕎w\in\mathds{W} is mutual information. Then the dynamics of the mean-field flow can be written as

μt+1()=F(μt,w):=𝒯(|x,u,μt)γw(du|x)μt(dx).\displaystyle\mu_{t+1}(\cdot)=F(\mu_{t},w):=\int\mathcal{T}(\cdot|x,u,\mu_{t})\gamma_{w}(du|x)\mu_{t}(dx). (18)

The common noise variable ensures a level of coordination among the agents. However, this is still a significant relaxation compared to full coordination where the agents share their full state or control data. In this section, we show that agents can construct independent learning iterates that converge by coordinating through an arbitrary common source of randomness.

We assume the following for the mean-field flow during the exploration:

Assumption 4.

Consider the Markov chain {μt}t𝒫(𝕏)\{\mu_{t}\}_{t}\subset{\mathcal{P}}(\mathds{X}) whose dynamics are given by (18) We assume that μt\mu_{t} has geometric ergodicity with a unique invariant probability measure P()𝒫(𝒫(𝕏))P(\cdot)\in{\mathcal{P}}({\mathcal{P}}(\mathds{X})) such that

Pr(μt)P()TVKρt\displaystyle\|Pr(\mu_{t}\in\cdot)-P(\cdot)\|_{TV}\leq K\rho^{t}

for some K<K<\infty and some ρ<1\rho<1.

Remark 3.

We can establish some sufficient conditions on the transition kernel of the system to test the ergodicity. We note that ([25, p 56, 3.1.1]) a sufficient condition is the following: there exists a mean-field state, say μ𝒫(𝕏)\mu^{*}\in{\mathcal{P}}(\mathds{X}) such that

Pr(w:𝒯(|x,u,μ)γw(du|x)μ(dx)=μ())>0, for all μ𝒫(𝕏).\displaystyle Pr\left(w:\int\mathcal{T}(\cdot|x,u,\mu)\gamma_{w}(du|x)\mu(dx)=\mu^{*}(\cdot)\right)>0,\text{ for all }\mu\in{\mathcal{P}}(\mathds{X}). (19)

That is, we need to be able to find a set of common noise realizations whose induced randomized exploration policies can take the state distribution to μ\mu^{*} independent of the starting distribution μ\mu.

The assumption stated in this form indicates that the condition is of the stochastic reachability or controllability type. It requires that from any initial distribution μ\mu, there exists a control policy that can steer the distribution of the system to some target measure μ\mu^{*}. We also note that the above can be generalized to a kk-step transition requirement. Analyzing this stochastic controllability behavior for the mean field systems is beyond the scope of the current paper, however, we give some examples in what follows.

We look further into (19). We fix some x1𝕏x_{1}\in\mathds{X}, and we define the following values:

U(x,μ)=maxu𝒯(x1|x,u,μ)\displaystyle U(x,\mu)=\max_{u}\mathcal{T}(x_{1}|x,u,\mu)
L(x,μ)=minu𝒯(x1|x,u,μ).\displaystyle L(x,\mu)=\min_{u}\mathcal{T}(x_{1}|x,u,\mu).

Note that μ(x1)\mu^{*}(x_{1}) is the average of the probabilities 𝒯γ(x1|x,μ):=𝒯(x1|x,u,μ)γw(du|x)\mathcal{T}^{\gamma}(x_{1}|x,\mu):=\int\mathcal{T}(x_{1}|x,u,\mu)\gamma_{w}(du|x) under the measure μ\mu for the xx values. Furthermore, by selecting an appropriate randomized policy, we can control these values in the interval

I(x,μ):=[L(x,μ),U(x,μ)].I(x,\mu):=\left[L(x,\mu),U(x,\mu)\right].

Thus, if the intersection of these intervals is nonempty, i.e. if

x,μI(x,μ)\displaystyle\bigcap_{x,\mu}I(x,\mu)\neq\emptyset (20)

then one can set μ(x1)\mu^{*}(x_{1}) to be a value in this intersection independent of μ\mu. By doing this for all x1x_{1}, we can set a reachable μ\mu^{*} from any μ𝒫(𝕏)\mu\in{\mathcal{P}}(\mathds{X}). As a result, if (20) holds for every x1x_{1}, then (19), and thus Assumption 4 can be shown to hold.

A somewhat restrictive example for (20) is the following: assume that there exists a control action uu^{*} which can reset the state to some xx^{*} from any state xx and any mean-field term μ\mu, that is

𝒯(x|x,u,μ)=1 for all x,μ.\displaystyle\mathcal{T}(x^{*}|x,u^{*},\mu)=1\text{ for all }x,\mu.

This means that 1I(x,μ)1\in I(x,\mu) for all x,μx,\mu when x1=xx_{1}=x^{*}, and 0I(x,μ)0\in I(x,\mu) for all x,μx,\mu when x1xx_{1}\neq x^{*}. Hence, μ1()=δx()\mu_{1}(\cdot)=\delta_{x^{*}}(\cdot) can be reached from any starting point by applying the policy γ(x)=u\gamma(x)=u^{*} for all xx. If this policy is among the set of exploration policies, the ergodicity assumption for the mean-field flow would be satisfied.

We note again that a general result would require more in depth analysis, however, the above gives some idea on the implications of this assumption on the controllability of the mean-field model.

We now define the trained measures, Px𝒫(𝒫(𝕏))P_{x}\in{\mathcal{P}}({\mathcal{P}}(\mathds{X})) for each (x,u)(x,u) pair based on the invariant measure for the mean-field flow. Under Assumption 3, that is assuming Pr(u|x)=γw(u|x)Pw(dw)>0Pr(u|x)=\int\gamma_{w}(u|x)P_{w}(dw)>0 for all (x,u)(x,u) pairs, we can write

P(μA|x,u)\displaystyle P(\mu\in A|x,u) =Pr(x,u,μA)Pr(x,u)\displaystyle=\frac{Pr(x,u,\mu\in A)}{Pr(x,u)}
=μAwγw(u|x)Pw(dw)μ(x)P(dμ)μ𝒫(𝕏)wγw(u|x)Pw(dw)μ(x)P(dμ)\displaystyle=\frac{\int_{\mu\in A}\int_{w}\gamma_{w}(u|x)P_{w}(dw)\mu(x)P(d\mu)}{\int_{\mu\in{\mathcal{P}}(\mathds{X})}\int_{w}\gamma_{w}(u|x)P_{w}(dw)\mu(x)P(d\mu)}
=μAμ(x)P(dμ)μ𝒫(𝕏)μ(x)P(dμ)=:Px(μA)\displaystyle=\frac{\int_{\mu\in A}\mu(x)P(d\mu)}{\int_{\mu\in{\mathcal{P}}(\mathds{X})}\mu(x)P(d\mu)}=:P_{x}(\mu\in A) (21)

Note that the trained sets of mean-field terms are independent of the control action uu, as the exploration policies are independent of the mean-field terms given the state xx. These sets have similar implications as the sets defined in (14). In particular, they indicate for which mean-field terms, one can estimate the kernel 𝒯(|x,u,μ)\mathcal{T}(\cdot|x,u,\mu) and the cost function c(x,u,μ)c(x,u,\mu) via the training process.

We now summarize the algorithm used for each agent. We drop the dependence on agent identity ii, and summarize the steps for a generic agent. At every time step tt, every agent performs the following steps:

  • Observe the common randomness ww given by the coordinator, and pick an action such that utγw(|xt)u_{t}\sim\gamma_{w}(\cdot|x_{t})

  • Collect xt,ut,xt+1,μt,cx_{t},u_{t},x_{t+1},\mu_{t},c where c=c(xt,ut,μt)c=c(x_{t},u_{t},\mu_{t})

  • For all (x,u)𝕏×𝕌(x,u)\in\mathds{X}\times\mathds{U}

    θt+1(x,u)=θt(x,u)+αt(x,u)𝚽(x,u,μt)[c𝚽(x,u,μt)θt(x,u)]\displaystyle{\bf\theta}_{t+1}(x,u)={\bf\theta}_{t}(x,u)+\alpha_{t}(x,u){\bf\Phi}(x,u,\mu_{t})\left[c-{\bf\Phi}^{\intercal}(x,u,\mu_{t}){\bf\theta}_{t}(x,u)\right] (22)
  • Note that the signed measure vector 𝐐t(|x,u){\bf Q}_{t}(\cdot|x,u) consists dd signed measures defined on 𝕏\mathds{X}, we denote by 𝐐tj(x,u){\bf Q}^{j}_{t}(x,u) the vector values of 𝐐t(xj|x,u){\bf Q}_{t}(x^{j}|x,u) for xj𝕏x^{j}\in\mathds{X} where j{1,|𝕏|}j\in\{1\dots,|\mathds{X}|\}. For all (x,u)(x,u) and j{1,|𝕏|}j\in\{1\dots,|\mathds{X}|\}

    𝐐t+1j(x,u)=𝐐tj(x,u)+αt(x,u)𝚽(x,u,μt)[𝟙{xt+1=xj}𝚽(x,u,μt)𝐐tj(x,u)].\displaystyle{\bf Q}_{t+1}^{j}(x,u)={\bf Q}_{t}^{j}(x,u)+\alpha_{t}(x,u){\bf\Phi}(x,u,\mu_{t})\left[\mathds{1}_{\{x_{t+1}=x^{j}\}}-{\bf\Phi}^{\intercal}(x,u,\mu_{t}){\bf Q}_{t}^{j}(x,u)\right]. (23)

We next show that the above algorithm converges if the learning rates are chosen properly. To show the convergence, we first present a convergence result for stochastic gradient descent algorithms with quadratic cost, where the error is Markov and stationary. We note that similar results have been established in the literature for the stochastic gradient iterations under Markovian noise processes under various assumptions; however, verifying most of these assumptions, such as the boundedness of the gradient, the boundedness of the iterates, or uniformly bounded variance, is not straightforward. Hence, we provide a proof in the appendix for completeness.

Proposition 1.

Let {St}𝕊\{S_{t}\}\subset\mathds{S} denote a Markov chain with the invariant probability measure π()\pi(\cdot) where 𝕊\mathds{S} is a standard Borel space. We assume that {St}\{S_{t}\} has geometric ergodicity such that Pr(St)π()TVKρt\|Pr(S_{t}\in\cdot)-\pi(\cdot)\|_{TV}\leq K\rho^{t} for some K<K<\infty and some ρ<1\rho<1. Let g(s,v)g(s,v) be such that

g(s,v)=(k(s)vh(s))2\displaystyle g(s,v)=\left(k(s)^{\intercal}v-h(s)\right)^{2}

for some k:𝕊dk:\mathds{S}\to\mathds{R}^{d}, h:𝕊h:\mathds{S}\to\mathds{R} and for vdv\in\mathds{R}^{d}. We assume that k,hk,h are uniformly bounded. We denote by

Gt(v)=E[g(St,v)]\displaystyle G_{t}(v)=E[g(S_{t},v)]
G(v)=g(s,v)π(ds).\displaystyle G(v)=\int g(s,v)\pi(ds).

Consider the iterations

vt+1=vtαtg(St,vt)\displaystyle v_{t+1}=v_{t}-\alpha_{t}\nabla g(S_{t},v_{t})

where the gradient is with respect to vtv_{t}. If the learning rates are such that tαt=\sum_{t}\alpha_{t}=\infty and tαt2<\sum_{t}\alpha_{t}^{2}<\infty with probability one, we then have that Gt(vt)minvG(v)=G(v)G_{t}(v_{t})\to\min_{v}G(v)=G(v^{*}) almost surely.

Proof.

The proof can be found in Appendix A. ∎

Corollary 1.

Let Assumption 4 and Assumption 3 hold and let the learning rates be chosen such that αt(x,u)=0\alpha_{t}(x,u)=0 unless (Xt,Ut)=(x,u)(X_{t},U_{t})=(x,u). Furthermore, tαt(x,u)=\sum_{t}\alpha_{t}(x,u)=\infty and tαt2(x,u)<\sum_{t}\alpha_{t}^{2}(x,u)<\infty with probability one for all (x,u)𝕏×𝕌(x,u)\in\mathds{X}\times\mathds{U}. Then, the iterations given in (22) and (23) converge with probability 1. Furthermore, the limit points, say θ(x,u){\bf\theta}^{*}(x,u) and 𝐐(x,u)(){\bf Q}^{*}_{(x,u)}(\cdot) are such that

θ(x,u)=argminθ(x,u)d|c(x,u,μ)𝚽(x,u,μ)θ(x,u)|2Px(dμ)\displaystyle{\bf\theta}^{*}(x,u)=\mathop{\rm arg\,min}_{\theta(x,u)\in\mathds{R}^{d}}\int\left|c(x,u,\mu)-{\bf\Phi}^{\intercal}(x,u,\mu){\bf\theta}(x,u)\right|^{2}P_{x}(d\mu)
𝐐,j(x,u)=argmin𝐐j(x,u)d|𝒯(j|x,u,μ)𝚽x,u(μ)𝐐j(x,u)|2Px(dμ)\displaystyle{\bf Q}^{*,j}{(x,u)}=\mathop{\rm arg\,min}_{{\bf Q}^{j}{(x,u)}\in\mathds{R}^{d}}\int\left|\mathcal{T}(j|x,u,\mu)-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}{(x,u)}\right|^{2}P_{x}(d\mu)

for every (x,u)(x,u) pair and for every jj, where 𝐐j,(x,u){\bf Q}^{j,*}{(x,u)} is the jjth column of 𝐐(x,u)(){\bf Q}^{*}_{(x,u)}(\cdot). Furthermore, Px()P_{x}(\cdot) denotes the trained set based on the invariant measure of the mean-field flow under the exploration policy with common randomness (see (2.2)).

Proof.

We define the following stopping times

τk+1=min{t>τk:(Xt,Ut)=(x,u)}\displaystyle\tau_{k+1}=\min{\{t>\tau_{k}:(X_{t},U_{t})=(x,u)\}}

such that τk\tau_{k} indicates the kk-th time the (x,u)(x,u) pair is visited.

For the iterations (22), Proposition 1 applies such that for each (x,u)(x,u), vkθτk(x,u)v_{k}\equiv\theta_{\tau_{k}}(x,u), k(μ)𝚽(x,u,μ)k(\mu)\equiv{\bf\Phi}^{\intercal}(x,u,\mu) and h(μ)c(x,u,μ)h(\mu)\equiv c(x,u,\mu) and finally the noise process skμτks_{k}\equiv\mu_{\tau_{k}}. Note that 𝚽{\bf\Phi} and cc are assumed to be uniformly bounded which also agrees with the assumptions in Propositions 1. Furthermore, with the strong Markov property, μτk\mu_{\tau_{k}} is also a Markov chain which is sampled when the state-action pair is (x,u)(x,u). Thus, the invariant measure for the sampled process is PxP_{x} as defined in (2.2).

For the iterations (23), Proposition 1 applies such that for each (x,u)(x,u) and each jj, vt𝐐τkj(x,u)v_{t}\equiv{\bf Q}^{j}_{\tau_{k}}(x,u), k(μ)𝚽(x,u,μ)k(\mu)\equiv{\bf\Phi}^{\intercal}(x,u,\mu) and h(μ)𝟙{X1=xj}h(\mu)\equiv\mathds{1}_{\{X_{1}=x^{j}\}}. We note that X1=f(x,u,μ,w)X_{1}=f(x,u,\mu,w) (see (1)) where ww is the i.i.d. noise for the dynamics of agents. Thus, the noise process for iterations (23) can be taken to be the joint process (μt,wt)(\mu_{t},w_{t}) where μt\mu_{t} is an ergodic Markov process, and wtw_{t} is an i.i.d. process. In particular, for every (x,u)(x,u) pair and for every xjx^{j}, if we consider the expectation over (μ,w)(\mu,w) where μPx()\mu\sim P_{x}(\cdot), we get

E[𝟙{X1=xj}]=E[𝟙{f(x,u,μ,w)=xj}]=𝒯(xj|x,u,μ)Px(dμ)\displaystyle E\left[\mathds{1}_{\{X_{1}=x^{j}\}}\right]=E\left[\mathds{1}_{\{f(x,u,\mu,w)=x^{j}\}}\right]=\int\mathcal{T}(x^{j}|x,u,\mu)P_{x}(d\mu)

for every xj𝕏x^{j}\in\mathds{X}.

The algorithm in (23) minimizes

|𝟙{f(x,u,μ,w)=xj}𝚽x,u(μ)𝐐j(x,u)|2Px(dμ)Pw(dw)\displaystyle\int\left|\mathds{1}_{\{f(x,u,\mu,w)=x^{j}\}}-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u})\right|^{2}P_{x}(d\mu)P_{w}(dw)

for each jj where PwP_{w} is the distribution of the noise term. We can then open up the above term to write:

argmin𝐐j(x,u)|𝟙{f(x,u,μ,w)=xj}𝚽x,u(μ)𝐐j(x,u)|2Px(dμ)Pw(dw)\displaystyle\mathop{\rm arg\,min}_{{\bf Q}^{j}(x,u)}\int\left|\mathds{1}_{\{f(x,u,\mu,w)=x^{j}\}}-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u})\right|^{2}P_{x}(d\mu)P_{w}(dw)
=argmin𝐐j(x,u)(𝟙{f(x,u,μ,w)=xj})22𝟙{f(x,u,μ,w)=xj}𝚽x,u(μ)𝐐j(x,u)\displaystyle=\mathop{\rm arg\,min}_{{\bf Q}^{j}(x,u)}\int(\mathds{1}_{\{f(x,u,\mu,w)=x^{j}\}})^{2}-2\mathds{1}_{\{f(x,u,\mu,w)=x^{j}\}}{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u})
+(𝚽x,u(μ)𝐐j(x,u))2Px(dμ)Pw(dw)\displaystyle\qquad\qquad\qquad+({\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u}))^{2}P_{x}(d\mu)P_{w}(dw)
=argmin𝐐j(x,u)2𝟙{f(x,u,μ,w)=xj}𝚽x,u(μ)𝐐j(x,u)+(𝚽x,u(μ)𝐐j(x,u))2Px(dμ)Pw(dw)\displaystyle=\mathop{\rm arg\,min}_{{\bf Q}^{j}(x,u)}\int-2\mathds{1}_{\{f(x,u,\mu,w)=x^{j}\}}{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u})+({\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u}))^{2}P_{x}(d\mu)P_{w}(dw)
=argmin𝐐j(x,u)2𝒯(xj|x,u,μ)𝚽x,u(μ)𝐐j(x,u)+(𝚽x,u(μ)𝐐j(x,u))2Px(dμ)\displaystyle=\mathop{\rm arg\,min}_{{\bf Q}^{j}(x,u)}\int-2\mathcal{T}(x^{j}|x,u,\mu){\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u})+({\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u}))^{2}P_{x}(d\mu)
=argmin𝐐j(x,u)(𝒯(xj|x,u,μ))22𝒯(xj|x,u,μ)𝚽x,u(μ)𝐐j(x,u)+(𝚽x,u(μ)𝐐j(x,u))2Px(dμ)\displaystyle=\mathop{\rm arg\,min}_{{\bf Q}^{j}(x,u)}\int(\mathcal{T}(x^{j}|x,u,\mu))^{2}-2\mathcal{T}(x^{j}|x,u,\mu){\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u})+({\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u}))^{2}P_{x}(d\mu)
=argmin𝐐j(x,u)(𝒯(xj|x,u,μ))𝚽x,u(μ)𝐐j(x,u))2Px(dμ).\displaystyle=\mathop{\rm arg\,min}_{{\bf Q}^{j}(x,u)}\int\left(\mathcal{T}(x^{j}|x,u,\mu))-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u})\right)^{2}P_{x}(d\mu).

Hence, the algorithm minimizes

(𝒯(xj|x,u,μ))𝚽x,u(μ)𝐐j(x,u))2Px(dμ)\displaystyle\int\left(\mathcal{T}(x^{j}|x,u,\mu))-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}({x,u})\right)^{2}P_{x}(d\mu)

for each jj. ∎

2.3 Learning for Finitely Many Players

In this section, we will study the more realistic scenario in which the number of agents is large but finite. The learning methods presented in the previous sections have focused on the ideal case where the system has infinitely many players. Although the setting with the infinitely many agents helps us to fix the ideas for the learning in the mean-field control setup, we should note that it is only an artificial setup, and the infinite population setup is only used as an approximation for large population control problems. Hence, we need to study the actual setup for which the limit problem is argued to be a well approximation, that is the problem with very large but finitely many agents.

We will apply the independent learning algorithm presented for the infinite population case, and study the performance of the learned solutions for the finitely many player setting. In particular, we will assume that the agents follow the iterations given in (22) and (23). We note, however, that the agents will not need to use common randomness during exploration as the flow of the mean-field term is stochastic for finite populations without common randomness. The method remains valid under common randomness as well; in fact, the common randomness, in general, encourages the exploration of the state space. The method is identical to the one presented in Section 2.2. However, we present the method again since it has some subtle differences.

At every time step tt, agent ii performs the following steps:

  • Pick an action such that utiγi(|xti)u^{i}_{t}\sim\gamma^{i}(\cdot|x^{i}_{t})

  • Collect xti,uti,xt+1i,μtN,cx^{i}_{t},u^{i}_{t},x^{i}_{t+1},\mu^{N}_{t},c where c=c(xti,uti,μtN)c=c(x^{i}_{t},u^{i}_{t},\mu^{N}_{t}) and μtN=μ𝐱𝐭\mu_{t}^{N}=\mu_{\bf x_{t}}

  • For all (x,u)𝕏×𝕌(x,u)\in\mathds{X}\times\mathds{U}

    θt+1(x,u)=θt(x,u)+αt(x,u)𝚽x,u(μtN)[c𝚽x,u(μtN)θt(x,u)]\displaystyle{\bf\theta}_{t+1}(x,u)={\bf\theta}_{t}(x,u)+\alpha_{t}(x,u){\bf\Phi}_{x,u}(\mu^{N}_{t})\left[c-{\bf\Phi}^{\intercal}_{x,u}(\mu^{N}_{t}){\bf\theta}_{t}(x,u)\right] (24)
  • Denoting by 𝐐tj(x,u){\bf Q}^{j}_{t}(x,u) the vector values of 𝐐t(xj|x,u){\bf Q}_{t}(x^{j}|x,u) for all xj𝕏x^{j}\in\mathds{X} where j{1,|𝕏|}j\in\{1\dots,|\mathds{X}|\}. For all (x,u)(x,u) and j{1,|𝕏|}j\in\{1\dots,|\mathds{X}|\}

    𝐐t+1j(x,u)=𝐐tj(x,u)+αt(x,u)𝚽x,u(μtN)[𝟙{xt+1i=xj}𝚽x,u(μtN)𝐐tj(x,u)]\displaystyle{\bf Q}_{t+1}^{j}(x,u)={\bf Q}_{t}^{j}(x,u)+\alpha_{t}(x,u){\bf\Phi}_{x,u}(\mu^{N}_{t})\left[\mathds{1}_{\{x^{i}_{t+1}=x^{j}\}}-{\bf\Phi}_{x,u}^{\intercal}(\mu^{N}_{t}){\bf Q}_{t}^{j}(x,u)\right] (25)
Remark 4.

We note that the iterates θt\theta_{t} and 𝐐t{\bf Q}_{t} depend on the agent identity ii, in this case, as each agent can learn the model independently. Moreover, the learning rates αt(x,u)\alpha_{t}(x,u) and the basis functions 𝚽x,u{\bf\Phi}_{x,u} might depend on the agent identity as well. However, we omit the dependence in the notation to reduce notational clutter.

Assumption 5.

Under the exploration team policy γ(|𝐱)=[γ1(|x1),,γN(|xN)]{\bf\gamma}(\cdot|{\bf x})=\left[\gamma^{1}(\cdot|x^{1}),\dots,\gamma^{N}(\cdot|x^{N})\right]^{\intercal}, the state vector process 𝐱𝐭=[xt1,,xtN]{\bf x_{t}}=[x_{t}^{1},\dots,x_{t}^{N}] of the agents is irreducible and aperiodic and in particular admits a unique invariant measure, and thus the mean-field flow μtN=μ𝐱𝐭\mu^{N}_{t}=\mu_{\bf x_{t}} admits a unique invariant measure, say PN()𝒫N(𝕏)P^{N}(\cdot)\in{\mathcal{P}}_{N}(\mathds{X}), as well.

Remark 5.

We note that a sufficient condition for the above assumption to hold is that there exists some x𝕏x^{\prime}\in\mathds{X} such that 𝒯(x|x,u,μN)ϵ>0\mathcal{T}(x^{\prime}|x,u,\mu^{N})\geq\epsilon>0 for any x,ux,u and for any μN\mu^{N}. In particular, this implies that

Pr(𝐗t+1=[x,,x]|𝐱t,γ)=i=1Nu𝒯(x|xti,u,μtN)γi(u|xti)ϵN>0\displaystyle Pr\left({\bf X}_{t+1}=[x^{\prime},\dots,x^{\prime}]|{\bf x}_{t},\gamma\right)=\prod_{i=1}^{N}\sum_{u}\mathcal{T}(x^{\prime}|x_{t}^{i},u,\mu_{t}^{N})\gamma^{i}(u|x^{i}_{t})\geq\epsilon^{N}>0

and thus [25, p 56, 3.1.1] implies that the process 𝐗t{\bf X}_{t} is geometrically ergodic.

The next result shows the convergence of the algorithm. Similar to the previous section, we first define the trained sets of mean-field terms for every (x,u)(x,u) pair using the stationary distribution of the mean-field terms. We assume that Assumption 3 holds for every policy γi\gamma^{i} such that γi(u|x)>0\gamma^{i}(u|x)>0 for every (x,u)(x,u) pair. Denoting the invariant distribution of the joint state process by P(𝐱)P(\bf x), for some μN𝒫N(𝕏)\mu^{N}\in{\mathcal{P}}_{N}(\mathds{X}), for agent ii, we can write

P(μN|(xi,ui)=(x,u))\displaystyle P(\mu^{N}|(x^{i},u^{i})=(x,u)) =𝐱𝕏NPr(μN|𝐱)P(𝐱|(xi,ui)=(x,u))\displaystyle=\int_{{\bf x}\in\mathds{X}^{N}}Pr(\mu^{N}|{\bf x})P({\bf x}|(x^{i},u^{i})=(x,u))
=𝐱𝕏N𝟙{μN=μ𝐱}γi(u|x)𝟙{x=𝐱[i]}P(xi=x,ui=u)P(𝐱)\displaystyle=\int_{{\bf x}\in\mathds{X}^{N}}\mathds{1}_{\{\mu^{N}=\mu_{\bf x}\}}\frac{\gamma^{i}(u|x)\mathds{1}_{\{x={\bf x}[i]\}}}{P(x^{i}=x,u^{i}=u)}P({\bf x})
=𝐱𝕏N𝟙{μN=μ𝐱}γi(u|x)𝟙{x=𝐱[i]}γi(u|x)P(xi=x)P(𝐱)\displaystyle=\int_{{\bf x}\in\mathds{X}^{N}}\mathds{1}_{\{\mu^{N}=\mu_{\bf x}\}}\frac{\gamma^{i}(u|x)\mathds{1}_{\{x={\bf x}[i]\}}}{\gamma^{i}(u|x)P(x^{i}=x)}P({\bf x})
=𝐱𝕏N𝟙{μN=μ𝐱}𝟙{x=𝐱[i]}P(xi=x)P(𝐱)\displaystyle=\int_{{\bf x}\in\mathds{X}^{N}}\mathds{1}_{\{\mu^{N}=\mu_{\bf x}\}}\frac{\mathds{1}_{\{x={\bf x}[i]\}}}{P(x^{i}=x)}P({\bf x})
=P(μN|xi=x)=:Pxi(μN).\displaystyle=P(\mu^{N}|x^{i}=x)=:P^{i}_{x}(\mu^{N}). (26)

Note that the trained measure of mean-fields is independent of the control actions; however, it does depend on the agent identity as the agents follow distinct exploration policies.

Corollary 2.

Let Assumption 5 hold, and let Assumption 3 hold for each policy γi\gamma^{i}. Assume further that the learning rates of every agent satisfy the assumption of Corollary 1 such that tαt(x,u)=\sum_{t}\alpha_{t}(x,u)=\infty and tαt2(x,u)<\sum_{t}\alpha_{t}^{2}(x,u)<\infty with probability one for all (x,u)𝕏×𝕌(x,u)\in\mathds{X}\times\mathds{U}. Then, the iterations given in (24) and (25) converge with probability one. Furthermore, for agent ii, the limit points, say θ(x,u){\bf\theta}^{*}_{(x,u)} and 𝐐(x,u)(){\bf Q}^{*}_{(x,u)}(\cdot) are such that

θ(x,u)=argminθ(x,u)d|c(x,u,μ)𝚽(x,u,μ)θ(x,u)|2Pxi(dμ)\displaystyle{\bf\theta}^{*}(x,u)=\mathop{\rm arg\,min}_{\theta(x,u)\in\mathds{R}^{d}}\int\left|c(x,u,\mu)-{\bf\Phi}^{\intercal}(x,u,\mu){\bf\theta}(x,u)\right|^{2}P^{i}_{x}(d\mu)
𝐐,j(x,u)=argmin𝐐j(x,u)d|𝒯(j|x,u,μ)𝚽x,u(μ)𝐐j(x,u)|2Pxi(dμ)\displaystyle{\bf Q}^{*,j}{(x,u)}=\mathop{\rm arg\,min}_{{\bf Q}^{j}{(x,u)}\in\mathds{R}^{d}}\int\left|\mathcal{T}(j|x,u,\mu)-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{j}{(x,u)}\right|^{2}P^{i}_{x}(d\mu)

for every (x,u)(x,u) pair and for every jj, where 𝐐j,(x,u){\bf Q}^{j,*}{(x,u)} is the jjth column of 𝐐(x,u)(){\bf Q}^{*}_{(x,u)}(\cdot). Furthermore, Px()P_{x}(\cdot) denotes the trained set based on the invariant measure of the mean-field flow under the exploration policy with common randomness (see (2.3)).

Proof.

The proof is identical to the proof of Corollary 1, and is an application of Proposition 1. The only difference is the ergodicity of the mean-field process μtN\mu_{t}^{N}, which does not require common randomness for exploration policies. ∎

3 Uniform Error Bounds for Model Approximation

The learning methods we have presented in Section 2 minimize the L2L_{2} distance between the true model and the linear approximate model, under the probability measure induced by the training data. In particular, denoting the learned parameters for a fixed pair (x,u)𝕏×𝕌(x,u)\in\mathds{X}\times\mathds{U} by θ(x,u)\theta^{*}_{(x,u)}, and 𝐐(x,u)(){\bf Q}_{(x,u)}^{*}(\cdot), we have that

θ(x,u)=argminθd|c(x,u,μ)𝚽(x,u)(μ)θ(x,u)|2Px(dμ)\displaystyle{\bf\theta}^{*}_{(x,u)}=\mathop{\rm arg\,min}_{\theta\in\mathds{R}^{d}}\int\left|c(x,u,\mu)-{\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf\theta}_{(x,u)}\right|^{2}P_{x}(d\mu)
𝐐,j(x,u)=argmin𝐐j(x,u)d|𝒯(j|x,u,μ)𝚽(x,u)(μ)𝐐j(x,u)|2Px(dμ)\displaystyle{\bf Q}^{*,j}{(x,u)}=\mathop{\rm arg\,min}_{{\bf Q}^{j}{(x,u)}\in\mathds{R}^{d}}\int\left|\mathcal{T}(j|x,u,\mu)-{\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf Q}^{j}{(x,u)}\right|^{2}P_{x}(d\mu) (27)

for some probability measure Px()𝒫(𝕏)P_{x}(\cdot)\in{\mathcal{P}}(\mathds{X}). The measure Px()P_{x}(\cdot) depends on the learning method used.

  • For the coordinated learning methods presented in Section 2.1, Px()P_{x}(\cdot) represents the empirical distribution of the mean-field terms in the training data for which the state xx has positive measure (see (14)).

  • For the individual learning method presented in Section 2.2 for infinite populations, Px()P_{x}(\cdot) represents the invariant measure of the mean-field flow under the randomized exploration policies given the state xx is observed. See (2.2).

  • Finally, for the individual learning method for finite populations in Section 2.3, PxP_{x} depends on the agent identity ii, and thus denoted by PxiP^{i}_{x}. Similar to the infinite population setting, it represents the invariant measure of the process μ𝐱𝐭\mu_{\bf x_{t}} conditioned on the event (xi=x)(x^{i}=x), for agent ii where 𝐱𝐭{\bf x_{t}} is the NN dimensional vector state of the team of NN agents. We note that each agent might have different trained sets of mean-field terms in this setting, since the policies may be distinct.

When the learned policy is executed, the flow of the mean-field is not guaranteed to stay in the support of the training measure Px()P_{x}(\cdot). Hence, in what follows, we aim to generalize the L2L_{2} performance of the learned models over the space 𝒫(𝕏){\mathcal{P}}(\mathds{X}).

In what follows, we will sometimes refer to Px()P_{x}(\cdot) as the training measure.

3.1 Ideal Case: Perfectly Linear Model

If the cost and the kernel are fully linear for a given set of basis functions 𝚽(x,u)(μ)=[Φ(x,u)1(μ),Φ(x,u)d(μ)]{\bf\Phi}_{(x,u)}(\mu)=[\Phi^{1}_{(x,u)}(\mu),\dots\Phi^{d}_{(x,u)}(\mu)]^{\intercal} then the linear model can be learned perfectly. That is, for the given basis functions 𝚽(x,u)(μ){\bf\Phi}_{(x,u)}(\mu), there exist θ(x,u){\bf\theta}^{*}_{(x,u)} and 𝐐(x,u)(){\bf Q}^{*}_{(x,u)}(\cdot) such that

c(x,u,μ)=𝚽(x,u)(μ)θ(x,u)\displaystyle c(x,u,\mu)={\bf\Phi}_{(x,u)}^{\intercal}(\mu)\theta^{*}_{(x,u)}
𝒯(|x,u,μ)=𝚽(x,u)(μ)𝐐(x,u)()\displaystyle\mathcal{T}(\cdot|x,u,\mu)={\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf Q}^{*}_{(x,u)}(\cdot)

The model can be learned perfectly with a coordinator under the method presented in Section 2.1 if

  • the training set TT is such that for each pair (x,u)(x,u), there exist at least dd different data points. Furthermore, for a given data point of the form (xi,ui,X1i,μ,ci)i=1(x^{i},u^{i},X_{1}^{i},\mu,c^{i})_{i=1}^{\infty}, the state-action distribution for this point is such that Pr(x,u)>0Pr(x,u)>0

  • and if the basis functions 𝚽(x,u)(μ){\bf\Phi}_{(x,u)}(\mu) and 𝚽(x,u)(μ){\bf\Phi}_{(x,u)}(\mu^{\prime}) are linearly independent for every μμ\mu\neq\mu^{\prime} that is if 𝐀x,u{\bf A}_{x,u} (see (2.1)) has independent columns.

For the independent learning methods given in Section 2.2 and Section 2.3, the learned model will be the true model with no error if the iterations converge.

3.2 Nearly Linear Models

In this section, we provide a result that states that if the true model can be approximated ϵ\epsilon close to a linear model, then the models learned with the least square method can approximate the true model uniformly in the order of ϵ\epsilon if the training set is informative enough.

The following assumption states that the true model is nearly linear.

Assumption 6.

We assume the existence of θ¯x,ud\bar{\theta}_{x,u}\in\mathds{R}^{d} and 𝐐¯x,u()d×|𝕏|\bar{\bf Q}_{x,u}(\cdot)\in\mathds{R}^{d\times|\mathds{X}|} with the following property: denoting by 𝐐¯j(x,u)d\bar{\bf Q}^{j}(x,u)\in\mathds{R}^{d} the jjth column of 𝐐¯x,u()\bar{\bf Q}_{x,u}(\cdot), for some ϵ>0\epsilon>0, and ϵj>0\epsilon_{j}>0

supx,u,μ|𝒯(j|x,u,μ)𝚽x,u(μ)𝐐¯j(x,u)|ϵj\displaystyle\sup_{x,u,\mu}\left|\mathcal{T}(j|x,u,\mu)-{\bf\Phi}_{x,u}^{\intercal}(\mu)\bar{\bf Q}^{j}{(x,u)}\right|\leq\epsilon_{j}
supx,u,μ|c(x,u,μ)𝚽x,u(μ)θ¯x,u()|ϵ.\displaystyle\sup_{x,u,\mu}\left|c(x,u,\mu)-{\bf\Phi}_{x,u}^{\intercal}(\mu)\bar{\bf\theta}_{x,u}(\cdot)\right|\leq\epsilon. (28)

In particular, further assuming jϵjϵ\sum_{j}\epsilon_{j}\leq\epsilon, the above implies that

𝒯(|x,u,μ)𝚽x,u(μ)𝐐¯(x,u)()ϵ2\displaystyle\left\|\mathcal{T}(\cdot|x,u,\mu)-{\bf\Phi}_{x,u}^{\intercal}(\mu)\bar{\bf Q}_{(x,u)}(\cdot)\right\|\leq\frac{\epsilon}{2}

for all x,u,μx,u,\mu.

We note that, in general, there is no guarantee that the learned dynamics constitute a proper stochastic kernel. This can be guaranteed when the model is fully linear as discussed in Section 3.1 or when we consider a discretization based approximation as described in Section 3.3. However, for general linear approximations, as in this section, we project the learned model 𝚽(x,u)(μ)𝐐(x,u){\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf Q}^{*}_{(x,u)} onto the set of probability measures P(𝕏)P(\mathds{X}), i.e. the simplex over 𝕏\mathds{X}.

In particular, we use the following notation:

c^(x,u,μ):=𝚽x,u(μ)θ(x,u)\displaystyle\hat{c}(x,u,\mu):={\bf\Phi}^{\intercal}_{x,u}(\mu)\theta^{*}_{(x,u)}
𝒯^(|x,u,μ):=argminμ𝒫(𝕏)μ𝚽x,u(μ)𝐐(x,u)()\displaystyle\hat{\mathcal{T}}(\cdot|x,u,\mu):=\mathop{\rm arg\,min}_{\mu\in{\mathcal{P}}(\mathds{X})}\|\mu-{\bf\Phi}^{\intercal}_{x,u}(\mu){\bf Q}^{*}_{(x,u)}(\cdot)\| (29)

where θ(x,u)\theta^{*}_{(x,u)} and 𝐐(x,u)(){\bf Q}^{*}_{(x,u)}(\cdot) denote the learned models based on the least square method, see (3).

Proposition 2.

Let Px()𝒫(𝒫(𝕏))P_{x}(\cdot)\subset{\mathcal{P}}({\mathcal{P}}(\mathds{X})) denote the training distribution of the mean-field terms for the state x𝕏x\in\mathds{X}. Let Assumption 6 hold. For the estimate models, c^(x,u,μ)\hat{c}(x,u,\mu) and 𝒯^(|x,u,μ)\hat{\mathcal{T}}(\cdot|x,u,\mu) defined based on the least square method (see (3.2)), we have that

|c(x,u,μ)c^(x,u,μ)|ϵ(1+2dλmin)\displaystyle\left|c(x,u,\mu)-\hat{c}(x,u,\mu)\right|\leq\epsilon\left(1+\frac{2\sqrt{d}}{\sqrt{\lambda_{\min}}}\right)
𝒯(|x,u,μ)𝒯^(|x,u,μ)ϵ(1+2dλmin)\displaystyle\left\|\mathcal{T}(\cdot|x,u,\mu)-\hat{\mathcal{T}}(\cdot|x,u,\mu)\right\|\leq\epsilon\left(1+\frac{2\sqrt{d}}{\sqrt{\lambda_{\min}}}\right)

where λmin\lambda_{min} is the minimum eigenvalue of ϕ(x,u)(μ)ϕ(x,u)(μ)Px(dμ)\int{\bf\phi}_{(x,u)}(\mu){\bf\phi}_{(x,u)}^{\intercal}(\mu)P_{x}(d\mu).

Proof.

We first note that since the learned θ(x,u)\theta^{*}_{(x,u)} minimizes the L2L_{2} distance to the true model under the training measure PxP_{x}, (6) implies that

𝒫(𝕏)|c(x,u,μ)𝚽(x,u)(μ)θ(𝐱,𝐮)|2Px(dμ)\displaystyle\int_{{\mathcal{P}}(\mathds{X})}\left|c(x,u,\mu)-{\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf\theta^{*}_{(x,u)}}\right|^{2}P_{x}(d\mu)\leq 𝒫(𝕏)|c(x,u,μ)𝚽(x,u)(μ)θ¯(𝐱,𝐮)|2Px(dμ)ϵ2.\displaystyle\int_{{\mathcal{P}}(\mathds{X})}\left|c(x,u,\mu)-{\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf\bar{\theta}_{(x,u)}}\right|^{2}P_{x}(d\mu)\leq\epsilon^{2}.

In particular, via the triangle inequality (under the L2L_{2} norm) we also have that

𝒫(𝕏)|𝚽(x,u)(μ)θ¯(x,u)𝚽(x,u)(μ)θ(𝐱,𝐮)|2Px(dμ)4ϵ2.\displaystyle\int_{{\mathcal{P}}(\mathds{X})}\left|{\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf\bar{\theta}}_{(x,u)}-{\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf\theta^{*}_{(x,u)}}\right|^{2}P_{x}(d\mu)\leq 4\epsilon^{2}.

We can further write that

𝒫(𝕏)|𝚽(x,u)(μ)θ¯(x,u)𝚽(x,u)(μ)θ(𝐱,𝐮)|2Px(dμ)\displaystyle\int_{{\mathcal{P}}(\mathds{X})}\left|{\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf\bar{\theta}}_{(x,u)}-{\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf\theta^{*}_{(x,u)}}\right|^{2}P_{x}(d\mu)
=𝒫(𝕏)|𝚽(x,u)(μ)(θ¯(x,u)θ(x,u))|2Px(dμ)\displaystyle=\int_{{\mathcal{P}}(\mathds{X})}\left|{\bf\Phi}_{(x,u)}^{\intercal}(\mu)\left({\bf\bar{\theta}}_{(x,u)}-\theta^{*}_{(x,u)}\right)\right|^{2}P_{x}(d\mu)
=(θ¯(x,u)θ(x,u))𝒫(𝕏)𝚽(x,u)(μ)𝚽(x,u)(μ)Px(dμ)(θ¯(x,u)θ(x,u))\displaystyle=\left({\bf\bar{\theta}}_{(x,u)}-\theta^{*}_{(x,u)}\right)^{\intercal}\int_{{\mathcal{P}}(\mathds{X})}{\bf\Phi}_{(x,u)}(\mu){\bf\Phi}^{\intercal}_{(x,u)}(\mu)P_{x}(d\mu)\left({\bf\bar{\theta}}_{(x,u)}-\theta^{*}_{(x,u)}\right)
θ¯(x,u)θ(x,u)22λmin\displaystyle\geq\left\|{\bf\bar{\theta}}_{(x,u)}-\theta^{*}_{(x,u)}\right\|_{2}^{2}\lambda_{\min}

where λmin\lambda_{min} is the minimum eigenvalue of ϕ(x,u)(μ)ϕ(x,u)(μ)Px(dμ)\int{\bf\phi}_{(x,u)}(\mu){\bf\phi}_{(x,u)}^{\intercal}(\mu)P_{x}(d\mu). Thus, we have that

θ¯(x,u)θ(x,u)22ϵλmin.\displaystyle\left\|{\bf\bar{\theta}}_{(x,u)}-\theta^{*}_{(x,u)}\right\|_{2}\leq\frac{2\epsilon}{\sqrt{\lambda_{\min}}}.

Finally, using the triangle inequality with the fact that c^(x,u,μ)=𝚽x,u(μ)θ(x,u)\hat{c}(x,u,\mu)={\bf\Phi}^{\intercal}_{x,u}(\mu)\theta^{*}_{(x,u)}

|c(x,u,μ)c^(x,u,μ)||c(x,u,μ)𝚽(x,u)(μ)θ¯(x,u)|+|𝚽(x,u)(μ)θ¯(x,u)𝚽x,u(μ)θ(x,u)|\displaystyle\left|c(x,u,\mu)-\hat{c}(x,u,\mu)\right|\leq\left|c(x,u,\mu)-{\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf\bar{\theta}}_{(x,u)}\right|+\left|{\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf\bar{\theta}}_{(x,u)}-{\bf\Phi}^{\intercal}_{x,u}(\mu)\theta^{*}_{(x,u)}\right|
ϵ+𝚽(x,u)(μ)2θ¯(x,u)θ(x,u)2ϵ+2ϵdλmin\displaystyle\leq\epsilon+\|{\bf\Phi}_{(x,u)}(\mu)\|_{2}\left\|{\bf\bar{\theta}}_{(x,u)}-\theta^{*}_{(x,u)}\right\|_{2}\leq\epsilon+\frac{2\epsilon\sqrt{d}}{\sqrt{\lambda_{\min}}}

where we used 𝚽(x,u)(μ)2d\|{\bf\Phi}_{(x,u)}(\mu)\|_{2}\leq\sqrt{d} since we assume that Φ(x,u)j()1\|\Phi^{j}_{(x,u)}(\cdot)\|_{\infty}\leq 1 via Assumption 2.

For the proof of the error bound of the estimate kernel 𝒯^(|x,u,μ)\hat{\mathcal{T}}(\cdot|x,u,\mu) we follow identical steps, recalling that by construction 𝚽(x,u)(μ)𝐐,j(x,u){\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf Q}^{*,j}{(x,u)} minimizes the L2L_{2} distance to 𝒯(j|x,u,μ)\mathcal{T}(j|x,u,\mu) and using Assumption 6, we write that

λmin𝐐¯j(x,u)𝐐,j(x,u)22|𝚽(𝐱,𝐮)(μ)𝐐¯j(x,u)𝚽(x,u)(μ)𝐐,j(x,u)|2Px(dμ)4ϵj2,\displaystyle\lambda_{\min}\|{\bf\bar{Q}}^{j}{(x,u)}-{\bf Q}^{*,j}{(x,u)}\|^{2}_{2}\leq\int\left|{\bf\Phi^{\intercal}_{(x,u)}}(\mu){\bf\bar{Q}}^{j}{(x,u)}-{\bf\Phi}_{(x,u)}^{\intercal}(\mu){\bf Q}^{*,j}{(x,u)}\right|^{2}P_{x}(d\mu)\leq 4\epsilon_{j}^{2},

which yields

𝐐¯j(x,u)𝐐,j(x,u)22ϵjλmin\displaystyle\|{\bf\bar{Q}}^{j}{(x,u)}-{\bf Q}^{*,j}{(x,u)}\|_{2}\leq\frac{2\epsilon_{j}}{\sqrt{\lambda_{\min}}}

where λmin\lambda_{min} is the minimum eigenvalue of ϕ(x,u)(μ)ϕ(x,u)(μ)Px(dμ)\int{\bf\phi}_{(x,u)}(\mu){\bf\phi}_{(x,u)}^{\intercal}(\mu)P_{x}(d\mu).

Since 𝒯^(|x,u,μ)\hat{\mathcal{T}}(\cdot|x,u,\mu) is the projection of 𝚽x,u(μ)𝐐(x,u)(){\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{*}_{(x,u)}(\cdot) onto the space of probability measures, we have, by the definition of the projection, that

𝒯^(|x,u,μ)𝚽x,u(μ)𝐐(x,u)()\displaystyle\|\hat{\mathcal{T}}(\cdot|x,u,\mu)-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{*}_{(x,u)}(\cdot)\| 𝒯(|x,u,μ)𝚽x,u(μ)𝐐(x,u)().\displaystyle\leq\|{\mathcal{T}}(\cdot|x,u,\mu)-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{*}_{(x,u)}(\cdot)\|.

We then write the following using the triangle inequality

𝒯(|x,u,μ)𝒯^(|x,u,μ)\displaystyle\left\|\mathcal{T}(\cdot|x,u,\mu)-\hat{\mathcal{T}}(\cdot|x,u,\mu)\right\|
𝒯(|x,u,μ)𝚽x,u(μ)𝐐(x,u)()+𝚽x,u(μ)𝐐(x,u)()𝒯^(|x,u,μ)\displaystyle\leq\left\|\mathcal{T}(\cdot|x,u,\mu)-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{*}_{(x,u)}(\cdot)\right\|+\left\|{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{*}_{(x,u)}(\cdot)-\hat{\mathcal{T}}(\cdot|x,u,\mu)\right\|
2𝒯(|x,u,μ)𝚽x,u(μ)𝐐(x,u)()\displaystyle\leq 2\|{\mathcal{T}}(\cdot|x,u,\mu)-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{*}_{(x,u)}(\cdot)\|
2𝒯(|x,u,μ)𝚽x,u(μ)𝐐¯(x,u)()+2𝚽x,u(μ)𝐐¯(x,u)()𝚽x,u(μ)𝐐(x,u)()\displaystyle\leq 2\left\|{\mathcal{T}}(\cdot|x,u,\mu)-{\bf\Phi}_{x,u}^{\intercal}(\mu)\bar{\bf Q}_{(x,u)}(\cdot)\right\|+2\left\|{\bf\Phi}_{x,u}^{\intercal}(\mu)\bar{\bf Q}_{(x,u)}(\cdot)-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{*}_{(x,u)}(\cdot)\right\|
ϵ+j|𝚽x,u(μ)𝐐¯j(x,u)𝚽x,u(μ)𝐐,j(x,u)|\displaystyle\leq\epsilon+\sum_{j}\left|{\bf\Phi}_{x,u}^{\intercal}(\mu)\bar{\bf Q}^{j}{(x,u)}-{\bf\Phi}_{x,u}^{\intercal}(\mu){\bf Q}^{*,j}{(x,u)}\right|
ϵ+j𝚽x,u(μ)2𝐐¯j(x,u)𝐐,j(x,u)2\displaystyle\leq\epsilon+\sum_{j}\|{\bf\Phi}_{x,u}^{\intercal}(\mu)\|_{2}\|{\bf\bar{Q}}^{j}{(x,u)}-{\bf Q}^{*,j}{(x,u)}\|_{2}
ϵ+dλminjϵjϵ+2dϵλmin\displaystyle\leq\epsilon+\frac{\sqrt{d}}{\sqrt{\lambda_{\min}}}\sum_{j}{\epsilon_{j}}\leq\epsilon+2\frac{\sqrt{d}\epsilon}{\sqrt{\lambda_{\min}}}

where we used jϵjϵ\sum_{j}\epsilon_{j}\leq\epsilon with Assumption 6. ∎

3.3 A Special Case: Linear Approximation via Discretization

In this section, we show that the discretization of the space 𝒫(𝕏){\mathcal{P}}(\mathds{X}) can be seen as a particular case of linear function approximation with a special class of basis functions. In particular, for this case, we can analyze the error bounds of the learned policy with mild conditions on the model.

Let {Bi}i=1d𝒫(𝕏)\{B_{i}\}_{i=1}^{d}\subset{\mathcal{P}}(\mathds{X}) be a disjoint set of quantization bins of 𝒫(𝕏){\mathcal{P}}(\mathds{X}) such that Bi=𝒫(𝕏)\cup B_{i}={\mathcal{P}}(\mathds{X}). We define the basis functions for the linear approximation such that

Φx,ui()=𝟙Bi()\displaystyle\Phi^{i}_{x,u}(\cdot)=\mathds{1}_{B_{i}}(\cdot)

for all (x,u)(x,u) pairs. Note that in general the quantization bins, BiB_{i}’s, can be chosen differently for every (x,u)(x,u); for the simplicity of the analysis, we will work with a discretization scheme which is the same for every (x,u)(x,u). An important property of the discretization is that the basis functions form an orthonormal basis for any training measure P()P(\cdot) with P(Bi)>0P(B_{i})>0 for each quantization bin BiB_{i}. That is

Φx,ui(μ),Φx,uj(μ)P(dμ)=𝟙{i=j}\displaystyle\int\big{\langle}\Phi^{i}_{x,u}(\mu),\Phi^{j}_{x,u}(\mu)\big{\rangle}P(d\mu)=\mathds{1}_{\{i=j\}}

for every (x,u)(x,u) pair. This property allows us to analyze the uniform error bounds of the discretization method more directly.

The linear fitted model (see (3) with the chosen basis functions becomes

θ(x,u)i\displaystyle\theta^{i}_{(x,u)} =Bic(x,u,μ)P(dμ)P(Bi)\displaystyle=\frac{\int_{B_{i}}c(x,u,\mu)P(d\mu)}{P(B_{i})}
Q(x,u)i()\displaystyle Q^{i}_{(x,u)}(\cdot) =Bi𝒯(|x,u,μ)P(dμ)P(Bi)\displaystyle=\frac{\int_{B_{i}}\mathcal{T}(\cdot|x,u,\mu)P(d\mu)}{P(B_{i})} (30)

In words, the learned coefficients are the averages of cost and transition realizations from the training set of the corresponding quantization bin.

The following then is an immediate result of Assumption 1.

Proposition 3.

Let θx,u=[θx,u1,,θx,ud]\theta_{x,u}=[\theta^{1}_{x,u},\dots,\theta^{d}_{x,u}]^{\intercal} and 𝐐x,u()=[Qx,u1(),,Qx,ud()]{\bf Q}_{x,u}(\cdot)=\left[Q^{1}_{x,u}(\cdot),\dots,Q^{d}_{x,u}(\cdot)\right]^{\intercal} be given by (3.3). If the training measure P()P(\cdot) is such that P(Bi)>0P(B_{i})>0 for each quantization bin BiB_{i}, under Assumption 1, we then have that

|c(x,u,μ)𝚽x,u(μ)θx,u|KcL\displaystyle\left|c(x,u,\mu)-{\bf\Phi}^{\intercal}_{x,u}(\mu){\bf\theta}_{x,u}\right|\leq K_{c}L
𝒯(|x,u,μ)𝚽x,u(μ)𝐐x,uKfL\displaystyle\left\|\mathcal{T}(\cdot|x,u,\mu)-{\bf\Phi}^{\intercal}_{x,u}(\mu){\bf Q}_{x,u}\right\|\leq K_{f}L

where LL is the largest diameter of the quantizations bins such that

L=maxisupμ,μBiμμ.\displaystyle L=\max_{i}\sup_{\mu,\mu^{\prime}\in B_{i}}\|\mu-\mu^{\prime}\|.

4 Error Analysis for Control with Misspecified Models

In the previous section, we have studied the uniform mismatch bounds of the learned models. In this section, we will focus on what happens if the controllers designed for the linear estimates are used for the true dynamics. We will provide error bounds for the performance loss of the control designed for a possibly missepecified model.

We will analyze the infinite population and the finite population settings separately. We note that some of the following results (e.g. Lemma 3) have been studied in the literature to establish the connection between the NN-agent control problems and the limit mean-field control problem without the model mismatch aspect. That is, existing results study what happens if one uses the infinite population solution for the finite population control problem with perfectly known dynamics (see e.g. [33, 5, 32]). However, we present the proof of every result for completeness and because of the connections in the analysis we follow throughout the paper. Furthermore, the existing results are often stated under slightly different assumptions and settings such as being stated only for closed loop policies, or only for policies that are open loop in the sense that they are measurable with respect to the noise process.

4.1 Error Bounds for Infinitely Many Agents

As we have observed in Example 1.2, even when agents agree on the model knowledge, without coordination on which policy to follow, the optimality may not be achieved. Therefore, we assume that after the learning period, the team of agents collectively agrees on the cost and transition models given by c^(x,u,μ)\hat{c}(x,u,\mu) and 𝒯^(|x,u,μ)\hat{\mathcal{T}}(\cdot|x,u,\mu) and designs policies for this model. We will assume that

|c(x,u,μ)c^(x,u,μ)|λ\displaystyle\left|c(x,u,\mu)-\hat{c}(x,u,\mu)\right|\leq\lambda
𝒯(|x,u,μ)𝒯^(|x,u,μ)λ\displaystyle\left\|\mathcal{T}(\cdot|x,u,\mu)-\hat{\mathcal{T}}(\cdot|x,u,\mu)\right\|\leq\lambda (31)

for some λ<\lambda<\infty and for all x,u,μx,u,\mu. That is λ\lambda represents the uniform model mismatch constant.

We will consider two different cases for the execution of the designed control.

  • Closed loop control: The team decides on a policy g^:𝒫(𝕏)Γ\hat{g}:{\mathcal{P}}(\mathds{X})\to\Gamma, and uses their local states and the mean-field term to apply the policy g^\hat{g}. That is, an agent ii observes the mean-field term μt\mu_{t}, chooses g^(μt)=γ^(|xi,μt)\hat{g}(\mu_{t})=\hat{\gamma}(\cdot|x^{i},\mu_{t}) and applies their control action according to γ^(|xi,μt)\hat{\gamma}(\cdot|x^{i},\mu_{t}) with the local state xix^{i}. The important distinction is that the mean-field term μt\mu_{t} is observed by every agent, and they decide on their agent-level policies with the observed mean-field term. Hence, we refer to this execution method to be the closed loop method since the mean-field term is given as a feedback variable.

  • Open loop control: We have argued earlier that the flow of the mean-field term μt\mu_{t} is deterministic for the infinitely many agent case, see (11). In particular, the mean-field term μt\mu_{t} can be estimated with the model information. Hence, for this case, we will assume that the agents only observe their local states, and estimates the mean-field term independent instead of observing it. That is, an agent ii estimates the mean-field term μ^t\hat{\mu}_{t}, and applies their control action according to γ^(|xi,μ^t)\hat{\gamma}(\cdot|x^{i},\hat{\mu}_{t}) with the local state xix^{i}. Note that if the model dynamics were perfectly known, this estimate would coincide with the true flow of the mean-field term. However, when the model is misspecified, the estimate μ^t\hat{\mu}_{t} and the correct mean-field term will deviate from each other, and we will need to study the effects of this deviation on the control performance, in addition to the incorrect computation of the control policy.

In what follows, previously introduced constants Kc,KfK_{c},K_{f} and δT\delta_{T} will be used often. We refer the reader to Assumption 1 for Kc,KfK_{c},K_{f}, and equation (6) for δT\delta_{T}.

For the results in this section, we will require that βK<1\beta K<1 where K=Kf+δTK=K_{f}+\delta_{T}. We note that this assumption is needed to show the Lipschitz continuity of the value function Kβ(μ)K_{\beta}^{*}(\mu) with respect to μ\mu. The following provides an example where this bound is not satisfied, and the value function is not Lipschitz continuous.

Example 4.1.

Consider a control-free (without loss of optimality) dynamics, with a binary state space 𝕏={0,1}\mathds{X}=\{0,1\}. We assume that

𝒯(0|x,μ)=μ(0)2\displaystyle\mathcal{T}(0|x,\mu)=\mu(0)^{2}

that is, the state process moves to 0 with probability μ(0)2\mu(0)^{2} independent of the value of the state at the current step. We first notice that μμ=|μ(0)μ(0)|\|\mu-\mu^{\prime}\|=|\mu(0)-\mu^{\prime}(0)| for the binary state space. Furthermore, we note that this kernel is Lipschitz continuous in μ\mu with Lipschitz constant 2, that is Kf=2K_{f}=2. To see this, consider the following for μ,μ𝒫(𝕏)\mu,\mu^{\prime}\in{\mathcal{P}}(\mathds{X})

𝒯(|x,μ)𝒯(|x,μ)=12(|𝒯(0|x,μ)𝒯(0|x,μ)|+|𝒯(1|x,μ)𝒯(1|x,μ)|)\displaystyle\|\mathcal{T}(\cdot|x,\mu)-\mathcal{T}(\cdot|x,\mu^{\prime})\|=\frac{1}{2}\left(|\mathcal{T}(0|x,\mu)-\mathcal{T}(0|x,\mu^{\prime})|+|\mathcal{T}(1|x,\mu)-\mathcal{T}(1|x,\mu^{\prime})|\right)
=|μ(0)2μ(0)2|=|μ(0)μ(0)|×|μ(0)+μ(0)|2|μ(0)μ(0)|=2μ()μ()\displaystyle=|\mu(0)^{2}-\mu^{\prime}(0)^{2}|=|\mu(0)-\mu^{\prime}(0)|\times|\mu(0)+\mu^{\prime}(0)|\leq 2|\mu(0)-\mu^{\prime}(0)|=2\|\mu(\cdot)-\mu^{\prime}(\cdot)\|

where we used the bound that |μ(0)+μ(0)|2|\mu(0)+\mu^{\prime}(0)|\leq 2 which is the minimal uniform upper bound for all μ,μ\mu,\mu^{\prime}.

Hence, the kernel is Lipschitz continuous with constant 2. Furthermore, since the dynamics do not depend xx and uu, we have that δT=0\delta_{T}=0, and thus K=Kf+δT=2K=K_{f}+\delta_{T}=2.

The stage-wise cost is given by c(μ)=μ(0)c(\mu)=\mu(0). We consider Lipschitz continuity of the value function around μ(0)=1\mu(0)=1 ,i.e. around μ=δ0\mu=\delta_{0}. Note that for some initial distribution μ0(0)=a\mu_{0}(0)=a, one can iteratively show that

μt(0)=a2t.\displaystyle\mu_{t}(0)=a^{{2^{t}}}.

Hence, we can write the value function as

Kβ(a)=t=0βta2t.\displaystyle K_{\beta}(a)=\sum_{t=0}^{\infty}\beta^{t}a^{{2^{t}}}.

To show that this function is not Lipschitz continuous, we consider two points a,b[0,1]a,b\in[0,1], without loss of generality assume that aba\geq b:

Kβ(a)Kβ(b)ab=t=0βt(a2tb2t)ab=t=0βt2tc2t1\displaystyle\frac{K_{\beta}(a)-K_{\beta}(b)}{a-b}=\frac{\sum_{t=0}^{\infty}\beta^{t}(a^{2^{t}}-b^{2^{t}})}{a-b}=\sum_{t=0}^{\infty}\beta^{t}2^{t}c^{2^{t}-1}

for some c[a,b]c\in[a,b] where we used the mean value theorem for a2tb2tab\frac{a^{2^{t}}-b^{2^{t}}}{a-b}. We can see that the above cannot be bounded uniformly when cc is around 11 if β1/2\beta\geq 1/2, i.e. if βK1\beta K\geq 1. This implies that the value function cannot be Lipschitz continuous if βK1\beta K\geq 1.

4.1.1 Error Bounds for Closed Loop Control

We assume that the agents calculate an optimal policy, say g^\hat{g}, for the incorrect model (𝒯^\hat{\mathcal{T}} and c^\hat{c}), and observe the correct mean-field term say μt\mu_{t}, at every time step tt. The agents then use

g^(μt)=γ^(|xt,μt)\displaystyle\hat{g}({\mu}_{t})=\hat{\gamma}(\cdot|x_{t},{\mu}_{t}) (32)

to select their control actions utu_{t} at time tt.

We denote the accumulated cost under this policy g^\hat{g} by Kβ(μ0,g^)K_{\beta}(\mu_{0},\hat{g}), and we will compare this with the optimal cost that can be achieved, which is Kβ(μ0)K_{\beta}^{*}(\mu_{0}) for some initial distribution μ0\mu_{0}.

Theorem 3.

Consider the closed loop policy g^\hat{g} in (32) designed for an estimate model 𝒯^,c^\hat{\mathcal{T}},\hat{c} which satisfies (4.1) for the infinite population dynamics. Under Assumption 1, if βK<1\beta K<1

Kβ(μ0,g^)Kβ(μ0)2λ(βCβK+1)(1β)2(1βK)\displaystyle K_{\beta}(\mu_{0},\hat{g})-K_{\beta}^{*}(\mu_{0})\leq 2\lambda\frac{(\beta C-\beta K+1)}{(1-\beta)^{2}(1-\beta K)}

where K=(Kf+δT)K=(K_{f}+\delta_{T}) and C=(c+Kc)C=(\|c\|_{\infty}+K_{c}).

Proof.

We start with the following upper-bound

Kβ(μ,g^)Kβ(μ)|Kβ(μ,g^)K^β(μ)|+|K^β(μ)Kβ(μ)|\displaystyle K_{\beta}(\mu,\hat{g})-K_{\beta}^{*}(\mu)\leq\left|K_{\beta}(\mu,\hat{g})-\hat{K}_{\beta}(\mu)\right|+\left|\hat{K}_{\beta}(\mu)-K_{\beta}^{*}(\mu)\right| (33)

where K^β(μ)\hat{K}_{\beta}(\mu) denotes the optimal value function for the mismatched model. We have an upper-bound for the second term by Lemma 1. We write the following Bellman equations for the first term:

Kβ(μ,g^)=k(μ,γ^)+βKβ(F(μ,γ^),g^)\displaystyle K_{\beta}(\mu,\hat{g})=k(\mu,\hat{\gamma})+\beta K_{\beta}\left(F(\mu,\hat{\gamma}),\hat{g}\right)
K^β(μ)=k^(μ,γ^)+βK^β(F^(μ,γ^)).\displaystyle\hat{K}_{\beta}(\mu)=\hat{k}(\mu,\hat{\gamma})+\beta\hat{K}_{\beta}\left(\hat{F}(\mu,\hat{\gamma})\right).

We can then write

|Kβ(μ,g^)K^β(μ)|\displaystyle\left|K_{\beta}(\mu,\hat{g})-\hat{K}_{\beta}(\mu)\right| |k(μ,γ^)k^(μ,γ^)|\displaystyle\leq\left|k(\mu,\hat{\gamma})-\hat{k}(\mu,\hat{\gamma})\right|
+β|Kβ(F(μ,γ^),g^)K^β(F(μ,γ^))|\displaystyle+\beta\left|K_{\beta}\left(F(\mu,\hat{\gamma}),\hat{g}\right)-\hat{K}_{\beta}\left(F(\mu,\hat{\gamma})\right)\right|
+β|K^β(F(μ,γ^))Kβ(F(μ,γ^))|\displaystyle+\beta\left|\hat{K}_{\beta}\left(F(\mu,\hat{\gamma})\right)-{K}^{*}_{\beta}\left(F(\mu,\hat{\gamma})\right)\right|
+β|Kβ(F(μ,γ^))Kβ(F^(μ,γ^))|\displaystyle+\beta\left|{K}^{*}_{\beta}\left(F(\mu,\hat{\gamma})\right)-K_{\beta}^{*}(\hat{F}(\mu,\hat{\gamma}))\right|
+β|Kβ(F^(μ,γ^))K^β(F^(μ,γ^))|\displaystyle+\beta\left|K_{\beta}^{*}(\hat{F}(\mu,\hat{\gamma}))-\hat{K}_{\beta}\left(\hat{F}(\mu,\hat{\gamma})\right)\right|

We note that |k(μ,γ^)k^(μ,γ^)|λ\left|k(\mu,\hat{\gamma})-\hat{k}(\mu,\hat{\gamma})\right|\leq\lambda and F(μ,γ^)F^(μ,γ^)λ\|F(\mu,\hat{\gamma})-\hat{F}(\mu,\hat{\gamma})\|\leq\lambda. Using Lemma 1 for the third and the last terms above, we get

|Kβ(μ,g^)K^β(μ)|\displaystyle\left|K_{\beta}(\mu,\hat{g})-\hat{K}_{\beta}(\mu)\right| λ+βsupμ|Kβ(μ,g^)K^β(μ)|\displaystyle\leq\lambda+\beta\sup_{\mu}\left|K_{\beta}(\mu,\hat{g})-\hat{K}_{\beta}(\mu)\right|
+2λβ(βCβK+1(1β)(1βK))+βKβLipλ.\displaystyle+2\lambda\beta\left(\frac{\beta C-\beta K+1}{(1-\beta)(1-\beta K)}\right)+\beta\|K_{\beta}^{*}\|_{Lip}\lambda.

Rearranging the terms and taking the supremum on the left hand side over μ𝒫(𝕏)\mu\in{\mathcal{P}}(\mathds{X}), and noting that KβLipC1βK\|K_{\beta}^{*}\|_{Lip}\leq\frac{C}{1-\beta K} we can then write

|Kβ(μ,g^)K^β(μ)|\displaystyle\left|K_{\beta}(\mu,\hat{g})-\hat{K}_{\beta}(\mu)\right| λ(1β)(1+2β(βCβK+1(1β)(1βK))+βC(1βK))\displaystyle\leq\frac{\lambda}{(1-\beta)}\left(1+2\beta\left(\frac{\beta C-\beta K+1}{(1-\beta)(1-\beta K)}\right)+\frac{\beta C}{(1-\beta K)}\right)
=λ((1+β)(βCβK+1)(1β)2(1βK))\displaystyle=\lambda\left(\frac{(1+\beta)(\beta C-\beta K+1)}{(1-\beta)^{2}(1-\beta K)}\right)

Combining this bound, and Lemma 1 with (33), we can conclude the proof.

4.1.2 Error Bounds for Open Loop Control

We assume that the agents calculate an optimal policy, say g^\hat{g} for the incorrect model, and estimate the mean-field flow under the incorrect model with the policy g^\hat{g}. That is, at every time step tt, the agents use

g^(μ^t)=γ^(|xt,μ^t)\displaystyle\hat{g}(\hat{\mu}_{t})=\hat{\gamma}(\cdot|x_{t},\hat{\mu}_{t}) (34)

to select their control actions utu_{t} at time tt. Furthermore, μ^t\hat{\mu}_{t} is estimated with

μ^t+1()=𝒯^(|x,u,μ^t)γ^(du|x,μ^t)μ^t(dx)\displaystyle\hat{\mu}_{t+1}(\cdot)=\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\hat{\mu}_{t}(dx) (35)

where 𝒯^\hat{\mathcal{T}} is the learned and possibly incorrect model. We are then interested in the optimality gap given by

Kβ(μ0,g^)Kβ(μ0)\displaystyle K_{\beta}(\mu_{0},\hat{g})-K_{\beta}^{*}(\mu_{0})

where Kβ(μ0,g^)K_{\beta}(\mu_{0},\hat{g}) denotes the accumulated cost when the agents follow the open loop policy g^(μ^t)=γ^(|xt,μ^t)\hat{g}(\hat{\mu}_{t})=\hat{\gamma}(\cdot|x_{t},\hat{\mu}_{t}) at every time tt. We note that the distinction from the closed loop control is that μ^t\hat{\mu}_{t} is not observed but estimated using the model 𝒯^\hat{\mathcal{T}}.

Theorem 4.

Consider the open loop policy g^\hat{g} in (32) which is designed for an estimate model that satisfies (4.1) for the infinite population dynamics. Under Assumption 1, if βK<1\beta K<1,

Kβ(μ0,g^)Kβ(μ0)2λβ(CK)+1(1β)(1βK)\displaystyle K_{\beta}(\mu_{0},\hat{g})-K_{\beta}^{*}(\mu_{0})\leq 2\lambda\frac{\beta(C-K)+1}{(1-\beta)(1-\beta K)}

for any μ0𝒫(𝕏)\mu_{0}\in{\mathcal{P}}(\mathds{X}) where C=c+KcC=\|c\|_{\infty}+K_{c} and K=Kf+δTK=K_{f}+\delta_{T}.

Proof.

We start with the following upper-bound

Kβ(μ0,g^)Kβ(μ0)|Kβ(μ0,g^)K^β(μ0)|+|K^β(μ0)Kβ(μ0)|\displaystyle K_{\beta}(\mu_{0},\hat{g})-K_{\beta}^{*}(\mu_{0})\leq\left|K_{\beta}(\mu_{0},\hat{g})-\hat{K}_{\beta}(\mu_{0})\right|+\left|\hat{K}_{\beta}(\mu_{0})-K_{\beta}^{*}(\mu_{0})\right| (36)

We have an upper-bound for the second term by Lemma 1. We now focus on the first term:

|Kβ(μ0,g^)K^β(μ0)|t=0βt|k(μt,γ^t)k^(μ^t,γ^t)|\displaystyle\left|K_{\beta}(\mu_{0},\hat{g})-\hat{K}_{\beta}(\mu_{0})\right|\leq\sum_{t=0}^{\infty}\beta^{t}\left|k(\mu^{\prime}_{t},\hat{\gamma}_{t})-\hat{k}(\hat{\mu}_{t},\hat{\gamma}_{t})\right|

where we write γ^t:=γ^(|x,μ^t)\hat{\gamma}_{t}:=\hat{\gamma}(\cdot|x,\hat{\mu}_{t}), and μt\mu^{\prime}_{t} denotes the measure flow under the true dynamics with the incorrect policy γ^t\hat{\gamma}_{t}, that is

μt+1=F^(μt,γ^t):=𝒯(|x,u,μt)γ^(du|x,μ^t)μt(dx).\displaystyle\mu^{\prime}_{t+1}=\hat{F}(\mu^{\prime}_{t},\hat{\gamma}_{t}):=\int{\mathcal{T}}(\cdot|x,u,\mu^{\prime}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\mu^{\prime}_{t}(dx).

We next claim that

μtμ^tλn=0t1(δT+Kf)n.\displaystyle\|\mu^{\prime}_{t}-\hat{\mu}_{t}\|\leq\lambda\sum_{n=0}^{t-1}(\delta_{T}+K_{f})^{n}.

We show this by induction. For t=1t=1, we have that

μ1μ^1\displaystyle\|\mu^{\prime}_{1}-\hat{\mu}_{1}\| =𝒯(|x,u,μ0)γ^(du|x,μ0)μ0(dx)𝒯^(|x,u,μ0)γ^(du|x,μ0)μ0(dx)\displaystyle=\left\|\int{\mathcal{T}}(\cdot|x,u,\mu_{0})\hat{\gamma}(du|x,{\mu}_{0})\mu_{0}(dx)-\int\hat{\mathcal{T}}(\cdot|x,u,\mu_{0})\hat{\gamma}(du|x,{\mu}_{0})\mu_{0}(dx)\right\|
λ.\displaystyle\leq\lambda.

We now assume that the claim is true for tt:

μt+1μ^t+1\displaystyle\|\mu^{\prime}_{t+1}-\hat{\mu}_{t+1}\| =𝒯(|x,u,μt)γ^(du|x,μ^t)μt(dx)𝒯^(|x,u,μ^t)γ^(du|x,μ^t)μ^t(dx)\displaystyle=\bigg{\|}\int{\mathcal{T}}(\cdot|x,u,\mu^{\prime}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\mu^{\prime}_{t}(dx)-\int\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\hat{\mu}_{t}(dx)\bigg{\|}
𝒯(|x,u,μt)γ^(du|x,μ^t)μt(dx)𝒯(|x,u,μt)γ^(du|x,μ^t)μ^t(dx)\displaystyle\leq\bigg{\|}\int{\mathcal{T}}(\cdot|x,u,\mu^{\prime}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\mu^{\prime}_{t}(dx)-\int{\mathcal{T}}(\cdot|x,u,{\mu}^{\prime}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\hat{\mu}_{t}(dx)\bigg{\|}
+𝒯(|x,u,μt)γ^(du|x,μ^t)μ^t(dx)𝒯^(|x,u,μ^t)γ^(du|x,μ^t)μ^t(dx)\displaystyle\quad+\bigg{\|}\int{\mathcal{T}}(\cdot|x,u,\mu^{\prime}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\hat{\mu}_{t}(dx)-\int\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\hat{\mu}_{t}(dx)\bigg{\|}
δTμtμ^t+supx,u𝒯(|x,u,μt)𝒯^(|x,u,μ^t)\displaystyle\leq\delta_{T}\|\mu^{\prime}_{t}-\hat{\mu}_{t}\|+\sup_{x,u}\left\|{\mathcal{T}}(\cdot|x,u,\mu^{\prime}_{t})-\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\right\|
δTμtμ^t+supx,u𝒯(|x,u,μt)𝒯(|x,u,μ^t)\displaystyle\leq\delta_{T}\|\mu^{\prime}_{t}-\hat{\mu}_{t}\|+\sup_{x,u}\left\|{\mathcal{T}}(\cdot|x,u,\mu^{\prime}_{t})-{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\right\|
+supx,u𝒯(|x,u,μ^t)𝒯^(|x,u,μ^t)\displaystyle\qquad\qquad\qquad\quad+\sup_{x,u}\left\|{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})-\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\right\|
(δT+Kf)μtμ^t+λ\displaystyle\leq(\delta_{T}+K_{f})\|\mu^{\prime}_{t}-\hat{\mu}_{t}\|+\lambda
(δT+Kf)λn=0t1(δT+Kf)n+λ=λn=0t(δT+Kf)n.\displaystyle\leq(\delta_{T}+K_{f})\lambda\sum_{n=0}^{t-1}(\delta_{T}+K_{f})^{n}+\lambda=\lambda\sum_{n=0}^{t}(\delta_{T}+K_{f})^{n}.

where we used the induction argument at the last inequality. We now go back to:

|Kβ(μ0,g^)K^β(μ0)|t=0βt|k(μt,γ^t)k^(μ^t,γ^t)|.\displaystyle\left|K_{\beta}(\mu_{0},\hat{g})-\hat{K}_{\beta}(\mu_{0})\right|\leq\sum_{t=0}^{\infty}\beta^{t}\left|k(\mu^{\prime}_{t},\hat{\gamma}_{t})-\hat{k}(\hat{\mu}_{t},\hat{\gamma}_{t})\right|.

For the term inside the summation, we write

|k(μt,γ^t)k^(μ^t,γ^t)|\displaystyle\left|k(\mu^{\prime}_{t},\hat{\gamma}_{t})-\hat{k}(\hat{\mu}_{t},\hat{\gamma}_{t})\right| =|c(x,u,μt)γ^(du|x,μ^t)μt(dx)c^(x,u,μ^t)γ^(du|x,μ^t)μ^t(dx)|\displaystyle=\left|\int c(x,u,\mu_{t}^{\prime})\hat{\gamma}(du|x,\hat{\mu}_{t})\mu^{\prime}_{t}(dx)-\int\hat{c}(x,u,\hat{\mu}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\hat{\mu}_{t}(dx)\right|
|c(x,u,μt)γ^(du|x,μ^t)μt(dx)c(x,u,μt)γ^(du|x,μ^t)μ^t(dx)|\displaystyle\leq\left|\int c(x,u,\mu_{t}^{\prime})\hat{\gamma}(du|x,\hat{\mu}_{t})\mu^{\prime}_{t}(dx)-\int{c}(x,u,\mu^{\prime}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\hat{\mu}_{t}(dx)\right|
+|c(x,u,μt)γ^(du|x,μ^t)μ^t(dx)c^(x,u,μ^t)γ^(du|x,μ^t)μ^t(dx)|\displaystyle\quad+\left|\int c(x,u,\mu_{t}^{\prime})\hat{\gamma}(du|x,\hat{\mu}_{t})\hat{\mu}_{t}(dx)-\int\hat{c}(x,u,\hat{\mu}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\hat{\mu}_{t}(dx)\right|
cμtμ^t+supx,u|c(x,u,μt)c^(x,u,μ^t)|\displaystyle\leq\|c\|_{\infty}\|\mu^{\prime}_{t}-\hat{\mu}_{t}\|+\sup_{x,u}\left|c(x,u,\mu^{\prime}_{t})-\hat{c}(x,u,\hat{\mu}_{t})\right|
cμtμ^t+supx,u|c(x,u,μt)c(x,u,μ^t)|\displaystyle\leq\|c\|_{\infty}\|\mu^{\prime}_{t}-\hat{\mu}_{t}\|+\sup_{x,u}\left|c(x,u,\mu^{\prime}_{t})-c(x,u,\hat{\mu}_{t})\right|
+supx,u|c(x,u,μ^t)c^(x,u,μ^t)|\displaystyle\qquad\qquad+\sup_{x,u}\left|c(x,u,\hat{\mu}_{t})-\hat{c}(x,u,\hat{\mu}_{t})\right|
(c+Kc)μtμ^t+λ.\displaystyle\leq(\|c\|_{\infty}+K_{c})\|\mu^{\prime}_{t}-\hat{\mu}_{t}\|+\lambda.

Using this bound, we finalize our argument. In the following we denote by K:=(Kf+δT)K:=(K_{f}+\delta_{T}) and C:=(c+Kc)C:=(\|c\|_{\infty}+K_{c}) to conclude:

|Kβ(μ0,g^)K^β(μ0)|t=0βt|k(μt,γ^t)k^(μ^t,γ^t)|\displaystyle\left|K_{\beta}(\mu_{0},\hat{g})-\hat{K}_{\beta}(\mu_{0})\right|\leq\sum_{t=0}^{\infty}\beta^{t}\left|k(\mu^{\prime}_{t},\hat{\gamma}_{t})-\hat{k}(\hat{\mu}_{t},\hat{\gamma}_{t})\right|
Ct=0βtμtμ^t+λ1β\displaystyle\leq C\sum_{t=0}^{\infty}\beta^{t}\|\mu^{\prime}_{t}-\hat{\mu}_{t}\|+\frac{\lambda}{1-\beta}
Cλt=0βtn=0t1Kn+λ1β=Cλt=0βt1Kt1K+λ1β\displaystyle\leq C\lambda\sum_{t=0}^{\infty}\beta^{t}\sum_{n=0}^{t-1}K^{n}+\frac{\lambda}{1-\beta}=C\lambda\sum_{t=0}^{\infty}\beta^{t}\frac{1-K^{t}}{1-K}+\frac{\lambda}{1-\beta}
=Cλ(1β)(1K)Cλ(1K)(1βK)+λ1β\displaystyle=\frac{C\lambda}{(1-\beta)(1-K)}-\frac{C\lambda}{(1-K)(1-\beta K)}+\frac{\lambda}{1-\beta}
=Cλβ(1β)(1βK)+λ1β=λβ(CK)+1(1β)(1βK).\displaystyle=\frac{C\lambda\beta}{(1-\beta)(1-\beta K)}+\frac{\lambda}{1-\beta}=\lambda\frac{\beta(C-K)+1}{(1-\beta)(1-\beta K)}.

This is the bound for the first term in (36), combining this with the upper-bound on the second term in (36) by Lemma 1, we can complete the proof. ∎

Lemma 1.

Under Assumption 1, if βK<1\beta K<1

|K^β(μ0)Kβ(μ0)|λ(βCβK+1(1β)(1βK))\displaystyle\left|\hat{K}_{\beta}(\mu_{0})-K_{\beta}^{*}(\mu_{0})\right|\leq\lambda\left(\frac{\beta C-\beta K+1}{(1-\beta)(1-\beta K)}\right)

for any initial distribution μ0𝒫(𝕏)\mu_{0}\in{\mathcal{P}}(\mathds{X}) where C=c+KcC=\|c\|_{\infty}+K_{c} and K=Kf+δTK=K_{f}+\delta_{T}.

Proof.

The proof can be found in the appendix B. ∎

4.2 Error Bounds for Finitely Many Agents

We introduce the following constant to denote the expected distance of an empirical measure to its true distribution:

MN:=supμ𝒫(𝕏)E[μNμ]\displaystyle M_{N}:=\sup_{\mu\in{\mathcal{P}}(\mathds{X})}E\left[\left\|\mu^{N}-\mu\right\|\right] (37)
M¯N=supμ𝒫(𝕏×𝕌)E[μNμ]\displaystyle\bar{M}_{N}=\sup_{\mu\in{\mathcal{P}}(\mathds{X}\times\mathds{U})}E\left[\left\|\mu^{N}-\mu\right\|\right] (38)

where μN\mu^{N} is an empirical measure of the distribution μ\mu, and the expectation is with respect to the randomness over the realizations of μN\mu^{N}.

Remark 6.

We note that the constants can be bounded in terms of the population size NN. In particular, for the finite space 𝕏\mathds{X} and 𝕌\mathds{U}

MNKN\displaystyle M_{N}\leq\frac{K}{\sqrt{N}}

where K<K<\infty in general depends on the underlying space 𝕏\mathds{X} (or the space 𝕏×𝕌\mathds{X}\times\mathds{U} for M¯N\bar{M}_{N}). Furthermore, for continuous state and action spaces, e.g. for 𝕏d\mathds{X}\subset\mathds{R}^{d}, the empirical error term is in the order of O(N12d)O(N^{\frac{-1}{2d}}).

4.2.1 Error Bounds for Open Loop Control

In this section, we will study the case where each agent in an NN-agent control system follows the open-loop control given by the solution of the infinite population control problem with mismatched model estimation. We summarize this for some agent ii as follows:

  • Collectively agree on a policy g^\hat{g} as in (34) according to the agreed upon estimate model 𝒯^,c^\hat{\mathcal{T}},\hat{c} that satisfies (4.1). Note that this policy is an optimal policy for the infinite population dynamics under the estimate model.

  • Estimate the mean-field term μ^t\hat{\mu}_{t} at time tt according to (35) using the approximate model 𝒯^\hat{\mathcal{T}}

  • Find the randomized agent level policy γ^\hat{\gamma} using g^(μ^t)=γ^(|xti,μ^t)\hat{g}(\hat{\mu}_{t})=\hat{\gamma}(\cdot|x_{t}^{i},\hat{\mu}_{t})

  • Observe local state xtix_{t}^{i}, and apply action uttγ^(|xti,μ^t)u_{t}^{t}\sim\hat{\gamma}(\cdot|x_{t}^{i},\hat{\mu}_{t}).

If every agent follows this policy, we have the following upperbound for the performance loss compared to the optimal value of the NN-population control problem,

Theorem 5.

Under Assumption 1, if each agent follows the steps summarized above, we then have that

KβN(μN,γ^)KβN,(μN)2λ(βCβK+1(1β)(1βK))+MN4βC(1β)(1βK).\displaystyle K_{\beta}^{N}(\mu^{N},\hat{\gamma})-K_{\beta}^{N,*}(\mu^{N})\leq 2\lambda\left(\frac{\beta C-\beta K+1}{(1-\beta)(1-\beta K)}\right)+M_{N}\frac{4\beta C}{(1-\beta)(1-\beta K)}.

where C=(c+Kc)C=(\|c\|_{\infty}+K_{c}), and K=(Kf+δT)K=(K_{f}+\delta_{T}).

Proof.

For some μ^0=μ0=μ𝐱𝟎=μN\hat{\mu}_{0}=\mu_{0}=\mu_{\bf x_{0}}=\mu^{N}

KβN(μN,γ^)KβN,(μN)\displaystyle K_{\beta}^{N}(\mu^{N},\hat{\gamma})-K_{\beta}^{N,*}(\mu^{N})\leq |KβN(μN,γ^)K^β(μN)|+|K^β(μN)Kβ(μN)|\displaystyle\left|K_{\beta}^{N}(\mu^{N},\hat{\gamma})-\hat{K}^{*}_{\beta}(\mu^{N})\right|+\left|\hat{K}^{*}_{\beta}(\mu^{N})-K^{*}_{\beta}(\mu^{N})\right|
+|Kβ(μN)KβN,(μN)|.\displaystyle+\left|K^{*}_{\beta}(\mu^{N})-K_{\beta}^{N,*}(\mu^{N})\right|.

The second term above is bounded by Lemma 1, the last term is bounded by Lemma 3, finally for the first term we have

|KβN(μN,γ^)K^β(μN)|t=0βtE[|k(μ𝐱𝐭,γ^)k^(μ^t,γ^)|]\displaystyle\left|K_{\beta}^{N}(\mu^{N},\hat{\gamma})-\hat{K}^{*}_{\beta}(\mu^{N})\right|\leq\sum_{t=0}^{\infty}\beta^{t}E\left[\left|k(\mu_{\bf x_{t}},\hat{\gamma})-\hat{k}(\hat{\mu}_{t},\hat{\gamma})\right|\right]

For the term inside of the expectation, we have

|k(μ𝐱𝐭,γ^)k^(μ^t,γ^)|\displaystyle\left|k(\mu_{\bf x_{t}},\hat{\gamma})-\hat{k}(\hat{\mu}_{t},\hat{\gamma})\right|
=|c(x,u,μ𝐱𝐭)γ^(du|x,μ^t)μ𝐱𝐭(dx)c^(x,u,μ^t)γ^(du|x,μ^t)μ^t(dx)|\displaystyle=\left|\int c(x,u,\mu_{\bf x_{t}})\hat{\gamma}(du|x,\hat{\mu}_{t})\mu_{\bf x_{t}}(dx)-\int\hat{c}(x,u,\hat{\mu}_{t})\hat{\gamma}(du|x,\hat{\mu}_{t})\hat{\mu}_{t}(dx)\right|
λ+Cμ𝐱𝐭μ^t\displaystyle\leq\lambda+C\|\mu_{\bf x_{t}}-\hat{\mu}_{t}\|

where C=(c+Kc)C=(\|c\|_{\infty}+K_{c}). We can then write

|KβN(μN,γ^)K^β(μN)|\displaystyle\left|K_{\beta}^{N}(\mu^{N},\hat{\gamma})-\hat{K}^{*}_{\beta}(\mu^{N})\right| t=0βtE[|k(μ𝐱𝐭,γ^)k^(μ^t,γ^)|]\displaystyle\leq\sum_{t=0}^{\infty}\beta^{t}E\left[\left|k(\mu_{\bf x_{t}},\hat{\gamma})-\hat{k}(\hat{\mu}_{t},\hat{\gamma})\right|\right]
t=0βt(λ+CE[μ𝐱𝐭μ^t])\displaystyle\leq\sum_{t=0}^{\infty}\beta^{t}\left(\lambda+CE\left[\|\mu_{\bf x_{t}}-\hat{\mu}_{t}\|\right]\right)
λ1β+Ct=0βtn=0t1Kn(λ+2MN)\displaystyle\leq\frac{\lambda}{1-\beta}+C\sum_{t=0}^{\infty}\beta^{t}\sum_{n=0}^{t-1}K^{n}(\lambda+2M_{N})
=λ1β+βC(λ+2MN)(1β)(1βK)\displaystyle=\frac{\lambda}{1-\beta}+\frac{\beta C(\lambda+2M_{N})}{(1-\beta)(1-\beta K)}
=λ(βCβK+1(1β)(1βK))+MN2βC(1β)(1βK).\displaystyle=\lambda\left(\frac{\beta C-\beta K+1}{(1-\beta)(1-\beta K)}\right)+M_{N}\frac{2\beta C}{(1-\beta)(1-\beta K)}.

where we have used Lemma 2 which is presented below. ∎

Lemma 2.

Let xtix_{t}^{i} be the state of the agent ii at time tt when each agent follows the open-loop policy γ^(|xti,μ^t)\hat{\gamma}(\cdot|x^{i}_{t},\hat{\mu}_{t}) in an NN-agent control dynamics. We denote by 𝐱t{\bf x}_{t} the vector of the states of NN agents at time tt. Under Assumption 1, we then have that

E[μ𝐱𝐭μ^t]n=0t1Kn(λ+2MN)\displaystyle E\left[\left\|\mu_{\bf x_{t}}-\hat{\mu}_{t}\right\|\right]\leq\sum_{n=0}^{t-1}K^{n}(\lambda+2M_{N})

where K=Kf+δtK=K_{f}+\delta_{t}, and where the expectation is with respect to the random dynamics of the NN player control system.

Proof.

The proof can be found in Appendix C. ∎

Lemma 3.

Under Assumption 1,

|KβN,(μN)Kβ(μN)|2βC(1β)(1βK)MN\displaystyle\left|K_{\beta}^{N,*}(\mu^{N})-K_{\beta}^{*}(\mu^{N})\right|\leq\frac{2\beta C}{(1-\beta)(1-\beta K)}M_{N}

where C=(c+Kc)C=(\|c\|_{\infty}+K_{c}) and K=(Kf+δT)K=(K_{f}+\delta_{T}), for any μN𝒫N(𝕏)𝒫(𝕏)\mu^{N}\in{\mathcal{P}}_{N}(\mathds{X})\subset{\mathcal{P}}(\mathds{X}) that is for any μN\mu^{N} that can be achieved with an empirical distribution of NN agents.

Proof.

The proof can be found in the appendix D. ∎

4.2.2 Error Bounds for Closed Loop Control

In this section, we will assume that the agents find and agree on an optimal policy for the control problem using the agreed-upon mismatched model c^,𝒯^\hat{c},\hat{\mathcal{T}} with infinite agent dynamics. However, unlike open-loop control, to execute this policy, they observe the empirical state distribution of the team of NN-agents, say μtN\mu_{t}^{N} at time tt and apply γ^(|x,μtN)\hat{\gamma}(\cdot|x,\mu_{t}^{N}). We summarize the application of this policy as follows:

  • Collectively agree on a policy g^\hat{g} as in (32) according to the agreed upon estimate model 𝒯^,c^\hat{\mathcal{T}},\hat{c} that satisfies (4.1). Note that this policy is an optimal policy for the infinite population dynamics under the estimate model.

  • Observe the correct mean-field term μt{\mu}_{t}.

  • Find the randomized agent level policy γ^\hat{\gamma} using g^(μt)=γ^(|xti,μt)\hat{g}({\mu}_{t})=\hat{\gamma}(\cdot|x_{t}^{i},{\mu}_{t})

  • Observe local state xtix_{t}^{i}, and apply action uttγ^(|xti,μt)u_{t}^{t}\sim\hat{\gamma}(\cdot|x_{t}^{i},{\mu}_{t}).

We denote the incurred cost under this policy by KβN(μN,γ^)K_{\beta}^{N}(\mu^{N},\hat{\gamma}) for some initial state distribution μN\mu^{N}.

Theorem 6.

Under Assumption 1, if each agent follows the steps summarized above, we then have that

KβN(μN,γ^)KβN,(μN)λ2(βCβK+1)(1β)2(1βK)+MN4βC(1β)(1βK)\displaystyle K_{\beta}^{N}(\mu^{N},\hat{\gamma})-K_{\beta}^{N,*}(\mu^{N})\leq\lambda\frac{2(\beta C-\beta K+1)}{(1-\beta)^{2}(1-\beta K)}+M_{N}\frac{4\beta C}{(1-\beta)(1-\beta K)}

where K=(Kf+δT)K=(K_{f}+\delta_{T}) and C=(c+Kc)C=(\|c\|_{\infty}+K_{c}).

Proof.

The proof follows very similar steps to the results we have proved earlier. For some μ^0=μ0=μ𝐱𝟎=μN\hat{\mu}_{0}=\mu_{0}=\mu_{\bf x_{0}}=\mu^{N}

KβN(μN,γ^)KβN,(μN)\displaystyle K_{\beta}^{N}(\mu^{N},\hat{\gamma})-K_{\beta}^{N,*}(\mu^{N})\leq |KβN(μN,γ^)K^β(μN)|+|K^β(μN)Kβ(μN)|\displaystyle\left|K_{\beta}^{N}(\mu^{N},\hat{\gamma})-\hat{K}_{\beta}(\mu^{N})\right|+\left|\hat{K}_{\beta}(\mu^{N})-K^{*}_{\beta}(\mu^{N})\right|
+|Kβ(μN)KβN,(μN)|.\displaystyle+\left|K^{*}_{\beta}(\mu^{N})-K_{\beta}^{N,*}(\mu^{N})\right|. (39)

The second term above is bounded by Lemma 1, the last term is bounded by Lemma 3. For the first term we write the Bellman equations:

KβN(μN,γ^)=k(μN,γ^)+βKβN(μ1N,γ^)η(dμ1N|μN,γ^)\displaystyle K_{\beta}^{N}(\mu^{N},\hat{\gamma})=k(\mu^{N},\hat{\gamma})+\beta\int K_{\beta}^{N}(\mu_{1}^{N},\hat{\gamma})\eta(d\mu_{1}^{N}|\mu^{N},\hat{\gamma})
K^β(μN)=k^(μN,γ^)+βK^β(F^(μN,γ^)).\displaystyle\hat{K}_{\beta}(\mu^{N})=\hat{k}(\mu^{N},\hat{\gamma})+\beta\hat{K}_{\beta}\left(\hat{F}(\mu^{N},\hat{\gamma})\right).

We can then write

|KβN(μN,γ^)K^β(μN)|\displaystyle\left|K_{\beta}^{N}(\mu^{N},\hat{\gamma})-\hat{K}^{*}_{\beta}(\mu^{N})\right| |k(μN,γ^)k^(μN,γ^)|\displaystyle\leq\left|k(\mu^{N},\hat{\gamma})-\hat{k}(\mu^{N},\hat{\gamma})\right|
+β|KβN(μ1N,γ^)K^β(μ1N)|η(dμ1N|μN,γ^)\displaystyle+\beta\int\left|K_{\beta}^{N}(\mu_{1}^{N},\hat{\gamma})-\hat{K}_{\beta}(\mu_{1}^{N})\right|\eta(d\mu_{1}^{N}|\mu^{N},\hat{\gamma})
+β|K^β(μ1N)K^β(F^(μN,γ^))|η(dμ1N|μN,γ^)\displaystyle+\beta\int\left|\hat{K}_{\beta}(\mu_{1}^{N})-\hat{K}_{\beta}\left(\hat{F}(\mu^{N},\hat{\gamma})\right)\right|\eta(d\mu_{1}^{N}|\mu^{N},\hat{\gamma})
λ+supμ|KβN(μ,γ^)K^β(μ)|\displaystyle\leq\lambda+\sup_{\mu}\left|K_{\beta}^{N}(\mu,\hat{\gamma})-\hat{K}_{\beta}(\mu)\right|
+2βλ(βCβK+1(1β)(1βK))\displaystyle+2\beta\lambda\left(\frac{\beta C-\beta K+1}{(1-\beta)(1-\beta K)}\right)
+β|Kβ(μ1N)Kβ(F^(μN,γ^))|η(dμ1N|μN,γ^)\displaystyle+\beta\int\left|{K}^{*}_{\beta}(\mu_{1}^{N})-{K}^{*}_{\beta}\left(\hat{F}(\mu^{N},\hat{\gamma})\right)\right|\eta(d\mu_{1}^{N}|\mu^{N},\hat{\gamma})

Using almost identical arguments that we have used in the proof of Lemma 3 and Lemma 2, we can bound the last term as

β|Kβ(μ1N)Kβ(F^(μN,γ^))|η(dμ1N|μN,γ^)\displaystyle\beta\int\left|{K}^{*}_{\beta}(\mu_{1}^{N})-{K}^{*}_{\beta}\left(\hat{F}(\mu^{N},\hat{\gamma})\right)\right|\eta(d\mu_{1}^{N}|\mu^{N},\hat{\gamma})
βKβLip(MN+δTMN+λ)\displaystyle\leq\beta\|K_{\beta}^{*}\|_{Lip}\left(M_{N}+\delta_{T}M_{N}+\lambda\right)

Re arranging the terms and noting that KβLipC1βK\|K_{\beta}^{*}\|_{Lip}\leq\frac{C}{1-\beta K}, we can write that

supμ𝒫N(𝕏)|KβN(μ,γ^)K^β(μ)|MN2βC(1β)(1βK)+λ(1+β)(βCβK+1)(1β)2(1βK).\displaystyle\sup_{\mu\in{\mathcal{P}}_{N}(\mathds{X})}\left|K_{\beta}^{N}(\mu,\hat{\gamma})-\hat{K}_{\beta}(\mu)\right|\leq M_{N}\frac{2\beta C}{(1-\beta)(1-\beta K)}+\lambda\frac{(1+\beta)(\beta C-\beta K+1)}{(1-\beta)^{2}(1-\beta K)}.

Combining this bound with (4.2.2), one can show that

KβN(μN,γ^)KβN,(μN)λ2(βCβK+1)(1β)2(1βK)+MN4βC(1β)(1βK)\displaystyle K_{\beta}^{N}(\mu^{N},\hat{\gamma})-K_{\beta}^{N,*}(\mu^{N})\leq\lambda\frac{2(\beta C-\beta K+1)}{(1-\beta)^{2}(1-\beta K)}+M_{N}\frac{4\beta C}{(1-\beta)(1-\beta K)}

5 Numerical Study

We now present a numerical example to verify the results we have established in the earlier sections.

We consider a multi-agent taxi service model where each agent represents a taxi. The state and action spaces are binary such that 𝕏=𝕌={0,1}\mathds{X}=\mathds{U}=\{0,1\}. We assume that at any given time a given zone is in either a surge or a non-surge mode. The state variable XtiX_{t}^{i} represents the location of the agent ii

  • Xti=0X_{t}^{i}=0 \rightarrow agent is in a surge zone (high demand)

  • Xti=1X_{t}^{i}=1 \rightarrow agent is in a non-surge zone (low demand)

The action variable represents the movement decisions:

  • Uti=0U_{t}^{i}=0 \rightarrow remains where they are

  • Uti=1U_{t}^{i}=1 \rightarrow relocates to another area.

The cost structure is defined as follows:

  • If an agent is in a non-surge zone (Xti=1X_{t}^{i}=1), they incur a cost SS due to lost earnings

  • If an agent relocates (Uti=1U_{t}^{i}=1), they receive a cost RR, for movement expenses.

  • Furthermore, to encourage a balanced distribution, we penalize deviations from the 4040%-6060% distribution by introducing a cost 10×(μ(0)0.4)210\times(\mu(0)-0.4)^{2} where μ(0)\mu(0) is the fraction of agents in the surge zones.

For the dynamics, we assume that a non-surge area has a fixed probability 0.20.2 of becoming a surge area in the next time step. Furthermore, we assume that a surge area has a probability 0.7μ(0)+0.20.7*\mu(0)+0.2 of becoming a non-surge area, indicating that as more drivers there are in a surge area (μ(0)\mu(0) is high), the likelihood of it remaining a surge zone decreases (due to increased supply). This then defines the transition probabilities as follows:

Pr(Xt+1i=1|Xti=0,Uti=0,μ)=0.7μ(0)+0.2\displaystyle Pr(X_{t+1}^{i}=1|X_{t}^{i}=0,U_{t}^{i}=0,\mu)=0.7*\mu(0)+0.2
Pr(Xt+1i=1|Xti=0,Uti=1,μ)=0.8\displaystyle Pr(X_{t+1}^{i}=1|X_{t}^{i}=0,U_{t}^{i}=1,\mu)=0.8
Pr(Xt+1i=1|Xti=1,Uti=0,μ)=0.8\displaystyle Pr(X_{t+1}^{i}=1|X_{t}^{i}=1,U_{t}^{i}=0,\mu)=0.8
Pr(Xt+1i=1|Xti=1,Uti=1,μ)=0.7μ(0)+0.2\displaystyle Pr(X_{t+1}^{i}=1|X_{t}^{i}=1,U_{t}^{i}=1,\mu)=0.7*\mu(0)+0.2

We set the parameters as R=1R=1, S=7S=7 and β=0.7\beta=0.7.

Near optimality of learned models and infinite population approximations. Figure 1 shows the value functions loss for different values NN, the number of agents in the system. We graph the loss functions under 3 settings:

  • The optimal policy for the infinite population model with perfect knowledge of transition dynamics and costs.

  • The estimate policy for the infinite population model, where the transition-cost function is learned using discretization basis functions based on the discretization of the measure space 𝒫(𝕏){\mathcal{P}}(\mathds{X}) into 6 subsets (see Section 3.3).

  • The estimate policy for the infinite population model, where the transition-cost function is learned using a class of basis functions:

    ϕ(μ)=[1,μ(0),μ(0)2,μ(0)3,sin(μ(0)),cos(μ(0))].\displaystyle{\bf\phi}(\mu)=[1,\mu(0),\mu(0)^{2},\mu(0)^{3},\sin(\mu(0)),\cos(\mu(0))].

    Note that the cost and the transitions are perfectly linear under the basis functions [1,μ(0),μ(0)2][1,\mu(0),\mu(0)^{2}].

For the loss, we compare the value of the policy with the optimal value in an infinite population environment. Furthermore, we assume that the initial distribution is μ0=1/2δ0+1/2δ1\mu_{0}=1/2\delta_{0}+1/2\delta_{1}.

In the figure, we also plot a scaled 1N\frac{1}{\sqrt{N}} line which represents the decay rate of the empirical consistency term MNM_{N} defined in (37). As verified by the results, the loss in all cases decays at a rate similar to 1N\frac{1}{\sqrt{N}}.

We also observe that the policies for the learned model with polynomial basis functions perform as well as the policies under perfect model knowledge, which is expected as the model is perfectly linear for these basis functions.

For the learned model under discretization, there is a small performance gap, which is also expected since the model is not perfectly linear under discretization basis functions. Thus, the learned model does not perfectly match the true model under discretization.

Refer to caption
Figure 1: Value comparison under different policies.

Lack of exploration without common randomness. Another significant observation from the previous sections about the exploration is also verified in this numerical study. In particular, when agents perform learning individually, we observe that the mean-field term tends to get stuck in certain regions without common randomness. However, if agents choose their actions based on a common randomness, then exploration becomes more efficient as seen in Figure 2. In the right graph, the agents follow a policy of the form γi(|x,wi)\gamma^{i}(\cdot|x,w^{i}) where wiw^{i} is an i.i.d. noise term that is independent across the agents which results in a deterministic flow of the mean-field term, and results in poor exploration. In the graph on the left, the agents follow exploration policies of the form γ(|x,w0)\gamma(\cdot|x,w^{0}) where w0w^{0} is a common noise that is shared by all agents. As a result, the flow of the mean-field term becomes stochastic and a better exploration is observed.

Refer to caption
Refer to caption
Figure 2: Learned regions with and without common noise.

6 Conclusion

We have studied model-based learning methods for mean-field control problems using linear function approximations, focusing on both fully coordinated and independent learning approaches. We have observed that full decentralization is generally not possible even when the agents agree on a common model. For the independent learning method, although agents do not need to share their local state information for the convergence, a certain level of coordination is inevitable especially for the exploration phase of the control problem which is done using a common noise process. For the learned models, we have provided error analysis which stems from two main sources (i) modeling mismatch due to linear function approximation, (ii) error arising from the infinite population approximation.

We have observed that the exploration is a key challenge in the learning of mean-field control systems. The analysis in the paper suggests that the stochastic controllability of the mean-field systems is closely related to the exploration problem. A natural future direction is then to further analysis of the controllability and exploration properties of the mean-field control system.

Appendix A Proof of Proposition 1

Step 1. We first show that E[vtv2]E\left[\|v_{t}-v^{*}\|^{2}\right] remains uniformly bounded over tt. We write

vt+1v2vtv22αtg(st,vt),vtv+αt22g(st,vt)\displaystyle\|v_{t+1}-v^{*}\|^{2}\leq\|v_{t}-v^{*}\|^{2}-2\alpha_{t}\langle\nabla g(s_{t},v_{t}),v_{t}-v^{*}\rangle+\alpha_{t}^{2}\nabla^{2}g(s_{t},v_{t}) (40)

For E[2g(st,vt)]E[\nabla^{2}g(s_{t},v_{t})] we have that

E[2g(st,vt)]=E[|2k(st)(k(st)vth(st))|2]KE[2k(st)2vt2+2h(st)2]\displaystyle E[\nabla^{2}g(s_{t},v_{t})]=E\left[\left|2k(s_{t})(k(s_{t})^{\intercal}v_{t}-h(s_{t}))\right|^{2}\right]\leq KE\left[2\|k(s_{t})\|^{2}\|v_{t}\|^{2}+2\|h(s_{t})\|^{2}\right]
KE[vt2+1]K(E[vtv2]+1)\displaystyle\leq KE\left[\|v_{t}\|^{2}+1\right]\leq K\left(E\left[\|v_{t}-v^{*}\|^{2}\right]+1\right)

where the generic constant K<K<\infty may represent different values at different steps. Denoting by At:=E[vtv2]A_{t}:=E\left[\|v_{t}-v^{*}\|^{2}\right], if we take the expectation on both sides of (40) we can write

At+1\displaystyle A_{t+1} At2αtE[g(st,vt),vtv]+αt2KAt+αt2K\displaystyle\leq A_{t}-2\alpha_{t}E\left[\langle\nabla g(s_{t},v_{t}),v_{t}-v^{*}\rangle\right]+\alpha_{t}^{2}KA_{t}+\alpha_{t}^{2}K
At2αtE[g(st,vt)g(st,v)]+αt2KAt+αt2K\displaystyle\leq A_{t}-2\alpha_{t}E\left[g(s_{t},v_{t})-g(s_{t},v^{*})\right]+\alpha_{t}^{2}KA_{t}+\alpha_{t}^{2}K (41)

where at the last step we used the convexity of g(st,vt)g(s_{t},v_{t}) for every sts_{t}. We now introduce s^t\hat{s}_{t} which are independent over tt and each s^t\hat{s}_{t} is distributed according to π()\pi(\cdot). For the middle term above we write

2αtE[g(st,vt)g(st,v)]=\displaystyle-2\alpha_{t}E\left[g(s_{t},v_{t})-g(s_{t},v^{*})\right]= 2αtE[(g(st,vt)g(s^t,vt))]\displaystyle-2\alpha_{t}E\left[\left(g(s_{t},v_{t})-g(\hat{s}_{t},v_{t})\right)\right]
2αtE[(g(s^t,vt)g(s^t,v))]\displaystyle-2\alpha_{t}E\left[\left(g(\hat{s}_{t},v_{t})-g(\hat{s}_{t},v^{*})\right)\right]
2αtE[(g(s^t,v)g(st,v))]\displaystyle-2\alpha_{t}E\left[\left(g(\hat{s}_{t},v^{*})-g(s_{t},v^{*})\right)\right] (42)

where the expectation is with respect to the independent coupling between st,s^ts_{t},\hat{s}_{t}.

We denote by

bt1=2αtE[(g(st,vt)g(s^t,vt))]\displaystyle b^{1}_{t}=-2\alpha_{t}E\left[\left(g(s_{t},v_{t})-g(\hat{s}_{t},v_{t})\right)\right]
bt2=2αtE[(g(s^t,vt)g(s^t,v))]\displaystyle b_{t}^{2}=-2\alpha_{t}E\left[\left(g(\hat{s}_{t},v_{t})-g(\hat{s}_{t},v^{*})\right)\right]
bt3=2αtE[(g(s^t,v)g(st,v))].\displaystyle b_{t}^{3}=-2\alpha_{t}E\left[\left(g(\hat{s}_{t},v^{*})-g(s_{t},v^{*})\right)\right].

For bt1b_{t}^{1}, we consider its absolute value to and write

|bt1|2αtE[|g(st,vt)g(s^t,vt)|]\displaystyle|b_{t}^{1}|\leq 2\alpha_{t}E\left[\left|g(s_{t},v_{t})-g(\hat{s}_{t},v_{t})\right|\right]
2αtE[|(k(st)vth(st))2(k(s^t)vth(s^t))2|]\displaystyle\leq 2\alpha_{t}E\left[\left|(k(s_{t})^{\intercal}v_{t}-h(s_{t}))^{2}-(k(\hat{s}_{t})^{\intercal}v_{t}-h(\hat{s}_{t}))^{2}\right|\right]
2αtE[|(k(st)k(s^t))vt||(k(st)+k(s^t))vth(st)h(s^t)|]\displaystyle\leq 2\alpha_{t}E\left[\left|\left(k(s_{t})^{\intercal}-k(\hat{s}_{t})^{\intercal}\right)v_{t}\right|\left|\left(k(s_{t})^{\intercal}+k(\hat{s}_{t})^{\intercal}\right)v_{t}-h(s_{t})-h(\hat{s}_{t})\right|\right]
2αtE[(2kk(st)k(s^t)vt2+2hk(st)k(s^t)vt)]\displaystyle\leq 2\alpha_{t}E\left[\left(2\|k\|_{\infty}\|k(s_{t})-k(\hat{s}_{t})\|\|v_{t}\|^{2}+2\|h\|_{\infty}\|k(s_{t})-k(\hat{s}_{t})\|\|v_{t}\|\right)\right]
2αtKE[k(st)k(s^t)]E[vt2]+2αtKE[Kk(st)k(s^t)]E[vt]\displaystyle\leq 2\alpha_{t}KE\left[\|k(s_{t})-k(\hat{s}_{t})\|\right]E\left[\|v_{t}\|^{2}\right]+2\alpha_{t}KE\left[K\|k(s_{t})-k(\hat{s}_{t})\|\right]E\left[\|v_{t}\|\right]
2αtKE[k(st)k(s^t)]E[vtv2]\displaystyle\leq 2\alpha_{t}KE\left[\|k(s_{t})-k(\hat{s}_{t})\|\right]E\left[\|v_{t}-v^{*}\|^{2}\right] (43)

where we used generic constant K<K<\infty for the above analysis that might have different values at different steps. Furthermore, we used the inequality vt22vtv2+2v2\|v_{t}\|^{2}\leq 2\|v_{t}-v^{*}\|^{2}+2\|v^{*}\|^{2}. We also assume that vt1\|v_{t}\|\geq 1 to use vtvt2\|v_{t}\|\leq\|v_{t}\|^{2}, note that this is without loss of generality as we are trying to show that Evtv2E\|v_{t}-v^{*}\|^{2} is bounded, and for vt1\|v_{t}\|\leq 1, the boundedness is immediate. For the following analysis, we will denote by

ϵt:=E[k(st)k(s^t)].\displaystyle\epsilon_{t}:=E\left[\|k(s_{t})-k(\hat{s}_{t})\|\right].

We now consider the series t=1αtϵt\sum_{t=1}^{\infty}\alpha_{t}\epsilon_{t}. Since sts_{t} is ergodic with a geometric rate with invariant measure π()\pi(\cdot) and s^tπ()\hat{s}_{t}\sim\pi(\cdot), we have that

t=1αtϵt<\displaystyle\sum_{t=1}^{\infty}\alpha_{t}\epsilon_{t}<\infty (44)

We now go back to (A)

At+1\displaystyle A_{t+1} At2αtE[g(st,vt)g(st,v)]+αt2KAt+αt2K\displaystyle\leq A_{t}-2\alpha_{t}E\left[g(s_{t},v_{t})-g(s_{t},v^{*})\right]+\alpha_{t}^{2}KA_{t}+\alpha_{t}^{2}K
At+|bt1|+bt2+bt3+αt2KAt\displaystyle\leq A_{t}+|b_{t}^{1}|+b_{t}^{2}+b_{t}^{3}+\alpha_{t}^{2}KA_{t}
At+2αtKϵtAt+2αtKϵt+bt2+bt3+αt2KAt+αt2K\displaystyle\leq A_{t}+2\alpha_{t}K\epsilon_{t}A_{t}+2\alpha_{t}K\epsilon_{t}+b_{t}^{2}+b_{t}^{3}+\alpha_{t}^{2}KA_{t}+\alpha_{t}^{2}K
(1+2αtKϵt+αt2K)At+bt2+bt3+αt2K.\displaystyle\leq(1+2\alpha_{t}K\epsilon_{t}+\alpha_{t}^{2}K)A_{t}+b_{t}^{2}+b_{t}^{3}+\alpha_{t}^{2}K.

For the following we denote by

ct=(1+2αtKϵt+αt2K)\displaystyle c_{t}=(1+2\alpha_{t}K\epsilon_{t}+\alpha_{t}^{2}K)

Note that one can show the infinite product t=1ct\prod_{t=1}^{\infty}c_{t} converges if and only if the sum

t=12αtKϵt+αt2K\displaystyle\sum_{t=1}^{\infty}2\alpha_{t}K\epsilon_{t}+\alpha_{t}^{2}K

converges. We have shown that the sum t=1αtϵt\sum_{t=1}^{\infty}\alpha_{t}\epsilon_{t} is convergent due to geometric ergodicity, and we also have that αt2\alpha_{t}^{2} is summable. Thus, we write

t=1ct<C\displaystyle\prod_{t=1}^{\infty}c_{t}<C

for some C<C<\infty. One can then iteratively show that

At+1\displaystyle A_{t+1} n=1tcnA0+Cn=1t(bn2+bn3+αn2K)\displaystyle\leq\prod_{n=1}^{t}c_{n}A_{0}+C\sum_{n=1}^{t}\left(b_{n}^{2}+b_{n}^{3}+\alpha_{n}^{2}K\right)
CA0+Cn=1t(bn2+bn3+αn2K).\displaystyle\leq CA_{0}+C\sum_{n=1}^{t}\left(b_{n}^{2}+b_{n}^{3}+\alpha_{n}^{2}K\right).

Consider bn2=2αnE[g(s^n,vn)g(s^n,v)]b^{2}_{n}=-2\alpha_{n}E\left[g(\hat{s}_{n},v_{n})-g(\hat{s}_{n},v^{*})\right]; since s^tπ()\hat{s}_{t}\sim\pi(\cdot) for all tt, and since v=argminvG(v)=argminvg(s,v)π(ds)v^{*}=\mathop{\rm arg\,min}_{v}G(v)=\mathop{\rm arg\,min}_{v}\int g(s,v)\pi(ds), bn20b^{2}_{n}\leq 0 for all nn. Thus, we can simply remove bn2b_{n}^{2} terms to get a further upperbound. For bn3b_{n}^{3}, we have that

n=1|bn3|t=12αt|g(s^t,v)g(st,v)|<\displaystyle\sum_{n=1}^{\infty}|b_{n}^{3}|\leq\sum_{t=1}^{\infty}2\alpha_{t}\left|g(\hat{s}_{t},v^{*})-g(s_{t},v^{*})\right|<\infty

using an identical argument we used to show αtϵt<\sum\alpha_{t}\epsilon_{t}<\infty. In particular,

limtAtlimtCA0+Cn=1t(bn3+αn2K)<\displaystyle\lim_{t\to\infty}A_{t}\leq\lim_{t\to\infty}CA_{0}+C\sum_{n=1}^{t}\left(b_{n}^{3}+\alpha_{n}^{2}K\right)<\infty

which shows that Evtv2E\|v_{t}-v^{*}\|^{2} is bounded uniformly over tt, which also implies that Evt2E\|v_{t}\|^{2} is bounded.

Step 2. Now we have the boundedness, we go back to (A); using the bound on AtA_{t} (only for the second AtA_{t} in (A)), and summing over the terms, we can write

AN+1A0t=1N(At+1At)t=1N2αtE[g(st,vt)g(st,v)]+t=1Nαt2K\displaystyle A_{N+1}-A_{0}\leq\sum_{t=1}^{N}\left(A_{t+1}-A_{t}\right)\leq\sum_{t=1}^{N}-2\alpha_{t}E\left[g(s_{t},v_{t})-g(s_{t},v^{*})\right]+\sum_{t=1}^{N}\alpha_{t}^{2}K

again using the boundedness of AtA_{t}, and the fact that t=1αt2<\sum_{t=1}^{\infty}\alpha_{t}^{2}<\infty and sending NN\to\infty, we get

E[t=12αt(g(st,vt)g(st,v))]<\displaystyle E\left[\sum_{t=1}^{\infty}2\alpha_{t}\left(g(s_{t},v_{t})-g(s_{t},v^{*})\right)\right]<\infty

We now introduce s^t\hat{s}_{t} which are independent over tt and each s^t\hat{s}_{t} is distributed according to π()\pi(\cdot). We then write

E[t=12αt(g(st,vt)g(s^t,vt))bt1]\displaystyle E\left[\sum_{t=1}^{\infty}\underbrace{2\alpha_{t}\left(g(s_{t},v_{t})-g(\hat{s}_{t},v_{t})\right)}_{b_{t}^{1}}\right]
+E[t=12αt(g(s^t,vt)g(s^t,v))bt2]\displaystyle+E\left[\sum_{t=1}^{\infty}\underbrace{2\alpha_{t}\left(g(\hat{s}_{t},v_{t})-g(\hat{s}_{t},v^{*})\right)}_{b_{t}^{2}}\right]
+E[t=12αt(g(s^t,v)g(st,v))bt3]<\displaystyle+E\left[\sum_{t=1}^{\infty}\underbrace{2\alpha_{t}\left(g(\hat{s}_{t},v^{*})-g(s_{t},v^{*})\right)}_{b_{t}^{3}}\right]<\infty (45)

where we overwrite the definitions of bt1b_{t}^{1}, bt2b_{t}^{2}, and bt3b_{t}^{3} (only changing the signs of these terms, see (A)).

Recall the analysis for bt1b_{t}^{1} in (A), together with the uniform boundedness of At=Evtv2]A_{t}=E\left\|v_{t}-v^{*}\|^{2}\right] over tt, we can write that

E[t=1|bt1|]t=12αtKE[k(st)k(s^t)]<\displaystyle E\left[\sum_{t=1}^{\infty}\left|b_{t}^{1}\right|\right]\leq\sum_{t=1}^{\infty}2\alpha_{t}KE\left[\|k(s_{t})-k(\hat{s}_{t})\|\right]<\infty

where we exchange the sum and expectation with monotone convergence theorem, and where the last step follows from what we have shown in (44).

For the last term similarly, we have that E[t=1bt3]<E\left[\sum_{t=1}^{\infty}b_{t}^{3}\right]<\infty, from (44), since sts_{t} is geometrically ergodic with invariant measure π\pi and s^tπ()\hat{s}_{t}\sim\pi(\cdot) and vv^{*} is fixed.

Going back to (A), now that we have shown the last and the first terms are finite, we can write

E[t=12αt(g(s^t,vt)g(s^t,v))]<.\displaystyle E\left[\sum_{t=1}^{\infty}2\alpha_{t}\left(g(\hat{s}_{t},v_{t})-g(\hat{s}_{t},v^{*})\right)\right]<\infty.

Since s^t\hat{s}_{t} is i.i.d and distributed according to π()\pi(\cdot), the above also implies that

E[t=12αt(G(vt)G(v))]<\displaystyle E\left[\sum_{t=1}^{\infty}2\alpha_{t}\left(G(v_{t})-G(v^{*})\right)\right]<\infty

which in turn implies that

t=12αt(G(vt)G(v))<\displaystyle\sum_{t=1}^{\infty}2\alpha_{t}\left(G(v_{t})-G(v^{*})\right)<\infty

almost surely. Furthermore, since αt\alpha_{t} is not summable, and (G(vt)G(v))0\left(G(v_{t})-G(v^{*})\right)\geq 0 (as vv^{*} achieves the minimum of G(v)G(v)), we must have that

G(vt)G(v), almost surely.\displaystyle G(v_{t})\to G(v^{*}),\text{ almost surely}.

Appendix B Proof of Lemma 1

We begin the proof by writing the Bellman equations

K^β(μ)=k^(μ,γ^)+βK^β(F^(μ,γ^))\displaystyle\hat{K}_{\beta}(\mu)=\hat{k}(\mu,\hat{\gamma})+\beta\hat{K}_{\beta}(\hat{F}(\mu,\hat{\gamma}))
Kβ(μ)=k(μ,γ)+βKβ(F(μ,γ))\displaystyle K_{\beta}^{*}(\mu)=k(\mu,\gamma)+\beta K_{\beta}^{*}(F(\mu,\gamma))

where γ^\hat{\gamma} and γ\gamma are optimal agent-level policies that achieves the minimum at the right hand side of the Bellman equations respectively. We can use then use the same agent level policies by exchanging them to get the following upper-bound

|K^β(μ)Kβ(μ)||k^(μ,γ)k(μ,γ)|+β|K^β(F^(μ,γ))Kβ(F(μ,γ))|\displaystyle\left|\hat{K}_{\beta}(\mu)-K_{\beta}^{*}(\mu)\right|\leq|\hat{k}(\mu,\gamma)-k(\mu,\gamma)|+\beta\left|\hat{K}_{\beta}(\hat{F}(\mu,\gamma))-K_{\beta}^{*}(F(\mu,\gamma))\right|
|k^(μ,γ)k(μ,γ)|+β|K^β(F^(μ,γ))Kβ(F^(μ,γ))|+β|Kβ(F^(μ,γ))Kβ(F(μ,γ))|\displaystyle\leq|\hat{k}(\mu,\gamma)-k(\mu,\gamma)|+\beta\left|\hat{K}_{\beta}(\hat{F}(\mu,\gamma))-K_{\beta}^{*}(\hat{F}(\mu,\gamma))\right|+\beta\left|{K}^{*}_{\beta}(\hat{F}(\mu,\gamma))-K_{\beta}^{*}({F}(\mu,\gamma))\right|
λ+βsupμ|K^β(μ)Kβ(μ)|+βKβLipF^(μ,γ)F(μ,γ)\displaystyle\leq\lambda+\beta\sup_{\mu}\left|\hat{K}_{\beta}(\mu)-K_{\beta}^{*}(\mu)\right|+\beta\|K_{\beta}^{*}\|_{Lip}\|\hat{F}(\mu,\gamma)-F(\mu,\gamma)\| (46)

We have that

F^(μ,γ)F(μ,γ)𝒯^(|x,u,μ)γ(du|x)μ(dx)𝒯(|x,u,μ)γ(du|x)μ(dx)\displaystyle\hat{F}(\mu,\gamma)-F(\mu,\gamma)\|\leq\left\|\int\hat{\mathcal{T}}(\cdot|x,u,\mu)\gamma(du|x)\mu(dx)-\int\mathcal{T}(\cdot|x,u,\mu)\gamma(du|x)\mu(dx)\right\|
λ.\displaystyle\leq\lambda.

Hence, by rearranging the terms in (B), we can write

|K^β(μ)Kβ(μ)|λ1β(1+βKβLip).\displaystyle\left|\hat{K}_{\beta}(\mu)-K_{\beta}^{*}(\mu)\right|\leq\frac{\lambda}{1-\beta}\left(1+\beta\|K_{\beta}\|_{Lip}\right).

Finally, a slight modification of [6, Lemma 6] for finite 𝕏,𝕌\mathds{X},\mathds{U} can be used to show that

KβLipC1βK\displaystyle\|K_{\beta}^{*}\|_{Lip}\leq\frac{C}{1-\beta K}

which completes the proof that

|K^β(μ)Kβ(μ)|λ(βCβK+1(1β)(1βK)).\displaystyle\left|\hat{K}_{\beta}(\mu)-K_{\beta}^{*}(\mu)\right|\leq\lambda\left(\frac{\beta C-\beta K+1}{(1-\beta)(1-\beta K)}\right).

Appendix C Proof of Lemma 2

We use the notation μt=μ𝐱𝐭\mu_{t}=\mu_{\bf x_{t}} for the following analysis. Note that with stochastic realization results, there exists a random variable vtv_{t} uniformly distributed on [0,1][0,1], and a measurable function γ^\hat{\gamma} such that

γ^(x,vt)\displaystyle\hat{\gamma}(x,v_{t})

has the same distribution as γ^(|x,μ^t)\hat{\gamma}(\cdot|x,\hat{\mu}_{t}), where we overwrite the notation for simplicity. Let 𝐱^t{\bf\hat{x}}_{t} denote a vector of size NN state variables that are distributed according to μ^t\hat{\mu}_{t}, i.e. 𝐱^t=[x^t1,,x^tN]{\bf\hat{x}}_{t}=[\hat{x}^{1}_{t},\dots,\hat{x}^{N}_{t}] such that x^tiμ^t\hat{x}_{t}^{i}\sim\hat{\mu}_{t} for all i{1,,N}i\in\{1,\dots,N\}. Furthermore, let 𝐯t{\bf v}_{t} denote a vector of size NN where each element is independent and distributed according to the law of vtv_{t}. We then study the following conditional expected difference:

E[μ𝐱𝐭+𝟏μ^t+1]=E[E[μ𝐱𝐭+𝟏μ^t+1|𝐱t,𝐱^𝐭,𝐯𝐭]].\displaystyle E\left[\left\|\mu_{\bf x_{t+1}}-\hat{\mu}_{t+1}\right\|\right]=E\left[E\left[\left\|\mu_{\bf x_{t+1}}-\hat{\mu}_{t+1}\right\||{\bf x}_{t},{\bf\hat{x}_{t}},{\bf v_{t}}\right]\right].

Let 𝐰𝐭\bf w_{t} denote the vector of size NN for the noise variables of the agents at time tt. Note that we have 𝐱t+1=f(𝐱t,𝐮t,𝐰t){\bf x}_{t+1}=f({\bf x}_{t},{\bf u}_{t},{\bf w}_{t}) where uti=γ^(xti,vti)u^{i}_{t}=\hat{\gamma}(x^{i}_{t},v^{i}_{t}) for each ii. We also introduce 𝐮^t\hat{\bf u}_{t} such that u^ti=γ^(x^ti,vti)\hat{u}^{i}_{t}=\hat{\gamma}(\hat{x}^{i}_{t},v_{t}^{i}).

We further introduce another vector of noise variables 𝐰^t\hat{\bf w}_{t} where each element is independently distributed, and the distribution of w^t\hat{w}_{t} agrees the with the kernel 𝒯^(|x,u,μ^t)\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t}). In other words, we use the functional representation of 𝒯^(|x,u,μ^t)\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t}) where

f^(x,u,μ^t,w^t)𝒯^(|x,u,μ^t)\displaystyle\hat{f}(x,u,\hat{\mu}_{t},\hat{w}_{t})\sim\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})

for some measurable f^\hat{f}.

We denote by 𝐏(d𝐰t)=P(dwt1)××P(dwtN){\bf P}(d{\bf w}_{t})=P(dw^{1}_{t})\times\dots\times P(dw_{t}^{N}) denote the distribution of the vector 𝐰t{\bf w}_{t} where it is assumed that wtiw_{t}^{i} and wtjw_{t}^{j} are independent for all iji\neq j. 𝐰^t{\bf\hat{w}}_{t} is also distributed according to 𝐏(){\bf P}(\cdot). For the joint distribution of 𝐰𝐭,𝐰^𝐭{\bf w_{t},\hat{w}_{t}}, we use a coupling of the form

𝛀(d𝐰t,d𝐰^t)=Ω1(dwt1,dw^t1)××ΩN(dwtN,dw^tN).\displaystyle{\bf\Omega}(d{\bf w}_{t},d{\bf\hat{w}}_{t})=\Omega^{1}(dw_{t}^{1},d\hat{w}_{t}^{1})\times\dots\times\Omega^{N}(dw_{t}^{N},d\hat{w}_{t}^{N}).

That is, we assume independence over i1,,Ni\in{1,\dots,N}, however, an arbitrary coupling is assumed between the distribution of wti,w^tiw_{t}^{i},\hat{w}_{t}^{i}. We will later specify the particular selection of coordinate wise couplings Ω1,,ΩN\Omega^{1},\dots,\Omega^{N}, however, the following analysis will hold correct for a general selection of Ω1,,ΩN\Omega^{1},\dots,\Omega^{N}.

For given realizations of 𝐱t,𝐱^𝐭,𝐯𝐭{\bf x}_{t},{\bf\hat{x}_{t}},{\bf v_{t}}, we write

E[μ𝐱𝐭+𝟏μ^t+1|𝐱t,𝐱^𝐭,𝐯𝐭]=μf(𝐱t,𝐮t,𝐰t)μ^t+1P(d𝐰t)\displaystyle E\left[\left\|\mu_{\bf x_{t+1}}-\hat{\mu}_{t+1}\right\||{\bf x}_{t},{\bf\hat{x}_{t}},{\bf v_{t}}\right]=\int\left\|\mu_{f({\bf x}_{t},{\bf u}_{t},{\bf w}_{t})}-\hat{\mu}_{t+1}\right\|P(d{\bf w}_{t})
=μf(𝐱t,𝐮t,𝐰t)μ^t+1𝛀(d𝐰t,d𝐰^t)\displaystyle=\int\left\|\mu_{f({\bf x}_{t},{\bf u}_{t},{\bf w}_{t})}-\hat{\mu}_{t+1}\right\|{\bf\Omega}(d{\bf w}_{t},d\hat{\bf w}_{t})
μf(𝐱t,𝐮t,𝐰t)μf^(𝐱^t,𝐮^t,𝐰^t)𝛀(d𝐰t,d𝐰^t)+μf^(𝐱^t,𝐮^t,𝐰^t)μ^t+1𝛀(d𝐰t,d𝐰^t)\displaystyle\leq\int\left\|\mu_{f({\bf x}_{t},{\bf u}_{t},{\bf w}_{t})}-{\mu}_{\hat{f}(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t})}\right\|{\bf\Omega}(d{\bf w}_{t},d\hat{\bf w}_{t})+\int\left\|{\mu}_{\hat{f}(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t})}-\hat{\mu}_{t+1}\right\|{\bf\Omega}(d{\bf w}_{t},d\hat{\bf w}_{t}) (47)

Note that 𝐱^t\hat{\bf x}_{t} is a vector of size NN where each entry is independent and distributed according to μ^t\hat{\mu}_{t}. Furthermore, u^ti=γ^(x^ti,vti)\hat{u}^{i}_{t}=\hat{\gamma}(\hat{x}^{i}_{t},v^{i}_{t}), and thus u^tiγ^(|x^ti,μ^t)\hat{u}^{i}_{t}\sim\hat{\gamma}(\cdot|\hat{x}^{i}_{t},\hat{\mu}_{t}) for each i{1,,N}i\in\{1,\dots,N\}. Thus, μf^(𝐱^t,𝐮^t,𝐰^t){\mu}_{\hat{f}(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t})} is an empirical measure for μ^t+1\hat{\mu}_{t+1} For the second term above, we then have:

E[μf^(𝐱^t,𝐮^t,𝐰^t)μ^t+1𝛀(d𝐰t,d𝐰^t)]=E[μf^(𝐱^t,𝐮^t,𝐰^t)μ^t+1P(d𝐰^t)]\displaystyle E\left[\int\left\|{\mu}_{\hat{f}(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t})}-\hat{\mu}_{t+1}\right\|{\bf\Omega}(d{\bf w}_{t},d\hat{\bf w}_{t})\right]=E\left[\int\left\|{\mu}_{\hat{f}(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t})}-\hat{\mu}_{t+1}\right\|P(d\hat{\bf w}_{t})\right]
=E[E[μ𝐱^t+1μ^t+1|𝐱^t,𝐮^t]]=E[μ𝐱^t+1μ^t+1]MN\displaystyle=E\left[E\left[\left\|\mu_{\hat{\bf x}_{t+1}}-\hat{\mu}_{t+1}\right\||\hat{\bf x}_{t},\hat{\bf u}_{t}\right]\right]=E\left[\|\mu_{\hat{\bf x}_{t+1}}-\hat{\mu}_{t+1}\|\right]\leq M_{N} (48)

see (37)(\ref{emp_bound}) for the definition of MNM_{N}.

For the first term in (C); we note that μf(𝐱t,𝐮t,𝐰t)\mu_{f({\bf x}_{t},{\bf u}_{t},{\bf w}_{t})} and μf(𝐱^t,𝐮^t,𝐰^t){\mu}_{f(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t})} are empirical measures, and thus for every given realization of 𝐰t{\bf w}_{t} and 𝐰^t{\bf\hat{w}}_{t}, the Wasserstein distance is achieved with a particular permutation of f(𝐱t,𝐮t,𝐰t){f({\bf x}_{t},{\bf u}_{t},{\bf w}_{t})} and f^(𝐱^t,𝐮^t,𝐰^t){\hat{f}(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t})} combined together. That is, letting σ\sigma denote a permutation map for the vector f^(𝐱^t,𝐮^t,𝐰^t){\hat{f}(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t})}. we have

μf(𝐱t,𝐮t,𝐰t)μf^(𝐱^t,𝐮^t,𝐰^t)=infσ1Ni=1N|f(xti,uti,wti,μ𝐱𝐭)σ(f^(x^ti,u^ti,w^ti,μ^t))|.\displaystyle\left\|\mu_{f({\bf x}_{t},{\bf u}_{t},{\bf w}_{t})}-{\mu}_{\hat{f}(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t})}\right\|=\inf_{\sigma}\frac{1}{N}\sum_{i=1}^{N}|f(x_{t}^{i},u_{t}^{i},w_{t}^{i},\mu_{\bf x_{t}})-\sigma(\hat{f}(\hat{x}_{t}^{i},\hat{u}_{t}^{i},\hat{w}_{t}^{i},\hat{\mu}_{t}))|.

We will however, consider a particular permutation where

𝒯(|x,u,μ𝐱t)μ(𝐱𝐭,𝐮𝐭)(du,dx)𝒯^(|x,u,μ^t)μ(𝐱^𝐭,𝐮^𝐭)(du,dx)\displaystyle\left\|\int\mathcal{T}(\cdot|x,u,\mu_{{\bf x}_{t}})\mu_{\bf(x_{t},u_{t})}(du,dx)-\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\mu_{(\bf\hat{x}_{t},\hat{u}_{t})}(du,dx)\right\|
=1Ni=1N𝒯(|xti,uti,μ𝐱t)σ(𝒯^(|x^ti,u^tu,μ^t))\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left\|\mathcal{T}(\cdot|x_{t}^{i},u_{t}^{i},\mu_{{\bf x}_{t}})-\sigma(\hat{\mathcal{T}}(\cdot|\hat{x}_{t}^{i},\hat{u}_{t}^{u},\hat{\mu}_{t}))\right\| (49)

For the following analysis, we will drop the permutation notation σ\sigma and assume that the given order of f^(𝐱^t,𝐮^t,𝐰^t)\hat{f}(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t}) achieves the Wasserstein distance in (C). Furthermore, the coupling 𝛀{\bf\Omega} is assumed to have the same order of coordinate-wise coupling.

We then write

μf(𝐱t,𝐮t,𝐰t)μf^(𝐱^t,𝐮^t,𝐰^t)𝛀(d𝐰t,d𝐰^t)\displaystyle\int\left\|\mu_{f({\bf x}_{t},{\bf u}_{t},{\bf w}_{t})}-{\mu}_{\hat{f}(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t})}\right\|{\bf\Omega}(d{\bf w}_{t},d\hat{\bf w}_{t})
1Ni=1N|f(xti,uti,wti,μ𝐱𝐭)f^(x^ti,u^ti,w^ti,μ^t)|𝛀(d𝐰t,d𝐰^t)\displaystyle\leq\int\frac{1}{N}\sum_{i=1}^{N}\left|f(x^{i}_{t},u^{i}_{t},w_{t}^{i},\mu_{\bf x_{t}})-\hat{f}(\hat{x}_{t}^{i},\hat{u}_{t}^{i},\hat{w}_{t}^{i},\hat{\mu}_{t})\right|{\bf\Omega}(d{\bf w}_{t},d\hat{\bf w}_{t})
=1Ni=1N|f(xti,uti,wti,μ𝐱𝐭)f^(x^ti,u^ti,w^ti,μ^t)|𝛀(d𝐰t,d𝐰^t)\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\int\left|f(x^{i}_{t},u^{i}_{t},w_{t}^{i},\mu_{\bf x_{t}})-\hat{f}(\hat{x}_{t}^{i},\hat{u}_{t}^{i},\hat{w}_{t}^{i},\hat{\mu}_{t})\right|{\bf\Omega}(d{\bf w}_{t},d\hat{\bf w}_{t})
=1Ni=1N|f(xti,uti,wti,μ𝐱𝐭)f^(x^ti,u^ti,w^ti,μ^t)|Ωi(dwti,dw^ti).\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\int\left|f(x^{i}_{t},u^{i}_{t},w_{t}^{i},\mu_{\bf x_{t}})-\hat{f}(\hat{x}_{t}^{i},\hat{u}_{t}^{i},\hat{w}_{t}^{i},\hat{\mu}_{t})\right|{\Omega^{i}}(dw^{i}_{t},d\hat{w}^{i}_{t}). (50)

The analysis thus far, works for any coupling 𝛀(d𝐰t,d𝐰^t){\bf\Omega}(d{\bf w}_{t},d\hat{\bf w}_{t}). In particular, the analysis holds for the coupling that satisfies

𝒯(|xt,ut,μ𝐱𝐭)𝒯^(|x^ti,u^ti,μ^t)=|f(xti,uti,wti,μ𝐱𝐭)f^(x^ti,u^ti,w^ti,μ^t)|Ωi(dwti,dw^ti).\displaystyle\|\mathcal{T}(\cdot|x_{t},u_{t},\mu_{\bf x_{t}})-\hat{\mathcal{T}}(\cdot|\hat{x}^{i}_{t},\hat{u}^{i}_{t},\hat{\mu}_{t})\|=\int\left|f(x^{i}_{t},u^{i}_{t},w_{t}^{i},\mu_{\bf x_{t}})-\hat{f}(\hat{x}_{t}^{i},\hat{u}_{t}^{i},\hat{w}_{t}^{i},\hat{\mu}_{t})\right|{\Omega^{i}}(dw^{i}_{t},d\hat{w}^{i}_{t}).

for every ii for some coordinate-wise coupling Ωi(dwti,dw^ti)\Omega^{i}(dw_{t}^{i},d\hat{w}_{t}^{i}). Continuing from the term (C), we can then write

μf(𝐱t,𝐮t,𝐰t)μf^(𝐱^t,𝐮^t,𝐰^t)𝛀(d𝐰t,d𝐰^t)\displaystyle\int\left\|\mu_{f({\bf x}_{t},{\bf u}_{t},{\bf w}_{t})}-{\mu}_{\hat{f}(\hat{\bf x}_{t},\hat{\bf u}_{t},\hat{\bf w}_{t})}\right\|{\bf\Omega}(d{\bf w}_{t},d\hat{\bf w}_{t})
1Ni=1N|f(xti,uti,wti,μ𝐱𝐭)f^(x^ti,u^ti,w^ti,μ^t)|Ωi(dwti,dw^ti)\displaystyle\leq\frac{1}{N}\sum_{i=1}^{N}\int\left|f(x^{i}_{t},u^{i}_{t},w_{t}^{i},\mu_{\bf x_{t}})-\hat{f}(\hat{x}_{t}^{i},\hat{u}_{t}^{i},\hat{w}_{t}^{i},\hat{\mu}_{t})\right|{\Omega^{i}}(dw^{i}_{t},d\hat{w}^{i}_{t})
=1Ni=1N𝒯(|xt,ut,μ𝐱𝐭)𝒯^(|x^ti,u^ti,μ^t)\displaystyle=\int\frac{1}{N}\sum_{i=1}^{N}\left\|\mathcal{T}(\cdot|x_{t},u_{t},\mu_{\bf x_{t}})-\hat{\mathcal{T}}(\cdot|\hat{x}^{i}_{t},\hat{u}^{i}_{t},\hat{\mu}_{t})\right\|
=𝒯(|x,u,μ𝐱t)μ(𝐱𝐭,𝐮𝐭)(du,dx)𝒯^(|x,u,μ^t)μ(𝐱^𝐭,𝐮^𝐭)(du,dx)\displaystyle=\left\|\int\mathcal{T}(\cdot|x,u,\mu_{{\bf x}_{t}})\mu_{\bf(x_{t},u_{t})}(du,dx)-\int\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\mu_{(\bf\hat{x}_{t},\hat{u}_{t})}(du,dx)\right\|

where the last step follows from the particular permutation we consider (see (C)).

Furthermore, we also have that

𝒯(|x,u,μ𝐱t)μ(𝐱𝐭,𝐮𝐭)(du,dx)𝒯^(|x,u,μ^t)μ(𝐱^𝐭,𝐮^𝐭)(du,dx)\displaystyle\left\|\int\mathcal{T}(\cdot|x,u,\mu_{{\bf x}_{t}})\mu_{\bf(x_{t},u_{t})}(du,dx)-\int\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\mu_{(\bf\hat{x}_{t},\hat{u}_{t})}(du,dx)\right\|
𝒯(|x,u,μ𝐱t)μ(𝐱𝐭,𝐮𝐭)(du,dx)𝒯(|x,u,μ𝐱t)μ(𝐱^𝐭,𝐮^𝐭)(du,dx)\displaystyle\leq\left\|\int\mathcal{T}(\cdot|x,u,\mu_{{\bf x}_{t}})\mu_{\bf(x_{t},u_{t})}(du,dx)-\int{\mathcal{T}}(\cdot|x,u,\mu_{{\bf x}_{t}})\mu_{(\bf\hat{x}_{t},\hat{u}_{t})}(du,dx)\right\|
+𝒯(|x,u,μ𝐱t)μ(𝐱^𝐭,𝐮^𝐭)(du,dx)𝒯(|x,u,μ^t)μ(𝐱^𝐭,𝐮^𝐭)(du,dx)\displaystyle+\left\|\int{\mathcal{T}}(\cdot|x,u,\mu_{{\bf x}_{t}})\mu_{(\bf\hat{x}_{t},\hat{u}_{t})}(du,dx)-\int{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\mu_{(\bf\hat{x}_{t},\hat{u}_{t})}(du,dx)\right\|
+𝒯(|x,u,μ^t)μ(𝐱^𝐭,𝐮^𝐭)(du,dx)𝒯^(|x,u,μ^t)μ(𝐱^𝐭,𝐮^𝐭)(du,dx)\displaystyle+\left\|\int{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\mu_{(\bf\hat{x}_{t},\hat{u}_{t})}(du,dx)-\int\hat{\mathcal{T}}(\cdot|x,u,\hat{\mu}_{t})\mu_{(\bf\hat{x}_{t},\hat{u}_{t})}(du,dx)\right\|
δTμ𝐱𝐭μ𝐱^𝐭+Kfμ𝐱𝐭μ^t+λ\displaystyle\leq\delta_{T}\|\mu_{\bf x_{t}}-\mu_{\bf\hat{x}_{t}}\|+K_{f}\|\mu_{\bf x_{t}}-\hat{\mu}_{t}\|+\lambda (51)

where for the first term we use the following bound:

𝒯(|x,u,μ𝐱t)μ(𝐱𝐭,𝐮𝐭)(du,dx)𝒯(|x,u,μ𝐱t)μ(𝐱^𝐭,𝐮^𝐭)(du,dx)\displaystyle\left\|\int\mathcal{T}(\cdot|x,u,\mu_{{\bf x}_{t}})\mu_{\bf(x_{t},u_{t})}(du,dx)-\int{\mathcal{T}}(\cdot|x,u,\mu_{{\bf x}_{t}})\mu_{(\bf\hat{x}_{t},\hat{u}_{t})}(du,dx)\right\|
=𝒯(|x,γ^(x,vi),μ𝐱t)μ𝐱𝐭(dx)𝒯(|x,γ^(x,vi),μ𝐱t)μ𝐱^𝐭(dx)\displaystyle=\left\|\int\mathcal{T}(\cdot|x,\hat{\gamma}(x,v^{i}),\mu_{{\bf x}_{t}})\mu_{\bf x_{t}}(dx)-\int{\mathcal{T}}(\cdot|x,\hat{\gamma}(x,v^{i}),\mu_{{\bf x}_{t}})\mu_{\bf\hat{x}_{t}}(dx)\right\|
δTμ𝐱𝐭μ𝐱^𝐭\displaystyle\leq\delta_{T}\|\mu_{\bf x_{t}}-\mu_{\bf\hat{x}_{t}}\|

Combining (C), (C), and (C), we can then write

E[μ𝐱𝐭+𝟏μ^t+1]MN+δTE[μ𝐱𝐭μ𝐱^𝐭]+KfE[μ𝐱𝐭μ^t]+λ\displaystyle E\left[\left\|\mu_{\bf x_{t+1}}-\hat{\mu}_{t+1}\right\|\right]\leq M_{N}+\delta_{T}E[\|\mu_{\bf x_{t}}-\mu_{\bf\hat{x}_{t}}\|]+K_{f}E[\|\mu_{\bf x_{t}}-\hat{\mu}_{t}\|]+\lambda
(1+δT)MN+KE[μ𝐱𝐭μ^t]+λ\displaystyle\leq(1+\delta_{T})M_{N}+KE[\|\mu_{\bf x_{t}}-\hat{\mu}_{t}\|]+\lambda

where K=(δT+Kf)K=(\delta_{T}+K_{f}). Noting that we have assumed μ𝐱𝟎=μ^0\mu_{\bf x_{0}}=\hat{\mu}_{0}, this bound implies that

E[μ𝐱𝐭μ^t]n=0t1Kn(λ+2MN)\displaystyle E\left[\left\|\mu_{\bf x_{t}}-\hat{\mu}_{t}\right\|\right]\leq\sum_{n=0}^{t-1}K^{n}(\lambda+2M_{N})

where we have used the fact that δT1\delta_{T}\leq 1 to simplify the notation.

Appendix D Proof of Lemma 3

We start by writing the Bellman equations:

Kβ(μN)=k(μN,γ)+βKβ(F(μN,γ))\displaystyle K_{\beta}^{*}(\mu^{N})=k(\mu^{N},\gamma_{\infty})+\beta K_{\beta}^{*}(F(\mu^{N},\gamma_{\infty}))
KβN,(μN)=k(μN,ΘN)+βKβN,(μ1N)η(dμ1N|μN,ΘN)\displaystyle K_{\beta}^{N,*}(\mu^{N})=k(\mu^{N},\Theta^{N})+\beta\int K_{\beta}^{N,*}(\mu_{1}^{N})\eta(d\mu_{1}^{N}|\mu^{N},\Theta^{N})

where we assume that an optimal selector for the infinite population problem at μN\mu^{N} is γ\gamma_{\infty} such that the agents should use the randomized agent-level policy γ(|x,μN)\gamma_{\infty}(\cdot|x,\mu^{N}). For the NN-agent problem, we assume that an optimal state-action distribution at μN\mu^{N} is given by some ΘN𝒫N(𝕏×𝕌)\Theta^{N}\in{\mathcal{P}}_{N}(\mathds{X}\times\mathds{U}), which can be achieved by some 𝐱,𝐮,{\bf x,u,} such that μ𝐱=μN\mu_{\bf x}=\mu^{N} and μ(𝐱,𝐮)=ΘN\mu_{(\bf x,u)}=\Theta^{N}.

We first assume that Kβ(μN)>KβN,(μN)K_{\beta}^{*}(\mu^{N})>K_{\beta}^{N,*}(\mu^{N}). For the infinite population problem, instead of using the optimal selector γ\gamma_{\infty}, we use a randomized agent-level policy from the finite population problem by writing ΘN(du,dx)=γN(du|x)μN(dx)\Theta^{N}(du,dx)=\gamma^{N}(du|x)\mu^{N}(dx), and letting the agents use γN\gamma^{N}. We emphasize that the optimal state action distribution for NN-agents is not achieved if each agent symmetrically use γN(du|x)\gamma^{N}(du|x), in other words, γN\gamma^{N} is not an optimal agent-level policy for the NN-population problem. To have the equality ΘN(du,dx)=γN(du|x)μN(dx)\Theta^{N}(du,dx)=\gamma_{N}(du|x)\mu^{N}(dx) the number of agents needs to tend to infinity. We can then write

Kβ(μN)KβN,(μN)Kβ(μN,γN)KβN,(μN)\displaystyle K_{\beta}^{*}(\mu^{N})-K_{\beta}^{N,*}(\mu^{N})\leq K_{\beta}(\mu^{N},\gamma^{N})-K_{\beta}^{N,*}(\mu^{N})
=k(μN,γN)k(μN,ΘN)+βKβ(F(μN,γN))βKβN,(μ1N)η(dμ1N|μN,ΘN)\displaystyle=k(\mu^{N},\gamma^{N})-k(\mu^{N},\Theta^{N})+\beta K_{\beta}^{*}\left(F(\mu^{N},\gamma^{N})\right)-\beta\int K_{\beta}^{N,*}(\mu_{1}^{N})\eta(d\mu_{1}^{N}|\mu^{N},\Theta^{N})

Note that

k(μN,γN)=c(x,u,μN)γN(du|x)μN(dx)=c(x,u,μN)ΘN(du,dx)=k(μN,ΘN).\displaystyle k(\mu^{N},\gamma^{N})=\int c(x,u,\mu^{N})\gamma^{N}(du|x)\mu^{N}(dx)=\int c(x,u,\mu^{N})\Theta^{N}(du,dx)=k(\mu^{N},\Theta^{N}).

Hence, we can continue:

Kβ(μN)KβN,(μN)βKβ(F(μN,γN))βKβN,(μ1N)η(dμ1N|μN,ΘN)\displaystyle K_{\beta}^{*}(\mu^{N})-K_{\beta}^{N,*}(\mu^{N})\leq\beta K_{\beta}^{*}\left(F(\mu^{N},\gamma^{N})\right)-\beta\int K_{\beta}^{N,*}(\mu_{1}^{N})\eta(d\mu_{1}^{N}|\mu^{N},\Theta^{N})
β|Kβ(F(μN,γN))Kβ(μ1N)|η(dμ1N|μN,ΘN)\displaystyle\leq\beta\int\left|K_{\beta}^{*}\left(F(\mu^{N},\gamma^{N})\right)-K_{\beta}^{*}(\mu_{1}^{N})\right|\eta(d\mu_{1}^{N}|\mu^{N},\Theta^{N})
+β|Kβ(μ1N)KβN,(μ1N)|η(dμ1N|μN,ΘN)\displaystyle\quad+\beta\int\left|K_{\beta}^{*}(\mu_{1}^{N})-K_{\beta}^{N,*}(\mu_{1}^{N})\right|\eta(d\mu_{1}^{N}|\mu^{N},\Theta^{N})
βKβLipF(μN,γN)μ1Nη(dμ1N|μN,ΘN)+βsupμ𝒫N(𝕏)|Kβ(μ)KβN,(μ)|.\displaystyle\leq\beta\|K_{\beta}^{*}\|_{Lip}\int\left\|F(\mu^{N},\gamma^{N})-\mu_{1}^{N}\right\|\eta(d\mu_{1}^{N}|\mu^{N},\Theta^{N})+\beta\sup_{\mu\in{\mathcal{P}}_{N}(\mathds{X})}\left|K_{\beta}^{*}(\mu)-K_{\beta}^{N,*}(\mu)\right|. (52)

We now focus on the term F(μN,γN)μ1Nη(dμ1N|μN,ΘN)\int\left\|F(\mu^{N},\gamma^{N})-\mu_{1}^{N}\right\|\eta(d\mu_{1}^{N}|\mu^{N},\Theta^{N}). We will follow a very similar methodoly as we have used in the proof of Lemma 2 with slight differences. We denote by 𝐏(d𝐰)=P(dw1)××P(dwN){\bf P}(d{\bf w})=P(dw^{1})\times\dots\times P(dw^{N}) denote the distribution of the vector 𝐰{\bf w} where it is assumed that wiw^{i} and wjw^{j} are independent for all iji\neq j. Let 𝐱,𝐮{\bf x,u} such that μ𝐱=μN\mu_{\bf x}=\mu^{N} and μ(𝐱,𝐮)=ΘN\mu_{\bf(x,u)}=\Theta^{N}. We then have that

F(μN,γN)μ1Nη(dμ1N|μN,ΘN)=F(μN,γN)μf(𝐱,𝐮,𝐰)𝐏(d𝐰)\displaystyle\int\left\|F(\mu^{N},\gamma^{N})-\mu_{1}^{N}\right\|\eta(d\mu_{1}^{N}|\mu^{N},\Theta^{N})=\int\left\|F(\mu^{N},\gamma^{N})-\mu_{f(\bf x,u,w)}\right\|{\bf P}(d{\bf w})

where f(𝐱,𝐮,𝐰)=[f(x1,u1,w1,μN),,f(xN,uN,wN,μN)]f({\bf x,u,w})=[f(x^{1},u^{1},w^{1},\mu^{N}),\dots,f(x^{N},u^{N},w^{N},\mu^{N})]. We now introduce (x^i,u^i)ΘN(du,dx)(\hat{x}^{i},\hat{u}^{i})\sim\Theta^{N}(du,dx) where i{1,,N}i\in\{1,\dots,N\}, which are different than the state action vectors (𝐱,𝐮){\bf(x,u)} and μ(𝐱^,𝐮^)\mu_{\bf(\hat{x},\hat{u})} forms on empirical measure for ΘN\Theta^{N} whereas μ𝐱^\mu_{\bf\hat{x}} forms an empirical measure for μN\mu^{N}. We further introduce 𝐰^=[w^1,,w^N]{\bf\hat{w}}=[\hat{w}^{1},\dots,\hat{w}^{N}]. 𝐰^{\bf\hat{w}} is also distributed according to 𝐏(){\bf P}(\cdot). For the joint distribution of 𝐰,𝐰^{\bf w,\hat{w}}, we use a coupling of the form

𝛀(d𝐰,d𝐰^)=Ω1(dw1,dw^1)××ΩN(dwN,dw^N).\displaystyle{\bf\Omega}(d{\bf w},d{\bf\hat{w}})=\Omega^{1}(dw^{1},d\hat{w}^{1})\times\dots\times\Omega^{N}(dw^{N},d\hat{w}^{N}).

That is, we assume independence over i1,,Ni\in{1,\dots,N}, however, an arbitrary coupling is assumed between the distribution of wi,w^iw^{i},\hat{w}^{i}. We will later specify the particular selection of coordinate wise couplings Ω1,,ΩN\Omega^{1},\dots,\Omega^{N}. We write

F(μN,γN)μf(𝐱,𝐮,𝐰)𝐏(d𝐰)\displaystyle\int\left\|F(\mu^{N},\gamma^{N})-\mu_{f(\bf x,u,w)}\right\|{\bf P}(d{\bf w})
E[F(μN,γN)μf(𝐱^,𝐮^,𝐰^)+μf(𝐱^,𝐮^,𝐰^)μf(𝐱,𝐮,𝐰)𝛀(d𝐰,d𝐰^)]\displaystyle\leq E\left[\int\left\|F(\mu^{N},\gamma^{N})-\mu_{f({\bf\hat{x},\hat{u},\hat{w}})}\right\|+\left\|\mu_{f({\bf\hat{x},\hat{u},\hat{w}})}-\mu_{f(\bf x,u,w)}\right\|{\bf\Omega}(d{\bf w},d{\bf\hat{w}})\right]

where the expectation is with respect to the random realizations of (x^i,u^i)ΘN(du,dx)(\hat{x}^{i},\hat{u}^{i})\sim\Theta^{N}(du,dx). The first term corresponds to the expected difference between the empirical measures of μ1=F(μN,γN)\mu_{1}=F(\mu^{N},\gamma^{N}) and μ1\mu_{1} itself, and thus is bounded by MNM_{N}.

For the second term, we note that μf(𝐱,𝐮,𝐰)\mu_{f({\bf x},{\bf u},{\bf w})} and μf(𝐱^,𝐮^,𝐰^){\mu}_{f(\hat{\bf x},\hat{\bf u},\hat{\bf w})} are empirical measures, and thus for every given realization of 𝐰{\bf w} and 𝐰^{\bf\hat{w}}, the Wasserstein distance is achieved with a particular permutation of f(𝐱,𝐮,𝐰){f({\bf x},{\bf u},{\bf w})} and f(𝐱^,𝐮^,𝐰^){{f}(\hat{\bf x},\hat{\bf u},\hat{\bf w})} combined together. That is, letting σ\sigma denote a permutation map for the vector f(𝐱^,𝐮^,𝐰^){{f}(\hat{\bf x},\hat{\bf u},\hat{\bf w})}. we have

μf(𝐱,𝐮,𝐰)μf^(𝐱^,𝐮^,𝐰^)=infσ1Ni=1N|f(xi,ui,wi,μN)σ(f(x^i,u^i,w^i,μN))|.\displaystyle\left\|\mu_{f({\bf x},{\bf u},{\bf w})}-{\mu}_{\hat{f}(\hat{\bf x},\hat{\bf u},\hat{\bf w})}\right\|=\inf_{\sigma}\frac{1}{N}\sum_{i=1}^{N}|f(x^{i},u^{i},w^{i},\mu^{N})-\sigma({f}(\hat{x}^{i},\hat{u}^{i},\hat{w}^{i},\mu^{N}))|.

We will however, consider a particular permutation where

𝒯(|x,u,μN)μ(𝐱,𝐮)(du,dx)𝒯(|x,u,μN)μ(𝐱^,𝐮^)(du,dx)\displaystyle\left\|\int\mathcal{T}(\cdot|x,u,\mu^{N})\mu_{\bf(x,u)}(du,dx)-{\mathcal{T}}(\cdot|x,u,{\mu^{N}})\mu_{(\bf\hat{x},\hat{u})}(du,dx)\right\|
=1Ni=1N𝒯(|xi,ui,μN)σ(𝒯(|x^i,u^u,μN))\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left\|\mathcal{T}(\cdot|x^{i},u^{i},\mu^{N})-\sigma({\mathcal{T}}(\cdot|\hat{x}^{i},\hat{u}^{u},{\mu}^{N}))\right\|

For the following analysis, we will drop the permutation notation σ\sigma and assume that the given order of f(𝐱^,𝐮^,𝐰^){f}(\hat{\bf x},\hat{\bf u},\hat{\bf w}) achieves the Wasserstein distance above. Furthermore, the coupling 𝛀{\bf\Omega} is assumed to have the same order of coordinate-wise coupling.

We then write

μf(𝐱,𝐮,𝐰)μf(𝐱^,𝐮^,𝐰^)𝛀(d𝐰,d𝐰^)\displaystyle\int\left\|\mu_{f({\bf x},{\bf u},{\bf w})}-{\mu}_{{f}(\hat{\bf x},\hat{\bf u},\hat{\bf w})}\right\|{\bf\Omega}(d{\bf w},d\hat{\bf w})
1Ni=1N|f(xi,ui,wi,μN)f(x^i,u^i,w^i,μN)|𝛀(d𝐰,d𝐰^)\displaystyle\leq\int\frac{1}{N}\sum_{i=1}^{N}\left|f(x^{i},u^{i},w^{i},\mu^{N})-{f}(\hat{x}^{i},\hat{u}^{i},\hat{w}^{i},{\mu^{N}})\right|{\bf\Omega}(d{\bf w},d\hat{\bf w})
=1Ni=1N|f(xi,ui,wi,μN)f(x^i,u^i,w^i,μN)|𝛀(d𝐰,d𝐰^)\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\int\left|f(x^{i},u^{i},w^{i},\mu^{N})-{f}(\hat{x}^{i},\hat{u}^{i},\hat{w}^{i},{\mu^{N}})\right|{\bf\Omega}(d{\bf w},d\hat{\bf w})
=1Ni=1N|f(xi,ui,wi,μN)f(x^i,u^i,w^i,μN)|Ωi(dwi,dw^i).\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\int\left|f(x^{i},u^{i},w^{i},\mu^{N})-{f}(\hat{x}^{i},\hat{u}^{i},\hat{w}^{i},{\mu}^{N})\right|{\Omega^{i}}(dw^{i},d\hat{w}^{i}).

The analysis thus far, works for any coupling 𝛀(d𝐰,d𝐰^){\bf\Omega}(d{\bf w},d\hat{\bf w}). In particular, the analysis holds for the coupling that satisfies

𝒯(|x,u,μN)𝒯(|x^i,u^i,μN)=|f(xi,ui,wi,μN)f(x^i,u^i,w^i,μN)|Ωi(dwi,dw^i).\displaystyle\|\mathcal{T}(\cdot|x,u,\mu^{N})-{\mathcal{T}}(\cdot|\hat{x}^{i},\hat{u}^{i},{\mu}^{N})\|=\int\left|f(x^{i},u^{i},w^{i},\mu^{N})-{f}(\hat{x}^{i},\hat{u}^{i},\hat{w}^{i},{\mu}^{N})\right|{\Omega^{i}}(dw^{i},d\hat{w}^{i}).

for every ii for some coordinate-wise coupling Ωi(dwi,dw^i)\Omega^{i}(dw^{i},d\hat{w}^{i}). We can then write

μf(𝐱,𝐮,𝐰)μf(𝐱^,𝐮^,𝐰^)𝛀(d𝐰,d𝐰^)\displaystyle\int\left\|\mu_{f({\bf x},{\bf u},{\bf w})}-{\mu}_{{f}(\hat{\bf x},\hat{\bf u},\hat{\bf w})}\right\|{\bf\Omega}(d{\bf w},d\hat{\bf w})
1Ni=1N|f(xi,ui,wi,μN)f(x^i,u^i,w^i,μN)|Ωi(dwi,dw^i)\displaystyle\leq\frac{1}{N}\sum_{i=1}^{N}\int\left|f(x^{i},u^{i},w^{i},\mu^{N})-{f}(\hat{x}^{i},\hat{u}^{i},\hat{w}^{i},{\mu^{N}})\right|{\Omega^{i}}(dw^{i},d\hat{w}^{i})
=1Ni=1N𝒯(|x,u,μN)𝒯(|x^i,u^i,μN)\displaystyle=\int\frac{1}{N}\sum_{i=1}^{N}\left\|\mathcal{T}(\cdot|x,u,\mu^{N})-{\mathcal{T}}(\cdot|\hat{x}^{i},\hat{u}^{i},{\mu^{N}})\right\|
=𝒯(|x,u,μN)μ(𝐱,𝐮)(du,dx)𝒯(|x,u,μN)μ(𝐱^,𝐮^)(du,dx).\displaystyle=\left\|\int\mathcal{T}(\cdot|x,u,\mu^{N})\mu_{\bf(x,u)}(du,dx)-\int{\mathcal{T}}(\cdot|x,u,{\mu^{N}})\mu_{(\bf\hat{x},\hat{u})}(du,dx)\right\|.

We can then write that

F(μN,γN)μf(𝐱,𝐮,𝐰)𝐏(d𝐰)\displaystyle\int\left\|F(\mu^{N},\gamma^{N})-\mu_{f(\bf x,u,w)}\right\|{\bf P}(d{\bf w})
E[F(μN,γN)μf(𝐱^,𝐮^,𝐰^)+μf(𝐱^,𝐮^,𝐰^)μf(𝐱,𝐮,𝐰)𝛀(d𝐰,d𝐰^)]\displaystyle\leq E\left[\int\left\|F(\mu^{N},\gamma^{N})-\mu_{f({\bf\hat{x},\hat{u},\hat{w}})}\right\|+\left\|\mu_{f({\bf\hat{x},\hat{u},\hat{w}})}-\mu_{f(\bf x,u,w)}\right\|{\bf\Omega}(d{\bf w},d{\bf\hat{w}})\right]
MN+E[𝒯(|x,u,μN)μ(𝐱,𝐮)(du,dx)𝒯(|x,u,μN)μ(𝐱^,𝐮^)(du,dx).]\displaystyle\leq M_{N}+E\left[\left\|\int\mathcal{T}(\cdot|x,u,\mu^{N})\mu_{\bf(x,u)}(du,dx)-\int{\mathcal{T}}(\cdot|x,u,{\mu^{N}})\mu_{(\bf\hat{x},\hat{u})}(du,dx)\right\|.\right]
MN+E[δTμ(𝐱^,𝐮^)μ(𝐱,𝐮)]MN+δTM¯N\displaystyle\leq M_{N}+E\left[\delta_{T}\left\|\mu_{\bf(\hat{x},\hat{u})}-\mu_{(\bf x,u)}\right\|\right]\leq M_{N}+\delta_{T}\bar{M}_{N}

where in the last step we used the fact that μ(𝐱^,𝐮^)\mu_{\bf(\hat{x},\hat{u})} is an empirical measure for μ(𝐱,𝐮)=ΘN\mu_{(\bf x,u)}=\Theta^{N}.

We then conclude that for the term (D):

Kβ(μN)KβN,(μN)\displaystyle K_{\beta}^{*}(\mu^{N})-K_{\beta}^{N,*}(\mu^{N})
βKβLipF(μN,γN)μ1Nη(dμ1N|μN,ΘN)+βsupμ𝒫N(𝕏)|Kβ(μ)KβN,(μ)|\displaystyle\leq\beta\|K_{\beta}^{*}\|_{Lip}\int\left\|F(\mu^{N},\gamma^{N})-\mu_{1}^{N}\right\|\eta(d\mu_{1}^{N}|\mu^{N},\Theta^{N})+\beta\sup_{\mu\in{\mathcal{P}}_{N}(\mathds{X})}\left|K_{\beta}^{*}(\mu)-K_{\beta}^{N,*}(\mu)\right|
βKβLip(MN+δTM¯N)+βsupμ𝒫N(𝕏)|Kβ(μ)KβN,(μ)|\displaystyle\leq\beta\|K_{\beta}^{*}\|_{Lip}\left(M_{N}+\delta_{T}\bar{M}_{N}\right)+\beta\sup_{\mu\in{\mathcal{P}}_{N}(\mathds{X})}\left|K_{\beta}^{*}(\mu)-K_{\beta}^{N,*}(\mu)\right| (53)

We now assume that Kβ(μN)<KβN,(μN)K_{\beta}^{*}(\mu^{N})<K_{\beta}^{N,*}(\mu^{N}). To get an upper bound similar to (D), for the finite population problem, we let agents to use the randomized policy γ\gamma_{\infty} that is optimal for the infinite population problem, instead of choosing actions that achieves ΘN\Theta^{N} which is the optimal selection for the NN population problem for the state distribution μN\mu^{N}. Let 𝐱{\bf x} be such that μ𝐱=μN\mu_{\bf x}=\mu^{N}, we introduce 𝐮=[u1,,uN]{{\bf u}}=[{u}^{1},\dots,{u}^{N}] where ui=γ(xi,vi){u}^{i}=\gamma_{\infty}(x^{i},v^{i}) for some i.i.d. viv^{i}. Denoting by Θ^N=μ(𝐱,𝐮)\hat{\Theta}^{N}=\mu_{\bf(x,{u})}, and following the steps leading to (D), we now write

KβN,(μN)Kβ(μN)βKβN,(μ1N)η(dμ1N|μN,Θ^N)βKβ(F(μN,γ))\displaystyle K_{\beta}^{N,*}(\mu^{N})-K_{\beta}^{*}(\mu^{N})\leq\beta\int K_{\beta}^{N,*}(\mu_{1}^{N})\eta(d\mu_{1}^{N}|\mu^{N},\hat{\Theta}^{N})-\beta K_{\beta}^{*}(F(\mu^{N},\gamma_{\infty}))
β|KβN,(μ1N)Kβ(μ1N)|η(dμ1N|μN,Θ^N)\displaystyle\leq\beta\int\left|K_{\beta}^{N,*}\left(\mu_{1}^{N}\right)-K_{\beta}^{*}(\mu_{1}^{N})\right|\eta(d\mu_{1}^{N}|\mu^{N},\hat{\Theta}^{N})
+β|Kβ(μ1N)Kβ(F(μN,γ))|η(dμ1N|μN,Θ^N)\displaystyle\quad+\beta\int\left|K_{\beta}^{*}(\mu_{1}^{N})-K_{\beta}^{*}\left(F(\mu^{N},\gamma_{\infty})\right)\right|\eta(d\mu_{1}^{N}|\mu^{N},\hat{\Theta}^{N})
βsupμ𝒫N(𝕏)|Kβ(μ)KβN,(μ)|+βKβLipF(μN,γ)μ1Nη(dμ1N|μN,Θ^N)\displaystyle\leq\beta\sup_{\mu\in{\mathcal{P}}_{N}(\mathds{X})}\left|K_{\beta}^{*}(\mu)-K_{\beta}^{N,*}(\mu)\right|+\beta\|K_{\beta}^{*}\|_{Lip}\int\left\|F(\mu^{N},\gamma_{\infty})-\mu_{1}^{N}\right\|\eta(d\mu_{1}^{N}|\mu^{N},\hat{\Theta}^{N}) (54)

Following almost identical steps as the first case, one can show that

F(μN,γ)μ1Nη(dμ1N|μN,Θ^N)\displaystyle\int\left\|F(\mu^{N},\gamma_{\infty})-\mu_{1}^{N}\right\|\eta(d\mu_{1}^{N}|\mu^{N},\hat{\Theta}^{N})
MN+δTE[μ(𝐱^,𝐮^)μ(𝐱,𝐮)]\displaystyle\leq M_{N}+\delta_{T}E\left[\left\|\mu_{\bf(\hat{x},\hat{u})}-\mu_{(\bf x,u)}\right\|\right]

where x^iμN\hat{x}^{i}\sim\mu^{N}, μ𝐱=μN\mu_{\bf x}=\mu^{N} and ui=γ(xi,vi)u^{i}=\gamma_{\infty}(x^{i},v^{i}), u^i=γ(x^i,vi)\hat{u}^{i}=\gamma_{\infty}(\hat{x}^{i},v^{i}), and the expectation above is with respect to the random selections of x^i\hat{x}^{i} and viv^{i}. Note that uiu^{i} and u^i\hat{u}^{i} uses the same randomization viv^{i}, hence averaging over the distribution of viv^{i}, we can write that

E[μ(𝐱^,𝐮^)μ(𝐱,𝐮)]\displaystyle E\left[\left\|\mu_{\bf(\hat{x},\hat{u})}-\mu_{(\bf x,u)}\right\|\right]\leq E[γ(du|x)μ𝐱(dx)γ(du|x)μ𝐱^(dx)]\displaystyle E\left[\left\|\gamma_{\infty}(du|x)\mu_{\bf x}(dx)-\gamma_{\infty}(du|x)\mu_{\bf\hat{x}}(dx)\right\|\right]
E[μ𝐱μ𝐱^]MN.\displaystyle\leq E\left[\|\mu_{\bf x}-\mu_{\bf\hat{x}}\|\right]\leq M_{N}.

In particular, we can conclude that the bound (D) can be concluded as:

Kβ(μN)KβN,(μN)βKβLip(MN+δTMN)+βsupμ𝒫N(𝕏)|Kβ(μ)KβN,(μ)|.\displaystyle K_{\beta}^{*}(\mu^{N})-K_{\beta}^{N,*}(\mu^{N})\leq\beta\|K_{\beta}^{*}\|_{Lip}\left(M_{N}+\delta_{T}M_{N}\right)+\beta\sup_{\mu\in{\mathcal{P}}_{N}(\mathds{X})}\left|K_{\beta}^{*}(\mu)-K_{\beta}^{N,*}(\mu)\right|. (55)

Thus, noting that MNM¯NM_{N}\leq\bar{M}_{N}, and combining (D) and (55), we can write

|Kβ(μN)KβN,(μN)|βKβLip(M¯N+δTM¯N)+βsupμ𝒫N(𝕏)|Kβ(μ)KβN,(μ)|.\displaystyle|K_{\beta}^{*}(\mu^{N})-K_{\beta}^{N,*}(\mu^{N})|\leq\beta\|K_{\beta}^{*}\|_{Lip}\left(\bar{M}_{N}+\delta_{T}\bar{M}_{N}\right)+\beta\sup_{\mu\in{\mathcal{P}}_{N}(\mathds{X})}\left|K_{\beta}^{*}(\mu)-K_{\beta}^{N,*}(\mu)\right|.

Rearranging the terms and taking the supremum on the left hand side over μN𝒫N(𝕏)\mu^{N}\in{\mathcal{P}}_{N}(\mathds{X}), we can write

supμ𝒫N(𝕏)|Kβ(μ)KβN,(μ)|βKβLip(1δT)M¯N1β\displaystyle\sup_{\mu\in{\mathcal{P}}_{N}(\mathds{X})}|K_{\beta}^{*}(\mu)-K_{\beta}^{N,*}(\mu)|\leq\frac{\beta\|K_{\beta}^{*}\|_{Lip}(1-\delta_{T})\bar{M}_{N}}{1-\beta}

which proves the result together with KβLipC1βK\|K_{\beta}^{*}\|_{Lip}\leq\frac{C}{1-\beta K} and δT1\delta_{T}\leq 1.

References

  • [1] Yves Achdou, Pierre Cardaliaguet, François Delarue, Alessio Porretta, Filippo Santambrogio, Yves Achdou, and Mathieu Laurière. Mean field games and applications: Numerical aspects. Mean Field Games: Cetraro, Italy 2019, pages 249–307, 2020.
  • [2] Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi. Q-learning in regularized mean-field games. Dynamic Games and Applications, pages 1–29, 2022.
  • [3] Andrea Angiuli, Jean-Pierre Fouque, and Mathieu Laurière. Unified reinforcement q-learning for mean field game and control problems. Mathematics of Control, Signals, and Systems, 34(2):217–271, 2022.
  • [4] Andrea Angiuli, Jean-Pierre Fouque, Mathieu Laurière, and Mengrui Zhang. Convergence of multi-scale reinforcement q-learning algorithms for mean field game and control problems. arXiv preprint arXiv:2312.06659, 2023.
  • [5] Nicole Bäuerle. Mean field Markov decision processes. Applied Mathematics & Optimization, 88(1):12, 2023.
  • [6] Erhan Bayraktar, Nicole Bauerle, and Ali Devran Kara. Finite approximations for mean field type multi-agent control and their near optimality, 2023.
  • [7] Erhan Bayraktar, Alekos Cecchin, and Prakash Chakraborty. Mean field control and finite agent approximation for regime-switching jump diffusions. Applied Mathematics & Optimization, 88(2):36, 2023.
  • [8] Erhan Bayraktar, Andrea Cosso, and Huyên Pham. Randomized dynamic programming principle and Feynman-Kac representation for optimal control of Mckean-Vlasov dynamics. Transactions of the American Mathematical Society, 370(3):2115–2160, 2018.
  • [9] Erhan Bayraktar and Ali Devran Kara. Infinite horizon average cost optimality criteria for mean-field control. SIAM Journal on Control and Optimization, 62(5):2776–2806, 2024.
  • [10] Erhan Bayraktar and Xin Zhang. Solvability of infinite horizon Mckean–Vlasov FBSDEs in mean field control problems and games. Applied Mathematics & Optimization, 87(1):13, 2023.
  • [11] Alain Bensoussan, Jens Frehse, and Phillip Yam. Mean field games and mean field type control theory, volume 101. Springer, 2013.
  • [12] René Carmona and François Delarue. Probabilistic analysis of mean-field games. SIAM Journal on Control and Optimization, 51(4):2705–2734, 2013.
  • [13] René Carmona and Mathieu Laurière. Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games i: The ergodic case. SIAM Journal on Numerical Analysis, 59(3):1455–1485, 2021.
  • [14] René Carmona and Mathieu Laurière. Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: Ii—the finite horizon case. The Annals of Applied Probability, 32(6):4065–4105, 2022.
  • [15] René Carmona, Mathieu Laurière, and Zongjun Tan. Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning. arXiv preprint arXiv:1910.12802, 2019.
  • [16] D. Carvalho, F. S. Melo, and P. Santos. A new convergent variant of q-learning with linear function approximation. Advances in Neural Information Processing Systems, 33:19412–19421, 2020.
  • [17] Mao Fabrice Djete, Dylan Possamaï, and Xiaolu Tan. McKean–Vlasov optimal control: the dynamic programming principle. The Annals of Probability, 50(2):791–833, 2022.
  • [18] Romuald Elie, Julien Perolat, Mathieu Laurière, Matthieu Geist, and Olivier Pietquin. On the convergence of model free learning in mean field games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7143–7150, 2020.
  • [19] Zuyue Fu, Zhuoran Yang, Yongxin Chen, and Zhaoran Wang. Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games. 2020.
  • [20] Maximilien Germain, Joseph Mikael, and Xavier Warin. Numerical resolution of Mckean-Vlasov FBSDEs using neural networks. Methodology and Computing in Applied Probability, pages 1–30, 2022.
  • [21] Diogo A Gomes and João Saúde. Mean field games models—a brief survey. Dynamic Games and Applications, 4(2):110–154, 2014.
  • [22] Haotian Gu, Xin Guo, Xiaoli Wei, and Renyuan Xu. Mean-field controls with q-learning for cooperative marl: convergence and complexity analysis. SIAM Journal on Mathematics of Data Science, 3(4):1168–1196, 2021.
  • [23] Haotian Gu, Xin Guo, Xiaoli Wei, and Renyuan Xu. Dynamic programming principles for mean-field controls with learning. Operations Research, 2023.
  • [24] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. Learning mean-field games. Advances in Neural Information Processing Systems, 32, 2019.
  • [25] O. Hernandez-Lerma. Adaptive Markov control processes, volume 79. Springer Science & Business Media, 2012.
  • [26] Minyi Huang, Peter E Caines, and Roland P Malhamé. Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized epsilon -Nash equilibria. IEEE transactions on automatic control, 52(9):1560–1571, 2007.
  • [27] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on learning theory, pages 2137–2143. PMLR, 2020.
  • [28] Daniel Lacker. Limit theory for controlled McKean–Vlasov dynamics. SIAM Journal on Control and Optimization, 55(3):1641–1672, 2017.
  • [29] Mathieu Lauriere, Sarah Perrin, Sertan Girgin, Paul Muller, Ayush Jain, Theophile Cabannes, Georgios Piliouras, Julien Pérolat, Romuald Elie, Olivier Pietquin, et al. Scalable deep reinforcement learning algorithms for mean field games. In International conference on machine learning, pages 12078–12095. PMLR, 2022.
  • [30] Mathieu Laurière and Olivier Pironneau. Dynamic programming for mean-field type control. Comptes Rendus Mathematique, 352(9):707–713, 2014.
  • [31] F. C. Melo, S. P. Meyn, and I. M. Ribeiro. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pages 664–671, 2008.
  • [32] Médéric Motte and Huyên Pham. Mean-field Markov decision processes with common noise and open-loop controls. The Annals of Applied Probability, 32(2):1421–1458, 2022.
  • [33] Médéric Motte and Huyên Pham. Quantitative propagation of chaos for mean field Markov decision process with common noise. Electronic Journal of Probability, 28:1–24, 2023.
  • [34] Barna Pásztor, Andreas Krause, and Ilija Bogunovic. Efficient model-based multi-agent mean-field reinforcement learning. Transactions on Machine Learning Research, 2023.
  • [35] Sarah Perrin, Julien Pérolat, Mathieu Laurière, Matthieu Geist, Romuald Elie, and Olivier Pietquin. Fictitious play for mean field games: Continuous time analysis and applications. Advances in Neural Information Processing Systems, 33:13199–13213, 2020.
  • [36] Huyên Pham and Xiaoli Wei. Dynamic programming for optimal control of stochastic McKean–Vlasov dynamics. SIAM Journal on Control and Optimization, 55(2):1069–1101, 2017.
  • [37] Lars Ruthotto, Stanley J Osher, Wuchen Li, Levon Nurbekyan, and Samy Wu Fung. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proceedings of the National Academy of Sciences, 117(17):9183–9193, 2020.
  • [38] Naci Saldi, Tamer Basar, and Maxim Raginsky. Markov–nash equilibria in mean-field games with discounted cost. SIAM Journal on Control and Optimization, 56(6):4256–4287, 2018.
  • [39] Naci Saldi, Tamer Başar, and Maxim Raginsky. Approximate nash equilibria in partially observed stochastic games with mean-field interactions. Mathematics of Operations Research, 44(3):1006–1033, 2019.
  • [40] Jayakumar Subramanian and Aditya Mahajan. Reinforcement learning in stationary mean-field games. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pages 251–259, 2019.
  • [41] C. Szepesvári and William D. Smart. Interpolation-based q-learning.. 2004.
  • [42] Hamidou Tembine, Quanyan Zhu, and Tamer Başar. Risk-sensitive mean-field games. IEEE Transactions on Automatic Control, 59(4):835–850, 2013.
  • [43] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of reinforcement learning and control, pages 321–384, 2021.