This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Propagation of chaos in path spaces via information theory

Lei Li School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University, Shanghai, 200240, P.R.China; Shanghai Artificial Intelligence Laboratory (leili2010@sjtu.edu.cn).    Yuelin Wang School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University, Shanghai, 200240, P.R.China (sjtu_\_wyl@sjtu.edu.cn).    Yuliang Wang School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University, Shanghai, 200240, P.R.China (YuliangWang_\_math@sjtu.edu.cn).
Abstract

Propagation of chaos for interacting particle systems has been an active research topic over decades. We propose an alternative approach to study the mean-field limit of the stochastic interacting particle systems via tools from information theory. In our framework, the propagation of chaos is reduced to the space for driving processes with possible lower dimension. Indeed, after applying the data processing inequality, one only needs to estimate the difference between the drifts of the particle system and the mean-field Mckean stochastic differential equation. This point is particularly useful in situations where the discrepancy in the driving processes is more apparent than the investigated processes. We will take the second order system as well as other examples for the illustration of how our framework could be used. This approach allows us to focus on probability measures in path spaces for the driving processes, avoiding using the usual hypocoercivity technique or taking the pseudo-inverse of the diffusion matrix, which might be more stable for numerical computation. Our framework is different from current approaches in literature and could provide new insight into the study of interacting particle systems.

keywords:
mean-field limit, interacting particle systems, relative entropy, data processing inequality, Girsanov theorem.
{AMS}

35Q70; 60J60; 82C22

1 Introduction

The interacting particle system, mostly built upon basic physical laws including Newton’s second law, has received growing popularity recent years in the study of both natural and social sciences. Practical application of such large-scale interacting particle systems includes groups of birds [11], consensus clusters in opinion dynamics [41], chemotaxis of bacteria [23], etc. Despite its strong applicability, the theoretical analysis and practical computation for the interacting particle system is rather complicated, mainly due to the fact that the particle number NN is very large in many practical settings. One classical strategy to reduce this complexity is to study instead the “mean-field” regime. The limiting partial differential equation (mean-field equation) is used to describe the behavior of the particle system as NN\rightarrow\infty. This approximation allows one to obtain a one-body model instead of the original many-body one. For instance, Jeans proposed a mean-field equation to study the galactic dynamics in 1915 [28]. Much work has been done to study the mean-field behaviors of various kinds of interacting particle systems [15, 33, 39, 18, 43] in the past decades.

Here, let us take the second order system as the example to explain the concepts of mean field limit and propagation of chaos. The second-order system is described by Newton’s second law for NN point particles driven by 2-body interaction forces and Brownian motions, satisfying the following system of stochastic differential equations (SDE):

{dXi(t)=Vi(t)dt,mdVi(t)=1N1j:jiK(Xi(t)Xj(t))dtγVi(t)dt+σdWi(t),1iN,\left\{\begin{aligned} dX_{i}(t)&=V_{i}(t)dt,\\ mdV_{i}(t)&=\frac{1}{N-1}\sum_{j:j\neq i}K\left(X_{i}(t)-X_{j}(t)\right)dt-\gamma V_{i}(t)dt+\sigma\cdot dW_{i}(t),\quad 1\leq i\leq N,\end{aligned}\right. (1.1)

where mm and γ\gamma represent the mass mm and friction coefficient respectively, Xi(t),Vi(t)dX_{i}(t),V_{i}(t)\in\mathbb{R}^{d}. The processes Wi(t)W_{i}(t) (1iN1\leq i\leq N) are independent Brownian motions in d\mathbb{R}^{d^{\prime}}, and KK: dd\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is the interaction kernel. We assume that the initial data {(Xi(0),Vi(0))}\{(X_{i}(0),V_{i}(0))\} are i.i.di.i.d drawn from some initial law F0NF_{0}^{N} independent of the Brownian motions. Denote Zi(t):=(Xi(t),Vi(t))Z_{i}(t):=(X_{i}(t),V_{i}(t)), and the corresponding joint law

FtN(z1,,zN)=Law(Z1(t),,ZN(t))𝒫(2Nd),F^{N}_{t}\left(z_{1},\cdots,z_{N}\right)=\operatorname{Law}\left(Z_{1}(t),\cdots,Z_{N}(t)\right)\in\mathcal{P}(\mathbb{R}^{2Nd}), (1.2)

where 𝒫(2Nd)\mathcal{P}(\mathbb{R}^{2Nd}) denotes the probability measure space on 2Nd.\mathbb{R}^{2Nd}. Then, the evolution of the density FtNF_{t}^{N} satisfies a Liouville’s equation [16, 17]:

tFtN+i=1Nxi(viFtN)+1mi=1Nvi(1N1jiK(xixj)FtNγviFtN)=12m2i=1Nvi2:(ΛFtN),\partial_{t}F_{t}^{N}+\sum_{i=1}^{N}\nabla_{x_{i}}\cdot(v_{i}F_{t}^{N})+\frac{1}{m}\sum_{i=1}^{N}\nabla_{v_{i}}\cdot\left(\frac{1}{N-1}\sum_{j\neq i}K\left(x_{i}-x_{j}\right)F_{t}^{N}-\gamma v_{i}F_{t}^{N}\right)=\\ \frac{1}{2m^{2}}\sum_{i=1}^{N}\nabla^{2}_{v_{i}}:(\Lambda F_{t}^{N}), (1.3)

with FtN|t=0=F0N.F_{t}^{N}|_{t=0}=F_{0}^{N}. Note that the matrix Λ\Lambda is defined by Λ:=σσT\Lambda:=\sigma\sigma^{T}. Here, “::” means the Hilbert-Schmidt inner product so that vi2:(ΛFtN)=j,kvjvk2(ΛjkFtN)\nabla^{2}_{v_{i}}:(\Lambda F_{t}^{N})=\sum_{j,k}\partial_{v_{j}v_{k}}^{2}(\Lambda_{jk}F_{t}^{N}). As the particle number NN tends to infinity, the correlation between any two focused particles through the weak interaction is expected to vanish. Hence, if two particles are initially independent, then they are expected to be independent as NN\to\infty at any fixed time point t>0t>0. This is the so-called propagation of chaos. Due to the asymptotic independence, a fixed particle with position and velocity Z¯i(t):=(X¯i(t),V¯i(t))\bar{Z}_{i}(t):=(\bar{X}_{i}(t),\bar{V}_{i}(t)) is then expected to satisfy the following mean field Mckean SDE system:

dX¯(t)=V¯(t)dt,mdV¯(t)=Kρ¯t(X¯(t))dtγV¯(t)dt+σdW(t),d\bar{X}(t)=\bar{V}(t)dt,\quad md\bar{V}(t)=K{*}\bar{\rho}_{t}(\bar{X}(t))dt-\gamma\bar{V}(t)dt+\sigma\cdot dW(t), (1.4)

where F¯t𝒫(2d)\bar{F}_{t}\in\mathcal{P}(\mathbb{R}^{2d}) is the law, and ρ¯t(x):=dF¯t(x,v)𝑑v\bar{\rho}_{t}(x):=\int_{\mathbb{R}^{d}}\bar{F}_{t}(x,v)dv is its marginal. The law F¯t\bar{F}_{t} is then expected to satisfy the following mean field kinetic Fokker-Planck equation [24, 25]:

tF¯t+x(vF¯t)+1mv(Kρ¯tF¯tγvF¯t)=12m2v2:(ΛF¯t),F¯t|t=0=F¯0.\partial_{t}\bar{F}_{t}+\nabla_{x}\cdot(v\bar{F}_{t})+\frac{1}{m}\nabla_{v}\cdot\left(K{*}\bar{\rho}_{t}\bar{F}_{t}-\gamma v\bar{F}_{t}\right)=\frac{1}{2m^{2}}\nabla^{2}_{v}:(\Lambda\bar{F}_{t}),\quad\bar{F}_{t}|_{t=0}=\bar{F}_{0}. (1.5)

Rigorous justification of this mean limit, or the propagation of chaos, has then become an active research topic.

The prevalent method in analyzing mean-field limits is based on Dobrushin’s Estimate, which is proposed in 1979 by Dobrushin etc. [13], to study the stability of the mean-field characteristic flow in terms of Wasserstein distances. Dobrushin-type analysis has now been a classical tool in mean-field limits for Valsov-type equations during these decades. Based on Dobrushin-type analysis, one can then prove the mean-field limit for the deterministic system in a finite time interval [0,T][0,T] in terms of Wasserstein distances [3, 44, 18]. Another way is to compare the stochastic trajectories through certain coupling technique. By considering trajectory controls, the mean-field limit for stochastic systems with Lipschitz kernel KK has been established [50, 19, 21].

Another class of methods is to compare the laws directly. What has become popular recently on chaos qualification is given by the analysis of relative entropy (also called Kullback-Leibler divergence, KL-divergence) between FtN:k=(2d)NkFtN𝑑zk+1𝑑zNF_{t}^{N:k}=\int_{(\mathbb{R}^{2d})^{N-k}}F_{t}^{N}dz_{k+1}\cdots dz_{N} and kk tensorized product of F¯t\bar{F}_{t}, F¯tk:=i=1kF¯t(zi)\bar{F}_{t}^{\otimes k}:=\prod_{i=1}^{k}\bar{F}_{t}(z_{i}) for 1kN1\leq k\leq N. The analysis could also be performed on the laws on path space with FtN:kF_{t}^{N:k} and F¯tk\bar{F}_{t}^{\otimes k} being their time marginals. Some early results in path space using the relative entropy have been achieved in the last century (e.g. [2, 1]). For time marginal distributions, Jabin et. al. proved the propagation of chaos for Vlasov-type systems with 𝒪(k/N)\mathcal{O}(k/N) bound, assuming the interaction kernel KK is bounded, and the propagation of chaos for first order systems with singular kernels [26]. For results in path space, Lacker obtained the propagation of chaos relying on Girsanov’s and Sanov’s theorem [30] and the BBGKY hierarchy [31, 32]. The approach in [31, 32] yields an 𝒪((k/N)2)\mathcal{O}((k/N)^{2}) bound of the relative entropy between the marginal law of kk particles and its limiting product measure. For singular LpL^{p}-interactions, Tomašević et. al. used the the partial Girsanov transform to derive the propagation of chaos in [27, 51]. Recently, Hao et. al. further showed the strong convergence of the propagation of chaos with singular LpL^{p}-interactions in [22]. Also, based on Lacker’s approach, Cattiaux gave an 𝒪(k/N)\mathcal{O}(k/N) estimate on the path space in [6], by using the invariance of relative entropy under time reversal [5]. The results in [12] and [20] are uniform in time for the Coulomb and the Biot-Savart kernel, respectively. There is a vast literature on this topic, and we provide recent review articles [7, 8] for the convenience of readers.

In this work, we propose to use the information theory to study the propagation of chaos by comparing the discrepancy between the joint law of the particle system and the corresponding mean-field equation in terms of KL-divergence defined by

DKL(PQ):={ElogdPdQdP,PQ,,otherwise,D_{KL}\left(P\|Q\right):=\left\{\begin{aligned} \int_{E}\log&\frac{dP}{dQ}dP,\quad&P\ll Q,\\ &\infty,\quad&\text{otherwise,}\end{aligned}\right. (1.6)

where PP and QQ are two probability measures over some appropriate space EE. In our framework, the propagation of chaos is reduced to the space for driving processes with possible lower dimension. We will mainly take the second-order systems as the example, which avoids using the usual hypocoercivity technique or taking the pseudo-inverse of the diffusion matrix. We remark that the bounds under relative entropy for the second order system can be obtained by direct Girsanov transform if one takes the pseudo-inverse of the degenerate diffusion matrix as mentioned in [31, Remark 4.5]. Nevertheless, we believe our approach is still of significance as there is no degeneracy in diffusion if we look at the measures in the space for driving processes, which could be more stable for numerical computation. We will also look at the application of our framework to other illustrating examples.

We focus an estimate for the KL-divergence between the laws in path space, in particular DKL(F[0,T]NF¯[0,T]N)D_{KL}(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]}). Here F[0,T]NF^{N}_{[0,T]} and F¯[0,T]N\bar{F}^{\otimes N}_{[0,T]} are probability distributions in the path space 𝒳:=C([0,T];2Nd)\mathcal{X}:=C([0,T];\mathbb{R}^{2Nd}) (for fixed time interval [0,T][0,T]) corresponding to the SDE systems (1.1) and NN independent copies of (1.4) respectively. Denoting 𝒵[0,T]:=(Z1,,ZN)[0,T]\mathcal{Z}_{[0,T]}:=(Z_{1},\dots,Z_{N})_{[0,T]}, 𝒵¯[0,T]:=(Z¯1,,Z¯N)[0,T]\bar{\mathcal{Z}}_{[0,T]}:=(\bar{Z}_{1},\dots,\bar{Z}_{N})_{[0,T]} in the path space, the path measures satisfy F[0,T]N=𝒵[0,T]#F^{N}_{[0,T]}=\mathcal{Z}_{[0,T]}{\#}\mathbb{P}, and F¯[0,T]N=𝒵¯[0,T]#\bar{F}^{\otimes N}_{[0,T]}=\bar{\mathcal{Z}}_{[0,T]}{\#}\mathbb{P} (\mathbb{P} is the original probability measure such that WW is a Brownian motion). With this setting, FtNF_{t}^{N} is the time marginal of F[0,T]NF^{N}_{[0,T]}, and F¯tN\bar{F}_{t}^{\otimes N} is then the time marginal of F¯[0,T]N\bar{F}^{\otimes N}_{[0,T]}. We then regard the process of the mean-field McKean SDEs and the interacting particle systems as the same dynamical system with different driving processes (input signals). Then, applying the data processing inequality, we can work on probability measures in the space for the input signals instead of the space for the particles. The former space is sometimes easier to deal with than the latter as one may avoid the degeneracy of the diffusion. Moreover, the dimension could be lower. This has similarity with the so-called latent space in machine learning [38]. Moreover, we will also present the applications of the framework onto neural networks and numerical analysis to illustrate this point.

The rest of the paper is organized as follows: In Section 2, we present our main ideas. The result (Theorem 3.6) on the propagation of chaos for the second-order system in path space is shown in Section 3 for both bounded kernels (not necessarily smooth) or Lipschitz kernels (not necessarily bounded) with the necessary assumptions and auxiliary lemmas. In Section 4, we provide two applications of our approach on numerical analysis and neural networks. Lastly in Section 5, we perform a discussion on the reversed relative entropy and mass-independence.

2 The main idea of the new framework

In this section, taking the second order system as the example, we present the main ideas without rigorous proof. The rigorous mathematical setup, assumptions and proof will be given in the next section.

For fixed [0,T][0,T], let F¯[0,T]\bar{F}_{[0,T]} be the law of the trajectories of the following Mckean SDE system (1.4). Then the tensorized distribution F¯[0,T]N\bar{F}^{\otimes N}_{[0,T]} is the law of trajectories of the following system:

dX¯i(t)=V¯i(t)dt,mdV¯i(t)=Kρ¯t(X¯i(t))dtγV¯i(t)dt+σdWi(t),1iN,d\bar{X}_{i}(t)=\bar{V}_{i}(t)dt,\quad md\bar{V}_{i}(t)=K{*}\bar{\rho}_{t}(\bar{X}_{i}(t))dt-\gamma\bar{V}_{i}(t)dt+\sigma\cdot dW_{i}(t),\quad 1\leq i\leq N, (2.7)

and the particles Z¯i:=(X¯i,V¯i)\bar{Z}_{i}:=(\bar{X}_{i},\bar{V}_{i}), 1iN1\leq i\leq N are independent.

The key idea of this work is rewriting (1.1) above into:

dXi(t)=Vi(t)dt,mdVi(t)=Kρ¯t(Xi(t))dtγVi(t)dt+dθi(1)(t),1iN,dX_{i}(t)=V_{i}(t)dt,\quad mdV_{i}(t)=K{*}\bar{\rho}_{t}(X_{i}(t))dt-\gamma V_{i}(t)dt+d\theta_{i}^{(1)}(t),\quad 1\leq i\leq N, (2.8)

where the process θi(1)(t)\theta_{i}^{(1)}(t) is defined by

θi(1)(t):\displaystyle\theta_{i}^{(1)}(t): =0t(1N1j:jiK(Xi(s)Xj(s))Kρ¯s(Xi(s)))𝑑s+σWi(t)\displaystyle=\int_{0}^{t}\left(\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}(s)-X_{j}(s))-K{*}\bar{\rho}_{s}(X_{i}(s))\right)ds+\sigma\cdot W_{i}(t)
=0tbi(s,X(s))𝑑s+σWi(t).\displaystyle=\int_{0}^{t}b_{i}(s,X(s))\,ds+\sigma\cdot W_{i}(t). (2.9)

Here,

bi(s,x):=1N1j:jiK(xixj)Kρ¯s(xi).\displaystyle b_{i}(s,x):=\frac{1}{N-1}\sum_{j:j\neq i}K(x_{i}-x_{j})-K*\bar{\rho}_{s}(x_{i}). (2.10)

We also denote

θi(2)(t)=σWi(t).\theta_{i}^{(2)}(t)=\sigma\cdot W_{i}(t). (2.11)

Based on (2.8) and (2.7), formally, we write the generalized dynamics

dX^i(t)=V^i(t)dt,mdV^i(t)=Kρ¯t(X^i(t))dtγV^i(t)dt+dθi(t),1iN.d\hat{X}_{i}(t)=\hat{V}_{i}(t)dt,\quad md\hat{V}_{i}(t)=K{*}\bar{\rho}_{t}(\hat{X}_{i}(t))dt-\gamma\hat{V}_{i}(t)dt+d\theta_{i}(t),\quad 1\leq i\leq N. (2.12)

Here, θ:=(θ1,,θN)\theta:=(\theta_{1},\cdots,\theta_{N}) is a driving process. In (2.8), the driving process is taken as the noise process θ(2),\theta^{(2)}, while in (2.7) is taken as θ(1).\theta^{(1)}. For fixed initial data, as shown in (2.13), the driving process θ\theta can be viewed as an input, then through the equation (2.12), the particle trajectory is obtained as an output.

driving process θ\theta(2.12)trajectory (X,V)(X,V) (2.13)

From this perspective, a natural guess is that, if there is only slight difference between two driving processes, the difference between the outputs might be not large. Luckily, if the mean field McKean SDE (1.4) has pathwise uniqueness, the following well-known data processing inequality [9] can help to establish such intuition.

Lemma 2.1 (data processing inequality).

Consider a given conditional probability PYXP_{Y\mid X} and that YY is produced by PYXP_{Y\mid X} given XX. If PYP_{Y} is the distribution of YY when XX is generated by PXP_{X}, and QYQ_{Y} is the distribution of YY when XX is generated by QXQ_{X}, then for any convex function f:+f:\mathbb{R}^{+}\rightarrow\mathbb{R} satisfying f(1)=0f(1)=0 and being strictly convex at x=1x=1, it holds

Df(PYQY)Df(PXQX),D_{f}\left(P_{Y}\|Q_{Y}\right)\leq D_{f}\left(P_{X}\|Q_{X}\right), (2.14)

where the ff-divergence Df()D_{f}(\cdot\|\cdot) is defined by

Df(PQ):={𝔼Q[f(dPdQ)]PQ,otherwise.D_{f}(P\|Q):=\left\{\begin{aligned} \mathbb{E}_{Q}&\left[f\left(\frac{dP}{dQ}\right)\right]\quad&P\ll Q,\\ &\infty\quad&\text{otherwise}.\end{aligned}\right. (2.15)
Remark 2.2.

Taking f(x)=xlogxf(x)=x\log x, the ff-divergence DfD_{f} is the famous KL-divergence. In this paper, we focus on this special case.

Remark 2.3.

The data processing inequality is also well-known in probability and statistics (e.g. [31]), which states that DKL(νg1νg1)DKL(νν)D_{KL}(\nu\circ g^{-1}\|\nu^{\prime}\circ g^{-1})\leq D_{KL}(\nu\circ\nu^{\prime}) for any probability measures ν,ν\nu,\,\nu^{\prime} on a common measurable space and any measurable function gg into another measurable space.

Now, by the data processing inequality, we can control the KL-divergence between the output into that between the input. In this respect, we change our problem from the trajectory space into the space for the driving process θ\theta. Exactly, we find that

DKL(F[0,T]NF¯[0,T]N)DKL(Q1Q2),D_{KL}(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]})\leq D_{KL}(Q^{1}\|Q^{2}),

where we recall F[0,T]NF^{N}_{[0,T]} and F¯[0,T]N\bar{F}^{\otimes N}_{[0,T]} are path measures introduced in Section 1 and we denote QjQ^{j} to be the path measures for

θ(j):=(θ1(j),,θN(j)(t)).\theta^{(j)}:=(\theta_{1}^{(j)},\cdots,\theta_{N}^{(j)}(t)).

To compute the latter relative entropy, we rewrite the equation for θ(1)\theta^{(1)} by

θi(1)=0tbi(s,X(s))ds+σWi(t)=:0tb~i(s,[θ(1)][0,s])ds+σWi(t).\displaystyle\theta_{i}^{(1)}=\int_{0}^{t}b_{i}(s,X(s))\,ds+\sigma\cdot W_{i}(t)=:\int_{0}^{t}\tilde{b}_{i}(s,[\theta^{(1)}]_{[0,s]})\,ds+\sigma\cdot W_{i}(t). (2.16)

Then, θ(1)\theta^{(1)} satisfies an SDE in the space of the driving process, with a dimension smaller than that of (X,V)(X,V). Then, by Girsanov’s transform, it holds

DKL(Q1Q2)=𝔼logdQ2dQ1[θ(1)]=12𝔼i0Tbi(s,X(s)),(σσT)1bi(s,X(s))𝑑s.\displaystyle D_{KL}(Q^{1}\|Q^{2})=-\mathbb{E}\log\frac{dQ^{2}}{dQ^{1}}[\theta^{(1)}]={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{1}{2}\mathbb{E}\sum_{i}\int_{0}^{T}\langle b_{i}(s,X(s)),(\sigma\sigma^{T})^{-1}b_{i}(s,X(s))\rangle ds.} (2.17)

Note that this reduction avoids the degeneracy of the diffusion coefficient. Though the degeneracy can be treated by using the pseudo-inverse as remarked in [31], such a reduction could be helpful for practical estimates using numerical computations. We will give more details in the next sections.

Let us discuss the choice of the noise and dynamical system. One may be tempted to rewrite the mean-field McKean SDE into

dX¯i=V¯idt,mdV¯i=1N1j:jiK(X¯iX¯j)dtγV¯idt+dηi(2),1iN,d\bar{X}_{i}=\bar{V}_{i}dt,\quad md\bar{V}_{i}=\frac{1}{N-1}\sum_{j:j\neq i}K(\bar{X}_{i}-\bar{X}_{j})dt-\gamma\bar{V}_{i}dt+d\eta_{i}^{(2)},\quad 1\leq i\leq N,

with

ηi(2)(t):=0t(Kρ¯s(X¯i)1N1j:jiK(X¯iX¯j))𝑑s+σWi(t).\eta_{i}^{(2)}(t):=\int_{0}^{t}\left(K{*}\bar{\rho}_{s}(\bar{X}_{i})-\frac{1}{N-1}\sum_{j:j\neq i}K(\bar{X}_{i}-\bar{X}_{j})\right)ds+\sigma\cdot W_{i}(t).

Then, the NN-body interacting particle system is given by

dXi=Vidt,mdVi=1N1j:jiK(XiXj)dtγVidt+dηi(1),1iN,dX_{i}=V_{i}dt,\quad mdV_{i}=\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}-X_{j})dt-\gamma V_{i}dt+d\eta_{i}^{(1)},\quad 1\leq i\leq N,

with ηi(1)(t):=σWi(t)\eta_{i}^{(1)}(t):=\sigma\cdot W_{i}(t) (1iN)(1\leq i\leq N).

The two systems are also the same dynamical system with difference driving noises

η(j)():=(η1(j)(),,ηN(j)()).\eta^{(j)}(\cdot):=(\eta_{1}^{(j)}(\cdot),\cdots,\eta_{N}^{(j)}(\cdot)).

At first glance, this formulation seems good since the drift in η(2)\eta^{(2)} involves only the solution to the mean-field McKean SDE. Then, one may apply the law of large numbers. However, this is not the case. In fact, applying the data processing inequality, one has

DKL(F[0,T]NF¯[0,T]N)DKL(Q¯1Q¯2),D_{KL}(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]})\leq D_{KL}(\bar{Q}^{1}\|\bar{Q}^{2}),

where Q¯j\bar{Q}^{j} is the law for η(j)\eta^{(j)}. We consider

ηi(2):=0tbi(s,X¯(s))𝑑s+σWi(t)=0tbi(s,πsΦ^s(η(2)))𝑑s+σWi(t).\eta_{i}^{(2)}:=-\int_{0}^{t}b_{i}(s,\bar{X}(s))\,ds+\sigma\cdot W_{i}(t)=-\int_{0}^{t}b_{i}(s,\pi_{s}\circ\hat{\Phi}_{s}(\eta^{(2)}))\,ds+\sigma\cdot W_{i}(t).

Here, the mapping Φ^s:η(X,V)\hat{\Phi}_{s}:\eta\mapsto(X,V) is the solution map for the NN-body interacting dynamical system and πsf=f(s)\pi_{s}f=f(s) is the time marginal. This is again an SDE in the space for the driving process. Then,

DKL(Q¯1Q¯2)=𝔼XQ¯1[logdQ¯2dQ¯1(X)].D_{KL}(\bar{Q}^{1}\|\bar{Q}^{2})=\mathbb{E}_{X\sim\bar{Q}^{1}}\left[-\log\frac{d\bar{Q}^{2}}{d\bar{Q}^{1}}(X)\right].

The point is that the Radon-Nykodym derivative is integrated against Q¯1\bar{Q}^{1}. The Girsanov’s transform then gives that

𝔼XQ¯1[logdQ¯2dQ¯1(X)]=i𝔼0t12bi(s,πsΦ^s(η(1))),Λ1bi(s,πsΦ^s(η(1)))𝑑s=i𝔼0t12bi(s,X(s)),Λ1bi(s,X(s))𝑑s,\displaystyle\begin{aligned} \mathbb{E}_{X\sim\bar{Q}^{1}}\left[-\log\frac{d\bar{Q}^{2}}{d\bar{Q}^{1}}(X)\right]&=\sum_{i}\mathbb{E}\int_{0}^{t}\frac{1}{2}\langle b_{i}(s,\pi_{s}\circ\hat{\Phi}_{s}(\eta^{(1)})),\Lambda^{-1}b_{i}(s,\pi_{s}\circ\hat{\Phi}_{s}(\eta^{(1)}))\rangle\,ds\\ &=\sum_{i}\mathbb{E}\int_{0}^{t}\frac{1}{2}\langle b_{i}(s,X(s)),\Lambda^{-1}b_{i}(s,X(s))\rangle\,ds,\end{aligned} (2.18)

where the inside is changed from η(2)\eta^{(2)} to η(1)\eta^{(1)}! The eventual result is the same as (2.17).

3 The application to the second order systems

In this section, we establish the propagation of chaos in path space for the second order systems using the framework of information theory, in particular the data processing inequality.

We first present our assumptions on the kernels and coefficients. The first set of assumptions requires that KK is bounded. {assumption}

  • (a)

    The kernel KK has finite essential bound, namely, KL(d)<+\|K\|_{L^{\infty}(\mathbb{R}^{d})}<+\infty.

  • (b)

    The matrix Λ=σσT\Lambda=\sigma\sigma^{T} is non-degenerate with minimum eigenvalue λ>0.\lambda>0.

Remark 3.1.

In our main text, the matrix σ\sigma is a constant matrix for notational convenience. However, a time- and state-dependent diffusion σ(t,Xi,Vi)\sigma(t,X_{i},V_{i}) is allowed as long as the spectrum of Λ=σσT\Lambda=\sigma\sigma^{T} is uniformly bounded above and away from zero and the well-posedness results in the following subsection preserves. It is similar with [31, Remark 4.5].

The boundedness condition for the interaction kernel KK (condition (a) in Assumption 3 above) sometimes is strong in practice. Here, if we assume that the initial distribution has a fast decaying tail, we can allow a Lipschitz kernel. In fact, we will assume also alternatively the followings: {assumption}

  • (a)

    The initial space-marginal distribution of the Mckean SDE (1.4) is sub-Gaussian, namely, there exists C>0C>0 such that for any a0a\geq 0, P(|X¯1(0)|>a)2exp(a2/C2)P(|\bar{X}_{1}(0)|>a)\leq 2\exp(-a^{2}/C^{2}).

  • (b)

    The interaction kernel K()K(\cdot) is LKL_{K}-Lipschitz, namely, x,yd\forall x,y\in\mathbb{R}^{d}, |K(x)K(y)|LK|xy||K(x)-K(y)|\leq L_{K}|x-y|.

  • (c)

    The matrix Λ=σσT\Lambda=\sigma\sigma^{T} is non-degenerate with minimum eigenvalue λ>0.\lambda>0.

3.1 The well-posedness of the mean field McKean SDE.

Under either Assumption 3 or 3, we are able to establish the propagation of chaos using nearly the same method. As a first step, we consider the solution map of (1.4). For fixed initial data, we rewrite it as

X^i(t)=X^i(0)+0tV^i(s)𝑑s,mV^i(t)=mV^i(0)+0tKρ¯s(X^i(s))𝑑sγ0tV^i(s)𝑑s+θ^i(t),1iN.\begin{split}&\hat{X}_{i}(t)=\hat{X}_{i}(0)+\int_{0}^{t}\hat{V}_{i}(s)ds,\\ &m\hat{V}_{i}(t)=m\hat{V}_{i}(0)+\int_{0}^{t}K{*}\bar{\rho}_{s}(\hat{X}_{i}(s))ds-\gamma\int_{0}^{t}\hat{V}_{i}(s)ds+\hat{\theta}_{i}(t),\quad 1\leq i\leq N.\end{split} (3.19)

We first have the following observation.

Lemma 3.2.

Suppose that either Assumption 3 or Assumption 3 holds. Then, the mean field nonlinear kinetic Fokker-Planck equation (1.5) has a unique solution that is in C([0,T];𝒫(d))C([0,T];\mathcal{P}(\mathbb{R}^{d})) where the topology is the weak convergence of measures. Moreover, the solution is smooth for any t>0t>0.

The result under Assumption 3 is very standard because the corresponding SDE system even has strong solutions. For the first, the well-posedness under some more general singular kernels have been established as well. One may refer to [29, 56, 22] for related discussion.

As soon as we have the well-posedness for the nonlinear Fokker-Planck equation, then Kρ¯tK*\bar{\rho}_{t} is smooth for any t>0t>0, and thus locally Lipschitz. Now, we take tρ¯tt\mapsto\bar{\rho}_{t} as given. We conclude the following.

Lemma 3.3.

Suppose that either Assumption 3 or Assumption 3 holds. Then, the following integral equation has a unique continuous solution.

X(t)=X0+0tV(s)𝑑s,mV(t)=mV0+0tKρ¯s(X(s))𝑑sγ0tV(s)𝑑s+η(t),\displaystyle\begin{split}&X(t)=X_{0}+\int_{0}^{t}V(s)\,ds,\\ &mV(t)=mV_{0}+\int_{0}^{t}K*\bar{\rho}_{s}(X(s))\,ds-\gamma\int_{0}^{t}V(s)\,ds+\eta(t),\end{split} (3.20)

where tη(t)t\mapsto\eta(t) is a given continuous driving signal.

For the uniqueness, it is relatively straightforward. In fact, for any two continuous solutions and given T>0T>0, they stay in a compact set. On this compact set, Kρ¯tK*\bar{\rho}_{t} is Lipschitz on [ϵ,T][\epsilon,T] for any ϵ>0\epsilon>0. The integral on [0,ϵ][0,\epsilon] can be made arbitrarily small. The uniqueness can then be obtained by direct comparison. For the existence, one may consider the regularized equation where ρ¯t\bar{\rho}_{t} is redefined to be ρ¯ϵ\bar{\rho}_{\epsilon} for t[0,ϵ]t\in[0,\epsilon]. The obtained solution (Xϵ(t),Vϵ(t))(X^{\epsilon}(t),V^{\epsilon}(t)) can be shown to be uniformly bounded. Then, it is not hard to show they are relatively compact in C([0,T];d)C([0,T];\mathbb{R}^{d}) by the Arzela-Ascoli criterion, with any limit point being a solution of the integral equation.

With the above fact, the mean field McKean SDE (1.4) actually has a unique strong solution. For a fixed time tt, we may introduce the mapping

Φt:θ^𝒵^:=(Z^1,,Z^N),\Phi_{t}:\quad\hat{\theta}\mapsto\hat{\mathcal{Z}}:=(\hat{Z}_{1},\dots,\hat{Z}_{N}), (3.21)

where θ^=(θ^1,,θ^N)C([0,t];Nd)\hat{\theta}=(\hat{\theta}_{1},\dots,\hat{\theta}_{N})\in C([0,t];\mathbb{R}^{Nd}) is a generic driving process, Z^i():=(X^i(),V^i()),\hat{Z}_{i}(\cdot):=(\hat{X}_{i}(\cdot),\hat{V}_{i}(\cdot)), and 𝒵^C([0,t];2Nd)\hat{\mathcal{Z}}\in C([0,t];\mathbb{R}^{2Nd}) is the solution of the dynamical system (3.19).

For fixed tt, Φt\Phi_{t} only depends on θs\theta_{s} for sts\leq t. If we change tt, the solution process will clearly agree on the common subinterval. Below, we will consider varying tt, but we will not change the notation θ^\hat{\theta} for convenience. Moreover, the dependence on the initial data is also not written out explicitly for clarity. Consequently, recalling the definitions 𝒵[0,T]=(Z1,,ZN)\mathcal{Z}_{[0,T]}=(Z_{1},\dots,Z_{N}), 𝒵¯[0,T]=(Z¯1,,Z¯N),\bar{\mathcal{Z}}_{[0,T]}=(\bar{Z}_{1},\dots,\bar{Z}_{N}), and Zi(t)=(Xi(t),Vi(t))Z_{i}(t)=(X_{i}(t),V_{i}(t)), Z¯i(t)=(X¯i(t),V¯i(t))\bar{Z}_{i}(t)=(\bar{X}_{i}(t),\bar{V}_{i}(t)), then one has

𝒵[0,T]=ΦT(θ[0,T](1)),𝒵¯[0,T]=ΦT(θ[0,T](2)).\mathcal{Z}_{[0,T]}=\Phi_{T}(\theta^{(1)}_{[0,T]}),\quad\bar{\mathcal{Z}}_{[0,T]}=\Phi_{T}(\theta^{(2)}_{[0,T]}). (3.22)

With the conditions above, next we establish the propagation of chaos result for distributions starting from a chaotic configuration (i.e., F0N=F¯0NF_{0}^{N}=\bar{F}_{0}^{\otimes N}).

3.2 Propagation of chaos in path space and the corollaries.

We again note a fact from standard SDE theory.

Lemma 3.4.

Suppose that either Assumption 3 or Assumption 3 holds. The interacting particle system (1.1) has a weak solution unique in law.

The existence of weak solution for bounded KK follows from a standard Girsanov transform (see e.g. [45, Theorem 8.6.5], [49, Theorem 27.1], [34, Theorem 2.1]). The uniqueness in law for bounded kernels is also standard and one may refer to the discussion in [49, page 155, Chapter 4, Section 18].

The weak well-posedness of the SDE implies that the Liouville equation (1.3) has weak solutions. The uniqueness of the Liouville equation (1.3) can also be established with the bounded or Lipschitz assumption on KK (see e.g. [48]). It is straightforward to see that if the initial F0NF^{N}_{0} is symmetric, FNF^{N} is symmetric due to the fact that tFtN(p(z))t\rightarrow F^{N}_{t}(p(z)) satisfies the same Liouville equation as tFtN(z)t\rightarrow F^{N}_{t}(z), where p(z)p(z) is an arbitrary permutation for z(2d)Nz\in(\mathbb{R}^{2d})^{N} (see, for instance, a similar argument in [42]). Similar argument also applies to the law in the path space. In fact, for any weak solution ZZ, it is not hard to see p(Z)p(Z) is also a weak solution. Then, the uniqueness in law implies that the law in the path space is symmetric. This in fact arises from the exchangeability of the particle systems.

Next, we have the following result under Assumption 3.

Lemma 3.5.

Suppose that Assumption 3 holds. Then, the following statements hold.

  1. 1.

    For any t[0,T]t\in[0,T], the solution of the mean field McKean SDE (1.5) is sub-Gaussian.

  2. 2.

    The interaction kernel K()K(\cdot) and the marginal distribution ρ¯t\bar{\rho}_{t} of the McKean SDE (1.4) satisfy: there exist C>0C>0 such that x,yd\forall x,y\in\mathbb{R}^{d} and t[0,T]t\in[0,T], |K(xy)Kρ¯t(x)|C(1+|y|)|K(x-y)-K{*}\bar{\rho}_{t}(x)|\leq C(1+|y|).

The first claim can be verified by calculating 𝔼exp(c(|X¯|2+|V¯|2))\mathbb{E}\,\exp(c(|\bar{X}|^{2}+|\bar{V}|^{2})) via Itô’s formula. The second one is actually also obvious by the first-order moment bound for X¯(t)\bar{X}(t), which is obvious under Assumption 3. Below, we present and prove the main result in this section.

Theorem 3.6.

For fixed time interval [0,T][0,T], assume that either Assumption 3 or Assumption 3 holds. Consider the path measure F[0,T]NF^{N}_{[0,T]} for the weak solution to the second-order system (1.1), with initial law F0N=F¯0NF^{N}_{0}=\bar{F}^{\otimes N}_{0}. Then, there exists a constant CC such that

DKL(F[0,T]NF¯[0,T]N)CeCT.D_{KL}\left(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]}\right)\leq Ce^{CT}. (3.23)

Consequently, for 1kN1\leq k\leq N,

DKL(FN:kF¯k)CeCTkN.D_{KL}\left(F^{N:k}\|\bar{F}^{\otimes k}\right)\leq Ce^{CT}\frac{k}{N}. (3.24)
Proof 3.7.

Recall equations (2.7)-(2.11). Note that we consider the weak solution to (1.1). Hence, the Brownian motions are not necessarily in the same space. However, since the McKean SDE has a strong solution, we may without loss of generality to take the Brownian motions in (2.7) to be the ones used for the weak solutions of (1.1), without altering the laws.

The corresponding driving process in the path space are

θ[0,T](j):=(θ1(j)(),,θN(j)())0tTC([0,T];Nd) for j=1,2.\theta^{(j)}_{[0,T]}:=\left(\theta^{(j)}_{1}(\cdot),\dots,\theta^{(j)}_{N}(\cdot)\right)_{0\leq t\leq T}\in C([0,T];\mathbb{R}^{Nd})\text{ for }j=1,2.

Let F[0,T]N(|z)F^{N}_{[0,T]}(\cdot|z) denote the law of 𝒵[0,T]=(Z1,,ZN)\mathcal{Z}_{[0,T]}=(Z_{1},\cdots,Z_{N}) (recall that Zi=(Xi,Vi)Z_{i}=(X_{i},V_{i})) with initial data 𝒵(0)=zNd\mathcal{Z}(0)=z\in\mathbb{R}^{Nd} and F¯[0,T]N(|z)\bar{F}^{N}_{[0,T]}(\cdot|z) is similarly defined. Then, for initial data obeying the distribution F¯0N\bar{F}_{0}^{\otimes N}, one has

F[0,T]N=NdF[0,T]N(|z)F¯0N(dz),F¯[0,T]N=NdF¯[0,T]N(|z)F¯0N(dz).\displaystyle F^{N}_{[0,T]}=\int_{\mathbb{R}^{Nd}}F^{N}_{[0,T]}(\cdot|z)\bar{F}_{0}^{\otimes N}(dz),\quad\bar{F}^{\otimes N}_{[0,T]}=\int_{\mathbb{R}^{Nd}}\bar{F}^{N}_{[0,T]}(\cdot|z)\bar{F}_{0}^{\otimes N}(dz). (3.25)

By the data processing inequality (Lemma 2.1), one has that

DKL(F[0,T]N(|z)F¯[0,T]N(|z))DKL(Q1Q2)=𝔼XQ1[logdQ2dQ1(X)],D_{KL}(F^{N}_{[0,T]}(\cdot|z)\|\bar{F}^{N}_{[0,T]}(\cdot|z))\leq D_{KL}(Q^{1}\|Q^{2})=\mathbb{E}_{X\sim Q^{1}}\left[-\log\frac{dQ^{2}}{dQ^{1}}(X)\right], (3.26)

where Q1Q^{1}, Q2Q^{2} are path measures generated by θ[0,T](1)\theta^{(1)}_{[0,T]} and θ[0,T](2)\theta^{(2)}_{[0,T]} , respectively, corresponding to the time interval [0,T][0,T]. Namely, Q1=θ[0,T](1)#Q^{1}=\theta^{(1)}_{[0,T]}{\#}\mathbb{P}, and Q2=θ[0,T](2)#Q^{2}=\theta^{(2)}_{[0,T]}{\#}\mathbb{P}. By definition of the process θ[0,T](1)\theta^{(1)}_{[0,T]}, θ[0,T](2)\theta^{(2)}_{[0,T]}, Q2Q1Q^{2}\ll Q^{1} and the Radon-Nikodym derivative dQ2dQ1\frac{dQ^{2}}{dQ^{1}} exists. One can find the expression of this Radon-Nikodym derivative explicitly by Girsanov’s transform. In fact, denote the NdNd-dimensional vector 𝐛(s,x)=(𝐛1T,,𝐛NT)T\boldsymbol{b}(s,x)=(\boldsymbol{b}^{T}_{1},\cdots,\boldsymbol{b}^{T}_{N})^{T} with

𝒃i(s,x):=σTΛ1(Kρs(xi)1N1j:jiK(xixj)).\boldsymbol{b}_{i}(s,x):=\sigma^{T}\Lambda^{-1}\left(K{*}\rho_{s}(x_{i})-\frac{1}{N-1}\sum_{j:j\neq i}K(x_{i}-x_{j})\right).

Note that

𝒃(s,X(s))=𝒃(s,πsΦs(θ[0,s](1)))=:𝒃~(s,[θ(1)][0,s]),\boldsymbol{b}(s,X(s))=\boldsymbol{b}(s,\pi_{s}\circ\Phi_{s}(\theta^{(1)}_{[0,s]}))=:\tilde{\boldsymbol{b}}(s,[\theta^{(1)}]_{[0,s]}),

where Φs\Phi_{s} is defined in (3.21), and πs\pi_{s} maps X[0,s]X_{[0,s]} in path space to its time marginal, namely, πs(X[0,s])=Xs\pi_{s}(X_{[0,s]})=X_{s}. Then the Girsanov’s transform asserts that the Radon-Nikodym derivative in the path space satisfies

dQ2dQ1(θ(1)(ω))\displaystyle\frac{dQ^{2}}{dQ^{1}}(\theta^{(1)}(\omega)) =exp(0T𝒃~(s,[θ(1)][0,s])𝑑Ws120T|𝒃~(s,[θ(1)][0,s])|2𝑑s)\displaystyle=\exp\Big{(}\int_{0}^{T}\tilde{\boldsymbol{b}}(s,[\theta^{(1)}]_{[0,s]})\cdot dW_{s}-\frac{1}{2}\int_{0}^{T}\left|\tilde{\boldsymbol{b}}(s,[\theta^{(1)}]_{[0,s]})\right|^{2}ds\Big{)}
=exp(0T𝒃(s,X(s))𝑑Ws120T|𝒃(s,X(s))|2𝑑s).\displaystyle=\exp\Big{(}\int_{0}^{T}\boldsymbol{b}(s,X(s))\cdot dW_{s}-\frac{1}{2}\int_{0}^{T}\left|\boldsymbol{b}(s,X(s))\right|^{2}ds\Big{)}. (3.27)

In Appendix A, we present a formal derivation of the details for (3.7). The strict proof can be found in many text books, e.g. [45, Theorem 8.6.5], [49, Theorem 27.1], [34, Theorem 2.1]. Since

|𝒃(s,X(s))|2\displaystyle|\boldsymbol{b}(s,X(s){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0})}|^{2} =i=1N|σTΛ1(Kρ¯s(Xi(s))1N1j:jiK(Xi(s)Xj(s)))|2\displaystyle=\sum_{i=1}^{N}\left|\sigma^{T}\Lambda^{-1}\left(K{*}\bar{\rho}_{s}(X_{i}(s))-\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}(s)-X_{j}(s))\right)\right|^{2}
1λi=1N|Kρ¯s(Xi(s))1N1j:jiK(Xi(s)Xj(s))|2,\displaystyle\leq\frac{1}{\lambda}\sum_{i=1}^{N}\left|K{*}\bar{\rho}_{s}(X_{i}(s))-\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}(s)-X_{j}(s))\right|^{2},

one has by combining (3.26) and (3.7) that

DKL(F[0,T]N(|z)F¯[0,T]N(|z))12λi=1N0T𝔼|Kρ¯s(Xi(s))1N1j:jiK(Xi(s)Xj(s))|2𝑑s.D_{KL}(F^{N}_{[0,T]}(\cdot|z)\|\bar{F}^{N}_{[0,T]}(\cdot|z))\leq\\ \frac{1}{2\lambda}\sum_{i=1}^{N}\int_{0}^{T}\mathbb{E}\left|K{*}\bar{\rho}_{s}(X_{i}(s))-\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}(s)-X_{j}(s))\right|^{2}ds. (3.28)

Moreover, due to the fact (3.25) and the convexity of the KL-divergence, one has by Jensen’s inequality that

DKL(F[0,T]NF¯[0,T]N)12λi=1N0T𝔼|Kρ¯s(Xi(s))1N1j:jiK(Xi(s)Xj(s))|2𝑑s,D_{KL}(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]})\leq\frac{1}{2\lambda}\sum_{i=1}^{N}\int_{0}^{T}\mathbb{E}\left|K{*}\bar{\rho}_{s}(X_{i}(s))-\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}(s)-X_{j}(s))\right|^{2}ds, (3.29)

where the expectation on the right hand is now the full expectation.

Next, we estimate (3.29). We separately estimate this under Assumption 3 (bounded KK) or Assumption 3 (unbounded KK).

Case 1: Under Assumption 3.

We first split the right hand side into (3.29) into

i=1N|Kρ¯s(Xi(s))1N1j:jiK(Xi(s)Xj(s))|2=1(N1)2i=1Nj:ji|Ai,j(s)|2+1(N1)2i=1Nj1,j2:j1j2,j1i,j2iAi,j1(s)Ai,j2(s),\sum_{i=1}^{N}\left|K{*}\bar{\rho}_{s}(X_{i}(s))-\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}(s)-X_{j}(s))\right|^{2}\\ =\frac{1}{(N-1)^{2}}\sum_{i=1}^{N}\sum_{j:j\neq i}|A_{i,j}^{\prime}(s)|^{2}+\frac{1}{(N-1)^{2}}\sum_{i=1}^{N}\sum_{j_{1},j_{2}:j_{1}\neq j_{2},j_{1}\neq i,j_{2}\neq i}A_{i,j_{1}}^{\prime}(s)\cdot A_{i,j_{2}}^{\prime}(s),

where Ai,j(t)A_{i,j}^{\prime}(t) is defined by

Ai,j(t):=K(Xi(t)Xj(t))Kρ¯t(Xi(t)).A_{i,j}^{\prime}(t):=K\left(X_{i}(t)-X_{j}(t)\right)-K{*}\bar{\rho}_{t}\left(X_{i}(t)\right).

Since KLK\in L^{\infty} by Assumption 3, it is easy to see that for N2N\geq 2, the first term above is bounded by 8K28\|K\|_{\infty}^{2}. For the second term, for any fixed ii, choosing ρ=ρsN\rho=\rho^{N}_{s} (the time marginal distribution for particle position Xs=(X1(s)XN(s))X_{s}=(X_{1}(s)\dots X_{N}(s)) at time ss) and ρ~=ρ¯sN\tilde{\rho}=\bar{\rho}^{\otimes N}_{s} in Lemma 3.13 (as we shall present in Section 3.3), for any η>0\eta>0 we have

𝔼[1N1j1,j2:j1j2,j1i,j2iAi,j1(s)Ai,j2(s)]η1DKL(ρsNρ¯sN)+η1log𝔼[exp(ηN1j1,j2:j1j2,j1i,j2iAi,j1(s)Ai,j2(s))],\mathbb{E}\left[\frac{1}{N-1}\sum_{j_{1},j_{2}:j_{1}\neq j_{2},j_{1}\neq i,j_{2}\neq i}A_{i,j_{1}}^{\prime}(s)\cdot A_{i,j_{2}}^{\prime}(s)\right]\\ \leq\eta^{-1}D_{KL}\left(\rho^{N}_{s}\|\bar{\rho}^{\otimes N}_{s}\right)+\eta^{-1}\log\mathbb{E}\left[\exp\left(\frac{\eta}{N-1}\sum_{j_{1},j_{2}:j_{1}\neq j_{2},j_{1}\neq i,j_{2}\neq i}A_{i,j_{1}}(s)A_{i,j_{2}}(s)\right)\right],

where Ai,j(t)A_{i,j}(t) is defined by

Ai,j(t):=K(X¯i(t)X¯j(t))Kρ¯t(X¯i(t)).A_{i,j}(t):=K\left(\bar{X}_{i}(t)-\bar{X}_{j}(t)\right)-K{*}\bar{\rho}_{t}\left(\bar{X}_{i}(t)\right).

Consider the map TsT_{s}: Z[0,s]XsZ_{[0,s]}\mapsto X_{s}, by the data processing inequality (Lemma 2.1) we know that

DKL(ρsNρ¯sN)DKL(F[0,s]NF¯[0,s]N).D_{KL}\left(\rho^{N}_{s}\|\bar{\rho}^{\otimes N}_{s}\right)\leq D_{KL}\left(F^{N}_{[0,s]}\|\bar{F}^{\otimes N}_{[0,s]}\right).

Also, Lemma 3.14 in Section 3.3 states that for η(0,1/(42eK2))\eta\in(0,1/(4\sqrt{2}e\|K\|^{2}_{\infty})),

supN2,s0𝔼[exp(ηN1j1,j2:j1j2,j1i,j2iAi,j1(s)Ai,j2(s))]1142eK2η<.\sup_{N\geq 2,s\geq 0}\mathbb{E}\left[\exp\left(\frac{\eta}{N-1}\sum_{j_{1},j_{2}:j_{1}\neq j_{2},j_{1}\neq i,j_{2}\neq i}A_{i,j_{1}}(s)A_{i,j_{2}}(s)\right)\right]\leq\frac{1}{1-4\sqrt{2}e\|K\|^{2}_{\infty}\eta}<\infty.

Hence, considering the averaged summation 1N1i=1N()\frac{1}{N-1}\sum_{i=1}^{N}(\cdot) for N2N\geq 2 and combining all the above, one obtains

DKL(F[0,T]NF¯[0,T]N)12λC(η)T+0T1λη1DKL(F[0,s]NF¯[0,s]N)𝑑s,D_{KL}(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]})\leq\frac{1}{2\lambda}C(\eta)T+\int_{0}^{T}\frac{1}{\lambda}\eta^{-1}D_{KL}\left(F^{N}_{[0,s]}\|\bar{F}^{\otimes N}_{[0,s]}\right)ds, (3.30)

where C(η):=8K2+2ηlog1142eK2ηC(\eta):=8\|K\|_{\infty}^{2}+\frac{2}{\eta}\log\frac{1}{1-4\sqrt{2}e\|K\|^{2}_{\infty}\eta}. The result (3.23) is obtained after the Grönwall’s inequality:

DKL(F[0,T]NF¯[0,T]N)\displaystyle D_{KL}(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]}) C(η)2λT+0TC(η)2λ1ληse(λη)1(Ts)𝑑s\displaystyle\leq\frac{C(\eta)}{2\lambda}T+\int_{0}^{T}\frac{C(\eta)}{2\lambda}\frac{1}{\lambda\eta}se^{(\lambda\eta)^{-1}(T-s)}ds
=C(η)η2(e(λη)1T1)CeCT,\displaystyle=C(\eta)\frac{\eta}{2}\left(e^{(\lambda\eta)^{-1}T}-1\right)\leq Ce^{CT},

where CC is a positive constant independent of the particle number NN and the particle mass mm. For instance, if we choose η=(82eK2)1\eta=(8\sqrt{2}e\|K\|_{\infty}^{2})^{-1}, then we can choose C=max(C1,C2)C=\max(C_{1},C_{2}) with C1:=24e+log2C_{1}:=\frac{\sqrt{2}}{4e}+\log 2 and C2:=82eK2λ1C_{2}:=8\sqrt{2}e\|K\|_{\infty}^{2}\lambda^{-1}.

Case 2: Under Assumption 3.

Now we consider the case for the unbounded interaction kernel. First, for fixed ii, still by Lemma 3.13, for any η>0\eta>0, we have (recalling the notations Ai,jA_{i,j} and Ai,jA^{\prime}_{i,j} above)

𝔼i=1N|Kρ¯s(Xi(s))1N1j:jiK(Xi(s)Xj(s))|2η1DKL(F[0,s]NF¯[0,s]N)+η1log𝔼[exp(ηi=1N|Kρ¯s(X¯i(s))1N1j:jiK(X¯i(s)X¯j(s))|2)].\mathbb{E}\sum_{i=1}^{N}\left|K{*}\bar{\rho}_{s}(X_{i}(s))-\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}(s)-X_{j}(s))\right|^{2}\leq\eta^{-1}D_{KL}\left(F^{N}_{[0,s]}\|\bar{F}^{\otimes N}_{[0,s]}\right)\\ +\eta^{-1}\log\mathbb{E}\left[\exp\left(\eta\sum_{i=1}^{N}\left|K{*}\bar{\rho}_{s}(\bar{X}_{i}(s))-\frac{1}{N-1}\sum_{j:j\neq i}K(\bar{X}_{i}(s)-\bar{X}_{j}(s))\right|^{2}\right)\right]. (3.31)

Now note that

𝔼[Kρ¯s(X¯i(s))1N1j:jiK(X¯i(s)X¯j(s))]=0.\mathbb{E}\left[K{*}\bar{\rho}_{s}(\bar{X}_{i}(s))-\frac{1}{N-1}\sum_{j:j\neq i}K(\bar{X}_{i}(s)-\bar{X}_{j}(s))\right]=0. (3.32)

Moreover, under Assumption 3, X¯i(s)\bar{X}_{i}(s) is a sub-Gaussian random variable, and

|Kρ¯s(X¯i(s))1N1j:jiK(X¯i(s)X¯j(s))|C(1+|X¯j(s)|).\left|K{*}\bar{\rho}_{s}(\bar{X}_{i}(s))-\frac{1}{N-1}\sum_{j:j\neq i}K(\bar{X}_{i}(s)-\bar{X}_{j}(s))\right|\leq C(1+|\bar{X}_{j}(s)|). (3.33)

Therefore, the conditions required in Lemma 3.15 are satisfied. Consequently, we have the similar estimate under Assumption 3:

DKL(F[0,T]NF¯[0,T]N)CT2λT+0TCλDKL(F[0,s]NF¯[0,s]N)𝑑s,D_{KL}(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]})\leq\frac{CT}{2\lambda}T+\int_{0}^{T}\frac{C^{\prime}}{\lambda}D_{KL}\left(F^{N}_{[0,s]}\|\bar{F}^{\otimes N}_{[0,s]}\right)ds, (3.34)

where CC, CC^{\prime} are positive constant independent of NN and mm. Therefore the O(1)O(1)-upper bound for DKL(F[0,T]NF¯[0,T]N)D_{KL}(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]}) is obtained due to Gröwnwall’s inequality.

Next, noting the symmetry of FtNF_{t}^{N}, one has by Lemma 3.16 that

DKL(F[0,T]N:kF¯[0,T]k)kNDKL(F[0,T]NF¯[0,T]N)CeCTkN.D_{KL}\left(F^{N:k}_{[0,T]}\|\bar{F}^{\otimes k}_{[0,T]}\right)\leq\frac{k}{N}D_{KL}\left(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]}\right)\leq Ce^{CT}\frac{k}{N}. (3.35)

Hence, (3.24) holds.

The results above are all about path measures. In fact, we can extend this to the time marginal case, which is commonly studied in related literature.

Corollary 3.8 (time marginal).

For any t>0t>0, consider the distributions FtNF^{N}_{t}, F¯tN\bar{F}^{\otimes N}_{t} for the second-order system defined in Section 1, with initial F0N=F¯0N.F^{N}_{0}=\bar{F}^{\otimes N}_{0}. Then under either Assumption 3 or Assumption 3, for the constant CC in Theorem 3.6,

DKL(FtNF¯tN)CeCt,t>0.D_{KL}(F^{N}_{t}\|\bar{F}^{\otimes N}_{t})\leq Ce^{Ct},\quad\forall t>0. (3.36)

Then for 1kN1\leq k\leq N,

DKL(FtN:kF¯tk)CeCtkN.D_{KL}\left(F^{N:k}_{t}\|\bar{F}_{t}^{\otimes k}\right)\leq Ce^{Ct}\frac{k}{N}. (3.37)
Proof 3.9.

For any t>0t>0, consider the path measures F[0,t]NF^{N}_{[0,t]}, F¯[0,t]N\bar{F}^{\otimes N}_{[0,t]} corresponding to the time interval [0,t][0,t]. Then by Theorem 3.6,

DKL(F[0,t]NF¯[0,t]N)CeCt.D_{KL}(F^{N}_{[0,t]}\|\bar{F}^{\otimes N}_{[0,t]})\leq Ce^{Ct}.

Now consider the time marginal mapping πt:C([0,t];d)d\pi_{t}:C([0,t];\mathbb{R}^{d})\to\mathbb{R}^{d} given by πt(Z)=Zt\pi_{t}(Z)=Z_{t}, which maps ZZ in the path space to its time marginal ZtZ_{t}. Then by the data processing inequality (Lemma 2.1), one has

DKL(FtNF¯tN)DKL(F[0,t]NF¯[0,t]N)CeCt.D_{KL}(F^{N}_{t}\|\bar{F}^{\otimes N}_{t})\leq D_{KL}(F^{N}_{[0,t]}\|\bar{F}^{\otimes N}_{[0,t]})\leq Ce^{Ct}. (3.38)

Then, (3.37) is a direct result of Lemma 3.16.

Remark 3.10.

The fact that the KL-divergence between path measures can control that between time marginals can actually be proved without data processing inequality, In fact, for t>0t>0, the Radon-Nikodym derivative in terms of time marginal distributions has the following formula: (see, for instance, Appendix A in [36])

dF¯tNdFtN(z)=𝔼[dF¯[0,t]NdF[0,t]NZt=z].\frac{d\bar{F}^{\otimes N}_{t}}{dF^{N}_{t}}(z)=\mathbb{E}\left[\frac{d\bar{F}^{\otimes N}_{[0,t]}}{dF^{N}_{[0,t]}}\mid Z_{t}=z\right]. (3.39)

Then by Jensen’s inequality, we directly conclude that

DKL(FtNF¯tN)DKL(F[0,t]NF¯[0,t]N).D_{KL}(F^{N}_{t}\|\bar{F}^{\otimes N}_{t})\leq D_{KL}(F^{N}_{[0,t]}\|\bar{F}^{\otimes N}_{[0,t]}).

In fact, these two approaches are essentially the same, since they are all due to Jensen’s inequality.

Based on Theorem 3.6 and Pinsker’s inequality [46], we are able to extend the propagation of chaos to that under total variation (TV) distance defined by

TV(μ,ν):=supA|μ(A)ν(A)|,TV(\mu,\nu):=\sup_{A\in\mathcal{F}}|\mu(A)-\nu(A)|, (3.40)

for two probability measures μ\mu, ν\nu defined on (Ω,)(\Omega,\mathcal{F}).

Corollary 3.11.

Under the same settings of Theorem 3.6 and Corollary 3.8, for 1kN1\leq k\leq N it holds that

TV(F[0,t]N:k,F¯[0,t]k)CeCtkN,TV(F^{N:k}_{[0,t]},\bar{F}^{\otimes k}_{[0,t]})\leq Ce^{Ct}\sqrt{\frac{k}{N}}, (3.41)

for path measures and

TV(FtN:k,F¯tk)CeCtkN,TV(F^{N:k}_{t},\bar{F}^{\otimes k}_{t})\leq Ce^{Ct}\sqrt{\frac{k}{N}}, (3.42)

for time marginal distributions.

Remark 3.12.

Our approach can be applied to the following first-order system without difficulty

dXi(t)=b(Xi(t))dt+1N1j:jiK(Xi(t)Xj(t))dt+σdWi(t),1iN,dX_{i}(t)=b(X_{i}(t))dt+\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}(t)-X_{j}(t))dt+\sigma\cdot dW_{i}(t),\quad 1\leq i\leq N, (3.43)

where b:ddb:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is the non-interaction drift and the setting of KK, σ\sigma, WiW_{i} is same as the second-order case. We skip the proof for this case.

3.3 Some auxiliary lemmas.

In this subsection we present some auxiliary lemmas used in our proof. The detailed proof of Lemma 3.14 is moved to the Appendix.

Near the end of the proof of Theorem 3.6, in order to estimate the difference between the two drifts

12λi=1N0T𝔼|Kρ¯s(Xi(s))1N1j:jiK(Xi(s)Xj(s))|2𝑑s,\frac{1}{2\lambda}\sum_{i=1}^{N}\int_{0}^{T}\mathbb{E}\left|K{*}\bar{\rho}_{s}(X_{i}(s))-\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}(s)-X_{j}(s))\right|^{2}ds,

we need the following two lemmas, where a type of Fenchel-Young’s inequality along with an exponential concentration estimate are needed. In fact, the Fenchel-Young type inequality ([26, Lemma 1]) states that:

Lemma 3.13.

For any two probability measures ρ\rho and ρ~\tilde{\rho} on a Polish space EE and some test function FF\in L1(ρ)L^{1}(\rho), one has that η>0\forall\eta>0,

EFρ(dx)1η(DKL(ρρ~)+logEeηFρ~(dx)).\int_{E}F\rho(dx)\leq\frac{1}{\eta}\left(D_{KL}(\rho\|\tilde{\rho})+\log\int_{E}e^{\eta F}\tilde{\rho}(dx)\right).

We also need the following exponential concentration estimate. Similar results can be found in related literature like [37, 26]. For the convenience of the readers, we also attach a proof in Appendix B.

Lemma 3.14.

Suppose Assumption 3 holds. Consider solutions to the Mckean SDEs (2.7) X¯1(t)\bar{X}_{1}(t), \dots, X¯N(t)\bar{X}_{N}(t), which are i.i.d. sampled from F¯t\bar{F}_{t}, then for fixed η(0,1/(42eK2))\eta\in(0,1/\left(4\sqrt{2}e\|K\|^{2}_{\infty}\right)), for any N2N\geq 2, t0t\geq 0, and 1iN1\leq i\leq N we have

𝔼[exp(ηN1j1,j2:j1j2,j1i,j2iAi,j1(t)Ai,j2(t))X¯i(t)]1142eK2η<+,\mathbb{E}\left[\exp\left(\frac{\eta}{N-1}\sum_{j_{1},j_{2}:j_{1}\neq j_{2},j_{1}\neq i,j_{2}\neq i}A_{i,j_{1}}(t)\cdot A_{i,j_{2}}(t)\right)\mid\bar{X}_{i}(t)\right]\leq\frac{1}{1-4\sqrt{2}e\|K\|_{\infty}^{2}\eta}<+\infty,

where Ai,j(t)A_{i,j}(t) is defined by

Ai,j(t):=K(X¯i(t)X¯j(t))Kρ¯t(X¯i(t)).A_{i,j}(t):=K\left(\bar{X}_{i}(t)-\bar{X}_{j}(t)\right)-K{*}\bar{\rho}_{t}\left(\bar{X}_{i}(t)\right).

When the interaction kernel KK is bounded, Lemma 3.13, Lemma 3.14 along with other previous analysis enable one to obtain an 𝒪(1)\mathcal{O}(1)-upper bound for DKL(F[0,T]NF¯[0,T]N)D_{KL}(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]}), and it is easy to see that the bound is independent of the particle mass mm. When KK is not bounded, we make use of Lemma 3.13 and Lemma 3.15 below instead:

Lemma 3.15.

[14, Lemma 3.3], Consider ρ𝒫(E)\rho\in\mathcal{P}(E) and ψ(x)\psi(x) satisfying Eψ(x)ρ(dx)=0\int_{E}\psi(x)\rho(dx)=0 and for the universal constant c>0c_{*}>0 in the Hoeffding’s inequality, the following holds

ψ(x)ρ:=inf{c>0:Eexp(|ψ(x)|2/c2)ρ(dx)2}<c.\|\psi(x)\|_{\rho}:=\inf\left\{c>0:\int_{E}\exp\left(|\psi(x)|^{2}/c^{2}\right)\rho(dx)\mid\leq 2\right\}<c_{*}. (3.44)

Then,

supN1ENexp(1N|i=1Nψ(xi)|2)ρNdx<.\sup_{N\geq 1}\int_{E^{N}}\exp\left(\frac{1}{N}\left|\sum_{i=1}^{N}\psi\left(x_{i}\right)\right|^{2}\right)\rho^{\otimes N}\mathrm{dx}<\infty. (3.45)

For readers’ convenience, here we briefly introduce the Hoeffding bound used in the statement (as well as its proof) of Lemma 3.15 above. The Hoeffding inequality [52] claims that for nn independent centered real random variables Y1,,YnY_{1},\dots,Y_{n}, there exists a universal constant C>0C_{*}>0 such that

P(|j=1nYj|y)2exp(cy2j=1nYjψ22),y0,{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}P\left(\left|\sum_{j=1}^{n}Y_{j}\right|\geq y\right)\leq 2\exp\left(-\frac{c_{*}y^{2}}{\sum_{j=1}^{n}\|Y_{j}\|_{\psi_{2}}^{2}}\right),\quad\forall\,y\geq 0,} (3.46)

where the ψ2\psi_{2} norm (or the Orlicz norm with ψ2(x)=exp(x2)1\psi_{2}(x)=\exp(x^{2})-1) for some sub-Gaussian random variable XX is given by

Xψ2:=inf{c>0:𝔼[exp(|x|2/c2)]2}.{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\|X\|_{\psi_{2}}:=\inf\left\{c>0:\mathbb{E}\left[\exp(|x|^{2}/c^{2})\right]\leq 2\right\}.} (3.47)

The following well-known linear scaling property of the relative entropy is useful for controlling the marginal distribution. (See e.g. [40, Lemma 3.9], [10, Equation (2.10), page 772].)

Lemma 3.16 (linear scaling for KL-divergence).

Let μn𝒫s(En)\mu^{n}\in\mathcal{P}_{s}(E^{n}) be a symmetric distribution over some space tensorized space EnE^{n} and μ¯𝒫(E)\bar{\mu}\in\mathcal{P}(E). For 1kn1\leq k\leq n, define its kk-th marginal μn:k\mu^{n:k} by

μn:k(z1,,zk):=EnkμN(z1,,zn)𝑑zk+1𝑑zn.\mu^{n:k}(z_{1},\dots,z_{k}):=\int_{E^{n-k}}\mu^{N}(z_{1},\dots,z_{n})dz_{k+1}\dots dz_{n}. (3.48)

Assume that μn:kμ¯k\mu^{n:k}\ll\bar{\mu}^{\otimes k} for any 1kN.1\leq k\leq N. Then it holds that

DKL(μn:kμ¯k)2knDKL(μnμ¯n).D_{KL}\left(\mu^{n:k}\|\bar{\mu}^{\otimes k}\right)\leq 2\frac{k}{n}D_{KL}\left(\mu^{n}\|\bar{\mu}^{\otimes n}\right). (3.49)

4 Other applications

In this section, we show two application of our approach in neural networks and numerical analysis respectively.

4.1 Application in neural networks.

An interesting application is on neural networks. To show the characteristics of our approach, we use an artificial single-layer neural network as an example:

Xi(T)=𝔖(0Tb(Xi(t))𝑑t+1N1j:jiK(Xi(t)Xj(t))dt+σdWi(t)),1iN,X_{i}(T)=\mathfrak{S}\Bigg{(}\int_{0}^{T}b(X_{i}(t))dt+\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}(t)-X_{j}(t))dt+\sigma\cdot dW_{i}(t)\Bigg{)},\quad 1\leq i\leq N, (4.50)

where {Xi(0)},\{X_{i}(0)\}, i=1,,Ni=1,\cdots,N denotes NN input features and 𝔖\mathfrak{S} denotes certain activate function. The {Xi(T)},\{X_{i}(T)\}, i=1,,Ni=1,\cdots,N means the output. This model can be viewed as a single-layer variant with noise from the model mentioned in [54]. Our approach can be directly applied into (4.50) and transform the original problem in the space of XX into the space of the driving process

θi(t)=0t(1N1j:jiK(Xi(s)Xj(s))Kρ¯s(Xi(s)))𝑑s+σWi(t),\theta_{i}(t)=\int_{0}^{t}\left(\frac{1}{N-1}\sum_{j:j\neq i}K(X_{i}(s)-X_{j}(s))-K{*}\bar{\rho}_{s}(X_{i}(s))\right)ds+\sigma\cdot W_{i}(t),

similarly to the discussion in Section 2. The existence of the activate function 𝔖\mathfrak{S} make it impossible to use Girsanov’s theorem directly, while our approach works in this case as well. Also, if one uses the second-order dynamics to update the features, that is,

Xi(T)=𝔖(0TVi(t))),Vi(t)is obtained by (1.1),X_{i}(T)=\mathfrak{S}\Bigg{(}\int_{0}^{T}V_{i}(t))\Bigg{)},\quad V_{i}(t)\text{is obtained by \eqref{eq:particle}},

the uniformity in mass is not a direct byproduct of Girsanov’s theorem.

4.2 Application in numerical analysis.

Our approach can be applied in numerical analysis directly. For example, take the following scheme of SDE (1.1) with time step hh. Without loss of generality, we set m=1m=1 and σ=1\sigma=1. Assume that KK is globally Lipschitz continuous with a constant CKC_{K}, and the second moment of the initial data is finite:

𝔼|Z(0)|2<.\mathbb{E}|Z(0)|^{2}<\infty. (4.51)

Define

Z:=(XV),A:=(010γ),B(X(t)):=(01N1j:jiK(Xi(t)Xj(t))),C:=(01).Z:=\begin{pmatrix}X\\ V\end{pmatrix},\quad A:=\begin{pmatrix}0&1\\ 0&-\gamma\end{pmatrix},\quad B(X(t)):=\begin{pmatrix}0\\ \frac{1}{N-1}\sum\limits_{j:j\neq i}K(X_{i}(t)-X_{j}(t))\end{pmatrix},\quad C:=\begin{pmatrix}0\\ 1\end{pmatrix}.

We use Z~,X~,V~\tilde{Z},\tilde{X},\tilde{V} to denote the numerical solution. For t[tk,tk+1),t\in[t_{k},t_{k+1)}, (tk=kht_{k}=kh), Z~\tilde{Z} is defined by

Z~t=eA(ttk)Z~(tk)+tkteA(ts)B(X~(tk))𝑑s+tkteA(ts)C𝑑Ws.\tilde{Z}_{t}=e^{A(t-t_{k})}\tilde{Z}(t_{k})+\int_{t_{k}}^{t}e^{A(t-s)}B(\tilde{X}(t_{k}))ds+\int_{t_{k}}^{t}e^{A(t-s)}CdW_{s}.

For T:=nhT:=nh and F~[0,T]N:=Law(Z~),\tilde{F}^{N}_{[0,T]}:=\text{Law}(\tilde{Z}), similar to the proof of Theorem 3.6, one has

DKL(F~[0,T]NF[0,T]N)\displaystyle D_{KL}(\tilde{F}^{N}_{[0,T]}\|F^{N}_{[0,T]}) 𝔼k=0n1tktk+1i,j=1,jiN1N1|K(X~i(t)X~j(t))K(X~i(tk)X~j(tk))|2dt\displaystyle\leq\mathbb{E}\sum\limits_{k=0}^{n-1}\int_{t_{k}}^{t_{k+1}}\sum_{\begin{subarray}{c}i,j=1,\\ j\neq i\end{subarray}}^{N}\frac{1}{N-1}|K(\tilde{X}_{i}(t)-\tilde{X}_{j}(t))-K(\tilde{X}_{i}(t_{k})-\tilde{X}_{j}(t_{k}))|^{2}dt (4.52)
C𝔼Nk=0n1tktk+1|K(X~1(t)K(X~1(tk)|2dt.\displaystyle\leq C\mathbb{E}N\sum\limits_{k=0}^{n-1}\int_{t_{k}}^{t_{k+1}}|K(\tilde{X}_{1}(t)-K(\tilde{X}_{1}(t_{k})|^{2}dt.

Consider equation (4.52), by Itô’s calculus and the assumption on KK, one has

d𝔼|V~i|2=\displaystyle d\mathbb{E}|\tilde{V}_{i}|^{2}= 2𝔼V~i(1N1j:jiK(X~i(tk)X~j(tk))dtγV~i(t)dt)+ddt\displaystyle 2\mathbb{E}\tilde{V}_{i}\cdot\left(\frac{1}{N-1}\sum\limits_{j:j\neq i}K(\tilde{X}_{i}(t_{k})-\tilde{X}_{j}(t_{k}))dt-\gamma\tilde{V}_{i}(t)dt\right)+d\,dt
\displaystyle\leq C(|K(0)|𝔼|V~i|+𝔼|V~i|1N1j:ji|X~j(tk)X~i(tk)|+𝔼|V~i|2+d)dt\displaystyle C\bigg{(}|K(0)|\mathbb{E}|\tilde{V}_{i}|+\mathbb{E}\,|\tilde{V}_{i}|\cdot\frac{1}{N-1}\sum\limits_{j:j\neq i}|\tilde{X}_{j}(t_{k})-\tilde{X}_{i}(t_{k})|+\mathbb{E}|\tilde{V}_{i}|^{2}+d\bigg{)}dt
\displaystyle\leq C(𝔼|X~i|2+𝔼|X~j|2+𝔼|V~i|2+1).\displaystyle C(\mathbb{E}|\tilde{X}_{i}|^{2}+\mathbb{E}|\tilde{X}_{j}|^{2}+\mathbb{E}|\tilde{V}_{i}|^{2}+1).

By the exchangeability, 𝔼|X~i|2=𝔼|X~j|2.\mathbb{E}|\tilde{X}_{i}|^{2}=\mathbb{E}|\tilde{X}_{j}|^{2}. One has

d𝔼|V~i|2C(𝔼|X~i|2+|V~i|2+1).d\mathbb{E}|\tilde{V}_{i}|^{2}\leq C(\mathbb{E}|\tilde{X}_{i}|^{2}+|\tilde{V}_{i}|^{2}+1).

By the Grönwall inequality and the assumption (4.51), it holds that

𝔼|V~i(t)|2<,t[0,T].\mathbb{E}|\tilde{V}_{i}(t)|^{2}<\infty,\quad\forall t\in[0,T]. (4.53)

Hence,

𝔼|X~1(t)X~1(tk)|2C𝔼|tktV~1(s)𝑑s|2𝔼supst|V~1(s)|2h2Ch2.\mathbb{E}\,|\tilde{X}_{1}(t)-\tilde{X}_{1}(t_{k})|^{2}\leq C\mathbb{E}\,\big{|}\int_{t_{k}}^{t}\tilde{V}_{1}(s)ds\big{|}^{2}\leq\mathbb{E}\,\sup\limits_{s\leq t}|\tilde{V}_{1}(s)|^{2}h^{2}\leq Ch^{2}. (4.54)

Then, combining (4.52) and (4.54), one obtains

DKL(F~[0,T]NF[0,T]N)\displaystyle D_{KL}(\tilde{F}^{N}_{[0,T]}\|F^{N}_{[0,T]}) CCKN𝔼k=0n1tktk+1|X~1(t)X~1(tk)|2𝑑t\displaystyle\leq CC_{K}N\mathbb{E}\sum\limits_{k=0}^{n-1}\int_{t_{k}}^{t_{k+1}}|\tilde{X}_{1}(t)-\tilde{X}_{1}(t_{k})|^{2}dt (4.55)
CNh2.\displaystyle\leq CNh^{2}.

5 More discussions

Here we present brief discussions on the reversed relative entropy and the mass independence phenomenon.

5.1 Discussion on the reversed relative entropy.

In section 3, we estimated the relative entropy DKL(F[0,T]NF¯[0,T]N)D_{KL}(F^{N}_{[0,T]}\|\bar{F}^{\otimes N}_{[0,T]}). If we consider the reversed relative entropy, by the data processing inequality, one would obtain that

DKL(F¯[0,T]NF[0,T]N)DKL(Q2Q1)=𝔼logdQ1dQ2(θ(2)).D_{KL}(\bar{F}^{\otimes N}_{[0,T]}\|F^{N}_{[0,T]})\leq D_{KL}(Q^{2}\|Q^{1})=-\mathbb{E}\log\frac{dQ^{1}}{dQ^{2}}(\theta^{(2)}). (5.56)

Since

πsΦs(θ(2))=X¯(s),\pi_{s}\circ\Phi_{s}(\theta^{(2)})=\bar{X}(s),

one thus finds that

DKL(Q2Q1)=𝔼i0t|𝒃i(s,X¯(s))|2𝑑s.D_{KL}(Q^{2}\|Q^{1})=\mathbb{E}\sum_{i}\int_{0}^{t}|\boldsymbol{b}_{i}(s,\bar{X}(s))|^{2}\,ds.

Here, X¯=(X¯1,,X¯N)\bar{X}=(\bar{X}_{1},\cdots,\bar{X}_{N}) is the position process for the mean-field McKean SDE, whose components are i.i.d.. Hence, the right hand side can be estimated by

DKL(Q2Q1)CTλ,D_{KL}(Q^{2}\|Q^{1})\leq C\frac{T}{\lambda}, (5.57)

where CC is independent of TT and NN. The result linearly depending on TT is similar with [31, Lemma 4.11]. This is an interesting observation, though the consequence of such a relative entropy estimate is unclear.

5.2 Discussion on the mass-independence.

Denote the marginal distributions in the vv-direction:

μvN(v):=NdFN𝑑x,μ¯v(v):=dF¯𝑑x.\mu_{v}^{N}(v):=\int_{\mathbb{R}^{Nd}}F^{N}dx,\quad\bar{\mu}_{v}(v):=\int_{\mathbb{R}^{d}}\bar{F}dx. (5.58)

It is not difficult to see from the proof of Theorem 3.6 that the KL-divergence DKL(μvNμ¯vN)D_{KL}\left(\mu_{v}^{N}\|\bar{\mu}_{v}^{\otimes N}\right) in the vv-direction has an 𝒪(1)\mathcal{O}(1) upper-bound, and the bound is independent of the particle mass mm. The mass-independence result is particularly interesting from a physical perspective. Additionally, when conducting numerical simulations in the regime of large friction, such as in viscous fluids, this phenomenon must be taken into account. Some researchers [55, 4, 53] focus on the zero mass limit under various conditions. If the propagation of chaos can be shown to be uniform in mass, then the result is asymptotically preserving in the overdamped limit.

However, the mass independence result is not very natural from a physical perspective. For fixed mass mm and fixed initial data, considering the mapping φTm:θV,\varphi^{m}_{T}:\theta\rightarrow V, the limiting behavior as m0m\rightarrow 0 is poor and the L2L^{2} norm of VNV^{N} (or V¯N\bar{V}^{\otimes N}) usually diverges. On the other hand, under our framework, the dependence of mm in the mapping Φ\Phi is not important when applying the data processing inequality. This may indicates the KL divergence is a suitable tool to obtain a rate independent of the mass. To illustrate this, we provide a simple example. Consider the channel Ψm(X):=X+Zm\Psi^{m}(X):=X+Z_{m}, where Zm𝒩(0,m2)Z_{m}\sim\mathcal{N}(0,m^{-2}). Then, if we simply consider the Gaussian data X𝒩(0,1)X\sim\mathcal{N}(0,1), Y𝒩(1,1)Y\sim\mathcal{N}(1,1), the inequality for the KL-divergence between their distributions μX\mu_{X}, μY\mu_{Y} still holds for any mm: DKL(Law(Ψm(X))Law(Ψm(Y))DKL(μXμY)D_{KL}(\text{Law}(\Psi^{m}(X))\|\text{Law}(\Psi^{m}(Y))\leq D_{KL}(\mu_{X}\|\mu_{Y}). In fact, direct calculation gives DKL(μXμY)=12D_{KL}(\mu_{X}\|\mu_{Y})=\frac{1}{2}, and DKL(Law(Ψm(X))Law(Ψm(Y))=12(1+m2)D_{KL}(\text{Law}(\Psi^{m}(X))\|\text{Law}(\Psi^{m}(Y))=\frac{1}{2(1+m^{-2})}, since Ψm(X)𝒩(0,1+m2)\Psi^{m}(X)\sim\mathcal{N}(0,1+m^{-2}), Ψm(Y)𝒩(1,1+m2)\Psi^{m}(Y)\sim\mathcal{N}(1,1+m^{-2}). However, it is easy to check that the L2L^{2} norm of single data may blow up as mm tends to zero, since the variance of Ψm(X)\Psi^{m}(X) is just 1+m21+m^{-2}.

Acknowledgement

This work is financially supported by the National Key R&D Program of China, Project Number 2021YFA1002800 and Project Number 2020YFA0712000. The work of L. Li was partially supported by NSFC 12371400 and 12031013, Shanghai Science and Technology Commission (Grant No. 21JC1403700, 20JC144100, 21JC1402900), the Strategic Priority Research Program of Chinese Academy of Sciences, Grant No. XDA25010403, Shanghai Municipal Science and Technology Major Project 2021SHZDZX0102. We thank Zhenfu Wang and the anonymous referees for some helpful comments.

Appendix A Basics on path measure and Girsanov’s transform

Here we present a formal derivation of Girsanov’s transform. Note that the derivation here is never meant to be a proof. We present it here for the convenience of readers for intuitive understanding. Consider the following two SDEs in d\mathbb{R}^{d} with different predictable drifts but the same diffusion σ\sigma, which we assume are weakly well-posed.

{Xt(1)=x0+0tb(1)(s,[X[0,s](1)])𝑑s+0tσ𝑑Ws,tT,Xt(2)=x0+0tb(2)(s,[X[0,s](2)])𝑑s+0tσ𝑑Ws,tT.\left\{\begin{aligned} X^{(1)}_{t}&=x_{0}+\int_{0}^{t}b^{(1)}\left(s,[X^{(1)}_{[0,s]}]\right)ds+\int_{0}^{t}\sigma\cdot dW_{s},\,t\leq T,\\ X^{(2)}_{t}&=x_{0}+\int_{0}^{t}b^{(2)}\left(s,[X^{(2)}_{[0,s]}]\right)ds+\int_{0}^{t}\sigma\cdot dW_{s},\,t\leq T.\end{aligned}\right. (1.59)

Here WW is a standard Brownian motion under the probability measure \mathbb{P} (the same for the two systems), and x0μ0x_{0}\sim\mu_{0} is a common, but random, initial position. Here, the drift b(i)(s,[γ[0,s]])b^{(i)}(s,[\gamma_{[0,s]}]) depends on the path γτ\gamma_{\tau} for 0τs0\leq\tau\leq s.

For a fixed time interval [0,T],[0,T], the two processes X(1)X^{(1)} and X(2)X^{(2)} naturally induce two probability measures in the path space 𝒳:=C([0,T],d)\mathcal{X}^{\prime}:=C([0,T],\mathbb{R}^{d}), denoted by P(1)P^{(1)} and P(2),P^{(2)}, respectively.

Define the process

u(X[0,t](2))=σTΛ1(b(2)b(1))(X[0,t](2)),u\left(X^{(2)}_{[0,t]}\right)=\sigma^{T}\Lambda^{-1}\left(b^{(2)}-b^{(1)}\right)\left(X^{(2)}_{[0,t]}\right), (1.60)

where Λ=σσT\Lambda=\sigma\sigma^{T}. By Girsanov theorem, under the probability measure \mathbb{Q} satisfying

dd(ω)=exp(0Tu(X[0,s](2))dWs120T|u(X[0,s](2))|2𝑑s),\frac{d\mathbb{Q}}{d\mathbb{P}}(\omega)=\exp\Big{(}\int_{0}^{T}-u\left(X^{(2)}_{[0,s]}\right)\cdot dW_{s}-\frac{1}{2}\int_{0}^{T}\left|u\left(X^{(2)}_{[0,s]}\right)\right|^{2}ds\Big{)}, (1.61)

the law of X(2)X^{(2)} is the same as the law of X(1)X^{(1)} under \mathbb{P}. In other words, for any Borel measurable set B𝒳,B\subset\mathcal{X}^{\prime},

𝔼[1B(X(1)(ω))]=𝔼[1B(X(2)(ω))]=𝔼[1B(X(2))dd(ω)].\mathbb{E}_{\mathbb{P}}[\textbf{1}_{B}(X^{(1)}(\omega))]=\mathbb{E}_{\mathbb{Q}}[\textbf{1}_{B}(X^{(2)}(\omega))]=\mathbb{E}_{\mathbb{P}}\left[\textbf{1}_{B}(X^{(2)})\frac{d\mathbb{Q}}{d\mathbb{P}}(\omega)\right].

Since P(1)=(X(1))#P^{(1)}=(X^{(1)})_{\#}\mathbb{P} and P(2)=(X(2))#P^{(2)}=(X^{(2)})_{\#}\mathbb{P} are the laws of X(1)X^{(1)} and X(2)X^{(2)} respectively, then one has

P(1)(B)=𝔼XP(2)[1B(X)dP(1)dP(2)(X)]=𝔼[1B(X(2)(ω))dP(1)dP(2)(X(2)(ω))].P^{(1)}(B)=\mathbb{E}_{X\sim P^{(2)}}\left[\textbf{1}_{B}(X)\frac{dP^{(1)}}{dP^{(2)}}(X)\right]=\mathbb{E}_{\mathbb{P}}\left[\textbf{1}_{B}(X^{(2)}(\omega))\frac{dP^{(1)}}{dP^{(2)}}(X^{(2)}(\omega))\right].

It follows that the Radon-Nikodym derivative satisfies

dP(1)dP(2)(X(2)(ω))=dd(ω)=exp(0Tu(X[0,s](2))dWs120T|u(X[0,s](2))|2𝑑s),a.s.,\frac{dP^{(1)}}{dP^{(2)}}(X^{(2)}(\omega))=\frac{d\mathbb{Q}}{d\mathbb{P}}(\omega)=\exp\Big{(}\int_{0}^{T}-u\left(X^{(2)}_{[0,s]}\right)\cdot dW_{s}-\frac{1}{2}\int_{0}^{T}\left|u\left(X^{(2)}_{[0,s]}\right)\right|^{2}ds\Big{)},\,a.s., (1.62)

which is a martingale under \mathbb{P} and its natural filtration t(2):=σ(Xs(2),st),\mathcal{F}_{t}^{(2)}:=\sigma(X_{s}^{(2)},s\leq t), t[0,T].t\in[0,T].

Below, for the reader’s convenience, we give a simple derivation for the formulas (1.61) (or (1.62)) from a discrete perspective. This is not a rigorous proof but it is illustrating for the Girsanov’s transform. For simplicity, let d=dd=d^{\prime} and σ+\sigma\in\mathbb{R}_{+} be a scalar. The general derivation can be performed similarly.

Consider

Xn+1(1)=Xn(1)+bn(1)τ+τσZn,X0(1)=x0f0,X_{n+1}^{(1)}=X_{n}^{(1)}+b^{(1)}_{n}\tau+\sqrt{\tau}\sigma Z_{n},\quad X_{0}^{(1)}=x_{0}\sim f_{0},

where bn(1):=b(1)(s,[γ~][0,s])b^{(1)}_{n}:=b^{(1)}(s,[\tilde{\gamma}]_{[0,s]}), where γ~s\tilde{\gamma}_{s} is some interpolation using the data X0(1),,Xn(1)X_{0}^{(1)},\cdots,X_{n}^{(1)}, and ZnN(0,Id)Z_{n}\sim N(0,I_{d}) under probability measure \mathbb{P}.

Clearly the posterior distribution f(Xi(1)X0(1),Xi1(1))f(X_{i}^{(1)}\mid X_{0}^{(1)},\dots X_{i-1}^{(1)}) is Gaussian, so one can calculate the joint distribution f(x0(1),,xN(1))f(x_{0}^{(1)},\dots,x_{N}^{(1)}) of (X0(1),XN(1))(X_{0}^{(1)},\dots X_{N}^{(1)}):

f(x0(1),,xN(1))=(2πτσ2)N2exp(12τσ2i=1N|xi(1)xi1(1)bi1(1)τ|2)f0.f(x_{0}^{(1)},\dots,x_{N}^{(1)})=\left(2\pi\tau\sigma^{2}\right)^{-\frac{N}{2}}\exp\left(-\frac{1}{2\tau\sigma^{2}}\sum_{i=1}^{N}\left|x_{i}^{(1)}-x_{i-1}^{(1)}-b^{(1)}_{{i-1}}\tau\right|^{2}\right)f_{0}.

Suppose there is another probability measure \mathbb{Q} such that the law of X(1)X^{(1)} is the same as the law of X(2)X^{(2)} under \mathbb{Q}, where one can similarly introduce the discrete version

Xn+1(2)=Xn(2)+bn(2)τ+τσZn,X0(2)=x0f0,X_{n+1}^{(2)}=X_{n}^{(2)}+b^{(2)}_{n}\tau+\sqrt{\tau}\sigma Z_{n},\quad X_{0}^{(2)}=x_{0}\sim f_{0},

and the joint distribution

f~(x0(2),,xN(2))=(2πτσ2)N2exp(12τσ2i=1N|xi(2)xi1(2)bi1(2)τ|2)f0.\tilde{f}(x_{0}^{(2)},\dots,x_{N}^{(2)})=\left(2\pi\tau\sigma^{2}\right)^{-\frac{N}{2}}\exp\left(-\frac{1}{2\tau\sigma^{2}}\sum_{i=1}^{N}\left|x_{i}^{(2)}-x_{i-1}^{(2)}-b^{(2)}_{{i-1}}\tau\right|^{2}\right)f_{0}.

Then by change of measure, for any measurable FF, it holds

F(X)dd𝑑=F(X)𝑑,\int F(X)\frac{d\mathbb{Q}}{d\mathbb{P}}d\mathbb{P}=\int F(X)d\mathbb{Q},

namely,

F(x0,,xN)f(x0,,xN)ddX1(x0,,xN)𝑑x0𝑑xN=F(x0,,xN)f~(x0,,xN)𝑑x0𝑑xN.\int F(x_{0},\dots,x_{N})f(x_{0},\dots,x_{N})\frac{d\mathbb{Q}}{d\mathbb{P}}\circ X^{-1}(x_{0},\dots,x_{N})dx_{0}\dots dx_{N}\\ =\int F(x_{0},\dots,x_{N})\tilde{f}(x_{0},\dots,x_{N})dx_{0}\dots dx_{N}.

So clearly dd=limτ0L1(τ)\frac{d\mathbb{Q}}{d\mathbb{P}}=\lim\limits_{\tau\rightarrow 0}L^{-1}(\tau), where

L(τ)\displaystyle L(\tau) =ff~=exp(12τσ2i=1N((xixi1bi1(1)τ)2(xixi1bi1(2)τ)2))\displaystyle=\frac{f}{\tilde{f}}=\exp\left(-\frac{1}{2\tau\sigma^{2}}\sum_{i=1}^{N}\left(\left(x_{i}-x_{i-1}-b^{(1)}_{{i-1}}\tau\right)^{2}-\left(x_{i}-x_{i-1}-b^{(2)}_{{i-1}}\tau\right)^{2}\right)\right)
=exp(12τσ2i=1N(2τ(xixi1)(bi1(2)bi1(1))+τ2(|bi1(1)|2|bi1(2)|2))).\displaystyle=\exp\left(-\frac{1}{2\tau\sigma^{2}}\sum_{i=1}^{N}\left(2\tau(x_{i}-x_{i-1})\cdot(b^{(2)}_{{i-1}}-b^{(1)}_{{i-1}})+\tau^{2}\left(|b^{(1)}_{{i-1}}|^{2}-|b^{(2)}_{{i-1}}|^{2}\right)\right)\right).

Letting τ0\tau\rightarrow 0, we are expected to have

limτ0L1(τ)=exp(1σ2(0t(b(2)b(1))(s,[X[0,s]])dXs+120t(|b(1)|2(X[0,s])|bs(2)|2(X[0,s]))ds)).\lim_{\tau\rightarrow 0}L^{-1}(\tau)=\exp\left(\frac{1}{\sigma^{2}}\left(\int_{0}^{t}(b^{(2)}-b^{(1)})(s,[X_{[0,s]}])\cdot dX_{s}\right.\right.\\ +\left.\left.\frac{1}{2}\int_{0}^{t}\left(|b^{(1)}|^{2}(X_{[0,s]})-|b^{(2)}_{s}|^{2}(X_{[0,s]})\right)ds\right)\right).

Taking into account XP(1)X\sim P^{(1)} (recall P(i)=X#(i)P^{(i)}=X^{(i)}_{\#}\mathbb{P}, i=1,2i=1,2), we derive that

dP(2)dP(1)(X(1))=exp(1σ0t(b(2)b(1))(s,[X(1)][0,s])dWs12σ20t|b(2)b(1)|2(s,[X(1)][0,s])ds).\frac{dP^{(2)}}{dP^{(1)}}(X^{(1)})=\exp\left(\frac{1}{\sigma}\int_{0}^{t}(b^{(2)}-b^{(1)})\left(s,[X^{(1)}]_{[0,s]}\right)\cdot dW_{s}\right.\\ \left.-\frac{1}{2\sigma^{2}}\int_{0}^{t}|b^{(2)}-b^{(1)}|^{2}\left(s,[X^{(1)}]_{[0,s]}\right)ds\right).

Also, since the two measures P(1)P^{(1)}, P(2)P^{(2)} are equivalent, dP(1)dP(2)\frac{dP^{(1)}}{dP^{(2)}} is well defined and can be derived in the exactly same way. Here we directly present its expression

dP(1)dP(2)(X(2))=exp(1σ0t(b(1)b(2))(s,[X(2)][0,s])dWs12σ20t|b(2)b(1)|2(s,[X(2)][0,s])ds).\frac{dP^{(1)}}{dP^{(2)}}(X^{(2)})=\exp\left(\frac{1}{\sigma}\int_{0}^{t}(b^{(1)}-b^{(2)})\left(s,[X^{(2)}]_{[0,s]}\right)\cdot dW_{s}\right.\\ \left.-\frac{1}{2\sigma^{2}}\int_{0}^{t}|b^{(2)}-b^{(1)}|^{2}\left(s,[X^{(2)}]_{[0,s]}\right)ds\right).

Appendix B Proof of Lemma 3.14

Here we prove Lemma 3.14 in Section 3.3. The critical point of the proof is the usage of he Marcinkiewicz-Zygmund type inequality (see for instance, Theorem 2.1 in [47], Lemma 5.2 in [37], or Lemma 3.3 in [35]).

Proof B.1.

(Proof of Lemma 3.14.) Fix ii and fix t>0t>0. For 1kN1\leq k\leq N define

Dk:=j:j<k,jiAi,k(t)Ai,j(t).D_{k}:=\sum_{j:j<k,j\neq i}A_{i,k}(t)\cdot A_{i,j}(t).

Then

j1,j2:j1j2,j1i,j2iAi,j1(t)Ai,j2(t)=2k:kiDk.\sum_{j_{1},j_{2}:j_{1}\neq j_{2},j_{1}\neq i,j_{2}\neq i}A_{i,j_{1}}(t)\cdot A_{i,j_{2}}(t)=2\sum_{k:k\neq i}D_{k}.

Clearly, since 𝔼[Ai,j1(t)Ai,j2(t)X¯i(t)]=𝔼[Ai,j1(t)X¯i(t)]𝔼[Ai,j2(t)X¯i(t)]=0\mathbb{E}\left[A_{i,j_{1}}(t)\cdot A_{i,j_{2}}(t)\mid\bar{X}_{i}(t)\right]=\mathbb{E}\left[A_{i,j_{1}}(t)\mid\bar{X}_{i}(t)\right]\cdot\mathbb{E}\left[A_{i,j_{2}}(t)\mid\bar{X}_{i}(t)\right]=0 (j1j2j_{1}\neq j_{2}, j1ij_{1}\neq i, j2ij_{2}\neq i) by independency, and since |Ai,j(t)||A_{i,j}(t)| is uniformly upper-bounded by 2K2\|K\|_{\infty} by Assumption 3, we know that (Dk)k(D_{k})_{k} is a sequence of LpL^{p}-martingale differences (p2p\geq 2) with respect to the filtration k:=σ(X¯1(t),X¯k(t);X¯i(t)).\mathcal{F}_{k}:=\sigma\left(\bar{X}_{1}(t),\dots\bar{X}_{k}(t);\bar{X}_{i}(t)\right). That is, for each k1,k\geq 1, DkD_{k} is k\mathcal{F}_{k}-measurable, DkLpD_{k}\in L^{p} and 𝔼[Dkk1]=0.\mathbb{E}\left[D_{k}\mid\mathcal{F}_{k-1}\right]=0. This then enables one to apply the Marcinkiewicz-Zygmund type inequality, and to obtain

k:kiDkLp2(p1)k:kiDkLp2,p2.\|\sum_{k:k\neq i}D_{k}\|_{L^{p}}^{2}\leq(p-1)\sum_{k:k\neq i}\|D_{k}\|_{L^{p}}^{2},\quad\forall p\geq 2.

Moreover, for each kik\neq i, define the sequence

Bjk=Ai,k(t)Ai,j(t),j<k,ji.B_{j}^{k}=A_{i,k}(t)\cdot A_{i,j}(t),\quad j<k,j\neq i.

Clearly, Dk=j:j<k,jiBjkD_{k}=\sum_{j:j<k,j\neq i}B_{j}^{k}, (Bjk)j(B_{j}^{k})_{j} is a sequence of LpL^{p}-martingale differences (p2p\geq 2) with respect to the filtration ^j:=σ(X¯1(t),X¯j(t);X¯k(t),X¯i(t))\hat{\mathcal{F}}_{j}:=\sigma\left(\bar{X}_{1}(t),\dots\bar{X}_{j}(t);\bar{X}_{k}(t),\bar{X}_{i}(t)\right), and 𝔼[Bjk^j1]=0\mathbb{E}\left[B_{j}^{k}\mid\hat{\mathcal{F}}_{j-1}\right]=0. Using the Marcinkiewicz-Zygmund type inequality again, one obtains

DkLp2(p1)j:j<k,jiBjkLp2.\|D_{k}\|_{L^{p}}^{2}\leq(p-1)\sum_{j:j<k,j\neq i}\|B^{k}_{j}\|_{L^{p}}^{2}.

Now Taylor’s expansion gives

𝔼[exp(2ηN1k:kiDk).\displaystyle\mathbb{E}\Biggl{[}\exp\biggl{(}\frac{2\eta}{N-1}\sum_{k:k\neq i}D_{k}\biggr{)}\mid\Biggr{.} .X¯i(t)]=1+p=2(2ηp)p!(N1)pk:kiDkLpp\displaystyle\Biggl{.}\bar{X}_{i}(t)\Biggr{]}=1+\sum_{p=2}^{\infty}\frac{(2\eta^{p})}{p!(N-1)^{p}}\|\sum_{k:k\neq i}D_{k}\|_{L^{p}}^{p}
1+p=2(2η)p(p1)p2p!(N1)p(k:kiDkLp2)p2\displaystyle\leq 1+\sum_{p=2}^{\infty}\frac{(2\eta)^{p}(p-1)^{\frac{p}{2}}}{p!(N-1)^{p}}\left(\sum_{k:k\neq i}\|D_{k}\|_{L^{p}}^{2}\right)^{\frac{p}{2}}
1+p=2(2η)p(p1)p2p!(N1)p(k:ki(p1)j:j<k,jiBjkLp2)p2\displaystyle\leq 1+\sum_{p=2}^{\infty}\frac{(2\eta)^{p}(p-1)^{\frac{p}{2}}}{p!(N-1)^{p}}\left(\sum_{k:k\neq i}(p-1)\sum_{j:j<k,j\neq i}\|B^{k}_{j}\|_{L^{p}}^{2}\right)^{\frac{p}{2}}
1+p=2(42K2η)p(p1)pp!(N2N1)p2.\displaystyle\leq 1+\sum_{p=2}^{\infty}\left(4\sqrt{2}\|K\|_{\infty}^{2}\eta\right)^{p}\frac{(p-1)^{p}}{p!}\left(\frac{N-2}{N-1}\right)^{\frac{p}{2}}.

Note that all LpL^{p} norm above is associated with the conditional expectation 𝔼[X¯i(t)]\mathbb{E}\left[\cdot\mid\bar{X}_{i}(t)\right]. For N2N\geq 2, N2N1<1\frac{N-2}{N-1}<1. Moreover, by Stirling’s formula, there exists θp(0,1)\theta_{p}\in(0,1) such that

(p1)pp!=(p1)pepeθp12ppp2πpep,p2.\frac{(p-1)^{p}}{p!}=\frac{(p-1)^{p}e^{p}e^{-\frac{\theta_{p}}{12p}}}{p^{p}\sqrt{2\pi p}}\leq e^{p},\quad\forall p\geq 2.

Hence, if we choose η(0,1/(42eK2))\eta\in(0,1/(4\sqrt{2}e\|K\|_{\infty}^{2})),

𝔼[exp(2ηN1k:kiDk)X¯i(t)]1+p=2(42eK2η)p1142eK2η<+.\mathbb{E}\left[\exp\left(\frac{2\eta}{N-1}\sum_{k:k\neq i}D_{k}\right)\mid\bar{X}_{i}(t)\right]\leq 1+\sum_{p=2}^{\infty}\left(4\sqrt{2}e\|K\|_{\infty}^{2}\eta\right)^{p}\leq\frac{1}{1-4\sqrt{2}e\|K\|_{\infty}^{2}\eta}<+\infty.

References

  • [1] G Ben Arous and Ofer Zeitouni. Increasing propagation of chaos for mean field models. In Annales de l’institut Henri Poincare (B) Probability and Statistics, volume 35, pages 85–102. Elsevier, 1999.
  • [2] Gérard Ben Arous and Marc Brunaud. Methode de laplace: etude variationnelle des fluctuations de diffusions de type. Communications in Statistics-Simulation and Computation, 31(1-4):79–144, 1990.
  • [3] Werner Braun and Klaus Hepp. The Vlasov dynamics and its fluctuations in the 1/N limit of interacting classical particles. Communications in mathematical physics, 56(2):101–113, 1977.
  • [4] José A Carrillo and Young-Pil Choi. Mean-field limits: from particle descriptions to macroscopic equations. Archive for Rational Mechanics and Analysis, 241:1529–1573, 2021.
  • [5] Patrick Cattiaux. Singular diffusion processes and applications. 2013.
  • [6] Patrick Cattiaux. Entropy on the path space and application to singular diffusions and mean-field models. arXiv preprint arXiv:2404.09552, 2024.
  • [7] Louis-Pierre Chaintron and Antoine Diez. Propagation of chaos: a review of models, methods and applications. i. models and methods. arXiv preprint arXiv:2203.00446, 2022.
  • [8] Louis-Pierre Chaintron and Antoine Diez. Propagation of chaos: a review of models, methods and applications. ii. applications. arXiv preprint arXiv:2106.14812, 2022.
  • [9] Thomas M. Cover and Joy A. Thomas. Entropy, Relative Entropy, and Mutual Information, chapter 2, pages 13–55. John Wiley & Sons, Ltd, 2005.
  • [10] Imre Csiszár. Sanov property, generalized i-projection and a conditional limit theorem. The Annals of Probability, pages 768–793, 1984.
  • [11] Felipe Cucker and Steve Smale. Emergent behavior in flocks. IEEE Transactions on automatic control, 52(5):852–862, 2007.
  • [12] François Delarue and Alvin Tse. Uniform in time weak propagation of chaos on the torus. arXiv preprint arXiv:2104.14973, 2021.
  • [13] Roland L’vovich Dobrushin. Vlasov equations. Funktsional’nyi Analiz i ego Prilozheniya, 13(2):48–58, 1979.
  • [14] Kai Du and Lei Li. A collision-oriented interacting particle system for landau-type equations and the molecular chaos. arXiv preprint arXiv:2408.16252, 2024.
  • [15] Antoine Georges, Gabriel Kotliar, Werner Krauth, and Marcelo J Rozenberg. Dynamical mean-field theory of strongly correlated fermion systems and the limit of infinite dimensions. Reviews of Modern Physics, 68(1):13, 1996.
  • [16] J Willard Gibbs. On the fundamental formulae of dynamics. American Journal of Mathematics, 2(1):49–64, 1879.
  • [17] Josiah Willard Gibbs. Elementary principles in statistical mechanics: developed with especial reference to the rational foundations of thermodynamics. C. Scribner’s sons, 1902.
  • [18] François Golse, Clément Mouhot, and Thierry Paul. On the mean field and classical limits of quantum mechanics. Communications in Mathematical Physics, 343:165–205, 2016.
  • [19] Carl Graham, Thomas G Kurtz, Sylvie Méléard, Philip E Protter, Mario Pulvirenti, Denis Talay, and Sylvie Méléard. Asymptotic behaviour of some interacting particle systems; McKean-Vlasov and Boltzmann models. Probabilistic Models for Nonlinear Partial Differential Equations: Lectures given at the 1st Session of the Centro Internazionale Matematico Estivo (CIME) held in Montecatini Terme, Italy, May 22–30, 1995, pages 42–95, 1996.
  • [20] Arnaud Guillin, Pierre Le Bris, and Pierre Monmarché. Uniform in time propagation of chaos for the 2d vortex model and other singular stochastic systems. Journal of the European Mathematical Society, 2024.
  • [21] Arnaud Guillin, Wei Liu, Liming Wu, and Chaoen Zhang. The kinetic fokker-planck equation with mean field interaction. Journal de Mathématiques Pures et Appliquées, 150:1–23, 2021.
  • [22] Zimo Hao, Michael Röckner, and Xicheng Zhang. Strong convergence of propagation of chaos for mckean–vlasov sdes with singular interactions. SIAM Journal on Mathematical Analysis, 56(2):2661–2713, 2024.
  • [23] Dirk Horstmann. From 1970 until present : the Keller-Segel model in chemotaxis and its consequences I. Jahresbericht der Deutschen Mathematiker-Vereinigung, 105(3):103–165, 2003.
  • [24] Pierre-Emmanuel Jabin. A review of the mean field limits for Vlasov equations. Kinetic and Related models, 7(4):661–711, 2014.
  • [25] Pierre-Emmanuel Jabin and Zhenfu Wang. Mean field limit for stochastic particle systems. Active Particles, Volume 1: Advances in Theory, Models, and Applications, pages 379–402, 2017.
  • [26] Pierre-Emmanuel Jabin and Zhenfu Wang. Quantitative estimates of propagation of chaos for stochastic systems with W1,W^{-1,\infty} kernels. Inventiones mathematicae, 214:523–591, 2018.
  • [27] Jean-Francois Jabir, Denis Talay, and Milica Tomašević. Mean-field limit of a particle approximation of the one-dimensional parabolic–parabolic keller-segel model without smoothing. Electronic Communications in Probability, 23(84):14, 2018.
  • [28] James H Jeans. On the theory of star-streaming and the structure of the universe. Monthly Notices of the Royal Astronomical Society, Vol. 76, p. 70-84, 76:70–84, 1915.
  • [29] Nicolai V Krylov and Michael Röckner. Strong solutions of stochastic equations with singular time dependent drift. Probability theory and related fields, 131:154–196, 2005.
  • [30] Daniel Lacker. On a strong form of propagation of chaos for McKean-Vlasov equations. Electronic Communications in Probability, 23(none):1 – 11, 2018.
  • [31] Daniel Lacker. Hierarchies, entropy, and quantitative propagation of chaos for mean field diffusions. Probability and Mathematical Physics, 4(2):377–432, 2023.
  • [32] Daniel Lacker and Luc Le Flem. Sharp uniform-in-time propagation of chaos. Probability Theory and Related Fields, 187(1-2):443–480, 2023.
  • [33] Jean-Michel Lasry and Pierre-Louis Lions. Mean field games. Japanese journal of mathematics, 2(1):229–260, 2007.
  • [34] Christian Léonard. Girsanov theory under a finite entropy condition. In Séminaire de Probabilités XLIV, pages 429–465. Springer, 2012.
  • [35] Lei Li, Yijia Tang, and Jingtong Zhang. Solving stationary nonlinear Fokker-Planck equations via sampling. arXiv preprint arXiv:2310.00544, 2023.
  • [36] Lei Li and Yuliang Wang. A sharp uniform-in-time error estimate for Stochastic Gradient Langevin Dynamics. arXiv preprint arXiv:2207.09304, 2022.
  • [37] Tau Shean Lim, Yulong Lu, and James H Nolen. Quantitative propagation of chaos in a bimolecular chemical reaction-diffusion model. SIAM Journal on Mathematical Analysis, 52(2):2098–2133, 2020.
  • [38] Yang Liu, Eunice Jun, Qisheng Li, and Jeffrey Heer. Latent space cartography: Visual analysis of vector space embeddings. In Computer graphics forum, volume 38, pages 67–78. Wiley Online Library, 2019.
  • [39] Yulong Lu. Two-scale gradient descent ascent dynamics finds mixed nash equilibria of continuous games: A mean-field perspective. In International Conference on Machine Learning, pages 22790–22811. PMLR, 2023.
  • [40] Laurent Miclo and Pierre Del Moral. Genealogies and increasing propagation of chaos for feynman-kac and genetic models. The Annals of Applied Probability, 11(4):1166–1198, 2001.
  • [41] Sebastien Motsch and Eitan Tadmor. Heterophilious dynamics enhances consensus. SIAM review, 56(4):577–621, 2014.
  • [42] Adrian Muntean, Jens Rademacher, and Antonios Zagaris. Macroscopic and large scale phenomena: coarse graining, mean field limits and ergodicity. Springer, 2016.
  • [43] Roberto Natalini and Thierry Paul. On the mean field limit for Cucker-Smale models. arXiv preprint arXiv:2011.12584, 2020.
  • [44] Helmut Neunzert and Joachim Wick. Die approximation der lösung von integro-differentialgleichungen durch endliche punktmengen. In Numerische Behandlung nichtlinearer Integrodifferential-und Differentialgleichungen: Vorträge einer Tagung im Mathematischen Forschungsinstitut Oberwolfach, 2. 12.–7. 12. 1973, pages 275–290. Springer, 2006.
  • [45] Bernt Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013.
  • [46] Mark S Pinsker. Information and information stability of random variables and processes. Holden-Day, 1964.
  • [47] Emmanuel Rio. Moment inequalities for sums of dependent random variables under projective conditions. Journal of Theoretical Probability, 22(1):146–163, 2009.
  • [48] Michael Röckner and Xicheng Zhang. Weak uniqueness of fokker–planck equations with degenerate and bounded coefficients. Comptes Rendus. Mathématique, 348(7-8):435–438, 2010.
  • [49] L Chris G Rogers and David Williams. Diffusions, Markov processes, and martingales: Itô calculus, volume 2. Cambridge university press, 2000.
  • [50] Alain-Sol Sznitman. Topics in propagation of chaos. Lecture notes in mathematics, pages 165–251, 1991.
  • [51] Milica Tomašević. Propagation of chaos for stochastic particle systems with singular mean-field interaction of l q- l p type. Electronic Communications in Probability, 28:1–13, 2023.
  • [52] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • [53] Wei Wang, Guangying Lv, and Jinglong Wei. Small mass limit in mean field theory for stochastic n particle system. Journal of Mathematical Physics, 63(8), 2022.
  • [54] Yuelin Wang, Kai Yi, Xinliang Liu, Yu Guang Wang, and Shi Jin. ACMP: Allen-cahn message passing with attractive and repulsive forces for graph neural networks. In ICLR, 2023.
  • [55] Zibo Wang, Li Lv, Yanjie Zhang, Jinqiao Duan, and Wei Wang. Small mass limit for stochastic interacting particle systems with Lévy noise and linear alignment force. Chaos: An Interdisciplinary Journal of Nonlinear Science, 34(2), 2024.
  • [56] Xicheng Zhang. Stochastic volterra equations in banach spaces and stochastic partial differential equation. Journal of Functional Analysis, 258(4):1361–1425, 2010.