This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sampling from the Mean-Field Stationary Distribution

​​​​​ Yunbum Kook    Matthew S. Zhang
Sinho Chewi    Murat A. Erdogdu    Mufan (Bill) Li
School of Computer Science at Georgia Institute of Technology, yb.kook@gatech.edu Department of Computer Science at University of Toronto, and Vector Institute, matthew.zhang@mail.utoronto.ca School of Mathematics at Institute for Advanced Study, schewi@ias.edu Department of Computer Science at University of Toronto, and Vector Institute, erdogdu@cs.toronto.edu Department of Operations Research and Financial Engineering at Princeton University mufan.li@princeton.edu
Abstract

We study the complexity of sampling from the stationary distribution of a mean-field SDE, or equivalently, the complexity of minimizing a functional over the space of probability measures which includes an interaction term. Our main insight is to decouple the two key aspects of this problem: (1) approximation of the mean-field SDE via a finite-particle system, via uniform-in-time propagation of chaos, and (2) sampling from the finite-particle stationary distribution, via standard log-concave samplers. Our approach is conceptually simpler and its flexibility allows for incorporating the state-of-the-art for both algorithms and theory. This leads to improved guarantees in numerous settings, including better guarantees for optimizing certain two-layer neural networks in the mean-field regime. A key technical contribution is to establish a new uniform-in-NN log-Sobolev inequality for the stationary distribution of the mean-field Langevin dynamics.

1 Introduction

The minimization of energy functionals \mathcal{E} over the Wasserstein space 𝒫2,ac(d)\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}) of probability measures has attracted substantial research activity in recent years, encompassing numerous application domains, including distributionally robust optimization [Kuh+19, YKW22], sampling [JKO98, Wib18, Che24], and variational inference [LW16, Lam+22, Dia+23, JCP23, Lac23a, YY23].

A canonical example of such a functional is (μ)=Vdμ+logμdμ\mathcal{E}(\mu)=\int V\,\mathrm{d}\mu+\int\log\mu\,\mathrm{d}\mu, where V:dV:\mathbb{R}^{d}\to\mathbb{R} is called the potential. Up to an additive constant, which is irrelevant for the optimization, this energy functional equals the KL divergence 𝖪𝖫(μπ)\operatorname{\mathsf{KL}}(\mu\mathbin{\|}\pi) with respect to the density πexp(V)\pi\propto\exp(-V), and the celebrated result of [JKO98] identifies the Wasserstein gradient flow of \mathcal{E} with the Langevin diffusion. This link has inspired a well-developed theory for log-concave sampling, with applications to Bayesian inference and randomized algorithms; see [Che24] for an exposition.

The energy functional above contains two terms, corresponding to two of the fundamental examples of functionals considered in Villani’s well-known treatise on optimal transport [Vil03]. Namely, they are the “potential energy” and the entropy, the latter being a special case of the “internal energy.” However, Villani identifies a third fundamental functional—the “interaction energy”—with the pairwise form given by

(μ)V(x)μ(dx)+W(xy)μ(dx)μ(dy)+σ22logμ(x)μ(dx).\displaystyle\mathcal{E}(\mu)\coloneqq\int V(x)\,\mu(\mathrm{d}x)+\iint W(x-y)\,\mu(\mathrm{d}x)\,\mu(\mathrm{d}y)+\frac{\sigma^{2}}{2}\int\log\mu(x)\,\mu(\mathrm{d}x)\,. (𝗉𝖤\mathsf{pE})

More generally, in this work we consider minimizing the generic entropy-regularized energy

(μ)(μ)+σ22logμdμ\displaystyle\mathcal{E}(\mu)\coloneqq\mathcal{F}(\mu)+\frac{\sigma^{2}}{2}\int\log\mu\,\mathrm{d}\mu\, (𝗀𝖤\mathsf{gE})

where :𝒫2,ac(d)\mathcal{F}:\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d})\to\mathbb{R} is a known functional. The minimization of the energy (𝗀𝖤\mathsf{gE}) has recently been of interest due to its role in analysing neural network training dynamics in the mean-field regime, including with [SNW22] and without [CB18, MMN18] entropic regularization, as well as with Fisher regularization [Cla+23].

For the sake of exposition, let us first focus on minimizing the pairwise energy (𝗉𝖤\mathsf{pE}). A priori, this question is more difficult than log-concave sampling; for instance, π\pi does not admit a closed form but rather is the solution to a non-linear equation

π(x)exp(2σ2V(x)2σ2W(x)dπ).\displaystyle\pi(x)\propto\exp\Bigl{(}-\frac{2}{\sigma^{2}}\,V(x)-\frac{2}{\sigma^{2}}\int W(x-\cdot)\,\mathrm{d}\pi\Bigr{)}\,. (1.1)

However, here too there is a well-developed mathematical theory which suggests a principled algorithmic approach. Just as the Wasserstein gradient flow of (𝗉𝖤\mathsf{pE}) in the case when W=0W=0 can be identified with the Langevin diffusion, the Wasserstein gradient flow of (𝗉𝖤\mathsf{pE}) in the case when W0W\neq 0 corresponds to a (pairwise) McKean–Vlasov SDE, i.e. an SDE whose coefficients depend on the marginal law of the process, given below as

dXt=(V(Xt)+W(Xt)dπt)dt+σdBt,\displaystyle\mathrm{d}X_{t}=-\Bigl{(}\nabla V(X_{t})+\int\nabla W(X_{t}-\cdot)\,\mathrm{d}\pi_{t}\Bigr{)}\,\mathrm{d}t+\sigma\,\mathrm{d}B_{t}\,, (𝗉𝖬𝖵\mathsf{pMV})

where πt=𝗅𝖺𝗐(Xt)\pi_{t}=\operatorname{\mathsf{law}}(X_{t}), WW is even, and {Bt}t0\{B_{t}\}_{t\geq 0} is a standard Brownian motion on d\mathbb{R}^{d}. Since the McKean–Vlasov SDE is the so-called mean-field limit of interacting particle systems, we can approximately sample from the minimizer π\pi by numerically discretizing a system of SDEs, which describe the evolution of NN particles{Xt1:N}t0{(Xt1,,XtN)}t0\{X_{t}^{1:N}\}_{t\geq 0}\coloneqq\{(X^{1}_{t},\dotsc,X^{N}_{t})\}_{t\geq 0} as:

dXti=(V(Xti)+1N1j[N]\iW(XtiXtj))dt+σdBti,i[N],\displaystyle\mathrm{d}X_{t}^{i}=-\Bigl{(}\nabla V(X_{t}^{i})+\frac{1}{N-1}\sum_{j\in[N]\backslash i}\nabla W(X_{t}^{i}-X_{t}^{j})\Bigr{)}\,\mathrm{d}t+\sigma\,\mathrm{d}B_{t}^{i}\,,\quad\forall\,i\in[N]\,, (𝗉𝖬𝖵N\mathsf{pMV}_{N})

where {Bi:i[N]}\{B^{i}:i\in[N]\} is a collection of independent Brownian motions. Moreover, the error from approximating the mean-field limit via this finite particle system has been studied in the literature on propagation of chaos [Szn91]. Similarly, the Wasserstein gradient flow for (𝗀𝖤\mathsf{gE}) corresponds to the mean-field Langevin dynamics and admits an analogous particle approximation.

The bounds for propagation of chaos have been refined over time, with [LL23] recently establishing a tight error dependence 𝒪(1/N)\mathcal{O}(\nicefrac{{1}}{{N}}) on the total number of particles NN. These bounds, however, do not translate immediately into algorithmic guarantees. Existing sampling analyses study the propagation of chaos and discretization as a single entangled problem, which thus far have only been able to use weaker 𝒪(1/N){\mathcal{O}}(\sqrt{\nicefrac{{1}}{{N}}}) rates for the former. Furthermore, there has been recent interest in using more sophisticated particle-based algorithms, e.g., “non-linear” Hamiltonian Monte Carlo [BS23] and the mean-field underdamped Langevin dynamics [FW23] to reduce the discretization error. Currently, this requires repeatedly carrying out the propagation of chaos and time discretization analyses from the ground up for each instance.

This motivates us to pose the following questions: (1) Can we incorporate improvements in the propagation of chaos literature, such as the 𝒪(1/N)\mathcal{O}(\nicefrac{{1}}{{N}}) error dependence shown in [LL23], to improve existing theoretical guarantees? (2) Can we leverage recent advances in the theory of log-concave sampling to design better algorithms?

Our main proposal in this work is to decouple the error into two terms, representing the propagation of chaos and discretization errors respectively. This simple and modular approach immediately allows us to answer both questions in the affirmative. Namely, we show how to combine established propagation of chaos bounds in various settings [[, including the sharp rate of]]lacker2023sharp with a large class of sophisticated off-the-shelf log-concave samplers, such as interacting versions of the randomized midpoint discretization of the underdamped Langevin dynamics [SL19, HBE20], Metropolis-adjusted algorithms [Che+21, WSC22, AC23], and the proximal sampler [LST21, Che+22, FYC23]. Our framework yields improvements upon prior state-of-the-art, such as [BS23, FW23], and provides a clear path for future ones.

1.1 Contributions and Organization

Propagation of chaos at stationarity.

We provide three propagation of chaos results which hold in the 𝒲2\mathcal{W}_{2}, 𝖪𝖫\sqrt{\operatorname{\mathsf{KL}}}, and 𝖥𝖨\sqrt{\operatorname{\mathsf{FI}}} “metrics”; the rates reflect the distance of the kk-particle marginal of the finite-particle system from πk\pi^{\otimes k}: (1) In the setting of (𝗉𝖤\mathsf{pE}), under strong displacement convexity, we obtain a 𝒪(k/N)\mathcal{O}(\sqrt{\nicefrac{{k}}{{N}}}) rate by adapting techniques from [Szn91, Mal01]; (2) without assuming displacement convexity, but assuming a weaker interaction, we obtain the sharp rate of 𝒪~(k/N)\widetilde{\mathcal{O}}(\nicefrac{{k}}{{N}}) following [LL23]; (3) finally, in the general setting of (𝗀𝖤\mathsf{gE}), and assuming \mathcal{F} is convex along linear interpolations, we obtain a 𝒪(k/N)\mathcal{O}(\sqrt{\nicefrac{{k}}{{N}}}) rate following [CRW22].

Unlike prior works, our proofs are carried out at stationarity; thus, our proofs are self-contained, streamlined, and include various improvements (e.g., weaker assumptions and explicit bounds). As a result, our work also serves as a helpful exposition to the mathematics of propagation of chaos.

Discretization.

Once the error due to particle approximation is controlled, we then obtain improved complexity guarantees by applying recent advances in the theory of log-concave sampling to the finite-particle stationary distribution. See Table 1 for a summary of our results, and the discussion in §4 for comparisons with prior works and an application to neural network training.

Once again, the importance of our framework is its modularity, which allows for any combination of uniform-in-time propagation of chaos bounds and log-concave sampler, provided that the finite-particle stationary distribution satisfies certain isoperimetric properties needed for the sampling guarantees. Toward this end, we also provide tools for verifying these isoperimetric properties with constants that hold independently of the number of particles (see §3.2.1).

1.2 Related Work

Mean-field equations.

The McKean–Vlasov SDE was first formulated in the works [McK66, Fun84, Mél96], with origins dating to much earlier [Bol72]. It has applications in many domains, from fluid dynamics [Vil02] to game theory [LL07, CD18]; see [CD22, CD22a] for a comprehensive survey. The kinetic version of this equation is known as the Boltzmann equation, and propagation of chaos has similarly been studied under a variety of assumptions [BGM10, Mon17, GM21, GLM22]. One prominent application within machine learning is the study of infinitely wide two-layer neural networks in the mean-field regime (see §4.2).

Propagation of chaos and sampling for (𝗉𝖤\mathsf{pE}).

The original propagation of chaos arguments of [Szn91] were first made uniform in time in [Mal01, Mal03] in both entropy and 𝒲2\mathcal{W}_{2}. The aforementioned works all achieve an error of order 𝒪~(k/N)\widetilde{\mathcal{O}}(\sqrt{\nicefrac{{k}}{{N}}}), and require a strong convexity assumption on VV and WW. These were later adapted for non-smooth potentials [JW17, JW18, BJW23]. Finally, [CRW22] obtained an entropic propagation of chaos bound under a higher-order smoothness assumption. See [CD22] for a more complete bibliography.

The breakthrough result of [Lac23] obtained the sharp bound of 𝒪~(k/N)\widetilde{\mathcal{O}}(\nicefrac{{k}}{{N}}) when the interaction is sufficiently weak, and this bound was made uniform in time in [LL23]. Their approach differs significantly from previous proofs by considering a local analysis based on the recursive BBGKY hierarchy. These results have been extended to other divergences, e.g., the χ2\chi^{2} divergence, but without a uniform-in-time guarantee [HR23]. In addition, [MRW24] showed an extension of this result under a “convexity at infinity” assumption.

The question of sampling from minimizers of (𝗉𝖤\mathsf{pE}) was first studied in [Tal96, BT97, AK02]. These works focused on the Euler–Maruyama discretization of the finite-particle system (𝗉𝖬𝖵N\mathsf{pMV}_{N}), under LL^{\infty}-boundedness of the gradients. Subsequently, the convergence of the Euler–Maruyama scheme has been studied in many works, including but not limited to [BH22, RES22, Li+23]. The strategy of disentangling finite particle error from time discretization also appears in [KHK24], which approaches the problem from the perspective of stochastic approximation. This work, however, is not focused on obtaining quantitative guarantees. Finally, [BS23] considered a non-linear version of Hamiltonian Monte Carlo; we give a detailed comparison with their work in §4.

Propagation of chaos and sampling for (𝗀𝖤\mathsf{gE}).

The mean-field (underdamped) Langevin algorithm for minimizing (𝗀𝖤\mathsf{gE}) was proposed and studied in [CRW22, Che+24]. Under alternative assumptions (see §3.1.2), they established propagation of chaos with a 𝒪(k/N)\mathcal{O}(\sqrt{\nicefrac{{k}}{{N}}}) rate, for both the overdamped and the underdamped finite-particle approximations. Recent works from the machine learning community [NWS22, SNW22, FW23, SWN23] studied the application of these algorithms for optimizing two-layer neural networks and obtained sampling guarantees. We provide a detailed comparison with their works in §4.2.

2 Background and Notation

Let 𝒫2,ac(d)\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}) be the set of probability measures on d\mathbb{R}^{d} that admit a density with respect to the Lebesgue measure and have finite second moment. We will also abuse notation and use the same symbol for a measure and its density when there is no confusion. We use superscripts for the particle index, and subscripts for the time variable. We will use 𝒪,𝒪~\mathcal{O},\widetilde{\mathcal{O}} to signify upper bounds up to numeric constants and polylogarithms respectively. We recall the definitions of convexity and smoothness:

Definition 1.

A function U:dU:\mathbb{R}^{d}\to\mathbb{R} is α\alpha-uniformly convex (allowing for α0\alpha\leq 0) and β\beta-smooth if the following hold respectively

U(x)U(y),xy\displaystyle\langle\nabla U(x)-\nabla U(y),x-y\rangle αxy2\displaystyle\geq\alpha\,\lVert x-y\rVert^{2} for allx,yd,\displaystyle\text{for all}~{}x,y\in\mathbb{R}^{d}\,,
U(x)U(y)\displaystyle\lVert\nabla U(x)-\nabla U(y)\rVert βxy\displaystyle\leq\beta\,\lVert x-y\rVert for allx,yd.\displaystyle\text{for all}~{}x,y\in\mathbb{R}^{d}\,.

For two probability measures μ,ν𝒫2,ac(d)\mu,\nu\in\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}), we define the 𝖪𝖫\mathsf{KL} divergence and the (relative) Fisher information by

𝖪𝖫(μν)𝔼μ[logμν]and𝖥𝖨(μν)𝔼μ[logμν2],\mathsf{KL}(\mu\mathbin{\|}\nu)\coloneqq\operatorname{\mathbb{E}}_{\mu}\bigl{[}\log\frac{\mu}{\nu}\bigr{]}\qquad\text{and}\qquad\mathsf{FI}(\mu\mathbin{\|}\nu)\coloneqq\operatorname{\mathbb{E}}_{\mu}\bigl{[}\bigl{\lVert}\nabla\log\frac{\mu}{\nu}\bigr{\rVert}^{2}\bigr{]}\,,

with the convention 𝖪𝖫(μν)=𝖥𝖨(μν)=\mathsf{KL}(\mu\mathbin{\|}\nu)=\mathsf{FI}(\mu\mathbin{\|}\nu)=\infty whenever μ≪̸ν\mu\not\ll\nu.

We recall the definition of the log-Sobolev inequality, which is used both for propagation of chaos arguments as well as mixing time bounds.

Definition 2 (Log-Sobolev Inequality).

A measure π\pi satisfies a log-Sobolev inequality with parameter C𝖫𝖲𝖨C_{\mathsf{LSI}} if for all μ𝒫2,ac(d)\mu\in\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}),

𝖪𝖫(μπ)C𝖫𝖲𝖨2𝖥𝖨(μπ).\displaystyle\mathsf{KL}(\mu\mathbin{\|}\pi)\leq\frac{C_{\mathsf{LSI}}}{2}\,\mathsf{FI}(\mu\mathbin{\|}\pi)\,. (LSI)

When log(1/π)\log(1/\pi) is α\alpha-uniformly convex for α>0\alpha>0, it follows from the Bakry–Émery condition that π\pi satisfies (LSI) with constant C𝖫𝖲𝖨1/αC_{\mathsf{LSI}}\leq 1/\alpha [BGL14, Proposition 5.7.1].

We can also define the pp-Wasserstein distance 𝒲p(μ,π)\mathcal{W}_{p}(\mu,\pi), p1p\geq 1, between μ,π\mu,\pi as

𝒲pp(μ,π)=infγΓ(μ,π)xypγ(dx,dy),\displaystyle\mathcal{W}_{p}^{p}(\mu,\pi)=\inf_{\gamma\in\Gamma(\mu,\pi)}\int\lVert x-y\rVert^{p}\,\gamma(\mathrm{d}x,\mathrm{d}y),

where Γ(μ,π)\Gamma(\mu,\pi) is the set of all joint probability measures on d×d\mathbb{R}^{d}\times\mathbb{R}^{d} with marginals μ,π\mu,\pi respectively.

Lastly, we recall that the celebrated Otto calculus interprets the space 𝒫2,ac(d)\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}), equipped with the 𝒲2\mathcal{W}_{2} metric, as a formal Riemannian manifold [Ott01]. In particular, the Wasserstein gradient of a functional :𝒫2,ac(d){}\mathcal{L}:\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d})\to\mathbb{R}\cup\{\infty\} is given as 𝒲2=δ\nabla_{\mathcal{W}_{2}}\mathcal{L}=\nabla\delta\mathcal{L}. Here, δ\delta\mathcal{L} is the first variation defined as follows: for all ν0,ν1𝒫2,ac(d)\nu_{0},\nu_{1}\in\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}), δ(ν0):d\delta\mathcal{L}(\nu_{0}):\mathbb{R}^{d}\to\mathbb{R} satisfies

limt0((1t)ν0+tν1)(ν0)t=δ(ν0),ν1ν0δ(ν0)d(ν1ν0).\displaystyle\lim_{t\searrow 0}\frac{\mathcal{L}((1-t)\,\nu_{0}+t\,\nu_{1})-\mathcal{L}(\nu_{0})}{t}=\langle\delta\mathcal{L}(\nu_{0}),\nu_{1}-\nu_{0}\rangle\coloneqq\int\delta\mathcal{L}(\nu_{0})\,\mathrm{d}(\nu_{1}-\nu_{0})\,.

The first variation is defined up to an additive constant, but the Wasserstein gradient is unambiguous. See [AGS08] for a rigorous development. As a shorthand, we will write δ(ν0,x)δ(ν0)(x)\delta\mathcal{L}(\nu_{0},x)\coloneqq\delta\mathcal{L}(\nu_{0})(x) and similarly 𝒲2(ν0,x)𝒲2(ν0)(x)\nabla_{\mathcal{W}_{2}}\mathcal{L}(\nu_{0},x)\coloneqq\nabla_{\mathcal{W}_{2}}\mathcal{L}(\nu_{0})(x).

2.1 SDE Systems and Their Stationary Distributions

2.1.1 The Pairwise McKean–Vlasov Setting

In the formalism introduced in the previous section, we note that (𝗉𝖬𝖵\mathsf{pMV}) can be interpreted as Wasserstein gradient flow for (𝗉𝖤\mathsf{pE}). In this paper, we refer to (𝗉𝖬𝖵\mathsf{pMV}) as the pairwise McKean–Vlasov process. As noted in the introduction, it has the stationary distribution (1.1) which minimizes (𝗉𝖤\mathsf{pE}).

Recall also that the equation (𝗉𝖬𝖵\mathsf{pMV}) is the mean-field limit of the finite-particle system (𝗉𝖬𝖵N\mathsf{pMV}_{N}). This NN-particle system has the following stationary distribution: for x1:N=[x1,,xN]d×Nx^{1:N}=[x^{1},\dotsc,x^{N}]\in\mathbb{R}^{d\times N},

μ1:N(x1:N)exp(2σ2i[N]V(xi)1σ2(N1)i[N]j[N]iW(xixj)).\displaystyle\mu^{1:N}(x^{1:N})\propto\exp\Bigl{(}-\frac{2}{\sigma^{2}}\sum_{i\in[N]}V(x^{i})-\frac{1}{\sigma^{2}\,(N-1)}\sum_{i\in[N]}\sum_{j\in[N]\setminus i}W(x^{i}-x^{j})\Bigr{)}\,. (2.1)

The system (𝗉𝖬𝖵N\mathsf{pMV}_{N}) can be viewed as an approximation to (𝗉𝖬𝖵\mathsf{pMV}), with the expectation term in the drift replaced by an empirical average. Note that the measure μ1:N\mu^{1:N} is exchangeable.111Exchangeability refers to the property that the law of [x1,,xN][x^{1},\ldots,x^{N}] equals the law of [xσ(1),,xσ(N)][x^{\sigma(1)},\ldots,x^{\sigma(N)}] for any permutation σ\sigma of {1,,N}\{1,\ldots,N\}. While the standard approach is to apply an Euler–Maruyama discretization to (𝗉𝖬𝖵N\mathsf{pMV}_{N}) in order to sample from (𝗉𝖬𝖵\mathsf{pMV}), our perspective is to write more sophisticated samplers for μ1:N\mu^{1:N} directly. Indeed, unlike (1.1), the finite-particle stationary distribution (2.1) is explicit and amenable to sampling methods.

2.1.2 The General McKean–Vlasov Setting

More generally, we consider the functional (𝗀𝖤\mathsf{gE}) where \mathcal{F} is of the form (μ)=0(μ)+λ22dμ\mathcal{F}(\mu)=\mathcal{F}_{0}(\mu)+\frac{\lambda}{2}\int\lVert\cdot\rVert^{2}\,\mathrm{d}\mu with λ0\lambda\geq 0. The second term acts as regularization and is common in the literature [FW23, SWN23]. We can describe its Wasserstein gradient flow as the marginal law of a particle trajectory satisfying the following SDE, which we call the general McKean–Vlasov equation:

dXt={𝒲20(πt,Xt)λXt}dt+σdBt,\displaystyle\mathrm{d}X_{t}=\{-\nabla_{\mathcal{W}_{2}}\mathcal{F}_{0}(\pi_{t},X_{t})-\lambda X_{t}\}\,\mathrm{d}t+\sigma\,\mathrm{d}B_{t}\,, (𝗀𝖬𝖵\mathsf{gMV})

where πt=𝗅𝖺𝗐(Xt)\pi_{t}=\operatorname{\mathsf{law}}(X_{t}), and {Bt}t0\{B_{t}\}_{t\geq 0} is a standard Brownian motion on d\mathbb{R}^{d}. The stationary distribution π\pi of (𝗀𝖬𝖵\mathsf{gMV}), and its linearization πμ\pi_{\mu} around a measure μ𝒫2,ac(d)\mu\in\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}), satisfy the following equations:

π(x)exp(2σ2δ0(π,x)λx2σ2)andπμ(x)exp(2σ2δ0(μ,x)λx2σ2).\displaystyle\!\!\pi(x)\propto\exp\Bigl{(}-\frac{2}{\sigma^{2}}\,\delta\mathcal{F}_{0}(\pi,x)-\frac{\lambda\,\lVert x\rVert^{2}}{\sigma^{2}}\Bigr{)}\>\;\text{and}\>\;\pi_{\mu}(x)\propto\exp\Bigl{(}-\frac{2}{\sigma^{2}}\,\delta\mathcal{F}_{0}(\mu,x)-\frac{\lambda\,\lVert x\rVert^{2}}{\sigma^{2}}\Bigr{)}\,. (2.2)

The latter is called the proximal Gibbs distribution with respect to μ\mu. The general dynamics corresponds to the mean-field limit of the following finite-particle system described by an NN-tuple of stochastic processes {Xt1:N}t0{(Xt1,,XtN)}t0\{X_{t}^{1:N}\}_{t\geq 0}\coloneqq\{(X^{1}_{t},\dotsc,X^{N}_{t})\}_{t\geq 0}:

dXti={𝒲20(ρXt1:N,Xti)λXti}dt+σdBti,\displaystyle\mathrm{d}X_{t}^{i}=\{-\nabla_{\mathcal{W}_{2}}\mathcal{F}_{0}(\rho_{X_{t}^{1:N}},X_{t}^{i})-\lambda X_{t}^{i}\}\,\mathrm{d}t+\sigma\,\mathrm{d}B_{t}^{i}\,, (𝗀𝖬𝖵N\mathsf{gMV}_{N})

and ρx1:N=1Ni=1Nδxi\rho_{x^{1:N}}=\frac{1}{N}\sum_{i=1}^{N}\delta_{x^{i}} is the empirical measure of the particle system. The stationary distribution for (𝗀𝖬𝖵N\mathsf{gMV}_{N}) is given as follows [CRW22, (2.16)]: for x1:N=[x1,,xN]d×Nx^{1:N}=[x^{1},\dotsc,x^{N}]\in\mathbb{R}^{d\times N},

μ1:N(x1:N)exp(2Nσ20(ρx1:N)λσ2x1:N2).\displaystyle\mu^{1:N}(x^{1:N})\propto\exp\Bigl{(}-\frac{2N}{\sigma^{2}}\,\mathcal{F}_{0}(\rho_{x^{1:N}})-\frac{\lambda}{\sigma^{2}}\,\lVert x^{1:N}\rVert^{2}\Bigr{)}\,. (2.3)

One can show that xi0(ρx1:N)=1N𝒲20(ρx1:N,xi)\nabla_{x^{i}}\mathcal{F}_{0}(\rho_{x^{1:N}})=\frac{1}{N}\,\nabla_{\mathcal{W}_{2}}\mathcal{F}_{0}(\rho_{x^{1:N}},x^{i}), and hence (𝗀𝖬𝖵N\mathsf{gMV}_{N}) is simply the Langevin diffusion corresponding to stationary measure (2.3). Moreover, when λ=0\lambda=0 and choosing 0(μ)=V(x)μ(dx)+W(xy)μ(dx)μ(dy)\mathcal{F}_{0}(\mu)=\int V(x)\,\mu(\mathrm{d}x)+\iint W(x-y)\,\mu(\mathrm{d}x)\,\mu(\mathrm{d}y), then the equations (𝗀𝖬𝖵\mathsf{gMV}), (2.2), (𝗀𝖬𝖵N\mathsf{gMV}_{N}), and (2.3) reduce to (𝗉𝖬𝖵\mathsf{pMV}), (1.1), (𝗉𝖬𝖵N\mathsf{pMV}_{N}), and (2.1), respectively.

3 Technical Ingredients

Our general approach for sampling from the stationary distribution π\pi in either (1.1) or (2.2) is to directly apply an off-the-shelf sampler for the finite-particle stationary distribution μ1:N\mu^{1:N}. The theoretical guarantees for this procedure require two main ingredients: (1) control of the “bias”—i.e. the error incurred by approximating π\pi by the 11-particle marginal of μ1:N\mu^{1:N}—and (2) verification of isoperimetric properties which allow for fast sampling from the measure μ1:N\mu^{1:N}.

3.1 Bias Control via Uniform-in-Time Propagation of Chaos

In this section, we focus on the first ingredient, namely, obtaining control of the bias via uniform-in-time propagation of chaos results. Proofs for this section are given in §A.

3.1.1 Pairwise McKean–Vlasov Setting

We first consider the pairwise McKean–Vlasov setting described in §2.1.1. Our first propagation of chaos result uses the following three assumptions.

Assumption 1.

The potentials V,WV,W are βV,βW\beta_{V},\beta_{W}-smooth respectively.

Assumption 2.

The distribution π\pi satisfies (LSI) with parameter C𝖫𝖲𝖨(π)C_{\mathsf{LSI}}(\pi).

Assumption 3.

The ratio ρσ4/8βW2C𝖫𝖲𝖨2(π)\rho\coloneqq\nicefrac{{\sigma^{4}}}{{8\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)}} is at least 33.

Remark. Note that from (1.1), we typically would expect C𝖫𝖲𝖨2(π)C_{\mathsf{LSI}}^{2}(\pi) to also scale as σ4\sigma^{4} (e.g., in the case when VV and WW are α\alpha-uniformly convex for α>0\alpha>0). Therefore, Assumption 3 is typically invariant to the scaling of σ\sigma and can be satisfied even for σ0\sigma\searrow 0. Under these assumptions, we obtain a sharp propagation of chaos result via a similar argument as [Lac22, LL23]. We note that the former is more permissive regarding the constant in Assumption 3 as compared to this work.

Theorem 3 (Sharp Propagation of Chaos).

Under Assumptions 1, 2 and 3, for any N100N\geq 100 and k[N]k\in[N], it holds that 𝖪𝖫(μ1:kπk)=𝒪~(dk2/N2)\mathsf{KL}(\mu^{1:k}\mathbin{\|}\pi^{\otimes k})=\widetilde{\mathcal{O}}(dk^{2}/N^{2}). Thus, 𝖪𝖫(μ1:kπk)<ε2\mathsf{KL}(\mu^{1:k}\mathbin{\|}\pi^{\otimes k})<\varepsilon^{2} if

N100Ω~(kdε1).\displaystyle N\geq 100\vee\widetilde{\Omega}\bigl{(}k\sqrt{d}\,\varepsilon^{-1}\bigr{)}\,. (3.1)

We note that the rate in Theorem 3 is sharp; see the Gaussian case in Example 10. A condition such as Assumption 3 is in general necessary, since otherwise the minimizer of (𝗉𝖤\mathsf{pE}) may not even be unique [[, see the example and discussion in]]lacker2023sharp. However, it can be restrictive, as it requires the interaction to be sufficiently weak. With the following convexity assumption, we can obtain a propagation of chaos result without Assumption 3.

Assumption 4.

The potentials V,WV,W are αV,αW\alpha_{V},\alpha_{W}-uniformly convex with αV+αW>0\alpha_{V}+{\alpha_{W}^{-}}>0. Here, αWαW0{\alpha_{W}^{-}}\coloneqq\alpha_{W}\wedge 0 denotes the negative part of αW\alpha_{W}.

The following weaker result consists of two parts. The first, a Wasserstein propagation of chaos result, is based on [Szn91]. The second, building on the first, is a uniform-in-time entropic propagation of chaos bound following from a Fisher information bound. The arguments are similar to those in [Mal01, Mal03], albeit simplified (since we work at stationarity) and presented here with explicit constants.

Theorem 4 (Weak Propagation of Chaos).

Under Assumptions 1 and 4, for any NαVαWαV+αW2,N\geq\frac{\alpha_{V}-{\alpha_{W}^{-}}}{\alpha_{V}+{\alpha_{W}^{-}}}\vee 2, if we denote ααV+αW\alpha\coloneqq\alpha_{V}+{\alpha_{W}^{-}}, then

𝒲22(μ1:k,πk)\displaystyle\mathcal{W}_{2}^{2}(\mu^{1:k},\pi^{\otimes k}) 4βW2σ2dα3kN,\displaystyle\leq\frac{4\beta_{W}^{2}\sigma^{2}d}{\alpha^{3}}\,\frac{k}{N}\,, (3.2)
𝖪𝖫(μ1:kπk)σ24α𝖥𝖨(μ1:kπk)\displaystyle\mathsf{KL}(\mu^{1:k}\mathbin{\|}\pi^{\otimes k})\leq\frac{\sigma^{2}}{4\alpha}\operatorname{\mathsf{FI}}(\mu^{1:k}\mathbin{\|}\pi^{\otimes k}) 132βW2(βV+βW)2dα4kN.\displaystyle\leq\frac{132\beta_{W}^{2}\,{(\beta_{V}+\beta_{W})}^{2}\,d}{\alpha^{4}}\,\frac{k}{N}\,. (3.3)

3.1.2 General McKean–Vlasov Setting

In the more general case where we aim to minimize (𝗀𝖤\mathsf{gE}) for a generic functional \mathcal{F} of the form (μ)=0(μ)+λ22dμ\mathcal{F}(\mu)=\mathcal{F}_{0}(\mu)+\frac{\lambda}{2}\int\lVert\cdot\rVert^{2}\,\mathrm{d}\mu, we impose the following assumptions. They can be largely seen as generalizations of the conditions for the pairwise case, and they are inherited from [CRW22, SWN23]. There is an additional convexity condition (Assumption 5), which in the pairwise McKean–Vlasov setting amounts to positive semidefiniteness of the kernel (x,y)W(xy)(x,y)\mapsto W(x-y) on d×d\mathbb{R}^{d}\times\mathbb{R}^{d}; thus, in general, the following assumptions are incomparable with the ones in §3.1.1.

Assumption 5.

The functional 0\mathcal{F}_{0} is convex in the usual sense. For all ν0,ν1𝒫2,ac(d)\nu_{0},\nu_{1}\in\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}), t[0,1]t\in[0,1],

0((1t)ν0+tν1)(1t)0(ν0)+t0(ν1).\displaystyle\mathcal{F}_{0}((1-t)\,\nu_{0}+t\,\nu_{1})\leq(1-t)\,\mathcal{F}_{0}(\nu_{0})+t\,\mathcal{F}_{0}(\nu_{1})\,.
Assumption 6.

The functional 0\mathcal{F}_{0} is smooth in the sense that for all x,ydx,y\in\mathbb{R}^{d}, ν,ν𝒫2,ac(d)\nu,\nu^{\prime}\in\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}), there is a uniform constant β\beta such that

𝒲20(ν,x)𝒲20(ν,y)β(xy+𝒲1(ν,ν)).\displaystyle\lVert\nabla_{\mathcal{W}_{2}}\mathcal{F}_{0}(\nu,x)-\nabla_{\mathcal{W}_{2}}\mathcal{F}_{0}(\nu^{\prime},y)\rVert\leq\beta\,(\lVert x-y\rVert+\mathcal{W}_{1}(\nu,\nu^{\prime}))\,.
Assumption 7.

The proximal Gibbs measures satisfy (LSI) with a uniform constant: namely, it holds that C𝖫𝖲𝖨(π)supμ𝒫2(d)C𝖫𝖲𝖨(πμ)C¯𝖫𝖲𝖨C_{\mathsf{LSI}}(\pi)\vee\sup_{\mu\in\mathcal{P}_{2}(\mathbb{R}^{d})}C_{\mathsf{LSI}}(\pi_{\mu})\leq\overline{C}_{\mathsf{LSI}}.

Remark. These assumptions taken together cover settings not covered in the preceding sections, including optimization of two-layer neural networks. See [CRW22, Remark 3.1] and §4.2.

Under these assumptions, we can derive an entropic propagation of chaos bound by following the proof of [CRW22]. Through a tighter analysis, we are able to reduce the dependence on the condition number κC¯𝖫𝖲𝖨β/σ2\kappa\coloneqq\overline{C}_{\mathsf{LSI}}\beta/\sigma^{2} from κ2\kappa^{2} to κ\kappa.

Theorem 5 (Propagation of Chaos for General Functionals).

Under Assumptions 56, and 7, for N160βC¯𝖫𝖲𝖨/σ2,N\geq\nicefrac{{160\beta\overline{C}_{\mathsf{LSI}}}}{{\sigma^{2}}}, we have

12C¯𝖫𝖲𝖨𝒲22(μ1:k,πk)𝖪𝖫(μ1:kπk)33βC¯𝖫𝖲𝖨dkσ2N.\frac{1}{2\overline{C}_{\mathsf{LSI}}}\mathcal{W}_{2}^{2}(\mu^{1:k},\pi^{\otimes k})\leq\operatorname{\mathsf{KL}}(\mu^{1:k}\mathbin{\|}\pi^{\otimes k})\leq\frac{33\beta\overline{C}_{\mathsf{LSI}}dk}{\sigma^{2}N}\,.

Among these assumptions, the hardest to verify is the uniform LSI of Assumption 7. Following [SWN23], we introduce the following sufficient condition for the validity of Assumption 7; see Lemma 22 for a more precise statement.

Assumption 8.

There exists a uniform bound on the Wasserstein gradient of the interaction term 0\mathcal{F}_{0}: for some constant B<B<\infty and all μ𝒫2,ac(d)\mu\in\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}), xdx\in\mathbb{R}^{d}, 𝒲20(μ,x)B.\lVert\nabla_{\mathcal{W}_{2}}\mathcal{F}_{0}(\mu,x)\rVert\leq B\,.

Lemma 6 (Informal).

Assumptions 6 and 8 imply Assumption 7 with an explicit constant C¯𝖫𝖲𝖨\overline{C}_{\mathsf{LSI}}, given in terms of BB, β\beta, λ\lambda, and σ\sigma.

3.2 Isoperimetric Properties of the Stationary Distributions

In this section, we verify the isoperimetric properties of π,μ1:N\pi,\mu^{1:N} in the (𝗉𝖬𝖵\mathsf{pMV}) setting, with proofs provided in §B.

3.2.1 Pairwise McKean–Vlasov Setting

If VV, WW satisfy Assumptions 1 and 4 (i.e. VV and WW have bounded Hessians), then the potential for (1.1), i.e. log(1/π)\log(1/\pi), is 2σ2(αV+αW)\frac{2}{\sigma^{2}}\,(\alpha_{V}+\alpha_{W})-convex and 2σ2(βV+βW)\frac{2}{\sigma^{2}}\,(\beta_{V}+\beta_{W})-smooth. By the Bakry–Émery condition, π\pi satisfies (LSI) with parameter C𝖫𝖲𝖨(π)σ2/2(αV+αW)C_{\mathsf{LSI}}(\pi)\leq\nicefrac{{\sigma^{2}}}{{2\,(\alpha_{V}+\alpha_{W})}}.

Similarly, for the invariant measure μ1:N\mu^{1:N} in (2.1), we can prove the following.

Lemma 7.

If VV and WW satisfy Assumption 1, then log(1/μ1:N)\log(1/\mu^{1:N}) is 2σ2(βV+NN1βW)\frac{2}{\sigma^{2}}\,(\beta_{V}+\frac{N}{N-1}\,\beta_{W})-smooth.

If VV and WW satisfy Assumption 4, then log(1/μ1:N)\log(1/\mu^{1:N}) is 2σ2(αV+NN1αW)\frac{2}{\sigma^{2}}\,(\alpha_{V}+\frac{N}{N-1}\,{\alpha_{W}^{-}})-convex.222Only the negative part of αW\alpha_{W} contributes to the strong log-concavity of μ1:N\mu^{1:N}. This is consistent with [Vil03, Theorem 5.15], which asserts that when αW>0\alpha_{W}>0, the interaction energy μW(xy)μ(dx)μ(dy)\mu\mapsto\iint W(x-y)\,\mu(\mathrm{d}x)\,\mu(\mathrm{d}y) is αW\alpha_{W}-strongly displacement convex over the subspace of probability measures with fixed mean, but only weakly convex over the full Wasserstein space.

We now consider the non-log-concave case. It is standard in the sampling literature that the assumption of (LSI) for the stationary distribution yields mixing time guarantees. Since our strategy is to sample from (2.1), we therefore seek an LSI for μ1:N\mu^{1:N}, formalized as the following assumption.

Assumption 9.

The distribution μ1:N\mu^{1:N} satisfies (LSI) with parameter C𝖫𝖲𝖨(μ1:N)C_{\mathsf{LSI}}(\mu^{1:N}).

In this section, we provide an easily verifiable condition, combining a Holley–Stroock condition [HS87] with a weak interaction condition, for this assumption to hold with an NN-independent constant.

Assumption 10.

The potentials VV and WW can be decomposed as V=V0+V1V=V_{0}+V_{1} and W=W0+W1W=W_{0}+W_{1} such V0V_{0}, W0W_{0} satisfy Assumption 4 and 𝗈𝗌𝖼(V1),𝗈𝗌𝖼(W1)<\operatorname{\mathsf{osc}}(V_{1}),\operatorname{\mathsf{osc}}(W_{1})<\infty, where for a function U:dU:\mathbb{R}^{d}\to\mathbb{R} we define 𝗈𝗌𝖼(U)supUinfU\operatorname{\mathsf{osc}}(U)\coloneqq\sup U-\inf U. Furthermore, the following weak interaction condition holds:

σ2βWC¯𝖫𝖲𝖨6,whereC¯𝖫𝖲𝖨σ2αV0+NN1αW0exp(2σ2(𝗈𝗌𝖼(V1)+𝗈𝗌𝖼(W1))).\displaystyle\frac{\sigma^{2}}{\beta_{W}\overline{C}_{\mathsf{LSI}}}\geq\sqrt{6}\,,\qquad\text{where}\qquad\overline{C}_{\mathsf{LSI}}\coloneqq\frac{\sigma^{2}}{\alpha_{V_{0}}+\frac{N}{N-1}\,{\alpha_{W_{0}}^{-}}}\exp\Bigl{(}\frac{2}{\sigma^{2}}\,\bigl{(}\operatorname{\mathsf{osc}}(V_{1})+\operatorname{\mathsf{osc}}(W_{1})\bigr{)}\Bigr{)}\,.

A careful application of the Holley–Stroock perturbation principle yields the following lemma.

Lemma 8.

Under Assumption 10, π\pi, μ1:N\mu^{1:N} satisfy (LSI) with parameters

C𝖫𝖲𝖨(π)\displaystyle C_{\mathsf{LSI}}(\pi) σ22(αV0+αW0)exp(2σ2(𝗈𝗌𝖼(V1)+𝗈𝗌𝖼(W1)))12C¯𝖫𝖲𝖨,\displaystyle\leq\frac{\sigma^{2}}{2\,(\alpha_{V_{0}}+\alpha_{W_{0}})}\exp\Bigl{(}\frac{2}{\sigma^{2}}\,\bigl{(}\operatorname{\mathsf{osc}}(V_{1})+\operatorname{\mathsf{osc}}(W_{1})\bigr{)}\Bigr{)}\leq\frac{1}{2}\,\overline{C}_{\mathsf{LSI}}\,, (3.4)
C𝖫𝖲𝖨(μ1:N)\displaystyle C_{\mathsf{LSI}}(\mu^{1:N}) C¯𝖫𝖲𝖨.\displaystyle\leq\overline{C}_{\mathsf{LSI}}\,.

In particular, Assumption 3 holds.

3.2.2 General McKean–Vlasov Setting

In the setting (𝗀𝖤\mathsf{gE}) with (μ)=0(μ)+λ22dμ\mathcal{F}(\mu)=\mathcal{F}_{0}(\mu)+\frac{\lambda}{2}\int\lVert\cdot\rVert^{2}\,\mathrm{d}\mu, we verify that Assumption 8 yields (LSI) for π\pi. See Corollary 25 for a more precise statement.

Lemma 9 (Informal).

In the mean-field Langevin setting of §2.1.2, suppose that Assumption 8 holds. Then, Assumption 9 holds with C𝖫𝖲𝖨(μ1:N)C_{\mathsf{LSI}}(\mu^{1:N}) depending on dd, BB, β\beta, λ\lambda, and σ\sigma, but not on NN.

Obtaining Lemma 9 is not straightforward, and we rely on a novel argument combining heat flow estimates from [BP24] with a generalized version of the propagation of chaos result in Theorem 5; see §B.3 for details.

4 Sampling from the Mean-Field Target

In this section, we present results for sampling from π\pi. As outlined in Algorithm 1, we use off-the-shelf log-concave samplers to sample from μ1:N\mu^{1:N}, during which we access the first-order333For our results involving the proximal sampler, we also assume access to a proximal oracle for simplicity. oracle for μ1:N\mu^{1:N} (i.e. an oracle for evaluation of logμ1:N\log\mu^{1:N} up to an additive constant, and for evaluation of logμ1:N\nabla\log\mu^{1:N}). For NN sufficiently large, the first particle given by Algorithm 1 is approximately distributed according to π\pi: for μ^1:N\hat{\mu}^{1:N} the law of the output of the log-concave sampler and its 11-particle marginal distribution μ^1\hat{\mu}^{1},

𝒲2(μ^1,π)𝒲2(μ^1,μ1)+𝒲2(μ1,π)1N𝒲2(μ^1:N,μ1:N)+𝒲2(μ1,π),\mathcal{W}_{2}(\hat{\mu}^{1},\pi)\leq\mathcal{W}_{2}(\hat{\mu}^{1},\mu^{1})+\mathcal{W}_{2}(\mu^{1},\pi)\leq\sqrt{\frac{1}{N}}\,\mathcal{W}_{2}(\hat{\mu}^{1:N},\mu^{1:N})+\mathcal{W}_{2}(\mu^{1},\pi)\,,

where the inequality follows from exchangeability (Lemma 27). A similar decomposition also holds for 𝖪𝖫\operatorname{\mathsf{KL}}, although the argument is more technical. We defer its presentation to §E.

Input: the number NN of total particles, a log-concave sampler 𝖫𝖢-𝖲𝖺𝗆𝗉𝗅𝖾𝗋\mathsf{LC{\text{-}}Sampler}
    Output: kk particles X^1:k\hat{X}^{1:k}

1:  Sample X^1:Nμ^1:N\hat{X}^{1:N}\sim\hat{\mu}^{1:N} via 𝖫𝖢-𝖲𝖺𝗆𝗉𝗅𝖾𝗋\mathsf{LC{\text{-}}Sampler}, so that μ^1:Nμ1:N\hat{\mu}^{1:N}\approx\mu^{1:N}, e.g., in 𝒲2\mathcal{W}_{2} or 𝖪𝖫\sqrt{\operatorname{\mathsf{KL}}}.
2:  Output the first kk particles X^1:k\hat{X}^{1:k}.
Algorithm 1 Sampling from the Mean-Field Stationary Distribution

To bound the second term by ε\varepsilon, it suffices to choose NN according to the propagation of chaos results in §3.1. Our results are summarized in Table 1, in which we record the total number of oracle calls MM for μ1:N\mu^{1:N} made by the sampler MM and the number of particles NN needed to achieve ε\varepsilon error in the desired metric, hiding polylogarithmic factors. Note that in the pairwise McKean–Vlasov setting, each oracle call to μ1:N\mu^{1:N} requires NN calls to an oracle for VV, and (N2)\binom{N}{2} calls to an oracle for WW.

Algorithm “Metric” Assumptions MM NN
LMC α/σ𝒲2\nicefrac{{\sqrt{\alpha}}}{{\sigma}}\,\mathcal{W}_{2} 1239 κ2d/ε2{\kappa^{2}d}/{\varepsilon^{2}} d1/2/ε{d^{1/2}}/{\varepsilon}
MALA–PS κd3/4/ε1/2\kappa d^{3/4}/\varepsilon^{1/2}
ULMC–PS κ3/2d1/2/ε{\kappa^{3/2}d^{1/2}}/{\varepsilon}
ULMC+ α/σ𝒲2\nicefrac{{\sqrt{\alpha}}}{{\sigma}}\,\mathcal{W}_{2} 134 κd1/3/ε2/3{\kappa d^{1/3}}/{\varepsilon^{2/3}} d1/2/ε{d^{1/2}}/{\varepsilon}
LMC 𝖪𝖫\sqrt{\operatorname{\mathsf{KL}}} 14 κ2d/ε2{\kappa^{2}d}/{\varepsilon^{2}} κ4d/ε2{\kappa^{4}d}/{\varepsilon^{2}}
ULMC κ3/2d1/2/ε{\kappa^{3/2}d^{1/2}}/{\varepsilon}
LMC α/σ𝒲2\nicefrac{{\sqrt{\alpha}}}{{\sigma}}\,\mathcal{W}_{2} κd/ε2{\kappa d}/{\varepsilon^{2}} κ2d/ε2{\kappa^{2}d}/{\varepsilon^{2}}
ULMC+ κd1/3/ε2/3{\kappa d^{1/3}}/{\varepsilon^{2/3}}
LMC 𝖪𝖫\sqrt{\operatorname{\mathsf{KL}}} 5679 κ2d/ε2{\kappa^{2}d}/{\varepsilon^{2}} κd/ε2{\kappa d}/{\varepsilon^{2}}
ULMC–PS κ3/2d1/2/ε\kappa^{3/2}d^{1/2}/\varepsilon
Table 1: In this table, we record MM, the total number of oracle queries to logμ1:N\log\mu^{1:N} made by the log-concave sampler, and NN, the number of particles.

The algorithms in the table refer to: Langevin Monte Carlo (LMC); underdamped Langevin Monte Carlo (ULMC); discretizations of the underdamped Langevin diffusion via the randomized midpoint method [SL19] or the shifted ODE method [FLO21] (ULMC+); and implementation of the proximal sampler [LST21, Che+22] via the Metropolis-adjusted Langevin algorithm or via ULMC (MALA–PS and ULMC–PS respectively). Note that LMC applied to sample from μ1:N\mu^{1:N} is simply the Euler–Maruyama discretization of (𝗉𝖬𝖵N\mathsf{pMV}_{N}), and likewise ULMC is the algorithm considered in [FW23]. We refer to §E for proofs and references.

To streamline the rates, we simplify the notation by defining β=βV+βW\beta=\beta_{V}+\beta_{W} if Assumption 1 holds, otherwise we use the value from Assumption 6. We let α=αV+αW\alpha=\alpha_{V}+{\alpha_{W}^{-}} under Assumption 4, α=σ2/2max{C𝖫𝖲𝖨(μ1:N),C𝖫𝖲𝖨(π)}\alpha=\nicefrac{{\sigma^{2}}}{{2\max\{C_{\mathsf{LSI}}(\mu^{1:N}),C_{\mathsf{LSI}}(\pi)\}}} under Assumptions 2 and 9, and α=σ2/2max{C𝖫𝖲𝖨(μ1:N),C¯𝖫𝖲𝖨}\alpha=\nicefrac{{\sigma^{2}}}{{2\max\{C_{\mathsf{LSI}}(\mu^{1:N}),\overline{C}_{\mathsf{LSI}}\}}} in the general McKean–Vlasov setting.

Finally, we let κβ/α\kappa\coloneqq\beta/\alpha denote the condition number. We briefly justify this terminology. If the target and all proximal Gibbs measures are strongly convex with parameter α/σ2\alpha/\sigma^{2}, then the Bakry–Émery condition implies that C¯𝖫𝖲𝖨σ2α\overline{C}_{\mathsf{LSI}}\leq\tfrac{\sigma^{2}}{\alpha}. Hence, the scale-invariant ratio C¯𝖫𝖲𝖨β/σ2\overline{C}_{\mathsf{LSI}}\beta/\sigma^{2} reduces to the classical condition number β/α\beta/\alpha, the ratio of the largest to smallest eigenvalues of the Hessian matrices for VV and WW. Therefore, C¯𝖫𝖲𝖨β/σ2\overline{C}_{\mathsf{LSI}}\beta/\sigma^{2} is a generalization of the condition number to settings beyond uniform strong convexity which allows us to state more interpretable bounds. The additional assumption κd/ε\kappa\leq\sqrt{d}/\varepsilon will be used to simplify some of the rates.

In the following subsections, we discuss some of the results in greater detail.

4.1 Pairwise McKean–Vlasov Setting

Example 10 (Gaussian Case).

Consider a quadratic confinement and interaction,

V(x)=12x𝖳Ax=12xA2,W(x)=λ2x2,V(x)=\frac{1}{2}\,x^{\mathsf{T}}Ax=\frac{1}{2}\,\lVert x\rVert_{A}^{2}\,,\qquad W(x)=\frac{\lambda}{2}\,\lVert x\rVert^{2}\,,

for some matrix Ad×dA\in\mathbb{R}^{d\times d} with A0A\succ 0, λ0\lambda\geq 0. The resulting stationary distributions can be calculated explicitly to be Gaussians. We show in §C that for large NN, 𝖪𝖫(μ1:kπk)=Θ~(dk2/N2)\operatorname{\mathsf{KL}}(\mu^{1:k}\mathbin{\|}\pi^{\otimes k})=\widetilde{\Theta}(\nicefrac{{dk^{2}}}{{N^{2}}}\,). This shows that the rate in Theorem 3 is sharp.

Example 11 (Strongly Convex Case).

Consider the strongly convex case where α=αV+αW>0\alpha=\alpha_{V}+{\alpha_{W}^{-}}>0. The prior work [BS23] also considered the problem of sampling from the mean-field stationary distribution π\pi, with σ2=2\sigma^{2}=2. If we count the number of calls to a gradient oracle for VV, their complexity bound reads 𝒪~(κ5/3d4/3/ε8/3)\widetilde{\mathcal{O}}(\nicefrac{{\kappa^{5/3}d^{4/3}}}{{\varepsilon^{8/3}}}) to achieve α/σ𝒲1(μ^1,π)ε\nicefrac{{\sqrt{\alpha}}}{{\sigma}}\,\mathcal{W}_{1}(\hat{\mu}^{1},\pi)\leq\varepsilon. We note that their assumptions are not strictly comparable to ours. They require the interaction WW to be sufficiently weak, in the sense that βWα\beta_{W}\lesssim\alpha, which is similar444See eq. (2.24) therein; note that they have a scaling factor of ε\varepsilon in front of their interaction term, so that our parameter βW\beta_{W} is equivalent to their εL~\varepsilon\tilde{L}. to our Assumption 3; on the other hand, they only assume αV>0\alpha_{V}>0, rather than αV+αW>0\alpha_{V}+{\alpha_{W}^{-}}>0. Nevertheless, we attempt to make some comparisons with their work below.

Without Assumption 3, ULMC+ achieves α/σ𝒲2(μ^1,π)ε\nicefrac{{\sqrt{\alpha}}}{{\sigma}}\,\mathcal{W}_{2}(\hat{\mu}^{1},\pi)\leq\varepsilon with complexity 𝒪~(κ3d4/3/ε8/3)\widetilde{\mathcal{O}}(\nicefrac{{\kappa^{3}d^{4/3}}}{{\varepsilon^{8/3}}}), which matches the guarantee of [BS23] up to the dependence on κ\kappa. We can also obtain guarantees in 𝖪𝖫\sqrt{\operatorname{\mathsf{KL}}}, at the cost of an extra factor of κ2\kappa^{2}.

With Assumption 3, MALA–PS has complexity 𝒪~(κd5/4/ε3/2)\widetilde{\mathcal{O}}(\nicefrac{{\kappa d^{5/4}}}{{\varepsilon^{3/2}}}) and ULMC+ has complexity 𝒪~(κd5/6/ε5/3)\widetilde{\mathcal{O}}(\nicefrac{{\kappa d^{5/6}}}{{\varepsilon^{5/3}}}), which improve substantially upon [BS23].

To summarize, in the strongly convex case, we have obtained numerous improvements: (i) we can obtain results even without the weak interaction condition (Assumption 3); (ii) when we assume the weak interaction condition, we obtain improved complexities; (iii) our results hold in stronger metrics; (iv) our approach is generic, allowing for the consideration of numerous different samplers without needing to establish new propagation of chaos results (by way of comparison, [BS23] developed a tailored propagation of chaos argument for their non-linear Hamiltonian Monte Carlo algorithm).

Example 12 (Bounded Perturbations).

Both the results of [BS23] as well as our own allow for non-convex potentials, albeit under different assumptions—[BS23] require strong convexity at infinity, whereas we require (LSI) for the stationary measures μ1:N\mu^{1:N} and π\pi. In order to obtain sampling guarantees with low complexity, it is important for the LSI constant of μ1:N\mu^{1:N} to be independent of NN. We have provided a sufficient condition for this to hold: VV and WW are bounded perturbations of V0V_{0} and W0W_{0} respectively, where αV0+αW0>0\alpha_{V_{0}}+{\alpha_{W_{0}}^{-}}>0; see Lemma 8.

We also note that in this setting, both of our works require a weak interaction condition. This is in general necessary in order to ensure uniqueness of the mean-field stationary distribution, see the discussion in §3.1.1.

4.2 General McKean–Vlasov Setting

Example 13 (General Functionals).

In the general setting, under Assumptions 56, and 7, the work of [SWN23] provided the first discretization bounds. They impose further assumptions and their resulting complexity bound is rather complicated, but it reads roughly MN=𝒪~(poly(κ)d2/ε4)MN=\widetilde{\mathcal{O}}(\nicefrac{{\operatorname{poly}(\kappa)\,d^{2}}}{{\varepsilon^{4}}}) for the discretization of (𝗀𝖬𝖵N\mathsf{gMV}_{N}). Subsequently, [FW23] obtained an improved complexity of MN=𝒪~(κ4d3/2/ε3)MN=\widetilde{\mathcal{O}}(\nicefrac{{\kappa^{4}d^{3/2}}}{{\varepsilon^{3}}}) via ULMC in the averaged 𝖳𝖵\mathsf{TV} distance. In comparison, we can improve this complexity guarantee to 𝒪~(κ5/2d3/2/ε3)\widetilde{\mathcal{O}}(\nicefrac{{\kappa^{5/2}d^{3/2}}}{{\varepsilon^{3}}}), and the guarantee even holds in 𝖪𝖫\sqrt{\operatorname{\mathsf{KL}}} if we combine ULMC with the proximal sampler. It appears that we gain one factor of κ\sqrt{\kappa} through sharper discretization analysis (via [Zha+23], or via the error analysis of the proximal sampler in [AC23]), and one factor of κ\kappa via a sharper propagation of chaos result (Theorem 5).

We also note that the result of [FW23] is based on a kinetic version of the propagation of chaos argument from [Che+24], whereas our approach uses the original “non-kinetic” argument from [CRW22] in the form of Theorem 5.

Application to Two-Layer Neural Networks.

Let us consider the problem of learning a two-layer neural network in the mean-field regime. Let fθ:df_{\theta}:\mathbb{R}^{d}\to\mathbb{R} be a function parameterized by θp\theta\in\mathbb{R}^{p}, and for any probability measure μ\mu over p\mathbb{R}^{p}, let fμfθμ(dθ)f_{\mu}\coloneqq\int f_{\theta}\,\mu(\mathrm{d}\theta). For example, in a standard two-layer neural network, we take θ=(a,w)×d\theta=(a,w)\in\mathbb{R}\times\mathbb{R}^{d} and fa,w(x)=a𝖱𝖾𝖫𝖴(w,x)f_{a,w}(x)=a\,\mathsf{ReLU}(\langle w,x\rangle). When μ=1mj=1mδ(aj,wj)\mu=\frac{1}{m}\sum_{j=1}^{m}\delta_{(a_{j},w_{j})} is an empirical measure, then fμf_{\mu} is the function computed by a two-layer neural network with mm hidden neurons. In this formulation, however, we can take μ\mu to be any probability measure, corresponding to the mean-field limit mm\to\infty [CB18, MMN18, Chi22, RV22, SS20].

Given a dataset {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} in d×\mathbb{R}^{d}\times\mathbb{R} and a loss function :×\ell:\mathbb{R}\times\mathbb{R}\to\mathbb{R}, we can formulate neural network training as the problem of minimizing the loss μi=1n(fμ(xi),yi)\mu\mapsto\sum_{i=1}^{n}\ell(f_{\mu}(x_{i}),y_{i}). To place this within the general McKean–Vlasov framework, we add two regularization terms: (1) λ22dμ\frac{\lambda}{2}\int\lVert\cdot\rVert^{2}\,\mathrm{d}\mu corresponds to weight decay; and (2) σ22logμdμ\frac{\sigma^{2}}{2}\int\log\mu\,\mathrm{d}\mu is entropic regularization. We are now in the setting of §2.1.2, with 0(μ)=i=1n(fμ(xi),yi)\mathcal{F}_{0}(\mu)=\sum_{i=1}^{n}\ell(f_{\mu}(x_{i}),y_{i}).

To minimize this energy, it is natural to consider the Euler–Maruyama discretization of (𝗀𝖬𝖵N\mathsf{gMV}_{N}), which corresponds to learning the neural network via noisy GD, and was considered in [SWN23]. Recent works [FW23, Che+24] also considered the underdamped version of (𝗀𝖬𝖵\mathsf{gMV}) and its discretization. Under the assumptions common to those works as well as our own, our results yield improved algorithmic guarantees for this task (see Example 13).

Unfortunately, the assumptions used for the analysis of the general McKean–Vlasov are restrictive and limit the applicability to neural network training. For example, it suffices for \ell to be convex in its first argument (to satisfy Assumption 5), to have two bounded derivatives (w.r.t. its first argument), and for θfθ(xi)\theta\mapsto f_{\theta}(x_{i}) to have two bounded derivatives for each xix_{i}. The last condition is satisfied, e.g., for fθ(x)=tanh(θ,x)f_{\theta}(x)=\tanh(\langle\theta,x\rangle). For a genuinely two-layer example, we can take fθ(x)=tanh(a)tanh(w,x)f_{\theta}(x)=\tanh(a)\tanh(\langle w,x\rangle) for θ=(a,w)×d\theta=(a,w)\in\mathbb{R}\times\mathbb{R}^{d}. Under these conditions, Assumptions 6 and 8 hold, which in turn furnish log-Sobolev inequalities via Lemmas 6 and 9.

Limitations.

However, we note that there is a substantial limitation of our framework when applied to the mean-field Langevin dynamics. Although we are able to establish a uniform-in-NN LSI for the stationary distribution μ1:N\mu^{1:N} under appropriate assumptions (see Corollary 25 for a precise statement), the dependence of the LSI constant scales poorly (in fact, doubly exponentially) in the problem parameters. To fully benefit from the modularity of our approach, it is desirable to obtain a uniform-in-NN LSI with better scaling, and we leave this question open for future research.

5 Conclusion

In this work, we propose a framework for obtaining sampling guarantees for the minimizers of (𝗉𝖤\mathsf{pE}) and (𝗀𝖤\mathsf{gE}), based on decoupling the problem into (i) particle approximation via propagation of chaos, and (ii) time-discretization via log-concave sampling theory. Our approach leads to simpler proofs and improved guarantees compared to previous works, and our results readily benefit from any improvements in either (i) or (ii).

We conclude by listing some future directions of study. As discussed in §4.2, our uniform-in-NN LSI for the mean-field Langevin dynamics currently scales poorly in the problem parameters, and it is important to improve it. We also believe there is further room for improvement in the propagation of chaos results. For example, can the sharp rate in Theorem 3 be extended to stronger metrics such as Rényi divergences, as well as to situations when the weak interaction condition (Assumption 3) fails, e.g., in the strongly displacement convex case or in the setting of §3.1.2? For the sampling guarantees, the prior works [BS23, SWN23] considered different settings, such as potentials satisfying convexity at infinity or the use of stochastic gradients; these extensions are compatible with our approach and could possibly lead to improvements in these cases, as well as others. Finally, consider the case where W(Xt)dπt\int\nabla W(X_{t}-\cdot)\,\mathrm{d}\pi_{t} in (𝗉𝖬𝖵\mathsf{pMV}) is replaced with a generic function bt(Xt,)dπt\int b_{t}(X_{t},\cdot)\,\mathrm{d}\pi_{t}, bt:n×nnb_{t}:\mathbb{R}^{n}\times\mathbb{R}^{n}\to\mathbb{R}^{n}. It would be interesting to extend our analysis to this setting, as it arises in many applications [AD20].

Acknowledgements

We thank Zhenjie Ren for pointing out an error in a previous draft of this paper, and Daniel Lacker, Atsushi Nitanda, and Taiji Suzuki for important discussions and references. YK was supported in part by NSF awards CCF-2007443 and CCF-2134105. MSZ was supported by NSERC through the CGS-D program. SC was supported by the Eric and Wendy Schmidt Fund at the Institute for Advanced Study. MAE was supported by NSERC Grant [2019-06167] and CIFAR AI Chairs program at the Vector Institute. MBL was supported by NSF grant DMS-2133806.

range

pages8 rangepages-1 rangepages54 rangepages52 rangepages29 rangepages18 rangepages88 rangepages36 rangepages24 rangepages51 rangepages121 rangepages157 rangepages43 rangepages31 rangepages2 rangepages41 rangepages26 rangepages-1 rangepages32 rangepages-1 rangepages49 rangepages18 rangepages44 rangepages20 rangepages11 rangepages36 rangepages24 rangepages69 rangepages17 rangepages36 rangepages37 rangepages21 rangepages56 rangepages38 rangepages32 rangepages58 rangepages35 rangepages24 rangepages21 rangepages5 rangepages-1 rangepages54 rangepages17 rangepages17 rangepages74 rangepages37 rangepages40 rangepages49 rangepages47 rangepages28 rangepages87 rangepages49 rangepages235 rangepages-1 rangepages935 rangepages63 rangepages16 rangepages36

References

  • [AC23] Jason M. Altschuler and Sinho Chewi “Faster high-accuracy log-concave sampling via algorithmic warm starts” In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), 2023, pp. 2169–2176
  • [AGS08] Luigi Ambrosio, Nicola Gigli and Giuseppe Savaré “Gradient flows in metric spaces and in the space of probability measures”, Lectures in Mathematics ETH Zürich Birkhäuser Verlag, Basel, 2008, pp. x+334
  • [AK02] Fabio Antonelli and Arturo Kohatsu-Higa “Rate of convergence of a particle method to the solution of the McKean–Vlasov equation” In The Annals of Applied Probability 12.2 Institute of Mathematical Statistics, 2002, pp. 423–476
  • [AD20] Marc Arnaudon and Pierre Del Moral “A second order analysis of McKean–Vlasov semigroups” In The Annals of Applied Probability 30.6 JSTOR, 2020, pp. 2613–2664
  • [BGL14] Dominique Bakry, Ivan Gentil and Michel Ledoux “Analysis and geometry of Markov diffusion operators” Springer, 2014
  • [BH22] Jianhai Bao and Xing Huang “Approximations of McKean–Vlasov stochastic differential equations with irregular coefficients” In Journal of Theoretical Probability Springer, 2022, pp. 1–29
  • [BGM10] François Bolley, Arnaud Guillin and Florent Malrieu “Trend to equilibrium and particle approximation for a weakly self consistent Vlasov–Fokker–Planck equation” In ESAIM: Mathematical Modelling and Numerical Analysis 44.5 EDP Sciences, 2010, pp. 867–884
  • [Bol72] Ludwig Boltzmann “Further studies on the heat balance among gas molecules” In History of Modern Physical Sciences 1, 1872, pp. 262–349
  • [BT97] Mireille Bossy and Denis Talay “A stochastic particle method for the McKean–Vlasov and the Burgers equation” In Mathematics of Computation 66.217, 1997, pp. 157–192
  • [BS23] Nawaf Bou-Rabee and Katharina Schuh “Nonlinear Hamiltonian Monte Carlo & its particle approximation” In arXiv preprint arXiv:2308.11491, 2023
  • [BL76] Herm J. Brascamp and Elliott H. Lieb “On extensions of the Brunn–Minkowski and Prékopa–Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation” In J. Functional Analysis 22.4, 1976, pp. 366–389
  • [BJW23] Didier Bresch, Pierre-Emmanuel Jabin and Zhenfu Wang “Mean field limit and quantitative estimates with singular attractive kernels” In Duke Mathematical Journal 172.13 Duke University Press, 2023, pp. 2591–2641
  • [BP24] Giovanni Brigati and Francesco Pedrotti “Heat flow, log-concavity, and Lipschitz transport maps” In arXiv preprint 2404.15205, 2024
  • [CD18] René Carmona and François Delarue “Probabilistic theory of mean field games with applications I–II” Springer, 2018
  • [CD22] Louis-Pierre Chaintron and Antoine Diez “Propagation of chaos: a review of models, methods and applications. I. Models and methods” In Kinet. Relat. Models 15.6, 2022, pp. 895–1015
  • [CD22a] Louis-Pierre Chaintron and Antoine Diez “Propagation of chaos: a review of models, methods and applications. II. Applications” In Kinet. Relat. Models 15.6, 2022, pp. 1017–1173
  • [Che+24] Fan Chen, Yiqing Lin, Zhenjie Ren and Songbo Wang “Uniform-in-time propagation of chaos for kinetic mean field Langevin dynamics” In Electronic Journal of Probability 29 Institute of Mathematical StatisticsBernoulli Society, 2024, pp. 1–43
  • [CRW22] Fan Chen, Zhenjie Ren and Songbo Wang “Uniform-in-time propagation of chaos for mean field Langevin dynamics” In arXiv preprint arXiv:2212.03050, 2022
  • [Che+22] Yongxin Chen, Sinho Chewi, Adil Salim and Andre Wibisono “Improved analysis for a proximal algorithm for sampling” In Proceedings of Thirty Fifth Conference on Learning Theory (COLT) 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 2984–3014
  • [Che24] Sinho Chewi “Log-concave sampling” Book draft available at https://chewisinho.github.io., 2024
  • [Che+22a] Sinho Chewi et al. “Analysis of Langevin Monte Carlo from Poincaré to log-Sobolev” In Proceedings of Thirty Fifth Conference on Learning Theory (COLT) 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 1–2
  • [Che+21] Sinho Chewi et al. “Optimal dimension dependence of the Metropolis-adjusted Langevin algorithm” In Conference on Learning Theory (COLT), 2021, pp. 1260–1300 PMLR
  • [Chi22] Lénaı̈c Chizat “Mean-field Langevin dynamics: exponential convergence and annealing” In Transactions on Machine Learning Research, 2022
  • [CB18] Lénaı̈c Chizat and Francis Bach “On the global convergence of gradient descent for over-parameterized models using optimal transport” In Advances in Neural Information Processing Systems (NeurIPS) 31, 2018
  • [Cla+23] Julien Claisse, Giovanni Conforti, Zhenjie Ren and Songbo Wang “Mean field optimization problem regularized by Fisher information” In arXiv preprint 2302.05938, 2023
  • [Csi84] Imre Csiszár “Sanov property, generalized I-projection and a conditional limit theorem” In The Annals of Probability JSTOR, 1984, pp. 768–793
  • [DKR22] Arnak S. Dalalyan, Avetik Karagulyan and Lionel Riou-Durand “Bounding the error of discretized Langevin algorithms for non-strongly log-concave targets” In J. Mach. Learn. Res. 23, 2022, pp. Paper No. 235\bibrangessep38
  • [Dia+23] Michael Z. Diao, Krishnakumar Balasubramanian, Sinho Chewi and Adil Salim “Forward-backward Gaussian variational inference via JKO in the Bures–Wasserstein space” In Proceedings of the 40th International Conference on Machine Learning 202, Proceedings of Machine Learning Research PMLR, 2023, pp. 7960–7991
  • [DMM19] Alain Durmus, Szymon Majewski and Błażej Miasojedow “Analysis of Langevin Monte Carlo via convex optimization” In J. Mach. Learn. Res. 20, 2019, pp. Paper No. 73\bibrangessep46
  • [FYC23] Jiaojiao Fan, Bo Yuan and Yongxin Chen “Improved dimension dependence of a proximal algorithm for sampling” In Proceedings of Thirty Sixth Conference on Learning Theory (COLT) 195, Proceedings of Machine Learning Research PMLR, 2023, pp. 1473–1521
  • [FLO21] James Foster, Terry Lyons and Harald Oberhauser “The shifted ODE method for underdamped Langevin MCMC” In arXiv preprint 2101.03446, 2021
  • [FW23] Qiang Fu and Ashia Wilson “Mean-field underdamped Langevin dynamics and its space-time discretization” In arXiv preprint arXiv:2312.16360, 2023
  • [Fun84] Tadahisa Funaki “A certain class of diffusion processes associated with nonlinear parabolic equations” In Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 67.3 Springer, 1984, pp. 331–348
  • [GLM22] Arnaud Guillin, Pierre Le Bris and Pierre Monmarché “Convergence rates for the Vlasov–Fokker–Planck equation and uniform in time propagation of chaos in non convex cases” In Electronic Journal of Probability 27 The Institute of Mathematical Statisticsthe Bernoulli Society, 2022, pp. 1–44
  • [GM21] Arnaud Guillin and Pierre Monmarché “Uniform long-time and propagation of chaos estimates for mean field kinetic particles in non-convex landscapes” In Journal of Statistical Physics 185 Springer, 2021, pp. 1–20
  • [HBE20] Ye He, Krishnakumar Balasubramanian and Murat A. Erdogdu “On the ergodicity, bias and asymptotic normality of randomized midpoint sampling method” In Advances in Neural Information Processing Systems 33, 2020, pp. 7366–7376
  • [HR23] Elias Hess-Childs and Keefer Rowan “Higher-order propagation of chaos in L2{L}^{2} for interacting diffusions” In arXiv preprint arXiv:2310.09654, 2023
  • [HS87] Richard Holley and Daniel Stroock “Logarithmic Sobolev inequalities and stochastic Ising models” In J. Statist. Phys. 46.5-6, 1987, pp. 1159–1194
  • [JW17] Pierre-Emmanuel Jabin and Zhenfu Wang “Mean field limit for stochastic particle systems” In Active Particles, Volume 1: Advances in Theory, Models, and Applications Springer, 2017, pp. 379–402
  • [JW18] Pierre-Emmanuel Jabin and Zhenfu Wang “Quantitative estimates of propagation of chaos for stochastic systems with W1,W^{-1,\infty} kernels” In Invent. Math. 214.1, 2018, pp. 523–591
  • [JCP23] Yiheng Jiang, Sinho Chewi and Aram-Alexandre Pooladian “Algorithms for mean-field variational inference via polyhedral optimization in the Wasserstein space” In arXiv preprint 2312.02849, 2023
  • [JKO98] Richard Jordan, David Kinderlehrer and Felix Otto “The variational formulation of the Fokker–Planck equation” In SIAM J. Math. Anal. 29.1, 1998, pp. 1–17
  • [KHK24] Mohammad Reza Karimi Jaghargh, Ya-Ping Hsieh and Andreas Krause “Stochastic Approximation Algorithms for Systems of Interacting Particles” In Advances in Neural Information Processing Systems 36, 2024
  • [KMP24] Ksenia A. Khudiakova, Jan Maas and Francesco Pedrotti LL^{\infty}-optimal transport of anisotropic log-concave measures and exponential convergence in Fisher’s infinitesimal model” In arXiv preprint 2402.04151, 2024
  • [KM12] Young-Heon Kim and Emanuel Milman “A generalization of Caffarelli’s contraction theorem via (reverse) heat flow” In Math. Ann. 354.3, 2012, pp. 827–862
  • [Kuh+19] Daniel Kuhn, Peyman M. Esfahani, Viet Anh Nguyen and Soroosh Shafieezadeh-Abadeh “Wasserstein distributionally robust optimization: theory and applications in machine learning” In Operations Research & Management Science in the Age of Analytics, 2019, pp. 130–166
  • [Lac22] Daniel Lacker “Quantitative approximate independence for continuous mean field Gibbs measures” In Electronic Journal of Probability 27 The Institute of Mathematical Statisticsthe Bernoulli Society, 2022, pp. 1–21
  • [Lac23] Daniel Lacker “Hierarchies, entropy, and quantitative propagation of chaos for mean field diffusions” In Probability and Mathematical Physics 4.2 Mathematical Sciences Publishers, 2023, pp. 377–432
  • [Lac23a] Daniel Lacker “Independent projections of diffusions: gradient flows for variational inference and optimal mean field approximations” In arXiv preprint 2309.13332, 2023
  • [LL23] Daniel Lacker and Luc Le Flem “Sharp uniform-in-time propagation of chaos” In Probability Theory and Related Fields Springer, 2023, pp. 1–38
  • [Lam+22] Marc Lambert et al. “Variational inference via Wasserstein gradient flows” In Advances in Neural Information Processing Systems, 2022
  • [LL07] Jean-Michel Lasry and Pierre-Louis Lions “Mean field games” In Jpn. J. Math. 2.1, 2007, pp. 229–260
  • [LST21] Yin Tat Lee, Ruoqi Shen and Kevin Tian “Structured logconcave sampling with a restricted Gaussian oracle” In Proceedings of Thirty Fourth Conference on Learning Theory 134, Proceedings of Machine Learning Research PMLR, 2021, pp. 2993–3050
  • [Li+23] Yun Li et al. “Strong convergence of Euler–Maruyama schemes for McKean–Vlasov stochastic differential equations under local Lipschitz conditions of state variables” In IMA Journal of Numerical Analysis 43.2 Oxford University Press, 2023, pp. 1001–1035
  • [LW16] Qiang Liu and Dilin Wang “Stein variational gradient descent: a general purpose Bayesian inference algorithm” In Advances in Neural Information Processing Systems 29 Curran Associates, Inc., 2016
  • [Mal01] Florent Malrieu “Logarithmic Sobolev inequalities for some nonlinear PDE’s” In Stochastic Process. Appl. 95.1, 2001, pp. 109–132
  • [Mal03] Florent Malrieu “Convergence to equilibrium for granular media equations and their Euler schemes” In The Annals of Applied Probability 13.2 Institute of Mathematical Statistics, 2003, pp. 540–560
  • [McK66] Henry P. McKean “A class of Markov processes associated with nonlinear parabolic equations” In Proc. Nat. Acad. Sci. U.S.A. 56, 1966, pp. 1907–1911
  • [MMN18] Song Mei, Andrea Montanari and Phan-Minh Nguyen “A mean field view of the landscape of two-layer neural networks” In Proceedings of the National Academy of Sciences 115.33 National Acad Sciences, 2018, pp. E7665–E7671
  • [Mél96] Sylvie Méléard “Asymptotic behaviour of some interacting particle systems; McKean–Vlasov and Boltzmann models” In Probabilistic Models for Nonlinear Partial Differential Equations: Lectures given at the 1st Session of the Centro Internazionale Matematico Estivo (C.I.M.E.) held in Montecatini Terme, Italy, May 22–30, 1995 Berlin, Heidelberg: Springer Berlin Heidelberg, 1996, pp. 42–95
  • [Mon17] Pierre Monmarché “Long-time behaviour and propagation of chaos for mean field kinetic particles” In Stochastic Processes and their Applications 127.6 Elsevier, 2017, pp. 1721–1737
  • [MRW24] Pierre Monmarché, Zhenjie Ren and Songbo Wang “Time-uniform log-Sobolev inequalities and applications to propagation of chaos” In arXiv preprint arXiv:2401.07966, 2024
  • [NWS22] Atsushi Nitanda, Denny Wu and Taiji Suzuki “Convex analysis of the mean field Langevin dynamics” In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022, pp. 9741–9757 PMLR
  • [Ott01] Felix Otto “The geometry of dissipative evolution equations: the porous medium equation” In Comm. Partial Differential Equations 26.1-2, 2001, pp. 101–174
  • [OR07] Felix Otto and Maria G. Reznikoff “A new criterion for the logarithmic Sobolev inequality and two applications” In Journal of Functional Analysis 243.1 Elsevier, 2007, pp. 121–157
  • [OV00] Felix Otto and Cédric Villani “Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality” In Journal of Functional Analysis 173.2 Elsevier, 2000, pp. 361–400
  • [RES22] Gonçalo Reis, Stefan Engelhardt and Greig Smith “Simulation of McKean–Vlasov SDEs with super-linear growth” In IMA Journal of Numerical Analysis 42.1 Oxford University Press, 2022, pp. 874–922
  • [RV22] Grant M. Rotskoff and Eric Vanden-Eijnden “Trainability and accuracy of artificial neural networks: an interacting particle system approach” In Comm. Pure Appl. Math. 75.9, 2022, pp. 1889–1935
  • [SL19] Ruoqi Shen and Yin Tat Lee “The randomized midpoint method for log-concave sampling” In Advances in Neural Information Processing Systems (NeurIPS) 32, 2019
  • [SS20] Justin Sirignano and Konstantinos Spiliopoulos “Mean field analysis of neural networks: a law of large numbers” In SIAM Journal on Applied Mathematics 80.2 SIAM, 2020, pp. 725–752
  • [SNW22] Taiji Suzuki, Atsushi Nitanda and Denny Wu “Uniform-in-time propagation of chaos for the mean-field gradient Langevin dynamics” In The Eleventh International Conference on Learning Representations (ICLR), 2022
  • [SWN23] Taiji Suzuki, Denny Wu and Atsushi Nitanda “Convergence of mean-field Langevin dynamics: time and space discretization, stochastic gradient, and variance reduction” In Advances in Neural Information Processing Systems (NeurIPS), 2023
  • [Szn91] Alain-Sol Sznitman “Topics in propagation of chaos” In École d’Été de Probabilités de Saint-Flour XIX—1989 1464, Lecture Notes in Math. Springer, Berlin, 1991, pp. 165–251
  • [Tal96] Denis Talay “Probabilistic numerical methods for partial differential equations: elements of analysis” In Probabilistic models for nonlinear partial differential equations (Montecatini Terme, 1995) 1627, Lecture Notes in Math. Springer, Berlin, 1996, pp. 148–196
  • [VW19] Santosh Vempala and Andre Wibisono “Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices” In Advances in Neural Information Processing Systems (NeurIPS) 32, 2019
  • [Vil02] Cédric Villani “A review of mathematical topics in collisional kinetic theory” In Handbook of mathematical fluid dynamics, Vol. I North-Holland, Amsterdam, 2002, pp. 71–305
  • [Vil03] Cédric Villani “Topics in optimal transportation” 58, Graduate Studies in Mathematics American Mathematical Society, Providence, RI, 2003, pp. xvi+370
  • [Wib18] Andre Wibisono “Sampling as optimization in the space of measures: the Langevin dynamics as a composite optimization problem” In Proceedings of the 31st Conference on Learning Theory 75, Proceedings of Machine Learning Research PMLR, 2018, pp. 2093–3027
  • [WSC22] Keru Wu, Scott Schmidler and Yuansi Chen “Minimax mixing time of the Metropolis-adjusted Langevin algorithm for log-concave sampling” In The Journal of Machine Learning Research (JMLR) 23.1 JMLRORG, 2022, pp. 12348–12410
  • [YY23] Rentian Yao and Yun Yang “Mean-field variational inference via Wasserstein gradient flow” In arXiv preprint 2207.08074, 2023
  • [YKW22] Man-Chung Yue, Daniel Kuhn and Wolfram Wiesemann “On linear optimization over Wasserstein balls” In Math. Program. 195.1-2, 2022, pp. 1107–1122
  • [Zha+23] Matthew S. Zhang et al. “Improved discretization analysis for underdamped Langevin Monte Carlo” In Conference on Learning Theory (COLT), 2023, pp. 36–71 PMLR

Appendix A Control of the Finite-Particle Error

In this section, we prove the results in §3.1 on the finite-particle error. We will make extensive use of the following transport inequality, which arises as a consequence of (LSI).

Lemma 14 (Talagrand’s Transport Inequality, [OV00]).

If a measure π\pi satisfies (LSI) with constant C𝖫𝖲𝖨C_{\mathsf{LSI}}, then for all measures μ𝒫2,ac(d)\mu\in\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}),

𝒲22(μ,π)2C𝖫𝖲𝖨𝖪𝖫(μπ).\displaystyle\mathcal{W}_{2}^{2}(\mu,\pi)\leq 2C_{\mathsf{LSI}}\,\mathsf{KL}(\mu\mathbin{\|}\pi)\,. (TI)

A.1 LSI Case

We provide the proof of Theorem 3 under the assumption of (LSI) for the invariant measures of (pMV)\eqref{eq:mean_field_interactive} and (pMVN)\eqref{eq:finite_particle_interactive}. This relies on a BBGKY hierarchy based on the arguments of [LL23].

Recall that μ1:k\mu^{1:k} is the kk-particle distribution of the finite-particle system. Explicitly,

logμ1:k(x1:k)\displaystyle\log\mu^{1:k}(x^{1:k}) =logexp(2σ2i=1NV(xi)1σ2(N1)i,j=1ijNW(xixj))dxk+1:N+const.\displaystyle=\log\int\exp\Bigl{(}-\frac{2}{\sigma^{2}}\sum_{i=1}^{N}V(x^{i})-\frac{1}{\sigma^{2}\,(N-1)}\sum_{\begin{subarray}{c}i,j=1\\ i\neq j\end{subarray}}^{N}W(x^{i}-x^{j})\Bigr{)}\,\mathrm{d}x^{k+1:N}+\text{const.}

Using exchangeability, we can then compute the gradient of the potential for this measure as

σ22xilogμ1:k(x1:k)\displaystyle-\frac{\sigma^{2}}{2}\,\nabla_{x^{i}}\log\mu^{1:k}(x^{1:k})
=V(xi)+1N1j=1ijkW(xixj)+NkN1𝔼μk+1|1:k(x1:k)W(xi).\displaystyle\qquad=\nabla V(x^{i})+\frac{1}{N-1}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{k}\nabla W(x^{i}-x^{j})+\frac{N-k}{N-1}\operatorname{\mathbb{E}}_{\mu^{k+1|1:k}(\cdot\mid x^{1:k})}\nabla W(x^{i}-\cdot)\,.

Let X1:kμ1:kX^{1:k}\sim\mu^{1:k} and introduce the notation

𝖪k𝖪𝖫(μ1:kπk).\displaystyle\mathsf{K}_{k}\coloneqq\mathsf{KL}(\mu^{1:k}\mathbin{\|}\pi^{\otimes k})\,.

Invoking (LSI) of the mean-field invariant measure (and tensorizing) leads to

𝖪k\displaystyle\mathsf{K}_{k} C𝖫𝖲𝖨(π)2𝖥𝖨(μ1:kπk)\displaystyle\leq\frac{C_{\mathsf{LSI}}(\pi)}{2}\,\mathsf{FI}(\mu^{1:k}\mathbin{\|}\pi^{\otimes k})
=2C𝖫𝖲𝖨(π)σ4i=1k𝔼[1N1j=1jikW(XiXj)W(Xi)dπ\displaystyle=\frac{2C_{\mathsf{LSI}}(\pi)}{\sigma^{4}}\sum_{i=1}^{k}\operatorname{\mathbb{E}}\Bigl{[}\Bigl{\lVert}\frac{1}{N-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{k}\nabla W(X^{i}-X^{j})-\int\nabla W(X^{i}-\cdot)\,\mathrm{d}\pi
+NkN1W(Xi)dμk+1|1:k(X1:k)2]\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad{}+\frac{N-k}{N-1}\int\nabla W(X^{i}-\cdot)\,\mathrm{d}\mu^{k+1|1:k}(\cdot\mid X^{1:k})\Bigr{\rVert}^{2}\Bigr{]}
4kC𝖫𝖲𝖨(π)σ4(N1)2𝔼[j=2k(W(X1Xj)W(X1)dπ)2]𝖠\displaystyle\leq\frac{4k\,C_{\mathsf{LSI}}(\pi)}{\sigma^{4}\,{(N-1)}^{2}}\,\underset{\mathsf{A}}{\underbrace{\operatorname{\mathbb{E}}\Bigl{[}\Bigl{\lVert}\sum_{\begin{subarray}{c}j=2\end{subarray}}^{k}\Bigl{(}\nabla W(X^{1}-X^{j})-\int\nabla W(X^{1}-\cdot)\,\mathrm{d}\pi\Bigr{)}\Bigr{\rVert}^{2}\Bigr{]}}}
+4kC𝖫𝖲𝖨(π)(Nk)2σ4(N1)2𝔼[W(X1)(dμk+1|1:k(X1:k)dπ)2]𝖡,\displaystyle\qquad{}+\frac{4k\,C_{\mathsf{LSI}}(\pi)\,{(N-k)}^{2}}{\sigma^{4}\,{(N-1)}^{2}}\,\underset{\mathsf{B}}{\underbrace{\operatorname{\mathbb{E}}\Bigl{[}\Bigl{\lVert}\int\nabla W(X^{1}-\cdot)\,\bigl{(}\mathrm{d}\mu^{k+1|1:k}(\cdot\mid X^{1:k})-\mathrm{d}\pi\bigr{)}\Bigr{\rVert}^{2}\Bigr{]}}}\,,

where the last line follows from exchangeability and a+b22(a2+b2)\lVert a+b\rVert^{2}\leq 2\,(\lVert a\rVert^{2}+\lVert b\rVert^{2}) for vectors a,bda,b\in\mathbb{R}^{d}.

A.1.1 Bounding the Error Terms

We now handle terms 𝖠,𝖡\mathsf{A},\mathsf{B} separately.

𝖠\displaystyle\mathsf{A} =j=2k𝔼[W(X1Xj)𝔼πW(X1)2]\displaystyle=\sum_{j=2}^{k}\operatorname{\mathbb{E}}[\lVert\nabla W(X^{1}-X^{j})-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot)\rVert^{2}]
+i,j=2ijk𝔼W(X1Xi)𝔼πW(X1),W(X1Xj)𝔼πW(X1)\displaystyle\qquad+\sum_{\begin{subarray}{c}i,j=2\\ i\neq j\end{subarray}}^{k}\operatorname{\mathbb{E}}\bigl{\langle}\nabla W(X^{1}-X^{i})-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot),\,\nabla W(X^{1}-X^{j})-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot)\bigr{\rangle}
=(i)(k1)𝔼[W(X1X2)𝔼πW(X1)2]+\displaystyle\underset{\text{(i)}}{=}(k-1)\operatorname{\mathbb{E}}[\lVert\nabla W(X^{1}-X^{2})-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot)\rVert^{2}]+
+(k1)(k2)𝔼W(X1X2)𝔼πW(X1),\displaystyle\qquad+(k-1)\,(k-2)\operatorname{\mathbb{E}}\bigl{\langle}\nabla W(X^{1}-X^{2})-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot),
W(X1X3)𝔼πW(X1)\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\nabla W(X^{1}-X^{3})-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot)\bigr{\rangle}
(ii)(k1)βW2𝔼[XY2]\displaystyle\underset{\text{(ii)}}{\leq}(k-1)\,\beta_{W}^{2}\operatorname{\mathbb{E}}[\lVert X-Y\rVert^{2}]
+(k1)2𝔼W(X1X2)𝔼πW(X1),W(X1X3)𝔼πW(X1),\displaystyle\quad+(k-1)^{2}\operatorname{\mathbb{E}}\bigl{\langle}\nabla W(X^{1}-X^{2})-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot),\,\nabla W(X^{1}-X^{3})-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot)\bigr{\rangle}\,,

where we used the exchangeability of the particles in (i) and the smoothness of WW in (ii). Here, Xμ1X\sim\mu^{1} and YπY\sim\pi are independent.

Let us deal with these two terms separately. For the first term, let Y¯π\bar{Y}\sim\pi be optimally coupled with XX. Then, by independence and sub-Gaussian concentration (implied by (LSI)),

𝔼[XY2]\displaystyle\operatorname{\mathbb{E}}[\lVert X-Y\rVert^{2}] 2𝔼[XY¯2]+2𝔼[YY¯2]=2𝒲22(μ1,π)+4𝔼[Y𝔼Y2]\displaystyle\leq 2\operatorname{\mathbb{E}}[\lVert X-\bar{Y}\rVert^{2}]+2\operatorname{\mathbb{E}}[\lVert Y-\bar{Y}\rVert^{2}]=2\,\mathcal{W}_{2}^{2}(\mu^{1},\pi)+4\operatorname{\mathbb{E}}[\lVert Y-\operatorname{\mathbb{E}}Y\rVert^{2}]
4C𝖫𝖲𝖨(π)𝖪𝖫(μ1π)+4dC𝖫𝖲𝖨(π)4C𝖫𝖲𝖨(π)(𝖪3+d),\displaystyle\leq 4C_{\mathsf{LSI}}(\pi)\operatorname{\mathsf{KL}}(\mu^{1}\mathbin{\|}\pi)+4dC_{\mathsf{LSI}}(\pi)\leq 4C_{\mathsf{LSI}}(\pi)\,(\mathsf{K}_{3}+d)\,, (A.1)

where the second inequality follows from (TI), and the last one follows from the data-processing inequality for the 𝖪𝖫\operatorname{\mathsf{KL}} divergence. For the second term, the Cauchy–Schwarz inequality leads to

𝔼W(X1X2)𝔼πW(X1),W(X1X3)𝔼πW(X1)\displaystyle\operatorname{\mathbb{E}}\bigl{\langle}\nabla W(X^{1}-X^{2})-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot),\,\nabla W(X^{1}-X^{3})-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot)\bigr{\rangle}
=𝔼W(X1X2)𝔼πW(X1),𝔼μ31:2(X1:2)W(X1)𝔼πW(X1)\displaystyle\qquad=\operatorname{\mathbb{E}}\bigl{\langle}\nabla W(X^{1}-X^{2})-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot),\,\operatorname{\mathbb{E}}_{\mu^{3\mid 1:2}(\cdot\mid X^{1:2})}\nabla W(X^{1}-\cdot)-\operatorname{\mathbb{E}}_{\pi}\nabla W(X^{1}-\cdot)\bigr{\rangle}
βW2𝔼[XY2]𝔼𝒲22(μ31:2(X1:2),π)\displaystyle\qquad\leq\beta_{W}^{2}\sqrt{\operatorname{\mathbb{E}}[\lVert X-Y\rVert^{2}]}\sqrt{\operatorname{\mathbb{E}}\mathcal{W}_{2}^{2}\bigl{(}\mu^{3\mid 1:2}(\cdot\mid X^{1:2}),\,\pi\bigr{)}}
(i)βW24C𝖫𝖲𝖨(π)(𝖪3+d)2C𝖫𝖲𝖨(π)𝔼𝖪𝖫(μ31:2(X1:2)π)\displaystyle\qquad\underset{\text{(i)}}{\leq}\beta_{W}^{2}\sqrt{4C_{\mathsf{LSI}}(\pi)\,(\mathsf{K}_{3}+d)}\sqrt{2C_{\mathsf{LSI}}(\pi)\operatorname{\mathbb{E}}\operatorname{\mathsf{KL}}\bigl{(}\mu^{3\mid 1:2}(\cdot\mid X^{1:2})\bigm{\|}\pi\bigr{)}}
(ii)3βW2C𝖫𝖲𝖨(π)𝖪3+d𝖪3\displaystyle\qquad\underset{\text{(ii)}}{\leq}3\beta_{W}^{2}C_{\mathsf{LSI}}(\pi)\sqrt{\mathsf{K}_{3}+d}\sqrt{\mathsf{K}_{3}} (A.2)
3βW2C𝖫𝖲𝖨(π)(𝖪3+d),\displaystyle\qquad\leq 3\beta_{W}^{2}C_{\mathsf{LSI}}(\pi)\,(\mathsf{K}_{3}+d)\,,

where in (i) we applied the bound (A.1) as well as (TI), and in (ii) we used the chain rule for the 𝖪𝖫\mathsf{KL} divergence.

We return to the analysis of the term 𝖡\mathsf{B}. In a similar way, we obtain

𝖡\displaystyle\mathsf{B} =𝔼[W(X1)(dμk+1|1:k(X1:k)dπ)2]βW2𝔼𝒲22(μk+1|1:k(X1:k),π)\displaystyle=\operatorname{\mathbb{E}}\Bigl{[}\Bigl{\lVert}\int\nabla W(X^{1}-\cdot)\,\bigl{(}\mathrm{d}\mu^{k+1|1:k}(\cdot\mid X^{1:k})-\mathrm{d}\pi\bigr{)}\Bigr{\rVert}^{2}\Bigr{]}\leq\beta_{W}^{2}\operatorname{\mathbb{E}}\mathcal{W}_{2}^{2}\bigl{(}\mu^{k+1|1:k}(\cdot\mid X^{1:k}),\,\pi\bigr{)}
2βW2C𝖫𝖲𝖨(π)(𝖪k+1𝖪k).\displaystyle\leq 2\beta_{W}^{2}C_{\mathsf{LSI}}(\pi)\,(\mathsf{K}_{k+1}-\mathsf{K}_{k})\,.

A.1.2 Induction

Putting our bounds on 𝖠\mathsf{A} and 𝖡\mathsf{B} together, we obtain for N30N\geq 30,

𝖪k30k3βW2C𝖫𝖲𝖨2(π)σ4N2(𝖪3+d)+8kβW2C𝖫𝖲𝖨2(π)σ4(𝖪𝗄+𝟣𝖪𝗄).\mathsf{K}_{k}\leq\frac{30k^{3}\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)}{\sigma^{4}N^{2}}\,(\mathsf{K}_{3}+d)+\frac{8k\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)}{\sigma^{4}}\,(\mathsf{K_{k+1}-\mathsf{K}_{k})}\,. (A.3)

In particular, the case of k=Nk=N involves our bounds only on 𝖠\mathsf{A}, leading to

𝖪N30NβW2C𝖫𝖲𝖨2(π)σ4(𝖪3+d).\mathsf{K}_{N}\leq\frac{30N\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)}{\sigma^{4}}\,(\mathsf{K}_{3}+d)\,.

By grouping together the 𝖪k\mathsf{K}_{k} terms in (A.3),

𝖪k8kβW2C𝖫𝖲𝖨2(π)/σ41+8kβW2C𝖫𝖲𝖨2(π)/σ4=:𝒞k(𝖪k+1+(2kN)2(𝖪3+d)).\displaystyle\mathsf{K}_{k}\leq\underset{=:\mathcal{C}_{k}}{\underbrace{\frac{8k\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)/\sigma^{4}}{1+8k\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)/\sigma^{4}}}}\,\Bigl{(}\mathsf{K}_{k+1}+\bigl{(}\frac{2k}{N}\bigr{)}^{2}\,(\mathsf{K}_{3}+d)\Bigr{)}\,. (A.4)

Iterating this inequality down to k=3k=3, for ρσ4/8βW2C𝖫𝖲𝖨2(π)\rho\coloneqq\nicefrac{{\sigma^{4}}}{{8\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)}},

𝖪3\displaystyle\mathsf{K}_{3} (k=3N1𝒞k)30NβW2C𝖫𝖲𝖨2(π)σ4(𝖪3+d)+k=3N1(=3k𝒞)(2kN)2(𝖪3+d)\displaystyle\leq\Bigl{(}\prod_{k=3}^{N-1}\mathcal{C}_{k}\Bigr{)}\,\frac{30N\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)}{\sigma^{4}}\,(\mathsf{K}_{3}+d)+\sum_{k=3}^{N-1}\Bigl{(}\prod_{\ell=3}^{k}\mathcal{C}_{\ell}\Bigr{)}\,\bigl{(}\frac{2k}{N}\bigr{)}^{2}\,(\mathsf{K}_{3}+d)
[(k=3N1𝒞k)4Nρ+k=3N1(=3k𝒞)(2kN)2]cN(𝖪3+d).\displaystyle\leq\underbrace{\biggl{[}\Bigl{(}\prod_{k=3}^{N-1}\mathcal{C}_{k}\Bigr{)}\,\frac{4N}{\rho}+\sum_{k=3}^{N-1}\Bigl{(}\prod_{\ell=3}^{k}\mathcal{C}_{\ell}\Bigr{)}\,\bigl{(}\frac{2k}{N}\bigr{)}^{2}\biggr{]}}_{\eqqcolon c_{N}}\,(\mathsf{K}_{3}+d)\,.

Now we show cN<1/2c_{N}<1/2, which implies 𝖪32cNd\mathsf{K}_{3}\leq 2c_{N}d. We require the following lemma.

Lemma 15.

For 3ikN3\leq i\leq k\leq N,

=ik𝒞(i+ρk+1+ρ)ρ.\prod_{\ell=i}^{k}\mathcal{C}_{\ell}\leq\Bigl{(}\frac{i+\rho}{k+1+\rho}\Bigr{)}^{\rho}\,.

Proof.  For 𝒞=ρ11+ρ1\mathcal{C}_{\ell}=\frac{\ell\rho^{-1}}{1+\ell\rho^{-1}}, we have

Clog=ik𝒞==iklog(111+ρ1)=ik11+ρ1.C\coloneqq\log\prod_{\ell=i}^{k}\mathcal{C}_{\ell}=\sum_{\ell=i}^{k}\log\Bigl{(}1-\frac{1}{1+\ell\rho^{-1}}\Bigr{)}\leq-\sum_{\ell=i}^{k}\frac{1}{1+\ell\rho^{-1}}\,.

As the summand is decreasing in \ell, it follows that

C=ik+111+xρ1dx=ik+111+xρ1dx=ρlogk+1+ρi+ρ.C\leq-\sum_{\ell=i}^{k}\int_{\ell}^{\ell+1}\frac{1}{1+x\rho^{-1}}\,\mathrm{d}x=-\int_{i}^{k+1}\frac{1}{1+x\rho^{-1}}\,\mathrm{d}x=-\rho\log\frac{k+1+\rho}{i+\rho}\,.

Therefore,

=ik𝒞=expC(k+1+ρi+ρ)ρ,\displaystyle\prod_{\ell=i}^{k}\mathcal{C}_{\ell}=\exp C\leq\Bigl{(}\frac{k+1+\rho}{i+\rho}\Bigr{)}^{-\rho}\,,

which proves the lemma. ∎

Using Lemma 15, we obtain

cN4(3+ρ)ρ(N1ρρ+1N2k=3N1k2ρ).c_{N}\leq 4\,{(3+\rho)}^{\rho}\,\Bigl{(}\frac{N^{1-\rho}}{\rho}+\frac{1}{N^{2}}\sum_{k=3}^{N-1}k^{2-\rho}\Bigr{)}\,.

Under Assumption 3, i.e. ρ3\rho\geq 3, we may assume ρ=3\rho=3 since we can always take a worse bound on the constants βW\beta_{W} so that ρ=3\rho=3. As seen shortly, the rate does not improve even if ρ>3\rho>3.555Alternatively, one can show the bound in Lemma 15 decreases in ρ\rho, so we can just substitute ρ=3\rho=3 therein. For ρ=3\rho=3 and N100N\geq 100, we therefore obtain

cN864(13N2+1N2k=3N11k)12,c_{N}\leq 864\,\Bigl{(}\frac{1}{3N^{2}}+\frac{1}{N^{2}}\sum_{k=3}^{N-1}\frac{1}{k}\Bigr{)}\leq\frac{1}{2}\,,

and thus

𝖪3dlogNN2.\displaystyle\mathsf{K}_{3}\lesssim\frac{d\log N}{N^{2}}\,. (A.5)

A.1.3 Bootstrapping

Substituting the bound (A.5) for 𝖪3\mathsf{K}_{3} into the recursive inequality (A.4), we end up with a suboptimal rate of 𝒪~(k3/N2)\widetilde{\mathcal{O}}(k^{3}/N^{2}) for 𝖪k\mathsf{K}_{k}. To improve the bound, we substitute our established bound (A.5) into (A.2), which results in an improved recursive inequality. Indeed,

𝖠\displaystyle\mathsf{A} kβW2C𝖫𝖲𝖨2(π)(𝖪𝖫3+d)+k2βW2C𝖫𝖲𝖨(π)𝖪𝖫3+d𝖪𝖫3dkβW2C𝖫𝖲𝖨(π)logN\displaystyle\lesssim k\beta_{W}^{2}C^{2}_{\mathsf{LSI}}(\pi)\,({\operatorname{\mathsf{KL}}_{3}}+d)+k^{2}\beta_{W}^{2}C_{\mathsf{LSI}}(\pi)\sqrt{{\operatorname{\mathsf{KL}}_{3}}+d}\sqrt{\operatorname{\mathsf{KL}}_{3}}\lesssim dk\beta_{W}^{2}C_{\mathsf{LSI}}(\pi)\sqrt{\log N}

and therefore

𝖪k\displaystyle\mathsf{K}_{k} 𝒪~(dk2βW2C𝖫𝖲𝖨2(π)σ4N2)+8kβW2C𝖫𝖲𝖨2(π)σ4(𝖪k+1𝖪k).\displaystyle\leq\widetilde{\mathcal{O}}\Bigl{(}\frac{dk^{2}\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)}{\sigma^{4}N^{2}}\Bigr{)}+\frac{8k\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)}{\sigma^{4}}\,(\mathsf{K}_{k+1}-\mathsf{K}_{k})\,.

For k=Nk=N this yields

𝖪N\displaystyle\mathsf{K}_{N} 𝒪~(dβW2C𝖫𝖲𝖨2(π)σ4).\displaystyle\leq\widetilde{\mathcal{O}}\Bigl{(}\frac{d\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)}{\sigma^{4}}\Bigr{)}\,.

Regrouping 𝖪k\mathsf{K}_{k} as before, we obtain

𝖪k𝒞k(𝖪k+1+𝒪~(dkN2)).\mathsf{K}_{k}\leq\mathcal{C}_{k}\,\Bigl{(}\mathsf{K}_{k+1}+\widetilde{\mathcal{O}}\bigl{(}\frac{dk}{N^{2}}\bigr{)}\Bigr{)}\,.

Iterating this down to k=Nk=N,

𝖪k\displaystyle\mathsf{K}_{k} (=kN1𝒞)𝖪N+=kN1(j=k𝒞j)𝒪~(dN2)\displaystyle\leq\Bigl{(}\prod_{\ell=k}^{N-1}\mathcal{C}_{\ell}\Bigr{)}\,\mathsf{K}_{N}+\sum_{\ell=k}^{N-1}\Bigl{(}\prod_{j=k}^{\ell}\mathcal{C}_{j}\Bigr{)}\,\widetilde{\mathcal{O}}\bigl{(}\frac{d\ell}{N^{2}}\bigr{)}
(i)𝒪~(k3N3dβW2C𝖫𝖲𝖨2(π)σ4+=kN1k33dN2)(ii)𝒪~(dk2N2),\displaystyle\underset{\text{(i)}}{\leq}\widetilde{\mathcal{O}}\Bigl{(}\frac{k^{3}}{N^{3}}\,\frac{d\beta_{W}^{2}C_{\mathsf{LSI}}^{2}(\pi)}{\sigma^{4}}+\sum_{\ell=k}^{N-1}\frac{k^{3}}{\ell^{3}}\,\frac{d\ell}{N^{2}}\Bigr{)}\underset{\text{(ii)}}{\leq}\widetilde{\mathcal{O}}\bigl{(}\frac{dk^{2}}{N^{2}}\bigr{)}\,,

where in (i) we used Lemma 15 with ρ=3\rho=3, and (ii) follows from ρ3\rho\geq 3 and k2k1\sum_{\ell\geq k}\ell^{-2}\leq k^{-1}. Therefore, for some fixed kk it suffices to take N=100Ω~(kd/ε)N=100\vee\widetilde{\Omega}(k\sqrt{d}/\varepsilon) to achieve ε2\varepsilon^{2}-bias in 𝖪𝖫\mathsf{KL}, completing the proof of Theorem 3.

A.2 Strongly Convex Case

The following propagation of chaos argument for the strongly log-concave case is based on [Szn91]. Let (Xt1:N)t0(X^{1:N}_{t})_{t\geq 0} denote the stochastic process following the finite-particle stochastic differential equation (𝗉𝖬𝖵N\mathsf{pMV}_{N}). Let the corresponding semigroup be denoted (𝒯t)t0{(\mathcal{T}_{t})}_{t\geq 0}, defined as follows. For any test function f:d×Nf:\mathbb{R}^{d\times N}\to\mathbb{R},

𝒯tf(x1:N)=𝔼[f(Xt1:N)X01:N=x1:N].\displaystyle\mathcal{T}_{t}f(x^{1:N})=\operatorname{\mathbb{E}}[f(X^{1:N}_{t})\mid X_{0}^{1:N}=x^{1:N}]\,.

Then, the following simple lemma proves Wasserstein contraction for the finite-particle system.

Lemma 16.

Under Assumption 4 and for NαVαWαV+(αW)N\geq\frac{\alpha_{V}-{\alpha_{W}^{-}}}{\alpha_{V}+{(\alpha_{W})}_{-}}, (𝒯t)t0{(\mathcal{T}_{t})}_{t\geq 0} is a contraction in the 22-Wasserstein distance with exponential rate at least α/2\alpha/2, where ααV+αW\alpha\coloneqq\alpha_{V}+{\alpha_{W}^{-}}. In other words, for any measures μ01:N\mu_{0}^{1:N}, ν01:N\nu_{0}^{1:N} in 𝒫2(d×N)\mathcal{P}_{2}(\mathbb{R}^{d\times N}),

𝒲2(μ01:N𝒯t,ν01:N𝒯t)\displaystyle\mathcal{W}_{2}(\mu_{0}^{1:N}\mathcal{T}_{t},\,\nu_{0}^{1:N}\mathcal{T}_{t}) exp(αt/2)𝒲2(μ01:N,ν01:N).\displaystyle\leq\exp(-\alpha t/2)\,\mathcal{W}_{2}(\mu_{0}^{1:N},\nu_{0}^{1:N})\,.

Proof.  Note that (𝒯t)t0{(\mathcal{T}_{t})}_{t\geq 0} corresponds to the time-scaled (by factor σ2/2\sigma^{2}/2) Langevin diffusion with stationary distribution μ1:N\mu^{1:N}, which is 2σ2(αV+NN1αW)\frac{2}{\sigma^{2}}\,(\alpha_{V}+\frac{N}{N-1}\,{\alpha_{W}^{-}})-strongly log-concave by Lemma 7. The condition on NN ensures that this is at least α/σ2\alpha/\sigma^{2}. Consequently, it is well-known (e.g., via synchronous coupling) that the diffusion is a contraction in the Wasserstein distance with rate at least α/2\alpha/2. ∎

We next bound the error incurred in one step from applying the finite-particle semigroup to πN\pi^{\otimes N}.

Lemma 17.

Under Assumptions 1 and 4, for any λ>0\lambda>0, 𝒯h\mathcal{T}_{h} induces the following error in Wasserstein distance:

𝒲22(πN𝒯h,πN)\displaystyle\mathcal{W}_{2}^{2}(\pi^{\otimes N}\mathcal{T}_{h},\,\pi^{\otimes N}) (1+λ1)βW2σ2dh2αexp((1+λ)βW2h22).\displaystyle\leq\frac{(1+\lambda^{-1})\,\beta_{W}^{2}\sigma^{2}dh^{2}}{\alpha}\exp\Bigl{(}\frac{(1+\lambda)\,\beta_{W}^{2}h^{2}}{2}\Bigr{)}\,.

Proof.  We resort to a coupling argument, noting that π\pi is stationary under (𝗉𝖬𝖵\mathsf{pMV}). Starting with πN\pi^{\otimes N}, we evolve (Xt1:N)t0{(X^{1:N}_{t})}_{t\geq 0} and (Yt1:N)t0{(Y^{1:N}_{t})}_{t\geq 0} according to (𝗉𝖬𝖵N\mathsf{pMV}_{N}) and (𝗉𝖬𝖵\mathsf{pMV}) respectively, i.e. Xt1:NπN𝒯tX^{1:N}_{t}\sim\pi^{\otimes N}\mathcal{T}_{t} and Yt1:NπNY^{1:N}_{t}\sim\pi^{\otimes N}. This argument is adapted from the original propagation of chaos proof by [Szn91].

We can compute the evolution under a synchronous coupling as:

d(XtiYti)\displaystyle\mathrm{d}(X^{i}_{t}-Y^{i}_{t}) =(V(Xti)V(Yti))dt1N1j=1jiN(W(XtiXtj)𝔼πW(Yti))dt\displaystyle=-\bigl{(}\nabla V(X^{i}_{t})-\nabla V(Y^{i}_{t})\bigr{)}\,\mathrm{d}t-\frac{1}{N-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\bigl{(}\nabla W(X^{i}_{t}-X^{j}_{t})-\operatorname{\mathbb{E}}_{\pi}\nabla W(Y^{i}_{t}-\cdot)\bigr{)}\,\mathrm{d}t
=(V(Xti)V(Yti))dt1N1j=1jiN(W(XtiXtj)W(YtiXtj))dt\displaystyle=-\bigl{(}\nabla V(X^{i}_{t})-\nabla V(Y^{i}_{t})\bigr{)}\,\mathrm{d}t-\frac{1}{N-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\bigl{(}\nabla W(X^{i}_{t}-X^{j}_{t})-\nabla W(Y^{i}_{t}-X^{j}_{t})\bigr{)}\,\mathrm{d}t
1N1j=1jiN(W(YtiXtj)W(YtiYtj))dt\displaystyle\qquad{}-\frac{1}{N-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\bigl{(}\nabla W(Y^{i}_{t}-X^{j}_{t})-\nabla W(Y^{i}_{t}-Y^{j}_{t})\bigr{)}\,\mathrm{d}t
1N1j=1jiN(W(YtiYtj)𝔼πW(Yti))dt.\displaystyle\qquad{}-\frac{1}{N-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\bigl{(}\nabla W(Y^{i}_{t}-Y^{j}_{t})-\operatorname{\mathbb{E}}_{\pi}\nabla W(Y^{i}_{t}-\cdot)\bigr{)}\,\mathrm{d}t\,.

Now let us denote by W¯(x,y)W(xy)𝔼πW(x)\overline{\nabla W}(x,y)\coloneqq\nabla W(x-y)-\operatorname{\mathbb{E}}_{\pi}\nabla W(x-\cdot) the centered gradient (with respect to π\pi). By Itô’s formula and Assumption 4,

dXtiYti2\displaystyle\mathrm{d}\lVert X_{t}^{i}-Y_{t}^{i}\rVert^{2} =2XtiYti,d(XtiYti)\displaystyle=2\,\langle X_{t}^{i}-Y_{t}^{i},\mathrm{d}(X_{t}^{i}-Y_{t}^{i})\rangle
2(αV+αW)XtiYti2dt\displaystyle\leq-2\,(\alpha_{V}+\alpha_{W})\,\lVert X_{t}^{i}-Y_{t}^{i}\rVert^{2}\,\mathrm{d}t
2N1j=1jiNXtiYti,W(YtiXtj)W(YtiYtj)dt\displaystyle\qquad{}-\frac{2}{N-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\langle X_{t}^{i}-Y_{t}^{i},\nabla W(Y^{i}_{t}-X^{j}_{t})-\nabla W(Y^{i}_{t}-Y^{j}_{t})\rangle\,\mathrm{d}t
2N1j=1jiNXtiYti,W(YtiYtj)𝔼πW(Yti)dt\displaystyle\qquad{}-\frac{2}{N-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\langle X_{t}^{i}-Y_{t}^{i},\nabla W(Y^{i}_{t}-Y^{j}_{t})-\operatorname{\mathbb{E}}_{\pi}\nabla W(Y^{i}_{t}-\cdot)\rangle\,\mathrm{d}t
2βWXtiYtiN1j=1jiNXtjYtjdt+2XtiYtiN1j=1jiNW¯(Yti,Ytj)dt\displaystyle\leq\frac{2\beta_{W}\,\lVert X_{t}^{i}-Y_{t}^{i}\rVert}{N-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}{\lVert X_{t}^{j}-Y_{t}^{j}\rVert}\,\mathrm{d}t+\frac{2\,\lVert X_{t}^{i}-Y_{t}^{i}\rVert}{N-1}\,\Bigl{\lVert}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\overline{\nabla W}(Y_{t}^{i},Y_{t}^{j})\Bigr{\rVert}\,\mathrm{d}t

or

dXtiYti\displaystyle\mathrm{d}\lVert X_{t}^{i}-Y_{t}^{i}\rVert βWN1j=1jiNXtjYtjdt+1N1j=1jiNW¯(Yti,Ytj)dt.\displaystyle\leq\frac{\beta_{W}}{N-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}{\lVert X_{t}^{j}-Y_{t}^{j}\rVert}\,\mathrm{d}t+\frac{1}{N-1}\,\Bigl{\lVert}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\overline{\nabla W}(Y_{t}^{i},Y_{t}^{j})\Bigr{\rVert}\,\mathrm{d}t\,.

Integrating and squaring,

XtiYti2\displaystyle\lVert X_{t}^{i}-Y_{t}^{i}\rVert^{2} |0t(βWN1j=1jiNXsjYsj+1N1j=1jiNW¯(Ysi,Ysj))ds|2\displaystyle\leq\Bigl{\lvert}\int_{0}^{t}\Bigl{(}\frac{\beta_{W}}{N-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}{\lVert X_{s}^{j}-Y_{s}^{j}\rVert}+\frac{1}{N-1}\,\Bigl{\lVert}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\overline{\nabla W}(Y_{s}^{i},Y_{s}^{j})\Bigr{\rVert}\Bigr{)}\,\mathrm{d}s\Bigr{\rvert}^{2}
(1+λ)βW2tN1j=1jiN0tXsjYsj2ds+(1+λ1)t(N1)20tj=1jiNW¯(Ysi,Ysj)2ds,\displaystyle\leq\frac{(1+\lambda)\,\beta_{W}^{2}t}{N-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\int_{0}^{t}\lVert X_{s}^{j}-Y_{s}^{j}\rVert^{2}\,\mathrm{d}s+\frac{(1+\lambda^{-1})\,t}{{(N-1)}^{2}}\int_{0}^{t}\Bigl{\lVert}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\overline{\nabla W}(Y_{s}^{i},Y_{s}^{j})\Bigr{\rVert}^{2}\,\mathrm{d}s\,,

where the last line follows from Young’s inequality.

Next, we take expectations. Note that W¯(,)\overline{\nabla W}(\cdot,\cdot) is centered in its second variable, so for any jkj\neq k,

𝔼W¯(Yti,Ytj),W¯(Yti,Ytk)=0.\displaystyle\operatorname{\mathbb{E}}\langle\overline{\nabla W}(Y^{i}_{t},Y^{j}_{t}),\overline{\nabla W}(Y^{i}_{t},Y^{k}_{t})\rangle=0\,.

Otherwise, we can bound the terms via

𝔼[W¯(Yti,Ytj)2]βW2𝔼YtjπZπ[YtjZ2]βW2σ2dα.\displaystyle\operatorname{\mathbb{E}}[\lVert\overline{\nabla W}(Y^{i}_{t},Y^{j}_{t})\rVert^{2}]\leq\beta_{W}^{2}\operatorname{\mathbb{E}}_{\begin{subarray}{c}Y^{j}_{t}\sim\pi\\ Z\sim\pi\end{subarray}}[\lVert Y^{j}_{t}-Z\rVert^{2}]\leq\frac{\beta_{W}^{2}\sigma^{2}d}{\alpha}\,.

Here, ZZ is an independent draw from π\pi and so cannot be reduced via coupling. The second inequality follows from a standard bound on the centered second moment of a strongly log-concave measure, using the fact that π\pi is 2α/σ22\alpha/\sigma^{2}-strongly log-concave [[, c.f.]]DalKarRio22NonStrongly.

Therefore, taking expectations and summing over the particles,

𝔼[Xt1:NYt1:N2]\displaystyle\operatorname{\mathbb{E}}[\lVert X_{t}^{1:N}-Y_{t}^{1:N}\rVert^{2}] (1+λ)βW2t0tXs1:NYs1:N2ds+(1+λ1)βW2σ2dt2α.\displaystyle\leq(1+\lambda)\,\beta_{W}^{2}t\int_{0}^{t}\lVert X_{s}^{1:N}-Y_{s}^{1:N}\rVert^{2}\,\mathrm{d}s+\frac{(1+\lambda^{-1})\,\beta_{W}^{2}\sigma^{2}dt^{2}}{\alpha}\,.

By Grönwall’s inequality below,

𝔼[Xh1:NYh1:N2]\displaystyle\operatorname{\mathbb{E}}[\lVert X_{h}^{1:N}-Y_{h}^{1:N}\rVert^{2}] (1+λ1)βW2σ2dh2αexp((1+λ)βW2h22).\displaystyle\leq\frac{(1+\lambda^{-1})\,\beta_{W}^{2}\sigma^{2}dh^{2}}{\alpha}\exp\Bigl{(}\frac{(1+\lambda)\,\beta_{W}^{2}h^{2}}{2}\Bigr{)}\,.

This concludes the proof. ∎

Lemma 18 (Grönwall’s Inequality).

For T>0T>0, let f:[0,T]0f:[0,T]\to\mathbb{R}_{\geq 0} be bounded. Suppose that the following holds pointwise for some functions a,b:[0,T]a,b:[0,T]\to\mathbb{R}, where aa is increasing:

f(t)a(t)+0tb(s)f(s)ds.f(t)\leq a(t)+\int_{0}^{t}b(s)f(s)\,\mathrm{d}s\,.

Then,

f(t)a(t)exp(0tb(s)ds).f(t)\leq a(t)\exp\Bigl{(}\int_{0}^{t}b(s)\,\mathrm{d}s\Bigr{)}\,.

Composing Lemmas 16 and 17, we now prove our propagation of chaos results.

Proof of Theorem 4     Indeed, we have

𝒲2(μ1:N,πN)=𝒲2(μ1:N𝒯h,πN)𝒲2(μ1:N𝒯h,πN𝒯h)+𝒲2(πN𝒯h,πN)\displaystyle\mathcal{W}_{2}(\mu^{1:N},\pi^{\otimes N})=\mathcal{W}_{2}(\mu^{1:N}\mathcal{T}_{h},\,\pi^{\otimes N})\leq\mathcal{W}_{2}(\mu^{1:N}\mathcal{T}_{h},\,\pi^{\otimes N}\mathcal{T}_{h})+\mathcal{W}_{2}(\pi^{\otimes N}\mathcal{T}_{h},\,\pi^{\otimes N})
exp(αh/2)𝒲2(μ1:N,πN)+(1+λ1)βW2σ2dh2αexp((1+λ)βW2h24).\displaystyle\qquad\leq\exp(-\alpha h/2)\,\mathcal{W}_{2}(\mu^{1:N},\pi^{\otimes N})+\sqrt{\frac{(1+\lambda^{-1})\,\beta_{W}^{2}\sigma^{2}dh^{2}}{\alpha}}\exp\Bigl{(}\frac{(1+\lambda)\,\beta_{W}^{2}h^{2}}{4}\Bigr{)}\,.

Rearranging,

𝒲2(μ1:N,πN)\displaystyle\mathcal{W}_{2}(\mu^{1:N},\pi^{\otimes N}) 11exp(αh/2)(1+λ1)βW2σ2dh2αexp((1+λ)βW2h24).\displaystyle\leq\frac{1}{1-\exp(-\alpha h/2)}\sqrt{\frac{(1+\lambda^{-1})\,\beta_{W}^{2}\sigma^{2}dh^{2}}{\alpha}}\exp\Bigl{(}\frac{(1+\lambda)\,\beta_{W}^{2}h^{2}}{4}\Bigr{)}\,.

Let h0h\searrow 0 first and then λ\lambda\nearrow\infty to obtain

𝒲22(μ1:N,πN)\displaystyle\mathcal{W}_{2}^{2}(\mu^{1:N},\pi^{\otimes N}) 4βW2σ2dα3.\displaystyle\leq\frac{4\beta_{W}^{2}\sigma^{2}d}{\alpha^{3}}\,.

Finally, when k<Nk<N, we use exchangeability (see Lemma 27 below) to conclude the proof of (3.2).

For (3.3), by the Bakry–Émery condition we have C𝖫𝖲𝖨(π)σ2/2αC_{\mathsf{LSI}}(\pi)\leq\nicefrac{{\sigma^{2}}}{{2\alpha}}, and tensorization [[, c.f.]Proposition 5.2.7]bakry2014analysis leads to C𝖫𝖲𝖨(πN)σ2/2αC_{\mathsf{LSI}}(\pi^{\otimes N})\leq\nicefrac{{\sigma^{2}}}{{2\alpha}}. Thus, (TI) leads to

𝖪𝖫(μ1:NπN)σ24α𝖥𝖨(μ1:NπN).\mathsf{KL}(\mu^{1:N}\mathbin{\|}\pi^{\otimes N})\leq\frac{\sigma^{2}}{4\alpha}\,\mathsf{FI}(\mu^{1:N}\mathbin{\|}\pi^{\otimes N})\,.

However, one notes that the density of μ1:N\mu^{1:N} is log-smooth with parameter 2σ2(βV+NN1βW)\frac{2}{\sigma^{2}}\,(\beta_{V}+\frac{N}{N-1}\,\beta_{W}) (Lemma 7). Likewise, πN\pi^{\otimes N} is log-smooth with parameter 2σ2(βV+βW)\frac{2}{\sigma^{2}}\,(\beta_{V}+\beta_{W}). Now consider a functional \mathcal{F} on the space of probability measures on 𝒫2,ac(d×N)\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d\times N}) given by :ν𝔼ν[logμ1:NπN2]\mathcal{F}:\nu\mapsto\operatorname{\mathbb{E}}_{\nu}[\lVert\nabla\log\frac{\mu^{1:N}}{\pi^{\otimes N}}\rVert^{2}]. Note that log(μ1:N/πN)\log(\mu^{1:N}/\pi^{\otimes N}) is smooth with parameter at most 4σ2(βV+NN1βW)8σ2(βV+βW)\frac{4}{\sigma^{2}}\,(\beta_{V}+\frac{N}{N-1}\,\beta_{W})\leq\frac{8}{\sigma^{2}}\,(\beta_{V}+\beta_{W}), for N2N\geq 2.

Next, note that for Y1:NπNY^{1:N}\sim\pi^{\otimes N},

(πN)\displaystyle\mathcal{F}(\pi^{\otimes N}) =𝔼πN[logμ1:NlogπN2]\displaystyle=\operatorname{\mathbb{E}}_{\pi^{\otimes N}}[\lVert\nabla\log\mu^{1:N}-\nabla\log\pi^{\otimes N}\rVert^{2}]
=4Nσ4(N1)2𝔼[j=2N(W(Y1Yj)W(Y1)dπ)2]\displaystyle=\frac{4N}{\sigma^{4}\,{(N-1)}^{2}}\operatorname{\mathbb{E}}\Bigl{[}\Bigl{\lVert}\sum_{j=2}^{N}\Bigl{(}\nabla W(Y^{1}-Y^{j})-\int\nabla W(Y^{1}-\cdot)\,\mathrm{d}\pi\Bigr{)}\Bigr{\rVert}^{2}\Bigr{]}
=4Nσ4(N1)2𝔼[j=2NW¯(Y1,Yj)2],\displaystyle=\frac{4N}{\sigma^{4}\,{(N-1)}^{2}}\operatorname{\mathbb{E}}\Bigl{[}\Bigl{\lVert}\sum_{j=2}^{N}\overline{\nabla W}(Y^{1},Y^{j})\Bigr{\rVert}^{2}\Bigr{]}\,,

by using exchangeability and the definition of W¯\overline{\nabla W}.

Subsequently, one derives the following inequality using the Wasserstein distance bound:

(μ1:N)\displaystyle\mathcal{F}(\mu^{1:N}) 128σ4(βV+βW)2𝒲22(μ1:N,πN)+2(πN)\displaystyle\leq\frac{128}{\sigma^{4}}\,{(\beta_{V}+\beta_{W})}^{2}\,\mathcal{W}_{2}^{2}(\mu^{1:N},\pi^{\otimes N})+2\mathcal{F}(\pi^{\otimes N})
512βW2dα3σ2(βV+βW)2+8Nσ4(N1)2𝔼[j=2NW¯(Y1,Yj)2]\displaystyle\leq\frac{512\beta_{W}^{2}d}{\alpha^{3}\sigma^{2}}\,{(\beta_{V}+\beta_{W})}^{2}+\frac{8N}{\sigma^{4}\,{(N-1)}^{2}}\operatorname{\mathbb{E}}\Bigl{[}\Bigl{\lVert}\sum_{j=2}^{N}\overline{\nabla W}(Y^{1},Y^{j})\Bigr{\rVert}^{2}\Bigr{]}
512βW2dα3σ2(βV+βW)2+16βW2Nσ4(N1)𝔼[Y1𝔼Y12]\displaystyle\leq\frac{512\beta_{W}^{2}d}{\alpha^{3}\sigma^{2}}\,{(\beta_{V}+\beta_{W})}^{2}+\frac{16\beta_{W}^{2}N}{\sigma^{4}\,(N-1)}\operatorname{\mathbb{E}}[\lVert Y^{1}-\operatorname{\mathbb{E}}Y^{1}\rVert^{2}]
512βW2dα3σ2(βV+βW)2+16βW2dασ2,\displaystyle\leq\frac{512\beta_{W}^{2}d}{\alpha^{3}\sigma^{2}}\,{(\beta_{V}+\beta_{W})}^{2}+\frac{16\beta_{W}^{2}d}{\alpha\sigma^{2}}\,,

by using (3.2) and the fact that W¯(,)\overline{\nabla W}(\cdot,\cdot) is a centered random variable in its second argument. This concludes the proof for k=Nk=N, and as in the 𝒲22\mathcal{W}_{2}^{2} bound, Lemma 28 will conclude the proof for k<Nk<N. ∎

A.3 General Functional Case

For any measure μ\mu, define its entropy as 𝖾𝗇𝗍(μ)=logμdμ\operatorname{\mathsf{ent}}(\mu)=\int\log\mu\,\mathrm{d}\mu. We now provide a self-contained propagation of chaos argument in the general McKean–Vlasov setting, following [CRW22]. We begin with the following entropy toast inequality, i.e. half of the entropy sandwich inequality from [CRW22].

Lemma 19 (Entropy Toast Inequality).

Define the empirical total energy for an NN-finite particle system as follows. Given a measure ν1:N𝒫2,ac(d×N)\nu^{1:N}\in\mathcal{P}_{2,\rm{ac}}(\mathbb{R}^{d\times N}),

N(ν1:N)=N(ρx1:N)ν1:N(dx1:N)+σ22𝖾𝗇𝗍(ν1:N).\displaystyle\mathcal{E}^{N}(\nu^{1:N})=N\int\mathcal{F}(\rho_{x^{1:N}})\,\nu^{1:N}(\mathrm{d}x^{1:N})+\frac{\sigma^{2}}{2}\operatorname{\mathsf{ent}}(\nu^{1:N})\,.

Under Assumption 5, it holds for all measures ν1:N𝒫2,ac(d×N)\nu^{1:N}\in\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d\times N})

σ22𝖪𝖫(ν1:NπN)N(ν1:N)N(π),\frac{\sigma^{2}}{2}\operatorname{\mathsf{KL}}(\nu^{1:N}\mathbin{\|}\pi^{\otimes N})\leq\mathcal{E}^{N}(\nu^{1:N})-N\mathcal{E}(\pi)\,,

where \mathcal{E} is the total energy (𝗀𝖤\mathsf{gE}) and π\pi is the stationary measure (2.2).

Proof.  By Assumption 5, we have

N(ν1:N)N(π)=N𝔼x1:Nν1:N[(ρx1:N)(π)]+σ22(𝖾𝗇𝗍(ν1:N)N𝖾𝗇𝗍(π))\displaystyle\mathcal{E}^{N}(\nu^{1:N})-N\mathcal{E}(\pi)=N\operatorname{\mathbb{E}}_{x^{1:N}\sim\nu^{1:N}}[\mathcal{F}(\rho_{x^{1:N}})-\mathcal{F}(\pi)]+\frac{\sigma^{2}}{2}\,\bigl{(}\operatorname{\mathsf{ent}}(\nu^{1:N})-N\operatorname{\mathsf{ent}}(\pi)\bigr{)}
𝔼x1:Nν1:N[Nδ(π,z)(ρx1:N(dz)π(dz))]+σ22(𝖾𝗇𝗍(ν1:N)N𝖾𝗇𝗍(π))\displaystyle\qquad\geq\operatorname{\mathbb{E}}_{x^{1:N}\sim\nu^{1:N}}\Bigl{[}N\int\delta\mathcal{F}(\pi,z)\,(\rho_{x^{1:N}}(\mathrm{d}z)-\pi(\mathrm{d}z))\Bigr{]}+\frac{\sigma^{2}}{2}\,\bigl{(}\operatorname{\mathsf{ent}}(\nu^{1:N})-N\operatorname{\mathsf{ent}}(\pi)\bigr{)}
=σ22𝔼x1:Nν1:N[Nlogπ(z)(ρx1:N(dz)π(dz))]+σ22(𝖾𝗇𝗍(ν1:N)N𝖾𝗇𝗍(π))\displaystyle\qquad=-\frac{\sigma^{2}}{2}\operatorname{\mathbb{E}}_{x^{1:N}\sim\nu^{1:N}}\Bigl{[}N\int\log\pi(z)\,(\rho_{x^{1:N}}(\mathrm{d}z)-\pi(\mathrm{d}z))\Bigr{]}+\frac{\sigma^{2}}{2}\,\bigl{(}\operatorname{\mathsf{ent}}(\nu^{1:N})-N\operatorname{\mathsf{ent}}(\pi)\bigr{)}
=σ22𝔼x1:Nν1:N[Nlogπ(z)ρx1:N(dz)]+σ22𝖾𝗇𝗍(ν1:N)\displaystyle\qquad=-\frac{\sigma^{2}}{2}\operatorname{\mathbb{E}}_{x^{1:N}\sim\nu^{1:N}}\Bigl{[}N\int\log\pi(z)\,\rho_{x^{1:N}}(\mathrm{d}z)\Bigr{]}+\frac{\sigma^{2}}{2}\operatorname{\mathsf{ent}}(\nu^{1:N})
=σ22i=1Nlogπ(xi)ν1:N(dx1:N)+σ22𝖾𝗇𝗍(ν1:N).\displaystyle\qquad=-\frac{\sigma^{2}}{2}\int\sum_{i=1}^{N}\log\pi(x^{i})\,\nu^{1:N}(\mathrm{d}x^{1:N})+\frac{\sigma^{2}}{2}\operatorname{\mathsf{ent}}(\nu^{1:N})\,.

However, this is just σ22𝖪𝖫(ν1:NπN)\frac{\sigma^{2}}{2}\operatorname{\mathsf{KL}}(\nu^{1:N}\mathbin{\|}\pi^{\otimes N}), so we are done. ∎

Proof of Theorem 5     We bound N(μ1:N)NN(π)\mathcal{E}^{N}(\mu^{1:N})-N\mathcal{E}^{N}(\pi) via the following argument. First, define the finite-particle mean-field functional as N(ν1:N)=N(ρx1:N)ν1:N(dx1:N)\mathcal{F}^{N}(\nu^{1:N})=N\int\mathcal{F}(\rho_{x^{1:N}})\,\nu^{1:N}(\mathrm{d}x^{1:N}). In the sequel, we also use the following notation for conditional measures: if xi(x1:i1,xi+1:N)d×(N1)x^{-i}\coloneqq(x^{1:i-1},x^{i+1:N})\in\mathbb{R}^{d\times(N-1)},

μ1:N(x1:N)=μi|i(xixi)×μi(xi).\mu^{1:N}(x^{1:N})=\mu^{i|-i}(x^{i}\mid x^{-i})\times\mu^{-i}(x^{-i})\,.

We know that

N(μ1:N)N(π)=N(μ1:N)N(π)+σ22𝖾𝗇𝗍(μ1:N)Nσ22𝖾𝗇𝗍(π).\displaystyle\mathcal{E}^{N}(\mu^{1:N})-N\mathcal{E}(\pi)=\mathcal{F}^{N}(\mu^{1:N})-N\mathcal{F}(\pi)+\frac{\sigma^{2}}{2}\operatorname{\mathsf{ent}}(\mu^{1:N})-\frac{N\sigma^{2}}{2}\operatorname{\mathsf{ent}}(\pi)\,.

Furthermore, by Assumption 5,

N(μ1:N)N(π)N𝔼x1:Nμ1:Nδ(ρx1:N,z)(ρx1:N(dz)π(dz)).\displaystyle\mathcal{F}^{N}(\mu^{1:N})-N\mathcal{F}(\pi)\leq N\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\int\delta\mathcal{F}(\rho_{x^{1:N}},z)\,(\rho_{x^{1:N}}(\mathrm{d}z)-\pi(\mathrm{d}z))\,.

Using the subadditivity of entropy, we can therefore write

N(μ1:N)N(π)i=1N𝔼x1:Nμ1:N[\displaystyle\mathcal{E}^{N}(\mu^{1:N})-N\mathcal{E}(\pi)\leq\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\Bigl{[} δ(ρx1:N,xi)δ(ρx1:N,)dπ\displaystyle\delta\mathcal{F}(\rho_{x^{1:N}},x^{i})-\int\delta\mathcal{F}(\rho_{x^{1:N}},\cdot)\,\mathrm{d}\pi
+σ22(𝖾𝗇𝗍(μii(xi))𝖾𝗇𝗍(π))].\displaystyle\qquad{}+\frac{\sigma^{2}}{2}\,\bigl{(}\operatorname{\mathsf{ent}}(\mu^{i\mid-i}(\cdot\mid x^{-i}))-\operatorname{\mathsf{ent}}(\pi)\bigr{)}\Bigr{]}\,.

To decouple the terms, we now replace each δ(ρx1:N,)\delta\mathcal{F}(\rho_{x^{1:N}},\cdot) term with δ(ρxi,)\delta\mathcal{F}(\rho_{x^{-i}},\cdot):

N(μ1:N)N(π)\displaystyle\mathcal{E}^{N}(\mu^{1:N})-N\mathcal{E}(\pi)
i=1N𝔼x1:Nμ1:N[δ(ρxi,xi)δ(ρxi,)dπ+σ22(𝖾𝗇𝗍(μii(xi))𝖾𝗇𝗍(π))]𝖠\displaystyle\qquad\leq\underbrace{\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\Bigl{[}\delta\mathcal{F}(\rho_{x^{-i}},x^{i})-\int\delta\mathcal{F}(\rho_{x^{-i}},\cdot)\,\mathrm{d}\pi+\frac{\sigma^{2}}{2}\,\bigl{(}\operatorname{\mathsf{ent}}(\mu^{i\mid-i}(\cdot\mid x^{-i}))-\operatorname{\mathsf{ent}}(\pi)\bigr{)}\Bigr{]}}_{\mathsf{A}}
+i=1N𝔼x1:Nμ1:N[δ(ρx1:N,xi)δ(ρxi,xi)(δ(ρx1:N,)δ(ρxi,))dπ]𝖡.\displaystyle\qquad\quad{}+{\underbrace{\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\Bigl{[}\delta\mathcal{F}(\rho_{x^{1:N}},x^{i})-\delta\mathcal{F}(\rho_{x^{-i}},x^{i})-\int\bigl{(}\delta\mathcal{F}(\rho_{x^{1:N}},\cdot)-\delta\mathcal{F}(\rho_{x^{-i}},\cdot)\bigr{)}\,\mathrm{d}\pi\Bigr{]}}_{\mathsf{B}}}\,.

We consider the two terms in turn, beginning with the first.

Note that by Fubini’s theorem,

𝔼x1:Nμ1:Nδ(ρxi,xi)\displaystyle\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\delta\mathcal{F}(\rho_{x^{-i}},x^{i}) =𝔼xiμiδ(ρxi,)dμii(xi).\displaystyle=\operatorname{\mathbb{E}}_{x^{-i}\sim\mu^{-i}}\int\delta\mathcal{F}(\rho_{x^{-i}},\cdot)\,\mathrm{d}\mu^{i\mid-i}(\cdot\mid x^{-i})\,.

In order to relate the first term 𝖠\mathsf{A} to a KL divergence, for each xid×(N1)x^{-i}\in\mathbb{R}^{d\times(N-1)} we introduce the probability measure τxi𝒫2,ac(d)\tau_{x^{-i}}\in\mathcal{P}_{2,\rm ac}(\mathbb{R}^{d}) via

τxiexp(2σ2δ(ρxi,)).\displaystyle\tau_{x^{-i}}\propto\exp\Bigl{(}-\frac{2}{\sigma^{2}}\,\delta\mathcal{F}(\rho_{x^{-i}},\cdot)\Bigr{)}\,.

We can compute

𝖪𝖫(μii(xi)τxi)\displaystyle\operatorname{\mathsf{KL}}\bigl{(}\mu^{i\mid-i}(\cdot\mid x^{-i})\bigm{\|}\tau_{x^{-i}}\bigr{)}
=(2σ2δ(ρxi,)+logμii(xi))dμii(xi)+logZ(τxi),\displaystyle\qquad=\int\Bigl{(}\frac{2}{\sigma^{2}}\,\delta\mathcal{F}(\rho_{x^{-i}},\cdot)+\log\mu^{i\mid-i}(\cdot\mid x^{-i})\Bigr{)}\,\mathrm{d}\mu^{i\mid-i}(\cdot\mid x^{-i})+\log Z(\tau_{x^{-i}})\,,

where Z(τxi)Z(\tau_{x^{-i}}) is the normalization constant for τxi\tau_{x^{-i}},

logZ(τxi)\displaystyle\log Z(\tau_{x^{-i}}) =logexp(2σ2δ(ρxi,z))dz\displaystyle=\log\int\exp\Bigl{(}-\frac{2}{\sigma^{2}}\,\delta\mathcal{F}(\rho_{x^{-i}},z)\Bigr{)}\,\mathrm{d}z
=logexp(2σ2(δ(π,z)δ(ρxi,z)))π(dz)+logZ(π)\displaystyle=\log\int\exp\Bigl{(}\frac{2}{\sigma^{2}}\,\bigl{(}\delta\mathcal{F}(\pi,z)-\delta\mathcal{F}(\rho_{x^{-i}},z)\bigr{)}\Bigr{)}\,\pi(\mathrm{d}z)+\log Z(\pi)
2σ2δ(ρxi,)dπ𝖾𝗇𝗍(π).\displaystyle\geq-\frac{2}{\sigma^{2}}\int\delta\mathcal{F}(\rho_{x^{-i}},\cdot)\,\mathrm{d}\pi-\operatorname{\mathsf{ent}}(\pi)\,.

Upon taking expectations, we obtain

𝖠\displaystyle\mathsf{A} σ22i=1N𝔼xiμi𝖪𝖫(μii(xi)τxi).\displaystyle\leq\frac{\sigma^{2}}{2}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{-i}\sim\mu^{-i}}\operatorname{\mathsf{KL}}\bigl{(}\mu^{i\mid-i}(\cdot\mid x^{-i})\bigm{\|}\tau_{x^{-i}}\bigr{)}\,.

Moreover, we can recognize that τxi\tau_{x^{-i}} is a proximal Gibbs measure. By Assumptions 6 and 7,

𝖠\displaystyle\mathsf{A} C¯𝖫𝖲𝖨σ24i=1N𝔼xiμi𝖥𝖨(μii(xi)τxi)\displaystyle\leq\frac{\overline{C}_{\mathsf{LSI}}\,\sigma^{2}}{4}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{-i}\sim\mu^{-i}}\operatorname{\mathsf{FI}}\bigl{(}\mu^{i\mid-i}(\cdot\mid x^{-i})\bigm{\|}\tau_{x^{-i}}\bigr{)}
=C¯𝖫𝖲𝖨σ24i=1N𝔼x1:Nμ1:N[xilogμii(xixi)+2σ2𝒲2(ρxi,xi)2]\displaystyle=\frac{\overline{C}_{\mathsf{LSI}}\,\sigma^{2}}{4}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\Bigl{[}\Bigl{\lVert}\nabla_{x^{i}}\log\mu^{i\mid-i}(x^{i}\mid x^{-i})+\frac{2}{\sigma^{2}}\,\nabla_{\mathcal{W}_{2}}\mathcal{F}(\rho_{x^{-i}},x^{i})\Bigr{\rVert}^{2}\Bigr{]}
=C¯𝖫𝖲𝖨σ24i=1N𝔼x1:Nμ1:N[xilogμ1:N(x1:N)+2σ2𝒲2(ρxi,xi)2]\displaystyle=\frac{\overline{C}_{\mathsf{LSI}}\,\sigma^{2}}{4}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\Bigl{[}\Bigl{\lVert}\nabla_{x^{i}}\log\mu^{1:N}(x^{1:N})+\frac{2}{\sigma^{2}}\,\nabla_{\mathcal{W}_{2}}\mathcal{F}(\rho_{x^{-i}},x^{i})\Bigr{\rVert}^{2}\Bigr{]}
=C¯𝖫𝖲𝖨σ2i=1N𝔼x1:Nμ1:N[𝒲2(ρx1:N,xi)𝒲2(ρxi,xi)2]\displaystyle=\frac{\overline{C}_{\mathsf{LSI}}}{\sigma^{2}}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}[\lVert\nabla_{\mathcal{W}_{2}}\mathcal{F}(\rho_{x^{1:N}},x^{i})-\nabla_{\mathcal{W}_{2}}\mathcal{F}(\rho_{x^{-i}},x^{i})\rVert^{2}]
=C¯𝖫𝖲𝖨σ2i=1N𝔼x1:Nμ1:N[𝒲20(ρx1:N,xi)𝒲20(ρxi,xi)2]\displaystyle=\frac{\overline{C}_{\mathsf{LSI}}}{\sigma^{2}}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}[\lVert\nabla_{\mathcal{W}_{2}}\mathcal{F}_{0}(\rho_{x^{1:N}},x^{i})-\nabla_{\mathcal{W}_{2}}\mathcal{F}_{0}(\rho_{x^{-i}},x^{i})\rVert^{2}]
β2C¯𝖫𝖲𝖨σ2i=1N𝔼x1:Nμ1:N𝒲12(ρx1:N,ρxi).\displaystyle\leq\frac{\beta^{2}\overline{C}_{\mathsf{LSI}}}{\sigma^{2}}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\mathcal{W}_{1}^{2}(\rho_{x^{1:N}},\rho_{x^{-i}})\,.

To transport the mass from ρx1:N\rho_{x^{1:N}} to ρxi\rho_{x^{-i}}, we take the transport plan which moves 1N(N1)\frac{1}{N\,(N-1)} of the mass from xix^{i} to each xjx^{j}, jij\neq i. It yields

𝒲1(ρx1:N,ρxi)\displaystyle\mathcal{W}_{1}(\rho_{x^{1:N}},\rho_{x^{-i}}) 1N(N1)j=1jiNxixj.\displaystyle\leq\frac{1}{N\,(N-1)}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}{\lVert x^{i}-x^{j}\rVert}\,. (A.6)

Hence,

𝖠\displaystyle\mathsf{A} β2C¯𝖫𝖲𝖨σ2N2(N1)2𝔼x1:Nμ1:Ni=1N(j=1jiNxixj)2\displaystyle\leq\frac{\beta^{2}\overline{C}_{\mathsf{LSI}}}{\sigma^{2}N^{2}\,{(N-1)}^{2}}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\sum_{i=1}^{N}\Bigl{(}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}{\lVert x^{i}-x^{j}\rVert}\Bigr{)}^{2}
β2C¯𝖫𝖲𝖨σ2N2(N1)𝔼x1:Nμ1:Nijxixj2=β2C¯𝖫𝖲𝖨σ2N𝔼x1:2μ1:2[x1x22].\displaystyle\leq\frac{\beta^{2}\overline{C}_{\mathsf{LSI}}}{\sigma^{2}N^{2}\,(N-1)}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\sum_{i\neq j}{\lVert x^{i}-x^{j}\rVert^{2}}=\frac{\beta^{2}\overline{C}_{\mathsf{LSI}}}{\sigma^{2}N}\operatorname{\mathbb{E}}_{x^{1:2}\sim\mu^{1:2}}[\lVert x^{1}-x^{2}\rVert^{2}]\,.

We then use the inequality

12𝔼x1:2μ1:2[x1x22]\displaystyle\frac{1}{2}\operatorname{\mathbb{E}}_{x^{1:2}\sim\mu^{1:2}}[\lVert x^{1}-x^{2}\rVert^{2}] 2𝒲22(μ1:2,π2)+𝔼x1:2π2[x1x22]\displaystyle\leq 2\mathcal{W}_{2}^{2}(\mu^{1:2},\pi^{\otimes 2})+\operatorname{\mathbb{E}}_{x^{1:2}\sim\pi^{\otimes 2}}[\lVert x^{1}-x^{2}\rVert^{2}]
4N𝒲22(μ1:N,πN)+2𝔼xπ[x𝔼x2]\displaystyle\leq\frac{4}{N}\,\mathcal{W}_{2}^{2}(\mu^{1:N},\pi^{\otimes N})+2\operatorname{\mathbb{E}}_{x\sim\pi}[\lVert x-\operatorname{\mathbb{E}}x\rVert^{2}]
8C¯𝖫𝖲𝖨N𝖪𝖫(μ1:NπN)+2dC¯𝖫𝖲𝖨,\displaystyle\leq\frac{8\overline{C}_{\mathsf{LSI}}}{N}\operatorname{\mathsf{KL}}(\mu^{1:N}\mathbin{\|}\pi^{\otimes N})+2d\overline{C}_{\mathsf{LSI}}\,, (A.7)

where we used Lemma 27 and the Poincaré inequality for π\pi. Hence,

𝖠\displaystyle\mathsf{A} 2β2C¯𝖫𝖲𝖨σ2N(8C¯𝖫𝖲𝖨N𝖪𝖫(μ1:NπN)+2dC¯𝖫𝖲𝖨).\displaystyle\leq\frac{2\beta^{2}\overline{C}_{\mathsf{LSI}}}{\sigma^{2}N}\,\Bigl{(}\frac{8\overline{C}_{\mathsf{LSI}}}{N}\operatorname{\mathsf{KL}}(\mu^{1:N}\mathbin{\|}\pi^{\otimes N})+2d\overline{C}_{\mathsf{LSI}}\Bigr{)}\,.

Next, we turn toward term 𝖡\mathsf{B}. First, define a function ζx1:Ni:d\zeta_{x^{1:N}}^{i}:\mathbb{R}^{d}\to\mathbb{R} by

ζx1:Ni(y)δ(ρxi,y)δ(ρx1:N,y)=δ0(ρxi,y)δ0(ρx1:N,y).\displaystyle\zeta_{x^{1:N}}^{i}(y)\coloneqq\delta\mathcal{F}(\rho_{x^{-i}},y)-\delta\mathcal{F}(\rho_{x^{1:N}},y)=\delta\mathcal{F}_{0}(\rho_{x^{-i}},y)-\delta\mathcal{F}_{0}(\rho_{x^{1:N}},y)\,.

It is clear from Assumption 6 that this function is Lipschitz with constant 2β𝒲1(ρx1:N,ρxi)2\beta\mathcal{W}_{1}(\rho_{x^{1:N}},\rho_{x^{-i}}). Thus, we obtain using this Lipschitzness, (A.6), and Young’s inequality,

𝖡\displaystyle\mathsf{B} =i=1N𝔼x1:Nμ1:N(ζx1:Ni(xi)ζx1:Ni(z))π(dz)\displaystyle=\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\int\bigl{(}\zeta_{x^{1:N}}^{i}(x^{i})-\zeta_{x^{1:N}}^{i}(z)\bigr{)}\,\pi(\mathrm{d}z)
i=1N𝔼x1:Nμ1:N2βN(N1)j=1jiNxjxixizπ(dz)\displaystyle\leq\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\int\frac{2\beta}{N\,(N-1)}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}{\lVert x^{j}-x^{i}\rVert\,\lVert x^{i}-z\rVert}\,\pi(\mathrm{d}z)
βN(N1)𝔼x1:Nμ1:Nijxixj2+βNi=1N𝔼(xi,z)μ1π[xiz2]\displaystyle\leq\frac{\beta}{N\,(N-1)}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\sum_{i\neq j}{\lVert x^{i}-x^{j}\rVert^{2}}+\frac{\beta}{N}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{(x^{i},z)\sim\mu^{1}\otimes\pi}[\lVert x^{i}-z\rVert^{2}]
=β𝔼x1:2μ1:2[x1x22]+β𝔼(x,z)μ1π[xz2].\displaystyle=\beta\operatorname{\mathbb{E}}_{x^{1:2}\sim\mu^{1:2}}[{\lVert x^{1}-x^{2}\rVert^{2}}]+\beta\operatorname{\mathbb{E}}_{(x,z)\sim\mu^{1}\otimes\pi}[\lVert x-z\rVert^{2}]\,.

For the first term, we can apply (A.7), and for the second term, we can apply (A.1). It yields

𝖡\displaystyle\mathsf{B} 20βC¯𝖫𝖲𝖨N𝖪𝖫(μ1:NπN)+8βC¯𝖫𝖲𝖨d.\displaystyle\leq\frac{20\beta\overline{C}_{\mathsf{LSI}}}{N}\operatorname{\mathsf{KL}}(\mu^{1:N}\mathbin{\|}\pi^{\otimes N})+8\beta\overline{C}_{\mathsf{LSI}}d\,.

Putting the bounds together with Lemma 19,

𝖪𝖫(μ1:NπN)\displaystyle\operatorname{\mathsf{KL}}(\mu^{1:N}\mathbin{\|}\pi^{\otimes N}) 33βC¯𝖫𝖲𝖨dσ2\displaystyle\leq\frac{33\beta\overline{C}_{\mathsf{LSI}}d}{\sigma^{2}}

for all N160βC¯𝖫𝖲𝖨/σ2N\geq 160\beta\overline{C}_{\mathsf{LSI}}/\sigma^{2}. The result for kNk\leq N follows from Lemma 28. ∎

Appendix B Isoperimetric Results for the Stationary Distributions

B.1 Convexity and Smoothness

Here, we verify the convexity and smoothness properties of μ1:N\mu^{1:N} in the pairwise McKean–Vlasov setting.

Proof of Lemma 7     For x1:N=[x1,,xN]d×Nx^{1:N}=[x^{1},\dotsc,x^{N}]\in\mathbb{R}^{d\times N}, the Hessian of log(1/μ1:N)\log(1/\mu^{1:N}) can be explicitly computed as

σ222logμ1:N(x1:N)=[2V(x1)0002V(x2)0002V(xN)]\displaystyle-\frac{\sigma^{2}}{2}\,\nabla^{2}\log\mu^{1:N}(x^{1:N})=\begin{bmatrix}\nabla^{2}V(x^{1})&0&\cdots&0\\ 0&\nabla^{2}V(x^{2})&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\nabla^{2}V(x^{N})\end{bmatrix}
+1N1[j=2N2W(x1xj)2W(x1x2)2W(x1xN)2W(x2x1)j=1j2N2W(x2xj)2W(x2xN)2W(xNx1)2W(xNx2)j=1N12W(xNxj)]=𝖡.\displaystyle\quad+\frac{1}{N-1}\,\underbrace{\begin{bmatrix}\sum_{j=2}^{N}\nabla^{2}W(x^{1}-x^{j})&-\nabla^{2}W(x^{1}-x^{2})&\cdots&-\nabla^{2}W(x^{1}-x^{N})\\ -\nabla^{2}W(x^{2}-x^{1})&\sum_{\begin{subarray}{c}j=1\\ j\neq 2\end{subarray}}^{N}\nabla^{2}W(x^{2}-x^{j})&\cdots&-\nabla^{2}W(x^{2}-x^{N})\\ \vdots&\vdots&\ddots&\vdots\\ -\nabla^{2}W(x^{N}-x^{1})&-\nabla^{2}W(x^{N}-x^{2})&\cdots&\sum_{j=1}^{N-1}\nabla^{2}W(x^{N}-x^{j})\end{bmatrix}}_{=\mathsf{B}}\,.

Clearly, the first block matrix has eigenvalues between αV\alpha_{V} and βV\beta_{V}. For the second block matrix 𝖡\mathsf{B}, let us denote Ai,j2W(xixj)A_{i,j}\coloneqq\nabla^{2}W(x^{i}-x^{j}) for i,j[N]i,j\in[N]. Note that Ai,j=Aj,iA_{i,j}=A_{j,i} since WW is even, and each Ai,jA_{i,j} is clearly symmetric.

For 𝖡dN×dN\mathsf{B}\in\mathbb{R}^{dN\times dN} the second matrix and y=[y1,,yN]dNy=[y^{1},\dotsc,y^{N}]\in\mathbb{R}^{dN}, we have

y𝖳𝖡y\displaystyle y^{\mathsf{T}}\mathsf{B}y =iNyi𝖳(j[N]\iAi,j)yii,jNijyi𝖳Ai,jyj\displaystyle=\sum_{i\leq N}y_{i}^{\mathsf{T}}\,\Bigl{(}\sum_{j\in[N]\backslash i}A_{i,j}\Bigr{)}\,y_{i}-\sum_{\begin{subarray}{c}i,j\leq N\\ i\neq j\end{subarray}}y_{i}^{\mathsf{T}}A_{i,j}y_{j}
=i,jNi<j(yi𝖳Ai,jyi+yj𝖳Aj,iyjyi𝖳Ai,jyjyj𝖳Aj,iyi)=i,jNi<j(yiyj)𝖳Ai,j(yiyj).\displaystyle=\sum_{\begin{subarray}{c}i,j\leq N\\ i<j\end{subarray}}\bigl{(}y_{i}^{\mathsf{T}}A_{i,j}y_{i}+y_{j}^{\mathsf{T}}A_{j,i}y_{j}-y_{i}^{\mathsf{T}}A_{i,j}y_{j}-y_{j}^{\mathsf{T}}A_{j,i}y_{i}\bigr{)}=\sum_{\begin{subarray}{c}i,j\leq N\\ i<j\end{subarray}}(y_{i}-y_{j})^{\mathsf{T}}A_{i,j}(y_{i}-y_{j})\,.

Using αWIdAi,jβWId\alpha_{W}I_{d}\preceq A_{i,j}\preceq\beta_{W}I_{d} and

𝖬y2i,jNi<jyiyj2\displaystyle\mathsf{M}\coloneqq\nabla^{2}_{y}\sum_{\begin{subarray}{c}i,j\leq N\\ i<j\end{subarray}}\lVert y_{i}-y_{j}\rVert^{2} =2[N1111N1111N1]Id,\displaystyle=2\,\begin{bmatrix}N-1&-1&\cdots&-1\\ -1&N-1&\cdots&-1\\ \vdots&\vdots&\ddots&\vdots\\ -1&-1&\cdots&N-1\end{bmatrix}\otimes I_{d}\,,

we have 12αW𝖬𝖡12βW𝖬\frac{1}{2}\,\alpha_{W}\mathsf{M}\preceq\mathsf{B}\preceq\frac{1}{2}\,\beta_{W}\mathsf{M}. Since the circulant matrix in 𝖬\mathsf{M} is PSD due to diagonal dominance and its largest eigenvalue is at most NN, it follows that the eigenvalues of 𝖬\mathsf{M} lie between 0 and 2N2N. Hence, the eigenvalues of 𝖡\mathsf{B} lie in the interval [NN1αW,NN1βW][\frac{N}{N-1}\,{\alpha_{W}^{-}},\frac{N}{N-1}\,\beta_{W}]. ∎

B.2 Bounded Perturbations

In this section, we prove the isoperimetric results from §3.2.1. We again introduce the conditional measure: if xi(x1:i1,xi+1:N)d×(N1)x^{-i}\coloneqq(x^{1:i-1},x^{i+1:N})\in\mathbb{R}^{d\times(N-1)} we define

μ1:N(x1:N)=μi|i(xixi)×μi(xi)\mu^{1:N}(x^{1:N})=\mu^{i|-i}(x^{i}\mid x^{-i})\times\mu^{-i}(x^{-i})

for the conditional distribution of the ii-th particle and the distribution of an NN-particle system with the ii-th particle marginalized out.

Before proceeding to the proof of Lemma 8, we first state a result on log-Sobolev inequalities under weak interactions due to [OR07].

Lemma 20 ([OR07, Theorem 1]).

Consider a measure μ1:N\mu^{1:N} on d×N\mathbb{R}^{d\times N}, with conditional measures μi|i\mu^{i|-i}. Assume that

C𝖫𝖲𝖨(μi|i(xi))\displaystyle C_{\mathsf{LSI}}(\mu^{i|-i}(\cdot\mid x^{-i})) 1τi,for all i[N]xid×(N1),\displaystyle\leq\frac{1}{\tau_{i}}\,,\qquad\text{for all $i\in[N]$, $x^{-i}\in\mathbb{R}^{d\times(N-1)}$}\,,
xixjlogμ1:N(x1:N)\displaystyle\lVert\nabla_{x^{i}}\nabla_{x^{j}}\log\mu^{1:N}(x^{1:N})\rVert βi,j,for all x1:Nd×Ni,j[N]ij.\displaystyle\leq\beta_{i,j}\,,\qquad\text{for all $x^{1:N}\in\mathbb{R}^{d\times N}$, $i,j\in[N]$, $i\neq j$}\,.

Then, consider the matrix AN×NA\in\mathbb{R}^{N\times N} with entries Ai,i=τiA_{i,i}=\tau_{i}, Ai,j=βi,jA_{i,j}=-\beta_{i,j} for iji\neq j. If AρINA\succeq\rho I_{N}, then μ1:N\mu^{1:N} satisfies (LSI) with constant 1/ρ1/\rho.

Proof of Lemma 8     We begin by proving the statement about π\pi. The potential of the invariant measure π\pi can also be written as

log1π(x)\displaystyle\log\frac{1}{\pi(x)} =2σ2(V0(x)+V1(x)+(W0(x)+W1(x))dπ)\displaystyle=\frac{2}{\sigma^{2}}\,\Bigl{(}V_{0}(x)+V_{1}(x)+\int\bigl{(}W_{0}(x-\cdot)+W_{1}(x-\cdot)\bigr{)}\,\mathrm{d}\pi\Bigr{)}
=2σ2(V0(x)+W0(x)dπ)+2σ2(V1(x)+W1(x)dπ).\displaystyle=\frac{2}{\sigma^{2}}\,\Bigl{(}V_{0}(x)+\int W_{0}(x-\cdot)\,\mathrm{d}\pi\Bigr{)}+\frac{2}{\sigma^{2}}\,\Bigl{(}V_{1}(x)+\int W_{1}(x-\cdot)\,\mathrm{d}\pi\Bigr{)}\,.

This is the sum of a 2σ2(αV0+αW0)\frac{2}{\sigma^{2}}\,(\alpha_{V_{0}}+\alpha_{W_{0}})-convex function with a 2σ2(𝗈𝗌𝖼(V1)+𝗈𝗌𝖼(W1))\frac{2}{\sigma^{2}}\,(\operatorname{\mathsf{osc}}(V_{1})+\operatorname{\mathsf{osc}}(W_{1}))-bounded perturbation. Thus, π\pi satisfies (LSI) with the claimed parameter.

We now prove the statement about μ1:N\mu^{1:N}. Each conditional measure has a density of the form

μi|i(xi)exp(2σ2V()2σ2(N1)j[N]\iW(xj)),\displaystyle\mu^{i|-i}(\cdot\mid x^{-i})\propto\exp\Bigl{(}-\frac{2}{\sigma^{2}}\,V(\cdot)-\frac{2}{\sigma^{2}\,(N-1)}\sum_{j\in[N]\backslash i}W(\cdot-x^{j})\Bigr{)}\,,

where both VV and WW are bounded perturbations of αV0\alpha_{V_{0}}, αW0\alpha_{W_{0}}-strongly convex functions respectively, irrespective of the conditional variables. Thus, by Holley–Stroock perturbation and the Bakry–Émery condition, each μi|i(xi)\mu^{i|-i}(\cdot\mid x^{-i}) satisfies (LSI) with parameter

τi1τ1σ22(αV0+NN1αW0)1exp(2σ2(𝗈𝗌𝖼(V1)+𝗈𝗌𝖼(V1))).\displaystyle\tau_{i}^{-1}\leq\tau^{-1}\coloneqq\frac{\sigma^{2}}{2}\,\bigl{(}\alpha_{V_{0}}+\frac{N}{N-1}\,{\alpha_{W_{0}}^{-}}\bigr{)}^{-1}\exp\Bigl{(}\frac{2}{\sigma^{2}}\,\bigl{(}\operatorname{\mathsf{osc}}(V_{1})+\operatorname{\mathsf{osc}}(V_{1})\bigr{)}\Bigr{)}\,.

Secondly, we note that from (2.1), we have

2σ2(N1)supzd2W(z)𝗈𝗉2βWσ2(N1)βi,j.\displaystyle\frac{2}{\sigma^{2}\,(N-1)}\sup_{z\in\mathbb{R}^{d}}\lVert\nabla^{2}W(z)\rVert_{\mathsf{op}}\leq\frac{2\beta_{W}}{\sigma^{2}\,(N-1)}\eqqcolon\beta_{i,j}\,.

Thus, we have

A=[τ2βWσ2(N1)2βWσ2(N1)2βWσ2(N1)τ2βWσ2(N1)2βWσ2(N1)2βWσ2(N1)τ].\displaystyle A=\begin{bmatrix}\tau&-\frac{2\beta_{W}}{\sigma^{2}\,(N-1)}&\cdots&-\frac{2\beta_{W}}{\sigma^{2}\,(N-1)}\\[5.0pt] -\frac{2\beta_{W}}{\sigma^{2}\,(N-1)}&\tau&\cdots&-\frac{2\beta_{W}}{\sigma^{2}\,(N-1)}\\[5.0pt] \vdots&\vdots&\ddots&\vdots\\[5.0pt] -\frac{2\beta_{W}}{\sigma^{2}\,(N-1)}&-\frac{2\beta_{W}}{\sigma^{2}\,(N-1)}&\cdots&\tau\end{bmatrix}\,.

Under Assumption 10, this matrix is strictly diagonally dominant and has a minimum eigenvalue of at least τ/2\tau/2. We can now apply Lemma 20 to complete the proof. ∎

B.3 Logarithmic Sobolev Inequalities via Perturbations

In this section, we state log-Sobolev inequalities for Lipschitz perturbations of strongly log-concave measures, which is used for the general McKean–Vlasov setting in §2.1.2.

Lemma 21 (LSI under Lipschitz Perturbations [BP24, Theorem 1.4]).

Let μexp(HV)\mu\propto\exp(-H-V) for an αV\alpha_{V}-strongly convex function V:dV:\mathbb{R}^{d}\to\mathbb{R} and an LL-Lipschitz function H:dH:\mathbb{R}^{d}\to\mathbb{R}. Then, μ\mu satisfies (LSI) with constant C𝖫𝖲𝖨(μ)C_{\mathsf{LSI}}(\mu) given by

C𝖫𝖲𝖨(μ)1αexp(L2α+4Lα).\displaystyle C_{\mathsf{LSI}}(\mu)\leq\frac{1}{\alpha}\exp\Bigl{(}\frac{L^{2}}{\alpha}+\frac{4L}{\sqrt{\alpha}}\Bigr{)}\,.

From this one derives the following log-Sobolev inequality for the proximal Gibbs measure.

Lemma 22 (Uniform LSI for the Proximal Gibbs Measure [SWN23, Theorem 1]).

For the proximal Gibbs measure (2.3) in the setting of §2.1.2, under Assumption 8, we have that

supμ𝒫2(d)C𝖫𝖲𝖨(πμ)C¯𝖫𝖲𝖨,\displaystyle\sup_{\mu\in\mathcal{P}_{2}(\mathbb{R}^{d})}C_{\mathsf{LSI}}(\pi_{\mu})\leq\overline{C}_{\mathsf{LSI}}\,,

where α\alpha can be bounded by

C¯𝖫𝖲𝖨\displaystyle\overline{C}_{\mathsf{LSI}} σ22λexp(2B2λσ2+8B2λσ).\displaystyle\leq\frac{\sigma^{2}}{2\lambda}\exp\Bigl{(}\frac{2B^{2}}{\lambda\sigma^{2}}+\frac{8B}{\sqrt{2\lambda}\,\sigma}\Bigr{)}\,.

Obtaining a uniform-in-NN LSI for μ1:N\mu^{1:N} under Assumption 8 is more difficult, and we rely on the recent heat flow estimates of [BP24]. In their work, the authors showed the existence of an LL-Lipschitz transport map—the Kim–Milman map [KM12]—from the standard Gaussian measure γ\gamma to a measure μ\mu, under suitable assumptions on μ\mu. By [BGL14, Proposition 5.4.3], this immediately implies that μ\mu satisfies C𝖫𝖲𝖨(μ)L2C_{\mathsf{LSI}}(\mu)\leq L^{2}. The existence of the Lipschitz transport map is based on estimates along the heat flow, and we summarize the computation in a convenient form based on bounding the operator norm of the covariance matrix of Gaussian tilts of the measure. The latter property is sometimes called tilt stability in the literature. Note that we do not attempt to optimize constants here.

Lemma 23 (Lipschitz Transport Maps via Reverse OU).

Let μ\mu be a probability measure over d\mathbb{R}^{d} and for any t>0t>0, ydy\in\mathbb{R}^{d}, let μt,y\mu_{t,y} denote the Gaussian tilt,

μt,y(dx)exp(yx22t+x22)μ(dx),\displaystyle\mu_{t,y}(\mathrm{d}x)\propto\exp\Bigl{(}-\frac{\lVert y-x\rVert^{2}}{2t}+\frac{\lVert x\rVert^{2}}{2}\Bigr{)}\,\mu(\mathrm{d}x)\,, (B.1)

where we assume that this defines a valid probability measure for all t>0t>0 and ydy\in\mathbb{R}^{d}. Suppose there exist a,C>0a,C>0 such that the following “tilt stability” property holds:

covμt,y𝗈𝗉\displaystyle\lVert\operatorname{cov}_{\mu_{t,y}}\rVert_{\mathsf{op}} (1a+1/t+Ca+1/t)2,for allt>0,yd.\displaystyle\leq\Bigl{(}\frac{1}{\sqrt{a+1/t}}+\frac{C}{a+1/t}\Bigr{)}^{2}\,,\qquad\text{for all}~{}t>0\,,\;y\in\mathbb{R}^{d}\,.

Then, there exists an LL-Lipschitz transport map T:ddT:\mathbb{R}^{d}\to\mathbb{R}^{d} such that T#γ=μT_{\#}\gamma=\mu, where γ\gamma is the standard Gaussian measure and LL can be estimated by

L\displaystyle L 11+aexp(C22(1+a)+2C1+a).\displaystyle\leq\frac{1}{\sqrt{1+a}}\exp\Bigl{(}\frac{C^{2}}{2\,(1+a)}+\frac{2C}{\sqrt{1+a}}\Bigr{)}\,.

Proof.  We follow the calculations of [BP24]. Let (Pt)t0{(P_{t})}_{t\geq 0} denote the heat semigroup, and (Qt)t0{(Q_{t})}_{t\geq 0} the Ornstein–Uhlenbeck semigroup. Then, if γ\gamma denotes the standard Gaussian measure, the identity Qtf=P1exp(2t)f(exp(t))Q_{t}f=P_{1-\exp(-2t)}f(\exp(-t)\,\cdot) and

Iexp(2t)1\displaystyle-\frac{I}{\exp(2t)-1} 2logQt(μγ)=exp(2t)[2logP1exp(2t)(μγ)](exp(t))\displaystyle\preceq\nabla^{2}\log Q_{t}\bigl{(}\frac{\mu}{\gamma}\bigr{)}=\exp(-2t)\,\Bigl{[}\nabla^{2}\log P_{1-\exp(-2t)}\bigl{(}\frac{\mu}{\gamma}\bigr{)}\Bigr{]}(\exp(-t)\,\cdot)
=1exp(2t)1(covμ1exp(2t),exp(t)1exp(2t)I)\displaystyle=\frac{1}{\exp(2t)-1}\,\Bigl{(}\frac{\operatorname{cov}_{\mu_{1-\exp(-2t),\,\exp(-t)\,\cdot}}}{1-\exp(-2t)}-I\Bigr{)}
Iexp(2t)1[(1/1exp(2t)a+1/(1exp(2t))+C/1exp(2t)a+1/(1exp(2t)))21]\displaystyle\preceq\frac{I}{\exp(2t)-1}\,\Bigl{[}\Bigl{(}\frac{1/\sqrt{1-\exp(-2t)}}{\sqrt{a+1/(1-\exp(-2t))}}+\frac{C/\sqrt{1-\exp(-2t)}}{a+1/(1-\exp(-2t))}\Bigr{)}^{2}-1\Bigr{]}
[1αα(exp(2t)1)+1+C2exp(2t)(α(exp(2t)1)+1)2\displaystyle\preceq\Bigl{[}\frac{1-\alpha}{\alpha\,(\exp(2t)-1)+1}+\frac{C^{2}\exp(2t)}{(\alpha\,(\exp(2t)-1)+1)^{2}}
+2Cexp(2t)(exp(2t)1)1/2(α(exp(2t)1)+1)3/2]I\displaystyle\qquad\qquad\qquad\qquad\qquad{}+\frac{2C\exp(2t)}{(\exp(2t)-1)^{1/2}\,(\alpha\,(\exp(2t)-1)+1)^{3/2}}\Bigr{]}\,I

where a=α1a=\alpha-1. Note that we can identify this with the bound of [BP24, Corollary 3.2] with C=LC=L. Thus, by following the calculations in the proof of [BP24, Theorem 1.4], we obtain the desired result. ∎

We next verify the tilt stability condition, leveraging the propagation of chaos result in Theorem 5.

Lemma 24 (Tilt Stability).

For any t>0t>0 and y1:Nd×Ny^{1:N}\in\mathbb{R}^{d\times N}, let μt,y1:N\mu_{t,y^{1:N}} be the Gaussian tilt of μ1:N\mu^{1:N} as defined in (B.1) with μ1:N\mu^{1:N} in place of μ\mu. Here, μ1:N\mu^{1:N} is the stationary distribution (1.1) of the mean-field Langevin diffusion. Then, under Assumptions 56, and 8, and assuming that 2λ>σ22\lambda>\sigma^{2} and

Nβd2λσ2exp(8B2/σ42λ/σ21)\displaystyle N\gtrsim\frac{\beta d}{2\lambda-\sigma^{2}}\exp\Bigl{(}\frac{8B^{2}/\sigma^{4}}{2\lambda/\sigma^{2}-1}\Bigr{)} (B.2)

for a sufficiently large implied (but universal) constant, we have

covμt,y1:N𝗈𝗉\displaystyle\lVert\operatorname{cov}_{\mu_{t,y^{1:N}}}\rVert_{\mathsf{op}} [12λ/σ21+1/t+12λ/σ21+1/t𝒪(Bσ2+βdσexp8B2/σ42λ/σ21)]2.\displaystyle\leq\Bigl{[}\frac{1}{\sqrt{2\lambda/\sigma^{2}-1+1/t}}+\frac{1}{2\lambda/\sigma^{2}-1+1/t}\,\mathcal{O}\Bigl{(}\frac{B}{\sigma^{2}}+\frac{\sqrt{\beta d}}{\sigma}\exp\frac{8B^{2}/\sigma^{4}}{2\lambda/\sigma^{2}-1}\Bigr{)}\Bigr{]}^{2}\,.

Proof.  We introduce the auxiliary measures

πt,yi(xi)exp(yixi22t(λσ212)xi22σ2δ0(π¯t,y1:N,xi)),i[N],\displaystyle\pi_{t,y^{i}}(x^{i})\propto\exp\Bigl{(}-\frac{\lVert y^{i}-x^{i}\rVert^{2}}{2t}-\bigl{(}\frac{\lambda}{\sigma^{2}}-\frac{1}{2}\bigr{)}\,\lVert x^{i}\rVert^{2}-\frac{2}{\sigma^{2}}\,\delta\mathcal{F}_{0}(\bar{\pi}_{t,y^{1:N}},x^{i})\Bigr{)}\,,\quad i\in[N]\,, (B.3)

where π¯t,y1:N1Ni=1Nπt,yi\bar{\pi}_{t,y^{1:N}}\coloneqq\frac{1}{N}\sum_{i=1}^{N}\pi_{t,y^{i}}. To see that these auxiliary measures are well-defined, note that any minimizer of the functional

(π1,,πN)\displaystyle(\pi^{1},\dotsc,\pi^{N})
i=1N[σ2yixi24t(λ2σ24)xi2]πi(dxi)+N0(1Ni=1Nπi)+σ22i=1N𝖾𝗇𝗍(πi)\displaystyle\qquad{}\mapsto\sum_{i=1}^{N}\int\Bigl{[}\frac{\sigma^{2}\,\lVert y^{i}-x^{i}\rVert^{2}}{4t}-\bigl{(}\frac{\lambda}{2}-\frac{\sigma^{2}}{4}\bigr{)}\,\lVert x^{i}\rVert^{2}\Bigr{]}\,\pi^{i}(\mathrm{d}x^{i})+N\mathcal{F}_{0}\bigl{(}\frac{1}{N}\sum_{i=1}^{N}\pi^{i}\bigr{)}+\frac{\sigma^{2}}{2}\sum_{i=1}^{N}\operatorname{\mathsf{ent}}(\pi^{i})

satisfies the system of equations (B.3), and that the minimizer is unique because the functional is strictly convex. We also let πt,y1:Ni=1Nπt,yi\pi_{t,y^{1:N}}\coloneqq\bigotimes_{i=1}^{N}\pi_{t,y^{i}}.

Then, for any unit vector θ1:Nd×N\theta^{1:N}\in\mathbb{R}^{d\times N},

θ1:N,covμt,y1:Nθ1:N\displaystyle\langle\theta^{1:N},\operatorname{cov}_{\mu_{t,y^{1:N}}}\theta^{1:N}\rangle 𝔼x1:Nμt,y1:N[θ1:N,x1:N𝔼x¯1:Nπt,y1:Nx¯1:N2]\displaystyle\leq\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu_{t,y^{1:N}}}[\langle\theta^{1:N},x^{1:N}-\operatorname{\mathbb{E}}_{\bar{x}^{1:N}\sim\pi_{t,y^{1:N}}}\bar{x}^{1:N}\rangle^{2}]
(𝒲2(μt,y1:N,πt,y1:N)+θ1:N,covπt,y1:Nθ1:N)2\displaystyle\leq\Bigl{(}\mathcal{W}_{2}(\mu_{t,y^{1:N}},\pi_{t,y^{1:N}})+\sqrt{\langle\theta^{1:N},\operatorname{cov}_{\pi_{t,y^{1:N}}}\theta^{1:N}\rangle}\Bigr{)}^{2}
(𝒲2(μt,y1:N,πt,y1:N)+maxi[N]covπt,yi𝗈𝗉)2,\displaystyle\leq\Bigl{(}\mathcal{W}_{2}(\mu_{t,y^{1:N}},\pi_{t,y^{1:N}})+\max_{i\in[N]}\sqrt{\lVert\operatorname{cov}_{\pi_{t,y^{i}}}\rVert_{\mathsf{op}}}\Bigr{)}^{2}\,,

where we used the fact that πt,y1:N\pi_{t,y^{1:N}} is a product measure. Also, introduce

π˘t,yi(xi)exp(yixi22t(λσ212)xi2),\displaystyle\breve{\pi}_{t,y^{i}}(x^{i})\propto\exp\Bigl{(}-\frac{\lVert y^{i}-x^{i}\rVert^{2}}{2t}-\bigl{(}\frac{\lambda}{\sigma^{2}}-\frac{1}{2}\bigr{)}\,\lVert x^{i}\rVert^{2}\Bigr{)}\,,

so that πt,yiexp(2σ2δ0(π¯t,y1:N,))π˘t,yi\pi_{t,y^{i}}\propto\exp(-\frac{2}{\sigma^{2}}\,\delta\mathcal{F}_{0}(\bar{\pi}_{t,y^{1:N}},\cdot))\,\breve{\pi}_{t,y^{i}}. The same argument as above then yields

θi,covπt,yiθi(𝒲2(πt,yi,π˘t,yi)+θi,covπ˘t,yiθi)2.\displaystyle\langle\theta^{i},\operatorname{cov}_{\pi_{t,y^{i}}}\theta^{i}\rangle\leq\Bigl{(}\mathcal{W}_{2}(\pi_{t,y^{i}},\breve{\pi}_{t,y^{i}})+\sqrt{\langle\theta^{i},\operatorname{cov}_{\breve{\pi}_{t,y^{i}}}\theta^{i}\rangle}\Bigr{)}^{2}\,.

Since π˘t,yi\breve{\pi}_{t,y^{i}} is (2λ/σ21+1/t)(2\lambda/\sigma^{2}-1+1/t)-strongly log-concave, covπ˘t,yi𝗈𝗉1/(2λ/σ21+1/t)\lVert\operatorname{cov}_{\breve{\pi}_{t,y^{i}}}\rVert_{\mathsf{op}}\leq 1/(2\lambda/\sigma^{2}-1+1/t) by the Brascamp–Lieb inequality [BL76]. Also, since 2σ2δ0(π¯t,y1:N,)\frac{2}{\sigma^{2}}\,\delta\mathcal{F}_{0}(\bar{\pi}_{t,y^{1:N}},\cdot) is 2B/σ22B/\sigma^{2}-Lipschitz under Assumption 8, then [KMP24, Corollary 2.4] yields

𝒲2(πt,yi,π˘t,yi)\displaystyle\mathcal{W}_{2}(\pi_{t,y^{i}},\breve{\pi}_{t,y^{i}}) 𝒲(πt,yi,π˘t,yi)2B/σ22λ/σ21+1/t.\displaystyle\leq\mathcal{W}_{\infty}(\pi_{t,y^{i}},\breve{\pi}_{t,y^{i}})\leq\frac{2B/\sigma^{2}}{2\lambda/\sigma^{2}-1+1/t}\,.

Hence,

covπt,yi𝗈𝗉\displaystyle\lVert\operatorname{cov}_{\pi_{t,y^{i}}}\rVert_{\mathsf{op}} (12λ/σ21+1/t+2B/σ22λ/σ21+1/t)2.\displaystyle\leq\Bigl{(}\frac{1}{\sqrt{2\lambda/\sigma^{2}-1+1/t}}+\frac{2B/\sigma^{2}}{2\lambda/\sigma^{2}-1+1/t}\Bigr{)}^{2}\,.

Finally, it remains to control 𝒲2(μt,y1:N,πt,y1:N)\mathcal{W}_{2}(\mu_{t,y^{1:N}},\pi_{t,y^{1:N}}). Note that when t=t=\infty, this essentially reduces to Theorem 5, so the task is to prove a generalization thereof. Note that by Lemma 21, in Theorem 26 below, we can take

C¯𝖫𝖲𝖨\displaystyle\overline{C}_{\mathsf{LSI}} 12λ/σ21+1/texp(4B2/σ42λ/σ21+1/t+8B/σ22λ/σ21+1/t)\displaystyle\leq\frac{1}{2\lambda/\sigma^{2}-1+1/t}\exp\Bigl{(}\frac{4B^{2}/\sigma^{4}}{2\lambda/\sigma^{2}-1+1/t}+\frac{8B/\sigma^{2}}{\sqrt{2\lambda/\sigma^{2}-1+1/t}}\Bigr{)}
32λ/σ21+1/texp(8B2/σ42λ/σ21+1/t)32λ/σ21exp(8B2/σ42λ/σ21).\displaystyle\leq\frac{3}{2\lambda/\sigma^{2}-1+1/t}\exp\Bigl{(}\frac{8B^{2}/\sigma^{4}}{2\lambda/\sigma^{2}-1+1/t}\Bigr{)}\leq\frac{3}{2\lambda/\sigma^{2}-1}\exp\Bigl{(}\frac{8B^{2}/\sigma^{4}}{2\lambda/\sigma^{2}-1}\Bigr{)}\,.

In particular, under the assumption (B.2) for NN, the preconditions of Theorem 26 are met and it yields the bound

𝒲2(μt,y1:N,πt,y1:N)\displaystyle\mathcal{W}_{2}(\mu_{t,y^{1:N}},\pi_{t,y^{1:N}}) C¯𝖫𝖲𝖨βdσβdσ(2λ/σ21+1/t)exp(8B2/σ42λ/σ21).\displaystyle\lesssim\frac{\overline{C}_{\mathsf{LSI}}\sqrt{\beta d}}{\sigma}\lesssim\frac{\sqrt{\beta d}}{\sigma\,(2\lambda/\sigma^{2}-1+1/t)}\exp\Bigl{(}\frac{8B^{2}/\sigma^{4}}{2\lambda/\sigma^{2}-1}\Bigr{)}\,.

Putting everything together completes the proof. ∎

Corollary 25 (Uniform LSI for the Stationary Measure).

Under Assumptions 56, and 8, if

Nβdλexp(8B2λσ2)\displaystyle N\gtrsim\frac{\beta d}{\lambda}\exp\Bigl{(}\frac{8B^{2}}{\lambda\sigma^{2}}\Bigr{)} (B.4)

for a sufficiently large implied (but universal) constant, then

C𝖫𝖲𝖨(μ1:N)\displaystyle C_{\mathsf{LSI}}(\mu^{1:N}) σ2λexp(𝒪(B2λσ2+βdλexp16B2λσ2)).\displaystyle\lesssim\frac{\sigma^{2}}{\lambda}\exp\Bigl{(}\mathcal{O}\Bigl{(}\frac{B^{2}}{\lambda\sigma^{2}}+\frac{\beta d}{\lambda}\exp\frac{16B^{2}}{\lambda\sigma^{2}}\Bigr{)}\Bigr{)}\,.

Proof.  To meet the conditions of Lemma 24, we perform a rescaling. We abuse notation and denote by η:dd\eta:\mathbb{R}^{d}\to\mathbb{R}^{d} the scaling map xηxx\mapsto\eta x. Then,

μη1:N(x1:N)\displaystyle\mu_{\eta}^{1:N}(x^{1:N}) η#μ1:N(x1:N)μ1:N(η1x1:N)exp(2Nσ20η(ρx1:N)λη2σ2x1:N2),\displaystyle\coloneqq\eta_{\#}\mu^{1:N}(x^{1:N})\propto\mu^{1:N}(\eta^{-1}x^{1:N})\propto\exp\Bigl{(}-\frac{2N}{\sigma^{2}}\,\mathcal{F}_{0}^{\eta}(\rho_{x^{1:N}})-\frac{\lambda}{\eta^{2}\sigma^{2}}\,\lVert x^{1:N}\rVert^{2}\Bigr{)}\,,

where 0η(ν)0((η1)#ν)\mathcal{F}_{0}^{\eta}(\nu)\coloneqq\mathcal{F}_{0}((\eta^{-1})_{\#}\nu). We see that μη1:N\mu_{\eta}^{1:N} is also the stationary measure for mean-field Langevin dynamics, with the following new parameters: ββ/η2\beta\leftarrow\beta/\eta^{2}; λλ/η2\lambda\leftarrow\lambda/\eta^{2}; BB/ηB\leftarrow B/\eta. In particular, if we take η2=λ/σ2\eta^{2}=\lambda/\sigma^{2}, then Lemma 24 applies to μη1:N\mu_{\eta}^{1:N} provided that (B.4) holds. Together with Lemma 23 with a=1a=1 and CB/(λ1/2σ)+(βd/λ)1/2exp(8B2/(λσ2))C\lesssim B/(\lambda^{1/2}\sigma)+{(\beta d/\lambda)}^{1/2}\exp(8B^{2}/(\lambda\sigma^{2})), it implies

C𝖫𝖲𝖨(μη1:N)\displaystyle C_{\mathsf{LSI}}(\mu^{1:N}_{\eta}) exp(𝒪(C2))=exp(𝒪(B2λσ2+βdλexp16B2λσ2)).\displaystyle\lesssim\exp\bigl{(}\mathcal{O}(C^{2})\bigr{)}=\exp\Bigl{(}\mathcal{O}\Bigl{(}\frac{B^{2}}{\lambda\sigma^{2}}+\frac{\beta d}{\lambda}\exp\frac{16B^{2}}{\lambda\sigma^{2}}\Bigr{)}\Bigr{)}\,.

The result for μ1:N\mu^{1:N} follows from contraction mapping [BGL14, Proposition 5.4.3]. ∎

It remains to prove the following generalized propagation of chaos result.

Theorem 26 (Generalized Propagation of Chaos).

For each i[N]i\in[N], let Vi:dV_{i}:\mathbb{R}^{d}\to\mathbb{R} and let 0:𝒫2(d)\mathcal{F}_{0}:\mathcal{P}_{2}(\mathbb{R}^{d})\to\mathbb{R}. Define probability measures

μ1:N(x1:N)\displaystyle\mu^{1:N}(x^{1:N}) exp(2σ2i=1NVi(xi)2Nσ20(ρx1:N)),\displaystyle\propto\exp\Bigl{(}-\frac{2}{\sigma^{2}}\sum_{i=1}^{N}V_{i}(x^{i})-\frac{2N}{\sigma^{2}}\,\mathcal{F}_{0}(\rho_{x^{1:N}})\Bigr{)}\,,
πi(xi)\displaystyle\pi^{i}(x^{i}) exp(2σ2Vi(xi)2σ2δ0(π¯,xi)),\displaystyle\propto\exp\Bigl{(}-\frac{2}{\sigma^{2}}\,V_{i}(x^{i})-\frac{2}{\sigma^{2}}\,\delta\mathcal{F}_{0}(\bar{\pi},x^{i})\Bigr{)}\,,

where π¯1Ni=1Nπi\bar{\pi}\coloneqq\frac{1}{N}\sum_{i=1}^{N}\pi_{i}, π1:Ni=1Nπi\pi^{1:N}\coloneqq\bigotimes_{i=1}^{N}\pi^{i}, and we assume that these measures are all well-defined. Furthermore, assume that 0\mathcal{F}_{0} satisfies Assumptions 5 and 6, and that for all i[N]i\in[N] and all ν𝒫2(d)\nu\in\mathcal{P}_{2}(\mathbb{R}^{d}), the proximal Gibbs measure

πνiexp(2σ2(Vi+δ0(ν,)))\displaystyle\pi_{\nu}^{i}\propto\exp\Bigl{(}-\frac{2}{\sigma^{2}}\,\bigl{(}V_{i}+\delta\mathcal{F}_{0}(\nu,\cdot)\bigr{)}\Bigr{)}

satisfies (LSI) with constant C¯𝖫𝖲𝖨\overline{C}_{\mathsf{LSI}}. Then, for all NβC¯𝖫𝖲𝖨d/σ2N\gtrsim\beta\overline{C}_{\mathsf{LSI}}d/\sigma^{2},

12C¯𝖫𝖲𝖨𝒲22(μ1:N,π1:N)𝖪𝖫(μ1:Nπ1:N)βC¯𝖫𝖲𝖨dσ2.\displaystyle\frac{1}{2\overline{C}_{\mathsf{LSI}}}\,\mathcal{W}_{2}^{2}(\mu^{1:N},\pi^{1:N})\leq\operatorname{\mathsf{KL}}(\mu^{1:N}\mathbin{\|}\pi^{1:N})\lesssim\frac{\beta\overline{C}_{\mathsf{LSI}}d}{\sigma^{2}}\,.

Proof.  Since the proof closely follows the proof of Theorem 5, to avoid repetition we only highlight the main changes. We define the following energy functionals:

N(ν1:N)\displaystyle\mathcal{E}^{N}(\nu^{1:N}) i=1NVidνi+N0(ρx1:N)ν1:N(dx1:N)+σ22𝖾𝗇𝗍(ν1:N),\displaystyle\coloneqq\sum_{i=1}^{N}\int V_{i}\,\mathrm{d}\nu^{i}+N\int\mathcal{F}_{0}(\rho_{x^{1:N}})\,\nu^{1:N}(\mathrm{d}x^{1:N})+\frac{\sigma^{2}}{2}\operatorname{\mathsf{ent}}(\nu^{1:N})\,,
(ν1:N)\displaystyle\mathcal{E}(\nu^{1:N}) i=1NVidνi+N0(ν¯)+σ22𝖾𝗇𝗍(ν1:N),\displaystyle\coloneqq\sum_{i=1}^{N}\int V_{i}\,\mathrm{d}\nu^{i}+N\mathcal{F}_{0}(\bar{\nu})+\frac{\sigma^{2}}{2}\operatorname{\mathsf{ent}}(\nu^{1:N})\,,

where for a measure ν1:N\nu^{1:N} on d×N\mathbb{R}^{d\times N}, we use the notation ν¯1Ni=1Nνi\bar{\nu}\coloneqq\frac{1}{N}\sum_{i=1}^{N}\nu^{i} for the average of the marginals. The first step is to establish the analogue of the entropy toast inequality (Lemma 19). Letting Zi=1Nexp(2σ2Vi2σ2δ0(π¯,))Z\coloneqq\prod_{i=1}^{N}\int\exp(-\frac{2}{\sigma^{2}}\,V_{i}-\frac{2}{\sigma^{2}}\,\delta\mathcal{F}_{0}(\bar{\pi},\cdot)) denote the normalizing constant for π1:N\pi^{1:N},

𝖪𝖫(ν1:Nπ1:N)\displaystyle\operatorname{\mathsf{KL}}(\nu^{1:N}\mathbin{\|}\pi^{1:N}) =2σ2i=1NVidνi+2Nσ2δ0(π¯,z)ρx1:N(dz)ν1:N(dx1:N)+𝖾𝗇𝗍(ν1:N)\displaystyle=\frac{2}{\sigma^{2}}\sum_{i=1}^{N}\int V_{i}\,\mathrm{d}\nu^{i}+\frac{2N}{\sigma^{2}}\iint\delta\mathcal{F}_{0}(\bar{\pi},z)\,\rho_{x^{1:N}}(\mathrm{d}z)\,\nu^{1:N}(\mathrm{d}x^{1:N})+\operatorname{\mathsf{ent}}(\nu^{1:N})
+logZ\displaystyle\qquad{}+\log Z
=2σ2i=1NVid(νiπi)+2Nσ2𝔼x1:Nν1:Nδ0(π¯,)d(ρx1:Nπ¯)\displaystyle=\frac{2}{\sigma^{2}}\sum_{i=1}^{N}\int V_{i}\,\mathrm{d}(\nu^{i}-\pi^{i})+\frac{2N}{\sigma^{2}}\operatorname{\mathbb{E}}_{x^{1:N}\sim\nu^{1:N}}\int\delta\mathcal{F}_{0}(\bar{\pi},\cdot)\,\mathrm{d}(\rho_{x^{1:N}}-\bar{\pi})
+𝖾𝗇𝗍(ν1:N)𝖾𝗇𝗍(π1:N)\displaystyle\qquad{}+\operatorname{\mathsf{ent}}(\nu^{1:N})-\operatorname{\mathsf{ent}}(\pi^{1:N})
2σ2i=1NVid(νiπi)+2Nσ2𝔼x1:Nν1:N[0(ρx1:N)0(π¯)]\displaystyle\leq\frac{2}{\sigma^{2}}\sum_{i=1}^{N}\int V_{i}\,\mathrm{d}(\nu^{i}-\pi^{i})+\frac{2N}{\sigma^{2}}\operatorname{\mathbb{E}}_{x^{1:N}\sim\nu^{1:N}}[\mathcal{F}_{0}(\rho_{x^{1:N}})-\mathcal{F}_{0}(\bar{\pi})]
+𝖾𝗇𝗍(ν1:N)𝖾𝗇𝗍(π1:N)\displaystyle\qquad{}+\operatorname{\mathsf{ent}}(\nu^{1:N})-\operatorname{\mathsf{ent}}(\pi^{1:N})
2σ2(N(ν1:N)(π1:N)).\displaystyle\leq\frac{2}{\sigma^{2}}\,\bigl{(}\mathcal{E}^{N}(\nu^{1:N})-\mathcal{E}(\pi^{1:N})\bigr{)}\,.

From here, we find that

𝔼x1:Nν1:N[0(ρx1:N)0(π¯)]𝔼x1:Nν1:Nδ0(ρx1:N,)d(ρx1:Nπ¯).\displaystyle\operatorname{\mathbb{E}}_{x^{1:N}\sim\nu^{1:N}}[\mathcal{F}_{0}(\rho_{x^{1:N}})-\mathcal{F}_{0}(\bar{\pi})]\leq\operatorname{\mathbb{E}}_{x^{1:N}\sim\nu^{1:N}}\int\delta\mathcal{F}_{0}(\rho_{x^{1:N}},\cdot)\,\mathrm{d}(\rho_{x^{1:N}}-\bar{\pi})\,.

Now, moving on to the propagation of chaos part of this argument, we have

N(μ1:N)(π1:N)\displaystyle\mathcal{E}^{N}(\mu^{1:N})-\mathcal{E}(\pi^{1:N})
i=1N𝔼x1:Nμ1:N[Vid(μii(xi)πi)+δ0(ρx1:N,)d(ρx1:Nπi)]\displaystyle\qquad\leq\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\Bigl{[}\int V_{i}\,\mathrm{d}(\mu^{i\mid-i}(\cdot\mid x^{-i})-\pi^{i})+\int\delta\mathcal{F}_{0}(\rho_{x^{1:N}},\cdot)\,\mathrm{d}(\rho_{x^{1:N}}-\pi^{i})\Bigr{]}
+σ22i=1N𝔼xiμi[𝖾𝗇𝗍(μii(xi))𝖾𝗇𝗍(πi)]\displaystyle\qquad\qquad{}+\frac{\sigma^{2}}{2}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{-i}\sim\mu^{-i}}\bigl{[}\operatorname{\mathsf{ent}}(\mu^{i\mid-i}(\cdot\mid x^{-i}))-\operatorname{\mathsf{ent}}(\pi^{i})\bigr{]}
=i=1N𝔼x1:Nμ1:N[Vid(μii(xi)πi)+δ0(ρx1:N,xi)δ0(ρx1:N,)dπi]\displaystyle\qquad=\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\Bigl{[}\int V_{i}\,\mathrm{d}(\mu^{i\mid-i}(\cdot\mid x^{-i})-\pi^{i})+\delta\mathcal{F}_{0}(\rho_{x^{1:N}},x^{i})-\int\delta\mathcal{F}_{0}(\rho_{x^{1:N}},\cdot)\,\mathrm{d}\pi^{i}\Bigr{]}
+σ22i=1N𝔼xiμi[𝖾𝗇𝗍(μii(xi))𝖾𝗇𝗍(πi)].\displaystyle\qquad\qquad{}+\frac{\sigma^{2}}{2}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{-i}\sim\mu^{-i}}\bigl{[}\operatorname{\mathsf{ent}}(\mu^{i\mid-i}(\cdot\mid x^{-i}))-\operatorname{\mathsf{ent}}(\pi^{i})\bigr{]}\,.

To decouple, introduce a new variable ziμiz^{i}\sim\mu^{i} independent of all the others and define as a shorthand x~i1:N\tilde{x}_{i}^{1:N} as the vector x1,,xi1,zi,xi+1,,xNx^{1},\ldots,x^{i-1},z^{i},x^{i+1},\ldots,x^{N}.

N(μ1:N)(π1:N)\displaystyle\mathcal{E}^{N}(\mu^{1:N})-\mathcal{E}(\pi^{1:N})
i=1N𝔼x1:Nμ1:N[Vid(μii(xi)πi)+σ22(𝖾𝗇𝗍(μii(xi))𝖾𝗇𝗍(πi))]\displaystyle\qquad\leq\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\Bigl{[}\int V_{i}\,\mathrm{d}(\mu^{i\mid-i}(\cdot\mid x^{-i})-\pi^{i})+\frac{\sigma^{2}}{2}\,\bigl{(}\operatorname{\mathsf{ent}}(\mu^{i\mid-i}(\cdot\mid x^{-i}))-\operatorname{\mathsf{ent}}(\pi^{i})\bigr{)}\Bigr{]}
+i=1N𝔼x1:Nμ1:N[𝔼ziμiδ0(ρx~i1:N,xi)𝔼ziμiδ0(ρx~i1:N,)dπi]\displaystyle\qquad\qquad{}+\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\Bigl{[}\operatorname{\mathbb{E}}_{z^{i}\sim\mu^{i}}\delta\mathcal{F}_{0}(\rho_{\tilde{x}^{1:N}_{i}},x^{i})-\int\operatorname{\mathbb{E}}_{z^{i}\sim\mu^{i}}\delta\mathcal{F}_{0}(\rho_{\tilde{x}^{1:N}_{i}},\cdot)\,\mathrm{d}\pi^{i}\Bigr{]}
+i=1N𝔼x1:Nμ1:N[δ0(ρx1:N,xi)𝔼ziμiδ0(ρx~i1:N,xi)\displaystyle\qquad{}+\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\Bigl{[}\delta\mathcal{F}_{0}(\rho_{x^{1:N}},x^{i})-\operatorname{\mathbb{E}}_{z^{i}\sim\mu^{i}}\delta\mathcal{F}_{0}(\rho_{\tilde{x}^{1:N}_{i}},x^{i})
(δ0(ρx1:N,)𝔼ziμiδ0(ρx~i1:N,))dπi].\displaystyle\qquad\qquad\qquad\qquad\qquad{}-\int\bigl{(}\delta\mathcal{F}_{0}(\rho_{x^{1:N}},\cdot)-\operatorname{\mathbb{E}}_{z^{i}\sim\mu^{i}}\delta\mathcal{F}_{0}(\rho_{\tilde{x}^{1:N}_{i}},\cdot)\bigr{)}\,\mathrm{d}\pi^{i}\Bigr{]}\,.

We group the terms in the first two lines as 𝖠\mathsf{A}, and the terms in the last two lines as 𝖡\mathsf{B}.

Let us first look at 𝖠\mathsf{A}. If we introduce the proximal Gibbs measure for ρx~i1:N\rho_{\tilde{x}^{1:N}_{i}} (in the ii-th coordinate), defined via

τx~i1:Niexp(2σ2(Vi+δ0(ρx~i1:N,))),\displaystyle\tau_{\tilde{x}^{1:N}_{i}}^{i}\propto\exp\Bigl{(}-\frac{2}{\sigma^{2}}\,\bigl{(}V_{i}+\delta\mathcal{F}_{0}(\rho_{\tilde{x}_{i}^{1:N}},\cdot)\bigr{)}\Bigr{)}\,,

one obtains as before

𝖠σ22i=1N𝔼x1:Nμ1:N𝔼ziμi𝖪𝖫(μii(xi)τx~i1:Ni).\displaystyle\mathsf{A}\leq\frac{\sigma^{2}}{2}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\operatorname{\mathbb{E}}_{z^{i}\sim\mu^{i}}\operatorname{\mathsf{KL}}\bigl{(}\mu^{i\mid-i}(\cdot\mid x^{-i})\bigm{\|}\tau_{\tilde{x}^{1:N}_{i}}^{i}\bigr{)}\,.

Applying the log-Sobolev inequality, it yields

𝖠\displaystyle\mathsf{A} β2C¯𝖫𝖲𝖨σ2i=1N𝔼x1:Nμ1:N𝔼ziμi𝒲12(ρx1:N,ρx~i1:N).\displaystyle\leq\frac{\beta^{2}\overline{C}_{\mathsf{LSI}}}{\sigma^{2}}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{1:N}\sim\mu^{1:N}}\operatorname{\mathbb{E}}_{z^{i}\sim\mu^{i}}\mathcal{W}_{1}^{2}(\rho_{x^{1:N}},\rho_{\tilde{x}_{i}^{1:N}})\,.

The Wasserstein distance is bounded by

𝒲1(ρx1:N,ρx~i1:N)\displaystyle\mathcal{W}_{1}(\rho_{x^{1:N}},\rho_{\tilde{x}_{i}^{1:N}}) 1Nxizi.\displaystyle\leq\frac{1}{N}\,{\lVert x^{i}-z^{i}\rVert}\,.

It eventually yields, as before,

𝖠\displaystyle\mathsf{A} β2C¯𝖫𝖲𝖨σ2N(C¯𝖫𝖲𝖨N𝖪𝖫(μ1:Nπ1:N)+dC¯𝖫𝖲𝖨).\displaystyle\lesssim\frac{\beta^{2}\overline{C}_{\mathsf{LSI}}}{\sigma^{2}N}\,\Bigl{(}\frac{\overline{C}_{\mathsf{LSI}}}{N}\operatorname{\mathsf{KL}}(\mu^{1:N}\mathbin{\|}\pi^{1:N})+d\overline{C}_{\mathsf{LSI}}\Bigr{)}\,.

As for the term 𝖡\mathsf{B}, a straightforward modification of the proof of Theorem 5 readily yields

𝖡βC¯𝖫𝖲𝖨N𝖪𝖫(μ1:Nπ1:N)+βC¯𝖫𝖲𝖨d.\displaystyle\mathsf{B}\lesssim\frac{\beta\overline{C}_{\mathsf{LSI}}}{N}\operatorname{\mathsf{KL}}(\mu^{1:N}\mathbin{\|}\pi^{1:N})+\beta\overline{C}_{\mathsf{LSI}}d\,.

Putting everything together yields the result. ∎

Appendix C Explicit Calculations for the Gaussian Case

Here we provide complete details for Example 10: for any kNk\leq N,

dk2N2𝖪𝖫(μ1:kπk)dk2N2logN.\frac{dk^{2}}{N^{2}}\lesssim\operatorname{\mathsf{KL}}(\mu^{1:k}\mathbin{\|}\pi^{\otimes k})\lesssim\frac{dk^{2}}{N^{2}}\log N\,.

Note that for 𝖢N×N\mathsf{C}\in\mathbb{R}^{N\times N} with 𝖢i,i=N1\mathsf{C}_{i,i}=N-1 and 𝖢i,j=1\mathsf{C}_{i,j}=-1 if iji\neq j,

μ1:N=𝒩(0,σ22(INA+λN1𝖢Id)1Σ1)andπ=𝒩(0,σ22(A+λId)1Σ2).\mu^{1:N}=\mathcal{N}\Bigl{(}0,\underbrace{\frac{\sigma^{2}}{2}\,\bigl{(}I_{N}\otimes A+\frac{\lambda}{N-1}\,\mathsf{C}\otimes I_{d}\bigr{)}^{-1}}_{\eqqcolon\Sigma_{1}}\Bigr{)}\qquad\text{and}\qquad\pi=\mathcal{N}\Bigl{(}0,\underbrace{\frac{\sigma^{2}}{2}\,{(A+\lambda I_{d})}^{-1}}_{\eqqcolon\Sigma_{2}}\Bigr{)}\,.

The kk-particle marginal μ1:k\mu^{1:k} is a Gaussian with zero mean and covariance being the upper-left (kN×kN)(kN\times kN)-block matrix of Σ1\Sigma_{1}, which we denote by Σ1,k\Sigma_{1,k}. Clearly, πk\pi^{\otimes k} is also a Gaussian with zero mean and covariance Σ2,kIkΣ2\Sigma_{2,k}\coloneqq I_{k}\otimes\Sigma_{2}. From a well-known formula for the 𝖪𝖫\operatorname{\mathsf{KL}} divergence between two Gaussian distributions,

𝖪𝖫(μ1:kπk)=12(logdet(Σ2,k1Σ1,k)dk+missingtr(Σ2,k1Σ1,k)).\operatorname{\mathsf{KL}}(\mu^{1:k}\mathbin{\|}\pi^{\otimes k})=\frac{1}{2}\,\bigl{(}-\log\det(\Sigma_{2,k}^{-1}\Sigma_{1,k})-dk+\mathop{\mathrm{missing}}{tr}(\Sigma_{2,k}^{-1}\Sigma_{1,k})\bigr{)}\,. (C.1)

Let 1pp1_{p}\in\mathbb{R}^{p} be the pp-dimensional vector with all entries 11. From 𝖢=NIN1N1N𝖳\mathsf{C}=NI_{N}-1_{N}1_{N}^{\mathsf{T}},

2σ2Σ1\displaystyle\frac{2}{\sigma^{2}}\,\Sigma_{1} =(IN(A+λNN1IdAλ)λN1(1N1N𝖳)Id)1\displaystyle=\Bigl{(}I_{N}\otimes\bigl{(}\underbrace{A+\frac{\lambda N}{N-1}\,I_{d}}_{\eqqcolon A_{\lambda}}\bigr{)}-\frac{\lambda}{N-1}\,(1_{N}1_{N}^{\mathsf{T}})\otimes I_{d}\Bigr{)}^{-1}
=(i)(INAλλN1(1NId)(1N𝖳Id))1\displaystyle\underset{\text{(i)}}{=}\Bigl{(}I_{N}\otimes A_{\lambda}-\frac{\lambda}{N-1}\,(1_{N}\otimes I_{d})(1_{N}^{\mathsf{T}}\otimes I_{d})\Bigr{)}^{-1}
=(ii)(INAλ)1\displaystyle\underset{\text{(ii)}}{=}\mathopen{}\mathclose{{}\left(I_{N}\otimes A_{\lambda}}\right)^{-1}
(INAλ)1(1NId)\displaystyle\qquad{}-{(I_{N}\otimes A_{\lambda})}^{-1}(1_{N}\otimes I_{d})
×(Id+(1N𝖳Id)(INAλ)1(1NId))1(1N𝖳Id)(INAλ)1\displaystyle\qquad\qquad{}\times\bigl{(}I_{d}+(1_{N}^{\mathsf{T}}\otimes I_{d})\mathopen{}\mathclose{{}\left(I_{N}\otimes A_{\lambda}}\right)^{-1}(1_{N}\otimes I_{d})\bigr{)}^{-1}(1_{N}^{\mathsf{T}}\otimes I_{d})\mathopen{}\mathclose{{}\left(I_{N}\otimes A_{\lambda}}\right)^{-1}
=(iii)INAλ1(1NAλ1)(Id+(1N𝖳1N)Aλ1)1(1N𝖳Aλ1)\displaystyle\underset{\text{(iii)}}{=}I_{N}\otimes A_{\lambda}^{-1}-(1_{N}\otimes A_{\lambda}^{-1})\bigl{(}I_{d}+(1_{N}^{\mathsf{T}}1_{N})\otimes A_{\lambda}^{-1}\bigr{)}^{-1}(1_{N}^{\mathsf{T}}\otimes A_{\lambda}^{-1})
=INAλ1(1NAλ1)(Id+NAλ1)1(1N𝖳Aλ1)\displaystyle=I_{N}\otimes A_{\lambda}^{-1}-(1_{N}\otimes A_{\lambda}^{-1}){(I_{d}+NA_{\lambda}^{-1})}^{-1}(1_{N}^{\mathsf{T}}\otimes A_{\lambda}^{-1})
=INAλ1(1N1N𝖳)(Aλ2+NAλ)1,\displaystyle=I_{N}\otimes A_{\lambda}^{-1}-(1_{N}1_{N}^{\mathsf{T}})\otimes{(A_{\lambda}^{2}+NA_{\lambda})}^{-1}\,,

where in (i) we used (AB)(CD)=(AC)(BD)(A\otimes B)(C\otimes D)=(AC)\otimes(BD), (ii) follows from the Woodbury matrix identity, and (iii) used (AB)1=A1B1(A\otimes B)^{-1}=A^{-1}\otimes B^{-1}. Hence it follows that

2σ2Σ1,k=IkAλ1(1k1k𝖳)(Aλ2+NAλ)1.\frac{2}{\sigma^{2}}\,\Sigma_{1,k}=I_{k}\otimes A_{\lambda}^{-1}-(1_{k}1_{k}^{\mathsf{T}})\otimes{(A_{\lambda}^{2}+NA_{\lambda})}^{-1}\,.

By the spectral decomposition of AA, we can write A=UDU𝖳A=UDU^{\mathsf{T}} for a diagonal Dd×dD\in\mathbb{R}^{d\times d} and an orthogonal matrix Ud×dU\in\mathbb{R}^{d\times d} such that {σiDi,i}i[d]\{\sigma_{i}\coloneqq D_{i,i}\}_{i\in[d]} correspond to the eigenvalues of AA. Since logdet()\log\det(\cdot) and missingtr()\mathop{\mathrm{missing}}{tr}(\cdot) in (C.1) are orthogonally invariant, let us look at the orthogonal conjugate of Σ2,k1Σ1,k\Sigma_{2,k}^{-1}\Sigma_{1,k} by IkU𝖳dk×dkI_{k}\otimes U^{\mathsf{T}}\in\mathbb{R}^{dk\times dk}. Using (AB)𝖳=A𝖳B𝖳(A\otimes B)^{\mathsf{T}}=A^{\mathsf{T}}\otimes B^{\mathsf{T}} and denoting DλD+λNN1IdD_{\lambda}\coloneqq D+\frac{\lambda N}{N-1}\,I_{d},

(IkU𝖳)Σ2,k1Σ1,k(IkU)\displaystyle(I_{k}\otimes U^{\mathsf{T}})\Sigma_{2,k}^{-1}\Sigma_{1,k}(I_{k}\otimes U)
=(IkU𝖳)(Ik(A+λId))(IkAλ1(1k1k𝖳)(Aλ2+NAλ)1)(IkU)\displaystyle\qquad=(I_{k}\otimes U^{\mathsf{T}})\bigl{(}I_{k}\otimes(A+\lambda I_{d})\bigr{)}\bigl{(}I_{k}\otimes A_{\lambda}^{-1}-(1_{k}1_{k}^{\mathsf{T}})\otimes{(A_{\lambda}^{2}+NA_{\lambda})}^{-1}\bigr{)}(I_{k}\otimes U)
=(Ik(D+λId))(IkDλ1(1k1k𝖳)(Dλ2+NDλ)1)\displaystyle\qquad=\bigl{(}I_{k}\otimes(D+\lambda I_{d})\bigr{)}\bigl{(}I_{k}\otimes D_{\lambda}^{-1}-(1_{k}1_{k}^{\mathsf{T}})\otimes{(D_{\lambda}^{2}+ND_{\lambda})}^{-1}\bigr{)}
=Ik((D+λId)Dλ1)(1k1k𝖳)Jk((D+λId)(Dλ2+NDλ)1)Sλ\displaystyle\qquad=I_{k}\otimes\mathopen{}\mathclose{{}\left((D+\lambda I_{d})D_{\lambda}^{-1}}\right)-\underbrace{(1_{k}1_{k}^{\mathsf{T}})}_{\eqqcolon J_{k}}\otimes\underbrace{\bigl{(}(D+\lambda I_{d}){(D_{\lambda}^{2}+ND_{\lambda})}^{-1}\bigr{)}}_{\eqqcolon S_{\lambda}}
=Idk(λN1IkDλ1+JkSλ𝖬).\displaystyle\qquad=I_{dk}-\Bigl{(}\underbrace{\frac{\lambda}{N-1}\,I_{k}\otimes D_{\lambda}^{-1}+J_{k}\otimes S_{\lambda}}_{\eqqcolon\mathsf{M}}\Bigr{)}\,.

For σdmini[d]σi\sigma_{d}\coloneqq\min_{i\in[d]}\sigma_{i}, ασd+λ\alpha\coloneqq\sigma_{d}+\lambda, and ελ/(N1)\varepsilon\coloneqq\lambda/(N-1), we have Dλ11αIdD_{\lambda}^{-1}\precsim\frac{1}{\alpha}\,I_{d} and Sλ1α+NIdS_{\lambda}\precsim\frac{1}{\alpha+N}\,I_{d} due to

((Dλ)1)i,i1α+εand(Sλ)i,iα(α+ε)(α+ε+N).((D_{\lambda})^{-1})_{i,i}\leq\frac{1}{\alpha+\varepsilon}\qquad\text{and}\qquad(S_{\lambda})_{i,i}\leq\frac{\alpha}{(\alpha+\varepsilon)\,(\alpha+\varepsilon+N)}\,.

Since the eigenvalues of ABA\otimes B consist of all possible combinations arising from the product of eigenvalues, one from AA and one from BB, the largest eigenvalue η1\eta_{1} of 𝖬\mathsf{M} is less than 11:

η1λN1Dλ1+kSλεα+ε+αN(α+ε)(α+ε+N)=ε+Nα+ε+N.\eta_{1}\leq\frac{\lambda}{N-1}\,\lVert D_{\lambda}^{-1}\rVert+k\,\lVert S_{\lambda}\rVert\leq\frac{\varepsilon}{\alpha+\varepsilon}+\frac{\alpha N}{(\alpha+\varepsilon)\,(\alpha+\varepsilon+N)}=\frac{\varepsilon+N}{\alpha+\varepsilon+N}\,.

Denoting the eigenvalues of 𝖬\mathsf{M} by ηi\eta_{i}, it follows from (C.1) that

2𝖪𝖫(μ1:kπk)\displaystyle 2\operatorname{\mathsf{KL}}(\mu^{1:k}\mathbin{\|}\pi^{\otimes k}) =(logdet(Idk𝖬)+dkmissingtr(Idk𝖬))=i=1dk(log(1ηi)+ηi)\displaystyle=-\bigl{(}\log\det(I_{dk}-\mathsf{M})+dk-\mathop{\mathrm{missing}}{tr}(I_{dk}-\mathsf{M})\bigr{)}=-\sum_{i=1}^{dk}\bigl{(}\log(1-\eta_{i})+\eta_{i}\bigr{)}
=i=1dkn2ηinn.\displaystyle=\sum_{i=1}^{dk}\sum_{n\geq 2}\frac{\eta_{i}^{n}}{n}\,.

Then, we have a trivial lower bound of 12i=1dkηi2\frac{1}{2}\sum_{i=1}^{dk}\eta_{i}^{2}, and as for the upper bound,

i=1dkn2ηinni=1dk(ηi2+n1ηin+2n)=i=1dkηi2log(e1ηi)(1logNα)missingtr(𝖬2),\sum_{i=1}^{dk}\sum_{n\geq 2}\frac{\eta_{i}^{n}}{n}\leq\sum_{i=1}^{dk}\Bigl{(}\eta_{i}^{2}+\sum_{n\geq 1}\frac{\eta_{i}^{n+2}}{n}\Bigr{)}=\sum_{i=1}^{dk}\eta_{i}^{2}\log\bigl{(}\frac{e}{1-\eta_{i}}\bigr{)}\lesssim\bigl{(}1\vee\log\frac{N}{\alpha}\bigr{)}\mathop{\mathrm{missing}}{tr}(\mathsf{M}^{2})\,,

where the last inequality follows from (1ηi)1(1η1)1(1-\eta_{i})^{-1}\leq(1-\eta_{1})^{-1}.

Using missingtr(AB)=missingtr(A)missingtr(B)\mathop{\mathrm{missing}}{tr}(A\otimes B)=\mathop{\mathrm{missing}}{tr}(A)\cdot\mathop{\mathrm{missing}}{tr}(B), and Dλ11αIdD_{\lambda}^{-1}\precsim\frac{1}{\alpha}\,I_{d} and Sλ1α+NIdS_{\lambda}\precsim\frac{1}{\alpha+N}\,I_{d}, we have

missingtr(𝖬2)\displaystyle\mathop{\mathrm{missing}}{tr}(\mathsf{M}^{2}) λ2(N1)2missingtr(Ik)missingtr(Dλ2)+missingtr(Jk2)missingtr(Sλ2)\displaystyle\lesssim\frac{\lambda^{2}}{{(N-1)}^{2}}\mathop{\mathrm{missing}}{tr}(I_{k})\mathop{\mathrm{missing}}{tr}(D_{\lambda}^{-2})+\mathop{\mathrm{missing}}{tr}(J_{k}^{2})\mathop{\mathrm{missing}}{tr}(S_{\lambda}^{2})
λ2α2dkN2+dk2(α+N)2dk2N2.\displaystyle\lesssim\frac{\lambda^{2}}{\alpha^{2}}\,\frac{dk}{N^{2}}+\frac{dk^{2}}{{(\alpha+N)}^{2}}\lesssim\frac{dk^{2}}{N^{2}}\,.

As for the lower bound, since (Sλ)i,i1N(S_{\lambda})_{i,i}\sim\frac{1}{N} for large NN, we have

missingtr(𝖬2)dk2N2,\mathop{\mathrm{missing}}{tr}(\mathsf{M}^{2})\gtrsim\frac{dk^{2}}{N^{2}}\,,

which completes the proof.

Appendix D Additional Technical Lemmas

In our proofs, we used the following general lemmas on exchangeability.

Lemma 27.

Let μ1:N\mu^{1:N}, ν1:N\nu^{1:N} be two exchangeable measures over d×N\mathbb{R}^{d\times N}. For any kNk\leq N,

𝒲22(μ1:k,ν1:k)\displaystyle\mathcal{W}_{2}^{2}(\mu^{1:k},\nu^{1:k}) kN𝒲22(μ1:N,ν1:N).\displaystyle\leq\frac{k}{N}\,\mathcal{W}_{2}^{2}(\mu^{1:N},\nu^{1:N})\,.

Proof.  Let (X1:N,Y1:N)(X^{1:N},Y^{1:N}) be optimally coupled for μ1:N\mu^{1:N} and ν1:N\nu^{1:N}. For each subset S[N]S\subseteq[N] of size kk, it induces a coupling (XS,YS)(X^{S},Y^{S}) of μ1:k\mu^{1:k} and ν1:k\nu^{1:k} (by exchangeability). In particular, the law of (X𝖲,Y𝖲)(X^{\mathsf{S}},Y^{\mathsf{S}}), where 𝖲\mathsf{S} is an independent and uniformly random subset of size kk, is also a coupling of μ1:k\mu^{1:k} and ν1:k\nu^{1:k}. Hence,

𝒲22(μ1:k,ν1:k)\displaystyle\mathcal{W}_{2}^{2}(\mu^{1:k},\nu^{1:k}) 𝔼[X𝖲Y𝖲2]=1(Nk)|S|=k𝔼[XSYS2]\displaystyle\leq\operatorname{\mathbb{E}}[\lVert X^{\mathsf{S}}-Y^{\mathsf{S}}\rVert^{2}]=\frac{1}{\binom{N}{k}}\sum_{\lvert S\rvert=k}\operatorname{\mathbb{E}}[\lVert X^{S}-Y^{S}\rVert^{2}]
=1(Nk)i=1N|S|=k:iS𝔼[XiYi2]=(N1k1)(Nk)i=1N𝔼[XiYi2]\displaystyle=\frac{1}{\binom{N}{k}}\sum_{i=1}^{N}\sum_{\lvert S\rvert=k\,:\,i\in S}\operatorname{\mathbb{E}}[\lVert X^{i}-Y^{i}\rVert^{2}]=\frac{\binom{N-1}{k-1}}{\binom{N}{k}}\sum_{i=1}^{N}\operatorname{\mathbb{E}}[\lVert X^{i}-Y^{i}\rVert^{2}]
=kN𝒲22(μ1:N,ν1:N),\displaystyle=\frac{k}{N}\,\mathcal{W}_{2}^{2}(\mu^{1:N},\nu^{1:N})\,,

which completes the proof. ∎

Lemma 28 (Information Inequality [Csi84]).

If 𝒳1,,𝒳N\mathcal{X}^{1},\ldots,\mathcal{X}^{N} are Polish spaces and μ1:N\mu^{1:N}, ν1:N\nu^{1:N} are probability measures on 𝒳1××𝒳N\mathcal{X}^{1}\times\cdots\times\mathcal{X}^{N}, where ν1:N=ν1νN\nu^{1:N}=\nu^{1}\otimes\cdots\otimes\nu^{N} is a product measure, then for the marginals μi\mu^{i} of μ\mu, it holds that

i=1N𝖪𝖫(μiνi)𝖪𝖫(μ1:Nν1:N).\sum_{i=1}^{N}\operatorname{\mathsf{KL}}(\mu^{i}\mathbin{\|}\nu^{i})\leq\operatorname{\mathsf{KL}}(\mu^{1:N}\mathbin{\|}\nu^{1:N})\,.

In particular when μ1:N\mu^{1:N}, ν1:N\nu^{1:N} are both exchangeable, this states that 𝖪𝖫(μ1ν1)1N𝖪𝖫(μ1:Nν1:N)\operatorname{\mathsf{KL}}(\mu^{1}\mathbin{\|}\nu^{1})\leq\frac{1}{N}\operatorname{\mathsf{KL}}(\mu^{1:N}\mathbin{\|}\nu^{1:N}).

Note that Lemma 28 follows from the chain rule and convexity of the 𝖪𝖫\operatorname{\mathsf{KL}} divergence.

Appendix E Sampling Guarantees

Here, we show how to obtain the claimed rates in §4. We begin with some preliminary facts.

𝖪𝖫\operatorname{\mathsf{KL}} divergence guarantees.

To obtain our guarantees in 𝖪𝖫\operatorname{\mathsf{KL}} divergence, we use the following lemma.

Lemma 29 ([Zha+23, Proof of Theorem 6]).

Let μ^\hat{\mu}, μ\mu, and π\pi be three probability measures, and assume that μ\mu satisfies (LSI) with constant C𝖫𝖲𝖨(μ)C_{\mathsf{LSI}}(\mu). Then,

𝖪𝖫(μ^π)\displaystyle\operatorname{\mathsf{KL}}(\hat{\mu}\mathbin{\|}\pi) 2χ2(μ^μ)+𝖪𝖫(μπ)+C𝖫𝖲𝖨(μ)4𝖥𝖨(μπ).\displaystyle\leq 2\,\chi^{2}(\hat{\mu}\mathbin{\|}\mu)+\operatorname{\mathsf{KL}}(\mu\mathbin{\|}\pi)+\frac{C_{\mathsf{LSI}}(\mu)}{4}\operatorname{\mathsf{FI}}(\mu\mathbin{\|}\pi)\,.

We instantiate the lemma with μ^1:N\hat{\mu}^{1:N}, μ1:N\mu^{1:N}, and πN\pi^{\otimes N} respectively. In the setting of Theorem 4, it is seen that 𝖪𝖫(μ1:NπN)\operatorname{\mathsf{KL}}(\mu^{1:N}\mathbin{\|}\pi^{\otimes N}) and C𝖫𝖲𝖨(μ1:N)𝖥𝖨(μ1:NπN)C_{\mathsf{LSI}}(\mu^{1:N})\operatorname{\mathsf{FI}}(\mu^{1:N}\mathbin{\|}\pi^{\otimes N}) are of the same order and can be made at most Nε2N\varepsilon^{2} if we take Nκ4d/ε2N\asymp\kappa^{4}d/\varepsilon^{2}. Thus, if we have a sampler that achieves χ2(μ^1:Nμ1:N)Nε2\chi^{2}(\hat{\mu}^{1:N}\mathbin{\|}\mu^{1:N})\leq N\varepsilon^{2}, it follows from exchangeability (Lemma 28) that 𝖪𝖫(μ^1π)ε2\operatorname{\mathsf{KL}}(\hat{\mu}^{1}\mathbin{\|}\pi)\lesssim\varepsilon^{2}.

Guarantees using the sharp propagation of chaos bound.

Here, we impose Assumptions 123, and 9. It follows from Theorem 3 that N=Θ~(d/ε)N=\widetilde{\Theta}(\sqrt{d}/\varepsilon) suffices in order to make α/σ𝒲2(μ1,π)ε\nicefrac{{\sqrt{\alpha}}}{{\sigma}}\,\mathcal{W}_{2}(\mu^{1},\pi)\leq\varepsilon. For the first term, we use exchangeability (Lemma 27) to argue that

𝒲2(μ^1,μ1)N1/2𝒲2(μ^1:N,μ1:N),\displaystyle\mathcal{W}_{2}(\hat{\mu}^{1},\mu^{1})\leq N^{-1/2}\,\mathcal{W}_{2}(\hat{\mu}^{1:N},\mu^{1:N})\,,

and hence we invoke sampling guarantees to ensure that α/σ𝒲2(μ^1:N,μ1:N)N1/2ε\nicefrac{{\sqrt{\alpha}}}{{\sigma}}\,\mathcal{W}_{2}(\hat{\mu}^{1:N},\mu^{1:N})\leq N^{1/2}\varepsilon under (LSI).

  • LMC: We use the guarantee for Langevin Monte Carlo from [VW19].

  • MALA–PS: We use the guarantee for the Metropolis-adjusted Langevin algorithm together with the proximal sampler from [AC23]. Note that the iteration complexity is 𝒪~(κd1/2N1/2)\widetilde{\mathcal{O}}(\kappa d^{1/2}N^{1/2}), and we substitute in the chosen value for NN.

  • ULMC–PS: Here, we use underdamped Langevin Monte Carlo to implement the proximal sampler. To justify the sampling guarantee, note that since logμ1:N\log\mu^{1:N} is β\beta-smooth, if we choose step size h=12βh=\frac{1}{2\beta} for the proximal sampler, then the RGO is β\beta-strongly log-concave and 3β3\beta-log-smooth. According to [AC23, Proof of Theorem 5.3], it suffices to implement the RGO in each iteration to accuracy N1/2ε/κ1/2N^{1/2}\varepsilon/\kappa^{1/2} in 𝖪𝖫\sqrt{\operatorname{\mathsf{KL}}}. Then, from [Zha+23], this can be done via ULMC with complexity 𝒪~(κ1/2d1/2/ε)\widetilde{\mathcal{O}}(\kappa^{1/2}d^{1/2}/\varepsilon). Finally, since the number of outer iterations of the proximal sampler is 𝒪~(κ)\widetilde{\mathcal{O}}(\kappa), we obtain the claim.

  • ULMC+: Here, we use either the randomized midpoint discretization [SL19] or the shifted ODE discretization [FLO21] of the underdamped Langevin diffusion. We also replace the LSI assumptions (Assumptions 2 and 9) with strong convexity (Assumption 4).

Guarantees under strong displacement convexity.

Here, we impose Assumptions 1 and 4. As discussed above, to obtain 𝖪𝖫\operatorname{\mathsf{KL}} guarantees, we require log-concave samplers that can achieve χ2(μ^1:Nμ1:N)Nε2\chi^{2}(\hat{\mu}^{1:N}\mathbin{\|}\mu^{1:N})\leq N\varepsilon^{2}. For 𝒲2\mathcal{W}_{2} guarantees, by Theorem 4 we take Nκ2d/ε2N\asymp\kappa^{2}d/\varepsilon^{2} and we require log-concave samplers that can achieve α/σ𝒲2(μ^1:N,μ1:N)N1/2ε\nicefrac{{\sqrt{\alpha}}}{{\sigma}}\,\mathcal{W}_{2}(\hat{\mu}^{1:N},\mu^{1:N})\leq N^{1/2}\varepsilon.

  • LMC: For Langevin Monte Carlo, we use the χ2\chi^{2} guarantee from [Che+22a] and the 𝒲2\mathcal{W}_{2} guarantee from [DMM19].

  • ULMC: For underdamped Langevin Monte Carlo, we use the χ2\chi^{2} guarantee from [AC23].

  • ULMC+: Here, we use the 𝒲2\mathcal{W}_{2} guarantees for either the randomized midpoint discretization [SL19] or the shifted ODE discretization [FLO21] of the underdamped Langevin diffusion.

Guarantees in the general McKean–Vlasov setting.

In the setting of Theorem 5, we take Nκd/ε2N\asymp\kappa d/\varepsilon^{2}. We use the same sampling guarantees under (LSI) as in the prior settings.

We also note that in order to apply the log-concave sampling guarantees, we must check that μ1:N\mu^{1:N} is log-smooth. This follows from Assumption 6. Indeed,

logμ1:N(x1:N)logμ1:N(y1:N)\displaystyle\lVert\nabla\log\mu^{1:N}(x^{1:N})-\nabla\log\mu^{1:N}(y^{1:N})\rVert =2σ2i=1N𝒲2(ρx1:N,xi)𝒲2(ρy1:N,yi)2\displaystyle=\frac{2}{\sigma^{2}}\sqrt{\sum_{i=1}^{N}{\lVert\nabla_{\mathcal{W}_{2}}\mathcal{F}(\rho_{x^{1:N}},x^{i})-\nabla_{\mathcal{W}_{2}}\mathcal{F}(\rho_{y^{1:N}},y^{i})\rVert^{2}}}
22βσ2i=1N(xiyi2+𝒲12(ρx1:N,ρy1:N))\displaystyle\leq\frac{2\sqrt{2}\,\beta}{\sigma^{2}}\sqrt{\sum_{i=1}^{N}\bigl{(}\lVert x^{i}-y^{i}\rVert^{2}+\mathcal{W}_{1}^{2}(\rho_{x^{1:N}},\rho_{y^{1:N}})\bigr{)}}
4βσ2x1:Ny1:N.\displaystyle\leq\frac{4\beta}{\sigma^{2}}\,\lVert x^{1:N}-y^{1:N}\rVert\,.