Sampling from the Mean-Field Stationary Distribution
Abstract
We study the complexity of sampling from the stationary distribution of a mean-field SDE, or equivalently, the complexity of minimizing a functional over the space of probability measures which includes an interaction term. Our main insight is to decouple the two key aspects of this problem: (1) approximation of the mean-field SDE via a finite-particle system, via uniform-in-time propagation of chaos, and (2) sampling from the finite-particle stationary distribution, via standard log-concave samplers. Our approach is conceptually simpler and its flexibility allows for incorporating the state-of-the-art for both algorithms and theory. This leads to improved guarantees in numerous settings, including better guarantees for optimizing certain two-layer neural networks in the mean-field regime. A key technical contribution is to establish a new uniform-in- log-Sobolev inequality for the stationary distribution of the mean-field Langevin dynamics.
1 Introduction
The minimization of energy functionals over the Wasserstein space of probability measures has attracted substantial research activity in recent years, encompassing numerous application domains, including distributionally robust optimization [Kuh+19, YKW22], sampling [JKO98, Wib18, Che24], and variational inference [LW16, Lam+22, Dia+23, JCP23, Lac23a, YY23].
A canonical example of such a functional is , where is called the potential. Up to an additive constant, which is irrelevant for the optimization, this energy functional equals the KL divergence with respect to the density , and the celebrated result of [JKO98] identifies the Wasserstein gradient flow of with the Langevin diffusion. This link has inspired a well-developed theory for log-concave sampling, with applications to Bayesian inference and randomized algorithms; see [Che24] for an exposition.
The energy functional above contains two terms, corresponding to two of the fundamental examples of functionals considered in Villani’s well-known treatise on optimal transport [Vil03]. Namely, they are the “potential energy” and the entropy, the latter being a special case of the “internal energy.” However, Villani identifies a third fundamental functional—the “interaction energy”—with the pairwise form given by
() |
More generally, in this work we consider minimizing the generic entropy-regularized energy
() |
where is a known functional. The minimization of the energy () has recently been of interest due to its role in analysing neural network training dynamics in the mean-field regime, including with [SNW22] and without [CB18, MMN18] entropic regularization, as well as with Fisher regularization [Cla+23].
For the sake of exposition, let us first focus on minimizing the pairwise energy (). A priori, this question is more difficult than log-concave sampling; for instance, does not admit a closed form but rather is the solution to a non-linear equation
(1.1) |
However, here too there is a well-developed mathematical theory which suggests a principled algorithmic approach. Just as the Wasserstein gradient flow of () in the case when can be identified with the Langevin diffusion, the Wasserstein gradient flow of () in the case when corresponds to a (pairwise) McKean–Vlasov SDE, i.e. an SDE whose coefficients depend on the marginal law of the process, given below as
() |
where , is even, and is a standard Brownian motion on . Since the McKean–Vlasov SDE is the so-called mean-field limit of interacting particle systems, we can approximately sample from the minimizer by numerically discretizing a system of SDEs, which describe the evolution of particles as:
() |
where is a collection of independent Brownian motions. Moreover, the error from approximating the mean-field limit via this finite particle system has been studied in the literature on propagation of chaos [Szn91]. Similarly, the Wasserstein gradient flow for () corresponds to the mean-field Langevin dynamics and admits an analogous particle approximation.
The bounds for propagation of chaos have been refined over time, with [LL23] recently establishing a tight error dependence on the total number of particles . These bounds, however, do not translate immediately into algorithmic guarantees. Existing sampling analyses study the propagation of chaos and discretization as a single entangled problem, which thus far have only been able to use weaker rates for the former. Furthermore, there has been recent interest in using more sophisticated particle-based algorithms, e.g., “non-linear” Hamiltonian Monte Carlo [BS23] and the mean-field underdamped Langevin dynamics [FW23] to reduce the discretization error. Currently, this requires repeatedly carrying out the propagation of chaos and time discretization analyses from the ground up for each instance.
This motivates us to pose the following questions: (1) Can we incorporate improvements in the propagation of chaos literature, such as the error dependence shown in [LL23], to improve existing theoretical guarantees? (2) Can we leverage recent advances in the theory of log-concave sampling to design better algorithms?
Our main proposal in this work is to decouple the error into two terms, representing the propagation of chaos and discretization errors respectively. This simple and modular approach immediately allows us to answer both questions in the affirmative. Namely, we show how to combine established propagation of chaos bounds in various settings [[, including the sharp rate of]]lacker2023sharp with a large class of sophisticated off-the-shelf log-concave samplers, such as interacting versions of the randomized midpoint discretization of the underdamped Langevin dynamics [SL19, HBE20], Metropolis-adjusted algorithms [Che+21, WSC22, AC23], and the proximal sampler [LST21, Che+22, FYC23]. Our framework yields improvements upon prior state-of-the-art, such as [BS23, FW23], and provides a clear path for future ones.
1.1 Contributions and Organization
Propagation of chaos at stationarity.
We provide three propagation of chaos results which hold in the , , and “metrics”; the rates reflect the distance of the -particle marginal of the finite-particle system from : (1) In the setting of (), under strong displacement convexity, we obtain a rate by adapting techniques from [Szn91, Mal01]; (2) without assuming displacement convexity, but assuming a weaker interaction, we obtain the sharp rate of following [LL23]; (3) finally, in the general setting of (), and assuming is convex along linear interpolations, we obtain a rate following [CRW22].
Unlike prior works, our proofs are carried out at stationarity; thus, our proofs are self-contained, streamlined, and include various improvements (e.g., weaker assumptions and explicit bounds). As a result, our work also serves as a helpful exposition to the mathematics of propagation of chaos.
Discretization.
Once the error due to particle approximation is controlled, we then obtain improved complexity guarantees by applying recent advances in the theory of log-concave sampling to the finite-particle stationary distribution. See Table 1 for a summary of our results, and the discussion in §4 for comparisons with prior works and an application to neural network training.
Once again, the importance of our framework is its modularity, which allows for any combination of uniform-in-time propagation of chaos bounds and log-concave sampler, provided that the finite-particle stationary distribution satisfies certain isoperimetric properties needed for the sampling guarantees. Toward this end, we also provide tools for verifying these isoperimetric properties with constants that hold independently of the number of particles (see §3.2.1).
1.2 Related Work
Mean-field equations.
The McKean–Vlasov SDE was first formulated in the works [McK66, Fun84, Mél96], with origins dating to much earlier [Bol72]. It has applications in many domains, from fluid dynamics [Vil02] to game theory [LL07, CD18]; see [CD22, CD22a] for a comprehensive survey. The kinetic version of this equation is known as the Boltzmann equation, and propagation of chaos has similarly been studied under a variety of assumptions [BGM10, Mon17, GM21, GLM22]. One prominent application within machine learning is the study of infinitely wide two-layer neural networks in the mean-field regime (see §4.2).
Propagation of chaos and sampling for ().
The original propagation of chaos arguments of [Szn91] were first made uniform in time in [Mal01, Mal03] in both entropy and . The aforementioned works all achieve an error of order , and require a strong convexity assumption on and . These were later adapted for non-smooth potentials [JW17, JW18, BJW23]. Finally, [CRW22] obtained an entropic propagation of chaos bound under a higher-order smoothness assumption. See [CD22] for a more complete bibliography.
The breakthrough result of [Lac23] obtained the sharp bound of when the interaction is sufficiently weak, and this bound was made uniform in time in [LL23]. Their approach differs significantly from previous proofs by considering a local analysis based on the recursive BBGKY hierarchy. These results have been extended to other divergences, e.g., the divergence, but without a uniform-in-time guarantee [HR23]. In addition, [MRW24] showed an extension of this result under a “convexity at infinity” assumption.
The question of sampling from minimizers of () was first studied in [Tal96, BT97, AK02]. These works focused on the Euler–Maruyama discretization of the finite-particle system (), under -boundedness of the gradients. Subsequently, the convergence of the Euler–Maruyama scheme has been studied in many works, including but not limited to [BH22, RES22, Li+23]. The strategy of disentangling finite particle error from time discretization also appears in [KHK24], which approaches the problem from the perspective of stochastic approximation. This work, however, is not focused on obtaining quantitative guarantees. Finally, [BS23] considered a non-linear version of Hamiltonian Monte Carlo; we give a detailed comparison with their work in §4.
Propagation of chaos and sampling for ().
The mean-field (underdamped) Langevin algorithm for minimizing () was proposed and studied in [CRW22, Che+24]. Under alternative assumptions (see §3.1.2), they established propagation of chaos with a rate, for both the overdamped and the underdamped finite-particle approximations. Recent works from the machine learning community [NWS22, SNW22, FW23, SWN23] studied the application of these algorithms for optimizing two-layer neural networks and obtained sampling guarantees. We provide a detailed comparison with their works in §4.2.
2 Background and Notation
Let be the set of probability measures on that admit a density with respect to the Lebesgue measure and have finite second moment. We will also abuse notation and use the same symbol for a measure and its density when there is no confusion. We use superscripts for the particle index, and subscripts for the time variable. We will use to signify upper bounds up to numeric constants and polylogarithms respectively. We recall the definitions of convexity and smoothness:
Definition 1.
A function is -uniformly convex (allowing for ) and -smooth if the following hold respectively
For two probability measures , we define the divergence and the (relative) Fisher information by
with the convention whenever .
We recall the definition of the log-Sobolev inequality, which is used both for propagation of chaos arguments as well as mixing time bounds.
Definition 2 (Log-Sobolev Inequality).
A measure satisfies a log-Sobolev inequality with parameter if for all ,
(LSI) |
When is -uniformly convex for , it follows from the Bakry–Émery condition that satisfies (LSI) with constant [BGL14, Proposition 5.7.1].
We can also define the -Wasserstein distance , , between as
where is the set of all joint probability measures on with marginals respectively.
Lastly, we recall that the celebrated Otto calculus interprets the space , equipped with the metric, as a formal Riemannian manifold [Ott01]. In particular, the Wasserstein gradient of a functional is given as . Here, is the first variation defined as follows: for all , satisfies
The first variation is defined up to an additive constant, but the Wasserstein gradient is unambiguous. See [AGS08] for a rigorous development. As a shorthand, we will write and similarly .
2.1 SDE Systems and Their Stationary Distributions
2.1.1 The Pairwise McKean–Vlasov Setting
In the formalism introduced in the previous section, we note that () can be interpreted as Wasserstein gradient flow for (). In this paper, we refer to () as the pairwise McKean–Vlasov process. As noted in the introduction, it has the stationary distribution (1.1) which minimizes ().
Recall also that the equation () is the mean-field limit of the finite-particle system (). This -particle system has the following stationary distribution: for ,
(2.1) |
The system () can be viewed as an approximation to (), with the expectation term in the drift replaced by an empirical average. Note that the measure is exchangeable.111Exchangeability refers to the property that the law of equals the law of for any permutation of . While the standard approach is to apply an Euler–Maruyama discretization to () in order to sample from (), our perspective is to write more sophisticated samplers for directly. Indeed, unlike (1.1), the finite-particle stationary distribution (2.1) is explicit and amenable to sampling methods.
2.1.2 The General McKean–Vlasov Setting
More generally, we consider the functional () where is of the form with . The second term acts as regularization and is common in the literature [FW23, SWN23]. We can describe its Wasserstein gradient flow as the marginal law of a particle trajectory satisfying the following SDE, which we call the general McKean–Vlasov equation:
() |
where , and is a standard Brownian motion on . The stationary distribution of (), and its linearization around a measure , satisfy the following equations:
(2.2) |
The latter is called the proximal Gibbs distribution with respect to . The general dynamics corresponds to the mean-field limit of the following finite-particle system described by an -tuple of stochastic processes :
() |
and is the empirical measure of the particle system. The stationary distribution for () is given as follows [CRW22, (2.16)]: for ,
(2.3) |
One can show that , and hence () is simply the Langevin diffusion corresponding to stationary measure (2.3). Moreover, when and choosing , then the equations (), (2.2), (), and (2.3) reduce to (), (1.1), (), and (2.1), respectively.
3 Technical Ingredients
Our general approach for sampling from the stationary distribution in either (1.1) or (2.2) is to directly apply an off-the-shelf sampler for the finite-particle stationary distribution . The theoretical guarantees for this procedure require two main ingredients: (1) control of the “bias”—i.e. the error incurred by approximating by the -particle marginal of —and (2) verification of isoperimetric properties which allow for fast sampling from the measure .
3.1 Bias Control via Uniform-in-Time Propagation of Chaos
In this section, we focus on the first ingredient, namely, obtaining control of the bias via uniform-in-time propagation of chaos results. Proofs for this section are given in §A.
3.1.1 Pairwise McKean–Vlasov Setting
We first consider the pairwise McKean–Vlasov setting described in §2.1.1. Our first propagation of chaos result uses the following three assumptions.
Assumption 1.
The potentials are -smooth respectively.
Assumption 2.
The distribution satisfies (LSI) with parameter .
Assumption 3.
The ratio is at least .
Remark. Note that from (1.1), we typically would expect to also scale as (e.g., in the case when and are -uniformly convex for ). Therefore, Assumption 3 is typically invariant to the scaling of and can be satisfied even for . Under these assumptions, we obtain a sharp propagation of chaos result via a similar argument as [Lac22, LL23]. We note that the former is more permissive regarding the constant in Assumption 3 as compared to this work.
Theorem 3 (Sharp Propagation of Chaos).
We note that the rate in Theorem 3 is sharp; see the Gaussian case in Example 10. A condition such as Assumption 3 is in general necessary, since otherwise the minimizer of () may not even be unique [[, see the example and discussion in]]lacker2023sharp. However, it can be restrictive, as it requires the interaction to be sufficiently weak. With the following convexity assumption, we can obtain a propagation of chaos result without Assumption 3.
Assumption 4.
The potentials are -uniformly convex with . Here, denotes the negative part of .
The following weaker result consists of two parts. The first, a Wasserstein propagation of chaos result, is based on [Szn91]. The second, building on the first, is a uniform-in-time entropic propagation of chaos bound following from a Fisher information bound. The arguments are similar to those in [Mal01, Mal03], albeit simplified (since we work at stationarity) and presented here with explicit constants.
3.1.2 General McKean–Vlasov Setting
In the more general case where we aim to minimize () for a generic functional of the form , we impose the following assumptions. They can be largely seen as generalizations of the conditions for the pairwise case, and they are inherited from [CRW22, SWN23]. There is an additional convexity condition (Assumption 5), which in the pairwise McKean–Vlasov setting amounts to positive semidefiniteness of the kernel on ; thus, in general, the following assumptions are incomparable with the ones in §3.1.1.
Assumption 5.
The functional is convex in the usual sense. For all , ,
Assumption 6.
The functional is smooth in the sense that for all , , there is a uniform constant such that
Assumption 7.
The proximal Gibbs measures satisfy (LSI) with a uniform constant: namely, it holds that .
Remark. These assumptions taken together cover settings not covered in the preceding sections, including optimization of two-layer neural networks. See [CRW22, Remark 3.1] and §4.2.
Under these assumptions, we can derive an entropic propagation of chaos bound by following the proof of [CRW22]. Through a tighter analysis, we are able to reduce the dependence on the condition number from to .
Theorem 5 (Propagation of Chaos for General Functionals).
Among these assumptions, the hardest to verify is the uniform LSI of Assumption 7. Following [SWN23], we introduce the following sufficient condition for the validity of Assumption 7; see Lemma 22 for a more precise statement.
Assumption 8.
There exists a uniform bound on the Wasserstein gradient of the interaction term : for some constant and all , ,
3.2 Isoperimetric Properties of the Stationary Distributions
In this section, we verify the isoperimetric properties of in the () setting, with proofs provided in §B.
3.2.1 Pairwise McKean–Vlasov Setting
If , satisfy Assumptions 1 and 4 (i.e. and have bounded Hessians), then the potential for (1.1), i.e. , is -convex and -smooth. By the Bakry–Émery condition, satisfies (LSI) with parameter .
Similarly, for the invariant measure in (2.1), we can prove the following.
Lemma 7.
If and satisfy Assumption 1, then is -smooth.
If and satisfy Assumption 4, then is -convex.222Only the negative part of contributes to the strong log-concavity of . This is consistent with [Vil03, Theorem 5.15], which asserts that when , the interaction energy is -strongly displacement convex over the subspace of probability measures with fixed mean, but only weakly convex over the full Wasserstein space.
We now consider the non-log-concave case. It is standard in the sampling literature that the assumption of (LSI) for the stationary distribution yields mixing time guarantees. Since our strategy is to sample from (2.1), we therefore seek an LSI for , formalized as the following assumption.
Assumption 9.
The distribution satisfies (LSI) with parameter .
In this section, we provide an easily verifiable condition, combining a Holley–Stroock condition [HS87] with a weak interaction condition, for this assumption to hold with an -independent constant.
Assumption 10.
The potentials and can be decomposed as and such , satisfy Assumption 4 and , where for a function we define . Furthermore, the following weak interaction condition holds:
A careful application of the Holley–Stroock perturbation principle yields the following lemma.
3.2.2 General McKean–Vlasov Setting
In the setting () with , we verify that Assumption 8 yields (LSI) for . See Corollary 25 for a more precise statement.
Lemma 9 (Informal).
4 Sampling from the Mean-Field Target
In this section, we present results for sampling from . As outlined in Algorithm 1, we use off-the-shelf log-concave samplers to sample from , during which we access the first-order333For our results involving the proximal sampler, we also assume access to a proximal oracle for simplicity. oracle for (i.e. an oracle for evaluation of up to an additive constant, and for evaluation of ). For sufficiently large, the first particle given by Algorithm 1 is approximately distributed according to : for the law of the output of the log-concave sampler and its -particle marginal distribution ,
where the inequality follows from exchangeability (Lemma 27). A similar decomposition also holds for , although the argument is more technical. We defer its presentation to §E.
Input: the number of total particles, a log-concave sampler
Output: particles
To bound the second term by , it suffices to choose according to the propagation of chaos results in §3.1. Our results are summarized in Table 1, in which we record the total number of oracle calls for made by the sampler and the number of particles needed to achieve error in the desired metric, hiding polylogarithmic factors. Note that in the pairwise McKean–Vlasov setting, each oracle call to requires calls to an oracle for , and calls to an oracle for .
Algorithm | “Metric” | Assumptions | ||
LMC | 1, 2, 3, 9 | |||
MALA–PS | ||||
ULMC–PS | ||||
ULMC+ | 1, 3, 4 | |||
LMC | 1, 4 | |||
ULMC | ||||
LMC | ||||
ULMC+ | ||||
LMC | 5, 6, 7, 9 | |||
ULMC–PS |
The algorithms in the table refer to: Langevin Monte Carlo (LMC); underdamped Langevin Monte Carlo (ULMC); discretizations of the underdamped Langevin diffusion via the randomized midpoint method [SL19] or the shifted ODE method [FLO21] (ULMC+); and implementation of the proximal sampler [LST21, Che+22] via the Metropolis-adjusted Langevin algorithm or via ULMC (MALA–PS and ULMC–PS respectively). Note that LMC applied to sample from is simply the Euler–Maruyama discretization of (), and likewise ULMC is the algorithm considered in [FW23]. We refer to §E for proofs and references.
To streamline the rates, we simplify the notation by defining if Assumption 1 holds, otherwise we use the value from Assumption 6. We let under Assumption 4, under Assumptions 2 and 9, and in the general McKean–Vlasov setting.
Finally, we let denote the condition number. We briefly justify this terminology. If the target and all proximal Gibbs measures are strongly convex with parameter , then the Bakry–Émery condition implies that . Hence, the scale-invariant ratio reduces to the classical condition number , the ratio of the largest to smallest eigenvalues of the Hessian matrices for and . Therefore, is a generalization of the condition number to settings beyond uniform strong convexity which allows us to state more interpretable bounds. The additional assumption will be used to simplify some of the rates.
In the following subsections, we discuss some of the results in greater detail.
4.1 Pairwise McKean–Vlasov Setting
Example 10 (Gaussian Case).
Example 11 (Strongly Convex Case).
Consider the strongly convex case where . The prior work [BS23] also considered the problem of sampling from the mean-field stationary distribution , with . If we count the number of calls to a gradient oracle for , their complexity bound reads to achieve . We note that their assumptions are not strictly comparable to ours. They require the interaction to be sufficiently weak, in the sense that , which is similar444See eq. (2.24) therein; note that they have a scaling factor of in front of their interaction term, so that our parameter is equivalent to their . to our Assumption 3; on the other hand, they only assume , rather than . Nevertheless, we attempt to make some comparisons with their work below.
Without Assumption 3, ULMC+ achieves with complexity , which matches the guarantee of [BS23] up to the dependence on . We can also obtain guarantees in , at the cost of an extra factor of .
With Assumption 3, MALA–PS has complexity and ULMC+ has complexity , which improve substantially upon [BS23].
To summarize, in the strongly convex case, we have obtained numerous improvements: (i) we can obtain results even without the weak interaction condition (Assumption 3); (ii) when we assume the weak interaction condition, we obtain improved complexities; (iii) our results hold in stronger metrics; (iv) our approach is generic, allowing for the consideration of numerous different samplers without needing to establish new propagation of chaos results (by way of comparison, [BS23] developed a tailored propagation of chaos argument for their non-linear Hamiltonian Monte Carlo algorithm).
Example 12 (Bounded Perturbations).
Both the results of [BS23] as well as our own allow for non-convex potentials, albeit under different assumptions—[BS23] require strong convexity at infinity, whereas we require (LSI) for the stationary measures and . In order to obtain sampling guarantees with low complexity, it is important for the LSI constant of to be independent of . We have provided a sufficient condition for this to hold: and are bounded perturbations of and respectively, where ; see Lemma 8.
We also note that in this setting, both of our works require a weak interaction condition. This is in general necessary in order to ensure uniqueness of the mean-field stationary distribution, see the discussion in §3.1.1.
4.2 General McKean–Vlasov Setting
Example 13 (General Functionals).
In the general setting, under Assumptions 5, 6, and 7, the work of [SWN23] provided the first discretization bounds. They impose further assumptions and their resulting complexity bound is rather complicated, but it reads roughly for the discretization of (). Subsequently, [FW23] obtained an improved complexity of via ULMC in the averaged distance. In comparison, we can improve this complexity guarantee to , and the guarantee even holds in if we combine ULMC with the proximal sampler. It appears that we gain one factor of through sharper discretization analysis (via [Zha+23], or via the error analysis of the proximal sampler in [AC23]), and one factor of via a sharper propagation of chaos result (Theorem 5).
Application to Two-Layer Neural Networks.
Let us consider the problem of learning a two-layer neural network in the mean-field regime. Let be a function parameterized by , and for any probability measure over , let . For example, in a standard two-layer neural network, we take and . When is an empirical measure, then is the function computed by a two-layer neural network with hidden neurons. In this formulation, however, we can take to be any probability measure, corresponding to the mean-field limit [CB18, MMN18, Chi22, RV22, SS20].
Given a dataset in and a loss function , we can formulate neural network training as the problem of minimizing the loss . To place this within the general McKean–Vlasov framework, we add two regularization terms: (1) corresponds to weight decay; and (2) is entropic regularization. We are now in the setting of §2.1.2, with .
To minimize this energy, it is natural to consider the Euler–Maruyama discretization of (), which corresponds to learning the neural network via noisy GD, and was considered in [SWN23]. Recent works [FW23, Che+24] also considered the underdamped version of () and its discretization. Under the assumptions common to those works as well as our own, our results yield improved algorithmic guarantees for this task (see Example 13).
Unfortunately, the assumptions used for the analysis of the general McKean–Vlasov are restrictive and limit the applicability to neural network training. For example, it suffices for to be convex in its first argument (to satisfy Assumption 5), to have two bounded derivatives (w.r.t. its first argument), and for to have two bounded derivatives for each . The last condition is satisfied, e.g., for . For a genuinely two-layer example, we can take for . Under these conditions, Assumptions 6 and 8 hold, which in turn furnish log-Sobolev inequalities via Lemmas 6 and 9.
Limitations.
However, we note that there is a substantial limitation of our framework when applied to the mean-field Langevin dynamics. Although we are able to establish a uniform-in- LSI for the stationary distribution under appropriate assumptions (see Corollary 25 for a precise statement), the dependence of the LSI constant scales poorly (in fact, doubly exponentially) in the problem parameters. To fully benefit from the modularity of our approach, it is desirable to obtain a uniform-in- LSI with better scaling, and we leave this question open for future research.
5 Conclusion
In this work, we propose a framework for obtaining sampling guarantees for the minimizers of () and (), based on decoupling the problem into (i) particle approximation via propagation of chaos, and (ii) time-discretization via log-concave sampling theory. Our approach leads to simpler proofs and improved guarantees compared to previous works, and our results readily benefit from any improvements in either (i) or (ii).
We conclude by listing some future directions of study. As discussed in §4.2, our uniform-in- LSI for the mean-field Langevin dynamics currently scales poorly in the problem parameters, and it is important to improve it. We also believe there is further room for improvement in the propagation of chaos results. For example, can the sharp rate in Theorem 3 be extended to stronger metrics such as Rényi divergences, as well as to situations when the weak interaction condition (Assumption 3) fails, e.g., in the strongly displacement convex case or in the setting of §3.1.2? For the sampling guarantees, the prior works [BS23, SWN23] considered different settings, such as potentials satisfying convexity at infinity or the use of stochastic gradients; these extensions are compatible with our approach and could possibly lead to improvements in these cases, as well as others. Finally, consider the case where in () is replaced with a generic function , . It would be interesting to extend our analysis to this setting, as it arises in many applications [AD20].
Acknowledgements
We thank Zhenjie Ren for pointing out an error in a previous draft of this paper, and Daniel Lacker, Atsushi Nitanda, and Taiji Suzuki for important discussions and references. YK was supported in part by NSF awards CCF-2007443 and CCF-2134105. MSZ was supported by NSERC through the CGS-D program. SC was supported by the Eric and Wendy Schmidt Fund at the Institute for Advanced Study. MAE was supported by NSERC Grant [2019-06167] and CIFAR AI Chairs program at the Vector Institute. MBL was supported by NSF grant DMS-2133806.
pages8 rangepages-1 rangepages54 rangepages52 rangepages29 rangepages18 rangepages88 rangepages36 rangepages24 rangepages51 rangepages121 rangepages157 rangepages43 rangepages31 rangepages2 rangepages41 rangepages26 rangepages-1 rangepages32 rangepages-1 rangepages49 rangepages18 rangepages44 rangepages20 rangepages11 rangepages36 rangepages24 rangepages69 rangepages17 rangepages36 rangepages37 rangepages21 rangepages56 rangepages38 rangepages32 rangepages58 rangepages35 rangepages24 rangepages21 rangepages5 rangepages-1 rangepages54 rangepages17 rangepages17 rangepages74 rangepages37 rangepages40 rangepages49 rangepages47 rangepages28 rangepages87 rangepages49 rangepages235 rangepages-1 rangepages935 rangepages63 rangepages16 rangepages36
References
- [AC23] Jason M. Altschuler and Sinho Chewi “Faster high-accuracy log-concave sampling via algorithmic warm starts” In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), 2023, pp. 2169–2176
- [AGS08] Luigi Ambrosio, Nicola Gigli and Giuseppe Savaré “Gradient flows in metric spaces and in the space of probability measures”, Lectures in Mathematics ETH Zürich Birkhäuser Verlag, Basel, 2008, pp. x+334
- [AK02] Fabio Antonelli and Arturo Kohatsu-Higa “Rate of convergence of a particle method to the solution of the McKean–Vlasov equation” In The Annals of Applied Probability 12.2 Institute of Mathematical Statistics, 2002, pp. 423–476
- [AD20] Marc Arnaudon and Pierre Del Moral “A second order analysis of McKean–Vlasov semigroups” In The Annals of Applied Probability 30.6 JSTOR, 2020, pp. 2613–2664
- [BGL14] Dominique Bakry, Ivan Gentil and Michel Ledoux “Analysis and geometry of Markov diffusion operators” Springer, 2014
- [BH22] Jianhai Bao and Xing Huang “Approximations of McKean–Vlasov stochastic differential equations with irregular coefficients” In Journal of Theoretical Probability Springer, 2022, pp. 1–29
- [BGM10] François Bolley, Arnaud Guillin and Florent Malrieu “Trend to equilibrium and particle approximation for a weakly self consistent Vlasov–Fokker–Planck equation” In ESAIM: Mathematical Modelling and Numerical Analysis 44.5 EDP Sciences, 2010, pp. 867–884
- [Bol72] Ludwig Boltzmann “Further studies on the heat balance among gas molecules” In History of Modern Physical Sciences 1, 1872, pp. 262–349
- [BT97] Mireille Bossy and Denis Talay “A stochastic particle method for the McKean–Vlasov and the Burgers equation” In Mathematics of Computation 66.217, 1997, pp. 157–192
- [BS23] Nawaf Bou-Rabee and Katharina Schuh “Nonlinear Hamiltonian Monte Carlo & its particle approximation” In arXiv preprint arXiv:2308.11491, 2023
- [BL76] Herm J. Brascamp and Elliott H. Lieb “On extensions of the Brunn–Minkowski and Prékopa–Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation” In J. Functional Analysis 22.4, 1976, pp. 366–389
- [BJW23] Didier Bresch, Pierre-Emmanuel Jabin and Zhenfu Wang “Mean field limit and quantitative estimates with singular attractive kernels” In Duke Mathematical Journal 172.13 Duke University Press, 2023, pp. 2591–2641
- [BP24] Giovanni Brigati and Francesco Pedrotti “Heat flow, log-concavity, and Lipschitz transport maps” In arXiv preprint 2404.15205, 2024
- [CD18] René Carmona and François Delarue “Probabilistic theory of mean field games with applications I–II” Springer, 2018
- [CD22] Louis-Pierre Chaintron and Antoine Diez “Propagation of chaos: a review of models, methods and applications. I. Models and methods” In Kinet. Relat. Models 15.6, 2022, pp. 895–1015
- [CD22a] Louis-Pierre Chaintron and Antoine Diez “Propagation of chaos: a review of models, methods and applications. II. Applications” In Kinet. Relat. Models 15.6, 2022, pp. 1017–1173
- [Che+24] Fan Chen, Yiqing Lin, Zhenjie Ren and Songbo Wang “Uniform-in-time propagation of chaos for kinetic mean field Langevin dynamics” In Electronic Journal of Probability 29 Institute of Mathematical StatisticsBernoulli Society, 2024, pp. 1–43
- [CRW22] Fan Chen, Zhenjie Ren and Songbo Wang “Uniform-in-time propagation of chaos for mean field Langevin dynamics” In arXiv preprint arXiv:2212.03050, 2022
- [Che+22] Yongxin Chen, Sinho Chewi, Adil Salim and Andre Wibisono “Improved analysis for a proximal algorithm for sampling” In Proceedings of Thirty Fifth Conference on Learning Theory (COLT) 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 2984–3014
- [Che24] Sinho Chewi “Log-concave sampling” Book draft available at https://chewisinho.github.io., 2024
- [Che+22a] Sinho Chewi et al. “Analysis of Langevin Monte Carlo from Poincaré to log-Sobolev” In Proceedings of Thirty Fifth Conference on Learning Theory (COLT) 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 1–2
- [Che+21] Sinho Chewi et al. “Optimal dimension dependence of the Metropolis-adjusted Langevin algorithm” In Conference on Learning Theory (COLT), 2021, pp. 1260–1300 PMLR
- [Chi22] Lénaı̈c Chizat “Mean-field Langevin dynamics: exponential convergence and annealing” In Transactions on Machine Learning Research, 2022
- [CB18] Lénaı̈c Chizat and Francis Bach “On the global convergence of gradient descent for over-parameterized models using optimal transport” In Advances in Neural Information Processing Systems (NeurIPS) 31, 2018
- [Cla+23] Julien Claisse, Giovanni Conforti, Zhenjie Ren and Songbo Wang “Mean field optimization problem regularized by Fisher information” In arXiv preprint 2302.05938, 2023
- [Csi84] Imre Csiszár “Sanov property, generalized I-projection and a conditional limit theorem” In The Annals of Probability JSTOR, 1984, pp. 768–793
- [DKR22] Arnak S. Dalalyan, Avetik Karagulyan and Lionel Riou-Durand “Bounding the error of discretized Langevin algorithms for non-strongly log-concave targets” In J. Mach. Learn. Res. 23, 2022, pp. Paper No. 235\bibrangessep38
- [Dia+23] Michael Z. Diao, Krishnakumar Balasubramanian, Sinho Chewi and Adil Salim “Forward-backward Gaussian variational inference via JKO in the Bures–Wasserstein space” In Proceedings of the 40th International Conference on Machine Learning 202, Proceedings of Machine Learning Research PMLR, 2023, pp. 7960–7991
- [DMM19] Alain Durmus, Szymon Majewski and Błażej Miasojedow “Analysis of Langevin Monte Carlo via convex optimization” In J. Mach. Learn. Res. 20, 2019, pp. Paper No. 73\bibrangessep46
- [FYC23] Jiaojiao Fan, Bo Yuan and Yongxin Chen “Improved dimension dependence of a proximal algorithm for sampling” In Proceedings of Thirty Sixth Conference on Learning Theory (COLT) 195, Proceedings of Machine Learning Research PMLR, 2023, pp. 1473–1521
- [FLO21] James Foster, Terry Lyons and Harald Oberhauser “The shifted ODE method for underdamped Langevin MCMC” In arXiv preprint 2101.03446, 2021
- [FW23] Qiang Fu and Ashia Wilson “Mean-field underdamped Langevin dynamics and its space-time discretization” In arXiv preprint arXiv:2312.16360, 2023
- [Fun84] Tadahisa Funaki “A certain class of diffusion processes associated with nonlinear parabolic equations” In Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 67.3 Springer, 1984, pp. 331–348
- [GLM22] Arnaud Guillin, Pierre Le Bris and Pierre Monmarché “Convergence rates for the Vlasov–Fokker–Planck equation and uniform in time propagation of chaos in non convex cases” In Electronic Journal of Probability 27 The Institute of Mathematical Statisticsthe Bernoulli Society, 2022, pp. 1–44
- [GM21] Arnaud Guillin and Pierre Monmarché “Uniform long-time and propagation of chaos estimates for mean field kinetic particles in non-convex landscapes” In Journal of Statistical Physics 185 Springer, 2021, pp. 1–20
- [HBE20] Ye He, Krishnakumar Balasubramanian and Murat A. Erdogdu “On the ergodicity, bias and asymptotic normality of randomized midpoint sampling method” In Advances in Neural Information Processing Systems 33, 2020, pp. 7366–7376
- [HR23] Elias Hess-Childs and Keefer Rowan “Higher-order propagation of chaos in for interacting diffusions” In arXiv preprint arXiv:2310.09654, 2023
- [HS87] Richard Holley and Daniel Stroock “Logarithmic Sobolev inequalities and stochastic Ising models” In J. Statist. Phys. 46.5-6, 1987, pp. 1159–1194
- [JW17] Pierre-Emmanuel Jabin and Zhenfu Wang “Mean field limit for stochastic particle systems” In Active Particles, Volume 1: Advances in Theory, Models, and Applications Springer, 2017, pp. 379–402
- [JW18] Pierre-Emmanuel Jabin and Zhenfu Wang “Quantitative estimates of propagation of chaos for stochastic systems with kernels” In Invent. Math. 214.1, 2018, pp. 523–591
- [JCP23] Yiheng Jiang, Sinho Chewi and Aram-Alexandre Pooladian “Algorithms for mean-field variational inference via polyhedral optimization in the Wasserstein space” In arXiv preprint 2312.02849, 2023
- [JKO98] Richard Jordan, David Kinderlehrer and Felix Otto “The variational formulation of the Fokker–Planck equation” In SIAM J. Math. Anal. 29.1, 1998, pp. 1–17
- [KHK24] Mohammad Reza Karimi Jaghargh, Ya-Ping Hsieh and Andreas Krause “Stochastic Approximation Algorithms for Systems of Interacting Particles” In Advances in Neural Information Processing Systems 36, 2024
- [KMP24] Ksenia A. Khudiakova, Jan Maas and Francesco Pedrotti “-optimal transport of anisotropic log-concave measures and exponential convergence in Fisher’s infinitesimal model” In arXiv preprint 2402.04151, 2024
- [KM12] Young-Heon Kim and Emanuel Milman “A generalization of Caffarelli’s contraction theorem via (reverse) heat flow” In Math. Ann. 354.3, 2012, pp. 827–862
- [Kuh+19] Daniel Kuhn, Peyman M. Esfahani, Viet Anh Nguyen and Soroosh Shafieezadeh-Abadeh “Wasserstein distributionally robust optimization: theory and applications in machine learning” In Operations Research & Management Science in the Age of Analytics, 2019, pp. 130–166
- [Lac22] Daniel Lacker “Quantitative approximate independence for continuous mean field Gibbs measures” In Electronic Journal of Probability 27 The Institute of Mathematical Statisticsthe Bernoulli Society, 2022, pp. 1–21
- [Lac23] Daniel Lacker “Hierarchies, entropy, and quantitative propagation of chaos for mean field diffusions” In Probability and Mathematical Physics 4.2 Mathematical Sciences Publishers, 2023, pp. 377–432
- [Lac23a] Daniel Lacker “Independent projections of diffusions: gradient flows for variational inference and optimal mean field approximations” In arXiv preprint 2309.13332, 2023
- [LL23] Daniel Lacker and Luc Le Flem “Sharp uniform-in-time propagation of chaos” In Probability Theory and Related Fields Springer, 2023, pp. 1–38
- [Lam+22] Marc Lambert et al. “Variational inference via Wasserstein gradient flows” In Advances in Neural Information Processing Systems, 2022
- [LL07] Jean-Michel Lasry and Pierre-Louis Lions “Mean field games” In Jpn. J. Math. 2.1, 2007, pp. 229–260
- [LST21] Yin Tat Lee, Ruoqi Shen and Kevin Tian “Structured logconcave sampling with a restricted Gaussian oracle” In Proceedings of Thirty Fourth Conference on Learning Theory 134, Proceedings of Machine Learning Research PMLR, 2021, pp. 2993–3050
- [Li+23] Yun Li et al. “Strong convergence of Euler–Maruyama schemes for McKean–Vlasov stochastic differential equations under local Lipschitz conditions of state variables” In IMA Journal of Numerical Analysis 43.2 Oxford University Press, 2023, pp. 1001–1035
- [LW16] Qiang Liu and Dilin Wang “Stein variational gradient descent: a general purpose Bayesian inference algorithm” In Advances in Neural Information Processing Systems 29 Curran Associates, Inc., 2016
- [Mal01] Florent Malrieu “Logarithmic Sobolev inequalities for some nonlinear PDE’s” In Stochastic Process. Appl. 95.1, 2001, pp. 109–132
- [Mal03] Florent Malrieu “Convergence to equilibrium for granular media equations and their Euler schemes” In The Annals of Applied Probability 13.2 Institute of Mathematical Statistics, 2003, pp. 540–560
- [McK66] Henry P. McKean “A class of Markov processes associated with nonlinear parabolic equations” In Proc. Nat. Acad. Sci. U.S.A. 56, 1966, pp. 1907–1911
- [MMN18] Song Mei, Andrea Montanari and Phan-Minh Nguyen “A mean field view of the landscape of two-layer neural networks” In Proceedings of the National Academy of Sciences 115.33 National Acad Sciences, 2018, pp. E7665–E7671
- [Mél96] Sylvie Méléard “Asymptotic behaviour of some interacting particle systems; McKean–Vlasov and Boltzmann models” In Probabilistic Models for Nonlinear Partial Differential Equations: Lectures given at the 1st Session of the Centro Internazionale Matematico Estivo (C.I.M.E.) held in Montecatini Terme, Italy, May 22–30, 1995 Berlin, Heidelberg: Springer Berlin Heidelberg, 1996, pp. 42–95
- [Mon17] Pierre Monmarché “Long-time behaviour and propagation of chaos for mean field kinetic particles” In Stochastic Processes and their Applications 127.6 Elsevier, 2017, pp. 1721–1737
- [MRW24] Pierre Monmarché, Zhenjie Ren and Songbo Wang “Time-uniform log-Sobolev inequalities and applications to propagation of chaos” In arXiv preprint arXiv:2401.07966, 2024
- [NWS22] Atsushi Nitanda, Denny Wu and Taiji Suzuki “Convex analysis of the mean field Langevin dynamics” In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022, pp. 9741–9757 PMLR
- [Ott01] Felix Otto “The geometry of dissipative evolution equations: the porous medium equation” In Comm. Partial Differential Equations 26.1-2, 2001, pp. 101–174
- [OR07] Felix Otto and Maria G. Reznikoff “A new criterion for the logarithmic Sobolev inequality and two applications” In Journal of Functional Analysis 243.1 Elsevier, 2007, pp. 121–157
- [OV00] Felix Otto and Cédric Villani “Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality” In Journal of Functional Analysis 173.2 Elsevier, 2000, pp. 361–400
- [RES22] Gonçalo Reis, Stefan Engelhardt and Greig Smith “Simulation of McKean–Vlasov SDEs with super-linear growth” In IMA Journal of Numerical Analysis 42.1 Oxford University Press, 2022, pp. 874–922
- [RV22] Grant M. Rotskoff and Eric Vanden-Eijnden “Trainability and accuracy of artificial neural networks: an interacting particle system approach” In Comm. Pure Appl. Math. 75.9, 2022, pp. 1889–1935
- [SL19] Ruoqi Shen and Yin Tat Lee “The randomized midpoint method for log-concave sampling” In Advances in Neural Information Processing Systems (NeurIPS) 32, 2019
- [SS20] Justin Sirignano and Konstantinos Spiliopoulos “Mean field analysis of neural networks: a law of large numbers” In SIAM Journal on Applied Mathematics 80.2 SIAM, 2020, pp. 725–752
- [SNW22] Taiji Suzuki, Atsushi Nitanda and Denny Wu “Uniform-in-time propagation of chaos for the mean-field gradient Langevin dynamics” In The Eleventh International Conference on Learning Representations (ICLR), 2022
- [SWN23] Taiji Suzuki, Denny Wu and Atsushi Nitanda “Convergence of mean-field Langevin dynamics: time and space discretization, stochastic gradient, and variance reduction” In Advances in Neural Information Processing Systems (NeurIPS), 2023
- [Szn91] Alain-Sol Sznitman “Topics in propagation of chaos” In École d’Été de Probabilités de Saint-Flour XIX—1989 1464, Lecture Notes in Math. Springer, Berlin, 1991, pp. 165–251
- [Tal96] Denis Talay “Probabilistic numerical methods for partial differential equations: elements of analysis” In Probabilistic models for nonlinear partial differential equations (Montecatini Terme, 1995) 1627, Lecture Notes in Math. Springer, Berlin, 1996, pp. 148–196
- [VW19] Santosh Vempala and Andre Wibisono “Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices” In Advances in Neural Information Processing Systems (NeurIPS) 32, 2019
- [Vil02] Cédric Villani “A review of mathematical topics in collisional kinetic theory” In Handbook of mathematical fluid dynamics, Vol. I North-Holland, Amsterdam, 2002, pp. 71–305
- [Vil03] Cédric Villani “Topics in optimal transportation” 58, Graduate Studies in Mathematics American Mathematical Society, Providence, RI, 2003, pp. xvi+370
- [Wib18] Andre Wibisono “Sampling as optimization in the space of measures: the Langevin dynamics as a composite optimization problem” In Proceedings of the 31st Conference on Learning Theory 75, Proceedings of Machine Learning Research PMLR, 2018, pp. 2093–3027
- [WSC22] Keru Wu, Scott Schmidler and Yuansi Chen “Minimax mixing time of the Metropolis-adjusted Langevin algorithm for log-concave sampling” In The Journal of Machine Learning Research (JMLR) 23.1 JMLRORG, 2022, pp. 12348–12410
- [YY23] Rentian Yao and Yun Yang “Mean-field variational inference via Wasserstein gradient flow” In arXiv preprint 2207.08074, 2023
- [YKW22] Man-Chung Yue, Daniel Kuhn and Wolfram Wiesemann “On linear optimization over Wasserstein balls” In Math. Program. 195.1-2, 2022, pp. 1107–1122
- [Zha+23] Matthew S. Zhang et al. “Improved discretization analysis for underdamped Langevin Monte Carlo” In Conference on Learning Theory (COLT), 2023, pp. 36–71 PMLR
Appendix A Control of the Finite-Particle Error
In this section, we prove the results in §3.1 on the finite-particle error. We will make extensive use of the following transport inequality, which arises as a consequence of (LSI).
Lemma 14 (Talagrand’s Transport Inequality, [OV00]).
If a measure satisfies (LSI) with constant , then for all measures ,
(TI) |
A.1 LSI Case
We provide the proof of Theorem 3 under the assumption of (LSI) for the invariant measures of and . This relies on a BBGKY hierarchy based on the arguments of [LL23].
Recall that is the -particle distribution of the finite-particle system. Explicitly,
Using exchangeability, we can then compute the gradient of the potential for this measure as
Let and introduce the notation
Invoking (LSI) of the mean-field invariant measure (and tensorizing) leads to
where the last line follows from exchangeability and for vectors .
A.1.1 Bounding the Error Terms
We now handle terms separately.
where we used the exchangeability of the particles in (i) and the smoothness of in (ii). Here, and are independent.
Let us deal with these two terms separately. For the first term, let be optimally coupled with . Then, by independence and sub-Gaussian concentration (implied by (LSI)),
(A.1) |
where the second inequality follows from (TI), and the last one follows from the data-processing inequality for the divergence. For the second term, the Cauchy–Schwarz inequality leads to
(A.2) | |||
where in (i) we applied the bound (A.1) as well as (TI), and in (ii) we used the chain rule for the divergence.
We return to the analysis of the term . In a similar way, we obtain
A.1.2 Induction
Putting our bounds on and together, we obtain for ,
(A.3) |
In particular, the case of involves our bounds only on , leading to
By grouping together the terms in (A.3),
(A.4) |
Iterating this inequality down to , for ,
Now we show , which implies . We require the following lemma.
Lemma 15.
For ,
Proof. For , we have
As the summand is decreasing in , it follows that
Therefore,
which proves the lemma. ∎
Using Lemma 15, we obtain
Under Assumption 3, i.e. , we may assume since we can always take a worse bound on the constants so that . As seen shortly, the rate does not improve even if .555Alternatively, one can show the bound in Lemma 15 decreases in , so we can just substitute therein. For and , we therefore obtain
and thus
(A.5) |
A.1.3 Bootstrapping
Substituting the bound (A.5) for into the recursive inequality (A.4), we end up with a suboptimal rate of for . To improve the bound, we substitute our established bound (A.5) into (A.2), which results in an improved recursive inequality. Indeed,
and therefore
For this yields
Regrouping as before, we obtain
Iterating this down to ,
where in (i) we used Lemma 15 with , and (ii) follows from and . Therefore, for some fixed it suffices to take to achieve -bias in , completing the proof of Theorem 3.
A.2 Strongly Convex Case
The following propagation of chaos argument for the strongly log-concave case is based on [Szn91]. Let denote the stochastic process following the finite-particle stochastic differential equation (). Let the corresponding semigroup be denoted , defined as follows. For any test function ,
Then, the following simple lemma proves Wasserstein contraction for the finite-particle system.
Lemma 16.
Under Assumption 4 and for , is a contraction in the -Wasserstein distance with exponential rate at least , where . In other words, for any measures , in ,
Proof. Note that corresponds to the time-scaled (by factor ) Langevin diffusion with stationary distribution , which is -strongly log-concave by Lemma 7. The condition on ensures that this is at least . Consequently, it is well-known (e.g., via synchronous coupling) that the diffusion is a contraction in the Wasserstein distance with rate at least . ∎
We next bound the error incurred in one step from applying the finite-particle semigroup to .
Proof. We resort to a coupling argument, noting that is stationary under (). Starting with , we evolve and according to () and () respectively, i.e. and . This argument is adapted from the original propagation of chaos proof by [Szn91].
We can compute the evolution under a synchronous coupling as:
Now let us denote by the centered gradient (with respect to ). By Itô’s formula and Assumption 4,
or
Integrating and squaring,
where the last line follows from Young’s inequality.
Next, we take expectations. Note that is centered in its second variable, so for any ,
Otherwise, we can bound the terms via
Here, is an independent draw from and so cannot be reduced via coupling. The second inequality follows from a standard bound on the centered second moment of a strongly log-concave measure, using the fact that is -strongly log-concave [[, c.f.]]DalKarRio22NonStrongly.
Therefore, taking expectations and summing over the particles,
By Grönwall’s inequality below,
This concludes the proof. ∎
Lemma 18 (Grönwall’s Inequality).
For , let be bounded. Suppose that the following holds pointwise for some functions , where is increasing:
Then,
Proof of Theorem 4 Indeed, we have
Rearranging,
Let first and then to obtain
Finally, when , we use exchangeability (see Lemma 27 below) to conclude the proof of (3.2).
For (3.3), by the Bakry–Émery condition we have , and tensorization [[, c.f.]Proposition 5.2.7]bakry2014analysis leads to . Thus, (TI) leads to
However, one notes that the density of is log-smooth with parameter (Lemma 7). Likewise, is log-smooth with parameter . Now consider a functional on the space of probability measures on given by . Note that is smooth with parameter at most , for .
Next, note that for ,
by using exchangeability and the definition of .
A.3 General Functional Case
For any measure , define its entropy as . We now provide a self-contained propagation of chaos argument in the general McKean–Vlasov setting, following [CRW22]. We begin with the following entropy toast inequality, i.e. half of the entropy sandwich inequality from [CRW22].
Lemma 19 (Entropy Toast Inequality).
Define the empirical total energy for an -finite particle system as follows. Given a measure ,
Proof of Theorem 5 We bound via the following argument. First, define the finite-particle mean-field functional as . In the sequel, we also use the following notation for conditional measures: if ,
We know that
Furthermore, by Assumption 5,
Using the subadditivity of entropy, we can therefore write
To decouple the terms, we now replace each term with :
We consider the two terms in turn, beginning with the first.
Note that by Fubini’s theorem,
In order to relate the first term to a KL divergence, for each we introduce the probability measure via
We can compute
where is the normalization constant for ,
Upon taking expectations, we obtain
Moreover, we can recognize that is a proximal Gibbs measure. By Assumptions 6 and 7,
To transport the mass from to , we take the transport plan which moves of the mass from to each , . It yields
(A.6) |
Hence,
We then use the inequality
(A.7) |
where we used Lemma 27 and the Poincaré inequality for . Hence,
Next, we turn toward term . First, define a function by
It is clear from Assumption 6 that this function is Lipschitz with constant . Thus, we obtain using this Lipschitzness, (A.6), and Young’s inequality,
For the first term, we can apply (A.7), and for the second term, we can apply (A.1). It yields
Putting the bounds together with Lemma 19,
for all .
The result for follows from Lemma 28.
∎
Appendix B Isoperimetric Results for the Stationary Distributions
B.1 Convexity and Smoothness
Here, we verify the convexity and smoothness properties of in the pairwise McKean–Vlasov setting.
Proof of Lemma 7 For , the Hessian of can be explicitly computed as
Clearly, the first block matrix has eigenvalues between and . For the second block matrix , let us denote for . Note that since is even, and each is clearly symmetric.
For the second matrix and , we have
Using and
we have .
Since the circulant matrix in is PSD due to diagonal dominance and its largest eigenvalue is at most , it follows that the eigenvalues of lie between and .
Hence, the eigenvalues of lie in the interval .
∎
B.2 Bounded Perturbations
In this section, we prove the isoperimetric results from §3.2.1. We again introduce the conditional measure: if we define
for the conditional distribution of the -th particle and the distribution of an -particle system with the -th particle marginalized out.
Before proceeding to the proof of Lemma 8, we first state a result on log-Sobolev inequalities under weak interactions due to [OR07].
Lemma 20 ([OR07, Theorem 1]).
Consider a measure on , with conditional measures . Assume that
Then, consider the matrix with entries , for . If , then satisfies (LSI) with constant .
Proof of Lemma 8 We begin by proving the statement about . The potential of the invariant measure can also be written as
This is the sum of a -convex function with a -bounded perturbation. Thus, satisfies (LSI) with the claimed parameter.
We now prove the statement about . Each conditional measure has a density of the form
where both and are bounded perturbations of , -strongly convex functions respectively, irrespective of the conditional variables. Thus, by Holley–Stroock perturbation and the Bakry–Émery condition, each satisfies (LSI) with parameter
Secondly, we note that from (2.1), we have
Thus, we have
Under Assumption 10, this matrix is strictly diagonally dominant and has a minimum eigenvalue of at least . We can now apply Lemma 20 to complete the proof.
∎
B.3 Logarithmic Sobolev Inequalities via Perturbations
In this section, we state log-Sobolev inequalities for Lipschitz perturbations of strongly log-concave measures, which is used for the general McKean–Vlasov setting in §2.1.2.
Lemma 21 (LSI under Lipschitz Perturbations [BP24, Theorem 1.4]).
Let for an -strongly convex function and an -Lipschitz function . Then, satisfies (LSI) with constant given by
From this one derives the following log-Sobolev inequality for the proximal Gibbs measure.
Lemma 22 (Uniform LSI for the Proximal Gibbs Measure [SWN23, Theorem 1]).
Obtaining a uniform-in- LSI for under Assumption 8 is more difficult, and we rely on the recent heat flow estimates of [BP24]. In their work, the authors showed the existence of an -Lipschitz transport map—the Kim–Milman map [KM12]—from the standard Gaussian measure to a measure , under suitable assumptions on . By [BGL14, Proposition 5.4.3], this immediately implies that satisfies . The existence of the Lipschitz transport map is based on estimates along the heat flow, and we summarize the computation in a convenient form based on bounding the operator norm of the covariance matrix of Gaussian tilts of the measure. The latter property is sometimes called tilt stability in the literature. Note that we do not attempt to optimize constants here.
Lemma 23 (Lipschitz Transport Maps via Reverse OU).
Let be a probability measure over and for any , , let denote the Gaussian tilt,
(B.1) |
where we assume that this defines a valid probability measure for all and . Suppose there exist such that the following “tilt stability” property holds:
Then, there exists an -Lipschitz transport map such that , where is the standard Gaussian measure and can be estimated by
Proof. We follow the calculations of [BP24]. Let denote the heat semigroup, and the Ornstein–Uhlenbeck semigroup. Then, if denotes the standard Gaussian measure, the identity and
where . Note that we can identify this with the bound of [BP24, Corollary 3.2] with . Thus, by following the calculations in the proof of [BP24, Theorem 1.4], we obtain the desired result. ∎
We next verify the tilt stability condition, leveraging the propagation of chaos result in Theorem 5.
Lemma 24 (Tilt Stability).
Proof. We introduce the auxiliary measures
(B.3) |
where . To see that these auxiliary measures are well-defined, note that any minimizer of the functional
satisfies the system of equations (B.3), and that the minimizer is unique because the functional is strictly convex. We also let .
Then, for any unit vector ,
where we used the fact that is a product measure. Also, introduce
so that . The same argument as above then yields
Since is -strongly log-concave, by the Brascamp–Lieb inequality [BL76]. Also, since is -Lipschitz under Assumption 8, then [KMP24, Corollary 2.4] yields
Hence,
Finally, it remains to control . Note that when , this essentially reduces to Theorem 5, so the task is to prove a generalization thereof. Note that by Lemma 21, in Theorem 26 below, we can take
In particular, under the assumption (B.2) for , the preconditions of Theorem 26 are met and it yields the bound
Putting everything together completes the proof. ∎
Corollary 25 (Uniform LSI for the Stationary Measure).
Proof. To meet the conditions of Lemma 24, we perform a rescaling. We abuse notation and denote by the scaling map . Then,
where . We see that is also the stationary measure for mean-field Langevin dynamics, with the following new parameters: ; ; . In particular, if we take , then Lemma 24 applies to provided that (B.4) holds. Together with Lemma 23 with and , it implies
The result for follows from contraction mapping [BGL14, Proposition 5.4.3]. ∎
It remains to prove the following generalized propagation of chaos result.
Theorem 26 (Generalized Propagation of Chaos).
Proof. Since the proof closely follows the proof of Theorem 5, to avoid repetition we only highlight the main changes. We define the following energy functionals:
where for a measure on , we use the notation for the average of the marginals. The first step is to establish the analogue of the entropy toast inequality (Lemma 19). Letting denote the normalizing constant for ,
From here, we find that
Now, moving on to the propagation of chaos part of this argument, we have
To decouple, introduce a new variable independent of all the others and define as a shorthand as the vector .
We group the terms in the first two lines as , and the terms in the last two lines as .
Let us first look at . If we introduce the proximal Gibbs measure for (in the -th coordinate), defined via
one obtains as before
Applying the log-Sobolev inequality, it yields
The Wasserstein distance is bounded by
It eventually yields, as before,
As for the term , a straightforward modification of the proof of Theorem 5 readily yields
Putting everything together yields the result. ∎
Appendix C Explicit Calculations for the Gaussian Case
Here we provide complete details for Example 10: for any ,
Note that for with and if ,
The -particle marginal is a Gaussian with zero mean and covariance being the upper-left -block matrix of , which we denote by . Clearly, is also a Gaussian with zero mean and covariance . From a well-known formula for the divergence between two Gaussian distributions,
(C.1) |
Let be the -dimensional vector with all entries . From ,
where in (i) we used , (ii) follows from the Woodbury matrix identity, and (iii) used . Hence it follows that
By the spectral decomposition of , we can write for a diagonal and an orthogonal matrix such that correspond to the eigenvalues of . Since and in (C.1) are orthogonally invariant, let us look at the orthogonal conjugate of by . Using and denoting ,
For , , and , we have and due to
Since the eigenvalues of consist of all possible combinations arising from the product of eigenvalues, one from and one from , the largest eigenvalue of is less than :
Denoting the eigenvalues of by , it follows from (C.1) that
Then, we have a trivial lower bound of , and as for the upper bound,
where the last inequality follows from .
Using , and and , we have
As for the lower bound, since for large , we have
which completes the proof.
Appendix D Additional Technical Lemmas
In our proofs, we used the following general lemmas on exchangeability.
Lemma 27.
Let , be two exchangeable measures over . For any ,
Proof. Let be optimally coupled for and . For each subset of size , it induces a coupling of and (by exchangeability). In particular, the law of , where is an independent and uniformly random subset of size , is also a coupling of and . Hence,
which completes the proof. ∎
Lemma 28 (Information Inequality [Csi84]).
If are Polish spaces and , are probability measures on , where is a product measure, then for the marginals of , it holds that
In particular when , are both exchangeable, this states that .
Note that Lemma 28 follows from the chain rule and convexity of the divergence.
Appendix E Sampling Guarantees
Here, we show how to obtain the claimed rates in §4. We begin with some preliminary facts.
divergence guarantees.
To obtain our guarantees in divergence, we use the following lemma.
Guarantees using the sharp propagation of chaos bound.
Here, we impose Assumptions 1, 2, 3, and 9. It follows from Theorem 3 that suffices in order to make . For the first term, we use exchangeability (Lemma 27) to argue that
and hence we invoke sampling guarantees to ensure that under (LSI).
-
•
LMC: We use the guarantee for Langevin Monte Carlo from [VW19].
-
•
MALA–PS: We use the guarantee for the Metropolis-adjusted Langevin algorithm together with the proximal sampler from [AC23]. Note that the iteration complexity is , and we substitute in the chosen value for .
-
•
ULMC–PS: Here, we use underdamped Langevin Monte Carlo to implement the proximal sampler. To justify the sampling guarantee, note that since is -smooth, if we choose step size for the proximal sampler, then the RGO is -strongly log-concave and -log-smooth. According to [AC23, Proof of Theorem 5.3], it suffices to implement the RGO in each iteration to accuracy in . Then, from [Zha+23], this can be done via ULMC with complexity . Finally, since the number of outer iterations of the proximal sampler is , we obtain the claim.
- •
Guarantees under strong displacement convexity.
Here, we impose Assumptions 1 and 4. As discussed above, to obtain guarantees, we require log-concave samplers that can achieve . For guarantees, by Theorem 4 we take and we require log-concave samplers that can achieve .
- •
-
•
ULMC: For underdamped Langevin Monte Carlo, we use the guarantee from [AC23].
- •
Guarantees in the general McKean–Vlasov setting.
In the setting of Theorem 5, we take . We use the same sampling guarantees under (LSI) as in the prior settings.
We also note that in order to apply the log-concave sampling guarantees, we must check that is log-smooth. This follows from Assumption 6. Indeed,