This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Elucidating Flow Matching ODE Dynamics with Respect to Data Geometries

Zhengchao Wan Co-first author. zwan@missouri.edu Department of Mathematics, University of Missouri Qingsong Wang Co-first author. qswang92@gmail.com Halıcıoğlu Data Science Institute, University of California San Diego Gal Mishne gmishne@ucsd.edu Halıcıoğlu Data Science Institute, University of California San Diego Yusu Wang yusuwang@ucsd.edu Halıcıoğlu Data Science Institute, University of California San Diego
Abstract

Diffusion-based generative models have become the standard for image generation. ODE-based samplers and flow matching models improve efficiency, in comparison to diffusion models, by reducing sampling steps through learned vector fields. However, the theoretical foundations of flow matching models remain limited, particularly regarding the convergence of individual sample trajectories at terminal time—a critical property that impacts sample quality and being critical assumption for models like the consistency model. In this paper, we advance the theory of flow matching models through a comprehensive analysis of sample trajectories, centered on the denoiser that drives ODE dynamics. We establish the existence, uniqueness and convergence of ODE trajectories at terminal time, ensuring stable sampling outcomes under minimal assumptions. Our analysis reveals how trajectories evolve from capturing global data features to local structures, providing the geometric characterization of per-sample behavior in flow matching models. We also explain the memorization phenomenon in diffusion-based training through our terminal time analysis. These findings bridge critical gaps in understanding flow matching models, with practical implications for sampling stability and model design.

1 Introduction

Diffusion-based generative models have become the de facto standard for the task of image generation  [SDWMG15, HJA20, SE19]. Compared to previous generative models (e.g., GANs [GPAM+14]), diffusion models are easier to train but suffer from long sampling times due to the sequential nature of the sampling process. To address this limitation, (deterministic) ODE-based samplers were introduced, where the sampling process is done by integrating an ODE trajectory. This approach has been shown to be more efficient than traditional sampling with a significantly reduced number of sampling steps needed [SME21, LZB+22, KAAL22]. Recently, a new class of models, known as flow matching models [LCBH+22, LGL23] has been developed to use an ODE flow map to interpolate between a prior and a target data distribution—generalizing diffusion models with ODE samplers. By learning a vector field utu_{t} a new sample x1x_{1} can be generated by integrating over the ODE below from some initial random sample x0x_{0}:

dxtdt=ut(xt),t[0,1].\frac{dx_{t}}{dt}=u_{t}(x_{t}),\,t\in[0,1].

Various versions of the flow matching model have gained popularity, such as the rectified flow model [LGL23], which is used in recent commercial image generation software [EKB+24]. Furthermore, the succinct and deterministic formulation of the flow matching model also makes theoretical analysis potentially easier.

However, despite their empirical success, the theoretical understanding of flow models remains incomplete, even for the well-posedness of the ODE trajectories, i.e., existence, uniqueness, as well as convergence as t1t\to 1 of the ODE trajectories. These are not thoroughly addressed in the pioneering papers [LCBH+22, LGL23], and only partially tackled in recent studies [GHJ24, LHH+24]. Specifically, in [LHH+24], well-posedness is established only on the open interval [0,1)[0,1). Recent work [GHJ24] extend well-posedness to [0,1][0,1] but under restrictive assumptions on the data distribution, excluding cases where data is supported on low-dimensional manifolds—an arguably crucial scenario in generative modeling [LGRH+24]. In both works, the challenge arises from the singularity in the vector field utu_{t} as the terminal time t=1t=1 is approached, influenced by both the ODE formulation and the data geometry. The convergence of the ODE trajectories lies in a broader context of sample trajectory analysis, which is less explored in the literature as most work focuses on distribution level analysis [BDD23, LC24].

Refer to caption
(a) An ODE trajectory that winds towards the data distribution without convergence as t1t\to 1.
Refer to caption
(b) An ODE trajectory smoothly converging to the data distribution as t1t\to 1.
Figure 1: Comparing two ODE trajectory behaviors. We show two types of ODE trajectories with the same data distribution supported on the unit circle: (a) winding instability, and (b) smooth convergence. Our convergence analysis at t=1t=1 (cf. Theorem 5.3) guarantees that flow matching ODE trajectories converge smoothly onto the data distribution, ensuring that the undesirable winding phenomenon does not occur.

We also highlight that pushing the well-posedness to t=1t=1 and performing sample trajectory analysis is not only theoretically interesting but also of practical importance. Suppose we know that the ODE trajectory exists in [0,1)[0,1), and the intermediate distribution following the ODE flow is getting closer to the data distribution as tt increases. However, if the ODE trajectory does not converge at t=1t=1, the final sampling following a single ODE trajectory can be very sensitive to the chosen ODE discretization scheme, i.e., a slight change in sampling schedule will result in a drastically different outcome. This could happen if the individual ODE trajectories are winding towards the data distribution but never converge at t=1t=1 as shown in Figure 1(a). A more desirable behavior is as in Figure 1(b), where the ODE trajectory converges to the data distribution as t1t\to 1. Empirical studies find the ODE trajectories tend to stabilize as t1t\to 1 and hence be more aligned with that of Figure 1(b), and this observation is utilized to guide the choice of the ODE discretization scheme [KAAL22, EKB+24]. Furthermore, the convergence of the ODE trajectories at t=1t=1 is a critical assumption for the consistency model [SDCS23], which utilizes the limit of the ODE trajectory to design a one-step diffusion model. These considerations motivate us to investigate the convergence of the ODE trajectories at t=1t=1 and per sample trajectory analysis.

In this paper, our approach focuses on the denoiser, a key component of flow matching models. The denoiser, defined as the mean of the posterior distribution of the data given a noisy observation, governs the evolution of ODE trajectories. Initially, the posterior distribution is broad, and the denoiser approximates the global mean of the data. As time progresses, the posterior measure becomes more concentrated around nearby data points, and the denoiser captures finer details of the data distribution.

Refer to caption
Figure 2: Illustration of three stages of a flow matching ODE trajectory. The three colored regions (yellow, blue, and pink) represent three local clusters in support of the data distribution. The trajectory first moves towards the mean (brown point) of the data distribution (cf. Proposition 6.3; note how the initial vector is pointing towards the data mean of the data distribution) and will be absorbed by the convex hull of the data distribution (cf. Proposition 6.4) in the shaded region. Later, it will be attracted by the local cluster of the data distribution (cf. Proposition 6.9) and will be guaranteed to converge onto the data support (green point) at terminal time (cf. Theorem 5.3).

We show that when approaching the terminal time, the posterior distribution concentrates on the nearest data, and the denoiser converges to the projection of the current sample onto the data support. This extends previous partial results [PY24, SBDS24] to a general setting beyond a small neighborhood of the data support. By carefully leveraging this denoiser convergence and integrating it with the analysis of the ODE dynamics, we find that near the terminal time, the ODE trajectories will stay in a region away from the singularity and converge to the data support. As a result, we establish the convergence of the entire sampling process with minimal assumptions and validate that the ODE trajectories in flow matching models are well-behaved (as in Figure 1(b)).

Notably, the terminal time analysis of the denoiser provides a new perspective on the memorization phenomenon observed in diffusion-based training where models only repeat the training samples [CHN+23, WLCL24], rather than generate new unseen data points. We identify that the denoiser’s terminal time convergence to the data support precisely leads to this memorization behavior. In particular, if a neural network trained denoiser is asymptotically trained optimally with empirical data, it can only sample memorized training data.

Finally, our analysis significantly strengthens the understanding of flow matching ODE dynamics, extending beyond the terminal time. By carefully examining the progressive concentration of the posterior distribution, we derive comprehensive insights into the behavior of ODE trajectories. Specifically, we establish a unified result demonstrating that trajectories are systematically attracted to, and ultimately absorbed by, sets associated with the data distribution. Initially, sample trajectories move towards the mean of the data distribution and will be absorbed into the convex hull of overall data distribution, capturing the global structure. As the process evolves, they are increasingly drawn towards local clusters within the data, reflecting a transition from global to local convergence. Ultimately, the trajectories converge to the data support at the terminal time, providing a geometric characterization of how samples naturally focus on local features; see Figure 2 for an illustration. In this way, we provide per sample trajectory analysis that complements the empirical, or distribution level discussion regarding the flow matching/diffusion model dynamics [BBDBM24, LC24].

Well-posedness in [0,1)[0,1)Thm. 5.1Absorbing & Attractingvia DenoiserThm. 3.63.83.11, Prop. 3.9Concentration ofPosteriorThm. 4.2, 4.9, Prop. 4.11Theoretical FoundationsInitial StageProp. 6.36.4, Coro. 6.66.7Intermediate StageProp. 6.9Terminal Stage (Well-posedness at 11)Thm. 5.35.4 Prop. 6.12ODE Trajectory StagesEquivariance ofFlow MapsThm. 5.75.8
Figure 3: Illustration of main theoretical results. Our findings are divided into two main parts: the first focuses on the theoretical foundations of flow matching ODEs, emphasizing properties of denoisers and posterior distributions; the second builds on these insights to characterize the three stages of flow matching ODE trajectories. In the first part, we highlight that the well-posedness result serves as the cornerstone for understanding the absorbing and attracting properties of denoisers. Furthermore, leveraging the well-posedness at t=1t=1, we establish the existence of the flow map Ψt\Psi_{t} at t=1t=1, which enables us to derive key properties of flow maps.
Contributions and Organization.

In summary, we provide a comprehensive theoretical analysis of flow matching models by examining the behavior of denoisers and posterior distributions. We establish the convergence of ODE trajectories and characterize their geometric evolution, bridging key gaps in the theoretical understanding of these models. Our results address critical challenges related to terminal time singularities and offer new insights into the memorization phenomenon in diffusion-based training. The geometric perspective we develop—showing how trajectories evolve from global to local features—provides a unified framework for understanding flow matching dynamics, laying a foundation for future theoretical developments. The paper is structured as follows, and Figure 3 provides an illustration of the main theoretical results and their inter-dependencies.

  • In Section 2, we introduce the background on flow matching models, their training objectives, and the unification of different scheduling functions via the noise-to-signal ratio σ\sigma.

  • In Section 3, we discuss key properties of the denoiser and highlight its role as the guiding component in the flow matching model ODE dynamics, illustrating terminal time singularity and general attracting and absorbing properties. These will be fundamental tools for our theoretical analysis.

  • In Section 4, we study the concentration and convergence of the posterior distribution, establishing its initial closeness to the data distribution and its convergence to the data support at the terminal time. We also derive specific convergence rates under different data geometries.

  • In Section 5, we present the well-posedness of ODE trajectories fisrt in [0,1)[0,1) and then extend to [0,1][0,1] with minimal assumptions on the data distribution. We then obtain refined convergence results based on data geometry. Lastly, with the existence of flow maps, we analyze its equivariance property under data transformation.

  • In Section 6, we analyze the ODE trajectories across different stages, demonstrating their initial attraction to the data mean and eventual absorption into the convex hull of the data distribution. Later, these trajectories are attracted to local clusters. We also provides quantitative results on the terminal time behavior for discrete data distributions and showcase the importance of terminal time behavior on the memorization phenomenon.

All technical proofs are deferred to the appendix.

2 Background on Flow Matching Models

In this section, we provide a brief overview of flow matching models and their training objectives and how the noise-to-signal ratio σ\sigma can be used to unify different scheduling functions.

2.1 Preliminaries and notations

We use d\mathbb{R}^{d} to denote the dd-dimensional Euclidean space, and \|\cdot\| to denote the Euclidean norm. We use Ω\Omega to denote a general closed subset Ωd\Omega\subset\mathbb{R}^{d} and usually use MM to denote a manifold in d\mathbb{R}^{d}. We let dΩ:dd_{\Omega}:\mathbb{R}^{d}\to\mathbb{R} denote the distance function to Ω\Omega, i.e.,

dΩ(x):=minyΩdΩ(x,y)d_{\Omega}(x):=\min_{y\in\Omega}d_{\Omega}(x,y)

The medial axis of Ω\Omega is defined as

ΣΩ:={xd:#{argminyΩxy}>1}\Sigma_{\Omega}:=\left\{x\in\mathbb{R}^{d}:\#\left\{\mathrm{argmin}_{y\in\Omega}\|x-y\|\right\}>1\right\}

The reach of Ω\Omega is defined as τΩ:=infxΩdist(x,ΣΩ)\tau_{\Omega}:=\inf_{x\in\Omega}\mathrm{dist}(x,\Sigma_{\Omega}). For any xd\ΣΩx\in\mathbb{R}^{d}\backslash\Sigma_{\Omega}, the nearest point on Ω\Omega is unique and we denote it by projΩ(x):=argminyΩxy\mathrm{proj}_{\Omega}(x):=\arg\min_{y\in\Omega}\|x-y\|. We let diam(Ω)\mathrm{diam}(\Omega) denote the diameter of Ω\Omega, i.e., diam(Ω):=supx,yΩxy\mathrm{diam}(\Omega):=\sup_{x,y\in\Omega}\|x-y\|. A set KK is said to be convex if for any x,yKx,y\in K and t[0,1]t\in[0,1], we have tx+(1t)yKtx+(1-t)y\in K. For any set Ω\Omega, we let conv(Ω)\mathrm{conv}(\Omega) denote the convex hull of Ω\Omega which is the smallest convex set containing Ω\Omega.

For any xdx\in\mathbb{R}^{d} and r>0r>0, we use Br(x)B_{r}(x) to denote the open ball centered at xx with radius rr, i.e., Br(x):={yd:xy<r}B_{r}(x):=\{y\in\mathbb{R}^{d}:\|x-y\|<r\}. More generally, for any set Ω\Omega, we use Br(Ω)B_{r}(\Omega) to denote the open rr-neighborhood of Ω\Omega, i.e., Br(Ω):={xd:yΩ,xy<r}B_{r}(\Omega):=\{x\in\mathbb{R}^{d}:\exists y\in\Omega,\|x-y\|<r\}.

We let pp denote the target data distribution in d\mathbb{R}^{d}, ppriorp_{\mathrm{prior}} denote the prior, which is often chosen to be the standard Gaussian 𝒩(0,I){\mathcal{N}}(0,I). We say pp has a finite 2-moment, denoted by 𝖬2(p)\mathsf{M}_{2}(p), if 𝖬2(p):=dx2p(dx)<\mathsf{M}_{2}(p):=\int_{\mathbb{R}^{d}}\|x\|^{2}p(dx)<\infty. We use a capitalized bold letter 𝑿p\bm{X}\sim p to denote a random variable with pp representing its law.

For any two probability measures μ\mu and ν\nu, the qq-Wasserstein distance for q1q\geq 1 as

dW,q(μ,ν):=infγ𝒞(μ,ν)(d×dxyqγ(dx×dy))1/q,d_{\mathrm{W},q}(\mu,\nu):=\inf_{\gamma\in\mathcal{C}(\mu,\nu)}\left(\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\|x-y\|^{q}\gamma(dx\times dy)\right)^{1/q},

where 𝒞(μ,ν)\mathcal{C}(\mu,\nu) denotes the set of all couplings of μ\mu and ν\nu, i.e. the set of all probability measures on d×d\mathbb{R}^{d}\times\mathbb{R}^{d} with marginals μ\mu and ν\nu. The distance will be finite if μ\mu and ν\nu have finite qq-moments, i.e. dxqμ(dx)<\int_{\mathbb{R}^{d}}\|x\|^{q}\mu(dx)<\infty and dxqν(dx)<\int_{\mathbb{R}^{d}}\|x\|^{q}\nu(dx)<\infty.

For any probability measures μ\mu and ν\nu on d\mathbb{R}^{d}, we use μν\mu*\nu to denote their convolution.

2.2 Conditional flow matching

Conditional flow matching models [LCBH+22, LGL23] are a class of generative models whose training process consists of learning a vector field utu_{t} that generates a probability path (pt)t[0,1](p_{t})_{t\in[0,1]} interpolating a prior ppriorp_{\mathrm{prior}} and a target data distribution pp and whose sampling process consists of integrating an ODE trajectory from an initial point 𝒁pprior\bm{Z}\sim p_{\mathrm{prior}} to a terminal point 𝑿p\bm{X}\sim p. For simplicity, we will call these models flow models.

More specifically, the flow model assumes a probability path (pt)t[0,1](p_{t})_{t\in[0,1]} interpolating p0=ppriorp_{0}=p_{\mathrm{prior}} and p1=pp_{1}=p, and aims at finding a time-dependent vector field (ut:dd)t[0,1)(u_{t}:\mathbb{R}^{d}\to\mathbb{R}^{d})_{t\in[0,1)} whose corresponding ODE

dxtdt=ut(xt)\frac{dx_{t}}{dt}=u_{t}(x_{t}) (1)

has an integration flow map Ψt\Psi_{t} that sends any initial point x0x_{0} to xtx_{t} along the ODE trajectory, satisfying pt=(Ψt)#p0p_{t}=(\Psi_{t})_{\#}p_{0}. Within the framework, the probability path (pt)t[0,1](p_{t})_{t\in[0,1]} is constructed conditionally as

pt(dxt):=pt(dxt|𝑿=x)p(dx),p_{t}(dx_{t}):=\int p_{t}(dx_{t}|\bm{X}=x)p(dx),

where the conditional distribution pt(|𝑿=x)p_{t}(\cdot|\bm{X}=x) satisfies p0(|𝑿=x)=ppriorp_{0}(\cdot|\bm{X}=x)=p_{\mathrm{prior}} and p1(|𝑿=x)=δxp_{1}(\cdot|\bm{X}=x)=\delta_{x}, the Dirac delta measure at xx. When the prior ppriorp_{\mathrm{prior}} is the standard Gaussian 𝒩(0,I){\mathcal{N}}(0,I), the conditional distribution pt(|𝑿=x)p_{t}(\cdot|\bm{X}=x) are often specified as pt(|𝑿=x):=𝒩(αtx,βt2I)p_{t}(\cdot|\bm{X}=x):={\mathcal{N}}(\alpha_{t}x,\beta_{t}^{2}I), where αt\alpha_{t} and βt\beta_{t} are scheduling functions satisfying α0=β1=0\alpha_{0}=\beta_{1}=0 and α1=β0=1\alpha_{1}=\beta_{0}=1 and often being monotonic. Common choices include linear scheduling αt=t\alpha_{t}=t and βt=1t\beta_{t}=1-t used in the rectified flow model [LGL23, EKB+24], as well as those arising from the original diffusion model, like DDPM [HJA20].

It turns out that the vector field utu_{t} satisfying the above properties can be explicitly expressed as follows using the data distribution pp. We first define the conditional vector field ut(x|x1)u_{t}(x|x_{1}) which enjoys a closed-form

ut(x|x1)=β˙tβtx+α˙tβtαtβ˙tβtx1,u_{t}(x|x_{1})=\frac{\dot{\beta}_{t}}{\beta_{t}}x+\frac{\dot{\alpha}_{t}\beta_{t}-\alpha_{t}\dot{\beta}_{t}}{\beta_{t}}x_{1},

where the dot represents the derivative with respect to tt. Then the desired vector field utu_{t} satisfies the following equation for any t[0,1)t\in[0,1)

ut(x)=ut(x|x1)p(dx1|𝑿t=x)u_{t}(x)=\int u_{t}(x|x_{1})p(dx_{1}|\bm{X}_{t}=x) (2)

where 𝑿tpt\bm{X}_{t}\sim p_{t} (we will use similar notation later) and p(|𝑿t=x)p(\cdot|\bm{X}_{t}=x) denotes the posterior distribution (which is different from the conditional distribution pt(|𝑿=x)p_{t}(\cdot|\bm{X}=x) defined earlier):

p(dx1|𝑿t=x)=exp(xαtx122βt2)exp(xαtx122βt2)p(dx1)p(dx1).p(dx_{1}|\bm{X}_{t}=x)=\frac{\exp\left(-\frac{\|x-\alpha_{t}x_{1}\|^{2}}{2\beta_{t}^{2}}\right)}{\int\exp\left(-\frac{\|x-\alpha_{t}x_{1}^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)p(dx_{1}^{\prime})}p(dx_{1}).

The guarantee that the vector field utu_{t} generates the probability path ptp_{t} is stated in [LGL23, Theorem 3.3] and [LCBH+22, Theorem 1] under the explicit or implicit assumption that the ODE trajectory exists on [0,1][0,1]. This result was further rigorously proved in [GHJ24] under some specific assumptions. We point out that the assumptions are very restrictive and cannot include the case when pp is supported on a low-dimensional manifold. See our results in Section 5 for a more general statement.

Ultimately, specifying the conditional distribution pt(|𝑿=x)p_{t}(\cdot|\bm{X}=x) as Gaussian distributions allows one to train a neural network to learn the vector field utu_{t} by minimizing the following loss function whose unique minimizer is ut(x)u_{t}(x) [LCBH+22]:

v\displaystyle\mathcal{L}_{v} =𝔼t[0,1),𝒁pprior,𝑿putθ(αt𝑿+βt𝒁)ut(αt𝑿+βt𝒁|𝑿)2\displaystyle=\mathbb{E}_{t\in[0,1),\bm{Z}\sim p_{\mathrm{prior}},\bm{X}\sim p}\left\|u_{t}^{\theta}(\alpha_{t}\bm{X}+\beta_{t}\bm{Z})-u_{t}(\alpha_{t}\bm{X}+\beta_{t}\bm{Z}|\bm{X})\right\|^{2}
=𝔼t[0,1),𝒁pprior,𝑿putθ(αt𝑿+βt𝒁)α˙t𝑿β˙t𝒁2\displaystyle=\mathbb{E}_{t\in[0,1),\bm{Z}\sim p_{\mathrm{prior}},\bm{X}\sim p}\left\|u_{t}^{\theta}(\alpha_{t}\bm{X}+\beta_{t}\bm{Z})-\dot{\alpha}_{t}\bm{X}-\dot{\beta}_{t}\bm{Z}\right\|^{2} (3)

2.3 Unifying scheduling functions through noise-to-signal ratio

In this section, we describe a formulation that can unify different scheduling functions αt\alpha_{t} and βt\beta_{t}. This is done by introducing the noise-to-signal ratio, which has been discussed in [SPC+24, CZW+24, KG24]. We find it useful in our analysis as it simplifies the ODE dynamics and allows us to present our results more cleanly.

When the scheduling functions αt\alpha_{t} and βt\beta_{t} are assumed to be strictly monotonic, then the noise-to-signal ratio is defined as

σt:=βtαt,for t(0,1].\sigma_{t}:=\frac{\beta_{t}}{\alpha_{t}},\quad\text{for }t\in(0,1].

By monotonicity, σ\sigma as a function of tt is invertible and we let t(σ)t(\sigma) denote the inverse function of σt\sigma_{t}. As tt goes from 0 to 11, the noise-to-signal ratio σt\sigma_{t} goes from \infty to 0. For any σ(0,)\sigma\in(0,\infty), we define qσq_{\sigma} as the convolution of pp and a Gaussian kernel with variance σ2I\sigma^{2}I:

qσ:=p𝒩(0,σ2I)=𝒩(|y,σ2I)p(dy).q_{\sigma}:=p*\mathcal{N}(0,\sigma^{2}I)=\int{\mathcal{N}}(\cdot|y,\sigma^{2}I)p(dy). (4)

We also let q0:=pq_{0}:=p. Then, we have the following result.

Proposition 2.1.

For any t(0,1]t\in(0,1], define At:ddA_{t}:\mathbb{R}^{d}\to\mathbb{R}^{d} by sending xx to x/αtx/\alpha_{t}. Then, qσt=(At)#ptq_{\sigma_{t}}=(A_{t})_{\#}p_{t}.

Under the map AtA_{t}, the probability path qσtq_{\sigma_{t}} satisfies an ODE with respect to σ\sigma.

Proposition 2.2.

For any [a,b)(0,1][a,b)\subset(0,1], let (xt)t[a,b)(x_{t})_{t\in[a,b)} denote an ODE trajectory of Equation 1. Then, (xσ:=xt(σ)/αt(σ))σ(σb,σb](x_{\sigma}:=x_{t(\sigma)}/\alpha_{t(\sigma)})_{\sigma\in(\sigma_{b},\sigma_{b}]} satisfies the following ODE:

dxσdσ=σlogqσ(xσ),\frac{dx_{\sigma}}{d\sigma}=-\sigma\nabla\log q_{\sigma}(x_{\sigma}), (5)

where qσ(x)q_{\sigma}(x) denotes the probability density of qσq_{\sigma}.

During sampling, the ODE with respect to σ\sigma integrates backwards, that is, for any σT(0,)\sigma_{T}\in(0,\infty) and any xdx\in\mathbb{R}^{d}, the corresponding ODE trajectory xσx_{\sigma} is the solution of Equation 5 from xσx_{\sigma} with σ[0,σT]\sigma\in[0,\sigma_{T}] that satisfies xσT=xx_{\sigma_{T}}=x. So now, we have an ODE model depicted in Equation 5 generating a probability path qσq_{\sigma} for σ(0,)\sigma\in(0,\infty). This way of parametrizing the ODE and its variants have been studied extensively in the literature of diffusion models [SME21, KAAL22].

The above transition from the parameter tt to the parameter σ\sigma works for any (monotonic) scheduling functions αt\alpha_{t} and βt\beta_{t}. As a consequence, different choices of scheduling functions are equivalent up to a reparametrization. This is also discussed to some extent in [SPC+24] for creating a sampling schedule different from the ones used in the training process.

Remark 2.3 (Subtlety at the endpoints).

When t=0t=0, α0=0\alpha_{0}=0 and there is an issue of division by zero which prevents the above transition from being valid. Furthermore, as σ\sigma goes to \infty, the corresponding probability measure qσq_{\sigma} does not have a well-defined limit. Hence, one often uses a large σT\sigma_{T} instead of \infty as the starting time of Equation 5, i.e., one samples z𝒩(0,σT2I)z\sim{\mathcal{N}}(0,\sigma_{T}^{2}I) with σT\sigma_{T} as the initial point for the backward ODE. This may introduce an extra error term in the sampling process since at σT\sigma_{T}, one should have sampled from qσTq_{\sigma_{T}} which may deviate from 𝒩(0,σT2I){\mathcal{N}}(0,\sigma_{T}^{2}I). Indeed, it is empirically observed in [ZYLX24b] that this error is related to the average brightness issue in the generated images. The validity of the above transition property suggests an easy correction though: first sample z𝒩(0,I)z\sim{\mathcal{N}}(0,I) and then perform an ODE step in tt to obtain ztϵz_{t_{\epsilon}}, scale it by σtϵ\sigma_{t_{\epsilon}} and then continue the noise-to-signal ratio backward ODE using σ\sigma. This is termed as SingDiffusion in [ZYLX24b].

In this paper, it is much cleaner and easier to state / prove most theoretical results using the noise-to-signal ratio σ\sigma rather than tt. Therefore, we will present these results in terms of σ\sigma (or alternatively λ:=log(σ)\lambda:=-\log(\sigma)). However, note the equivalence between the two models and hence the results stated in terms of σ\sigma can be translated to the flow matching model in a straightforward manner. For results regarding flow maps Ψt\Psi_{t}, we will use parameter tt due to the inherent obstacles of defining a flow map as discussed in Remark 2.3 as well as the reverse process from \infty to 0 in Equation 5.

3 Denoiser: the Guiding Component in Flow Matching Models

At the core of the flow matching model is the vector field utu_{t} that generates the probability path ptp_{t}. It turns out that the vector field utu_{t} is fully determined by the denoiser, i.e. the mean of the posterior distribution p(|𝑿t=x)p(\cdot|\bm{X}_{t}=x)  [KAAL22]. In this section, we will first talk about some basic properties of the denoiser, then mention certain difficulties in handling the denoiser and will eventually showcase how the ODE dynamics are guided by the denoiser and its implications in the flow model.

3.1 Basics of the denoiser

Recall that the vector field utu_{t} is given by ut(x)=ut(x|x1)p(dx1|𝑿t=x)u_{t}(x)=\int u_{t}(x|x_{1})p(dx_{1}|\bm{X}_{t}=x), and by plugging in the explicit formula of ut(x|x1)=β˙tβtx+α˙tβtαtβ˙tβtx1u_{t}(x|x_{1})=\frac{\dot{\beta}_{t}}{\beta_{t}}x+\frac{\dot{\alpha}_{t}\beta_{t}-\alpha_{t}\dot{\beta}_{t}}{\beta_{t}}x_{1}, we have that

ut(x)=β˙tβtx+α˙tβtαtβ˙tβt𝔼[𝑿|𝑿t=x], where 𝑿pu_{t}(x)=\frac{\dot{\beta}_{t}}{\beta_{t}}x+\frac{\dot{\alpha}_{t}\beta_{t}-\alpha_{t}\dot{\beta}_{t}}{\beta_{t}}\,\mathbb{E}[\bm{X}|\bm{X}_{t}=x],\text{ where }\bm{X}\sim p (6)

a formulation that is also given in [GHJ24, Equation (3.7)]. Here 𝔼[𝑿|𝑿t=x]\mathbb{E}[\bm{X}|\bm{X}_{t}=x] can be explicitly written as follows:

𝔼[𝑿|𝑿t=x]=exp(xαty22βt2)yexp(xαty22βt2)p(dy)p(dy).\mathbb{E}[\bm{X}|\bm{X}_{t}=x]=\int\frac{\exp\left(-\frac{\|x-\alpha_{t}y\|^{2}}{2\beta_{t}^{2}}\right)y}{\int\exp\left(-\frac{\|x-\alpha_{t}y^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)p(dy^{\prime})}p(dy). (7)
Remark 3.1 (Well-definedness of the denoiser).

In order for the above integral to converge, one only needs a mild condition that pp has a finite 1-moment. In this case, it follows that for any t[0,1)t\in[0,1), the posterior distribution p(|𝐗t=x)p(\cdot|\bm{X}_{t}=x) has a finite 1-moment and hence the denoiser mtm_{t} is well-defined. The same conclusion holds for 𝐗σ\bm{X}_{\sigma} defined later.

Notice that the denoiser 𝔼[𝑿|𝑿t=x]\mathbb{E}[\bm{X}|\bm{X}_{t}=x] is the only part of utu_{t} dependent on the data distribution pp and hence fully determines the vector field utu_{t}. For simplicity, we use the notation mt(x):=𝔼[𝑿|𝑿t=x]m_{t}(x):=\mathbb{E}[\bm{X}|\bm{X}_{t}=x] to emphasize that it is the mean of the posterior distribution p(|𝑿t=x)p(\cdot|\bm{X}_{t}=x).

Noise-to-signal ratio formulation.

Likewise, the ODE in σ\sigma can also be expressed in terms of the denoiser mσ(x):=𝔼[𝑿|𝑿σ=x]m_{\sigma}(x):=\mathbb{E}[\bm{X}|\bm{X}_{\sigma}=x] as follows from Tweedie’s formula:

dxσdσ=σlogqσ(xσ)=1σ(mσ(xσ)xσ).\frac{dx_{\sigma}}{d\sigma}=-\sigma\nabla\log q_{\sigma}(x_{\sigma})=-{\frac{1}{\sigma}}\left(m_{\sigma}(x_{\sigma})-x_{\sigma}\right). (8)

Notably, the ODE in σ\sigma can be simply interpreted as moving towards the denoiser mσ(x)m_{\sigma}(x) at a speed inversely proportional to σ\sigma. Similarly, one can explicitly write mσ(x)m_{\sigma}(x) as follows.

mσ(x)=exp(xy22σ2)yexp(xy22σ2)p(dy)p(dy).m_{\sigma}(x)=\int\frac{\exp\left(-\frac{\|x-y\|^{2}}{2\sigma^{2}}\right)y}{\int\exp\left(-\frac{\|x-y^{\prime}\|^{2}}{2\sigma^{2}}\right)p(dy^{\prime})}p(dy). (9)
An alternative parametrization for σ\sigma.

This backward integration in σ\sigma might be cumbersome in analysis and we alternatively use the parameter λ=log(σ)\lambda=-\log(\sigma). We similarly let σ(λ)\sigma(\lambda) define the inverse function. Then, when σ\sigma changes from \infty to 0, λ\lambda changes from -\infty to \infty. For an ODE trajectory (xσ)σ(0,)(x_{\sigma})_{\sigma\in(0,\infty)}, we define (xλ:=xσ(λ))λ(,)(x_{\lambda}:=x_{\sigma(\lambda)})_{\lambda\in(-\infty,\infty)}. For any xdx\in\mathbb{R}^{d}, we define mλ(x):=mσ(λ)(x)m_{\lambda}(x):=m_{\sigma(\lambda)}(x). Then, the ODE in λ\lambda has a concise form:

dxλdλ=mλ(xλ)xλ.\frac{dx_{\lambda}}{d\lambda}=m_{\lambda}(x_{\lambda})-x_{\lambda}. (10)
Jacobians of the denoiser and data covariance.

We point out that the denoiser, under some mild condition on the data distribution pp, is differentiable and its Jacobian is inherently connected with the covariance matrix of the posterior distribution p(|𝑿t=x)p(\cdot|\bm{X}_{t}=x) (or p(|𝑿σ=x)p(\cdot|\bm{X}_{\sigma}=x)). Similar formulas for computing the Jacobian have been utilized before for various purposes; see, for example, [ZYLX24a, Lemma B.2.1] and [GHJ24, Lemma 4.1]. Moreover, the covariance formula in Proposition 3.2 is a direct consequence of higher order generalization of Tweedie’s formula, which has been studied in previous works (see, e.g., [Efr11], [MSLE21]).

Proposition 3.2.

Assume that pp has a finite 2-moment. For any t[0,1)t\in[0,1), we have that mtm_{t} is differentiable. In particular, the Jacobian xmt\nabla_{x}m_{t} can be explicitly expressed as follows for any xdx\in\mathbb{R}^{d}:

xmt(x)\displaystyle\nabla_{x}m_{t}(x) =αt2βt2(zz)(zz)Tp(dz|𝑿t=x)p(dz|𝑿t=x)\displaystyle=\frac{\alpha_{t}}{2\beta_{t}^{2}}\iint(z-z^{\prime})(z-z^{\prime})^{T}p(dz|\bm{X}_{t}=x)p(dz^{\prime}|\bm{X}_{t}=x)
=αtβt2Cov[𝑿|𝑿t=x].\displaystyle=\frac{\alpha_{t}}{\beta_{t}^{2}}\mathrm{Cov}[\bm{X}|\bm{X}_{t}=x].

Furthermore, if we let σ=σt\sigma=\sigma_{t}, then

xmσ(x)\displaystyle\nabla_{x}m_{\sigma}(x) =12σ2(zz)(zz)Tp(dz|𝑿σ=x)p(dz|𝑿σ=x)\displaystyle=\frac{1}{2\sigma^{2}}\iint(z-z^{\prime})(z-z^{\prime})^{T}p(dz|\bm{X}_{\sigma}=x)p(dz^{\prime}|\bm{X}_{\sigma}=x)
=1σ2Cov[𝑿|𝑿σ=x].\displaystyle=\frac{1}{\sigma^{2}}\mathrm{Cov}[\bm{X}|\bm{X}_{\sigma}=x].
A note on the training process.

As mtm_{t} is the only part in utu_{t} dependent on pp, instead of training the vector field utu_{t} directly, one can train a neural network mtθm_{t}^{\theta} to learn the denoiser mtm_{t} directly. By plugging Equation 6 into Section 2.2, we obtain the following training loss for mtm_{t} which turns out to be equivalent to the training loss for the vector field Section 2.2:

m=𝔼t[0,1),𝒁pprior,𝑿pmtθ(αt𝑿+βt𝒁)𝑿2.\mathcal{L}_{m}=\mathbb{E}_{t\in[0,1),\bm{Z}\sim p_{\mathrm{prior}},\bm{X}\sim p}\left\|m_{t}^{\theta}(\alpha_{t}\bm{X}+\beta_{t}\bm{Z})-\bm{X}\right\|^{2}. (11)

The mean mtm_{t} will always be bounded, which is opposite to the fact that the vector field utu_{t} can blow up when t1t\to 1 as discussed in Proposition 3.4. This makes the denoiser mt(x)m_{t}(x) much easier to train than the vector field ut(x)u_{t}(x). Meanwhile, training of the denoiser, an alternative to the vector field, is also utilized in the diffusion model literature [KAAL22] (with σ\sigma) and shows promising results.

Remark 3.3.

In the literature on diffusion models, the main object is the score function, which is the gradient of the log density of the probability distribution ptp_{t}. The score function view or the denoiser view are equivalent to each other by applying Tweedie’s formula [Efr11], see e.g., [GHJ24][Remark 3.5]. As the mean of the posterior distribution, the denoiser mt(x)m_{t}(x) is more interpretable and we will show its importance in the remainder of the paper.

3.2 Denoiser and ODE dynamics: terminal time singularity

The terminal time is referred to as the time t=1t=1 (or σ=0\sigma=0, λ\lambda\to\infty) in the flow model. The convergence of the ODE trajectory at the terminal time relies on the terminal time regularity of the vector field. We now elucidate two types of singularities of the vector field that arise at the terminal time in the flow model, one due to the ODE formulation and the other due to the data geometry.

Singularity due to the ODE formulation.

Recall that the vector field utu_{t} is given by ut(x)=β˙tβtx+α˙tβtαtβ˙tβt𝔼[𝑿|𝑿t=x]u_{t}(x)=\frac{\dot{\beta}_{t}}{\beta_{t}}x+\frac{\dot{\alpha}_{t}\beta_{t}-\alpha_{t}\dot{\beta}_{t}}{\beta_{t}}\,\mathbb{E}[\bm{X}|\bm{X}_{t}=x]. Since the denominator βt\beta_{t} approaches 0 as t1t\to 1, the vector field utu_{t} faces an issue of division by zero when approaching the terminal time. This singularity is intrinsic to the flow matching ODE formulation. In the following proposition, we show that when the data distribution is not fully supported, the limit limt1ut(x)\lim_{t\to 1}\|u_{t}(x)\| goes to infinity for almost all xx while it is bounded when the data distribution is fully supported.

Proposition 3.4.

Assume that α,β:[0,1]\alpha,\beta:[0,1]\to\mathbb{R} are smooth, and α˙1,β˙1\dot{\alpha}_{1},\dot{\beta}_{1} exist and are non zero. Let Ω:=supp(p)\Omega:=\mathrm{supp}(p) and let ΣΩ\Sigma_{\Omega} denote its medial axis. Then, we have the following properties:

  • If pp is fully supported, i.e., Ω=d\Omega=\mathbb{R}^{d}, and has a Lipschitz density, then for any xdx\in\mathbb{R}^{d}, the vector field ut(x)u_{t}(x) is uniformly bounded for all t[0,1)t\in[0,1).

  • If pp is not fully supported, i.e., Ωd\Omega\neq\mathbb{R}^{d}, then for any xΩΣΩx\notin\Omega\cup\Sigma_{\Omega}, limt1ut(x)=\lim_{t\to 1}\|u_{t}(x)\|=\infty.

The deciding factor on whether the vector field utu_{t} blows up for a given xx is if limt1mt(x)=x\lim_{t\to 1}m_{t}(x)=x or not. We will show in Corollary 4.4 that limt1mt(x)=projΩ(x)\lim_{t\to 1}m_{t}(x)=\mathrm{proj}_{\Omega}(x) for any xΣΩx\notin\Sigma_{\Omega}, and thus when xΩΣΩx\notin\Omega\cup\Sigma_{\Omega}, limt1mt(x)x\lim_{t\to 1}m_{t}(x)\neq x and hence the vector field utu_{t} blows up. Another way to interpret the singularity of the ODE formulation is by considering the transformation of the ODE in σ\sigma in Equation 10 where the singularity corresponds to the 1/σ1/\sigma term. This singularity can be addressed by using the parameter λ\lambda as in Equation 10 which results in the ODE:

dxλdλ=mλ(xλ)xλ,λ(,),\frac{dx_{\lambda}}{d\lambda}=m_{\lambda}(x_{\lambda})-x_{\lambda},\quad\lambda\in(-\infty,\infty),

with the trade-off of turning a finite-time ODE into an infinite-time ODE.

Singularity due to the data geometry.

When the data is not fully supported, the medial axis ΣΩ\Sigma_{\Omega} of the data support plays a crucial role in the singularity of the denoiser mσm_{\sigma} which may result in discontinuity of the limit limσ0mσ(x)\lim_{\sigma\to 0}m_{\sigma}(x). In this case, when the ODE is transformed into the λ\lambda, the vector field uλu_{\lambda} does not have a uniform Lipschitz bound for all λ(a,)\lambda\in(a,\infty) and hence the typical ODE theory such as Picard-Lindelöf theorem can not be directly applied to analyze the flow matching ODEs.

Refer to caption
Figure 4: The denoiser mσ(x)m_{\sigma}(x) for the two point example with various σ\sigma.

The discontinuity behavior can be illustrated by the following simple example of a two-point data distribution which can be easily extended to higher dimensions. We specifically use the σ=eλ\sigma=e^{-\lambda} and consider limσ0mσ(x)\lim_{\sigma\to 0}m_{\sigma}(x) for illustration as the sampling process is often done in σ\sigma, e.g., as in [SME21, KAAL22] and the σ\sigma values are more interpretable.

Example 3.5.

Let p=12δ1+12δ1p=\frac{1}{2}\delta_{-1}+\frac{1}{2}\delta_{1} be a probability measure on 1\mathbb{R}^{1}. Then, the support Ω:=supp(p)={1,1}\Omega:=\mathrm{supp}(p)=\{-1,1\} is just a two-point set. The medial axis is the singleton ΣΩ={0}\Sigma_{\Omega}=\{0\} whose distance to either point is 11. Now, we can explicitly write down the denoiser mσm_{\sigma} as follows:

mσ(x)=exp((x+1)22σ2)+exp((x1)22σ2)exp((x+1)22σ2)+exp((x1)22σ2).m_{\sigma}(x)=\frac{-\exp\left(-\frac{(x+1)^{2}}{2\sigma^{2}}\right)+\exp\left(-\frac{(x-1)^{2}}{2\sigma^{2}}\right)}{\exp\left(-\frac{(x+1)^{2}}{2\sigma^{2}}\right)+\exp\left(-\frac{(x-1)^{2}}{2\sigma^{2}}\right)}. (12)

Notice that when σ\sigma approaches 0,

  • The denoiser mσm_{\sigma} is converging to a function f:1{1,0,1}f:\mathbb{R}^{1}\to\{-1,0,1\} with f(x)=1f(x)=1 for x>0x>0, f(x)=1f(x)=-1 for x<0x<0 and f(0)=0f(0)=0.

  • Certain singularity of mσm_{\sigma} is emerging at ΣΩ={0}\Sigma_{\Omega}=\{0\}: the derivative (Jacobian if in high dimensions) dmσ(0)dx\frac{dm_{\sigma}(0)}{dx} is blowing up when σ0\sigma\to 0.

A full characterization of the limit limσ0mσ(x)\lim_{\sigma\to 0}m_{\sigma}(x) for discrete data distribution will be given in Corollary 4.12 where the discontinuity often arises at the medial axis of the data support. All these singularities pose challenges in theoretical analysis of the flow matching ODEs and particularly in the convergence of the ODE trajectory when approaching the terminal time. The data geometry singularity is more challenging to handle, especially the discontinuity behavior of the limit of the denoiser near the medial axis.

We will address these challenges in the convergence analysis of the flow matching ODEs in Section 5 by showing that even if the geometric singularities exist, the ODE trajectory will avoid the singularities and converge to the data support. This is done by identifying an absorbing and attracting property of the ODE dynamics, which will be discussed in the next section.

3.3 Denoiser and ODE dynamics: attracting and absorbing

Assume that the ODE trajectory exists and unique up to the terminal time—on [0,1)[0,1) for tt, and on (0,)(0,\infty) (resp. (,)(-\infty,\infty)) for σ\sigma (resp. λ\lambda). In Section 5.1, we rigorously establish this well-posedness for any data distribution pp with finite second moments. The Equation 8 suggests that the sample simply moves towards the denoiser mσm_{\sigma} from its current position xσx_{\sigma}. However, the main challenge is that the denoiser mσm_{\sigma} itself evolves along the trajectory.

Nevertheless, when certain coarse geometric information is available, one can provide preliminary insights into the ODE dynamics. Intuitively, as discussed in the introduction, the flow matching ODE initially points towards the data mean and later towards local clusters. These concepts can be described in terms of distance to certain set. Specifically, the flow matching ODE exhibits two key properties when the denoiser mσm_{\sigma} meets certain conditions that we will elaborate later.

  • Attracting Property: Trajectories are drawn towards specific closed set, i.e, the distance to the set decreases along the trajectory.

  • Absorbing Property: Trajectories remain confined within neighborhoods of certain closed set.

The closed sets that we work with can be a point (e.g. the data mean or a single training data), a convex set (e.g. certain convex hull of data), or a more general set (e.g. the data support). These properties are formalized in the following two meta-theorems. We describe them with respect to σ\sigma, but they can be translated to tt or λ\lambda in a straightforward manner.

Attracting towards sets.

Let Ω\Omega be a closed set in d\mathbb{R}^{d}. We want to examine the distance to Ω\Omega along the ODE trajectory (xσ)σ(σ2,σ1](x_{\sigma})_{\sigma\in(\sigma_{2},\sigma_{1}]} with some σ1>σ2\sigma_{1}>\sigma_{2}. Assume that the trajectory avoids the medial axis of Ω\Omega then the distance of xσx_{\sigma} to Ω\Omega will decrease as σ\sigma decreases if the trajectory direction 1σ(mσ(xσ)xσ)\frac{1}{\sigma}\left(m_{\sigma}(x_{\sigma})-x_{\sigma}\right) forms an acute angle with the direction pointing towards Ω\Omega, that is

mσ(xσ)xσ,projΩ(xσ)xσ>0;\langle m_{\sigma}(x_{\sigma})-x_{\sigma},\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\rangle>0;

see Figure 5 for an illustration.

Refer to caption
Figure 5: Illustration of the acute angle condition.

Notice that whenever xσΩx_{\sigma}\notin\Omega, one has

mσ(xσ)xσ,projΩ(xσ)xσ=mσ(xσ)projΩ(xσ),projΩ(xσ)xσ+projΩ(xσ)xσ2.\langle m_{\sigma}(x_{\sigma})-x_{\sigma},\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\rangle=\langle m_{\sigma}(x_{\sigma})-\mathrm{proj}_{\Omega}(x_{\sigma}),\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\rangle+\|\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\|^{2}. (13)

Hence, as long as the term mσ(xσ)projΩ(xσ),projΩ(xσ)xσ\langle m_{\sigma}(x_{\sigma})-\mathrm{proj}_{\Omega}(x_{\sigma}),\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\rangle is small enough, one would guarantee that the acute angle condition is satisfied and then the distance to Ω\Omega will decrease along the trajectory. This intuition is formalized in the following theorem.

Theorem 3.6 (Attracting towards sets).

Let (xσ)σ(σ2,σ1](x_{\sigma})_{\sigma\in(\sigma_{2},\sigma_{1}]} be an ODE trajectory of Equation 8 starting from some xσ1x_{\sigma_{1}}. Assume that the trajectory avoids the medial axis of a closed Ω\Omega then we have the following results.

  1. 1.

    If mσ(xσ)projΩ(xσ),projΩ(xσ)xσζxσprojΩ(xσ)2\langle m_{\sigma}(x_{\sigma})-\mathrm{proj}_{\Omega}(x_{\sigma}),\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\rangle\leq\zeta\|x_{\sigma}-\mathrm{proj}_{\Omega}(x_{\sigma})\|^{2} for some 0ζ<10\leq\zeta<1 along the trajectory, then dΩ(xσ)d_{\Omega}(x_{\sigma}) decreases along the trajectory with rate:

    dΩ(xσ)σ1ζσ11ζdΩ(xσ1),d_{\Omega}(x_{\sigma})\leq\frac{\sigma^{1-\zeta}}{\sigma_{1}^{1-\zeta}}d_{\Omega}(x_{\sigma_{1}}),

    In particular, if σ2=0\sigma_{2}=0, then dΩ(xσ)d_{\Omega}(x_{\sigma}) decreases to zero as σ0\sigma\to 0.

  2. 2.

    If σ2=0\sigma_{2}=0 and |mσ(xσ)projΩ(xσ),projΩ(xσ)xσ|ϕ(σ)|\langle m_{\sigma}(x_{\sigma})-\mathrm{proj}_{\Omega}(x_{\sigma}),\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\rangle|\leq\phi(\sigma) for some function ϕ(σ)\phi(\sigma) along the trajectory with limσ0ϕ(σ)=0\lim_{\sigma\to 0}\phi(\sigma)=0, then

    limσ0dΩ(xσ)=0.\lim_{\sigma\to 0}d_{\Omega}(x_{\sigma})=0.
Remark 3.7.

In fact, when considering the parameter λ=log(σ)\lambda=-\log(\sigma) and the trajectory zλ:=xσ(λ)z_{\lambda}:=x_{\sigma(\lambda)}, we obtain the following convergence rate for Item 2 in the above theorem:

dΩ(zλ)e(λλ1)dΩ(zλ1)+eλλ1λ2e2tϕ(t)𝑑t.d_{\Omega}(z_{\lambda})\leq e^{-(\lambda-\lambda_{1})}d_{\Omega}(z_{\lambda_{1}})+e^{-\lambda}\sqrt{\int_{\lambda_{1}}^{\lambda}2e^{2t}\phi(t)dt}.

The above theorems require the trajectory to avoid the medial axis of Ω\Omega. We will find two situations that this condition is satisfied: (1) when Ω\Omega is a convex set as its medial axis is empty. It turns out one can identify specific convex sets in different stages of the flow matching ODE dynamics where the denoiser satisfies the above estimates and hence the ODE trajectory is attracted towards these convex sets. We will discuss these in Sections 6.1 and 6.2. (2) when we can show that the trajectory will confined in a region outside the medial axis of Ω\Omega. This is in particular related to the terminal behavior of the flow model sampling process which includes the convergence of the flow matching ODE to the data support which will be discussed in Section 5.2 and Section 6.3. Furthermore, the estimates in the conditions of the above theorem is related with the concentration properties of the posterior distribution p(|𝑿σ=x)p(\cdot|\bm{X}_{\sigma}=x) which we will discuss in Section 4.

We now describe a general absorbing property in the next paragraph.

Absorbing by sets.

Let Ω\Omega be any set in d\mathbb{R}^{d}. For any 0σ2<σ10\leq\sigma_{2}<\sigma_{1}, we say Ω\Omega is absorbing for the flow matching ODE in (σ2,σ1](\sigma_{2},\sigma_{1}], if for any xΩx\in\Omega, the ODE trajectory (xσ)σ(σ2,σ1](x_{\sigma})_{\sigma\in(\sigma_{2},\sigma_{1}]} for Equation 8 starting at xσ1=xx_{\sigma_{1}}=x will remain in Ω\Omega for all σ(σ2,σ1]\sigma\in(\sigma_{2},\sigma_{1}].

Now, we assume that Ω\Omega is closed. It turns out that the acute angel condition

mσ(x)x,projΩ(x)x>0\langle m_{\sigma}(x)-x,\mathrm{proj}_{\Omega}(x)-x\rangle>0

will also guarantee that the trajectory will remain confined within a neighborhood of Ω\Omega. This is formalized in the following theorem.

Theorem 3.8 (Absorbing by sets).

For any closed set Ω\Omega and r>0r>0, we consider the open neighborhood Br(Ω)B_{r}(\Omega).

  1. 1.

    If Br(Ω)¯ΣΩ=\overline{B_{r}(\Omega)}\cap\Sigma_{\Omega}=\emptyset and for any xBr(Ω)x\in\partial B_{r}(\Omega) and any σ(σ2,σ1)\sigma\in(\sigma_{2},\sigma_{1}), one has

    mσ(x)x,projΩ(x)x>0,\langle m_{\sigma}(x)-x,\mathrm{proj}_{\Omega}(x)-x\rangle>0,

    then Br(Ω)B_{r}(\Omega) is absorbing in (σ2,σ1](\sigma_{2},\sigma_{1}].

  2. 2.

    If there exists some r0>0r_{0}>0 such that Br(Ω)B_{r}(\Omega) is absorbing in (σ2,σ1](\sigma_{2},\sigma_{1}] for all r(0,r0)r\in(0,r_{0}), then Ω\Omega is absorbing in (σ2,σ1](\sigma_{2},\sigma_{1}] as well.

Note that the absorbing property only requires the denoiser information in some fixed region rather than a priori knowledge of how the denoiser evolves along the trajectory. This versatility makes the absorbing property to be utilized as a first step of controlling the ODE dynamics in many of our analysis.

We now describe how will we use the above absorbing property for convex sets which will be used often in Section 6, and how its generalization will be used to analyze the convergence of the flow matching ODEs in Section 5.2.

For a convex set KK, its the medial axis ΣK\Sigma_{K} is empty and if we assume that the denoiser mσm_{\sigma} lies in KK for any xBr(K)x\in\partial B_{r}(K) then any neighborhood of KK will be absorbing for the ODE trajectory. We also obtain an stronger result regarding when the set KK itself is absorbing.

Proposition 3.9 (Absorbing of convex sets).

Let KK be a closed convex set in d\mathbb{R}^{d}. Let (xσ)σ(σ2,σ1](x_{\sigma})_{\sigma\in(\sigma_{2},\sigma_{1}]} be an ODE trajectory of Equation 8. Then, we have the following results.

  1. 1.

    For any r>0r>0, if mσ(x)Km_{\sigma}(x)\in K for any xBr(K)x\in\partial B_{r}(K) and any σ(σ2,σ1]\sigma\in(\sigma_{2},\sigma_{1}], then Br(K)B_{r}(K) is absorbing for (xσ)σ(σ2,σ1](x_{\sigma})_{\sigma\in(\sigma_{2},\sigma_{1}]}.

  2. 2.

    If the interior KK^{\circ} of KK is not empty and mσ(x)Km_{\sigma}(x)\in K^{\circ} for any xKx\in\partial K and any σ(σ2,σ1]\sigma\in(\sigma_{2},\sigma_{1}], then KK is absorbing for (xσ)σ(σ2,σ1](x_{\sigma})_{\sigma\in(\sigma_{2},\sigma_{1}]}.

Remark 3.10.

When Ω\Omega is not convex, a typical way to show the acute angle condition is by requiring mσ(x)projΩ(x)\|m_{\sigma}(x)-\mathrm{proj}_{\Omega}(x)\| to be small enough on Br(Ω)\partial B_{r}(\Omega) for all σ(σ2,σ1]\sigma\in(\sigma_{2},\sigma_{1}]. This can be seen by the following computation:

mσ(xσ)xσ,projΩ(xσ)xσ\displaystyle\langle m_{\sigma}(x_{\sigma})-x_{\sigma},\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\rangle =mσ(xσ)projΩ(xσ),projΩ(xσ)xσ+projΩ(xσ)xσ2\displaystyle=\langle m_{\sigma}(x_{\sigma})-\mathrm{proj}_{\Omega}(x_{\sigma}),\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\rangle+\|\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\|^{2}
mσ(xσ)projΩ(xσ)projΩ(xσ)xσ+projΩ(xσ)xσ2.\displaystyle\geq-\|m_{\sigma}(x_{\sigma})-\mathrm{proj}_{\Omega}(x_{\sigma})\|\|\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\|+\|\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\|^{2}.

Furthermore, once the absorbing property is established, it guarantees that dΩ(xσ)=projΩ(xσ)xσd_{\Omega}(x_{\sigma})=\|\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\| to be bounded for a trajectory in consideration. In this case, the condition in Item 2 of the attracting theorem Theorem 3.6 can be derived from the decay of mσ(xσ)projΩ(xσ)\|m_{\sigma}(x_{\sigma})-\mathrm{proj}_{\Omega}(x_{\sigma})\| as

|mσ(xσ)projΩ(xσ),projΩ(xσ)xσ|mσ(xσ)projΩ(xσ)projΩ(xσ)xσ.|\langle m_{\sigma}(x_{\sigma})-\mathrm{proj}_{\Omega}(x_{\sigma}),\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\rangle|\leq\|m_{\sigma}(x_{\sigma})-\mathrm{proj}_{\Omega}(x_{\sigma})\|\|\mathrm{proj}_{\Omega}(x_{\sigma})-x_{\sigma}\|.

These type of arguments will be utilize in Section 6.2 and Section 6.3 when discussing the ODE dynamics of the flow matching ODEs.

When Ω\Omega is unbounded, e.g the support of a general distribution, it would require much more assumptions for us to control the term mσ(xσ)projΩ(xσ)\|m_{\sigma}(x_{\sigma})-\mathrm{proj}_{\Omega}(x_{\sigma})\| uniformly on the boundary. As one way to circumvent this issue, we consider the intersection of Br(Ω)B_{r}(\Omega) with a bounded set and establish the absorbing property of the bounded subset. This is formalized in the following result.

Theorem 3.11 (Absorbing of data support).

Fix any small 0<δ<τΩ/40<\delta<\tau_{\Omega}/4 and any 0<ζ<10<\zeta<1. Assume that there exists a constant σΩ>0\sigma_{\Omega}>0 such that for any R>12τΩR>\frac{1}{2}\tau_{\Omega} and for any zBR(0)BτΩ/2(Ω)z\in B_{R}(0)\cap B_{\tau_{\Omega}/2}(\Omega), one has that

mσ(z)projΩ(z)Cζ,τ,Rσζ, for all 0<σ<σΩ.\|m_{\sigma}(z)-\mathrm{proj}_{\Omega}(z)\|\leq C_{\zeta,\tau,R}\cdot\sigma^{\zeta},\text{ for all }0<\sigma<\sigma_{\Omega}.

where Cζ,τ,RC_{\zeta,\tau,R} is a constant depending only on ζ\zeta and τ\tau and RR.

Then, there exists σδσΩ\sigma_{\delta}\leq\sigma_{\Omega} dependent on δ,ζ\delta,\zeta and Cζ,τ,RC_{\zeta,\tau,R} satisfying the following property for any R>2δR>2\delta: The trajectory (xσ)σ(0,σδ](x_{\sigma})_{\sigma\in(0,\sigma_{\delta}]} starting at any initial point xσδBR(0)Bδ(Ω)x_{\sigma_{\delta}}\in B_{R}(0)\cap B_{\delta}(\Omega) of the ODE in Equation 8 will be absorbed in a slightly larger space: for any σσδ\sigma\leq\sigma_{\delta}: xσB2R(0)B2δ(Ω)x_{\sigma}\in B_{2R}(0)\cap B_{2\delta}(\Omega).

Note that this absorbing result is slightly different from Theorem 3.8, where the neighborhood of the data support Ω\Omega must be enlarged from δ\delta to 2δ2\delta (and RR to 2R2R) to guarantee the absorbing property. This subtle difference arises from the additional treatment in the proof to account for the bounded ball BR(0)B_{R}(0) in the above theorem.

We will then be able to show in Section 5.2 that near the terminal time, an ODE trajectory that starts from a point xx near the data support Ω\Omega will never leave a neighborhood of the Ω\Omega, which avoids the geometric singularities (cf. Section 3.2), and will eventually be attracted to the Ω\Omega.

4 Concentration and Convergence of the Posterior Distribution

In the previous section, we established how the properties of denoisers influence the behavior of flow matching ODEs. Since denoisers are defined as expectations of posterior distributions, this section provides a detailed analysis of the concentration and convergence of the posterior distribution for a general data distribution pp. Such an analysis not only validates the assumptions on denoisers made in the meta-theorems on attraction (Theorem 3.6) and absorption (Theorem 3.8) but also forms the foundation for the ODE analysis in Section 5 and Section 6. Additionally, we discuss the cases when pp is supported either on a low-dimensional manifold or on a discrete set to explore the impact of different data geometries. As a natural consequence of the convergence of the posterior distribution, we also establish the convergence of the denoisers.

4.1 Concentration and convergence of the posterior distribution

Intuitively, for the posterior distribution p(|𝑿t=x)p(\cdot|\bm{X}_{t}=x), as tt goes to 11, the noise level βt\beta_{t} goes to 0 and hence the posterior distribution should concentrate around data points near xx. We will establish this rigorously utilizing the parameter σ\sigma.

Initially, the posterior distribution p(|𝑿σ=x)p(\cdot|\bm{X}_{\sigma}=x) for large σ\sigma is close to the data distribution pp and hence the ODE trajectory xσx_{\sigma} will move towards the mean of the data distribution. The following proposition quantifies the initial stability of the posterior distribution.

Proposition 4.1 (Initial stability of posterior measure).

Let pp be a probability measure on d\mathbb{R}^{d} with bounded support which is denoted as Ω:=supp(p)\Omega:=\mathrm{supp}(p). Let xx be a point and consider the posterior measure p(|𝐗σ=x)p(\cdot|\bm{X}_{\sigma}=x). We then have the following Wasserstein distance bound:

dW,1(p(|𝑿σ=x),p)<(exp(2x𝔼[𝑿]diam(Ω)σ2)1)diam(Ω).d_{W,1}(p(\cdot|\bm{X}_{\sigma}=x),p)<\left(\exp\left(\frac{2\|x-\mathbb{E}[\bm{X}]\|\mathrm{diam}(\Omega)}{\sigma^{2}}\right)-1\right)\mathrm{diam}(\Omega).

Note that the bound above will only be meaningful when σ\sigma is large and will blow up to infinity as σ\sigma decreases to 0. This makes sense as the posterior distribution p(|𝑿σ=x)p(\cdot|\bm{X}_{\sigma}=x) will gradually depart from pp and concentrate around the data points near xx and eventually converge to the delta distribution located at the nearest data point. The following theorem quantifies the concentration and convergence of the posterior distribution.

Theorem 4.2.

Let Ω:=supp(p)\Omega:=\mathrm{supp}(p). Assume that pp has finite 2-moment, i.e., 𝖬2(p):=dx2p(dx)<\mathsf{M}_{2}(p):=\int_{\mathbb{R}^{d}}\|x\|^{2}p(dx)<\infty. For all xd\ΣΩx\in\mathbb{R}^{d}\backslash\Sigma_{\Omega}, we let xΩ:=projΩ(x)x_{\Omega}:=\mathrm{proj}_{\Omega}(x). Then, we have that

limσ0dW,2(p(|𝑿σ=x),δxΩ)=0.\lim_{\sigma\to 0}d_{\mathrm{W},2}\left(p(\cdot|\bm{X}_{\sigma}=x),\delta_{x_{\Omega}}\right)=0.
Remark 4.3.

The proof strategy of Theorem 4.2 is similar to that used in [SBDS24, Theorem 4.1], where the integral under consideration is split into two parts to analyze the concentration. However, it is important to note that our result is more general than that in [SBDS24] in two aspects. First, we do not require the data distribution to be supported on a manifold. Second, we do not require the points to be sufficiently close to the data support.

As a direct consequence of the convergence of the posterior distribution, we have the following convergence of the denoiser.

Corollary 4.4.

Under the same assumptions as in Theorem 4.2, we have that for any xd\ΣΩx\in\mathbb{R}^{d}\backslash\Sigma_{\Omega}, we have that

limσ0mσ(x)=projΩ(x).\lim_{\sigma\to 0}m_{\sigma}(x)=\mathrm{proj}_{\Omega}(x).

When we turn back to the parameter tt, we have the following corollary. The proof turns out to be rather technical instead of being a direct consequence of the above theorem. The main difficulty lies in the scaling αt\alpha_{t} within the exponential term. This is another example that the parameter σ\sigma is more convenient for theoretical analysis.

Corollary 4.5.

Let Ω:=supp(p)\Omega:=\mathrm{supp}(p). Assume that pp has finite 2-moment. For all xd\ΣΩx\in\mathbb{R}^{d}\backslash\Sigma_{\Omega}, we let xΩ:=projΩ(x)x_{\Omega}:=\mathrm{proj}_{\Omega}(x). Then, we have that

limt1mt(x)=projΩ(x).\lim_{t\to 1}m_{t}(x)=\mathrm{proj}_{\Omega}(x).

Finally, we establish the following convergence rate for the posterior distribution when more assumptions are made on the data distribution pp.

Theorem 4.6.

Assume that the reach τ=τΩ>0\tau=\tau_{\Omega}>0 is positive. Consider any xdx\in\mathbb{R}^{d}. We assume that dΩ(x)<12τd_{\Omega}(x)<\frac{1}{2}\tau and that there exists g0g\geq 0 such that there exist constants C,c>0C,c>0 so that for any small radius 0<r<c0<r<c, one has p(Br(xΩ))Crkp(B_{r}(x_{\Omega}))\geq Cr^{k}. Then, for any 0<ζ<10<\zeta<1 we have the following convergence rate for any 0<σ<c1/ζ0<\sigma<c^{1/\zeta}:

dW,2(p(|𝑿σ=x),δxΩ)σ2ζ+10k(2𝖬2(p)+xΩ2)max{τk,σkζ}Cσ2kζexp(18σ2(ζ1))Cζ,τσζ,d_{\mathrm{W},2}\left(p(\cdot|\bm{X}_{\sigma}=x),\delta_{x_{\Omega}}\right)\leq\sqrt{\sigma^{2\zeta}+\frac{10^{k}(2\mathsf{M}_{2}(p)+\|x_{\Omega}\|^{2})\max\{\tau^{k},\sigma^{k\zeta}\}}{C\sigma^{2k\zeta}}\exp\left(-\frac{1}{8}\sigma^{2(\zeta-1)}\right)}\leq C_{\zeta,\tau}\sigma^{\zeta},

where Cζ,τC_{\zeta,\tau} is a constant depending only on ζ\zeta and τΩ\tau_{\Omega}.

This convergence rate result plays a central role in the subsequent analysis of the convergence of flow matching ODE trajectories (cf. Theorem 5.3). The assumptions underlying this theorem serve as a primary basis for the assumptions outlined in 1, which are critical for our analysis of ODEs. Consequently, any refinement or improvement of this theorem could potentially enhance the subsequent analysis of ODEs.

4.2 Distinct behaviors for posterior convergence rates under data geometry

In this subsection, we will revisit Theorem 4.2 under different data geometries. In particular, we focus on two cases: (1) pp is supported on a low-dimensional manifold MdM\subset\mathbb{R}^{d} with smooth density with respect to the Hausdorff measure (note that we do not exclude the case when M=dM=\mathbb{R}^{d}); (2) pp is a discrete distribution supported on a finite set.

The first case is motivated by the fact that many real-world data distributions are supported on low-dimensional manifolds. However, in practice, one usually has no access to the manifold MM but only to some samples drawn from pp. Let MN={x(1),,x(N)}M_{N}=\{x^{(1)},\ldots,x^{(N)}\} denote NN independent samples drawn from pp, and let pNp^{N} denote the corresponding empirical probability measure. When training the objective function in Section 2.2 with the empirical p1Np_{1}^{N}, one unfortunately will obtain the unique empirical optimal solution utN(x)u_{t}^{N}(x) with closed form:

utN(x)\displaystyle u_{t}^{N}(x) =β˙tβtx+α˙tβtαtβ˙tβti=1Nexp(xαtx(i)22βt2)xij=1Nexp(xαtx(j)22βt2)mtN(x).\displaystyle=\frac{\dot{\beta}_{t}}{\beta_{t}}x+\frac{\dot{\alpha}_{t}\beta_{t}-\alpha_{t}\dot{\beta}_{t}}{\beta_{t}}\underbrace{\sum_{i=1}^{N}\frac{\exp\left(-\frac{\|x-\alpha_{t}x^{(i)}\|^{2}}{2\beta_{t}^{2}}\right)x_{i}}{\sum_{j=1}^{N}\exp\left(-\frac{\|x-\alpha_{t}x^{(j)}\|^{2}}{2\beta_{t}^{2}}\right)}}_{m_{t}^{N}(x)}. (14)

Solving the ODE dxtdt=utN(xt)\frac{dx_{t}}{dt}=u_{t}^{N}(x_{t}) with x0p0x_{0}\sim p_{0} will almost surely result in x1MNx_{1}\in M_{N}. This is an extremely undesired effect as no new samples can be generated from the model. To avoid this issue, we believe certain modification of the training objective must be made. In order to achieve this goal, we first need to theoretically contrast the solution ut(x)u_{t}(x) with the empirical solution utN(x)u_{t}^{N}(x) under the manifold hypothesis mentioned above. This is why we focus on the convergence rates of the posterior distribution in these two geometries in this section. We also point out that the comparison of posterior distributions is necessary as one can hardly tell the difference between the empirical solution and the ground truth solution at the level of the overall distributions.

Proposition 4.7.

Assume that MNM_{N} is a good sample of MM such that dW,2(pN,p)=ϵd_{\mathrm{W},2}(p^{N},p)=\epsilon. Then, for any t[0,1]t\in[0,1], one has that dW,2(pt,ptN)ϵd_{\mathrm{W},2}(p_{t},p_{t}^{N})\leq\epsilon.

This result follows trivially from the fact that convolution will decrease the Wasserstein distance. We point out that, however, there is no simple way of bounding the difference between the posterior distributions dW,2(p(|𝑿t=x),pN(|𝑿t=x))d_{\mathrm{W},2}\left(p(\cdot|\bm{X}_{t}=x),p^{N}(\cdot|\bm{X}_{t}=x)\right) using dW,2(pN,p)d_{\mathrm{W},2}(p^{N},p). See Theorem 4.3 and Remark 4.4 in [MSW19] for some relevant results along this line.

In this section, we first tackle the convergence rates for the posterior distributions and will complete the convergence analysis for ODEs in Section 5.3.

An intuitive observation regarding the convergence rates.

We start with an observation regarding the Jacobian of the denoisers mσ(x)m_{\sigma}(x) and mσN(x)m_{\sigma}^{N}(x). By Corollary 4.4 we have that limσ0mσ(x)=projM(x)\lim_{\sigma\to 0}m_{\sigma}(x)=\mathrm{proj}_{M}(x) and limσ0mσN(x)=projMN(x)\lim_{\sigma\to 0}m_{\sigma}^{N}(x)=\mathrm{proj}_{M_{N}}(x). Now, we consider the Jacobians of the two projection maps.

Proposition 4.8.

For almost everywhere xdx\in\mathbb{R}^{d}, we have that

  1. 1.

    projM(x)=ΠTxMM\nabla\mathrm{proj}_{M}(x)=\Pi_{T_{x_{M}}M} where xM:=projM(x)x_{M}:=\mathrm{proj}_{M}(x) and ΠTxMM\Pi_{T_{x_{M}}M} is the orthogonal projection onto the tangent space TxMMT_{x_{M}}M;

  2. 2.

    projMN(x)=0\nabla\mathrm{proj}_{M^{N}}(x)=0.

Then, Proposition 3.2 and Proposition 4.8 imply the following results: if we assume that the convergence of mσm_{\sigma} to projM\mathrm{proj}_{M} (resp. mσNm_{\sigma}^{N} to projMN\mathrm{proj}_{M^{N}}) holds up to the 1st order derivatives as t1t\to 1 (this assumption is made for illustrative purposes only and is not rigorously justified here, serving instead to motivate the later results), then one has the following convergence as σ0\sigma\to 0:

  • 12σ2(zz)(zz)Tp(dz|𝑿σ=x)p(dz|𝑿σ=x)ΠTxMM\frac{1}{2\sigma^{2}}\iint(z-z^{\prime})(z-z^{\prime})^{T}p(dz|\bm{X}_{\sigma}=x)p(dz^{\prime}|\bm{X}_{\sigma}=x)\to\Pi_{T_{x_{M}}M}

  • 12σ2(zz)(zz)TpN(dz|𝑿σ=x)pN(dz|𝑿σ=x)0\frac{1}{2\sigma^{2}}\iint(z-z^{\prime})(z-z^{\prime})^{T}p^{N}(dz|\bm{X}_{\sigma}=x)p^{N}(dz^{\prime}|\bm{X}_{\sigma}=x)\to 0.

Note the 1/σ21/\sigma^{2} term in front of the integrals above. That the limits above exist immediately implies the following rate estimates (otherwise, the 1/σ21/\sigma^{2} term will result in a blowup behaviour instead of the convergence above):

  • (zz)(zz)Tp(dz|𝑿σ=x)p(dz|𝑿σ=x)=O(σ2)\iint(z-z^{\prime})(z-z^{\prime})^{T}p(dz|\bm{X}_{\sigma}=x)p(dz^{\prime}|\bm{X}_{\sigma}=x)=O(\sigma^{2})

  • (zz)(zz)TpN(dz|𝑿σ=x)pN(dz|𝑿σ=x)=o(σ2)\iint(z-z^{\prime})(z-z^{\prime})^{T}p^{N}(dz|\bm{X}_{\sigma}=x)p^{N}(dz^{\prime}|\bm{X}_{\sigma}=x)=o(\sigma^{2}).

From this, we see that the convergence rate of the posterior must be significantly different when pp is supported on a manifold or on discrete points. The illustrative calculation above suggests that (1) the convergence rate for the manifold case should be O(σ)O(\sigma), (2) whereas the discrete case should be o(σ)o(\sigma). These can be seen by roughly taking a square root for the integrals above as one can vaguely think of the above integrals as certain squares of the posteriors. Now, we establish our final rigorous results below.

Convergence rates for the manifold case.

We first consider the case when pp is supported on a submanifold MdM\subset\mathbb{R}^{d}. Note that this does not exclude the case when M=dM=\mathbb{R}^{d}. Under some mild conditions, we have the following convergence rate for the posterior distribution.

Theorem 4.9.

Let MdM\subset\mathbb{R}^{d} be a mm dimensional closed submanifold (without self-intersection) with a positive reach τM\tau_{M}. Assume that p(dx)=ρ(x)volM(dx)p(dx)=\rho(x)\mathrm{vol}_{M}(dx) has a smooth non-vanishing density ρ:M\rho:M\to\mathbb{R}. For any xdx\in\mathbb{R}^{d} and σ(0,)\sigma\in(0,\infty), we let xM:=projM(x)x_{M}:=\mathrm{proj}_{M}(x). If xdx\in\mathbb{R}^{d} satisfies that dM(x)<12τMd_{M}(x)<\frac{1}{2}\tau_{M}, then we have that

dW,2(p(|𝑿σ=x),δxM)=mσ+O(σ2).d_{\mathrm{W},2}(p(\cdot|\bm{X}_{\sigma}=x),\delta_{x_{M}})=\sqrt{m}\sigma+O(\sigma^{2}).

Note that the first-order convergence rate depends solely on the dimension of the submanifold, while higher-order information, such as the second fundamental form, is encapsulated within the big O notation. Specifying these higher-order contributions in detail would be an interesting direction for future work.

As a direct consequence of the convergence of the posterior distribution, we have the following convergence of the denoiser.

Corollary 4.10.

Under the same assumptions as in Theorem 4.9, for any xdx\in\mathbb{R}^{d} satisfying dM(x)<12τMd_{M}(x)<\frac{1}{2}\tau_{M}, we have that

mσ(x)xM=O(σ).\|m_{\sigma}(x)-x_{M}\|=O(\sigma).
Convergence rates for the discrete case.

Let the data distribution p=i=1Naiδxip=\sum_{i=1}^{N}a_{i}\,\delta_{x_{i}} be a general discrete distribution with x1,,xNdx_{1},\ldots,x_{N}\in\mathbb{R}^{d} and a1,,aN>0a_{1},\ldots,a_{N}>0. We use Ω={x1,,xN}\Omega=\{x_{1},\ldots,x_{N}\} to denote the support of pp. We study the concentration and convergence of the posterior measure p(|𝑿σ=x)p(\cdot|\bm{X}_{\sigma}=x) for each xdx\in\mathbb{R}^{d}, including those xx on ΣΩ\Sigma_{\Omega}, the medial axis of Ω\Omega. To this end, we introduce the following notations.

For each point xdx\in\mathbb{R}^{d}, we denote the set of distance values from xx to each point in Ω\Omega as follows:

DVΩ(x):={xxi:xiΩ}.\mathrm{DV}_{\Omega}(x):=\{\|x-x_{i}\|:x_{i}\in\Omega\}. (15)

We use dΩ(x;i)d_{\Omega}(x;i) to denote the ii-th smallest distance value in DVΩ(x)\mathrm{DV}_{\Omega}(x). A useful geometric notion will be the gap between the squares of the two smallest distances which we denote by

ΔΩ(x):=dΩ2(x;2)dΩ2(x;1).\Delta_{\Omega}(x):=d_{\Omega}^{2}(x;2)-d_{\Omega}^{2}(x;1). (16)

We further let

NNΩ(x):=argminxiΩdΩ(x,xi)={xiΩ:xxi=dΩ(x;1)=minxjΩxxj}.\mathrm{NN}_{\Omega}(x):=\mathrm{argmin}_{x_{i}\in\Omega}d_{\Omega}(x,x_{i})=\left\{x_{i}\in\Omega:\|x-x_{i}\|=d_{\Omega}(x;1)=\min_{x_{j}\in\Omega}\|x-x_{j}\|\right\}.

We use the notation p^NN(x)\widehat{p}_{\mathrm{NN}(x)} to denote the normalized measure restricted to the points in Ω\Omega that are closest to xx:

p^NN(x):=1xiNNΩ(x)aixiNNΩ(x)aiδxi.\widehat{p}_{\mathrm{NN}(x)}:=\frac{1}{\sum_{x_{i}\in\mathrm{NN}_{\Omega}(x)}a_{i}}\sum_{x_{i}\in\mathrm{NN}_{\Omega}(x)}a_{i}\,\delta_{x_{i}}. (17)

Whenever xx is not on ΣΩ\Sigma_{\Omega}, we have NNΩ(x)={projΩ(x)}\mathrm{NN}_{\Omega}(x)=\{\mathrm{proj}_{\Omega}(x)\} and p^NN(x)=δprojΩ(x)\widehat{p}_{\mathrm{NN}(x)}=\delta_{\mathrm{proj}_{\Omega}(x)}. With the above notation, we have the following convergence result for the posterior measure p(|𝑿σ=x)p(\cdot|\bm{X}_{\sigma}=x) as σ0\sigma\to 0.

Theorem 4.11.

Let p=i=1Naiδxip=\sum_{i=1}^{N}a_{i}\,\delta_{x_{i}} be a discrete distribution. For any xdx\in\mathbb{R}^{d}, we have the following convergence of the posterior measure p(|𝐗σ=x)p(\cdot|\bm{X}_{\sigma}=x) towards p^NN(x)\widehat{p}_{\mathrm{NN}(x)}:

dW,2(p(|𝑿σ=x),p^NN(x))diam(Ω)1p(NNΩ(x))p(NNΩ(x))exp(ΔΩ(x)4σ2).d_{\mathrm{W},2}\left(p(\cdot|\bm{X}_{\sigma}=x),\widehat{p}_{\mathrm{NN}(x)}\right)\leq\mathrm{diam}(\Omega)\sqrt{\frac{1-p(\mathrm{NN}_{\Omega}(x))}{p(\mathrm{NN}_{\Omega}(x))}}\exp\left(-\frac{\Delta_{\Omega}(x)}{4\sigma^{2}}\right).

As a corollary, we have the following convergence of the denoiser.

Corollary 4.12.

Let p=i=1Naiδxip=\sum_{i=1}^{N}a_{i}\,\delta_{x_{i}} be a discrete distribution. For any xdx\in\mathbb{R}^{d}, we have the following convergence of the denoiser mσ(x)m_{\sigma}(x) towards the mean of p^NN(x)\widehat{p}_{\mathrm{NN}(x)}:

mσ(x)𝔼(p^NN(x))diam(Ω)1p(NNΩ(x))p(NNΩ(x))exp(ΔΩ(x)4σ2).\|m_{\sigma}(x)-\mathbb{E}(\widehat{p}_{\mathrm{NN}(x)})\|\leq\mathrm{diam}(\Omega)\sqrt{\frac{1-p(\mathrm{NN}_{\Omega}(x))}{p(\mathrm{NN}_{\Omega}(x))}}\exp\left(-\frac{\Delta_{\Omega}(x)}{4\sigma^{2}}\right).

In particular, when xΣΩx\notin\Sigma_{\Omega}, assume that xi=projΩ(x)x_{i}=\mathrm{proj}_{\Omega}(x) we have that

mσ(x)xidiam(Ω)1aiaiexp(ΔΩ(x)4σ2).\|m_{\sigma}(x)-x_{i}\|\leq\mathrm{diam}(\Omega)\sqrt{\frac{1-a_{i}}{a_{i}}}\exp\left(-\frac{\Delta_{\Omega}(x)}{4\sigma^{2}}\right).

Compared to the linear convergence rate in Corollary 4.10, the discrete setting exhibits an exponential convergence rate. Interestingly, this significant difference does not lead to any notable variation in the flow map convergence rate described in Theorem 5.4. Nevertheless, this discrepancy remains intriguing and could provide valuable insights into understanding the interplay between memorization and generalization in the future. Later in Section 6.3, we will utilize the above results to analyze memorization behavior.

5 Well-Posedness of Flow Matching ODEs

With the concentration and convergence of the posterior distribution established, we now establish the well-posedness of the flow matching ODEs within the interval t[0,1)t\in[0,1) as well as the convergence property when t1t\to 1, in a general setting, including when the target data distribution pp is not fully supported such as those concentrated on a submanifold. This is the first theoretical guarantee for such cases, addressing gaps in prior works [LHH+24, GHJ24] that impose restrictive assumptions excluding manifold-supported data.

5.1 Well-posedness of flow matching ODEs at t[0,1)t\in[0,1)

First of all, we establish the existence of the flow map Ψt\Psi_{t} for t[0,1)t\in[0,1), i.e., the uniqueness and existence of the solution of Equation 1 for t[0,1)t\in[0,1). The following result was stated in the original flow matching paper [LCBH+22] under unspecified conditions and later on was proved in [GHJ24] in the case when pp satisfies many regularity assumptions such as absolute continuity w.r.t. the Lebesgue measure. In our case, we only require that pp has finite 2-moment, which significantly expands the applicability of the flow model.

Theorem 5.1.

As long as pp has finite 2-moment, for any initial point x0dx_{0}\in\mathbb{R}^{d}, there exists a unique solution (xt)t[0,1)(x_{t})_{t\in[0,1)} for the ODE Equation 1. Furthermore, the corresponding flow map Ψt\Psi_{t} is continuous and satisfies that (Ψt)#pprior=pt(\Psi_{t})_{\#}p_{\mathrm{prior}}=p_{t} for all t[0,1)t\in[0,1).

This proof involves carefully analyzing Cov[𝑿|𝑿t=x]\mathrm{Cov}[\bm{X}|\bm{X}_{t}=x] to establish the locally Lipschitz property of the vector field ut(x)u_{t}(x) as well as establishing the integrability of the vector field so that we can apply the classic mass concentration result (see [Vil09]).

5.2 Well-posedness of flow matching ODEs at t=1t=1

Now, we establish the convergence of flow map Ψt\Psi_{t} as t1t\to 1 under mild assumptions and hence rule out the potential risk of divergence which can be a disaster for the flow model as shown in Figure 1(a).

Assumption 1 (Regularity assumptions for the data distribution).

Let pp be a probability measure on d\mathbb{R}^{d}. We assume that pp satisfies the following properties:

  1. 1.

    pp has a finite 2-moment, i.e., dx2p(dx)<\int_{\mathbb{R}^{d}}\|x\|^{2}p(dx)<\infty;

  2. 2.

    The support Ω:=supp(p)\Omega:=\mathrm{supp}(p) has a positive reach τΩ>0\tau_{\Omega}>0111The support Ω\Omega can be the whole space d\mathbb{R}^{d} and in this case the reach τ=\tau=\infty.;

  3. 3.

    There exists k0k\geq 0 and c>0c>0 such that for any radius R>0R>0, there exist constants CRC_{R} so that for any small radius 0r<c0\leq r<c and any xBR(0)Ωx\in B_{R}(0)\cap\Omega, one has p(Br(x))CRrkp(B_{r}(x))\geq C_{R}r^{k} for any xΩx\in\Omega.

Many common data distributions satisfy the above assumptions and we list some of them below:

Example 5.2.
  1. 1.

    Any data distribution fully supported on d\mathbb{R}^{d} with finite 2 moment and non-vanishing density, e.g., normal distribution, satisfies all the assumptions with k=dk=d.

  2. 2.

    Any data distribution supported on a mm-dimensional linear subspace with finite 2 moment and non-vanishing density, e.g., projected normal distribution, satisfies all the assumptions with k=mk=m.

  3. 3.

    Any data distribution supported on a mm-dimensional compact submanifold MM with positive reach with a non-vanishing density with respect to the volume measure on MM satisfies all the assumptions with k=mk=m (cf. Lemma B.8).

  4. 4.

    Any empirical data distribution with a finite number of samples satisfies all the assumptions with k=0k=0.

Now, we establish the main result of this section below. The major reason that we require the reach τΩ\tau_{\Omega} to be positive is to handle the geometric singularity mentioned in Section 3.2 through the absorbing result in Theorem 3.11. In this way, we are able to ensure that an ODE trajectory will stay away from the medial axis and hence avoid the potential singularity issue. We need item 3 in 1 to ensure that at least locally, we can apply the posterior convergence rate result in Theorem 4.6 in a uniform way.

Theorem 5.3.

Let pp be a probability measure on d\mathbb{R}^{d} satisfying the regularity assumptions in 1. Then,

  1. 1.

    The limit Ψ1(x):=limt1Ψt(x)\Psi_{1}(x):=\lim_{t\to 1}\Psi_{t}(x) exists almost everywhere with respect to the Lebesgue measure.

  2. 2.

    Ψ1\Psi_{1} is a measurable map and (Ψ1)#pprior=p(\Psi_{1})_{\#}p_{\mathrm{prior}}=p.

Furthermore, we have the following estimate of the convergence rate of the flow map. Recall that σt:=βt/αt\sigma_{t}:=\beta_{t}/\alpha_{t}, then, we have that for any fixed 0<ζ<10<\zeta<1,

Ψ1(x)Ψt(x)=O(σtζ/2).\|\Psi_{1}(x)-\Psi_{t}(x)\|=O\left(\sigma_{t}^{{\zeta/2}}\right).

The ζ\zeta in the theorem above is inherited from Theorem 4.6. As such, we chose to retain the original formulation, avoiding the simplification of σtζ/2\sigma_{t}^{{\zeta/2}} to σtζ\sigma_{t}^{{\zeta}}, which would have required redefining ζ(0,1/2)\zeta\in(0,1/2).

Notice that the convergence rate is for very general data distribution pp. In fact, if we further assume that pp is supported on a compact manifold or a discrete set, then the convergence rate can be improved. See Section 5.3 below for more details.

5.3 Refined convergence rates for manifolds and discrete cases

Now, we go back to the manifold hypothesis and give a detailed convergence analysis for the manifold case and the sampling case. Based on the convergence rates we established in Section 4.2, we develop the following ODE convergence result which is a refined version of Theorem 5.3. Unlike the vast difference in posterior convergence and in denoiser convergence, the ODE convergence rates of the manifold case and the discrete case are rather similar. This is a bit surprising but we will explain intuitively why this should be the case at the end.

Theorem 5.4.

When pp is supported on a submanifold or a discrete set, we have the following convergence rates for the flow map Ψt\Psi_{t} as t1t\to 1.

  • Let MM be a mm dimensional closed submanifold (without self-intersection). Assume that pp has finite 2-moment, and p(dx)=ρ(x)volM(dx)p(dx)=\rho(x)\mathrm{vol}_{M}(dx) has a smooth nonvanishing density ρ:M\rho:M\to\mathbb{R} and a positive reach τM\tau_{M}. We further assume that the second fundamental form 𝐈𝐈\mathbf{I\!I} and its covariant derivative 𝐈𝐈\nabla\mathbf{I\!I} are bounded on MM. Then, for almost everywhere xdx\in\mathbb{R}^{d}, we have that

    Ψ1(x)Ψt(x)=O(σt);\|\Psi_{1}(x)-\Psi_{t}(x)\|=O(\sqrt{\sigma_{t}});
  • If p=i=1Naixip=\sum_{i=1}^{N}a_{i}x_{i} denotes a discrete probability measure, then for almost everywhere xdx\in\mathbb{R}^{d}, we have that

    Ψ1(x)Ψt(x)=O(σt).\|\Psi_{1}(x)-\Psi_{t}(x)\|=O(\sigma_{t}).

Note that the geometric conditions for manifolds in the theorem are mild. These conditions only guarantee that the submanifold will not have “sharp turns” in the ambient space d\mathbb{R}^{d}. For example, it includes the case when M=dM=\mathbb{R}^{d} and MM is any low dimensional subspace or any compact submanifold, such as a sphere, torus, etc.

The rate for the discrete case is optimal as shown in the example below.

Example 5.5.

Consider the one point set M1={x1}M_{1}=\{x_{1}\} with p:=δx1p:=\delta_{x_{1}}. Then, when choosing αt=t\alpha_{t}=t and βt=1t\beta_{t}=1-t, one has that starting with any x0dx_{0}\in\mathbb{R}^{d}, its ODE trajectory is given by (xt=(1t)x0+tx1)t[0,1)(x_{t}=(1-t)x_{0}+tx_{1})_{t\in[0,1)}. Therefore, x1xt=(1t)x0x1=Θ(1t)=Θ(σt)\|x_{1}-x_{t}\|=(1-t)\|x_{0}-x_{1}\|=\Theta(1-t)=\Theta(\sigma_{t}). Hence, the flow map Ψt\Psi_{t} has a linear convergence rate.

However, we do not know whether the rate for the manifold case is optimal or not. In fact, we conjecture that the rate can be improved to the linear rate σt\sigma_{t} for the manifold case as well and hence the ODE convergence rate does not differ between the manifold case and the discrete case. This conjecture is motivated from the fact that the distribution qσpq_{\sigma}\to p with linear rate O(σ)O(\sigma) (see the proposition below) and hence its associated ODE should intuitively converge with the same rate.

Proposition 5.6.

For any probability measure pp with finite 2-moment, we have that

dW,2(qσ,p)=O(σ).d_{\mathrm{W},2}(q_{\sigma},p)=O(\sigma).

We note that the above proposition is relatively straightforward compared to the posterior convergence established in Theorem 4.2. This provides further evidence (in addition to Proposition 4.7 and the subsequent discussion) for the importance of focusing on analyzing the posterior rather than the entire data distribution.

5.4 Equivariance of the flow maps under data transformations

We have established in Theorem 5.3 the existence of the flow map Ψ1:dd\Psi_{1}:\mathbb{R}^{d}\to\mathbb{R}^{d} sending directly any sampled point x0ppriorx_{0}\sim p_{\mathrm{prior}} to a point x1x_{1} following the target data distribution pp under 1. In this regard, it is of interest to establish theoretical properties of the flow map. In this subsection, we examine how the flow map Ψ1\Psi_{1} behaves under certain transformations T:ddT:\mathbb{R}^{d}\to\mathbb{R}^{d} of the data distribution pp. More precisely, after certain transformation, one obtains a new data distribution p~:=T#p\widetilde{p}:=T_{\#}p. We then consider applying the flow model to the new data distribution p~\widetilde{p} and examine the corresponding flow map. In this paper, we assume that pp satisfies 1 and focus on rigid transformations and scaling (which will preserve the assumptions), leaving the investigation of other transformations for future work.

Rigid transformations

Consider the rigid transformation y=𝑶x+by={\bm{O}}x+b where 𝑶d×d{\bm{O}}\in\mathbb{R}^{d\times d} is an orthogonal matrix and bdb\in\mathbb{R}^{d} is any vector. It turns out that if we apply the flow model to such a transformed data distribution p~\widetilde{p} with the same scheduling functions αt\alpha_{t} and βt\beta_{t}, the flow map Ψ~1\widetilde{\Psi}_{1} is also a rigid transformation of the original flow map Ψ1\Psi_{1}, i.e.,

Ψ~1(𝑶x)=𝑶Ψ1(x)+b\widetilde{\Psi}_{1}({\bm{O}}x)={\bm{O}}\Psi_{1}(x)+b

In fact, if we let m~t,u~t,Ψ~t\widetilde{m}_{t},\widetilde{u}_{t},\widetilde{\Psi}_{t} denote the corresponding denoiser, vector field, and flow map associated with p~\widetilde{p}, we establish the following result.

Theorem 5.7 (Equivariance under rigid transformations).

For any xdx\in\mathbb{R}^{d} and t[0,1)t\in[0,1), if we let y=𝐎x+αtby={\bm{O}}x+\alpha_{t}b, then

m~t(y)=𝑶mt(x)+b,u~t(y)=𝑶ut(x)+α˙tb\widetilde{m}_{t}(y)={\bm{O}}m_{t}(x)+b,\,\,\widetilde{u}_{t}(y)={\bm{O}}u_{t}(x)+\dot{\alpha}_{t}b

where α˙t\dot{\alpha}_{t} represents the derivative of αt\alpha_{t} with respect to tt. Then, we have that

Ψ~t(𝑶x)=𝑶Ψt(x)+αtb,\widetilde{\Psi}_{t}({\bm{O}}x)={\bm{O}}\Psi_{t}(x)+\alpha_{t}b,

in particular, by convergence of Ψt\Psi_{t} as t1t\to 1, we have that Ψ~1(𝐎x)=𝐎Ψ1(x)+b\widetilde{\Psi}_{1}({\bm{O}}x)={\bm{O}}\Psi_{1}(x)+b.

Scaling

We now consider the effect on the flow map when the data is scaled by a factor k>0k>0, i.e., y=kxy=kx. This case is more intricate than the previous two cases, as one needs to change the scheduling functions to obtain the invariance property of the flow map.

Assume that an initial flow model is applied to pp with scheduling functions αt\alpha_{t} and βt\beta_{t}. Let st:[0,1]s_{t}:[0,1]\to\mathbb{R} denote any positive smooth function so that s0=1s_{0}=1 and s1=ks_{1}=k. Then, consider

α¯t:=stαt/k and β¯t:=stβt.\bar{\alpha}_{t}:=s_{t}\alpha_{t}/k\,\text{ and }\,\bar{\beta}_{t}:=s_{t}\beta_{t}.

These are still legitimate scheduling functions satisfying α¯0=β¯1=0\bar{\alpha}_{0}=\bar{\beta}_{1}=0 and α¯1=β¯0=1\bar{\alpha}_{1}=\bar{\beta}_{0}=1. We let m¯t,u¯t,Ψ¯t\bar{m}_{t},\bar{u}_{t},\bar{\Psi}_{t} denote the corresponding denoiser, vector field, and flow map associated with the flow matching model with the scaled data distribution kpk\cdot p and the new scheduling functions α¯t\bar{\alpha}_{t} and β¯t\bar{\beta}_{t}. We establish the following result.

Theorem 5.8 (Equivariance under scaling transformations).

For any xdx\in\mathbb{R}^{d} and t[0,1)t\in[0,1), if we let y=stxy=s_{t}x, then

m¯t(y)=kmt(x).\bar{m}_{t}(y)=k\cdot m_{t}(x).

Therefore,

Ψ¯t(x)=stΨt(x)\bar{\Psi}_{t}(x)=s_{t}\cdot\Psi_{t}(x)

and in particular, by convergence of Ψ¯t\bar{\Psi}_{t} as t1t\to 1, we have that Ψ¯1(x)=kΨ1(x)\bar{\Psi}_{1}(x)=k\cdot\Psi_{1}(x).

6 Attraction Dynamics of Flow Matching ODEs

In this section, we utilize the concentration and convergence results of the posterior distribution to analyze the attraction dynamics of the flow matching ODEs.

Following the flow matching ODE, when using σ\sigma, the ODE trajectory can be simply interpreted as moving towards the denoiser at each time step. Intuitively, denoiser mσ(x)m_{\sigma}(x) initially is close to the data mean and then progressively being close to nearby data points as the ODE trajectory evolves. Eventually, as σ\sigma goes to zero, the denoisers will converge to the nearest data point almost surely. We will instantiate this intuition and provide rigorous results. We will first analyze the initial stage of the sampling process and then discuss the attraction dynamics towards a local cluster of data points in the intermediate stage. Finally, we will discuss the terminal behavior of the ODE trajectory under the discrete measure along with its connection with memorization.

6.1 Initial stage of the sampling process

We will first work with a data distribution pp with bounded support and then later extend to a more general setting where pp is a convolution of a bounded support distribution and a Gaussian distribution. We assume the data distribution pp has finite 22-moment and hence the ODE trajectories of the flow model are well-defined (cf. Section 5.1). We now analyze the initial stage of the sampling process in the flow model.

The initial stage of the sampling process corresponds with σ\sigma being with a large value TT with which the initial sample xσ=Tx_{\sigma=T} is transformed from xtσ=T/αtσ=Tx_{t_{\sigma=T}}/\alpha_{t_{\sigma=T}}. As TT is large, tσ=Tt_{\sigma=T} is close to 0, and hence xtσ=T/αtσ=Tx_{t_{\sigma=T}}/\alpha_{t_{\sigma=T}} has an very large norm. In this case, xσ=Tx_{\sigma=T} is likely outside the support of pbp_{b}. Therefore, the above corollary describes the initial stage of the sampling process.

Attracting toward data mean

By the initial stability of the posterior distribution, the denoiser will be close to the mean of the data distribution which can be summarized in the following proposition.

Proposition 6.1.

Let pp be a probability measure on d\mathbb{R}^{d} with finite 2-moment and bounded support Ω:=supp(p)\Omega:=\mathrm{supp}(p). Fix any ζ(0,1)\zeta\in(0,1). Then, for any x0dx_{0}\in\mathbb{R}^{d} with initial distance R0:=x0𝔼[𝐗]R_{0}:=\|x_{0}-\mathbb{E}[\bm{X}]\| where 𝐗p\bm{X}\sim p, there exists a constant

σinit(Ω,ζ,R0):=2R0diam(Ω)log(1+ζR0diam(Ω))\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0}):=\sqrt{\frac{2R_{0}\mathrm{diam}(\Omega)}{\log\left(1+\frac{\zeta R_{0}}{\mathrm{diam}(\Omega)}\right)}}

such that for all σ1>σinit(Ω,ζ,R0)\sigma_{1}>\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0}), the denoiser mσ(x)m_{\sigma}(x) will be close to the mean of the data distribution pp with the following estimate:

mσ(x)𝔼[𝑿]<ζx𝔼[𝑿].\|m_{\sigma}(x)-\mathbb{E}[\bm{X}]\|<\zeta\|x-\mathbb{E}[\bm{X}]\|.

With the above proposition, we can then apply the meta absorbing result in Section 3.3 to show that the ODE trajectory will be absorbed in a ball centered at the mean of the data distribution.

Proposition 6.2.

Under the same assumptions as in Proposition 6.1, the closed neighborhood BR0(𝔼[𝐗])¯\overline{B_{R_{0}}(\mathbb{E}[\bm{X}])} is an absorbing set for the ODE trajectory (xσ)σ(σinit(Ω,ζ,R0),σ1](x_{\sigma})_{\sigma\in(\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0}),\sigma_{1}]}.

The abosorbing property guarantees the trajectory will remain in BR0(𝔼[𝑿])B_{R_{0}}(\mathbb{E}[\bm{X}]) and hence the estimate in Proposition 6.1 holds for the trajectory as well. We can then apply the meta attracting result to show the ODE trajectory moves toward the mean of the data distribution in the initial stage.

Proposition 6.3 (Initial stage: Moving towards the data mean).

With the same assumptions as in Proposition 6.1, the ODE trajectory (xσ)σ(σinit(Ω,ζ,R0)2,σ1](x_{\sigma})_{\sigma\in(\sqrt{\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0})^{2}},\sigma_{1}]} starting from xσ1=x0x_{\sigma_{1}}=x_{0} will move towards the mean of the data distribution pp as shown in the following estimate:

xσ𝔼[𝑿]<σ1ζσ11ζxσ1𝔼[𝑿],\|x_{\sigma}-\mathbb{E}[\bm{X}]\|<\frac{\sigma^{{1-\zeta}}}{\sigma_{1}^{{1-\zeta}}}\|x_{\sigma_{1}}-\mathbb{E}[\bm{X}]\|,

where 𝐗p\bm{X}\sim p.

Note that an ζ\zeta close to 11 will make σinit(Ω,ζ,R0)\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0}) smaller and hence better convergence of the entire sampling process. However, the decay guarantee will be weaker as ζ\zeta approaches 1.

Attracting and abosorbing to the convex hull of data support

How close the trajectory will be to the mean of the data distribution depends on the initial distance and how fast the posterior distribution concentrates. However, the trajectory will necessarily reach the convex hull of the support of pp as σ\sigma goes to zero as shown in the following proposition which is obtained by applying meta theorems of attracting and absorbing in Section 3.3 to the convex hull of the support of pp.

Proposition 6.4.

Given the data distribution pp defined above, for any σ1>0\sigma_{1}>0, let (xσ)σ(0,σ1](x_{\sigma})_{\sigma\in(0,\sigma_{1}]} be a flow matching ODE trajectory follows Equation 8. Then we have the following results.

  1. 1.

    If xσ1conv(supp(p))x_{\sigma_{1}}\in\mathrm{conv}(\mathrm{supp}(p)), then xσconv(supp(p))x_{\sigma}\in\mathrm{conv}(\mathrm{supp}(p)) for any σ(0,σ1]\sigma\in(0,\sigma_{1}];

  2. 2.

    If xσ1conv(supp(p))x_{\sigma_{1}}\notin\mathrm{conv}(\mathrm{supp}(p)), then xσx_{\sigma} moves towards conv(supp(p))\mathrm{conv}(\mathrm{supp}(p)) with the following decay guarantee:

    dconv(supp(p))(xσ)σσ1dconv(supp(p))(xσ1),d_{\mathrm{conv}(\mathrm{supp}(p))}(x_{\sigma})\leq\frac{\sigma}{\sigma_{1}}d_{\mathrm{conv}(\mathrm{supp}(p))}(x_{\sigma_{1}}),

    for any σ(0,σ1]\sigma\in(0,\sigma_{1}].

The above propositions requires the data distribution to have a bounded support. We now extend the above results to a more general setting where the data distribution is p=pb𝒩(0,δ2I)p=p_{b}\ast\mathcal{N}(0,\delta^{2}I), i.e., a convolution of a bounded support distribution and a Gaussian distribution. This is done by noting that the ODE trajectory with data distribution pp can be derived from the ODE trajectory with data distribution pbp_{b} by a simple transformation.

Lemma 6.5.

Let pbp_{b} be a distribution and let p:=pb𝒩(0,δ2I)p:=p_{b}\ast{\mathcal{N}}(0,\delta^{2}I). Let (yσb)σb(0,)(y_{\sigma_{b}})_{\sigma_{b}\in(0,\infty)} be an ODE trajectory of Equation 5 with data distribution pbp_{b}. We define (xσ:=yσb=σ2+δ2)σ(0,)(x_{\sigma}:=y_{\sigma_{b}=\sqrt{\sigma^{2}+\delta^{2}}})_{\sigma\in(0,\infty)}. Then, (xσ)(x_{\sigma}) is an ODE trajectory of Equation 5 with data distribution pp.

Corollary 6.6.

Let p=pb𝒩(0,δ2I)p=p_{b}\ast\mathcal{N}(0,\delta^{2}I) be a probability measure on d\mathbb{R}^{d} with pbp_{b} having a bounded support which is denoted as Ω:=supp(pb)\Omega:=\mathrm{supp}(p_{b}). Let x0x_{0} be a point and denote x0𝔼[𝐗]=R0\|x_{0}-\mathbb{E}[\bm{X}]\|=R_{0}, where 𝐗p\bm{X}\sim p. Let ζ\zeta be a parameter such that 0<ζ<10<\zeta<1. Then there exist a constant

σinit(Ω,ζ,R0):=2R0diam(Ω)log(1+ζR0diam(Ω))\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0}):=\sqrt{\frac{2R_{0}\mathrm{diam}(\Omega)}{\log\left(1+\frac{\zeta R_{0}}{\mathrm{diam}(\Omega)}\right)}}

such that for all σ1>σinit(Ω,ζ,R0)2+δ2\sigma_{1}>\sqrt{\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0})^{2}+\delta^{2}}, a trajectory (xσ)σ(σinit(Ω,ζ,R0)2+δ2,σ1](x_{\sigma})_{\sigma\in(\sqrt{\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0})^{2}+\delta^{2}},\sigma_{1}]} starting from xσ1=x0x_{\sigma_{1}}=x_{0} will move towards the mean of the data distribution pp. Additionally, we have the following decay result:

xσ𝔼[𝑿]<(σ2+δ2)1ζ2(σ12+δ2)1ζ2R0.\|x_{\sigma}-\mathbb{E}[\bm{X}]\|<\frac{(\sigma^{2}+\delta^{2})^{\frac{1-\zeta}{2}}}{(\sigma_{1}^{2}+\delta^{2})^{\frac{1-\zeta}{2}}}R_{0}.
Corollary 6.7.

Given the data distribution pp defined above, for any σ1>0\sigma_{1}>0, let xσx_{\sigma} be an ODE trajectory of Equation 8 from σ=σ1\sigma=\sigma_{1} to σ=0\sigma=0. Then, we have that

  1. 1.

    If xσ1conv(supp(pb))x_{\sigma_{1}}\in\mathrm{conv}(\mathrm{supp}(p_{b})), then xσconv(supp(pb))x_{\sigma}\in\mathrm{conv}(\mathrm{supp}(p_{b})) for any σ(0,σ1]\sigma\in(0,\sigma_{1}];

  2. 2.

    If xσ1conv(supp(pb))x_{\sigma_{1}}\notin\mathrm{conv}(\mathrm{supp}(p_{b})), then xσx_{\sigma} moves towards conv(supp(pb))\mathrm{conv}(\mathrm{supp}(p_{b})) with the following decay guarantee:

    dconv(supp(pb))(xσ)σ2+δ2σ12+δ2dconv(supp(pb))(xσ1),d_{\mathrm{conv}(\mathrm{supp}(p_{b}))}(x_{\sigma})\leq\frac{\sqrt{\sigma^{2}+\delta^{2}}}{\sqrt{\sigma_{1}^{2}+\delta^{2}}}d_{\mathrm{conv}(\mathrm{supp}(p_{b}))}(x_{\sigma_{1}}),

    for any σ(0,σ1]\sigma\in(0,\sigma_{1}].

6.2 Intermediate stage of the sampling process

Now we have described the flow matching ODE dynamics at initial stage. In this subsection, we shed light on the intermediate stage by showing that the ODE trajectory will be attracted towards the convex hull of local clusters of data under some assumptions.

We start by assuming that the data distribution pp contains a “local cluster” or “local mode”. More precisely, we consider the following assumptions.

Assumption 2.

Let pp be a probability measure on d\mathbb{R}^{d}. We assume that pp that has a well separated local cluster SS, specifically, we assume that pp satisfies the following properties:

  1. 1.

    supp(p)=S(Ω\S)\mathrm{supp}(p)=S\cup(\Omega\backslash S), where SS is a closed bounded set with diameter D:=diam(S)<D:=\mathrm{diam}(S)<\infty.

  2. 2.

    Ω\S\Omega\backslash S satisfies dconv(S)(x)>2Dd_{\mathrm{conv}(S)}(x)>2D for all xΩ\Sx\in\Omega\backslash S.

Then, it turns out that for any point yy that is close to the convex hull of SS, the denoiser mσ(y)m_{\sigma}(y) will be attracted to the convex hull of SS as σ\sigma goes to zero.

Proposition 6.8.

Assume that pp satisfies 2. Then, for any xdx\in\mathbb{R}^{d} such that dconv(S)(x)D/2ϵd_{\mathrm{conv}(S)}(x)\leq D/2-\epsilon, we have that

dconv(S)(mσ(x))diam(supp(p))1aSaSexp(3Dϵ2σ2),d_{\mathrm{conv}(S)}(m_{\sigma}(x))\leq\mathrm{diam}(\mathrm{supp}(p))\sqrt{\frac{1-a_{S}}{a_{S}}}\exp\left(-\frac{3D\epsilon}{2\sigma^{2}}\right),

where aS:=p(S)a_{S}:=p(S) which is positive.

Following the strategy illustrated in Remark 3.10, the above proposition can be used to show that the ODE trajectory will be absorbed in certain neighborhood of the convex hull of SS and eventually be attracted to the convex hull of SS.

Proposition 6.9.

Assume that pp satisfies 2. Let CϵS=D/2ϵdiam(supp(p))1aSaSC^{S}_{\epsilon}=\frac{D/2-\epsilon}{\mathrm{diam}(\mathrm{supp}(p))\sqrt{\frac{1-a_{S}}{a_{S}}}}, and define

σ0(S,ϵ):={exp(12log(23Dϵlog(CϵS))),if CϵS<1,,otherwise.\displaystyle\sigma_{0}(S,\epsilon):=\begin{cases}\exp\left(-\frac{1}{2}\log\left(-\frac{2}{3D\epsilon}\log\left(C^{S}_{\epsilon}\right)\right)\right),&\text{if }C^{S}_{\epsilon}<1,\\ \infty,&\text{otherwise}.\end{cases}

Then, the set BD/2ϵ(conv(S))¯\overline{B_{D/2-\epsilon}(\mathrm{conv}(S))} is an absorbing set for the ODE trajectory (xσ)σ(0,σ1](x_{\sigma})_{\sigma\in(0,\sigma_{1}]} with σ1<σ0(S,ϵ)\sigma_{1}<\sigma_{0}(S,\epsilon). Additionally, an ODE trajectory starting from xσ1BD/2ϵ(conv(S))¯x_{\sigma_{1}}\in\overline{B_{D/2-\epsilon}(\mathrm{conv}(S))} will converge to the convex hull of SS as σ0\sigma\to 0.

Remark 6.10.

A well separated local cluster can be regarded as a local mode of the data distribution. The above proposition give a quantitative measure when the ODE trajectory will be attracted to the local mode.

6.3 Terminal stage of the sampling process and memorization behavior

The local cluster dynamics described in the previous subsection can be manifested in the terminal stage of a discrete data distribution where each data point can eventually be regarded as a local cluster. In this subsection, we will show that the ODE trajectory will be attracted to the nearest data point as σ\sigma goes to zero and discuss the memorization behavior of the flow model. The results in this subsection complement those in Section 5.3 by specifying a precise range of parameters σ\sigma for terminal convergence, rather than focusing solely on the convergence rate as in Section 5.3.

We will first introduce some necessary notations and definitions. Throughout this subsection, pp denotes a discrete distribution p=i=1naiδxip=\sum_{i=1}^{n}a_{i}\,\delta_{x_{i}}. We let Ω={x1,,xn}\Omega=\{x_{1},\ldots,x_{n}\} denote its support. The Voronoi diagram of Ω\Omega gives a partition of d\mathbb{R}^{d} into regions ViV_{i}’s based on the closeness to the data points:

Vi:={x:xxixxj,xjΩ}.V_{i}:=\{x:\|x-x_{i}\|\leq\|x-x_{j}\|,\ \forall x_{j}\in\Omega\}.

In particular, each ViV_{i} is a connected closed convex region and the union of all ViV_{i}’s covers the whole space d\mathbb{R}^{d}. The boundary of the region ViV_{i} consists of exactly the points on ΣΩ\Sigma_{\Omega}, the medial axis of Ω\Omega. We will also consider a family of regions ViϵV_{i}^{\epsilon} that lies inside ViV_{i} and covers all interior points of ViV_{i} as ϵ\epsilon goes to zero. Specifically, let 0<ϵ<dΩ(xi)0<\epsilon<d_{\Omega}(x_{i}) be a small positive number, we use ViϵV_{i}^{\epsilon} to denote the region

Viϵ:={x:xxi2xxj2ϵ2,xjΩ,ji}.V_{i}^{\epsilon}:=\left\{x:\|x-x_{i}\|^{2}\leq\|x-x_{j}\|^{2}-\epsilon^{2},\ \forall x_{j}\in\Omega,\,j\neq i\right\}. (18)

Similar to the condition for ViV_{i}, each ViϵV_{i}^{\epsilon} is intersection of finitely many half-spaces and hence is a convex region.

For each yViϵy\in V_{i}^{\epsilon}, xix_{i} is the unique nearest data point to yy and the posterior measure p(|xσ=y)p(\cdot|x_{\sigma}=y) will be fully concentrated at xix_{i} as σ\sigma goes to zero as shown in Theorem 4.11. As a consequence, we can identify a time σ0(Viϵ)\sigma_{0}(V_{i}^{\epsilon}) as follows such that the ODE trajectory will never leave the region ViϵV_{i}^{\epsilon} for σ<σ0(Viϵ)\sigma<\sigma_{0}(V_{i}^{\epsilon}). We use sep(xi)\mathrm{sep}(x_{i}) to denote the minimum distance from xix_{i} to other data points in Ω\Omega and introduce a constant Ci,ϵΩ=2sep(xi)sep2(xi)ϵ21aiaidiam(Ω)C^{\Omega}_{i,\epsilon}=\frac{2\,\mathrm{sep}(x_{i})}{\mathrm{sep}^{2}(x_{i})-\epsilon^{2}}\cdot\sqrt{\frac{1-a_{i}}{a_{i}}}\cdot\mathrm{diam}(\Omega) that depends on the data support Ω\Omega, specific point xix_{i}, and the parameter ϵ(0,sep(xi)/2)\epsilon\in(0,\mathrm{sep}(x_{i})/2). Then the time σ0(Viϵ)\sigma_{0}(V_{i}^{\epsilon}) is defined as

σ0(Viϵ)={,if Ci,ϵΩ1,ϵ2(log(Ci,ϵΩ))1/2,if Ci,ϵΩ>1,\sigma_{0}(V_{i}^{\epsilon})=\begin{cases}\infty,&\text{if }C^{\Omega}_{i,\epsilon}\leq 1,\\ \frac{\epsilon}{2}\left(\log(C^{\Omega}_{i,\epsilon})\right)^{-1/2},&\text{if }C^{\Omega}_{i,\epsilon}>1,\end{cases}

We have the following proposition.

Proposition 6.11 (Terminal absorbing behavior under discrete distribution).

Fix an arbitrary 0<σ1<σ0(Viϵ)0<\sigma_{1}<\sigma_{0}(V_{i}^{\epsilon}). Then, for any yViϵy\in V_{i}^{\epsilon}, the ODE trajectory (xσ)σ(0,σ0](x_{\sigma})_{\sigma\in(0,\sigma_{0}]} starting from xσ0=yx_{\sigma_{0}}=y will stay inside ViϵV_{i}^{\epsilon}, i.e., xσViϵx_{\sigma}\in V_{i}^{\epsilon}.

With the above propositions, we can show that the ODE trajectory inside ViϵV_{i}^{\epsilon} will be attracted to xix_{i} as σ\sigma goes to zero.

Proposition 6.12.

Fix an arbitrary 0<σ0<σ0(Viϵ)0<\sigma_{0}<\sigma_{0}(V_{i}^{\epsilon}). Then, for any yViϵy\in V_{i}^{\epsilon}, the ODE trajectory (xσ)σ(0,σ0](x_{\sigma})_{\sigma\in(0,\sigma_{0}]} starting from xσ0=yx_{\sigma_{0}}=y will converge to xix_{i} as σ\sigma goes to zero.

Note that this result is complementary to the discrete part of Theorem 5.4 in the sense that we can identify a fairly explicit time σ0(Viϵ)\sigma_{0}(V_{i}^{\epsilon}) such that the current nearest data point is what the ODE trajectory will converge to. This is in contrast to the general convergence rate of the flow map as t1t\to 1 in Theorem 5.4.

Discussion on Memorization

The constant σ0(Viϵ)\sigma_{0}(V_{i}^{\epsilon}) in Proposition 6.12 can be regarded as a time under which a trajectory near the data point xix_{i} will eventually converge to xix_{i} under the flow matching ODE for the empirical data distribution. In particularly, the empirical optimal solution will only be able to repeat the training data points in the terminal stage of the sampling process—a phenomenon known as memorization. The constant σ0(Viϵ)\sigma_{0}(V_{i}^{\epsilon}) can then be used as a measure of the memorization time for the empirical optimal solution. When considering the CIFAR-10 dataset and choose ϵ=1.0\epsilon=1.0, the averaged value of σ0(Viϵ)\sigma_{0}(V_{i}^{\epsilon}) over all data points is around 0.170.17 which covers the last one fourth of the sampling process in the popular sampling schedule used in [KAAL22].

Furthermore, we want to emphasize that the memorization for the flow model is particularly depends on the denoiser near terminal time. In particular, if the denoiser is asymptotically close to the empirical denoiser, the flow model will only memorize the empirical data. This is formalized in the following proposition.

For the following, we will consider a neural network trained denoiser mσθm_{\sigma}^{\theta} and its corresponding ODE trajectory (xσθ)σ(0,σ0](x_{\sigma}^{\theta})_{\sigma\in(0,\sigma_{0}]} following the flow matching ODE with mσm_{\sigma} replaced by mσθm_{\sigma}^{\theta}, that is,

dxσθdσ\displaystyle\frac{dx^{\theta}_{\sigma}}{d\sigma} =1σ(mσθ(xσ)xσθ),\displaystyle=-\frac{1}{\sigma}\left(m_{\sigma}^{\theta}(x_{\sigma})-x^{\theta}_{\sigma}\right), (19)
Proposition 6.13 (Memorization of trained denoiser).

Let pp denote a discrete probability measure and let mσm_{\sigma} denote its denoiser. Let mσθ:ddm^{\theta}_{\sigma}:\mathbb{R}^{d}\to\mathbb{R}^{d} be any smooth map (which should be regarded as any neural network trained denoiser). Assume that there exists a function ϕ(σ)\phi(\sigma) such that limσ0ϕ(σ)=0\lim_{\sigma\to 0}\phi(\sigma)=0 and

mσθ(x)mσ(x)ϕ(σ), for all xd.\|m^{\theta}_{\sigma}(x)-m_{\sigma}(x)\|\leq\phi(\sigma),\text{ for all }x\in\mathbb{R}^{d}.

Then, there exists a parameter σ0(Viϵ,ϕ)>0\sigma_{0}(V_{i}^{\epsilon},\phi)>0 such that for all σ0<σ0(Viϵ,ϕ)\sigma_{0}<\sigma_{0}(V_{i}^{\epsilon},\phi), and for any yViϵy\in V_{i}^{\epsilon}, the ODE trajectory (xσθ)σ(0,σ0](x_{\sigma}^{\theta})_{\sigma\in(0,\sigma_{0}]} for Equation 8 (with mσm_{\sigma} replaced by mσθm_{\sigma}^{\theta}) starting from xσ0=yx_{\sigma_{0}}=y will converge to xix_{i} as σ0\sigma\to 0.

Note that the above proposition do not necessarily specifying the convergence of xσθx^{\theta}_{\sigma} and can be applied in general. In the case where we do have convergence of xσθx^{\theta}_{\sigma} and the the limit of the denoiser mσθ(xσθ)m_{\sigma}^{\theta}(x^{\theta}_{\sigma}), the two limits will coincide. This is formalized in the following proposition.

Proposition 6.14.

Let mσθ:ddm_{\sigma}^{\theta}:\mathbb{R}^{d}\to\mathbb{R}^{d} be any smooth map (which should be regarded as any neural network trained denoiser). Assume that the ODE trajectory (xσθ)σ(0,σ0](x_{\sigma}^{\theta})_{\sigma\in(0,\sigma_{0}]} for Equation 8 has a limit as σ0\sigma\to 0 and the limit limσ0mσθ(xσθ)\lim_{\sigma\to 0}m_{\sigma}^{\theta}(x_{\sigma}^{\theta}) also exist. Then, we must have that

limσ0mσθ(xσθ)xσθ=0.\lim_{\sigma\to 0}\|m^{\theta}_{\sigma}(x^{\theta}_{\sigma})-x^{\theta}_{\sigma}\|=0.

This proposition shows that the final output of the flow model is fully determined by the behavior of the denoiser near the terminal time unlike a generic ODE where the near terminal time behavior only partially influence the final output. Therefore, for a flow model to have good generalization ability, it is crucial to ensure that the trained denoiser is diverse near terminal time and covers the entire data distribution—where the empirical optimal solution fails.

7 Discussion

Our study lays a steady theoretical foundation for the flow-matching model, especially in the case when data distribution is supported on a low dimensional manifold. Our analysis of the ODE dynamics, as well as the contrast between the manifold and sampling scenarios, will help shed light on designing better variants of the flow matching model that are memorization-free and with better generalization ability.

In particular, we point out the following two possible future directions. Suppose the submanifold MM has intrinsic dimension kk. Then, trxm1k\text{tr}\nabla_{x}m_{1}\equiv k whereas trxm1N0\text{tr}\nabla_{x}m_{1}^{N}\equiv 0 (cf. Proposition 4.8). From this perspective, one can already see a huge difference between the empirical solution and the ground truth solution. This motivates us to consider incorporating the Jacobian of the denoiser into the training of the flow matching model to minimize memorization. Our analysis of the attraction region of the flow matching ODE in the discrete case also suggests that one should be careful on training the denoiser near the terminal time to avoid memorization.

Our general convergence result contains a coefficient that depends on the reach of the data support which is not stable under perturbations. However, some of our results can potentially be refined by using local variants of the reach, such as the local feature size, which could yield more robust and stronger theoretical guarantees.

Acknowledgements.

This work is partially supported by NSF grants CCF-2112665, CCF-2217058, CCF-2310411 and CCF-2403452.

References

  • [AB98] Nina Amenta and Marshall Bern. Surface reconstruction by voronoi filtering. In Proceedings of the fourteenth annual symposium on Computational geometry, pages 39–48, 1998.
  • [AB06] Stephanie B Alexander and Richard L Bishop. Gauss equation and injectivity radii for subspaces in spaces of curvature bounded above. Geometriae Dedicata, 117:65–84, 2006.
  • [AL19] Eddie Aamari and Clément Levrard. Nonasymptotic rates for manifold, tangent space and curvature estimation. 2019.
  • [BBDBM24] Giulio Biroli, Tony Bonnaire, Valentin De Bortoli, and Marc Mézard. Dynamical regimes of diffusion models. Nature Communications, 15(1):9957, 2024.
  • [BDD23] Joe Benton, George Deligiannidis, and Arnaud Doucet. Error bounds for flow matching methods. arXiv preprint arXiv:2305.16860, 2023.
  • [BHHS22] Clément Berenfeld, John Harvey, Marc Hoffmann, and Krishnan Shankar. Estimating the reach of a manifold via its convexity defect function. Discrete & Computational Geometry, 67(2):403–438, 2022.
  • [Bia23] Adam Białożyt. The tangent cone, the dimension and the frontier of the medial axis. Nonlinear Differential Equations and Applications NoDEA, 30(2):27, 2023.
  • [CHN+23] Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, 2023.
  • [CLN23] Bennett Chow, Peng Lu, and Lei Ni. Hamilton’s Ricci flow, volume 77. American Mathematical Society, Science Press, 2023.
  • [CZW+24] Defang Chen, Zhenyu Zhou, Can Wang, Chunhua Shen, and Siwei Lyu. On the trajectory regularity of ode-based diffusion sampling. In Forty-first International Conference on Machine Learning, 2024.
  • [dM37] MR de Mises. La base géométrique du théoreme de m. mandelbrojt sur les points singuliers d’une fonction analytique. CR Acad. Sci. Paris Sér. I Math, 205:1353–1355, 1937.
  • [DZ11] Michel C Delfour and J-P Zolésio. Shapes and geometries: metrics, analysis, differential calculus, and optimization. SIAM, 2011.
  • [Efr11] Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
  • [EKB+24] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Dominik Lorenz Yam Levi, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
  • [Fed59] Herbert Federer. Curvature measures. Transactions of the American Mathematical Society, 93(3):418–491, 1959.
  • [GHJ24] Yuan Gao, Jian Huang, and Yuling Jiao. Gaussian interpolation flows. Journal of Machine Learning Research, 25(253):1–52, 2024.
  • [GHL90] Sylvestre Gallot, Dominique Hulin, and Jacques Lafontaine. Riemannian geometry, volume 2. Springer, 1990.
  • [GPAM+14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • [HJA20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [HW20] Daniel Hug and Wolfgang Weil. Lectures on convex geometry, volume 286. Springer, 2020.
  • [KAAL22] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35:26565–26577, 2022.
  • [KG24] Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36, 2024.
  • [LC24] Marvin Li and Sitan Chen. Critical windows: non-asymptotic theory for feature emergence in diffusion models. arXiv preprint arXiv:2403.01633, 2024.
  • [LCBH+22] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2022.
  • [LGL23] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In International conference on learning representations (ICLR), 2023.
  • [LGRH+24] Gabriel Loaiza-Ganem, Brendan Leigh Ross, Rasa Hosseinzadeh, Anthony L Caterini, and Jesse C Cresswell. Deep generative models through the lens of the manifold hypothesis: A survey and new connections. arXiv preprint arXiv:2404.02954, 2024.
  • [LHH+24] Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code. arXiv preprint arXiv:2412.06264, 2024.
  • [LZB+22] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  • [MMASC14] Maria G Monera, A Montesinos-Amilibia, and Esther Sanabria-Codesal. The taylor expansion of the exponential map and geometric applications. Revista de la Real Academia de Ciencias Exactas, Fisicas y Naturales. Serie A. Matematicas, 108:881–906, 2014.
  • [MMWW23] Facundo Mémoli, Axel Munk, Zhengchao Wan, and Christoph Weitkamp. The ultrametric Gromov–Wasserstein distance. Discrete & Computational Geometry, 70(4):1378–1450, 2023.
  • [MSLE21] Chenlin Meng, Yang Song, Wenzhe Li, and Stefano Ermon. Estimating high order gradients of the data distribution by denoising. Advances in Neural Information Processing Systems, 34:25359–25369, 2021.
  • [MSW19] Facundo Memoli, Zane Smith, and Zhengchao Wan. The Wasserstein transform. In International Conference on Machine Learning, pages 4496–4504. PMLR, 2019.
  • [PY24] Frank Permenter and Chenyang Yuan. Interpreting and improving diffusion models from an optimization perspective. In Forty-first International Conference on Machine Learning, 2024.
  • [RTG98] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image databases. In Sixth international conference on computer vision (IEEE Cat. No. 98CH36271), pages 59–66. IEEE, 1998.
  • [SBDS24] Jan Pawel Stanczuk, Georgios Batzolis, Teo Deveney, and Carola-Bibiane Schönlieb. Diffusion models encode the intrinsic dimension of data manifolds. In Forty-first International Conference on Machine Learning, 2024.
  • [SDCS23] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, pages 32211–32252. PMLR, 2023.
  • [SDWMG15] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • [SE19] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • [SME21] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  • [SPC+24] Neta Shaul, Juan Perez, Ricky TQ Chen, Ali Thabet, Albert Pumarola, and Yaron Lipman. Bespoke solvers for generative flow models. In The Twelfth International Conference on Learning Representations, 2024.
  • [Vil03] Cédric Villani. Topics in Optimal Transportation. Number 58. American Mathematical Soc., 2003.
  • [Vil09] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
  • [WLCL24] Yuxin Wen, Yuchen Liu, Chen Chen, and Lingjuan Lyu. Detecting, explaining, and mitigating memorization in diffusion models. In The Twelfth International Conference on Learning Representations, 2024.
  • [ZYLX24a] Pengze Zhang, Hubery Yin, Chen Li, and Xiaohua Xie. Formulating discrete probability flow through optimal transport. Advances in Neural Information Processing Systems, 36, 2024.
  • [ZYLX24b] Pengze Zhang, Hubery Yin, Chen Li, and Xiaohua Xie. Tackling the singularities at the endpoints of time intervals in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6945–6954, 2024.

Appendix A Geometric Notions and Results

In this section, we recall some basic geometric notions and establish some results that are used in the proofs. The geometric results are of independent interest and can be used in other contexts as well.

A.1 Convex geometry notions and results

In this subsection, we collect some basic notions and results in convex geometry that are used in the proofs. Our main reference is the book [HW20].

We first introduce the definition of a convex set and convex function.

Definition 1 (Convex set).

A set KdK\subset\mathbb{R}^{d} is called a convex set if for any x1,,xnKx_{1},\ldots,x_{n}\in K and 0α1,,αn10\leq\alpha_{1},\ldots,\alpha_{n}\leq 1 such that i=1nαi=1\sum_{i=1}^{n}\alpha_{i}=1, we have that i=1nαixiK\sum_{i=1}^{n}\alpha_{i}x_{i}\in K.

Definition 2 (Convex function).

A function f:df:\mathbb{R}^{d}\to\mathbb{R} is called a convex function if for any x,ydx,y\in\mathbb{R}^{d} and 0α10\leq\alpha\leq 1, we have that f(αx+(1α)y)αf(x)+(1α)f(y)f(\alpha x+(1-\alpha)y)\leq\alpha f(x)+(1-\alpha)f(y).

An intermediate result that the sublevel sets {f<c}\{f<c\} or {fc}\{f\leq c\} of a convex function ff are convex sets; see [HW20, Remark 2.6].

We now introduce the definition of the convex hull of a set.

Definition 3 (Convex hull [HW20, Definition 1.3, Theorem 1.2]).

The convex hull of a set Ωd\Omega\subset\mathbb{R}^{d} is the smallest convex set that contains Ω\Omega and is denoted by conv(Ω)\mathrm{conv}(\Omega). Additionally, we have that

conv(Ω)={i=1nαixi:k,xiΩ,αi[0,1],i=1nαi=1}.\mathrm{conv}(\Omega)=\left\{\sum_{i=1}^{n}\alpha_{i}x_{i}:k\in\mathbb{N},x_{i}\in\Omega,\alpha_{i}\in[0,1],\sum_{i=1}^{n}\alpha_{i}=1\right\}.

Let Ω\Omega be a set, and xΩx\in\Omega. We say a hyperplane give by a linear function HH is a supporting hyperplane of Ω\Omega at xx if the following holds:

  1. 1.

    H(x)=0H(x)=0.

  2. 2.

    H(y)0H(y)\geq 0 for all yΩy\in\Omega.

For a closed convex set Ω\Omega, every boundary point of Ω\Omega has a supporting hyperplane.

Proposition A.1 (Supporting hyperplane [HW20, Theorem 1.16]).

Let KK be a closed convex set in d\mathbb{R}^{d} and xKx\in\partial K. Then, there exists a supporting hyperplane of KK at xx.

We collect some basic properties regarding the convex set the distance function.

Proposition A.2.

Let KK be a convex set in d\mathbb{R}^{d} and Ω\Omega be a set in d\mathbb{R}^{d}. Then, we have that

  1. 1.

    For each xdx\in\mathbb{R}^{d}, there exists a unique point projK(x)K\mathrm{proj}_{K}(x)\in K such that xprojK(x)=dK(x)\|x-\mathrm{proj}_{K}(x)\|=d_{K}(x).

  2. 2.

    The distance function dK(x)d_{K}(x) is a convex function.

  3. 3.

    The thickening of KK by a distance rr is a convex set, that is Br(K):={xd:dK(x)r}B_{r}(K):=\{x\in\mathbb{R}^{d}:d_{K}(x)\leq r\} is a convex set.

  4. 4.

    The diameter of Ω\Omega is the same as the diameter of its convex hull, that is diam(Ω)=diam(conv(Ω))\mathrm{diam}(\Omega)=\mathrm{diam}(\mathrm{conv}(\Omega)).

  5. 5.

    Let a>0a>0, then a set Ω\Omega is convex if and only if aΩa\cdot\Omega is convex.

A.2 Metric geometry notions and results

Let Ωd\Omega\subset\mathbb{R}^{d} be a closed subset and ΣΩ\Sigma_{\Omega} be the medial axis of Ω\Omega. We let dΩ:dd_{\Omega}:\mathbb{R}^{d}\to\mathbb{R} by defined by dΩ(x):=infyΩxyd_{\Omega}(x):=\inf_{y\in\Omega}\|x-y\|. In this section, we consider certain properties of the projection function projΩ:ΣΩcΩ\mathrm{proj}_{\Omega}:\Sigma_{\Omega}^{c}\to\Omega.

We first recall the definition of the local feature size in [AB98] with a slight generalization that we consider all points in d\mathbb{R}^{d} instead of only points in Ω\Omega.

Definition 4 ([AB98]).

For any xdx\in\mathbb{R}^{d}, we define the local feature size lfs(x)\mathrm{lfs}(x) of xx as the distance from xx to the medial axis of Ω\Omega.

Let xx be a point in d\mathbb{R}^{d} such that it has a unique projection to Ω\Omega, denoted by xΩ:=projΩ(x)x_{\Omega}:=\mathrm{proj}_{\Omega}(x). We first consider the following set for any xΣΩcx\in\Sigma_{\Omega}^{c}:

TΩ(x):={t0:projΩ(xΩ+txxΩxxΩ)=xΩ.}.T_{\Omega}(x):=\left\{t\geq 0:\mathrm{proj}_{\Omega}\left(x_{\Omega}+t\frac{x-x_{\Omega}}{\|x-x_{\Omega}\|}\right)=x_{\Omega}.\right\}.
Lemma A.3.

we have the following characterizations of TΩ(x)T_{\Omega}(x).

  • TΩ(x)T_{\Omega}(x) is an interval.

  • For any tTΩ(x)t\in T_{\Omega}(x), we have that xΩ+txxΩxxΩΣΩx_{\Omega}+t\frac{x-x_{\Omega}}{\|x-x_{\Omega}\|}\notin{\Sigma_{\Omega}}.

  • We let RΩ(x):=supTΩ(x)R_{\Omega}(x):=\sup T_{\Omega}(x). If 0<RΩ(x)<0<R_{\Omega}(x)<\infty, we have that xΩ+RΩ(x)xxΩxxΩΣΩ¯x_{\Omega}+R_{\Omega}(x)\frac{x-x_{\Omega}}{\|x-x_{\Omega}\|}\in\overline{\Sigma_{\Omega}}.

  • [0,dΩ(x)+lfs(x))TΩ(x)[0,d_{\Omega}(x)+\mathrm{lfs}(x))\subset T_{\Omega}(x).

Proof of Lemma A.3.

The first three items follow from the pioneering work [Fed59, Theorem 4.8]; see also [DZ11, Theorem 6.2] for more details. We provide a proof for the last item. For the last item, first of all, it is straightforward to see that [0,dΩ(x)]TΩ(x)[0,d_{\Omega}(x)]\subset T_{\Omega}(x). Now, we assume that r:=RΩ(x)[dΩ(x),dΩ(x)+lfs(x)).r:=R_{\Omega}(x)\in[d_{\Omega}(x),d_{\Omega}(x)+\mathrm{lfs}(x)). This implies that

xΩ+rxxΩxxΩ=x+(rdΩ(x))xxΩxxΩΣΩ¯x_{\Omega}+r\frac{x-x_{\Omega}}{\|x-x_{\Omega}\|}=x+(r-d_{\Omega}(x))\frac{x-x_{\Omega}}{\|x-x_{\Omega}\|}\in\overline{\Sigma_{\Omega}}

and hence

lfs(x)=dΣΩ(x)=dΣΩ¯(x)rdΩ(x)<lfs(x).\mathrm{lfs}(x)=d_{\Sigma_{\Omega}}(x)=d_{\overline{\Sigma_{\Omega}}}(x)\leq r-d_{\Omega}(x)<\mathrm{lfs}(x).

This is a contradiction and hence RΩ(x)dΩ(x)+lfs(x)R_{\Omega}(x)\geq d_{\Omega}(x)+\mathrm{lfs}(x). This concludes the proof. ∎

Next, for any point bΩb\in\Omega, we analyze the angle xxΩb\angle xx_{\Omega}b for any bΩb\in\Omega. The following lemma is a slight variant of [Fed59, Theorem 4.8 (7)].

Lemma A.4.

For any xΣΩcx\in\Sigma_{\Omega}^{c} and any t>0t>0 such that tTΩ(x)t\in T_{\Omega}(x), the following holds for any bΩb\in\Omega:

xxΩ,xΩbxΩb2xxΩ2t\left\langle x-x_{\Omega},x_{\Omega}-b\right\rangle\geq-\frac{\|x_{\Omega}-b\|^{2}\|x-x_{\Omega}\|}{2t}
Proof.

By definition of TΩ(x)T_{\Omega}(x), we have that projΩ(xΩ+txxΩxxΩ)=xΩ\mathrm{proj}_{\Omega}\left(x_{\Omega}+t\frac{x-x_{\Omega}}{\|x-x_{\Omega}\|}\right)=x_{\Omega}. Therefore, we have that

xΩ+txxΩxxΩb2\displaystyle\left\|x_{\Omega}+t\frac{x-x_{\Omega}}{\|x-x_{\Omega}\|}-b\right\|^{2} dΩ2(xΩ+txxΩxxΩ)=t2\displaystyle\geq d_{\Omega}^{2}\left(x_{\Omega}+t\frac{x-x_{\Omega}}{\|x-x_{\Omega}\|}\right)=t^{2}
xΩb2+2txΩb,xxΩxxΩ+t2\displaystyle\|x_{\Omega}-b\|^{2}+2t\left\langle x_{\Omega}-b,\frac{x-x_{\Omega}}{\|x-x_{\Omega}\|}\right\rangle+t^{2} t2\displaystyle\geq t^{2}
2txΩb,xxΩ\displaystyle 2t\left\langle x_{\Omega}-b,x-x_{\Omega}\right\rangle xΩb2xxΩ\displaystyle\geq-\|x_{\Omega}-b\|^{2}\|x-x_{\Omega}\|
xxΩ,xΩb\displaystyle\left\langle x-x_{\Omega},x_{\Omega}-b\right\rangle xΩb2xxΩ2t\displaystyle\geq-\frac{\|x_{\Omega}-b\|^{2}\|x-x_{\Omega}\|}{2t}

When Ω\Omega is convex, then TΩ(x)=[0,)T_{\Omega}(x)=[0,\infty). In this way, we have the following corollary.

Corollary A.5.

If Ω\Omega is convex, then for any bΩb\in\Omega and any xdx\in\mathbb{R}^{d} we have that xxΩ,xΩb0\langle x-x_{\Omega},x_{\Omega}-b\rangle\geq 0.

This control of the angle xxΩb\angle xx_{\Omega}b allows us to derive the following result that bounds the distance between bΩb\in\Omega and the projection projΩ(x)\mathrm{proj}_{\Omega}(x) in terms of the distance between bb and xx.

Lemma A.6.

For any xΣΩcx\in\Sigma_{\Omega}^{c} and any t>0t>0 such that tTΩ(x)t\in T_{\Omega}(x), we have that for any bΩb\in\Omega,

xb2dΩ(x)2+bxΩ2(1dΩ(x)t).\|x-b\|^{2}\geq d_{\Omega}(x)^{2}+\|b-x_{\Omega}\|^{2}\left(1-\frac{d_{\Omega}(x)}{t}\right).
Proof.

The case when xΩx\in\Omega trivially holds. Below we consider the case when xΣΩcΩcx\in\Sigma_{\Omega}^{c}\cap\Omega^{c}. In this case, dΩ(x)>0d_{\Omega}(x)>0 as Ω\Omega is a closed set.

By the law of cosines, we have that

cos(xxΩb)=dΩ(x)2+bxΩ2xb22dΩ(x)bxΩ\cos(\angle xx_{\Omega}b)=\frac{d_{\Omega}(x)^{2}+\|b-x_{\Omega}\|^{2}-\|x-b\|^{2}}{2d_{\Omega}(x)\|b-x_{\Omega}\|}

Suppose xb2<dΩ(x)2+bxΩ2(1dΩ(x)t)\|x-b\|^{2}<d_{\Omega}(x)^{2}+\|b-x_{\Omega}\|^{2}(1-\frac{d_{\Omega}(x)}{t}), then we have that

cos(xxΩb)\displaystyle\cos(\angle xx_{\Omega}b) =dΩ(x)2+bxΩ2xb22dΩ(x)bxΩ\displaystyle=\frac{d_{\Omega}(x)^{2}+\|b-x_{\Omega}\|^{2}-\|x-b\|^{2}}{2d_{\Omega}(x)\|b-x_{\Omega}\|}
>dΩ(x)2+bxΩ2dΩ(x)2bxΩ2+bxΩ2dΩ(x)t2dΩ(x)bxΩ\displaystyle>\frac{\|d_{\Omega}(x)^{2}+\|b-x_{\Omega}\|^{2}-d_{\Omega}(x)^{2}-\|b-x_{\Omega}\|^{2}+\|b-x_{\Omega}\|^{2}\frac{d_{\Omega}(x)}{t}}{2d_{\Omega}(x)\|b-x_{\Omega}\|}
=bxΩ22t.\displaystyle=\frac{\|b-x_{\Omega}\|^{2}}{2t}.

By applying Lemma A.4 to bb and xx, we have that

xxΩ,xΩbxΩb2xxΩ2t.\left\langle x-x_{\Omega},x_{\Omega}-b\right\rangle\geq-\frac{\|x_{\Omega}-b\|^{2}\|x-x_{\Omega}\|}{2t}.

This implies the following estimate for the cosine of the angle xxΩb\angle xx_{\Omega}b:

cos(xxΩb)=xxΩ,bxΩxxΩbxΩxΩb22t.\cos(\angle xx_{\Omega}b)=\frac{\left\langle x-x_{\Omega},b-x_{\Omega}\right\rangle}{\|x-x_{\Omega}\|\|b-x_{\Omega}\|}\leq\frac{\|x_{\Omega}-b\|^{2}}{2t}.

This contradicts the inequality above and hence we must have that

xb2dΩ(x)2+bxΩ2(1dΩ(x)t).\|x-b\|^{2}\geq d_{\Omega}(x)^{2}+\|b-x_{\Omega}\|^{2}\left(1-\frac{d_{\Omega}(x)}{t}\right).

We then have the following corollary which will be used in the proof of Theorem 4.2.

Corollary A.7.

Fix any xΣΩcx\in\Sigma_{\Omega}^{c}, any t[dΩ(x),dΩ(x)+lfs(x))t\in[d_{\Omega}(x),d_{\Omega}(x)+\mathrm{lfs}(x)) and any ϵ>0\epsilon>0. Then, we have that

BdΩ(x)2+ϵ2(1dΩ(x)/t)(x)ΩBϵ(xΩ)ΩB_{\sqrt{d_{\Omega}(x)^{2}+\epsilon^{2}\left(1-d_{\Omega}(x)/t\right)}}\left(x\right)\cap\Omega\subseteq B_{\epsilon}(x_{\Omega})\cap\Omega
Proof.

For any bBdΩ(x)2+ϵ2(1dΩ(x)/t)(x)Ωb\in B_{\sqrt{d_{\Omega}(x)^{2}+\epsilon^{2}\left(1-d_{\Omega}(x)/t\right)}}\left(x\right)\cap\Omega, we have that

xb2<dΩ(x)2+ϵ2(1dΩ(x)t).\|x-b\|^{2}<d_{\Omega}(x)^{2}+\epsilon^{2}\left(1-\frac{d_{\Omega}(x)}{t}\right).

By Lemma A.6, we have that

xb2dΩ(x)2+bxΩ2(1dΩ(x)t).\|x-b\|^{2}\geq d_{\Omega}(x)^{2}+\|b-x_{\Omega}\|^{2}\left(1-\frac{d_{\Omega}(x)}{t}\right).

Combining the two inequalities, we have that bxΩ2<ϵ2\|b-x_{\Omega}\|^{2}<\epsilon^{2} and hence bBϵ(xΩ)Ωb\in B_{\epsilon}(x_{\Omega})\cap\Omega. ∎

Finally, we derive the local Lipschitz continuity of the projection function.

Lemma A.8 (Local Lipschitz continuity of the projection).

For any ϵ>0\epsilon>0 and for any x,ydx,y\in\mathbb{R}^{d} such that lfs(x)>ϵ\mathrm{lfs}(x)>\epsilon and lfs(y)>ϵ\mathrm{lfs}(y)>\epsilon, we have that

xΩyΩ(max{dΩ(x),dΩ(y)}ϵ+1)xy.\|x_{\Omega}-y_{\Omega}\|\leq\left(\frac{\max\{d_{\Omega}(x),d_{\Omega}(y)\}}{\epsilon}+1\right)\|x-y\|.
Proof.

Since lfs(x)>ϵ\mathrm{lfs}(x)>\epsilon and lfs(y)>ϵ\mathrm{lfs}(y)>\epsilon, by Lemma A.3 we have that

projΩ(xΩ+(ϵ+dΩ(x))xxΩxxΩ)=xΩ\mathrm{proj}_{\Omega}\left(x_{\Omega}+(\epsilon+d_{\Omega}(x))\frac{x-x_{\Omega}}{\|x-x_{\Omega}\|}\right)=x_{\Omega}

and

projΩ(yΩ+(ϵ+dΩ(y))yyΩyyΩ)=yΩ.\mathrm{proj}_{\Omega}\left(y_{\Omega}+(\epsilon+d_{\Omega}(y))\frac{y-y_{\Omega}}{\|y-y_{\Omega}\|}\right)=y_{\Omega}.

By applying Lemma A.4 to x,xΩ,yΩx,x_{\Omega},y_{\Omega} and separately to y,yΩ,xΩy,y_{\Omega},x_{\Omega}, we have that

xΩyΩ,xxΩ\displaystyle\left\langle x_{\Omega}-y_{\Omega},x-x_{\Omega}\right\rangle xΩyΩ22(ϵ+dΩ(x))dΩ(x)\displaystyle\geq-\frac{\|x_{\Omega}-y_{\Omega}\|^{2}}{2\left(\epsilon+d_{\Omega}(x)\right)}d_{\Omega}(x)
yΩxΩ,yyΩ\displaystyle\left\langle y_{\Omega}-x_{\Omega},y-y_{\Omega}\right\rangle xΩyΩ22(ϵ+dΩ(y))dΩ(y)\displaystyle\geq-\frac{\|x_{\Omega}-y_{\Omega}\|^{2}}{2\left(\epsilon+d_{\Omega}(y)\right)}d_{\Omega}(y)

By adding the two inequalities about the inner products, we have that

xΩyΩ,xxΩy+yΩ\displaystyle\left\langle x_{\Omega}-y_{\Omega},x-x_{\Omega}-y+y_{\Omega}\right\rangle xΩyΩ22(ϵ+dΩ(x))dΩ(x)xΩyΩ22(ϵ+dΩ(y))dΩ(y)\displaystyle\geq-\frac{\|x_{\Omega}-y_{\Omega}\|^{2}}{2\left(\epsilon+d_{\Omega}(x)\right)}d_{\Omega}(x)-\frac{\|x_{\Omega}-y_{\Omega}\|^{2}}{2\left(\epsilon+d_{\Omega}(y)\right)}d_{\Omega}(y)
xΩyΩ,xyxΩyΩ2\displaystyle\left\langle x_{\Omega}-y_{\Omega},x-y\right\rangle-\|x_{\Omega}-y_{\Omega}\|^{2} xΩyΩ22(ϵ+dΩ(x))dΩ(x)xΩyΩ22(ϵ+dΩ(y))dΩ(y)\displaystyle\geq-\frac{\|x_{\Omega}-y_{\Omega}\|^{2}}{2\left(\epsilon+d_{\Omega}(x)\right)}d_{\Omega}(x)-\frac{\|x_{\Omega}-y_{\Omega}\|^{2}}{2\left(\epsilon+d_{\Omega}(y)\right)}d_{\Omega}(y)
xΩyΩ,xy\displaystyle\left\langle x_{\Omega}-y_{\Omega},x-y\right\rangle xΩyΩ2xΩyΩ22(ϵ+dΩ(x))dΩ(x)xΩyΩ22(ϵ+dΩ(y))dΩ(y)\displaystyle\geq\|x_{\Omega}-y_{\Omega}\|^{2}-\frac{\|x_{\Omega}-y_{\Omega}\|^{2}}{2\left(\epsilon+d_{\Omega}(x)\right)}d_{\Omega}(x)-\frac{\|x_{\Omega}-y_{\Omega}\|^{2}}{2\left(\epsilon+d_{\Omega}(y)\right)}d_{\Omega}(y)

Let m=max{dΩ(x),dΩ(y)}m=\max\{d_{\Omega}(x),d_{\Omega}(y)\}, then we have that

0<1mm+ϵ(1dΩ(x)2(ϵ+dΩ(x))dΩ(y)2(ϵ+dΩ(y)))0<1-\frac{m}{m+\epsilon}\leq\left(1-\frac{d_{\Omega}(x)}{2\left(\epsilon+d_{\Omega}(x)\right)}-\frac{d_{\Omega}(y)}{2\left(\epsilon+d_{\Omega}(y)\right)}\right)

Therefore, by Cauchy-Schwarz inequality, we have that

xΩyΩ2(1mm+ϵ)xΩyΩ,xyxΩyΩxy\|x_{\Omega}-y_{\Omega}\|^{2}\left(1-\frac{m}{m+\epsilon}\right)\leq\left\langle x_{\Omega}-y_{\Omega},x-y\right\rangle\leq\|x_{\Omega}-y_{\Omega}\|\|x-y\|

This implies xΩyΩ(m+ϵϵ)xy\|x_{\Omega}-y_{\Omega}\|\leq\left(\frac{m+\epsilon}{\epsilon}\right)\|x-y\|. ∎

Differentiability of the distance function

Let Ωd\Omega\subset\mathbb{R}^{d} be any closed subset. We let Σ(x):={yΩ:xy=dΩ(x)}\Sigma(x):=\{y\in\Omega:\|x-y\|=d_{\Omega}(x)\} denote the set of points in Ω\Omega that achieve the infimum. When xx is not in the medial axis of Ω\Omega, the set Σ(x)\Sigma(x) is a singleton and we denote it by projΩ(x)\mathrm{proj}_{\Omega}(x).

Interestingly, there is the following result result showing the existence of one sided directional derivatives for dΩd_{\Omega} by [dM37] (see also [Bia23] for a more recent English treatment).

Lemma A.9 ([dM37]).

For any vector vdv\in\mathbb{R}^{d}, the one-sided directional derivative of dΩd_{\Omega} at xd\Ωx\in\mathbb{R}^{d}\backslash\Omega in the direction vv exists and is given by

DvdΩ(x)=inf{v,yxyx:yΣ(x)}D_{v}d_{\Omega}(x)=\inf\left\{-\frac{\langle v,y-x\rangle}{\|y-x\|}:y\in\Sigma(x)\right\}

Motivated by the this result, we mimick the proof and establish the following result for the squared distance function dΩ2d_{\Omega}^{2}. Notice that, we can remove the constraint for xΩx\notin\Omega.

Lemma A.10.

For any xd\ΣΩx\in\mathbb{R}^{d}\backslash\Sigma_{\Omega}, the directional derivative of dΩ2d_{\Omega}^{2} at xx exists and is given by

DvdΩ2(x)=2v,projΩ(x)xD_{v}d_{\Omega}^{2}(x)=-2\langle v,\mathrm{proj}_{\Omega}(x)-x\rangle
Proof.

Without loss of generality, we assume that x=0x=0 is the origin and v=(c,0,,0)v=(c,0,\ldots,0) for some c>0c>0 (one can achieve these by applying rigid transformations).

We let xΩ=projΩ(x)x_{\Omega}=\mathrm{proj}_{\Omega}(x), xt=tvx_{t}=tv and xt=projΩ(xt)x_{t}^{*}=\mathrm{proj}_{\Omega}(x_{t}). Since xtxtxΩxt\|x_{t}^{*}-x_{t}\|\leq\|x_{\Omega}-x_{t}\|, we have that

xt2xΩ22ct(xt,(1)x0,(1)),\|x_{t}^{*}\|^{2}-\|x_{\Omega}\|^{2}\leq 2ct(x_{t}^{*,(1)}-x_{0}^{*,(1)}),

where xt,(1)x_{t}^{*,(1)} denotes the first coordinate of xtx_{t}^{*}. Note that xt=xtxxΩx=xΩ\|x_{t}^{*}\|=\|x_{t}^{*}-x\|\geq\|x_{\Omega}-x\|=\|x_{\Omega}\|. So we have that

0xt,(1)x0,(1).0\leq x_{t}^{*,(1)}-x_{0}^{*,(1)}.

Now we have that by the cosine rule of the triangle formed by xt=tvx_{t}=tv, x=0x=0 and xtx_{t}^{*}, we have that

dΩ2(xt)=xt2+t2c22tv,xtx.d^{2}_{\Omega}(x_{t})=\|x_{t}^{*}\|^{2}+t^{2}c^{2}-2t\langle v,x_{t}^{*}-x\rangle.

Therefore, we have that

dΩ2(xt)dΩ2(x)t=xt2xΩ2t2v,xtx\displaystyle\frac{d_{\Omega}^{2}(x_{t})-d_{\Omega}^{2}(x)}{t}=\frac{\|x_{t}^{*}\|^{2}-\|x_{\Omega}\|^{2}}{t}-2\langle v,x_{t}^{*}-x\rangle

As discussed above, we have that

0xt2xΩ2t2c(xt,(1)x0,(1))0\leq\frac{\|x_{t}^{*}\|^{2}-\|x_{\Omega}\|^{2}}{t}\leq 2c(x_{t}^{*,(1)}-x_{0}^{*,(1)})

By continuity of projΩ\mathrm{proj}_{\Omega} outside ΣΩ\Sigma_{\Omega}, we have that xtxx_{t}^{*}\to x^{*} as t0t\to 0. Therefore, we have that

limt0dΩ2(xt)dΩ2(x)t=2v,xx=2v,projΩ(x)x.\lim_{t\to 0}\frac{d_{\Omega}^{2}(x_{t})-d_{\Omega}^{2}(x)}{t}=-2\langle v,x^{*}-x\rangle=-2\langle v,\mathrm{proj}_{\Omega}(x)-x\rangle.
Corollary A.11.

Let (xt)t(x_{t})_{t} be a differentiable curve in xd\ΣΩx\in\mathbb{R}^{d}\backslash\Sigma_{\Omega}, then we have that dΩ2(xt)d_{\Omega}^{2}(x_{t}) is differentiable with respect to tt and we have that

ddtdΩ2(xt)=Dx˙tdΩ2(xt)=2x˙t,projΩ(xt)xt\frac{d}{dt}d_{\Omega}^{2}(x_{t})=D_{\dot{x}_{t}}d_{\Omega}^{2}(x_{t})=-2\langle\dot{x}_{t},\mathrm{proj}_{\Omega}(x_{t})-x_{t}\rangle

Appendix B Proofs

B.1 Proofs in Section 2

Proof of Proposition 2.1.

For any xdx\in\mathbb{R}^{d} and t(0,1]t\in(0,1], αt>0\alpha_{t}>0 and the map FtF_{t} is well-defined.

(Ft)#pt\displaystyle(F_{t})_{\#}p_{t} =(Ft)#(𝒩(|αtx0,βt2I)p(dx0))\displaystyle=(F_{t})_{\#}\left(\int{\mathcal{N}}(\cdot|\alpha_{t}x_{0},\beta_{t}^{2}I)p(dx_{0})\right)
=𝒩(|x0,(βt/αt)2I)p(dx0)\displaystyle=\int{\mathcal{N}}(\cdot|x_{0},(\beta_{t}/\alpha_{t})^{2}I)p(dx_{0})
=𝒩(|x0,σt2I)p(dx0).\displaystyle=\int{\mathcal{N}}(\cdot|x_{0},\sigma_{t}^{2}I)p(dx_{0}).
Proof of Proposition 2.2.

We first write down the density of qσq_{\sigma} explicitly as follows:

qσ(x)=1(2πσ2)d/2exp(xx122σ2)p(dx).q_{\sigma}(x)=\frac{1}{{(2\pi\sigma^{2})^{d/2}}}\int\exp\left(-\frac{\|x-x_{1}\|^{2}}{2\sigma^{2}}\right)p(dx).

Then, we have that

logqσ(x)=1σ2(xyexp(xy22σ2)p(dy)exp(xy22σ2)p(dy)).\nabla\log q_{\sigma}(x)=-\frac{1}{\sigma^{2}}\left(x-\frac{\int y\exp\left(-\frac{\|x-y\|^{2}}{2\sigma^{2}}\right)p(dy)}{\int\exp\left(-\frac{\|x-y^{\prime}\|^{2}}{2\sigma^{2}}\right)p(dy^{\prime})}\right).

Note that dtdσ=αt2β˙tαtβtα˙t\frac{dt}{d\sigma}=\frac{\alpha_{t}^{2}}{\dot{\beta}_{t}\alpha_{t}-\beta_{t}\dot{\alpha}_{t}}. Therefore with the reparametrization xσ:=xt/αt{x}_{\sigma}:=x_{t}/\alpha_{t}, we have that

dxσdσ\displaystyle\frac{dx_{\sigma}}{d\sigma} =dxσdtdtdσ=1αt2(αtdxtdtα˙txt)dtdσ\displaystyle=\frac{dx_{\sigma}}{dt}\frac{dt}{d\sigma}=\frac{1}{\alpha_{t}^{2}}\left(\alpha_{t}\frac{dx_{t}}{dt}-\dot{\alpha}_{t}x_{t}\right)\frac{dt}{d\sigma}
=αtβt(xtαtexp(xtαty22βt2)yexp(xtαty22βt2)p(dy)p(dy))\displaystyle=\frac{\alpha_{t}}{\beta_{t}}\left(\frac{x_{t}}{\alpha_{t}}-\int\frac{\exp\left(-\frac{\|x_{t}-\alpha_{t}y\|^{2}}{2\beta_{t}^{2}}\right)y}{\int\exp\left(-\frac{\|x_{t}-\alpha_{t}y^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)p(dy^{\prime})}p(dy)\right)
=σlogqσ(xσ).\displaystyle=-{\sigma}\nabla\log q_{\sigma}(x_{\sigma}).

B.2 Proofs in Section 3

Proof of Remark 3.1.

For the normalizing factor Z=exp(xαty22βt2)p(dy)Z=\int\exp\left(-\frac{\|x-\alpha_{t}y^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)p(dy^{\prime}), we note that 0<exp(xαty22βt2)10<\exp\left(-\frac{\|x-\alpha_{t}y^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)\leq 1. Hence, the factor ZZ must be positive and bounded.

Now, we consider the following integral

exp(xαty22βt2)yp(dy)yp(dy)<.\displaystyle\int{\exp\left(-\frac{\|x-\alpha_{t}y\|^{2}}{2\beta_{t}^{2}}\right)\|y\|}p(dy)\leq\int\|y\|p(dy)<\infty.

The last inequality follows from the fact that pp has a finite 1-moment. Hence, the posterior distribution p(|𝑿t=x)p(\cdot|\bm{X}_{t}=x) has a finite 1-moment and the denoiser mtm_{t} is well-defined. ∎

Proof of Equation 10.

Since λ=logσ\lambda=-\log\sigma, we have that

dλdσ=1σ\frac{d\lambda}{d\sigma}=-\frac{1}{\sigma}

and thus the ODE equation Equation 8 becomes

dxλdλ\displaystyle\frac{d{x}_{\lambda}}{d\lambda} =dxσ(λ)dσdσdλ=σlogqσ(xσ)(σ)\displaystyle=\frac{d{x}_{\sigma(\lambda)}}{d\sigma}\frac{d\sigma}{d\lambda}=-\sigma\nabla\log q_{\sigma}(x_{\sigma})\cdot\left(-{\sigma}\right)
=xσ+yexp(xσy22σ2)p(dy)exp(xσy22σ2)p(dy)=mλ(xλ)xλ.\displaystyle=-x_{\sigma}+\frac{\int y\exp\left(-\frac{\|x_{\sigma}-y\|^{2}}{2\sigma^{2}}\right)p(dy)}{\int\exp\left(-\frac{\|x_{\sigma}-y^{\prime}\|^{2}}{2\sigma^{2}}\right)p(dy^{\prime})}=m_{\lambda}(x_{\lambda})-x_{\lambda}.
Proof of Proposition 3.2.

Recall that

mt(x)=exp(xαtz22βt2)zp(dz)exp(xαtz22βt2)p(dz) and p(dz|𝑿t=x)=exp(xαtz22βt2)p(dz)exp(xαtz22βt2)p(dz).m_{t}(x)=\frac{\int\exp\left(-\frac{\|x-\alpha_{t}z\|^{2}}{2\beta_{t}^{2}}\right)z\,p(dz)}{\int\exp\left(-\frac{\|x-\alpha_{t}z\|^{2}}{2\beta_{t}^{2}}\right)\,p(dz)}\text{ and }p(dz|\bm{X}_{t}=x)=\frac{\exp\left(-\frac{\|x-\alpha_{t}z\|^{2}}{2\beta_{t}^{2}}\right)p(dz)}{\int\exp\left(-\frac{\|x-\alpha_{t}z\|^{2}}{2\beta_{t}^{2}}\right)\,p(dz)}.

We let wt(x,z):=exp(xαtz22βt2)w_{t}(x,z):=\exp\left(-\frac{\|x-\alpha_{t}z\|^{2}}{2\beta_{t}^{2}}\right). Then,

xmt(x)=\displaystyle\nabla_{x}m_{t}(x)= zx(exp(xαtz22βt2)exp(xαtz22βt2)p(dz))p(dz)\displaystyle\int z\nabla_{x}\left(\frac{\exp\left(-\frac{\|x-\alpha_{t}z\|^{2}}{2\beta_{t}^{2}}\right)}{\int\exp\left(-\frac{\|x-\alpha_{t}z^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)p(dz^{\prime})}\right)p(dz)
=(wt(x,z)p(dz))2z(wt(x,z)(xαtzβt2)Twt(x,z)p(dz))p(dz)\displaystyle=\left(\int w_{t}(x,z^{\prime})p(dz^{\prime})\right)^{-2}\int z\left(w_{t}(x,z)\left(-\frac{x-\alpha_{t}z}{\beta_{t}^{2}}\right)^{T}\int w_{t}(x,z^{\prime})p(dz^{\prime})\right)p(dz)
(wt(x,z)p(dz))2z(wt(x,z)wt(x,z)(xαtzβt2)Tp(dz))p(dz)\displaystyle-\left(\int w_{t}(x,z^{\prime})p(dz^{\prime})\right)^{-2}\int z\left(w_{t}(x,z)\int w_{t}(x,z^{\prime})\left(-\frac{x-\alpha_{t}z^{\prime}}{\beta_{t}^{2}}\right)^{T}p(dz^{\prime})\right)p(dz)
=(wt(x,z)p(dz))2wt(x,z)wt(x,z)z(xαtzβt2+xαtzβt2)Tp(dz)p(dz)\displaystyle=\left(\int w_{t}(x,z^{\prime})p(dz^{\prime})\right)^{-2}\iint w_{t}(x,z)w_{t}(x,z^{\prime})z\left(-\frac{x-\alpha_{t}z}{\beta_{t}^{2}}+\frac{x-\alpha_{t}z^{\prime}}{\beta_{t}^{2}}\right)^{T}p(dz)p(dz^{\prime})
=αtβt2(wt(x,z)p(dz))2wt(x,z)wt(x,z)z(zz)Tp(dz)p(dz)\displaystyle=\frac{\alpha_{t}}{\beta_{t}^{2}}\left(\int w_{t}(x,z^{\prime})p(dz^{\prime})\right)^{-2}\iint w_{t}(x,z)w_{t}(x,z^{\prime})z\left(z-z^{\prime}\right)^{T}p(dz)p(dz^{\prime})
=αt2βt2(wt(x,z)p(dz))2wt(x,z)wt(x,z)(zz)(zz)Tp(dz)p(dz)\displaystyle=\frac{\alpha_{t}}{2\beta_{t}^{2}}\left(\int w_{t}(x,z^{\prime})p(dz^{\prime})\right)^{-2}\iint w_{t}(x,z)w_{t}(x,z^{\prime})(z-z^{\prime})\left(z-z^{\prime}\right)^{T}p(dz)p(dz^{\prime})
=αt2βt2(zz)(zz)Tp(dz|𝑿t=x)p(dz|𝑿t=x).\displaystyle=\frac{\alpha_{t}}{2\beta_{t}^{2}}\iint(z-z^{\prime})(z-z^{\prime})^{T}p(dz|\bm{X}_{t}=x)p(dz^{\prime}|\bm{X}_{t}=x).

The second equation follows from standard calculations. ∎

Proof of Proposition 3.4.

Let Ω:=supp(p)\Omega:=\mathrm{supp}(p). Recall that

ut(x)\displaystyle u_{t}(x) =(logβt)x+βt(αt/βt)mt(x)\displaystyle=(\log\beta_{t})^{\prime}x+\beta_{t}\left(\alpha_{t}/\beta_{t}\right)^{\prime}m_{t}(x)
=β˙txαtmt(x)βt+α˙tmt(x).\displaystyle=\dot{\beta}_{t}\frac{x-\alpha_{t}m_{t}(x)}{\beta_{t}}+\dot{\alpha}_{t}m_{t}(x).

Then, we have that

mt(x)projΩ(x)\displaystyle\|m_{t}(x)-\mathrm{proj}_{\Omega}(x)\| mσt(x/αt)projΩ(x/αt)+projΩ(x/αt)projΩ(x).\displaystyle\leq\|m_{\sigma_{t}}(x/\alpha_{t})-\mathrm{proj}_{\Omega}(x/\alpha_{t})\|+\|\mathrm{proj}_{\Omega}(x/\alpha_{t})-\mathrm{proj}_{\Omega}(x)\|.

By Lemma A.8, there exists a positive constant CxC_{x} such that projΩ(x)projΩ(y)Cxxy\|\mathrm{proj}_{\Omega}(x)-\mathrm{proj}_{\Omega}(y)\|\leq C_{x}\|x-y\| for any yy close to xx. Therefore,

projΩ(x/αt)projΩ(x)Cxxαtx=O(1αt).\|\mathrm{proj}_{\Omega}(x/\alpha_{t})-\mathrm{proj}_{\Omega}(x)\|\leq C_{x}\left\|\frac{x}{\alpha_{t}}-x\right\|=O(1-\alpha_{t}).

Now, when Ω=d\Omega=\mathbb{R}^{d}, by Corollary 4.10, we have that

mσt(x/αt)projΩ(x/αt)=O(σt)=O(βt).\|m_{\sigma_{t}}(x/\alpha_{t})-\mathrm{proj}_{\Omega}(x/\alpha_{t})\|=O(\sigma_{t})=O(\beta_{t}).

In this case, projΩ(x)=x\mathrm{proj}_{\Omega}(x)=x. This implies that for tt close to 1, we have that

xαtmt(x)βt\displaystyle\left\|\frac{x-\alpha_{t}m_{t}(x)}{\beta_{t}}\right\| O(1αtβt)+O(1).\displaystyle\leq O\left(\frac{1-\alpha_{t}}{\beta_{t}}\right)+O(1).

The right hand side is bounded since limt11αtβt=limt1α˙tβ˙t=α˙1β˙1\lim_{t\to 1}\frac{1-\alpha_{t}}{\beta_{t}}=\lim_{t\to 1}\frac{-\dot{\alpha}_{t}}{\dot{\beta}_{t}}=\frac{-\dot{\alpha}_{1}}{\dot{\beta}_{1}} exists. Therefore, the vector field ut(x)u_{t}(x) is uniformly bounded for all t[0,1)t\in[0,1).

If Ωd\Omega\neq\mathbb{R}^{d} and xΩΣΩx\notin\Omega\cup\Sigma_{\Omega}, then projΩ(x)x\mathrm{proj}_{\Omega}(x)\neq x. By Corollary 4.5, we have that mt(x)projΩ(x)m_{t}(x)\to\mathrm{proj}_{\Omega}(x) as t1t\to 1. This implies that xαtmt(x)xprojΩ(x)>0\|x-\alpha_{t}m_{t}(x)\|\to\|x-\mathrm{proj}_{\Omega}(x)\|>0. Then, as the denominator βt0\beta_{t}\to 0, we have that limt1ut(x)=\lim_{t\to 1}\|u_{t}(x)\|=\infty. ∎

The following result on the convergence of ODE under asymptotically vanishing perturbation will be used often in the proofs of this section.

Lemma B.1.

Let (yt0)t[t1,)(y_{t}\geq 0)_{t\in[t_{1},\infty)} be a differentiable trajectory of non-negative real numbers satisfying the following differential inequality

dytdtkyt+ϕ(t),\frac{dy_{t}}{dt}\leq-k\,y_{t}+\phi(t),

where k>0k>0 and ϕ:\phi:\mathbb{R}\to\mathbb{R}. Then, we have:

  1. 1.

    For any t2>t1t_{2}>t_{1} we have that

    yt2ek(t2t1)yt1+t1t2ek(t2t)ϕ(t)𝑑t.y_{t_{2}}\leq e^{-k(t_{2}-t_{1})}y_{t_{1}}+\int_{t_{1}}^{t_{2}}e^{-k(t_{2}-t)}\phi(t)dt.
  2. 2.

    If limtϕ(t)=0\lim_{t\to\infty}\phi(t)=0, then limtyt=0\lim_{t\to\infty}y_{t}=0;

Proof of Lemma B.1.

By multiplying the integrating factor ekte^{kt}, we have that

dektytdt\displaystyle\frac{d\,e^{kt}y_{t}}{dt} =ektdytdt+kektytϕ(t)ekt.\displaystyle=e^{kt}\frac{d\,y_{t}}{dt}+ke^{kt}y_{t}\leq\phi(t)e^{kt}.

Then for all t2>t1t_{2}>t_{1}, we have

ekt2yt2\displaystyle e^{kt_{2}}y_{t_{2}} ekt1yt1+t1t2ektϕ(t)𝑑t,\displaystyle\leq e^{kt_{1}}y_{t_{1}}+\int_{t_{1}}^{t_{2}}e^{kt}\phi(t)dt,
yt2\displaystyle y_{t_{2}} ek(t2t1)yt1+t1t2ek(t2t)ϕ(t)𝑑t.\displaystyle\leq e^{-k(t_{2}-t_{1})}y_{t_{1}}+\int_{t_{1}}^{t_{2}}e^{-k(t_{2}-t)}\phi(t)dt.

This proves Item 1.

For Item 2, we will first show that yty_{t} is bounded. As ϕ(t)\phi(t) decays to zero as tt goes to infinity, so will be |ϕ(t)||\phi(t)|, and hence there exists some constant C>0C>0 such that |ϕ(t)|C|\phi(t)|\leq C for all tt1t\geq t_{1}. Then, we have that

yt2\displaystyle y_{t_{2}} ek(t2t1)yt1+Ct1t2ek(t2t)𝑑t,\displaystyle\leq e^{-k(t_{2}-t_{1})}y_{t_{1}}+C\int_{t_{1}}^{t_{2}}e^{-k(t_{2}-t)}dt,
ek(t2t1)yt1+Ck(1ek(t2t1)),\displaystyle\leq e^{-k(t_{2}-t_{1})}y_{t_{1}}+\frac{C}{k}(1-e^{-k(t_{2}-t_{1})}),
ek(t2t1)yt1+Ck.\displaystyle\leq e^{-k(t_{2}-t_{1})}y_{t_{1}}+\frac{C}{k}.

Hence, yty_{t} is bounded for all tt1t\geq t_{1}, and we denote by Cy>0C_{y}>0 any bound for yty_{t}.

Next, we show that yty_{t} converges to zero as tt goes to infinity. For any ϵ>0\epsilon>0, there is a sufficient large tϵ>t1t_{\epsilon}>t_{1} such that

  1. 1.

    ektϵCy<ϵ/2e^{-kt_{\epsilon}}C_{y}<\epsilon/2.

  2. 2.

    |ϕ(t)|<ϵ/2|\phi(t)|<\epsilon/2 for all ttϵt\geq t_{\epsilon}.

Then for all t2>2tϵt_{2}>2t_{\epsilon}, by integrating the inequality from tϵt_{\epsilon} to t2t_{2}, we have that

yt2\displaystyle y_{t_{2}} ek(t2tϵ)ytϵ+tϵt2ek(t2t)ϕ(t)𝑑t,\displaystyle\leq e^{-k(t_{2}-t_{\epsilon})}y_{t_{\epsilon}}+\int_{t_{\epsilon}}^{t_{2}}e^{-k(t_{2}-t)}\phi(t)dt,
ek(tϵ)Cy+ϵ/2tϵt2ek(t2t)𝑑t,\displaystyle\leq e^{-k(t_{\epsilon})}C_{y}+\epsilon/2\int_{t_{\epsilon}}^{t_{2}}e^{-k(t_{2}-t)}dt,
ϵ/2+ϵ/2(1ek(t2tϵ)),\displaystyle\leq\epsilon/2+\epsilon/2(1-e^{-k(t_{2}-t_{\epsilon})}),
ϵ.\displaystyle\leq\epsilon.

This implies that yty_{t} converges to zero as tt goes to infinity. ∎

Proof of Theorem 3.6 and Remark 3.7.

We consider the change of variable λ=log(σ)\lambda=-\log(\sigma). We let zλ:=xσ(λ)z_{\lambda}:=x_{\sigma}(\lambda) for all λ[λ1:=λ(σ1),λ2:=λ(σ2)]\lambda\in[\lambda_{1}:=\lambda(\sigma_{1}),\lambda_{2}:=\lambda(\sigma_{2})]. Then, (zλ)λ[λ1,λ2](z_{\lambda})_{\lambda\in[\lambda_{1},\lambda_{2}]} satisfies the nice ODE as in Equation 10. By assumption we have that zλ1z_{\lambda_{1}} is outside the convex set Ω\Omega. For any λ[λ1,λ2]\lambda\in[\lambda_{1},\lambda_{2}], by Corollary A.11 we have that

ddλdΩ2(zλ)\displaystyle\frac{d}{d\lambda}d_{\Omega}^{2}(z_{\lambda}) =2zλprojΩ(zλ),zλmλ(zλ)\displaystyle=-2\langle z_{\lambda}-\mathrm{proj}_{\Omega}(z_{\lambda}),z_{\lambda}-m_{\lambda}(z_{\lambda})\rangle
=2(zλprojΩ(zλ),zλprojΩ(zλ)+zλprojΩ(zλ),projΩ(zλ)mλ(zλ))\displaystyle=-2\left(\langle z_{\lambda}-\mathrm{proj}_{\Omega}(z_{\lambda}),z_{\lambda}-\mathrm{proj}_{\Omega}(z_{\lambda})\rangle+\langle z_{\lambda}-\mathrm{proj}_{\Omega}(z_{\lambda}),\mathrm{proj}_{\Omega}(z_{\lambda})-m_{\lambda}(z_{\lambda})\rangle\right)

For the first item, we have that ddλdΩ2(zλ)2dΩ2(zλ)\frac{d}{d\lambda}d_{\Omega}^{2}(z_{\lambda})\leq-2d_{\Omega}^{2}(z_{\lambda}). Multiplying with the exponential integrator e2(1ζ)λe^{2(1-\zeta)\lambda}, we have that

ddλ(e2(1ζ)λdΩ2(zλ))\displaystyle\frac{d}{d\lambda}(e^{2(1-\zeta)\lambda}d_{\Omega}^{2}(z_{\lambda})) 0\displaystyle\leq 0

Then for any λ[λ1,λ2]\lambda\in[\lambda_{1},\lambda_{2}], we have that

dΩ2(zλ)\displaystyle d^{2}_{\Omega}(z_{\lambda}) e2(1ζ)(λλ1)dΩ2(zλ1)\displaystyle\leq e^{-2(1-\zeta)(\lambda-\lambda_{1})}d^{2}_{\Omega}(z_{\lambda_{1}})

Using the change of variable, we have that

dΩ(zλ)σ1ζσ11ζdΩ(zλ1).d_{\Omega}(z_{\lambda})\leq\frac{\sigma^{1-\zeta}}{\sigma_{1}^{1-\zeta}}d_{\Omega}(z_{\lambda_{1}}).

For the second item, we have that

ddλdΩ2(zλ)\displaystyle\frac{d}{d\lambda}d_{\Omega}^{2}(z_{\lambda}) 2dΩ2(zλ)+2ϕ(eλ)\displaystyle\leq-2d_{\Omega}^{2}(z_{\lambda})+2\phi(e^{-\lambda})

Then we can apply Item 2 of Lemma B.1 to obtain limλdΩ2(zλ)=0\lim_{\lambda\to\infty}d_{\Omega}^{2}(z_{\lambda})=0. This implies that limσ0dΩ2(xσ)=0\lim_{\sigma\to 0}d_{\Omega}^{2}(x_{\sigma})=0.

We then apply Item 1 of Lemma B.1 to obtain that for any λλ1\lambda\geq\lambda_{1}:

dΩ2(zλ)e2(λλ1)dΩ2(zλ1)+2λ1λe2(λt)ϕ(et)𝑑t.d_{\Omega}^{2}(z_{\lambda})\leq e^{-2(\lambda-\lambda_{1})}d_{\Omega}^{2}(z_{\lambda_{1}})+2\int_{\lambda_{1}}^{\lambda}e^{-2(\lambda-t)}\phi(e^{-t})dt.

Using the fact that a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} for any a,b0a,b\geq 0, we have that

dΩ(zλ)e(λλ1)dΩ(zλ1)+eλλ1λ2e2tϕ(et)dt.d_{\Omega}(z_{\lambda})\leq e^{-(\lambda-\lambda_{1})}d_{\Omega}(z_{\lambda_{1}})+e^{-\lambda}\sqrt{\int_{\lambda_{1}}^{\lambda}2e^{2t}\phi(e^{-t})}dt.

This proves Remark 3.7

Proof of Theorem 3.8.

We will utilize the parameter λ\lambda for this proof with which we consider the ODE trajectory (zλ:=xσ(λ))λ[λ1,λ2)(z_{\lambda}:=x_{\sigma(\lambda)})_{\lambda\in[\lambda_{1},\lambda_{2})} of Equation 10 with λ1=log(σ1)\lambda_{1}=-\log(\sigma_{1}) and λ2=log(σ2)\lambda_{2}=-\log(\sigma_{2}).

For Item 1, we will show that the trajectory zλz_{\lambda} must stay inside Br(Ω)B_{r}(\Omega) for all λ[λ1,λ2)\lambda\in[\lambda_{1},\lambda_{2}) whenever zλ1Br(Ω)z_{\lambda_{1}}\in B_{r}(\Omega). Suppose otherwise then there exists some first time λo>λ1\lambda_{o}>\lambda_{1} such that zλoBr(Ω)z_{\lambda_{o}}\in\partial B_{r}(\Omega). By the assumption that Br(Ω)¯ΣΩ=\overline{B_{r}(\Omega)}\cap\Sigma_{\Omega}=\emptyset, we can apply Lemma A.10 to obtain the derivative of the squared distance function along the trajectory for any λ[λ1,λo]\lambda\in[\lambda_{1},\lambda_{o}]:

ddλdBr(Ω)2(zλ)=2mλ(zλ)zλ,projΩ(zλ)zλ.\frac{d}{d\lambda}d_{B_{r}(\Omega)}^{2}(z_{\lambda})=-2\langle m_{\lambda}(z_{\lambda})-z_{\lambda},\mathrm{proj}_{\Omega}(z_{\lambda})-z_{\lambda}\rangle.

This implies that ddλdBr(Ω)2(zλo)<0\frac{d}{d\lambda}d_{B_{r}(\Omega)}^{2}(z_{\lambda_{o}})<0. By the continuity of the derivative above, we have that ddλdBr(Ω)2(zλ)<0\frac{d}{d\lambda}d_{B_{r}(\Omega)}^{2}(z_{\lambda})<0 for all λ[λoϵ,λo]\lambda\in[\lambda_{o}-\epsilon,\lambda_{o}] for some ϵ>0\epsilon>0. Note that dΩ2(zλo)=r2d_{\Omega}^{2}(z_{\lambda_{o}})=r^{2} and dΩ2(zλoϵ)<r2d_{\Omega}^{2}(z_{\lambda_{o}-\epsilon})<r^{2}. Then by the mean value theorem, there exists some λi(λoϵ,λo)\lambda_{i}\in(\lambda_{o}-\epsilon,\lambda_{o}) such that ddλdBr(Ω)2(zλi)>0\frac{d}{d\lambda}d_{B_{r}(\Omega)}^{2}(z_{\lambda_{i}})>0. This leads to a contradiction and hence the trajectory zλz_{\lambda} must stay inside Br(Ω)B_{r}(\Omega) for all λ[λ1,λ2)\lambda\in[\lambda_{1},\lambda_{2}), which concludes the proof.

The second part of the theorem follows straightforwardly from the fact that Ω\Omega is a closed set. If a trajectory starts from Ω\Omega but leaves Ω\Omega at some point, it must also leave a neighborhood of Ω\Omega, which would contradict the given assumption. This completes the proof. ∎

Proof of Proposition 3.9.

For the first item, for any xBr(K)x\in\partial B_{r}(K), since mσ(x)m_{\sigma}(x) lies in the convex set KK, we apply Corollary A.5 to conclude

mσ(x)projK(x),projK(x)x0.\langle m_{\sigma}(x)-\mathrm{proj}_{K}(x),\mathrm{proj}_{K}(x)-x\rangle\geq 0.

Then there is

mσ(x)x,projK(x)x=mσ(x)projK(x),projK(x)x+projK(x)x2r2>0.\langle m_{\sigma}(x)-x,\mathrm{proj}_{K}(x)-x\rangle=\langle m_{\sigma}(x)-\mathrm{proj}_{K}(x),\mathrm{proj}_{K}(x)-x\rangle+\|\mathrm{proj}_{K}(x)-x\|^{2}\geq r^{2}>0.

We then apply Theorem 3.8 to conclude that Br(K)B_{r}(K) is absorbing for (xσ)σ(σ2,σ1](x_{\sigma})_{\sigma\in(\sigma_{2},\sigma_{1}]}.

For the second item, the distance function dK2(x)d_{K}^{2}(x) does not distinguish between the interior and the boundary of KK and we utilize the supporting hyperplane function instead. Assume the trajectory (xσ)t(σ2,σ1](x_{\sigma})_{t\in(\sigma_{2},\sigma_{1}]} leaves KK at some first time σo(σ2,σ1]\sigma_{o}\in(\sigma_{2},\sigma_{1}]. Then, in particular, xσKx_{\sigma}\in K for all σ[σo,σ1]\sigma\in[\sigma_{o},\sigma_{1}] and xσoKx_{\sigma_{o}}\in\partial K. In terms of the parameter λ=log(σ)\lambda=-\log(\sigma), we have zλ[λ1:=log(σ1),λo:=log(σo)]z_{\lambda}\in[\lambda_{1}:=-\log(\sigma_{1}),\lambda_{o}:=-\log(\sigma_{o})] and zλoKz_{\lambda_{o}}\in\partial K. Since KK is a closed convex set, there exist a supporting hyperplane HH at zλoz_{\lambda_{o}} such that H(zλo)=0H(z_{\lambda_{o}})=0 and H(y)0H(y)\geq 0 for all yKy\in K. In particular, we can write H(y)=yzλo,nH(y)=\langle y-z_{\lambda_{o}},n\rangle, where nn is the unit normal vector of the hyperplane. Since zλKz_{\lambda}\in K for all λ[λo,λ1]\lambda\in[\lambda_{o},\lambda_{1}] and H(zλo)H(zλ)H(z_{\lambda_{o}})\leq H(z_{\lambda}) for all λ[λo,λ1]\lambda\in[\lambda_{o},\lambda_{1}], we must have that dH(zλ)dλ|λ=λo0\frac{dH(z_{\lambda})}{d\lambda}|_{\lambda=\lambda_{o}^{-}}\leq 0. Therefore, we have that

dH(zλ)dλ|λ=λo=dzλdλ|λ=λo,n=mλo(zλo)zλo,n0.\frac{dH(z_{\lambda})}{d\lambda}|_{\lambda=\lambda_{o}^{-}}=\left\langle\frac{dz_{\lambda}}{d\lambda}|_{\lambda=\lambda_{o}},n\right\rangle=\langle m_{\lambda_{o}}(z_{\lambda_{o}})-z_{\lambda_{o}},n\rangle\leq 0.

Since mλo(zλo)m_{\lambda_{o}}(z_{\lambda_{o}}) lies in the interior of KK, we must have that mλo(zλo)zλo,n>0\langle m_{\lambda_{o}}(z_{\lambda_{o}})-z_{\lambda_{o}},n\rangle>0. This leads to a contradiction and hence the trajectory zλz_{\lambda} must stay inside KK for all λ[λ1,λ2)\lambda\in[\lambda_{1},\lambda_{2}). ∎

Proof of Theorem 3.11.

Similarly as in the proofs of other theorems in this section, we consider the change of variable λ=log(σ)\lambda=-\log(\sigma) and zλ:=xσ(λ)z_{\lambda}:=x_{\sigma}(\lambda).

We let C:=Cζ,τ,2RC:=C_{\zeta,\tau,2R} and define (one can see how the definition is motivated from the following proof)

λδ:=max{1ζmin{log(δζ2C),2log(ζ8(2ζ)δC),log(δ(2ζ)4C)},Λ}.\lambda_{\delta}:=\max\left\{-\frac{1}{\zeta}\cdot\min\left\{\log\left(\frac{\delta\zeta}{2C}\right),2\log\left(\frac{\zeta}{8}\sqrt{\frac{(2-\zeta)\delta}{C}}\right),\log\left(\frac{\delta(2-\zeta)}{4C}\right)\right\},\Lambda\right\}.

Then, σδ:=eλδ\sigma_{\delta}:=e^{-\lambda_{\delta}}.

By continuity of the ODE path, there exists a maximal interval I=[λδ,Λδ]I=[\lambda_{\delta},\Lambda_{\delta}] so that for any λI\lambda\in I, zλB2R(0)¯B2δ(Ω)¯z_{\lambda}\in\overline{B_{2R}(0)}\cap\overline{B_{2\delta}(\Omega)} where the overline indicates the closure of the underlying set. Now, we show that Λδ\Lambda_{\delta} must be infinity by showing that otherwise zΛδz_{\Lambda_{\delta}} lies in the interior, i.e., zΛδB2R(0)B2δ(Ω)z_{\Lambda_{\delta}}\in{B_{2R}(0)}\cap{B_{2\delta}(\Omega)} and hence the trajectory will be able to extend within B2R(0)B2δ(Ω){B_{2R}(0)}\cap{B_{2\delta}(\Omega)} to Λδ+ϵ\Lambda_{\delta}+\epsilon for some small ϵ>0\epsilon>0. In this way, we can extend the interval II to [λδ,Λδ+ϵ][\lambda_{\delta},\Lambda_{\delta}+\epsilon] which contradicts the maximality of II.

Now, assume that Λδ<\Lambda_{\delta}<\infty. Then, by Corollary A.11, we have that for any λI\lambda\in I,

d(dΩ2(zλ))dλ\displaystyle\frac{d(d_{\Omega}^{2}(z_{\lambda}))}{d\lambda} =2zλprojΩ(zλ),zλmλ(zλ)\displaystyle=-2\langle z_{\lambda}-\mathrm{proj}_{\Omega}(z_{\lambda}),z_{\lambda}-m_{\lambda}(z_{\lambda})\rangle
=2(zλprojΩ(zλ),zλprojΩ(zλ)+zλprojΩ(zλ),projΩ(zλ)mλ(zλ))\displaystyle=-2\left(\langle z_{\lambda}-\mathrm{proj}_{\Omega}(z_{\lambda}),z_{\lambda}-\mathrm{proj}_{\Omega}(z_{\lambda})\rangle+\langle z_{\lambda}-\mathrm{proj}_{\Omega}(z_{\lambda}),\mathrm{proj}_{\Omega}(z_{\lambda})-m_{\lambda}(z_{\lambda})\rangle\right)
2dΩ2(zλ)+2dΩ(zλ)mλ(zλ)projΩ(zλ).\displaystyle\leq-2d_{\Omega}^{2}(z_{\lambda})+{2}d_{\Omega}(z_{\lambda})\|m_{\lambda}(z_{\lambda})-\mathrm{proj}_{\Omega}(z_{\lambda})\|.

By our assumption on zλz_{\lambda}, we have that dΩ(zλ)<2δd_{\Omega}(z_{\lambda})<2\delta and for any λ>Λ\lambda>\Lambda, mλ(zλ)projΩ(zλ)Ceζλ\|m_{\lambda}(z_{\lambda})-\mathrm{proj}_{\Omega}(z_{\lambda})\|\leq C\cdot e^{-\zeta\lambda}. Then, by Remark 3.7, we have that

dΩ(zλ)\displaystyle d_{\Omega}(z_{\lambda}) e(λλδ)dΩ(zλδ)+eλλδλ4Cδe2teζt𝑑t\displaystyle\leq e^{-(\lambda-\lambda_{\delta})}d_{\Omega}(z_{\lambda_{\delta}})+e^{-\lambda}\sqrt{\int_{\lambda_{\delta}}^{\lambda}4C\delta e^{2t}e^{-\zeta t}dt}
e(λλδ)dΩ(zλδ)+4δC2ζ(eζλe(2ζ)λδ2λ)4δC2ζeζλδ<2δ,\displaystyle\leq e^{-(\lambda-\lambda_{\delta})}d_{\Omega}(z_{\lambda_{\delta}})+\underbrace{\sqrt{\frac{{4}\delta C}{2-\zeta}\left(e^{-\zeta\lambda}-e^{(2-\zeta)\lambda_{\delta}-2\lambda}\right)}}_{\leq\sqrt{\frac{{4}\delta C}{2-\zeta}e^{-\zeta\lambda}}\leq\delta}<2\delta,

where the inequality under the brace bracket follows from the definition of λδ\lambda_{\delta}. Hence, zΛδB2δ(Ω)z_{\Lambda_{\delta}}\in{B_{2\delta}(\Omega)}.

Now we examine zλ\|z_{\lambda}\| along the integral path. We have that

zszλδ\displaystyle\|z_{s}-z_{\lambda_{\delta}}\| λδsmλ(zλ)zλ𝑑λ\displaystyle\leq\int_{\lambda_{\delta}}^{s}\|m_{\lambda}(z_{\lambda})-z_{\lambda}\|d\lambda
λδsmλ(zλ)projΩ(zλ)+projΩ(zλ)zλdλ\displaystyle\leq\int_{\lambda_{\delta}}^{s}\|m_{\lambda}(z_{\lambda})-\mathrm{proj}_{\Omega}(z_{\lambda})\|+\|\mathrm{proj}_{\Omega}(z_{\lambda})-z_{\lambda}\|d\lambda
λδsCeζλ+dΩ(zλ)dλ.\displaystyle\leq\int_{\lambda_{\delta}}^{s}Ce^{-\zeta\lambda}+d_{\Omega}(z_{\lambda})d\lambda.

Now, for the integral of dΩ(zλ)d_{\Omega}(z_{\lambda}), we have that

λδsdΩ(zλ)𝑑λ\displaystyle\int_{\lambda_{\delta}}^{s}d_{\Omega}(z_{\lambda})d\lambda λδse(λλδ)dΩ(zλδ)+4δC2ζ(eζλe(2ζ)λδ2λ)dλ\displaystyle\leq\int_{\lambda_{\delta}}^{s}e^{-(\lambda-\lambda_{\delta})}d_{\Omega}(z_{\lambda_{\delta}})+{\sqrt{\frac{{4}\delta C}{2-\zeta}\left(e^{-\zeta\lambda}-e^{(2-\zeta)\lambda_{\delta}-2\lambda}\right)}}d\lambda
λδse(λλδ)δ+4δC2ζeζλdλ\displaystyle\leq\int_{\lambda_{\delta}}^{s}e^{-(\lambda-\lambda_{\delta})}\delta+{\sqrt{\frac{{4}\delta C}{2-\zeta}e^{-\zeta\lambda}}}d\lambda
=δ(1es+λδ)+4δC2ζ2ζ(eζλδ/2eζs/2).\displaystyle=\delta(1-e^{-s+\lambda_{\delta}})+\sqrt{\frac{{4}\delta C}{2-\zeta}}\cdot\frac{2}{\zeta}\left(e^{-\zeta\lambda_{\delta}/2}-e^{-\zeta s/2}\right).

Therefore,

zszλδ\displaystyle\|z_{s}-z_{\lambda_{\delta}}\| Cζ(eζs+eζλδ)Cζeζλδδ/2+δ(1es+λδ)+4δC2ζ2ζ(eζλδ/2eζs/2)4δC2ζ2ζeζλδ/2δ/22δ\displaystyle\leq\underbrace{\frac{C}{\zeta}(-e^{-\zeta s}+e^{-\zeta\lambda_{\delta}})}_{\leq\frac{C}{\zeta}e^{-\zeta\lambda_{\delta}}\leq\delta/2}+\delta(1-e^{-s+\lambda_{\delta}})+\underbrace{\sqrt{\frac{{4}\delta C}{2-\zeta}}\cdot\frac{2}{\zeta}\left(e^{-\zeta\lambda_{\delta}/2}-e^{-\zeta s/2}\right)}_{\leq\sqrt{\frac{{4}\delta C}{2-\zeta}}\cdot\frac{2}{\zeta}e^{-\zeta\lambda_{\delta}/2}\leq\delta/2}\leq 2\delta (20)

where we used the definition of λδ\lambda_{\delta} again to control all the exponential terms. This implies that zsR+2δ<2R\|z_{s}\|\leq R+2\delta<2R for all s[λδ,Λδ]s\in[\lambda_{\delta},\Lambda_{\delta}] (recall that we assumed that R>2δR>2\delta) and hence zΛδB2R(0)z_{\Lambda_{\delta}}\in B_{2R}(0). This concludes the proof. ∎

B.3 Proofs in Section 4

Proof of Proposition 4.1.

Let R1=x𝔼[𝑿]R_{1}=\|x-\mathbb{E}[\bm{X}]\|. Consider the function gσ(y)=exp(xy22σ2)g_{\sigma}(y)=\exp\left(-\frac{\|x-y\|^{2}}{2\sigma^{2}}\right). By the fact that x𝔼[𝑿]diam(Ω)\|x-\mathbb{E}[\bm{X}]\|\leq\mathrm{diam}(\Omega) for any xΩx\in\Omega, we have that:

exp((R1+diam(Ω))22σ2)gσ(y)exp((R1diam(Ω))22σ2)\exp\left(-\frac{(R_{1}+\mathrm{diam}(\Omega))^{2}}{2\sigma^{2}}\right)\leq g_{\sigma}(y)\leq\exp\left(-\frac{(R_{1}-\mathrm{diam}(\Omega))^{2}}{2\sigma^{2}}\right)

for all yΩy\in\Omega. Then for any Borel measurable set AΩA\subseteq\Omega, we can bound the ratio of the posterior and prior measures as:

p(A|𝑿σ=x)p(A)\displaystyle\frac{p(A|\bm{X}_{\sigma}=x)}{p(A)} =Agσ(y)p(dy)p(A)Ωgσ(y)p(dy)\displaystyle=\frac{\int_{A}g_{\sigma}(y)p(dy)}{p(A)\int_{\Omega}g_{\sigma}(y)p(dy)}
exp((R1diam(Ω))22σ2)Ap(dy)p(A)exp((R1+diam(Ω))22σ2)Ωp(dy)\displaystyle\leq\frac{\exp\left(-\frac{(R_{1}-\mathrm{diam}(\Omega))^{2}}{2\sigma^{2}}\right)\int_{A}p(dy)}{p(A)\exp\left(-\frac{(R_{1}+\mathrm{diam}(\Omega))^{2}}{2\sigma^{2}}\right)\int_{\Omega}p(dy)}
=exp((R1+diam(Ω))22σ2(R1diam(Ω))22σ2)a.\displaystyle=\underbrace{\exp\left(\frac{(R_{1}+\mathrm{diam}(\Omega))^{2}}{2\sigma^{2}}-\frac{(R_{1}-\mathrm{diam}(\Omega))^{2}}{2\sigma^{2}}\right)}_{a}.

Similarly, we can bound the ratio from below:

p(A|𝑿σ=x)p(A)\displaystyle\frac{p(A|\bm{X}_{\sigma}=x)}{p(A)} exp((R1+diam(Ω))22σ2)Ap(dy)p(A)exp((R1diam(Ω))22σ2)Ωp(dy)\displaystyle\geq\frac{\exp\left(-\frac{(R_{1}+\mathrm{diam}(\Omega))^{2}}{2\sigma^{2}}\right)\int_{A}p(dy)}{p(A)\exp\left(-\frac{(R_{1}-\mathrm{diam}(\Omega))^{2}}{2\sigma^{2}}\right)\int_{\Omega}p(dy)}
=exp((R1diam(Ω))22σ2(R1+diam(Ω))22σ2)\displaystyle=\exp\left(\frac{(R_{1}-\mathrm{diam}(\Omega))^{2}}{2\sigma^{2}}-\frac{(R_{1}+\mathrm{diam}(\Omega))^{2}}{2\sigma^{2}}\right)
=1/a\displaystyle=1/a

Since a>1a>1, we have |1/a1|a1|1/a-1|\leq a-1. Therefore, we have:

|p(A|𝑿σ=x)p(A)1|a1\left|\frac{p(A|\bm{X}_{\sigma}=x)}{p(A)}-1\right|\leq a-1

In particularly, there is |p(A|𝑿σ=x)p(A)|a1|p(A|\bm{X}_{\sigma}=x)-p(A)|\leq a-1 for all AΩA\subseteq\Omega which by definition bounds the total variation distance between p(|𝑿σ=x)p(\cdot|\bm{X}_{\sigma}=x) and pp by a1a-1. Then the Wasserstein distance dW,1(p(|𝑿σ=x),p)d_{W,1}(p(\cdot|\bm{X}_{\sigma}=x),p) can be bounded by (a1)diam(Ω)(a-1)\mathrm{diam}(\Omega); see e.g [Vil03, Proposition 7.10]. ∎

Proof of Theorem 4.2.

We let Φx:=dΣΩ(x)/2>0\Phi_{x}:=d_{\Sigma_{\Omega}}(x)/2>0 and Δx:=dΩ(x)0\Delta_{x}:=d_{\Omega}(x)\geq 0. We let xΩ:=projΩ(x)x_{\Omega}:=\mathrm{proj}_{\Omega}(x). According to Corollary A.7, for any ϵ>0\epsilon>0, if we define the radius

rx,ϵ:=Δx2+ϵ2(1ΔxΔx+Φx)r_{x,\epsilon}:=\sqrt{\Delta_{x}^{2}+\epsilon^{2}\left(1-\frac{\Delta_{x}}{\Delta_{x}+\Phi_{x}}\right)}

then we have the following inclusion

Brx,ϵ(x)ΩBϵ(xΩ)Ω.B_{r_{x,\epsilon}}(x)\cap\Omega\subseteq B_{\epsilon}(x_{\Omega})\cap\Omega.

For the other direction, we have that

Brx,ϵ(xΩ)ΩBrx,ϵ(x)ΩB_{r^{*}_{x,\epsilon}}(x_{\Omega})\cap\Omega\subseteq B_{r_{x,\epsilon}}(x)\cap\Omega

where

rx,ϵ:=rx,ϵΔx=Φx(Φx+Δx)(rx,ϵ+Δx)ϵ2.r^{*}_{x,\epsilon}:=r_{x,\epsilon}-\Delta_{x}=\frac{\Phi_{x}}{(\Phi_{x}+\Delta_{x})\left(r_{x,\epsilon}+\Delta_{x}\right)}\epsilon^{2}.

Here rx,ϵr^{*}_{x,\epsilon} satisfies that as dΩ(x)0d_{\Omega}(x)\to 0, rx,ϵϵr^{*}_{x,\epsilon}\to\epsilon. So Brx,ϵ(xΩ)B_{r^{*}_{x,\epsilon}}(x_{\Omega}) can be thought of as an approximation of Bϵ(xΩ)B_{\epsilon}(x_{\Omega}).

Then, we estimate the two terms below separately:

dW,2(p(|𝑿σ=x),δxΩ)2=Ωx0xΩ2exp(xx022σ2)Ωexp(xx022σ2)p(dx0)p(dx0)\displaystyle d_{\mathrm{W},2}(p(\cdot|\bm{X}_{\sigma}=x),\delta_{x_{\Omega}})^{2}=\int_{\Omega}\left\|x_{0}-x_{\Omega}\right\|^{2}\frac{\exp\left(-\frac{\|x-x_{0}\|^{2}}{2\sigma^{2}}\right)}{\int_{\Omega}\exp\left(-\frac{\|x-x_{0}^{\prime}\|^{2}}{2\sigma^{2}}\right)p(dx_{0}^{\prime})}p(dx_{0})
=Brx,ϵ(x)Ωx0xΩ2exp(xx022σ2)Ωexp(xx022σ2)p(dx0)p(dx0)I1\displaystyle=\underbrace{\int_{B_{r_{x,\epsilon}}(x)\cap\Omega}\left\|x_{0}-x_{\Omega}\right\|^{2}\frac{\exp\left(-\frac{\|x-x_{0}\|^{2}}{2\sigma^{2}}\right)}{\int_{\Omega}\exp\left(-\frac{\|x-x_{0}^{\prime}\|^{2}}{2\sigma^{2}}\right)p(dx_{0}^{\prime})}p(dx_{0})}_{I_{1}}
+Brx,ϵ(x)c(y)Ωx0xΩ2exp(xx022σ2)Ωexp(xx022σ2)p(dx0)p(dx0)I2\displaystyle+\underbrace{\int_{B_{r_{x,\epsilon}}(x)^{c}(y)\cap\Omega}\left\|x_{0}-x_{\Omega}\right\|^{2}\frac{\exp\left(-\frac{\|x-x_{0}\|^{2}}{2\sigma^{2}}\right)}{\int_{\Omega}\exp\left(-\frac{\|x-x_{0}^{\prime}\|^{2}}{2\sigma^{2}}\right)p(dx_{0}^{\prime})}p(dx_{0})}_{I_{2}}

For I1I_{1}, we have the following estimate:

I1Bϵ(xΩ)Ωx0xΩ2exp(xx022σ2)Ωexp(xx022σ2)p(dx0)p(dx0)ϵ2\displaystyle I_{1}\leq\int_{B_{\epsilon}(x_{\Omega})\cap\Omega}\|x_{0}-x_{\Omega}\|^{2}\frac{\exp\left(-\frac{\|x-x_{0}\|^{2}}{2\sigma^{2}}\right)}{\int_{\Omega}\exp\left(-\frac{\|x-x_{0}^{\prime}\|^{2}}{2\sigma^{2}}\right)p(dx_{0}^{\prime})}p(dx_{0})\leq\epsilon^{2}

For the second term I2I_{2}, we have the following estimate:

I2\displaystyle I_{2} Brx,ϵc(y)Ω(2x02+2xΩ2)exp(xx022σ2)Ωexp(xx022σ2)p(dx0)p(dx0)\displaystyle\leq\int_{B_{r_{x,\epsilon}}^{c}(y)\cap\Omega}(2\|x_{0}\|^{2}+2\|x_{\Omega}\|^{2})\frac{\exp\left(-\frac{\|x-x_{0}\|^{2}}{2\sigma^{2}}\right)}{\int_{\Omega}\exp\left(-\frac{\|x-x_{0}^{\prime}\|^{2}}{2\sigma^{2}}\right)p(dx_{0}^{\prime})}p(dx_{0})
=Brx,ϵc(y)Ω2x02+2xΩ2Ωexp(xx022σ2xx022σ2)p(dx0)p(dx0)\displaystyle=\int_{B_{r_{x,\epsilon}}^{c}(y)\cap\Omega}\frac{2\|x_{0}\|^{2}+2\|x_{\Omega}\|^{2}}{\int_{\Omega}\exp\left(\frac{\|x-x_{0}\|^{2}}{2\sigma^{2}}-\frac{\|x-x_{0}^{\prime}\|^{2}}{2\sigma^{2}}\right)p(dx_{0}^{\prime})}p(dx_{0})
Brx,ϵΩc(y)Ω2x02+2xΩ2Brx,ϵ/2(y)Ωexp(xx022σ2xx022σ2)p(dx0)p(dx0)\displaystyle\leq\int_{B_{r_{x,\epsilon}\cap\Omega}^{c}(y)\cap\Omega}\frac{2\|x_{0}\|^{2}+2\|x_{\Omega}\|^{2}}{\int_{B_{r_{x,\epsilon/\sqrt{2}}}(y)\cap\Omega}\exp\left(\frac{\|x-x_{0}\|^{2}}{2\sigma^{2}}-\frac{\|x-x_{0}^{\prime}\|^{2}}{2\sigma^{2}}\right)p(dx_{0}^{\prime})}p(dx_{0})
Brx,ϵΩc(y)Ω2x02+2xΩ2Brx,ϵ/2(y)Ωexp(ϵ24σ2(1ΔxΔx+Φx))p(dx0)p(dx0)\displaystyle\leq\int_{B_{r_{x,\epsilon}\cap\Omega}^{c}(y)\cap\Omega}\frac{2\|x_{0}\|^{2}+2\|x_{\Omega}\|^{2}}{\int_{B_{r_{x,\epsilon/\sqrt{2}}}(y)\cap\Omega}\exp\left(\frac{\epsilon^{2}}{4\sigma^{2}}\left(1-\frac{\Delta_{x}}{\Delta_{x}+\Phi_{x}}\right)\right)p(dx_{0}^{\prime})}p(dx_{0})
=exp(ϵ24σ2(1ΔxΔx+Φx))2xΩ2+2𝖬2(p)p(Brx,ϵ/2(y)Ω)\displaystyle=\exp\left(-\frac{\epsilon^{2}}{4\sigma^{2}}\left(1-\frac{\Delta_{x}}{\Delta_{x}+\Phi_{x}}\right)\right)\frac{2\|x_{\Omega}\|^{2}+2\mathsf{M}_{2}(p)}{p\left(B_{r_{x,\epsilon/\sqrt{2}}}(y)\cap\Omega\right)}
exp(ϵ24σ2(1ΔxΔx+Φx))2xΩ2+2𝖬2(p)p(Brx,ϵ/2(xΩ))\displaystyle\leq\exp\left(-\frac{\epsilon^{2}}{4\sigma^{2}}\left(1-\frac{\Delta_{x}}{\Delta_{x}+\Phi_{x}}\right)\right)\frac{2\|x_{\Omega}\|^{2}+2\mathsf{M}_{2}(p)}{p\left(B_{r^{*}_{x,\epsilon/\sqrt{2}}}(x_{\Omega})\right)}

Since Ω\Omega is the support of pp, we have that p(Brx,ϵ/2(xΩ))>0p\left(B_{r^{*}_{x,\epsilon/\sqrt{2}}}(x_{\Omega})\right)>0. Hence, for any ϵ>0\epsilon>0, we have

I1+I2\displaystyle I_{1}+I_{2} ϵ2+2xΩ2+2𝖬2(p)p(Brx,ϵ/2(xΩ))exp(ϵ24σ2(1ΔxΔx+Φx))\displaystyle\leq\epsilon^{2}+\frac{2\|x_{\Omega}\|^{2}+2\mathsf{M}_{2}(p)}{p\left(B_{r^{*}_{x,\epsilon/\sqrt{2}}}(x_{\Omega})\right)}\exp\left(-\frac{\epsilon^{2}}{4\sigma^{2}}\left(1-\frac{\Delta_{x}}{\Delta_{x}+\Phi_{x}}\right)\right)

By letting σ0\sigma\to 0, we have that I1+I2ϵI_{1}+I_{2}\leq\epsilon. Therefore, we have that

limσ0dW,2(p(|𝑿σ=x),δxΩ)=0.\lim_{\sigma\to 0}d_{\mathrm{W},2}(p(\cdot|\bm{X}_{\sigma}=x),\delta_{x_{\Omega}})=0.
Proof of Corollary 4.4.

This is a direct consequence of the following property of the Wasserstein distance where the case s=1s=1 was proved in [RTG98] and the other cases follow from the fact that dW,qd_{\mathrm{W},q} is increasing with respect to qq [Vil09].

Lemma B.2.

For any probability measures μ,ν\mu,\nu on d\mathbb{R}^{d} and any 1q1\leq q\leq\infty, we have that

dW,q(μ,ν)mean(μ)mean(ν).d_{\mathrm{W},q}(\mu,\nu)\geq\|\mathrm{mean}(\mu)-\mathrm{mean}(\nu)\|.

Proof of Corollary 4.5.

In the proof of Theorem 4.2, we end up with

dW,2(p(|𝑿σ=x),δxΩ)2ϵ2+exp(ϵ24σ2(1ΔxΔx+Φx))2xΩ2+2𝖬2(p)p(Brx,ϵ/2(xΩ))d_{\mathrm{W},2}(p(\cdot|\bm{X}_{\sigma}=x),\delta_{x_{\Omega}})^{2}\leq\epsilon^{2}+\exp\left(-\frac{\epsilon^{2}}{4\sigma^{2}}\left(1-\frac{\Delta_{x}}{\Delta_{x}+\Phi_{x}}\right)\right)\frac{2\|x_{\Omega}\|^{2}+2\mathsf{M}_{2}(p)}{p\left(B_{r^{*}_{x,\epsilon/\sqrt{2}}}(x_{\Omega})\right)}

for any arbitrarily chosen small ϵ\epsilon.

By Lemma A.8, there exists a small ξ>0\xi>0 and a constant Cξ>0C_{\xi}>0 such that for any zx<ξ\|z-x\|<\xi, one has zΩxΩ<Cξzx\|z_{\Omega}-x_{\Omega}\|<C_{\xi}\|z-x\|. As both Δx\Delta_{x} and Φx\Phi_{x} are continuous in a local neighborhood of xx, there exists a small ξ1>0\xi_{1}>0 such that for any zz with zx<ξ1\|z-x\|<\xi_{1}, one has

  • rz,ϵ/2>12rx,ϵ/2;r^{*}_{z,\epsilon/\sqrt{2}}>\frac{1}{2}r^{*}_{x,\epsilon/\sqrt{2}};

  • 1ΔxΔx+Φx>12(1ΔzΔz+δz);1-\frac{\Delta_{x}}{\Delta_{x}+\Phi_{x}}>\frac{1}{2}\left(1-\frac{\Delta_{z}}{\Delta_{z}+\delta_{z}}\right);

  • zΩ2xΩ.\|z_{\Omega}\|\leq 2\|x_{\Omega}\|.

Now, take Ξ:=min{ξ,ξ1,12Cξrz,ϵ/2}\Xi:=\min\left\{\xi,\xi_{1},\frac{1}{2C_{\xi}}r^{*}_{z,\epsilon/\sqrt{2}}\right\}. Then, it is easy to check that for any zz such that zx<Ξ\|z-x\|<\Xi, one has

B14rx,ϵ/2(xΩ)Brz,ϵ/2(zΩ).B_{\frac{1}{4}r^{*}_{x,\epsilon/\sqrt{2}}}(x_{\Omega})\subset B_{r^{*}_{z,\epsilon/\sqrt{2}}}(z_{\Omega}).

This implies that for any zz such that zx<Ξ\|z-x\|<\Xi, one has

mσ(z)zΩ2dW,2(p(|𝑿σ=z),δzΩ)2ϵ2+exp(ϵ28σ2(1ΔxΔx+Φx))4xΩ2+2𝖬2(p)p(Brx,14ϵ/2(xΩ)).\|m_{\sigma}(z)-z_{\Omega}\|^{2}\leq d_{\mathrm{W},2}(p(\cdot|\bm{X}_{\sigma}=z),\delta_{z_{\Omega}})^{2}\leq\epsilon^{2}+\exp\left(-\frac{\epsilon^{2}}{8\sigma^{2}}\left(1-\frac{\Delta_{x}}{\Delta_{x}+\Phi_{x}}\right)\right)\frac{4\|x_{\Omega}\|^{2}+2\mathsf{M}_{2}(p)}{p\left(B_{r^{*}_{x,\frac{1}{4}\epsilon/\sqrt{2}}}(x_{\Omega})\right)}.

Hence, when σ\sigma is small enough (dependent on ϵ\epsilon), we have that mσ(z)zΩ2ϵ\|m_{\sigma}(z)-z_{\Omega}\|\leq 2\epsilon for any zz such that zx<Ξ\|z-x\|<\Xi.

Now, we have that

mt(x)xΩmσt(x/αt)projΩ(x/αt)+projΩ(x/αt)projΩ(x).\|m_{t}(x)-x_{\Omega}\|\leq\|m_{\sigma_{t}}(x/\alpha_{t})-\mathrm{proj}_{\Omega}(x/\alpha_{t})\|+\|\mathrm{proj}_{\Omega}(x/\alpha_{t})-\mathrm{proj}_{\Omega}(x)\|.

As αt1\alpha_{t}\to 1, there exists t0t_{0} such that when t>t0t>t_{0}, one has that z=x/αtBΞ(x)z=x/\alpha_{t}\in B_{\Xi}(x) and projΩ(x/αt)projΩ(x)ϵ\|\mathrm{proj}_{\Omega}(x/\alpha_{t})-\mathrm{proj}_{\Omega}(x)\|\leq\epsilon. By the analysis above, we can enlarge tt so that σt\sigma_{t} is small enough so that mσt(z)zΩ2ϵ\|m_{\sigma_{t}}(z)-z_{\Omega}\|\leq 2\epsilon. Therefore, for any t>t0t>t_{0}, we have that mt(x)xΩ3ϵ\|m_{t}(x)-x_{\Omega}\|\leq 3\epsilon. Since ϵ\epsilon is arbitrary, we have that limt1mt(x)=xΩ\lim_{t\to 1}m_{t}(x)=x_{\Omega}. ∎

Proof of Theorem 4.6.

Recall from the proof of Theorem 4.2 that for any ϵ>0\epsilon>0, we have that

I1+I2\displaystyle I_{1}+I_{2} ϵ2+2xΩ2+2𝖬2(p)p(Brx,ϵ/2(xΩ))exp(ϵ24σ2(1ΔxΔx+Φx)).\displaystyle\leq\epsilon^{2}+\frac{2\|x_{\Omega}\|^{2}+2\mathsf{M}_{2}(p)}{p\left(B_{r^{*}_{x,\epsilon/\sqrt{2}}}(x_{\Omega})\right)}\exp\left(-\frac{\epsilon^{2}}{4\sigma^{2}}\left(1-\frac{\Delta_{x}}{\Delta_{x}+\Phi_{x}}\right)\right).

Then, we have that 2ΦxτΩΔx2\Phi_{x}\geq\tau_{\Omega}-\Delta_{x}. So

rx,ϵ\displaystyle r_{x,\epsilon}^{*} =Φx(Φx+Δx)(rx,ϵ+Δx)ϵ2\displaystyle=\frac{\Phi_{x}}{(\Phi_{x}+\Delta_{x})\left(r_{x,\epsilon}+\Delta_{x}\right)}\epsilon^{2} (21)
τΩΔx2τΩϵ2Δx+Δy2+ϵ2\displaystyle\geq\frac{\tau_{\Omega}-\Delta_{x}}{2\tau_{\Omega}}\cdot\frac{\epsilon^{2}}{\Delta_{x}+\sqrt{\Delta^{2}_{y}+\epsilon^{2}}} (22)
14min{ϵ(2+1)Δx,12+1}ϵ\displaystyle\geq\frac{1}{4}\cdot\min\left\{\frac{\epsilon}{(\sqrt{2}+1)\Delta_{x}},\frac{1}{\sqrt{2}+1}\right\}\epsilon (23)
ϵ10min{ϵτΩ,1}.\displaystyle\geq\frac{\epsilon}{10}\min\left\{\frac{\epsilon}{\tau_{\Omega}},1\right\}. (24)

On the other hand, we have that

rx,ϵ\displaystyle r^{*}_{x,\epsilon} Φx(Φx+Δx)ϵΦxΔx+Φxϵ2\displaystyle\leq\frac{\Phi_{x}}{(\Phi_{x}+\Delta_{x})\epsilon\sqrt{\frac{\Phi_{x}}{\Delta_{x}+\Phi_{x}}}}\epsilon^{2}
=ϵΦxΔx+Φxϵ.\displaystyle=\epsilon\sqrt{\frac{\Phi_{x}}{\Delta_{x}+\Phi_{x}}}\leq\epsilon.

Therefore, if one let ϵ=σζ\epsilon=\sigma^{\zeta}, when σ<c1/ζ\sigma<c^{1/\zeta} we have that rx,ϵ<cr_{x,\epsilon}^{*}<c and then

p(Brx,ϵ/2(xΩ))\displaystyle p\left(B_{r^{*}_{x,\epsilon/\sqrt{2}}}(x_{\Omega})\right) CΩ(ϵ10min{ϵτΩ,1})k\displaystyle\geq C_{\Omega}\cdot\left(\frac{\epsilon}{10}\min\left\{\frac{\epsilon}{\tau_{\Omega}},1\right\}\right)^{k}
=CΩσkζ10kmin{σkζτΩk,1}.\displaystyle=\frac{C_{\Omega}\sigma^{k\zeta}}{10^{k}}\min\left\{\frac{\sigma^{k\zeta}}{\tau_{\Omega}^{k}},1\right\}.

Notice that 1ΔxΔx+Φx121-\frac{\Delta_{x}}{\Delta_{x}+\Phi_{x}}\geq\frac{1}{2}, we have that eventually

I1+I2\displaystyle I_{1}+I_{2} σ2ζ+2𝖬2(ρ)+2xΩ2CΩσkζ10kmin{σkζτΩk,1}exp(18σ2(1ζ))\displaystyle\leq\sigma^{2\zeta}+\frac{2\mathsf{M}_{2}(\rho)+2\|x_{\Omega}\|^{2}}{\frac{C_{\Omega}\sigma^{k\zeta}}{10^{k}}\min\left\{\frac{\sigma^{k\zeta}}{\tau_{\Omega}^{k}},1\right\}}\exp\left(-\frac{1}{8\sigma^{2(1-\zeta)}}\right)
=σ2ζ+10k(2𝖬2(ρ)+2xΩ2)max{τΩk,σkζ}CΩσ2kζexp(18σ2(ζ1))\displaystyle=\sigma^{2\zeta}+\frac{10^{k}(2\mathsf{M}_{2}(\rho)+2\|x_{\Omega}\|^{2})\max\{\tau_{\Omega}^{k},\sigma^{k\zeta}\}}{C_{\Omega}\sigma^{2k\zeta}}\exp\left(-\frac{1}{8}\sigma^{2(\zeta-1)}\right)

The rightmost inequality in the theorem follows from the fact that the right most summand in the above equation has an exponential decay which is way faster than any polynomial decay. ∎

Proof of Proposition 4.7.

The case when t=0t=0 is trivial. When t(0,1]t\in(0,1], the statement follows from the following two results and from the fact that pt=(At1)#qσtp_{t}=(A_{t}^{-1})_{\#}q_{\sigma_{t}} where At(x)=αtxA_{t}(x)=\alpha_{t}x.

Lemma B.3 ([MMWW23, Lemma B.11]).

Let Ψ:dd\Psi:\mathbb{R}^{d}\to\mathbb{R}^{d} be a measurable map. Then, for any probability measures μ,ν\mu,\nu with finite 2-moment, we have that dW,2(μ,ν)=dW,2(Ψ#μ,Ψ#ν)d_{\mathrm{W},2}(\mu,\nu)=d_{\mathrm{W},2}(\Psi_{\#}\mu,\Psi_{\#}\nu)

Lemma B.4 ([Vil03, Proposition 7.17]).

Let μ,ν\mu,\nu be two probability measures on d\mathbb{R}^{d}. We let Gσ=𝒩(0,σ2I)G_{\sigma}=\mathcal{N}(0,\sigma^{2}I) and μσ:=μGσ\mu_{\sigma}:=\mu*G_{\sigma} as well as νσ:=νGσ\nu_{\sigma}:=\nu*G_{\sigma}. Then, for any σ>0\sigma>0, one has that

dW,2(μσ,νσ)dW,2(μ,ν).d_{\mathrm{W},2}(\mu_{\sigma},\nu_{\sigma})\leq d_{\mathrm{W},2}(\mu,\nu).

Proof of Theorem 4.11.

When ΔΩ(x)=0\Delta_{\Omega}(x)=0, NNΩ(x)=Ω\mathrm{NN}_{\Omega}(x)=\Omega and p(|𝑿σ=x)=pp(\cdot|\bm{X}_{\sigma}=x)=p and then the Wasserstein distance becomes 0 which proves the statement.

Now, we will focus on the case when ΔΩ(x)>0\Delta_{\Omega}(x)>0. The posterior measure p(|𝑿σ=x)p(\cdot|\bm{X}_{\sigma}=x) is given by

p(|𝑿σ=x)\displaystyle p(\cdot|\bm{X}_{\sigma}=x) =i=1Naiexp(12σ2xix2)j=1Najexp(12σ2xjx2)δxi\displaystyle=\sum_{i=1}^{N}\frac{a_{i}\exp\left(-\frac{1}{2\sigma^{2}}\|x_{i}-x\|^{2}\right)}{\sum_{j=1}^{N}a_{j}\exp\left(-\frac{1}{2\sigma^{2}}\|x_{j}-x\|^{2}\right)}\,\delta_{x_{i}}

For the ease of notation, we use Bσ:=j=1Najexp(12σ2xjx2)B_{\sigma}:=\sum_{j=1}^{N}a_{j}\exp\left(-\frac{1}{2\sigma^{2}}\|x_{j}-x\|^{2}\right) to denote the normalization constant in p(|𝑿σ=x)p(\cdot|\bm{X}_{\sigma}=x), Bσ,xi:=aiexp(12σ2xix2)B_{\sigma,x_{i}}:=a_{i}\exp\left(-\frac{1}{2\sigma^{2}}\|x_{i}-x\|^{2}\right) to denote the ii-th term in BσB_{\sigma}. We use A=p(NNΩ(x))A=p(\mathrm{NN}_{\Omega}(x)) to denote the normalization constant in p^NN(x)\widehat{p}_{\mathrm{NN}(x)}.

Observe that for any xi,xjNNΩ(x)x_{i},x_{j}\in\mathrm{NN}_{\Omega}(x), one has

p(xi|𝑿σ=x)/p(xj|𝑿σ=x)=p^NN(x)(xi)/p^NN(x)(xj).p(x_{i}|\bm{X}_{\sigma}=x)/p(x_{j}|\bm{X}_{\sigma}=x)=\widehat{p}_{\mathrm{NN}(x)}(x_{i})/\widehat{p}_{\mathrm{NN}(x)}(x_{j}).

This motivates the following construction of a coupling γ\gamma between p(|𝑿σ=x)p(\cdot|\bm{X}_{\sigma}=x) and pNN(y)p_{\mathrm{NN}(y)}:

γ=xjNNΩ(x)Bσ,xjBσxiNNΩ(x)aiAδ(xi,xi)+xiNNΩ(x),xjΩ\NNΩ(x)Bλ,xjBσaiAδ(xi,xj).\displaystyle\gamma=\sum_{x_{j}\in\mathrm{NN}_{\Omega}(x)}\frac{B_{\sigma,x_{j}}}{B_{\sigma}}\cdot\sum_{x_{i}\in\mathrm{NN}_{\Omega}(x)}\frac{a_{i}}{A}\delta_{(x_{i},x_{i})}+\sum_{x_{i}\in\mathrm{NN}_{\Omega}(x),x_{j}\in\Omega\backslash\mathrm{NN}_{\Omega}(x)}\frac{B_{\lambda,x_{j}}}{B_{\sigma}}\,\frac{a_{i}}{A}\,\delta_{(x_{i},x_{j})}.

Then, we bound the Wasserstein distance between p(|𝑿σ=x)p(\cdot|\bm{X}_{\sigma}=x) and p^NN(x)\widehat{p}_{\mathrm{NN}(x)} as follows:

dW,2(p(|𝑿σ=x),p^NN(x))2\displaystyle d_{\mathrm{W},2}(p(\cdot|\bm{X}_{\sigma}=x),\widehat{p}_{\mathrm{NN}(x)})^{2} xy2γ(dx,dy),\displaystyle\leq\int\|x-y\|^{2}\gamma(dx,dy),
=xiNNΩ(x),xjΩ\NNΩ(x)Bλ,xjBσaiAxixj2,\displaystyle=\sum_{x_{i}\in\mathrm{NN}_{\Omega}(x),x_{j}\in\Omega\backslash\mathrm{NN}_{\Omega}(x)}\frac{B_{\lambda,x_{j}}}{B_{\sigma}}\,\frac{a_{i}}{A}\|x_{i}-x_{j}\|^{2},
xiNNΩ(x),xjΩ\NNΩ(x)Bλ,xjBσaiAdiam(Ω)2,\displaystyle\leq\sum_{x_{i}\in\mathrm{NN}_{\Omega}(x),x_{j}\in\Omega\backslash\mathrm{NN}_{\Omega}(x)}\frac{B_{\lambda,x_{j}}}{B_{\sigma}}\,\frac{a_{i}}{A}\mathrm{diam}(\Omega)^{2},
=diam(Ω)2xiNNΩ(x),xjΩ\NNΩ(x)ajexp(12σ2xjx2)j=1Najexp(12σ2xjx2)aiA,\displaystyle=\mathrm{diam}(\Omega)^{2}\sum_{x_{i}\in\mathrm{NN}_{\Omega}(x),x_{j}\in\Omega\backslash\mathrm{NN}_{\Omega}(x)}\frac{a_{j}\exp\left(-\frac{1}{2\sigma^{2}}\|x_{j}-x\|^{2}\right)}{\sum_{j=1}^{N}a_{j}\exp\left(-\frac{1}{2\sigma^{2}}\|x_{j}-x\|^{2}\right)}\,\frac{a_{i}}{A},
diam(Ω)2xiNNΩ(x),xjΩ\NNΩ(x)ajexp(12e2λdΩ2(x,2))Aexp(12e2λdΩ2(x,1))aiA,\displaystyle\leq\mathrm{diam}(\Omega)^{2}\sum_{x_{i}\in\mathrm{NN}_{\Omega}(x),x_{j}\in\Omega\backslash\mathrm{NN}_{\Omega}(x)}\frac{a_{j}\exp\left(-\frac{1}{2}e^{2\lambda}d_{\Omega}^{2}(x,2)\right)}{A\exp\left(-\frac{1}{2}e^{2\lambda}d_{\Omega}^{2}(x,1)\right)}\,\frac{a_{i}}{A},
=diam(Ω)21AAexp(12e2λdΩ2(x,2))exp(12e2λdΩ2(x,1)),\displaystyle=\mathrm{diam}(\Omega)^{2}\frac{1-A}{A}\frac{\exp\left(-\frac{1}{2}e^{2\lambda}d_{\Omega}^{2}(x,2)\right)}{\exp\left(-\frac{1}{2}e^{2\lambda}d_{\Omega}^{2}(x,1)\right)},
=diam(Ω)21AAexp(12e2λ(dΩ2(x,2)dΩ2(x,1))),\displaystyle=\mathrm{diam}(\Omega)^{2}\frac{1-A}{A}\exp\left(-\frac{1}{2}e^{2\lambda}(d_{\Omega}^{2}(x,2)-d_{\Omega}^{2}(x,1))\right),
diam(Ω)21AAexp(12e2λΔΩ(x)).\displaystyle\leq\mathrm{diam}(\Omega)^{2}{\frac{1-A}{A}}\exp\left(-\frac{1}{2}e^{2\lambda}\Delta_{\Omega}(x)\right).

By taking the square roots on both sides, we conclude the proof. ∎

Proof of Corollary 4.12.

The result follows from Theorem 4.11 and the stability of the mean operation under the Wasserstein distance Lemma B.2. ∎

B.3.1 Proof of Theorem 4.9

We still let Δx:=dM(x)<12τM\Delta_{x}:=d_{M}(x)<\frac{1}{2}\tau_{M}. Then, we let qσx:=p(|𝑿σ=x)q_{\sigma}^{x}:=p(\cdot|\bm{X}_{\sigma}=x) and hence,

qσx(dx1):=exp(xx122σ2)ρ(x1)volM(dx1)Mexp(xx122σ2)ρ(x1)volM(dx1),for x1M.q_{\sigma}^{x}(dx_{1}):=\frac{\exp\left(-\frac{\|x-x_{1}\|^{2}}{2\sigma^{2}}\right)\rho(x_{1})\mathrm{vol}_{M}(dx_{1})}{\int_{M}\exp\left(-\frac{\|x-x_{1}^{\prime}\|^{2}}{2\sigma^{2}}\right)\rho(x_{1}^{\prime})\mathrm{vol}_{M}(dx_{1}^{\prime})},\quad\text{for }x_{1}\in M.

As σ0\sigma\to 0, qσxq_{\sigma}^{x} is concentrated around xM=projM(x)x_{M}=\mathrm{proj}_{M}(x). For any r0>0r_{0}>0, we have that

dW,2(qσx,δxM)2\displaystyle d_{\mathrm{W},2}(q_{\sigma}^{x},\delta_{x_{M}})^{2} =Mx1xM2qσx(dx1)\displaystyle=\int_{M}\|x_{1}-x_{M}\|^{2}q_{\sigma}^{x}(dx_{1})
=Br0(xM)x1xM2qσx(dx1)I1+M\Br0(xM)x1xM2qσx(dx1)I2\displaystyle=\underbrace{\int_{B_{r_{0}}(x_{M})}\|x_{1}-x_{M}\|^{2}q_{\sigma}^{x}(dx_{1})}_{I_{1}}+\underbrace{\int_{M\backslash B_{r_{0}}(x_{M})}\|x_{1}-x_{M}\|^{2}q_{\sigma}^{x}(dx_{1})}_{I_{2}}

We first consider the term I2I_{2}.

I2M\Br0(xM)(2x12+2xM2)qσx(dx1).I_{2}\leq\int_{M\backslash B_{r_{0}}(x_{M})}(2\|x_{1}\|^{2}+2\|x_{M}\|^{2})q_{\sigma}^{x}(dx_{1}).

As we have shown already in the proof of Theorem 4.2, we have that Brx,r0(x)MBr0(xM)MB_{r_{x,r_{0}}}(x)\cap M\subset B_{r_{0}}(x_{M})\cap M. So, we have that

I2\displaystyle I_{2} M\Brx,r0(x)(2x12+2xM2)qσx(dx1)\displaystyle\leq\int_{M\backslash B_{r_{x,r_{0}}}(x)}(2\|x_{1}\|^{2}+2\|x_{M}\|^{2})q_{\sigma}^{x}(dx_{1}) (25)
exp(r024σ2(1ΔxΔx+Φx))2𝖬2(p)+2xM2p(Brx,r0/2(xM))=O(exp(r028σ2)).\displaystyle\leq\exp\left(-\frac{r_{0}^{2}}{4\sigma^{2}}\left(1-\frac{\Delta_{x}}{\Delta_{x}+\Phi_{x}}\right)\right)\frac{2\mathsf{M}_{2}(p)+2\|x_{M}\|^{2}}{p\left(B_{r^{*}_{x,r_{0}/\sqrt{2}}}(x_{M})\right)}=O\left(\exp\left(-\frac{r_{0}^{2}}{8\sigma^{2}}\right)\right). (26)

Now, we consider the term I1I_{1}. Since MM is a submanifold of d\mathbb{R}^{d}, its tangent space TxMMT_{x_{M}}M can be identified as a subspace of d\mathbb{R}^{d}. We use ι:TxMMd\iota:T_{x_{M}}M\to\mathbb{R}^{d} to denote the inclusion map. In particular, for any uTxMMu\in T_{x_{M}}M, we have that xxM,ι(u)=0\langle x-x_{M},\iota(u)\rangle=0. Let Inj(M)\mathrm{Inj}(M) denote the injective radius of MM. Note that, by [AB06, Corollary 1.4], Inj(M)τM/4\mathrm{Inj}(M)\geq\tau_{M}/4. So, the exponential map expxM:TxMMM\exp_{x_{M}}:T_{x_{M}}M\to M will be an diffeomorphism in a small ball BτM/4TxM(𝟎)TxMMB_{\tau_{M}/4}^{T_{x_{M}}}(\mathbf{0})\subset T_{x_{M}}M. Furthermore, we have the following inclusion relation between geodesic balls and Euclidean balls.

Lemma B.5.

For any 0<h3τM250<h\leq\frac{3\tau_{M}}{25}, we have that

MBh(xM)expxM(B5h/3TxM(𝟎))MB5h/3(xM).M\cap B_{h}(x_{M})\subset\exp_{x_{M}}\left(B_{5h/3}^{T_{x_{M}}}(\mathbf{0})\right)\subset M\cap B_{5h/3}(x_{M}).
Proof of Lemma B.5.

The proof of the left hand side follows from Lemma A.1 and Lemma A.2 (ii) of [AL19]. Although Lemma A.2 (ii) stated compactness for the manifold MM, this assumption is not needed in the proof. The right hand side follows from the fact that xydM(x,y)\|x-y\|\leq d_{M}(x,y) for any x,yMx,y\in M, where dMd_{M} denotes the geodesic distance. ∎

Now, we fix some r03τM/25r_{0}\leq 3\tau_{M}/25. Then, there exists an open neighborhood Ur0B5r0/3TxM(𝟎)U_{r_{0}}\subset B_{5r_{0}/3}^{T_{x_{M}}}(\mathbf{0}) around 𝟎TxMM\mathbf{0}\in T_{x_{M}}M so that we have the following diffeomorphism (which gives rise to the normal coordinates):

expxM:Ur0TxMMBr0(xM)M.\exp_{x_{M}}:U_{r_{0}}\subseteq T_{x_{M}}M\to B_{r_{0}}(x_{M})\cap M.

In particular, xM=expxM(𝟎)x_{M}=\exp_{x_{M}}(\bm{0}) and each x1Br0(xM)Mx_{1}\in B_{r_{0}}(x_{M})\cap M can be written as x1=expxM(u)x_{1}=\exp_{x_{M}}(u) for some uUru\in U_{r}.

Let 𝐈𝐈:TxMM×TxMM(TxMM)\mathbf{I\!I}:T_{x_{M}}M\times T_{x_{M}}M\to(T_{x_{M}}M)^{\perp} denote the second fundamental form of MM at xMx_{M}. Then, the exponential map expxM\exp_{x_{M}} in Ur0U_{r_{0}} has the following Taylor expansion [MMASC14]:

expxM(u)=xM+ι(u)+12𝐈𝐈(u,u)+O(u3),\exp_{x_{M}}(u)=x_{M}+\iota(u)+\frac{1}{2}\mathbf{I\!I}(u,u)+O(\|u\|^{3}), (27)

where O(u3)C(𝐈𝐈,𝐈𝐈)u3O(\|u\|^{3})\leq C(\mathbf{I\!I},\nabla\mathbf{I\!I})\|u\|^{3} for some constant C(𝐈𝐈,𝐈𝐈)>0C(\mathbf{I\!I},\nabla\mathbf{I\!I})>0 only dependent on 𝐈𝐈\mathbf{I\!I} and its derivatives 𝐈𝐈\nabla\mathbf{I\!I}.

For any uUr0u\in U_{r_{0}}, we let g(u)g(u) denote the metric tensor, then g(𝟎)g(\bm{0}) is the identity matrix. The volume form is given by det(g(u))du1duk\sqrt{\det(g(u))}du^{1}\wedge\ldots\wedge du^{k} and it admits the following Taylor expansion around 𝟎TxMM\bm{0}\in T_{x_{M}}M:

det(g(u))=116Rijuiuj+O(u3),\sqrt{\det(g(u))}=1-\frac{1}{6}R_{ij}u^{i}u^{j}+O(\|u\|^{3}),

where RijR_{ij} is the Ricci curvature tensor of MM at xMx_{M}; see e.g. [CLN23, Exercise 1.83].

We can write

ρ(u)det(g(u))=ρ(𝟎)+R1(u),|R1(u)|C1u,\rho(u)\sqrt{\det(g(u))}=\rho(\bm{0})+R_{1}(u),\quad|R_{1}(u)|\leq C_{1}\|u\|, (28)

where C1C_{1} depends on Ricci curvature tensor RijR_{ij} and the gradient ρ(xM)\nabla\rho(x_{M}).

Now, we define

fσ(u):=exp(xexpxM(u)22σ2)ρ(u)det(g(u)).f_{\sigma}(u):=\exp\left(-\frac{\|x-\exp_{x_{M}}(u)\|^{2}}{2\sigma^{2}}\right)\rho(u)\sqrt{\det(g(u))}.

Let MBr0:=Br0(xM)MMB_{r_{0}}:=B_{r_{0}}(x_{M})\cap M. Then, we have that

qσx(du)=fσ(u)duUr0fσ(u)𝑑u+M\MBr0exp(xx122σ2)ρ(x1)volM(dx1).q_{\sigma}^{x}(du)=\frac{f_{\sigma}(u)du}{\int_{U_{r_{0}}}f_{\sigma}(u^{\prime})du^{\prime}+\int_{M\backslash MB_{r_{0}}}\exp\left(-\frac{\|x-x_{1}^{\prime}\|^{2}}{2\sigma^{2}}\right)\rho(x_{1}^{\prime})\mathrm{vol}_{M}(dx_{1}^{\prime})}.

Using the same argument as the one used for controlling I2I_{2} in the proof of Theorem 4.2, we have

M\MBr0exp(xx122σ2)ρ(x1)volM(dx1)MBr0exp(xx122σ2)ρ(x1)volM(dx1)=O(exp(r028σ2))\frac{\int_{M\backslash MB_{r_{0}}}\exp\left(-\frac{\|x-x_{1}^{\prime}\|^{2}}{2\sigma^{2}}\right)\rho(x_{1}^{\prime})\mathrm{vol}_{M}(dx_{1}^{\prime})}{\int_{MB_{r_{0}}}\exp\left(-\frac{\|x-x_{1}^{\prime}\|^{2}}{2\sigma^{2}}\right)\rho(x_{1}^{\prime})\mathrm{vol}_{M}(dx_{1}^{\prime})}=O\left(\exp\left(-\frac{r_{0}^{2}}{8\sigma^{2}}\right)\right) (29)

So,

MBr0x1xM2qσx(dx1)=Ur0expxM(u)xM2fσ(u)𝑑uUr0fσ(u)𝑑u(1+O(exp(r028σ2))).\int_{MB_{r_{0}}}\|x_{1}-x_{M}\|^{2}q_{\sigma}^{x}(dx_{1})=\frac{\int_{U_{r_{0}}}\|\exp_{x_{M}}(u)-x_{M}\|^{2}f_{\sigma}(u)du}{\int_{U_{r_{0}}}f_{\sigma}(u^{\prime})du^{\prime}}\left(1+O\left(\exp\left(-\frac{r_{0}^{2}}{8\sigma^{2}}\right)\right)\right). (30)

Next, we derive a Taylor expansion of the squared distance from xx to expxM(u)\exp_{x_{M}}(u) for uu around 𝟎\bm{0}:

Lemma B.6.

Let v:=xxMv:=x-x_{M}. Then, we have that

xexpxMu2=v2+u2v,𝐈𝐈(u,u)+R2(u)=v2+uT(𝑰𝐈𝐈v)u+R2(u)\|x-\exp_{x_{M}}u\|^{2}=\|v\|^{2}+\|u\|^{2}-\langle v,\mathbf{I\!I}(u,u)\rangle+R_{2}(u)=\|v\|^{2}+u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u)

where 𝐈𝐈v:=(v,𝐈𝐈ij)i,j=1,,k\mathbf{I\!I}_{v}:=(\langle v,\mathbf{I\!I}_{ij}\rangle)_{i,j=1,\ldots,k} and |R2(u)|C2u3|R_{2}(u)|\leq C_{2}\|u\|^{3} for some positive constant C2=C2(𝐈𝐈,𝐈𝐈)>0C_{2}=C_{2}(\mathbf{I\!I},\nabla\mathbf{I\!I})>0, and 𝐈\bm{I} is the identity matrix.

Proof of Lemma B.6.

We first note that

xexpxMu2\displaystyle\|x-\exp_{x_{M}}u\|^{2} =xxM2+xMexpxM(u)2+2xxM,xMexpxM(u).\displaystyle=\|x-x_{M}\|^{2}+\|x_{M}-\exp_{x_{M}}(u)\|^{2}+2\langle x-x_{M},x_{M}-\exp_{x_{M}}(u)\rangle.

By equation (27), we have that

xexpxMu2\displaystyle\|x-\exp_{x_{M}}u\|^{2} =v2+u2+u,𝐈𝐈(u,u)2v,ι(u)v,𝐈𝐈(u,u)+O(u3)\displaystyle=\|v\|^{2}+\|u\|^{2}+\langle u,\mathbf{I\!I}(u,u)\rangle-2\langle v,\iota(u)\rangle-\langle v,\mathbf{I\!I}(u,u)\rangle+O(\|u\|^{3})
=v2+u2v,𝐈𝐈(u,u)+O(u3)\displaystyle=\|v\|^{2}+\|u\|^{2}-\langle v,\mathbf{I\!I}(u,u)\rangle+O(\|u\|^{3})

where we used the fact that vv belongs the normal space (TxMM)(T_{x_{M}}M)^{\perp} of MM at xMx_{M} so that v,ι(u)=0\langle v,\iota(u)\rangle=0 and u,𝐈𝐈(u,u)=0\langle u,\mathbf{I\!I}(u,u)\rangle=0. ∎

By Lemma B.6 and Equation 30, we have that

Ur0expxM(u)xM2fσ(u)𝑑u\displaystyle\int_{U_{r_{0}}}\|\exp_{x_{M}}(u)-x_{M}\|^{2}f_{\sigma}(u)du
=exp(v22σ2)Ur0(uT(𝑰𝐈𝐈v)u+R2(u))exp(uT(𝑰𝐈𝐈v)u+R2(u)2σ2)(ρ(𝟎)+R1(u))𝑑u.\displaystyle=\exp\left(-\frac{\|v\|^{2}}{2\sigma^{2}}\right)\int_{U_{r_{0}}}{(u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u))}\exp\left(-\frac{u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u)}{2\sigma^{2}}\right)(\rho(\bm{0})+R_{1}(u))du.

and

Ur0fσ(u)𝑑u=exp(v22σ2)Ur0exp(uT(𝑰𝐈𝐈v)u+R2(u)2σ2)(ρ(𝟎)+R1(u))𝑑u.\displaystyle\int_{U_{r_{0}}}f_{\sigma}(u)du=\exp\left(-\frac{\|v\|^{2}}{2\sigma^{2}}\right)\int_{U_{r_{0}}}\exp\left(-\frac{u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u)}{2\sigma^{2}}\right)(\rho(\bm{0})+R_{1}(u))du.

We will establish the following claims:

Claim 1.
Ur0(uT(𝑰𝐈𝐈v)u+R2(u))exp(uT(𝑰𝐈𝐈v)u+R2(u)2σ2)(ρ(𝟎)+R1(u))𝑑u\displaystyle\int_{U_{r_{0}}}{(u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u))}\exp\left(-\frac{u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u)}{2\sigma^{2}}\right)(\rho(\bm{0})+R_{1}(u))du (31)
=σm+2ρ(𝟎)mzT(𝑰𝐈𝐈v)zexp(zT(𝑰𝐈𝐈v)z2)𝑑z+O(σm+3).\displaystyle=\sigma^{m+2}\rho(\bm{0})\int_{\mathbb{R}^{m}}z^{T}(\bm{I}-\mathbf{I\!I}_{v})z\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}\right)dz+O(\sigma^{m+3}).
Claim 2.
Ur0exp(uT(𝑰𝐈𝐈v)u+R2(u)2σ2)(ρ(𝟎)+R1(u))𝑑u\displaystyle\int_{U_{r_{0}}}\exp\left(-\frac{u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u)}{2\sigma^{2}}\right)(\rho(\bm{0})+R_{1}(u))du (32)
=σmρ(𝟎)mexp(zT(𝑰𝐈𝐈v)z2)𝑑z+O(σm+1).\displaystyle=\sigma^{m}\rho(\bm{0})\int_{\mathbb{R}^{m}}\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}\right)dz+O(\sigma^{m+1}).

With the above two claims, there is

MBr0\displaystyle\int_{MB_{r_{0}}} x1xM2qσx(dx1)=Ur0expxM(u)xM2fσ(u)𝑑uUr0fσ(u)𝑑u(1+O(exp(r028σ2)))\displaystyle\|x_{1}-x_{M}\|^{2}q_{\sigma}^{x}(dx_{1})=\frac{\int_{U_{r_{0}}}\|\exp_{x_{M}}(u)-x_{M}\|^{2}f_{\sigma}(u)du}{\int_{U_{r_{0}}}f_{\sigma}(u^{\prime})du^{\prime}}\left(1+O\left(\exp\left(-\frac{r_{0}^{2}}{8\sigma^{2}}\right)\right)\right)
=σ2mzT(𝑰𝐈𝐈v)zexp(zT(𝑰𝐈𝐈v)z2)𝑑z+O(σ3)mexp(zT(𝑰𝐈𝐈v)z2)𝑑z+O(σ)(1+O(exp(r028σ2)))\displaystyle=\frac{\sigma^{2}\int_{\mathbb{R}^{m}}z^{T}(\bm{I}-\mathbf{I\!I}_{v})z\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}\right)dz+O(\sigma^{3})}{\int_{\mathbb{R}^{m}}\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}\right)dz+O(\sigma)}\left(1+O\left(\exp\left(-\frac{r_{0}^{2}}{8\sigma^{2}}\right)\right)\right)
=σ2(mzT(𝑰𝐈𝐈v)zexp(zT(𝑰𝐈𝐈v)z2)𝑑zmexp(zT(𝑰𝐈𝐈v)z2)𝑑z+O(σ))(1+O(exp(r028σ2)))\displaystyle=\sigma^{2}\left(\frac{\int_{\mathbb{R}^{m}}z^{T}(\bm{I}-\mathbf{I\!I}_{v})z\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}\right)dz}{\int_{\mathbb{R}^{m}}\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}\right)dz}+O(\sigma)\right)\left(1+O\left(\exp\left(-\frac{r_{0}^{2}}{8\sigma^{2}}\right)\right)\right)
=mσ2+O(σ3)\displaystyle=m\sigma^{2}+O(\sigma^{3})

where in the last equality we used the fact that 𝔼[𝑿TA𝑿]=tr(AΣ)\mathbb{E}[\bm{X}^{T}A\bm{X}]=\text{tr}(A\Sigma) for a Gaussian random variable 𝑿𝒩(0,Σ)\bm{X}\sim{\mathcal{N}}(0,\Sigma).

Therefore, we have that

dW,2(qσx,δxM)=mσ+O(σ2),d_{\mathrm{W},2}(q_{\sigma}^{x},\delta_{x_{M}})=\sqrt{m}\sigma+O(\sigma^{2}),

and concludes the proof.

Now we prove the two claims. For the first claim, note that by v=dM(x)<τM/2\|v\|=d_{M}(x)<\tau_{M}/2, we can use [BHHS22, Theorem 2.1] to conclude

𝐈𝐈vvmaxi,j𝐈𝐈ij<12τM1/τM=12.\|\mathbf{I\!I}_{v}\|\leq\|v\|\cdot\max_{i,j}\|\mathbf{I\!I}_{ij}\|<\frac{1}{2}\tau_{M}\cdot 1/\tau_{M}=\frac{1}{2}.

So the matrix 𝑰𝐈𝐈v\bm{I}-\mathbf{I\!I}_{v} is positive definite. We let λmin>0\lambda_{\min}>0 be the smallest eigenvalues of 𝑰𝐈𝐈v\bm{I}-\mathbf{I\!I}_{v}. Note that λmin1𝐈𝐈v12\lambda_{\min}\geq 1-\|\mathbf{I\!I}_{v}\|\geq\frac{1}{2}. So, we can choose r0r_{0} small enough at the beginning so that there exists some constant C3>0C_{3}>0 for all uUr0u\in U_{r_{0}}, we have that

uT(𝑰𝐈𝐈v)u+R2(u)λminu2C2u3>C3u2.u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u)\geq\lambda_{\min}\|u\|^{2}-C_{2}\|u\|^{3}>C_{3}\|u\|^{2}.

Consider the transformation u=σzu=\sigma z. Then, we let σ:=2r0(C2σ)13\sigma^{\prime}:=2r_{0}(C_{2}\sigma)^{\frac{1}{3}} which goes to 0 as σ0\sigma\to 0 and for sufficiently small σ\sigma, we have that σ>σ\sigma^{\prime}>\sigma hence there is

1σUr01σUr0.\frac{1}{\sigma^{\prime}}U_{r_{0}}\subset\frac{1}{\sigma}U_{r_{0}}.

Then, we have that

Ur0uT(𝑰𝐈𝐈v)uexp(uT(𝑰𝐈𝐈v)u+R2(u)2σ2)(ρ(𝟎)+R1(u))𝑑u\displaystyle\int_{U_{r_{0}}}u^{T}(\bm{I}-\mathbf{I\!I}_{v})u\exp\left(-\frac{u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u)}{2\sigma^{2}}\right)(\rho(\bm{0})+R_{1}(u))du
=σm+21σUr0zT(𝑰𝐈𝐈v)zexp(zT(𝑰𝐈𝐈v)z2R2(σz)2σ2)(ρ(𝟎)+R1(σz))𝑑z\displaystyle=\sigma^{m+2}\int_{\frac{1}{\sigma}U_{r_{0}}}z^{T}(\bm{I}-\mathbf{I\!I}_{v})z\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}-\frac{R_{2}(\sigma z)}{2\sigma^{2}}\right)(\rho(\bm{0})+R_{1}(\sigma z))dz
=σm+21σUr0zT(𝑰𝐈𝐈v)zexp(zT(𝑰𝐈𝐈v)z2R2(σz)2σ2)(ρ(𝟎)+R1(σz))𝑑zJ1\displaystyle=\sigma^{m+2}\underbrace{\int_{\frac{1}{\sigma^{\prime}}U_{r_{0}}}z^{T}(\bm{I}-\mathbf{I\!I}_{v})z\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}-\frac{R_{2}(\sigma z)}{2\sigma^{2}}\right)(\rho(\bm{0})+R_{1}(\sigma z))dz}_{J_{1}}
+σm+21σUr0\1σUr0zT(𝑰𝐈𝐈v)zexp(zT(𝑰𝐈𝐈v)z2R2(σz)2σ2)(ρ(𝟎)+R1(σz))𝑑zJ2\displaystyle+\sigma^{m+2}\underbrace{\int_{\frac{1}{\sigma}U_{r_{0}}\backslash\frac{1}{\sigma^{\prime}}U_{r_{0}}}z^{T}(\bm{I}-\mathbf{I\!I}_{v})z\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}-\frac{R_{2}(\sigma z)}{2\sigma^{2}}\right)(\rho(\bm{0})+R_{1}(\sigma z))dz}_{J_{2}}

Now, for J1J_{1}, recall that Ur0B5r0/3TxMM(𝟎)U_{r_{0}}\subset B_{5r_{0}/3}^{T_{x_{M}}M}(\mathbf{0}). Hence, for any z1σUr0z\in\frac{1}{\sigma^{\prime}}U_{r_{0}}, we have that z1σ5r0/3<1σ2r0\|z\|\leq\frac{1}{\sigma^{\prime}}5r_{0}/3<\frac{1}{\sigma^{\prime}}2r_{0} and hence

|R2(σz)|2σ2C2(σz)32σ2<C2σ3(2r0)32σ2(2r0)3C2σ=12.\frac{|R_{2}(\sigma z)|}{2\sigma^{2}}\leq\frac{C_{2}(\sigma\|z\|)^{3}}{2\sigma^{2}}<\frac{C_{2}\sigma^{3}(2r_{0})^{3}}{2\sigma^{2}\cdot(2r_{0})^{3}C_{2}\sigma}=\frac{1}{2}.

This implies that there is expansion exp(R2(σz)2σ2)=1+O(σz3)\exp\left(-\frac{R_{2}(\sigma z)}{2\sigma^{2}}\right)=1+O(\sigma\|z\|^{3}). Therefore, we have that

J1=1σUr0zT(𝑰𝐈𝐈v)zexp(zT(𝑰𝐈𝐈v)z2)(1+O(σz3))(ρ(𝟎)+O(σz))𝑑z\displaystyle J_{1}=\int_{\frac{1}{\sigma^{\prime}}U_{r_{0}}}z^{T}(\bm{I}-\mathbf{I\!I}_{v})z\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}\right)(1+O(\sigma\|z\|^{3}))(\rho(\bm{0})+O(\|\sigma z\|))dz
=ρ(𝟎)1σUr0zT(𝑰𝐈𝐈v)zexp(zT(𝑰𝐈𝐈v)z2)𝑑z+O(σ)\displaystyle=\rho(\bm{0})\int_{\frac{1}{\sigma^{\prime}}U_{r_{0}}}z^{T}(\bm{I}-\mathbf{I\!I}_{v})z\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}\right)dz+O(\sigma)
=ρ(𝟎)mzT(𝑰𝐈𝐈v)zexp(zT(𝑰𝐈𝐈v)z2)𝑑z+O(σ)\displaystyle=\rho(\bm{0})\int_{\mathbb{R}^{m}}z^{T}(\bm{I}-\mathbf{I\!I}_{v})z\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}\right)dz+O(\sigma)

where in the last part we used the fact that the the decay rate of such an integral as σ\sigma goes to 0 is exponential.

For J2J_{2}, we similarly have that

|J2|1σUr0\1σUr0zT(𝑰𝐈𝐈v)zexp(C3z2)(ρ(𝟎)+O(σz))𝑑z\displaystyle|J_{2}|\leq\int_{\frac{1}{\sigma}U_{r_{0}}\backslash\frac{1}{\sigma^{\prime}}U_{r_{0}}}z^{T}(\bm{I}-\mathbf{I\!I}_{v})z\exp\left(-C_{3}\|z\|^{2}\right)(\rho(\bm{0})+O(\|\sigma z\|))dz
=O(σ).\displaystyle=O(\sigma).

Therefore,

Ur0uT(𝑰𝐈𝐈v)uexp(uT(𝑰𝐈𝐈v)u+R2(u)2σ2)(ρ(𝟎)+R1(u))𝑑u\displaystyle\int_{U_{r_{0}}}u^{T}(\bm{I}-\mathbf{I\!I}_{v})u\exp\left(-\frac{u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u)}{2\sigma^{2}}\right)(\rho(\bm{0})+R_{1}(u))du (33)
=σm+2ρ(𝟎)mzT(𝑰𝐈𝐈v)zexp(zT(𝑰𝐈𝐈v)z2)𝑑z+O(σm+3).\displaystyle=\sigma^{m+2}\rho(\bm{0})\int_{\mathbb{R}^{m}}z^{T}(\bm{I}-\mathbf{I\!I}_{v})z\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}\right)dz+O(\sigma^{m+3}).

Similarly, we have that

Ur0exp(uT(𝑰𝐈𝐈v)u+R2(u)2σ2)(ρ(𝟎)+R1(u))𝑑u\displaystyle\int_{U_{r_{0}}}\exp\left(-\frac{u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u)}{2\sigma^{2}}\right)(\rho(\bm{0})+R_{1}(u))du (34)
=σmρ(𝟎)mexp(zT(𝑰𝐈𝐈v)z2)𝑑z+O(σm+1).\displaystyle=\sigma^{m}\rho(\bm{0})\int_{\mathbb{R}^{m}}\exp\left(-\frac{z^{T}(\bm{I}-\mathbf{I\!I}_{v})z}{2}\right)dz+O(\sigma^{m+1}).

Additionally, by |R2(u)|C2u3|R_{2}(u)|\leq C_{2}\|u\|^{3}, we have that

Ur0R2(u)exp(uT(𝑰𝐈𝐈v)u+R2(u)2σ2)(ρ(𝟎)+R1(u))𝑑u=O(σm+3).\int_{U_{r_{0}}}R_{2}(u)\exp\left(-\frac{u^{T}(\bm{I}-\mathbf{I\!I}_{v})u+R_{2}(u)}{2\sigma^{2}}\right)(\rho(\bm{0})+R_{1}(u))du=O(\sigma^{m+3}). (35)

Then Claim 1 follows from Equation 33 and Equation 35 and Claim 2 follows from Equation 34.

B.4 Proofs in Section 5

Proof of Theorem 5.1.

We prove that the vector field u:[0,1)×ddu:[0,1)\times\mathbb{R}^{d}\to\mathbb{R}^{d} is locally Lipschitz for any fixed tt and that the integral 01dut(x)pt(dx)𝑑t<\int_{0}^{1}\int_{\mathbb{R}^{d}}\|u_{t}(x)\|p_{t}(dx)dt<\infty. Then, by the mass conservation formula (see for example [Vil09]), we conclude the proof.

To prove the local Lipschitz property, by Proposition 3.2, it suffices to show that the covariance matrix Σt(x)\Sigma_{t}(x) of the posterior distribution p(|𝑿t=x)p(\cdot|\bm{X}_{t}=x) is locally (w.r.t. xx) uniformly bounded (w.r.t. tt). We establish the following lemma for this purpose.

Lemma B.7.

Let pp be a probability measure on d\mathbb{R}^{d} with a finite 2-moment 𝖬2(p)\mathsf{M}_{2}(p). For any xdx\in\mathbb{R}^{d}, consider the posterior distribution:

p(dz|𝑿t=x)=exp(xαtz22βt2)p(dz)Ωexp(xαtz22βt2)p(dz).p(dz|\bm{X}_{t}=x)=\frac{\exp\left(-\frac{\|x-\alpha_{t}z\|^{2}}{2\beta_{t}^{2}}\right)p(dz)}{\int_{\Omega}\exp\left(-\frac{\|x-\alpha_{t}z\|^{2}}{2\beta_{t}^{2}}\right)p(dz)}.

We let Nt(x):=exp(xαtz22βt2)p(dz)N_{t}(x):=\int\exp\left(-\frac{\|x-\alpha_{t}z\|^{2}}{2\beta_{t}^{2}}\right)p(dz). Then, the covariance matrix Σt(x)\Sigma_{t}(x) of p(|𝐗t=x)p(\cdot|\bm{X}_{t}=x) satisfies:

Σt(x)(2mt(x)2+2𝖬2(p)Nt(x))I.\Sigma_{t}(x)\preceq\left(2\|m_{t}(x)\|^{2}+\frac{2\mathsf{M}_{2}(p)}{N_{t}(x)}\right)I.
Proof of Lemma B.7.

Fix a unit vector vv. Then, we have that

vΣt(x)v\displaystyle v^{\top}\Sigma_{t}(x)v =zmt(x),v2exp(xαtz22βt2)p(dz)exp(xαtz22βt2)p(dz)\displaystyle=\int\langle z-m_{t}(x),v\rangle^{2}\frac{\exp\left(-\frac{\|x-\alpha_{t}z\|^{2}}{2\beta_{t}^{2}}\right)p(dz)}{\int\exp\left(-\frac{\|x-\alpha_{t}z^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)p(dz^{\prime})}

For zdz\in\mathbb{R}^{d}, we have that

zmt(x),v2\displaystyle\langle z-m_{t}(x),v\rangle^{2} zmt(x)22z2+2mt(x)2.\displaystyle\leq\|z-m_{t}(x)\|^{2}\leq 2\|z\|^{2}+2\|m_{t}(x)\|^{2}.

Therefore, one has that

vΣt(x)v\displaystyle v^{\top}\Sigma_{t}(x)v 2mt(x)2+2z2exp(xαtz22βt2)p(dz)exp(xαtz22βt2)p(dz)\displaystyle\leq 2\|m_{t}(x)\|^{2}+2\int\|z\|^{2}\frac{\exp\left(-\frac{\|x-\alpha_{t}z\|^{2}}{2\beta_{t}^{2}}\right)p(dz)}{\int\exp\left(-\frac{\|x-\alpha_{t}z^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)p(dz^{\prime})}
=2mt(x)2+2z2p(dz)exp(xαtz22βt2)p(dz)\displaystyle=2\|m_{t}(x)\|^{2}+\frac{2\int\|z\|^{2}p(dz)}{\int\exp\left(-\frac{\|x-\alpha_{t}z^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)p(dz^{\prime})}
=2mt(x)2+2𝖬2(p)Nt(x)\displaystyle=2\|m_{t}(x)\|^{2}+\frac{2\mathsf{M}_{2}(p)}{N_{t}(x)}

Since vv is arbitrary, this concludes the proof. ∎

By dominated convergence theorem, it is straightforward to check that N:[0,1)×dN:[0,1)\times\mathbb{R}^{d}\to\mathbb{R} is continuous. Hence, for any xdx\in\mathbb{R}^{d} and any local compact neighborhood of xx, NN is uniformly bounded below by some positive constant. Furthermore, as mt(x)m_{t}(x) is differentiable (cf. Proposition 3.2), we have that mt(x)m_{t}(x) is locally uniformly bounded as well. These together with Proposition 3.2 and Equation 6 imply that the vector field u:[0,1)×ddu:[0,1)\times\mathbb{R}^{d}\to\mathbb{R}^{d} is locally Lipschitz for any t[0,1)t\in[0,1). This concludes the proof of the well-posedness of the flow map.

Now we verify the integrality of utu_{t}.

01ut(x)pt(dx)𝑑t\displaystyle\int_{0}^{1}\int\left\|u_{t}(x)\right\|p_{t}(dx)dt
01ut(x|x1)pt(dx|𝑿=x1)p(dx1)dt\displaystyle\leq\int_{0}^{1}\int\int\|u_{t}(x|x_{1})\|p_{t}(dx|\bm{X}=x_{1})\,p(dx_{1})dt
=01β˙tβtx+α˙tβtαtβ˙tβtx11(2πβt2)d/2exp(xαtx122βt2)𝑑xp(dx1)𝑑t\displaystyle=\int_{0}^{1}\int\int\left\|\frac{\dot{\beta}_{t}}{\beta_{t}}x+\frac{\dot{\alpha}_{t}\beta_{t}-\alpha_{t}\dot{\beta}_{t}}{\beta_{t}}x_{1}\right\|\frac{1}{(2\pi\beta_{t}^{2})^{d/2}}\exp\left(-\frac{\|x-\alpha_{t}x_{1}\|^{2}}{2\beta_{t}^{2}}\right)dx\,p(dx_{1})dt
01(β˙tβtxαtβ˙tβtx1+αt˙x1)1(2πβt2)d/2exp(xαtx122βt2)𝑑xp(dx1)𝑑t\displaystyle\leq\int_{0}^{1}\int\int\left(\left\|\frac{\dot{\beta}_{t}}{\beta_{t}}x-\frac{\alpha_{t}\dot{\beta}_{t}}{\beta_{t}}x_{1}\right\|+\left\|\dot{\alpha_{t}}x_{1}\right\|\right)\frac{1}{(2\pi\beta_{t}^{2})^{d/2}}\exp\left(-\frac{\|x-\alpha_{t}x_{1}\|^{2}}{2\beta_{t}^{2}}\right)dx\,p(dx_{1})dt

We split the integral by the two terms according to the summation β˙tβtxαtβ˙tβtx1+αt˙x1\|\frac{\dot{\beta}_{t}}{\beta_{t}}x-\frac{\alpha_{t}\dot{\beta}_{t}}{\beta_{t}}x_{1}\|+\left\|\dot{\alpha_{t}}x_{1}\right\|. For the first term, we use x~=xαtx1\tilde{x}=x-\alpha_{t}x_{1} and the fact about the expected norm of a Gaussian random variable with variance σ2\sigma^{2} is σπ2Γ((n+1)/2)Γ(n/2)\sigma\sqrt{\frac{\pi}{2}}\frac{\Gamma((n+1)/2)}{\Gamma(n/2)}.

01|β˙tβt|xαtx11(2πβt2)d/2exp(xαtx122βt2)𝑑xp(dx1)𝑑t\displaystyle\int_{0}^{1}\int\int\left|\frac{\dot{\beta}_{t}}{\beta_{t}}\right|\left\|x-\alpha_{t}x_{1}\right\|\frac{1}{(2\pi\beta_{t}^{2})^{d/2}}\exp\left(-\frac{\|x-\alpha_{t}x_{1}\|^{2}}{2\beta_{t}^{2}}\right)dx\,p(dx_{1})dt
=01|β˙tβt|βtπ2Γ((d+1)/2)Γ(d/2)p(dx1)𝑑t\displaystyle=\int_{0}^{1}\int\left|\frac{\dot{\beta}_{t}}{\beta_{t}}\right|\beta_{t}\sqrt{\frac{\pi}{2}}\frac{\Gamma((d+1)/2)}{\Gamma(d/2)}p(dx_{1})dt
=π2Γ((d+1)/2)Γ(d/2)01βt˙dt\displaystyle=\sqrt{\frac{\pi}{2}}\frac{\Gamma((d+1)/2)}{\Gamma(d/2)}\int_{0}^{1}-\dot{\beta_{t}}dt
=π2Γ((d+1)/2)Γ(d/2),\displaystyle=\sqrt{\frac{\pi}{2}}\frac{\Gamma((d+1)/2)}{\Gamma(d/2)},

where we use the assumption that βt\beta_{t} is a non-increasing function of tt and hence β˙t0\dot{\beta}_{t}\leq 0.

For the second term, we have that

01αt˙x11(2πβt2)d/2exp(xαtx122βt2)𝑑xp(dx1)𝑑t\displaystyle\int_{0}^{1}\int\int\left\|\dot{\alpha_{t}}x_{1}\right\|\frac{1}{(2\pi\beta_{t}^{2})^{d/2}}\exp\left(-\frac{\|x-\alpha_{t}x_{1}\|^{2}}{2\beta_{t}^{2}}\right)dx\,p(dx_{1})dt
01αt˙x1p(dx1)𝑑t\displaystyle\leq\int_{0}^{1}\int\dot{\alpha_{t}}\|x_{1}\|p(dx_{1})dt
x1p(dx1)<.\displaystyle\leq\int\|x_{1}\|p(dx_{1})<\infty.

The last step follows from the fact that pp has finite second moment and hence finite first moment. We also use the assumption that αt\alpha_{t} is a non-decreasing function of tt and hence α˙t0\dot{\alpha}_{t}\geq 0. ∎

Proof of Proposition 5.6.

For any independent random variables 𝑿p\bm{X}\sim p and any 𝒁N(0,σ2I)\bm{Z}\sim N(0,\sigma^{2}I), we have that 𝑿+𝒁qσ=pN(0,σ2I)\bm{X}+\bm{Z}\sim q_{\sigma}=p*N(0,\sigma^{2}I). Hence,

dW,2(qσ,p)2𝔼[𝑿+𝒁𝑿2]=𝔼[𝒁2]=O(σ2).d_{\mathrm{W},2}(q_{\sigma},p)^{2}\leq\mathbb{E}\left[\|\bm{X}+\bm{Z}-\bm{X}\|^{2}\right]=\mathbb{E}\left[\|\bm{Z}\|^{2}\right]=O(\sigma^{2}).

The result follows. ∎

Proof of Theorem 5.7.
m~t(y)=\displaystyle\widetilde{m}_{t}(y)= exp(yαty122βt2)y1exp(yαty122βt2)p~(dy1)p~(dy1)\displaystyle\int\frac{\exp\left(-\frac{\|y-\alpha_{t}y_{1}\|^{2}}{2\beta_{t}^{2}}\right)y_{1}}{\int\exp\left(-\frac{\|y-\alpha_{t}y_{1}^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)\tilde{p}(dy_{1}^{\prime})}\tilde{p}(dy_{1}) (36)
=\displaystyle= exp(𝑶x+αtbαt(𝑶x1+b)22βt2)(𝑶x1+b)exp(𝑶x+αtbαt(𝑶x1+αtb)22βt2)p(dx1)p(dx1)\displaystyle\int\frac{\exp\left(-\frac{\|{\bm{O}}x+\alpha_{t}b-\alpha_{t}({\bm{O}}x_{1}+b)\|^{2}}{2\beta_{t}^{2}}\right)({\bm{O}}x_{1}+b)}{\int\exp\left(-\frac{\|{\bm{O}}x+\alpha_{t}b-\alpha_{t}({\bm{O}}x_{1}^{\prime}+\alpha_{t}b)\|^{2}}{2\beta_{t}^{2}}\right)p(dx_{1}^{\prime})}p(dx_{1})
=\displaystyle= exp(xαtx122βt2)(𝑶x1+b)exp(xαtx122βt2)p(dx1)p(dx1)\displaystyle\int\frac{\exp\left(-\frac{\|x-\alpha_{t}x_{1}\|^{2}}{2\beta_{t}^{2}}\right)({\bm{O}}x_{1}+b)}{\int\exp\left(-\frac{\|x-\alpha_{t}x_{1}^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)p(dx_{1}^{\prime})}p(dx_{1})
=\displaystyle= 𝑶mt(x)+b.\displaystyle{\bm{O}}m_{t}(x)+b.

where in the second equality we used the change of variable y1=𝑶x1+by_{1}={\bm{O}}x_{1}+b and the fact that |det(𝑶)|=1|\det({\bm{O}})|=1.

Then, we have that

u~t(y)\displaystyle\widetilde{u}_{t}(y) =u~t(𝑶x+αtb)=(logβt)(𝑶x+αtb)+βt(αtβt)m~t(𝑶x+αtb)\displaystyle=\widetilde{u}_{t}({\bm{O}}x+\alpha_{t}b)=\left(\log\beta_{t}\right)^{\prime}({\bm{O}}x+\alpha_{t}b)+\beta_{t}\left(\frac{\alpha_{t}}{\beta_{t}}\right)^{\prime}\widetilde{m}_{t}({\bm{O}}x+\alpha_{t}b)
=𝑶((logβt)x+βt(αtβt)mt(x))+α˙tb\displaystyle={\bm{O}}\left(\left(\log\beta_{t}\right)^{\prime}x+\beta_{t}\left(\frac{\alpha_{t}}{\beta_{t}}\right)^{\prime}{m}_{t}(x)\right)+\dot{\alpha}_{t}b
=𝑶ut(x)+α˙tb.\displaystyle={\bm{O}}u_{t}(x)+\dot{\alpha}_{t}b.

Now, let xtx_{t} be an integral path w.r.t. utu_{t}. Consider the path yt=𝑶xt+αtby_{t}={\bm{O}}x_{t}+\alpha_{t}b. Notice that y0=𝑶x0y_{0}={\bm{O}}x_{0}, then we have that for any xdx\in\mathbb{R}^{d},

dytdt=𝑶dxtdt+α˙tb=𝑶ut(xt)+α˙tb=u~t(yt).\frac{dy_{t}}{dt}={\bm{O}}\frac{dx_{t}}{dt}+\dot{\alpha}_{t}b={\bm{O}}u_{t}(x_{t})+\dot{\alpha}_{t}b=\widetilde{u}_{t}(y_{t}).

Therefore, Ψ~t(𝑶x)=𝑶Ψt(x)+αtb\widetilde{\Psi}_{t}({\bm{O}}x)={\bm{O}}\Psi_{t}(x)+\alpha_{t}b. ∎

Proof of Theorem 5.8.

Consider y=stxy=s_{t}x. The transformed denoiser m¯t(y)\bar{m}_{t}(y) w.r.t. α¯t\bar{\alpha}_{t} and β¯t\bar{\beta}_{t} is given by

m¯t(y)=\displaystyle\bar{m}_{t}(y)= exp(yα¯ty122β¯t2)y1exp(yα¯ty122β¯t2)p(dy1)p(dy1)\displaystyle\int\frac{\exp\left(-\frac{\|y-\bar{\alpha}_{t}y_{1}\|^{2}}{2\bar{\beta}_{t}^{2}}\right)y_{1}}{\int\exp\left(-\frac{\|y-\bar{\alpha}_{t}y_{1}^{\prime}\|^{2}}{2\bar{\beta}_{t}^{2}}\right)p(dy_{1}^{\prime})}p(dy_{1}) (37)
=\displaystyle= exp(stxαtstx122st2βt2)kx1exp(stxαtstx122st2βt2)p(dx1)p(dx1)\displaystyle\int\frac{\exp\left(-\frac{\|s_{t}x-\alpha_{t}s_{t}x_{1}\|^{2}}{2s_{t}^{2}\beta_{t}^{2}}\right)kx_{1}}{\int\exp\left(-\frac{\|s_{t}x-\alpha_{t}s_{t}x_{1}^{\prime}\|^{2}}{2s_{t}^{2}\beta_{t}^{2}}\right)p(dx_{1}^{\prime})}p(dx_{1})
=\displaystyle= exp(xαtx122βt2)kx1exp(xαtx122βt2)p(dx1)p(dx1)=kmt(x),\displaystyle\int\frac{\exp\left(-\frac{\|x-\alpha_{t}x_{1}\|^{2}}{2\beta_{t}^{2}}\right)kx_{1}}{\int\exp\left(-\frac{\|x-\alpha_{t}x_{1}^{\prime}\|^{2}}{2\beta_{t}^{2}}\right)p(dx_{1}^{\prime})}p(dx_{1})=k\cdot m_{t}(x),

where we used the change of variable y1=kx1y_{1}=kx_{1} in the second equation.

Let xtx_{t} denote an ODE path for dxt/dt=ut(xt)dx_{t}/dt=u_{t}(x_{t}). Then we consider the path yt=stxty_{t}=s_{t}x_{t}. We need to check that dyt/dt=u¯t(yt)dy_{t}/dt=\bar{u}_{t}(y_{t}).

dytdt\displaystyle\frac{dy_{t}}{dt} =stxt+stdxtdt=stxt+st((logβt)xt+βt(αt/βt)mt(xt))\displaystyle=s_{t}^{\prime}x_{t}+s_{t}\frac{dx_{t}}{dt}=s_{t}^{\prime}x_{t}+s_{t}((\log\beta_{t})^{\prime}x_{t}+\beta_{t}(\alpha_{t}/\beta_{t})^{\prime}m_{t}(x_{t}))
=(st/st+(logβt))stxt+βt(αt/βt)stmt(xt)\displaystyle=(s_{t}^{\prime}/s_{t}+(\log\beta_{t})^{\prime})s_{t}x_{t}+\beta_{t}(\alpha_{t}/\beta_{t})^{\prime}s_{t}m_{t}(x_{t})
=(logβ¯t)yt+stβt(αtkβt)kmt(xt)\displaystyle=(\log\bar{\beta}_{t})^{\prime}y_{t}+s_{t}\beta_{t}(\frac{\alpha_{t}}{k\beta_{t}})^{\prime}km_{t}(x_{t})
=(logβ¯t)yt+β¯t(α¯tβ¯t)m¯t(yt)=u¯t(yt)\displaystyle=(\log\bar{\beta}_{t})^{\prime}y_{t}+\bar{\beta}_{t}(\frac{\bar{\alpha}_{t}}{\bar{\beta}_{t}})^{\prime}\bar{m}_{t}(y_{t})=\bar{u}_{t}(y_{t})

Note that y0=x0y_{0}=x_{0}, hence we conclude the proof. ∎

B.4.1 Proof of Theorem 5.3

We utilize the change of variable λ(t)=logαtβt\lambda(t)=\log\frac{\alpha_{t}}{\beta_{t}} for t(0,1)t\in(0,1). We also let t(λ)t(\lambda) denote the inverse function of λ(t)\lambda(t).

Next, we consider zλ:=xt(λ)αt(λ)z_{\lambda}:=\frac{x_{t(\lambda)}}{\alpha_{t(\lambda)}}. Then, we have that zλz_{\lambda} satisfies ODE Equation 10: dzλ/dλ=mλ(zλ)zλdz_{\lambda}/d\lambda=m_{\lambda}(z_{\lambda})-z_{\lambda}. Recall the transformation AtA_{t} sending xx to x/αtx/\alpha_{t}. Then, we define qλ:=(At(λ))#pt(λ)=p𝒩(0,e2λ(t)I)q_{\lambda}:=(A_{t(\lambda)})_{\#}p_{t(\lambda)}=p*\mathcal{N}(0,e^{-2\lambda(t)}I).

By Theorem 4.6, we have the following convergence rate for mλ(zλ)m_{\lambda}(z_{\lambda}):

Claim 3.

Fix 0<ζ<10<\zeta<1. Then, there exists Λ>\Lambda>-\infty such that for any radius R>12τΩR>\frac{1}{2}\tau_{\Omega} and all zBR(0)B12τΩ(Ω)z\in B_{R}(0)\cap B_{\frac{1}{2}\tau_{\Omega}}(\Omega), one has

mλ(z)projΩ(z)Cζ,τ,Reζλ for all λ>Λ\|m_{\lambda}(z)-\mathrm{proj}_{\Omega}(z)\|\leq C_{\zeta,\tau,R}\cdot e^{-\zeta\lambda}\text{ for all }\lambda>\Lambda

where Cζ,τ,RC_{\zeta,\tau,R} is a constant depending only on ζ\zeta and τ\tau and RR.

Proof of 3.

We let zΩ:=projΩ(z)z_{\Omega}:=\mathrm{proj}_{\Omega}(z). Note that zΩz+dΩ(z)2R\|z_{\Omega}\|\leq\|z\|+d_{\Omega}(z)\leq 2R. Since pp satisfies 1, we have that p(Br(zΩ))C2Rrkp(B_{r}(z_{\Omega}))\geq C_{2R}r^{k} for small 0<r<c0<r<c. Now, we let Λ:=log(c)/ζ\Lambda:=-\log(c)/\zeta. By Theorem 4.6, we conclude the proof. ∎

The following Claim establishes an absorbing property for points in zλδBRδ(0)Bδ(Ω)z_{\lambda_{\delta}}\in B_{R_{\delta}}(0)\cap B_{\delta}(\Omega).

Claim 4.

Consider δ>0\delta>0 small such that δ<τΩ4\delta<\frac{\tau_{\Omega}}{4}. Fix any Rδ>0R_{\delta}>0 such that Rδ>2δR_{\delta}>2\delta. Then, there exists λδΛ\lambda_{\delta}\geq\Lambda satisfying the following property: the trajectory (zλ)λ[λδ,)(z_{\lambda})_{\lambda\in[\lambda_{\delta},\infty)} starting at any initial point zλδBRδ(0)Bδ(Ω)z_{\lambda_{\delta}}\in B_{R_{\delta}}(0)\cap B_{\delta}(\Omega) of the ODE in Equation 10 satisfies that for any λλδ\lambda\geq\lambda_{\delta}: zλB2Rδ(0)B2δ(Ω)z_{\lambda}\in B_{2R_{\delta}}(0)\cap B_{2\delta}(\Omega).

Proof of 4.

This follows from 3 and Theorem 3.11 (by letting σΩ:=eΛ\sigma_{\Omega}:=e^{-\Lambda}). ∎

Next we establish a concentration result for qλq_{\lambda} when λ\lambda is large.

Claim 5.

For any small δ>0\delta>0, there are Rδ,λδ>0R_{\delta},\lambda_{\delta}>0 large enough such that for any λλδ\lambda\geq\lambda_{\delta} and R>RδR>R_{\delta}, we have that

qλ(BR(0)Bδ(Ω))>1δ.q_{\lambda}\left(B_{R}(0)\cap B_{\delta}(\Omega)\right)>1-\delta.
Proof of 5.

Consider the random variable 𝑿=𝒀+eλ𝒁\bm{X}=\bm{Y}+e^{-\lambda}\bm{Z} where 𝒀p\bm{Y}\sim p and 𝒁p0=𝒩(0,I)\bm{Z}\sim p_{0}=\mathcal{N}(0,I) are independent but from the same probability space (Ω,)(\Omega,\mathbb{P}). Then, 𝑿\bm{X} has qλq_{\lambda} as its law. We have that dΩ(x)𝒀+eλ𝒁𝒀=eλ𝒁d_{\Omega}(x)\leq\|\bm{Y}+e^{-\lambda}\bm{Z}-\bm{Y}\|=e^{-\lambda}\|\bm{Z}\|. For any R>δR>\delta, we have that

(𝑿+eλ𝒁2R,eλ𝒁δ)\displaystyle\mathbb{P}(\|\bm{X}+e^{-\lambda}\bm{Z}\|\leq 2R,e^{-\lambda}\|\bm{Z}\|\leq\delta) (𝑿R,eλ𝒁δ)\displaystyle\geq\mathbb{P}(\|\bm{X}\|\leq R,e^{-\lambda}\|\bm{Z}\|\leq\delta)
=(𝑿R)(𝒁eλδ)\displaystyle=\mathbb{P}(\|\bm{X}\|\leq R)\mathbb{P}(\|\bm{Z}\|\leq e^{\lambda}\delta)

Since 𝒁\bm{Z} follows the standard Gaussian, for any δ\delta, there exists λδ>0\lambda_{\delta}>0 such that for all λλδ\lambda\geq\lambda_{\delta}, we have that

(𝒁eλδ)(𝒁eλδδ)>1δ2.\mathbb{P}(\|\bm{Z}\|\leq e^{\lambda}\delta)\geq\mathbb{P}(\|\bm{Z}\|\leq e^{\lambda_{\delta}}\delta)>1-\frac{\delta}{2}.

Now, since pp has finite 2-moment and hence finite 1-moment, there exists Rδ>0R_{\delta}>0 such that (𝑿Rδ2)>1δ2\mathbb{P}(\|\bm{X}\|\leq\frac{R_{\delta}}{2})>1-\frac{\delta}{2}.

Therefore, for all λλδ\lambda\geq\lambda_{\delta}, we have that

qλ(BRδ(0)Bδ(Ω))\displaystyle q_{\lambda}\left(B_{R_{\delta}}(0)\cap B_{\delta}(\Omega)\right) (𝑿+eλ𝒁Rδ,eλ𝒁δ)\displaystyle\geq\mathbb{P}(\|\bm{X}+e^{-\lambda}\bm{Z}\|\leq R_{\delta},e^{-\lambda}\|\bm{Z}\|\leq\delta)
(1δ2)(1δ2)>1δ.\displaystyle\geq\left(1-\frac{\delta}{2}\right)\left(1-\frac{\delta}{2}\right)>1-\delta.

Now, we establish the desired convergence results for the scale of λ\lambda.

Claim 6.

For any small δ>0\delta>0, there exist large enough λδ\lambda_{\delta}, such that with probability at least 1δ1-\delta, we have that zλδqλδz_{\lambda_{\delta}}\sim q_{\lambda_{\delta}} satisfies the following properties:

  1. 1.

    zλz_{\lambda} converges along the ODE trajectory starting from zλδz_{\lambda_{\delta}} as λ\lambda\to\infty;

  2. 2.

    the convergence rate is given by zλzλδ=O(eζλ2)\|z_{\lambda}-z_{\lambda_{\delta}}\|=O(e^{-\frac{\zeta\lambda}{2}}).

Proof of 6.

For any δ>0\delta>0, by 4 and 5, there exist large enough λδ\lambda_{\delta} and RδR_{\delta}, such that

  • with probability at least 1δ1-\delta, we have that zλδqλδz_{\lambda_{\delta}}\sim q_{\lambda_{\delta}} lies in BRδ(0)Bδ(Ω)B_{R_{\delta}}(0)\cap B_{\delta}(\Omega);

  • the ODE trajectory (zλ)λ[λδ,)(z_{\lambda})_{\lambda\in[\lambda_{\delta},\infty)} starting at zλδBRδ(0)Bδ(Ω)z_{\lambda_{\delta}}\in B_{R_{\delta}}(0)\cap B_{\delta}(\Omega) satisfies that for all λλδ\lambda\geq\lambda_{\delta}, zλz_{\lambda} lies in B2Rδ(0)B_{2R_{\delta}}(0) and dΩ(zλ)2δd_{\Omega}(z_{\lambda})\leq 2\delta.

This implies that one can apply the convergence rate of denoiser in 3 to the entire trajectory with C:=Cζ,τ,RC:=C_{\zeta,\tau,R} where R:=2RδR:=2R_{\delta}. Then, for any λ1<λ2\lambda_{1}<\lambda_{2} in the interval [λδ,)[\lambda_{\delta},\infty), we have that

zλ2zλ1\displaystyle\|z_{\lambda_{2}}-z_{\lambda_{1}}\| λ1λ2mλ(zλ)zλ𝑑λ\displaystyle\leq\int_{\lambda_{1}}^{\lambda_{2}}\|m_{\lambda}(z_{\lambda})-z_{\lambda}\|d\lambda
λ1λ2mλ(zλ)projΩ(zλ)+projΩ(zλ)zλdλ\displaystyle\leq\int_{\lambda_{1}}^{\lambda_{2}}\|m_{\lambda}(z_{\lambda})-\mathrm{proj}_{\Omega}(z_{\lambda})\|+\|\mathrm{proj}_{\Omega}(z_{\lambda})-z_{\lambda}\|d\lambda (38)
Cζ(eζλ2+eζλ1)+δeλδ(eλ1eλ2)+4δC2ζ2ζ(eζλ1/2eζλ2/2),\displaystyle\leq\frac{C}{\zeta}(-e^{-\zeta\lambda_{2}}+e^{-\zeta\lambda_{1}})+\delta e^{\lambda_{\delta}}(e^{-\lambda_{1}}-e^{-\lambda_{2}})+\sqrt{\frac{{4}\delta C}{2-\zeta}}\cdot\frac{2}{\zeta}\left(e^{-\zeta\lambda_{1}/2}-e^{-\zeta\lambda_{2}/2}\right),

where the bound is obtained in a way similar to how we obtain Section B.4.1.

This implies that the solution zλz_{\lambda} becomes a Cauchy sequence and hence converges to a limit zz_{\infty}.

Now, we let λ2\lambda_{2} approach \infty in the above inequality and and replace λ1\lambda_{1} with λ\lambda to obtain the following convergence rate:

zzλ=O(eζλ2).\|z_{\infty}-z_{\lambda}\|=O(e^{-\frac{\zeta\lambda}{2}}). (39)

Now, we change the coordinate back to t[0,1)t\in[0,1) to obtain the following result. Since δ>0\delta>0 is arbitrary so one can let δ\delta approach 0 in the result below to conclude the proof of Theorem 5.3.

Claim 7.

For any small δ>0\delta>0, one can sample x0p0x_{0}\sim p_{0} with probability at least 1δ1-\delta, such that the flow map Ψt(x0)\Psi_{t}(x_{0}) converges to a limit Ψ1(x0)\Psi_{1}(x_{0}) as t1t\to 1.

Proof of 7.

Let λδ\lambda_{\delta} be the one given in 6. Then, we let tδ:=t(λδ)t_{\delta}:=t(\lambda_{\delta}). Consider the map Atδ:ddA_{t_{\delta}}:\mathbb{R}^{d}\to\mathbb{R}^{d} sending xx to x/αtδx/\alpha_{t_{\delta}}. Then, we have that

(Atδ)#ptδ=qλδ.(A_{t_{\delta}})_{\#}p_{t_{\delta}}=q_{\lambda_{\delta}}. (40)

Both maps AtδA_{t_{\delta}} and Ψtδ\Psi_{t_{\delta}} (whose existence follows from Theorem 5.1) are continuous bijections. It is then easy to see that if an ODE trajectory of Equation 10 converges starting with some zλδqλδz_{\lambda_{\delta}}\sim q_{\lambda_{\delta}} and has convergence rate O(eζλ2)O(e^{-\frac{\zeta\lambda}{2}}), then the corresponding trajectory of the ODE Equation 1 starting with x0:=(AtδΨtδ)1(zλδ)x_{0}:=(A_{t_{\delta}}\circ\Psi_{t_{\delta}})^{-1}(z_{\lambda_{\delta}}) also converges to a limit as t1t\to 1 with the same convergence rate up to the change of variable λt\lambda\to t. Finally, by Equation 40 we also have that

p0(x0 with the desired properties)=qλδ(zλδ with the desired properties)>1δ,p_{0}(x_{0}\text{ with the desired properties})=q_{\lambda_{\delta}}(z_{\lambda_{\delta}}\text{ with the desired properties})>1-\delta,

which concludes the proof. ∎

Item 2 in the theorem is a direct consequence of item 1. As Ψ1\Psi_{1} is the pointwise limit of continuous maps Ψt\Psi_{t} (cf. Theorem 5.1), it is also measurable. We know that ptp_{t} weakly converges to p1=pp_{1}=p as t1t\to 1. Now, we just verify that (Ψ1)#p0(\Psi_{1})_{\#}p_{0} is also a weak limit of pt=(Ψt)#p0p_{t}=(\Psi_{t})_{\#}p_{0} to show that (Ψ1)#p0=p1(\Psi_{1})_{\#}p_{0}=p_{1}.

For any continuous and bounded function ff, we have that

limt1f(x)(Ψt)#p0(dx)\displaystyle\lim_{t\to 1}\int f(x)(\Psi_{t})_{\#}p_{0}(dx) =limt1f(Ψt(x))p0(dx)\displaystyle=\lim_{t\to 1}\int f(\Psi_{t}(x))p_{0}(dx)
=f(Ψ1(x))p0(dx)\displaystyle=\int f(\Psi_{1}(x))p_{0}(dx)

where we used the bounded convergence theorem in the last step. Therefore, we have that (Ψ1)#p0=p1(\Psi_{1})_{\#}p_{0}=p_{1} and hence Ψ1\Psi_{1} is the flow map associated with the ODE dxt/dt=ut(xt)dx_{t}/dt=u_{t}(x_{t}).

B.4.2 Proof of Theorem 5.4

Part 1.

We first establish the following volume growth condition for the manifold MM satisfying assumptions in the theorem. Notice that by [AB06, Corollary 1.4], Inj(M)τM/4>0\mathrm{Inj}(M)\geq\tau_{M}/4>0.

Lemma B.8.

There exists 0<rM<Inj(M)0<r_{M}^{\prime}<\mathrm{Inj}(M) sufficiently small, where Inj(M)\mathrm{Inj}(M) denotes the injective radius, such that the following holds. For any R>0R>0, there exists a constant CR>0C_{R}>0 so that for any radius 0<r<rM0<r<r_{M}^{\prime}, and for any xMBR(0)x\in M\cap B_{R}(0), one has p(Br(x))CRrmp(B_{r}(x))\geq C_{R}r^{m}.

Proof.

Notice that within the compact region MR+Inj(M):=BR+Inj(M)(0)¯MM_{R+\mathrm{Inj}(M)}:=\overline{B_{R+\mathrm{Inj}(M)}(0)}\cap M, the density ρ\rho is lower bounded by a positive constant ρR>0\rho_{R}>0. Additionally, the sectional curvature of MM is upper bounded by some constant κ>0\kappa>0 due to the boundedness of the second fundamental form. Then by Gunther’s volume comparison theorem [GHL90, 3.101 Theorem] for 0<r<Inj(M)0<r<\mathrm{Inj}(M), there exists constant cκ>0c_{\kappa}>0 such that

volM(BrM(x))volMκm(BrMκm(x)),\mathrm{vol}_{M}(B^{M}_{r}(x))\geq\mathrm{vol}_{M_{\kappa}^{m}}(B^{M_{\kappa}^{m}}_{r}(x)),

where BrM(x):={yM:dM(x,y)<r}B_{r}^{M}(x):=\{y\in M:\,d_{M}(x,y)<r\}, dMd_{M} denotes the geodesic distance, and Mκm{M_{\kappa}^{m}} denotes the mm-dimensional sphere with sectional curvature κ\kappa.

We then choose rMr_{M}^{\prime} to be small enough such that

volMκm(BrMκm(x))cκrm\mathrm{vol}_{M_{\kappa}^{m}}(B^{M_{\kappa}^{m}}_{r}(x))\geq c_{\kappa}r^{m}

for some constant cκ>0c_{\kappa}>0 and all r<rMr<r_{M}^{\prime}.

Since xydM(x,y)\|x-y\|\leq d_{M}(x,y) for any x,yMx,y\in M, we have that BrM(x)Br(x)B_{r}^{M}(x)\subset B_{r}(x). Hence, we have that for all xBR(0)Mx\in B_{R}(0)\cap M and r<rMr<r_{M}^{\prime},

p(Br(x))p(BrM(x))=BrM(x)ρ(y)volM(dy)ρRvolM(BrM(x))ρRcκrm.p(B_{r}(x))\geq p(B_{r}^{M}(x))=\int_{B_{r}^{M}(x)}\rho(y)\mathrm{vol}_{M}(dy)\geq\rho_{R}\mathrm{vol}_{M}(B_{r}^{M}(x))\geq\rho_{R}\,c_{\kappa}\cdot r^{m}.

By letting CR:=ρRcκC_{R}:=\rho_{R}\,c_{\kappa}, we conclude the proof. ∎

Then, we can replicate the proof of Theorem 5.3 with only small changes of 3 as follows using ζ=1\zeta=1:

Claim 8.

Under assumptions as in Theorem 5.4, there exsits Λ>\Lambda>-\infty such that for any radius R>τM/2R>\tau_{M}/2, all zBR(0)BτM/2(M)z\in B_{R}(0)\cap B_{\tau_{M}/2}(M), we have that

mλ(z)projM(z)Cτ,Reλ, for λ>Λ\|m_{\lambda}(z)-\mathrm{proj}_{M}(z)\|\leq C_{\tau,R}\cdot e^{-\lambda},\text{ for }\lambda>\Lambda

where CM,RC_{M,R} is a constant depending only on RR and the geometry of MM.

Proof of 8.

This can be proved by carefully examining all bounds involved in the proof of Theorem 4.9.

  • Equation 25: Since zBR(0)Bτ/2(M)z\in B_{R}(0)\cap B_{\tau/2}(M), we have that zMBR+τM/2(0)z_{M}\in B_{R+\tau_{M}/2}(0) and hence zM\|z_{M}\| in the numerator can be bounded by R+τM/2R+\tau_{M}/2. The denominator is lower bounded by some polynomial of r0r_{0} by Equation 24 and the volume growth condition of pp established in Lemma B.8. We also notice that r0r_{0} can be chosen uniformly for all zBR(0)BτM/2(M)z\in B_{R}(0)\cap B_{\tau_{M}/2}(M) as it is completely dependent on the reach. Therefore, the big O function is bounded above by some function of RR.

  • Equation 28: C1C_{1} can be bounded by RR and the bounds on the local Lipschitz constant of the density and the second fundamental form (which bounds the Ricci tensor).

  • Equation 29: this one is bounded similarly as the one in Equation 25 by some function of RR.

  • Lemma B.6: the bound C2C_{2} is bounded by the bounds on the second fundamental form and its covariant derivatives.

In conclusion,

mλ(z)projM(z)dW,2(p(|𝑿λ=z),δzM)=meλ+G(z)\|m_{\lambda}(z)-\mathrm{proj}_{M}(z)\|\leq d_{\mathrm{W},2}(p(\cdot|\bm{X}_{\lambda}=z),\delta_{z_{M}})=\sqrt{m}e^{-\lambda}+G(z)

where |G(z)|CM,Re2λ|G(z)|\leq C^{\prime}_{M,R}e^{-2\lambda} for all zBR(0)BτM/2(M)z\in B_{R}(0)\cap B_{\tau_{M}/2}(M) and CM,RC^{\prime}_{M,R} depends only on RR and geometry bounds of MM such as reach and second fundamental form bounds. Then, by combining the two exponential terms, one concludes the proof. ∎

Part 2.

In the discrete case, we let Ω:={x1,,xN}\Omega:=\{x_{1},\ldots,x_{N}\}. We can improve the convergence rate in Theorem 5.3 to be exponential by considering a direct consequence of Corollary 4.12:

Claim 9.

For all zdz\in\mathbb{R}^{d} such that dΩ(z)<14sepΩd_{\Omega}(z)<\frac{1}{4}\mathrm{sep}_{\Omega}, we have that

mλ(z)projΩ(z)C2exp(C1e2λ),\|m_{\lambda}(z)-\mathrm{proj}_{\Omega}(z)\|\leq C_{2}\cdot\exp({-C_{1}e^{2\lambda}}),

where C1,C2C_{1},C_{2} only depends on sepΩ:=minxixjΩxixj\mathrm{sep}_{\Omega}:=\min_{x_{i}\neq x_{j}\in\Omega}\|x_{i}-x_{j}\| is the minimal separation between the points in Ω\Omega.

Therefore, when λδ\lambda_{\delta} is large enough, we have that for any λ2>λ1λδ\lambda_{2}>\lambda_{1}\geq\lambda_{\delta}

d(e2λdΩ2(zλ))dλ\displaystyle\frac{d(e^{2\lambda}d_{\Omega}^{2}(z_{\lambda}))}{d\lambda} 2e2λdΩ(zλ)mλ(zλ)projΩ(zλ)\displaystyle\leq{2}e^{2\lambda}d_{\Omega}(z_{\lambda})\|m_{\lambda}(z_{\lambda})-\mathrm{proj}_{\Omega}(z_{\lambda})\|
2C2e2λC1e2λdΩ(zλ)\displaystyle\leq{2}C_{2}e^{2\lambda-C_{1}e^{2\lambda}}d_{\Omega}(z_{\lambda})
C3eλdΩ(zλ),\displaystyle\leq C_{3}e^{-\lambda}d_{\Omega}(z_{\lambda}),

where C3C_{3} is a constant dependent on C1,C2C_{1},C_{2} and the properties of the exponential function.

Then, in particular, there is

e2λdΩ2(zλ)\displaystyle e^{2\lambda}d_{\Omega}^{2}(z_{\lambda}) e2λδdΩ2(zλδ)+C3λδλeλdΩ(zλ)𝑑λ,\displaystyle\leq e^{2\lambda_{\delta}}d_{\Omega}^{2}(z_{\lambda_{\delta}})+C_{3}\int_{\lambda_{\delta}}^{\lambda}e^{-\lambda^{\prime}}d_{\Omega}(z_{\lambda^{\prime}})d\lambda^{\prime},
e2λδC5+C4λδλeλ𝑑λ\displaystyle\leq e^{2\lambda_{\delta}}C_{5}+C_{4}\int_{\lambda_{\delta}}^{\lambda}e^{-\lambda^{\prime}}d\lambda^{\prime}

where the we use the fact that dΩ(zλ)d_{\Omega}(z_{\lambda}) is bounded by the absorbing result in Theorem 3.11 and hence the integral is bounded by a constant C4C_{4}.

This implies that dΩ2(zλ)e2(λλδ)C5+C4e2λC6eλd_{\Omega}^{2}(z_{\lambda})\leq e^{-2(\lambda-\lambda_{\delta})}C_{5}+C_{4}e^{-2\lambda}\leq C_{6}e^{-\lambda} for some constant C6C_{6} depending on C4,C5C_{4},C_{5}. Now, following the rest of the estimations in the proof of Theorem 5.3, we can replace the rate O(eζλ2)O(e^{-\frac{\zeta\lambda}{2}}) in 6 with O(eλ)O(e^{-\lambda}) after the change above and conclude the proof.

B.5 Proofs in Section 6

Proof of Proposition 6.1.

By the estimate in Proposition 4.1 and the fact that f(s)=slog(1+s)f(s)=\frac{s}{\log(1+s)} is increasing for s>0s>0, one can check that for any σ>σinit(Ω,ζ,R0)\sigma>\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0}), and for all zz with z𝔼[𝑿]<R0\|z-\mathbb{E}[\bm{X}]\|<R_{0} where 𝑿bp\bm{X}_{b}\sim p there is

dW,1(p(|𝒀σ=z),p)<ζz𝔼[𝑿],d_{W,1}(p(\cdot|\bm{Y}_{\sigma}=z),p)<\zeta\|z-\mathbb{E}[\bm{X}]\|,

where 𝒀σqσ\bm{Y}_{\sigma}\sim q_{\sigma}. Then the result follows from Lemma B.2. ∎

Proof of Proposition 6.2.

The estimate in Proposition 6.1 shows that for all σ(σinit(Ω,ζ,R0)2,σ1]\sigma\in(\sqrt{\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0})^{2}},\sigma_{1}] and for all zBR0(𝔼[𝑿])z\in\partial B_{R_{0}}(\mathbb{E}[\bm{X}]), the denoiser mσ(z)m_{\sigma}(z) lies in the ball BζR0(𝔼[𝑿])B_{\zeta R_{0}}(\mathbb{E}[\bm{X}]) and hence the interior of BR0(𝔼[𝑿])¯\overline{B_{R_{0}}(\mathbb{E}[\bm{X}])}. The closed ball BR0(𝔼[𝑿])¯\overline{B_{R_{0}}(\mathbb{E}[\bm{X}])} is clearly convex and we can then apply Item 2 of Proposition 3.9 to obtain the desired result. ∎

Proof of Proposition 6.3.

According to Propositions 6.1 and 6.2, the estimate

mσ(xσ)𝔼[𝑿]<ζxσ𝔼[𝑿]\|m_{\sigma}(x_{\sigma})-\mathbb{E}[\bm{X}]\|<\zeta\|x_{\sigma}-\mathbb{E}[\bm{X}]\|

holds for the entire trajectory (xσ)σ(σinit(Ω,ζ,R0)2,σ1](x_{\sigma})_{\sigma\in(\sqrt{\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0})^{2}},\sigma_{1}]}. We then obtain the result by Item 1 of the meta attracting result Theorem 3.6. ∎

Proof of Proposition 6.4.

The key observation comes from that, the posterior distribution p(|𝑿𝝈=x)p(\cdot|\bm{X_{\sigma}}=x) is also supported on supp(p)\mathrm{supp}(p) and hence its expectation, the denoiser mσ(x)m_{\sigma}(x) must lie in the convex hull conv(supp(p))\mathrm{conv}(\mathrm{supp}(p)).

For the first part, the first Item in Proposition 3.9 applies to conv(supp(p))\mathrm{conv}(\mathrm{supp}(p)) and hence any neighborhood of conv(supp(p))\mathrm{conv}(\mathrm{supp}(p)) is absorbing (xσ)σ(σinit(Ω,ζ,R0)2,σ1](x_{\sigma})_{\sigma\in(\sqrt{\sigma_{\mathrm{init}}(\Omega,\zeta,R_{0})^{2}},\sigma_{1}]}. We then obtain the desired result by Item 2 in Theorem 3.8.

For the second part, we use Corollary A.5 to obtain

mσ(x)projconv(supp(p))(x),projconv(supp(p))(x)x0\langle m_{\sigma}(x)-\mathrm{proj}_{\mathrm{conv}(\mathrm{supp}(p))}(x),\mathrm{proj}_{\mathrm{conv}(\mathrm{supp}(p))}(x)-x\rangle\leq 0

for all xdx\in\mathbb{R}^{d} and all σ\sigma. Then we can apply Item 1 of the meta attracting result Theorem 3.6 with ζ=0\zeta=0 to conclude the proof. ∎

Proof of Lemma 6.5.

Let qσb:=pb𝒩(0,σb2I)q_{\sigma_{b}}:=p_{b}*\mathcal{N}(0,\sigma_{b}^{2}I) be the probability path with data distribution pbp_{b}. Then, we have that yσby_{\sigma_{b}} satisfies

dyσbdσb\displaystyle\frac{d\,y_{\sigma_{b}}}{d\sigma_{b}} =σblogqσb(yσb),\displaystyle=-\sigma_{b}\nabla\log q_{\sigma_{b}}(y_{\sigma_{b}}),

With the change of variable σb=σ2+δ2\sigma_{b}=\sqrt{\sigma^{2}+\delta^{2}}, we have that

dyσbdσ\displaystyle\frac{d\,y_{\sigma_{b}}}{d\sigma} =dσbdσσblogqσb(yσb),\displaystyle=-\frac{d\,\sigma_{b}}{d\,\sigma}\sigma_{b}\nabla\log q_{\sigma_{b}}(y_{\sigma_{b}}),
=σσbσblogqσb(yσb),\displaystyle=-\frac{\sigma}{\sigma_{b}}\sigma_{b}\nabla\log q_{\sigma_{b}}(y_{\sigma_{b}}),
=σlogqσb(yσb),\displaystyle=-\sigma\nabla\log q_{\sigma_{b}}(y_{\sigma_{b}}),

One has that

qσb(yσb)\displaystyle q_{\sigma_{b}}(y_{\sigma_{b}}) =1(2πσb2)d/2exp(yσbx22σb2)pb(dx)\displaystyle=\frac{1}{{(2\pi\sigma_{b}^{2})^{d/2}}}\int\exp\left(-\frac{\|y_{\sigma_{b}}-x\|^{2}}{2\sigma_{b}^{2}}\right)p_{b}(dx)
=1(2π(σ2+δ2))d/2exp(xσx22(δ2+σ2))pb(dx)\displaystyle=\frac{1}{{(2\pi(\sigma^{2}+\delta^{2}))^{d/2}}}\int\exp\left(-\frac{\|x_{\sigma}-x\|^{2}}{2(\delta^{2}+\sigma^{2})}\right)p_{b}(dx)
=qσ(xσ),\displaystyle=q_{\sigma}(x_{\sigma}),

where qσ(x)q_{\sigma}(x) denotes the density of qσ:=p𝒩(0,σ2I)q_{\sigma}:=p*\mathcal{N}(0,\sigma^{2}I). Therefore, the trajectory xσ:=yσbx_{\sigma}:=y_{\sigma_{b}} satisfies

dxσdσ=σlogqσ(xσ).\displaystyle\frac{d\,x_{\sigma}}{d\sigma}=-\sigma\nabla\log q_{\sigma}(x_{\sigma}).
Proof of Corollary 6.6.

This is a direct consequence of Proposition 6.3 and Lemma 6.5

Proof of Corollary 6.7.

This is a direct consequence of Proposition 6.4 and Lemma 6.5

Proof of Proposition 6.8.

For simplicity, we use ησ:=p(|𝑿σ=x)\eta_{\sigma}:=p(\cdot|\bm{X}_{\sigma}=x) to denote the posterior measure at xx. We restrict ησ\eta_{\sigma} on SS to obtain the following probability measure

ησS:=1ησ(S)ησ|S.\eta_{\sigma}^{S}:=\frac{1}{\eta_{\sigma}(S)}\eta_{\sigma}|_{S}.

It is easy to see that 𝔼(ησS)\mathbb{E}(\eta_{\sigma}^{S}) lies in the convex hull of SS. We can then bound the distance from the denoiser mσ(x)m_{\sigma}(x) to the convex hull of SS by the Wasserstein distance between ησ\eta_{\sigma} and ησS\eta_{\sigma}^{S}.

Similarly to the proof of Theorem 4.11, we construct the following coupling γ\gamma between ησ\eta_{\sigma} and ησS\eta_{\sigma}^{S}:

γ=Δ#(ησ|S)+ησS(ησ|(Ω\S)),\displaystyle\gamma=\Delta_{\#}(\eta_{\sigma}|_{S})+\eta_{\sigma}^{S}\otimes(\eta_{\sigma}|_{(\Omega\backslash S)}),

where Δ:dd×d\Delta:\mathbb{R}^{d}\to\mathbb{R}^{d}\times\mathbb{R}^{d} is the diagonal map sending xx to (x,x)(x,x).

We then have the following estimate:

dW,22(ησ,ησS)\displaystyle d_{\mathrm{W},2}^{2}\left(\eta_{\sigma},\eta_{\sigma}^{S}\right) y1y22ησ|Sησ(S)(dy1)ησ|(Ω\S)(dy2),\displaystyle\leq\iint\|y_{1}-y_{2}\|^{2}\frac{\eta_{\sigma}|_{S}}{\eta_{\sigma}(S)}(dy_{1})\eta_{\sigma}|_{(\Omega\backslash S)}(dy_{2}),
ησ((Ω\S))(diam(supp(p)))2\displaystyle\leq\eta_{\sigma}((\Omega\backslash S))\,\left(\mathrm{diam}(\mathrm{supp}(p))\right)^{2}
=(diam(supp(p)))2(Ω\S)exp(12σ2xx12)p(dx1)S(Ω\S)exp(12σ2xx12)p(dx1),\displaystyle=\left(\mathrm{diam}(\mathrm{supp}(p))\right)^{2}\frac{\int_{(\Omega\backslash S)}\exp{\left(-\frac{1}{2\sigma^{2}}\|x-x_{1}\|^{2}\right)}p(dx_{1})}{\int_{S\cup(\Omega\backslash S)}\exp{\left(-\frac{1}{2\sigma^{2}}\|x-x_{1}^{\prime}\|^{2}\right)}p(dx_{1}^{\prime})},
(diam(supp(p)))2(Ω\S)exp(12σ2xx12)p(dx1)Sexp(12σ2xx12)p(dx1),\displaystyle\leq\left(\mathrm{diam}(\mathrm{supp}(p))\right)^{2}\frac{\int_{(\Omega\backslash S)}\exp{\left(-\frac{1}{2\sigma^{2}}\|x-x_{1}\|^{2}\right)}p(dx_{1})}{\int_{S}\exp{\left(-\frac{1}{2\sigma^{2}}\|x-x_{1}\|^{2}\right)}p(dx_{1})},

By the assumption that dconv(S)(x)D/2ϵd_{\mathrm{conv}(S)}(x)\leq D/2-\epsilon, we have that for all x1Sx_{1}\in S

xx1D/2ϵ+D=3D/2ϵ,\|x-x_{1}\|\leq D/2-\epsilon+D=3D/2-\epsilon,

and for all x1Ωx_{1}\in\Omega^{\prime}

xx12D(D/2ϵ)=3D/2+ϵ.\|x-x_{1}\|\geq 2D-(D/2-\epsilon)=3D/2+\epsilon.

We then have that

dW,2(ησ,ησS)2\displaystyle d_{\mathrm{W},2}\left(\eta_{\sigma},\eta_{\sigma}^{S}\right)^{2} (diam(supp(p)))2(Ω\S)exp(12σ2(3D/2+ϵ)2)p(dx1)Sexp(12σ2(3D/2ϵ)2)p(dx1),\displaystyle\leq\left(\mathrm{diam}(\mathrm{supp}(p))\right)^{2}\frac{\int_{(\Omega\backslash S)}\exp{\left(-\frac{1}{2\sigma^{2}}(3D/2+\epsilon)^{2}\right)}p(dx_{1})}{\int_{S}\exp{\left(-\frac{1}{2\sigma^{2}}(3D/2-\epsilon)^{2}\right)}p(dx_{1})},
=(diam(supp(p)))21aSaSexp(12σ2((3D/2+ϵ)2(3D/2ϵ)2)),\displaystyle=\left(\mathrm{diam}(\mathrm{supp}(p))\right)^{2}\frac{1-a_{S}}{a_{S}}\exp\left(-\frac{1}{2\sigma^{2}}((3D/2+\epsilon)^{2}-(3D/2-\epsilon)^{2})\right),
(diam(supp(p)))21aSaSexp(3Dϵσ2).\displaystyle\leq\left(\mathrm{diam}(\mathrm{supp}(p))\right)^{2}\frac{1-a_{S}}{a_{S}}\exp\left(-\frac{3D\epsilon}{\sigma^{2}}\right).

Then by applying Lemma B.2 and the fact that 𝔼(ησS)conv(S)\mathbb{E}(\eta_{\sigma}^{S})\in\mathrm{conv}(S), we have that

dconv(S)(mσ(x))diam(supp(p))1aSaSexp(3Dϵσ2).d_{\mathrm{conv}(S)}(m_{\sigma}(x))\leq\mathrm{diam}(\mathrm{supp}(p))\sqrt{\frac{1-a_{S}}{a_{S}}}\exp\left(-\frac{3D\epsilon}{\sigma^{2}}\right).
Proof of Proposition 6.9.

Note that, the set BD/2ϵ(conv(S))B_{D/2-\epsilon}(\mathrm{conv}(S)) is convex itself as the distance function to a convex set if a function. For the first part, according to Item 2 of Proposition 3.9, it suffices to show that for all xBD/2ϵ(conv(S))¯x\in\partial\overline{B_{D/2-\epsilon}(\mathrm{conv}(S))}, the denoiser mσ(x)m_{\sigma}(x) lies in the interior of BD/2ϵ(conv(S))B_{D/2-\epsilon}(\mathrm{conv}(S)) for all σσ0(S,ϵ)\sigma\leq\sigma_{0}(S,\epsilon). By Proposition 6.8, we need to show

dconv(S)(mσ(x))diam(supp(p))1aSaSexp(3Dϵσ2)<D/2ϵ.\displaystyle d_{\mathrm{conv}(S)}(m_{\sigma}(x))\leq\mathrm{diam}(\mathrm{supp}(p))\sqrt{\frac{1-a_{S}}{a_{S}}}\exp\left(-\frac{3D\epsilon}{\sigma^{2}}\right)<D/2-\epsilon.

That is to verify, when σσ0(S,ϵ)\sigma\leq\sigma_{0}(S,\epsilon), the following inequality holds.

diam(supp(p))1aSaSexp(3Dϵσ2)\displaystyle\mathrm{diam}(\mathrm{supp}(p))\sqrt{\frac{1-a_{S}}{a_{S}}}\exp\left(-\frac{3D\epsilon}{\sigma^{2}}\right) <D/2ϵ,\displaystyle<D/2-\epsilon,
exp(3Dϵσ2)\displaystyle\exp\left(-\frac{3D\epsilon}{\sigma^{2}}\right) <CϵS,\displaystyle<C^{S}_{\epsilon},

where CϵS=D/2ϵdiam(supp(p))1aSaSC^{S}_{\epsilon}=\frac{D/2-\epsilon}{\mathrm{diam}(\mathrm{supp}(p))\sqrt{\frac{1-a_{S}}{a_{S}}}}. Then, when CϵS1C^{S}_{\epsilon}\geq 1, the above inequality holds for all σ<\sigma<\infty and when CϵS<1C^{S}_{\epsilon}<1, it is straightforward to verify the above inequality holds for all σ<σ0(S,ϵ)\sigma<\sigma_{0}(S,\epsilon). This concludes the proof of the first part.

For the second part, we use λ=logσ\lambda=-\log\sigma as the reparametrization and the corresponding trajectory zλ:=xσ(λ)z_{\lambda}:=x_{\sigma(\lambda)}. We want to show that the distance dconv(S)(zλ)d_{\mathrm{conv}(S)}(z_{\lambda}) goes to zero as λ\lambda goes to infinity. By taking the derivative of dconv(S)2(zλ)d_{\mathrm{conv}(S)}^{2}(z_{\lambda}), and use the notation

v:=𝔼[𝑿|𝒁λ=zλ and 𝑿S] where 𝒁λqσ(λ),v:=\mathbb{E}\left[\bm{X}|\bm{Z}_{\lambda}=z_{\lambda}\text{ and }\bm{X}\in S\right]\text{ where }\bm{Z}_{\lambda}\sim q_{\sigma(\lambda)},

we have

ddconv(S)2(zλ)dλ\displaystyle\frac{d\,d_{\mathrm{conv}(S)}^{2}(z_{\lambda})}{d\lambda} =2zλprojconv(S)(zλ),zλmλ(zλ),\displaystyle=-2\langle z_{\lambda}-\mathrm{proj}_{\mathrm{conv}(S)}(z_{\lambda}),z_{\lambda}-m_{\lambda}(z_{\lambda})\rangle,
=2zλprojconv(S)(zλ),zλprojconv(S)(zλ)\displaystyle=-2\langle z_{\lambda}-\mathrm{proj}_{\mathrm{conv}(S)}(z_{\lambda}),z_{\lambda}-\mathrm{proj}_{\mathrm{conv}(S)}(z_{\lambda})\rangle
2zλprojconv(S)(zλ),projconv(S)(zλ)v\displaystyle\quad\quad-2\langle z_{\lambda}-\mathrm{proj}_{\mathrm{conv}(S)}(z_{\lambda}),\mathrm{proj}_{\mathrm{conv}(S)}(z_{\lambda})-v\rangle
2zλprojconv(S)(zλ),vmλ(zλ),\displaystyle\quad\quad-2\langle z_{\lambda}-\mathrm{proj}_{\mathrm{conv}(S)}(z_{\lambda}),v-m_{\lambda}(z_{\lambda})\rangle,
2dconv(S)2(zλ)+2zλvvmλ(zλ),\displaystyle\leq-2d_{\mathrm{conv}(S)}^{2}(z_{\lambda})+2\|z_{\lambda}-v\|\|v-m_{\lambda}(z_{\lambda})\|,
2dconv(S)2(zλ)+2(D/2ϵ)vmλ(zλ).\displaystyle\leq-2d_{\mathrm{conv}(S)}^{2}(z_{\lambda})+2(D/2-\epsilon)\|v-m_{\lambda}(z_{\lambda})\|.

where in the second to last inequality we use the fact that vconv(S)v\in\mathrm{conv}(S) and hence

zλprojconv(S)(zλ),projconv(S)(zλ)v0.\langle z_{\lambda}-\mathrm{proj}_{\mathrm{conv}(S)}(z_{\lambda}),\mathrm{proj}_{\mathrm{conv}(S)}(z_{\lambda})-v\rangle\geq 0.

Note that vv is exactly the quantity used in the Proof of Proposition 6.8 which by the absorbing property in the first part of the proof, we can apply (in terms of λ\lambda) to zλz_{\lambda} and obtain

vmλ(zλ)diam(supp(p))1aSaSexp(3Dϵ2e2λ).\|v-m_{\lambda}(z_{\lambda})\|\leq\mathrm{diam}(\mathrm{supp}(p))\sqrt{\frac{1-a_{S}}{a_{S}}}\exp\left(-\frac{3D\epsilon}{2}\,e^{2\lambda}\right).

Then by Lemma B.1, we have that dconv(S)(zλ)0d_{\mathrm{conv}(S)}(z_{\lambda})\to 0 as λ\lambda\to\infty and hence xσ=zλ(σ)x_{\sigma}=z_{\lambda(\sigma)} converges to the convex hull of SS as σ0\sigma\to 0. ∎

Proof of Proposition 6.11.

Since each ViϵV_{i}^{\epsilon} is a convex set and in fact, a closure of an open convex set, by item 2 of Proposition 3.9, it suffices to prove that for all xViϵx\in\partial V_{i}^{\epsilon} and all σ<σ0(Viϵ)\sigma<\sigma_{0}(V_{i}^{\epsilon}), which is the boundary of ViϵV_{i}^{\epsilon}, one has that mσ(x)intViϵm_{\sigma}(x)\in\mathrm{int}V_{i}^{\epsilon} which is the interior of ViϵV_{i}^{\epsilon}.

First of all, we define

ri,ϵ:=sep2(xi)ϵ22sep(xi).r_{i,\epsilon}:=\frac{\mathrm{sep}^{2}(x_{i})-\epsilon^{2}}{2\mathrm{sep}(x_{i})}.

It is straightforward to check that Bri,ϵ(xi)ViϵB_{r_{i,\epsilon}}(x_{i})\subseteq V_{i}^{\epsilon}. In fact, ri,ϵ=argmax{r>0:Br(xi)Viϵ}r_{i,\epsilon}=\mathrm{argmax}\{r>0:B_{r}(x_{i})\subseteq V_{i}^{\epsilon}\}.

Hence, by Corollary 4.12, for any xViϵx\in\partial V_{i}^{\epsilon} we have that

mσ(x)xi\displaystyle\|m_{\sigma}(x)-x_{i}\| diam(Ω)1aiaiexp(ΔΩ(x)4σ2)\displaystyle\leq\mathrm{diam}(\Omega)\sqrt{\frac{1-a_{i}}{a_{i}}}\exp\left(-\frac{\Delta_{\Omega}(x)}{4\sigma^{2}}\right)
diam(Ω)1aiaiexp(ϵ24σ2)\displaystyle\leq\mathrm{diam}(\Omega)\sqrt{\frac{1-a_{i}}{a_{i}}}\exp\left(-\frac{\epsilon^{2}}{4\sigma^{2}}\right)

Therefore, we need to identify when the above inequality is less than ri,ϵr_{i,\epsilon}, that is,

diam(Ω)1aiaiexp(ϵ24σ2)\displaystyle\mathrm{diam}(\Omega)\sqrt{\frac{1-a_{i}}{a_{i}}}\exp\left(-\frac{\epsilon^{2}}{4\sigma^{2}}\right) <ri,ϵ,\displaystyle<r_{i,\epsilon},
exp(ϵ24σ2)\displaystyle\exp\left(-\frac{\epsilon^{2}}{4\sigma^{2}}\right) <ri,ϵdiam(Ω)1aiai,\displaystyle<\frac{r_{i,\epsilon}}{\mathrm{diam}(\Omega)\sqrt{\frac{1-a_{i}}{a_{i}}}},

Recall that Ci,ϵ=ri,ϵdiam(Ω)ai1aiC_{i,\epsilon}=\frac{r_{i,\epsilon}}{\mathrm{diam}(\Omega)}\sqrt{\frac{a_{i}}{1-a_{i}}}. Then, it is direct to check that when Ci,ϵ1C_{i,\epsilon}\leq 1, the above inequality holds for all 0σ<0\leq\sigma<\infty and when Ci,ϵ>1C_{i,\epsilon}>1, the above inequality holds for all σ<σ0(Viϵ)=ϵ2(log(Ci,ϵΩ))1/2\sigma<\sigma_{0}(V_{i}^{\epsilon})=\frac{\epsilon}{2}\left(\log(C^{\Omega}_{i,\epsilon})\right)^{-1/2}. In summary, this implies that mσ(x)intBri,ϵ(xi)intViϵm_{\sigma}(x)\in\mathrm{int}B_{r_{i,\epsilon}}(x_{i})\subset\mathrm{int}V_{i}^{\epsilon} when σ<σ0(Viϵ)\sigma<\sigma_{0}(V_{i}^{\epsilon}). ∎

Proofs of Proposition 6.12.

We will first prove in the λ\lambda parameter and then the translation to the σ\sigma parameter is straightforward. We consider the trajectory zλ:=xσ(λ)z_{\lambda}:=x_{\sigma(\lambda)} for λ[λ(σ0),λ(0)=)\lambda\in[\lambda(\sigma_{0}),\lambda(0)=\infty). We first show that dΩ(zλ)0d_{\Omega}(z_{\lambda})\to 0 as λ\lambda\to\infty.

Along the trajectory zλz_{\lambda}, by Corollary A.11 we have

ddΩ2(zλ)dλ\displaystyle\frac{d\,d_{\Omega}^{2}(z_{\lambda})}{d\lambda} =2zλxi,zλmλ(zλ),\displaystyle=-2\langle z_{\lambda}-x_{i},z_{\lambda}-m_{\lambda}(z_{\lambda})\rangle,
=2zλxi,zλxi+2zλxi,ximλ(zλ),\displaystyle=-2\langle z_{\lambda}-x_{i},z_{\lambda}-x_{i}\rangle+2\langle z_{\lambda}-x_{i},x_{i}-m_{\lambda}(z_{\lambda})\rangle,
2dΩ2(zλ)+2dΩ(zλ)ximλ(zλ),\displaystyle\leq-2d_{\Omega}^{2}(z_{\lambda})+2d_{\Omega}(z_{\lambda})\|x_{i}-m_{\lambda}(z_{\lambda})\|,

Applying Corollary 4.12 to points in ViϵV_{i}^{\epsilon}, we have the following uniform bound for all yViϵy\in V_{i}^{\epsilon},

mλ(y)xidiam(Ω)1aiaiexp(14e2λϵ2).\|m_{\lambda}(y)-x_{i}\|\leq\mathrm{diam}(\Omega)\sqrt{\frac{1-a_{i}}{a_{i}}}\exp\left(-\frac{1}{4}e^{2\lambda}\epsilon^{2}\right).

By Proposition 6.11 we have that zλViϵz_{\lambda}\in V_{i}^{\epsilon}. Therefore,

ddΩ2(zλ)dλ\displaystyle\frac{d\,d^{2}_{\Omega}(z_{\lambda})}{d\lambda} 2dΩ2(zλ)+sep(xi)diam(Ω)1aiaiexp(14e2λϵ2).\displaystyle\leq-2d_{\Omega}^{2}(z_{\lambda})+\mathrm{sep}(x_{i})\,\mathrm{diam}(\Omega)\sqrt{\frac{1-a_{i}}{a_{i}}}\exp\left(-\frac{1}{4}e^{2\lambda}\epsilon^{2}\right).

Then by Lemma B.1, we have that dΩ(zλ)0d_{\Omega}(z_{\lambda})\to 0 as λ\lambda\to\infty. By the the fact that Ω\Omega is discrete and xix_{i} is the nearest point to zλz_{\lambda} for all large λ\lambda, we must have zλxiz_{\lambda}\to x_{i} as λ\lambda\to\infty. ∎

Proof of Proposition 6.13.

The proof idea follows the same line as the proofs of Proposition 6.11 and Proposition 6.12. Similarly as in the case of denoisers, we can also consider the change of variable λ=log(σ)\lambda=-\log(\sigma) for handling mσθm_{\sigma}^{\theta}. We let zλθ:=xσ(λ)θz_{\lambda}^{\theta}:=x_{\sigma(\lambda)}^{\theta} and let mλθ(x):=mσ(λ)θ(x)m_{\lambda}^{\theta}(x):=m_{\sigma(\lambda)}^{\theta}(x) for any xdx\in\mathbb{R}^{d}. Then, it is straightforward to check that

dzλθdλ=mλθ(zλθ)zλθ\frac{dz_{\lambda}^{\theta}}{d\lambda}=m_{\lambda}^{\theta}(z_{\lambda}^{\theta})-z_{\lambda}^{\theta} (41)

and converting everything back to σ\sigma is straightforward.

We first identify the parameter λ0(Viϵ,ϕ)\lambda_{0}(V_{i}^{\epsilon},\phi) such that the denoiser mλθ(x)m^{\theta}_{\lambda}(x) is will always lie in the interior of ViϵV_{i}^{\epsilon} for all xViϵx\in\partial V_{i}^{\epsilon} and all λ>λ0(Viϵ,ϕ)\lambda>\lambda_{0}(V_{i}^{\epsilon},\phi).

By the triangle inequality, we have that

mλθ(y)xi\displaystyle\|m^{\theta}_{\lambda}(y)-x_{i}\| mλθ(y)mλN(y)+mλN(y)xi,\displaystyle\leq\|m^{\theta}_{\lambda}(y)-m_{\lambda}^{N}(y)\|+\|m_{\lambda}^{N}(y)-x_{i}\|,
ϕ(λ)+diam(Ω)1aiaiexp(14e2λϵ2).\displaystyle\leq\phi(\lambda)+\mathrm{diam}(\Omega)\sqrt{\frac{1-a_{i}}{a_{i}}}\exp\left(-\frac{1}{4}e^{2\lambda}\epsilon^{2}\right).

Since ϕ(λ)\phi(\lambda) goes to zero as λ\lambda goes to infinity, there exists a parameter λ0(Viϵ,ϕ)\lambda_{0}(V_{i}^{\epsilon},\phi) such that for all λ>λ0(Viϵ,ϕ)\lambda>\lambda_{0}(V_{i}^{\epsilon},\phi), mλθ(y)xisep2(xi)ϵ22sep(xi)\|m^{\theta}_{\lambda}(y)-x_{i}\|\leq\frac{\mathrm{sep}^{2}(x_{i})-\epsilon^{2}}{2\,\mathrm{sep}(x_{i})}. Then, with the same argument as in the proof of Proposition 6.11, we can prove that the trajectory zλθz_{\lambda}^{\theta} will never leave ViϵV_{i}^{\epsilon} for all λ>λ0(Viϵ,ϕ)\lambda>\lambda_{0}(V_{i}^{\epsilon},\phi).

Since the trajectory zλθz_{\lambda}^{\theta} never leaves ViϵV_{i}^{\epsilon} for all λ>λ0(Viϵ,ϕ)\lambda>\lambda_{0}(V_{i}^{\epsilon},\phi), we can then apply the uniform decay of mλθ(y)xi\|m^{\theta}_{\lambda}(y)-x_{i}\| to the differential inequality of dΩ2(zλθ)d_{\Omega}^{2}(z_{\lambda}^{\theta}) as follows:

ddΩ2(zλθ)dλ\displaystyle\frac{d\,d_{\Omega}^{2}(z_{\lambda}^{\theta})}{d\lambda} 2dΩ2(zλθ)+2dΩ(zλθ)ximλθ(zλθ),\displaystyle\leq-2d_{\Omega}^{2}(z_{\lambda}^{\theta})+2d_{\Omega}(z_{\lambda}^{\theta})\left\|x_{i}-m_{\lambda}^{\theta}(z_{\lambda}^{\theta})\right\|,
2dΩ2(zλθ)+2dΩ(zλθ)(ϕ(λ)+diam(Ω)1aiaiexp(14e2λϵ2)).\displaystyle\leq-2d_{\Omega}^{2}(z_{\lambda}^{\theta})+2d_{\Omega}(z_{\lambda}^{\theta})\left(\phi(\lambda)+\mathrm{diam}(\Omega)\sqrt{\frac{1-a_{i}}{a_{i}}}\exp\left(-\frac{1}{4}e^{2\lambda}\epsilon^{2}\right)\right).

Now, we apply Lemma B.1 again as in the proof of Proposition 6.12, we have that dΩ(zλθ)d_{\Omega}(z_{\lambda}^{\theta}) goes to zero as λ\lambda goes to infinity and hence zλθz_{\lambda}^{\theta} converges to xix_{i}. ∎

Proof of Proposition 6.14.

Similarly as in the proof of Proposition 6.13, we consider the λ\lambda parameter and the corresponding trajectory zλθ:=xσ(λ)θz^{\theta}_{\lambda}:=x^{\theta}_{\sigma(\lambda)} for λ[λ(σ0),λ(0)=)\lambda\in[\lambda(\sigma_{0}),\lambda(0)=\infty). Then by assumptions, the limits zθ:=limλzλθz_{\infty}^{\theta}:=\lim_{\lambda\to\infty}z^{\theta}_{\lambda} and limλmλθ(zλθ)\lim_{\lambda\to\infty}m^{\theta}_{\lambda}(z^{\theta}_{\lambda}) exist. In particular, the limit of the derivative limλdzλθdλ=limλmλθ(zλθ)zλθ\lim_{\lambda\to\infty}\frac{dz^{\theta}_{\lambda}}{d\lambda}=\lim_{\lambda\to\infty}m_{\lambda}^{\theta}(z_{\lambda}^{\theta})-z_{\lambda}^{\theta} exists and we will show that it has to be zero.

Suppose limλdzλθdλ0\lim_{\lambda\to\infty}\frac{dz^{\theta}_{\lambda}}{d\lambda}\neq 0, then there must exist a coordinate jj with nonzero limit. Assume that the limit for the jj-th coordinate of dzλθdλ\frac{dz^{\theta}_{\lambda}}{d\lambda} is vj0v_{j}\neq 0. Then, there exist a T>0T>0 such that the jj-th coordinate of dzλθdλ\frac{dz^{\theta}_{\lambda}}{d\lambda} is bounded away from zero by |vj|/2|v_{j}|/2 for all λ>T\lambda>T. However, due to the convergence zλθzz^{\theta}_{\lambda}\to z_{\infty}, we can find two numbers λ1,λ2>T\lambda_{1},\lambda_{2}>T such that |zλ1,jθzλ2,jθ|<12|vj|/2|z^{\theta}_{\lambda_{1},j}-z^{\theta}_{\lambda_{2},j}|<\frac{1}{2}|v_{j}|/2. This contradicts with the lower bound |vj|/2|v_{j}|/2 for the jj-th coordinate of the derivative by mean value theorem. Therefore, the limit of the derivative dzλθdλ\frac{dz^{\theta}_{\lambda}}{d\lambda} has to be zero which implies that

limλmλθ(zλθ)zλθ=0.\lim_{\lambda\to\infty}\|m^{\theta}_{\lambda}(z^{\theta}_{\lambda})-z^{\theta}_{\lambda}\|=0.