Non-asymptotic convergence bounds for
Wasserstein approximation using point clouds

Quentin Mérigot
Université Paris-Saclay, CNRS,
Laboratoire de mathématiques d’Orsay,
91405, Orsay, France
Institut Universitaire de France &Filippo Santambrogio
Univ Lyon,
Université Claude Bernard Lyon 1, CNRS
UMR 5208, Institut Camille Jordan
F-69622 Villeurbanne
Institut Universitaire de France &Clément Sarrazin
Université Paris-Saclay, CNRS,
Laboratoire de mathématiques d’Orsay
91405, Orsay, France

Abstract

Several issues in machine learning and inverse problems require to generate discrete data, as if sampled from a model probability distribution. A common way to do so relies on the construction of a uniform probability distribution over a set of $N$ points which minimizes the Wasserstein distance to the model distribution. This minimization problem, where the unknowns are the positions of the atoms, is non-convex. Yet, in most cases, a suitably adjusted version of Lloyd’s algorithm — in which Voronoi cells are replaced by Power cells — leads to configurations with small Wasserstein error. This is surprising because, again, of the non-convex nature of the problem, as well as the existence of spurious critical points. We provide explicit upper bounds for the convergence speed of this Lloyd-type algorithm, starting from a cloud of points sufficiently far from each other. This already works after one step of the iteration procedure, and similar bounds can be deduced, for the corresponding gradient descent. These bounds naturally lead to a modified Poliak-Łojasiewicz inequality for the Wasserstein distance cost, with an error term depending on the distances between Dirac masses in the discrete distribution.

1 Introduction

In recent years, the theory of optimal transport has been the source of stimulating ideas in machine learning and in inverse problems. Optimal transport can be used to define distances, called Wasserstein or earth-mover distances, between probability distributions over a metric space. These distances allows one to measure the closeness between a generated distribution and a model distribution, and they have been used with success as data attachment terms in inverse problems. Practically, it has been observed for several different inverse problems that replacing usual loss functions with Wasserstein distances tend to increase the basin of convergence of the methods towards a good solution of the problem, or even to convexify the landscape of the minimized energy [7, 6]. This good behaviour is not fully understood, but one may attribute it partly to the fact that the Wasserstein distances encodes the geometry of the underlying space. A notable use of Wasserstein distances in machine learning is in the field of generative adversarial networks, where one seeks to design a neural network able to produce random examples whose distribution is close to a prescribed model distribution [2].

Wasserstein distance and Wasserstein regression

Given two probability distributions $\rho,\mu$ on $\mathbb{R}^{d}$ , the Wasserstein distance of exponent $p$ between $\rho$ and $\mu$ is a way to measure the total cost of moving mass distribution described by $\rho$ to $\mu$ , knowing that moving a unit mass from $x$ to $y$ costs $\|x-y\|^{p}$ . Formally, it is defined as the value of an optimal transport problem between $\rho$ and $\mu$ :

W_{p}(\rho,\mu)=\left(\min_{\pi\in\Pi(\rho,\mu)}\int\|x-y\|^{p}\mathrm{d}\pi(x,y)\right)^{1/p},

(1)

where we minimize of the set $\Pi(\rho,\mu)$ of transport plans between $\rho$ and $\mu$ , i.e. probability distributions over $\mathbb{R}^{d}\times\mathbb{R}^{d}$ with marginals $\rho$ and $\mu$ . Standard references on the theory of optimal transport include books by Villani and by Santambrogio [19, 20, 18], while the computational and statistical aspects are discussed in a survey of Cuturi and Peyré [16].

In this article, we consider regression problems with respect to the Wasserstein metric, which can be put in the following form

\min_{\theta\in\Theta}\mathrm{W}_{p}^{p}(T_{\theta\#}\mu,\rho),

(2)

where $\mu$ is the reference distribution, a probability measure on $[0,1]^{\ell}$ , $\rho$ is the model distribution, a probability measure on $\mathbb{R}^{d}$ , and where $T_{\theta}:[0,1]^{\ell}\to\mathbb{R}^{d}$ is a family of maps indexed by a parameter $\theta\in\Theta$ . In the previous formula, we also denoted $T_{\theta}\#\mu$ the image of the measure $\mu$ under the map $T_{\theta}$ , also called pushforward of $\mu$ under $T_{\theta}$ . This image measure is defined by $T_{\theta\#}\mu(B):=\mu(T_{\theta}^{-1}(B))$ for any measurable set $B$ in the codomain of $T_{\theta}$ . In this work, we will concentrate on the quadratic Wasserstein distance $W_{2}$ . Several problems related to the design of generative models can be put under the form (2), see for instance [8, 2]. Solving (2) numerically is challenging for several reasons, but in this article we will concentrate on one of them: the non-convexity of the Wasserstein distance under displacement of the measures.

Non-convexity of the Wasserstein distance under displacements.

It is well known that the Wasserstein distance is convex for the standard (linear) structure of the space of probability measures, meaning that if $\nu_{0}$ and $\nu_{1}$ are two probability measures and $\nu_{t}=(1-t)\nu_{0}+t\nu_{1}$ , then the map $t\in[0,1]\mapsto\mathrm{W}_{p}^{p}(\nu_{t},\rho)$ is convex. Using a terminology from physics, we may say that the Wasserstein distance is convex for the Eulerian structure of the space of probability measures, e.g. when one interpolates linearly between the densities. However, in the regression problem (2), the perturbations are Lagrangian rather than Eulerian, in the sense that modifications of the parameter $\theta$ leads to a displacement of the support of the measure $T_{\theta\#}\mu$ . This appears very clearly in particular when $\mu$ is the uniform measure over a set $X=(x_{1},\ldots,x_{N})$ of $N$ point in $[0,1]^{d}$ , i.e. $\mu=\delta_{X}$ with

\delta_{X}\stackrel{{\scriptstyle\text{def}}}{{=}}\frac{1}{N}\sum_{i=1}^{N}\delta_{x_{i}}.

In this case $T_{\theta\#}\mu$ is the uniform measure over the set $T_{\theta}(X)=(T_{\theta}(x_{1}),\ldots,T_{\theta}(x_{n}))$ , i.e. $T_{\theta\#}\mu=\delta_{T_{\theta}(X)}$ . In this article, we will therefore be interested by the function

F_{N}:Y\in(\mathbb{R}^{d})^{N}\mapsto\frac{1}{2}W^{2}_{2}(\rho,\delta_{Y}).

(3)

This function $F_{N}$ is not convex, and actually exhibits (semi-)concavity properties. This has been observed first in [1] (Theorem 7.3.2), and is related to the positive curvature of the Wasserstein space. A precise statement in the context considered here may also be found as Proposition 21 in [14]. A practical consequence of the lack of convexity of $F_{N}$ is that critical points of $F_{N}$ are not necessarily global minimizers. It is actually easy to construct examples families of critical points $Y_{N}$ of $F_{N}$ such that $F_{N}(Y_{N})$ is bounded from below by a positive constant, while $\lim_{N\to\infty}\min F_{N}=0$ , so that the ratio between $F_{N}(Y_{N})$ and $\min F_{N}$ is arbitrarily large as $N\to+\infty$ . This can be done by concentrating the points $Y_{N}$ on lower-dimensional subspaces of $\mathbb{R}^{d}$ , as in Remarks 2 and 3.

When applying gradient descent to the nonconvex optimization problem (2), it is in principle possible to end up on local minima corresponding to a high energy critical points of the Wasserstein distance, regardless of the non-linearity of the map $\theta\mapsto T_{\theta\#\sigma}$ . Our main theorem, or rather its Corollary 6 shows that if the points of $Y$ are at distance at least $\varepsilon>0$ from one another, then

F_{N}(Y)-C\frac{1}{N\varepsilon^{d-1}}\leq N\|\nabla F_{N}(Y)\|^{2}.

In the previous equation $\|\nabla F_{N}(Y)\|$ denotes the Euclidean norm of the vector in $\mathbb{R}^{Nd}$ obtained by putting one after the other the gradients of $F_{N}$ w.r.t. the positions of the atoms $y_{i}$ . We note that due to the weights $1/N$ in the atomic measure $\delta_{Y}$ , the components of this vector are in general of the order of $1/N$ , see Proposition 1. This inequality resembles the Polyak-Łojasiewicz inequality, and shows in particular that if the quantization error $F_{N}(Y)=\mathrm{W}_{2}^{2}(\rho,\delta_{Y})$ is large, i.e. larger than $\varepsilon^{1-d}/N$ , then the point cloud $Y$ is not critical for $F_{N}$ . From this, we deduce in 7 that if the points in the initial cloud are not too close to each other at the initialization, then the iterates of fixed step gradient descent converge to points with low energy $F_{N}$ , despite the non-convexity of $F_{N}$ .

Refer to caption — Figure 1: From left to right, a point cloud $Y^{0}$ in the square $\Omega=[0;1]\times[0;1]$ , the associated power cells $P_{i}(Y)$ in the optimal transport to the Lebesgue measure on $\Omega$ , the vectors $-N\nabla F_{N}(Y^{0})=B_{N}(Y^{0})-Y^{0}$ followed during the Lloyd step and the positions of the barycenters $Y^{1}=B_{N}(Y)$ .

Relation to optimal quantization

Our main result also has implications in terms of the uniform optimal quantization problem, where one seeks a point cloud $Y=(y_{1},\ldots,y_{N})$ in $(\mathbb{R}^{d})^{N}$ such that the uniform measure supported over $Y$ , denoted $\delta_{Y}$ , is as close as possible to the model distribution $\rho$ with respect to the $2$ -Wasserstein distance:

\min_{Y\in\Omega^{N}}F_{N}(Y).

(4)

The uniform optimal quantization problem (4) is a very natural variant of the (standard) optimal quantization problem, where one does not impose that the measure supported on $Y$ is uniform:

\min_{Y\in(\mathbb{R}^{d})^{N}}G_{N}(Y),~{}~{}\hbox{ where }G_{N}(Y)=\min_{\mu\in\Delta_{N}}\mathrm{W}^{2}_{2}(\rho,\sum_{i=1}^{N}\mu_{i}\delta_{y_{i}}),

(5)

and where $\Delta_{N}\subseteq\mathbb{R}^{N}$ is the probability simplex. This standard optimal quantization problem is a cornerstone of sampling theory, and we refer the reader to the book of Graf and Luschgy [10] and to the survey by Pagès [15]. The uniform quantization problem (4) is less common, but also very natural. It has been used in imaging to produce stipplings of an image [4, 3] or for meshing purposes [9]. A common difficulty for solving (5) and (4) numerically is that the minimized functionals $F_{N}$ and $G_{N}$ are non-convex and have many critical points with high energy. However, in practice, simple fixed-point or gradient descent stategies behave well when the initial point cloud is not chosen adversely. Our second contribution is a quantitative explanation for this good behaviour in the case of the uniform optimal quantization problem.

Lloyd’s algorithm [12] is a fixed point algorithm for solving approximately the standard optimal quantization problem (5). Starting from a point cloud $Y^{k}=(y^{k}_{1},\ldots,y^{k}_{N})\in(\mathbb{R}^{d})^{N}$ with distinct points, one defines the next iterate $Y^{k+1}$ in two steps. First, one computes the Voronoi diagram of $Y$ , a tessellation of the space into convex polyhedra $(V_{i}(Y^{k}))_{1\leq i\leq N}$ , where

V_{i}(Y)=\{x\in\Omega\mid\forall j\in\{1,\ldots,N\},\|x-y_{i}\|\leq\|x-y_{j}\|\}.

(6)

In the second step, one moves every point $y^{k}_{i}$ towards the barycenter, with respect to $\rho$ , of the corresponding cell $V_{i}(Y^{k})$ . This algorithm can also be interpreted as a fixed point algorithm for solving the first-order optimality condition for (5), i.e. $\nabla G_{N}(Y)=0$ . One can show that the energy $(G_{N}(Y^{k}))_{k\geq 0}$ decreases in $k$ . The convergence of $Y^{k}$ towards a critical point of $F_{N}$ as $k\to+\infty$ has been studied in [5], but the energy of this limit critical point is not guaranteed to be small.

In the case of the uniform quantization problem (4), one can try to minimize the energy $F_{N}$ by gradient descent, defining the iterates

Y^{k+1}=Y^{k}-\tau N\nabla F_{N}(Y^{k}),

(7)

where $\tau>0$ is the time step. The factor $N$ in front of $\nabla F_{N}$ is set as a compensation for the fact that we have, in general, $\nabla F_{N}(Y)=O(1/N)$ . When $\tau=1$ , one recovers a version of Lloyd’s algorithm for the uniform quantization problem, involving barycenters $B_{N}(Y)$ of Power cells, rather than Voronoi cells, associated to $Y$ . More precisely, Proposition 1 proves that $\nabla F_{N}(Y)=(Y-B_{N}(Y))/N$ so that $Y^{k+1}=B_{N}(Y^{k})$ when $\tau=1$ . Quite surprisingly, we prove in Corollary 4 that if the points in the initial cloud $Y^{0}$ are not too close to each other, then the uniform measure over the point cloud $Y^{1}=Y^{0}-N\nabla F_{N}(Y^{0})$ obtained after only one step of Lloyd’s algorithm is close to $\rho$ . This is illustrated in Figure 1. We prove in particular the following statement.

Theorem (Particular case of Corollary 4).

Let $\rho$ be a probability density over a compact convex set $\Omega\subseteq\mathbb{R}^{d}$ , let $Y^{0}=(y_{1}^{0},\ldots,y_{N}^{0})\in\Omega^{d}$ and assume that the points lie at some positive distance from one another: for some constant $c$ ,

\forall i\neq j,\|y_{i}-y_{j}\|\geq cN^{-1/d},

corresponding for instance to a point cloud sampled on a regular grid. Then, the point cloud $Y^{1}=Y^{0}-N\nabla F_{N}(Y^{0})$ obtained after one step of Lloyd’s algorithm satisfies

\mathrm{W}_{2}^{2}(\delta_{Y^{1}},\rho)\leq C_{c,d,\Omega}N^{-1/d},

where $C_{c,d,\Omega}$ is a constant depending on $c,d$ and $\operatorname{diam}(\Omega)$ .

Outline

In Section 2, we start by a short review of background material on optimal transport and optimal uniform quantization. We then establish our main result (3) on the approximation of a measure $\rho$ by barycenters of Power cells. This theorem yields error estimates for one step of Lloyd’s algorithm in deterministic and probabilistic settings (Corollaries 4 and 5). In Section 3, we establish a Polyak-Łojasiewicz-type inequality (Corollary 6) for the function $F_{N}=\frac{1}{2}\mathrm{W}_{2}^{2}(\rho,\delta_{Y})$ introduced in (3), and we study the convergence of a gradient descent algorithm for $F_{N}$ (Theorem 7). Finally, in Section 4, we report numerical results on optimal uniform quantization in dimension $d=2$ .

2 Lloyd’s algorithm for optimal uniform quantization

Optimal transport and Kantorovich duality

In this section we briefly review Kantorovich duality and its relation to semidiscrete optimal transport. The cost is fixed to $c(x,y)=\|x-y\|^{2}$ , and we assume that $\rho$ is a probability density over a compact convex domain $\Omega$ . In this setting, Brenier’s theorem implies that given any probability measure $\mu$ supported on $\Omega$ , the optimal transport plan between $\rho$ and $\mu$ , i.e. the minimizer $\pi$ in the definition of the Wasserstein distance (1) with $p=2$ , is induced by a transport map $T_{\mu}:\Omega\to\Omega$ , meaning $\pi=(T_{\mu},Id)_{\#}\rho$ . One can derive an alternative expression for the Wasserstein distance using Kantorovich duality, which leads to a more precise description of the optimal transport map [18, Theorem 1.39]:

\mathrm{W}_{2}^{2}(\rho,\mu)=\max_{\phi:Y\to\mathbb{R}}\int_{\mathbb{R}^{d}}\phi\mathrm{d}\mu+\int_{\Omega}\phi^{c}\mathrm{d}\rho,

(8)

where $\phi^{c}(x)=\min_{i}c(x,y_{i})-\phi_{i}$ . When $\mu=\delta_{Y}$ is the uniform probability measure over a point cloud $Y=(y_{1},\ldots,y_{N})$ containing $N$ distinct points, we set $\phi_{i}=\phi(y_{i})$ and we define the $i$ th Power cell associated to the couple $(Y,\phi)$ as

\mathrm{Pow}_{i}(Y,\phi)=\{x\in\mathbb{R}^{d}\mid\forall j\in\{1,\ldots,N\},~{}\left\lVert x-y_{i}\right\rVert^{2}-\phi_{i}\leq\left\lVert x-y_{j}\right\rVert^{2}-\phi_{j}\}.

Then, the Kantorovich dual (8) of the optimal transport problem between $\rho$ and $\delta_{Y}$ turns into a finite-dimensional concave maximization problem

\mathrm{W}_{2}^{2}(\mu,\rho)=\max_{\phi\in\mathbb{R}^{N}}\sum_{i=1}^{N}\frac{\phi_{i}}{N}+\int_{\mathrm{Pow}_{i}(Y,\phi)}\left(\left\lVert x-y_{i}\right\rVert^{2}-\phi_{i}\right)\mathrm{d}\rho(x)

(9)

By Corollary 1.2 in [11], a vector $\phi\in\mathbb{R}^{N}$ is optimal for this maximization problem if and only if the potential $\phi$ is such that each Power cell contain the same amount of mass, i.e. if

\forall i\in\{1,\ldots,N\},~{}~{}\rho(\mathrm{Pow}_{i}(Y,\phi))=\frac{1}{N},

(10)

From now on, we denote $P_{i}(Y)=\mathrm{Pow}_{i}(Y,\phi)\cap\Omega$ , where $\phi\in\mathbb{R}^{N}$ satisfies (10). The optimal transport map $T_{Y}$ between $\rho$ and $\delta_{Y}$ sends every Power cell $P_{i}(Y)$ to the point $y_{i}$ , i.e. it is defined $\rho$ -almost everywhere by $\left.{T_{Y}}\right|_{P_{i}(Y)}=y_{i}$ . We refer again to the introduction of [11] for more details.

Optimal uniform quantization

In this article, we study the behaviour of the squared Wasserstein distance between the (fixed) probability density $\rho$ and a uniform finitely supported measure $\delta_{Y}$ where $Y=(y_{1},\ldots,y_{N})$ is a cloud of $N$ points, in terms of variations of $Y$ . As in equation (3), we denote $F_{N}=\frac{1}{2}\mathrm{W}_{2}^{2}(\rho,\cdot)$ . Proposition 21 in [14] gives an expression for the gradient of $F$ , and proves its semiconcavity. We recall that $F$ is called $\alpha$ –semiconcave, with $\alpha\geq 0$ , if the function $F-\frac{\alpha}{2}\|\cdot\|^{2}$ is concave. We denote $\mathbb{D}_{N}$ the generalized diagonal

\mathbb{D}_{N}=\{Y\in(\mathbb{R}^{d})^{N}\mid\exists i\neq j\hbox{ s.t. }y_{i}=y_{j}\}.

Proposition 1 (Gradient of $F_{N}$ ).

The function $F_{N}$ is $\frac{1}{N}$ –semiconcave on $(\mathbb{R}^{d})^{N}$ and is of class $\mathcal{C}^{1}$ on $(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N}$ . In addition, for any $Y\in\mathbb{D}_{N}$ one has

\forall Y\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N},\nabla F_{N}(Y)=\frac{1}{N}(Y-B_{N}(Y)),\hbox{ where }B_{N}(Y)=(b_{1}(Y),\ldots,b_{N}(Y))

(11)

and where $b_{i}(Y)$ is the barycenter of the $i$ th power cell, i.e. $b_{i}(Y)=N\int_{P_{i}(Y)}\mathrm{d}\rho(x)$ .

It is not difficult to prove that $F_{N}$ admits at least one minimizer, and that this minimizer $Y$ satisfies the first-order optimality condition $Y=B_{N}(Y)$ . A point cloud that satisfies this condition is called critical.

Remark 1 (Upper bound on the minimum of $F_{N}$ ).

We note from [14, Proposition 12] that when $\rho$ is supported on a compact subset of $\mathbb{R}^{d}$ , then

\min F_{N}=\min_{Y\in(\mathbb{R}^{d})^{N}}\frac{1}{2}\mathrm{W}_{2}^{2}(\rho,\delta_{Y})\lesssim\begin{cases}N^{-\frac{2}{d}}&\hbox{ if }d>2\\ N^{-1}\log N&\hbox{ if }d=2\\ N^{-1}&\hbox{ if }d=1.\\ \end{cases}

(12)

These upper bounds may not be tight, in particular when $\rho$ is separable (see Appendix E).

Remark 2 (High energy critical points).

On the other hand, since $F_{N}$ is not convex, this first-order condition is not sufficient to have a minimizer of $F_{N}$ . For instance, if $\rho\equiv 1$ on the unit square $\Omega=[0,1]^{2}$ , one can check that the point cloud

Y_{N}=\left(\left(\frac{1}{2N},\frac{1}{2}\right),\left(\frac{3}{2N},\frac{1}{2}\right),\ldots,\left(\frac{2N-1}{2N},\frac{1}{2}\right)\right)

is a critical point of $F_{N}$ but not a minimizer of $F_{N}$ . In fact, this critical point becomes arbitrarily bad as $N\to+\infty$ in the sense that

\lim_{N\to+\infty}\frac{F_{N}(Y_{N})}{\min F_{N}}=+\infty.

On the other hand, we note that the point cloud $Y_{N}$ is highly concentrated, in the sense that the distance between consecutive points in $Y_{N}$ is $\frac{1}{2N}$ , whereas in an evenly distributed point cloud, one would expect the minimum distance between points to be of order $N^{-1/d}$ .

Gradient descent and Lloyd’s algorithm

One can find a critical point of $F_{N}$ by following the discrete gradient flow of $F_{N}$ , defined in (7), starting from an initial position $Y^{0}\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N}$ . Thanks to the expression of $\nabla F_{N}$ given in Proposition 1, the discrete gradient flow may be written as

Y^{k+1}=Y^{k}+\tau_{N}(B_{N}(Y^{k})-Y^{k}),

(13)

where $\tau_{N}$ is a fixed time step. For $\tau_{N}=1$ , one recovers a variant of Lloyd’s algorithm, where one moves every point to the barycenter of its Power cell $P_{i}(Y^{k})$ at each iteration:

Y^{k+1}=B_{N}(Y^{k})

(14)

We can state the following result about Lloyd’s algorithm for the uniform quantization problem, whose proof is postponed to the appendix.

Proposition 2.

Let $N$ be a fixed integer and $(Y^{k})_{k\geq 0}$ be the iterates of (14), with $Y^{0}\not\in\mathbb{D}_{N}$ . Then, the energy $k\mapsto F_{N}(Y^{k})$ is decreasing, and $\lim_{k\to+\infty}\|\nabla F_{N}(Y^{k})\|=0$ . Moreover, the sequence $(Y^{k})_{k\geq 0}$ belongs to a compact subset of $(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N}$ and every limit point of a converging subsequence of it is a critical point for $F_{N}$ .

Experiments suggest that following the discrete gradient flow of $F_{N}$ does not bring us to high energy critical points of $F_{N}$ , such as those described in Remark 2, unless we started from an adversely chosen point cloud. The following theorem and its corollaries, the main results of this article, backs up this experimental evidence. It shows that if the point cloud $Y$ is not too concentrated, then the uniform measure over the barycenters of the power cells, $\delta_{B_{N}(Y)}$ , is a good quantization of the probability density $\rho$ , i.e. it bounds the quantization error after one step of Lloyd’s algorithm (14).

We will use the following notation for $\varepsilon>0$ :

I_{\varepsilon}(Y)=\{i\in\{1,\ldots,N\}\mid\forall j\neq i,\|y_{i}-y_{j}\|\geq\varepsilon\}.

\mathbb{D}_{N,\varepsilon}=\{Y\in(\mathbb{R}^{N})^{d}\mid\exists i\neq j,\|y_{i}-y_{j}\|\leq\varepsilon\}.

Note that $\mathbb{D}_{N,\varepsilon}$ is an $\varepsilon$ -neighborhood around the generalized diagonal $\mathbb{D}_{N}$ .

Theorem 3 (Quantization by barycenters).

Let $\Omega\subseteq\mathbb{R}^{d}$ be a compact convex set, $\rho$ a probability density on $\Omega$ and consider a point cloud $Y=(y_{1},\dots,y_{N})$ in $\Omega^{N}\setminus\mathbb{D}_{N}$ . Then, for all $0<\varepsilon\leq 1$ ,

\mathrm{W}^{2}_{2}\left(\rho,\delta_{B_{N}(Y)}\right)\leq C_{d,\Omega}\left(\frac{\varepsilon^{1-d}}{N}+1-\frac{\operatorname{Card}(I_{\varepsilon}(Y))}{N}\right).

(15)

where $C_{d,\Omega}=\frac{2^{2d-1}}{\omega_{d-1}}(\operatorname{diam}(\Omega)+1)^{d+1}$ and where $\omega_{d-1}$ is the volume of the unit ball in $\mathbb{R}^{d-1}$ .

The proof relies on arguments from convex geometry. In particular, we denote $A\oplus B$ the Minkowski sum of sets: $A\oplus B=\{a+b\mid(a,b)\in A\times B\}$ .

Proof.

Let $\phi^{1}\in\mathbb{R}^{N}$ be the solution to the dual Kantorovich problem (10) between $\rho$ and $\delta_{Y}$ . We let $\phi^{t}=t\phi^{1}$ and we denote $P_{i}^{t}=\mathrm{Pow}_{i}(Y,\phi^{t})\cap\Omega^{\prime}$ the $i$ th Power cell intersected with the slightly enlarged convex set $\Omega^{\prime}=\Omega\oplus\mathrm{B}(0,1)$ . This way, $P_{i}^{1}\supseteq P_{i}(Y)$ whereas $P_{i}^{0}$ is in fact the intersection of the $i$ -th Voronoi cell defined in (6) with $\Omega^{\prime}$ .

We will now prove an upper bound on the sum of the diameters of the cells $P_{i}(Y)$ whose index lies in $I_{\varepsilon}(Y)$ . First, we notice the following inclusion, which holds for any $t\in[0,1]$ :

(1-t)P_{i}^{0}\oplus tP_{i}^{1}\subseteq P_{i}^{t},

(16)

Indeed, let $x^{0}\in P_{i}^{0}$ and $x^{1}\in P_{i}^{1}$ , so that for all $j\in\{1,\ldots,N\}$ and $k\in\{0,1\}$ ,

\|x^{k}-y_{i}\|^{2}-\phi_{i}\leq\|x^{k}-y_{j}\|^{2}-\phi_{j}.

Expanding the squares and substracting $\|x^{k}\|^{2}$ on both sides these inequalities become linear in $\phi_{i},\phi_{j}$ and $x^{k}$ , implying directly that $x^{t}=(1-t)x^{0}+tx^{1}\subseteq P_{i}^{t}$ as desired.

For any index $i\in I_{\varepsilon}$ , the point $y_{i}$ is at distance at least $\varepsilon$ from other points, implying that $\mathrm{B}(0,\frac{\varepsilon}{2})$ is contained in the Voronoi cell $V_{i}(Y)$ with $\Omega^{\prime}$ . Using that $P_{i}^{0}=V_{i}(Y)\cap\Omega^{\prime}$ , that $\Omega^{\prime}=\Omega\oplus\mathrm{B}(0,1)$ and that $y_{i}\in\Omega$ , we deduce that $P_{i}^{0}$ contains the same ball. On the other hand, $P_{i}^{1}$ contains a segment $S_{i}$ of length $\operatorname{diam}(P_{i}^{1})$ and inclusion (16) with $t=\frac{1}{2}$ gives

\frac{1}{2}(\mathrm{B}(y_{i},\varepsilon/2)\oplus S_{i})\subseteq P_{i}^{1/2}.

The Minkowski sum in the left-hand side contains in particular the product of a $(d-1)$ -dimensional ball of radius $\varepsilon/2$ with an orthogonal segment with length $\operatorname{diam}(P_{i}^{1})\geq\operatorname{diam}(P_{i}(Y))$ . Thus,

\frac{1}{2^{d}}\left(\omega_{d-1}\frac{\varepsilon^{d-1}}{2^{d-1}}\operatorname{diam}(P_{i}(Y))\right)\leq|P_{i}^{\frac{1}{2}}|.

Using that the Power cells $P_{i}^{\frac{1}{2}}$ form a tesselation of the domain $\Omega^{\prime}$ , we therefore obtain

\sum_{i\in I_{\varepsilon}(Y)}\operatorname{diam}(P_{i}(Y))\leq\frac{2^{2d-1}}{\omega_{d-1}}\left\lvert\Omega^{\prime}\right\rvert\varepsilon^{1-d}\leq\frac{2^{2d-1}}{\omega_{d-1}}(\operatorname{diam}(\Omega)+1)^{d}\varepsilon^{1-d}

(17)

We now estimate the transport cost between $\delta_{B}$ and the density $\rho$ , where $B=B_{N}(Y)$ . The transport cost due to the points whose indices do not belong to $I_{\varepsilon}(Y)$ can be bounded in a crude way by

\sum_{i\not\in I_{\varepsilon}(Y)}\int_{P_{i}(Y)}\|x-y_{i}\|^{2}\mathrm{d}\rho(x)\leq(1-\frac{\operatorname{Card}I_{\varepsilon}(Y)}{N})\operatorname{diam}(\Omega)^{2}.

Note that we used $\rho(P_{i}(Y))=\frac{1}{N}$ . On the other hand, the transport cost associated with indices in $I_{\varepsilon}(Y)$ can be bounded using (17) and $\operatorname{diam}(P_{i}(Y))\leq\operatorname{diam}(\Omega)$ :

	$\displaystyle\sum_{i\in I_{\varepsilon}(Y)}\int_{P_{i}(Y)}\\|x-y_{i}\\|^{2}\mathrm{d}\rho(x)$	$\displaystyle\leq\frac{1}{N}\sum_{i\in I_{\varepsilon}(Y)}\operatorname{diam}(P_{i}(Y))^{2}$
		$\displaystyle\leq\frac{1}{N}\operatorname{diam}(\Omega)\sum_{i\in I_{\varepsilon}}\operatorname{diam}(P_{i}(Y))$
		$\displaystyle\leq\frac{2^{2d-1}}{\omega_{d-1}}(\operatorname{diam}(\Omega)+1)^{d+1}\frac{\varepsilon^{1-d}}{N}$

In conclusion, we obtain the desired estimate:

\mathrm{W}_{2}^{2}\left(\rho,\delta_{B_{N}(Y)}\right)\leq\frac{2^{2d-1}}{\omega_{d-1}}(\operatorname{diam}(\Omega)+1)^{d+1}\frac{\varepsilon^{1-d}}{N}+\operatorname{diam}(\Omega)^{2}\left(1-\frac{\operatorname{Card}I_{\varepsilon}}{N}\right).\qed

This theorem could be extended mutatis mutandis to the case where $\rho$ is a general probability measure (i.e. not a density). However, this would imply some technical complications in the definition of the barycenters $b_{i}$ by introducing a disintegration of $\rho$ with respect to the transport plan $\pi$ .

Consequence for Lloyd’s algorithm (14)

In the next corollary, we assume that any pair of distinct points in $Y_{N}\in(\mathbb{R}^{d})^{N}$ is bounded from below by $\varepsilon_{N}\geq CN^{-\beta}$ , implying that $I_{\varepsilon_{N}}(Y_{N})=N$ . This corresponds to the value one could expect for a point set uniformly sampled from a set with Minkowski dimension $\beta$ . When $\beta>d-1$ , the corollary asserts that one step of Lloyd’s algorithm is enough to approximate $\rho$ , in the sense that the uniform measure $\delta_{B_{N}(Y_{N})}$ over the barycenters converges towards $\rho$ as $N\to+\infty$ .

Corollary 4 (Quantization by barycenters, asymptotic case).

Assume $\varepsilon_{N}\geq C\cdot N^{-1/\beta}$ with $C,\beta>0$ . Then, with $\alpha=1-\frac{d-1}{\beta}$

\forall Y\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{\varepsilon_{N}},\quad\mathrm{W}^{2}_{2}(\rho,\delta_{B_{N}(Y)})\leq\frac{C_{d,\Omega}}{C^{d-1}}N^{-\alpha},

(18)

and in particular, if $\beta>d-1$ ,

\lim_{N\to+\infty}\max_{Y\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{\varepsilon_{N}}}\mathrm{W}^{2}_{2}(\rho,\delta_{B_{N}(Y)})=0.

(19)

Remark 3 (Optimality of the exponent when $\beta=d$ ).

There is no reason to believe that the exponent in the upper bound (18) is optimal in general. However, it seems to be optimal in a “worst-case sense” when $\beta=d$ . More precisely, we show the following result in Appendix E: for any $\delta\in(0,1)$ , and for every $N=n^{d}$ ( $n\in\mathbb{N}$ ) there exists a sequence of separable probability densities $\rho_{N}$ over $X=[-1,1]^{d}$ ( $\rho_{N}$ is a truncated Gaussian distributions, whose variance converges to zero slowly as $N\to+\infty$ ) such that if $Y_{N}$ is a uniform grid of size $n\times\cdot\times n=N^{d}$ in $X$ , then

\mathrm{W}_{2}^{2}(\delta_{B_{N}(Y_{N})},\rho_{N})\geq CN^{-\frac{(2-\delta)}{d}},

where $C$ is independent of $N$ . On the other hand, in this setting every point in $Y_{N}$ is at distance at least $CN^{-1/d}$ from any other point in $Y_{N}$ . Applying 4 with $\beta=d$ , i.e. $\alpha=\frac{1}{d}$ , we get

\mathrm{W}_{2}^{2}(\delta_{B_{N}(Y_{N})},\rho_{N})\leq C^{\prime}N^{-\frac{1}{d}}.

Comparing this upper bound on $\mathrm{W}_{2}^{2}(\delta_{B_{N}(Y_{N})},\rho_{N})$ with the above lower bound, one sees that is is not possible to improve the exponent.

Remark 4 (Optimality of (19)).

The assumption $\beta>d-1$ for (19) is tight: if $\rho$ is the Lebesgue measure on $[0,1]^{d}$ , it is possible for to construct a point cloud $Y_{N}$ with $N$ points on the $(d-1)$ -cube $\{\frac{1}{2}\}\times[0,1]^{d-1}$ such that distinct point in $Y_{N}$ are at distance at least $\varepsilon_{N}\geq C\cdot N^{-1/(d-1)}$ . Then, the barycenters $B_{N}(Y_{N})$ are also contained in the cube, so that $\mathrm{W}_{2}^{2}(\rho,\delta_{B_{N}(Y_{N})})\geq\frac{1}{12}$ .

The next corollary is a probabilistic analogue of Corollary 4, assuming that the initial point cloud $Y$ is drawn from a probability density $\sigma$ on $\Omega$ . Note that $\sigma$ can be distinct from $\rho$ . The proof of this corollary relies on McDiarmid’s inequality to quantify the proportion of $\varepsilon$ -isolated points in a point cloud that is drawn randomly and independently from $\sigma$ . The proof of this result is in Appendix B.

Corollary 5 (Quantization by barycenters, probabilistic case).

Let $\sigma\in\mathrm{L}^{\infty}(\Omega)$ and let $X_{1},...,X_{N}$ be i.i.d. random variables with distribution $\sigma\in\mathrm{L}^{\infty}(\mathbb{R}^{d})$ . Then, there exists a constant $C>0$ depending only on $\|\sigma\|_{\mathrm{L}^{\infty}}$ and $d$ , such that for $N$ large enough,

\mathbb{P}\left(W_{2}^{2}\left(\frac{1}{N}\sum_{i=1}^{N}\delta_{b_{i}^{X}},\rho\right)\lesssim N^{-\frac{1}{2d-1}}\right)\geq 1-e^{-CN^{\frac{2d-3}{2d-1}}}

3 Gradient flow and a Polyak-Łojasiewicz-type inequality

Theorem 3 can be interpred as a modified Polyak-Łojasiewicz-type (PŁ for short) inequality for the function $F_{N}$ . The usual PŁ inequality for a differentiable function $F:\mathbb{R}^{D}\to\mathbb{R}$ is of the form

\forall Y\in\mathbb{R}^{D},\quad F(Y)-\min F\leq C\|\nabla F(Y)\|^{2},

where $C$ is a positive constant. This inequality has been originally used by Polyak to prove convergence of gradient descent towards the global minimum of $F$ . Note in particular that such an inequality implies that any critical point of $F$ is a global minimum of $F$ . By Remark 2, $F_{N}$ has critical points that are not minimizers, so that we cannot expect the standard PŁ inequality to hold. What we get is a similar inequality relating $F_{N}(Y)$ and $\|\nabla F_{N}(Y)\|^{2}$ but with a term involving the minimimum distance between the points in place of $\min F_{N}$ .

Corollary 6 (Polyak-Łojasiewicz-type inequality).

Let $Y\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N,\varepsilon}$ . Then,

F_{N}(Y)-C_{d,\Omega}\frac{1}{N}\left(\frac{1}{\varepsilon}\right)^{d-1}\leq N\|\nabla F_{N}(Y)\|^{2}

(20)

We note that when $\varepsilon\simeq\left(\frac{1}{N}\right)^{1/d}$ , the term $\frac{1}{N}\left(\frac{1}{\varepsilon}\right)^{d-1}$ in (20) has order $\left(\frac{1}{N}\right)^{1/d}$ . On the other hand, as recalled in 1, $\min F_{N}\lesssim\left(\frac{1}{N}\right)^{2/d}$ when $d>2$ . Thus, we do not expect (20) to be tight.

Convergence of a discrete gradient flow

The modified Polyak-Łojasiewicz inequality (20) suggests that the discrete gradient flow 13 will bring us close to a point cloud with low Wasserstein distance to $\rho$ , provided that can guarantee that the the points clouds $Y^{k}$ remain far for generalized diagonal during the iterations. We prove in Lemma 3 in Appendix D that if $Y^{k+1}=Y^{k}-\tau_{N}\nabla F_{N}(Y^{k})$ and $\tau_{N}\in(0,1)$ , then

\forall i\neq j,\quad\left\lVert y^{k+1}_{i}-y^{k+1}_{j}\right\rVert\geq(1-\tau_{N})\|y^{k}_{i}-y^{k}_{j}\|.

(21)

We note that this inequality ensures that $Y^{k}$ never touches the generalized diagonal $\mathbb{D}_{N}$ , so that the gradient $\nabla F_{N}(Y^{k})$ is well-defined at each step. Combining this inequality with 3, one can actually prove that if the points in the initial cloud $Y^{0}_{N}$ are not too close to each other, then a few steps of gradient discrete gradient descent leads to a discrete measure $Y^{k}_{N}$ that is close to the target $\rho$ . Precisely, we arrive at the following theorem, proved in Appendix D.

Theorem 7.

Let $0<\alpha<\frac{1}{d-1}-\frac{1}{d}$ , $\varepsilon_{N}\gtrsim N^{-\frac{1}{d}-\alpha}$ , and $Y^{0}_{N}\in\Omega^{N}\setminus\mathbb{D}_{\varepsilon_{N}}$ . Let $(Y_{N}^{k})_{k}$ be the iterates of (13) starting from $Y_{N}^{0}$ with timestep $0<\tau_{N}<1$ . We assume that $\lim_{N\to\infty}\tau_{N}=0$ and we set

k_{N}=\left\lfloor\frac{1}{d\tau_{N}}\ln(F_{N}(Y_{N}^{0})N\varepsilon_{N}^{d-1})\right\rfloor.

Then,

\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{k_{N}}}\right)=O_{N\to\infty}\left(\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{0}}\right)^{1-\frac{1}{d}}.N^{\frac{-1}{d^{2}}+\alpha\left(1-\frac{1}{d}\right)}\right).

(22)

Remark 5.

Note that the exponential behavior implied by 21 and 3 is coherent with the estimates that are known in the absolutely continuous setting for the continuous gradient flow. When transitioning from discrete measures to probability densities, lower bounds on the distance between points become upper bounds on the density. The gradient flow $\dot{\mu}_{t}=\frac{1}{2}\nabla_{\mu}\mathrm{W}_{2}^{2}(\rho,\mu_{t})$ has an explicit solution $\mu_{t}=\sigma_{1-e^{-t}}$ , where $\sigma$ is a constant-speed geodesic in the Wasserstein space with $\sigma_{0}=\mu_{0}$ and $\sigma_{1}=\rho$ . In this case, a simple adaptation of the estimates in Theorem 2 in [17] shows the bound $\|\mu_{t}\|_{\mathrm{L}^{\infty}}\leq e^{td}\|\mu_{0}\|_{\mathrm{L}^{\infty}}.$ Still in this absolutely continous setting, it is possible to remove the exponential growth if the target density is also bounded, as a consequence of displacement convexity [13, Theorem 2.2]. There seems to be no discrete counterpart to this argument, explaining in part the discrepancy between the exponent of $N$ in (22) with the one obtained in 4.

4 Numerical results

In this section, we report some experimental results in dimension $d=2$ .

Gray-scale image

As we mentioned in the introduction, uniform optimal quantization allows to sparsely represent a (gray scale) image via points, clustered more closely in areas where the image is darker [4, 3]. On figure 3, we ploted the point clouds obtained after a single Lloyd step toward the density representing the image on the left (Puffin), starting from regular grids. The observed rate of convergence, $N^{-1.00}$ , is coherent with the theoretical estimate $\log(N)/N$ of 1.

Gaussian density with small variance

We now consider a toy model where we approximate a gaussian density truncated to the unit square $\Omega=[0,1]^{2}$ , $\rho(x,y)=\frac{1}{Z}e^{-8((x-\frac{1}{2})^{2}+(y-\frac{1}{2})^{2})}$ where $Z$ is a normalization constant. On the left column of this figure, the initial point clouds $Y_{N}^{0}$ are randomly distributed in $[0,1]^{2}$ . The three point clouds represented above are obtained after one step of Lloyd’s algorithm (14). The red curve displays in a log-log scale the mean values of $F_{N}(B_{N}(Y_{N}))$ over a hundred random point clouds, for $N\in\{400,961,1600,2500\}$ . In this case, we observe a decrease rate $N^{-0.95}$ with respect to the number of points, similar to the case of the gray scale picture.

However, an interesting phenomena occurs when the initial point cloud $Y_{N}^{0}$ is aligned on a axis-aligned grid. The pictures in the right column of Fig. 2 where computed starting from such a grid with $N\in\{400,961,1600,2500\}$ points. As in the randomly initialized case, we represented the values of $F_{N}(B_{N}(Y_{N}))$ in log-log scale. The corresponding discrete probability measure $\delta_{B_{N}(Y_{N})}$ seems to converge to $\rho$ as $N\to\infty$ , but with a much worse rate for these "low" values of $N$ : $F_{N}(B_{N}(Y_{N}))\simeq N^{-0.8}$ . In this specific setting, with a separable density and an axis-aligned grid $Y_{0}$ , the power cells are rectangles and a single Lloyd step brings us to a critical point of $F_{N}$ . Thanks to this remark, it is possible to estimate the approximation error from the one-dimensional case. In fact, Appendix E shows that for any $\delta\in(0,1)$ , there exists variances $\sigma_{N}=\sigma_{N}(\delta)$ such that the approximation error $W_{2}^{2}(\rho_{\sigma_{N}},\delta_{B_{N}})$ is of order $N^{-\frac{2-\delta}{2}}$ . On the other hand, for a fixed $\sigma$ , the approximation error is of order $N^{-1}$ , to be compared with the bound $\log(N)/N$ for general measures.

5 Discussion

We have studied the problem of minimizing the Wasserstein distance between a fixed probability measure $\rho$ and a uniform measure over $N$ points $\delta_{Y}$ , parametrized by the position of the points $Y=(y_{1},\ldots,y_{N})$ . The main difficulty is the nonconvexity of the Wasserstein distance $F_{N}:Y\in(\mathbb{R}^{d})^{N}\mapsto\frac{1}{2}\mathrm{W}_{2}^{2}(\rho,\delta_{Y})$ , which we tackled by means of a modified Polyak-Łojaciewicz inequality (20). One limitation of our work is that the terms replacing $\min F_{N}$ in the Polyak-Łojaciewicz inequality (20) does not match the theoretical bounds recalled in 1. Future work will concentrate on bridging that gap, but also on deriving consequences for the algorithmic resolution of Wasserstein regression problems $\min_{\theta}\mathrm{W}_{2}^{2}(\rho,T_{\theta\#}\mu),$ starting with the case where $\theta\mapsto T_{\theta}$ is linear.

Acknowledgments and Disclosure of Funding

This work was supported by a grant from the French ANR (MAGA, ANR-16-CE40-0014).

References

[1] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
[3] Michael Balzer, Thomas Schlömer, and Oliver Deussen. Capacity-constrained point distributions: a variant of lloyd’s method. ACM Transactions on Graphics (TOG), 28(3):1–8, 2009.
[4] Fernando De Goes, Katherine Breeden, Victor Ostromoukhov, and Mathieu Desbrun. Blue noise through optimal transport. ACM Transactions on Graphics (TOG), 31(6):1–11, 2012.
[5] Qiang Du, Maria Emelianenko, and Lili Ju. Convergence of the lloyd algorithm for computing centroidal voronoi tessellations. SIAM journal on numerical analysis, 44(1):102–119, 2006.
[6] Bjorn Engquist, Brittany D Froese, and Yunan Yang. Optimal transport for seismic full waveform inversion. Communications in Mathematical Sciences, 14(8):2309–2330, 2016.
[7] Jean Feydy, Benjamin Charlier, François-Xavier Vialard, and Gabriel Peyré. Optimal transport for diffeomorphic registration. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 291–299. Springer, 2017.
[8] Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pages 1608–1617. PMLR, 2018.
[9] Fernando de Goes, Pooran Memari, Patrick Mullen, and Mathieu Desbrun. Weighted triangulations for geometry processing. ACM Transactions on Graphics (TOG), 33(3):1–13, 2014.
[10] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer, 2007.
[11] Jun Kitagawa, Quentin Mérigot, and Boris Thibert. Convergence of a newton algorithm for semi-discrete optimal transport. Journal of the European Mathematical Society, 21(9):2603–2651, 2019.
[12] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
[13] Robert J McCann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153–179, 1997.
[14] Quentin Mérigot and Jean-Marie Mirebeau. Minimal geodesics along volume-preserving maps, through semidiscrete optimal transport. SIAM Journal on Numerical Analysis, 54(6):3465–3492, 2016.
[15] Gilles Pagès. Introduction to vector quantization and its applications for numerics. ESAIM: proceedings and surveys, 48:29–79, 2015.
[16] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
[17] Filippo Santambrogio. Absolute continuity and summability of transport densities: simpler proofs and new estimates. Calculus of variations and partial differential equations, 36(3):343–354, 2009.
[18] Filippo Santambrogio. Optimal transport for applied mathematicians, volume 55. Springer, 2015.
[19] Cédric Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
[20] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

Appendix A Proof of 2

Given $Y=(y_{1},\ldots,y_{N})\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N}$ , one has for any $i\in\{1,\ldots,N\}$ ,

	$\displaystyle\int_{P_{i}(Y)}\left\lVert x-y_{i}\right\rVert^{2}\mathrm{d}\rho(x)$	$\displaystyle=\int_{P_{i}(Y)}\left\lVert x-b_{i}(Y)+b_{i}(Y)-y_{i}\right\rVert^{2}\mathrm{d}\rho(x)$
		$\displaystyle=\int_{P_{i}(Y)}\left\lVert x-b_{i}(Y)\right\rVert^{2}\mathrm{d}\rho(x)+\frac{1}{N}\\|b_{i}(Y)-y_{i}\\|^{2}.$

Summing these equalities over $i$ and remarking that the map $T_{Y}$ defined by $\left.T_{Y}\right|_{P_{i}(Y)}=y_{i}$ is an optimal transport map between $\rho$ and $\delta_{Y}$ , we get

	$\displaystyle\frac{1}{N}\\|B_{N}(Y)-Y\\|^{2}$	$\displaystyle=\mathrm{W}_{2}^{2}(\rho,y_{i})-\sum_{i}\int_{P_{i}(Y)}\left\lVert x-b_{i}(Y)\right\rVert^{2}\mathrm{d}\rho(x)$
		$\displaystyle\leq\mathrm{W}_{2}^{2}(\rho,\delta_{Y})-\mathrm{W}_{2}^{2}(\rho,\delta_{B_{N}(Y)}).$

Thus, with $Y^{k+1}=B_{N}(Y^{k})$ , we have

N\|\nabla F_{N}(Y^{k})\|^{2}=\frac{1}{N}\|Y^{k+1}-Y^{k}\|^{2}\leq 2(F_{N}(Y^{k})-F_{N}(Y^{k+1})).

This implies that the values of $F_{N}(Y^{k})$ are decreasing in $k$ and, since they are bounded from below, that $\|\nabla F_{N}(Y^{k})\|\to 0$ since $\sum_{k}\|\nabla F_{N}(Y^{k})\|^{2}<+\infty$ . The sequence $(Y^{k})_{k}$ can be easily seen to be bounded, since $F_{N}(Y^{k})$ is bounded, which implies a bound on the second moment of $\delta_{Y^{k}}$ .

For fixed $N$ , since all atoms of $\delta_{Y^{k}}$ have mass $1/N$ , this implies that all points $y_{i}^{k}$ belong to a same fixed compact ball. If $\rho$ itself is compactly supported, we can also prove that all points $Y^{k+1}=B_{N}(Y^{k})$ are contained in a compact subset of $(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N}$ , which means obtaining a lower bound on the distances $|b_{i}(Y)-b_{j}(Y)|$ for arbitrary $Y$ . This lower bound can be obtained in the following way: since $\rho$ is absolutely continuous it is uniformly integrable which means that for every $\varepsilon>0$ there is $\delta=\delta(\varepsilon)>0$ such that for any set $A$ with Lebesgue measure $|A|<\delta$ we have $\rho(A)<\varepsilon$ . We claim that we have $|b_{i}(Y)-b_{j}(Y)|\geq r:=(2R)^{1-d}\delta(\frac{1}{2N})$ , where $R$ is such that $\rho$ is supported in a ball $B_{R}$ of radius $R$ . Indeed, it is enough to prove that every barycenter $b_{i}(Y)$ is at distance at least $r/2$ from each face of the convex polytope $P_{i}(Y)$ . Consider a face of such a polytope and suppose, by simplicity, that it lies on the hyperplane $\{x_{d}=0\}$ with the cell contained in $\{x_{d}\geq 0\}$ . Let $s$ be such that $\rho(P_{i}(Y)\cap\{x_{d}>s\})=\rho(P_{i}(Y)\cap\{x_{d}<s\})=\frac{1}{2N}$ . Then since the diameter of $P_{i}(Y)\cap B_{R}$ is smaller than $2R$ , the Lebesgue measure of $P_{i}(Y)\cap\{x_{d}<s\}$ is bounded by $(2R)^{d-1}s$ , which provides $s\geq r$ because of the definition of $r$ . Since at least half of the mass (according to $\rho$ ) of the cell $P_{i}(Y)$ is above the level $x_{d}=s$ the $x_{d}$ -coordinate of the barycenter is at least $r/2$ . This shows that the barycenter lies at distance at least $r/2$ from each of its faces.

As a consequence, the iterations $Y^{k}$ of the Lloyd algorithm lie in a compact subset of $(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N}$ , on which $F_{N}$ is $C^{1}$ . This implies that any limit point must be a critical point.

We do not discuss here whether the whole sequence converges or not, which seems to be a delicate matter even for fixed $N$ . It is anyway possible to prove (but we do not develop the details here) that the set of limit points is a closed connected subet of $(\mathbb{R}^{d})^{N}$ with empty interior, composed of critical points of $F_{N}$ all lying on a same level set of $F_{N}$ .

Appendix B Proof of 5

Given $Y=(y_{1},\ldots,y_{N})\in(\mathbb{R}^{d})^{N}$ , we denote

I_{\varepsilon}(Y)=\{i\in\{1,\ldots,N\}\mid\forall j\neq i,\|y_{i}-y_{j}\|\geq\varepsilon\}.

We call points $y_{i}$ such that $i\in I_{\varepsilon}(Y)$ $\varepsilon$ -isolated, and points $y_{i}$ such that $i\not\in I_{\varepsilon}(Y)$ $\varepsilon$ -connected.

Lemma 1.

Let $X_{1},\ldots,X_{N}$ be independent, $\mathbb{R}^{d}$ -valued, random variables. Then, there is a constant $C_{d}>0$ such that

\mathbb{P}(\{\left\lvert\kappa(X_{1},\ldots,X_{N})-\mathbb{E}(\kappa)\right\rvert\geq\eta\})\leq\mathrm{e}^{-N\eta^{2}/C_{d}}.

Proof.

This lemma is a consequence of McDiarmid’s inequality. To apply this inequality, we need evaluate the amplitude of variation of the function $\kappa$ along changes of one of the points $x_{i}$ . Denote $c_{d}$ the maximum cardinal of a subset $S$ of the ball $B(0,\varepsilon)$ such that the distance between any distinct points in $S$ is at least $\varepsilon$ . By a scaling argument, one can check that $c_{d}$ does not, in fact, depend on $\varepsilon$ . To evaluate

\left\lvert\kappa(x_{1},\ldots,x_{i},\ldots,x_{N})-\kappa(x_{1},\ldots,\tilde{x}_{i},\ldots,x_{N})\right\rvert,

we first note that at most $c_{d}$ points may become $\varepsilon$ -isolated when removing $x_{i}$ . To prove this, we remark that if a point $x_{j}$ becomes $\varepsilon$ -isolated when $x_{i}$ is removed, this means that $\|{x_{i}-x_{j}}\|\leq\varepsilon$ and $\|{x_{j}-x_{k}}\|>\varepsilon$ for all $k\not\in\{i,j\}$ . The number of such $j$ is bounded by $c_{d}$ . Symmetrically, there may be at most $c_{d}$ points becoming $\varepsilon$ -connected under addition of $\hat{x}_{i}$ . Finally, the point $x_{i}$ itself may change status from $\varepsilon$ -isolated to $\varepsilon$ -connected. To summarize, we obtain that with $C_{d}=2c_{d}+1$ ,

\left\lvert\kappa(x_{1},\ldots,x_{i},\ldots,x_{N})-\kappa(x_{1},\ldots,\tilde{x}_{i},\ldots,x_{N})\right\rvert\leq\frac{1}{N}C_{d}.

The conclusion then directly follows from McDiarmid’s inequality. ∎

Lemma 2.

Let $\sigma\in\mathrm{L}^{\infty}(\mathbb{R}^{d})$ be a probability density and let $X_{1},\ldots,X_{N}$ be i.i.d. random variables with distribution $\sigma$ . Then,

\mathbb{E}(\kappa(X_{1},\ldots,X_{N}))\geq(1-\|{\sigma}\|_{\mathrm{L}^{\infty}}\omega_{d}\varepsilon^{d})^{N-1}.

Proof.

The probability that a point $X_{i}$ belongs to the ball $B(X_{j},\varepsilon)$ for some $j\neq i$ can be bounded from above by $\sigma(B(X_{j},\varepsilon))\leq\|{\sigma}\|_{\mathrm{L}^{\infty}}\omega_{d}\varepsilon^{d}$ , where $\omega_{d}$ is the volume of the $d$ -dimensional unit ball. Thus, the probability that $X_{i}$ is $\varepsilon$ -isolated is larger than

(1-\|{\sigma}\|_{\mathrm{L}^{\infty}}\omega_{d}\varepsilon^{d})^{N-1}.

We conclude by noting that

\mathbb{E}(\kappa(X_{1},\ldots,X_{N}))=\frac{1}{N}\sum_{1\leq i\leq N}\mathbb{P}(X_{i}\hbox{ is $\varepsilon$-isolated}).\qed

Proof of 5.

We apply the previous 2 with $\varepsilon_{N}=N^{-\frac{1}{\beta}}$ and $\beta=d-\frac{1}{2}$ . The expectation of $\kappa(X_{1},\dots,X_{N})$ is lower bounded by:

\begin{split}\mathbb{E}(\kappa(X_{1},\dots,X_{N}))\geq&\left(1-N^{-\frac{d}{\beta}}\|{\sigma}\|_{\mathrm{L}^{\infty}}\omega_{d}\right)^{N-1}\\ \geq&1-CN^{1-\frac{d}{\beta}}\end{split}

for large $N$ , since $\beta<d$ . By 1, for any $\eta>0$ ,

\mathbb{P}(\kappa(X_{1},\ldots,X_{N})\geq 1-CN^{1-\frac{d}{\beta}}-\eta)\geq 1-e^{-KN\eta^{2}},

for constants $C,K>0$ depending only on $\|{\sigma}\|_{\mathrm{L}^{\infty}}$ and $d$ . We choose $\eta=N^{-\frac{1}{2d-1}}$ , so that $\eta$ is of the same order as $N^{1-\frac{d}{\beta}}$ since $1-\frac{d}{\beta}=-\frac{1}{2d-1}$ .Thus, for a slightly different $C$ ,

\mathbb{P}(\kappa(X_{1},\ldots,X_{N})\geq 1-C\eta)\geq 1-\mathrm{e}^{-KN\eta^{2}}.

Now, for $\omega_{1},\dots,\omega_{N}$ such that

\kappa(X_{1}(\omega_{1}),\ldots,X_{N}(\omega_{N}))\geq 1-C\eta,

3 yields:

W_{2}^{2}\left(\delta_{B_{N}(X(\omega))},\rho\right)\lesssim\frac{N^{\frac{d-1}{\beta}}}{N}+\eta\lesssim N^{-\frac{1}{2d-1}}

and such a disposition happens with probability at least

1-\mathrm{e}^{-KN\eta^{2}}=1-\mathrm{e}^{-KN^{\frac{2d-3}{2d-1}}}.\qed

Appendix C Proof of 6

We first note that by Proposition 1, we have $\|\nabla F_{N}(Y)\|^{2}=\frac{1}{N^{2}}\left\lVert B_{N}(Y)-Y\right\rVert^{2}$ . We then use $\mathrm{W}^{2}_{2}(\delta_{B_{N}(Y)},\delta_{Y})\leq\frac{1}{N}\left\lVert B_{N}(Y)-Y\right\rVert^{2}$ and

\displaystyle\mathrm{W}_{2}^{2}(\rho,\delta_{Y})

\displaystyle\leq 2\big{(}\mathrm{W}_{2}^{2}(\rho,\delta_{B_{N}(Y)})+2N\|\nabla F_{N}(Y)\|^{2}.

Thus, using 3 to bound $\mathrm{W}_{2}^{2}(\rho,\delta_{B_{N}(Y)})$ from above, we get the desired result.

Appendix D Proof of 7

Lemma 3.

Let $Y^{0}\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N,\varepsilon_{N}}$ for some $\varepsilon_{N}>0$ . Then, the iterates $(Y^{k})_{k\geq 0}$ of (13) satisfy for every $k\geq 0$ , and for every $i\neq j$

\left\lVert y_{i}^{k}-y_{j}^{k}\right\rVert\geq(1-\tau_{N})^{k}\varepsilon_{N}

(23)

Proof.

We consider the distance between two trajectories after $k$ iterations: $e_{k}=\left\lVert y_{i}^{k}-y_{j}^{k}\right\rVert.$ Assuming that $e_{k}>0$ , the convexity of the norm immediately gives us:

	$\displaystyle e_{k+1}-e_{k}\geq$	$\displaystyle\left(\frac{y_{i}^{k}-y_{j}^{k}}{\left\lVert y_{i}^{k}-y_{j}^{k}\right\rVert}\right)\cdot\left(y_{i}^{k+1}-y_{j}^{k+1}-\left(y_{i}^{k}-y_{j}^{k}\right)\right)$
	$\displaystyle=$	$\displaystyle\tau_{N}\left(\frac{y_{i}^{k}-y_{j}^{k}}{\left\lVert y_{i}^{k}-y_{j}^{k}\right\rVert}\right)\cdot\left(b_{i}^{k}-b_{j}^{k}\right)-\tau_{N}\left\lVert y_{i}^{k}-y_{j}^{k}\right\rVert$

where we denoted $b_{i}^{k}:=b_{i}(Y_{N}^{k})$ the barycenter of the $i$ th Power cell $P_{i}(Y_{N}^{k})$ in the tesselation associated with the point cloud $Y_{N}^{k}$ . Since each barycenter $b_{i}^{k}$ lies in its corresponding Power cell, the scalar product $\left(y_{i}^{k}-y_{j}^{k}\right)\cdot\left(b_{i}^{k}-b_{j}^{k}\right)$ is non-negative: Indeed, for any $i\neq j$ ,

\left\lVert y_{i}^{k}-b_{i}^{k}\right\rVert^{2}-\left\lVert y_{j}^{k}-b_{i}^{k}\right\rVert^{2}\leq\phi_{i}^{k}-\phi_{j}^{k}

Summing this inequality with the same inequality with the roles of $i$ and $j$ reversed, we obtain:

\left(y_{i}^{k}-y_{j}^{k}\right)\cdot\left(b_{i}^{k}-b_{j}^{k}\right)\geq 0

thus giving us the geometric inequality $e_{k+1}\geq(1-\tau_{N})e_{k}$ . Since $Y_{N}^{0}$ was chosen in $\Omega^{N}\setminus\mathbb{D}_{N,\varepsilon_{N}}$ , this yields $e_{k}\geq(1-\tau_{N})^{k}e_{0}$ and inequality 23. ∎

Lemma 4.

For any $k\geq 0$

F_{N}(Y_{N}^{k})\leq F_{N}(Y_{N}^{0})\eta_{N}^{k}+2C_{d,\Omega}(1-\eta_{N})\frac{\varepsilon_{N}^{1-d}}{N}\frac{A_{N}^{k}-\eta_{N}^{k}}{A_{N}-\eta_{N}},

(24)

where we denote $\eta_{N}=1-\frac{\tau_{N}}{2}(2-\tau_{N})$ and $A_{N}=(1-\tau_{N})^{1-d}$ .

Proof.

This is obtained in a very similar fashion as 3. For any $k\geq 0$ , the semi-concavity of $F_{N}$ yields the inequality:

\begin{split}F_{N}(Y_{N}^{k+1})-\frac{\left\lVert Y_{N}^{k+1}\right\rVert^{2}}{2N}-\left(F_{N}(Y_{N}^{k})-\frac{\left\lVert Y_{N}^{k}\right\rVert^{2}}{2N}\right)\leq&\left(-\frac{\mathrm{B}_{N}^{k}}{N}\right)\cdot\left(Y_{N}^{k+1}-Y_{N}^{k}\right)\end{split}

with $B_{N}^{k}:=B_{N}(Y_{N}^{k})$ in accordance with the previous proof.

Rearranging the terms,

\begin{split}F_{N}(Y_{N}^{k+1})-F_{N}(Y_{N}^{k})\leq&-\tau_{N}(1-\frac{\tau_{N}}{2})\frac{\left\lVert B_{N}^{k}-Y_{N}^{k}\right\rVert^{2}}{N}\\ =&-\tau_{N}(1-\frac{\tau_{N}}{2})\mathrm{W}_{2}^{2}(\delta_{B_{N}^{k}},\delta_{Y_{N}^{k}})\\ \leq&\tau_{N}(1-\frac{\tau_{N}}{2})\left(-\frac{1}{2}\mathrm{W}_{2}^{2}(\delta_{Y_{N}^{k}},\rho)+\mathrm{W}_{2}^{2}(\rho,\delta_{B_{N}^{k}})\right)\end{split}

by applying first the triangle inequality to $\mathrm{W}_{2}(\delta_{B_{N}^{k}},\delta_{Y_{N}^{k}})$ and then Cauchy-Schwartz’s inequality. Using 3, this yields:

\begin{split}F_{N}(Y_{N}^{k+1})\leq&(1-\frac{\tau_{N}}{2}(2-\tau_{N}))F_{N}(Y_{N}^{k})+2C_{d,\Omega}\tau_{N}(2-\tau_{N})\frac{\varepsilon_{N}^{1-d}}{N}(1-\tau_{N})^{k(1-d)}\\ \leq&\eta_{N}F_{N}(Y_{N}^{k})+2C_{d,\Omega}(1-\eta_{N})\frac{\varepsilon_{N}^{1-d}}{N}A_{N}^{k}.\end{split}

and we simply iterate on $k$ to end up with the bound claimed in 4. ∎

Proof of 7.

To conclude, we simply make (order 1) expansions of the terms in 24. The definition of $k_{N}$ in 7, although convoluted, was made so that both terms in the right-hand side of this inequality, $F_{N}(Y_{N}^{0})\eta_{N}^{k_{N}}$ and $(1-\eta_{N})\frac{\varepsilon_{N}^{1-d}}{N}\frac{A_{N}^{k_{N}}-\eta_{N}^{k_{N}}}{A_{N}-\eta_{N}}$ have the same asymptotic decay to $0$ (as $N\to+\infty$ ): With the notations of the previous proposition, we have for fixed $N$ :

\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{k_{N}}}\right)\leq\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{0}}\right)\eta_{N}^{k_{N}}+2C_{d,\Omega}\frac{(1-\eta_{N})}{A-\eta_{N}}\frac{A^{k_{N}}-\eta_{N}^{k_{N}}}{N\varepsilon_{N}^{d-1}}

(25)

We make use here of the notation from Section 3:

T_{N}=k_{N}\tau_{N}=\left\lfloor\frac{1}{d}\ln(F_{N}(Y_{N}^{0})N\varepsilon_{N}^{d-1})\right\rfloor

to clear this expression a bit, and, because of the assumption $\lim_{N\to\infty}\tau_{N}=0$ , we may write:

\frac{A^{k_{N}}-\eta^{k_{N}}}{N\varepsilon_{N}^{d-1}}=\frac{\mathrm{e}^{(d-1)T_{N}}}{N\varepsilon_{N}^{d-1}}+o_{N\to\infty}\left(\frac{T_{N}}{(N\varepsilon_{N}^{d-1})^{\frac{1}{d}}}\right)

as well as $\eta^{k_{N}}=\mathrm{e}^{-T_{N}}+o_{N\to\infty}\left(\frac{T_{N}}{(N\varepsilon_{N}^{d-1})^{\frac{1}{d}}}\right)$ , and substituting $T_{N}$ ,

\begin{split}\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{k_{N}}}\right)\lesssim&\frac{\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{0}}\right)^{\frac{d-1}{d}}}{\left(N\varepsilon_{N}^{d-1}\right)^{\frac{1}{d}}}+o_{N\to\infty}\left(\frac{T_{N}}{(N\varepsilon_{N}^{d-1})^{\frac{1}{d}}}\right)\\ \lesssim&\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{0}}\right)^{1-\frac{1}{d}}N^{\frac{-1}{d^{2}}+\alpha\left(1-\frac{1}{d}\right)}\qed\end{split}

Appendix E Case of a low variance Gaussian in Section 4

Here, we consider $\rho_{\sigma}$ the probability measure obtained by truncating and renormalizing a centered normal distribution with variance $\sigma$ to the segment $[-1,1]$ . We first show that for any $N\in\mathbb{N}$ and $\delta\in(0,1)$ , we can find a small $\sigma_{N,\delta}$ such that the Wasserstein distance beween $\rho_{\sigma_{N,\delta}}$ and its best $N$ -points approximation of is at least $CN^{-(2-\delta)}$ .

Proposition 8.

For any $\sigma>0$ , consider $\rho_{\sigma}\stackrel{{\scriptstyle\text{def}}}{{=}}m_{\sigma}\mathrm{e}^{-\frac{\left\lvert x\right\rvert^{2}}{2\sigma^{2}}}\mathds{1}_{[-1;1]}dx$ the truncated centered Gaussian density, where $m_{\sigma}$ is taken so that $\rho_{\sigma}$ has unit mass. Then, for every $\delta\in(0,1)$ , there exists a constant $C>0$ and a sequence of variances $(\sigma_{N})_{N\in\mathbb{N}}$ such that

\forall Y\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N},\quad\mathrm{W}_{2}^{2}\left(\delta_{B_{N}(Y)},\rho_{\sigma_{N}}\right)\geq CN^{-(2-\delta)}

From the proof, one can see that the dependence of $\sigma_{N}$ on $N$ is logarithmic.

Proof.

We denote $g:x\in\mathbb{R}\mapsto\frac{1}{\sqrt{2\pi}}\mathrm{e}^{-\frac{\left\lvert x\right\rvert^{2}}{2}}$ the density of the centered Gaussian distribution and $F_{g}$ its cumulative distribution function, so that

\begin{split}m_{\sigma}^{-1}=\int_{-1}^{1}\mathrm{e}^{-\frac{\left\lvert x\right\rvert^{2}}{2\sigma^{2}}}dx=\sigma\sqrt{2\pi}\int_{-1/\sigma}^{1/\sigma}g(y)dy=\sqrt{2\pi}\sigma(F_{g}(1/\sigma)-F_{g}(-1/\sigma))\end{split}

(26)

Note that, whenever $\sigma\to 0$ , we have $\sigma m_{\sigma}\to\sqrt{2\pi}$ . We denote by $F_{\sigma}:[-1,1]\to[0,1]$ the cumulative distribution function of $\rho_{\sigma}$ . Given any point cloud $Y=(y_{1},\ldots,\leq y_{N})$ such that $y_{1}\leq\ldots\leq y_{N}$ , the Power cells $P_{i}(Y)$ is simply the segment

P_{i}(Y)=[F_{\sigma}^{-1}(i/N),F_{\sigma}^{-1}((i+1)/N)].

Since these segments do not depend on $Y$ , we will denote them $(P_{i})_{1\leq i\leq N}$ . Finally, defining $b_{i}=N\int_{P_{i}}x\mathrm{d}\rho_{\sigma}(x)$ as the barycenter of the $i$ th power cell and $\delta_{B}=\frac{1}{N}\sum_{i}\delta_{b_{i}}$ , we have

$\displaystyle\mathrm{W}_{2}^{2}(\delta_{B},\rho_{\sigma})$	$\displaystyle=\sum_{i=1}^{N}\int_{P_{i}}(x-b_{i})^{2}\mathrm{d}\rho_{\sigma}(x)$	(27)
	$\displaystyle\geq\rho_{\sigma}(-1)\sum_{i=1}^{N}\int_{P_{i}}(x-b_{i})^{2}\mathrm{d}x$
	$\displaystyle\geq C\rho_{\sigma}(-1)\sum_{i=1}^{N}(F_{\sigma}^{-1}((i+1)/N)-F_{\sigma}^{-1}(i/N))^{3},$

where we used that $\rho_{\sigma}$ attains its minimum at $\pm 1$ to get the first inequality. We now wish to provide an approximation for $F_{\sigma}^{-1}(t)$ , $t\in[0,1]$ . We first note, using Taylor’s formula, that we have

	$\displaystyle F_{\sigma}^{-1}(t)$	$\displaystyle=\sigma F_{g}^{-1}\left(F_{g}\left(\frac{-1}{\sigma}\right)+t\left[F_{g}\left(\frac{1}{\sigma}\right)-F_{g}\left(\frac{-1}{\sigma}\right)\right]\right)$
		$\displaystyle=\sigma F_{g}^{-1}\left(F_{g}\left(\frac{-1}{\sigma}\right)+\frac{t}{\sqrt{2\pi}\sigma m_{\sigma}}\right)$
		$\displaystyle=-1+\sigma(F_{g}^{-1})^{\prime}\left(F_{g}\left(\frac{-1}{\sigma}\right)\right)\frac{t}{\sqrt{2\pi}\sigma m_{\sigma}}+\frac{\sigma}{2}(F_{g}^{-1})^{\prime\prime}(s)\frac{t^{2}}{2\pi\sigma^{2}m_{\sigma}^{2}}$

for some $s\in[F_{g}(-\frac{1}{\sigma}),F_{g}(-\frac{1}{\sigma})+t(F_{g}(\frac{1}{\sigma})-F_{g}(-\frac{1}{\sigma}))]$ . But,

(F_{g}^{-1})^{\prime}(t)=\frac{1}{g\circ F_{g}^{-1}(t)}=\sqrt{2\pi}\mathrm{e}^{\frac{\left\lvert F_{g}^{-1}(t)\right\rvert^{2}}{2}},

(F_{g}^{-1})^{\prime\prime}(t)=-\frac{g^{\prime}\circ F_{g}^{-1}(t)}{\left(g\circ F_{g}^{-1}(t)\right)^{3}}=2\pi F_{g}^{-1}(t)\mathrm{e}^{\left\lvert F_{g}^{-1}(t)\right\rvert^{2}},

and we see that

\left\lvert F_{\sigma}^{-1}(t)-\left(-1+\frac{t}{m_{\sigma}}\mathrm{e}^{\frac{1}{2\sigma^{2}}}\right)\right\rvert\leq\mathrm{e}^{\frac{1}{\sigma^{2}}}\frac{t^{2}}{2\sigma^{2}m_{\sigma}^{2}}

Therefore, if we denote $\varepsilon(\sigma,t)$ the second-order error in the above formula, i.e. $\varepsilon(\sigma,t)=\mathrm{e}^{\frac{1}{\sigma^{2}}}\frac{t^{2}}{2\sigma^{2}m_{\sigma}^{2}}$ , the size of the first Power cell $P_{0}(Y)$ is of order:

F_{\sigma}^{-1}(1/N)-F_{\sigma}^{-1}(0)=\frac{1}{Nm_{\sigma}}\mathrm{e}^{\frac{1}{2\sigma^{2}}}+O\left(\varepsilon\left(\sigma,\frac{1}{N}\right)\right).

We will choose $\sigma_{N}$ depending on $N$ in order for the first term in the left-hand side to dominate the second one:

\varepsilon\left(\sigma_{N},\frac{1}{N}\right)=o\left(\frac{1}{Nm_{\sigma}}\mathrm{e}^{\frac{1}{2\sigma^{2}}}\right).

(28)

In this way, we have

\begin{split}(F_{\sigma}^{-1}(1/N)-F_{\sigma}^{-1}(0))^{3}\rho_{\sigma}(-1)\geq&c\frac{1}{N^{3}m_{\sigma}^{3}}\mathrm{e}^{\frac{3}{2\sigma^{2}}}m_{\sigma}\mathrm{e}^{-\frac{1}{2\sigma^{2}}}\\ =&c\frac{1}{N^{3}m_{\sigma}^{2}}\mathrm{e}^{\frac{1}{\sigma^{2}}}.\end{split}

(29)

We now choose $\sigma=\sigma_{N}$ such that $\mathrm{e}^{\frac{1}{2\sigma^{2}}}=N^{\alpha}$ for an exponent $\alpha$ to be chosen. We need $\alpha>0$ so that $\sigma_{N}\to 0$ . This last condition and (26) implies that $m_{\sigma_{N}}$ is of order $\sqrt{\log N}$ . This means that the condition (28) is satisfied if $\alpha<1$ and $N$ large enough.

The sum in (27) is lower bounded by its first term, (29), and we get

\mathrm{W}_{2}^{2}(\delta_{B},\rho_{\sigma})\geq c\frac{1}{N^{3}m_{\sigma_{N}}^{2}}\mathrm{e}^{\frac{1}{\sigma_{N}^{2}}}\geq C\left(\frac{N^{2\alpha-3}}{\ln(N)}\right)

for some constant $C>0$ , since $\sigma$ depends logarithmically on $N$ . Finally, if we want this last expression to be larger than $N^{-(2-\delta)}$ we can take for instance $2\alpha>1+\delta$ and $N$ large enough. ∎

The following corollary, whose proof can just be obtained by adapting the above proof to a simple multi-dimensional setting where measures and cells “factorize” according to the components, confirms the facts observed in the numerical section (Section 4), and the sharpness of our result (Remark 4).

Corollary 9.

Fix $\delta\in(0,1)$ . Given any $n\in\mathbb{N}$ , consider an axis-aligned discrete grid of the form $Z_{N}=Y_{1}\times\ldots\times Y_{d}$ in $\mathbb{R}^{d}$ , with $N=\operatorname{Card}(Z_{N})=n^{d}$ , where each $Y_{j}$ is a subset of $\mathbb{R}$ with cardinal $n$ . Finally, define $\sigma_{N}:=\sigma_{n,\delta}$ as in 8 Then we have

\mathrm{W}_{2}^{2}(\delta_{B_{N}(Z_{N})},\rho_{\sigma_{N}}\otimes\dots\otimes\rho_{\sigma_{N}})\geq CN^{-\frac{(2-\delta)}{d}},

where the constant $C$ is independent of $N$ .

Non-asymptotic convergence bounds for Wasserstein approximation using point clouds

Abstract

1 Introduction

Wasserstein distance and Wasserstein regression

Non-convexity of the Wasserstein distance under displacements.

Relation to optimal quantization

Theorem (Particular case of Corollary 4).

Outline

2 Lloyd’s algorithm for optimal uniform quantization

Optimal transport and Kantorovich duality

Optimal uniform quantization

Proposition 1 (Gradient of FNF_{N}).

Remark 1 (Upper bound on the minimum of FNF_{N}).

Remark 2 (High energy critical points).

Gradient descent and Lloyd’s algorithm

Proposition 2.

Theorem 3 (Quantization by barycenters).

Proof.

Consequence for Lloyd’s algorithm (14)

Corollary 4 (Quantization by barycenters, asymptotic case).

Remark 3 (Optimality of the exponent when β=d\beta=d).

Remark 4 (Optimality of (19)).

Corollary 5 (Quantization by barycenters, probabilistic case).

3 Gradient flow and a Polyak-Łojasiewicz-type inequality

Corollary 6 (Polyak-Łojasiewicz-type inequality).

Convergence of a discrete gradient flow

Theorem 7.

Remark 5.

4 Numerical results

Gray-scale image

Gaussian density with small variance

5 Discussion

Acknowledgments and Disclosure of Funding

References

Appendix A Proof of 2

Appendix B Proof of 5

Lemma 1.

Proof.

Lemma 2.

Proof.

Proof of 5.

Appendix C Proof of 6

Appendix D Proof of 7

Lemma 3.

Proof.

Lemma 4.

Proof.

Proof of 7.

Appendix E Case of a low variance Gaussian in Section 4

Proposition 8.

Proof.

Corollary 9.

Non-asymptotic convergence bounds for
Wasserstein approximation using point clouds

Proposition 1 (Gradient of $F_{N}$ ).

Remark 1 (Upper bound on the minimum of $F_{N}$ ).

Remark 3 (Optimality of the exponent when $\beta=d$ ).