This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Non-asymptotic convergence bounds for
Wasserstein approximation using point clouds

Quentin Mérigot
Université Paris-Saclay, CNRS,
Laboratoire de mathématiques d’Orsay,
91405, Orsay, France
Institut Universitaire de France &Filippo Santambrogio
Univ Lyon,
Université Claude Bernard Lyon 1, CNRS
UMR 5208, Institut Camille Jordan
F-69622 Villeurbanne
Institut Universitaire de France &Clément Sarrazin
Université Paris-Saclay, CNRS,
Laboratoire de mathématiques d’Orsay
91405, Orsay, France
Abstract

Several issues in machine learning and inverse problems require to generate discrete data, as if sampled from a model probability distribution. A common way to do so relies on the construction of a uniform probability distribution over a set of NN points which minimizes the Wasserstein distance to the model distribution. This minimization problem, where the unknowns are the positions of the atoms, is non-convex. Yet, in most cases, a suitably adjusted version of Lloyd’s algorithm — in which Voronoi cells are replaced by Power cells — leads to configurations with small Wasserstein error. This is surprising because, again, of the non-convex nature of the problem, as well as the existence of spurious critical points. We provide explicit upper bounds for the convergence speed of this Lloyd-type algorithm, starting from a cloud of points sufficiently far from each other. This already works after one step of the iteration procedure, and similar bounds can be deduced, for the corresponding gradient descent. These bounds naturally lead to a modified Poliak-Łojasiewicz inequality for the Wasserstein distance cost, with an error term depending on the distances between Dirac masses in the discrete distribution.

1 Introduction

In recent years, the theory of optimal transport has been the source of stimulating ideas in machine learning and in inverse problems. Optimal transport can be used to define distances, called Wasserstein or earth-mover distances, between probability distributions over a metric space. These distances allows one to measure the closeness between a generated distribution and a model distribution, and they have been used with success as data attachment terms in inverse problems. Practically, it has been observed for several different inverse problems that replacing usual loss functions with Wasserstein distances tend to increase the basin of convergence of the methods towards a good solution of the problem, or even to convexify the landscape of the minimized energy [7, 6]. This good behaviour is not fully understood, but one may attribute it partly to the fact that the Wasserstein distances encodes the geometry of the underlying space. A notable use of Wasserstein distances in machine learning is in the field of generative adversarial networks, where one seeks to design a neural network able to produce random examples whose distribution is close to a prescribed model distribution [2].

Wasserstein distance and Wasserstein regression

Given two probability distributions ρ,μ\rho,\mu on d\mathbb{R}^{d}, the Wasserstein distance of exponent pp between ρ\rho and μ\mu is a way to measure the total cost of moving mass distribution described by ρ\rho to μ\mu, knowing that moving a unit mass from xx to yy costs xyp\|x-y\|^{p}. Formally, it is defined as the value of an optimal transport problem between ρ\rho and μ\mu:

Wp(ρ,μ)=(minπΠ(ρ,μ)xypdπ(x,y))1/p,W_{p}(\rho,\mu)=\left(\min_{\pi\in\Pi(\rho,\mu)}\int\|x-y\|^{p}\mathrm{d}\pi(x,y)\right)^{1/p}, (1)

where we minimize of the set Π(ρ,μ)\Pi(\rho,\mu) of transport plans between ρ\rho and μ\mu, i.e. probability distributions over d×d\mathbb{R}^{d}\times\mathbb{R}^{d} with marginals ρ\rho and μ\mu. Standard references on the theory of optimal transport include books by Villani and by Santambrogio [19, 20, 18], while the computational and statistical aspects are discussed in a survey of Cuturi and Peyré [16].

In this article, we consider regression problems with respect to the Wasserstein metric, which can be put in the following form

minθΘWpp(Tθ#μ,ρ),\min_{\theta\in\Theta}\mathrm{W}_{p}^{p}(T_{\theta\#}\mu,\rho), (2)

where μ\mu is the reference distribution, a probability measure on [0,1][0,1]^{\ell}, ρ\rho is the model distribution, a probability measure on d\mathbb{R}^{d}, and where Tθ:[0,1]dT_{\theta}:[0,1]^{\ell}\to\mathbb{R}^{d} is a family of maps indexed by a parameter θΘ\theta\in\Theta. In the previous formula, we also denoted Tθ#μT_{\theta}\#\mu the image of the measure μ\mu under the map TθT_{\theta}, also called pushforward of μ\mu under TθT_{\theta}. This image measure is defined by Tθ#μ(B):=μ(Tθ1(B))T_{\theta\#}\mu(B):=\mu(T_{\theta}^{-1}(B)) for any measurable set BB in the codomain of TθT_{\theta}. In this work, we will concentrate on the quadratic Wasserstein distance W2W_{2}. Several problems related to the design of generative models can be put under the form (2), see for instance [8, 2]. Solving (2) numerically is challenging for several reasons, but in this article we will concentrate on one of them: the non-convexity of the Wasserstein distance under displacement of the measures.

Non-convexity of the Wasserstein distance under displacements.

It is well known that the Wasserstein distance is convex for the standard (linear) structure of the space of probability measures, meaning that if ν0\nu_{0} and ν1\nu_{1} are two probability measures and νt=(1t)ν0+tν1\nu_{t}=(1-t)\nu_{0}+t\nu_{1}, then the map t[0,1]Wpp(νt,ρ)t\in[0,1]\mapsto\mathrm{W}_{p}^{p}(\nu_{t},\rho) is convex. Using a terminology from physics, we may say that the Wasserstein distance is convex for the Eulerian structure of the space of probability measures, e.g. when one interpolates linearly between the densities. However, in the regression problem (2), the perturbations are Lagrangian rather than Eulerian, in the sense that modifications of the parameter θ\theta leads to a displacement of the support of the measure Tθ#μT_{\theta\#}\mu. This appears very clearly in particular when μ\mu is the uniform measure over a set X=(x1,,xN)X=(x_{1},\ldots,x_{N}) of NN point in [0,1]d[0,1]^{d}, i.e. μ=δX\mu=\delta_{X} with

δX=def1Ni=1Nδxi.\delta_{X}\stackrel{{\scriptstyle\text{def}}}{{=}}\frac{1}{N}\sum_{i=1}^{N}\delta_{x_{i}}.

In this case Tθ#μT_{\theta\#}\mu is the uniform measure over the set Tθ(X)=(Tθ(x1),,Tθ(xn))T_{\theta}(X)=(T_{\theta}(x_{1}),\ldots,T_{\theta}(x_{n})), i.e. Tθ#μ=δTθ(X)T_{\theta\#}\mu=\delta_{T_{\theta}(X)}. In this article, we will therefore be interested by the function

FN:Y(d)N12W22(ρ,δY).F_{N}:Y\in(\mathbb{R}^{d})^{N}\mapsto\frac{1}{2}W^{2}_{2}(\rho,\delta_{Y}). (3)

This function FNF_{N} is not convex, and actually exhibits (semi-)concavity properties. This has been observed first in [1] (Theorem 7.3.2), and is related to the positive curvature of the Wasserstein space. A precise statement in the context considered here may also be found as Proposition 21 in [14]. A practical consequence of the lack of convexity of FNF_{N} is that critical points of FNF_{N} are not necessarily global minimizers. It is actually easy to construct examples families of critical points YNY_{N} of FNF_{N} such that FN(YN)F_{N}(Y_{N}) is bounded from below by a positive constant, while limNminFN=0\lim_{N\to\infty}\min F_{N}=0, so that the ratio between FN(YN)F_{N}(Y_{N}) and minFN\min F_{N} is arbitrarily large as N+N\to+\infty. This can be done by concentrating the points YNY_{N} on lower-dimensional subspaces of d\mathbb{R}^{d}, as in Remarks 2 and 3.

When applying gradient descent to the nonconvex optimization problem (2), it is in principle possible to end up on local minima corresponding to a high energy critical points of the Wasserstein distance, regardless of the non-linearity of the map θTθ#σ\theta\mapsto T_{\theta\#\sigma}. Our main theorem, or rather its Corollary 6 shows that if the points of YY are at distance at least ε>0\varepsilon>0 from one another, then

FN(Y)C1Nεd1NFN(Y)2.F_{N}(Y)-C\frac{1}{N\varepsilon^{d-1}}\leq N\|\nabla F_{N}(Y)\|^{2}.

In the previous equation FN(Y)\|\nabla F_{N}(Y)\| denotes the Euclidean norm of the vector in Nd\mathbb{R}^{Nd} obtained by putting one after the other the gradients of FNF_{N} w.r.t. the positions of the atoms yiy_{i}. We note that due to the weights 1/N1/N in the atomic measure δY\delta_{Y}, the components of this vector are in general of the order of 1/N1/N, see Proposition 1. This inequality resembles the Polyak-Łojasiewicz inequality, and shows in particular that if the quantization error FN(Y)=W22(ρ,δY)F_{N}(Y)=\mathrm{W}_{2}^{2}(\rho,\delta_{Y}) is large, i.e. larger than ε1d/N\varepsilon^{1-d}/N, then the point cloud YY is not critical for FNF_{N}. From this, we deduce in 7 that if the points in the initial cloud are not too close to each other at the initialization, then the iterates of fixed step gradient descent converge to points with low energy FNF_{N}, despite the non-convexity of FNF_{N}.

Refer to caption Refer to caption Refer to caption Refer to caption
Figure 1: From left to right, a point cloud Y0Y^{0} in the square Ω=[0;1]×[0;1]\Omega=[0;1]\times[0;1], the associated power cells Pi(Y)P_{i}(Y) in the optimal transport to the Lebesgue measure on Ω\Omega, the vectors NFN(Y0)=BN(Y0)Y0-N\nabla F_{N}(Y^{0})=B_{N}(Y^{0})-Y^{0} followed during the Lloyd step and the positions of the barycenters Y1=BN(Y)Y^{1}=B_{N}(Y).

Relation to optimal quantization

Our main result also has implications in terms of the uniform optimal quantization problem, where one seeks a point cloud Y=(y1,,yN)Y=(y_{1},\ldots,y_{N}) in (d)N(\mathbb{R}^{d})^{N} such that the uniform measure supported over YY, denoted δY\delta_{Y}, is as close as possible to the model distribution ρ\rho with respect to the 22-Wasserstein distance:

minYΩNFN(Y).\min_{Y\in\Omega^{N}}F_{N}(Y). (4)

The uniform optimal quantization problem (4) is a very natural variant of the (standard) optimal quantization problem, where one does not impose that the measure supported on YY is uniform:

minY(d)NGN(Y), where GN(Y)=minμΔNW22(ρ,i=1Nμiδyi),\min_{Y\in(\mathbb{R}^{d})^{N}}G_{N}(Y),~{}~{}\hbox{ where }G_{N}(Y)=\min_{\mu\in\Delta_{N}}\mathrm{W}^{2}_{2}(\rho,\sum_{i=1}^{N}\mu_{i}\delta_{y_{i}}), (5)

and where ΔNN\Delta_{N}\subseteq\mathbb{R}^{N} is the probability simplex. This standard optimal quantization problem is a cornerstone of sampling theory, and we refer the reader to the book of Graf and Luschgy [10] and to the survey by Pagès [15]. The uniform quantization problem (4) is less common, but also very natural. It has been used in imaging to produce stipplings of an image [4, 3] or for meshing purposes [9]. A common difficulty for solving (5) and (4) numerically is that the minimized functionals FNF_{N} and GNG_{N} are non-convex and have many critical points with high energy. However, in practice, simple fixed-point or gradient descent stategies behave well when the initial point cloud is not chosen adversely. Our second contribution is a quantitative explanation for this good behaviour in the case of the uniform optimal quantization problem.

Lloyd’s algorithm [12] is a fixed point algorithm for solving approximately the standard optimal quantization problem (5). Starting from a point cloud Yk=(y1k,,yNk)(d)NY^{k}=(y^{k}_{1},\ldots,y^{k}_{N})\in(\mathbb{R}^{d})^{N} with distinct points, one defines the next iterate Yk+1Y^{k+1} in two steps. First, one computes the Voronoi diagram of YY, a tessellation of the space into convex polyhedra (Vi(Yk))1iN(V_{i}(Y^{k}))_{1\leq i\leq N}, where

Vi(Y)={xΩj{1,,N},xyixyj}.V_{i}(Y)=\{x\in\Omega\mid\forall j\in\{1,\ldots,N\},\|x-y_{i}\|\leq\|x-y_{j}\|\}. (6)

In the second step, one moves every point yiky^{k}_{i} towards the barycenter, with respect to ρ\rho, of the corresponding cell Vi(Yk)V_{i}(Y^{k}). This algorithm can also be interpreted as a fixed point algorithm for solving the first-order optimality condition for (5), i.e. GN(Y)=0\nabla G_{N}(Y)=0. One can show that the energy (GN(Yk))k0(G_{N}(Y^{k}))_{k\geq 0} decreases in kk. The convergence of YkY^{k} towards a critical point of FNF_{N} as k+k\to+\infty has been studied in [5], but the energy of this limit critical point is not guaranteed to be small.

In the case of the uniform quantization problem (4), one can try to minimize the energy FNF_{N} by gradient descent, defining the iterates

Yk+1=YkτNFN(Yk),Y^{k+1}=Y^{k}-\tau N\nabla F_{N}(Y^{k}), (7)

where τ>0\tau>0 is the time step. The factor NN in front of FN\nabla F_{N} is set as a compensation for the fact that we have, in general, FN(Y)=O(1/N)\nabla F_{N}(Y)=O(1/N). When τ=1\tau=1, one recovers a version of Lloyd’s algorithm for the uniform quantization problem, involving barycenters BN(Y)B_{N}(Y) of Power cells, rather than Voronoi cells, associated to YY. More precisely, Proposition 1 proves that FN(Y)=(YBN(Y))/N\nabla F_{N}(Y)=(Y-B_{N}(Y))/N so that Yk+1=BN(Yk)Y^{k+1}=B_{N}(Y^{k}) when τ=1\tau=1. Quite surprisingly, we prove in Corollary 4 that if the points in the initial cloud Y0Y^{0} are not too close to each other, then the uniform measure over the point cloud Y1=Y0NFN(Y0)Y^{1}=Y^{0}-N\nabla F_{N}(Y^{0}) obtained after only one step of Lloyd’s algorithm is close to ρ\rho. This is illustrated in Figure 1. We prove in particular the following statement.

Theorem (Particular case of Corollary 4).

Let ρ\rho be a probability density over a compact convex set Ωd\Omega\subseteq\mathbb{R}^{d}, let Y0=(y10,,yN0)ΩdY^{0}=(y_{1}^{0},\ldots,y_{N}^{0})\in\Omega^{d} and assume that the points lie at some positive distance from one another: for some constant cc,

ij,yiyjcN1/d,\forall i\neq j,\|y_{i}-y_{j}\|\geq cN^{-1/d},

corresponding for instance to a point cloud sampled on a regular grid. Then, the point cloud Y1=Y0NFN(Y0)Y^{1}=Y^{0}-N\nabla F_{N}(Y^{0}) obtained after one step of Lloyd’s algorithm satisfies

W22(δY1,ρ)Cc,d,ΩN1/d,\mathrm{W}_{2}^{2}(\delta_{Y^{1}},\rho)\leq C_{c,d,\Omega}N^{-1/d},

where Cc,d,ΩC_{c,d,\Omega} is a constant depending on c,dc,d and diam(Ω)\operatorname{diam}(\Omega).

Outline

In Section 2, we start by a short review of background material on optimal transport and optimal uniform quantization. We then establish our main result (3) on the approximation of a measure ρ\rho by barycenters of Power cells. This theorem yields error estimates for one step of Lloyd’s algorithm in deterministic and probabilistic settings (Corollaries 4 and 5). In Section 3, we establish a Polyak-Łojasiewicz-type inequality (Corollary 6) for the function FN=12W22(ρ,δY)F_{N}=\frac{1}{2}\mathrm{W}_{2}^{2}(\rho,\delta_{Y}) introduced in (3), and we study the convergence of a gradient descent algorithm for FNF_{N} (Theorem 7). Finally, in Section 4, we report numerical results on optimal uniform quantization in dimension d=2d=2.

2 Lloyd’s algorithm for optimal uniform quantization

Optimal transport and Kantorovich duality

In this section we briefly review Kantorovich duality and its relation to semidiscrete optimal transport. The cost is fixed to c(x,y)=xy2c(x,y)=\|x-y\|^{2}, and we assume that ρ\rho is a probability density over a compact convex domain Ω\Omega. In this setting, Brenier’s theorem implies that given any probability measure μ\mu supported on Ω\Omega, the optimal transport plan between ρ\rho and μ\mu, i.e. the minimizer π\pi in the definition of the Wasserstein distance (1) with p=2p=2, is induced by a transport map Tμ:ΩΩT_{\mu}:\Omega\to\Omega, meaning π=(Tμ,Id)#ρ\pi=(T_{\mu},Id)_{\#}\rho. One can derive an alternative expression for the Wasserstein distance using Kantorovich duality, which leads to a more precise description of the optimal transport map [18, Theorem 1.39]:

W22(ρ,μ)=maxϕ:Ydϕdμ+Ωϕcdρ,\mathrm{W}_{2}^{2}(\rho,\mu)=\max_{\phi:Y\to\mathbb{R}}\int_{\mathbb{R}^{d}}\phi\mathrm{d}\mu+\int_{\Omega}\phi^{c}\mathrm{d}\rho, (8)

where ϕc(x)=minic(x,yi)ϕi\phi^{c}(x)=\min_{i}c(x,y_{i})-\phi_{i}. When μ=δY\mu=\delta_{Y} is the uniform probability measure over a point cloud Y=(y1,,yN)Y=(y_{1},\ldots,y_{N}) containing NN distinct points, we set ϕi=ϕ(yi)\phi_{i}=\phi(y_{i}) and we define the iith Power cell associated to the couple (Y,ϕ)(Y,\phi) as

Powi(Y,ϕ)={xdj{1,,N},xyi2ϕixyj2ϕj}.\mathrm{Pow}_{i}(Y,\phi)=\{x\in\mathbb{R}^{d}\mid\forall j\in\{1,\ldots,N\},~{}\left\lVert x-y_{i}\right\rVert^{2}-\phi_{i}\leq\left\lVert x-y_{j}\right\rVert^{2}-\phi_{j}\}.

Then, the Kantorovich dual (8) of the optimal transport problem between ρ\rho and δY\delta_{Y} turns into a finite-dimensional concave maximization problem

W22(μ,ρ)=maxϕNi=1NϕiN+Powi(Y,ϕ)(xyi2ϕi)dρ(x)\mathrm{W}_{2}^{2}(\mu,\rho)=\max_{\phi\in\mathbb{R}^{N}}\sum_{i=1}^{N}\frac{\phi_{i}}{N}+\int_{\mathrm{Pow}_{i}(Y,\phi)}\left(\left\lVert x-y_{i}\right\rVert^{2}-\phi_{i}\right)\mathrm{d}\rho(x) (9)

By Corollary 1.2 in [11], a vector ϕN\phi\in\mathbb{R}^{N} is optimal for this maximization problem if and only if the potential ϕ\phi is such that each Power cell contain the same amount of mass, i.e. if

i{1,,N},ρ(Powi(Y,ϕ))=1N,\forall i\in\{1,\ldots,N\},~{}~{}\rho(\mathrm{Pow}_{i}(Y,\phi))=\frac{1}{N}, (10)

From now on, we denote Pi(Y)=Powi(Y,ϕ)ΩP_{i}(Y)=\mathrm{Pow}_{i}(Y,\phi)\cap\Omega, where ϕN\phi\in\mathbb{R}^{N} satisfies (10). The optimal transport map TYT_{Y} between ρ\rho and δY\delta_{Y} sends every Power cell Pi(Y)P_{i}(Y) to the point yiy_{i}, i.e. it is defined ρ\rho-almost everywhere by TY|Pi(Y)=yi\left.{T_{Y}}\right|_{P_{i}(Y)}=y_{i}. We refer again to the introduction of [11] for more details.

Optimal uniform quantization

In this article, we study the behaviour of the squared Wasserstein distance between the (fixed) probability density ρ\rho and a uniform finitely supported measure δY\delta_{Y} where Y=(y1,,yN)Y=(y_{1},\ldots,y_{N}) is a cloud of NN points, in terms of variations of YY. As in equation (3), we denote FN=12W22(ρ,)F_{N}=\frac{1}{2}\mathrm{W}_{2}^{2}(\rho,\cdot). Proposition 21 in [14] gives an expression for the gradient of FF, and proves its semiconcavity. We recall that FF is called α\alpha–semiconcave, with α0\alpha\geq 0, if the function Fα22F-\frac{\alpha}{2}\|\cdot\|^{2} is concave. We denote 𝔻N\mathbb{D}_{N} the generalized diagonal

𝔻N={Y(d)Nij s.t. yi=yj}.\mathbb{D}_{N}=\{Y\in(\mathbb{R}^{d})^{N}\mid\exists i\neq j\hbox{ s.t. }y_{i}=y_{j}\}.
Proposition 1 (Gradient of FNF_{N}).

The function FNF_{N} is 1N\frac{1}{N}–semiconcave on (d)N(\mathbb{R}^{d})^{N} and is of class 𝒞1\mathcal{C}^{1} on (d)N𝔻N(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N}. In addition, for any Y𝔻NY\in\mathbb{D}_{N} one has

Y(d)N𝔻N,FN(Y)=1N(YBN(Y)), where BN(Y)=(b1(Y),,bN(Y))\forall Y\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N},\nabla F_{N}(Y)=\frac{1}{N}(Y-B_{N}(Y)),\hbox{ where }B_{N}(Y)=(b_{1}(Y),\ldots,b_{N}(Y)) (11)

and where bi(Y)b_{i}(Y) is the barycenter of the iith power cell, i.e. bi(Y)=NPi(Y)dρ(x)b_{i}(Y)=N\int_{P_{i}(Y)}\mathrm{d}\rho(x).

It is not difficult to prove that FNF_{N} admits at least one minimizer, and that this minimizer YY satisfies the first-order optimality condition Y=BN(Y)Y=B_{N}(Y). A point cloud that satisfies this condition is called critical.

Remark 1 (Upper bound on the minimum of FNF_{N}).

We note from [14, Proposition 12] that when ρ\rho is supported on a compact subset of d\mathbb{R}^{d}, then

minFN=minY(d)N12W22(ρ,δY){N2d if d>2N1logN if d=2N1 if d=1.\min F_{N}=\min_{Y\in(\mathbb{R}^{d})^{N}}\frac{1}{2}\mathrm{W}_{2}^{2}(\rho,\delta_{Y})\lesssim\begin{cases}N^{-\frac{2}{d}}&\hbox{ if }d>2\\ N^{-1}\log N&\hbox{ if }d=2\\ N^{-1}&\hbox{ if }d=1.\\ \end{cases} (12)

These upper bounds may not be tight, in particular when ρ\rho is separable (see Appendix E).

Remark 2 (High energy critical points).

On the other hand, since FNF_{N} is not convex, this first-order condition is not sufficient to have a minimizer of FNF_{N}. For instance, if ρ1\rho\equiv 1 on the unit square Ω=[0,1]2\Omega=[0,1]^{2}, one can check that the point cloud

YN=((12N,12),(32N,12),,(2N12N,12))Y_{N}=\left(\left(\frac{1}{2N},\frac{1}{2}\right),\left(\frac{3}{2N},\frac{1}{2}\right),\ldots,\left(\frac{2N-1}{2N},\frac{1}{2}\right)\right)
[Uncaptioned image]

is a critical point of FNF_{N} but not a minimizer of FNF_{N}. In fact, this critical point becomes arbitrarily bad as N+N\to+\infty in the sense that

limN+FN(YN)minFN=+.\lim_{N\to+\infty}\frac{F_{N}(Y_{N})}{\min F_{N}}=+\infty.

On the other hand, we note that the point cloud YNY_{N} is highly concentrated, in the sense that the distance between consecutive points in YNY_{N} is 12N\frac{1}{2N}, whereas in an evenly distributed point cloud, one would expect the minimum distance between points to be of order N1/dN^{-1/d}.

Gradient descent and Lloyd’s algorithm

One can find a critical point of FNF_{N} by following the discrete gradient flow of FNF_{N}, defined in (7), starting from an initial position Y0(d)N𝔻NY^{0}\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N}. Thanks to the expression of FN\nabla F_{N} given in Proposition 1, the discrete gradient flow may be written as

Yk+1=Yk+τN(BN(Yk)Yk),Y^{k+1}=Y^{k}+\tau_{N}(B_{N}(Y^{k})-Y^{k}), (13)

where τN\tau_{N} is a fixed time step. For τN=1\tau_{N}=1, one recovers a variant of Lloyd’s algorithm, where one moves every point to the barycenter of its Power cell Pi(Yk)P_{i}(Y^{k}) at each iteration:

Yk+1=BN(Yk)Y^{k+1}=B_{N}(Y^{k}) (14)

We can state the following result about Lloyd’s algorithm for the uniform quantization problem, whose proof is postponed to the appendix.

Proposition 2.

Let NN be a fixed integer and (Yk)k0(Y^{k})_{k\geq 0} be the iterates of (14), with Y0𝔻NY^{0}\not\in\mathbb{D}_{N}. Then, the energy kFN(Yk)k\mapsto F_{N}(Y^{k}) is decreasing, and limk+FN(Yk)=0\lim_{k\to+\infty}\|\nabla F_{N}(Y^{k})\|=0. Moreover, the sequence (Yk)k0(Y^{k})_{k\geq 0} belongs to a compact subset of (d)N𝔻N(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N} and every limit point of a converging subsequence of it is a critical point for FNF_{N}.

Experiments suggest that following the discrete gradient flow of FNF_{N} does not bring us to high energy critical points of FNF_{N}, such as those described in Remark 2, unless we started from an adversely chosen point cloud. The following theorem and its corollaries, the main results of this article, backs up this experimental evidence. It shows that if the point cloud YY is not too concentrated, then the uniform measure over the barycenters of the power cells, δBN(Y)\delta_{B_{N}(Y)}, is a good quantization of the probability density ρ\rho, i.e. it bounds the quantization error after one step of Lloyd’s algorithm (14).

We will use the following notation for ε>0\varepsilon>0:

Iε(Y)={i{1,,N}ji,yiyjε}.I_{\varepsilon}(Y)=\{i\in\{1,\ldots,N\}\mid\forall j\neq i,\|y_{i}-y_{j}\|\geq\varepsilon\}.
𝔻N,ε={Y(N)dij,yiyjε}.\mathbb{D}_{N,\varepsilon}=\{Y\in(\mathbb{R}^{N})^{d}\mid\exists i\neq j,\|y_{i}-y_{j}\|\leq\varepsilon\}.

Note that 𝔻N,ε\mathbb{D}_{N,\varepsilon} is an ε\varepsilon-neighborhood around the generalized diagonal 𝔻N\mathbb{D}_{N}.

Theorem 3 (Quantization by barycenters).

Let Ωd\Omega\subseteq\mathbb{R}^{d} be a compact convex set, ρ\rho a probability density on Ω\Omega and consider a point cloud Y=(y1,,yN)Y=(y_{1},\dots,y_{N}) in ΩN𝔻N\Omega^{N}\setminus\mathbb{D}_{N}. Then, for all 0<ε10<\varepsilon\leq 1,

W22(ρ,δBN(Y))Cd,Ω(ε1dN+1Card(Iε(Y))N).\mathrm{W}^{2}_{2}\left(\rho,\delta_{B_{N}(Y)}\right)\leq C_{d,\Omega}\left(\frac{\varepsilon^{1-d}}{N}+1-\frac{\operatorname{Card}(I_{\varepsilon}(Y))}{N}\right). (15)

where Cd,Ω=22d1ωd1(diam(Ω)+1)d+1C_{d,\Omega}=\frac{2^{2d-1}}{\omega_{d-1}}(\operatorname{diam}(\Omega)+1)^{d+1} and where ωd1\omega_{d-1} is the volume of the unit ball in d1\mathbb{R}^{d-1}.

The proof relies on arguments from convex geometry. In particular, we denote ABA\oplus B the Minkowski sum of sets: AB={a+b(a,b)A×B}A\oplus B=\{a+b\mid(a,b)\in A\times B\}.

Proof.

Let ϕ1N\phi^{1}\in\mathbb{R}^{N} be the solution to the dual Kantorovich problem (10) between ρ\rho and δY\delta_{Y}. We let ϕt=tϕ1\phi^{t}=t\phi^{1} and we denote Pit=Powi(Y,ϕt)ΩP_{i}^{t}=\mathrm{Pow}_{i}(Y,\phi^{t})\cap\Omega^{\prime} the iith Power cell intersected with the slightly enlarged convex set Ω=ΩB(0,1)\Omega^{\prime}=\Omega\oplus\mathrm{B}(0,1). This way, Pi1Pi(Y)P_{i}^{1}\supseteq P_{i}(Y) whereas Pi0P_{i}^{0} is in fact the intersection of the ii-th Voronoi cell defined in (6) with Ω\Omega^{\prime}.

We will now prove an upper bound on the sum of the diameters of the cells Pi(Y)P_{i}(Y) whose index lies in Iε(Y)I_{\varepsilon}(Y). First, we notice the following inclusion, which holds for any t[0,1]t\in[0,1]:

(1t)Pi0tPi1Pit,(1-t)P_{i}^{0}\oplus tP_{i}^{1}\subseteq P_{i}^{t}, (16)

Indeed, let x0Pi0x^{0}\in P_{i}^{0} and x1Pi1x^{1}\in P_{i}^{1}, so that for all j{1,,N}j\in\{1,\ldots,N\} and k{0,1}k\in\{0,1\},

xkyi2ϕixkyj2ϕj.\|x^{k}-y_{i}\|^{2}-\phi_{i}\leq\|x^{k}-y_{j}\|^{2}-\phi_{j}.

Expanding the squares and substracting xk2\|x^{k}\|^{2} on both sides these inequalities become linear in ϕi,ϕj\phi_{i},\phi_{j} and xkx^{k}, implying directly that xt=(1t)x0+tx1Pitx^{t}=(1-t)x^{0}+tx^{1}\subseteq P_{i}^{t} as desired.

For any index iIεi\in I_{\varepsilon}, the point yiy_{i} is at distance at least ε\varepsilon from other points, implying that B(0,ε2)\mathrm{B}(0,\frac{\varepsilon}{2}) is contained in the Voronoi cell Vi(Y)V_{i}(Y) with Ω\Omega^{\prime}. Using that Pi0=Vi(Y)ΩP_{i}^{0}=V_{i}(Y)\cap\Omega^{\prime}, that Ω=ΩB(0,1)\Omega^{\prime}=\Omega\oplus\mathrm{B}(0,1) and that yiΩy_{i}\in\Omega, we deduce that Pi0P_{i}^{0} contains the same ball. On the other hand, Pi1P_{i}^{1} contains a segment SiS_{i} of length diam(Pi1)\operatorname{diam}(P_{i}^{1}) and inclusion (16) with t=12t=\frac{1}{2} gives

12(B(yi,ε/2)Si)Pi1/2.\frac{1}{2}(\mathrm{B}(y_{i},\varepsilon/2)\oplus S_{i})\subseteq P_{i}^{1/2}.

The Minkowski sum in the left-hand side contains in particular the product of a (d1)(d-1)-dimensional ball of radius ε/2\varepsilon/2 with an orthogonal segment with length diam(Pi1)diam(Pi(Y))\operatorname{diam}(P_{i}^{1})\geq\operatorname{diam}(P_{i}(Y)). Thus,

12d(ωd1εd12d1diam(Pi(Y)))|Pi12|.\frac{1}{2^{d}}\left(\omega_{d-1}\frac{\varepsilon^{d-1}}{2^{d-1}}\operatorname{diam}(P_{i}(Y))\right)\leq|P_{i}^{\frac{1}{2}}|.

Using that the Power cells Pi12P_{i}^{\frac{1}{2}} form a tesselation of the domain Ω\Omega^{\prime}, we therefore obtain

iIε(Y)diam(Pi(Y))22d1ωd1|Ω|ε1d22d1ωd1(diam(Ω)+1)dε1d\sum_{i\in I_{\varepsilon}(Y)}\operatorname{diam}(P_{i}(Y))\leq\frac{2^{2d-1}}{\omega_{d-1}}\left\lvert\Omega^{\prime}\right\rvert\varepsilon^{1-d}\leq\frac{2^{2d-1}}{\omega_{d-1}}(\operatorname{diam}(\Omega)+1)^{d}\varepsilon^{1-d} (17)

We now estimate the transport cost between δB\delta_{B} and the density ρ\rho, where B=BN(Y)B=B_{N}(Y). The transport cost due to the points whose indices do not belong to Iε(Y)I_{\varepsilon}(Y) can be bounded in a crude way by

iIε(Y)Pi(Y)xyi2dρ(x)(1CardIε(Y)N)diam(Ω)2.\sum_{i\not\in I_{\varepsilon}(Y)}\int_{P_{i}(Y)}\|x-y_{i}\|^{2}\mathrm{d}\rho(x)\leq(1-\frac{\operatorname{Card}I_{\varepsilon}(Y)}{N})\operatorname{diam}(\Omega)^{2}.

Note that we used ρ(Pi(Y))=1N\rho(P_{i}(Y))=\frac{1}{N}. On the other hand, the transport cost associated with indices in Iε(Y)I_{\varepsilon}(Y) can be bounded using (17) and diam(Pi(Y))diam(Ω)\operatorname{diam}(P_{i}(Y))\leq\operatorname{diam}(\Omega):

iIε(Y)Pi(Y)xyi2dρ(x)\displaystyle\sum_{i\in I_{\varepsilon}(Y)}\int_{P_{i}(Y)}\|x-y_{i}\|^{2}\mathrm{d}\rho(x) 1NiIε(Y)diam(Pi(Y))2\displaystyle\leq\frac{1}{N}\sum_{i\in I_{\varepsilon}(Y)}\operatorname{diam}(P_{i}(Y))^{2}
1Ndiam(Ω)iIεdiam(Pi(Y))\displaystyle\leq\frac{1}{N}\operatorname{diam}(\Omega)\sum_{i\in I_{\varepsilon}}\operatorname{diam}(P_{i}(Y))
22d1ωd1(diam(Ω)+1)d+1ε1dN\displaystyle\leq\frac{2^{2d-1}}{\omega_{d-1}}(\operatorname{diam}(\Omega)+1)^{d+1}\frac{\varepsilon^{1-d}}{N}

In conclusion, we obtain the desired estimate:

W22(ρ,δBN(Y))22d1ωd1(diam(Ω)+1)d+1ε1dN+diam(Ω)2(1CardIεN).\mathrm{W}_{2}^{2}\left(\rho,\delta_{B_{N}(Y)}\right)\leq\frac{2^{2d-1}}{\omega_{d-1}}(\operatorname{diam}(\Omega)+1)^{d+1}\frac{\varepsilon^{1-d}}{N}+\operatorname{diam}(\Omega)^{2}\left(1-\frac{\operatorname{Card}I_{\varepsilon}}{N}\right).\qed

This theorem could be extended mutatis mutandis to the case where ρ\rho is a general probability measure (i.e. not a density). However, this would imply some technical complications in the definition of the barycenters bib_{i} by introducing a disintegration of ρ\rho with respect to the transport plan π\pi.

Consequence for Lloyd’s algorithm (14)

In the next corollary, we assume that any pair of distinct points in YN(d)NY_{N}\in(\mathbb{R}^{d})^{N} is bounded from below by εNCNβ\varepsilon_{N}\geq CN^{-\beta}, implying that IεN(YN)=NI_{\varepsilon_{N}}(Y_{N})=N. This corresponds to the value one could expect for a point set uniformly sampled from a set with Minkowski dimension β\beta. When β>d1\beta>d-1, the corollary asserts that one step of Lloyd’s algorithm is enough to approximate ρ\rho, in the sense that the uniform measure δBN(YN)\delta_{B_{N}(Y_{N})} over the barycenters converges towards ρ\rho as N+N\to+\infty.

Corollary 4 (Quantization by barycenters, asymptotic case).

Assume εNCN1/β\varepsilon_{N}\geq C\cdot N^{-1/\beta} with C,β>0C,\beta>0. Then, with α=1d1β\alpha=1-\frac{d-1}{\beta}

Y(d)N𝔻εN,W22(ρ,δBN(Y))Cd,ΩCd1Nα,\forall Y\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{\varepsilon_{N}},\quad\mathrm{W}^{2}_{2}(\rho,\delta_{B_{N}(Y)})\leq\frac{C_{d,\Omega}}{C^{d-1}}N^{-\alpha}, (18)

and in particular, if β>d1\beta>d-1,

limN+maxY(d)N𝔻εNW22(ρ,δBN(Y))=0.\lim_{N\to+\infty}\max_{Y\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{\varepsilon_{N}}}\mathrm{W}^{2}_{2}(\rho,\delta_{B_{N}(Y)})=0. (19)
Remark 3 (Optimality of the exponent when β=d\beta=d).

There is no reason to believe that the exponent in the upper bound (18) is optimal in general. However, it seems to be optimal in a “worst-case sense” when β=d\beta=d. More precisely, we show the following result in Appendix E: for any δ(0,1)\delta\in(0,1), and for every N=ndN=n^{d} (nn\in\mathbb{N}) there exists a sequence of separable probability densities ρN\rho_{N} over X=[1,1]dX=[-1,1]^{d} (ρN\rho_{N} is a truncated Gaussian distributions, whose variance converges to zero slowly as N+N\to+\infty) such that if YNY_{N} is a uniform grid of size n××n=Ndn\times\cdot\times n=N^{d} in XX, then

W22(δBN(YN),ρN)CN(2δ)d,\mathrm{W}_{2}^{2}(\delta_{B_{N}(Y_{N})},\rho_{N})\geq CN^{-\frac{(2-\delta)}{d}},

where CC is independent of NN. On the other hand, in this setting every point in YNY_{N} is at distance at least CN1/dCN^{-1/d} from any other point in YNY_{N}. Applying 4 with β=d\beta=d, i.e. α=1d\alpha=\frac{1}{d}, we get

W22(δBN(YN),ρN)CN1d.\mathrm{W}_{2}^{2}(\delta_{B_{N}(Y_{N})},\rho_{N})\leq C^{\prime}N^{-\frac{1}{d}}.

Comparing this upper bound on W22(δBN(YN),ρN)\mathrm{W}_{2}^{2}(\delta_{B_{N}(Y_{N})},\rho_{N}) with the above lower bound, one sees that is is not possible to improve the exponent.

Remark 4 (Optimality of (19)).

The assumption β>d1\beta>d-1 for (19) is tight: if ρ\rho is the Lebesgue measure on [0,1]d[0,1]^{d}, it is possible for to construct a point cloud YNY_{N} with NN points on the (d1)(d-1)-cube {12}×[0,1]d1\{\frac{1}{2}\}\times[0,1]^{d-1} such that distinct point in YNY_{N} are at distance at least εNCN1/(d1)\varepsilon_{N}\geq C\cdot N^{-1/(d-1)}. Then, the barycenters BN(YN)B_{N}(Y_{N}) are also contained in the cube, so that W22(ρ,δBN(YN))112\mathrm{W}_{2}^{2}(\rho,\delta_{B_{N}(Y_{N})})\geq\frac{1}{12}.

The next corollary is a probabilistic analogue of Corollary 4, assuming that the initial point cloud YY is drawn from a probability density σ\sigma on Ω\Omega. Note that σ\sigma can be distinct from ρ\rho. The proof of this corollary relies on McDiarmid’s inequality to quantify the proportion of ε\varepsilon-isolated points in a point cloud that is drawn randomly and independently from σ\sigma. The proof of this result is in Appendix B.

Corollary 5 (Quantization by barycenters, probabilistic case).

Let σL(Ω)\sigma\in\mathrm{L}^{\infty}(\Omega) and let X1,,XNX_{1},...,X_{N} be i.i.d. random variables with distribution σL(d)\sigma\in\mathrm{L}^{\infty}(\mathbb{R}^{d}). Then, there exists a constant C>0C>0 depending only on σL\|\sigma\|_{\mathrm{L}^{\infty}} and dd, such that for NN large enough,

(W22(1Ni=1NδbiX,ρ)N12d1)1eCN2d32d1\mathbb{P}\left(W_{2}^{2}\left(\frac{1}{N}\sum_{i=1}^{N}\delta_{b_{i}^{X}},\rho\right)\lesssim N^{-\frac{1}{2d-1}}\right)\geq 1-e^{-CN^{\frac{2d-3}{2d-1}}}
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 2: Optimal quantization of a Gaussian truncated to the unit square. On the left, the initial point cloud YNY_{N} is drawn randomly and uniformly from [0,1]2[0,1]^{2}, while on the right YNY_{N} is on a regular grid. The top row displays the point clouds obtained after one step of Lloyd’s algorithm. The bottom row displays the quantization error after one step of Lloyd’s algorithm FN(BN(YN))F_{N}(B_{N}(Y_{N})) as a fuction of the number of points. We get FN(BN(YN))N0.95F_{N}(B_{N}(Y_{N}))\simeq N^{-0.95} when YNY_{N} is a random uniform point cloud in [0,1]N[0,1]^{N} and FN(BN(YN)N0.8F_{N}(B_{N}(Y_{N})\simeq N^{-0.8} when YNY_{N} is a regular grid.

3 Gradient flow and a Polyak-Łojasiewicz-type inequality

Theorem 3 can be interpred as a modified Polyak-Łojasiewicz-type (PŁ for short) inequality for the function FNF_{N}. The usual PŁ inequality for a differentiable function F:DF:\mathbb{R}^{D}\to\mathbb{R} is of the form

YD,F(Y)minFCF(Y)2,\forall Y\in\mathbb{R}^{D},\quad F(Y)-\min F\leq C\|\nabla F(Y)\|^{2},

where CC is a positive constant. This inequality has been originally used by Polyak to prove convergence of gradient descent towards the global minimum of FF. Note in particular that such an inequality implies that any critical point of FF is a global minimum of FF. By Remark 2, FNF_{N} has critical points that are not minimizers, so that we cannot expect the standard PŁ inequality to hold. What we get is a similar inequality relating FN(Y)F_{N}(Y) and FN(Y)2\|\nabla F_{N}(Y)\|^{2} but with a term involving the minimimum distance between the points in place of minFN\min F_{N}.

Corollary 6 (Polyak-Łojasiewicz-type inequality).

Let Y(d)N𝔻N,εY\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N,\varepsilon}. Then,

FN(Y)Cd,Ω1N(1ε)d1NFN(Y)2F_{N}(Y)-C_{d,\Omega}\frac{1}{N}\left(\frac{1}{\varepsilon}\right)^{d-1}\leq N\|\nabla F_{N}(Y)\|^{2} (20)

We note that when ε(1N)1/d\varepsilon\simeq\left(\frac{1}{N}\right)^{1/d}, the term 1N(1ε)d1\frac{1}{N}\left(\frac{1}{\varepsilon}\right)^{d-1} in (20) has order (1N)1/d\left(\frac{1}{N}\right)^{1/d}. On the other hand, as recalled in 1, minFN(1N)2/d\min F_{N}\lesssim\left(\frac{1}{N}\right)^{2/d} when d>2d>2. Thus, we do not expect (20) to be tight.

Convergence of a discrete gradient flow

The modified Polyak-Łojasiewicz inequality (20) suggests that the discrete gradient flow 13 will bring us close to a point cloud with low Wasserstein distance to ρ\rho, provided that can guarantee that the the points clouds YkY^{k} remain far for generalized diagonal during the iterations. We prove in Lemma 3 in Appendix D that if Yk+1=YkτNFN(Yk)Y^{k+1}=Y^{k}-\tau_{N}\nabla F_{N}(Y^{k}) and τN(0,1)\tau_{N}\in(0,1), then

ij,yik+1yjk+1(1τN)yikyjk.\forall i\neq j,\quad\left\lVert y^{k+1}_{i}-y^{k+1}_{j}\right\rVert\geq(1-\tau_{N})\|y^{k}_{i}-y^{k}_{j}\|. (21)

We note that this inequality ensures that YkY^{k} never touches the generalized diagonal 𝔻N\mathbb{D}_{N}, so that the gradient FN(Yk)\nabla F_{N}(Y^{k}) is well-defined at each step. Combining this inequality with 3, one can actually prove that if the points in the initial cloud YN0Y^{0}_{N} are not too close to each other, then a few steps of gradient discrete gradient descent leads to a discrete measure YNkY^{k}_{N} that is close to the target ρ\rho. Precisely, we arrive at the following theorem, proved in Appendix D.

Theorem 7.

Let 0<α<1d11d0<\alpha<\frac{1}{d-1}-\frac{1}{d}, εNN1dα\varepsilon_{N}\gtrsim N^{-\frac{1}{d}-\alpha}, and YN0ΩN𝔻εNY^{0}_{N}\in\Omega^{N}\setminus\mathbb{D}_{\varepsilon_{N}}. Let (YNk)k(Y_{N}^{k})_{k} be the iterates of (13) starting from YN0Y_{N}^{0} with timestep 0<τN<10<\tau_{N}<1. We assume that limNτN=0\lim_{N\to\infty}\tau_{N}=0 and we set

kN=1dτNln(FN(YN0)NεNd1).k_{N}=\left\lfloor\frac{1}{d\tau_{N}}\ln(F_{N}(Y_{N}^{0})N\varepsilon_{N}^{d-1})\right\rfloor.

Then,

W22(ρ,δYNkN)=ON(W22(ρ,δYN0)11d.N1d2+α(11d)).\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{k_{N}}}\right)=O_{N\to\infty}\left(\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{0}}\right)^{1-\frac{1}{d}}.N^{\frac{-1}{d^{2}}+\alpha\left(1-\frac{1}{d}\right)}\right). (22)
Remark 5.

Note that the exponential behavior implied by 21 and 3 is coherent with the estimates that are known in the absolutely continuous setting for the continuous gradient flow. When transitioning from discrete measures to probability densities, lower bounds on the distance between points become upper bounds on the density. The gradient flow μ˙t=12μW22(ρ,μt)\dot{\mu}_{t}=\frac{1}{2}\nabla_{\mu}\mathrm{W}_{2}^{2}(\rho,\mu_{t}) has an explicit solution μt=σ1et\mu_{t}=\sigma_{1-e^{-t}}, where σ\sigma is a constant-speed geodesic in the Wasserstein space with σ0=μ0\sigma_{0}=\mu_{0} and σ1=ρ\sigma_{1}=\rho. In this case, a simple adaptation of the estimates in Theorem 2 in [17] shows the bound μtLetdμ0L.\|\mu_{t}\|_{\mathrm{L}^{\infty}}\leq e^{td}\|\mu_{0}\|_{\mathrm{L}^{\infty}}. Still in this absolutely continous setting, it is possible to remove the exponential growth if the target density is also bounded, as a consequence of displacement convexity [13, Theorem 2.2]. There seems to be no discrete counterpart to this argument, explaining in part the discrepancy between the exponent of NN in (22) with the one obtained in 4.

Refer to captionRefer to captionRefer to caption

Refer to caption Refer to caption Refer to caption

Refer to caption
Figure 3: Optimal quantization of a density ρ\rho corresponding to a gray-scale image (Wikimedia Commons, CC BY-SA 3.0). (Left) We display the point clouds obtained after one step of Lloyd’s algorithm, starting from a regular grid of size N{3750,7350,15000,43350}N\in\{3750,7350,15000,43350\}. (Right) Quantization error W22(ρ,δBN)W_{2}^{2}\left(\rho,\delta_{B_{N}}\right) as a function of N the number of points, showing that W22(ρ,δBN)N1.00W_{2}^{2}\left(\rho,\delta_{B_{N}}\right)\simeq N^{-1.00}.

4 Numerical results

In this section, we report some experimental results in dimension d=2d=2.

Gray-scale image

As we mentioned in the introduction, uniform optimal quantization allows to sparsely represent a (gray scale) image via points, clustered more closely in areas where the image is darker [4, 3]. On figure 3, we ploted the point clouds obtained after a single Lloyd step toward the density representing the image on the left (Puffin), starting from regular grids. The observed rate of convergence, N1.00N^{-1.00}, is coherent with the theoretical estimate log(N)/N\log(N)/N of 1.

Gaussian density with small variance

We now consider a toy model where we approximate a gaussian density truncated to the unit square Ω=[0,1]2\Omega=[0,1]^{2}, ρ(x,y)=1Ze8((x12)2+(y12)2)\rho(x,y)=\frac{1}{Z}e^{-8((x-\frac{1}{2})^{2}+(y-\frac{1}{2})^{2})} where ZZ is a normalization constant. On the left column of this figure, the initial point clouds YN0Y_{N}^{0} are randomly distributed in [0,1]2[0,1]^{2}. The three point clouds represented above are obtained after one step of Lloyd’s algorithm (14). The red curve displays in a log-log scale the mean values of FN(BN(YN))F_{N}(B_{N}(Y_{N})) over a hundred random point clouds, for N{400,961,1600,2500}N\in\{400,961,1600,2500\}. In this case, we observe a decrease rate N0.95N^{-0.95} with respect to the number of points, similar to the case of the gray scale picture.

However, an interesting phenomena occurs when the initial point cloud YN0Y_{N}^{0} is aligned on a axis-aligned grid. The pictures in the right column of Fig. 2 where computed starting from such a grid with N{400,961,1600,2500}N\in\{400,961,1600,2500\} points. As in the randomly initialized case, we represented the values of FN(BN(YN))F_{N}(B_{N}(Y_{N})) in log-log scale. The corresponding discrete probability measure δBN(YN)\delta_{B_{N}(Y_{N})} seems to converge to ρ\rho as NN\to\infty, but with a much worse rate for these "low" values of NN: FN(BN(YN))N0.8F_{N}(B_{N}(Y_{N}))\simeq N^{-0.8}. In this specific setting, with a separable density and an axis-aligned grid Y0Y_{0}, the power cells are rectangles and a single Lloyd step brings us to a critical point of FNF_{N}. Thanks to this remark, it is possible to estimate the approximation error from the one-dimensional case. In fact, Appendix E shows that for any δ(0,1)\delta\in(0,1), there exists variances σN=σN(δ)\sigma_{N}=\sigma_{N}(\delta) such that the approximation error W22(ρσN,δBN)W_{2}^{2}(\rho_{\sigma_{N}},\delta_{B_{N}}) is of order N2δ2N^{-\frac{2-\delta}{2}}. On the other hand, for a fixed σ\sigma, the approximation error is of order N1N^{-1}, to be compared with the bound log(N)/N\log(N)/N for general measures.

5 Discussion

We have studied the problem of minimizing the Wasserstein distance between a fixed probability measure ρ\rho and a uniform measure over NN points δY\delta_{Y}, parametrized by the position of the points Y=(y1,,yN)Y=(y_{1},\ldots,y_{N}). The main difficulty is the nonconvexity of the Wasserstein distance FN:Y(d)N12W22(ρ,δY)F_{N}:Y\in(\mathbb{R}^{d})^{N}\mapsto\frac{1}{2}\mathrm{W}_{2}^{2}(\rho,\delta_{Y}), which we tackled by means of a modified Polyak-Łojaciewicz inequality (20). One limitation of our work is that the terms replacing minFN\min F_{N} in the Polyak-Łojaciewicz inequality (20) does not match the theoretical bounds recalled in 1. Future work will concentrate on bridging that gap, but also on deriving consequences for the algorithmic resolution of Wasserstein regression problems minθW22(ρ,Tθ#μ),\min_{\theta}\mathrm{W}_{2}^{2}(\rho,T_{\theta\#}\mu), starting with the case where θTθ\theta\mapsto T_{\theta} is linear.

Acknowledgments and Disclosure of Funding

This work was supported by a grant from the French ANR (MAGA, ANR-16-CE40-0014).

References

  • [1] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  • [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  • [3] Michael Balzer, Thomas Schlömer, and Oliver Deussen. Capacity-constrained point distributions: a variant of lloyd’s method. ACM Transactions on Graphics (TOG), 28(3):1–8, 2009.
  • [4] Fernando De Goes, Katherine Breeden, Victor Ostromoukhov, and Mathieu Desbrun. Blue noise through optimal transport. ACM Transactions on Graphics (TOG), 31(6):1–11, 2012.
  • [5] Qiang Du, Maria Emelianenko, and Lili Ju. Convergence of the lloyd algorithm for computing centroidal voronoi tessellations. SIAM journal on numerical analysis, 44(1):102–119, 2006.
  • [6] Bjorn Engquist, Brittany D Froese, and Yunan Yang. Optimal transport for seismic full waveform inversion. Communications in Mathematical Sciences, 14(8):2309–2330, 2016.
  • [7] Jean Feydy, Benjamin Charlier, François-Xavier Vialard, and Gabriel Peyré. Optimal transport for diffeomorphic registration. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 291–299. Springer, 2017.
  • [8] Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pages 1608–1617. PMLR, 2018.
  • [9] Fernando de Goes, Pooran Memari, Patrick Mullen, and Mathieu Desbrun. Weighted triangulations for geometry processing. ACM Transactions on Graphics (TOG), 33(3):1–13, 2014.
  • [10] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer, 2007.
  • [11] Jun Kitagawa, Quentin Mérigot, and Boris Thibert. Convergence of a newton algorithm for semi-discrete optimal transport. Journal of the European Mathematical Society, 21(9):2603–2651, 2019.
  • [12] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  • [13] Robert J McCann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153–179, 1997.
  • [14] Quentin Mérigot and Jean-Marie Mirebeau. Minimal geodesics along volume-preserving maps, through semidiscrete optimal transport. SIAM Journal on Numerical Analysis, 54(6):3465–3492, 2016.
  • [15] Gilles Pagès. Introduction to vector quantization and its applications for numerics. ESAIM: proceedings and surveys, 48:29–79, 2015.
  • [16] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  • [17] Filippo Santambrogio. Absolute continuity and summability of transport densities: simpler proofs and new estimates. Calculus of variations and partial differential equations, 36(3):343–354, 2009.
  • [18] Filippo Santambrogio. Optimal transport for applied mathematicians, volume 55. Springer, 2015.
  • [19] Cédric Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
  • [20] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

Appendix A Proof of 2

Given Y=(y1,,yN)(d)N𝔻NY=(y_{1},\ldots,y_{N})\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N}, one has for any i{1,,N}i\in\{1,\ldots,N\},

Pi(Y)xyi2dρ(x)\displaystyle\int_{P_{i}(Y)}\left\lVert x-y_{i}\right\rVert^{2}\mathrm{d}\rho(x) =Pi(Y)xbi(Y)+bi(Y)yi2dρ(x)\displaystyle=\int_{P_{i}(Y)}\left\lVert x-b_{i}(Y)+b_{i}(Y)-y_{i}\right\rVert^{2}\mathrm{d}\rho(x)
=Pi(Y)xbi(Y)2dρ(x)+1Nbi(Y)yi2.\displaystyle=\int_{P_{i}(Y)}\left\lVert x-b_{i}(Y)\right\rVert^{2}\mathrm{d}\rho(x)+\frac{1}{N}\|b_{i}(Y)-y_{i}\|^{2}.

Summing these equalities over ii and remarking that the map TYT_{Y} defined by TY|Pi(Y)=yi\left.T_{Y}\right|_{P_{i}(Y)}=y_{i} is an optimal transport map between ρ\rho and δY\delta_{Y}, we get

1NBN(Y)Y2\displaystyle\frac{1}{N}\|B_{N}(Y)-Y\|^{2} =W22(ρ,yi)iPi(Y)xbi(Y)2dρ(x)\displaystyle=\mathrm{W}_{2}^{2}(\rho,y_{i})-\sum_{i}\int_{P_{i}(Y)}\left\lVert x-b_{i}(Y)\right\rVert^{2}\mathrm{d}\rho(x)
W22(ρ,δY)W22(ρ,δBN(Y)).\displaystyle\leq\mathrm{W}_{2}^{2}(\rho,\delta_{Y})-\mathrm{W}_{2}^{2}(\rho,\delta_{B_{N}(Y)}).

Thus, with Yk+1=BN(Yk)Y^{k+1}=B_{N}(Y^{k}), we have

NFN(Yk)2=1NYk+1Yk22(FN(Yk)FN(Yk+1)).N\|\nabla F_{N}(Y^{k})\|^{2}=\frac{1}{N}\|Y^{k+1}-Y^{k}\|^{2}\leq 2(F_{N}(Y^{k})-F_{N}(Y^{k+1})).

This implies that the values of FN(Yk)F_{N}(Y^{k}) are decreasing in kk and, since they are bounded from below, that FN(Yk)0\|\nabla F_{N}(Y^{k})\|\to 0 since kFN(Yk)2<+\sum_{k}\|\nabla F_{N}(Y^{k})\|^{2}<+\infty. The sequence (Yk)k(Y^{k})_{k} can be easily seen to be bounded, since FN(Yk)F_{N}(Y^{k}) is bounded, which implies a bound on the second moment of δYk\delta_{Y^{k}}.

For fixed NN, since all atoms of δYk\delta_{Y^{k}} have mass 1/N1/N, this implies that all points yiky_{i}^{k} belong to a same fixed compact ball. If ρ\rho itself is compactly supported, we can also prove that all points Yk+1=BN(Yk)Y^{k+1}=B_{N}(Y^{k}) are contained in a compact subset of (d)N𝔻N(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N}, which means obtaining a lower bound on the distances |bi(Y)bj(Y)||b_{i}(Y)-b_{j}(Y)| for arbitrary YY. This lower bound can be obtained in the following way: since ρ\rho is absolutely continuous it is uniformly integrable which means that for every ε>0\varepsilon>0 there is δ=δ(ε)>0\delta=\delta(\varepsilon)>0 such that for any set AA with Lebesgue measure |A|<δ|A|<\delta we have ρ(A)<ε\rho(A)<\varepsilon. We claim that we have |bi(Y)bj(Y)|r:=(2R)1dδ(12N)|b_{i}(Y)-b_{j}(Y)|\geq r:=(2R)^{1-d}\delta(\frac{1}{2N}), where RR is such that ρ\rho is supported in a ball BRB_{R} of radius RR. Indeed, it is enough to prove that every barycenter bi(Y)b_{i}(Y) is at distance at least r/2r/2 from each face of the convex polytope Pi(Y)P_{i}(Y). Consider a face of such a polytope and suppose, by simplicity, that it lies on the hyperplane {xd=0}\{x_{d}=0\} with the cell contained in {xd0}\{x_{d}\geq 0\}. Let ss be such that ρ(Pi(Y){xd>s})=ρ(Pi(Y){xd<s})=12N\rho(P_{i}(Y)\cap\{x_{d}>s\})=\rho(P_{i}(Y)\cap\{x_{d}<s\})=\frac{1}{2N}. Then since the diameter of Pi(Y)BRP_{i}(Y)\cap B_{R} is smaller than 2R2R, the Lebesgue measure of Pi(Y){xd<s}P_{i}(Y)\cap\{x_{d}<s\} is bounded by (2R)d1s(2R)^{d-1}s, which provides srs\geq r because of the definition of rr. Since at least half of the mass (according to ρ\rho) of the cell Pi(Y)P_{i}(Y) is above the level xd=sx_{d}=s the xdx_{d}-coordinate of the barycenter is at least r/2r/2. This shows that the barycenter lies at distance at least r/2r/2 from each of its faces.

As a consequence, the iterations YkY^{k} of the Lloyd algorithm lie in a compact subset of (d)N𝔻N(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N}, on which FNF_{N} is C1C^{1}. This implies that any limit point must be a critical point.

We do not discuss here whether the whole sequence converges or not, which seems to be a delicate matter even for fixed NN. It is anyway possible to prove (but we do not develop the details here) that the set of limit points is a closed connected subet of (d)N(\mathbb{R}^{d})^{N} with empty interior, composed of critical points of FNF_{N} all lying on a same level set of FNF_{N}.

Appendix B Proof of 5

Given Y=(y1,,yN)(d)NY=(y_{1},\ldots,y_{N})\in(\mathbb{R}^{d})^{N}, we denote

Iε(Y)={i{1,,N}ji,yiyjε}.I_{\varepsilon}(Y)=\{i\in\{1,\ldots,N\}\mid\forall j\neq i,\|y_{i}-y_{j}\|\geq\varepsilon\}.

We call points yiy_{i} such that iIε(Y)i\in I_{\varepsilon}(Y) ε\varepsilon-isolated, and points yiy_{i} such that iIε(Y)i\not\in I_{\varepsilon}(Y) ε\varepsilon-connected.

Lemma 1.

Let X1,,XNX_{1},\ldots,X_{N} be independent, d\mathbb{R}^{d}-valued, random variables. Then, there is a constant Cd>0C_{d}>0 such that

({|κ(X1,,XN)𝔼(κ)|η})eNη2/Cd.\mathbb{P}(\{\left\lvert\kappa(X_{1},\ldots,X_{N})-\mathbb{E}(\kappa)\right\rvert\geq\eta\})\leq\mathrm{e}^{-N\eta^{2}/C_{d}}.
Proof.

This lemma is a consequence of McDiarmid’s inequality. To apply this inequality, we need evaluate the amplitude of variation of the function κ\kappa along changes of one of the points xix_{i}. Denote cdc_{d} the maximum cardinal of a subset SS of the ball B(0,ε)B(0,\varepsilon) such that the distance between any distinct points in SS is at least ε\varepsilon. By a scaling argument, one can check that cdc_{d} does not, in fact, depend on ε\varepsilon. To evaluate

|κ(x1,,xi,,xN)κ(x1,,x~i,,xN)|,\left\lvert\kappa(x_{1},\ldots,x_{i},\ldots,x_{N})-\kappa(x_{1},\ldots,\tilde{x}_{i},\ldots,x_{N})\right\rvert,

we first note that at most cdc_{d} points may become ε\varepsilon-isolated when removing xix_{i}. To prove this, we remark that if a point xjx_{j} becomes ε\varepsilon-isolated when xix_{i} is removed, this means that xixjε\|{x_{i}-x_{j}}\|\leq\varepsilon and xjxk>ε\|{x_{j}-x_{k}}\|>\varepsilon for all k{i,j}k\not\in\{i,j\}. The number of such jj is bounded by cdc_{d}. Symmetrically, there may be at most cdc_{d} points becoming ε\varepsilon-connected under addition of x^i\hat{x}_{i}. Finally, the point xix_{i} itself may change status from ε\varepsilon-isolated to ε\varepsilon-connected. To summarize, we obtain that with Cd=2cd+1C_{d}=2c_{d}+1,

|κ(x1,,xi,,xN)κ(x1,,x~i,,xN)|1NCd.\left\lvert\kappa(x_{1},\ldots,x_{i},\ldots,x_{N})-\kappa(x_{1},\ldots,\tilde{x}_{i},\ldots,x_{N})\right\rvert\leq\frac{1}{N}C_{d}.

The conclusion then directly follows from McDiarmid’s inequality. ∎

Lemma 2.

Let σL(d)\sigma\in\mathrm{L}^{\infty}(\mathbb{R}^{d}) be a probability density and let X1,,XNX_{1},\ldots,X_{N} be i.i.d. random variables with distribution σ\sigma. Then,

𝔼(κ(X1,,XN))(1σLωdεd)N1.\mathbb{E}(\kappa(X_{1},\ldots,X_{N}))\geq(1-\|{\sigma}\|_{\mathrm{L}^{\infty}}\omega_{d}\varepsilon^{d})^{N-1}.
Proof.

The probability that a point XiX_{i} belongs to the ball B(Xj,ε)B(X_{j},\varepsilon) for some jij\neq i can be bounded from above by σ(B(Xj,ε))σLωdεd\sigma(B(X_{j},\varepsilon))\leq\|{\sigma}\|_{\mathrm{L}^{\infty}}\omega_{d}\varepsilon^{d}, where ωd\omega_{d} is the volume of the dd-dimensional unit ball. Thus, the probability that XiX_{i} is ε\varepsilon-isolated is larger than

(1σLωdεd)N1.(1-\|{\sigma}\|_{\mathrm{L}^{\infty}}\omega_{d}\varepsilon^{d})^{N-1}.

We conclude by noting that

𝔼(κ(X1,,XN))=1N1iN(Xi is ε-isolated).\mathbb{E}(\kappa(X_{1},\ldots,X_{N}))=\frac{1}{N}\sum_{1\leq i\leq N}\mathbb{P}(X_{i}\hbox{ is $\varepsilon$-isolated}).\qed
Proof of 5.

We apply the previous 2 with εN=N1β\varepsilon_{N}=N^{-\frac{1}{\beta}} and β=d12\beta=d-\frac{1}{2}. The expectation of κ(X1,,XN)\kappa(X_{1},\dots,X_{N}) is lower bounded by:

𝔼(κ(X1,,XN))(1NdβσLωd)N11CN1dβ\begin{split}\mathbb{E}(\kappa(X_{1},\dots,X_{N}))\geq&\left(1-N^{-\frac{d}{\beta}}\|{\sigma}\|_{\mathrm{L}^{\infty}}\omega_{d}\right)^{N-1}\\ \geq&1-CN^{1-\frac{d}{\beta}}\end{split}

for large NN, since β<d\beta<d. By 1, for any η>0\eta>0,

(κ(X1,,XN)1CN1dβη)1eKNη2,\mathbb{P}(\kappa(X_{1},\ldots,X_{N})\geq 1-CN^{1-\frac{d}{\beta}}-\eta)\geq 1-e^{-KN\eta^{2}},

for constants C,K>0C,K>0 depending only on σL\|{\sigma}\|_{\mathrm{L}^{\infty}} and dd. We choose η=N12d1\eta=N^{-\frac{1}{2d-1}}, so that η\eta is of the same order as N1dβN^{1-\frac{d}{\beta}} since 1dβ=12d11-\frac{d}{\beta}=-\frac{1}{2d-1}.Thus, for a slightly different CC,

(κ(X1,,XN)1Cη)1eKNη2.\mathbb{P}(\kappa(X_{1},\ldots,X_{N})\geq 1-C\eta)\geq 1-\mathrm{e}^{-KN\eta^{2}}.

Now, for ω1,,ωN\omega_{1},\dots,\omega_{N} such that

κ(X1(ω1),,XN(ωN))1Cη,\kappa(X_{1}(\omega_{1}),\ldots,X_{N}(\omega_{N}))\geq 1-C\eta,

3 yields:

W22(δBN(X(ω)),ρ)Nd1βN+ηN12d1W_{2}^{2}\left(\delta_{B_{N}(X(\omega))},\rho\right)\lesssim\frac{N^{\frac{d-1}{\beta}}}{N}+\eta\lesssim N^{-\frac{1}{2d-1}}

and such a disposition happens with probability at least

1eKNη2=1eKN2d32d1.1-\mathrm{e}^{-KN\eta^{2}}=1-\mathrm{e}^{-KN^{\frac{2d-3}{2d-1}}}.\qed

Appendix C Proof of 6

We first note that by Proposition 1, we have FN(Y)2=1N2BN(Y)Y2\|\nabla F_{N}(Y)\|^{2}=\frac{1}{N^{2}}\left\lVert B_{N}(Y)-Y\right\rVert^{2}. We then use W22(δBN(Y),δY)1NBN(Y)Y2\mathrm{W}^{2}_{2}(\delta_{B_{N}(Y)},\delta_{Y})\leq\frac{1}{N}\left\lVert B_{N}(Y)-Y\right\rVert^{2} and

W22(ρ,δY)\displaystyle\mathrm{W}_{2}^{2}(\rho,\delta_{Y}) 2(W22(ρ,δBN(Y))+2NFN(Y)2.\displaystyle\leq 2\big{(}\mathrm{W}_{2}^{2}(\rho,\delta_{B_{N}(Y)})+2N\|\nabla F_{N}(Y)\|^{2}.

Thus, using 3 to bound W22(ρ,δBN(Y))\mathrm{W}_{2}^{2}(\rho,\delta_{B_{N}(Y)}) from above, we get the desired result.

Appendix D Proof of 7

Lemma 3.

Let Y0(d)N𝔻N,εNY^{0}\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N,\varepsilon_{N}} for some εN>0\varepsilon_{N}>0. Then, the iterates (Yk)k0(Y^{k})_{k\geq 0} of (13) satisfy for every k0k\geq 0, and for every iji\neq j

yikyjk(1τN)kεN\left\lVert y_{i}^{k}-y_{j}^{k}\right\rVert\geq(1-\tau_{N})^{k}\varepsilon_{N} (23)
Proof.

We consider the distance between two trajectories after kk iterations: ek=yikyjk.e_{k}=\left\lVert y_{i}^{k}-y_{j}^{k}\right\rVert. Assuming that ek>0e_{k}>0, the convexity of the norm immediately gives us:

ek+1ek\displaystyle e_{k+1}-e_{k}\geq (yikyjkyikyjk)(yik+1yjk+1(yikyjk))\displaystyle\left(\frac{y_{i}^{k}-y_{j}^{k}}{\left\lVert y_{i}^{k}-y_{j}^{k}\right\rVert}\right)\cdot\left(y_{i}^{k+1}-y_{j}^{k+1}-\left(y_{i}^{k}-y_{j}^{k}\right)\right)
=\displaystyle= τN(yikyjkyikyjk)(bikbjk)τNyikyjk\displaystyle\tau_{N}\left(\frac{y_{i}^{k}-y_{j}^{k}}{\left\lVert y_{i}^{k}-y_{j}^{k}\right\rVert}\right)\cdot\left(b_{i}^{k}-b_{j}^{k}\right)-\tau_{N}\left\lVert y_{i}^{k}-y_{j}^{k}\right\rVert

where we denoted bik:=bi(YNk)b_{i}^{k}:=b_{i}(Y_{N}^{k}) the barycenter of the iith Power cell Pi(YNk)P_{i}(Y_{N}^{k}) in the tesselation associated with the point cloud YNkY_{N}^{k}. Since each barycenter bikb_{i}^{k} lies in its corresponding Power cell, the scalar product (yikyjk)(bikbjk)\left(y_{i}^{k}-y_{j}^{k}\right)\cdot\left(b_{i}^{k}-b_{j}^{k}\right) is non-negative: Indeed, for any iji\neq j,

yikbik2yjkbik2ϕikϕjk\left\lVert y_{i}^{k}-b_{i}^{k}\right\rVert^{2}-\left\lVert y_{j}^{k}-b_{i}^{k}\right\rVert^{2}\leq\phi_{i}^{k}-\phi_{j}^{k}

Summing this inequality with the same inequality with the roles of ii and jj reversed, we obtain:

(yikyjk)(bikbjk)0\left(y_{i}^{k}-y_{j}^{k}\right)\cdot\left(b_{i}^{k}-b_{j}^{k}\right)\geq 0

thus giving us the geometric inequality ek+1(1τN)eke_{k+1}\geq(1-\tau_{N})e_{k}. Since YN0Y_{N}^{0} was chosen in ΩN𝔻N,εN\Omega^{N}\setminus\mathbb{D}_{N,\varepsilon_{N}}, this yields ek(1τN)ke0e_{k}\geq(1-\tau_{N})^{k}e_{0} and inequality 23. ∎

Lemma 4.

For any k0k\geq 0

FN(YNk)FN(YN0)ηNk+2Cd,Ω(1ηN)εN1dNANkηNkANηN,F_{N}(Y_{N}^{k})\leq F_{N}(Y_{N}^{0})\eta_{N}^{k}+2C_{d,\Omega}(1-\eta_{N})\frac{\varepsilon_{N}^{1-d}}{N}\frac{A_{N}^{k}-\eta_{N}^{k}}{A_{N}-\eta_{N}}, (24)

where we denote ηN=1τN2(2τN)\eta_{N}=1-\frac{\tau_{N}}{2}(2-\tau_{N}) and AN=(1τN)1dA_{N}=(1-\tau_{N})^{1-d}.

Proof.

This is obtained in a very similar fashion as 3. For any k0k\geq 0, the semi-concavity of FNF_{N} yields the inequality:

FN(YNk+1)YNk+122N(FN(YNk)YNk22N)(BNkN)(YNk+1YNk)\begin{split}F_{N}(Y_{N}^{k+1})-\frac{\left\lVert Y_{N}^{k+1}\right\rVert^{2}}{2N}-\left(F_{N}(Y_{N}^{k})-\frac{\left\lVert Y_{N}^{k}\right\rVert^{2}}{2N}\right)\leq&\left(-\frac{\mathrm{B}_{N}^{k}}{N}\right)\cdot\left(Y_{N}^{k+1}-Y_{N}^{k}\right)\end{split}

with BNk:=BN(YNk)B_{N}^{k}:=B_{N}(Y_{N}^{k}) in accordance with the previous proof.

Rearranging the terms,

FN(YNk+1)FN(YNk)τN(1τN2)BNkYNk2N=τN(1τN2)W22(δBNk,δYNk)τN(1τN2)(12W22(δYNk,ρ)+W22(ρ,δBNk))\begin{split}F_{N}(Y_{N}^{k+1})-F_{N}(Y_{N}^{k})\leq&-\tau_{N}(1-\frac{\tau_{N}}{2})\frac{\left\lVert B_{N}^{k}-Y_{N}^{k}\right\rVert^{2}}{N}\\ =&-\tau_{N}(1-\frac{\tau_{N}}{2})\mathrm{W}_{2}^{2}(\delta_{B_{N}^{k}},\delta_{Y_{N}^{k}})\\ \leq&\tau_{N}(1-\frac{\tau_{N}}{2})\left(-\frac{1}{2}\mathrm{W}_{2}^{2}(\delta_{Y_{N}^{k}},\rho)+\mathrm{W}_{2}^{2}(\rho,\delta_{B_{N}^{k}})\right)\end{split}

by applying first the triangle inequality to W2(δBNk,δYNk)\mathrm{W}_{2}(\delta_{B_{N}^{k}},\delta_{Y_{N}^{k}}) and then Cauchy-Schwartz’s inequality. Using 3, this yields:

FN(YNk+1)(1τN2(2τN))FN(YNk)+2Cd,ΩτN(2τN)εN1dN(1τN)k(1d)ηNFN(YNk)+2Cd,Ω(1ηN)εN1dNANk.\begin{split}F_{N}(Y_{N}^{k+1})\leq&(1-\frac{\tau_{N}}{2}(2-\tau_{N}))F_{N}(Y_{N}^{k})+2C_{d,\Omega}\tau_{N}(2-\tau_{N})\frac{\varepsilon_{N}^{1-d}}{N}(1-\tau_{N})^{k(1-d)}\\ \leq&\eta_{N}F_{N}(Y_{N}^{k})+2C_{d,\Omega}(1-\eta_{N})\frac{\varepsilon_{N}^{1-d}}{N}A_{N}^{k}.\end{split}

and we simply iterate on kk to end up with the bound claimed in 4. ∎

Proof of 7.

To conclude, we simply make (order 1) expansions of the terms in 24. The definition of kNk_{N} in 7, although convoluted, was made so that both terms in the right-hand side of this inequality, FN(YN0)ηNkNF_{N}(Y_{N}^{0})\eta_{N}^{k_{N}} and (1ηN)εN1dNANkNηNkNANηN(1-\eta_{N})\frac{\varepsilon_{N}^{1-d}}{N}\frac{A_{N}^{k_{N}}-\eta_{N}^{k_{N}}}{A_{N}-\eta_{N}} have the same asymptotic decay to 0 (as N+N\to+\infty): With the notations of the previous proposition, we have for fixed NN:

W22(ρ,δYNkN)W22(ρ,δYN0)ηNkN+2Cd,Ω(1ηN)AηNAkNηNkNNεNd1\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{k_{N}}}\right)\leq\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{0}}\right)\eta_{N}^{k_{N}}+2C_{d,\Omega}\frac{(1-\eta_{N})}{A-\eta_{N}}\frac{A^{k_{N}}-\eta_{N}^{k_{N}}}{N\varepsilon_{N}^{d-1}} (25)

We make use here of the notation from Section 3:

TN=kNτN=1dln(FN(YN0)NεNd1)T_{N}=k_{N}\tau_{N}=\left\lfloor\frac{1}{d}\ln(F_{N}(Y_{N}^{0})N\varepsilon_{N}^{d-1})\right\rfloor

to clear this expression a bit, and, because of the assumption limNτN=0\lim_{N\to\infty}\tau_{N}=0, we may write:

AkNηkNNεNd1=e(d1)TNNεNd1+oN(TN(NεNd1)1d)\frac{A^{k_{N}}-\eta^{k_{N}}}{N\varepsilon_{N}^{d-1}}=\frac{\mathrm{e}^{(d-1)T_{N}}}{N\varepsilon_{N}^{d-1}}+o_{N\to\infty}\left(\frac{T_{N}}{(N\varepsilon_{N}^{d-1})^{\frac{1}{d}}}\right)

as well as ηkN=eTN+oN(TN(NεNd1)1d)\eta^{k_{N}}=\mathrm{e}^{-T_{N}}+o_{N\to\infty}\left(\frac{T_{N}}{(N\varepsilon_{N}^{d-1})^{\frac{1}{d}}}\right), and substituting TNT_{N},

W22(ρ,δYNkN)W22(ρ,δYN0)d1d(NεNd1)1d+oN(TN(NεNd1)1d)W22(ρ,δYN0)11dN1d2+α(11d)\begin{split}\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{k_{N}}}\right)\lesssim&\frac{\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{0}}\right)^{\frac{d-1}{d}}}{\left(N\varepsilon_{N}^{d-1}\right)^{\frac{1}{d}}}+o_{N\to\infty}\left(\frac{T_{N}}{(N\varepsilon_{N}^{d-1})^{\frac{1}{d}}}\right)\\ \lesssim&\mathrm{W}_{2}^{2}\left(\rho,\delta_{Y_{N}^{0}}\right)^{1-\frac{1}{d}}N^{\frac{-1}{d^{2}}+\alpha\left(1-\frac{1}{d}\right)}\qed\end{split}

Appendix E Case of a low variance Gaussian in Section 4

Here, we consider ρσ\rho_{\sigma} the probability measure obtained by truncating and renormalizing a centered normal distribution with variance σ\sigma to the segment [1,1][-1,1]. We first show that for any NN\in\mathbb{N} and δ(0,1)\delta\in(0,1), we can find a small σN,δ\sigma_{N,\delta} such that the Wasserstein distance beween ρσN,δ\rho_{\sigma_{N,\delta}} and its best NN-points approximation of is at least CN(2δ)CN^{-(2-\delta)}.

Proposition 8.

For any σ>0\sigma>0, consider ρσ=defmσe|x|22σ2𝟙[1;1]dx\rho_{\sigma}\stackrel{{\scriptstyle\text{def}}}{{=}}m_{\sigma}\mathrm{e}^{-\frac{\left\lvert x\right\rvert^{2}}{2\sigma^{2}}}\mathds{1}_{[-1;1]}dx the truncated centered Gaussian density, where mσm_{\sigma} is taken so that ρσ\rho_{\sigma} has unit mass. Then, for every δ(0,1)\delta\in(0,1), there exists a constant C>0C>0 and a sequence of variances (σN)N(\sigma_{N})_{N\in\mathbb{N}} such that

Y(d)N𝔻N,W22(δBN(Y),ρσN)CN(2δ)\forall Y\in(\mathbb{R}^{d})^{N}\setminus\mathbb{D}_{N},\quad\mathrm{W}_{2}^{2}\left(\delta_{B_{N}(Y)},\rho_{\sigma_{N}}\right)\geq CN^{-(2-\delta)}

From the proof, one can see that the dependence of σN\sigma_{N} on NN is logarithmic.

Proof.

We denote g:x12πe|x|22g:x\in\mathbb{R}\mapsto\frac{1}{\sqrt{2\pi}}\mathrm{e}^{-\frac{\left\lvert x\right\rvert^{2}}{2}} the density of the centered Gaussian distribution and FgF_{g} its cumulative distribution function, so that

mσ1=11e|x|22σ2𝑑x=σ2π1/σ1/σg(y)𝑑y=2πσ(Fg(1/σ)Fg(1/σ))\begin{split}m_{\sigma}^{-1}=\int_{-1}^{1}\mathrm{e}^{-\frac{\left\lvert x\right\rvert^{2}}{2\sigma^{2}}}dx=\sigma\sqrt{2\pi}\int_{-1/\sigma}^{1/\sigma}g(y)dy=\sqrt{2\pi}\sigma(F_{g}(1/\sigma)-F_{g}(-1/\sigma))\end{split} (26)

Note that, whenever σ0\sigma\to 0, we have σmσ2π\sigma m_{\sigma}\to\sqrt{2\pi}. We denote by Fσ:[1,1][0,1]F_{\sigma}:[-1,1]\to[0,1] the cumulative distribution function of ρσ\rho_{\sigma}. Given any point cloud Y=(y1,,yN)Y=(y_{1},\ldots,\leq y_{N}) such that y1yNy_{1}\leq\ldots\leq y_{N}, the Power cells Pi(Y)P_{i}(Y) is simply the segment

Pi(Y)=[Fσ1(i/N),Fσ1((i+1)/N)].P_{i}(Y)=[F_{\sigma}^{-1}(i/N),F_{\sigma}^{-1}((i+1)/N)].

Since these segments do not depend on YY, we will denote them (Pi)1iN(P_{i})_{1\leq i\leq N}. Finally, defining bi=NPixdρσ(x)b_{i}=N\int_{P_{i}}x\mathrm{d}\rho_{\sigma}(x) as the barycenter of the iith power cell and δB=1Niδbi\delta_{B}=\frac{1}{N}\sum_{i}\delta_{b_{i}}, we have

W22(δB,ρσ)\displaystyle\mathrm{W}_{2}^{2}(\delta_{B},\rho_{\sigma}) =i=1NPi(xbi)2dρσ(x)\displaystyle=\sum_{i=1}^{N}\int_{P_{i}}(x-b_{i})^{2}\mathrm{d}\rho_{\sigma}(x) (27)
ρσ(1)i=1NPi(xbi)2dx\displaystyle\geq\rho_{\sigma}(-1)\sum_{i=1}^{N}\int_{P_{i}}(x-b_{i})^{2}\mathrm{d}x
Cρσ(1)i=1N(Fσ1((i+1)/N)Fσ1(i/N))3,\displaystyle\geq C\rho_{\sigma}(-1)\sum_{i=1}^{N}(F_{\sigma}^{-1}((i+1)/N)-F_{\sigma}^{-1}(i/N))^{3},

where we used that ρσ\rho_{\sigma} attains its minimum at ±1\pm 1 to get the first inequality. We now wish to provide an approximation for Fσ1(t)F_{\sigma}^{-1}(t), t[0,1]t\in[0,1]. We first note, using Taylor’s formula, that we have

Fσ1(t)\displaystyle F_{\sigma}^{-1}(t) =σFg1(Fg(1σ)+t[Fg(1σ)Fg(1σ)])\displaystyle=\sigma F_{g}^{-1}\left(F_{g}\left(\frac{-1}{\sigma}\right)+t\left[F_{g}\left(\frac{1}{\sigma}\right)-F_{g}\left(\frac{-1}{\sigma}\right)\right]\right)
=σFg1(Fg(1σ)+t2πσmσ)\displaystyle=\sigma F_{g}^{-1}\left(F_{g}\left(\frac{-1}{\sigma}\right)+\frac{t}{\sqrt{2\pi}\sigma m_{\sigma}}\right)
=1+σ(Fg1)(Fg(1σ))t2πσmσ+σ2(Fg1)′′(s)t22πσ2mσ2\displaystyle=-1+\sigma(F_{g}^{-1})^{\prime}\left(F_{g}\left(\frac{-1}{\sigma}\right)\right)\frac{t}{\sqrt{2\pi}\sigma m_{\sigma}}+\frac{\sigma}{2}(F_{g}^{-1})^{\prime\prime}(s)\frac{t^{2}}{2\pi\sigma^{2}m_{\sigma}^{2}}

for some s[Fg(1σ),Fg(1σ)+t(Fg(1σ)Fg(1σ))]s\in[F_{g}(-\frac{1}{\sigma}),F_{g}(-\frac{1}{\sigma})+t(F_{g}(\frac{1}{\sigma})-F_{g}(-\frac{1}{\sigma}))]. But,

(Fg1)(t)=1gFg1(t)=2πe|Fg1(t)|22,(F_{g}^{-1})^{\prime}(t)=\frac{1}{g\circ F_{g}^{-1}(t)}=\sqrt{2\pi}\mathrm{e}^{\frac{\left\lvert F_{g}^{-1}(t)\right\rvert^{2}}{2}},
(Fg1)′′(t)=gFg1(t)(gFg1(t))3=2πFg1(t)e|Fg1(t)|2,(F_{g}^{-1})^{\prime\prime}(t)=-\frac{g^{\prime}\circ F_{g}^{-1}(t)}{\left(g\circ F_{g}^{-1}(t)\right)^{3}}=2\pi F_{g}^{-1}(t)\mathrm{e}^{\left\lvert F_{g}^{-1}(t)\right\rvert^{2}},

and we see that

|Fσ1(t)(1+tmσe12σ2)|e1σ2t22σ2mσ2\left\lvert F_{\sigma}^{-1}(t)-\left(-1+\frac{t}{m_{\sigma}}\mathrm{e}^{\frac{1}{2\sigma^{2}}}\right)\right\rvert\leq\mathrm{e}^{\frac{1}{\sigma^{2}}}\frac{t^{2}}{2\sigma^{2}m_{\sigma}^{2}}

Therefore, if we denote ε(σ,t)\varepsilon(\sigma,t) the second-order error in the above formula, i.e. ε(σ,t)=e1σ2t22σ2mσ2\varepsilon(\sigma,t)=\mathrm{e}^{\frac{1}{\sigma^{2}}}\frac{t^{2}}{2\sigma^{2}m_{\sigma}^{2}}, the size of the first Power cell P0(Y)P_{0}(Y) is of order:

Fσ1(1/N)Fσ1(0)=1Nmσe12σ2+O(ε(σ,1N)).F_{\sigma}^{-1}(1/N)-F_{\sigma}^{-1}(0)=\frac{1}{Nm_{\sigma}}\mathrm{e}^{\frac{1}{2\sigma^{2}}}+O\left(\varepsilon\left(\sigma,\frac{1}{N}\right)\right).

We will choose σN\sigma_{N} depending on NN in order for the first term in the left-hand side to dominate the second one:

ε(σN,1N)=o(1Nmσe12σ2).\varepsilon\left(\sigma_{N},\frac{1}{N}\right)=o\left(\frac{1}{Nm_{\sigma}}\mathrm{e}^{\frac{1}{2\sigma^{2}}}\right). (28)

In this way, we have

(Fσ1(1/N)Fσ1(0))3ρσ(1)c1N3mσ3e32σ2mσe12σ2=c1N3mσ2e1σ2.\begin{split}(F_{\sigma}^{-1}(1/N)-F_{\sigma}^{-1}(0))^{3}\rho_{\sigma}(-1)\geq&c\frac{1}{N^{3}m_{\sigma}^{3}}\mathrm{e}^{\frac{3}{2\sigma^{2}}}m_{\sigma}\mathrm{e}^{-\frac{1}{2\sigma^{2}}}\\ =&c\frac{1}{N^{3}m_{\sigma}^{2}}\mathrm{e}^{\frac{1}{\sigma^{2}}}.\end{split} (29)

We now choose σ=σN\sigma=\sigma_{N} such that e12σ2=Nα\mathrm{e}^{\frac{1}{2\sigma^{2}}}=N^{\alpha} for an exponent α\alpha to be chosen. We need α>0\alpha>0 so that σN0\sigma_{N}\to 0. This last condition and (26) implies that mσNm_{\sigma_{N}} is of order logN\sqrt{\log N}. This means that the condition (28) is satisfied if α<1\alpha<1 and NN large enough.

The sum in (27) is lower bounded by its first term, (29), and we get

W22(δB,ρσ)c1N3mσN2e1σN2C(N2α3ln(N))\mathrm{W}_{2}^{2}(\delta_{B},\rho_{\sigma})\geq c\frac{1}{N^{3}m_{\sigma_{N}}^{2}}\mathrm{e}^{\frac{1}{\sigma_{N}^{2}}}\geq C\left(\frac{N^{2\alpha-3}}{\ln(N)}\right)

for some constant C>0C>0, since σ\sigma depends logarithmically on NN. Finally, if we want this last expression to be larger than N(2δ)N^{-(2-\delta)} we can take for instance 2α>1+δ2\alpha>1+\delta and NN large enough. ∎

The following corollary, whose proof can just be obtained by adapting the above proof to a simple multi-dimensional setting where measures and cells “factorize” according to the components, confirms the facts observed in the numerical section (Section 4), and the sharpness of our result (Remark 4).

Corollary 9.

Fix δ(0,1)\delta\in(0,1). Given any nn\in\mathbb{N}, consider an axis-aligned discrete grid of the form ZN=Y1××YdZ_{N}=Y_{1}\times\ldots\times Y_{d} in d\mathbb{R}^{d}, with N=Card(ZN)=ndN=\operatorname{Card}(Z_{N})=n^{d}, where each YjY_{j} is a subset of \mathbb{R} with cardinal nn. Finally, define σN:=σn,δ\sigma_{N}:=\sigma_{n,\delta} as in 8 Then we have

W22(δBN(ZN),ρσNρσN)CN(2δ)d,\mathrm{W}_{2}^{2}(\delta_{B_{N}(Z_{N})},\rho_{\sigma_{N}}\otimes\dots\otimes\rho_{\sigma_{N}})\geq CN^{-\frac{(2-\delta)}{d}},

where the constant CC is independent of NN.