This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions

Yulong Lu
Department of Mathematics and Statistics
University of Massachusetts Amherst
Amherst, MA 01003
lu@math.umass.edu
&Jianfeng Lu
Mathematics Department
Duke University
Durham, NC 27708
jianfeng@math.duke.edu
Abstract

This paper studies the universal approximation property of deep neural networks for representing probability distributions. Given a target distribution π\pi and a source distribution pzp_{z} both defined on d\mathbb{R}^{d}, we prove under some assumptions that there exists a deep neural network g:dg:\mathbb{R}^{d}{\rightarrow}\mathbb{R} with ReLU activation such that the push-forward measure (g)#pz(\nabla g)_{\#}p_{z} of pzp_{z} under the map g\nabla g is arbitrarily close to the target measure π\pi. The closeness are measured by three classes of integral probability metrics between probability distributions: 11-Wasserstein distance, maximum mean distance (MMD) and kernelized Stein discrepancy (KSD). We prove upper bounds for the size (width and depth) of the deep neural network in terms of the dimension dd and the approximation error ε\varepsilon with respect to the three discrepancies. In particular, the size of neural network can grow exponentially in dd when 11-Wasserstein distance is used as the discrepancy, whereas for both MMD and KSD the size of neural network only depends on dd at most polynomially. Our proof relies on convergence estimates of empirical measures under aforementioned discrepancies and semi-discrete optimal transport.

1 Introduction

In recent years, deep learning has achieved unprecedented success in numerous machine learning problems [31, 54]. The success of deep learning is largely attributed to the usage of deep neural networks (DNNs) for representing and learning the unknown structures in machine learning tasks, which are usually modeled by some unknown function mappings or unknown probability distributions. The effectiveness of using neural networks (NNs) in approximating functions has been justified rigorously in the last three decades. Specifically, a series of early works [13, 18, 25, 7] on universal approximation theorems show that a continuous function defined on a bounded domain can be approximated by a sufficiently large shallow (two-layer) neural network. In particular, the result by [7] quantifies the approximation error of shallow neural networks in terms of the decay property of the Fourier transform of the function of interest. Recently, the expressive power of DNNs for approximating functions have received increasing attention starting from the works by [37] and [60]; see also [61, 48, 50, 44, 51, 15, 43] for more recent developments. The theoretical benefits of using deep neural networks over shallow neural networks have been demonstrated in a sequence of depth separation results; see e.g. [17, 55, 57, 14]

Compared to a vast number of theoretical results on neural networks for approximating functions, the use of neural networks for expressing distributions is far less understood on the theoretical side. The idea of using neural networks for modeling distributions underpins an important class of unsupervised learning techniques called generative models, where the goal is to approximate or learn complex probability distributions from the training samples drawn from the distributions. Typical generative models include Variational Autoencoders [29], Normalizing Flows [49] and Generative Adversarial Networks (GANs) [19], just to name a few. In these generative models, the probability distribution of interest can be very complex or computationally intractable, and is usually modelled by transforming a simple distribution using some map parameterized by a (deep) neural network. In particular, a GAN consists of a game between a generator and a discriminator which are represented by deep neural networks: the generator attempts to generate fake samples whose distribution is indistinguishable from the real distribution and it generate samples by mapping samples from a simple input distribution (e.g. Gaussian) via a deep neural network; the discriminator attempts to learn how to tell the fake apart from the real. Despite the great empirical success of GANs in various applications, its theoretical analysis is far from complete. Existing theoretical works on GANs are mainly focused on the trade-off between the generator and the discriminator (see e.g. [42, 2, 3, 38, 5]). The key message from these works is that the discriminator family needs to be chosen appropriately according to the generator family in order to obtain a good generalization error.

Our contributions. In this work, we focus on an even more fundamental question on GANs and other generative models which is not yet fully addressed. Namely how well can DNNs express probability distributions? We shall answer this question by making the following contributions.

  • Given a fairly general source distribution and a target distribution defined on d{\mathbb{R}}^{d} which satisfies certain integrability assumptions, we show that there is a ReLU DNN with dd inputs and one output such that the push-forward of the source distribution via the gradient of the output function defined by the DNN is arbitrarily close to the target. We measure the closeness between probability distributions by three integral probability metrics (IPMs): 1-Wasserstein metric, maximum mean discrepancy and kernelized Stein discrepancy.

  • Given a desired approximation error ε{\varepsilon}, we prove complexity upper bounds for the depth and width of the DNN needed to attain the given approximation error with respect to the three IPMs mentioned above; our complexity upper bounds are given with explicit dependence on the dimension dd of the target distribution and the approximation error ε{\varepsilon}.

It is also worth mentioning that the DNN constructed in the paper is explicit: the output function of the DNN is the maximum of finitely many (multivariate) affine functions, with the affine parameters determined explicitly in terms of the source measure and target measure.

Related work. Let us discuss some previous related work and compare the results there with ours. Kong and Chaudhuri [30] analyzed the expressive power of normalizing flow models under the L1L^{1}-norm of distributions and showed that the flow models have limited approximation capability in high dimensions. We show that feedforward DNNs can approximate a general class of distributions in high dimensions with respect to three IPMs. Lin and Jegelka [39] studied the universal approximation of certain ResNets for multivariate functions; the approximation result there however was not quantitative and did not consider the universal approximation of ResNets for distributions. The work by Bailey and Telgarsky [6] considered the approximation power of DNNs for expressing uniform and Gaussian distributions in the Wasserstein distance, whereas we prove quantitative approximation results for fairly general distributions under three IPMs. The work by Lee et al. [32] is closest to ous, where the authors considered a class of probability distributions that are given as push-forwards of a base distribution by a class of Barron functions, and showed that those distributions can be approximated in Wasserstein metrics by push-forwards of NNs, essentially relying on the ability of NNs for approximating Barron functions. It is however not clear what probability distributions are given by push-forward of a base one by Barron functions. In this work, we provide more explicit and direct criteria of the target distributions.

Notations. Let us introduce several definitions and notations to be used throughout the paper. We start with the definition of a fully connected and feed-forward neural network.

Definition 1.1.

A (fully connected and feed-forward) neural network of LL hidden layers takes an input vector xN0x\in{\mathbb{R}}^{N_{0}}, outputs a vector yNL+1y\in{\mathbb{R}}^{N_{L+1}} and has LL hidden layers of sizes N1,N2,NLN_{1},N_{2},\cdots N_{L}. The neural network is parametrized by the weight matrices WN1×NW^{\ell}\in{\mathbb{R}}^{N_{\ell-1}\times N_{\ell}} and bias vectors bb^{\ell} with =1,2,,L+1\ell=1,2,\cdots,L+1. The output yy is defined from the input xx iteratively according to the following.

x0=x,\displaystyle x^{0}=x, (1.1)
x=σ(W1x1+b1), 1L\displaystyle x^{\ell}=\sigma(W^{\ell-1}x^{\ell-1}+b^{\ell-1}),1\leq\ell\leq L
y=WLxL+bL.\displaystyle y=W^{L}x^{L}+b^{L}.

Here σ\sigma is a (nonlinear) activation function which acts on a vector xx component-wisely, i.e. [σ(x)]i=σ(xi)[\sigma(x)]_{i}=\sigma(x_{i}). When N1=N2==NL=NN_{1}=N_{2}=\cdots=N_{L}=N, we say the network network has width NN and depth LL. The neural network is said to be a deep neural network (DNN) if L2L\geq 2. The function defined by the deep neural network is denoted by DNN({W,b}=1L+1){\mathrm{DNN}}(\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1}).

Popular choices of activation functions σ\sigma include the rectified linear unit (ReLU) function ReLU(x)=max(x,0)\textrm{ReLU}(x)=\max(x,0) and the sigmoid function Sigmoid(x)=(1+ex)1\textrm{Sigmoid}(x)=(1+e^{-x})^{-1}.

Given a matrix AA, let us denote its nn-fold direct sum by nA=AAAn times=diag(A,,A).\oplus^{n}A=\overbrace{A\oplus A\oplus\cdots\oplus A}^{n\text{ times}}=\operatorname{diag}(A,\cdots,A). We denote by 𝒫2(d){\mathcal{P}}_{2}({\mathbb{R}}^{d}) the space of probability measures with finite second moment. Given two probability measures μ\mu and ν\nu on d{\mathbb{R}}^{d}, a transport map TT between μ\mu and ν\nu is a measurable map T:ddT:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}^{d} such that ν=T#μ\nu=T_{\#}\mu where T#μT_{\#}\mu denotes the push-forward of μ\mu under the map TT, i.e., for any measurable AdA\subset{\mathbb{R}}^{d}, ν(A)=μ(T1(A))\nu(A)=\mu(T^{-1}(A)). We denote by Γ(μ,ν)\Gamma(\mu,\nu) the set of transport plans between μ\mu and ν\nu which consists of all coupling measures γ\gamma of μ\mu and ν\nu, i.e., γ(A×d)=μ(A)\gamma(A\times{\mathbb{R}}^{d})=\mu(A) and γ(d×B)=ν(B)\gamma({\mathbb{R}}^{d}\times B)=\nu(B) for any measurable A,BdA,B\subset{\mathbb{R}}^{d}. We may use C,C1,C2C,C_{1},C_{2} to denote generic constants which do not depend on any quantities of interest (e.g. dimension dd).

The rest of the paper is organized as follows. We describe the problem and state the main result in Section 2. Section 3 and Section 4 devote to the two ingredients for proving the main result: convergence of empirical measures in IPMs and building neural-network-based maps between the source measure and empirical measures via semi-discrete optimal transport respectively. Proofs of lemmas and intermediate results are provided in appendices.

2 Problem Description and Main Result

Let π(x)\pi(x) be the target probability distribution defined on d{\mathbb{R}}^{d} which one would like to learn or generate samples from. In the framework of GANs, one is interested in representing the distribution π\pi implicitly by a generative neural network. Specifically, let 𝒢NN{g:dd}{\mathcal{G}}_{{\mathrm{NN}}}\subset\{g:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}^{d}\} be a subset of generators (transformations), which are defined by neural networks. The concrete form of 𝒢NN{\mathcal{G}}_{{\mathrm{NN}}} is to be specified later. Let pzp_{z} be a source distribution (e.g. standard normal). The push-forward of pzp_{z} under the transformation g𝒢NNg\in{\mathcal{G}}_{{\mathrm{NN}}} is denoted by px=g#pzp_{x}=g_{\#}p_{z}. In a GAN problem, one aims to find g𝒢NNg\in{\mathcal{G}}_{{\mathrm{NN}}} such that g#pzπg_{\#}p_{z}\approx\pi. In the mathematical language, GANs can be formulated as the following minimization problem:

infg𝒢NND(g#pz,π)\inf_{g\in{\mathcal{G}}_{{\mathrm{NN}}}}D(g_{\#}p_{z},\pi) (2.1)

where D(p,π)D(p,\pi) is some discrepancy measure between probability measures pp and π\pi, which typically takes the form of integral probability metric (IPM) or adversarial loss defined by

D(p,π)=dD(p,π):=supfD|𝐄Xpf(X)𝐄Xπf(X)|,D(p,\pi)=d_{{\mathcal{F}}_{D}}(p,\pi):=\sup_{f\in{\mathcal{F}}_{D}}\Big{|}{\mathbf{E}}_{X\sim p}f(X)-{\mathbf{E}}_{X\sim\pi}f(X)\Big{|}, (2.2)

where D\mathcal{F}_{D} is certain class of test (or witness) functions. As a consequence, GANs can be formulated as the minimax problem

infg𝒢NNsupfD|𝐄Xpf(X)𝐄Xπf(X)|.\inf_{g\in{\mathcal{G}}_{{\mathrm{NN}}}}\sup_{f\in{\mathcal{F}}_{D}}\Big{|}{\mathbf{E}}_{X\sim p}f(X)-{\mathbf{E}}_{X\sim\pi}f(X)\Big{|}.

The present paper aims to answer the following fundamental questions on GANs:

(1) Is there a neural-network-based generator g𝒢NNg\in{\mathcal{G}}_{{\mathrm{NN}}} such that D(g#pz,π)0D(g_{\#}p_{z},\pi)\approx 0?

(2) How to quantify the complexity (e.g. depth and width) of the neural network?

As we shall see below, the answers to the questions above depend on the IPM DD used to measure the discrepancy between distributions. In this paper, we are interested in three IPMs which are commonly used in GANs, including 11-Wasserstein distance [59, 1], maximum mean discrepancy [21, 16, 36] and kernelized Stein discrepancy [40, 12, 27].

Wasserstein Distance: When the witness class D{\mathcal{F}}_{D} is chosen as the the class of 1-Lipschitz functions, i.e. D:={f:d: Lip (f)1}{\mathcal{F}}_{D}:=\{f:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}:\text{ Lip }(f)\leq 1\}, the resulting IPM dDd_{{\mathcal{F}}_{D}} becomes the 1-Wasserstein distance (also known as Kantorovich-Rubinstein distance):

𝒲1(p,π)=infγΓ(p,π)|xy|γ(dxdy).{\mathcal{W}}_{1}(p,\pi)=\inf_{\gamma\in\Gamma(p,\pi)}\int|x-y|\gamma(dxdy).

The Wasserstein-GAN proposed by [1] leverages the Wasserstein distance as the objective function to improve the stability of training of the original GAN based on the Jensen-Shannon divergence. Nevertheless, it has been shown that Wasserstein-GAN still suffers from the mode collapse issue [23] and does not generalize with any polynomial number of training samples [2].

Maximum Mean Discrepancy (MMD): When D{\mathcal{F}}_{D} is the unit ball of a reproducing kernel Hilbert space (RKHS) k{\mathcal{H}}_{k}, i.e. D:={fk:fk1}{\mathcal{F}}_{D}:=\{f\in{\mathcal{H}}_{k}:\|f\|_{{\mathcal{H}}_{k}}\leq 1\}, the resulting IPM dDd_{{\mathcal{F}}_{D}} coincides with the maximum mean discrepancy (MMD) [21]:

MMD(p,π)=supfk1|𝐄Xpf(X)𝐄Xπf(X)|.\operatorname{MMD}(p,\pi)=\sup_{\|f\|_{{\mathcal{H}}_{k}}\leq 1}\Big{|}{\mathbf{E}}_{X\sim p}f(X)-{\mathbf{E}}_{X\sim\pi}f(X)\Big{|}.

GANs based on minimizing MMD as the loss function were firstly proposed in [16, 36]. Since MMD is a weaker metric than 11-Wasserstein distance, MMD-GANs also suffer from the mode collapse issue, but empirical results (see e.g. [8]) suggest that they require smaller discriminative networks and hence enable faster training than Wasserstein-GANs.

Kernelized Stein Discrepancy (KSD): If the witness class D{\mathcal{F}}_{D} is chosen as D:={𝒯πf:fk and fk1},{\mathcal{F}}_{D}:=\{{\mathcal{T}}_{\pi}f:f\in{\mathcal{H}}_{k}\text{ and }\|f\|_{{\mathcal{H}}_{k}}\leq 1\}, where 𝒯π{\mathcal{T}}_{\pi} is the Stein-operator defined by

𝒯πf:=logπf+f,{\mathcal{T}}_{\pi}f:=\nabla\log\pi\cdot f+\nabla\cdot f, (2.3)

the associated IPM dDd_{{\mathcal{F}}_{D}} becomes the Kernelized Stein Discrepancy (KSD) [40, 12]:

KSD(p,π)=supfk1𝐄Xp[𝒯πf(X)].\operatorname{KSD}(p,\pi)=\sup_{\|f\|_{{\mathcal{H}}_{k}}\leq 1}{\mathbf{E}}_{X\sim p}[{\mathcal{T}}_{\pi}f(X)].

The KSD has received great popularity in machine learning and statistics since the quantity KSD(p,π)\operatorname{KSD}(p,\pi) is very easy to compute and does not depend on the normalization constant of π\pi, which makes it suitable for statistical computation, such as hypothesis testing [20] and statistical sampling [41, 11]. The recent paper [27] adopts the GAN formulation (2.1) with KSD as the training loss to construct a new sampling algorithm called Stein Neural Sampler.

2.1 Main result

Throughout the paper, we consider the following assumptions on the reproducing kernel kk:

Assumption K1.

The kernel kk is integrally strictly positive definite: for all finite non-zero signed Borel measures μ\mu defined on d{\mathbb{R}}^{d},

dk(x,y)𝑑μ(x)𝑑μ(y)>0.\iint_{{\mathbb{R}}^{d}}k(x,y)d\mu(x)d\mu(y)>0.
Assumption K2.

There exists a constant K0>0K_{0}>0 such that

supxd|k(x,x)|K0.\displaystyle\sup_{x\in{\mathbb{R}}^{d}}|k(x,x)|\leq K_{0}. (2.4)
Assumption K3.

The kernel function k:d×dk:{\mathbb{R}}^{d}\times{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}} is twice differentiable and there exists a constant K1>0K_{1}>0 such that

maxm+n1supx,yxmynk(x,y)K1 and supx,y|Tr(xyk(x,y))|K1(1+d).\displaystyle\max_{m+n\leq 1}\sup_{x,y}\|\nabla_{x}^{m}\nabla_{y}^{n}k(x,y)\|\leq K_{1}\textrm{ and }\sup_{x,y}|\operatorname{Tr}(\nabla_{x}\nabla_{y}k(x,y))|\leq K_{1}(1+d). (2.5)

According to [54, Theorem 7], Assumption K1 is necessary and sufficient for the kernel being characteristic, i.e., MMD(μ,ν)=0\operatorname{MMD}(\mu,\nu)=0 implies μ=ν\mu=\nu, which guarantees that MMD is a metric. In addition, thanks to [40, Proposition 3.3], KSD is a valid discrepancy measure under the Assumption K1, namely KSD(p,π)0\operatorname{KSD}(p,\pi)\geq 0 and KSD(p,π)=0\operatorname{KSD}(p,\pi)=0 if and only if p=πp=\pi.

Assumption K2 will be used to get an error bound for MMD(Pn,π)\operatorname{MMD}(P_{n},\pi); see Theorem 3.2. Assumption K3 will be crucial for bounding KSD(Pn,π)\operatorname{KSD}(P_{n},\pi); see Theorem 3.3. Many commonly used kernel functions fulfill all three assumptions K1-K3, including for example Gaussian kernel k(x,y)=e12|xy|2k(x,y)=e^{-\frac{1}{2}|x-y|^{2}} and inverse multiquadric (IMQ) kernel k(x,y)=(c+|xy|2)βk(x,y)=(c+|x-y|^{2})^{\beta} with c>0c>0 and β<0\beta<0. Unfortunately, Matérn kernels [46] only satisfy Assumptions K1-K2, but not Assumption K3 since the second order derivatives of kk are singular on the diagonal, so the last estimate of (2.5) is violated.

In order to bound KSD(Pn,π)\operatorname{KSD}(P_{n},\pi), we need to assume further that the target measure π\pi satisfies the following regularity and integrability assumptions. We will use the shorthand notation sπ(x)=logπ(x)s_{\pi}(x)=\nabla\log\pi(x).

Assumption 1 (LL-Lipschitz).

Assume that sπ(x)s_{\pi}(x) is globally Lipschitz in d{\mathbb{R}}^{d}, i.e. there exists a constant L~>0\tilde{L}>0 such that |sπ(x)sπ(y)|L~|xy||s_{\pi}(x)-s_{\pi}(y)|\leq\tilde{L}|x-y| for all x,ydx,y\in{\mathbb{R}}^{d}. As a result, there exists L>0L>0 such that

|sπ(x)|L(1+|x|) for all xd.\displaystyle|s_{\pi}(x)|\leq L(1+|x|)\quad\textrm{ for all }x\in{\mathbb{R}}^{d}. (2.6)
Assumption 2 (sub-Gaussian).

The probability measure π\pi is sub-Gaussian, i.e. there exist m=(m1,,md)dm=(m_{1},\cdots,m_{d})\in{\mathbb{R}}^{d} and υ>0\upsilon>0 such that

𝐄Xπ[exp(αT(Xm))]exp(|α|2υ2/2) for all αd.{\mathbf{E}}_{X\sim\pi}[\exp(\alpha^{\hbox{\it\tiny T}}(X-m))]\leq\exp(|\alpha|^{2}\upsilon^{2}/2)\quad\textrm{ for all }\alpha\in{\mathbb{R}}^{d}.

Assume further that maxi|mi|m\max_{i}|m_{i}|\leq m^{\ast} for some m>0m^{\ast}>0.

Our main result is the universal approximation theorem for expressing probability distributions.

Theorem 2.1 (Main theorem).

Let π\pi and pzp_{z} be the target and the source distributions respectively, both defined on d{\mathbb{R}}^{d}. Assume that pzp_{z} is absolutely continuous with respect to the Lebesgue measure. Then under certain assumptions on π\pi and the kernel kk to be specified below, it holds that for any given approximation error ε{\varepsilon}, there exists a positive integer nn, and a fully connected and feed-forward deep neural network u=DNN({W,b}=1L+1)u={\mathrm{DNN}}(\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1}) of depth L=log2nL=\left\lceil\log_{2}n\right\rceil and width N=2L=2log2nN=2^{L}=2^{\left\lceil\log_{2}n\right\rceil}, with dd inputs and a single output and with ReLU activation such that dD((u)#pz,π)ε.d_{{\mathcal{F}}_{D}}((\nabla u)_{\#}p_{z},\pi)\leq{\varepsilon}. The complexity parameter nn depends on the choice of the metric dDd_{{\mathcal{F}}_{D}}. Specifically,

1. Consider dD=𝒲1d_{{\mathcal{F}}_{D}}={\mathcal{W}}_{1}. If π\pi satisfies that M3=𝐄Xπ|X|3<M_{3}={\mathbf{E}}_{X\sim\pi}|X|^{3}<\infty, it holds that

n{Cε2,d=1,Clog2(ε)ε2,d=2,Cdεd,d3,n\leq\begin{cases}\frac{C}{{\varepsilon}^{2}},&d=1,\\ \frac{C\log^{2}({\varepsilon})}{{\varepsilon}^{2}},&d=2,\\ \frac{C^{d}}{{\varepsilon}^{d}},&d\geq 3,\end{cases}

where the constant CC depends only on M3M_{3}.

2. Consider dD=MMD d_{{\mathcal{F}}_{D}}=\textrm{MMD } with kernel kk. If kk satisfies Assumption K2, then

nCε2n\leq\frac{C}{{\varepsilon}^{2}}

with a constant CC depending only on the constant K0K_{0} in (2.4).

3. Consider dD=KSD d_{{\mathcal{F}}_{D}}=\textrm{KSD } with kernel kk. If kk satisfies Assumption K3 with constant K1K_{1} and if π\pi satisfies Assumption 1 and Assumption 2 with parameters L,m,υL,m,\upsilon, then

nCdε2,n\leq\frac{Cd}{{\varepsilon}^{2}},

where the constant CC depends only on L,K1,m,υL,K_{1},m^{\ast},\upsilon, but not on dd.

Theorem 2.1 states that a given probability measure π\pi (with certain integrability assumption) can be approximated arbitrarily well by push-forwarding a source distribution with the gradient of a potential which can be parameterized by a finite DNN. The complexity bound in the Wasserstein case suffers from the curse of dimensionality whereas this issue was eliminated in the cases of MMD and KSD . We remark that our complexity bound for the the neural network is stated in terms of the number of widths and depths, which would lead to an estimate of number of weights needed to achieve certain approximation error.

Proof strategy. The proof of Theorem 2.1 is given in Appendix F, which relies on two ingredients: (1) one approximates the target π\pi by the empirical measure PnP_{n}; Proposition 3.1-Proposition 3.3 give quantitative estimates w.r.t the three IPMs defined above; (2) based on the theory of (semi-discrete) optimal transport, one can build an optimal transport map of the form T=φT=\nabla\varphi which push-forwards the source distribution pzp_{z} to the empirical distribution PnP_{n}. Moreover, the potential φ\varphi is explicit : it is the maximum of finitely many affine functions; it is such explicit structure that enables one represents the function φ\varphi with a finite deep neural network; see Theorem 4.1 for the precise statement.

It is interesting to remark that our strategy of proving Theorem 2.1 shares the same spirit as the one used to prove universal approximation theorems of DNNs for functions [60, 37]. Indeed, both the universal approximation theorems in those works and ours are proved by approximating the target function or distribution with a suitable dense subset (or sieves) on the space of functions or distributions which can be parametrized by deep neural networks. Specifically, in [60, 37] where the goal is to approximate continuous functions on a compact set, the dense sieves are polynomials which can be further approximated by the output functions of DNNs, whereas in our case we use empirical measures as the sieves for approximating distributions, and we show that empirical measures are exactly expressible by transporting a source distribution with neural-network-based transport maps.

We also remark that the push-forward map between probability measures constructed in Theorem 2.1 is the gradient of a potential function given by a neural network, i.e., the neural network is used to parametrize the potential function, instead of the map itself, which is perhaps more commonly used in practice. The rational of using such gradient formulation lies in that transport maps between two probability measures are discontinuous in general. This occurs particularly when the target measure has disjoint modes (or supports) and the input has a unique mode, which is ubiquitous in GAN applications where images are concentrated on disjoint modes while the input is chosen as Gaussian. In such case it is impossible to generate the target measure using usual NNs as the resulting functions are all continuous (in fact Lipschitz for ReLU activation and smooth for Sigmoid activation). Thus, from a theoretical viewpoint, it is advantageous to use NNs to parametrize the Brenier’s potential since it is more regular (at least Lipschitz by OT theory), and to use its gradient as the transport map. From a practical perspective, the idea of taking gradients of NNs has already been used and received increasing attention in practical learning problems; see [34, 24, 45, 56] for an application of such parameterization in learning optimal transport maps and improving the training of Wasserstein-GANs; see also [62] for learning interatomic potentials and forces for molecular dynamics simulations.

3 Convergence of Empirical Measures in Various IPMs

In this section, we consider the approximation of a given target measure π\pi by empirical measures. More specifically, let {Xi}i=1n\{X_{i}\}_{i=1}^{n} be an i.i.d. sequence of random samples from the distribution π\pi and let Pn=1ni=1nδXiP_{n}=\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}} be the empirical measure associated to the samples {Xi}i=1n\{X_{i}\}_{i=1}^{n}. Our goal is to derive quantitative error estimates of dD(Pn,π)d_{{\mathcal{F}}_{D}}(P_{n},\pi) with respect to three IPMs dDd_{{\mathcal{F}}_{D}} described in the last section.

We first state an upper bound on 𝒲1(Pn,π){\mathcal{W}}_{1}(P_{n},\pi) in the average sense in the next proposition.

Proposition 3.1 (Convergence in 11-Wasserstein distance).

Consider the IPM with D={f:d: Lip (f)1}{\mathcal{F}}_{D}=\{f:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}:\textrm{ Lip }(f)\leq 1\}. Assume that π\pi satisfies that M3=𝔼Xπ|X|3<M_{3}={\mathbb{E}}_{X\sim\pi}|X|^{3}<\infty. Then there exists a constant CC depending on M3M_{3} such that

𝐄𝒲1(Pn,π)C{n1/2,d=1,n1/2logn,d=2,n1/d,d3.{\mathbf{E}}{\mathcal{W}}_{1}(P_{n},\pi)\leq C\cdot\begin{cases}n^{-1/2},&d=1,\\ n^{-1/2}\log n,&d=2,\\ n^{-1/d},&d\geq 3.\end{cases}

The convergence rates of 𝒲1(Pn,π){\mathcal{W}}_{1}(P_{n},\pi) as stated in Proposition 3.1 are well-known in the statistics literature. The statement in Proposition 3.1 is a combination of results from [9] and [33]; see Appendix A for a short proof. We remark that the prefactor constant CC in the estimate above can be made explicit. In fact, one can easily obtain from the moment bound in Proposition C.1 of Appendix C that if π\pi is sub-Gaussian with parameters mm and υ\upsilon, then the constant CC can be chosen as C=Cd,C=C^{\prime}\sqrt{d}, with some constant CC^{\prime} depending only on υ\upsilon and m\|m\|_{\infty}. Moreover, one can also obtain a high probability bound for 𝒲1(Pn,π){\mathcal{W}}_{1}(P_{n},\pi) if π\pi is sub-exponential (see e.g., [33, Corollary 5.2]). Here we content ourselves with the expectation result as it comes with weaker assumptions and also suffices for our purpose of showing the existence of an empirical measure with desired approximation rate.

Moving on to approximation in MMD, the following proposition gives a high-probability non-asymptotic error bound of MMD(Pn,π)\operatorname{MMD}(P_{n},\pi).

Proposition 3.2 (Convergence in MMD).

Consider the IPM with D={fk:fHk1}{\mathcal{F}}_{D}=\{f\in{\mathcal{H}}_{k}:\|f\|_{H_{k}}\leq 1\}. Assume that the kernel kk satisfies Assumption K2 with constant K0K_{0}. Then for every τ>0\tau>0, with probability at least 12eτ1-2e^{-\tau},

MMD(Pn,π)2K0n+32K0τn.\operatorname{MMD}(P_{n},\pi)\leq 2\sqrt{\frac{\sqrt{K_{0}}}{n}}+3\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}.

Proposition 3.2 can be viewed as a special case of [53, Theorem 3.3] where the kernel class is a skeleton. Since its proof is short, we provide the proof in Appendix B for completeness.

In the next proposition, we consider the convergence estimate of empirical measures PnP_{n} to π\pi in KSD . To the best of our knowledge, this is the first estimate on empirical measure under the KSD in the literature. This result can be useful to obtain quantitative error bounds for the new GAN/sampler called Stein Neural Sampler [27]. The proof relies on a Bernstein type inequality for the distribution of von Mises’ statistics; the details are deferred to Appendix C.

Proposition 3.3 (Convergence in KSD).

Consider the IPM with D:={𝒯πf:fk and fk1},{\mathcal{F}}_{D}:=\{{\mathcal{T}}_{\pi}f:f\in{\mathcal{H}}_{k}\text{ and }\|f\|_{{\mathcal{H}}_{k}}\leq 1\}, where 𝒯π{\mathcal{T}}_{\pi} is the Stein operator defined in (2.3). Suppose that the kernel kk satisfies Assumption K3 with constant K1K_{1}. Suppose also that π\pi satisfies Assumption 1 and Assumption 2. Then for any δ>0\delta>0 there exists a constant C=C(L,K1,m,δ)C=C(L,K_{1},m^{\ast},\delta) such that with probability at least 1δ1-\delta,

KSD(Pn,π)Cdn.\displaystyle\operatorname{KSD}(P_{n},\pi)\leq C\sqrt{\frac{d}{n}}. (3.1)
Remark 3.1.

Proposition 3.3 provides a non-asymptotic high probability error bound for the convergence of the empirical measure PnP_{n} converges to π\pi in KSD. Our result implies in particular that KSD(Pn,π)0\operatorname{KSD}(P_{n},\pi){\rightarrow}0 with the asymptotic rate 𝒪(dn)\mathcal{O}(\sqrt{\frac{d}{n}}). We also remark that the rate 𝒪(n1/2)\mathcal{O}(n^{-1/2}) is optimal and is consistent with the asymptotic CLT result for the corresponding U-statistics of KSD2(Pn,π)\operatorname{KSD}^{2}(P_{n},\pi) (see [40, Theorem 4.1 (2)]).

4 Constructing Transport Maps via Semi-discrete Optimal Transport

In this section, we aim to build a neural-network-based map which push-forwards a given source distribution to discrete probability measures, including in particular the empirical measures. The main result of this section is the following theorem.

Theorem 4.1.

Let μ𝒫2(d)\mu\in\mathcal{P}_{2}({\mathbb{R}}^{d}) be absolutely continuous with respect to the Lebesgue measure with Radon–Nikodym density ρ(x)\rho(x). Let ν=i=1nνiδyi\nu=\sum_{i=1}^{n}\nu_{i}\delta_{y_{i}} for some {yj}j=1nd,νj0\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d},\nu_{j}\geq 0 and j=1nνj=1\sum_{j=1}^{n}\nu_{j}=1. Then there exists a transport map of the form T=uT=\nabla u such that T#μ=νT_{\#}\mu=\nu where uu is a fully connected deep neural network of depth L=log2nL=\left\lceil\log_{2}n\right\rceil and width N=2L=2log2nN=2^{L}=2^{\left\lceil\log_{2}n\right\rceil}, and with ReLU activation function and parameters {W,b}=1L+1\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1} such that u=DNN({W,b}=1L+1)u={\mathrm{DNN}}(\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1}).

As shown below, the transport map in Theorem 4.1 is chosen as the optimal transport map from the continuous distribution μ\mu to the discrete distribution ν\nu, which turns out to be the gradient of a piece-wise linear function, which in turn can be expressed by neural networks. We remark that the weights and biases of the constructed neural network can also be characterized explicitly in terms of μ\mu and ν\nu (see the proof of Proposition 4.1). Since semi-discrete optimal transport plays an essential role in the proof of Theorem 4.1, we first recall the set-up and some key results on optimal transport in both general and semi-discrete settings.

Optimal transport with quadratic cost. Let μ\mu and ν\nu be two probability measures on d{\mathbb{R}}^{d} with finite second moments. Let c(x,y)=12|xy|2c(x,y)=\frac{1}{2}\lvert x-y\rvert^{2} be the quadratic cost. Then Monge’s [47] optimal transportation problem is to transport the probability mass between μ\mu and ν\nu while minimizing the quadratic cost, i.e.

infT:dd12|xT(x)|2μ(dx) s.t. T#μ=ν.\inf_{T:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}^{d}}\int\frac{1}{2}\lvert x-T(x)\rvert^{2}\mu(dx)\quad\text{ s.t. }T_{\#}\mu=\nu. (4.1)

A map TT attaining the infimum above is called an optimal transport map. In general an optimal transport map may not exist since Monge’s formulation prevents splitting the mass so that the set of transport maps may be empty. On the other hand, Kantorovich [28] relaxed the problem by considering minimizing the transportation cost over transport plans instead of the transport maps:

infγΓ(μ,ν)𝒦(γ):=infγΓ(μ,ν)12|xy|2γ(dxdy).\inf_{\gamma\in\Gamma(\mu,\nu)}\mathcal{K}(\gamma):=\inf_{\gamma\in\Gamma(\mu,\nu)}\int\frac{1}{2}|x-y|^{2}\gamma(dxdy). (4.2)

A coupling γ\gamma achieving the infimum above is called an optimal coupling. Noting that problem (4.2) above is a linear programming, Kantorovich proposed a dual formulation for (4.2):

sup(φ,ψ)Φc𝒥(φ,ψ):=sup(φ,ψ)Φcφ𝑑μ+ψdν,\sup_{(\varphi,\psi)\in\Phi_{c}}\mathcal{J}(\varphi,\psi):=\sup_{(\varphi,\psi)\in\Phi_{c}}\int\varphi d\mu+\psi d\nu,

where Φc\Phi_{c} be the set of measurable functions (φ,ψ)L1(μ)×L1(ν)(\varphi,\psi)\in L^{1}(\mu)\times L^{1}(\nu) satisfying φ(x)+ψ(y)12|xy|2\varphi(x)+\psi(y)\leq\frac{1}{2}|x-y|^{2}. We also define the cc-transformation φc:d\varphi^{c}:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}} of a function φ:d\varphi:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}} by

φc(y)=infxdc(x,y)φ(x)=infxd12|xy|2φ(x).\varphi^{c}(y)=\inf_{x\in{\mathbb{R}}^{d}}c(x,y)-\varphi(x)=\inf_{x\in{\mathbb{R}}^{d}}\frac{1}{2}|x-y|^{2}-\varphi(x).

Similarly, one can define ψc\psi^{c} associated to ψ\psi. The Kantorovich’s duality theorem (see e.g. [59, Theorem 5.10]) states that

infγΓ(μ,ν)𝒦(γ)=sup(φ,ψ)Φc𝒥(φ,ψ)=supφL1(dμ)𝒥(φ,φc)=supψL1(dν)𝒥(ψc,ψ).\displaystyle\inf_{\gamma\in\Gamma(\mu,\nu)}\mathcal{K}(\gamma)=\sup_{(\varphi,\psi)\in\Phi_{c}}\mathcal{J}(\varphi,\psi)=\sup_{\varphi\in L^{1}(d\mu)}\mathcal{J}(\varphi,\varphi^{c})=\sup_{\psi\in L^{1}(d\nu)}\mathcal{J}(\psi^{c},\psi). (4.3)

Moreover, if the source measure μ\mu is absolutely continuous with respect to the Lebesgue measure, then the optimal transport map defined in Monge’s problem is given by a gradient field, which is usually referred to as the Brenier’s map and can be characterized explicitly in terms of the solution of the dual Kantorovich problem. A precise statement is included in Theorem E.1 in Appendix E.

Semi-discrete optimal transport. Let us now consider the optimal transport problem in the semi-discrete setting: the source measure μ\mu is continuous and the target measure ν\nu is discrete. Specifically, assume that μ𝒫2(d)\mu\in{\mathcal{P}}_{2}({\mathbb{R}}^{d}) is absolutely continuous with respect to the Lebesgue measure, i.e. μ(dx)=ρ(x)dx\mu(dx)=\rho(x)dx for some probability density ρ\rho and ν\nu is discrete, i.e. ν=j=1nνjδyj\nu=\sum_{j=1}^{n}\nu_{j}\delta_{y_{j}} for some {yj}j=1nd,νj0\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d},\nu_{j}\geq 0 and j=1nνj=1\sum_{j=1}^{n}\nu_{j}=1. In the semi-discrete setting, Monge’s problem becomes

infT12|xT(x)|2μ(dx) s.t T1(yj)𝑑μ=νj,j=1,,n.\inf_{T}\int\frac{1}{2}|x-T(x)|^{2}\mu(dx)\ \text{ s.t }\ \int_{T^{-1}(y_{j})}d\mu=\nu_{j},\ j=1,\cdots,n. (4.4)

In this case the action of the transport map is clear: it assigns each point xdx\in{\mathbb{R}}^{d} to one of these yjy_{j}. Moreover, by taking advantage of the dicreteness of the measure ν\nu, one sees that the dual Kantorovich problem in the semi-discrete case becomes maximizing the following functional

(ψ)\displaystyle{\mathcal{F}}(\psi) =(ψ1,,ψn)=infj(12|xyj|2ψj)ρ(x)dx+j=1nψjνj.\displaystyle=\mathcal{F}(\psi_{1},\cdots,\psi_{n})=\int\inf_{j}\left(\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\right)\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\nu_{j}. (4.5)

Similar to the continuum setting, the optimal transport map of Monge’s problem (4.4) can be characterized by the maximizer of {\mathcal{F}}. To see this, let us introduce an important concept of power diagram (or Laguerre diagram) [4, 52]. Given a finite set of points {yj}j=1nd\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d} and the scalars ψ={ψj}j=1n\psi=\{\psi_{j}\}_{j=1}^{n}, the power diagrams associated to the scalars ψ\psi and the points {yj}j=1n\{y_{j}\}_{j=1}^{n} are the sets

Pj:={xd|12|xyj|2ψj12|xyk|2ψk,kj}.P_{j}:=\Bigl{\{}x\in{\mathbb{R}}^{d}\;\Big{|}\;\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\leq\frac{1}{2}|x-y_{k}|^{2}-\psi_{k},\forall k\neq j\Bigr{\}}. (4.6)

By grouping the points according to the power diagrams PjP_{j}, we have from (4.5) that

(ψ)=j=1n[Pj(12|xyj|2ψj)ρ(x)𝑑x+ψjνj].\mathcal{F}(\psi)=\sum_{j=1}^{n}\left[\int_{P_{j}}\left(\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\right)\rho(x)dx+\psi_{j}\nu_{j}\right]. (4.7)

The following theorem characterizes the optimal transport map of Monge’s problem (4.4) in terms of the power diagrams PjP_{j} associated to the points {yj}j=1n\{y_{j}\}_{j=1}^{n} and the maximizer ψ\psi of {\mathcal{F}}.

Theorem 4.2.

Let μ𝒫2(d)\mu\in\mathcal{P}_{2}({\mathbb{R}}^{d}) be absolutely continuous with respect to the Lebesgue measure with Radon–Nikodym density ρ(x)\rho(x). density ρ(x)\rho(x). Let ν=i=1nνiδyi\nu=\sum_{i=1}^{n}\nu_{i}\delta_{y_{i}}. Let ψ=(ψ1,,ψn)\psi=(\psi_{1},\cdots,\psi_{n}) be an maximizer of {\mathcal{F}} defined in (4.7). Denote by {Pj}j=1n\{P_{j}\}_{j=1}^{n} the power diagrams associated to {yj}j=1n\{y_{j}\}_{j=1}^{n} and ψ\psi. Then the optimal transport plan TT solving the semi-discrete Monge’s problem (4.4) is given by T(x)=φ¯(x),T(x)=\nabla\bar{\varphi}(x), where φ¯(x)=maxj{xyj+mj}\bar{\varphi}(x)=\max_{j}\{x\cdot y_{j}+m_{j}\} for some mjm_{j}\in{\mathbb{R}}. Specifically, T(x)=yjT(x)=y_{j} if xPj(ψ)x\in P_{j}(\psi).

Theorem 4.2 shows that the optimal transport map in the semi-discrete case is achieved by the gradient of a particular piece-wise affine function that is the maximum of finitely many affine functions. A similar result was proved by [22] in the case where μ\mu is defined on a compact convex domain. We provide a proof of Theorem 4.2, which deals with measures on the whole space d{\mathbb{R}}^{d} in Appendix E.2.

The next proposition shows that the piece-wise linear function maxj{xyj+mj}\max_{j}\{x\cdot y_{j}+m_{j}\} defined in Theorem 4.2 can be expressed exactly by a deep neural network.

Proposition 4.1.

Let φ¯(x)=maxj{xyj+mj}\bar{\varphi}(x)=\max_{j}\{x\cdot y_{j}+m_{j}\} with {yj}j=1nd\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d} and {mj}j=1n\{m_{j}\}_{j=1}^{n}\subset{\mathbb{R}}. Then there exists a fully connected deep neural network of depth L=lognL=\left\lceil\log n\right\rceil and width N=2L=2lognN=2^{L}=2^{\left\lceil\log n\right\rceil}, and with ReLU activation function and parameters {W,b}=1L+1\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1} such that φ¯=DNN({W,b}=1L+1)\bar{\varphi}={\mathrm{DNN}}(\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1}).

The proof of Proposition 4.1 can be found in Appendix E.3. Theorem 4.1 is a direct consequence of Theorem 4.2 and Proposition 4.1.

5 Conclusion

In this paper, we establish that certain general classes of target distributions can be expressed arbitrarily well with respect to three type of IPMs by transporting a source distribution with maps which can be parametrized by DNNs. We provide upper bounds for the depths and widths of DNNs needed to achieve certain approximation error; the upper bounds are established with explicit dependence on the dimension of the underlying distributions and the approximation error.

Broader Impact

This work focuses on theoretical properties of neural networks for expressing probability distributions. Our work can help understand the theoretical benefit and limitations of neural networks in approximating probability distributions under various integral probability metrics. Our work and the proof technique improve our understanding of the theoretical underpinnings of Generative Adversarial Networks and other generative models used in machine learning, and may lead to better use of these techniques with possible benefits to the society.

Acknowledgments and Disclosure of Funding

The work of JL is supported in part by the National Science Foundation via grants DMS-2012286 and CCF-1934964 (Duke TRIPODS).

References

  • [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223, 2017.
  • [2] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 224–232. JMLR. org, 2017.
  • [3] S. Arora and Y. Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.
  • [4] F. Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM Journal on Computing, 16(1):78–96, 1987.
  • [5] Y. Bai, T. Ma, and A. Risteski. Approximability of discriminators implies diversity in gans. arXiv preprint arXiv:1806.10586, 2018.
  • [6] B. Bailey and M. J. Telgarsky. Size-noise tradeoffs in generative networks. In Advances in Neural Information Processing Systems, pages 6489–6499, 2018.
  • [7] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
  • [8] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
  • [9] S. Bobkov and M. Ledoux. One-dimensional empirical measures, order statistics, and Kantorovich transport distances. Mem. Amer. Math. Soc., 261(1259):0, 2019.
  • [10] I. Borisov. Approximation of distributions of von mises statistics with multidimensional kernels. Siberian Mathematical Journal, 32(4):554–566, 1991.
  • [11] W. Chen, L. Mackey, J. Gorham, F.-X. Briol, and C. Oates. Stein points. In International Conference on Machine Learning, pages 843–852, 2018.
  • [12] K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 2606–2615. JMLR. org, 2016.
  • [13] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
  • [14] A. Daniely. Depth separation for neural networks. arXiv preprint arXiv:1702.08489, 2017.
  • [15] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, and G. Petrova. Nonlinear approximation and (deep) relu networks. arXiv preprint arXiv:1905.02199, 2019.
  • [16] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 258–267. AUAI Press, 2015.
  • [17] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907–940, 2016.
  • [18] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.
  • [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [20] J. Gorham and L. Mackey. Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1292–1301. JMLR. org, 2017.
  • [21] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
  • [22] X. Gu, F. Luo, J. Sun, and S.-T. Yau. Variational principles for minkowski type problems, discrete optimal transport, and discrete monge–ampère equations. Asian Journal of Mathematics, 20(2):383–398, 2016.
  • [23] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
  • [24] Y. Guo, D. An, X. Qi, Z. Luo, S.-T. Yau, and X. Gu. Mode collapse and regularity of optimal transportation maps, 2019. arXiv preprint arXiv:1902.02934.
  • [25] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  • [26] D. Hsu, S. Kakade, T. Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
  • [27] T. Hu, Z. Chen, H. Sun, J. Bai, M. Ye, and G. Cheng. Stein neural sampler. arXiv preprint arXiv:1810.03545, 2018.
  • [28] L. V. Kantorovich. On the translocation of masses. In Dokl. Akad. Nauk. USSR (NS), volume 37, pages 199–201, 1942.
  • [29] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [30] Z. Kong and K. Chaudhuri. The expressive power of a class of normalizing flow models. In Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics (AISTATS) 2020, -Volume 108, pages 3599–3609. JMLR. org, 2020.
  • [31] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
  • [32] H. Lee, R. Ge, T. Ma, A. Risteski, and S. Arora. On the ability of neural nets to express distributions. In S. Kale and O. Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1271–1296, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR.
  • [33] J. Lei. Convergence and concentration of empirical measures under wasserstein distance in unbounded functional spaces. Bernoulli, 26(1):767–798, 2020.
  • [34] N. Lei, K. Su, L. Cui, S.-T. Yau, and X. D. Gu. A geometric view of optimal transportation and generative model. Computer Aided Geometric Design, 68:1–21, 2019.
  • [35] B. Lévy and E. L. Schwindt. Notions of optimal transport theory and how to implement them on a computer. Computers & Graphics, 72:135–148, 2018.
  • [36] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.
  • [37] S. Liang and R. Srikant. Why deep neural networks for function approximation? arXiv preprint arXiv:1610.04161, 2016.
  • [38] T. Liang. On how well generative adversarial networks learn densities: Nonparametric and parametric results. arXiv preprint arXiv:1811.03179, 2018.
  • [39] H. Lin and S. Jegelka. Resnet with one-neuron hidden layers is a universal approximator. In Advances in neural information processing systems, pages 6169–6178, 2018.
  • [40] Q. Liu, J. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276–284, 2016.
  • [41] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pages 2378–2386, 2016.
  • [42] S. Liu, O. Bousquet, and K. Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pages 5545–5553, 2017.
  • [43] J. Lu, Z. Shen, H. Yang, and S. Zhang. Deep network approximation for smooth functions, 2020. arXiv preprint arXiv:2001.03040.
  • [44] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neural networks: A view from the width. In Advances in neural information processing systems, pages 6231–6239, 2017.
  • [45] A. V. Makkuva, A. Taghvaei, S. Oh, and J. D. Lee. Optimal transport mapping via input convex neural networks. arXiv preprint arXiv:1908.10962, 2019.
  • [46] B. Matérn. Spatial variation, volume 36. Springer Science & Business Media, 2013.
  • [47] G. Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris, 1781.
  • [48] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
  • [49] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
  • [50] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep neural networks. Applied and Computational Harmonic Analysis, 44(3):537–557, 2018.
  • [51] Z. Shen, H. Yang, and S. Zhang. Deep network approximation characterized by number of neurons, 2019. arXiv preprint arXiv:1906.05497.
  • [52] D. Siersma and M. Van Manen. Power diagrams and their applications. arXiv preprint math.MG/0508037, 2005.
  • [53] B. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839–1893, 2016.
  • [54] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517–1561, 2010.
  • [55] S. Sun, W. Chen, L. Wang, X. Liu, and T.-Y. Liu. On the depth of deep neural networks: A theoretical view. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • [56] A. Taghvaei and A. Jalali. 2-wasserstein approximation via restricted convex potentials with application to improved training for gans. arXiv preprint arXiv:1902.07197, 2019.
  • [57] M. Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101, 2015.
  • [58] C. Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
  • [59] C. Villani. Optimal transport, old and new, volume 338 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2009.
  • [60] D. Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
  • [61] D. Yarotsky. Optimal approximation of continuous functions by very deep relu networks. arXiv preprint arXiv:1802.03620, 2018.
  • [62] L. Zhang, J. Han, H. Wang, R. Car, and E. Weinan. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Physical review letters, 120(14):143001, 2018.

Appendix

Appendix A Proof of Proposition 3.1

Proof.

The proof follows from some previous results by [9] and [33]. In fact, in the one dimensional case, according to [9, Theorem 3.2], we know that if π\pi satisfies that

J1(π)=F(x)(1F(x))𝑑x<J_{1}(\pi)=\int_{-\infty}^{\infty}\sqrt{F(x)(1-F(x))}dx<\infty (A.1)

where FF is the cumulative distribution function of π\pi, then for every n1n\geq 1,

𝐄𝒲1(Pn,π)J1(π)n.{\mathbf{E}}{\mathcal{W}}_{1}(P_{n},\pi)\leq\frac{J_{1}(\pi)}{\sqrt{n}}. (A.2)

The condition (A.1) is fulfilled if π\pi has finite third moment since

J1(π)0𝐏(|X|x)𝑑x1+1𝐄|X|3x32𝑑x=1+2M3.J_{1}(\pi)\leq\int_{0}^{\infty}\sqrt{{\mathbf{P}}(|X|\geq x)}dx\leq 1+\int_{1}^{\infty}\frac{\sqrt{{\mathbf{E}}|X|^{3}}}{x^{\frac{3}{2}}}dx=1+2\sqrt{M_{3}}.

In the case that d2d\geq 2, it follows from that [33, Theorem 3.1] if M3=𝔼Xπ|X|3dπ<M_{3}={\mathbb{E}}_{X\sim\pi}|X|^{3}d\pi<\infty, then there exists a constant c>0c>0 independent of dd such that

𝐄𝒲1(Pn,π)cM31/3{lognn if d=2,1n1/d if d3.{\mathbf{E}}{\mathcal{W}}_{1}(P_{n},\pi)\leq cM_{3}^{1/3}\cdot\begin{cases}\frac{\log n}{\sqrt{n}}&\text{ if }d=2,\\ \frac{1}{n^{1/d}}&\text{ if }d\geq 3.\end{cases} (A.3)

Appendix B Proof of Proposition 3.2

Proof.

Thanks to [53, Proposition 3.1], one has that

MMD(Pn,π)=dk(,x)d(Pnπ)(x)k.\operatorname{MMD}(P_{n},\pi)=\Big{\|}\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\Big{\|}_{{\mathcal{H}}_{k}}.

Let us define φ(X1,X2,,Xn):=dk(,x)d(Pnπ)(x)k.\varphi(X_{1},X_{2},\cdots,X_{n}):=\|\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\|_{{\mathcal{H}}_{k}}. Then by definition φ(X1,X2,,Xn)\varphi(X_{1},X_{2},\cdots,X_{n}) satisfies that for any i{1,,n}i\in\{1,\cdots,n\},

|φ(X1,,Xi1,Xi,,Xn)φ(X1,,Xi1,Xi,,Xn)|\displaystyle\big{|}\varphi(X_{1},\cdots,X_{i-1},X_{i},\cdots,X_{n})-\varphi(X_{1},\cdots,X_{i-1},X_{i}^{\prime},\cdots,X_{n})\big{|}
2Nsupxk(,x)k\displaystyle\leq\frac{2}{N}\sup_{x}\|k(\cdot,x)\|_{{\mathcal{H}}_{k}}
2K0N,Xi,Xid,\displaystyle\leq\frac{2\sqrt{K_{0}}}{N},\forall X_{i},X_{i}^{\prime}\in{\mathbb{R}}^{d},

where we have used that k(,x)k=supxk(x,x)K0\|k(\cdot,x)\|_{{\mathcal{H}}_{k}}=\sup_{x}\sqrt{k(x,x)}\leq\sqrt{K_{0}} by assumption. It follows from above and the McDiarmid’s inequality that for every τ>0\tau>0, with probability 1eτ1-e^{-\tau},

dk(,x)d(Pnπ)(x)k𝐄dk(,x)d(Pnπ)(x)k+2K0τn.\Big{\|}\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\Big{\|}_{{\mathcal{H}}_{k}}\leq{\mathbf{E}}\Big{\|}\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\Big{\|}_{{\mathcal{H}}_{k}}+\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}.

In addition, we have by the standard symmetrization argument that

𝐄dk(,x)d(Pnπ)(x)k2𝐄𝐄ε1ni=1nεik(,Xi)k,{\mathbf{E}}\Big{\|}\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\Big{\|}_{{\mathcal{H}}_{k}}\leq 2{\mathbf{E}}{\mathbf{E}}_{\varepsilon}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}{\varepsilon}_{i}k(\cdot,X_{i})\Big{\|}_{{\mathcal{H}}_{k}},

where {εi}i=1n\{{\varepsilon}_{i}\}_{i=1}^{n} are i.i.d. Radmacher variables and 𝐄ε{\mathbf{E}}_{\varepsilon} represents the conditional expectation w.r.t {εi}i=1n\{{\varepsilon}_{i}\}_{i=1}^{n} given {Xi}i=1n\{X_{i}\}_{i=1}^{n}. To bound the right hand side above, we can apply McDiarmid’s inequality again to obtain that with probability at least 1eτ1-e^{-\tau},

𝐄𝐄ε1ni=1nεik(,Xi)k\displaystyle{\mathbf{E}}{\mathbf{E}}_{\varepsilon}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}{\varepsilon}_{i}k(\cdot,X_{i})\Big{\|}_{{\mathcal{H}}_{k}} 𝐄ε1ni=1nεik(,Xi)k+2K0τn\displaystyle\leq{\mathbf{E}}_{\varepsilon}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}{\varepsilon}_{i}k(\cdot,X_{i})\Big{\|}_{{\mathcal{H}}_{k}}+\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}
(𝐄ε1ni=1nεik(,Xi)k2)1/2+2K0τn\displaystyle\leq\Big{(}{\mathbf{E}}_{\varepsilon}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}{\varepsilon}_{i}k(\cdot,X_{i})\Big{\|}_{{\mathcal{H}}_{k}}^{2}\Big{)}^{1/2}+\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}
K0n+2K0τn,\displaystyle\leq\sqrt{\frac{\sqrt{K_{0}}}{n}}+\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}},

where we have used Jensen’s inequality for expectation in the second inequality and the independence of εi{\varepsilon}_{i} and the definition of K0K_{0} in the last inequality. Combining the estimates above yields that with probability at least 12eτ1-2e^{-\tau},

MMD(Pn,π)=dk(,x)d(Pnπ)(x)k2K0n+32K0τn.\operatorname{MMD}(P_{n},\pi)=\Big{\|}\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\Big{\|}_{{\mathcal{H}}_{k}}\leq 2\sqrt{\frac{\sqrt{K_{0}}}{n}}+3\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}.

Appendix C Proof of Proposition 3.3

Thanks to [40, Theorem 3.6], KSD(Pn,π)\textrm{KSD}(P_{n},\pi) is evaluated explicitly as

KSD(Pn,π)=𝐄x,yPn[uπ(x,y)]=1n2i,j=1nuπ(Xi,Xj),\operatorname{KSD}(P_{n},\pi)=\sqrt{{\mathbf{E}}_{x,y\sim P_{n}}[u_{\pi}(x,y)]}=\sqrt{\frac{1}{n^{2}}\sum_{i,j=1}^{n}u_{\pi}(X_{i},X_{j})}, (C.1)

where uπu_{\pi} is a new kernel defined by

uπ(x,y)\displaystyle u_{\pi}(x,y) =sπ(x)Tk(x,y)sπ(y)+sπ(x)Tyk(x,y)\displaystyle=s_{\pi}(x)^{\hbox{\it\tiny T}}k(x,y)s_{\pi}(y)+s_{\pi}(x)^{\hbox{\it\tiny T}}\nabla_{y}k(x,y)
+sπ(y)Txk(x,y)+Tr(xyk(x,y))\displaystyle\quad\quad+s_{\pi}(y)^{\hbox{\it\tiny T}}\nabla_{x}k(x,y)+\operatorname{Tr}(\nabla_{x}\nabla_{y}k(x,y))

with sπ(x)=logπ(x)s_{\pi}(x)=\nabla\log\pi(x). Moreover, according to [40, Proposition 3.3], if kk satisfies Assumption K1, then KSD(Pn,π)\operatorname{KSD}(P_{n},\pi) is non-negative.

Our proof of Proposition 3.3 relies on the fact that KSD2(Pn,π)\textrm{KSD}^{2}(P_{n},\pi) can be viewed as a von Mises’ statistics (VV-statistics) and an important Bernstein type inequality due to [10] for the distribution of VV-statistics, which gives a concentration bound of KSD2(Pn,π)\textrm{KSD}^{2}(P_{n},\pi) around its mean (which is zero). We recall this inequality in the theorem below, which is a restatement of [10, Theorem 1] for second order degenerate VV-statistics.

C.1 Bernstein type inequality for von Mises’ statistics

Let X1,,Xn,X_{1},\cdots,X_{n},\cdots be a sequence of i.i.d. random variables on d{\mathbb{R}}^{d}. For a kernel h(x,y):d×dh(x,y):{\mathbb{R}}^{d}\times{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}, we call

Vn=i,j=1nh(Xi,Xj)\displaystyle V_{n}=\sum_{i,j=1}^{n}h(X_{i},X_{j}) (C.2)

a von-Mises’ statistic of order 22 with kernel hh. We say that the kernel hh is degenerate if the following holds:

𝐄[h(X1,X2)|X1]=𝐄[h(X1,X2)|X2]=0.\displaystyle{\mathbf{E}}[h(X_{1},X_{2})|X_{1}]={\mathbf{E}}[h(X_{1},X_{2})|X_{2}]=0. (C.3)
Theorem C.1 ([10, Theorem 1]).

Consider the VV-statistic MnM_{n} defined by (C.2) with a degenerate kernel hh. Assume the kernel satisfies that

|h(x,y)|g(x)g(y)\displaystyle|h(x,y)|\leq g(x)\cdot g(y) (C.4)

for all x,ydx,y\in{\mathbb{R}}^{d} with a function g:dg:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}} satisfying for ξ,J>0\xi,J>0,

𝐄[g(X1)k]ξ2Jk2k!/2,\displaystyle{\mathbf{E}}[g(X_{1})^{k}]\leq\xi^{2}J^{k-2}k!/2, (C.5)

for all k=2,3,k=2,3,\cdots. Then there exist some generic constants C1,C2>0C_{1},C_{2}>0 independent of k,h,l,ξk,h,l,\xi such that for any t0t\geq 0 that

𝐏(|Vn|n2t)C1exp(C2ntξ2+Jt1/2).\displaystyle{\mathbf{P}}(|V_{n}|\geq n^{2}t)\leq C_{1}\exp\Big{(}-\frac{C_{2}nt}{\xi^{2}+Jt^{1/2}}\Big{)}. (C.6)
Remark C.1.

As noted in [10, Remark 1], the inequality (C.6) is to some extent optimal. Moreover, a straightforward calculation shows that inequality (C.6) implies that for any δ(0,1)\delta\in(0,1),

𝐏(1n2|Vn|𝒱n)1δ,{\mathbf{P}}\Big{(}\frac{1}{n^{2}}|V_{n}|\leq\frac{\mathscr{V}}{n}\Big{)}\geq 1-\delta, (C.7)

where

𝒱=(Jlog(C1δ)C2+log(C1δ)C2ξ)2.\mathscr{V}=\Big{(}\frac{J\log(\frac{C_{1}}{\delta})}{C_{2}}+\sqrt{\frac{\log(\frac{C_{1}}{\delta})}{C_{2}}}\xi\Big{)}^{2}.

C.2 Moment bound of sub-Gaussian random vectors

Let us first recall a useful concentration result on sub-Gaussian random vectors.

Theorem C.2 ([26, Theorem 2.1]).

Let XdX\in{\mathbb{R}}^{d} be a sub-Gaussian random vector with parameters mdm\in{\mathbb{R}}^{d} and υ>0\upsilon>0. Then for any t>0t>0,

𝐏(|Xm|υd+2dt+2t)et.\displaystyle{\mathbf{P}}\Big{(}|X-m|\geq\upsilon\sqrt{d+2\sqrt{dt}+2t}\Big{)}\leq e^{-t}. (C.8)

Moreover, for any 0η<12υ20\leq\eta<\frac{1}{2\upsilon^{2}},

𝐄exp(η|Xm|2)exp(υ2dη+υ4dη212υ2η).\displaystyle{\mathbf{E}}\exp(\eta|X-m|^{2})\leq\exp(\upsilon^{2}d\eta+\frac{\upsilon^{4}d\eta^{2}}{1-2\upsilon^{2}\eta}). (C.9)

As a direct consequence of Theorem C.2, we have the following useful moment bound for sub-Gaussian random vectors.

Proposition C.1.

Let XdX\in{\mathbb{R}}^{d} be a sub-Gaussian random vector with parameters mdm\in{\mathbb{R}}^{d} and υ>0\upsilon>0. Then for any k2k\geq 2,

𝐄|Xm|kk((2υd)k+12(4υ2)kkk/2).\displaystyle{\mathbf{E}}|X-m|^{k}\leq k\big{(}(2\upsilon\sqrt{d})^{k}+\frac{1}{2}(\frac{4\upsilon}{\sqrt{2}})^{k}k^{k/2}\big{)}. (C.10)
Proof.

From the concentration bound (C.8) and the simple fact that

d+2dt+2t=2(t+d2)2+d24(t+d2)2,d+2\sqrt{dt}+2t=2\Big{(}\sqrt{t}+\frac{\sqrt{d}}{2}\Big{)}^{2}+\frac{d}{2}\leq 4\Big{(}\sqrt{t}+\frac{\sqrt{d}}{2}\Big{)}^{2},

one can obtain that

𝐏(|Xm|2υ(t+d2))𝐏(|Xm|υd+2dt+2t)et.{\mathbf{P}}\Big{(}|X-m|\geq 2\upsilon\big{(}\sqrt{t}+\frac{\sqrt{d}}{2}\big{)}\Big{)}\leq{\mathbf{P}}\Big{(}|X-m|\geq\upsilon\sqrt{d+2\sqrt{dt}+2t}\Big{)}\leq e^{-t}.

Therefore, for any sυds\geq\upsilon\sqrt{d}, we obtain from above with s=2υ(t+d/2)s=2\upsilon(\sqrt{t}+\sqrt{d}/2) that

𝐏(|Xm|s)e(s2υd2)2.\displaystyle{\mathbf{P}}\Big{(}|X-m|\geq s\Big{)}\leq e^{-\big{(}\frac{s}{2\upsilon}-\frac{\sqrt{d}}{2}\big{)}^{2}}. (C.11)

As a result, for any k2k\geq 2,

𝐄|Xm|k\displaystyle{\mathbf{E}}|X-m|^{k} =0𝐏(|Xm|ks)𝑑s\displaystyle=\int_{0}^{\infty}{\mathbf{P}}(|X-m|^{k}\geq s)ds (C.12)
=0𝐏(|Xm|ksk)ksk1𝑑s\displaystyle=\int_{0}^{\infty}{\mathbf{P}}(|X-m|^{k}\geq s^{k})ks^{k-1}ds
=02υd𝐏(|Xm|s)ksk1𝑑s+2υd𝐏(|Xm|s)ksk1𝑑s\displaystyle=\int_{0}^{2\upsilon\sqrt{d}}{\mathbf{P}}(|X-m|\geq s)ks^{k-1}ds+\int_{2\upsilon\sqrt{d}}^{\infty}{\mathbf{P}}(|X-m|\geq s)ks^{k-1}ds
=:I1+I2,\displaystyle=:I_{1}+I_{2},

where we have used the change of variable ssks\mapsto s^{k} in the second line above. It is clear that the first term

I1k(2υd)k.I_{1}\leq k(2\upsilon\sqrt{d})^{k}.

For I2I_{2}, one first notices that if s2υds\geq 2\upsilon\sqrt{d}, then s/(2υ)d/2s/(4υ)s/(2\upsilon)-\sqrt{d}/2\geq s/(4\upsilon). Hence it follows from (C.8) that

I2\displaystyle I_{2} 2υde(s4υ)2ksk1𝑑s\displaystyle\leq\int_{2\upsilon\sqrt{d}}^{\infty}e^{-\big{(}\frac{s}{4\upsilon}\big{)}^{2}}ks^{k-1}ds
=k(4υ)k2d4ettk22𝑑t\displaystyle=\frac{k(4\upsilon)^{k}}{2}\int_{\frac{d}{4}}^{\infty}e^{-t}t^{\frac{k-2}{2}}dt
k(4υ)k2Γ(k2).\displaystyle\leq\frac{k(4\upsilon)^{k}}{2}\Gamma(\frac{k}{2}).

The last two estimates imply that

𝐄|Xm|k\displaystyle{\mathbf{E}}|X-m|^{k} k(2υd)k1+k(4υ)k2Γ(k2)\displaystyle\leq k(2\upsilon\sqrt{d})^{k-1}+\frac{k(4\upsilon)^{k}}{2}\Gamma(\frac{k}{2})
k((2υd)k+12(4υ2)kkk/2),\displaystyle\leq k\big{(}(2\upsilon\sqrt{d})^{k}+\frac{1}{2}(\frac{4\upsilon}{\sqrt{2}})^{k}k^{k/2}\big{)},

where the second inequality above follows from Γ(k2)(k/2)k/2\Gamma(\frac{k}{2})\leq(k/2)^{k/2} for k2k\geq 2. ∎

C.3 Proof of Proposition 3.3

Our goal is to invoke Theorem C.1 to obtain a concentration inequality for KSD. Recall that KSD(Pn,π)\textrm{KSD}(P_{n},\pi) is defined by

KSD2(Pn,π)=1n2i,j=1nuπ(Xi,Xj)\textrm{KSD}^{2}(P_{n},\pi)=\frac{1}{n^{2}}\sum_{i,j=1}^{n}u_{\pi}(X_{i},X_{j})

with the kernel

uπ(x,y)=sπ(x)Tk(x,y)sπ(y)+sq(x)Tyk(x,y)+sq(y)Txk(x,y)+Tr(xyk(x,y)).u_{\pi}(x,y)=s_{\pi}(x)^{\hbox{\it\tiny T}}k(x,y)s_{\pi}(y)+s_{q}(x)^{\hbox{\it\tiny T}}\nabla_{y}k(x,y)+s_{q}(y)^{\hbox{\it\tiny T}}\nabla_{x}k(x,y)+\operatorname{Tr}(\nabla_{x}\nabla_{y}k(x,y)).

Let us first verify that the new kernel uπu_{\pi} satisfies the assumption of Theorem C.1. In fact, since sπ(x)=log(π(x))s_{\pi}(x)=\nabla\log(\pi(x)), one obtains from integration by part that

𝐄[uπ(X1,X2)|X1=x]=duπ(x,y)𝑑π(y)\displaystyle{\mathbf{E}}[u_{\pi}(X_{1},X_{2})|X_{1}=x]=\int_{{\mathbb{R}}^{d}}u_{\pi}(x,y)d\pi(y)
=dsπ(x)k(x,y)sπ(y)+sq(x)Tyk(x,y)\displaystyle=\int_{{\mathbb{R}}^{d}}s_{\pi}(x)k(x,y)s_{\pi}(y)+s_{q}(x)^{\hbox{\it\tiny T}}\nabla_{y}k(x,y)
+sπ(y)Txk(x,y)+Tr(xyk(x,y))dπ(y)\displaystyle\qquad+s_{\pi}(y)^{\hbox{\it\tiny T}}\nabla_{x}k(x,y)+\operatorname{Tr}(\nabla_{x}\nabla_{y}k(x,y))d\pi(y)
=dk(x,y)sπ(x)Tyπ(y)𝑑ydy(sπ(x)π(y))𝑑y\displaystyle=\int_{{\mathbb{R}}^{d}}k(x,y)s_{\pi}(x)^{\hbox{\it\tiny T}}\nabla_{y}\pi(y)dy-\int_{{\mathbb{R}}^{d}}\nabla_{y}\cdot(s_{\pi}(x)\pi(y))dy
+dyπ(y)Txk(x,y)𝑑ydyπ(y)Txk(x,y)𝑑y\displaystyle\qquad+\int_{{\mathbb{R}}^{d}}\nabla_{y}\pi(y)^{\hbox{\it\tiny T}}\nabla_{x}k(x,y)dy-\int_{{\mathbb{R}}^{d}}\nabla_{y}\pi(y)^{\hbox{\it\tiny T}}\nabla_{x}k(x,y)dy
=0.\displaystyle=0.

Similarly, one has

𝐄[uπ(X1,X2)|X2=y]=0.{\mathbf{E}}[u_{\pi}(X_{1},X_{2})|X_{2}=y]=0.

This shows that uπu_{\pi} satisfies the condition of degeneracy (C.3).

Next, we show that uπu_{\pi} satisfies the bound (C.4) with a function gg satisfying the moment condition (C.5). In fact, by Assumption K3 on the kernel kk and Assumption 1 on the target density π\pi,

|uπ(x,y)|\displaystyle|u_{\pi}(x,y)| L2K1(1+|x|)(1+|y|)+LK1(1+|x|+1+|y|)+K1(1+d)\displaystyle\leq L^{2}K_{1}(1+|x|)\cdot(1+|y|)+LK_{1}(1+|x|+1+|y|)+K_{1}(1+d)
K1(L+1)2(d+1+|x|)(d+1+|y|)\displaystyle\leq K_{1}(L+1)^{2}(\sqrt{d}+1+|x|)\cdot(\sqrt{d}+1+|y|)
=:g(x)g(y),\displaystyle=:g(x)\cdot g(y),

where g(x)=K1(L+1)(d+1+|x|)g(x)=\sqrt{K_{1}}(L+1)(\sqrt{d}+1+|x|) and the constant LL is defined in (2.6). To verify gg satisfies (C.5), we write

𝐄Xπ[g(X)k]=(K1(L+1))k𝐄Xπ[(d+1+|X|)k]\displaystyle{\mathbf{E}}_{X\sim\pi}[g(X)^{k}]=(\sqrt{K_{1}}(L+1))^{k}{\mathbf{E}}_{X\sim\pi}[(\sqrt{d}+1+|X|)^{k}] (C.13)
(K1(L+1))k𝐄Xπ[(d+1+|m|+|Xm|)k]\displaystyle\leq(\sqrt{K_{1}}(L+1))^{k}{\mathbf{E}}_{X\sim\pi}[(\sqrt{d}+1+|m|+|X-m|)^{k}]
=(K1(L+1))k((d+1+|m|)k+j=1k(kj)(d+1+|m|)kj𝐄Xπ|Xm|j).\displaystyle=(\sqrt{K_{1}}(L+1))^{k}\Big{(}(\sqrt{d}+1+|m|)^{k}+\sum_{j=1}^{k}\begin{pmatrix}k\\ j\end{pmatrix}(\sqrt{d}+1+|m|)^{k-j}{\mathbf{E}}_{X\sim\pi}|X-m|^{j}\Big{)}.

Thanks to Proposition C.1, we have for any j1j\geq 1,

𝐄Xπ|Xm|j\displaystyle{\mathbf{E}}_{X\sim\pi}|X-m|^{j} (𝐄Xπ|Xm|2j)1/2\displaystyle\leq\big{(}{\mathbf{E}}_{X\sim\pi}|X-m|^{2j}\big{)}^{1/2} (C.14)
(2j)1/2((2υd)2j+12(4υ2)2j(2j)j)1/2\displaystyle\leq(2j)^{1/2}\cdot\Big{(}(2\upsilon\sqrt{d})^{2j}+\frac{1}{2}\big{(}\frac{4\upsilon}{\sqrt{2}}\big{)}^{2j}\cdot(2j)^{j}\Big{)}^{1/2}
(2j)1/2(2max(4υ,1)dj)j\displaystyle\leq(2j)^{1/2}\cdot\Big{(}2\max(4\upsilon,1)\cdot\sqrt{d}\cdot\sqrt{j}\Big{)}^{j}
(2e1/emax(4υ,1)dj)j,\displaystyle\leq\Big{(}2e^{1/e}\max(4\upsilon,1)\cdot\sqrt{d}\cdot\sqrt{j}\Big{)}^{j},

where we have used the simple fact that (2j)1/(2j)e1/e(2j)^{1/(2j)}\leq e^{1/e} for any j1j\geq 1 in the last inequality. Plugging (C.14) into (C.13) yields that

𝐄Xπ[g(X)k]2(K1(L+1))k(d+1+|m|+2e1/emax(4υ,1)dk)k\displaystyle{\mathbf{E}}_{X\sim\pi}[g(X)^{k}]\leq 2(\sqrt{K_{1}}(L+1))^{k}\Big{(}\sqrt{d}+1+|m|+2e^{1/e}\max(4\upsilon,1)\sqrt{d}\cdot\sqrt{k}\Big{)}^{k} (C.15)
=2(K1(L+1))kexp(klog(d+1+|m|+2e1/emax(4υ,1)dk)).\displaystyle=2(\sqrt{K_{1}}(L+1))^{k}\exp\Big{(}k\log\Big{(}\sqrt{d}+1+|m|+2e^{1/e}\max(4\upsilon,1)\sqrt{d}\cdot\sqrt{k}\Big{)}\Big{)}.

Using the fact that log(a+b)log(a)=log(1+b/a)b/a\log(a+b)-\log(a)=\log(1+b/a)\leq b/a for all a,b1a,b\geq 1, one has

exp(klog(d+1+|m|+2e1/emax(4υ,1)dk))\displaystyle\exp\Big{(}k\log\Big{(}\sqrt{d}+1+|m|+2e^{1/e}\max(4\upsilon,1)\sqrt{d}\cdot\sqrt{k}\Big{)}\Big{)} (C.16)
exp(klog(2e1/emax(4υ,1)d=:Ak))exp(kd+1+|m|2e1/emax(4υ,1)d=:B).\displaystyle\leq\exp\Big{(}k\log\Big{(}\underbrace{2e^{1/e}\max(4\upsilon,1)\sqrt{d}}_{=:A}\cdot\sqrt{k}\Big{)}\Big{)}\cdot\exp\Big{(}\sqrt{k}\cdot\underbrace{\frac{\sqrt{d}+1+|m|}{2e^{1/e}\max(4\upsilon,1)\sqrt{d}}}_{=:B}\Big{)}.

Since by assumption |m|md|m|\leq m^{\ast}\sqrt{d} and d1d\geq 1, we have

B2+m2e1/emax(4υ,1)=:B~.B\leq\frac{2+m^{\ast}}{2e^{1/e}\max(4\upsilon,1)}=:\tilde{B}.

As a consequence of above and the fact that k!(k3)kk!\geq\Big{(}\frac{k}{3}\Big{)}^{k} for any k+k\in{\mathbb{N}}_{+},

exp(klog\displaystyle\exp\Bigl{(}k\log (d+1+|m|+2e1/emax(4υ,1)dk))\displaystyle\Bigl{(}\sqrt{d}+1+|m|+2e^{1/e}\max(4\upsilon,1)\sqrt{d}\cdot\sqrt{k}\Bigr{)}\Bigr{)} (C.17)
exp(klogk/2)exp(k(logA+B~))\displaystyle\leq\exp(k\log k/2)\cdot\exp(k(\log A+\tilde{B}))
=(k3)k/2(3Aexp(B~+12))k\displaystyle=\big{(}\frac{k}{3}\big{)}^{k/2}\cdot\big{(}\sqrt{3}A\exp(\tilde{B}+\frac{1}{2})\big{)}^{k}
k!(3Aexp(B~+12))k.\displaystyle\leq k!\cdot\big{(}\sqrt{3}A\exp(\tilde{B}+\frac{1}{2})\big{)}^{k}.

Combining this with (C.15) implies that the moment bound assumption (C.5) holds with the constants

J=3K1(L+1)Aexp(B~+12) and ξ=2J.J=\sqrt{3K_{1}}(L+1)A\exp\Big{(}\tilde{B}+\frac{1}{2}\Big{)}\textrm{ and }\xi=2J.

Therefore it follows from the definition of KSD(Pn,π)\mathrm{KSD}(P_{n},\pi) in (C.1) and the concentration bound (C.7) implied by Theorem C.1 that with at least probability 1δ1-\delta,

KSD(Pn,π)Cn,\textrm{KSD}(P_{n},\pi)\leq\frac{C}{\sqrt{n}},

with the constant

C=J(log(C1δ)C2+2log(C1δ)C2).C=J\Big{(}\frac{\log(\frac{C_{1}}{\delta})}{C_{2}}+2\sqrt{\frac{\log(\frac{C_{1}}{\delta})}{C_{2}}}\Big{)}.

Since by definition the constant A=𝒪(d)A=\mathcal{O}(\sqrt{d}) for large dd, we have that the constant C=𝒪(d)C=\mathcal{O}(\sqrt{d}). This completes the proof.

Appendix D Summarizing Propositions 3.1 - 3.3

The theorem below summarizes the Propositions 3.1 - 3.3 above, serving as one of the ingredients for proving Theorem 2.1.

Theorem D.1.

Let π\pi be a probability measure on d{\mathbb{R}}^{d} and let Pn=1ni=1nδXiP_{n}=\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}} be the empirical measure associated to the i.i.d. samples {Xi}i=1n\{X_{i}\}_{i=1}^{n} drawn from π\pi. Then we have the following:

  1. 1.

    If π\pi satisfies M3=𝐄Xπ|X|3<M_{3}={\mathbf{E}}_{X\sim\pi}|X|^{3}<\infty, then there exists a realization of empirical measure PnP_{n} such that

    𝒲1(Pn,π)C{n1/2,d=1,n1/2logn,d=2,n1/d,d3,{\mathcal{W}}_{1}(P_{n},\pi)\leq C\cdot\begin{cases}n^{-1/2},&d=1,\\ n^{-1/2}\log n,&d=2,\\ n^{-1/d},&d\geq 3,\end{cases}

    where the constant CC depends only on M3M_{3}.

  2. 2.

    If kk satisfies Assumption K2 with constant K0K_{0}, then there exists a realization of empirical measure PnP_{n} such that

    MMD(Pn,π)Cn,\operatorname{MMD}(P_{n},\pi)\leq\frac{C}{\sqrt{n}},

    where the constant CC depending only on K0K_{0}.

  3. 3.

    If π\pi satisfies Assumption 1 and 2 and kk satisfies Assumption K3 with constant K1K_{1}, then there exists a realization of empirical measure PnP_{n} such that

    KSD (Pn,π)Cdn,\textrm{KSD }(P_{n},\pi)\leq C\sqrt{\frac{d}{n}},

    where the constant CC depends only on L,K1,m,υL,K_{1},m^{\ast},\upsilon.

Appendix E Semi-discrete optimal transport with quadratic cost

E.1 Structure theorem of optimal transport map

We recall the structure theorem of optimal transport map between μ\mu and ν\nu under the assumption that μ\mu does not give mass to null sets.

Theorem E.1 ([58, Theorem 2.9 and Theorem 2.12]).

Let μ\mu and ν\nu be two probability measures on d{\mathbb{R}}^{d} with finite second moments. Assume that μ\mu is absolutely continuous with respect to the Lebesgue measure. Consider the functionals 𝒦\mathcal{K} and 𝒥\mathcal{J} defined in Monge’s problem (4.1) and dual Kantorovich problem (4.2) with c=12|xy|2c=\frac{1}{2}|x-y|^{2}. Then

(i) there exists a unique solution π\pi to Kantorovich’s problem, which is given by π(dxdy)=(𝐈𝐝×T)#μ\pi(dxdy)=(\mathbf{Id}\times T)_{\#}\mu where T(x)=φ¯(x)T(x)=\nabla\bar{\varphi}(x) μ\mu-a.e.xx for some convex function φ¯:d\bar{\varphi}:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}. In another word, T(x)=φ¯(x)T(x)=\nabla\bar{\varphi}(x) is the unique solution to Monge’s problem.

(ii) there exists an optimal pair (φ(x),φc(y))(\varphi(x),\varphi^{c}(y)) or (ψc(x),ψ(y))(\psi^{c}(x),\psi(y)) solving the dual Kantorovich’s problem, i.e. sup(φ,ψ)Φc𝒥(φ,ψ)=𝒥(φ,φc)=𝒥(ψc,ψ)\sup_{(\varphi,\psi)\in\Phi_{c}}\mathcal{J}(\varphi,\psi)=\mathcal{J}(\varphi,\varphi^{c})=\mathcal{J}(\psi^{c},\psi);

(iii) the function φ¯(x)\bar{\varphi}(x) can be chosen as φ¯(x)=12|x|2φ(x)\bar{\varphi}(x)=\frac{1}{2}|x|^{2}-\varphi(x) (or φ¯(x)=12|x|2ψc(x)\bar{\varphi}(x)=\frac{1}{2}|x|^{2}-\psi^{c}(x)) where (φ(x),φc(y))(\varphi(x),\varphi^{c}(y)) (or (ψc(x),ψ(y))(\psi^{c}(x),\psi(y))) is an optimal pair which maximizes 𝒥\mathcal{J} within the set Φc\Phi_{c}.

E.2 Proof of Theorem 4.2

Recall that the dual Kantorovich problem in the semi-discrete case reduces to maximizing the following functional

(ψ)=infj(12|xyj|2ψj)ρ(x)dx+j=1nψjνj.\displaystyle{\mathcal{F}}(\psi)=\int\inf_{j}\left(\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\right)\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\nu_{j}. (E.1)

Proof of Theorem 4.2 relies on two useful lemmas on the functional {\mathcal{F}}. The first lemma below shows that the functional {\mathcal{F}} is concave, whose proof adapts that of [35, Theorem 2] for semi-discrete optimal transport with the quadratic cost.

Lemma E.1.

Let ρ\rho be a probability density on d{\mathbb{R}}^{d}. Let {yj}j=1nd\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d} and let {νj}j=1n[0,1]\{\nu_{j}\}_{j=1}^{n}\subset[0,1] be such that j=1nνj=1\sum_{j=1}^{n}\nu_{j}=1. Then the functional {\mathcal{F}} be defined by (E.1) is concave.

Proof.

Let 𝒜:d{1,2,,n}\mathcal{A}:{\mathbb{R}}^{d}{\rightarrow}\{1,2,\cdots,n\} be an assignment function which assigns a point xdx\in{\mathbb{R}}^{d} to the index jj of some point yjy_{j}. Let us also define the function

~(𝒜,ψ)=(12|xy𝒜(x)|2ψ𝒜(x))ρ(x)𝑑x+j=1nψjνj.\widetilde{\mathcal{F}}(\mathcal{A},\psi)=\int\Big{(}\frac{1}{2}|x-y_{{\mathcal{A}}(x)}|^{2}-\psi_{{\mathcal{A}}(x)}\Big{)}\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\nu_{j}.

Then by definition (ψ)=inf𝒜~(𝒜,ψ){\mathcal{F}}(\psi)=\inf_{{\mathcal{A}}}\widetilde{\mathcal{F}}(\mathcal{A},\psi). Denote 𝒜1(j)={xd|𝒜(x)=j}{\mathcal{A}}^{-1}(j)=\{x\in{\mathbb{R}}^{d}|{\mathcal{A}}(x)=j\}. Then

~(𝒜,ψ)\displaystyle\widetilde{\mathcal{F}}(\mathcal{A},\psi) =j=1n[𝒜1(j)(12|xyj|2ψj)ρ(x)𝑑x+ψjνj]\displaystyle=\sum_{j=1}^{n}\left[\int_{{\mathcal{A}}^{-1}(j)}\Big{(}\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\Big{)}\rho(x)dx+\psi_{j}\nu_{j}\right]
=j=1n𝒜1(j)12|xyj|2ρ(x)𝑑x+j=1nψj(νj𝒜1(j)ρ(x)𝑑x).\displaystyle=\sum_{j=1}^{n}\int_{{\mathcal{A}}^{-1}(j)}\frac{1}{2}|x-y_{j}|^{2}\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\Big{(}\nu_{j}-\int_{{\mathcal{A}}^{-1}(j)}\rho(x)dx\Big{)}.

Since the function ~(𝒜,ψ)\widetilde{\mathcal{F}}(\mathcal{A},\psi) is affine in ψ\psi for every 𝒜{\mathcal{A}}, it follows that (ψ)=inf𝒜~(𝒜,ψ){\mathcal{F}}(\psi)=\inf_{{\mathcal{A}}}\widetilde{\mathcal{F}}(\mathcal{A},\psi) is concave. ∎

The next lemma computes the gradient of the concave function {\mathcal{F}}; see [35, Section 7.4] for the corresponding result with general transportation cost.

Lemma E.2.

Let ρ\rho be a probability density on d{\mathbb{R}}^{d}. Let {yj}j=1nd\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d} and let {νj}j=1n[0,1]\{\nu_{j}\}_{j=1}^{n}\subset[0,1] be such that j=1nνj=1\sum_{j=1}^{n}\nu_{j}=1. Denote by Pj(ψ)P_{j}(\psi) the power diagram associated to ψ\psi and yjy_{j}. Then

ψi(ψ)=νiμ(Pi(ψ))=νiPi(ψ)ρ(x)𝑑x.\partial_{\psi_{i}}{\mathcal{F}}(\psi)=\nu_{i}-\mu(P_{i}(\psi))=\nu_{i}-\int_{P_{i}(\psi)}\rho(x)dx. (E.2)
Proof.

By the definition of \mathcal{F} in (E.1), we rewrite \mathcal{F} as

(ψ)\displaystyle\mathcal{F}(\psi) =12|x|2ρ(dx)+infj{xyj+12|yj|2ψj}ρ(x)dx+j=1nψjνj\displaystyle=\int\frac{1}{2}|x|^{2}\rho(dx)+\int\inf_{j}\Big{\{}-x\cdot y_{j}+\frac{1}{2}|y_{j}|^{2}-\psi_{j}\Big{\}}\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\nu_{j}
=12|x|2ρ(dx)supj{xyj+ψj12|yj|2}ρ(x)dx+j=1nψjνj\displaystyle=\int\frac{1}{2}|x|^{2}\rho(dx)-\int\sup_{j}\Big{\{}x\cdot y_{j}+\psi_{j}-\frac{1}{2}|y_{j}|^{2}\Big{\}}\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\nu_{j}

To prove (E.2), it suffices to prove that

ψi(supj{xyj+ψj12|yj|2}ρ(x)dx)=Pi(ψ)ρ(x)𝑑x.\partial_{\psi_{i}}\Big{(}\int\sup_{j}\Big{\{}x\cdot y_{j}+\psi_{j}-\frac{1}{2}|y_{j}|^{2}\Big{\}}\rho(x)dx\Big{)}=\int_{P_{i}(\psi)}\rho(x)dx. (E.3)

Note that the partial derivative on the left side of above makes sense since g(x,ψ):=supj{xyj+ψj12|yj|2}g(x,\psi):=\sup_{j}\{x\cdot y_{j}+\psi_{j}-\frac{1}{2}|y_{j}|^{2}\} is convex with respect to (x,ψ)(x,\psi) on d×d{\mathbb{R}}^{d}\times{\mathbb{R}}^{d} so that the resulting integral against the measure ρ\rho is also convex (and hence Lipschitz) in ψ\psi. To see (E.3), since g(x,ψ)g(x,\psi) is convex and piecewise linear in ψ\psi for any fixed x, it is easy to observe that

ψig(x,ψ)=δij if x{xd|xyj+ψj12|yj|2=g(x,ψ)}.\partial_{\psi_{i}}g(x,\psi)=\delta_{ij}\text{ if }x\in\Big{\{}x\in{\mathbb{R}}^{d}\Big{|}x\cdot y_{j}+\psi_{j}-\frac{1}{2}|y_{j}|^{2}=g(x,\psi)\Big{\}}.

However, by subtracting 12|x|2\frac{1}{2}|x|^{2} on both sides of the equation inside the big parenthesis and then flipping the sign one sees that

{xd|xyj+ψj12|yj|2=g(x,ψ)}=Pj(ψ).\Big{\{}x\in{\mathbb{R}}^{d}\Big{|}x\cdot y_{j}+\psi_{j}-\frac{1}{2}|y_{j}|^{2}=g(x,\psi)\Big{\}}=P_{j}(\psi).

Namely we have obtained that

ψig(x,ψ)=δi,j if xPj(ψ).\partial_{\psi_{i}}g(x,\psi)=\delta_{i,j}\text{ if }x\in P_{j}(\psi).

In particular, this implies that ψg(x,ψ)\psi{\rightarrow}g(x,\psi) is 1-Lipschitz in ψ\psi uniformly with respect to xx. Finally since ρ(x)\rho(x) is a probability measure, the desired identity (E.2) follows from the equation above and the dominated convergence theorem. This completes the proof of the lemma. ∎

With the lemmas above, we are ready to prove Theorem 4.2. In fact, according to Lemma E.1 and Lemma E.2, ψ=(ψ1,,ψn)\psi=(\psi_{1},\cdots,\psi_{n}) is a maximizer of the functional {\mathcal{F}} if and only if

ψi(ψ)=νiμ(Pi(ψ))=νiPi(ψ)ρ(x)𝑑x=0.\partial_{\psi_{i}}{\mathcal{F}}(\psi)=\nu_{i}-\mu(P_{i}(\psi))=\nu_{i}-\int_{P_{i}(\psi)}\rho(x)dx=0.

Since the dual Kantorovich problem in the semi-discrete setting reduces to the problem of maximizing {\mathcal{F}}, it follows from Theorem E.1 that the optimal transport map TT solving the semi-discrete Monge’s problem (4.4) is given by T(x)=φ¯(x)T(x)=\nabla\bar{\varphi}(x) where φ¯(x)=12|x|2φ(x)\bar{\varphi}(x)=\frac{1}{2}|x|^{2}-\varphi(x) and φ(x)=minj12|xyj|2ψj\varphi(x)=\min_{j}\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}. Consequently,

φ¯(x)\displaystyle\bar{\varphi}(x) =12|x|2φ(x)\displaystyle=\frac{1}{2}|x|^{2}-\varphi(x)
=12|x|2(minj{12|xyj|2ψj})\displaystyle=\frac{1}{2}|x|^{2}-\Big{(}\min_{j}\{\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\}\Big{)}
=maxj{xyj+mj}\displaystyle=\max_{j}\{x\cdot y_{j}+m_{j}\}

with mj=ψj12|yj|2m_{j}=\psi_{j}-\frac{1}{2}|y_{j}|^{2}. Moreover, noticing that φ(x)\varphi(x) can be rewritten as

φ(x)=12|xyj|2ψj if xPj(ψ),\varphi(x)=\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\text{ if }x\in P_{j}(\psi),

one obtains that T(x)=φ¯(x)=yjT(x)=\nabla\bar{\varphi}(x)=y_{j} if xPj(ψ)x\in P_{j}(\psi).

E.3 Proof of Proposition 4.1

Let us first consider the case that n=2kn=2^{k} for some kk\in{\mathbb{N}}. Then

φ¯(x)=maxj=1,,2k{xyj+mj}=maxj=1,,2k1maxi{2j1,2j}{xyi+mi}.\displaystyle\bar{\varphi}(x)=\max_{j=1,\cdots,2^{k}}\{x\cdot y_{j}+m_{j}\}=\max_{j=1,\cdots,2^{k-1}}\max_{i\in\{2j-1,2j\}}\{x\cdot y_{i}+m_{i}\}.

Let us define maps φn:nn/2\varphi_{n}:{\mathbb{R}}^{n}{\rightarrow}{\mathbb{R}}^{n/2} and ψ:dn\psi:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}^{n} by setting

[φn(z)]i=max{z2i1,z2i},i=1,,n/2 and [ψ(x)]j=xyj+mj,j=1,,n.[\varphi_{n}(z)]_{i}=\max\{z_{2i-1},z_{2i}\},i=1,\cdots,n/2\text{ and }[\psi(x)]_{j}=x\cdot y_{j}+m_{j},j=1,\cdots,n.

Then by definition it is straightforward that

φ¯(x)=(φ2φ4φn/2φnψ)(x).\bar{\varphi}(x)=(\varphi_{2}\circ\varphi_{4}\circ\cdots\circ\varphi_{n/2}\circ\varphi_{n}\circ\psi)(x). (E.4)

By defining

Y=(y1Ty2TynT),m=(m1m2mn),Y=\begin{pmatrix}y_{1}^{\hbox{\it\tiny T}}\\ y_{2}^{\hbox{\it\tiny T}}\\ \vdots\\ y_{n}^{\hbox{\it\tiny T}}\end{pmatrix},m=\begin{pmatrix}m_{1}\\ m_{2}\\ \vdots\\ m_{n}\end{pmatrix},

we can write the map ψ\psi as

ψ(x)=Yx+m.\psi(x)=Y\cdot x+m. (E.5)

Moreover, thanks to the following simple equivalent formulation of the maximum function:

max(a,b)\displaystyle\max(a,b) =ReLU(ab)+ReLU(b)ReLU(b)\displaystyle=\textrm{ReLU}(a-b)+\textrm{ReLU}(b)-\textrm{ReLU}(-b)
=hTReLU(A(ab)),\displaystyle=h^{\hbox{\it\tiny T}}\cdot\textrm{ReLU}\left(A\begin{pmatrix}a\\ b\end{pmatrix}\right),

where

A=(110101),h=(111),A=\begin{pmatrix}1&-1\\ 0&1\\ 0&-1\end{pmatrix},\quad h=\begin{pmatrix}1\\ 1\\ -1\end{pmatrix},

we can express the map φn\varphi_{n} in terms of a two-layer neural network as follows

φn(z)=Hn/2ReLU(Anz),\varphi_{n}(z)=H_{n/2}\cdot\textrm{ReLU}(A_{n}\cdot z), (E.6)

where An=nAA_{n}=\oplus^{n}A and Hn=nhH_{n}=\oplus^{n}h. Finally, by combining (E.4), (E.5) and (E.6), one sees that φ¯\bar{\varphi} can be expressed in terms of a DNN of width nn and depth logn\log n with parameters (W,b)=1L+1(W^{\ell},b^{\ell})_{\ell=1}^{L+1} defined by

W0=AnY,b0=Anm,\displaystyle W^{0}=A_{n}\cdot Y,\quad b^{0}=A_{n}\cdot m,
W1=An/2Hn/2,b1=0,\displaystyle W^{1}=A_{n/2}\cdot H_{n/2},\quad b^{1}=0,
W2=An/4Hn/4,b2=0,\displaystyle W^{2}=A_{n/4}\cdot H_{n/4},\quad b^{2}=0,
,\displaystyle\cdots,
WL1=A2H2,bL1=0,\displaystyle W_{L-1}=A_{2}\cdot H_{2},\quad b^{L-1}=0,
WL=H1,bL=0.\displaystyle W_{L}=H_{1},\quad b^{L}=0.

In the general case where log2n\log_{2}n\notin{\mathbb{N}}, we set k=log2nk=\left\lceil\log_{2}n\right\rceil so that kk is smallest integer such that 2k>n2^{k}>n. By redefining yj=0y_{j}=0 and mj=0m_{j}=0 for j=n+1,,2kj=n+1,\cdots,2^{k}, we may still write φ¯(x)=maxj=1,,2k{xyj+mj}\bar{\varphi}(x)=\max_{j=1,\cdots,2^{k}}\{x\cdot y_{j}+m_{j}\} so that the analysis above directly applies.

Appendix F Proof of Main Theorem 2.1

The proof follows directly from Theorem D.1 and Theorem 4.1. Indeed, on the one hand, the quantitative estimate for the convergence of the empirical measure PnP_{n} directly translates to the sample complexity bounds in Theorem 2.1 with a given error ε{\varepsilon}. On the other hand, Theorem 4.1 provides a push-forward from pxp_{x} to the empirical measure PnP_{n} based on the gradient of a DNN.