A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions

Yulong Lu
Department of Mathematics and Statistics
University of Massachusetts Amherst
Amherst, MA 01003
lu@math.umass.edu
&Jianfeng Lu
Mathematics Department
Duke University
Durham, NC 27708
jianfeng@math.duke.edu

Abstract

This paper studies the universal approximation property of deep neural networks for representing probability distributions. Given a target distribution $\pi$ and a source distribution $p_{z}$ both defined on $\mathbb{R}^{d}$ , we prove under some assumptions that there exists a deep neural network $g:\mathbb{R}^{d}{\rightarrow}\mathbb{R}$ with ReLU activation such that the push-forward measure $(\nabla g)_{\#}p_{z}$ of $p_{z}$ under the map $\nabla g$ is arbitrarily close to the target measure $\pi$ . The closeness are measured by three classes of integral probability metrics between probability distributions: $1$ -Wasserstein distance, maximum mean distance (MMD) and kernelized Stein discrepancy (KSD). We prove upper bounds for the size (width and depth) of the deep neural network in terms of the dimension $d$ and the approximation error $\varepsilon$ with respect to the three discrepancies. In particular, the size of neural network can grow exponentially in $d$ when $1$ -Wasserstein distance is used as the discrepancy, whereas for both MMD and KSD the size of neural network only depends on $d$ at most polynomially. Our proof relies on convergence estimates of empirical measures under aforementioned discrepancies and semi-discrete optimal transport.

1 Introduction

In recent years, deep learning has achieved unprecedented success in numerous machine learning problems [31, 54]. The success of deep learning is largely attributed to the usage of deep neural networks (DNNs) for representing and learning the unknown structures in machine learning tasks, which are usually modeled by some unknown function mappings or unknown probability distributions. The effectiveness of using neural networks (NNs) in approximating functions has been justified rigorously in the last three decades. Specifically, a series of early works [13, 18, 25, 7] on universal approximation theorems show that a continuous function defined on a bounded domain can be approximated by a sufficiently large shallow (two-layer) neural network. In particular, the result by [7] quantifies the approximation error of shallow neural networks in terms of the decay property of the Fourier transform of the function of interest. Recently, the expressive power of DNNs for approximating functions have received increasing attention starting from the works by [37] and [60]; see also [61, 48, 50, 44, 51, 15, 43] for more recent developments. The theoretical benefits of using deep neural networks over shallow neural networks have been demonstrated in a sequence of depth separation results; see e.g. [17, 55, 57, 14]

Compared to a vast number of theoretical results on neural networks for approximating functions, the use of neural networks for expressing distributions is far less understood on the theoretical side. The idea of using neural networks for modeling distributions underpins an important class of unsupervised learning techniques called generative models, where the goal is to approximate or learn complex probability distributions from the training samples drawn from the distributions. Typical generative models include Variational Autoencoders [29], Normalizing Flows [49] and Generative Adversarial Networks (GANs) [19], just to name a few. In these generative models, the probability distribution of interest can be very complex or computationally intractable, and is usually modelled by transforming a simple distribution using some map parameterized by a (deep) neural network. In particular, a GAN consists of a game between a generator and a discriminator which are represented by deep neural networks: the generator attempts to generate fake samples whose distribution is indistinguishable from the real distribution and it generate samples by mapping samples from a simple input distribution (e.g. Gaussian) via a deep neural network; the discriminator attempts to learn how to tell the fake apart from the real. Despite the great empirical success of GANs in various applications, its theoretical analysis is far from complete. Existing theoretical works on GANs are mainly focused on the trade-off between the generator and the discriminator (see e.g. [42, 2, 3, 38, 5]). The key message from these works is that the discriminator family needs to be chosen appropriately according to the generator family in order to obtain a good generalization error.

Our contributions. In this work, we focus on an even more fundamental question on GANs and other generative models which is not yet fully addressed. Namely how well can DNNs express probability distributions? We shall answer this question by making the following contributions.

•

Given a fairly general source distribution and a target distribution defined on ${\mathbb{R}}^{d}$ which satisfies certain integrability assumptions, we show that there is a ReLU DNN with $d$ inputs and one output such that the push-forward of the source distribution via the gradient of the output function defined by the DNN is arbitrarily close to the target. We measure the closeness between probability distributions by three integral probability metrics (IPMs): 1-Wasserstein metric, maximum mean discrepancy and kernelized Stein discrepancy.
•

Given a desired approximation error ${\varepsilon}$ , we prove complexity upper bounds for the depth and width of the DNN needed to attain the given approximation error with respect to the three IPMs mentioned above; our complexity upper bounds are given with explicit dependence on the dimension $d$ of the target distribution and the approximation error ${\varepsilon}$ .

It is also worth mentioning that the DNN constructed in the paper is explicit: the output function of the DNN is the maximum of finitely many (multivariate) affine functions, with the affine parameters determined explicitly in terms of the source measure and target measure.

Related work. Let us discuss some previous related work and compare the results there with ours. Kong and Chaudhuri [30] analyzed the expressive power of normalizing flow models under the $L^{1}$ -norm of distributions and showed that the flow models have limited approximation capability in high dimensions. We show that feedforward DNNs can approximate a general class of distributions in high dimensions with respect to three IPMs. Lin and Jegelka [39] studied the universal approximation of certain ResNets for multivariate functions; the approximation result there however was not quantitative and did not consider the universal approximation of ResNets for distributions. The work by Bailey and Telgarsky [6] considered the approximation power of DNNs for expressing uniform and Gaussian distributions in the Wasserstein distance, whereas we prove quantitative approximation results for fairly general distributions under three IPMs. The work by Lee et al. [32] is closest to ous, where the authors considered a class of probability distributions that are given as push-forwards of a base distribution by a class of Barron functions, and showed that those distributions can be approximated in Wasserstein metrics by push-forwards of NNs, essentially relying on the ability of NNs for approximating Barron functions. It is however not clear what probability distributions are given by push-forward of a base one by Barron functions. In this work, we provide more explicit and direct criteria of the target distributions.

Notations. Let us introduce several definitions and notations to be used throughout the paper. We start with the definition of a fully connected and feed-forward neural network.

Definition 1.1.

A (fully connected and feed-forward) neural network of $L$ hidden layers takes an input vector $x\in{\mathbb{R}}^{N_{0}}$ , outputs a vector $y\in{\mathbb{R}}^{N_{L+1}}$ and has $L$ hidden layers of sizes $N_{1},N_{2},\cdots N_{L}$ . The neural network is parametrized by the weight matrices $W^{\ell}\in{\mathbb{R}}^{N_{\ell-1}\times N_{\ell}}$ and bias vectors $b^{\ell}$ with $\ell=1,2,\cdots,L+1$ . The output $y$ is defined from the input $x$ iteratively according to the following.

		$\displaystyle x^{0}=x,$		(1.1)
		$\displaystyle x^{\ell}=\sigma(W^{\ell-1}x^{\ell-1}+b^{\ell-1}),1\leq\ell\leq L$
		$\displaystyle y=W^{L}x^{L}+b^{L}.$

Here $\sigma$ is a (nonlinear) activation function which acts on a vector $x$ component-wisely, i.e. $[\sigma(x)]_{i}=\sigma(x_{i})$ . When $N_{1}=N_{2}=\cdots=N_{L}=N$ , we say the network network has width $N$ and depth $L$ . The neural network is said to be a deep neural network (DNN) if $L\geq 2$ . The function defined by the deep neural network is denoted by ${\mathrm{DNN}}(\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1})$ .

Popular choices of activation functions $\sigma$ include the rectified linear unit (ReLU) function $\textrm{ReLU}(x)=\max(x,0)$ and the sigmoid function $\textrm{Sigmoid}(x)=(1+e^{-x})^{-1}$ .

Given a matrix $A$ , let us denote its $n$ -fold direct sum by $\oplus^{n}A=\overbrace{A\oplus A\oplus\cdots\oplus A}^{n\text{ times}}=\operatorname{diag}(A,\cdots,A).$ We denote by ${\mathcal{P}}_{2}({\mathbb{R}}^{d})$ the space of probability measures with finite second moment. Given two probability measures $\mu$ and $\nu$ on ${\mathbb{R}}^{d}$ , a transport map $T$ between $\mu$ and $\nu$ is a measurable map $T:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}^{d}$ such that $\nu=T_{\#}\mu$ where $T_{\#}\mu$ denotes the push-forward of $\mu$ under the map $T$ , i.e., for any measurable $A\subset{\mathbb{R}}^{d}$ , $\nu(A)=\mu(T^{-1}(A))$ . We denote by $\Gamma(\mu,\nu)$ the set of transport plans between $\mu$ and $\nu$ which consists of all coupling measures $\gamma$ of $\mu$ and $\nu$ , i.e., $\gamma(A\times{\mathbb{R}}^{d})=\mu(A)$ and $\gamma({\mathbb{R}}^{d}\times B)=\nu(B)$ for any measurable $A,B\subset{\mathbb{R}}^{d}$ . We may use $C,C_{1},C_{2}$ to denote generic constants which do not depend on any quantities of interest (e.g. dimension $d$ ).

The rest of the paper is organized as follows. We describe the problem and state the main result in Section 2. Section 3 and Section 4 devote to the two ingredients for proving the main result: convergence of empirical measures in IPMs and building neural-network-based maps between the source measure and empirical measures via semi-discrete optimal transport respectively. Proofs of lemmas and intermediate results are provided in appendices.

2 Problem Description and Main Result

Let $\pi(x)$ be the target probability distribution defined on ${\mathbb{R}}^{d}$ which one would like to learn or generate samples from. In the framework of GANs, one is interested in representing the distribution $\pi$ implicitly by a generative neural network. Specifically, let ${\mathcal{G}}_{{\mathrm{NN}}}\subset\{g:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}^{d}\}$ be a subset of generators (transformations), which are defined by neural networks. The concrete form of ${\mathcal{G}}_{{\mathrm{NN}}}$ is to be specified later. Let $p_{z}$ be a source distribution (e.g. standard normal). The push-forward of $p_{z}$ under the transformation $g\in{\mathcal{G}}_{{\mathrm{NN}}}$ is denoted by $p_{x}=g_{\#}p_{z}$ . In a GAN problem, one aims to find $g\in{\mathcal{G}}_{{\mathrm{NN}}}$ such that $g_{\#}p_{z}\approx\pi$ . In the mathematical language, GANs can be formulated as the following minimization problem:

\inf_{g\in{\mathcal{G}}_{{\mathrm{NN}}}}D(g_{\#}p_{z},\pi)

(2.1)

where $D(p,\pi)$ is some discrepancy measure between probability measures $p$ and $\pi$ , which typically takes the form of integral probability metric (IPM) or adversarial loss defined by

D(p,\pi)=d_{{\mathcal{F}}_{D}}(p,\pi):=\sup_{f\in{\mathcal{F}}_{D}}\Big{|}{\mathbf{E}}_{X\sim p}f(X)-{\mathbf{E}}_{X\sim\pi}f(X)\Big{|},

(2.2)

where $\mathcal{F}_{D}$ is certain class of test (or witness) functions. As a consequence, GANs can be formulated as the minimax problem

\inf_{g\in{\mathcal{G}}_{{\mathrm{NN}}}}\sup_{f\in{\mathcal{F}}_{D}}\Big{|}{\mathbf{E}}_{X\sim p}f(X)-{\mathbf{E}}_{X\sim\pi}f(X)\Big{|}.

The present paper aims to answer the following fundamental questions on GANs:

(1) Is there a neural-network-based generator $g\in{\mathcal{G}}_{{\mathrm{NN}}}$ such that $D(g_{\#}p_{z},\pi)\approx 0$ ?

(2) How to quantify the complexity (e.g. depth and width) of the neural network?

As we shall see below, the answers to the questions above depend on the IPM $D$ used to measure the discrepancy between distributions. In this paper, we are interested in three IPMs which are commonly used in GANs, including $1$ -Wasserstein distance [59, 1], maximum mean discrepancy [21, 16, 36] and kernelized Stein discrepancy [40, 12, 27].

Wasserstein Distance: When the witness class ${\mathcal{F}}_{D}$ is chosen as the the class of 1-Lipschitz functions, i.e. ${\mathcal{F}}_{D}:=\{f:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}:\text{ Lip }(f)\leq 1\}$ , the resulting IPM $d_{{\mathcal{F}}_{D}}$ becomes the 1-Wasserstein distance (also known as Kantorovich-Rubinstein distance):

{\mathcal{W}}_{1}(p,\pi)=\inf_{\gamma\in\Gamma(p,\pi)}\int|x-y|\gamma(dxdy).

The Wasserstein-GAN proposed by [1] leverages the Wasserstein distance as the objective function to improve the stability of training of the original GAN based on the Jensen-Shannon divergence. Nevertheless, it has been shown that Wasserstein-GAN still suffers from the mode collapse issue [23] and does not generalize with any polynomial number of training samples [2].

Maximum Mean Discrepancy (MMD): When ${\mathcal{F}}_{D}$ is the unit ball of a reproducing kernel Hilbert space (RKHS) ${\mathcal{H}}_{k}$ , i.e. ${\mathcal{F}}_{D}:=\{f\in{\mathcal{H}}_{k}:\|f\|_{{\mathcal{H}}_{k}}\leq 1\}$ , the resulting IPM $d_{{\mathcal{F}}_{D}}$ coincides with the maximum mean discrepancy (MMD) [21]:

\operatorname{MMD}(p,\pi)=\sup_{\|f\|_{{\mathcal{H}}_{k}}\leq 1}\Big{|}{\mathbf{E}}_{X\sim p}f(X)-{\mathbf{E}}_{X\sim\pi}f(X)\Big{|}.

GANs based on minimizing MMD as the loss function were firstly proposed in [16, 36]. Since MMD is a weaker metric than $1$ -Wasserstein distance, MMD-GANs also suffer from the mode collapse issue, but empirical results (see e.g. [8]) suggest that they require smaller discriminative networks and hence enable faster training than Wasserstein-GANs.

Kernelized Stein Discrepancy (KSD): If the witness class ${\mathcal{F}}_{D}$ is chosen as ${\mathcal{F}}_{D}:=\{{\mathcal{T}}_{\pi}f:f\in{\mathcal{H}}_{k}\text{ and }\|f\|_{{\mathcal{H}}_{k}}\leq 1\},$ where ${\mathcal{T}}_{\pi}$ is the Stein-operator defined by

{\mathcal{T}}_{\pi}f:=\nabla\log\pi\cdot f+\nabla\cdot f,

(2.3)

the associated IPM $d_{{\mathcal{F}}_{D}}$ becomes the Kernelized Stein Discrepancy (KSD) [40, 12]:

\operatorname{KSD}(p,\pi)=\sup_{\|f\|_{{\mathcal{H}}_{k}}\leq 1}{\mathbf{E}}_{X\sim p}[{\mathcal{T}}_{\pi}f(X)].

The KSD has received great popularity in machine learning and statistics since the quantity $\operatorname{KSD}(p,\pi)$ is very easy to compute and does not depend on the normalization constant of $\pi$ , which makes it suitable for statistical computation, such as hypothesis testing [20] and statistical sampling [41, 11]. The recent paper [27] adopts the GAN formulation (2.1) with KSD as the training loss to construct a new sampling algorithm called Stein Neural Sampler.

2.1 Main result

Throughout the paper, we consider the following assumptions on the reproducing kernel $k$ :

Assumption K1.

The kernel $k$ is integrally strictly positive definite: for all finite non-zero signed Borel measures $\mu$ defined on ${\mathbb{R}}^{d}$ ,

\iint_{{\mathbb{R}}^{d}}k(x,y)d\mu(x)d\mu(y)>0.

Assumption K2.

There exists a constant $K_{0}>0$ such that

\displaystyle\sup_{x\in{\mathbb{R}}^{d}}|k(x,x)|\leq K_{0}.

(2.4)

Assumption K3.

The kernel function $k:{\mathbb{R}}^{d}\times{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}$ is twice differentiable and there exists a constant $K_{1}>0$ such that

\displaystyle\max_{m+n\leq 1}\sup_{x,y}\|\nabla_{x}^{m}\nabla_{y}^{n}k(x,y)\|\leq K_{1}\textrm{ and }\sup_{x,y}|\operatorname{Tr}(\nabla_{x}\nabla_{y}k(x,y))|\leq K_{1}(1+d).

(2.5)

According to [54, Theorem 7], Assumption K1 is necessary and sufficient for the kernel being characteristic, i.e., $\operatorname{MMD}(\mu,\nu)=0$ implies $\mu=\nu$ , which guarantees that MMD is a metric. In addition, thanks to [40, Proposition 3.3], KSD is a valid discrepancy measure under the Assumption K1, namely $\operatorname{KSD}(p,\pi)\geq 0$ and $\operatorname{KSD}(p,\pi)=0$ if and only if $p=\pi$ .

Assumption K2 will be used to get an error bound for $\operatorname{MMD}(P_{n},\pi)$ ; see Theorem 3.2. Assumption K3 will be crucial for bounding $\operatorname{KSD}(P_{n},\pi)$ ; see Theorem 3.3. Many commonly used kernel functions fulfill all three assumptions K1-K3, including for example Gaussian kernel $k(x,y)=e^{-\frac{1}{2}|x-y|^{2}}$ and inverse multiquadric (IMQ) kernel $k(x,y)=(c+|x-y|^{2})^{\beta}$ with $c>0$ and $\beta<0$ . Unfortunately, Matérn kernels [46] only satisfy Assumptions K1-K2, but not Assumption K3 since the second order derivatives of $k$ are singular on the diagonal, so the last estimate of (2.5) is violated.

In order to bound $\operatorname{KSD}(P_{n},\pi)$ , we need to assume further that the target measure $\pi$ satisfies the following regularity and integrability assumptions. We will use the shorthand notation $s_{\pi}(x)=\nabla\log\pi(x)$ .

Assumption 1 ( $L$ -Lipschitz).

Assume that $s_{\pi}(x)$ is globally Lipschitz in ${\mathbb{R}}^{d}$ , i.e. there exists a constant $\tilde{L}>0$ such that $|s_{\pi}(x)-s_{\pi}(y)|\leq\tilde{L}|x-y|$ for all $x,y\in{\mathbb{R}}^{d}$ . As a result, there exists $L>0$ such that

\displaystyle|s_{\pi}(x)|\leq L(1+|x|)\quad\textrm{ for all }x\in{\mathbb{R}}^{d}.

(2.6)

Assumption 2 (sub-Gaussian).

The probability measure $\pi$ is sub-Gaussian, i.e. there exist $m=(m_{1},\cdots,m_{d})\in{\mathbb{R}}^{d}$ and $\upsilon>0$ such that

{\mathbf{E}}_{X\sim\pi}[\exp(\alpha^{\hbox{\it\tiny T}}(X-m))]\leq\exp(|\alpha|^{2}\upsilon^{2}/2)\quad\textrm{ for all }\alpha\in{\mathbb{R}}^{d}.

Assume further that $\max_{i}|m_{i}|\leq m^{\ast}$ for some $m^{\ast}>0$ .

Our main result is the universal approximation theorem for expressing probability distributions.

Theorem 2.1 (Main theorem).

Let $\pi$ and $p_{z}$ be the target and the source distributions respectively, both defined on ${\mathbb{R}}^{d}$ . Assume that $p_{z}$ is absolutely continuous with respect to the Lebesgue measure. Then under certain assumptions on $\pi$ and the kernel $k$ to be specified below, it holds that for any given approximation error ${\varepsilon}$ , there exists a positive integer $n$ , and a fully connected and feed-forward deep neural network $u={\mathrm{DNN}}(\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1})$ of depth $L=\left\lceil\log_{2}n\right\rceil$ and width $N=2^{L}=2^{\left\lceil\log_{2}n\right\rceil}$ , with $d$ inputs and a single output and with ReLU activation such that $d_{{\mathcal{F}}_{D}}((\nabla u)_{\#}p_{z},\pi)\leq{\varepsilon}.$ The complexity parameter $n$ depends on the choice of the metric $d_{{\mathcal{F}}_{D}}$ . Specifically,

1. Consider $d_{{\mathcal{F}}_{D}}={\mathcal{W}}_{1}$ . If $\pi$ satisfies that $M_{3}={\mathbf{E}}_{X\sim\pi}|X|^{3}<\infty$ , it holds that

n\leq\begin{cases}\frac{C}{{\varepsilon}^{2}},&d=1,\\ \frac{C\log^{2}({\varepsilon})}{{\varepsilon}^{2}},&d=2,\\ \frac{C^{d}}{{\varepsilon}^{d}},&d\geq 3,\end{cases}

where the constant $C$ depends only on $M_{3}$ .

2. Consider $d_{{\mathcal{F}}_{D}}=\textrm{MMD }$ with kernel $k$ . If $k$ satisfies Assumption K2, then

n\leq\frac{C}{{\varepsilon}^{2}}

with a constant $C$ depending only on the constant $K_{0}$ in (2.4).

3. Consider $d_{{\mathcal{F}}_{D}}=\textrm{KSD }$ with kernel $k$ . If $k$ satisfies Assumption K3 with constant $K_{1}$ and if $\pi$ satisfies Assumption 1 and Assumption 2 with parameters $L,m,\upsilon$ , then

n\leq\frac{Cd}{{\varepsilon}^{2}},

where the constant $C$ depends only on $L,K_{1},m^{\ast},\upsilon$ , but not on $d$ .

Theorem 2.1 states that a given probability measure $\pi$ (with certain integrability assumption) can be approximated arbitrarily well by push-forwarding a source distribution with the gradient of a potential which can be parameterized by a finite DNN. The complexity bound in the Wasserstein case suffers from the curse of dimensionality whereas this issue was eliminated in the cases of MMD and KSD . We remark that our complexity bound for the the neural network is stated in terms of the number of widths and depths, which would lead to an estimate of number of weights needed to achieve certain approximation error.

Proof strategy. The proof of Theorem 2.1 is given in Appendix F, which relies on two ingredients: (1) one approximates the target $\pi$ by the empirical measure $P_{n}$ ; Proposition 3.1-Proposition 3.3 give quantitative estimates w.r.t the three IPMs defined above; (2) based on the theory of (semi-discrete) optimal transport, one can build an optimal transport map of the form $T=\nabla\varphi$ which push-forwards the source distribution $p_{z}$ to the empirical distribution $P_{n}$ . Moreover, the potential $\varphi$ is explicit : it is the maximum of finitely many affine functions; it is such explicit structure that enables one represents the function $\varphi$ with a finite deep neural network; see Theorem 4.1 for the precise statement.

It is interesting to remark that our strategy of proving Theorem 2.1 shares the same spirit as the one used to prove universal approximation theorems of DNNs for functions [60, 37]. Indeed, both the universal approximation theorems in those works and ours are proved by approximating the target function or distribution with a suitable dense subset (or sieves) on the space of functions or distributions which can be parametrized by deep neural networks. Specifically, in [60, 37] where the goal is to approximate continuous functions on a compact set, the dense sieves are polynomials which can be further approximated by the output functions of DNNs, whereas in our case we use empirical measures as the sieves for approximating distributions, and we show that empirical measures are exactly expressible by transporting a source distribution with neural-network-based transport maps.

We also remark that the push-forward map between probability measures constructed in Theorem 2.1 is the gradient of a potential function given by a neural network, i.e., the neural network is used to parametrize the potential function, instead of the map itself, which is perhaps more commonly used in practice. The rational of using such gradient formulation lies in that transport maps between two probability measures are discontinuous in general. This occurs particularly when the target measure has disjoint modes (or supports) and the input has a unique mode, which is ubiquitous in GAN applications where images are concentrated on disjoint modes while the input is chosen as Gaussian. In such case it is impossible to generate the target measure using usual NNs as the resulting functions are all continuous (in fact Lipschitz for ReLU activation and smooth for Sigmoid activation). Thus, from a theoretical viewpoint, it is advantageous to use NNs to parametrize the Brenier’s potential since it is more regular (at least Lipschitz by OT theory), and to use its gradient as the transport map. From a practical perspective, the idea of taking gradients of NNs has already been used and received increasing attention in practical learning problems; see [34, 24, 45, 56] for an application of such parameterization in learning optimal transport maps and improving the training of Wasserstein-GANs; see also [62] for learning interatomic potentials and forces for molecular dynamics simulations.

3 Convergence of Empirical Measures in Various IPMs

In this section, we consider the approximation of a given target measure $\pi$ by empirical measures. More specifically, let $\{X_{i}\}_{i=1}^{n}$ be an i.i.d. sequence of random samples from the distribution $\pi$ and let $P_{n}=\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}}$ be the empirical measure associated to the samples $\{X_{i}\}_{i=1}^{n}$ . Our goal is to derive quantitative error estimates of $d_{{\mathcal{F}}_{D}}(P_{n},\pi)$ with respect to three IPMs $d_{{\mathcal{F}}_{D}}$ described in the last section.

We first state an upper bound on ${\mathcal{W}}_{1}(P_{n},\pi)$ in the average sense in the next proposition.

Proposition 3.1 (Convergence in $1$ -Wasserstein distance).

Consider the IPM with ${\mathcal{F}}_{D}=\{f:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}:\textrm{ Lip }(f)\leq 1\}$ . Assume that $\pi$ satisfies that $M_{3}={\mathbb{E}}_{X\sim\pi}|X|^{3}<\infty$ . Then there exists a constant $C$ depending on $M_{3}$ such that

{\mathbf{E}}{\mathcal{W}}_{1}(P_{n},\pi)\leq C\cdot\begin{cases}n^{-1/2},&d=1,\\ n^{-1/2}\log n,&d=2,\\ n^{-1/d},&d\geq 3.\end{cases}

The convergence rates of ${\mathcal{W}}_{1}(P_{n},\pi)$ as stated in Proposition 3.1 are well-known in the statistics literature. The statement in Proposition 3.1 is a combination of results from [9] and [33]; see Appendix A for a short proof. We remark that the prefactor constant $C$ in the estimate above can be made explicit. In fact, one can easily obtain from the moment bound in Proposition C.1 of Appendix C that if $\pi$ is sub-Gaussian with parameters $m$ and $\upsilon$ , then the constant $C$ can be chosen as $C=C^{\prime}\sqrt{d},$ with some constant $C^{\prime}$ depending only on $\upsilon$ and $\|m\|_{\infty}$ . Moreover, one can also obtain a high probability bound for ${\mathcal{W}}_{1}(P_{n},\pi)$ if $\pi$ is sub-exponential (see e.g., [33, Corollary 5.2]). Here we content ourselves with the expectation result as it comes with weaker assumptions and also suffices for our purpose of showing the existence of an empirical measure with desired approximation rate.

Moving on to approximation in MMD, the following proposition gives a high-probability non-asymptotic error bound of $\operatorname{MMD}(P_{n},\pi)$ .

Proposition 3.2 (Convergence in MMD).

Consider the IPM with ${\mathcal{F}}_{D}=\{f\in{\mathcal{H}}_{k}:\|f\|_{H_{k}}\leq 1\}$ . Assume that the kernel $k$ satisfies Assumption K2 with constant $K_{0}$ . Then for every $\tau>0$ , with probability at least $1-2e^{-\tau}$ ,

\operatorname{MMD}(P_{n},\pi)\leq 2\sqrt{\frac{\sqrt{K_{0}}}{n}}+3\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}.

Proposition 3.2 can be viewed as a special case of [53, Theorem 3.3] where the kernel class is a skeleton. Since its proof is short, we provide the proof in Appendix B for completeness.

In the next proposition, we consider the convergence estimate of empirical measures $P_{n}$ to $\pi$ in KSD . To the best of our knowledge, this is the first estimate on empirical measure under the KSD in the literature. This result can be useful to obtain quantitative error bounds for the new GAN/sampler called Stein Neural Sampler [27]. The proof relies on a Bernstein type inequality for the distribution of von Mises’ statistics; the details are deferred to Appendix C.

Proposition 3.3 (Convergence in KSD).

Consider the IPM with ${\mathcal{F}}_{D}:=\{{\mathcal{T}}_{\pi}f:f\in{\mathcal{H}}_{k}\text{ and }\|f\|_{{\mathcal{H}}_{k}}\leq 1\},$ where ${\mathcal{T}}_{\pi}$ is the Stein operator defined in (2.3). Suppose that the kernel $k$ satisfies Assumption K3 with constant $K_{1}$ . Suppose also that $\pi$ satisfies Assumption 1 and Assumption 2. Then for any $\delta>0$ there exists a constant $C=C(L,K_{1},m^{\ast},\delta)$ such that with probability at least $1-\delta$ ,

\displaystyle\operatorname{KSD}(P_{n},\pi)\leq C\sqrt{\frac{d}{n}}.

(3.1)

Remark 3.1.

Proposition 3.3 provides a non-asymptotic high probability error bound for the convergence of the empirical measure $P_{n}$ converges to $\pi$ in KSD. Our result implies in particular that $\operatorname{KSD}(P_{n},\pi){\rightarrow}0$ with the asymptotic rate $\mathcal{O}(\sqrt{\frac{d}{n}})$ . We also remark that the rate $\mathcal{O}(n^{-1/2})$ is optimal and is consistent with the asymptotic CLT result for the corresponding U-statistics of $\operatorname{KSD}^{2}(P_{n},\pi)$ (see [40, Theorem 4.1 (2)]).

4 Constructing Transport Maps via Semi-discrete Optimal Transport

In this section, we aim to build a neural-network-based map which push-forwards a given source distribution to discrete probability measures, including in particular the empirical measures. The main result of this section is the following theorem.

Theorem 4.1.

Let $\mu\in\mathcal{P}_{2}({\mathbb{R}}^{d})$ be absolutely continuous with respect to the Lebesgue measure with Radon–Nikodym density $\rho(x)$ . Let $\nu=\sum_{i=1}^{n}\nu_{i}\delta_{y_{i}}$ for some $\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d},\nu_{j}\geq 0$ and $\sum_{j=1}^{n}\nu_{j}=1$ . Then there exists a transport map of the form $T=\nabla u$ such that $T_{\#}\mu=\nu$ where $u$ is a fully connected deep neural network of depth $L=\left\lceil\log_{2}n\right\rceil$ and width $N=2^{L}=2^{\left\lceil\log_{2}n\right\rceil}$ , and with ReLU activation function and parameters $\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1}$ such that $u={\mathrm{DNN}}(\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1})$ .

As shown below, the transport map in Theorem 4.1 is chosen as the optimal transport map from the continuous distribution $\mu$ to the discrete distribution $\nu$ , which turns out to be the gradient of a piece-wise linear function, which in turn can be expressed by neural networks. We remark that the weights and biases of the constructed neural network can also be characterized explicitly in terms of $\mu$ and $\nu$ (see the proof of Proposition 4.1). Since semi-discrete optimal transport plays an essential role in the proof of Theorem 4.1, we first recall the set-up and some key results on optimal transport in both general and semi-discrete settings.

Optimal transport with quadratic cost. Let $\mu$ and $\nu$ be two probability measures on ${\mathbb{R}}^{d}$ with finite second moments. Let $c(x,y)=\frac{1}{2}\lvert x-y\rvert^{2}$ be the quadratic cost. Then Monge’s [47] optimal transportation problem is to transport the probability mass between $\mu$ and $\nu$ while minimizing the quadratic cost, i.e.

\inf_{T:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}^{d}}\int\frac{1}{2}\lvert x-T(x)\rvert^{2}\mu(dx)\quad\text{ s.t. }T_{\#}\mu=\nu.

(4.1)

A map $T$ attaining the infimum above is called an optimal transport map. In general an optimal transport map may not exist since Monge’s formulation prevents splitting the mass so that the set of transport maps may be empty. On the other hand, Kantorovich [28] relaxed the problem by considering minimizing the transportation cost over transport plans instead of the transport maps:

\inf_{\gamma\in\Gamma(\mu,\nu)}\mathcal{K}(\gamma):=\inf_{\gamma\in\Gamma(\mu,\nu)}\int\frac{1}{2}|x-y|^{2}\gamma(dxdy).

(4.2)

A coupling $\gamma$ achieving the infimum above is called an optimal coupling. Noting that problem (4.2) above is a linear programming, Kantorovich proposed a dual formulation for (4.2):

\sup_{(\varphi,\psi)\in\Phi_{c}}\mathcal{J}(\varphi,\psi):=\sup_{(\varphi,\psi)\in\Phi_{c}}\int\varphi d\mu+\psi d\nu,

where $\Phi_{c}$ be the set of measurable functions $(\varphi,\psi)\in L^{1}(\mu)\times L^{1}(\nu)$ satisfying $\varphi(x)+\psi(y)\leq\frac{1}{2}|x-y|^{2}$ . We also define the $c$ -transformation $\varphi^{c}:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}$ of a function $\varphi:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}$ by

\varphi^{c}(y)=\inf_{x\in{\mathbb{R}}^{d}}c(x,y)-\varphi(x)=\inf_{x\in{\mathbb{R}}^{d}}\frac{1}{2}|x-y|^{2}-\varphi(x).

Similarly, one can define $\psi^{c}$ associated to $\psi$ . The Kantorovich’s duality theorem (see e.g. [59, Theorem 5.10]) states that

\displaystyle\inf_{\gamma\in\Gamma(\mu,\nu)}\mathcal{K}(\gamma)=\sup_{(\varphi,\psi)\in\Phi_{c}}\mathcal{J}(\varphi,\psi)=\sup_{\varphi\in L^{1}(d\mu)}\mathcal{J}(\varphi,\varphi^{c})=\sup_{\psi\in L^{1}(d\nu)}\mathcal{J}(\psi^{c},\psi).

(4.3)

Moreover, if the source measure $\mu$ is absolutely continuous with respect to the Lebesgue measure, then the optimal transport map defined in Monge’s problem is given by a gradient field, which is usually referred to as the Brenier’s map and can be characterized explicitly in terms of the solution of the dual Kantorovich problem. A precise statement is included in Theorem E.1 in Appendix E.

Semi-discrete optimal transport. Let us now consider the optimal transport problem in the semi-discrete setting: the source measure $\mu$ is continuous and the target measure $\nu$ is discrete. Specifically, assume that $\mu\in{\mathcal{P}}_{2}({\mathbb{R}}^{d})$ is absolutely continuous with respect to the Lebesgue measure, i.e. $\mu(dx)=\rho(x)dx$ for some probability density $\rho$ and $\nu$ is discrete, i.e. $\nu=\sum_{j=1}^{n}\nu_{j}\delta_{y_{j}}$ for some $\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d},\nu_{j}\geq 0$ and $\sum_{j=1}^{n}\nu_{j}=1$ . In the semi-discrete setting, Monge’s problem becomes

\inf_{T}\int\frac{1}{2}|x-T(x)|^{2}\mu(dx)\ \text{ s.t }\ \int_{T^{-1}(y_{j})}d\mu=\nu_{j},\ j=1,\cdots,n.

(4.4)

In this case the action of the transport map is clear: it assigns each point $x\in{\mathbb{R}}^{d}$ to one of these $y_{j}$ . Moreover, by taking advantage of the dicreteness of the measure $\nu$ , one sees that the dual Kantorovich problem in the semi-discrete case becomes maximizing the following functional

\displaystyle{\mathcal{F}}(\psi)

\displaystyle=\mathcal{F}(\psi_{1},\cdots,\psi_{n})=\int\inf_{j}\left(\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\right)\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\nu_{j}.

(4.5)

Similar to the continuum setting, the optimal transport map of Monge’s problem (4.4) can be characterized by the maximizer of ${\mathcal{F}}$ . To see this, let us introduce an important concept of power diagram (or Laguerre diagram) [4, 52]. Given a finite set of points $\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d}$ and the scalars $\psi=\{\psi_{j}\}_{j=1}^{n}$ , the power diagrams associated to the scalars $\psi$ and the points $\{y_{j}\}_{j=1}^{n}$ are the sets

P_{j}:=\Bigl{\{}x\in{\mathbb{R}}^{d}\;\Big{|}\;\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\leq\frac{1}{2}|x-y_{k}|^{2}-\psi_{k},\forall k\neq j\Bigr{\}}.

(4.6)

By grouping the points according to the power diagrams $P_{j}$ , we have from (4.5) that

\mathcal{F}(\psi)=\sum_{j=1}^{n}\left[\int_{P_{j}}\left(\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\right)\rho(x)dx+\psi_{j}\nu_{j}\right].

(4.7)

The following theorem characterizes the optimal transport map of Monge’s problem (4.4) in terms of the power diagrams $P_{j}$ associated to the points $\{y_{j}\}_{j=1}^{n}$ and the maximizer $\psi$ of ${\mathcal{F}}$ .

Theorem 4.2.

Let $\mu\in\mathcal{P}_{2}({\mathbb{R}}^{d})$ be absolutely continuous with respect to the Lebesgue measure with Radon–Nikodym density $\rho(x)$ . density $\rho(x)$ . Let $\nu=\sum_{i=1}^{n}\nu_{i}\delta_{y_{i}}$ . Let $\psi=(\psi_{1},\cdots,\psi_{n})$ be an maximizer of ${\mathcal{F}}$ defined in (4.7). Denote by $\{P_{j}\}_{j=1}^{n}$ the power diagrams associated to $\{y_{j}\}_{j=1}^{n}$ and $\psi$ . Then the optimal transport plan $T$ solving the semi-discrete Monge’s problem (4.4) is given by $T(x)=\nabla\bar{\varphi}(x),$ where $\bar{\varphi}(x)=\max_{j}\{x\cdot y_{j}+m_{j}\}$ for some $m_{j}\in{\mathbb{R}}$ . Specifically, $T(x)=y_{j}$ if $x\in P_{j}(\psi)$ .

Theorem 4.2 shows that the optimal transport map in the semi-discrete case is achieved by the gradient of a particular piece-wise affine function that is the maximum of finitely many affine functions. A similar result was proved by [22] in the case where $\mu$ is defined on a compact convex domain. We provide a proof of Theorem 4.2, which deals with measures on the whole space ${\mathbb{R}}^{d}$ in Appendix E.2.

The next proposition shows that the piece-wise linear function $\max_{j}\{x\cdot y_{j}+m_{j}\}$ defined in Theorem 4.2 can be expressed exactly by a deep neural network.

Proposition 4.1.

Let $\bar{\varphi}(x)=\max_{j}\{x\cdot y_{j}+m_{j}\}$ with $\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d}$ and $\{m_{j}\}_{j=1}^{n}\subset{\mathbb{R}}$ . Then there exists a fully connected deep neural network of depth $L=\left\lceil\log n\right\rceil$ and width $N=2^{L}=2^{\left\lceil\log n\right\rceil}$ , and with ReLU activation function and parameters $\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1}$ such that $\bar{\varphi}={\mathrm{DNN}}(\{W^{\ell},b^{\ell}\}_{\ell=1}^{L+1})$ .

The proof of Proposition 4.1 can be found in Appendix E.3. Theorem 4.1 is a direct consequence of Theorem 4.2 and Proposition 4.1.

5 Conclusion

In this paper, we establish that certain general classes of target distributions can be expressed arbitrarily well with respect to three type of IPMs by transporting a source distribution with maps which can be parametrized by DNNs. We provide upper bounds for the depths and widths of DNNs needed to achieve certain approximation error; the upper bounds are established with explicit dependence on the dimension of the underlying distributions and the approximation error.

Broader Impact

This work focuses on theoretical properties of neural networks for expressing probability distributions. Our work can help understand the theoretical benefit and limitations of neural networks in approximating probability distributions under various integral probability metrics. Our work and the proof technique improve our understanding of the theoretical underpinnings of Generative Adversarial Networks and other generative models used in machine learning, and may lead to better use of these techniques with possible benefits to the society.

Acknowledgments and Disclosure of Funding

The work of JL is supported in part by the National Science Foundation via grants DMS-2012286 and CCF-1934964 (Duke TRIPODS).

References

[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223, 2017.
[2] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 224–232. JMLR. org, 2017.
[3] S. Arora and Y. Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.
[4] F. Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM Journal on Computing, 16(1):78–96, 1987.
[5] Y. Bai, T. Ma, and A. Risteski. Approximability of discriminators implies diversity in gans. arXiv preprint arXiv:1806.10586, 2018.
[6] B. Bailey and M. J. Telgarsky. Size-noise tradeoffs in generative networks. In Advances in Neural Information Processing Systems, pages 6489–6499, 2018.
[7] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
[8] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
[9] S. Bobkov and M. Ledoux. One-dimensional empirical measures, order statistics, and Kantorovich transport distances. Mem. Amer. Math. Soc., 261(1259):0, 2019.
[10] I. Borisov. Approximation of distributions of von mises statistics with multidimensional kernels. Siberian Mathematical Journal, 32(4):554–566, 1991.
[11] W. Chen, L. Mackey, J. Gorham, F.-X. Briol, and C. Oates. Stein points. In International Conference on Machine Learning, pages 843–852, 2018.
[12] K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 2606–2615. JMLR. org, 2016.
[13] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
[14] A. Daniely. Depth separation for neural networks. arXiv preprint arXiv:1702.08489, 2017.
[15] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, and G. Petrova. Nonlinear approximation and (deep) relu networks. arXiv preprint arXiv:1905.02199, 2019.
[16] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 258–267. AUAI Press, 2015.
[17] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907–940, 2016.
[18] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.
[19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[20] J. Gorham and L. Mackey. Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1292–1301. JMLR. org, 2017.
[21] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
[22] X. Gu, F. Luo, J. Sun, and S.-T. Yau. Variational principles for minkowski type problems, discrete optimal transport, and discrete monge–ampère equations. Asian Journal of Mathematics, 20(2):383–398, 2016.
[23] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
[24] Y. Guo, D. An, X. Qi, Z. Luo, S.-T. Yau, and X. Gu. Mode collapse and regularity of optimal transportation maps, 2019. arXiv preprint arXiv:1902.02934.
[25] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
[26] D. Hsu, S. Kakade, T. Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
[27] T. Hu, Z. Chen, H. Sun, J. Bai, M. Ye, and G. Cheng. Stein neural sampler. arXiv preprint arXiv:1810.03545, 2018.
[28] L. V. Kantorovich. On the translocation of masses. In Dokl. Akad. Nauk. USSR (NS), volume 37, pages 199–201, 1942.
[29] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[30] Z. Kong and K. Chaudhuri. The expressive power of a class of normalizing flow models. In Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics (AISTATS) 2020, -Volume 108, pages 3599–3609. JMLR. org, 2020.
[31] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
[32] H. Lee, R. Ge, T. Ma, A. Risteski, and S. Arora. On the ability of neural nets to express distributions. In S. Kale and O. Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1271–1296, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR.
[33] J. Lei. Convergence and concentration of empirical measures under wasserstein distance in unbounded functional spaces. Bernoulli, 26(1):767–798, 2020.
[34] N. Lei, K. Su, L. Cui, S.-T. Yau, and X. D. Gu. A geometric view of optimal transportation and generative model. Computer Aided Geometric Design, 68:1–21, 2019.
[35] B. Lévy and E. L. Schwindt. Notions of optimal transport theory and how to implement them on a computer. Computers & Graphics, 72:135–148, 2018.
[36] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.
[37] S. Liang and R. Srikant. Why deep neural networks for function approximation? arXiv preprint arXiv:1610.04161, 2016.
[38] T. Liang. On how well generative adversarial networks learn densities: Nonparametric and parametric results. arXiv preprint arXiv:1811.03179, 2018.
[39] H. Lin and S. Jegelka. Resnet with one-neuron hidden layers is a universal approximator. In Advances in neural information processing systems, pages 6169–6178, 2018.
[40] Q. Liu, J. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276–284, 2016.
[41] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pages 2378–2386, 2016.
[42] S. Liu, O. Bousquet, and K. Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pages 5545–5553, 2017.
[43] J. Lu, Z. Shen, H. Yang, and S. Zhang. Deep network approximation for smooth functions, 2020. arXiv preprint arXiv:2001.03040.
[44] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neural networks: A view from the width. In Advances in neural information processing systems, pages 6231–6239, 2017.
[45] A. V. Makkuva, A. Taghvaei, S. Oh, and J. D. Lee. Optimal transport mapping via input convex neural networks. arXiv preprint arXiv:1908.10962, 2019.
[46] B. Matérn. Spatial variation, volume 36. Springer Science & Business Media, 2013.
[47] G. Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris, 1781.
[48] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
[49] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
[50] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep neural networks. Applied and Computational Harmonic Analysis, 44(3):537–557, 2018.
[51] Z. Shen, H. Yang, and S. Zhang. Deep network approximation characterized by number of neurons, 2019. arXiv preprint arXiv:1906.05497.
[52] D. Siersma and M. Van Manen. Power diagrams and their applications. arXiv preprint math.MG/0508037, 2005.
[53] B. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839–1893, 2016.
[54] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517–1561, 2010.
[55] S. Sun, W. Chen, L. Wang, X. Liu, and T.-Y. Liu. On the depth of deep neural networks: A theoretical view. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[56] A. Taghvaei and A. Jalali. 2-wasserstein approximation via restricted convex potentials with application to improved training for gans. arXiv preprint arXiv:1902.07197, 2019.
[57] M. Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101, 2015.
[58] C. Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
[59] C. Villani. Optimal transport, old and new, volume 338 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2009.
[60] D. Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
[61] D. Yarotsky. Optimal approximation of continuous functions by very deep relu networks. arXiv preprint arXiv:1802.03620, 2018.
[62] L. Zhang, J. Han, H. Wang, R. Car, and E. Weinan. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Physical review letters, 120(14):143001, 2018.

Appendix

Appendix A Proof of Proposition 3.1

Proof.

The proof follows from some previous results by [9] and [33]. In fact, in the one dimensional case, according to [9, Theorem 3.2], we know that if $\pi$ satisfies that

J_{1}(\pi)=\int_{-\infty}^{\infty}\sqrt{F(x)(1-F(x))}dx<\infty

(A.1)

where $F$ is the cumulative distribution function of $\pi$ , then for every $n\geq 1$ ,

{\mathbf{E}}{\mathcal{W}}_{1}(P_{n},\pi)\leq\frac{J_{1}(\pi)}{\sqrt{n}}.

(A.2)

The condition (A.1) is fulfilled if $\pi$ has finite third moment since

J_{1}(\pi)\leq\int_{0}^{\infty}\sqrt{{\mathbf{P}}(|X|\geq x)}dx\leq 1+\int_{1}^{\infty}\frac{\sqrt{{\mathbf{E}}|X|^{3}}}{x^{\frac{3}{2}}}dx=1+2\sqrt{M_{3}}.

In the case that $d\geq 2$ , it follows from that [33, Theorem 3.1] if $M_{3}={\mathbb{E}}_{X\sim\pi}|X|^{3}d\pi<\infty$ , then there exists a constant $c>0$ independent of $d$ such that

{\mathbf{E}}{\mathcal{W}}_{1}(P_{n},\pi)\leq cM_{3}^{1/3}\cdot\begin{cases}\frac{\log n}{\sqrt{n}}&\text{ if }d=2,\\ \frac{1}{n^{1/d}}&\text{ if }d\geq 3.\end{cases}

(A.3)

∎

Appendix B Proof of Proposition 3.2

Proof.

Thanks to [53, Proposition 3.1], one has that

\operatorname{MMD}(P_{n},\pi)=\Big{\|}\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\Big{\|}_{{\mathcal{H}}_{k}}.

Let us define $\varphi(X_{1},X_{2},\cdots,X_{n}):=\|\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\|_{{\mathcal{H}}_{k}}.$ Then by definition $\varphi(X_{1},X_{2},\cdots,X_{n})$ satisfies that for any $i\in\{1,\cdots,n\}$ ,

		$\displaystyle\big{\|}\varphi(X_{1},\cdots,X_{i-1},X_{i},\cdots,X_{n})-\varphi(X_{1},\cdots,X_{i-1},X_{i}^{\prime},\cdots,X_{n})\big{\|}$
		$\displaystyle\leq\frac{2}{N}\sup_{x}\\|k(\cdot,x)\\|_{{\mathcal{H}}_{k}}$
		$\displaystyle\leq\frac{2\sqrt{K_{0}}}{N},\forall X_{i},X_{i}^{\prime}\in{\mathbb{R}}^{d},$

where we have used that $\|k(\cdot,x)\|_{{\mathcal{H}}_{k}}=\sup_{x}\sqrt{k(x,x)}\leq\sqrt{K_{0}}$ by assumption. It follows from above and the McDiarmid’s inequality that for every $\tau>0$ , with probability $1-e^{-\tau}$ ,

\Big{\|}\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\Big{\|}_{{\mathcal{H}}_{k}}\leq{\mathbf{E}}\Big{\|}\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\Big{\|}_{{\mathcal{H}}_{k}}+\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}.

In addition, we have by the standard symmetrization argument that

{\mathbf{E}}\Big{\|}\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\Big{\|}_{{\mathcal{H}}_{k}}\leq 2{\mathbf{E}}{\mathbf{E}}_{\varepsilon}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}{\varepsilon}_{i}k(\cdot,X_{i})\Big{\|}_{{\mathcal{H}}_{k}},

where $\{{\varepsilon}_{i}\}_{i=1}^{n}$ are i.i.d. Radmacher variables and ${\mathbf{E}}_{\varepsilon}$ represents the conditional expectation w.r.t $\{{\varepsilon}_{i}\}_{i=1}^{n}$ given $\{X_{i}\}_{i=1}^{n}$ . To bound the right hand side above, we can apply McDiarmid’s inequality again to obtain that with probability at least $1-e^{-\tau}$ ,

	$\displaystyle{\mathbf{E}}{\mathbf{E}}_{\varepsilon}\Big{\\|}\frac{1}{n}\sum_{i=1}^{n}{\varepsilon}_{i}k(\cdot,X_{i})\Big{\\|}_{{\mathcal{H}}_{k}}$	$\displaystyle\leq{\mathbf{E}}_{\varepsilon}\Big{\\|}\frac{1}{n}\sum_{i=1}^{n}{\varepsilon}_{i}k(\cdot,X_{i})\Big{\\|}_{{\mathcal{H}}_{k}}+\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}$
		$\displaystyle\leq\Big{(}{\mathbf{E}}_{\varepsilon}\Big{\\|}\frac{1}{n}\sum_{i=1}^{n}{\varepsilon}_{i}k(\cdot,X_{i})\Big{\\|}_{{\mathcal{H}}_{k}}^{2}\Big{)}^{1/2}+\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}$
		$\displaystyle\leq\sqrt{\frac{\sqrt{K_{0}}}{n}}+\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}},$

where we have used Jensen’s inequality for expectation in the second inequality and the independence of ${\varepsilon}_{i}$ and the definition of $K_{0}$ in the last inequality. Combining the estimates above yields that with probability at least $1-2e^{-\tau}$ ,

\operatorname{MMD}(P_{n},\pi)=\Big{\|}\int_{{\mathbb{R}}^{d}}k(\cdot,x)d(P_{n}-\pi)(x)\Big{\|}_{{\mathcal{H}}_{k}}\leq 2\sqrt{\frac{\sqrt{K_{0}}}{n}}+3\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}.

∎

Appendix C Proof of Proposition 3.3

Thanks to [40, Theorem 3.6], $\textrm{KSD}(P_{n},\pi)$ is evaluated explicitly as

\operatorname{KSD}(P_{n},\pi)=\sqrt{{\mathbf{E}}_{x,y\sim P_{n}}[u_{\pi}(x,y)]}=\sqrt{\frac{1}{n^{2}}\sum_{i,j=1}^{n}u_{\pi}(X_{i},X_{j})},

(C.1)

where $u_{\pi}$ is a new kernel defined by

	$\displaystyle u_{\pi}(x,y)$	$\displaystyle=s_{\pi}(x)^{\hbox{\it\tiny T}}k(x,y)s_{\pi}(y)+s_{\pi}(x)^{\hbox{\it\tiny T}}\nabla_{y}k(x,y)$
		$\displaystyle\quad\quad+s_{\pi}(y)^{\hbox{\it\tiny T}}\nabla_{x}k(x,y)+\operatorname{Tr}(\nabla_{x}\nabla_{y}k(x,y))$

with $s_{\pi}(x)=\nabla\log\pi(x)$ . Moreover, according to [40, Proposition 3.3], if $k$ satisfies Assumption K1, then $\operatorname{KSD}(P_{n},\pi)$ is non-negative.

Our proof of Proposition 3.3 relies on the fact that $\textrm{KSD}^{2}(P_{n},\pi)$ can be viewed as a von Mises’ statistics ( $V$ -statistics) and an important Bernstein type inequality due to [10] for the distribution of $V$ -statistics, which gives a concentration bound of $\textrm{KSD}^{2}(P_{n},\pi)$ around its mean (which is zero). We recall this inequality in the theorem below, which is a restatement of [10, Theorem 1] for second order degenerate $V$ -statistics.

C.1 Bernstein type inequality for von Mises’ statistics

Let $X_{1},\cdots,X_{n},\cdots$ be a sequence of i.i.d. random variables on ${\mathbb{R}}^{d}$ . For a kernel $h(x,y):{\mathbb{R}}^{d}\times{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}$ , we call

\displaystyle V_{n}=\sum_{i,j=1}^{n}h(X_{i},X_{j})

(C.2)

a von-Mises’ statistic of order $2$ with kernel $h$ . We say that the kernel $h$ is degenerate if the following holds:

\displaystyle{\mathbf{E}}[h(X_{1},X_{2})|X_{1}]={\mathbf{E}}[h(X_{1},X_{2})|X_{2}]=0.

(C.3)

Theorem C.1 ([10, Theorem 1]).

Consider the $V$ -statistic $M_{n}$ defined by (C.2) with a degenerate kernel $h$ . Assume the kernel satisfies that

\displaystyle|h(x,y)|\leq g(x)\cdot g(y)

(C.4)

for all $x,y\in{\mathbb{R}}^{d}$ with a function $g:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}$ satisfying for $\xi,J>0$ ,

\displaystyle{\mathbf{E}}[g(X_{1})^{k}]\leq\xi^{2}J^{k-2}k!/2,

(C.5)

for all $k=2,3,\cdots$ . Then there exist some generic constants $C_{1},C_{2}>0$ independent of $k,h,l,\xi$ such that for any $t\geq 0$ that

\displaystyle{\mathbf{P}}(|V_{n}|\geq n^{2}t)\leq C_{1}\exp\Big{(}-\frac{C_{2}nt}{\xi^{2}+Jt^{1/2}}\Big{)}.

(C.6)

Remark C.1.

As noted in [10, Remark 1], the inequality (C.6) is to some extent optimal. Moreover, a straightforward calculation shows that inequality (C.6) implies that for any $\delta\in(0,1)$ ,

{\mathbf{P}}\Big{(}\frac{1}{n^{2}}|V_{n}|\leq\frac{\mathscr{V}}{n}\Big{)}\geq 1-\delta,

(C.7)

where

\mathscr{V}=\Big{(}\frac{J\log(\frac{C_{1}}{\delta})}{C_{2}}+\sqrt{\frac{\log(\frac{C_{1}}{\delta})}{C_{2}}}\xi\Big{)}^{2}.

C.2 Moment bound of sub-Gaussian random vectors

Let us first recall a useful concentration result on sub-Gaussian random vectors.

Theorem C.2 ([26, Theorem 2.1]).

Let $X\in{\mathbb{R}}^{d}$ be a sub-Gaussian random vector with parameters $m\in{\mathbb{R}}^{d}$ and $\upsilon>0$ . Then for any $t>0$ ,

\displaystyle{\mathbf{P}}\Big{(}|X-m|\geq\upsilon\sqrt{d+2\sqrt{dt}+2t}\Big{)}\leq e^{-t}.

(C.8)

Moreover, for any $0\leq\eta<\frac{1}{2\upsilon^{2}}$ ,

\displaystyle{\mathbf{E}}\exp(\eta|X-m|^{2})\leq\exp(\upsilon^{2}d\eta+\frac{\upsilon^{4}d\eta^{2}}{1-2\upsilon^{2}\eta}).

(C.9)

As a direct consequence of Theorem C.2, we have the following useful moment bound for sub-Gaussian random vectors.

Proposition C.1.

Let $X\in{\mathbb{R}}^{d}$ be a sub-Gaussian random vector with parameters $m\in{\mathbb{R}}^{d}$ and $\upsilon>0$ . Then for any $k\geq 2$ ,

\displaystyle{\mathbf{E}}|X-m|^{k}\leq k\big{(}(2\upsilon\sqrt{d})^{k}+\frac{1}{2}(\frac{4\upsilon}{\sqrt{2}})^{k}k^{k/2}\big{)}.

(C.10)

Proof.

From the concentration bound (C.8) and the simple fact that

d+2\sqrt{dt}+2t=2\Big{(}\sqrt{t}+\frac{\sqrt{d}}{2}\Big{)}^{2}+\frac{d}{2}\leq 4\Big{(}\sqrt{t}+\frac{\sqrt{d}}{2}\Big{)}^{2},

one can obtain that

{\mathbf{P}}\Big{(}|X-m|\geq 2\upsilon\big{(}\sqrt{t}+\frac{\sqrt{d}}{2}\big{)}\Big{)}\leq{\mathbf{P}}\Big{(}|X-m|\geq\upsilon\sqrt{d+2\sqrt{dt}+2t}\Big{)}\leq e^{-t}.

Therefore, for any $s\geq\upsilon\sqrt{d}$ , we obtain from above with $s=2\upsilon(\sqrt{t}+\sqrt{d}/2)$ that

\displaystyle{\mathbf{P}}\Big{(}|X-m|\geq s\Big{)}\leq e^{-\big{(}\frac{s}{2\upsilon}-\frac{\sqrt{d}}{2}\big{)}^{2}}.

(C.11)

As a result, for any $k\geq 2$ ,

$\displaystyle{\mathbf{E}}\|X-m\|^{k}$	$\displaystyle=\int_{0}^{\infty}{\mathbf{P}}(\|X-m\|^{k}\geq s)ds$	(C.12)
	$\displaystyle=\int_{0}^{\infty}{\mathbf{P}}(\|X-m\|^{k}\geq s^{k})ks^{k-1}ds$
	$\displaystyle=\int_{0}^{2\upsilon\sqrt{d}}{\mathbf{P}}(\|X-m\|\geq s)ks^{k-1}ds+\int_{2\upsilon\sqrt{d}}^{\infty}{\mathbf{P}}(\|X-m\|\geq s)ks^{k-1}ds$
	$\displaystyle=:I_{1}+I_{2},$

where we have used the change of variable $s\mapsto s^{k}$ in the second line above. It is clear that the first term

I_{1}\leq k(2\upsilon\sqrt{d})^{k}.

For $I_{2}$ , one first notices that if $s\geq 2\upsilon\sqrt{d}$ , then $s/(2\upsilon)-\sqrt{d}/2\geq s/(4\upsilon)$ . Hence it follows from (C.8) that

	$\displaystyle I_{2}$	$\displaystyle\leq\int_{2\upsilon\sqrt{d}}^{\infty}e^{-\big{(}\frac{s}{4\upsilon}\big{)}^{2}}ks^{k-1}ds$
		$\displaystyle=\frac{k(4\upsilon)^{k}}{2}\int_{\frac{d}{4}}^{\infty}e^{-t}t^{\frac{k-2}{2}}dt$
		$\displaystyle\leq\frac{k(4\upsilon)^{k}}{2}\Gamma(\frac{k}{2}).$

The last two estimates imply that

	$\displaystyle{\mathbf{E}}\|X-m\|^{k}$	$\displaystyle\leq k(2\upsilon\sqrt{d})^{k-1}+\frac{k(4\upsilon)^{k}}{2}\Gamma(\frac{k}{2})$
		$\displaystyle\leq k\big{(}(2\upsilon\sqrt{d})^{k}+\frac{1}{2}(\frac{4\upsilon}{\sqrt{2}})^{k}k^{k/2}\big{)},$

where the second inequality above follows from $\Gamma(\frac{k}{2})\leq(k/2)^{k/2}$ for $k\geq 2$ . ∎

C.3 Proof of Proposition 3.3

Our goal is to invoke Theorem C.1 to obtain a concentration inequality for KSD. Recall that $\textrm{KSD}(P_{n},\pi)$ is defined by

\textrm{KSD}^{2}(P_{n},\pi)=\frac{1}{n^{2}}\sum_{i,j=1}^{n}u_{\pi}(X_{i},X_{j})

with the kernel

u_{\pi}(x,y)=s_{\pi}(x)^{\hbox{\it\tiny T}}k(x,y)s_{\pi}(y)+s_{q}(x)^{\hbox{\it\tiny T}}\nabla_{y}k(x,y)+s_{q}(y)^{\hbox{\it\tiny T}}\nabla_{x}k(x,y)+\operatorname{Tr}(\nabla_{x}\nabla_{y}k(x,y)).

Let us first verify that the new kernel $u_{\pi}$ satisfies the assumption of Theorem C.1. In fact, since $s_{\pi}(x)=\nabla\log(\pi(x))$ , one obtains from integration by part that

		$\displaystyle{\mathbf{E}}[u_{\pi}(X_{1},X_{2})\|X_{1}=x]=\int_{{\mathbb{R}}^{d}}u_{\pi}(x,y)d\pi(y)$
		$\displaystyle=\int_{{\mathbb{R}}^{d}}s_{\pi}(x)k(x,y)s_{\pi}(y)+s_{q}(x)^{\hbox{\it\tiny T}}\nabla_{y}k(x,y)$
		$\displaystyle\qquad+s_{\pi}(y)^{\hbox{\it\tiny T}}\nabla_{x}k(x,y)+\operatorname{Tr}(\nabla_{x}\nabla_{y}k(x,y))d\pi(y)$
		$\displaystyle=\int_{{\mathbb{R}}^{d}}k(x,y)s_{\pi}(x)^{\hbox{\it\tiny T}}\nabla_{y}\pi(y)dy-\int_{{\mathbb{R}}^{d}}\nabla_{y}\cdot(s_{\pi}(x)\pi(y))dy$
		$\displaystyle\qquad+\int_{{\mathbb{R}}^{d}}\nabla_{y}\pi(y)^{\hbox{\it\tiny T}}\nabla_{x}k(x,y)dy-\int_{{\mathbb{R}}^{d}}\nabla_{y}\pi(y)^{\hbox{\it\tiny T}}\nabla_{x}k(x,y)dy$
		$\displaystyle=0.$

Similarly, one has

{\mathbf{E}}[u_{\pi}(X_{1},X_{2})|X_{2}=y]=0.

This shows that $u_{\pi}$ satisfies the condition of degeneracy (C.3).

Next, we show that $u_{\pi}$ satisfies the bound (C.4) with a function $g$ satisfying the moment condition (C.5). In fact, by Assumption K3 on the kernel $k$ and Assumption 1 on the target density $\pi$ ,

	$\displaystyle\|u_{\pi}(x,y)\|$	$\displaystyle\leq L^{2}K_{1}(1+\|x\|)\cdot(1+\|y\|)+LK_{1}(1+\|x\|+1+\|y\|)+K_{1}(1+d)$
		$\displaystyle\leq K_{1}(L+1)^{2}(\sqrt{d}+1+\|x\|)\cdot(\sqrt{d}+1+\|y\|)$
		$\displaystyle=:g(x)\cdot g(y),$

where $g(x)=\sqrt{K_{1}}(L+1)(\sqrt{d}+1+|x|)$ and the constant $L$ is defined in (2.6). To verify $g$ satisfies (C.5), we write

		$\displaystyle{\mathbf{E}}_{X\sim\pi}[g(X)^{k}]=(\sqrt{K_{1}}(L+1))^{k}{\mathbf{E}}_{X\sim\pi}[(\sqrt{d}+1+\|X\|)^{k}]$		(C.13)
		$\displaystyle\leq(\sqrt{K_{1}}(L+1))^{k}{\mathbf{E}}_{X\sim\pi}[(\sqrt{d}+1+\|m\|+\|X-m\|)^{k}]$
		$\displaystyle=(\sqrt{K_{1}}(L+1))^{k}\Big{(}(\sqrt{d}+1+\|m\|)^{k}+\sum_{j=1}^{k}\begin{pmatrix}k\\ j\end{pmatrix}(\sqrt{d}+1+\|m\|)^{k-j}{\mathbf{E}}_{X\sim\pi}\|X-m\|^{j}\Big{)}.$

Thanks to Proposition C.1, we have for any $j\geq 1$ ,

$\displaystyle{\mathbf{E}}_{X\sim\pi}\|X-m\|^{j}$	$\displaystyle\leq\big{(}{\mathbf{E}}_{X\sim\pi}\|X-m\|^{2j}\big{)}^{1/2}$	(C.14)
	$\displaystyle\leq(2j)^{1/2}\cdot\Big{(}(2\upsilon\sqrt{d})^{2j}+\frac{1}{2}\big{(}\frac{4\upsilon}{\sqrt{2}}\big{)}^{2j}\cdot(2j)^{j}\Big{)}^{1/2}$
	$\displaystyle\leq(2j)^{1/2}\cdot\Big{(}2\max(4\upsilon,1)\cdot\sqrt{d}\cdot\sqrt{j}\Big{)}^{j}$
	$\displaystyle\leq\Big{(}2e^{1/e}\max(4\upsilon,1)\cdot\sqrt{d}\cdot\sqrt{j}\Big{)}^{j},$

where we have used the simple fact that $(2j)^{1/(2j)}\leq e^{1/e}$ for any $j\geq 1$ in the last inequality. Plugging (C.14) into (C.13) yields that

		$\displaystyle{\mathbf{E}}_{X\sim\pi}[g(X)^{k}]\leq 2(\sqrt{K_{1}}(L+1))^{k}\Big{(}\sqrt{d}+1+\|m\|+2e^{1/e}\max(4\upsilon,1)\sqrt{d}\cdot\sqrt{k}\Big{)}^{k}$		(C.15)
		$\displaystyle=2(\sqrt{K_{1}}(L+1))^{k}\exp\Big{(}k\log\Big{(}\sqrt{d}+1+\|m\|+2e^{1/e}\max(4\upsilon,1)\sqrt{d}\cdot\sqrt{k}\Big{)}\Big{)}.$		(C.15)

Using the fact that $\log(a+b)-\log(a)=\log(1+b/a)\leq b/a$ for all $a,b\geq 1$ , one has

		$\displaystyle\exp\Big{(}k\log\Big{(}\sqrt{d}+1+\|m\|+2e^{1/e}\max(4\upsilon,1)\sqrt{d}\cdot\sqrt{k}\Big{)}\Big{)}$		(C.16)
		$\displaystyle\leq\exp\Big{(}k\log\Big{(}\underbrace{2e^{1/e}\max(4\upsilon,1)\sqrt{d}}_{=:A}\cdot\sqrt{k}\Big{)}\Big{)}\cdot\exp\Big{(}\sqrt{k}\cdot\underbrace{\frac{\sqrt{d}+1+\|m\|}{2e^{1/e}\max(4\upsilon,1)\sqrt{d}}}_{=:B}\Big{)}.$		(C.16)

Since by assumption $|m|\leq m^{\ast}\sqrt{d}$ and $d\geq 1$ , we have

B\leq\frac{2+m^{\ast}}{2e^{1/e}\max(4\upsilon,1)}=:\tilde{B}.

As a consequence of above and the fact that $k!\geq\Big{(}\frac{k}{3}\Big{)}^{k}$ for any $k\in{\mathbb{N}}_{+}$ ,

$\displaystyle\exp\Bigl{(}k\log$	$\displaystyle\Bigl{(}\sqrt{d}+1+\|m\|+2e^{1/e}\max(4\upsilon,1)\sqrt{d}\cdot\sqrt{k}\Bigr{)}\Bigr{)}$	(C.17)
	$\displaystyle\leq\exp(k\log k/2)\cdot\exp(k(\log A+\tilde{B}))$
	$\displaystyle=\big{(}\frac{k}{3}\big{)}^{k/2}\cdot\big{(}\sqrt{3}A\exp(\tilde{B}+\frac{1}{2})\big{)}^{k}$
	$\displaystyle\leq k!\cdot\big{(}\sqrt{3}A\exp(\tilde{B}+\frac{1}{2})\big{)}^{k}.$

Combining this with (C.15) implies that the moment bound assumption (C.5) holds with the constants

J=\sqrt{3K_{1}}(L+1)A\exp\Big{(}\tilde{B}+\frac{1}{2}\Big{)}\textrm{ and }\xi=2J.

Therefore it follows from the definition of $\mathrm{KSD}(P_{n},\pi)$ in (C.1) and the concentration bound (C.7) implied by Theorem C.1 that with at least probability $1-\delta$ ,

\textrm{KSD}(P_{n},\pi)\leq\frac{C}{\sqrt{n}},

with the constant

C=J\Big{(}\frac{\log(\frac{C_{1}}{\delta})}{C_{2}}+2\sqrt{\frac{\log(\frac{C_{1}}{\delta})}{C_{2}}}\Big{)}.

Since by definition the constant $A=\mathcal{O}(\sqrt{d})$ for large $d$ , we have that the constant $C=\mathcal{O}(\sqrt{d})$ . This completes the proof.

Appendix D Summarizing Propositions 3.1 - 3.3

The theorem below summarizes the Propositions 3.1 - 3.3 above, serving as one of the ingredients for proving Theorem 2.1.

Theorem D.1.

Let $\pi$ be a probability measure on ${\mathbb{R}}^{d}$ and let $P_{n}=\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}}$ be the empirical measure associated to the i.i.d. samples $\{X_{i}\}_{i=1}^{n}$ drawn from $\pi$ . Then we have the following:

1.

If $\pi$ satisfies $M_{3}={\mathbf{E}}_{X\sim\pi}|X|^{3}<\infty$ , then there exists a realization of empirical measure $P_{n}$ such that

${\mathcal{W}}_{1}(P_{n},\pi)\leq C\cdot\begin{cases}n^{-1/2},&d=1,\\ n^{-1/2}\log n,&d=2,\\ n^{-1/d},&d\geq 3,\end{cases}$

where the constant $C$ depends only on $M_{3}$ .
2.

If $k$ satisfies Assumption K2 with constant $K_{0}$ , then there exists a realization of empirical measure $P_{n}$ such that

$\operatorname{MMD}(P_{n},\pi)\leq\frac{C}{\sqrt{n}},$

where the constant $C$ depending only on $K_{0}$ .
3.

If $\pi$ satisfies Assumption 1 and 2 and $k$ satisfies Assumption K3 with constant $K_{1}$ , then there exists a realization of empirical measure $P_{n}$ such that

$\textrm{KSD }(P_{n},\pi)\leq C\sqrt{\frac{d}{n}},$

where the constant $C$ depends only on $L,K_{1},m^{\ast},\upsilon$ .

Appendix E Semi-discrete optimal transport with quadratic cost

E.1 Structure theorem of optimal transport map

We recall the structure theorem of optimal transport map between $\mu$ and $\nu$ under the assumption that $\mu$ does not give mass to null sets.

Theorem E.1 ([58, Theorem 2.9 and Theorem 2.12]).

Let $\mu$ and $\nu$ be two probability measures on ${\mathbb{R}}^{d}$ with finite second moments. Assume that $\mu$ is absolutely continuous with respect to the Lebesgue measure. Consider the functionals $\mathcal{K}$ and $\mathcal{J}$ defined in Monge’s problem (4.1) and dual Kantorovich problem (4.2) with $c=\frac{1}{2}|x-y|^{2}$ . Then

(i) there exists a unique solution $\pi$ to Kantorovich’s problem, which is given by $\pi(dxdy)=(\mathbf{Id}\times T)_{\#}\mu$ where $T(x)=\nabla\bar{\varphi}(x)$ $\mu$ -a.e. $x$ for some convex function $\bar{\varphi}:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}$ . In another word, $T(x)=\nabla\bar{\varphi}(x)$ is the unique solution to Monge’s problem.

(ii) there exists an optimal pair $(\varphi(x),\varphi^{c}(y))$ or $(\psi^{c}(x),\psi(y))$ solving the dual Kantorovich’s problem, i.e. $\sup_{(\varphi,\psi)\in\Phi_{c}}\mathcal{J}(\varphi,\psi)=\mathcal{J}(\varphi,\varphi^{c})=\mathcal{J}(\psi^{c},\psi)$ ;

(iii) the function $\bar{\varphi}(x)$ can be chosen as $\bar{\varphi}(x)=\frac{1}{2}|x|^{2}-\varphi(x)$ (or $\bar{\varphi}(x)=\frac{1}{2}|x|^{2}-\psi^{c}(x)$ ) where $(\varphi(x),\varphi^{c}(y))$ (or $(\psi^{c}(x),\psi(y))$ ) is an optimal pair which maximizes $\mathcal{J}$ within the set $\Phi_{c}$ .

E.2 Proof of Theorem 4.2

Recall that the dual Kantorovich problem in the semi-discrete case reduces to maximizing the following functional

\displaystyle{\mathcal{F}}(\psi)=\int\inf_{j}\left(\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\right)\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\nu_{j}.

(E.1)

Proof of Theorem 4.2 relies on two useful lemmas on the functional ${\mathcal{F}}$ . The first lemma below shows that the functional ${\mathcal{F}}$ is concave, whose proof adapts that of [35, Theorem 2] for semi-discrete optimal transport with the quadratic cost.

Lemma E.1.

Let $\rho$ be a probability density on ${\mathbb{R}}^{d}$ . Let $\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d}$ and let $\{\nu_{j}\}_{j=1}^{n}\subset[0,1]$ be such that $\sum_{j=1}^{n}\nu_{j}=1$ . Then the functional ${\mathcal{F}}$ be defined by (E.1) is concave.

Proof.

Let $\mathcal{A}:{\mathbb{R}}^{d}{\rightarrow}\{1,2,\cdots,n\}$ be an assignment function which assigns a point $x\in{\mathbb{R}}^{d}$ to the index $j$ of some point $y_{j}$ . Let us also define the function

\widetilde{\mathcal{F}}(\mathcal{A},\psi)=\int\Big{(}\frac{1}{2}|x-y_{{\mathcal{A}}(x)}|^{2}-\psi_{{\mathcal{A}}(x)}\Big{)}\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\nu_{j}.

Then by definition ${\mathcal{F}}(\psi)=\inf_{{\mathcal{A}}}\widetilde{\mathcal{F}}(\mathcal{A},\psi)$ . Denote ${\mathcal{A}}^{-1}(j)=\{x\in{\mathbb{R}}^{d}|{\mathcal{A}}(x)=j\}$ . Then

	$\displaystyle\widetilde{\mathcal{F}}(\mathcal{A},\psi)$	$\displaystyle=\sum_{j=1}^{n}\left[\int_{{\mathcal{A}}^{-1}(j)}\Big{(}\frac{1}{2}\|x-y_{j}\|^{2}-\psi_{j}\Big{)}\rho(x)dx+\psi_{j}\nu_{j}\right]$
		$\displaystyle=\sum_{j=1}^{n}\int_{{\mathcal{A}}^{-1}(j)}\frac{1}{2}\|x-y_{j}\|^{2}\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\Big{(}\nu_{j}-\int_{{\mathcal{A}}^{-1}(j)}\rho(x)dx\Big{)}.$

Since the function $\widetilde{\mathcal{F}}(\mathcal{A},\psi)$ is affine in $\psi$ for every ${\mathcal{A}}$ , it follows that ${\mathcal{F}}(\psi)=\inf_{{\mathcal{A}}}\widetilde{\mathcal{F}}(\mathcal{A},\psi)$ is concave. ∎

The next lemma computes the gradient of the concave function ${\mathcal{F}}$ ; see [35, Section 7.4] for the corresponding result with general transportation cost.

Lemma E.2.

Let $\rho$ be a probability density on ${\mathbb{R}}^{d}$ . Let $\{y_{j}\}_{j=1}^{n}\subset{\mathbb{R}}^{d}$ and let $\{\nu_{j}\}_{j=1}^{n}\subset[0,1]$ be such that $\sum_{j=1}^{n}\nu_{j}=1$ . Denote by $P_{j}(\psi)$ the power diagram associated to $\psi$ and $y_{j}$ . Then

\partial_{\psi_{i}}{\mathcal{F}}(\psi)=\nu_{i}-\mu(P_{i}(\psi))=\nu_{i}-\int_{P_{i}(\psi)}\rho(x)dx.

(E.2)

Proof.

By the definition of $\mathcal{F}$ in (E.1), we rewrite $\mathcal{F}$ as

	$\displaystyle\mathcal{F}(\psi)$	$\displaystyle=\int\frac{1}{2}\|x\|^{2}\rho(dx)+\int\inf_{j}\Big{\{}-x\cdot y_{j}+\frac{1}{2}\|y_{j}\|^{2}-\psi_{j}\Big{\}}\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\nu_{j}$
		$\displaystyle=\int\frac{1}{2}\|x\|^{2}\rho(dx)-\int\sup_{j}\Big{\{}x\cdot y_{j}+\psi_{j}-\frac{1}{2}\|y_{j}\|^{2}\Big{\}}\rho(x)dx+\sum_{j=1}^{n}\psi_{j}\nu_{j}$

To prove (E.2), it suffices to prove that

\partial_{\psi_{i}}\Big{(}\int\sup_{j}\Big{\{}x\cdot y_{j}+\psi_{j}-\frac{1}{2}|y_{j}|^{2}\Big{\}}\rho(x)dx\Big{)}=\int_{P_{i}(\psi)}\rho(x)dx.

(E.3)

Note that the partial derivative on the left side of above makes sense since $g(x,\psi):=\sup_{j}\{x\cdot y_{j}+\psi_{j}-\frac{1}{2}|y_{j}|^{2}\}$ is convex with respect to $(x,\psi)$ on ${\mathbb{R}}^{d}\times{\mathbb{R}}^{d}$ so that the resulting integral against the measure $\rho$ is also convex (and hence Lipschitz) in $\psi$ . To see (E.3), since $g(x,\psi)$ is convex and piecewise linear in $\psi$ for any fixed x, it is easy to observe that

\partial_{\psi_{i}}g(x,\psi)=\delta_{ij}\text{ if }x\in\Big{\{}x\in{\mathbb{R}}^{d}\Big{|}x\cdot y_{j}+\psi_{j}-\frac{1}{2}|y_{j}|^{2}=g(x,\psi)\Big{\}}.

However, by subtracting $\frac{1}{2}|x|^{2}$ on both sides of the equation inside the big parenthesis and then flipping the sign one sees that

\Big{\{}x\in{\mathbb{R}}^{d}\Big{|}x\cdot y_{j}+\psi_{j}-\frac{1}{2}|y_{j}|^{2}=g(x,\psi)\Big{\}}=P_{j}(\psi).

Namely we have obtained that

\partial_{\psi_{i}}g(x,\psi)=\delta_{i,j}\text{ if }x\in P_{j}(\psi).

In particular, this implies that $\psi{\rightarrow}g(x,\psi)$ is 1-Lipschitz in $\psi$ uniformly with respect to $x$ . Finally since $\rho(x)$ is a probability measure, the desired identity (E.2) follows from the equation above and the dominated convergence theorem. This completes the proof of the lemma. ∎

With the lemmas above, we are ready to prove Theorem 4.2. In fact, according to Lemma E.1 and Lemma E.2, $\psi=(\psi_{1},\cdots,\psi_{n})$ is a maximizer of the functional ${\mathcal{F}}$ if and only if

\partial_{\psi_{i}}{\mathcal{F}}(\psi)=\nu_{i}-\mu(P_{i}(\psi))=\nu_{i}-\int_{P_{i}(\psi)}\rho(x)dx=0.

Since the dual Kantorovich problem in the semi-discrete setting reduces to the problem of maximizing ${\mathcal{F}}$ , it follows from Theorem E.1 that the optimal transport map $T$ solving the semi-discrete Monge’s problem (4.4) is given by $T(x)=\nabla\bar{\varphi}(x)$ where $\bar{\varphi}(x)=\frac{1}{2}|x|^{2}-\varphi(x)$ and $\varphi(x)=\min_{j}\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}$ . Consequently,

	$\displaystyle\bar{\varphi}(x)$	$\displaystyle=\frac{1}{2}\|x\|^{2}-\varphi(x)$
		$\displaystyle=\frac{1}{2}\|x\|^{2}-\Big{(}\min_{j}\{\frac{1}{2}\|x-y_{j}\|^{2}-\psi_{j}\}\Big{)}$
		$\displaystyle=\max_{j}\{x\cdot y_{j}+m_{j}\}$

with $m_{j}=\psi_{j}-\frac{1}{2}|y_{j}|^{2}$ . Moreover, noticing that $\varphi(x)$ can be rewritten as

\varphi(x)=\frac{1}{2}|x-y_{j}|^{2}-\psi_{j}\text{ if }x\in P_{j}(\psi),

one obtains that $T(x)=\nabla\bar{\varphi}(x)=y_{j}$ if $x\in P_{j}(\psi)$ .

E.3 Proof of Proposition 4.1

Let us first consider the case that $n=2^{k}$ for some $k\in{\mathbb{N}}$ . Then

\displaystyle\bar{\varphi}(x)=\max_{j=1,\cdots,2^{k}}\{x\cdot y_{j}+m_{j}\}=\max_{j=1,\cdots,2^{k-1}}\max_{i\in\{2j-1,2j\}}\{x\cdot y_{i}+m_{i}\}.

Let us define maps $\varphi_{n}:{\mathbb{R}}^{n}{\rightarrow}{\mathbb{R}}^{n/2}$ and $\psi:{\mathbb{R}}^{d}{\rightarrow}{\mathbb{R}}^{n}$ by setting

[\varphi_{n}(z)]_{i}=\max\{z_{2i-1},z_{2i}\},i=1,\cdots,n/2\text{ and }[\psi(x)]_{j}=x\cdot y_{j}+m_{j},j=1,\cdots,n.

Then by definition it is straightforward that

\bar{\varphi}(x)=(\varphi_{2}\circ\varphi_{4}\circ\cdots\circ\varphi_{n/2}\circ\varphi_{n}\circ\psi)(x).

(E.4)

By defining

Y=\begin{pmatrix}y_{1}^{\hbox{\it\tiny T}}\\ y_{2}^{\hbox{\it\tiny T}}\\ \vdots\\ y_{n}^{\hbox{\it\tiny T}}\end{pmatrix},m=\begin{pmatrix}m_{1}\\ m_{2}\\ \vdots\\ m_{n}\end{pmatrix},

we can write the map $\psi$ as

\psi(x)=Y\cdot x+m.

(E.5)

Moreover, thanks to the following simple equivalent formulation of the maximum function:

	$\displaystyle\max(a,b)$	$\displaystyle=\textrm{ReLU}(a-b)+\textrm{ReLU}(b)-\textrm{ReLU}(-b)$
		$\displaystyle=h^{\hbox{\it\tiny T}}\cdot\textrm{ReLU}\left(A\begin{pmatrix}a\\ b\end{pmatrix}\right),$

where

A=\begin{pmatrix}1&-1\\ 0&1\\ 0&-1\end{pmatrix},\quad h=\begin{pmatrix}1\\ 1\\ -1\end{pmatrix},

we can express the map $\varphi_{n}$ in terms of a two-layer neural network as follows

\varphi_{n}(z)=H_{n/2}\cdot\textrm{ReLU}(A_{n}\cdot z),

(E.6)

where $A_{n}=\oplus^{n}A$ and $H_{n}=\oplus^{n}h$ . Finally, by combining (E.4), (E.5) and (E.6), one sees that $\bar{\varphi}$ can be expressed in terms of a DNN of width $n$ and depth $\log n$ with parameters $(W^{\ell},b^{\ell})_{\ell=1}^{L+1}$ defined by

		$\displaystyle W^{0}=A_{n}\cdot Y,\quad b^{0}=A_{n}\cdot m,$
		$\displaystyle W^{1}=A_{n/2}\cdot H_{n/2},\quad b^{1}=0,$
		$\displaystyle W^{2}=A_{n/4}\cdot H_{n/4},\quad b^{2}=0,$
		$\displaystyle\cdots,$
		$\displaystyle W_{L-1}=A_{2}\cdot H_{2},\quad b^{L-1}=0,$
		$\displaystyle W_{L}=H_{1},\quad b^{L}=0.$

In the general case where $\log_{2}n\notin{\mathbb{N}}$ , we set $k=\left\lceil\log_{2}n\right\rceil$ so that $k$ is smallest integer such that $2^{k}>n$ . By redefining $y_{j}=0$ and $m_{j}=0$ for $j=n+1,\cdots,2^{k}$ , we may still write $\bar{\varphi}(x)=\max_{j=1,\cdots,2^{k}}\{x\cdot y_{j}+m_{j}\}$ so that the analysis above directly applies.

Appendix F Proof of Main Theorem 2.1

The proof follows directly from Theorem D.1 and Theorem 4.1. Indeed, on the one hand, the quantitative estimate for the convergence of the empirical measure $P_{n}$ directly translates to the sample complexity bounds in Theorem 2.1 with a given error ${\varepsilon}$ . On the other hand, Theorem 4.1 provides a push-forward from $p_{x}$ to the empirical measure $P_{n}$ based on the gradient of a DNN.

	$\displaystyle{\mathbf{E}}{\mathbf{E}}_{\varepsilon}\Big{\\|}\frac{1}{n}\sum_{i=1}^{n}{\varepsilon}_{i}k(\cdot,X_{i})\Big{\\|}_{{\mathcal{H}}_{k}}$	$\displaystyle\leq{\mathbf{E}}_{\varepsilon}\Big{\\|}\frac{1}{n}\sum_{i=1}^{n}{\varepsilon}_{i}k(\cdot,X_{i})\Big{\\|}_{{\mathcal{H}}_{k}}+\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}$
		$\displaystyle\leq\Big{(}{\mathbf{E}}_{\varepsilon}\Big{\\|}\frac{1}{n}\sum_{i=1}^{n}{\varepsilon}_{i}k(\cdot,X_{i})\Big{\\|}_{{\mathcal{H}}_{k}}^{2}\Big{)}^{1/2}+\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}}$
		$\displaystyle\leq\sqrt{\frac{\sqrt{K_{0}}}{n}}+\sqrt{\frac{2\sqrt{K_{0}}\tau}{n}},$

$\displaystyle{\mathbf{E}}\|X-m\|^{k}$	$\displaystyle=\int_{0}^{\infty}{\mathbf{P}}(\|X-m\|^{k}\geq s)ds$	(C.12)
	$\displaystyle=\int_{0}^{\infty}{\mathbf{P}}(\|X-m\|^{k}\geq s^{k})ks^{k-1}ds$
	$\displaystyle=\int_{0}^{2\upsilon\sqrt{d}}{\mathbf{P}}(\|X-m\|\geq s)ks^{k-1}ds+\int_{2\upsilon\sqrt{d}}^{\infty}{\mathbf{P}}(\|X-m\|\geq s)ks^{k-1}ds$
	$\displaystyle=:I_{1}+I_{2},$

	$\displaystyle\|u_{\pi}(x,y)\|$	$\displaystyle\leq L^{2}K_{1}(1+\|x\|)\cdot(1+\|y\|)+LK_{1}(1+\|x\|+1+\|y\|)+K_{1}(1+d)$
		$\displaystyle\leq K_{1}(L+1)^{2}(\sqrt{d}+1+\|x\|)\cdot(\sqrt{d}+1+\|y\|)$
		$\displaystyle=:g(x)\cdot g(y),$

		$\displaystyle{\mathbf{E}}_{X\sim\pi}[g(X)^{k}]=(\sqrt{K_{1}}(L+1))^{k}{\mathbf{E}}_{X\sim\pi}[(\sqrt{d}+1+\|X\|)^{k}]$		(C.13)
		$\displaystyle\leq(\sqrt{K_{1}}(L+1))^{k}{\mathbf{E}}_{X\sim\pi}[(\sqrt{d}+1+\|m\|+\|X-m\|)^{k}]$
		$\displaystyle=(\sqrt{K_{1}}(L+1))^{k}\Big{(}(\sqrt{d}+1+\|m\|)^{k}+\sum_{j=1}^{k}\begin{pmatrix}k\\ j\end{pmatrix}(\sqrt{d}+1+\|m\|)^{k-j}{\mathbf{E}}_{X\sim\pi}\|X-m\|^{j}\Big{)}.$

A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions

Abstract

1 Introduction

Definition 1.1.

2 Problem Description and Main Result

2.1 Main result

Assumption K1.

Assumption K2.

Assumption K3.

Assumption 1 (LL-Lipschitz).

Assumption 2 (sub-Gaussian).

Theorem 2.1 (Main theorem).

3 Convergence of Empirical Measures in Various IPMs

Proposition 3.1 (Convergence in 11-Wasserstein distance).

Proposition 3.2 (Convergence in MMD).

Proposition 3.3 (Convergence in KSD).

Remark 3.1.

4 Constructing Transport Maps via Semi-discrete Optimal Transport

Theorem 4.1.

Theorem 4.2.

Proposition 4.1.

5 Conclusion

Broader Impact

Acknowledgments and Disclosure of Funding

References

Appendix A Proof of Proposition 3.1

Proof.

Appendix B Proof of Proposition 3.2

Proof.

Appendix C Proof of Proposition 3.3

C.1 Bernstein type inequality for von Mises’ statistics

Theorem C.1 ([10, Theorem 1]).

Remark C.1.

C.2 Moment bound of sub-Gaussian random vectors

Theorem C.2 ([26, Theorem 2.1]).

Proposition C.1.

Proof.

C.3 Proof of Proposition 3.3

Appendix D Summarizing Propositions 3.1 - 3.3

Theorem D.1.

Appendix E Semi-discrete optimal transport with quadratic cost

E.1 Structure theorem of optimal transport map

Theorem E.1 ([58, Theorem 2.9 and Theorem 2.12]).

E.2 Proof of Theorem 4.2

Lemma E.1.

Proof.

Lemma E.2.

Proof.

E.3 Proof of Proposition 4.1

Appendix F Proof of Main Theorem 2.1

Assumption 1 ( $L$ -Lipschitz).

Proposition 3.1 (Convergence in $1$ -Wasserstein distance).