A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions
Abstract
This paper studies the universal approximation property of deep neural networks for representing probability distributions. Given a target distribution and a source distribution both defined on , we prove under some assumptions that there exists a deep neural network with ReLU activation such that the push-forward measure of under the map is arbitrarily close to the target measure . The closeness are measured by three classes of integral probability metrics between probability distributions: -Wasserstein distance, maximum mean distance (MMD) and kernelized Stein discrepancy (KSD). We prove upper bounds for the size (width and depth) of the deep neural network in terms of the dimension and the approximation error with respect to the three discrepancies. In particular, the size of neural network can grow exponentially in when -Wasserstein distance is used as the discrepancy, whereas for both MMD and KSD the size of neural network only depends on at most polynomially. Our proof relies on convergence estimates of empirical measures under aforementioned discrepancies and semi-discrete optimal transport.
1 Introduction
In recent years, deep learning has achieved unprecedented success in numerous machine learning problems [31, 54]. The success of deep learning is largely attributed to the usage of deep neural networks (DNNs) for representing and learning the unknown structures in machine learning tasks, which are usually modeled by some unknown function mappings or unknown probability distributions. The effectiveness of using neural networks (NNs) in approximating functions has been justified rigorously in the last three decades. Specifically, a series of early works [13, 18, 25, 7] on universal approximation theorems show that a continuous function defined on a bounded domain can be approximated by a sufficiently large shallow (two-layer) neural network. In particular, the result by [7] quantifies the approximation error of shallow neural networks in terms of the decay property of the Fourier transform of the function of interest. Recently, the expressive power of DNNs for approximating functions have received increasing attention starting from the works by [37] and [60]; see also [61, 48, 50, 44, 51, 15, 43] for more recent developments. The theoretical benefits of using deep neural networks over shallow neural networks have been demonstrated in a sequence of depth separation results; see e.g. [17, 55, 57, 14]
Compared to a vast number of theoretical results on neural networks for approximating functions, the use of neural networks for expressing distributions is far less understood on the theoretical side. The idea of using neural networks for modeling distributions underpins an important class of unsupervised learning techniques called generative models, where the goal is to approximate or learn complex probability distributions from the training samples drawn from the distributions. Typical generative models include Variational Autoencoders [29], Normalizing Flows [49] and Generative Adversarial Networks (GANs) [19], just to name a few. In these generative models, the probability distribution of interest can be very complex or computationally intractable, and is usually modelled by transforming a simple distribution using some map parameterized by a (deep) neural network. In particular, a GAN consists of a game between a generator and a discriminator which are represented by deep neural networks: the generator attempts to generate fake samples whose distribution is indistinguishable from the real distribution and it generate samples by mapping samples from a simple input distribution (e.g. Gaussian) via a deep neural network; the discriminator attempts to learn how to tell the fake apart from the real. Despite the great empirical success of GANs in various applications, its theoretical analysis is far from complete. Existing theoretical works on GANs are mainly focused on the trade-off between the generator and the discriminator (see e.g. [42, 2, 3, 38, 5]). The key message from these works is that the discriminator family needs to be chosen appropriately according to the generator family in order to obtain a good generalization error.
Our contributions. In this work, we focus on an even more fundamental question on GANs and other generative models which is not yet fully addressed. Namely how well can DNNs express probability distributions? We shall answer this question by making the following contributions.
-
•
Given a fairly general source distribution and a target distribution defined on which satisfies certain integrability assumptions, we show that there is a ReLU DNN with inputs and one output such that the push-forward of the source distribution via the gradient of the output function defined by the DNN is arbitrarily close to the target. We measure the closeness between probability distributions by three integral probability metrics (IPMs): 1-Wasserstein metric, maximum mean discrepancy and kernelized Stein discrepancy.
-
•
Given a desired approximation error , we prove complexity upper bounds for the depth and width of the DNN needed to attain the given approximation error with respect to the three IPMs mentioned above; our complexity upper bounds are given with explicit dependence on the dimension of the target distribution and the approximation error .
It is also worth mentioning that the DNN constructed in the paper is explicit: the output function of the DNN is the maximum of finitely many (multivariate) affine functions, with the affine parameters determined explicitly in terms of the source measure and target measure.
Related work. Let us discuss some previous related work and compare the results there with ours. Kong and Chaudhuri [30] analyzed the expressive power of normalizing flow models under the -norm of distributions and showed that the flow models have limited approximation capability in high dimensions. We show that feedforward DNNs can approximate a general class of distributions in high dimensions with respect to three IPMs. Lin and Jegelka [39] studied the universal approximation of certain ResNets for multivariate functions; the approximation result there however was not quantitative and did not consider the universal approximation of ResNets for distributions. The work by Bailey and Telgarsky [6] considered the approximation power of DNNs for expressing uniform and Gaussian distributions in the Wasserstein distance, whereas we prove quantitative approximation results for fairly general distributions under three IPMs. The work by Lee et al. [32] is closest to ous, where the authors considered a class of probability distributions that are given as push-forwards of a base distribution by a class of Barron functions, and showed that those distributions can be approximated in Wasserstein metrics by push-forwards of NNs, essentially relying on the ability of NNs for approximating Barron functions. It is however not clear what probability distributions are given by push-forward of a base one by Barron functions. In this work, we provide more explicit and direct criteria of the target distributions.
Notations. Let us introduce several definitions and notations to be used throughout the paper. We start with the definition of a fully connected and feed-forward neural network.
Definition 1.1.
A (fully connected and feed-forward) neural network of hidden layers takes an input vector , outputs a vector and has hidden layers of sizes . The neural network is parametrized by the weight matrices and bias vectors with . The output is defined from the input iteratively according to the following.
(1.1) | ||||
Here is a (nonlinear) activation function which acts on a vector component-wisely, i.e. . When , we say the network network has width and depth . The neural network is said to be a deep neural network (DNN) if . The function defined by the deep neural network is denoted by .
Popular choices of activation functions include the rectified linear unit (ReLU) function and the sigmoid function .
Given a matrix , let us denote its -fold direct sum by We denote by the space of probability measures with finite second moment. Given two probability measures and on , a transport map between and is a measurable map such that where denotes the push-forward of under the map , i.e., for any measurable , . We denote by the set of transport plans between and which consists of all coupling measures of and , i.e., and for any measurable . We may use to denote generic constants which do not depend on any quantities of interest (e.g. dimension ).
The rest of the paper is organized as follows. We describe the problem and state the main result in Section 2. Section 3 and Section 4 devote to the two ingredients for proving the main result: convergence of empirical measures in IPMs and building neural-network-based maps between the source measure and empirical measures via semi-discrete optimal transport respectively. Proofs of lemmas and intermediate results are provided in appendices.
2 Problem Description and Main Result
Let be the target probability distribution defined on which one would like to learn or generate samples from. In the framework of GANs, one is interested in representing the distribution implicitly by a generative neural network. Specifically, let be a subset of generators (transformations), which are defined by neural networks. The concrete form of is to be specified later. Let be a source distribution (e.g. standard normal). The push-forward of under the transformation is denoted by . In a GAN problem, one aims to find such that . In the mathematical language, GANs can be formulated as the following minimization problem:
(2.1) |
where is some discrepancy measure between probability measures and , which typically takes the form of integral probability metric (IPM) or adversarial loss defined by
(2.2) |
where is certain class of test (or witness) functions. As a consequence, GANs can be formulated as the minimax problem
The present paper aims to answer the following fundamental questions on GANs:
(1) Is there a neural-network-based generator such that ?
(2) How to quantify the complexity (e.g. depth and width) of the neural network?
As we shall see below, the answers to the questions above depend on the IPM used to measure the discrepancy between distributions. In this paper, we are interested in three IPMs which are commonly used in GANs, including -Wasserstein distance [59, 1], maximum mean discrepancy [21, 16, 36] and kernelized Stein discrepancy [40, 12, 27].
Wasserstein Distance: When the witness class is chosen as the the class of 1-Lipschitz functions, i.e. , the resulting IPM becomes the 1-Wasserstein distance (also known as Kantorovich-Rubinstein distance):
The Wasserstein-GAN proposed by [1] leverages the Wasserstein distance as the objective function to improve the stability of training of the original GAN based on the Jensen-Shannon divergence. Nevertheless, it has been shown that Wasserstein-GAN still suffers from the mode collapse issue [23] and does not generalize with any polynomial number of training samples [2].
Maximum Mean Discrepancy (MMD): When is the unit ball of a reproducing kernel Hilbert space (RKHS) , i.e. , the resulting IPM coincides with the maximum mean discrepancy (MMD) [21]:
GANs based on minimizing MMD as the loss function were firstly proposed in [16, 36]. Since MMD is a weaker metric than -Wasserstein distance, MMD-GANs also suffer from the mode collapse issue, but empirical results (see e.g. [8]) suggest that they require smaller discriminative networks and hence enable faster training than Wasserstein-GANs.
Kernelized Stein Discrepancy (KSD): If the witness class is chosen as where is the Stein-operator defined by
(2.3) |
the associated IPM becomes the Kernelized Stein Discrepancy (KSD) [40, 12]:
The KSD has received great popularity in machine learning and statistics since the quantity is very easy to compute and does not depend on the normalization constant of , which makes it suitable for statistical computation, such as hypothesis testing [20] and statistical sampling [41, 11]. The recent paper [27] adopts the GAN formulation (2.1) with KSD as the training loss to construct a new sampling algorithm called Stein Neural Sampler.
2.1 Main result
Throughout the paper, we consider the following assumptions on the reproducing kernel :
Assumption K1.
The kernel is integrally strictly positive definite: for all finite non-zero signed Borel measures defined on ,
Assumption K2.
There exists a constant such that
(2.4) |
Assumption K3.
The kernel function is twice differentiable and there exists a constant such that
(2.5) |
According to [54, Theorem 7], Assumption K1 is necessary and sufficient for the kernel being characteristic, i.e., implies , which guarantees that MMD is a metric. In addition, thanks to [40, Proposition 3.3], KSD is a valid discrepancy measure under the Assumption K1, namely and if and only if .
Assumption K2 will be used to get an error bound for ; see Theorem 3.2. Assumption K3 will be crucial for bounding ; see Theorem 3.3. Many commonly used kernel functions fulfill all three assumptions K1-K3, including for example Gaussian kernel and inverse multiquadric (IMQ) kernel with and . Unfortunately, Matérn kernels [46] only satisfy Assumptions K1-K2, but not Assumption K3 since the second order derivatives of are singular on the diagonal, so the last estimate of (2.5) is violated.
In order to bound , we need to assume further that the target measure satisfies the following regularity and integrability assumptions. We will use the shorthand notation .
Assumption 1 (-Lipschitz).
Assume that is globally Lipschitz in , i.e. there exists a constant such that for all . As a result, there exists such that
(2.6) |
Assumption 2 (sub-Gaussian).
The probability measure is sub-Gaussian, i.e. there exist and such that
Assume further that for some .
Our main result is the universal approximation theorem for expressing probability distributions.
Theorem 2.1 (Main theorem).
Let and be the target and the source distributions respectively, both defined on . Assume that is absolutely continuous with respect to the Lebesgue measure. Then under certain assumptions on and the kernel to be specified below, it holds that for any given approximation error , there exists a positive integer , and a fully connected and feed-forward deep neural network of depth and width , with inputs and a single output and with ReLU activation such that The complexity parameter depends on the choice of the metric . Specifically,
1. Consider . If satisfies that , it holds that
where the constant depends only on .
Theorem 2.1 states that a given probability measure (with certain integrability assumption) can be approximated arbitrarily well by push-forwarding a source distribution with the gradient of a potential which can be parameterized by a finite DNN. The complexity bound in the Wasserstein case suffers from the curse of dimensionality whereas this issue was eliminated in the cases of MMD and KSD . We remark that our complexity bound for the the neural network is stated in terms of the number of widths and depths, which would lead to an estimate of number of weights needed to achieve certain approximation error.
Proof strategy. The proof of Theorem 2.1 is given in Appendix F, which relies on two ingredients: (1) one approximates the target by the empirical measure ; Proposition 3.1-Proposition 3.3 give quantitative estimates w.r.t the three IPMs defined above; (2) based on the theory of (semi-discrete) optimal transport, one can build an optimal transport map of the form which push-forwards the source distribution to the empirical distribution . Moreover, the potential is explicit : it is the maximum of finitely many affine functions; it is such explicit structure that enables one represents the function with a finite deep neural network; see Theorem 4.1 for the precise statement.
It is interesting to remark that our strategy of proving Theorem 2.1 shares the same spirit as the one used to prove universal approximation theorems of DNNs for functions [60, 37]. Indeed, both the universal approximation theorems in those works and ours are proved by approximating the target function or distribution with a suitable dense subset (or sieves) on the space of functions or distributions which can be parametrized by deep neural networks. Specifically, in [60, 37] where the goal is to approximate continuous functions on a compact set, the dense sieves are polynomials which can be further approximated by the output functions of DNNs, whereas in our case we use empirical measures as the sieves for approximating distributions, and we show that empirical measures are exactly expressible by transporting a source distribution with neural-network-based transport maps.
We also remark that the push-forward map between probability measures constructed in Theorem 2.1 is the gradient of a potential function given by a neural network, i.e., the neural network is used to parametrize the potential function, instead of the map itself, which is perhaps more commonly used in practice. The rational of using such gradient formulation lies in that transport maps between two probability measures are discontinuous in general. This occurs particularly when the target measure has disjoint modes (or supports) and the input has a unique mode, which is ubiquitous in GAN applications where images are concentrated on disjoint modes while the input is chosen as Gaussian. In such case it is impossible to generate the target measure using usual NNs as the resulting functions are all continuous (in fact Lipschitz for ReLU activation and smooth for Sigmoid activation). Thus, from a theoretical viewpoint, it is advantageous to use NNs to parametrize the Brenier’s potential since it is more regular (at least Lipschitz by OT theory), and to use its gradient as the transport map. From a practical perspective, the idea of taking gradients of NNs has already been used and received increasing attention in practical learning problems; see [34, 24, 45, 56] for an application of such parameterization in learning optimal transport maps and improving the training of Wasserstein-GANs; see also [62] for learning interatomic potentials and forces for molecular dynamics simulations.
3 Convergence of Empirical Measures in Various IPMs
In this section, we consider the approximation of a given target measure by empirical measures. More specifically, let be an i.i.d. sequence of random samples from the distribution and let be the empirical measure associated to the samples . Our goal is to derive quantitative error estimates of with respect to three IPMs described in the last section.
We first state an upper bound on in the average sense in the next proposition.
Proposition 3.1 (Convergence in -Wasserstein distance).
Consider the IPM with . Assume that satisfies that . Then there exists a constant depending on such that
The convergence rates of as stated in Proposition 3.1 are well-known in the statistics literature. The statement in Proposition 3.1 is a combination of results from [9] and [33]; see Appendix A for a short proof. We remark that the prefactor constant in the estimate above can be made explicit. In fact, one can easily obtain from the moment bound in Proposition C.1 of Appendix C that if is sub-Gaussian with parameters and , then the constant can be chosen as with some constant depending only on and . Moreover, one can also obtain a high probability bound for if is sub-exponential (see e.g., [33, Corollary 5.2]). Here we content ourselves with the expectation result as it comes with weaker assumptions and also suffices for our purpose of showing the existence of an empirical measure with desired approximation rate.
Moving on to approximation in MMD, the following proposition gives a high-probability non-asymptotic error bound of .
Proposition 3.2 (Convergence in MMD).
Consider the IPM with . Assume that the kernel satisfies Assumption K2 with constant . Then for every , with probability at least ,
Proposition 3.2 can be viewed as a special case of [53, Theorem 3.3] where the kernel class is a skeleton. Since its proof is short, we provide the proof in Appendix B for completeness.
In the next proposition, we consider the convergence estimate of empirical measures to in KSD . To the best of our knowledge, this is the first estimate on empirical measure under the KSD in the literature. This result can be useful to obtain quantitative error bounds for the new GAN/sampler called Stein Neural Sampler [27]. The proof relies on a Bernstein type inequality for the distribution of von Mises’ statistics; the details are deferred to Appendix C.
Proposition 3.3 (Convergence in KSD).
Remark 3.1.
Proposition 3.3 provides a non-asymptotic high probability error bound for the convergence of the empirical measure converges to in KSD. Our result implies in particular that with the asymptotic rate . We also remark that the rate is optimal and is consistent with the asymptotic CLT result for the corresponding U-statistics of (see [40, Theorem 4.1 (2)]).
4 Constructing Transport Maps via Semi-discrete Optimal Transport
In this section, we aim to build a neural-network-based map which push-forwards a given source distribution to discrete probability measures, including in particular the empirical measures. The main result of this section is the following theorem.
Theorem 4.1.
Let be absolutely continuous with respect to the Lebesgue measure with Radon–Nikodym density . Let for some and . Then there exists a transport map of the form such that where is a fully connected deep neural network of depth and width , and with ReLU activation function and parameters such that .
As shown below, the transport map in Theorem 4.1 is chosen as the optimal transport map from the continuous distribution to the discrete distribution , which turns out to be the gradient of a piece-wise linear function, which in turn can be expressed by neural networks. We remark that the weights and biases of the constructed neural network can also be characterized explicitly in terms of and (see the proof of Proposition 4.1). Since semi-discrete optimal transport plays an essential role in the proof of Theorem 4.1, we first recall the set-up and some key results on optimal transport in both general and semi-discrete settings.
Optimal transport with quadratic cost. Let and be two probability measures on with finite second moments. Let be the quadratic cost. Then Monge’s [47] optimal transportation problem is to transport the probability mass between and while minimizing the quadratic cost, i.e.
(4.1) |
A map attaining the infimum above is called an optimal transport map. In general an optimal transport map may not exist since Monge’s formulation prevents splitting the mass so that the set of transport maps may be empty. On the other hand, Kantorovich [28] relaxed the problem by considering minimizing the transportation cost over transport plans instead of the transport maps:
(4.2) |
A coupling achieving the infimum above is called an optimal coupling. Noting that problem (4.2) above is a linear programming, Kantorovich proposed a dual formulation for (4.2):
where be the set of measurable functions satisfying . We also define the -transformation of a function by
Similarly, one can define associated to . The Kantorovich’s duality theorem (see e.g. [59, Theorem 5.10]) states that
(4.3) |
Moreover, if the source measure is absolutely continuous with respect to the Lebesgue measure, then the optimal transport map defined in Monge’s problem is given by a gradient field, which is usually referred to as the Brenier’s map and can be characterized explicitly in terms of the solution of the dual Kantorovich problem. A precise statement is included in Theorem E.1 in Appendix E.
Semi-discrete optimal transport. Let us now consider the optimal transport problem in the semi-discrete setting: the source measure is continuous and the target measure is discrete. Specifically, assume that is absolutely continuous with respect to the Lebesgue measure, i.e. for some probability density and is discrete, i.e. for some and . In the semi-discrete setting, Monge’s problem becomes
(4.4) |
In this case the action of the transport map is clear: it assigns each point to one of these . Moreover, by taking advantage of the dicreteness of the measure , one sees that the dual Kantorovich problem in the semi-discrete case becomes maximizing the following functional
(4.5) |
Similar to the continuum setting, the optimal transport map of Monge’s problem (4.4) can be characterized by the maximizer of . To see this, let us introduce an important concept of power diagram (or Laguerre diagram) [4, 52]. Given a finite set of points and the scalars , the power diagrams associated to the scalars and the points are the sets
(4.6) |
By grouping the points according to the power diagrams , we have from (4.5) that
(4.7) |
The following theorem characterizes the optimal transport map of Monge’s problem (4.4) in terms of the power diagrams associated to the points and the maximizer of .
Theorem 4.2.
Let be absolutely continuous with respect to the Lebesgue measure with Radon–Nikodym density . density . Let . Let be an maximizer of defined in (4.7). Denote by the power diagrams associated to and . Then the optimal transport plan solving the semi-discrete Monge’s problem (4.4) is given by where for some . Specifically, if .
Theorem 4.2 shows that the optimal transport map in the semi-discrete case is achieved by the gradient of a particular piece-wise affine function that is the maximum of finitely many affine functions. A similar result was proved by [22] in the case where is defined on a compact convex domain. We provide a proof of Theorem 4.2, which deals with measures on the whole space in Appendix E.2.
The next proposition shows that the piece-wise linear function defined in Theorem 4.2 can be expressed exactly by a deep neural network.
Proposition 4.1.
Let with and . Then there exists a fully connected deep neural network of depth and width , and with ReLU activation function and parameters such that .
5 Conclusion
In this paper, we establish that certain general classes of target distributions can be expressed arbitrarily well with respect to three type of IPMs by transporting a source distribution with maps which can be parametrized by DNNs. We provide upper bounds for the depths and widths of DNNs needed to achieve certain approximation error; the upper bounds are established with explicit dependence on the dimension of the underlying distributions and the approximation error.
Broader Impact
This work focuses on theoretical properties of neural networks for expressing probability distributions. Our work can help understand the theoretical benefit and limitations of neural networks in approximating probability distributions under various integral probability metrics. Our work and the proof technique improve our understanding of the theoretical underpinnings of Generative Adversarial Networks and other generative models used in machine learning, and may lead to better use of these techniques with possible benefits to the society.
Acknowledgments and Disclosure of Funding
The work of JL is supported in part by the National Science Foundation via grants DMS-2012286 and CCF-1934964 (Duke TRIPODS).
References
- [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223, 2017.
- [2] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 224–232. JMLR. org, 2017.
- [3] S. Arora and Y. Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.
- [4] F. Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM Journal on Computing, 16(1):78–96, 1987.
- [5] Y. Bai, T. Ma, and A. Risteski. Approximability of discriminators implies diversity in gans. arXiv preprint arXiv:1806.10586, 2018.
- [6] B. Bailey and M. J. Telgarsky. Size-noise tradeoffs in generative networks. In Advances in Neural Information Processing Systems, pages 6489–6499, 2018.
- [7] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
- [8] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
- [9] S. Bobkov and M. Ledoux. One-dimensional empirical measures, order statistics, and Kantorovich transport distances. Mem. Amer. Math. Soc., 261(1259):0, 2019.
- [10] I. Borisov. Approximation of distributions of von mises statistics with multidimensional kernels. Siberian Mathematical Journal, 32(4):554–566, 1991.
- [11] W. Chen, L. Mackey, J. Gorham, F.-X. Briol, and C. Oates. Stein points. In International Conference on Machine Learning, pages 843–852, 2018.
- [12] K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 2606–2615. JMLR. org, 2016.
- [13] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
- [14] A. Daniely. Depth separation for neural networks. arXiv preprint arXiv:1702.08489, 2017.
- [15] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, and G. Petrova. Nonlinear approximation and (deep) relu networks. arXiv preprint arXiv:1905.02199, 2019.
- [16] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 258–267. AUAI Press, 2015.
- [17] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907–940, 2016.
- [18] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.
- [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- [20] J. Gorham and L. Mackey. Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1292–1301. JMLR. org, 2017.
- [21] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
- [22] X. Gu, F. Luo, J. Sun, and S.-T. Yau. Variational principles for minkowski type problems, discrete optimal transport, and discrete monge–ampère equations. Asian Journal of Mathematics, 20(2):383–398, 2016.
- [23] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
- [24] Y. Guo, D. An, X. Qi, Z. Luo, S.-T. Yau, and X. Gu. Mode collapse and regularity of optimal transportation maps, 2019. arXiv preprint arXiv:1902.02934.
- [25] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
- [26] D. Hsu, S. Kakade, T. Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
- [27] T. Hu, Z. Chen, H. Sun, J. Bai, M. Ye, and G. Cheng. Stein neural sampler. arXiv preprint arXiv:1810.03545, 2018.
- [28] L. V. Kantorovich. On the translocation of masses. In Dokl. Akad. Nauk. USSR (NS), volume 37, pages 199–201, 1942.
- [29] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- [30] Z. Kong and K. Chaudhuri. The expressive power of a class of normalizing flow models. In Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics (AISTATS) 2020, -Volume 108, pages 3599–3609. JMLR. org, 2020.
- [31] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
- [32] H. Lee, R. Ge, T. Ma, A. Risteski, and S. Arora. On the ability of neural nets to express distributions. In S. Kale and O. Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1271–1296, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR.
- [33] J. Lei. Convergence and concentration of empirical measures under wasserstein distance in unbounded functional spaces. Bernoulli, 26(1):767–798, 2020.
- [34] N. Lei, K. Su, L. Cui, S.-T. Yau, and X. D. Gu. A geometric view of optimal transportation and generative model. Computer Aided Geometric Design, 68:1–21, 2019.
- [35] B. Lévy and E. L. Schwindt. Notions of optimal transport theory and how to implement them on a computer. Computers & Graphics, 72:135–148, 2018.
- [36] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.
- [37] S. Liang and R. Srikant. Why deep neural networks for function approximation? arXiv preprint arXiv:1610.04161, 2016.
- [38] T. Liang. On how well generative adversarial networks learn densities: Nonparametric and parametric results. arXiv preprint arXiv:1811.03179, 2018.
- [39] H. Lin and S. Jegelka. Resnet with one-neuron hidden layers is a universal approximator. In Advances in neural information processing systems, pages 6169–6178, 2018.
- [40] Q. Liu, J. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276–284, 2016.
- [41] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pages 2378–2386, 2016.
- [42] S. Liu, O. Bousquet, and K. Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pages 5545–5553, 2017.
- [43] J. Lu, Z. Shen, H. Yang, and S. Zhang. Deep network approximation for smooth functions, 2020. arXiv preprint arXiv:2001.03040.
- [44] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neural networks: A view from the width. In Advances in neural information processing systems, pages 6231–6239, 2017.
- [45] A. V. Makkuva, A. Taghvaei, S. Oh, and J. D. Lee. Optimal transport mapping via input convex neural networks. arXiv preprint arXiv:1908.10962, 2019.
- [46] B. Matérn. Spatial variation, volume 36. Springer Science & Business Media, 2013.
- [47] G. Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris, 1781.
- [48] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
- [49] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
- [50] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep neural networks. Applied and Computational Harmonic Analysis, 44(3):537–557, 2018.
- [51] Z. Shen, H. Yang, and S. Zhang. Deep network approximation characterized by number of neurons, 2019. arXiv preprint arXiv:1906.05497.
- [52] D. Siersma and M. Van Manen. Power diagrams and their applications. arXiv preprint math.MG/0508037, 2005.
- [53] B. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839–1893, 2016.
- [54] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517–1561, 2010.
- [55] S. Sun, W. Chen, L. Wang, X. Liu, and T.-Y. Liu. On the depth of deep neural networks: A theoretical view. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- [56] A. Taghvaei and A. Jalali. 2-wasserstein approximation via restricted convex potentials with application to improved training for gans. arXiv preprint arXiv:1902.07197, 2019.
- [57] M. Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101, 2015.
- [58] C. Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
- [59] C. Villani. Optimal transport, old and new, volume 338 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2009.
- [60] D. Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
- [61] D. Yarotsky. Optimal approximation of continuous functions by very deep relu networks. arXiv preprint arXiv:1802.03620, 2018.
- [62] L. Zhang, J. Han, H. Wang, R. Car, and E. Weinan. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Physical review letters, 120(14):143001, 2018.
Appendix
Appendix A Proof of Proposition 3.1
Proof.
The proof follows from some previous results by [9] and [33]. In fact, in the one dimensional case, according to [9, Theorem 3.2], we know that if satisfies that
(A.1) |
where is the cumulative distribution function of , then for every ,
(A.2) |
The condition (A.1) is fulfilled if has finite third moment since
In the case that , it follows from that [33, Theorem 3.1] if , then there exists a constant independent of such that
(A.3) |
∎
Appendix B Proof of Proposition 3.2
Proof.
Thanks to [53, Proposition 3.1], one has that
Let us define Then by definition satisfies that for any ,
where we have used that by assumption. It follows from above and the McDiarmid’s inequality that for every , with probability ,
In addition, we have by the standard symmetrization argument that
where are i.i.d. Radmacher variables and represents the conditional expectation w.r.t given . To bound the right hand side above, we can apply McDiarmid’s inequality again to obtain that with probability at least ,
where we have used Jensen’s inequality for expectation in the second inequality and the independence of and the definition of in the last inequality. Combining the estimates above yields that with probability at least ,
∎
Appendix C Proof of Proposition 3.3
Thanks to [40, Theorem 3.6], is evaluated explicitly as
(C.1) |
where is a new kernel defined by
with . Moreover, according to [40, Proposition 3.3], if satisfies Assumption K1, then is non-negative.
Our proof of Proposition 3.3 relies on the fact that can be viewed as a von Mises’ statistics (-statistics) and an important Bernstein type inequality due to [10] for the distribution of -statistics, which gives a concentration bound of around its mean (which is zero). We recall this inequality in the theorem below, which is a restatement of [10, Theorem 1] for second order degenerate -statistics.
C.1 Bernstein type inequality for von Mises’ statistics
Let be a sequence of i.i.d. random variables on . For a kernel , we call
(C.2) |
a von-Mises’ statistic of order with kernel . We say that the kernel is degenerate if the following holds:
(C.3) |
C.2 Moment bound of sub-Gaussian random vectors
Let us first recall a useful concentration result on sub-Gaussian random vectors.
Theorem C.2 ([26, Theorem 2.1]).
Let be a sub-Gaussian random vector with parameters and . Then for any ,
(C.8) |
Moreover, for any ,
(C.9) |
As a direct consequence of Theorem C.2, we have the following useful moment bound for sub-Gaussian random vectors.
Proposition C.1.
Let be a sub-Gaussian random vector with parameters and . Then for any ,
(C.10) |
Proof.
From the concentration bound (C.8) and the simple fact that
one can obtain that
Therefore, for any , we obtain from above with that
(C.11) |
As a result, for any ,
(C.12) | ||||
where we have used the change of variable in the second line above. It is clear that the first term
For , one first notices that if , then . Hence it follows from (C.8) that
The last two estimates imply that
where the second inequality above follows from for . ∎
C.3 Proof of Proposition 3.3
Our goal is to invoke Theorem C.1 to obtain a concentration inequality for KSD. Recall that is defined by
with the kernel
Let us first verify that the new kernel satisfies the assumption of Theorem C.1. In fact, since , one obtains from integration by part that
Similarly, one has
This shows that satisfies the condition of degeneracy (C.3).
Next, we show that satisfies the bound (C.4) with a function satisfying the moment condition (C.5). In fact, by Assumption K3 on the kernel and Assumption 1 on the target density ,
where and the constant is defined in (2.6). To verify satisfies (C.5), we write
(C.13) | ||||
Thanks to Proposition C.1, we have for any ,
(C.14) | ||||
where we have used the simple fact that for any in the last inequality. Plugging (C.14) into (C.13) yields that
(C.15) | ||||
Using the fact that for all , one has
(C.16) | ||||
Since by assumption and , we have
As a consequence of above and the fact that for any ,
(C.17) | ||||
Combining this with (C.15) implies that the moment bound assumption (C.5) holds with the constants
Therefore it follows from the definition of in (C.1) and the concentration bound (C.7) implied by Theorem C.1 that with at least probability ,
with the constant
Since by definition the constant for large , we have that the constant . This completes the proof.
Appendix D Summarizing Propositions 3.1 - 3.3
The theorem below summarizes the Propositions 3.1 - 3.3 above, serving as one of the ingredients for proving Theorem 2.1.
Theorem D.1.
Let be a probability measure on and let be the empirical measure associated to the i.i.d. samples drawn from . Then we have the following:
-
1.
If satisfies , then there exists a realization of empirical measure such that
where the constant depends only on .
-
2.
If satisfies Assumption K2 with constant , then there exists a realization of empirical measure such that
where the constant depending only on .
- 3.
Appendix E Semi-discrete optimal transport with quadratic cost
E.1 Structure theorem of optimal transport map
We recall the structure theorem of optimal transport map between and under the assumption that does not give mass to null sets.
Theorem E.1 ([58, Theorem 2.9 and Theorem 2.12]).
Let and be two probability measures on with finite second moments. Assume that is absolutely continuous with respect to the Lebesgue measure. Consider the functionals and defined in Monge’s problem (4.1) and dual Kantorovich problem (4.2) with . Then
(i) there exists a unique solution to Kantorovich’s problem, which is given by where -a.e. for some convex function . In another word, is the unique solution to Monge’s problem.
(ii) there exists an optimal pair or solving the dual Kantorovich’s problem, i.e. ;
(iii) the function can be chosen as (or ) where (or ) is an optimal pair which maximizes within the set .
E.2 Proof of Theorem 4.2
Recall that the dual Kantorovich problem in the semi-discrete case reduces to maximizing the following functional
(E.1) |
Proof of Theorem 4.2 relies on two useful lemmas on the functional . The first lemma below shows that the functional is concave, whose proof adapts that of [35, Theorem 2] for semi-discrete optimal transport with the quadratic cost.
Lemma E.1.
Let be a probability density on . Let and let be such that . Then the functional be defined by (E.1) is concave.
Proof.
Let be an assignment function which assigns a point to the index of some point . Let us also define the function
Then by definition . Denote . Then
Since the function is affine in for every , it follows that is concave. ∎
The next lemma computes the gradient of the concave function ; see [35, Section 7.4] for the corresponding result with general transportation cost.
Lemma E.2.
Let be a probability density on . Let and let be such that . Denote by the power diagram associated to and . Then
(E.2) |
Proof.
By the definition of in (E.1), we rewrite as
To prove (E.2), it suffices to prove that
(E.3) |
Note that the partial derivative on the left side of above makes sense since is convex with respect to on so that the resulting integral against the measure is also convex (and hence Lipschitz) in . To see (E.3), since is convex and piecewise linear in for any fixed x, it is easy to observe that
However, by subtracting on both sides of the equation inside the big parenthesis and then flipping the sign one sees that
Namely we have obtained that
In particular, this implies that is 1-Lipschitz in uniformly with respect to . Finally since is a probability measure, the desired identity (E.2) follows from the equation above and the dominated convergence theorem. This completes the proof of the lemma. ∎
With the lemmas above, we are ready to prove Theorem 4.2. In fact, according to Lemma E.1 and Lemma E.2, is a maximizer of the functional if and only if
Since the dual Kantorovich problem in the semi-discrete setting reduces to the problem of maximizing , it follows from Theorem E.1 that the optimal transport map solving the semi-discrete Monge’s problem (4.4) is given by where and . Consequently,
with . Moreover, noticing that can be rewritten as
one obtains that if .
E.3 Proof of Proposition 4.1
Let us first consider the case that for some . Then
Let us define maps and by setting
Then by definition it is straightforward that
(E.4) |
By defining
we can write the map as
(E.5) |
Moreover, thanks to the following simple equivalent formulation of the maximum function:
where
we can express the map in terms of a two-layer neural network as follows
(E.6) |
where and . Finally, by combining (E.4), (E.5) and (E.6), one sees that can be expressed in terms of a DNN of width and depth with parameters defined by
In the general case where , we set so that is smallest integer such that . By redefining and for , we may still write so that the analysis above directly applies.
Appendix F Proof of Main Theorem 2.1
The proof follows directly from Theorem D.1 and Theorem 4.1. Indeed, on the one hand, the quantitative estimate for the convergence of the empirical measure directly translates to the sample complexity bounds in Theorem 2.1 with a given error . On the other hand, Theorem 4.1 provides a push-forward from to the empirical measure based on the gradient of a DNN.