This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Rates of convergence for nonparametric estimation of singular distributions using generative adversarial networks

Minwoo Chae
Department of Industrial and Management Engineering
Pohang University of Science and Technology
Abstract

We consider generative adversarial networks (GAN) for estimating parameters in a deep generative model. The data-generating distribution is assumed to concentrate around some low-dimensional structure, making the target distribution singular to the Lebesgue measure. Under this assumption, we obtain convergence rates of a GAN type estimator with respect to the Wasserstein metric. The convergence rate depends only on the noise level, intrinsic dimension and smoothness of the underlying structure. Furthermore, the rate is faster than that obtained by likelihood approaches, which provides insights into why GAN approaches perform better in many real problems. A lower bound of the minimax optimal rate is also investigated.

Keywords: Convergence rate, deep generative model, generative adversarial networks, nonparametric estimation, singular distribution, Wasserstein distance.

1 Introduction

Given DD-dimensional observations 𝐗1,…,𝐗n{\bf X}_{1},\ldots,{\bf X}_{n} following P0P_{0}, suppose that we are interested in inferring the underlying distribution P0P_{0} or related quantities such as its density function or the manifold on which P0P_{0} is supported. The inference of P0P_{0} is fundamental in unsupervised learning problems, for which numerous inferential methods are available in the literature [25, 43, 9]. In this paper, we model 𝐗i{\bf X}_{i} as 𝐗i=𝐠​(𝐙i)+Ο΅i{\bf X}_{i}={\bf g}({\bf Z}_{i})+\bm{\epsilon}_{i} for some function 𝐠:𝒡→ℝD{\bf g}:\mathcal{Z}\to{\mathbb{R}}^{D}. Here, 𝐙i{\bf Z}_{i} is a latent variable following the known distribution PZP_{Z} supported on π’΅βŠ‚β„d\mathcal{Z}\subset{\mathbb{R}}^{d}, and Ο΅i\bm{\epsilon}_{i} is an error vector following the normal distribution 𝒩​(𝟎D,Οƒ2​𝕀D)\mathcal{N}({\bf 0}_{D},\sigma^{2}{\mathbb{I}}_{D}), where 𝟎D{\bf 0}_{D} and 𝕀D{\mathbb{I}}_{D} denote the DD-dimensional zero vector and identity matrix, respectively. The dimension dd of latent variables is typically much smaller than DD. The model is often called a (non-linear) factor model in statistical communities [68, 34] and a generative model in machine learning societies [22, 31]. Throughout the paper, we use the latter terminology. Accordingly, 𝐠{\bf g} will be referred to as a generator.

A fundamental issue in a generative model is to construct an estimator of 𝐠{\bf g} because inferences are mostly based on the estimation of the generator. Once we have an estimator 𝐠^\hat{\bf g}, for example, the distribution of 𝐠^​(𝐙i)\hat{\bf g}({\bf Z}_{i}) can serve as an estimator of P0P_{0}. While there are various nonparametric approaches for estimating P0P_{0} [59, 24], the generative model approach does not provide a direct estimator due to an intractable integral. However, generative models are often more practical than the direct estimation methods because it is easy to generate samples from the estimated distribution.

Recent advances in deep learning have brought great successes to the generative model approach by modeling 𝐠{\bf g} through deep neural networks (DNN), for which we call a deep generative model. Two learning approaches are popularly used in practice. The former is likelihood approaches; variational autoencoder [31, 50] is perhaps the most well-known algorithm for estimating 𝐠{\bf g}. The latter approach is known as the generative adversarial networks (GAN), which was originally developed by Goodfellow et al. [22] and generalized by several researchers. One of the extensions considers general integral probability metrics (IPM) as loss functions. Sobolev GAN [41], maximum mean discrepancy GAN [37] and Wasserstein GAN [3] are important examples. Another important direction of generalization is the development of novel architectures for generators and discriminators; deep convolutional GAN [49], progressive GAN [29] and style GAN [30] are successful architectures. In many real applications, GAN approaches tend to perform better than likelihood approaches, but training a GAN architecture is notorious for its difficulty. In particular, the estimator is very sensitive to the choice of the hyperparameters in the training algorithm.

In spite of the rapid development of GAN, theoretical understanding of it remains largely unexplored. This paper studies the statistical properties of GAN from a nonparametric distribution estimation viewpoint. Specifically, we investigate convergence rates of a GAN type estimator with a structural assumption on the generator. Although GAN does not yield an explicit estimator for P0P_{0}, it is crucial to study the convergence rate of the estimator, implicitly defined through 𝐠^\hat{\bf g}. A primary goal is to provide theoretical insights into why GAN performs well in many real-world applications. With regard to this goal, fundamental questions would be, β€œWhich distributions can be efficiently estimated via GAN?” and β€œWhat is the main benefit of GAN compared to other methods for estimating these distributions?” The first question has recently been addressed by Chae et al.Β [10] to understand the benefit of deep generative models, although their results are limited to the likelihood approaches. They considered a certain class of structured distributions and tried to explain how deep generative models can avoid the curse of dimensionality in nonparametric distribution estimation problems.

To set the scene and the notation, let Q𝐠Q_{\bf g} be the distribution of 𝐠​(𝐙){\bf g}({\bf Z}), where π™βˆΌPZ{\bf Z}\sim P_{Z}. In other words, Q𝐠Q_{\bf g} is the pushforward measure of PZP_{Z} by the map 𝐠{\bf g}. Let P𝐠,Οƒ=Qπ βˆ—π’©β€‹(𝟎D,Οƒ2​𝕀D)P_{{\bf g},\sigma}=Q_{\bf g}*\mathcal{N}({\bf 0}_{D},\sigma^{2}{\mathbb{I}}_{D}), the convolution of Q𝐠Q_{\bf g} and 𝒩​(𝟎D,Οƒ2​𝕀D)\mathcal{N}({\bf 0}_{D},\sigma^{2}{\mathbb{I}}_{D}). A fundamental assumption of the present paper is that there exists a true generator 𝐠0{\bf g}_{0} and Οƒ0β‰₯0\sigma_{0}\geq 0 such that 𝐗i{\bf X}_{i}’s are equal in distribution to 𝐠0​(𝐙i)+Ο΅i{\bf g}_{0}({\bf Z}_{i})+\bm{\epsilon}_{i}, where Ο΅iβˆΌπ’©β€‹(𝟎D,Οƒ02​𝕀D)\bm{\epsilon}_{i}\sim\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D}) and Ο΅iβŸ‚βŸ‚π™i\bm{\epsilon}_{i}\!\perp\!\!\!\perp{\bf Z}_{i}. With the above notation, this can be expressed as

P0=P𝐠0,Οƒ0=Q0βˆ—π’©β€‹(𝟎D,Οƒ2​𝕀D),P_{0}=P_{{\bf g}_{0},\sigma_{0}}=Q_{0}*\mathcal{N}({\bf 0}_{D},\sigma^{2}{\mathbb{I}}_{D}), (1.1)

where Q0=Q𝐠0Q_{0}=Q_{{\bf g}_{0}}. Under this assumption, it would be more reasonable to set Q0Q_{0}, rather than P0P_{0}, as the target distribution to be estimated. We further assume that Q0Q_{0} possesses a certain low-dimensional structure and Οƒ0β†’0\sigma_{0}\to 0 as the sample size increases. That is, the data-generating distribution consists of the structured distribution Q0Q_{0} and small additional noise.

The above assumption has been investigated by Chae et al.Β [10], motivated by recent articles on structured distribution estimation [18, 17, 48, 1, 14]. Once the true generator 𝐠0{\bf g}_{0} belongs to the class 𝒒0\mathcal{G}_{0} possessing a low-dimensional structure that DNN can efficiently capture, deep generative models are highly appropriate for estimating Q0Q_{0}. Chae et al.Β [10] considered a class of composite functions [26, 28] which have recently been studied in deep supervised learning [52, 6]. Some other structures have also been studied in literature [27, 44, 12]. The corresponding class of distributions 𝒬0={Q𝐠:π βˆˆπ’’0}\mathcal{Q}_{0}=\{Q_{\bf g}:{\bf g}\in\mathcal{G}_{0}\} inherits low-dimensional structures of 𝒒0\mathcal{G}_{0}. In particular, when 𝒒0\mathcal{G}_{0} consists of composite functions, the corresponding class 𝒬0\mathcal{Q}_{0} is sufficiently large to include various structured distributions such as the product distributions, classical smooth distributions and distributions supported on submanifolds; see Section 4 of Chae et al.Β [10] for details.

The assumption on Οƒ0\sigma_{0} is crucial for the efficient estimation of Q0Q_{0}. Unless Οƒ0\sigma_{0} is small enough, the minimax optimal rate is very slow, e.g.Β 1/log⁑n1/\log n. In statistical society, the problem is known as the deconvolution [15, 40, 17, 45]. Mathematically, the assumption for small Οƒ0\sigma_{0} can be expressed as Οƒ0β†’0\sigma_{0}\to 0 with a suitable rate.

Once we have an estimator 𝐠^\hat{\bf g} for 𝐠0{\bf g}_{0}, Q^=Q𝐠^\hat{Q}=Q_{\hat{\bf g}} can serve as an estimator for Q0Q_{0}. Under the assumption described above, we study convergence rates of a GAN type estimator Q^\hat{Q}. Note that Q0Q_{0} is singular with respect to the Lebesgue measure on ℝD{\mathbb{R}}^{D} because dd is smaller than DD. Therefore, standard metrics between densities, such as the total variation and Hellinger, are not appropriate for evaluating the estimation performance. We instead consider the L1L^{1}-Wasserstein metric, which is originally inspired by the problem of optimal mass transportation and frequently used in distribution estimation problems [62, 45, 66, 11].

When 𝐠0{\bf g}_{0} possesses a composite structure with intrinsic dimension tt and smoothness Ξ²\beta, see Section 3 for the definition, Chae et al.Β [10] proved that a likelihood approach to deep generative models can achieve the rate nβˆ’Ξ²/2​(Ξ²+t)+Οƒ0n^{-\beta/2(\beta+t)}+\sigma_{0} up to a logarithmic factor. Due to the singularity of the underlying distribution, it plays a key role to perturb the data by an artificial noise. That is, the rate is obtained by a sieve maximum likelihood estimator based on the perturbed data 𝐗~i=𝐗i+Ο΅~i\widetilde{\bf X}_{i}={\bf X}_{i}+\widetilde{\bm{\epsilon}}_{i}, where Ο΅~iβˆΌπ’©β€‹(𝟎D,Οƒ~2​𝕀D)\widetilde{\bm{\epsilon}}_{i}\sim\mathcal{N}({\bf 0}_{D},\widetilde{\sigma}^{2}{\mathbb{I}}_{D}) and Οƒ~\widetilde{\sigma} is the degree of perturbation. Without suitable data perturbation, likelihood approaches can fail to estimate Q0Q_{0} consistently. Note that the rate depends on Ξ²\beta and tt, but not on DD and dd.

Interestingly, a GAN type estimator considered in this paper can achieve a strictly faster rate than that of the likelihood approach. Our main result (Theorem 3.2) guarantees that a GAN type estimator achieves the rate nβˆ’Ξ²/(2​β+t)+Οƒ0n^{-\beta/(2\beta+t)}+\sigma_{0} under the above assumption. Although Chae et al.Β [10] obtained only an upper bound for the convergence rate of likelihood approaches, it is hard to expect that the rate can be improved by likelihood approaches, based on the classical nonparametric theory [8, 35, 36, 67]. In this sense, our results provide some insights into why GAN approaches often perform better than likelihood approaches in real data analysis.

In addition to the convergence rate of a GAN type estimator, we obtain a lower bound nβˆ’Ξ²/(2​β+tβˆ’2)n^{-\beta/(2\beta+t-2)} for the minimax convergence rate; see Theorem 4.1. When Οƒ0\sigma_{0} is small enough, this lower bound is only slightly smaller than the convergence rate of a GAN type estimator.

It would be worthwhile to mention the technical novelty of the present paper compared to existing theories about GAN reviewed in Section 1.1. Firstly, most existing theories analyze GAN from a classical nonparametric density estimation viewpoint, rather than a distribution estimation as in our paper. Classical methods such as the kernel density estimator and wavelets can also attain the minimax optimal convergence rate in their framework. Consequently, their results cannot explain why GAN outperforms classical approaches to density estimation problems.

Another notable difference lies in the discriminator architectures. While the discriminator architecture in literature depends solely on the evaluation metric (L1L^{1}-Wasserstein in our case), in our approach, it depends on the generator architecture as well. Although state-of-the-art GAN architectures such as progressive and style GANs are too complicated to render them theoretically tractable, it is crucial for the success of these procedures that discriminator architectures have pretty similar structures to the generator architectures. In the proof of Theorem 3.2, we carefully construct the discriminator class using the generator class. In particular, the discriminator class is constructed so that its complexity, expressed through the metric entropy, is of the same order as that of the generator class. Consequently, the discriminator can become a much smaller class than the set of every function with the Lipschitz constant bounded by one. This reduction can significantly improve the rate; see the discussion after Theorem 3.1.

The construction of the discriminator class in the proof of Theorem 3.2 is artificial and only for a theoretical purpose. In particular, the discriminator class is not a neural network, and the computation of the considered estimator is intractable. In spite of this limitation, our theoretical results provide important insights for the success of GAN. Focusing on the Wasserstein GAN, note that many algorithms for Wasserstein GAN [3, 23] pursue to find a minimizer, say Q^W\hat{Q}^{W}, of the Wasserstein distance from the empirical measure. However, even computing the Wasserstein distance between two simple distributions is very difficult; see Theorem 3 of Kuhn et al.Β [33]. In practice, a class of neural network functions is used as a discriminator class, and this has been understood as a technique for approximating Q^W\hat{Q}^{W}. However, our theory implies that Q^W\hat{Q}^{W} might not be a decent estimator even when exact computation of it is possible. Stanczuk et al.Β [57] empirically demonstrated this by showing that Wasserstein GAN does not approximate the Wasserstein distance.

Besides the Wasserstein distance, we also consider a general integral probability metric as an evaluation metric (Theorem 3.3). For example, Ξ±\alpha-HΓΆlder classes can be used to define the evaluation metric. Considering state-of-the-art architectures, neural network distances [4, 71, 5, 39] would also be natural choices. Then, the corresponding GAN type estimator is much more natural than the one considered in the proof of Theorem 3.2.

The remainder of the paper is organized as follows. First, we review literature for the theory of GAN and introduce some notations in the following subsections. Then, Section 2 provides a mathematical set-up, including a brief introduction to DNN and GAN. An upper bound for the convergence rate of a GAN type estimator and a lower bound of the minimax convergence rates are investigated in Sections 3 and 4, respectively. Concluding remarks follow in Section 5. All proofs are deferred to Appendix.

1.1 Related statistical theory for GAN

The study of convergence rates in nonparametric generative models has been conducted in some earlier papers [34, 47] with the name of latent factor models. Rather than utilizing DNN, they considered a Bayesian approach with a Gaussian process prior on the generator function. Since the development of GAN [22], several researchers have studied rates of convergence in deep generative models, particularly focusing on GAN. To the best of our knowledge, an earlier version of Liang [38] is the first one studying the convergence rate under a GAN framework. Similar theory has been developed by Singh et al.Β [56], which is later generalized by Uppal et al.Β [60]. Slightly weaker results are obtained by Chen et al.Β [13] with explicit DNN architectures for generator and discriminator classes. Convergence rates of the vanilla GAN with respect to the Jensen–Shannon divergence have recently been obtained by Belomestny et al.Β [7].

All the above works tried to understand GAN from a nonparametric density estimation framework. They used integral probability metrics as evaluation metrics, while classical approaches on nonparametric density estimation focused on other metrics such as the total variation, Hellinger and uniform metrics. Since the total variation can be viewed as an IPM, some results in the above papers are comparable with that of the classical methods. In this case, both approaches achieve the same minimax optimal rate. Hence, the above results cannot explain why deep generative models outperform classical nonparametric methods. Schreuder et al.Β [54] considered generative models in which the target distribution does not possess a Lebesgue density. However, their result only guarantees that the convergence rate of GAN is not worse than that of the empirical measure [65]. We adopt the set-up in Chae et al.Β [10] who exclusively considered likelihood approaches.

1.2 Notations

The maximum and minimum of two real numbers aa and bb are denoted a∨ba\vee b and a∧ba\wedge b, respectively. For 1≀pβ‰€βˆž1\leq p\leq\infty, |β‹…|p|\cdot|_{p} denotes the β„“p\ell^{p}-norm. For a real-valued function ff and a probability measure PP, let P​f=∫f​(𝐱)​𝑑P​(𝐱)Pf=\int f({\bf x})dP({\bf x}). 𝔼\mathbb{E} denotes the expectation when the underlying probability is obvious. The equality c=c​(A1,…,Ak)c=c(A_{1},\ldots,A_{k}) means that cc depends only on A1,…,AkA_{1},\ldots,A_{k}. The uppercase letters, such as PP and P^\hat{P} refer to the probability measures corresponding to the densities denoted by the lowercase letters pp and p^\hat{p}, respectively, and vice versa. The inequality a≲ba\lesssim b means that aa is less than bb up to a constant multiplication, where the constant is universal or at least contextually unimportant. Also, denote a≍ba\asymp b if a≲ba\lesssim b and b≲ab\lesssim a.

2 Generative adversarial networks

For a given class β„±\mathcal{F} of functions from ℝD{\mathbb{R}}^{D} to ℝ{\mathbb{R}}, the β„±\mathcal{F}-IPM [42] between two probability measures P1P_{1} and P2P_{2} is defined as

dℱ​(P1,P2)=supfβˆˆβ„±|P1​fβˆ’P2​f|.\displaystyle d_{\mathcal{F}}(P_{1},P_{2})=\sup_{f\in\mathcal{F}}|P_{1}f-P_{2}f|.

For example, if β„±=β„±Lip\mathcal{F}=\mathcal{F}_{\rm Lip}, the class of every function f:ℝD→ℝf:{\mathbb{R}}^{D}\to{\mathbb{R}} satisfying |f​(𝐱)βˆ’f​(𝐲)|≀|π±βˆ’π²|2|f({\bf x})-f({\bf y})|\leq|{\bf x}-{\bf y}|_{2} for all 𝐱,π²βˆˆβ„D{\bf x},{\bf y}\in{\mathbb{R}}^{D}, then the corresponding IPM is the L1L^{1}-Wasserstein distance by the Kantorovich–Rubinstein duality theorem; see Theorem 1.14 of Villani [62]. Note that the LpL^{p}-Wasserstein distance (with respect to the Euclidean distance on ℝD{\mathbb{R}}^{D}) is defined as

Wp​(P1,P2)=(infΟ€βˆ«|π±βˆ’π²|2p​𝑑π​(𝐱,𝐲))1/p,\displaystyle W_{p}(P_{1},P_{2})=\left(\inf_{\pi}\int|{\bf x}-{\bf y}|_{2}^{p}d\pi({\bf x},{\bf y})\right)^{1/p},

where the infimum is taken over every coupling Ο€\pi of P1P_{1} and P2P_{2}.

Let 𝒒\mathcal{G} be a class of functions from π’΅βŠ‚β„d\mathcal{Z}\subset{\mathbb{R}}^{d} to ℝD{\mathbb{R}}^{D}, and β„±\mathcal{F} be a class of functions from ℝD{\mathbb{R}}^{D} to ℝ{\mathbb{R}}. Two classes 𝒒\mathcal{G} and β„±\mathcal{F} are referred to as the generator and discriminator classes, respectively. Once 𝒒\mathcal{G} and β„±\mathcal{F} are given, a GAN type estimator is defined through a minimizer of dℱ​(Q𝐠,β„™n)d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n}) over 𝒒\mathcal{G}, where β„™n{\mathbb{P}}_{n} is the empirical measure based on the DD-dimensional observations 𝐗1,…,𝐗n{\bf X}_{1},\ldots,{\bf X}_{n}. More specifically, let 𝐠^βˆˆπ’’\hat{\bf g}\in\mathcal{G} be an estimator satisfying

dℱ​(Q𝐠^,β„™n)≀infπ βˆˆπ’’dℱ​(Q𝐠,β„™n)+Ο΅opt,d_{\mathcal{F}}(Q_{\hat{\bf g}},{\mathbb{P}}_{n})\leq\inf_{{\bf g}\in\mathcal{G}}d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n})+\epsilon_{\rm opt}, (2.1)

and Q^=Q𝐠^\hat{Q}=Q_{\hat{\bf g}}. Here, Ο΅optβ‰₯0\epsilon_{\rm opt}\geq 0 represents the optimization error. Although the vanilla GAN [22] is not the case, the formulation (2.1) is quite general to include various GANs popularly used in practice [3, 37, 41]. At a population level, one may view (2.1) as a method to estimate the minimizer of β„±\mathcal{F}-IPM from the data-generating distribution.

In practice, both 𝒒\mathcal{G} and β„±\mathcal{F} are modelled as DNNs. To be specific, let ρ​(x)=x∨0\rho(x)=x\vee 0 be the ReLU activation function [21]. We focus on the ReLU in this paper, but other activation functions can also be used once a suitable approximation property holds [46]. For a vector 𝐯=(v1,…,vr){\bf v}=(v_{1},\ldots,v_{r}) and 𝐳=(z1,…,zr){\bf z}=(z_{1},\ldots,z_{r}), define ρ𝐯​(𝐳)=(ρ​(z1βˆ’v1),…,ρ​(zrβˆ’vr))\rho_{\bf v}({\bf z})=(\rho(z_{1}-v_{1}),\ldots,\rho(z_{r}-v_{r})). For a nonnegative integer LL and 𝐩=(p0,…,pL+1)βˆˆβ„•L+2{\bf p}=(p_{0},\ldots,p_{L+1})\in{\mathbb{N}}^{L+2}, a neural network function with the network architecture (L,𝐩)(L,{\bf p}) is any function 𝐟:ℝp0→ℝpL+1{\bf f}:{\mathbb{R}}^{p_{0}}\to{\mathbb{R}}^{p_{L+1}} of the form

π³β†¦πŸβ€‹(𝐳)=WL​ρ𝐯L​WLβˆ’1​ρ𝐯Lβˆ’1​⋯​W1​ρ𝐯1​W0​𝐳,{\bf z}\mapsto{\bf f}({\bf z})=W_{L}\rho_{{\bf v}_{L}}W_{L-1}\rho_{{\bf v}_{L-1}}\cdots W_{1}\rho_{{\bf v}_{1}}W_{0}{\bf z}, (2.2)

where Wiβˆˆβ„pi+1Γ—piW_{i}\in{\mathbb{R}}^{p_{i+1}\times p_{i}} and 𝐯iβˆˆβ„pi{\bf v}_{i}\in{\mathbb{R}}^{p_{i}}. Let π’Ÿβ€‹(L,𝐩,s,F)\mathcal{D}(L,{\bf p},s,F) be the collection 𝐟{\bf f} of the form (2.2) satisfying

maxj=0,…,L⁑|Wj|∞∨|𝐯j|βˆžβ‰€1,\displaystyle\max_{j=0,\ldots,L}|W_{j}|_{\infty}\vee|{\bf v}_{j}|_{\infty}\leq 1,
βˆ‘j=1L|Wj|0+|𝐯j|0≀sandβ€‹β€–πŸβ€–βˆžβ‰€F,\displaystyle\sum_{j=1}^{L}|W_{j}|_{0}+|{\bf v}_{j}|_{0}\leq s\quad\text{and}~{}~{}\|{\bf f}\|_{\infty}\leq F,

where |Wj|∞|W_{j}|_{\infty} and |Wj|0|W_{j}|_{0} denote the maximum-entry norm and the number of nonzero elements of the matrix WjW_{j}, respectively, and β€–πŸβ€–βˆž=β€–|πŸβ€‹(𝐳)|βˆžβ€–βˆž=sup𝐳|πŸβ€‹(𝐳)|∞\|{\bf f}\|_{\infty}=\||{\bf f}({\bf z})|_{\infty}\|_{\infty}=\sup_{\bf z}|{\bf f}({\bf z})|_{\infty}.

When the generator class 𝒒\mathcal{G} consists of neural network functions, we call the corresponding class 𝒬={Q𝐠:π βˆˆπ’’}\mathcal{Q}=\{Q_{\bf g}:{\bf g}\in\mathcal{G}\} of distributions as a deep generative model. In this sense, GAN can be viewed as a method for estimating the parameters in deep generative models. Likelihood approaches such as the variational autoencoder are another methods inferring the model parameters. When a likelihood approach is taken into account, 𝒫={P𝐠,Οƒ:π βˆˆπ’’,Οƒβˆˆ[Οƒmin,Οƒmax]}\mathcal{P}=\{P_{{\bf g},\sigma}:{\bf g}\in\mathcal{G},\sigma\in[\sigma_{\min},\sigma_{\max}]\} is often called a deep generative model as well. Note that P𝐠,ΟƒP_{{\bf g},\sigma} always possesses a density regardless whether Q𝐠Q_{\bf g} is singular or not.

3 Convergence rate of GAN

Although strict minimization of the map 𝐠↦dℱ​(Q𝐠,β„™n){\bf g}\mapsto d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n}) is computationally intractable, several heuristic approaches are available to approximate the solution to (2.1). In this section, we investigate the convergence rate of Q^=Q𝐠^\hat{Q}=Q_{\hat{\bf g}} under the assumption that the computation of it is possible. To this end, we suppose that the data-generating distribution P0P_{0} is of the form P0=Q0βˆ—π’©β€‹(𝟎D,Οƒ02​𝕀D)P_{0}=Q_{0}*\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D}) for some Q0Q_{0} and Οƒ0β‰₯0\sigma_{0}\geq 0. Q0Q_{0} will be further assumed to possess a low-dimensional structure. A goal is to find a sharp upper bound for 𝔼​deval​(Q^,Q0)\mathbb{E}d_{\rm eval}(\hat{Q},Q_{0}), where devald_{\rm eval} is the evaluation metric. In particular, we hope the rate to adapt to the structure of Q0Q_{0} and to be independent of DD and dd. We consider an arbitrary evaluation metric devald_{\rm eval} for generality. The L1L^{1}-Wasserstein distance is of primary interest.

In literature, the evaluation metric is often identified with dβ„±d_{\mathcal{F}}. In this sense, when deval=W1d_{\rm eval}=W_{1}, β„±Lip\mathcal{F}_{\rm Lip} might be a natural candidate for the discriminator class. Indeed, it is the original motivation of the Wasserstein GAN to find Q^W\hat{Q}^{W}, a minimizer of the map 𝐠↦dβ„±Lip​(Q𝐠,β„™n){\bf g}\mapsto d_{\mathcal{F}_{\rm Lip}}(Q_{\bf g},{\mathbb{P}}_{n}). Due to the computational intractability, β„±Lip\mathcal{F}_{\rm Lip} is replaced by a class β„±\mathcal{F} of neural network functions in practice. Although minimizing 𝐠↦dℱ​(Q𝐠,β„™n){\bf g}\mapsto d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n}) is still challenging, several numerical algorithms can be used to approximate the solution. In initial papers concerning Wasserstein GAN [3, 2], this replacement was regarded only as a technique approximating Q^W\hat{Q}^{W}.

Theoretically, it is unclear whether Q^W\hat{Q}^{W} is a decent estimator. If the generator class 𝒒\mathcal{G} is large enough, for example, Q^W\hat{Q}^{W} would be arbitrarily close to the empirical measure. Consequently, the convergence rate of Q^W\hat{Q}^{W} and β„™n{\mathbb{P}}_{n} would be the same. Note that the convergence rate of the empirical measure with respect to the Wasserstein distance is well-known. Specifically, it holds that [16]

𝔼​W1​(β„™n,P0)≲{nβˆ’1/2ifΒ D=1nβˆ’1/2​log⁑nifΒ D=2nβˆ’1/DifΒ D>2.\mathbb{E}W_{1}({\mathbb{P}}_{n},P_{0})\lesssim\left\{\begin{array}[]{ll}n^{-1/2}&~{}~{}\text{if $D=1$}\\ n^{-1/2}\log n&~{}~{}\text{if $D=2$}\\ n^{-1/D}&~{}~{}\text{if $D>2$}.\end{array}\right. (3.1)

The rate becomes slow as DD increases, suffering from the curse of dimensionality. Although β„™n{\mathbb{P}}_{n} adapts to a certain intrinsic dimension and achieves the minimax rate in some sense [65, 55], this does not guarantee that β„™n{\mathbb{P}}_{n} is a decent estimator, particularly when the underlying distribution possesses some smooth structure. The convergence rate of a GAN type estimator obtained by Schreuder et al.Β [54] is nothing but the rate (3.1).

If the size of 𝒒\mathcal{G} is not too large, then Q^W\hat{Q}^{W} may achieve a faster rate than β„™n{\mathbb{P}}_{n} due to the regularization effect. However, studying the behavior of Q^W\hat{Q}^{W}, possibly depending on the complexity of 𝒒\mathcal{G}, is quite tricky. Furthermore, Stanczuk et al.Β [57] empirically showed that Q^W\hat{Q}^{W} performs poorly in a simple simulation. In particular, their experiments show that estimators constructed from practical algorithms can be fundamentally different from Q^W\hat{Q}^{W}.

In another viewpoint, it would not be desirable to study the convergence rate of Q^W\hat{Q}^{W} because it does not take crucial features of state-of-the-art architectures into account. As mentioned in the introduction, the structures of the generator and discriminator architectures are quite similar for most successful GAN approaches. In particular, the complexities of the two architectures are closely related. On the other hand, for Q^W\hat{Q}^{W}, the corresponding discriminator class β„±Lip\mathcal{F}_{\rm Lip} have no connection with the generator class. In this sense, Q^W\hat{Q}^{W} cannot be viewed as a fundamental estimator possessing essential properties of widely-used GAN type estimators.

Nonetheless, dβ„±d_{\mathcal{F}} must be close to devald_{\rm eval} in some sense to guarantee a reasonable convergence rate because β„±\mathcal{F} is the only way to take devald_{\rm eval} into account with the GAN approach (2.1). This is specified as condition (iv) of Theorem 3.1; dβ„±d_{\mathcal{F}} needs to be close to devald_{\rm eval} only on a relatively small class of distributions.

Theorem 3.1.

Suppose that 𝐗1,…,𝐗n{\bf X}_{1},\ldots,{\bf X}_{n} are i.i.d.Β random vectors following P0=Q0βˆ—π’©β€‹(𝟎D,Οƒ02​𝕀D)P_{0}=Q_{0}*\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D}) for some distribution Q0Q_{0} and Οƒ0β‰₯0\sigma_{0}\geq 0. For given generator class 𝒒\mathcal{G} and discriminator class β„±\mathcal{F}, suppose that an estimator Q^=Q𝐠^\hat{Q}=Q_{\hat{\bf g}} with 𝐠^βˆˆπ’’\hat{\bf g}\in\mathcal{G} satisfies

(i)infπ βˆˆπ’’deval​(Q𝐠,Q0)≀ϡ1(ii)dℱ​(Q^,β„™n)≀infπ βˆˆπ’’dℱ​(Q𝐠,β„™n)+Ο΅2(iii)𝔼​dℱ​(β„™n,P0)≀ϡ3(iv)|deval​(Q1,Q2)βˆ’dℱ​(Q1,Q2)|≀ϡ4βˆ€Q1,Q2βˆˆπ’¬βˆͺ{Q0}\begin{split}{\rm(i)}~{}&\inf_{{\bf g}\in\mathcal{G}}d_{\rm eval}(Q_{\bf g},Q_{0})\leq\epsilon_{1}\\ {\rm(ii)}~{}&d_{\mathcal{F}}(\hat{Q},{\mathbb{P}}_{n})\leq\inf_{{\bf g}\in\mathcal{G}}d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n})+\epsilon_{2}\\ {\rm(iii)}~{}&\mathbb{E}d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})\leq\epsilon_{3}\\ {\rm(iv)}~{}&|d_{\rm eval}(Q_{1},Q_{2})-d_{\mathcal{F}}(Q_{1},Q_{2})|\leq\epsilon_{4}\\ &\quad\text{$\forall Q_{1},Q_{2}\in\mathcal{Q}\cup\{Q_{0}\}$}\end{split} (3.2)

where 𝒬={Q𝐠:π βˆˆπ’’}\mathcal{Q}=\{Q_{\bf g}:{\bf g}\in\mathcal{G}\} and Ο΅jβ‰₯0\epsilon_{j}\geq 0. Then,

𝔼​deval​(Q^,Q0)≀2​dℱ​(P0,Q0)+5​ϡ1+Ο΅2+2​ϡ3+2​ϡ4.\displaystyle\mathbb{E}d_{\rm eval}(\hat{Q},Q_{0})\leq 2d_{\mathcal{F}}(P_{0},Q_{0})+5\epsilon_{1}+\epsilon_{2}+2\epsilon_{3}+2\epsilon_{4}.

Two quantities Ο΅1\epsilon_{1} and Ο΅3\epsilon_{3} are closely related to the complexity of 𝒒\mathcal{G} and β„±\mathcal{F}, respectively. In particular, Ο΅1\epsilon_{1} represents an error for approximating Q0Q_{0} by distributions of the form Q𝐠Q_{\bf g} over π βˆˆπ’’{\bf g}\in\mathcal{G}; the larger the generator class 𝒒\mathcal{G} is, the smaller the approximation error is. [69, 58, 46] Similarly, Ο΅3\epsilon_{3} increases according as the complexity of β„±\mathcal{F} increases. Techniques for bounding 𝔼​dℱ​(β„™n,P0)\mathbb{E}d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0}) are well-known in empirical process theory [61, 20]. The second error term Ο΅2\epsilon_{2} is nothing but the optimization error. The fourth term Ο΅4\epsilon_{4} is the deviance between the evaluation metric devald_{\rm eval} and β„±\mathcal{F}-IPM over 𝒬βˆͺ{Q0}\mathcal{Q}\cup\{Q_{0}\}, connecting dβ„±d_{\mathcal{F}} and devald_{\rm eval}. Finally, the term dℱ​(P0,Q0)d_{\mathcal{F}}(P_{0},Q_{0}) in the rate depends primarily on Οƒ0\sigma_{0}. One can easily prove that

dℱ​(P0,Q0)≀W1​(P0,Q0)≀W2​(P0,Q0)≲σ0.d_{\mathcal{F}}(P_{0},Q_{0})\leq W_{1}(P_{0},Q_{0})\leq W_{2}(P_{0},Q_{0})\lesssim\sigma_{0}. (3.3)

provided that β„±βŠ‚β„±Lip\mathcal{F}\subset\mathcal{F}_{\rm Lip}.

Ignoring the optimization error, suppose for a moment that 𝒒\mathcal{G} is given and we need to choose a suitable discriminator class to minimize Ο΅3+Ο΅4\epsilon_{3}+\epsilon_{4} in Theorem 3.1. We focus on the case of deval=W1d_{\rm eval}=W_{1}. One can easily make Ο΅4=0\epsilon_{4}=0 by taking β„±=β„±Lip\mathcal{F}=\mathcal{F}_{\rm Lip}. In this case, however, Ο΅3\epsilon_{3} would be too large because 𝔼​W1​(β„™n,P0)≍nβˆ’1/D\mathbb{E}W_{1}({\mathbb{P}}_{n},P_{0})\asymp n^{-1/D}. That is, β„±Lip\mathcal{F}_{\rm Lip} is too large to be used as a discriminator class; β„±\mathcal{F} should be a much smaller class than β„±Lip\mathcal{F}_{\rm Lip} to obtain a fast convergence rate. To achieve this goal, we construct β„±\mathcal{F} so that Ο΅3\epsilon_{3} is small enough while Ο΅4\epsilon_{4} retains small. For example, we may consider

β„±={fQ1,Q2:Q1,Q2βˆˆπ’¬βˆͺ{Q0}},\mathcal{F}=\Big{\{}f_{Q_{1},Q_{2}}:Q_{1},Q_{2}\in\mathcal{Q}\cup\{Q_{0}\}\Big{\}}, (3.4)

where fQ1,Q2f_{Q_{1},Q_{2}} is a (approximate) maximizer of |Q1​fβˆ’Q2​f||Q_{1}f-Q_{2}f| over fβˆˆβ„±Lipf\in\mathcal{F}_{\rm Lip}. In this case, Ο΅4\epsilon_{4} vanishes, hence the convergence rate of Q^\hat{Q} will be determined solely by Ο΅1\epsilon_{1}, Ο΅3\epsilon_{3} and Οƒ0\sigma_{0}. Furthermore, the complexity of β„±\mathcal{F} would roughly be the same to that of 𝒒×𝒒\mathcal{G}\times\mathcal{G}. If the complexity of a function class is expressed through a metric entropy, the complexities of 𝒒\mathcal{G} and β„±\mathcal{F} are of the same order. Three quantities Ο΅1\epsilon_{1}, Ο΅3\epsilon_{3} and Οƒ0\sigma_{0} can roughly be interpreted as the approximation error, estimation error and noise level. While we cannot control Οƒ0\sigma_{0}, both the approximation and estimation errors depend on the complexity of 𝒒\mathcal{G}, hence a suitable choice of it would be important to achieve a fast convergence rate.

To give a specific convergence rate, we consider a class of structured distributions considered by Chae et al.Β [10] for which deep generative models have benefits. For positive numbers Ξ²\beta and KK, let β„‹Kβ​(A)\mathcal{H}^{\beta}_{K}(A) be the class of all functions from AA to ℝ{\mathbb{R}} with Ξ²\beta-HΓΆlder norm bounded by KK [61, 20]. We consider the composite structure with low-dimensional smooth component functions as described in Section 3 of Schmidt-Hieber [52]. Specifically, we consider a function 𝐠:ℝd→ℝD{\bf g}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{D} of the form

𝐠=𝐑q∘𝐑qβˆ’1βˆ˜β‹―βˆ˜π‘1∘𝐑0{\bf g}={\bf h}_{q}\circ{\bf h}_{q-1}\circ\cdots\circ{\bf h}_{1}\circ{\bf h}_{0} (3.5)

with 𝐑i:(ai,bi)diβ†’(ai+1,bi+1)di+1{\bf h}_{i}:(a_{i},b_{i})^{d_{i}}\to(a_{i+1},b_{i+1})^{d_{i+1}}. Here, d0=dd_{0}=d and dq+1=Dd_{q+1}=D. Denote by 𝐑i=(hi​1,…,hi​di+1){\bf h}_{i}=(h_{i1},\ldots,h_{id_{i+1}}) the components of 𝐑i{\bf h}_{i} and let tit_{i} be the maximal number of variables on which each of the hi​jh_{ij} depends. Let 𝒒0​(q,𝐝,𝐭,𝜷,K)\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K) be the collection of functions of the form (3.5) satisfying hi​jβˆˆβ„‹KΞ²i​((ai,bi)ti)h_{ij}\in\mathcal{H}^{\beta_{i}}_{K}\big{(}(a_{i},b_{i})^{t_{i}}\big{)} and |ai|∨|bi|≀K|a_{i}|\vee|b_{i}|\leq K, where 𝐝=(d0,…,dq+1){\bf d}=(d_{0},\ldots,d_{q+1}), 𝐭=(t0,…,tq){\bf t}=(t_{0},\ldots,t_{q}) and 𝜷=(Ξ²0,…,Ξ²q)\bm{\beta}=(\beta_{0},\ldots,\beta_{q}). Let

Ξ²~j\displaystyle\widetilde{\beta}_{j} =\displaystyle= Ξ²jβ€‹βˆl=j+1q(Ξ²l∧1),jβˆ—=argmaxj∈{0,…,q}tjΞ²~j,\displaystyle\beta_{j}\prod_{l=j+1}^{q}(\beta_{l}\wedge 1),\quad j_{*}=\operatorname*{argmax}_{j\in\{0,\ldots,q\}}\frac{t_{j}}{\widetilde{\beta}_{j}},
Ξ²βˆ—\displaystyle\beta_{*} =\displaystyle= Ξ²~jβˆ—and​tβˆ—=tjβˆ—.\displaystyle\widetilde{\beta}_{j_{*}}\quad\text{and}~{}~{}t_{*}=t_{j_{*}}.

We call tβˆ—t_{*} and Ξ²βˆ—\beta_{*} as the intrinsic dimension and smoothness of 𝐠{\bf g} (or of the class 𝒒0(q,𝐝,𝐭,𝜷,K))\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)), respectively. The class 𝒒0​(q,𝐝,𝐭,𝜷,K)\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K) has been extensively studied in recent articles on deep supervised learning to demonstrate the benefit of DNN in nonparametric function estimation [52, 6].

Let

𝒬0​(q,𝐝,𝐭,𝜷,K)={Q𝐠:π βˆˆπ’’0​(q,𝐝,𝐭,𝜷,K)}.\displaystyle\mathcal{Q}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)=\Big{\{}Q_{\bf g}:{\bf g}\in\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)\Big{\}}.

Quantities (q,𝐝,𝐭,𝜷,K)(q,{\bf d},{\bf t},\bm{\beta},K) are constants independent of nn. In the forthcoming Theorem 3.2, we obtain a Wasserstein convergence rate of a GAN type estimator Q^\hat{Q} under the assumption that Q0βˆˆπ’¬0​(q,𝐝,𝐭,𝜷,K)Q_{0}\in\mathcal{Q}_{0}(q,{\bf d},{\bf t},\bm{\beta},K).

Theorem 3.2.

Suppose that 𝐗1,…,𝐗n{\bf X}_{1},\ldots,{\bf X}_{n} are i.i.d.Β random vectors following P0=Q0βˆ—π’©β€‹(𝟎D,Οƒ02​𝕀D)P_{0}=Q_{0}*\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D}), where Οƒ0≀1\sigma_{0}\leq 1 and Q0=Q𝐠0Q_{0}=Q_{{\bf g}_{0}} for some 𝐠0βˆˆπ’’0​(q,𝐝,𝐭,𝛃,K){\bf g}_{0}\in\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K). Then, there exist a generator class 𝒒=π’Ÿβ€‹(L,𝐩,s,K∨1)\mathcal{G}=\mathcal{D}(L,{\bf p},s,K\vee 1) and discriminator class β„±βŠ‚β„±Lip\mathcal{F}\subset\mathcal{F}_{\rm Lip} such that for an estimator Q^\hat{Q} satisfying (2.1),

supQ0βˆˆπ’¬0​(q,𝐝,𝐭,𝜷,K)𝔼​W1​(Q^,Q0)≀C​{nβˆ’Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—β€‹(log⁑n)3β€‹Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—+Οƒ0+Ο΅opt},\begin{split}&\sup_{Q_{0}\in\mathcal{Q}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)}\mathbb{E}W_{1}(\hat{Q},Q_{0})\\ &\leq C\bigg{\{}n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}+\sigma_{0}+\epsilon_{\rm opt}\bigg{\}},\end{split} (3.6)

where C=C​(q,𝐝,𝐭,𝛃,K)C=C(q,{\bf d},{\bf t},\bm{\beta},K).

In Theorem 3.2, network parameters (L,𝐩,s)(L,{\bf p},s) of 𝒒\mathcal{G} depend on the sample size nn. More specifically, it can be deduced from the proof that one can choose L≲log⁑nL\lesssim\log n, |𝐩|βˆžβ‰²nβˆ’tβˆ—/(2β€‹Ξ²βˆ—+tβˆ—)​log⁑n|{\bf p}|_{\infty}\lesssim n^{-t_{*}/(2\beta_{*}+t_{*})}\log n and s≲nβˆ’tβˆ—/(2β€‹Ξ²βˆ—+tβˆ—)​log⁑ns\lesssim n^{-t_{*}/(2\beta_{*}+t_{*})}\log n. As illustrated above, the discriminator class β„±\mathcal{F} is carefully constructed using 𝒒\mathcal{G}.

Ignoring the optimization error Ο΅opt\epsilon_{\rm opt}, the rate (LABEL:eq:rate-composition) consists of the two terms, Οƒ0\sigma_{0} and nβˆ’Ξ²βˆ—/(2β€‹Ξ²βˆ—+tβˆ—)n^{-\beta_{*}/(2\beta_{*}+t_{*})} up to a logarithmic factor. If Οƒ0≲nβˆ’Ξ²βˆ—/(2β€‹Ξ²βˆ—+tβˆ—)\sigma_{0}\lesssim n^{-\beta_{*}/(2\beta_{*}+t_{*})}, it can be absorbed into the polynomial term; hence when Οƒ0\sigma_{0} is small enough, Q^\hat{Q} achieves the rate nβˆ’Ξ²βˆ—/(2β€‹Ξ²βˆ—+tβˆ—)n^{-\beta_{*}/(2\beta_{*}+t_{*})}. Note that this rate appears in many nonparametric smooth function estimation problems.

The dependence on Οƒ0\sigma_{0} comes from the term dℱ​(P0,Q0)d_{\mathcal{F}}(P_{0},Q_{0}) in Theorem 3.1 and the inequality (3.3). Note that (3.3) holds because β„±\mathcal{F} is a subset of β„±Lip\mathcal{F}_{\rm Lip}. We are not aware whether the term Οƒ0\sigma_{0} in (LABEL:eq:rate-composition) can be improved in general. If we consider another evaluation metric, however, it is possible to improve this term. For example, if β„±\mathcal{F} consists of twice continuously differentiable functions, it would be possible to prove dℱ​(P0,Q0)≲σ02d_{\mathcal{F}}(P_{0},Q_{0})\lesssim\sigma_{0}^{2}. This is because for a twice continuously differentiable ff,

|P0​fβˆ’Q0​f|=|𝔼​[f​(𝐘+Ο΅)βˆ’f​(𝐘)]|β‰ˆ|𝔼​[Ο΅Tβ€‹βˆ‡f​(𝐘)+12​ϡTβ€‹βˆ‡2f​(𝐘)​ϡ]|≍σ02,\begin{split}&|P_{0}f-Q_{0}f|=\big{|}\mathbb{E}[f({\bf Y}+\bm{\epsilon})-f({\bf Y})]\big{|}\\ &\approx\Big{|}\mathbb{E}\Big{[}\bm{\epsilon}^{T}\nabla f({\bf Y})+\frac{1}{2}\bm{\epsilon}^{T}\nabla^{2}f({\bf Y})\bm{\epsilon}\Big{]}\Big{|}\asymp\sigma_{0}^{2},\end{split} (3.7)

where 𝐘∼Q0,Ο΅βˆΌπ’©β€‹(𝟎D,Οƒ02​𝕀){\bf Y}\sim Q_{0},\bm{\epsilon}\sim\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}) and π˜βŸ‚βŸ‚Ο΅{\bf Y}\!\perp\!\!\!\perp\bm{\epsilon}.

Note that a sieve MLE considered by Chae et al.Β [10] achieves the rate nβˆ’Ξ²βˆ—/2​(Ξ²βˆ—+tβˆ—)+Οƒ0n^{-\beta_{*}/2(\beta_{*}+t_{*})}+\sigma_{0} under a slightly stronger assumption than that of Theorem 3.2. Hence, for a moderately small Οƒ0\sigma_{0}, the convergence rate of a GAN type estimator is strictly better than that of a sieve MLE. This result provides some insight into why GAN approaches perform better than likelihood approaches in many real applications. In particular, if GAN performs significantly better than likelihood approaches, it might be a reasonable inference that the noise level of the data is not too large. On the other hand, if the noise level is larger than a certain threshold, P0P_{0} is not nearly singular anymore. In this case, likelihood approaches would be preferable to computationally much more difficult GAN.

In practice, the noise level Οƒ0\sigma_{0} is unknown, so one may firstly try likelihood approaches with different levels of perturbation. As empirically demonstrated in Chae et al. [10], the data perturbation significantly improves the quality of generated samples provided that Οƒ0\sigma_{0} is small enough. Therefore, if the data perturbation improves the performance of likelihood approaches, one may next try GAN to obtain a better estimator.

So far, we have focused on the case deval=W1d_{\rm eval}=W_{1}. Note that the discriminator class considered in the proof of Theorem 3.2 is of the form (3.4), which is far away from a practical one. In particular, it is unclear whether it is possible to achieve the rate (LABEL:eq:rate-composition) with neural network discriminators. If we consider a different evaluation metric, however, one can easily obtain a convergence rate using a neural network discriminator. When β„±0\mathcal{F}_{0} consists of neural networks, β„±0\mathcal{F}_{0}-IPM is often called a neural network distance. It is well-known under mild assumptions that the convergence of probability measures in a neural network distance guarantees the weak convergence [71]. Therefore, neural network distances are also good candidates for evaluation metrics. In Theorem 3.3, more general integral probability metrics are taken into account.

Theorem 3.3.

Suppose that 𝐗1,…,𝐗n{\bf X}_{1},\ldots,{\bf X}_{n} are i.i.d.Β random vectors following P0=P𝐠0,Οƒ0P_{0}=P_{{\bf g}_{0},\sigma_{0}} for some 𝐠0βˆˆπ’’0​(q,𝐝,𝐭,𝛃,K){\bf g}_{0}\in\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K) and Οƒ0≀1\sigma_{0}\leq 1. Let β„±0\mathcal{F}_{0} be a class of Lipschitz continuous functions from ℝD{\mathbb{R}}^{D} to ℝ{\mathbb{R}} with Lipschitz constant bounded by a constant C1>0C_{1}>0. Then, there exist a generator class 𝒒=π’Ÿβ€‹(L,𝐩,s,K∨1)\mathcal{G}=\mathcal{D}(L,{\bf p},s,K\vee 1) and discriminator class β„±\mathcal{F} such that Q^\hat{Q} defined as in (2.1) satisfies

supQ0βˆˆπ’¬0​(q,𝐝,𝐭,𝜷,K)𝔼dβ„±0(Q^,Q0)≀C2[Οƒ0+Ο΅opt+𝔼dβ„±0(β„™n,P0)∧{nβˆ’Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—(logn)3β€‹Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—}],\begin{split}&\sup_{Q_{0}\in\mathcal{Q}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)}\mathbb{E}d_{\mathcal{F}_{0}}(\hat{Q},Q_{0})\leq C_{2}\bigg{[}\sigma_{0}+\epsilon_{\rm opt}\\ &\quad+\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\wedge\Big{\{}n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}\Big{\}}\bigg{]},\end{split} (3.8)

where C2=C2​(q,𝐝,𝐭,𝛃,K,C1)C_{2}=C_{2}(q,{\bf d},{\bf t},\bm{\beta},K,C_{1}). In particular, one can identify the discriminator class β„±\mathcal{F} with β„±0\mathcal{F}_{0} provided that

𝔼​dβ„±0​(β„™n,P0)≲nβˆ’Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—β€‹(log⁑n)3β€‹Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—.\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\lesssim n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}. (3.9)

The proof of Theorem 3.3 can be divided into two cases. Firstly, if the complexity of β„±0\mathcal{F}_{0} is small enough in the sense of (3.9), one can ignore the approximation error (Ο΅1\epsilon_{1} in Theorem 3.1) by taking an arbitrarily large 𝒒\mathcal{G}. Also, β„±=β„±0\mathcal{F}=\mathcal{F}_{0} leads to Ο΅4=0\epsilon_{4}=0. Hence, the rate is determined by Οƒ0\sigma_{0}, Ο΅opt\epsilon_{\rm opt} and 𝔼​dβ„±0​(β„™n,P0)\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0}). On the other hand, if (3.9) does not hold, we construct the discriminator as in (3.4). This leads to the same convergence rate with Theorem 3.2.

If β„±0\mathcal{F}_{0} is a subset of π’Ÿβ€‹(L0,𝐩0,s0,∞)\mathcal{D}(L_{0},{\bf p}_{0},s_{0},\infty), then it is not difficult to see that 𝔼​dβ„±0​(β„™n,P0)≲s0/n\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\lesssim\sqrt{s_{0}/n} up to a logarithmic factor. This can be proved using well-known empirical process theory and metric entropy of deep neural networks; see Lemma 5 of Schmidt-Hieber [52]. Note that functions in β„±0\mathcal{F}_{0} are required to be Lipschitz continuous and there are several regularization techniques bounding Lipschitz constants of DNN [2, 51].

Another important class of metrics is a HΓΆlder IPM. When β„±0=β„‹1α​([βˆ’K,K]D)\mathcal{F}_{0}=\mathcal{H}^{\alpha}_{1}([-K,K]^{D}) for some Ξ±>0\alpha>0, Schreuder [53] has shown that

𝔼​dβ„±0​(β„™n,P0)≲{nβˆ’Ξ±/DifΒ Ξ±<D/2nβˆ’1/2​log⁑nifΒ Ξ±=D/2nβˆ’1/2ifΒ Ξ±>D/2.\displaystyle\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\lesssim\left\{\begin{array}[]{ll}n^{-\alpha/D}&~{}~{}\text{if $\alpha<D/2$}\\ n^{-1/2}\log n&~{}~{}\text{if $\alpha=D/2$}\\ n^{-1/2}&~{}~{}\text{if $\alpha>D/2$}.\end{array}\right.

Furthermore, for Ξ±>2\alpha>2, Οƒ0\sigma_{0} in (LABEL:eq:rate-ipm) can be replaced by Οƒ02\sigma_{0}^{2} using (LABEL:eq:sigma2).

4 Lower bound of the minimax risk

In this section, we study a lower bound of the minimax convergence rate, particularly focusing on the case deval=W1d_{\rm eval}=W_{1}. As in the previous section, suppose that 𝐗1,…,𝐗n{\bf X}_{1},\ldots,{\bf X}_{n} are i.i.d.Β random vectors following P0=P𝐠0,Οƒ0P_{0}=P_{{\bf g}_{0},\sigma_{0}}. We investigate the minimax rate over the class 𝒬0={Q𝐠:π βˆˆπ’’0}\mathcal{Q}_{0}=\{Q_{\bf g}:{\bf g}\in\mathcal{G}_{0}\}, where 𝒒0=𝒒0​(q,𝐝,𝐭,𝜷,K)\mathcal{G}_{0}=\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K). For simplicity, we consider the case q=0q=0, d0=t0=dd_{0}=t_{0}=d and Ξ²0=Ξ²\beta_{0}=\beta. In this case, we have 𝒒0=β„‹Kβ​([0,1]d)Γ—β‹―Γ—β„‹Kβ​([0,1]d)\mathcal{G}_{0}=\mathcal{H}_{K}^{\beta}([0,1]^{d})\times\cdots\times\mathcal{H}_{K}^{\beta}([0,1]^{d}). An extension to general qq would be slightly more complicated but not difficult. We also assume that PZP_{Z} is the uniform distribution on [0,1]d[0,1]^{d}.

For given 𝒒0\mathcal{G}_{0} and Οƒ0β‰₯0\sigma_{0}\geq 0, the minimax risk is defined as

𝔐​(𝒒0,Οƒ0)=infQ^sup𝐠0βˆˆπ’’0𝔼​W1​(Q^,Q0),\displaystyle\mathfrak{M}(\mathcal{G}_{0},\sigma_{0})=\inf_{\hat{Q}}\sup_{{\bf g}_{0}\in\mathcal{G}_{0}}\mathbb{E}W_{1}(\hat{Q},Q_{0}),

where the infimum ranges over all possible estimators. Several techniques are available to obtain a lower bound for 𝔐​(𝒒0,Οƒ0)\mathfrak{M}(\mathcal{G}_{0},\sigma_{0}) [59, 64]. We utilize the Fano’s method to prove the following theorem.

Theorem 4.1.

Suppose that d≀Dd\leq D, Ξ²>0\beta>0 and Οƒ0β‰₯0\sigma_{0}\geq 0. Let PZP_{Z} be the uniform distribution on [0,1]d[0,1]^{d} and 𝒒0=β„‹Kβ​([0,1]d)Γ—β‹―Γ—β„‹Kβ​([0,1]d)\mathcal{G}_{0}=\mathcal{H}_{K}^{\beta}([0,1]^{d})\times\cdots\times\mathcal{H}_{K}^{\beta}([0,1]^{d}). If KK is large enough (depending on Ξ²\beta and dd), the minimax risk satisfies

𝔐​(𝒒0,Οƒ0)β‰₯C​nβˆ’Ξ²2​β+dβˆ’2\mathfrak{M}(\mathcal{G}_{0},\sigma_{0})\geq Cn^{-\frac{\beta}{2\beta+d-2}} (4.1)

for some constant C>0C>0.

Note that the lower bound (4.1) does not depend on Οƒ0\sigma_{0}. With a direct application of the Le Cam’s method, one can easily show that 𝔐​(𝒒0,Οƒ0)≳σ0/n\mathfrak{M}(\mathcal{G}_{0},\sigma_{0})\gtrsim\sigma_{0}/\sqrt{n}. As discussed in the previous section, it would not be easy to obtain a sharp rate with respect to Οƒ0\sigma_{0}. Since we are more interested in small Οƒ0\sigma_{0} cases (i.e.Β nearly singular cases), our discussion focuses on the cases where Οƒ0\sigma_{0} is ignorable.

Note that the lower bound (4.1) is only slightly smaller than the rate nβˆ’Ξ²/(2​β+d)n^{-\beta/(2\beta+d)}, the first term in the right hand side of (LABEL:eq:rate-composition). Hence, the convergence rate of a GAN type estimator is at least very close to the minimax optimal rate.

For the gap between the upper and lower bounds, we conjecture that the lower bound is sharp and cannot be improved. This conjecture is based on the result in Uppal et al.Β [60] and Liang [38]. They considered GAN for nonparametric density estimation, hence D=dD=d in their framework. For example, Theorem 4 in Liang [38] guarantees that, for D=dβ‰₯2D=d\geq 2 and Οƒ0=0\sigma_{0}=0,

infQ^supQ0βˆˆπ’¬~0𝔼​W1​(Q^,Q0)≍nβˆ’Ξ²β€²+12​β′+d,\inf_{\hat{Q}}\sup_{Q_{0}\in\widetilde{\mathcal{Q}}_{0}}\mathbb{E}W_{1}(\hat{Q},Q_{0})\asymp n^{-\frac{\beta^{\prime}+1}{2\beta^{\prime}+d}}, (4.2)

where 𝒬~0={Q:qβˆˆβ„‹1β′​([0,1]D)}\widetilde{\mathcal{Q}}_{0}=\{Q:q\in\mathcal{H}^{\beta^{\prime}}_{1}([0,1]^{D})\}. (More precisely, he considered Sobolev classes instead of HΓΆlder classes.) Interestingly, there is a close connection between the density model 𝒬~0\widetilde{\mathcal{Q}}_{0} in literature and the generative model 𝒬0\mathcal{Q}_{0} considered in our paper. This connection is based on the profound regularity theory of the optimal transport, often called the Brenier map. Roughly speaking, for a Ξ²β€²\beta^{\prime}-HΓΆlder density qq, there exits a (Ξ²β€²+1)(\beta^{\prime}+1)-HΓΆlder function 𝐠{\bf g} such that Q=Q𝐠Q=Q_{\bf g}; see Theorem 12.50 of Villani [63] for a precise statement. We also refer to Lemma 4.1 of Chae et al.Β [10] for a concise statement. In this sense, the density model Q~0\widetilde{Q}_{0} matches with the generative model Q0Q_{0} when Ξ²=Ξ²β€²+1\beta=\beta^{\prime}+1. In this case, two rates (4.1) and (4.2) are the same, and this is why we conjecture that the lower bound (4.1) cannot be improved. Unfortunately, proof techniques in Uppal et al.Β [60] and Liang [38] for both upper and lower bounds are not generalizable to our case because Q0Q_{0} does not possess a Lebesgue density.

5 Conclusion

Under a structural assumption on the generator, we have investigated the convergence rate of a GAN type estimator and a lower bound of the minimax optimal rate. In particular, the rate is faster than that obtained by likelihood approaches, providing some insights into why GAN outperforms likelihood approaches. This result would be an important step-stone toward a more advanced theory that can take into account fundamental properties of state-of-the-art GANs. We conclude the paper with some possible directions for future work.

Firstly, reducing the gap between the upper and lower bounds of the convergence rate obtained in this paper would be necessary. As discussed in Section 4, it will be crucial to construct an estimator that achieves the lower bound in Theorem 4.1. More specifically, we wonder whether a GAN type estimator can do this. Next, when deval=W1d_{\rm eval}=W_{1}, an important question is whether it is possible to choose β„±\mathcal{F} as a class of neural network functions. Perhaps, we may not obtain the rate in Theorem 3.2 because a large network would be necessary to approximate an arbitrary Lipschitz function. Finally, based on the approximation property of the convolutional neural networks (CNN) architectures [32, 70], studying the benefit of CNN-based GAN would be an intriguing problem.

References

  • Aamari and Levrard, [2019] Aamari, E. and Levrard, C. (2019). Nonasymptotic rates for manifold, tangent space and curvature estimation. Ann. Statist., 47(1):177–204.
  • Arjovsky and Bottou, [2017] Arjovsky, M. and Bottou, L. (2017). Towards principled methods for training generative adversarial networks. In Proc. International Conference on Learning Representations, pages 1–17.
  • Arjovsky etΒ al., [2017] Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In Proc. International Conference on Machine Learning, pages 214–223.
  • Arora etΒ al., [2017] Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (GANs). In Proc. International Conference on Machine Learning, pages 224–232.
  • Bai etΒ al., [2019] Bai, Y., Ma, T., and Risteski, A. (2019). Approximability of discriminators implies diversity in GANs. In Proc. International Conference on Learning Representations, pages 1–10.
  • Bauer and Kohler, [2019] Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist., 47(4):2261–2285.
  • Belomestny etΒ al., [2021] Belomestny, D., Moulines, E., Naumov, A., Puchkin, N., and Samsonov, S. (2021). Rates of convergence for density estimation with GANs. ArXiv:2102.00199.
  • BirgΓ©, [1983] BirgΓ©, L. (1983). Approximation dans les espaces mΓ©triques et thΓ©orie de l’estimation. Zeitschrift fΓΌr Wahrscheinlichkeitstheorie und verwandte Gebiete, 65(2):181–237.
  • Bishop, [2006] Bishop, C.Β M. (2006). Pattern Recognition and Machine Learning. Springer, New York.
  • Chae etΒ al., [2021] Chae, M., Kim, D., Kim, Y., and Lin, L. (2021). A likelihood approach to nonparametric estimation of a singular distribution using deep generative models. ArXiv:2105.04046.
  • Chae and Walker, [2019] Chae, M. and Walker, S.Β G. (2019). Bayesian consistency for a nonparametric stationary Markov model. Bernoulli, 25(2):877–901.
  • Chen etΒ al., [2019] Chen, M., Jiang, H., Liao, W., and Zhao, T. (2019). Efficient approximation of deep ReLU networks for functions on low dimensional manifolds. In Proc. Neural Information Processing Systems, pages 8174–8184.
  • Chen etΒ al., [2020] Chen, M., Liao, W., Zha, H., and Zhao, T. (2020). Statistical guarantees of generative adversarial networks for distribution estimation. ArXiv:2002.03938.
  • Divol, [2020] Divol, V. (2020). Minimax adaptive estimation in manifold inference. ArXiv:2001.04896.
  • Fan, [1991] Fan, J. (1991). On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Statist., 19(3):1257–1272.
  • Fournier and Guillin, [2015] Fournier, N. and Guillin, A. (2015). On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields, 162(3-4):707–738.
  • [17] Genovese, C.Β R., Perone-Pacifico, M., Verdinelli, I., and Wasserman, L. (2012a). Manifold estimation and singular deconvolution under Hausdorff loss. Ann. Statist., 40(2):941–963.
  • [18] Genovese, C.Β R., Perone-Pacifico, M., Verdinelli, I., and Wasserman, L. (2012b). Minimax manifold estimation. J. Mach. Learn. Res., 13(1):1263–1291.
  • Ghosal and vanΒ der Vaart, [2017] Ghosal, S. and vanΒ der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press.
  • GinΓ© and Nickl, [2016] GinΓ©, E. and Nickl, R. (2016). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge University Press.
  • Glorot etΒ al., [2011] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proc. International Conference on Artificial Intelligence and Statistics, pages 315–323.
  • Goodfellow etΒ al., [2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Proc. Neural Information Processing Systems, pages 2672–2680.
  • Gulrajani etΒ al., [2017] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A.Β C. (2017). Improved training of Wasserstein GANs. In Proc. Neural Information Processing Systems, pages 5767–5777.
  • GyΓΆrfi etΒ al., [2006] GyΓΆrfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2006). A Distribution-Free Theory of Nonparametric Regression. Springer, New York.
  • Hastie etΒ al., [2009] Hastie, T., Tibshirani, R., and Friedman, J.Β H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
  • Horowitz and Mammen, [2007] Horowitz, J.Β L. and Mammen, E. (2007). Rate-optimal estimation for a general class of nonparametric regression models with unknown link functions. Ann. Statist., 35(6):2589–2619.
  • Imaizumi and Fukumizu, [2019] Imaizumi, M. and Fukumizu, K. (2019). Deep neural networks learn non-smooth functions effectively. In Proc. International Conference on Artificial Intelligence and Statistics, pages 869–878.
  • Juditsky etΒ al., [2009] Juditsky, A.Β B., Lepski, O.Β V., and Tsybakov, A.Β B. (2009). Nonparametric estimation of composite functions. Ann. Statist., 37(3):1360–1404.
  • Karras etΒ al., [2018] Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018). Progressive growing of GANs for improved quality, stability, and variation. In Proc. International Conference on Learning Representations, pages 1–26.
  • Karras etΒ al., [2019] Karras, T., Laine, S., and Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proc. Conference on Computer Vision and Pattern Recognition, pages 4401–4410.
  • Kingma and Welling, [2014] Kingma, D.Β P. and Welling, M. (2014). Auto-encoding variational Bayes. In Proc. International Conference on Learning Representations, pages 1–14.
  • Kohler etΒ al., [2020] Kohler, M., Krzyzak, A., and Walter, B. (2020). On the rate of convergence of image classifiers based on convolutional neural networks. ArXiv:2003.01526.
  • Kuhn etΒ al., [2019] Kuhn, D., Esfahani, P.Β M., Nguyen, V.Β A., and Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Porc. Operations Research & Management Science in the Age of Analytics, pages 130–166. INFORMS.
  • Kundu and Dunson, [2014] Kundu, S. and Dunson, D.Β B. (2014). Latent factor models for density estimation. Biometrika, 101(3):641–654.
  • LeΒ Cam, [1973] LeΒ Cam, L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist., 1(1):38–53.
  • LeΒ Cam, [1986] LeΒ Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer.
  • Li etΒ al., [2017] Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and PΓ³czos, B. (2017). MMD GAN: Towards deeper understanding of moment matching network. In Proc. Neural Information Processing Systems, pages 2203–2213.
  • Liang, [2021] Liang, T. (2021). How well generative adversarial networks learn distributions. Journal of Machine Learning Research, 22(228):1–41.
  • Liu etΒ al., [2017] Liu, S., Bousquet, O., and Chaudhuri, K. (2017). Approximation and convergence properties of generative adversarial learning. In Proc. Neural Information Processing Systems, pages 5545–5553.
  • Meister, [2009] Meister, A. (2009). Deconvolution Problems in Nonparametric Statistics. Springer, New York.
  • Mroueh etΒ al., [2017] Mroueh, Y., Li, C.-L., Sercu, T., Raj, A., and Cheng, Y. (2017). Sobolev GAN. ArXiv:1711.04894.
  • MΓΌller, [1997] MΓΌller, A. (1997). Integral probability metrics and their generating classes of functions. Adv. in Appl. Probab., 29(2):429–443.
  • Murphy, [2012] Murphy, K.Β P. (2012). Machine Learning: A Probabilistic Perspective. MIT press.
  • Nakada and Imaizumi, [2020] Nakada, R. and Imaizumi, M. (2020). Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. J. Mach. Learn. Res., 21(174):1–38.
  • Nguyen, [2013] Nguyen, X. (2013). Convergence of latent mixing measures in finite and infinite mixture models. Ann. Statist., 41(1):370–400.
  • Ohn and Kim, [2019] Ohn, I. and Kim, Y. (2019). Smooth function approximation by deep neural networks with general activation functions. Entropy, 21(7):627.
  • Pati etΒ al., [2011] Pati, D., Bhattacharya, A., and Dunson, D.Β B. (2011). Posterior convergence rates in non-linear latent variable models. ArXiv:1109.5000.
  • Puchkin and Spokoiny, [2019] Puchkin, N. and Spokoiny, V. (2019). Structure-adaptive manifold estimation. ArXiv:1906.05014.
  • Radford etΒ al., [2016] Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In Proc. International Conference on Learning Representations, pages 1–16.
  • Rezende etΒ al., [2014] Rezende, D.Β J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In Proc. International Conference on Machine Learning, pages 1278–1286.
  • Scaman and Virmaux, [2018] Scaman, K. and Virmaux, A. (2018). Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Proc. Neural Information Processing Systems, volumeΒ 31, pages 1–10.
  • Schmidt-Hieber, [2020] Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU activation function. Ann. Statist., 48(4):1875–1897.
  • Schreuder, [2021] Schreuder, N. (2021). Bounding the expectation of the supremum of empirical processes indexed by HΓΆlder classes. Math. Methods Statist., 29:76–86.
  • Schreuder etΒ al., [2021] Schreuder, N., Brunel, V.-E., and Dalalyan, A. (2021). Statistical guarantees for generative models without domination. In Proc. Algorithmic Learning Theory, pages 1051–1071. PMLR.
  • Singh and PΓ³czos, [2018] Singh, S. and PΓ³czos, B. (2018). Minimax distribution estimation in Wasserstein distance. ArXiv:1802.08855.
  • Singh etΒ al., [2018] Singh, S., Uppal, A., Li, B., Li, C.-L., Zaheer, M., and PΓ³czos, B. (2018). Nonparametric density estimation with adversarial losses. In Proc. Neural Information Processing Systems, pages 10246–10257.
  • Stanczuk etΒ al., [2021] Stanczuk, J., Etmann, C., Kreusser, L.Β M., and SchΓΆnlieb, C.-B. (2021). Wasserstein GANs work because they fail (to approximate the Wasserstein distance). ArXiv:2103.01678.
  • Telgarsky, [2016] Telgarsky, M. (2016). Benefits of depth in neural networks. In Proc. Conference on Learning Theory, pages 1517–1539.
  • Tsybakov, [2008] Tsybakov, A.Β B. (2008). Introduction to Nonparametric Estimation. Springer, New York.
  • Uppal etΒ al., [2019] Uppal, A., Singh, S., and PΓ³czos, B. (2019). Nonparametric density estimation and convergence of GANs under Besov IPM losses. In Proc. Neural Information Processing Systems, pages 9089–9100.
  • vanΒ der Vaart and Wellner, [1996] vanΒ der Vaart, A.Β W. and Wellner, J.Β A. (1996). Weak Convergence and Empirical Processes. Springer.
  • Villani, [2003] Villani, C. (2003). Topics in Optimal Transportation. American Mathematical Society.
  • Villani, [2008] Villani, C. (2008). Optimal Transport: Old and New. Springer.
  • Wainwright, [2019] Wainwright, M.Β J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.
  • Weed and Bach, [2019] Weed, J. and Bach, F. (2019). Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli, 25(4A):2620–2648.
  • Wei and Nguyen, [2022] Wei, Y. and Nguyen, X. (2022). Convergence of de Finetti’s mixing measure in latent structure models for observed exchangeable sequences. To appear in Ann. Statist.
  • Wong and Shen, [1995] Wong, W.Β H. and Shen, X. (1995). Probability inequalities for likelihood ratios and convergence rates of sieve MLEs. Ann. Statist., 23(2):339–362.
  • Yalcin and Amemiya, [2001] Yalcin, I. and Amemiya, Y. (2001). Nonlinear factor analysis as a statistical method. Statist. Sci., 16(3):275–294.
  • Yarotsky, [2017] Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114.
  • Yarotsky, [2021] Yarotsky, D. (2021). Universal approximations of invariant maps by neural networks. Constr. Approx., pages 1–68.
  • Zhang etΒ al., [2018] Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. (2018). On the discrimination-generalization tradeoff in GANs. In Proc. International Conference on Learning Representations, pages 1–26.

Appendix A Proofs

A.1 Proof of Theorem 3.1

Choose π βˆ—βˆˆπ’’{\bf g}_{*}\in\mathcal{G} such that

deval​(Qβˆ—,Q0)≀infπ βˆˆπ’’deval​(Q𝐠,Q0)+Ο΅1≀(i)2​ϡ1,d_{\rm eval}(Q_{*},Q_{0})\leq\inf_{{\bf g}\in\mathcal{G}}d_{\rm eval}(Q_{\bf g},Q_{0})+\epsilon_{1}\stackrel{{\scriptstyle\rm(i)}}{{\leq}}2\epsilon_{1}, (A.1)

where Qβˆ—=Qπ βˆ—Q_{*}=Q_{{\bf g}_{*}}. Then,

deval​(Q^,Q0)≀deval​(Q^,Qβˆ—)+deval​(Qβˆ—,Q0)≀(​A.1​)deval​(Q^,Qβˆ—)+2​ϡ1\displaystyle d_{\rm eval}(\hat{Q},Q_{0})\leq d_{\rm eval}(\hat{Q},Q_{*})+d_{\rm eval}(Q_{*},Q_{0})\stackrel{{\scriptstyle\eqref{eq:general-tech2}}}{{\leq}}d_{\rm eval}(\hat{Q},Q_{*})+2\epsilon_{1}
≀(iv)dℱ​(Q^,Qβˆ—)+2​ϡ1+Ο΅4≀dℱ​(Q^,β„™n)+dℱ​(β„™n,Qβˆ—)+2​ϡ1+Ο΅4\displaystyle\stackrel{{\scriptstyle\rm(iv)}}{{\leq}}d_{\mathcal{F}}(\hat{Q},Q_{*})+2\epsilon_{1}+\epsilon_{4}\leq d_{\mathcal{F}}(\hat{Q},{\mathbb{P}}_{n})+d_{\mathcal{F}}({\mathbb{P}}_{n},Q_{*})+2\epsilon_{1}+\epsilon_{4}
≀(ii)infπ βˆˆπ’’dℱ​(Q𝐠,β„™n)+dℱ​(β„™n,Qβˆ—)+2​ϡ1+Ο΅2+Ο΅4\displaystyle\stackrel{{\scriptstyle\rm(ii)}}{{\leq}}\inf_{{\bf g}\in\mathcal{G}}d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n})+d_{\mathcal{F}}({\mathbb{P}}_{n},Q_{*})+2\epsilon_{1}+\epsilon_{2}+\epsilon_{4}
≀infπ βˆˆπ’’dℱ​(Q𝐠,β„™n)+dℱ​(β„™n,P0)+dℱ​(P0,Q0)+dℱ​(Q0,Qβˆ—)+2​ϡ1+Ο΅2+Ο΅4\displaystyle\leq\inf_{{\bf g}\in\mathcal{G}}d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n})+d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+d_{\mathcal{F}}(P_{0},Q_{0})+d_{\mathcal{F}}(Q_{0},Q_{*})+2\epsilon_{1}+\epsilon_{2}+\epsilon_{4}
≀infπ βˆˆπ’’dℱ​(Q𝐠,Q0)+dℱ​(β„™n,Q0)+dℱ​(β„™n,P0)+dℱ​(P0,Q0)+dℱ​(Q0,Qβˆ—)+2​ϡ1+Ο΅2+Ο΅4\displaystyle\leq\inf_{{\bf g}\in\mathcal{G}}d_{\mathcal{F}}(Q_{\bf g},Q_{0})+d_{\mathcal{F}}({\mathbb{P}}_{n},Q_{0})+d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+d_{\mathcal{F}}(P_{0},Q_{0})+d_{\mathcal{F}}(Q_{0},Q_{*})+2\epsilon_{1}+\epsilon_{2}+\epsilon_{4}
≀(i)dℱ​(β„™n,Q0)+dℱ​(β„™n,P0)+dℱ​(P0,Q0)+dℱ​(Q0,Qβˆ—)+3​ϡ1+Ο΅2+Ο΅4\displaystyle\stackrel{{\scriptstyle\rm(i)}}{{\leq}}d_{\mathcal{F}}({\mathbb{P}}_{n},Q_{0})+d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+d_{\mathcal{F}}(P_{0},Q_{0})+d_{\mathcal{F}}(Q_{0},Q_{*})+3\epsilon_{1}+\epsilon_{2}+\epsilon_{4}
≀2​dℱ​(β„™n,P0)+2​dℱ​(P0,Q0)+dℱ​(Q0,Qβˆ—)+3​ϡ1+Ο΅2+Ο΅4\displaystyle\leq 2d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+2d_{\mathcal{F}}(P_{0},Q_{0})+d_{\mathcal{F}}(Q_{0},Q_{*})+3\epsilon_{1}+\epsilon_{2}+\epsilon_{4}
≀(iv)2​dℱ​(β„™n,P0)+2​dℱ​(P0,Q0)+deval​(Q0,Qβˆ—)+3​ϡ1+Ο΅2+2​ϡ4\displaystyle\stackrel{{\scriptstyle\rm(iv)}}{{\leq}}2d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+2d_{\mathcal{F}}(P_{0},Q_{0})+d_{\rm eval}(Q_{0},Q_{*})+3\epsilon_{1}+\epsilon_{2}+2\epsilon_{4}
≀(​A.1​)2​dℱ​(β„™n,P0)+2​dℱ​(P0,Q0)+5​ϡ1+Ο΅2+2​ϡ4.\displaystyle\stackrel{{\scriptstyle\eqref{eq:general-tech2}}}{{\leq}}2d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+2d_{\mathcal{F}}(P_{0},Q_{0})+5\epsilon_{1}+\epsilon_{2}+2\epsilon_{4}.

By taking the expectation, we complete the proof. ∎

A.2 Proof of Theorem 3.2

We will construct a generator class 𝒒\mathcal{G} and a discriminator β„±\mathcal{F} satisfying condition (3.2) of Theorem 3.1 with deval=W1d_{\rm eval}=W_{1}. By the construction of the estimator Q^\hat{Q}, condition (3.2)-(ii) is automatically satisfied with Ο΅2=Ο΅opt\epsilon_{2}=\epsilon_{\rm opt} for any 𝒒\mathcal{G} and β„±\mathcal{F}.

Let Ξ΄>0\delta>0 be given. Lemma 3.5 of Chae et al.Β [10] implies that there exists π βˆ—βˆˆπ’Ÿβ€‹(L,𝐩,s,K∨1){\bf g}_{*}\in\mathcal{D}(L,{\bf p},s,K\vee 1), with

L≀c1​logβ‘Ξ΄βˆ’1,|𝐩|βˆžβ‰€c1β€‹Ξ΄βˆ’tβˆ—/Ξ²βˆ—,s≀c1β€‹Ξ΄βˆ’tβˆ—/Ξ²βˆ—β€‹logβ‘Ξ΄βˆ’1\displaystyle L\leq c_{1}\log\delta^{-1},\quad|{\bf p}|_{\infty}\leq c_{1}\delta^{-t_{*}/\beta_{*}},\quad s\leq c_{1}\delta^{-t_{*}/\beta_{*}}\log\delta^{-1}

for some constant c1=c1​(q,𝐝,𝐭,𝜷,K)c_{1}=c_{1}(q,{\bf d},{\bf t},\bm{\beta},K), such that β€–π βˆ—βˆ’π 0β€–βˆž<Ξ΄\|{\bf g}_{*}-{\bf g}_{0}\|_{\infty}<\delta. Let Qβˆ—=Qπ βˆ—Q_{*}=Q_{{\bf g}_{*}} and 𝒒=π’Ÿβ€‹(L,𝐩,s,K∨1)\mathcal{G}=\mathcal{D}(L,{\bf p},s,K\vee 1). Then, by the Kantorovich–Rubinstein duality (see Theorem 1.14 in Villani [62]),

W1​(Qβˆ—,Q0)\displaystyle W_{1}(Q_{*},Q_{0}) =\displaystyle= supfβˆˆβ„±Lip|Qβˆ—β€‹fβˆ’Q0​f|≀supfβˆˆβ„±Lip∫|f​(π βˆ—β€‹(𝐳))βˆ’f​(𝐠0​(𝐳))|​𝑑PZ​(𝐳)\displaystyle\sup_{f\in\mathcal{F}_{\rm Lip}}|Q_{*}f-Q_{0}f|\leq\sup_{f\in\mathcal{F}_{\rm Lip}}\int\Big{|}f\big{(}{\bf g}_{*}({\bf z})\big{)}-f\big{(}{\bf g}_{0}({\bf z})\big{)}\Big{|}dP_{Z}({\bf z})
≀\displaystyle\leq ∫|π βˆ—β€‹(𝐳)βˆ’π 0​(𝐳)|2​𝑑PZ​(𝐳)≀Dβ€‹β€–π βˆ—βˆ’π 0β€–βˆžβ‰€D​δ.\displaystyle\int|{\bf g}_{*}({\bf z})-{\bf g}_{0}({\bf z})|_{2}dP_{Z}({\bf z})\leq\sqrt{D}\|{\bf g}_{*}-{\bf g}_{0}\|_{\infty}\leq\sqrt{D}\delta.

Hence, condition (3.2)-(i) holds with Ο΅1=D​δ\epsilon_{1}=\sqrt{D}\delta.

Let Ο΅>0\epsilon>0 be given. For two Borel probability measures Q1Q_{1} and Q2Q_{2} on ℝD{\mathbb{R}}^{D}, one can choose fQ1,Q2βˆˆβ„±Lipf_{Q_{1},Q_{2}}\in\mathcal{F}_{\rm Lip} such that fQ1,Q2​(𝟎D)=0f_{Q_{1},Q_{2}}({\bf 0}_{D})=0 and

W1​(Q1,Q2)=supfβˆˆβ„±Lip|Q1​fβˆ’Q2​f|≀|Q1​fQ1,Q2βˆ’Q2​fQ1,Q2|+Ο΅.\displaystyle W_{1}(Q_{1},Q_{2})=\sup_{f\in\mathcal{F}_{\rm Lip}}|Q_{1}f-Q_{2}f|\leq|Q_{1}f_{Q_{1},Q_{2}}-Q_{2}f_{Q_{1},Q_{2}}|+\epsilon.

Then, by the Lipschitz continuity,

sup|𝐱|βˆžβ‰€K|fQ1,Q2​(𝐱)|≀sup|𝐱|βˆžβ‰€K|𝐱|2=D​K.\displaystyle\sup_{|{\bf x}|_{\infty}\leq K}|f_{Q_{1},Q_{2}}({\bf x})|\leq\sup_{|{\bf x}|_{\infty}\leq K}|{\bf x}|_{2}=\sqrt{D}K.

Let 𝐠1,…,𝐠N{\bf g}_{1},\ldots,{\bf g}_{N} be Ο΅\epsilon-cover of 𝒒βˆͺ{𝐠0}\mathcal{G}\cup\{{\bf g}_{0}\} with respect to βˆ₯β‹…βˆ₯PZ,2\|\cdot\|_{P_{Z},2} and

β„±={fj​k:1≀j,k≀N},\displaystyle\mathcal{F}=\Big{\{}f_{jk}:1\leq j,k\leq N\Big{\}},

where

‖𝐠‖PZ,p\displaystyle\|{\bf g}\|_{P_{Z},p} =\displaystyle= (∫|𝐠​(𝐳)|pp​𝑑PZ​(𝐳))1/p\displaystyle\left(\int|{\bf g}({\bf z})|_{p}^{p}dP_{Z}({\bf z})\right)^{1/p}

and fj​k=fQ𝐠j,Q𝐠kf_{jk}=f_{Q_{{\bf g}_{j}},Q_{{\bf g}_{k}}}. Since β€–π βˆ’π ~β€–PZ,2≀Dβ€‹β€–π βˆ’π ~β€–βˆž\|{\bf g}-\widetilde{\bf g}\|_{P_{Z},2}\leq\sqrt{D}\|{\bf g}-\widetilde{\bf g}\|_{\infty} for every 𝐠,𝐠~βˆˆπ’’βˆͺ{𝐠0}{\bf g},\widetilde{\bf g}\in\mathcal{G}\cup\{{\bf g}_{0}\} and

logN(Ο΅,𝒒,βˆ₯β‹…βˆ₯∞)≀(s+1){log2+logΟ΅βˆ’1+log(L+1)+2βˆ‘l=0L+1log(pl+1)}\displaystyle\log N(\epsilon,\mathcal{G},\|\cdot\|_{\infty})\leq(s+1)\bigg{\{}\log 2+\log\epsilon^{-1}+\log(L+1)+2\sum_{l=0}^{L+1}\log(p_{l}+1)\bigg{\}}

by Lemma 5 of Schmidt-Hieber [52], the number NN can be bounded as

log⁑N≀log(N(Ο΅/D,𝒒,βˆ₯β‹…βˆ₯∞)+1)≀c2s(logD+logΟ΅βˆ’1+LlogΞ΄βˆ’1)≀c3β€‹Ξ΄βˆ’tβˆ—/Ξ²βˆ—β€‹logβ‘Ξ΄βˆ’1​{logβ‘Ο΅βˆ’1+(logβ‘Ξ΄βˆ’1)2},\begin{split}\log N&\leq\log\big{(}N(\epsilon/\sqrt{D},\mathcal{G},\|\cdot\|_{\infty})+1\big{)}\leq c_{2}s\Big{(}\log D+\log\epsilon^{-1}+L\log\delta^{-1}\Big{)}\\ &\leq c_{3}\delta^{-t_{*}/\beta_{*}}\log\delta^{-1}\Big{\{}\log\epsilon^{-1}+(\log\delta^{-1})^{2}\Big{\}},\end{split} (A.2)

where c2=c2​(tβˆ—,Ξ²βˆ—)c_{2}=c_{2}(t_{*},\beta_{*}) and c3=c3​(c1,c2,D)c_{3}=c_{3}(c_{1},c_{2},D). Here, N(Ο΅,𝒒,βˆ₯β‹…βˆ₯∞)N(\epsilon,\mathcal{G},\|\cdot\|_{\infty}) denotes the covering number of 𝒒\mathcal{G} with respect to βˆ₯β‹…βˆ₯∞\|\cdot\|_{\infty}.

Next, we will prove that condition (3.2)-(iv) is satisfied with Ο΅4=5​ϡ\epsilon_{4}=5\epsilon. Note that dℱ≀W1d_{\mathcal{F}}\leq W_{1} by the construction. For 𝐠,𝐠~βˆˆπ’’βˆͺ{𝐠0}{\bf g},\widetilde{\bf g}\in\mathcal{G}\cup\{{\bf g}_{0}\}, we can choose 𝐠j{\bf g}_{j} and 𝐠k{\bf g}_{k} such that β€–π βˆ’π jβ€–PZ,2≀ϡ\|{\bf g}-{\bf g}_{j}\|_{P_{Z},2}\leq\epsilon and ‖𝐠~βˆ’π kβ€–PZ,2≀ϡ\|\widetilde{\bf g}-{\bf g}_{k}\|_{P_{Z},2}\leq\epsilon. Then,

W1​(Q𝐠,Q𝐠~)≀W1​(Q𝐠,Q𝐠j)+W1​(Q𝐠j,Q𝐠k)+W1​(Q𝐠k,Q𝐠~)≀W1​(Q𝐠,Q𝐠j)+dℱ​(Q𝐠j,Q𝐠k)+W1​(Q𝐠k,Q𝐠~)+Ο΅.\begin{split}W_{1}(Q_{\bf g},Q_{\widetilde{\bf g}})&\leq W_{1}(Q_{\bf g},Q_{{\bf g}_{j}})+W_{1}(Q_{{\bf g}_{j}},Q_{{\bf g}_{k}})+W_{1}(Q_{{\bf g}_{k}},Q_{\widetilde{\bf g}})\\ &\leq W_{1}(Q_{\bf g},Q_{{\bf g}_{j}})+d_{\mathcal{F}}(Q_{{\bf g}_{j}},Q_{{\bf g}_{k}})+W_{1}(Q_{{\bf g}_{k}},Q_{\widetilde{\bf g}})+\epsilon.\end{split} (A.3)

Note that

W1​(Q𝐠,Q𝐠j)\displaystyle W_{1}(Q_{\bf g},Q_{{\bf g}_{j}}) =\displaystyle= supfβˆˆβ„±Lip|∫f​(𝐠​(𝐳))​𝑑PZ​(𝐳)βˆ’βˆ«f​(𝐠j​(𝐳))​𝑑PZ​(𝐳)|\displaystyle\sup_{f\in\mathcal{F}_{\rm Lip}}\left|\int f\big{(}{\bf g}({\bf z})\big{)}dP_{Z}({\bf z})-\int f\big{(}{\bf g}_{j}({\bf z})\big{)}dP_{Z}({\bf z})\right|
≀\displaystyle\leq ∫|𝐠​(𝐳)βˆ’π j​(𝐳)|2​𝑑PZ​(𝐳)=β€–π βˆ’π jβ€–PZ,2≀ϡ.\displaystyle\int|{\bf g}({\bf z})-{\bf g}_{j}({\bf z})|_{2}dP_{Z}({\bf z})=\|{\bf g}-{\bf g}_{j}\|_{P_{Z},2}\leq\epsilon.

Similarly, W1​(Q𝐠k,Q𝐠~)≀ϡW_{1}(Q_{{\bf g}_{k}},Q_{\widetilde{\bf g}})\leq\epsilon, and therefore,

dℱ​(Q𝐠j,Q𝐠k)≀dℱ​(Q𝐠j,Q𝐠)+dℱ​(Q𝐠,Q𝐠~)+dℱ​(Q𝐠~,Q𝐠k)≀dℱ​(Q𝐠,Q𝐠~)+2​ϡ.\displaystyle d_{\mathcal{F}}(Q_{{\bf g}_{j}},Q_{{\bf g}_{k}})\leq d_{\mathcal{F}}(Q_{{\bf g}_{j}},Q_{\bf g})+d_{\mathcal{F}}(Q_{\bf g},Q_{\widetilde{\bf g}})+d_{\mathcal{F}}(Q_{\widetilde{\bf g}},Q_{{\bf g}_{k}})\leq d_{\mathcal{F}}(Q_{\bf g},Q_{\widetilde{\bf g}})+2\epsilon.

Hence, the right hand side of (A.3) is bounded by dℱ​(Q𝐠,Q𝐠~)+5​ϡd_{\mathcal{F}}(Q_{\bf g},Q_{\widetilde{\bf g}})+5\epsilon. That is, condition (3.2)-(iv) holds with Ο΅4=5​ϡ\epsilon_{4}=5\epsilon.

Next, note that β„™n{\mathbb{P}}_{n} is the empirical measure based on i.i.d.Β samples from P0P_{0}. Let 𝐘{\bf Y} and Ο΅\bm{\epsilon} be independent random vectors following Q0Q_{0} and 𝒩​(𝟎D,Οƒ02​𝕀D)\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D}), respectively. For any fβˆˆβ„±f\in\mathcal{F}, by the Lipschitz continuity,

|f​(𝐘+Ο΅)|≀|𝐘+Ο΅|2≀|𝐘|2+|Ο΅|2.\displaystyle|f({\bf Y}+\bm{\epsilon})|\leq|{\bf Y}+\bm{\epsilon}|_{2}\leq|{\bf Y}|_{2}+|\bm{\epsilon}|_{2}.

Since 𝐘{\bf Y} is bounded almost surely and Οƒ0≀1\sigma_{0}\leq 1, f​(𝐘+Ο΅)f({\bf Y}+\bm{\epsilon}) is a sub-Gaussian random variable with the sub-Gaussian parameter Οƒ=σ​(K,D)\sigma=\sigma(K,D). By the Hoeffding’s inequality,

P0​(|β„™n​fβˆ’P0​f|>t)≀2​exp⁑[βˆ’n​t22​σ2]\displaystyle P_{0}\Big{(}\big{|}{\mathbb{P}}_{n}f-P_{0}f\big{|}>t\Big{)}\leq 2\exp\left[-\frac{nt^{2}}{2\sigma^{2}}\right]

for every fβˆˆβ„±f\in\mathcal{F} and tβ‰₯0t\geq 0; see Proposition 2.5 of Wainwright [64] for Hoeffding’s inequality for unbounded sub-Gaussian random variables. Since β„±\mathcal{F} is a finite set with the cardinality N2N^{2},

P0​(supfβˆˆβ„±|β„™n​fβˆ’P0​f|>t)≀2​N2​exp⁑[βˆ’n​t22​σ2].\displaystyle P_{0}\bigg{(}\sup_{f\in\mathcal{F}}\big{|}{\mathbb{P}}_{n}f-P_{0}f\big{|}>t\bigg{)}\leq 2N^{2}\exp\left[-\frac{nt^{2}}{2\sigma^{2}}\right].

If tβ‰₯2​σ​{log⁑(2​N2)}/nt\geq 2\sigma\sqrt{\{\log(2N^{2})\}/n}, the right hand side is bounded by eβˆ’n​t2/(4​σ2)e^{-nt^{2}/(4\sigma^{2})}. Therefore,

𝔼​dℱ​(β„™n,P0)=∫0∞P0​(dℱ​(β„™n,P0)>t)​𝑑t\displaystyle\mathbb{E}d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})=\int_{0}^{\infty}P_{0}\big{(}d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})>t\big{)}dt
≀2​σ​log⁑(2​N2)n+∫0∞exp⁑[βˆ’n​t24​σ2]​𝑑t≀2​σ​log⁑(2​N2)n+σ​πn\displaystyle\leq 2\sigma\sqrt{\frac{\log(2N^{2})}{n}}+\int_{0}^{\infty}\exp\left[-\frac{nt^{2}}{4\sigma^{2}}\right]dt\leq 2\sigma\sqrt{\frac{\log(2N^{2})}{n}}+\sigma\sqrt{\frac{\pi}{n}}

and condition (3.2)-(iii) is also satisfied with Ο΅3\epsilon_{3} equal to the right hand side of the last display.

Note that

dℱ​(P0,Q0)≀W1​(P0,Q0)≀W2​(P0,Q0)≀D​σ0,\displaystyle d_{\mathcal{F}}(P_{0},Q_{0})\leq W_{1}(P_{0},Q_{0})\leq W_{2}(P_{0},Q_{0})\leq\sqrt{D}\sigma_{0},

where the last inequality holds because P0P_{0} is the convolution of Q0Q_{0} and 𝒩​(𝟎D,Οƒ02​𝕀D)\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D}). By Theorem 3.1, we have

𝔼​W1​(Q^,Q0)\displaystyle\mathbb{E}W_{1}(\hat{Q},Q_{0}) ≀\displaystyle\leq 2​D​σ0+5​D​δ+Ο΅opt+4​σ​log⁑(2​N2)n+2​σ​πn+5​ϡ≀c4​{Ο΅opt+Οƒ0+Ξ΄+log⁑Nn+Ο΅},\displaystyle 2\sqrt{D}\sigma_{0}+5\sqrt{D}\delta+\epsilon_{\rm opt}+4\sigma\sqrt{\frac{\log(2N^{2})}{n}}+2\sigma\sqrt{\frac{\pi}{n}}+5\epsilon\leq c_{4}\bigg{\{}\epsilon_{\rm opt}+\sigma_{0}+\delta+\sqrt{\frac{\log N}{n}}+\epsilon\bigg{\}},

where c4=c4​(Οƒ,D)c_{4}=c_{4}(\sigma,D). Combining with (A.2), we have

𝔼​W1​(Q^,Q0)≀c5​{Ο΅opt+Οƒ0+Ξ΄+logβ‘Ξ΄βˆ’1​(logβ‘Ο΅βˆ’1+logβ‘Ξ΄βˆ’1)n​δtβˆ—/2β€‹Ξ²βˆ—+Ο΅},\displaystyle\mathbb{E}W_{1}(\hat{Q},Q_{0})\leq c_{5}\bigg{\{}\epsilon_{\rm opt}+\sigma_{0}+\delta+\frac{\sqrt{\log\delta^{-1}}(\sqrt{\log\epsilon^{-1}}+\log\delta^{-1})}{\sqrt{n}\delta^{t_{*}/2\beta_{*}}}+\epsilon\bigg{\}},

where c5=c5​(c3,c4)c_{5}=c_{5}(c_{3},c_{4}). The proof is complete if we take

Ξ΄=nβˆ’Ξ²βˆ—/(2β€‹Ξ²βˆ—+tβˆ—)​(log⁑n)3β€‹Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—\displaystyle\delta=n^{-\beta_{*}/(2\beta_{*}+t_{*})}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}

and Ο΅=nβˆ’log⁑n\epsilon=n^{-\log n}. ∎

A.3 Proof of Theorem 3.3

The proof is divided into two cases.

Case 1: Suppose that

𝔼​dβ„±0​(β„™n,P0)≳nβˆ’Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—β€‹(log⁑n)3β€‹Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—.\displaystyle\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\gtrsim n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}.

In this case, we have

𝔼​W1​(Q^,Q0)≲nβˆ’Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—β€‹(log⁑n)3β€‹Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—+Οƒ0+Ο΅opt\displaystyle\mathbb{E}W_{1}(\hat{Q},Q_{0})\lesssim n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}+\sigma_{0}+\epsilon_{\rm opt}

whose proof is the same to that of Theorem 3.2. The only difference is that some constants in the proof depend on the Lipschitz constant C1C_{1}.

Case 2: Suppose that

𝔼​dβ„±0​(β„™n,P0)≲nβˆ’Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—β€‹(log⁑n)3β€‹Ξ²βˆ—2β€‹Ξ²βˆ—+tβˆ—.\displaystyle\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\lesssim n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}.

We utilize Theorem 3.1 with β„±=β„±0\mathcal{F}=\mathcal{F}_{0}. Since deval=dβ„±d_{\rm eval}=d_{\mathcal{F}}, we have Ο΅4=0\epsilon_{4}=0. Also, for a large enough 𝒒\mathcal{G}, i.e.Β large depth, width and number of nonzero parameters, Ο΅1\epsilon_{1} can be set to be an arbitrarily small number. Since β„±\mathcal{F} consists of Lipschitz continuous function, dℱ​(P0,Q0)≲σ0d_{\mathcal{F}}(P_{0},Q_{0})\lesssim\sigma_{0}. It follows by Theorem 3.1 that 𝔼​dβ„±0​(Q^,Q0)≲σ0+Ο΅opt+𝔼​dβ„±0​(β„™n,P0)\mathbb{E}d_{\mathcal{F}_{0}}(\hat{Q},Q_{0})\lesssim\sigma_{0}+\epsilon_{\rm opt}+\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0}).

A.4 Proof of Theorem 4.1

Throughout the proof, we will assume that D=dD=d; an extension to the case D>dD>d is straightforward. Our proof relies on the Fano’s method for which we refer to Chapter 15 of Wainwright [64].

Let Ο•:ℝ→[0,∞)\phi:{\mathbb{R}}\to[0,\infty) be a fixed function satisfying that

  1. (i)

    Ο•\phi is [Ξ²+1][\beta+1]-times continuously differentiable on ℝ{\mathbb{R}},

  2. (ii)

    Ο•\phi is unimodal and symmetric about 1/21/2, and

  3. (iii)

    ϕ​(z)>0\phi(z)>0 if and only if z∈(0,1)z\in(0,1),

where [x][x] denotes the largest integer less than or equal to xx. Figure 1 shows an illustration of Ο•\phi and related functions. For a positive integer m=mnm=m_{n}, with mnβ†‘βˆžm_{n}\uparrow\infty as nβ†’βˆžn\to\infty, let zj=j/mz_{j}=j/m, Ij=[zj,zj+1]I_{j}=[z_{j},z_{j+1}] for j=0,…,mβˆ’1j=0,\ldots,m-1, J={0,1,…,mβˆ’1}dJ=\{0,1,\ldots,m-1\}^{d} and Ο•j​(z)=ϕ​(m​(zβˆ’zj))\phi_{j}(z)=\phi(m(z-z_{j})). For a multi-index 𝐣=(j1,…,jd)∈J{\bf j}=(j_{1},\ldots,j_{d})\in J and Ξ±=(α𝐣)𝐣∈J∈{βˆ’1,+1}|J|\alpha=(\alpha_{\bf j})_{{\bf j}\in J}\in\{-1,+1\}^{|J|}, define 𝐠α:[0,1]d→ℝd{\bf g}_{\alpha}:[0,1]^{d}\to{\mathbb{R}}^{d} as

𝐠α​(𝐳)=(z1+c1mΞ²β€‹βˆ‘π£βˆˆJα𝐣​ϕj1​(z1)​⋯​ϕjd​(zd),z2,…,zd),\displaystyle{\bf g}_{\alpha}({\bf z})=\left(z_{1}+\frac{c_{1}}{m^{\beta}}\sum_{{\bf j}\in J}\alpha_{\bf j}\phi_{j_{1}}(z_{1})\cdots\phi_{j_{d}}(z_{d}),z_{2},\ldots,z_{d}\right),

where c1=c1​(Ο•,d)c_{1}=c_{1}(\phi,d) is a small enough constant described below. Then, it is easy to check that 𝐠α{\bf g}_{\alpha} is a one-to-one function from [0,1]d[0,1]^{d} onto itself, and π Ξ±βˆˆβ„‹Kβ​([0,1]d)Γ—β‹―Γ—β„‹Kβ​([0,1]d){\bf g}_{\alpha}\in\mathcal{H}_{K}^{\beta}([0,1]^{d})\times\cdots\times\mathcal{H}_{K}^{\beta}([0,1]^{d}) for large enough K=K​(Ξ²,c1)K=K(\beta,c_{1}).

Refer to caption
(a) ϕ​(z)\phi(z)
Refer to caption
(b) ϕ′​(z)\phi^{\prime}(z)
Refer to caption
(c) ϕ′​(z1)​ϕ​(z2)\phi^{\prime}(z_{1})\phi(z_{2})
Figure 1: An illustration of Ο•\phi and related functions.

Let 𝐙=(Z1,…,Zd){\bf Z}=(Z_{1},\ldots,Z_{d}) be a uniform random variable on (0,1)d(0,1)^{d}. Then, by the change of variables formula, the Lebesgue density qΞ±q_{\alpha} of 𝐘=𝐠α​(𝐙){\bf Y}={\bf g}_{\alpha}({\bf Z}) is given as

qα​(𝐲)=|βˆ‚π³βˆ‚π²|=(1+c1mΞ²β€‹βˆ‘π£βˆˆJα𝐣​ϕj1′​(z1)​ϕj2​(y2)​⋯​ϕjd​(yd))βˆ’1\displaystyle q_{\alpha}({\bf y})=\left|\frac{\partial{\bf z}}{\partial{\bf y}}\right|=\left(1+\frac{c_{1}}{m^{\beta}}\sum_{{\bf j}\in J}\alpha_{\bf j}\phi_{j_{1}}^{\prime}(z_{1})\phi_{j_{2}}(y_{2})\cdots\phi_{j_{d}}(y_{d})\right)^{-1}

for 𝐲∈[0,1]d{\bf y}\in[0,1]^{d}, where Ο•β€²\phi^{\prime} denotes the derivative of Ο•\phi. Here, z1=z1​(y1,…,yd)z_{1}=z_{1}(y_{1},\ldots,y_{d}) is implicitly defined.

We first find an upper bound of K​(qΞ±,qΞ±β€²)K(q_{\alpha},q_{\alpha^{\prime}}) for Ξ±,Ξ±β€²βˆˆ{βˆ’1,+1}|J|\alpha,\alpha^{\prime}\in\{-1,+1\}^{|J|}, where K​(p,q)=∫log⁑p/q​d​PK(p,q)=\int\log p/qdP is the Kullback–Leibler divergence. Since 𝐠α​(C𝐣)=C𝐣{\bf g}_{\alpha}(C_{\bf j})=C_{\bf j} and qΞ±q_{\alpha} is bounded from above and below for small enough c1c_{1}, where C𝐣=Ij1Γ—β‹―Γ—IjdC_{\bf j}=I_{j_{1}}\times\cdots\times I_{j_{d}}, we have

|qα​(𝐲)βˆ’qα′​(𝐲)|≲|1qα​(𝐲)βˆ’1qα′​(𝐲)|≀2​c1mΞ²βˆ’1β€‹β€–Ο•β€²β€–βˆžβ€‹β€–Ο•β€–βˆždβˆ’1.\displaystyle|q_{\alpha}({\bf y})-q_{\alpha^{\prime}}({\bf y})|\lesssim\left|\frac{1}{q_{\alpha}({\bf y})}-\frac{1}{q_{\alpha^{\prime}}({\bf y})}\right|\leq 2\frac{c_{1}}{m^{\beta-1}}\|\phi^{\prime}\|_{\infty}\|\phi\|_{\infty}^{d-1}.

Since the ratio qΞ±/qΞ±β€²q_{\alpha}/q_{\alpha^{\prime}} is bounded from above and below, we can use a well-known inequality K​(qΞ±,qΞ±β€²)≲dH2​(qΞ±,qΞ±β€²)K(q_{\alpha},q_{\alpha^{\prime}})\lesssim d_{H}^{2}(q_{\alpha},q_{\alpha^{\prime}}), where dHd_{H} denotes the Hellinger distance; see Lemma B.2 of Ghosal & van der Vaart [19]. Since |qΞ±βˆ’qΞ±β€²|≲|qΞ±βˆ’qΞ±β€²||\sqrt{q_{\alpha}}-\sqrt{q_{\alpha^{\prime}}}|\lesssim|q_{\alpha}-q_{\alpha^{\prime}}|, we have

K​(qΞ±,qΞ±β€²)β‰²βˆ«[0,1]d|qα​(𝐲)βˆ’qα′​(𝐲)|2​𝑑𝐲≲c12β€‹β€–Ο•β€²β€–βˆž2β€‹β€–Ο•β€–βˆž2​(dβˆ’1)m2​(Ξ²βˆ’1).\displaystyle K(q_{\alpha},q_{\alpha^{\prime}})\lesssim\int_{[0,1]^{d}}|q_{\alpha}({\bf y})-q_{\alpha^{\prime}}({\bf y})|^{2}d{\bf y}\lesssim\frac{c_{1}^{2}\|\phi^{\prime}\|_{\infty}^{2}\|\phi\|_{\infty}^{2(d-1)}}{m^{2(\beta-1)}}.

Next, we derive a lower bound for W1​(qΞ±,qΞ±β€²)W_{1}(q_{\alpha},q_{\alpha^{\prime}}). Suppose that α𝐣≠α𝐣′\alpha_{\bf j}\neq\alpha_{\bf j}^{\prime} for some 𝐣∈J{\bf j}\in J. Then, the excess mass of QΞ±Q_{\alpha} over QΞ±β€²Q_{\alpha^{\prime}} on C𝐣C_{\bf j} is

∫{𝐲∈Cj:qα​(𝐲)>qα′​(𝐲)}{qα​(𝐲)βˆ’qα′​(𝐲)}​𝑑𝐲=12β€‹βˆ«C𝐣|qα​(𝐲)βˆ’qα′​(𝐲)|​𝑑𝐲\displaystyle\int_{\{{\bf y}\in C_{j}:q_{\alpha}({\bf y})>q_{\alpha^{\prime}}({\bf y})\}}\Big{\{}q_{\alpha}({\bf y})-q_{\alpha^{\prime}}({\bf y})\Big{\}}d{\bf y}=\frac{1}{2}\int_{C_{\bf j}}|q_{\alpha}({\bf y})-q_{\alpha^{\prime}}({\bf y})|d{\bf y}
β‰³βˆ«C𝐣|1qα​(𝐲)βˆ’1qα′​(𝐲)|​𝑑𝐲=2​c1mΞ²β€‹βˆ«C𝐣|Ο•j1′​(z1)​ϕj2​(y2)​⋯​ϕjd​(yd)|​𝑑𝐲\displaystyle\gtrsim\int_{C_{\bf j}}\left|\frac{1}{q_{\alpha}({\bf y})}-\frac{1}{q_{\alpha^{\prime}}({\bf y})}\right|d{\bf y}=\frac{2c_{1}}{m^{\beta}}\int_{C_{\bf j}}|\phi_{j_{1}}^{\prime}(z_{1})\phi_{j_{2}}(y_{2})\cdots\phi_{j_{d}}(y_{d})|d{\bf y}
=2​c1mΞ²β€‹βˆ«C𝐣|Ο•j1′​(z1)​ϕj2​(z2)​⋯​ϕjd​(zd)|​|βˆ‚π²βˆ‚π³|​𝑑𝐳≳c1m(Ξ²βˆ’1)+dβ€‹βˆ«(0,1)d|ϕ′​(z1)​ϕ​(z2)​⋯​ϕ​(zd)|​𝑑𝐳.\displaystyle=\frac{2c_{1}}{m^{\beta}}\int_{C_{\bf j}}|\phi_{j_{1}}^{\prime}(z_{1})\phi_{j_{2}}(z_{2})\cdots\phi_{j_{d}}(z_{d})|\left|\frac{\partial{\bf y}}{\partial{\bf z}}\right|d{\bf z}\gtrsim\frac{c_{1}}{m^{(\beta-1)+d}}\int_{(0,1)^{d}}|\phi^{\prime}(z_{1})\phi(z_{2})\cdots\phi(z_{d})|d{\bf z}.

In virtue of Corollary 1.16 in Villani [62], with the (unique) optimal transport plan between QΞ±Q_{\alpha} and QΞ±β€²Q_{\alpha^{\prime}}, some portion γ∈(0,1)\gamma\in(0,1) of this excess mass must be transported at least the distance of c2/mc_{2}/m, where constants Ξ³\gamma and c2c_{2} can be chosen so that they depend only on dd and Ο•\phi. Hence, for some constant c3=c3​(Ο•,d)c_{3}=c_{3}(\phi,d),

W1​(qΞ±,qΞ±β€²)β‰₯c1​c3mΞ²+d​H​(Ξ±,Ξ±β€²),\displaystyle W_{1}(q_{\alpha},q_{\alpha^{\prime}})\geq\frac{c_{1}c_{3}}{m^{\beta+d}}H(\alpha,\alpha^{\prime}),

where H​(Ξ±,Ξ±β€²)=βˆ‘π£βˆˆJI​(α𝐣≠α𝐣′)H(\alpha,\alpha^{\prime})=\sum_{{\bf j}\in J}I(\alpha_{\bf j}\neq\alpha^{\prime}_{\bf j}) denotes the Hamming distance between Ξ±\alpha and Ξ±β€²\alpha^{\prime}.

With the Hamming distance on {βˆ’1,+1}|J|\{-1,+1\}^{|J|}, it is well-known (e.g.Β see page 124 of Wainwright [64]) that there is a |J|/4|J|/4-packing π’œ\mathcal{A} of {βˆ’1,+1}|J|\{-1,+1\}^{|J|} whose cardinality is at least e|J|/16e^{|J|/16}. Let PΞ±P_{\alpha} be the convolution of QΞ±Q_{\alpha} and 𝒩​(𝟎d,Οƒ02​𝕀d)\mathcal{N}({\bf 0}_{d},\sigma_{0}^{2}{\mathbb{I}}_{d}). Then, K​(pΞ±,pΞ±β€²)≀K​(qΞ±,qΞ±β€²)K(p_{\alpha},p_{\alpha^{\prime}})\leq K(q_{\alpha},q_{\alpha^{\prime}}) by Lemma B.11 of Ghosal & van der Vaart [19]. By the Fano’s method (Proposition 15.12 of Wainwright [64]), we have

𝔐​(𝒒0,Οƒ0)≳c1​c3mβ​{1βˆ’n​c12​C​(Ο•,d)​mβˆ’2​(Ξ²βˆ’1)+log⁑2md/16}\displaystyle\mathfrak{M}(\mathcal{G}_{0},\sigma_{0})\gtrsim\frac{c_{1}c_{3}}{m^{\beta}}\left\{1-\frac{nc_{1}^{2}C(\phi,d)m^{-2(\beta-1)}+\log 2}{m^{d}/16}\right\}

If n≍md+2​(Ξ²βˆ’1)n\asymp m^{d+2(\beta-1)} and c1c_{1} is small enough, we have the desired result. ∎