Rates of convergence for nonparametric estimation of singular distributions using generative adversarial networks

Minwoo Chae
Department of Industrial and Management Engineering
Pohang University of Science and Technology

Abstract

We consider generative adversarial networks (GAN) for estimating parameters in a deep generative model. The data-generating distribution is assumed to concentrate around some low-dimensional structure, making the target distribution singular to the Lebesgue measure. Under this assumption, we obtain convergence rates of a GAN type estimator with respect to the Wasserstein metric. The convergence rate depends only on the noise level, intrinsic dimension and smoothness of the underlying structure. Furthermore, the rate is faster than that obtained by likelihood approaches, which provides insights into why GAN approaches perform better in many real problems. A lower bound of the minimax optimal rate is also investigated.

Keywords: Convergence rate, deep generative model, generative adversarial networks, nonparametric estimation, singular distribution, Wasserstein distance.

1 Introduction

Given $D$ -dimensional observations ${\bf X}_{1},\ldots,{\bf X}_{n}$ following $P_{0}$ , suppose that we are interested in inferring the underlying distribution $P_{0}$ or related quantities such as its density function or the manifold on which $P_{0}$ is supported. The inference of $P_{0}$ is fundamental in unsupervised learning problems, for which numerous inferential methods are available in the literature [25, 43, 9]. In this paper, we model ${\bf X}_{i}$ as ${\bf X}_{i}={\bf g}({\bf Z}_{i})+\bm{\epsilon}_{i}$ for some function ${\bf g}:\mathcal{Z}\to{\mathbb{R}}^{D}$ . Here, ${\bf Z}_{i}$ is a latent variable following the known distribution $P_{Z}$ supported on $\mathcal{Z}\subset{\mathbb{R}}^{d}$ , and $\bm{\epsilon}_{i}$ is an error vector following the normal distribution $\mathcal{N}({\bf 0}_{D},\sigma^{2}{\mathbb{I}}_{D})$ , where ${\bf 0}_{D}$ and ${\mathbb{I}}_{D}$ denote the $D$ -dimensional zero vector and identity matrix, respectively. The dimension $d$ of latent variables is typically much smaller than $D$ . The model is often called a (non-linear) factor model in statistical communities [68, 34] and a generative model in machine learning societies [22, 31]. Throughout the paper, we use the latter terminology. Accordingly, ${\bf g}$ will be referred to as a generator.

A fundamental issue in a generative model is to construct an estimator of ${\bf g}$ because inferences are mostly based on the estimation of the generator. Once we have an estimator $\hat{\bf g}$ , for example, the distribution of $\hat{\bf g}({\bf Z}_{i})$ can serve as an estimator of $P_{0}$ . While there are various nonparametric approaches for estimating $P_{0}$ [59, 24], the generative model approach does not provide a direct estimator due to an intractable integral. However, generative models are often more practical than the direct estimation methods because it is easy to generate samples from the estimated distribution.

Recent advances in deep learning have brought great successes to the generative model approach by modeling ${\bf g}$ through deep neural networks (DNN), for which we call a deep generative model. Two learning approaches are popularly used in practice. The former is likelihood approaches; variational autoencoder [31, 50] is perhaps the most well-known algorithm for estimating ${\bf g}$ . The latter approach is known as the generative adversarial networks (GAN), which was originally developed by Goodfellow et al. [22] and generalized by several researchers. One of the extensions considers general integral probability metrics (IPM) as loss functions. Sobolev GAN [41], maximum mean discrepancy GAN [37] and Wasserstein GAN [3] are important examples. Another important direction of generalization is the development of novel architectures for generators and discriminators; deep convolutional GAN [49], progressive GAN [29] and style GAN [30] are successful architectures. In many real applications, GAN approaches tend to perform better than likelihood approaches, but training a GAN architecture is notorious for its difficulty. In particular, the estimator is very sensitive to the choice of the hyperparameters in the training algorithm.

In spite of the rapid development of GAN, theoretical understanding of it remains largely unexplored. This paper studies the statistical properties of GAN from a nonparametric distribution estimation viewpoint. Specifically, we investigate convergence rates of a GAN type estimator with a structural assumption on the generator. Although GAN does not yield an explicit estimator for $P_{0}$ , it is crucial to study the convergence rate of the estimator, implicitly defined through $\hat{\bf g}$ . A primary goal is to provide theoretical insights into why GAN performs well in many real-world applications. With regard to this goal, fundamental questions would be, “Which distributions can be efficiently estimated via GAN?” and “What is the main benefit of GAN compared to other methods for estimating these distributions?” The first question has recently been addressed by Chae et al. [10] to understand the benefit of deep generative models, although their results are limited to the likelihood approaches. They considered a certain class of structured distributions and tried to explain how deep generative models can avoid the curse of dimensionality in nonparametric distribution estimation problems.

To set the scene and the notation, let $Q_{\bf g}$ be the distribution of ${\bf g}({\bf Z})$ , where ${\bf Z}\sim P_{Z}$ . In other words, $Q_{\bf g}$ is the pushforward measure of $P_{Z}$ by the map ${\bf g}$ . Let $P_{{\bf g},\sigma}=Q_{\bf g}*\mathcal{N}({\bf 0}_{D},\sigma^{2}{\mathbb{I}}_{D})$ , the convolution of $Q_{\bf g}$ and $\mathcal{N}({\bf 0}_{D},\sigma^{2}{\mathbb{I}}_{D})$ . A fundamental assumption of the present paper is that there exists a true generator ${\bf g}_{0}$ and $\sigma_{0}\geq 0$ such that ${\bf X}_{i}$ ’s are equal in distribution to ${\bf g}_{0}({\bf Z}_{i})+\bm{\epsilon}_{i}$ , where $\bm{\epsilon}_{i}\sim\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D})$ and $\bm{\epsilon}_{i}\!\perp\!\!\!\perp{\bf Z}_{i}$ . With the above notation, this can be expressed as

P_{0}=P_{{\bf g}_{0},\sigma_{0}}=Q_{0}*\mathcal{N}({\bf 0}_{D},\sigma^{2}{\mathbb{I}}_{D}),

(1.1)

where $Q_{0}=Q_{{\bf g}_{0}}$ . Under this assumption, it would be more reasonable to set $Q_{0}$ , rather than $P_{0}$ , as the target distribution to be estimated. We further assume that $Q_{0}$ possesses a certain low-dimensional structure and $\sigma_{0}\to 0$ as the sample size increases. That is, the data-generating distribution consists of the structured distribution $Q_{0}$ and small additional noise.

The above assumption has been investigated by Chae et al. [10], motivated by recent articles on structured distribution estimation [18, 17, 48, 1, 14]. Once the true generator ${\bf g}_{0}$ belongs to the class $\mathcal{G}_{0}$ possessing a low-dimensional structure that DNN can efficiently capture, deep generative models are highly appropriate for estimating $Q_{0}$ . Chae et al. [10] considered a class of composite functions [26, 28] which have recently been studied in deep supervised learning [52, 6]. Some other structures have also been studied in literature [27, 44, 12]. The corresponding class of distributions $\mathcal{Q}_{0}=\{Q_{\bf g}:{\bf g}\in\mathcal{G}_{0}\}$ inherits low-dimensional structures of $\mathcal{G}_{0}$ . In particular, when $\mathcal{G}_{0}$ consists of composite functions, the corresponding class $\mathcal{Q}_{0}$ is sufficiently large to include various structured distributions such as the product distributions, classical smooth distributions and distributions supported on submanifolds; see Section 4 of Chae et al. [10] for details.

The assumption on $\sigma_{0}$ is crucial for the efficient estimation of $Q_{0}$ . Unless $\sigma_{0}$ is small enough, the minimax optimal rate is very slow, e.g. $1/\log n$ . In statistical society, the problem is known as the deconvolution [15, 40, 17, 45]. Mathematically, the assumption for small $\sigma_{0}$ can be expressed as $\sigma_{0}\to 0$ with a suitable rate.

Once we have an estimator $\hat{\bf g}$ for ${\bf g}_{0}$ , $\hat{Q}=Q_{\hat{\bf g}}$ can serve as an estimator for $Q_{0}$ . Under the assumption described above, we study convergence rates of a GAN type estimator $\hat{Q}$ . Note that $Q_{0}$ is singular with respect to the Lebesgue measure on ${\mathbb{R}}^{D}$ because $d$ is smaller than $D$ . Therefore, standard metrics between densities, such as the total variation and Hellinger, are not appropriate for evaluating the estimation performance. We instead consider the $L^{1}$ -Wasserstein metric, which is originally inspired by the problem of optimal mass transportation and frequently used in distribution estimation problems [62, 45, 66, 11].

When ${\bf g}_{0}$ possesses a composite structure with intrinsic dimension $t$ and smoothness $\beta$ , see Section 3 for the definition, Chae et al. [10] proved that a likelihood approach to deep generative models can achieve the rate $n^{-\beta/2(\beta+t)}+\sigma_{0}$ up to a logarithmic factor. Due to the singularity of the underlying distribution, it plays a key role to perturb the data by an artificial noise. That is, the rate is obtained by a sieve maximum likelihood estimator based on the perturbed data $\widetilde{\bf X}_{i}={\bf X}_{i}+\widetilde{\bm{\epsilon}}_{i}$ , where $\widetilde{\bm{\epsilon}}_{i}\sim\mathcal{N}({\bf 0}_{D},\widetilde{\sigma}^{2}{\mathbb{I}}_{D})$ and $\widetilde{\sigma}$ is the degree of perturbation. Without suitable data perturbation, likelihood approaches can fail to estimate $Q_{0}$ consistently. Note that the rate depends on $\beta$ and $t$ , but not on $D$ and $d$ .

Interestingly, a GAN type estimator considered in this paper can achieve a strictly faster rate than that of the likelihood approach. Our main result (Theorem 3.2) guarantees that a GAN type estimator achieves the rate $n^{-\beta/(2\beta+t)}+\sigma_{0}$ under the above assumption. Although Chae et al. [10] obtained only an upper bound for the convergence rate of likelihood approaches, it is hard to expect that the rate can be improved by likelihood approaches, based on the classical nonparametric theory [8, 35, 36, 67]. In this sense, our results provide some insights into why GAN approaches often perform better than likelihood approaches in real data analysis.

In addition to the convergence rate of a GAN type estimator, we obtain a lower bound $n^{-\beta/(2\beta+t-2)}$ for the minimax convergence rate; see Theorem 4.1. When $\sigma_{0}$ is small enough, this lower bound is only slightly smaller than the convergence rate of a GAN type estimator.

It would be worthwhile to mention the technical novelty of the present paper compared to existing theories about GAN reviewed in Section 1.1. Firstly, most existing theories analyze GAN from a classical nonparametric density estimation viewpoint, rather than a distribution estimation as in our paper. Classical methods such as the kernel density estimator and wavelets can also attain the minimax optimal convergence rate in their framework. Consequently, their results cannot explain why GAN outperforms classical approaches to density estimation problems.

Another notable difference lies in the discriminator architectures. While the discriminator architecture in literature depends solely on the evaluation metric ( $L^{1}$ -Wasserstein in our case), in our approach, it depends on the generator architecture as well. Although state-of-the-art GAN architectures such as progressive and style GANs are too complicated to render them theoretically tractable, it is crucial for the success of these procedures that discriminator architectures have pretty similar structures to the generator architectures. In the proof of Theorem 3.2, we carefully construct the discriminator class using the generator class. In particular, the discriminator class is constructed so that its complexity, expressed through the metric entropy, is of the same order as that of the generator class. Consequently, the discriminator can become a much smaller class than the set of every function with the Lipschitz constant bounded by one. This reduction can significantly improve the rate; see the discussion after Theorem 3.1.

The construction of the discriminator class in the proof of Theorem 3.2 is artificial and only for a theoretical purpose. In particular, the discriminator class is not a neural network, and the computation of the considered estimator is intractable. In spite of this limitation, our theoretical results provide important insights for the success of GAN. Focusing on the Wasserstein GAN, note that many algorithms for Wasserstein GAN [3, 23] pursue to find a minimizer, say $\hat{Q}^{W}$ , of the Wasserstein distance from the empirical measure. However, even computing the Wasserstein distance between two simple distributions is very difficult; see Theorem 3 of Kuhn et al. [33]. In practice, a class of neural network functions is used as a discriminator class, and this has been understood as a technique for approximating $\hat{Q}^{W}$ . However, our theory implies that $\hat{Q}^{W}$ might not be a decent estimator even when exact computation of it is possible. Stanczuk et al. [57] empirically demonstrated this by showing that Wasserstein GAN does not approximate the Wasserstein distance.

Besides the Wasserstein distance, we also consider a general integral probability metric as an evaluation metric (Theorem 3.3). For example, $\alpha$ -Hölder classes can be used to define the evaluation metric. Considering state-of-the-art architectures, neural network distances [4, 71, 5, 39] would also be natural choices. Then, the corresponding GAN type estimator is much more natural than the one considered in the proof of Theorem 3.2.

The remainder of the paper is organized as follows. First, we review literature for the theory of GAN and introduce some notations in the following subsections. Then, Section 2 provides a mathematical set-up, including a brief introduction to DNN and GAN. An upper bound for the convergence rate of a GAN type estimator and a lower bound of the minimax convergence rates are investigated in Sections 3 and 4, respectively. Concluding remarks follow in Section 5. All proofs are deferred to Appendix.

1.1 Related statistical theory for GAN

The study of convergence rates in nonparametric generative models has been conducted in some earlier papers [34, 47] with the name of latent factor models. Rather than utilizing DNN, they considered a Bayesian approach with a Gaussian process prior on the generator function. Since the development of GAN [22], several researchers have studied rates of convergence in deep generative models, particularly focusing on GAN. To the best of our knowledge, an earlier version of Liang [38] is the first one studying the convergence rate under a GAN framework. Similar theory has been developed by Singh et al. [56], which is later generalized by Uppal et al. [60]. Slightly weaker results are obtained by Chen et al. [13] with explicit DNN architectures for generator and discriminator classes. Convergence rates of the vanilla GAN with respect to the Jensen–Shannon divergence have recently been obtained by Belomestny et al. [7].

All the above works tried to understand GAN from a nonparametric density estimation framework. They used integral probability metrics as evaluation metrics, while classical approaches on nonparametric density estimation focused on other metrics such as the total variation, Hellinger and uniform metrics. Since the total variation can be viewed as an IPM, some results in the above papers are comparable with that of the classical methods. In this case, both approaches achieve the same minimax optimal rate. Hence, the above results cannot explain why deep generative models outperform classical nonparametric methods. Schreuder et al. [54] considered generative models in which the target distribution does not possess a Lebesgue density. However, their result only guarantees that the convergence rate of GAN is not worse than that of the empirical measure [65]. We adopt the set-up in Chae et al. [10] who exclusively considered likelihood approaches.

1.2 Notations

The maximum and minimum of two real numbers $a$ and $b$ are denoted $a\vee b$ and $a\wedge b$ , respectively. For $1\leq p\leq\infty$ , $|\cdot|_{p}$ denotes the $\ell^{p}$ -norm. For a real-valued function $f$ and a probability measure $P$ , let $Pf=\int f({\bf x})dP({\bf x})$ . $\mathbb{E}$ denotes the expectation when the underlying probability is obvious. The equality $c=c(A_{1},\ldots,A_{k})$ means that $c$ depends only on $A_{1},\ldots,A_{k}$ . The uppercase letters, such as $P$ and $\hat{P}$ refer to the probability measures corresponding to the densities denoted by the lowercase letters $p$ and $\hat{p}$ , respectively, and vice versa. The inequality $a\lesssim b$ means that $a$ is less than $b$ up to a constant multiplication, where the constant is universal or at least contextually unimportant. Also, denote $a\asymp b$ if $a\lesssim b$ and $b\lesssim a$ .

2 Generative adversarial networks

For a given class $\mathcal{F}$ of functions from ${\mathbb{R}}^{D}$ to ${\mathbb{R}}$ , the $\mathcal{F}$ -IPM [42] between two probability measures $P_{1}$ and $P_{2}$ is defined as

\displaystyle d_{\mathcal{F}}(P_{1},P_{2})=\sup_{f\in\mathcal{F}}|P_{1}f-P_{2}f|.

For example, if $\mathcal{F}=\mathcal{F}_{\rm Lip}$ , the class of every function $f:{\mathbb{R}}^{D}\to{\mathbb{R}}$ satisfying $|f({\bf x})-f({\bf y})|\leq|{\bf x}-{\bf y}|_{2}$ for all ${\bf x},{\bf y}\in{\mathbb{R}}^{D}$ , then the corresponding IPM is the $L^{1}$ -Wasserstein distance by the Kantorovich–Rubinstein duality theorem; see Theorem 1.14 of Villani [62]. Note that the $L^{p}$ -Wasserstein distance (with respect to the Euclidean distance on ${\mathbb{R}}^{D}$ ) is defined as

\displaystyle W_{p}(P_{1},P_{2})=\left(\inf_{\pi}\int|{\bf x}-{\bf y}|_{2}^{p}d\pi({\bf x},{\bf y})\right)^{1/p},

where the infimum is taken over every coupling $\pi$ of $P_{1}$ and $P_{2}$ .

Let $\mathcal{G}$ be a class of functions from $\mathcal{Z}\subset{\mathbb{R}}^{d}$ to ${\mathbb{R}}^{D}$ , and $\mathcal{F}$ be a class of functions from ${\mathbb{R}}^{D}$ to ${\mathbb{R}}$ . Two classes $\mathcal{G}$ and $\mathcal{F}$ are referred to as the generator and discriminator classes, respectively. Once $\mathcal{G}$ and $\mathcal{F}$ are given, a GAN type estimator is defined through a minimizer of $d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n})$ over $\mathcal{G}$ , where ${\mathbb{P}}_{n}$ is the empirical measure based on the $D$ -dimensional observations ${\bf X}_{1},\ldots,{\bf X}_{n}$ . More specifically, let $\hat{\bf g}\in\mathcal{G}$ be an estimator satisfying

d_{\mathcal{F}}(Q_{\hat{\bf g}},{\mathbb{P}}_{n})\leq\inf_{{\bf g}\in\mathcal{G}}d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n})+\epsilon_{\rm opt},

(2.1)

and $\hat{Q}=Q_{\hat{\bf g}}$ . Here, $\epsilon_{\rm opt}\geq 0$ represents the optimization error. Although the vanilla GAN [22] is not the case, the formulation (2.1) is quite general to include various GANs popularly used in practice [3, 37, 41]. At a population level, one may view (2.1) as a method to estimate the minimizer of $\mathcal{F}$ -IPM from the data-generating distribution.

In practice, both $\mathcal{G}$ and $\mathcal{F}$ are modelled as DNNs. To be specific, let $\rho(x)=x\vee 0$ be the ReLU activation function [21]. We focus on the ReLU in this paper, but other activation functions can also be used once a suitable approximation property holds [46]. For a vector ${\bf v}=(v_{1},\ldots,v_{r})$ and ${\bf z}=(z_{1},\ldots,z_{r})$ , define $\rho_{\bf v}({\bf z})=(\rho(z_{1}-v_{1}),\ldots,\rho(z_{r}-v_{r}))$ . For a nonnegative integer $L$ and ${\bf p}=(p_{0},\ldots,p_{L+1})\in{\mathbb{N}}^{L+2}$ , a neural network function with the network architecture $(L,{\bf p})$ is any function ${\bf f}:{\mathbb{R}}^{p_{0}}\to{\mathbb{R}}^{p_{L+1}}$ of the form

{\bf z}\mapsto{\bf f}({\bf z})=W_{L}\rho_{{\bf v}_{L}}W_{L-1}\rho_{{\bf v}_{L-1}}\cdots W_{1}\rho_{{\bf v}_{1}}W_{0}{\bf z},

(2.2)

where $W_{i}\in{\mathbb{R}}^{p_{i+1}\times p_{i}}$ and ${\bf v}_{i}\in{\mathbb{R}}^{p_{i}}$ . Let $\mathcal{D}(L,{\bf p},s,F)$ be the collection ${\bf f}$ of the form (2.2) satisfying

	$\displaystyle\max_{j=0,\ldots,L}\|W_{j}\|_{\infty}\vee\|{\bf v}_{j}\|_{\infty}\leq 1,$
	$\displaystyle\sum_{j=1}^{L}\|W_{j}\|_{0}+\|{\bf v}_{j}\|_{0}\leq s\quad\text{and}~{}~{}\\|{\bf f}\\|_{\infty}\leq F,$

where $|W_{j}|_{\infty}$ and $|W_{j}|_{0}$ denote the maximum-entry norm and the number of nonzero elements of the matrix $W_{j}$ , respectively, and $\|{\bf f}\|_{\infty}=\||{\bf f}({\bf z})|_{\infty}\|_{\infty}=\sup_{\bf z}|{\bf f}({\bf z})|_{\infty}$ .

When the generator class $\mathcal{G}$ consists of neural network functions, we call the corresponding class $\mathcal{Q}=\{Q_{\bf g}:{\bf g}\in\mathcal{G}\}$ of distributions as a deep generative model. In this sense, GAN can be viewed as a method for estimating the parameters in deep generative models. Likelihood approaches such as the variational autoencoder are another methods inferring the model parameters. When a likelihood approach is taken into account, $\mathcal{P}=\{P_{{\bf g},\sigma}:{\bf g}\in\mathcal{G},\sigma\in[\sigma_{\min},\sigma_{\max}]\}$ is often called a deep generative model as well. Note that $P_{{\bf g},\sigma}$ always possesses a density regardless whether $Q_{\bf g}$ is singular or not.

3 Convergence rate of GAN

Although strict minimization of the map ${\bf g}\mapsto d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n})$ is computationally intractable, several heuristic approaches are available to approximate the solution to (2.1). In this section, we investigate the convergence rate of $\hat{Q}=Q_{\hat{\bf g}}$ under the assumption that the computation of it is possible. To this end, we suppose that the data-generating distribution $P_{0}$ is of the form $P_{0}=Q_{0}*\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D})$ for some $Q_{0}$ and $\sigma_{0}\geq 0$ . $Q_{0}$ will be further assumed to possess a low-dimensional structure. A goal is to find a sharp upper bound for $\mathbb{E}d_{\rm eval}(\hat{Q},Q_{0})$ , where $d_{\rm eval}$ is the evaluation metric. In particular, we hope the rate to adapt to the structure of $Q_{0}$ and to be independent of $D$ and $d$ . We consider an arbitrary evaluation metric $d_{\rm eval}$ for generality. The $L^{1}$ -Wasserstein distance is of primary interest.

In literature, the evaluation metric is often identified with $d_{\mathcal{F}}$ . In this sense, when $d_{\rm eval}=W_{1}$ , $\mathcal{F}_{\rm Lip}$ might be a natural candidate for the discriminator class. Indeed, it is the original motivation of the Wasserstein GAN to find $\hat{Q}^{W}$ , a minimizer of the map ${\bf g}\mapsto d_{\mathcal{F}_{\rm Lip}}(Q_{\bf g},{\mathbb{P}}_{n})$ . Due to the computational intractability, $\mathcal{F}_{\rm Lip}$ is replaced by a class $\mathcal{F}$ of neural network functions in practice. Although minimizing ${\bf g}\mapsto d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n})$ is still challenging, several numerical algorithms can be used to approximate the solution. In initial papers concerning Wasserstein GAN [3, 2], this replacement was regarded only as a technique approximating $\hat{Q}^{W}$ .

Theoretically, it is unclear whether $\hat{Q}^{W}$ is a decent estimator. If the generator class $\mathcal{G}$ is large enough, for example, $\hat{Q}^{W}$ would be arbitrarily close to the empirical measure. Consequently, the convergence rate of $\hat{Q}^{W}$ and ${\mathbb{P}}_{n}$ would be the same. Note that the convergence rate of the empirical measure with respect to the Wasserstein distance is well-known. Specifically, it holds that [16]

\mathbb{E}W_{1}({\mathbb{P}}_{n},P_{0})\lesssim\left\{\begin{array}[]{ll}n^{-1/2}&~{}~{}\text{if $D=1$}\\ n^{-1/2}\log n&~{}~{}\text{if $D=2$}\\ n^{-1/D}&~{}~{}\text{if $D>2$}.\end{array}\right.

(3.1)

The rate becomes slow as $D$ increases, suffering from the curse of dimensionality. Although ${\mathbb{P}}_{n}$ adapts to a certain intrinsic dimension and achieves the minimax rate in some sense [65, 55], this does not guarantee that ${\mathbb{P}}_{n}$ is a decent estimator, particularly when the underlying distribution possesses some smooth structure. The convergence rate of a GAN type estimator obtained by Schreuder et al. [54] is nothing but the rate (3.1).

If the size of $\mathcal{G}$ is not too large, then $\hat{Q}^{W}$ may achieve a faster rate than ${\mathbb{P}}_{n}$ due to the regularization effect. However, studying the behavior of $\hat{Q}^{W}$ , possibly depending on the complexity of $\mathcal{G}$ , is quite tricky. Furthermore, Stanczuk et al. [57] empirically showed that $\hat{Q}^{W}$ performs poorly in a simple simulation. In particular, their experiments show that estimators constructed from practical algorithms can be fundamentally different from $\hat{Q}^{W}$ .

In another viewpoint, it would not be desirable to study the convergence rate of $\hat{Q}^{W}$ because it does not take crucial features of state-of-the-art architectures into account. As mentioned in the introduction, the structures of the generator and discriminator architectures are quite similar for most successful GAN approaches. In particular, the complexities of the two architectures are closely related. On the other hand, for $\hat{Q}^{W}$ , the corresponding discriminator class $\mathcal{F}_{\rm Lip}$ have no connection with the generator class. In this sense, $\hat{Q}^{W}$ cannot be viewed as a fundamental estimator possessing essential properties of widely-used GAN type estimators.

Nonetheless, $d_{\mathcal{F}}$ must be close to $d_{\rm eval}$ in some sense to guarantee a reasonable convergence rate because $\mathcal{F}$ is the only way to take $d_{\rm eval}$ into account with the GAN approach (2.1). This is specified as condition (iv) of Theorem 3.1; $d_{\mathcal{F}}$ needs to be close to $d_{\rm eval}$ only on a relatively small class of distributions.

Theorem 3.1.

Suppose that ${\bf X}_{1},\ldots,{\bf X}_{n}$ are i.i.d. random vectors following $P_{0}=Q_{0}*\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D})$ for some distribution $Q_{0}$ and $\sigma_{0}\geq 0$ . For given generator class $\mathcal{G}$ and discriminator class $\mathcal{F}$ , suppose that an estimator $\hat{Q}=Q_{\hat{\bf g}}$ with $\hat{\bf g}\in\mathcal{G}$ satisfies

\begin{split}{\rm(i)}~{}&\inf_{{\bf g}\in\mathcal{G}}d_{\rm eval}(Q_{\bf g},Q_{0})\leq\epsilon_{1}\\ {\rm(ii)}~{}&d_{\mathcal{F}}(\hat{Q},{\mathbb{P}}_{n})\leq\inf_{{\bf g}\in\mathcal{G}}d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n})+\epsilon_{2}\\ {\rm(iii)}~{}&\mathbb{E}d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})\leq\epsilon_{3}\\ {\rm(iv)}~{}&|d_{\rm eval}(Q_{1},Q_{2})-d_{\mathcal{F}}(Q_{1},Q_{2})|\leq\epsilon_{4}\\ &\quad\text{$\forall Q_{1},Q_{2}\in\mathcal{Q}\cup\{Q_{0}\}$}\end{split}

(3.2)

where $\mathcal{Q}=\{Q_{\bf g}:{\bf g}\in\mathcal{G}\}$ and $\epsilon_{j}\geq 0$ . Then,

\displaystyle\mathbb{E}d_{\rm eval}(\hat{Q},Q_{0})\leq 2d_{\mathcal{F}}(P_{0},Q_{0})+5\epsilon_{1}+\epsilon_{2}+2\epsilon_{3}+2\epsilon_{4}.

Two quantities $\epsilon_{1}$ and $\epsilon_{3}$ are closely related to the complexity of $\mathcal{G}$ and $\mathcal{F}$ , respectively. In particular, $\epsilon_{1}$ represents an error for approximating $Q_{0}$ by distributions of the form $Q_{\bf g}$ over ${\bf g}\in\mathcal{G}$ ; the larger the generator class $\mathcal{G}$ is, the smaller the approximation error is. [69, 58, 46] Similarly, $\epsilon_{3}$ increases according as the complexity of $\mathcal{F}$ increases. Techniques for bounding $\mathbb{E}d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})$ are well-known in empirical process theory [61, 20]. The second error term $\epsilon_{2}$ is nothing but the optimization error. The fourth term $\epsilon_{4}$ is the deviance between the evaluation metric $d_{\rm eval}$ and $\mathcal{F}$ -IPM over $\mathcal{Q}\cup\{Q_{0}\}$ , connecting $d_{\mathcal{F}}$ and $d_{\rm eval}$ . Finally, the term $d_{\mathcal{F}}(P_{0},Q_{0})$ in the rate depends primarily on $\sigma_{0}$ . One can easily prove that

d_{\mathcal{F}}(P_{0},Q_{0})\leq W_{1}(P_{0},Q_{0})\leq W_{2}(P_{0},Q_{0})\lesssim\sigma_{0}.

(3.3)

provided that $\mathcal{F}\subset\mathcal{F}_{\rm Lip}$ .

Ignoring the optimization error, suppose for a moment that $\mathcal{G}$ is given and we need to choose a suitable discriminator class to minimize $\epsilon_{3}+\epsilon_{4}$ in Theorem 3.1. We focus on the case of $d_{\rm eval}=W_{1}$ . One can easily make $\epsilon_{4}=0$ by taking $\mathcal{F}=\mathcal{F}_{\rm Lip}$ . In this case, however, $\epsilon_{3}$ would be too large because $\mathbb{E}W_{1}({\mathbb{P}}_{n},P_{0})\asymp n^{-1/D}$ . That is, $\mathcal{F}_{\rm Lip}$ is too large to be used as a discriminator class; $\mathcal{F}$ should be a much smaller class than $\mathcal{F}_{\rm Lip}$ to obtain a fast convergence rate. To achieve this goal, we construct $\mathcal{F}$ so that $\epsilon_{3}$ is small enough while $\epsilon_{4}$ retains small. For example, we may consider

\mathcal{F}=\Big{\{}f_{Q_{1},Q_{2}}:Q_{1},Q_{2}\in\mathcal{Q}\cup\{Q_{0}\}\Big{\}},

(3.4)

where $f_{Q_{1},Q_{2}}$ is a (approximate) maximizer of $|Q_{1}f-Q_{2}f|$ over $f\in\mathcal{F}_{\rm Lip}$ . In this case, $\epsilon_{4}$ vanishes, hence the convergence rate of $\hat{Q}$ will be determined solely by $\epsilon_{1}$ , $\epsilon_{3}$ and $\sigma_{0}$ . Furthermore, the complexity of $\mathcal{F}$ would roughly be the same to that of $\mathcal{G}\times\mathcal{G}$ . If the complexity of a function class is expressed through a metric entropy, the complexities of $\mathcal{G}$ and $\mathcal{F}$ are of the same order. Three quantities $\epsilon_{1}$ , $\epsilon_{3}$ and $\sigma_{0}$ can roughly be interpreted as the approximation error, estimation error and noise level. While we cannot control $\sigma_{0}$ , both the approximation and estimation errors depend on the complexity of $\mathcal{G}$ , hence a suitable choice of it would be important to achieve a fast convergence rate.

To give a specific convergence rate, we consider a class of structured distributions considered by Chae et al. [10] for which deep generative models have benefits. For positive numbers $\beta$ and $K$ , let $\mathcal{H}^{\beta}_{K}(A)$ be the class of all functions from $A$ to ${\mathbb{R}}$ with $\beta$ -Hölder norm bounded by $K$ [61, 20]. We consider the composite structure with low-dimensional smooth component functions as described in Section 3 of Schmidt-Hieber [52]. Specifically, we consider a function ${\bf g}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{D}$ of the form

{\bf g}={\bf h}_{q}\circ{\bf h}_{q-1}\circ\cdots\circ{\bf h}_{1}\circ{\bf h}_{0}

(3.5)

with ${\bf h}_{i}:(a_{i},b_{i})^{d_{i}}\to(a_{i+1},b_{i+1})^{d_{i+1}}$ . Here, $d_{0}=d$ and $d_{q+1}=D$ . Denote by ${\bf h}_{i}=(h_{i1},\ldots,h_{id_{i+1}})$ the components of ${\bf h}_{i}$ and let $t_{i}$ be the maximal number of variables on which each of the $h_{ij}$ depends. Let $\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)$ be the collection of functions of the form (3.5) satisfying $h_{ij}\in\mathcal{H}^{\beta_{i}}_{K}\big{(}(a_{i},b_{i})^{t_{i}}\big{)}$ and $|a_{i}|\vee|b_{i}|\leq K$ , where ${\bf d}=(d_{0},\ldots,d_{q+1})$ , ${\bf t}=(t_{0},\ldots,t_{q})$ and $\bm{\beta}=(\beta_{0},\ldots,\beta_{q})$ . Let

	$\displaystyle\widetilde{\beta}_{j}$	$\displaystyle=$	$\displaystyle\beta_{j}\prod_{l=j+1}^{q}(\beta_{l}\wedge 1),\quad j_{}=\operatorname{argmax}_{j\in\{0,\ldots,q\}}\frac{t_{j}}{\widetilde{\beta}_{j}},$
	$\displaystyle\beta_{*}$	$\displaystyle=$	$\displaystyle\widetilde{\beta}_{j_{}}\quad\text{and}~{}~{}t_{}=t_{j_{*}}.$

We call $t_{*}$ and $\beta_{*}$ as the intrinsic dimension and smoothness of ${\bf g}$ (or of the class $\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K))$ , respectively. The class $\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)$ has been extensively studied in recent articles on deep supervised learning to demonstrate the benefit of DNN in nonparametric function estimation [52, 6].

Let

\displaystyle\mathcal{Q}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)=\Big{\{}Q_{\bf g}:{\bf g}\in\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)\Big{\}}.

Quantities $(q,{\bf d},{\bf t},\bm{\beta},K)$ are constants independent of $n$ . In the forthcoming Theorem 3.2, we obtain a Wasserstein convergence rate of a GAN type estimator $\hat{Q}$ under the assumption that $Q_{0}\in\mathcal{Q}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)$ .

Theorem 3.2.

Suppose that ${\bf X}_{1},\ldots,{\bf X}_{n}$ are i.i.d. random vectors following $P_{0}=Q_{0}*\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D})$ , where $\sigma_{0}\leq 1$ and $Q_{0}=Q_{{\bf g}_{0}}$ for some ${\bf g}_{0}\in\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)$ . Then, there exist a generator class $\mathcal{G}=\mathcal{D}(L,{\bf p},s,K\vee 1)$ and discriminator class $\mathcal{F}\subset\mathcal{F}_{\rm Lip}$ such that for an estimator $\hat{Q}$ satisfying (2.1),

\begin{split}&\sup_{Q_{0}\in\mathcal{Q}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)}\mathbb{E}W_{1}(\hat{Q},Q_{0})\\ &\leq C\bigg{\{}n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}+\sigma_{0}+\epsilon_{\rm opt}\bigg{\}},\end{split}

(3.6)

where $C=C(q,{\bf d},{\bf t},\bm{\beta},K)$ .

In Theorem 3.2, network parameters $(L,{\bf p},s)$ of $\mathcal{G}$ depend on the sample size $n$ . More specifically, it can be deduced from the proof that one can choose $L\lesssim\log n$ , $|{\bf p}|_{\infty}\lesssim n^{-t_{*}/(2\beta_{*}+t_{*})}\log n$ and $s\lesssim n^{-t_{*}/(2\beta_{*}+t_{*})}\log n$ . As illustrated above, the discriminator class $\mathcal{F}$ is carefully constructed using $\mathcal{G}$ .

Ignoring the optimization error $\epsilon_{\rm opt}$ , the rate (LABEL:eq:rate-composition) consists of the two terms, $\sigma_{0}$ and $n^{-\beta_{*}/(2\beta_{*}+t_{*})}$ up to a logarithmic factor. If $\sigma_{0}\lesssim n^{-\beta_{*}/(2\beta_{*}+t_{*})}$ , it can be absorbed into the polynomial term; hence when $\sigma_{0}$ is small enough, $\hat{Q}$ achieves the rate $n^{-\beta_{*}/(2\beta_{*}+t_{*})}$ . Note that this rate appears in many nonparametric smooth function estimation problems.

The dependence on $\sigma_{0}$ comes from the term $d_{\mathcal{F}}(P_{0},Q_{0})$ in Theorem 3.1 and the inequality (3.3). Note that (3.3) holds because $\mathcal{F}$ is a subset of $\mathcal{F}_{\rm Lip}$ . We are not aware whether the term $\sigma_{0}$ in (LABEL:eq:rate-composition) can be improved in general. If we consider another evaluation metric, however, it is possible to improve this term. For example, if $\mathcal{F}$ consists of twice continuously differentiable functions, it would be possible to prove $d_{\mathcal{F}}(P_{0},Q_{0})\lesssim\sigma_{0}^{2}$ . This is because for a twice continuously differentiable $f$ ,

\begin{split}&|P_{0}f-Q_{0}f|=\big{|}\mathbb{E}[f({\bf Y}+\bm{\epsilon})-f({\bf Y})]\big{|}\\ &\approx\Big{|}\mathbb{E}\Big{[}\bm{\epsilon}^{T}\nabla f({\bf Y})+\frac{1}{2}\bm{\epsilon}^{T}\nabla^{2}f({\bf Y})\bm{\epsilon}\Big{]}\Big{|}\asymp\sigma_{0}^{2},\end{split}

(3.7)

where ${\bf Y}\sim Q_{0},\bm{\epsilon}\sim\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}})$ and ${\bf Y}\!\perp\!\!\!\perp\bm{\epsilon}$ .

Note that a sieve MLE considered by Chae et al. [10] achieves the rate $n^{-\beta_{*}/2(\beta_{*}+t_{*})}+\sigma_{0}$ under a slightly stronger assumption than that of Theorem 3.2. Hence, for a moderately small $\sigma_{0}$ , the convergence rate of a GAN type estimator is strictly better than that of a sieve MLE. This result provides some insight into why GAN approaches perform better than likelihood approaches in many real applications. In particular, if GAN performs significantly better than likelihood approaches, it might be a reasonable inference that the noise level of the data is not too large. On the other hand, if the noise level is larger than a certain threshold, $P_{0}$ is not nearly singular anymore. In this case, likelihood approaches would be preferable to computationally much more difficult GAN.

In practice, the noise level $\sigma_{0}$ is unknown, so one may firstly try likelihood approaches with different levels of perturbation. As empirically demonstrated in Chae et al. [10], the data perturbation significantly improves the quality of generated samples provided that $\sigma_{0}$ is small enough. Therefore, if the data perturbation improves the performance of likelihood approaches, one may next try GAN to obtain a better estimator.

So far, we have focused on the case $d_{\rm eval}=W_{1}$ . Note that the discriminator class considered in the proof of Theorem 3.2 is of the form (3.4), which is far away from a practical one. In particular, it is unclear whether it is possible to achieve the rate (LABEL:eq:rate-composition) with neural network discriminators. If we consider a different evaluation metric, however, one can easily obtain a convergence rate using a neural network discriminator. When $\mathcal{F}_{0}$ consists of neural networks, $\mathcal{F}_{0}$ -IPM is often called a neural network distance. It is well-known under mild assumptions that the convergence of probability measures in a neural network distance guarantees the weak convergence [71]. Therefore, neural network distances are also good candidates for evaluation metrics. In Theorem 3.3, more general integral probability metrics are taken into account.

Theorem 3.3.

Suppose that ${\bf X}_{1},\ldots,{\bf X}_{n}$ are i.i.d. random vectors following $P_{0}=P_{{\bf g}_{0},\sigma_{0}}$ for some ${\bf g}_{0}\in\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)$ and $\sigma_{0}\leq 1$ . Let $\mathcal{F}_{0}$ be a class of Lipschitz continuous functions from ${\mathbb{R}}^{D}$ to ${\mathbb{R}}$ with Lipschitz constant bounded by a constant $C_{1}>0$ . Then, there exist a generator class $\mathcal{G}=\mathcal{D}(L,{\bf p},s,K\vee 1)$ and discriminator class $\mathcal{F}$ such that $\hat{Q}$ defined as in (2.1) satisfies

\begin{split}&\sup_{Q_{0}\in\mathcal{Q}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)}\mathbb{E}d_{\mathcal{F}_{0}}(\hat{Q},Q_{0})\leq C_{2}\bigg{[}\sigma_{0}+\epsilon_{\rm opt}\\ &\quad+\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\wedge\Big{\{}n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}\Big{\}}\bigg{]},\end{split}

(3.8)

where $C_{2}=C_{2}(q,{\bf d},{\bf t},\bm{\beta},K,C_{1})$ . In particular, one can identify the discriminator class $\mathcal{F}$ with $\mathcal{F}_{0}$ provided that

\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\lesssim n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}.

(3.9)

The proof of Theorem 3.3 can be divided into two cases. Firstly, if the complexity of $\mathcal{F}_{0}$ is small enough in the sense of (3.9), one can ignore the approximation error ( $\epsilon_{1}$ in Theorem 3.1) by taking an arbitrarily large $\mathcal{G}$ . Also, $\mathcal{F}=\mathcal{F}_{0}$ leads to $\epsilon_{4}=0$ . Hence, the rate is determined by $\sigma_{0}$ , $\epsilon_{\rm opt}$ and $\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})$ . On the other hand, if (3.9) does not hold, we construct the discriminator as in (3.4). This leads to the same convergence rate with Theorem 3.2.

If $\mathcal{F}_{0}$ is a subset of $\mathcal{D}(L_{0},{\bf p}_{0},s_{0},\infty)$ , then it is not difficult to see that $\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\lesssim\sqrt{s_{0}/n}$ up to a logarithmic factor. This can be proved using well-known empirical process theory and metric entropy of deep neural networks; see Lemma 5 of Schmidt-Hieber [52]. Note that functions in $\mathcal{F}_{0}$ are required to be Lipschitz continuous and there are several regularization techniques bounding Lipschitz constants of DNN [2, 51].

Another important class of metrics is a Hölder IPM. When $\mathcal{F}_{0}=\mathcal{H}^{\alpha}_{1}([-K,K]^{D})$ for some $\alpha>0$ , Schreuder [53] has shown that

\displaystyle\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\lesssim\left\{\begin{array}[]{ll}n^{-\alpha/D}&~{}~{}\text{if $\alpha<D/2$}\\ n^{-1/2}\log n&~{}~{}\text{if $\alpha=D/2$}\\ n^{-1/2}&~{}~{}\text{if $\alpha>D/2$}.\end{array}\right.

Furthermore, for $\alpha>2$ , $\sigma_{0}$ in (LABEL:eq:rate-ipm) can be replaced by $\sigma_{0}^{2}$ using (LABEL:eq:sigma2).

4 Lower bound of the minimax risk

In this section, we study a lower bound of the minimax convergence rate, particularly focusing on the case $d_{\rm eval}=W_{1}$ . As in the previous section, suppose that ${\bf X}_{1},\ldots,{\bf X}_{n}$ are i.i.d. random vectors following $P_{0}=P_{{\bf g}_{0},\sigma_{0}}$ . We investigate the minimax rate over the class $\mathcal{Q}_{0}=\{Q_{\bf g}:{\bf g}\in\mathcal{G}_{0}\}$ , where $\mathcal{G}_{0}=\mathcal{G}_{0}(q,{\bf d},{\bf t},\bm{\beta},K)$ . For simplicity, we consider the case $q=0$ , $d_{0}=t_{0}=d$ and $\beta_{0}=\beta$ . In this case, we have $\mathcal{G}_{0}=\mathcal{H}_{K}^{\beta}([0,1]^{d})\times\cdots\times\mathcal{H}_{K}^{\beta}([0,1]^{d})$ . An extension to general $q$ would be slightly more complicated but not difficult. We also assume that $P_{Z}$ is the uniform distribution on $[0,1]^{d}$ .

For given $\mathcal{G}_{0}$ and $\sigma_{0}\geq 0$ , the minimax risk is defined as

\displaystyle\mathfrak{M}(\mathcal{G}_{0},\sigma_{0})=\inf_{\hat{Q}}\sup_{{\bf g}_{0}\in\mathcal{G}_{0}}\mathbb{E}W_{1}(\hat{Q},Q_{0}),

where the infimum ranges over all possible estimators. Several techniques are available to obtain a lower bound for $\mathfrak{M}(\mathcal{G}_{0},\sigma_{0})$ [59, 64]. We utilize the Fano’s method to prove the following theorem.

Theorem 4.1.

Suppose that $d\leq D$ , $\beta>0$ and $\sigma_{0}\geq 0$ . Let $P_{Z}$ be the uniform distribution on $[0,1]^{d}$ and $\mathcal{G}_{0}=\mathcal{H}_{K}^{\beta}([0,1]^{d})\times\cdots\times\mathcal{H}_{K}^{\beta}([0,1]^{d})$ . If $K$ is large enough (depending on $\beta$ and $d$ ), the minimax risk satisfies

\mathfrak{M}(\mathcal{G}_{0},\sigma_{0})\geq Cn^{-\frac{\beta}{2\beta+d-2}}

(4.1)

for some constant $C>0$ .

Note that the lower bound (4.1) does not depend on $\sigma_{0}$ . With a direct application of the Le Cam’s method, one can easily show that $\mathfrak{M}(\mathcal{G}_{0},\sigma_{0})\gtrsim\sigma_{0}/\sqrt{n}$ . As discussed in the previous section, it would not be easy to obtain a sharp rate with respect to $\sigma_{0}$ . Since we are more interested in small $\sigma_{0}$ cases (i.e. nearly singular cases), our discussion focuses on the cases where $\sigma_{0}$ is ignorable.

Note that the lower bound (4.1) is only slightly smaller than the rate $n^{-\beta/(2\beta+d)}$ , the first term in the right hand side of (LABEL:eq:rate-composition). Hence, the convergence rate of a GAN type estimator is at least very close to the minimax optimal rate.

For the gap between the upper and lower bounds, we conjecture that the lower bound is sharp and cannot be improved. This conjecture is based on the result in Uppal et al. [60] and Liang [38]. They considered GAN for nonparametric density estimation, hence $D=d$ in their framework. For example, Theorem 4 in Liang [38] guarantees that, for $D=d\geq 2$ and $\sigma_{0}=0$ ,

\inf_{\hat{Q}}\sup_{Q_{0}\in\widetilde{\mathcal{Q}}_{0}}\mathbb{E}W_{1}(\hat{Q},Q_{0})\asymp n^{-\frac{\beta^{\prime}+1}{2\beta^{\prime}+d}},

(4.2)

where $\widetilde{\mathcal{Q}}_{0}=\{Q:q\in\mathcal{H}^{\beta^{\prime}}_{1}([0,1]^{D})\}$ . (More precisely, he considered Sobolev classes instead of Hölder classes.) Interestingly, there is a close connection between the density model $\widetilde{\mathcal{Q}}_{0}$ in literature and the generative model $\mathcal{Q}_{0}$ considered in our paper. This connection is based on the profound regularity theory of the optimal transport, often called the Brenier map. Roughly speaking, for a $\beta^{\prime}$ -Hölder density $q$ , there exits a $(\beta^{\prime}+1)$ -Hölder function ${\bf g}$ such that $Q=Q_{\bf g}$ ; see Theorem 12.50 of Villani [63] for a precise statement. We also refer to Lemma 4.1 of Chae et al. [10] for a concise statement. In this sense, the density model $\widetilde{Q}_{0}$ matches with the generative model $Q_{0}$ when $\beta=\beta^{\prime}+1$ . In this case, two rates (4.1) and (4.2) are the same, and this is why we conjecture that the lower bound (4.1) cannot be improved. Unfortunately, proof techniques in Uppal et al. [60] and Liang [38] for both upper and lower bounds are not generalizable to our case because $Q_{0}$ does not possess a Lebesgue density.

5 Conclusion

Under a structural assumption on the generator, we have investigated the convergence rate of a GAN type estimator and a lower bound of the minimax optimal rate. In particular, the rate is faster than that obtained by likelihood approaches, providing some insights into why GAN outperforms likelihood approaches. This result would be an important step-stone toward a more advanced theory that can take into account fundamental properties of state-of-the-art GANs. We conclude the paper with some possible directions for future work.

Firstly, reducing the gap between the upper and lower bounds of the convergence rate obtained in this paper would be necessary. As discussed in Section 4, it will be crucial to construct an estimator that achieves the lower bound in Theorem 4.1. More specifically, we wonder whether a GAN type estimator can do this. Next, when $d_{\rm eval}=W_{1}$ , an important question is whether it is possible to choose $\mathcal{F}$ as a class of neural network functions. Perhaps, we may not obtain the rate in Theorem 3.2 because a large network would be necessary to approximate an arbitrary Lipschitz function. Finally, based on the approximation property of the convolutional neural networks (CNN) architectures [32, 70], studying the benefit of CNN-based GAN would be an intriguing problem.

References

Aamari and Levrard, [2019] Aamari, E. and Levrard, C. (2019). Nonasymptotic rates for manifold, tangent space and curvature estimation. Ann. Statist., 47(1):177–204.
Arjovsky and Bottou, [2017] Arjovsky, M. and Bottou, L. (2017). Towards principled methods for training generative adversarial networks. In Proc. International Conference on Learning Representations, pages 1–17.
Arjovsky et al., [2017] Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In Proc. International Conference on Machine Learning, pages 214–223.
Arora et al., [2017] Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (GANs). In Proc. International Conference on Machine Learning, pages 224–232.
Bai et al., [2019] Bai, Y., Ma, T., and Risteski, A. (2019). Approximability of discriminators implies diversity in GANs. In Proc. International Conference on Learning Representations, pages 1–10.
Bauer and Kohler, [2019] Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist., 47(4):2261–2285.
Belomestny et al., [2021] Belomestny, D., Moulines, E., Naumov, A., Puchkin, N., and Samsonov, S. (2021). Rates of convergence for density estimation with GANs. ArXiv:2102.00199.
Birgé, [1983] Birgé, L. (1983). Approximation dans les espaces métriques et théorie de l’estimation. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 65(2):181–237.
Bishop, [2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, New York.
Chae et al., [2021] Chae, M., Kim, D., Kim, Y., and Lin, L. (2021). A likelihood approach to nonparametric estimation of a singular distribution using deep generative models. ArXiv:2105.04046.
Chae and Walker, [2019] Chae, M. and Walker, S. G. (2019). Bayesian consistency for a nonparametric stationary Markov model. Bernoulli, 25(2):877–901.
Chen et al., [2019] Chen, M., Jiang, H., Liao, W., and Zhao, T. (2019). Efficient approximation of deep ReLU networks for functions on low dimensional manifolds. In Proc. Neural Information Processing Systems, pages 8174–8184.
Chen et al., [2020] Chen, M., Liao, W., Zha, H., and Zhao, T. (2020). Statistical guarantees of generative adversarial networks for distribution estimation. ArXiv:2002.03938.
Divol, [2020] Divol, V. (2020). Minimax adaptive estimation in manifold inference. ArXiv:2001.04896.
Fan, [1991] Fan, J. (1991). On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Statist., 19(3):1257–1272.
Fournier and Guillin, [2015] Fournier, N. and Guillin, A. (2015). On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields, 162(3-4):707–738.
[17] Genovese, C. R., Perone-Pacifico, M., Verdinelli, I., and Wasserman, L. (2012a). Manifold estimation and singular deconvolution under Hausdorff loss. Ann. Statist., 40(2):941–963.
[18] Genovese, C. R., Perone-Pacifico, M., Verdinelli, I., and Wasserman, L. (2012b). Minimax manifold estimation. J. Mach. Learn. Res., 13(1):1263–1291.
Ghosal and van der Vaart, [2017] Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press.
Giné and Nickl, [2016] Giné, E. and Nickl, R. (2016). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge University Press.
Glorot et al., [2011] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proc. International Conference on Artificial Intelligence and Statistics, pages 315–323.
Goodfellow et al., [2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Proc. Neural Information Processing Systems, pages 2672–2680.
Gulrajani et al., [2017] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. (2017). Improved training of Wasserstein GANs. In Proc. Neural Information Processing Systems, pages 5767–5777.
Györfi et al., [2006] Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2006). A Distribution-Free Theory of Nonparametric Regression. Springer, New York.
Hastie et al., [2009] Hastie, T., Tibshirani, R., and Friedman, J. H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
Horowitz and Mammen, [2007] Horowitz, J. L. and Mammen, E. (2007). Rate-optimal estimation for a general class of nonparametric regression models with unknown link functions. Ann. Statist., 35(6):2589–2619.
Imaizumi and Fukumizu, [2019] Imaizumi, M. and Fukumizu, K. (2019). Deep neural networks learn non-smooth functions effectively. In Proc. International Conference on Artificial Intelligence and Statistics, pages 869–878.
Juditsky et al., [2009] Juditsky, A. B., Lepski, O. V., and Tsybakov, A. B. (2009). Nonparametric estimation of composite functions. Ann. Statist., 37(3):1360–1404.
Karras et al., [2018] Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018). Progressive growing of GANs for improved quality, stability, and variation. In Proc. International Conference on Learning Representations, pages 1–26.
Karras et al., [2019] Karras, T., Laine, S., and Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proc. Conference on Computer Vision and Pattern Recognition, pages 4401–4410.
Kingma and Welling, [2014] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In Proc. International Conference on Learning Representations, pages 1–14.
Kohler et al., [2020] Kohler, M., Krzyzak, A., and Walter, B. (2020). On the rate of convergence of image classifiers based on convolutional neural networks. ArXiv:2003.01526.
Kuhn et al., [2019] Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Porc. Operations Research & Management Science in the Age of Analytics, pages 130–166. INFORMS.
Kundu and Dunson, [2014] Kundu, S. and Dunson, D. B. (2014). Latent factor models for density estimation. Biometrika, 101(3):641–654.
Le Cam, [1973] Le Cam, L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist., 1(1):38–53.
Le Cam, [1986] Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer.
Li et al., [2017] Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Póczos, B. (2017). MMD GAN: Towards deeper understanding of moment matching network. In Proc. Neural Information Processing Systems, pages 2203–2213.
Liang, [2021] Liang, T. (2021). How well generative adversarial networks learn distributions. Journal of Machine Learning Research, 22(228):1–41.
Liu et al., [2017] Liu, S., Bousquet, O., and Chaudhuri, K. (2017). Approximation and convergence properties of generative adversarial learning. In Proc. Neural Information Processing Systems, pages 5545–5553.
Meister, [2009] Meister, A. (2009). Deconvolution Problems in Nonparametric Statistics. Springer, New York.
Mroueh et al., [2017] Mroueh, Y., Li, C.-L., Sercu, T., Raj, A., and Cheng, Y. (2017). Sobolev GAN. ArXiv:1711.04894.
Müller, [1997] Müller, A. (1997). Integral probability metrics and their generating classes of functions. Adv. in Appl. Probab., 29(2):429–443.
Murphy, [2012] Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT press.
Nakada and Imaizumi, [2020] Nakada, R. and Imaizumi, M. (2020). Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. J. Mach. Learn. Res., 21(174):1–38.
Nguyen, [2013] Nguyen, X. (2013). Convergence of latent mixing measures in finite and infinite mixture models. Ann. Statist., 41(1):370–400.
Ohn and Kim, [2019] Ohn, I. and Kim, Y. (2019). Smooth function approximation by deep neural networks with general activation functions. Entropy, 21(7):627.
Pati et al., [2011] Pati, D., Bhattacharya, A., and Dunson, D. B. (2011). Posterior convergence rates in non-linear latent variable models. ArXiv:1109.5000.
Puchkin and Spokoiny, [2019] Puchkin, N. and Spokoiny, V. (2019). Structure-adaptive manifold estimation. ArXiv:1906.05014.
Radford et al., [2016] Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In Proc. International Conference on Learning Representations, pages 1–16.
Rezende et al., [2014] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In Proc. International Conference on Machine Learning, pages 1278–1286.
Scaman and Virmaux, [2018] Scaman, K. and Virmaux, A. (2018). Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Proc. Neural Information Processing Systems, volume 31, pages 1–10.
Schmidt-Hieber, [2020] Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU activation function. Ann. Statist., 48(4):1875–1897.
Schreuder, [2021] Schreuder, N. (2021). Bounding the expectation of the supremum of empirical processes indexed by Hölder classes. Math. Methods Statist., 29:76–86.
Schreuder et al., [2021] Schreuder, N., Brunel, V.-E., and Dalalyan, A. (2021). Statistical guarantees for generative models without domination. In Proc. Algorithmic Learning Theory, pages 1051–1071. PMLR.
Singh and Póczos, [2018] Singh, S. and Póczos, B. (2018). Minimax distribution estimation in Wasserstein distance. ArXiv:1802.08855.
Singh et al., [2018] Singh, S., Uppal, A., Li, B., Li, C.-L., Zaheer, M., and Póczos, B. (2018). Nonparametric density estimation with adversarial losses. In Proc. Neural Information Processing Systems, pages 10246–10257.
Stanczuk et al., [2021] Stanczuk, J., Etmann, C., Kreusser, L. M., and Schönlieb, C.-B. (2021). Wasserstein GANs work because they fail (to approximate the Wasserstein distance). ArXiv:2103.01678.
Telgarsky, [2016] Telgarsky, M. (2016). Benefits of depth in neural networks. In Proc. Conference on Learning Theory, pages 1517–1539.
Tsybakov, [2008] Tsybakov, A. B. (2008). Introduction to Nonparametric Estimation. Springer, New York.
Uppal et al., [2019] Uppal, A., Singh, S., and Póczos, B. (2019). Nonparametric density estimation and convergence of GANs under Besov IPM losses. In Proc. Neural Information Processing Systems, pages 9089–9100.
van der Vaart and Wellner, [1996] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer.
Villani, [2003] Villani, C. (2003). Topics in Optimal Transportation. American Mathematical Society.
Villani, [2008] Villani, C. (2008). Optimal Transport: Old and New. Springer.
Wainwright, [2019] Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.
Weed and Bach, [2019] Weed, J. and Bach, F. (2019). Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli, 25(4A):2620–2648.
Wei and Nguyen, [2022] Wei, Y. and Nguyen, X. (2022). Convergence of de Finetti’s mixing measure in latent structure models for observed exchangeable sequences. To appear in Ann. Statist.
Wong and Shen, [1995] Wong, W. H. and Shen, X. (1995). Probability inequalities for likelihood ratios and convergence rates of sieve MLEs. Ann. Statist., 23(2):339–362.
Yalcin and Amemiya, [2001] Yalcin, I. and Amemiya, Y. (2001). Nonlinear factor analysis as a statistical method. Statist. Sci., 16(3):275–294.
Yarotsky, [2017] Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114.
Yarotsky, [2021] Yarotsky, D. (2021). Universal approximations of invariant maps by neural networks. Constr. Approx., pages 1–68.
Zhang et al., [2018] Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. (2018). On the discrimination-generalization tradeoff in GANs. In Proc. International Conference on Learning Representations, pages 1–26.

Appendix A Proofs

A.1 Proof of Theorem 3.1

Choose ${\bf g}_{*}\in\mathcal{G}$ such that

d_{\rm eval}(Q_{*},Q_{0})\leq\inf_{{\bf g}\in\mathcal{G}}d_{\rm eval}(Q_{\bf g},Q_{0})+\epsilon_{1}\stackrel{{\scriptstyle\rm(i)}}{{\leq}}2\epsilon_{1},

(A.1)

where $Q_{*}=Q_{{\bf g}_{*}}$ . Then,

	$\displaystyle d_{\rm eval}(\hat{Q},Q_{0})\leq d_{\rm eval}(\hat{Q},Q_{})+d_{\rm eval}(Q_{},Q_{0})\stackrel{{\scriptstyle\eqref{eq:general-tech2}}}{{\leq}}d_{\rm eval}(\hat{Q},Q_{*})+2\epsilon_{1}$
	$\displaystyle\stackrel{{\scriptstyle\rm(iv)}}{{\leq}}d_{\mathcal{F}}(\hat{Q},Q_{})+2\epsilon_{1}+\epsilon_{4}\leq d_{\mathcal{F}}(\hat{Q},{\mathbb{P}}_{n})+d_{\mathcal{F}}({\mathbb{P}}_{n},Q_{})+2\epsilon_{1}+\epsilon_{4}$
	$\displaystyle\stackrel{{\scriptstyle\rm(ii)}}{{\leq}}\inf_{{\bf g}\in\mathcal{G}}d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n})+d_{\mathcal{F}}({\mathbb{P}}_{n},Q_{*})+2\epsilon_{1}+\epsilon_{2}+\epsilon_{4}$
	$\displaystyle\leq\inf_{{\bf g}\in\mathcal{G}}d_{\mathcal{F}}(Q_{\bf g},{\mathbb{P}}_{n})+d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+d_{\mathcal{F}}(P_{0},Q_{0})+d_{\mathcal{F}}(Q_{0},Q_{*})+2\epsilon_{1}+\epsilon_{2}+\epsilon_{4}$
	$\displaystyle\leq\inf_{{\bf g}\in\mathcal{G}}d_{\mathcal{F}}(Q_{\bf g},Q_{0})+d_{\mathcal{F}}({\mathbb{P}}_{n},Q_{0})+d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+d_{\mathcal{F}}(P_{0},Q_{0})+d_{\mathcal{F}}(Q_{0},Q_{*})+2\epsilon_{1}+\epsilon_{2}+\epsilon_{4}$
	$\displaystyle\stackrel{{\scriptstyle\rm(i)}}{{\leq}}d_{\mathcal{F}}({\mathbb{P}}_{n},Q_{0})+d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+d_{\mathcal{F}}(P_{0},Q_{0})+d_{\mathcal{F}}(Q_{0},Q_{*})+3\epsilon_{1}+\epsilon_{2}+\epsilon_{4}$
	$\displaystyle\leq 2d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+2d_{\mathcal{F}}(P_{0},Q_{0})+d_{\mathcal{F}}(Q_{0},Q_{*})+3\epsilon_{1}+\epsilon_{2}+\epsilon_{4}$
	$\displaystyle\stackrel{{\scriptstyle\rm(iv)}}{{\leq}}2d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+2d_{\mathcal{F}}(P_{0},Q_{0})+d_{\rm eval}(Q_{0},Q_{*})+3\epsilon_{1}+\epsilon_{2}+2\epsilon_{4}$
	$\displaystyle\stackrel{{\scriptstyle\eqref{eq:general-tech2}}}{{\leq}}2d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})+2d_{\mathcal{F}}(P_{0},Q_{0})+5\epsilon_{1}+\epsilon_{2}+2\epsilon_{4}.$

By taking the expectation, we complete the proof. ∎

A.2 Proof of Theorem 3.2

We will construct a generator class $\mathcal{G}$ and a discriminator $\mathcal{F}$ satisfying condition (3.2) of Theorem 3.1 with $d_{\rm eval}=W_{1}$ . By the construction of the estimator $\hat{Q}$ , condition (3.2)-(ii) is automatically satisfied with $\epsilon_{2}=\epsilon_{\rm opt}$ for any $\mathcal{G}$ and $\mathcal{F}$ .

Let $\delta>0$ be given. Lemma 3.5 of Chae et al. [10] implies that there exists ${\bf g}_{*}\in\mathcal{D}(L,{\bf p},s,K\vee 1)$ , with

\displaystyle L\leq c_{1}\log\delta^{-1},\quad|{\bf p}|_{\infty}\leq c_{1}\delta^{-t_{*}/\beta_{*}},\quad s\leq c_{1}\delta^{-t_{*}/\beta_{*}}\log\delta^{-1}

for some constant $c_{1}=c_{1}(q,{\bf d},{\bf t},\bm{\beta},K)$ , such that $\|{\bf g}_{*}-{\bf g}_{0}\|_{\infty}<\delta$ . Let $Q_{*}=Q_{{\bf g}_{*}}$ and $\mathcal{G}=\mathcal{D}(L,{\bf p},s,K\vee 1)$ . Then, by the Kantorovich–Rubinstein duality (see Theorem 1.14 in Villani [62]),

	$\displaystyle W_{1}(Q_{*},Q_{0})$	$\displaystyle=$	$\displaystyle\sup_{f\in\mathcal{F}_{\rm Lip}}\|Q_{}f-Q_{0}f\|\leq\sup_{f\in\mathcal{F}_{\rm Lip}}\int\Big{\|}f\big{(}{\bf g}_{}({\bf z})\big{)}-f\big{(}{\bf g}_{0}({\bf z})\big{)}\Big{\|}dP_{Z}({\bf z})$
		$\displaystyle\leq$	$\displaystyle\int\|{\bf g}_{}({\bf z})-{\bf g}_{0}({\bf z})\|_{2}dP_{Z}({\bf z})\leq\sqrt{D}\\|{\bf g}_{}-{\bf g}_{0}\\|_{\infty}\leq\sqrt{D}\delta.$

Hence, condition (3.2)-(i) holds with $\epsilon_{1}=\sqrt{D}\delta$ .

Let $\epsilon>0$ be given. For two Borel probability measures $Q_{1}$ and $Q_{2}$ on ${\mathbb{R}}^{D}$ , one can choose $f_{Q_{1},Q_{2}}\in\mathcal{F}_{\rm Lip}$ such that $f_{Q_{1},Q_{2}}({\bf 0}_{D})=0$ and

\displaystyle W_{1}(Q_{1},Q_{2})=\sup_{f\in\mathcal{F}_{\rm Lip}}|Q_{1}f-Q_{2}f|\leq|Q_{1}f_{Q_{1},Q_{2}}-Q_{2}f_{Q_{1},Q_{2}}|+\epsilon.

Then, by the Lipschitz continuity,

\displaystyle\sup_{|{\bf x}|_{\infty}\leq K}|f_{Q_{1},Q_{2}}({\bf x})|\leq\sup_{|{\bf x}|_{\infty}\leq K}|{\bf x}|_{2}=\sqrt{D}K.

Let ${\bf g}_{1},\ldots,{\bf g}_{N}$ be $\epsilon$ -cover of $\mathcal{G}\cup\{{\bf g}_{0}\}$ with respect to $\|\cdot\|_{P_{Z},2}$ and

\displaystyle\mathcal{F}=\Big{\{}f_{jk}:1\leq j,k\leq N\Big{\}},

where

\displaystyle\|{\bf g}\|_{P_{Z},p}

\displaystyle=

\displaystyle\left(\int|{\bf g}({\bf z})|_{p}^{p}dP_{Z}({\bf z})\right)^{1/p}

and $f_{jk}=f_{Q_{{\bf g}_{j}},Q_{{\bf g}_{k}}}$ . Since $\|{\bf g}-\widetilde{\bf g}\|_{P_{Z},2}\leq\sqrt{D}\|{\bf g}-\widetilde{\bf g}\|_{\infty}$ for every ${\bf g},\widetilde{\bf g}\in\mathcal{G}\cup\{{\bf g}_{0}\}$ and

\displaystyle\log N(\epsilon,\mathcal{G},\|\cdot\|_{\infty})\leq(s+1)\bigg{\{}\log 2+\log\epsilon^{-1}+\log(L+1)+2\sum_{l=0}^{L+1}\log(p_{l}+1)\bigg{\}}

by Lemma 5 of Schmidt-Hieber [52], the number $N$ can be bounded as

\begin{split}\log N&\leq\log\big{(}N(\epsilon/\sqrt{D},\mathcal{G},\|\cdot\|_{\infty})+1\big{)}\leq c_{2}s\Big{(}\log D+\log\epsilon^{-1}+L\log\delta^{-1}\Big{)}\\ &\leq c_{3}\delta^{-t_{*}/\beta_{*}}\log\delta^{-1}\Big{\{}\log\epsilon^{-1}+(\log\delta^{-1})^{2}\Big{\}},\end{split}

(A.2)

where $c_{2}=c_{2}(t_{*},\beta_{*})$ and $c_{3}=c_{3}(c_{1},c_{2},D)$ . Here, $N(\epsilon,\mathcal{G},\|\cdot\|_{\infty})$ denotes the covering number of $\mathcal{G}$ with respect to $\|\cdot\|_{\infty}$ .

Next, we will prove that condition (3.2)-(iv) is satisfied with $\epsilon_{4}=5\epsilon$ . Note that $d_{\mathcal{F}}\leq W_{1}$ by the construction. For ${\bf g},\widetilde{\bf g}\in\mathcal{G}\cup\{{\bf g}_{0}\}$ , we can choose ${\bf g}_{j}$ and ${\bf g}_{k}$ such that $\|{\bf g}-{\bf g}_{j}\|_{P_{Z},2}\leq\epsilon$ and $\|\widetilde{\bf g}-{\bf g}_{k}\|_{P_{Z},2}\leq\epsilon$ . Then,

\begin{split}W_{1}(Q_{\bf g},Q_{\widetilde{\bf g}})&\leq W_{1}(Q_{\bf g},Q_{{\bf g}_{j}})+W_{1}(Q_{{\bf g}_{j}},Q_{{\bf g}_{k}})+W_{1}(Q_{{\bf g}_{k}},Q_{\widetilde{\bf g}})\\ &\leq W_{1}(Q_{\bf g},Q_{{\bf g}_{j}})+d_{\mathcal{F}}(Q_{{\bf g}_{j}},Q_{{\bf g}_{k}})+W_{1}(Q_{{\bf g}_{k}},Q_{\widetilde{\bf g}})+\epsilon.\end{split}

(A.3)

Note that

	$\displaystyle W_{1}(Q_{\bf g},Q_{{\bf g}_{j}})$	$\displaystyle=$	$\displaystyle\sup_{f\in\mathcal{F}_{\rm Lip}}\left\|\int f\big{(}{\bf g}({\bf z})\big{)}dP_{Z}({\bf z})-\int f\big{(}{\bf g}_{j}({\bf z})\big{)}dP_{Z}({\bf z})\right\|$
		$\displaystyle\leq$	$\displaystyle\int\|{\bf g}({\bf z})-{\bf g}_{j}({\bf z})\|_{2}dP_{Z}({\bf z})=\\|{\bf g}-{\bf g}_{j}\\|_{P_{Z},2}\leq\epsilon.$

Similarly, $W_{1}(Q_{{\bf g}_{k}},Q_{\widetilde{\bf g}})\leq\epsilon$ , and therefore,

\displaystyle d_{\mathcal{F}}(Q_{{\bf g}_{j}},Q_{{\bf g}_{k}})\leq d_{\mathcal{F}}(Q_{{\bf g}_{j}},Q_{\bf g})+d_{\mathcal{F}}(Q_{\bf g},Q_{\widetilde{\bf g}})+d_{\mathcal{F}}(Q_{\widetilde{\bf g}},Q_{{\bf g}_{k}})\leq d_{\mathcal{F}}(Q_{\bf g},Q_{\widetilde{\bf g}})+2\epsilon.

Hence, the right hand side of (A.3) is bounded by $d_{\mathcal{F}}(Q_{\bf g},Q_{\widetilde{\bf g}})+5\epsilon$ . That is, condition (3.2)-(iv) holds with $\epsilon_{4}=5\epsilon$ .

Next, note that ${\mathbb{P}}_{n}$ is the empirical measure based on i.i.d. samples from $P_{0}$ . Let ${\bf Y}$ and $\bm{\epsilon}$ be independent random vectors following $Q_{0}$ and $\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D})$ , respectively. For any $f\in\mathcal{F}$ , by the Lipschitz continuity,

\displaystyle|f({\bf Y}+\bm{\epsilon})|\leq|{\bf Y}+\bm{\epsilon}|_{2}\leq|{\bf Y}|_{2}+|\bm{\epsilon}|_{2}.

Since ${\bf Y}$ is bounded almost surely and $\sigma_{0}\leq 1$ , $f({\bf Y}+\bm{\epsilon})$ is a sub-Gaussian random variable with the sub-Gaussian parameter $\sigma=\sigma(K,D)$ . By the Hoeffding’s inequality,

\displaystyle P_{0}\Big{(}\big{|}{\mathbb{P}}_{n}f-P_{0}f\big{|}>t\Big{)}\leq 2\exp\left[-\frac{nt^{2}}{2\sigma^{2}}\right]

for every $f\in\mathcal{F}$ and $t\geq 0$ ; see Proposition 2.5 of Wainwright [64] for Hoeffding’s inequality for unbounded sub-Gaussian random variables. Since $\mathcal{F}$ is a finite set with the cardinality $N^{2}$ ,

\displaystyle P_{0}\bigg{(}\sup_{f\in\mathcal{F}}\big{|}{\mathbb{P}}_{n}f-P_{0}f\big{|}>t\bigg{)}\leq 2N^{2}\exp\left[-\frac{nt^{2}}{2\sigma^{2}}\right].

If $t\geq 2\sigma\sqrt{\{\log(2N^{2})\}/n}$ , the right hand side is bounded by $e^{-nt^{2}/(4\sigma^{2})}$ . Therefore,

	$\displaystyle\mathbb{E}d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})=\int_{0}^{\infty}P_{0}\big{(}d_{\mathcal{F}}({\mathbb{P}}_{n},P_{0})>t\big{)}dt$
	$\displaystyle\leq 2\sigma\sqrt{\frac{\log(2N^{2})}{n}}+\int_{0}^{\infty}\exp\left[-\frac{nt^{2}}{4\sigma^{2}}\right]dt\leq 2\sigma\sqrt{\frac{\log(2N^{2})}{n}}+\sigma\sqrt{\frac{\pi}{n}}$

and condition (3.2)-(iii) is also satisfied with $\epsilon_{3}$ equal to the right hand side of the last display.

Note that

\displaystyle d_{\mathcal{F}}(P_{0},Q_{0})\leq W_{1}(P_{0},Q_{0})\leq W_{2}(P_{0},Q_{0})\leq\sqrt{D}\sigma_{0},

where the last inequality holds because $P_{0}$ is the convolution of $Q_{0}$ and $\mathcal{N}({\bf 0}_{D},\sigma_{0}^{2}{\mathbb{I}}_{D})$ . By Theorem 3.1, we have

\displaystyle\mathbb{E}W_{1}(\hat{Q},Q_{0})

\displaystyle\leq

\displaystyle 2\sqrt{D}\sigma_{0}+5\sqrt{D}\delta+\epsilon_{\rm opt}+4\sigma\sqrt{\frac{\log(2N^{2})}{n}}+2\sigma\sqrt{\frac{\pi}{n}}+5\epsilon\leq c_{4}\bigg{\{}\epsilon_{\rm opt}+\sigma_{0}+\delta+\sqrt{\frac{\log N}{n}}+\epsilon\bigg{\}},

where $c_{4}=c_{4}(\sigma,D)$ . Combining with (A.2), we have

\displaystyle\mathbb{E}W_{1}(\hat{Q},Q_{0})\leq c_{5}\bigg{\{}\epsilon_{\rm opt}+\sigma_{0}+\delta+\frac{\sqrt{\log\delta^{-1}}(\sqrt{\log\epsilon^{-1}}+\log\delta^{-1})}{\sqrt{n}\delta^{t_{*}/2\beta_{*}}}+\epsilon\bigg{\}},

where $c_{5}=c_{5}(c_{3},c_{4})$ . The proof is complete if we take

\displaystyle\delta=n^{-\beta_{*}/(2\beta_{*}+t_{*})}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}

and $\epsilon=n^{-\log n}$ . ∎

A.3 Proof of Theorem 3.3

The proof is divided into two cases.

Case 1: Suppose that

\displaystyle\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\gtrsim n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}.

In this case, we have

\displaystyle\mathbb{E}W_{1}(\hat{Q},Q_{0})\lesssim n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}+\sigma_{0}+\epsilon_{\rm opt}

whose proof is the same to that of Theorem 3.2. The only difference is that some constants in the proof depend on the Lipschitz constant $C_{1}$ .

Case 2: Suppose that

\displaystyle\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})\lesssim n^{-\frac{\beta_{*}}{2\beta_{*}+t_{*}}}(\log n)^{\frac{3\beta_{*}}{2\beta_{*}+t_{*}}}.

We utilize Theorem 3.1 with $\mathcal{F}=\mathcal{F}_{0}$ . Since $d_{\rm eval}=d_{\mathcal{F}}$ , we have $\epsilon_{4}=0$ . Also, for a large enough $\mathcal{G}$ , i.e. large depth, width and number of nonzero parameters, $\epsilon_{1}$ can be set to be an arbitrarily small number. Since $\mathcal{F}$ consists of Lipschitz continuous function, $d_{\mathcal{F}}(P_{0},Q_{0})\lesssim\sigma_{0}$ . It follows by Theorem 3.1 that $\mathbb{E}d_{\mathcal{F}_{0}}(\hat{Q},Q_{0})\lesssim\sigma_{0}+\epsilon_{\rm opt}+\mathbb{E}d_{\mathcal{F}_{0}}({\mathbb{P}}_{n},P_{0})$ .

A.4 Proof of Theorem 4.1

Throughout the proof, we will assume that $D=d$ ; an extension to the case $D>d$ is straightforward. Our proof relies on the Fano’s method for which we refer to Chapter 15 of Wainwright [64].

Let $\phi:{\mathbb{R}}\to[0,\infty)$ be a fixed function satisfying that

(i)

$\phi$ is $[\beta+1]$ -times continuously differentiable on ${\mathbb{R}}$ ,
(ii)

$\phi$ is unimodal and symmetric about $1/2$ , and
(iii)

$\phi(z)>0$ if and only if $z\in(0,1)$ ,

where $[x]$ denotes the largest integer less than or equal to $x$ . Figure 1 shows an illustration of $\phi$ and related functions. For a positive integer $m=m_{n}$ , with $m_{n}\uparrow\infty$ as $n\to\infty$ , let $z_{j}=j/m$ , $I_{j}=[z_{j},z_{j+1}]$ for $j=0,\ldots,m-1$ , $J=\{0,1,\ldots,m-1\}^{d}$ and $\phi_{j}(z)=\phi(m(z-z_{j}))$ . For a multi-index ${\bf j}=(j_{1},\ldots,j_{d})\in J$ and $\alpha=(\alpha_{\bf j})_{{\bf j}\in J}\in\{-1,+1\}^{|J|}$ , define ${\bf g}_{\alpha}:[0,1]^{d}\to{\mathbb{R}}^{d}$ as

\displaystyle{\bf g}_{\alpha}({\bf z})=\left(z_{1}+\frac{c_{1}}{m^{\beta}}\sum_{{\bf j}\in J}\alpha_{\bf j}\phi_{j_{1}}(z_{1})\cdots\phi_{j_{d}}(z_{d}),z_{2},\ldots,z_{d}\right),

where $c_{1}=c_{1}(\phi,d)$ is a small enough constant described below. Then, it is easy to check that ${\bf g}_{\alpha}$ is a one-to-one function from $[0,1]^{d}$ onto itself, and ${\bf g}_{\alpha}\in\mathcal{H}_{K}^{\beta}([0,1]^{d})\times\cdots\times\mathcal{H}_{K}^{\beta}([0,1]^{d})$ for large enough $K=K(\beta,c_{1})$ .

Let ${\bf Z}=(Z_{1},\ldots,Z_{d})$ be a uniform random variable on $(0,1)^{d}$ . Then, by the change of variables formula, the Lebesgue density $q_{\alpha}$ of ${\bf Y}={\bf g}_{\alpha}({\bf Z})$ is given as

\displaystyle q_{\alpha}({\bf y})=\left|\frac{\partial{\bf z}}{\partial{\bf y}}\right|=\left(1+\frac{c_{1}}{m^{\beta}}\sum_{{\bf j}\in J}\alpha_{\bf j}\phi_{j_{1}}^{\prime}(z_{1})\phi_{j_{2}}(y_{2})\cdots\phi_{j_{d}}(y_{d})\right)^{-1}

for ${\bf y}\in[0,1]^{d}$ , where $\phi^{\prime}$ denotes the derivative of $\phi$ . Here, $z_{1}=z_{1}(y_{1},\ldots,y_{d})$ is implicitly defined.

We first find an upper bound of $K(q_{\alpha},q_{\alpha^{\prime}})$ for $\alpha,\alpha^{\prime}\in\{-1,+1\}^{|J|}$ , where $K(p,q)=\int\log p/qdP$ is the Kullback–Leibler divergence. Since ${\bf g}_{\alpha}(C_{\bf j})=C_{\bf j}$ and $q_{\alpha}$ is bounded from above and below for small enough $c_{1}$ , where $C_{\bf j}=I_{j_{1}}\times\cdots\times I_{j_{d}}$ , we have

\displaystyle|q_{\alpha}({\bf y})-q_{\alpha^{\prime}}({\bf y})|\lesssim\left|\frac{1}{q_{\alpha}({\bf y})}-\frac{1}{q_{\alpha^{\prime}}({\bf y})}\right|\leq 2\frac{c_{1}}{m^{\beta-1}}\|\phi^{\prime}\|_{\infty}\|\phi\|_{\infty}^{d-1}.

Since the ratio $q_{\alpha}/q_{\alpha^{\prime}}$ is bounded from above and below, we can use a well-known inequality $K(q_{\alpha},q_{\alpha^{\prime}})\lesssim d_{H}^{2}(q_{\alpha},q_{\alpha^{\prime}})$ , where $d_{H}$ denotes the Hellinger distance; see Lemma B.2 of Ghosal & van der Vaart [19]. Since $|\sqrt{q_{\alpha}}-\sqrt{q_{\alpha^{\prime}}}|\lesssim|q_{\alpha}-q_{\alpha^{\prime}}|$ , we have

\displaystyle K(q_{\alpha},q_{\alpha^{\prime}})\lesssim\int_{[0,1]^{d}}|q_{\alpha}({\bf y})-q_{\alpha^{\prime}}({\bf y})|^{2}d{\bf y}\lesssim\frac{c_{1}^{2}\|\phi^{\prime}\|_{\infty}^{2}\|\phi\|_{\infty}^{2(d-1)}}{m^{2(\beta-1)}}.

Next, we derive a lower bound for $W_{1}(q_{\alpha},q_{\alpha^{\prime}})$ . Suppose that $\alpha_{\bf j}\neq\alpha_{\bf j}^{\prime}$ for some ${\bf j}\in J$ . Then, the excess mass of $Q_{\alpha}$ over $Q_{\alpha^{\prime}}$ on $C_{\bf j}$ is

	$\displaystyle\int_{\{{\bf y}\in C_{j}:q_{\alpha}({\bf y})>q_{\alpha^{\prime}}({\bf y})\}}\Big{\{}q_{\alpha}({\bf y})-q_{\alpha^{\prime}}({\bf y})\Big{\}}d{\bf y}=\frac{1}{2}\int_{C_{\bf j}}\|q_{\alpha}({\bf y})-q_{\alpha^{\prime}}({\bf y})\|d{\bf y}$
	$\displaystyle\gtrsim\int_{C_{\bf j}}\left\|\frac{1}{q_{\alpha}({\bf y})}-\frac{1}{q_{\alpha^{\prime}}({\bf y})}\right\|d{\bf y}=\frac{2c_{1}}{m^{\beta}}\int_{C_{\bf j}}\|\phi_{j_{1}}^{\prime}(z_{1})\phi_{j_{2}}(y_{2})\cdots\phi_{j_{d}}(y_{d})\|d{\bf y}$
	$\displaystyle=\frac{2c_{1}}{m^{\beta}}\int_{C_{\bf j}}\|\phi_{j_{1}}^{\prime}(z_{1})\phi_{j_{2}}(z_{2})\cdots\phi_{j_{d}}(z_{d})\|\left\|\frac{\partial{\bf y}}{\partial{\bf z}}\right\|d{\bf z}\gtrsim\frac{c_{1}}{m^{(\beta-1)+d}}\int_{(0,1)^{d}}\|\phi^{\prime}(z_{1})\phi(z_{2})\cdots\phi(z_{d})\|d{\bf z}.$

In virtue of Corollary 1.16 in Villani [62], with the (unique) optimal transport plan between $Q_{\alpha}$ and $Q_{\alpha^{\prime}}$ , some portion $\gamma\in(0,1)$ of this excess mass must be transported at least the distance of $c_{2}/m$ , where constants $\gamma$ and $c_{2}$ can be chosen so that they depend only on $d$ and $\phi$ . Hence, for some constant $c_{3}=c_{3}(\phi,d)$ ,

\displaystyle W_{1}(q_{\alpha},q_{\alpha^{\prime}})\geq\frac{c_{1}c_{3}}{m^{\beta+d}}H(\alpha,\alpha^{\prime}),

where $H(\alpha,\alpha^{\prime})=\sum_{{\bf j}\in J}I(\alpha_{\bf j}\neq\alpha^{\prime}_{\bf j})$ denotes the Hamming distance between $\alpha$ and $\alpha^{\prime}$ .

With the Hamming distance on $\{-1,+1\}^{|J|}$ , it is well-known (e.g. see page 124 of Wainwright [64]) that there is a $|J|/4$ -packing $\mathcal{A}$ of $\{-1,+1\}^{|J|}$ whose cardinality is at least $e^{|J|/16}$ . Let $P_{\alpha}$ be the convolution of $Q_{\alpha}$ and $\mathcal{N}({\bf 0}_{d},\sigma_{0}^{2}{\mathbb{I}}_{d})$ . Then, $K(p_{\alpha},p_{\alpha^{\prime}})\leq K(q_{\alpha},q_{\alpha^{\prime}})$ by Lemma B.11 of Ghosal & van der Vaart [19]. By the Fano’s method (Proposition 15.12 of Wainwright [64]), we have

\displaystyle\mathfrak{M}(\mathcal{G}_{0},\sigma_{0})\gtrsim\frac{c_{1}c_{3}}{m^{\beta}}\left\{1-\frac{nc_{1}^{2}C(\phi,d)m^{-2(\beta-1)}+\log 2}{m^{d}/16}\right\}

If $n\asymp m^{d+2(\beta-1)}$ and $c_{1}$ is small enough, we have the desired result. ∎