This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Non-Asymptotic Error Bounds for
Bidirectional GANs

Shiao Liu
Department of Statistics and Actuarial Science, University of Iowa
Iowa City, IA 52242, USA
shiao-liu@uiowa.edu &Yunfei Yang
Department of Mathematics, The Hong Kong University of Science and Technology
Clear Water Bay, Hong Kong, China
yyangdc@connect.ust.hk
&Jian Huang
Department of Statistics and Actuarial Science, University of Iowa
Iowa City, IA 52242, USA
jian-huang@uiowa.edu
&Yuling Jiao
School of Mathematics and Statistics, Wuhan University
Wuhan, Hubei, China 430072
yulingjiaomath@whu.edu.cn
&Yang Wang
Department of Mathematics, The Hong Kong University of Science and Technology
Clear Water Bay, Hong Kong, China
yangwang@ust.hk
Corresponding authors
Abstract

We derive nearly sharp bounds for the bidirectional GAN (BiGAN) estimation error under the Dudley distance between the latent joint distribution and the data joint distribution with appropriately specified architecture of the neural networks used in the model. To the best of our knowledge, this is the first theoretical guarantee for the bidirectional GAN learning approach. An appealing feature of our results is that they do not assume the reference and the data distributions to have the same dimensions or these distributions to have bounded support. These assumptions are commonly assumed in the existing convergence analysis of the unidirectional GANs but may not be satisfied in practice. Our results are also applicable to the Wasserstein bidirectional GAN if the target distribution is assumed to have a bounded support. To prove these results, we construct neural network functions that push forward an empirical distribution to another arbitrary empirical distribution on a possibly different-dimensional space. We also develop a novel decomposition of the integral probability metric for the error analysis of bidirectional GANs. These basic theoretical results are of independent interest and can be applied to other related learning problems.

1 Introduction

Generative adversarial networks (GAN) (Goodfellow et al., 2014) is an important approach to implicitly learning and sampling from high-dimensional complex distributions. GANs have been shown to achieve impressive performance in many machine learning tasks (Radford et al., 2016; Reed et al., 2016; Zhu et al., 2017; Karras et al., 2018, 2019; Brock et al., 2019). Several recent studies have generalized GANs to bidirectional generative learning, which learns an encoder mapping the data distribution to the reference distribution simultaneously together with the generator doing reversely. These studies include the adversarial autoencoder (AAE) (Makhzani et al., 2015), bidirectional GAN (BiGAN) (Donahue et al., 2016), adversarially learned inference (ALI) (Dumoulin et al., 2016), and bidirectional generative modeling using adversarial gradient estimation (AGES) (Shen et al., 2020). A common feature of these methods is that they generalize the basic adversarial training framework of the original GAN from unidirectional to bidirectional. Dumoulin et al. (2016) showed that BiGANs make use of the joint distribution of data and latent representations, which can better capture the information of data than the vanilla GANs. Comparing with the unidirectional GANs, the joint distribution matching in the training of bidirectional GANs alleviates mode dropping and encourages cycle consistency (Shen et al., 2020).

Several elegant and stimulating papers have analyzed the theoretical properties of unidirectional GANs. Arora et al. (2017) considered the generalization error of GANs under the neural net distance. Zhang et al. (2018) improved the generalization error bound in Arora et al. (2017). Liang (2020) studied the minimax optimal rates for learning distributions with empirical samples under Sobolev evaluation class and density class. The minimax rate is O(n1/2nα+β/(2α+β))O(n^{-{1}/{2}}\vee n^{-{\alpha+\beta}/{(2\alpha+\beta)}}), where α\alpha and β\beta are the regularity parameters for Sobolev density and evaluation class, respectively. Bai et al. (2019) analyzed the estimation error of GANs under the Wasserstein distance for a special class of distributions implemented by a generator, while the discriminator is designed to guarantee zero bias. Chen et al. (2020) studies the convergence properties of GANs when both the evaluation class and the target density class are Hölder classes and derived O(nβ/(2β+d)log2n)O(n^{-{\beta}/{(2\beta+d)}}\log^{2}n) bound, where dd is the dimension of the data distribution and α\alpha and β\beta are the regularity parameters for Hölder density and evaluation class, respectively. While impressive progresses have been made on the theoretical understanding of GANs, there are still some drawbacks in the existing results. For example,

  1. (a)

    The reference distribution and the target data distribution are assumed to have the same dimension, which is not the actual setting for GAN training.

  2. (b)

    The reference and the target data distributions are assumed to be supported on bounded sets.

  3. (c)

    The prefactors in the convergence rates may depend on the dimension dd of the data distribution exponentially.

In practice, GANs are usually trained using a reference distributions with a lower dimension than that of the target data distribution. Indeed, an important strength of GANs is that they can model low-dimensional latent structures via using a low-dimensional reference distribution. The bounded support assumption excludes some commonly used Gaussian distributions as the reference. Therefore, strictly speaking, the existing convergence analysis results do not apply to what have been done in practice. In addition, there has been no theoretical analysis of bidirectional GANs in the literature.

1.1 Contributions

We derive nearly sharp non-asymptotic bounds for the GAN estimation error under the Dudley distance between the reference joint distribution and the data joint distribution.To the best of our knowledge, this is the first result providing theoretical guarantees for bidirectional GAN estimation error rate. We do not assume that the reference and the target data distributions have the same dimension or these distributions have bounded support. Also, our results are applicable to the Wasserstein distance if the target data distribution is assumed to have a bounded support.

The main novel aspects of our work are as follows.

  1. (1)

    We allow the dimension of the reference distribution to be different from the dimension of the target distribution, in particular, it can be much lower than that of the target distribution.

  2. (2)

    We allow unbounded support for the reference distribution and the target distribution under mild conditions on the tail probabilities of the target distribution.

  3. (3)

    We explicitly establish that the prefactors in the error bounds depend on the square root of the dimension of the target distribution. This is a significant improvement over the exponential dependence on dd in the existing works.

Moreover, we develop a novel decomposition of integral probability metric for the error analysis of bidirectional GANs. We also show that the pushforward distribution of an empirical distribution based on neural networks can perfectly approximate another arbitrary empirical distribution as long as the number of discrete points are the same.

Notation We use σ\sigma to denote the ReLU activation function in neural networks, which is σ(x)=max{x,0},x\sigma(x)=\max\{x,0\},x\in\mathbb{R}. We use II to denote the identity map. Without further indication, \|\cdot\| represents the L2L_{2} norm. For any function gg, let g=supxg(x)\|g\|_{\infty}=\sup_{x}\|g(x)\|. We use notation O()O(\cdot) and O~()\tilde{O}(\cdot) to express the order of function slightly differently, where O()O(\cdot) omits the universal constant independent of dd while O~()\tilde{O}(\cdot) omits the constant depending on dd. We use B2d(a)B_{2}^{d}(a) to denote the L2L_{2} ball in d\mathbb{R}^{d} with center at 𝟎\mathbf{0} and radius aa. Let g#νg_{\#}\nu be the pushforward distribution of ν\nu by function gg in the sense that g#ν(A)=ν(g1(A))g_{\#}\nu(A)=\nu(g^{-1}(A)) for any measurable set AA. We use 𝔼^\hat{\mathbb{E}} to denote taking expectation with respect to the empirical distribution.

2 Bidirectional generative learning

We describe the setup of the bidirectional GAN estimation problem and present the assumptions we need in our analysis.

2.1 Bidirectional GAN estimators

Let μ\mu be the target data distribution supported on d\mathbb{R}^{d} for d1.d\geq 1. Let ν\nu be a reference distribution which is easy to sample from. We first consider the case when ν\nu is supported on \mathbb{R}, and then extend it to k\mathbb{R}^{k}, where k1k\geq 1 can be different from dd. Usually, kdk\ll d in practical machine learning tasks such as image generation. The goal is to learn functions g:dg:\mathbb{R}\to\mathbb{R}^{d} and e:de:\mathbb{R}^{d}\to\mathbb{R} such that g~#ν=e~#μ\tilde{g}_{\#}\nu=\tilde{e}_{\#}\mu, where g~:=(g,I)\tilde{g}:=(g,I) and e~:=(I,e)\tilde{e}:=(I,e), g~#ν\tilde{g}_{\#}\nu is the pushforward distribution of ν\nu under g~\tilde{g} and e~#μ\tilde{e}_{\#}\mu is the pushforward distribution of μ\mu under e~\tilde{e}. We call g~#ν\tilde{g}_{\#}\nu the joint latent distribution or joint reference distribution and e~#μ\tilde{e}_{\#}\mu the joint data distribution or joint target distribution. At the population level, the bidirectional GAN solves the minimax problem:

(g,e,f)argming𝒢,emaxf𝔼Zν[f(g(Z),Z)]𝔼xμ[f(X,e(X))],(g^{*},e^{*},f^{*})\in\arg\min_{g\in\mathcal{G},e\in\mathcal{E}}\max_{f\in\mathcal{F}}{\mathbb{E}}_{Z\sim\nu}[f(g(Z),Z)]-{\mathbb{E}}_{x\sim\mu}[f(X,e(X))],

where 𝒢,,\mathcal{G},\mathcal{E},\mathcal{F} are referred to as the generator class, the encoder class, and the discriminator class, respectively. Suppose we have two independent random samples Z1,,Zni.i.d.νZ_{1},\ldots,Z_{n}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\nu and X1,,Xni.i.d.μX_{1},\ldots,X_{n}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mu. At the sample level, the bidirectional GAN solves the empirical version of the above minimax problem:

(g^θ,e^φ,f^ω)\displaystyle(\hat{g}_{\theta},\hat{e}_{\varphi},\hat{f}_{\omega}) =argmingθ𝒢NN,eφNNmaxfωNN1ni=1nfω(gθ(Zi),Zi)1nj=1nfω(Xj,eφ(Xj)),\displaystyle=\arg\min_{g_{\theta}\in\mathcal{G}_{NN},e_{\varphi}\in\mathcal{E}_{NN}}\max_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}f_{\omega}(g_{\theta}(Z_{i}),Z_{i})-\frac{1}{n}\sum_{j=1}^{n}f_{\omega}(X_{j},e_{\varphi}(X_{j})), (2.1)

where 𝒢NN\mathcal{G}_{NN} and NN\mathcal{E}_{NN} are two classes of neural networks approximating the generator class 𝒢\mathcal{G} and the encoder class \mathcal{E} respectively, and NN\mathcal{F}_{NN} is a class of neural networks approximating the discriminator class \mathcal{F}.

2.2 Assumptions

We assume the target μ\mu and the reference ν\nu satisfy the following assumptions.

Assumption 1 (Subexponential tail).

For a large nn, the target distribution μ\mu on d\mathbb{R}^{d} and the reference distribtuion ν\nu on \mathbb{R} satisfies the first moment tail condition for some δ>0\delta>0,

max{𝔼νZ𝟙{Z>logn},𝔼μX𝟙{X>logn}}=O(n(logn)δd).\max\{\mathbb{E}_{\nu}\|Z\|\mathbbm{1}_{\{\|Z\|>\log n\}},\mathbb{E}_{\mu}\|X\|\mathbbm{1}_{\{\|X\|>\log n\}}\}=O(n^{-\frac{(\log n)^{\delta}}{d}}).
Assumption 2 (Absolute continuity).

Both the target distribution μ\mu on d\mathbb{R}^{d} and the reference distribution ν\nu on \mathbb{R} are absolutely continuous with respect to the Lebesgue measure λ.\lambda.

Assumption 1 is a technical condition for dealing with the case when μ\mu and ν\nu are supported on d\mathbb{R}^{d} and \mathbb{R} instead of compact subsets. For distributions with bounded supports, this assumption is automatically satisfied. Here the factor (logn)δ(\log n)^{\delta} ensures that the tails of μ\mu and ν\nu are sub-exponential, and it can be easily satisfied if the distributions are sub-gaussian. For the reference distribution, Assumption 1 and 2 can be easily satisfied by specifying ν\nu as some common distribution with easy-to-sample density such as Gaussian or uniform, which is usually done in the applications of GANs. For the target distribution, Assumption 1 and 2 specifies the type of distributions that are learnable by bidirectional GAN with our theoretical guarantees. Note that Assumption 1 is also necessary in our proof for bounding the generator and encoder approximation error in the sense that the results will not hold if we replace (logn)δ(\log n)^{\delta} with 1. Assumption 2 is also necessary for Theorem 4.3 in mapping between empirical samples, which is essential in bounding generator and encoder approximation error.

2.3 Generator, encoder and discriminator classes

Let NN:=𝒩𝒩(W1,L1)\mathcal{F}_{NN}:=\mathcal{NN}(W_{1},L_{1}) be the discriminator class consisting of the feedforward ReLU neural networks fω:d+1f_{\omega}:\mathbb{R}^{d+1}\mapsto\mathbb{R} with width W1W_{1} and depth L1L_{1}. Similarly, let 𝒢NN:=𝒩𝒩(W2,L2)\mathcal{G}_{NN}:=\mathcal{NN}(W_{2},L_{2}) be the generator class consisting of the feedforward ReLU neural networks gθ:dg_{\theta}:\mathbb{R}\mapsto\mathbb{R}^{d} with width W2W_{2} and depth L2L_{2}, and NN:=𝒩𝒩(W3,L3)\mathcal{E}_{NN}:=\mathcal{NN}(W_{3},L_{3}) the encoder class consisting of the feedforward ReLU neural networks eφ:de_{\varphi}:\mathbb{R}^{d}\mapsto\mathbb{R} with width W3W_{3} and depth L3L_{3}.

The functions fωNNf_{\omega}\in\mathcal{F}_{NN} have the following form:

fω(x)=AL1σ(AL11σ(A1x+b1)+bL11)+bL1\displaystyle f_{\omega}(x)=A_{L_{1}}\cdot\sigma(A_{L_{1}-1}\cdots\sigma(A_{1}x+b_{1})\cdots+b_{L_{1}-1})+b_{L_{1}}

where AiA_{i} are the weight matrices with number of rows and columns no larger than the width W1W_{1}, bib_{i} are the bias vectors with compatible dimensions, and σ\sigma is the ReLU activation function σ(x)=x0\sigma(x)=x\vee 0. Similarly, functions gθ𝒢NNg_{\theta}\in\mathcal{G}_{NN} and eφNNe_{\varphi}\in\mathcal{E}_{NN} have the following form:

gθ(x)\displaystyle g_{\theta}(x) =AL2σ(AL21σ(A1x+b1)+bL21)+bL2\displaystyle=A^{\prime}_{L_{2}}\cdot\sigma(A^{\prime}_{L_{2}-1}\cdots\sigma(A^{\prime}_{1}x+b^{\prime}_{1})\cdots+b^{\prime}_{L_{2}-1})+b^{\prime}_{L_{2}}
eφ(x)\displaystyle e_{\varphi}(x) =AL3′′σ(AL31′′σ(A1′′x+b1′′)+bL31′′)+bL3′′\displaystyle=A^{\prime\prime}_{L_{3}}\cdot\sigma(A^{\prime\prime}_{L_{3}-1}\cdots\sigma(A^{\prime\prime}_{1}x+b^{\prime\prime}_{1})\cdots+b^{\prime\prime}_{L_{3}-1})+b^{\prime\prime}_{L_{3}}

where AiA_{i}^{\prime} and Ai′′A_{i}^{\prime\prime} are the weight matrices with number of rows and columns no larger than W2W_{2} and W3W_{3}, respectively, and bib_{i}^{\prime} and bi′′b_{i}^{\prime\prime} are the bias vectors with compatible dimensions.

We impose the following conditions on 𝒢NN\mathcal{G}_{NN}, NN\mathcal{E}_{NN}, and NN\mathcal{F}_{NN}.

Condition 1.

For any gθ𝒢NNg_{\theta}\in\mathcal{G}_{NN} and eφNNe_{\varphi}\in\mathcal{E}_{NN}, we have max{gθ,eφ}logn.\max\{\|g_{\theta}\|_{\infty},\|e_{\varphi}\|_{\infty}\}\leq\log n.

Condition 1 on 𝒢NN\mathcal{G}_{NN} can be easily satisfied by adding an additional clipping layer \ell after the original output layer, with cn,d(logn)/dc_{n,d}\equiv{(\log n)}/{\sqrt{d}},

(a)=acn,d(cn,d)=σ(a+cn,d)σ(acn,d)cn,d.\displaystyle\ell(a)=a\wedge c_{n,d}\vee(-c_{n,d})=\sigma(a+c_{n,d})-\sigma(a-c_{n,d})-c_{n,d}. (2.2)

We truncate the output of gθ\|g_{\theta}\| to an increasing interval [logn,logn][-\log n,\log n] to include the whole d\mathbb{R}^{d} support for the evaluation function class. Condition 1 on NN\mathcal{E}_{NN} can be satisfied in the same manner. This condition is technically necessary in our proof (see appendix).

3 Non-asymptotic error bounds

We characterize the bidirectional GAN solutions based on minimizing the integral probability metric (IPM, Müller (1997)) between two distributions μ\mu and ν\nu with respect to a symmetric evaluation function class \mathcal{F}, defined by

d(μ,ν)=supf[𝔼μf𝔼νf].\displaystyle d_{\mathcal{F}}(\mu,\nu)=\sup_{f\in\mathcal{F}}[\mathbb{E}_{\mu}f-\mathbb{E}_{\nu}f]. (3.1)

By specifying the evaluation function class \mathcal{F} differently, we can obtain many commonly-used metrics (Liu et al., 2017). Here we focus on the following two

  • =\mathcal{F}= bounded Lipschitz function class :d=dBL:d_{\mathcal{F}}=d_{BL}, (bounded Lipschitz (or Dudley) metric: metrizing weak convergence, Dudley (2018)),

  • =\mathcal{F}= Lipschitz function class :d=W1:d_{\mathcal{F}}=W_{1} (Wasserstein GAN, Arjovsky et al. (2017)).

We consider the estimation error under the Dudley metric dBLd_{BL}. Note that in the case when μ\mu and ν\nu have bounded supports, the Dudley metric dBLd_{BL} is equivalent to the 1-Wasserstein metric W1W_{1}. Therefore, under the bounded support condition for μ\mu and ν\nu, all our convergence results also hold under the Wasserstein distance W1W_{1}. Even if the support of μ\mu and ν\nu are unbounded, we can still apply the result of Lu and Lu (2020) to avoid empirical process theory and obtain an stochastic error bound under the Wasserstein distance W1W_{1}. However, the result of Lu and Lu (2020) requires sub-gaussianity to obtain the d\sqrt{d} prefactor. In order to make it more general, we use the empirical processes theory to get the explicit prefactor. Also, the discriminator approximation error will be unbounded if we consider the Wasserstein distance W1W_{1}. Hence, we can only consider dBLd_{BL} for the unbounded support case.

The bidirectional GAN solution (g^θ,e^φ)(\hat{g}_{\theta},\hat{e}_{\varphi}) in (2.1) also minimizes the distance between (g~θ)#ν^n(\tilde{g}_{\theta})_{\#}\hat{\nu}_{n} and (e~φ)#μ^n(\tilde{e}_{\varphi})_{\#}\hat{\mu}_{n} under dNNd_{\mathcal{F}_{NN}}

mingθ𝒢NN,eφNNdNN((g~θ)#ν^n,(e~φ)#μ^n).\displaystyle\underset{g_{\theta}\in\mathcal{G}_{NN},e_{\varphi}\in\mathcal{E}_{NN}}{\min}d_{\mathcal{F}_{NN}}((\tilde{g}_{\theta})_{\#}\hat{\nu}_{n},(\tilde{e}_{\varphi})_{\#}\hat{\mu}_{n}).

However, even if two distributions are close with respect to dNNd_{\mathcal{F}_{NN}}, there is no automatic guarantee that they will still be close under other metrics, for example, the Dudley or the Wasserstein distance   (Arora et al., 2017). Therefore, it is natural to ask the question:

  • How close are the two bidirectional GAN estimators 𝝂^:=(g^θ,I)#ν\hat{\boldsymbol{\nu}}:=(\hat{g}_{\theta},I)_{\#}\nu and 𝝁^:=(I,e^φ)#μ\hat{\boldsymbol{\mu}}:=(I,\hat{e}_{\varphi})_{\#}\mu under some other stronger metrics?

We consider the IPM with the uniformly bounded 1-Lipschitz function class on d+1\mathbb{R}^{d+1}, as the evaluation class, which is defined as, for some finite B>0B>0,

1:={f:\displaystyle\mathcal{F}^{1}:=\big{\{}f:\ d+1||f(x)f(y)|xy,x,yd+1 and fB}\displaystyle\mathbb{R}^{d+1}\mapsto\mathbb{R}\big{|}\ |f(x)-f(y)|\leq\|x-y\|,x,y\in\mathbb{R}^{d+1}\text{ and }\|f\|_{\infty}\leq B\big{\}} (3.2)

In Theorem 3.1, we consider the bounded support case where d=W1d_{\mathcal{F}}=W_{1}; In Theorem 3.2, we extend the result to the unbounded support case; In Theorem 3.3, we extend the result to the case where the dimension of the reference distribution is arbitrary.

We first present a result when μ\mu is supported on a compact subset [M,M]dd[-M,M]^{d}\subset\mathbb{R}^{d} and ν\nu is supported on [M,M][-M,M]\subset\mathbb{R} for a finite M>0M>0.

Theorem 3.1.

Suppose that the target μ\mu is supported on [M,M]dd[-M,M]^{d}\subset\mathbb{R}^{d} and the reference ν\nu is supported on [M,M][-M,M]\subset\mathbb{R} for a finite M>0M>0, and Assumption 2 holds. Let the outputs of gθg_{\theta} and eφe_{\varphi} be within [M,M]d[-M,M]^{d} and [M,M][-M,M] for gθ𝒢NNg_{\theta}\in\mathcal{G}_{NN} and eφNNe_{\varphi}\in\mathcal{E}_{NN}, respectively. By specifying the three network structures as W1L1nW_{1}L_{1}\geq\left\lceil\sqrt{n}\right\rceil, W22L2=C1dnW_{2}^{2}L_{2}=C_{1}dn, and W32L3=C2nW_{3}^{2}L_{3}=C_{2}n for some constants 12C1,C238412\leq C_{1},C_{2}\leq 384 and properly choosing parameters, we have

𝔼d1(𝝂^,𝝁^)C0dn1d+1(logn)1d+1,\displaystyle\mathbb{E}d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})\leq C_{0}\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{\frac{1}{d+1}},

where C0>0C_{0}>0 is a constant independent of dd and nn.

The prefactor C0dC_{0}\sqrt{d} in the error bound depends on d1/2d^{1/2} linearly. This is different from the existing works where the dependence of the prefactor on dd is either not clearly described or is exponential. In high-dimensional settings with large dd, this makes a substantial difference in the quality of the error bounds. These remarks apply to all the results stated below.

The next theorem deals with the case of unbounded support.

Theorem 3.2.

Suppose Assumption 1 and 2 hold, and Condition 1 is satisfied. By specifying the structures of the three network classes as W1L1nW_{1}L_{1}\geq\left\lceil\sqrt{n}\right\rceil, W22L2=C1dnW_{2}^{2}L_{2}=C_{1}dn, and W32L3=C2nW_{3}^{2}L_{3}=C_{2}n for some constants 12C1,C238412\leq C_{1},C_{2}\leq 384 and properly choosing parameters, we have

𝔼d1(𝝂^,𝝁^)min{C0dn1d+1(logn)1+1d+1,Cdn1d+1logn},\displaystyle\mathbb{E}d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})\leq\min\big{\{}C_{0}\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}},C_{d}n^{-\frac{1}{d+1}}\log n\big{\}},

where C0C_{0} is a constant independent of dd and nn, but CdC_{d} depends on dd.

Note that two methods are used in bounding stochastic errors (see appendix), which leads to two different bounds: one with an explicit d\sqrt{d} prefactor with the cost that we have an additional logn\log n factor. Another one with an implicit prefactor but with a better logn\log n factor. Hence, it is a tradeoff between the explicitness of prefactor and the order of logn\log n.

Our next result generalizes the results to the case when the reference distribution ν\nu is supported on k\mathbb{R}^{k} for k+.k\in\mathbb{N}_{+}.

Assumption 3.

The target distribution μ\mu on d\mathbb{R}^{d} is absolutely continuous with respect to the Lebesgue measure on d\mathbb{R}^{d} and the reference distribution ν\nu on k\mathbb{R}^{k} is absolutely continuous with respect to the Lebesgue measure on k\mathbb{R}^{k}, and kdk\ll d.

With the above assumption, we have the following theorem providing theoretical guarantees for the validity of any dimensional reference ν\nu.

Theorem 3.3.

Suppose Assumption 1 and 3 hold, and Condition 1 is satisfied. By specifying generator and discriminator class structure as W1L1nW_{1}L_{1}\geq\left\lceil\sqrt{n}\right\rceil, W22L2=C1dnW_{2}^{2}L_{2}=C_{1}dn, and W32L3=C2knW_{3}^{2}L_{3}=C_{2}kn for some constants 12C1,C238412\leq C_{1},C_{2}\leq 384 and properly choosing parameters, we have

𝔼d1(𝝂^,𝝁^)min{C0dn1d+k(logn)1+1d+k,Cdn1d+klogn},\displaystyle\mathbb{E}d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})\leq\min\big{\{}C_{0}\sqrt{d}n^{-\frac{1}{d+k}}(\log n)^{1+\frac{1}{d+k}},C_{d}n^{-\frac{1}{d+k}}\log n\big{\}},

where C0C_{0} is a constant independent of dd and nn, but CdC_{d} depends on dd.

Note that the errors bounds established in Theorems 3.1-3.3 are tight up to a logarithmic factor, since the minimax rate measured in Wasserstein distance for learning distributions when the Lipschitz evaluation class is defined on d\mathbb{R}^{d} is O~(n1d)\tilde{O}(n^{-\frac{1}{d}}) (Liang, 2020).

4 Approximation and stochastic errors

In this section we present a novel inequality for decomposing the total error into approximation and stochastic errors and establish bounds on these errors.

4.1 Decomposition of the estimation error

Define the approximation error of a function class \mathcal{F} to another function class \mathcal{H} by

(,):=suphinffhf.\mathcal{E}(\mathcal{H},\mathcal{F}):=\underset{h\in\mathcal{H}}{\sup}\underset{f\in\mathcal{F}}{\inf}\|h-f\|_{\infty}.

We decompose the Dudley distance d1(𝝂^,𝝁^)d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}}) between the latent joint distribution and the data joint distribution into four different error terms,

  • the approximation error of the discriminator class NN\mathcal{F}_{NN} to 1\mathcal{F}^{1}:

    1=(1,NN),\mathcal{E}_{1}=\mathcal{E}(\mathcal{F}^{1},\mathcal{F}_{NN}),
  • the approximation error of the generator and encoder classes:

    2=infgθ𝒢NN,eφNNsupfωNN1ni=1n(fω(gθ(zi),zi)fω(xi,eφ(xi))),\mathcal{E}_{2}=\inf_{g_{\theta}\in\mathcal{G}_{NN},e_{\varphi}\in\mathcal{E}_{NN}}\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(g_{\theta}(z_{i}),z_{i})-f_{\omega}(x_{i},e_{\varphi}(x_{i}))\Big{)},
  • the stochastic error for the latent joint distribution 𝝂^\hat{\boldsymbol{\nu}}:

    3=supfω1𝔼fω(g^(z),z)𝔼^fω(g^(z),z),\mathcal{E}_{3}=\sup_{f_{\omega}\in\mathcal{F}^{1}}\mathbb{E}f_{\omega}(\hat{g}(z),z)-\hat{\mathbb{E}}f_{\omega}(\hat{g}(z),z),
  • the stochastic error for the latent joint distribution 𝝁^\hat{\boldsymbol{\mu}}:

    4=supfω1𝔼^fω(x,e^(x))𝔼fω(x,e^(x)).\mathcal{E}_{4}=\sup_{f_{\omega}\in\mathcal{F}^{1}}\hat{\mathbb{E}}f_{\omega}(x,\hat{e}(x))-\mathbb{E}f_{\omega}(x,\hat{e}(x)).
Lemma 4.1.

Let (g^θ,e^φ)(\hat{g}_{\theta},\hat{e}_{\varphi}) be the bidirectional GAN solution in (2.1) and 1\mathcal{F}^{1} be the uniformly bounded 1-Lipschitz function class defined in (3.2). Then the Dudley distance between the latent joint distribution 𝛎^=(g^θ,I)#ν\hat{\boldsymbol{\nu}}=(\hat{g}_{\theta},I)_{\#}\nu and the data joint distribution 𝛍^=(I,e^φ)#μ\hat{\boldsymbol{\mu}}=(I,\hat{e}_{\varphi})_{\#}\mu can be decomposed as follows

d1(𝝂^,𝝁^)\displaystyle d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}}) \displaystyle\leq 21+2+3+4.\displaystyle 2\mathcal{E}_{1}+\mathcal{E}_{2}+\mathcal{E}_{3}+\mathcal{E}_{4}. (4.1)

The novel decomposition (4.1) is fundamental to our error analysis. Based on (4.1), we bound each error term on the right side of (4.1) and balance the bounds to obtain an overall bound for the bidirectional GAN estimation.

For proving Lemma 4.1, we introduce the following useful inequality, which states that for any two probability distributions, the difference in IPMs with two distinct evaluation classes will not exceed 2 times the approximation error between the two evaluation classes, that is, for any probability distributions μ\mu and ν\nu and symmetric function classes \mathcal{F} and \mathcal{H},

d(μ,ν)d(μ,ν)2(,).\displaystyle d_{\mathcal{H}}(\mu,\nu)-d_{\mathcal{F}}(\mu,\nu)\leq 2\mathcal{E}(\mathcal{H},\mathcal{F}). (4.2)

It is easy to check that if we replace d(μ,ν)d_{\mathcal{H}}(\mu,\nu) by d^(μ,ν):=suph[𝔼^μh𝔼^νh]\hat{d}_{\mathcal{H}}(\mu,\nu):=\underset{h\in\mathcal{H}}{\sup}[\hat{\mathbb{E}}_{\mu}h-\hat{\mathbb{E}}_{\nu}h], (4.2) still holds.

Proof of Lemma 4.1.

We have

d1(𝝂^,𝝁^)=\displaystyle d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})= supfω1𝔼fω(g^(z),z)𝔼fω(x,e^(x))\displaystyle\sup_{f_{\omega}\in\mathcal{F}^{1}}\mathbb{E}f_{\omega}(\hat{g}(z),z)-\mathbb{E}f_{\omega}(x,\hat{e}(x))
\displaystyle\leq supfω1𝔼fω(g^(z),z)𝔼^fω(g^(z),z)+supfω1𝔼^fω(g^(z),z)𝔼^fω(x,e^(x))\displaystyle\sup_{f_{\omega}\in\mathcal{F}^{1}}\mathbb{E}f_{\omega}(\hat{g}(z),z)-\hat{\mathbb{E}}f_{\omega}(\hat{g}(z),z)+\sup_{f_{\omega}\in\mathcal{F}^{1}}\hat{\mathbb{E}}f_{\omega}(\hat{g}(z),z)-\hat{\mathbb{E}}f_{\omega}(x,\hat{e}(x))
+supfω1𝔼^fω(x,e^(x))𝔼fω(x,e^(x))\displaystyle+\sup_{f_{\omega}\in\mathcal{F}^{1}}\hat{\mathbb{E}}f_{\omega}(x,\hat{e}(x))-\mathbb{E}f_{\omega}(x,\hat{e}(x))
=\displaystyle= 3+4+supfω1𝔼^fω(g^(z),z)𝔼^fω(x,e^(x)).\displaystyle\mathcal{E}_{3}+\mathcal{E}_{4}+\sup_{f_{\omega}\in\mathcal{F}^{1}}\hat{\mathbb{E}}f_{\omega}(\hat{g}(z),z)-\hat{\mathbb{E}}f_{\omega}(x,\hat{e}(x)).

Denote A:=supfω1𝔼^fω(g^(z),z)𝔼^fω(x,e^(x))A:=\sup_{f_{\omega}\in\mathcal{F}^{1}}\hat{\mathbb{E}}f_{\omega}(\hat{g}(z),z)-\hat{\mathbb{E}}f_{\omega}(x,\hat{e}(x)). By (4.2) and the optimality of the bidirectional GAN solutions, AA satisfies

A\displaystyle A =supfω11ni=1n(fω(g^(zi),zi)fω(xi,e^(xi)))\displaystyle=\sup_{f_{\omega}\in\mathcal{F}^{1}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(\hat{g}(z_{i}),z_{i})-f_{\omega}(x_{i},\hat{e}(x_{i}))\Big{)}
supfωNN1ni=1n(fω(g^(zi),zi)fω(xi,e^(xi)))+2(1,NN)\displaystyle\leq\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(\hat{g}(z_{i}),z_{i})-f_{\omega}(x_{i},\hat{e}(x_{i}))\Big{)}+2\mathcal{E}(\mathcal{F}^{1},\mathcal{F}_{NN})
=infgθ𝒢NN,eφNNsupfωNN1ni=1n(fω(gθ(zi),zi)fω(xi,eφ(xi)))+21\displaystyle=\inf_{g_{\theta}\in\mathcal{G}_{NN},e_{\varphi}\in\mathcal{E}_{NN}}\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(g_{\theta}(z_{i}),z_{i})-f_{\omega}(x_{i},e_{\varphi}(x_{i}))\Big{)}+2\mathcal{E}_{1}
=21+2.\displaystyle=2\mathcal{E}_{1}+\mathcal{E}_{2}.

Note that we cannot directly apply the symmetrization technic (see appendix) to 3\mathcal{E}_{3} and 4\mathcal{E}_{4} since ee^{*} and gg^{*} are correlated with xix_{i} and ziz_{i}. However, this problem can be solved by replacing the samples (xi,zi)(x_{i},z_{i}) in the empirical terms in 3\mathcal{E}_{3} and 4\mathcal{E}_{4} with ghost samples (xi,zi)(x^{\prime}_{i},z^{\prime}_{i}) independent of (xi,zi)(x_{i},z_{i}) and replacing gg^{*} and ee^{*} with gg^{**} and ee^{**} which are obtained from the ghost samples, respectively. That is, we replace 𝔼^fω(g(z),z)\hat{\mathbb{E}}f_{\omega}(g^{*}(z),z) and 𝔼^fω(x,e(x))\hat{\mathbb{E}}f_{\omega}(x,e^{*}(x)) with 𝔼^fω(g(z),z)\hat{\mathbb{E}}f_{\omega}(g^{**}(z^{\prime}),z^{\prime}) and 𝔼^fω(x,e(x))\hat{\mathbb{E}}f_{\omega}(x^{\prime},e^{**}(x^{\prime})) in 3\mathcal{E}_{3} and 4\mathcal{E}_{4}, respectively. Then we can proceed with the same proof of Lemma 4.1 and apply the symmetrization technic to 3\mathcal{E}_{3} and 4,\mathcal{E}_{4}, since (g(zi),zi)(g^{*}(z_{i}),z_{i}) and (g(zi),zi)(g^{**}(z^{\prime}_{i}),z^{\prime}_{i}) have the same distribution. To simplify the notation, we will just use 𝔼^fω(g(z),z)\hat{\mathbb{E}}f_{\omega}(g^{*}(z),z) and 𝔼^fω(x,e(x))\hat{\mathbb{E}}f_{\omega}(x,e^{*}(x)) to denote 𝔼^fω(g(z),z)\hat{\mathbb{E}}f_{\omega}(g^{**}(z^{\prime}),z^{\prime}) and 𝔼^fω(x,e(x))\hat{\mathbb{E}}f_{\omega}(x^{\prime},e^{**}(x^{\prime})) here, respectively.

4.2 Approximation errors

We now discuss the errors due to the discriminator approximation and the generator and encoder approximation.

4.2.1 The discriminator approximation error 1\mathcal{E}_{1}

The discriminator approximation error 1\mathcal{E}_{1} describes how well the discriminator neural network class approximates functions from the Lipschitz class 1\mathcal{F}^{1}. Lemma 4.2 below can be applied to obtain the neural network approximation error for Lipschitz functions. It leads to a quantitative and non-asymptotic approximation rate in terms of the width and depth of the neural networks when bounding 1\mathcal{E}_{1}.

Lemma 4.2 (Shen et al. (2021)).

Let ff be a Lipschitz continuous function defined on [R,R]d[-R,R]^{d}. For arbitrary W,L+W,L\in\mathbb{N}_{+}, there exists a function ψ\psi implemented by a ReLU feedforward neural network with width WW and depth LL such that

fψ=O(dR(WL)2d).\displaystyle||f-\psi||_{\infty}=O\big{(}\sqrt{d}R(WL)^{-\frac{2}{d}}\big{)}.

By Lemma 4.2 and our choice of the architecture of discriminator class NN\mathcal{F}_{NN} in the theorems, we have 1=O(d(W1L1)2d+1logn)\mathcal{E}_{1}=O\big{(}\sqrt{d}(W_{1}L_{1})^{-\frac{2}{d+1}}\log n\big{)}. Theorem 4.2 also informs about how to choose the architecture of the discriminator networks based on how small we want the approximation error 1\mathcal{E}_{1} to be. By setting (W1L1)2n(W_{1}L_{1})^{2}\geq n, 1\mathcal{E}_{1} is dominated by the stochastic terms 3\mathcal{E}_{3} and 4\mathcal{E}_{4}.

4.2.2 The generator and encoder approximation error 2\mathcal{E}_{2}

The generator and encoder approximation error 2\mathcal{E}_{2} describes how powerful the generator and encoder classes are in pushing the empirical distributions μ^n\hat{\mu}_{n} and ν^n\hat{\nu}_{n} to each other. A natural question is

  • Can we find some generator and encoder neural network functions such that 2=0\mathcal{E}_{2}=0?

Most of the current literature concerning the error analysis of GANs applied the optimal transport theory (Villani, 2008) to minimize an error term similar to 2\mathcal{E}_{2}, see, for example, Chen et al. (2020). However, the existence of the optimal transport function from d\mathbb{R}\to\mathbb{R}^{d} is not guaranteed. Therefore, the existing analysis of GANs can only deal with the scenario when the reference and the target data distribution are assumed to have the same dimension. This equal dimensionality assumption is not satisfied in the actual training of GANs or bidirectional GANs in many applications. Here, instead of using the optimal transport theory, we establish the following approximation results in Theorem 4.3, which enables us to forgo the equal dimensionality assumption.

Theorem 4.3.

Suppose that ν\nu supported on \mathbb{R} and μ\mu supported on d\mathbb{R}^{d} are both absolutely continuous w.r.t. the Lebesgue measures, and zisz_{i}^{\prime}s and xisx_{i}^{\prime}s are i.i.d. samples from ν\nu and μ\mu, respectively for 1in1\leq i\leq n. Then there exist generator and encoder neural network functions g:dg:\mathbb{R}\mapsto\mathbb{R}^{d} and e:de:\mathbb{R}^{d}\mapsto\mathbb{R} such that gg and ee are inverse bijections of each other between {zi:1in}\{z_{i}:1\leq i\leq n\} and {xi:1in}\{x_{i}:1\leq i\leq n\} up to a permutation. Moreover, such neural network functions gg and ee can be obtained by properly specifying W22L2=c2dnW_{2}^{2}L_{2}=c_{2}dn and W32L3=c3nW_{3}^{2}L_{3}=c_{3}n for some constant 12c2,c338412\leq c_{2},c_{3}\leq 384.

Proof.

By the absolute continuity of ν\nu and μ\mu, all the zisz_{i}^{\prime}s and xisx_{i}^{\prime}s are distinct a.s.. We can reorder zisz_{i}^{\prime}s from the smallest to the largest, so z1<z2<<znz_{1}<z_{2}<\ldots<z_{n}. Let zi+1/2z_{i+1/2} be any point between ziz_{i} and zi+1z_{i+1} for i{1,2,,n1}i\in\{1,2,\ldots,n-1\}. We define the continuous piece-wise linear function g:dg:\mathbb{R}\mapsto\mathbb{R}^{d} by

g(z)={x1zz1,zzi+1/2zizi+1/2xi+zzizi+12zixi+1z=(zi,zi+1/2), for i=1,,n1,xi+1z[zi+1/2,zi+1], for i=1,,n2,xnzzn1+1/2.\displaystyle g(z)=\begin{cases}x_{1}&z\leq z_{1},\\ \frac{z-z_{i+1/2}}{z_{i}-z_{i+1/2}}x_{i}+\frac{z-z_{i}}{z_{i+\frac{1}{2}}-z_{i}}x_{i+1}&z=(z_{i},z_{i+1/2}),\text{ for }i=1,\ldots,n-1,\\ x_{i+1}&z\in[z_{i+1/2},z_{i+1}],\text{ for }i=1,\ldots,n-2,\\ x_{n}&z\geq z_{n-1+1/2}.\end{cases}

By Yang et al. (2021, Lemma 3.1) , g𝒩𝒩(W2,L2)g\in\mathcal{NN}(W_{2},L_{2}) if n(W2d1)W2d16dL22n\leq(W_{2}-d-1)\left\lfloor\frac{W_{2}-d-1}{6d}\right\rfloor\left\lfloor\frac{L_{2}}{2}\right\rfloor. Taking n=(W2d1)W2d16dL22n=(W_{2}-d-1)\left\lfloor\frac{W_{2}-d-1}{6d}\right\rfloor\left\lfloor\frac{L_{2}}{2}\right\rfloor, a simple calculation shows W22L2=cdnW^{2}_{2}L_{2}=cdn for some constant 12c38412\leq c\leq 384. The existence of neural net function ee can be constructed in the same way due to the fact that the first coordinate of xisx_{i}^{\prime}s are distinct almost surely. ∎

When the number of point masses of the empirical distributions are relatively moderate compared with the structure of the neural nets, we can approximate empirical distributions arbitrarily well with any empirical distribution with the same number of point masses pushforwarded by the neural nets.

Theorem 4.3 provides an effective way to specify the architecture of generator and encoder classes. According to this lemma, we can take n=W2d2W2d6dL22+2=W312W316L32+2n=\frac{W_{2}-d}{2}\left\lfloor\frac{W_{2}-d}{6d}\right\rfloor\left\lfloor\frac{L_{2}}{2}\right\rfloor+2=\frac{W_{3}-1}{2}\left\lfloor\frac{W_{3}-1}{6}\right\rfloor\left\lfloor\frac{L_{3}}{2}\right\rfloor+2, which gives rise to W22L2/dW32L3nW_{2}^{2}L_{2}/d\asymp W_{3}^{2}L_{3}\asymp n. More importantly, Theorem 4.3 can be applied to bound 2\mathcal{E}_{2} as follows.

2=\displaystyle\mathcal{E}_{2}= infgθ𝒢NN,eφNNsupfωNN1ni=1n(fω(gθ(zi),zi)fω(xi,eφ(xi)))\displaystyle\inf_{g_{\theta}\in\mathcal{G}_{NN},e_{\varphi}\in\mathcal{E}_{NN}}\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(g_{\theta}(z_{i}),z_{i})-f_{\omega}(x_{i},e_{\varphi}(x_{i}))\Big{)}
\displaystyle\leq infgθ𝒢NNsupfωNN1ni=1n(fω(gθ(zi),zi)fω(xi,zi))\displaystyle\inf_{g_{\theta}\in\mathcal{G}_{NN}}\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(g_{\theta}(z_{i}),z_{i})-f_{\omega}(x_{i},z_{i})\Big{)}
+infeφNNsupfωNN1ni=1n(fω(xi,zi)fω(xi,eφ(xi)))\displaystyle+\inf_{e_{\varphi}\in\mathcal{E}_{NN}}\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(x_{i},z_{i})-f_{\omega}(x_{i},e_{\varphi}(x_{i}))\Big{)}
=\displaystyle= 0.\displaystyle 0.

We simply reordered zisz_{i}^{\prime}s and xisx_{i}^{\prime}s as in the proof. Therefore, this error term can be perfectly eliminated.

4.3 Stochastic errors

The stochastic error 3\mathcal{E}_{3} (4\mathcal{E}_{4}) quantifies how close the empirical distribution and the true latent joint distribution (data joint distribution) are with the Lipschitz class 1\mathcal{F}^{1} as the evaluation class under IPM. We apply the results in the refined Dudley inequality (Schreuder, 2020) in Lemma C.1 to bound 3\mathcal{E}_{3} and 4\mathcal{E}_{4}.

Lemma 4.4 (Refined Dudley Inequality).

For a symmetric function class \mathcal{F} with supffM\sup_{f\in\mathcal{F}}||f||_{\infty}\leq M, we have

𝔼[d(μ^n,μ)]inf0<δ<M(4δ+12nδMlog𝒩(ϵ,,||||)𝑑ϵ).\displaystyle\mathbb{E}[d_{\mathcal{F}}(\hat{\mu}_{n},\mu)]\leq\inf_{0<\delta<M}\left(4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{M}\sqrt{\log\mathcal{N}(\epsilon,\mathcal{F},||\cdot||_{\infty})}d\epsilon\right).

The original Dudley inequality (Dudley, 1967; Van der Vaart and Wellner, 1996) suffers from the problem that if the covering number 𝒩(ϵ,,||||)\mathcal{N}(\epsilon,\mathcal{F},||\cdot||_{\infty}) increases too fast as ϵ\epsilon goes to 0, then the upper bound will be infinity, which is totally meaningless. The improved Dudley inequality circumvents such a problem by only allowing ϵ\epsilon to integrate from δ>0\delta>0 as is shown in Lemma C.1, which also indicates that 𝔼3\mathbb{E}\mathcal{E}_{3} scales with the covering number 𝒩(ϵ,1,||||)\mathcal{N}(\epsilon,\mathcal{F}^{1},||\cdot||_{\infty}).

By calculating the covering number of 1\mathcal{F}^{1} and utilizing the refined Dudley inequality, we can obtain the upper bound

max{𝔼3,𝔼4}\displaystyle\max\{\mathbb{E}\mathcal{E}_{3},\mathbb{E}\mathcal{E}_{4}\} =O(Cdn1d+1logndn1d+1(logn)1+1d+1).\displaystyle=O\left(C_{d}n^{-\frac{1}{d+1}}\log n\wedge\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right). (4.3)

5 Related work

Recently, several impressive works have studied the challenging problem of the convergence properties of unidirectional GANs. Arora et al. (2017) noted that training of GANs may not have good generalization properties in the sense that even if training may appear successful but the trained distribution may be far from target distribution in standard metrics. On the other hand, Bai et al. (2019) showed that GANs can learn distributions in Wasserstein distance with polynomial sample complexity. Liang (2020) studied the rates of convergence of a class of GANs, including Wasserstein, Sobolev and MMD GANs. This work also established the nonparametric minimax optimal rate under the Sobolev IPM. The results of Bai et al. (2019) and Liang (2020) require invertible generator networks, meaning all the weight matrices need to be full-rank, and the activation function needs to be the invertible leaky ReLU activation. Chen et al. (2020) established an upper bound for the estimation error rate under Hölder evaluation and target density classes, where β\mathcal{H}^{\beta} is Hölder class with regularity β\beta and the density of the target μ\mu is assumed to belong to α\mathcal{H}^{\alpha}. They assumed that the reference distribution has the same dimension as the target distribution and applied the optimal transport theory to control the generator approximation error. However, how the prefactor depends in the error bounds on the dimension dd in the existing results (Liang, 2020; Chen et al., 2020) is either not clearly described or is exponential. In high-dimensional settings with large dd, this makes a substantial difference in the quality of the error bounds.

Singh et al. (2019) studied minimax convergence rates of nonparametric density estimation under a class of adversarial losses and investigated how the choice of loss and the assumed smoothness of the underlying density together determine the minimax rate; they also discussed connections to learning generative models in a minimax statistical sense. Uppal et al. (2019) generates the idea of Sobolev IPM to Besov IPM, where both target density and the evaluation classes are Besov classes. They also showed how their results imply bounds on the statistical error of a GAN.

These results provide important insights in the understanding of GANs. However, as we mentioned earlier, some of the assumptions made in these results, including equal dimension between the reference and target distributions and bounded support of the distributions, are not satisfied in the training of GANs in practice. Our results avoid these assumptions. Moreover, the prefactors in our error bounds are clearly described as being dependent on the square root of the dimension dd. Finally, the aforementioned results only dealt with unidirectional GANs. Our work is the first to address the convergence properties of bidirectional GANs.

6 Conclusion

This paper derives the error bounds for the bidirectional GANs under the Dudley distance between the latent joint distribution and the data joint distribution. The results are established without the two crucial conditions that are commonly assumed in the existing literature: equal dimensionality between the reference and the target distributions and bounded support for these distributions. Additionally, this work contributes to the neural network approximation theory by constructing neural network functions such that the pushforward distribution of an empirical distribution can perfectly approximate another arbitrary empirical distribution with a different dimension as long as their number of point masses are equal. A novel decomposition of integral probability metric is also developed for error analysis of bidirectional GANs, which can be useful in other generative learning problems.

A limitation of our results, as well as all the existing results on the convergence properties of GANs, is that they suffer from the curse of dimensionality, which cannot be circumvented by assuming sufficient smoothness assumptions. In many applications, high-dimensional complex data such as images, texts and natural languages, tend to be supported on approximate lower-dimensional manifolds. It is desirable to take into such structure in the theoretical analysis. An important extension of the present results is to show that bidirectional GANs can circumvent the curse of dimensionality if the target distribution is assumed to be supported on an approximate lower-dimensional manifold. This appears to be a technically challenging problem and will be pursued in our future work.

Acknowledgements

The authors wish to thank the three anonymous reviewers for their insightful comments and constructive suggestions that helped improve the paper significantly.

The work of J. Huang is partially supported by the U.S. NSF grant DMS-1916199. The work of Y. Jiao is supported in part by the National Science Foundation of China under Grant 11871474 and by the research fund of KLATASDSMOE. The work of Y. Wang is supported in part by the Hong Kong Research Grant Council grants 16308518 and 16317416 and HK Innovation Technology Fund ITS/044/18FX, as well as Guangdong-Hong Kong-Macao Joint Laboratory for Data-Driven Fluid Mechanics and Engineering Applications.

References

  • Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In ICML.
  • Arora et al. (2017) Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (gans). In International Conference on Machine Learning, pages 224–232. PMLR.
  • Bai et al. (2019) Bai, Y., Ma, T., and Risteski, A. (2019). Approximability of discriminators implies diversity in GANs. In International Conference on Learning Representations.
  • Brock et al. (2019) Brock, A., Donahue, J., and Simonyan, K. (2019). Large scale gan training for high fidelity natural image synthesis.
  • Chen et al. (2020) Chen, M., Liao, W., Zha, H., and Zhao, T. (2020). Statistical guarantees of generative adversarial networks for distribution estimation. arXiv preprint arXiv:2002.03938.
  • Donahue et al. (2016) Donahue, J., Krähenbühl, P., and Darrell, T. (2016). Adversarial feature learning. arXiv preprint arXiv:1605.09782.
  • Dudley (1967) Dudley, R. (1967). The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330.
  • Dudley (2018) Dudley, R. M. (2018). Real Analysis and Probability. CRC Press.
  • Dumoulin et al. (2016) Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., and Courville, A. (2016). Adversarially learned inference. arXiv preprint arXiv:1606.00704.
  • Goodfellow et al. (2014) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial networks. arXiv preprint arXiv:1406.2661.
  • Gottlieb et al. (2013) Gottlieb, L.-A., Kontorovich, A., and Krauthgamer, R. (2013). Efficient regression in metric spaces via approximate lipschitz extension. In International Workshop on Similarity-Based Pattern Recognition, pages 43–58. Springer.
  • Karras et al. (2018) Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018). Progressive growing of gans for improved quality, stability, and variation.
  • Karras et al. (2019) Karras, T., Laine, S., and Aila, T. (2019). A style-based generator architecture for generative adversarial networks.
  • Liang (2020) Liang, T. (2020). How well generative adversarial networks learn distributions.
  • Liu et al. (2017) Liu, S., Bousquet, O., and Chaudhuri, K. (2017). Approximation and convergence properties of generative adversarial learning. arXiv preprint arXiv:1705.08991.
  • Lu and Lu (2020) Lu, Y. and Lu, J. (2020). A universal approximation theorem of deep neural networks for expressing distributions. arXiv preprint arXiv:2004.08867.
  • Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. (2015). Adversarial autoencoders. arXiv preprint arXiv:1511.05644.
  • Mohri et al. (2018) Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning. MIT press.
  • Müller (1997) Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, pages 429–443.
  • Radford et al. (2016) Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks.
  • Reed et al. (2016) Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H. (2016). Generative adversarial text to image synthesis. In ICML.
  • Schreuder (2020) Schreuder, N. (2020). Bounding the expectation of the supremum of empirical processes indexed by hölder classes.
  • Shen et al. (2020) Shen, X., Zhang, T., and Chen, K. (2020). Bidirectional generative modeling using adversarial gradient estimation. arXiv preprint arXiv:2002.09161.
  • Shen et al. (2019) Shen, Z., Yang, H., and Zhang, S. (2019). Deep network approximation characterized by number of neurons. arXiv preprint arXiv:1906.05497.
  • Singh et al. (2019) Singh, S., Uppal, A., Li, B., Li, C.-L., Zaheer, M., and Póczos, B. (2019). Nonparametric density estimation with adversarial losses. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 10246–10257.
  • Srebro and Sridharan (2010) Srebro, N. and Sridharan, K. (2010). Note on refined dudley integral covering number bound. Unpublished results. http://ttic. uchicago. edu/karthik/dudley. pdf.
  • Uppal et al. (2019) Uppal, A., Singh, S., and Póczos, B. (2019). Nonparametric density estimation & convergence rates for gans under besov ipm losses. arXiv preprint arXiv:1902.03511.
  • Van der Vaart and Wellner (1996) Van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence. In Weak convergence and empirical processes. Springer.
  • Villani (2008) Villani, C. (2008). Optimal Transport: Old and New, volume 338. Springer Science & Business Media.
  • Yang et al. (2021) Yang, Y., Li, Z., and Wang, Y. (2021). On the capacity of deep generative networks for approximating distributions. arXiv preprint arXiv:2101.12353.
  • Zhang et al. (2018) Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. (2018). On the discrimination-generalization tradeoff in GANs. In International Conference on Learning Representations.
  • Zhu et al. (2017) Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.

Appendix

In the appendix, we first prove Theorem 3.2, and then Theorems 3.1 and 3.3.

Appendix A Notations and Preliminaries

We use σ\sigma to denote the ReLU activation function in neural networks, which is σ(x)=max{x,0}\sigma(x)=\max\{x,0\}. Without further indication, \|\cdot\| represents the L2L_{2} norm. For any function gg, let g=supxg(x)\|g\|_{\infty}=\sup_{x}\|g(x)\|. We use notation O()O(\cdot) and O~()\tilde{O}(\cdot) to express the order of function slightly differently, where O()O(\cdot) omits the universal constant not relying on dd while O~()\tilde{O}(\cdot) omits the constant related to dd. We use B2d(a)B_{2}^{d}(a) to denote L2L_{2} ball in d\mathbb{R}^{d} with center at 𝟎\mathbf{0} and radius aa. Let g#νg_{\#}\nu be the pushforward distribution of ν\nu by function gg in the sense that g#ν(A)=ν(g1(A))g_{\#}\nu(A)=\nu(g^{-1}(A)) for any measurable set AA.

The rr-covering number of some class \mathcal{F} w.r.t. norm \|\cdot\| is the minimum number of rr-\|\cdot\| radius balls needed to cover \mathcal{F}, which we denote as 𝒩(r,,)\mathcal{N}(r,\mathcal{F},\|\cdot\|). We denote 𝒩(r,,L2(Pn))\mathcal{N}(r,\mathcal{F},L_{2}(P_{n})) as the covering number of \mathcal{F} w.r.t. L2(Pn)L_{2}(P_{n}), which is defined as fL2(Pn)2=1ni=1nf(Xi)2\|f\|^{2}_{L_{2}(P_{n})}=\frac{1}{n}\sum_{i=1}^{n}\|f(X_{i})\|^{2} where X1,,XnX_{1},\ldots,X_{n} are the empirical samples. We denote 𝒩(r,,L(Pn))\mathcal{N}(r,\mathcal{F},L_{\infty}(P_{n})) as the covering number of \mathcal{F} w.r.t. L(Pn)L_{\infty}(P_{n}), which is defined as fL(Pn)=max1inf(Xi)\|f\|_{L_{\infty}(P_{n})}=\max_{1\leq i\leq n}\|f(X_{i})\|. It is easy to check that

𝒩(r,,L2(Pn))𝒩(r,,L(Pn))𝒩(r,,).\displaystyle\mathcal{N}(r,\mathcal{F},L_{2}(P_{n}))\leq\mathcal{N}(r,\mathcal{F},L_{\infty}(P_{n}))\leq\mathcal{N}(r,\mathcal{F},\|\cdot\|_{\infty}).

Appendix B Restriction on the domain of uniformly bounded Lipschitz function class 1\mathcal{F}^{1}

So far, most of the related works assume that the target distribution μ\mu is supported on a compact set, for example Chen et al. (2020) and Liang (2020). To remove the compact support assumption, we need to assume Assumption 1, i.e., the tails of the target μ\mu and the reference ν\nu are subexponential. Define n1:={f|B2d+1(2logn):f1}\mathcal{F}_{n}^{1}:=\{f|_{B_{2}^{d+1}(\sqrt{2}\log n)}:f\in\mathcal{F}^{1}\}. In this section, we show that proving Theorem 3.2 is equivalent to establishing the same convergence rate but with the domain restricted function class n1\mathcal{F}_{n}^{1} as the evaluation class.

Under Assumption 1 and by the Markov inequality, we have

Pν(z>logn)𝔼νz𝟙{z>logn}logn=O(n(logn)δd/logn)\displaystyle P_{\nu}(\|z\|>\log n)\leq\frac{\mathbb{E}_{\nu}\|z\|\mathbbm{1}_{\{\|z\|>\log n\}}}{\log n}=O(n^{-\frac{(\log n)^{\delta}}{d}}/\log n) (B.1)

The Dudley distance between latent joint distribution 𝝂^\hat{\boldsymbol{\nu}} and data joint distribution 𝝁^\hat{\boldsymbol{\mu}} is defined as

d1(𝝂^,𝝁^)=supf1𝔼f(g^(z),z)𝔼f(x,e^(x))\displaystyle d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})=\sup_{f\in\mathcal{F}^{1}}\mathbb{E}f(\hat{g}(z),z)-\mathbb{E}f(x,\hat{e}(x)) (B.2)

The first term above can be decomposed as

𝔼f(g^(z),z)\displaystyle\mathbb{E}f(\hat{g}(z),z) =𝔼f(g^(z),z)𝟙zlogn+𝔼f(g^(z),z)𝟙z>logn\displaystyle=\mathbb{E}f(\hat{g}(z),z)\mathbbm{1}_{\|z\|\leq\log n}+\mathbb{E}f(\hat{g}(z),z)\mathbbm{1}_{\|z\|>\log n} (B.3)

For any f1f\in\mathcal{F}^{1} and fixed point z0z_{0} such that z0logn\|z_{0}\|\leq\log n, due to the Lipschitzness of ff, the second term above satisfies

|𝔼f(g^(z),z)𝟙z>logn|\displaystyle|\mathbb{E}f(\hat{g}(z),z)\mathbbm{1}_{\|z\|>\log n}| |𝔼f(g^(z),z)𝟙z>logn𝔼f(g^(z0),z0)𝟙z>logn|\displaystyle\leq|\mathbb{E}f(\hat{g}(z),z)\mathbbm{1}_{\|z\|>\log n}-\mathbb{E}f(\hat{g}(z_{0}),z_{0})\mathbbm{1}_{\|z\|>\log n}|
+|𝔼f(g^(z0),z0)𝟙z>logn|\displaystyle+|\mathbb{E}f(\hat{g}(z_{0}),z_{0})\mathbbm{1}_{\|z\|>\log n}|
\displaystyle\leq 𝔼(g^(z)g^(z0),zz0)𝟙z>logn+BPν(z>logn)\displaystyle\mathbb{E}\|(\hat{g}(z)-\hat{g}(z_{0}),z-z_{0})\|\mathbbm{1}_{\|z\|>\log n}+BP_{\nu}(\|z\|>\log n)
\displaystyle\leq 𝔼((g^(z)g^(z0)+zz0)𝟙z>logn+BPν(z>logn)\displaystyle\mathbb{E}(\|(\hat{g}(z)-\hat{g}(z_{0})\|+\|z-z_{0}\|)\mathbbm{1}_{\|z\|>\log n}+BP_{\nu}(\|z\|>\log n)
\displaystyle\leq 2(logn)Pν(z>logn)+𝔼zz0𝟙z>logn+BPν(z>logn)\displaystyle 2(\log n)P_{\nu}(\|z\|>\log n)+\mathbb{E}\|z-z_{0}\|\mathbbm{1}_{\|z\|>\log n}+BP_{\nu}(\|z\|>\log n)
=\displaystyle= O(n(logn)δd)\displaystyle O(n^{-\frac{(\log n)^{\delta}}{d}})

where the second inequality is due to lipschitzness and boundedness of ff, and the last inequality is due to Assumption 1, (B.1), and the boundedness condition of g^\hat{g}. In the first term in (B.3), ff only acts on the increasing L2L_{2} ball B2d(2logn)B^{d}_{2}(\sqrt{2}\log n) because of Condition 1 and the indicator function 𝟙{zlogn}\mathbbm{1}_{\{\|z\|\leq\log n\}}. Similarly, we can apply the same procedure to the second term in (B.2). Therefore, it is still an equivalent problem if we restrict the domain of 1\mathcal{F}^{1} on B2d(2logn)B^{d}_{2}(\sqrt{2}\log n). Hence, in order to prove the estimation error rate in Theorem 3.2, we only need to show that for the restricted evaluation function class n1\mathcal{F}^{1}_{n}, we have

𝔼dn1(𝝂^,𝝁^)C0dn1d+1(logn)1+1d+1Cdn1d+1logn\displaystyle\mathbb{E}d_{\mathcal{F}_{n}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})\leq C_{0}\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\wedge C_{d}n^{-\frac{1}{d+1}}\log n

Due to this fact, to keep notation simple, we are going to denote n1\mathcal{F}_{n}^{1} as 1\mathcal{F}^{1} in the following sections.

Remark 1.

The restriction on 1\mathcal{F}^{1} is technically necessary for calculating the covering number of 1\mathcal{F}^{1} later we will see the use of it when bounding the stochastic error 3\mathcal{E}_{3} and 4\mathcal{E}_{4} below.

Appendix C Stochastic errors

C.1 Bounding 3\mathcal{E}_{3} and 4\mathcal{E}_{4}

The stochastic errors 3\mathcal{E}_{3} and 4\mathcal{E}_{4} quantify how close the empirical distributions and the true latent joint distribution (data joint distribution) are with the Lipschitz class 1\mathcal{F}^{1} as the evaluation class under IPM. We apply the results in Lemma C.1 to bound 3\mathcal{E}_{3} and 4\mathcal{E}_{4}. We introduce two methods to bound max{3,4}\max\{\mathcal{E}_{3},\mathcal{E}_{4}\}, which gives two different upper bounds for max{3,4}\max\{\mathcal{E}_{3},\mathcal{E}_{4}\}. They both utilize the following lemma, which we shall prove later. More detailed description about the refined Dudley inequality can be found in Srebro and Sridharan (2010) and Schreuder (2020).

Lemma C.1 (Refined Dudley Inequality).

For a symmetric function class \mathcal{F} with
supffM\sup_{f\in\mathcal{F}}\|f\|_{\infty}\leq M, we have

𝔼[d(μ^n,μ)]inf0<δ<M(4δ+12nδMlog𝒩(ϵ,,)𝑑ϵ).\displaystyle\mathbb{E}[d_{\mathcal{F}}(\hat{\mu}_{n},\mu)]\leq\inf_{0<\delta<M}\left(4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{M}\sqrt{\log\mathcal{N}(\epsilon,\mathcal{F},\|\cdot\|_{\infty})}\,d\epsilon\right).
Remark 2.

The original Dudley inequality  (Dudley, 1967; Van der Vaart and Wellner, 1996) suffers from the problem that if the covering number 𝒩(ϵ,,)\mathcal{N}(\epsilon,\mathcal{F},\|\cdot\|_{\infty}) increases too fast as ϵ\epsilon goes to 0, then the upper bound can be infinite. The improved Dudley inequality circumvents this problem by only allowing ϵ\epsilon to integrate from δ>0\delta>0, which also indicates that 𝔼3\mathbb{E}\mathcal{E}_{3} scales with the covering number 𝒩(ϵ,1,)\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty}).

C.1.1 The first method (explicit constant)

The first method provides an explicit constant depending on dd at the expense of the higher order of logn\log n in the upper bounds. It utilizes the next lemma (Gottlieb et al., 2013, Lemma 6), which turns the problem of bounding the covering number of a Lipschitz function class into the one bounding the covering number of the domain defined for the function class.

Lemma C.2 (Gottlieb et al. (2013)).

Let L\mathcal{F}^{L} be the collection of LL-Lipschitz functions mapping the metric space (𝒳,ρ)(\mathcal{X},\rho) to [0,1][0,1]. Then the covering number of L\mathcal{F}^{L} can be estimated in terms of the covering number of 𝒳\mathcal{X} with respect to ρ\rho as follows.

𝒩(ϵ,L,)(8ϵ)𝒩(ϵ/8L,𝒳,ρ).\displaystyle\mathcal{N}(\epsilon,\mathcal{F}^{L},\|\cdot\|_{\infty})\leq(\frac{8}{\epsilon})^{\mathcal{N}(\epsilon/8L,\mathcal{X},\rho)}.

Now we apply Lemma C.2 to bound the covering number for the 1-Lipschitz class 𝒩(ϵ,1,)\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty}) by bounding the covering number for its domain 𝒩(ϵ,B2d+1(2logn),2)\mathcal{N}(\epsilon,B^{d+1}_{2}(\sqrt{2}\log n),\|\cdot\|_{2}). Define a new function class 12B\mathcal{F}^{\frac{1}{2B}} as

12B:={f+B2B:f1}.\displaystyle\mathcal{F}^{\frac{1}{2B}}:=\{\frac{f+B}{2B}:f\in\mathcal{F}^{1}\}.

Recall that 1\mathcal{F}^{1} is restricted on B2d+1(2logn)B^{d+1}_{2}(\sqrt{2}\log n). Obviously, 12B\mathcal{F}^{\frac{1}{2B}} is a 12B\frac{1}{2B}-Lipschitz function class : B2d+1(2logn)[0,1]B^{d+1}_{2}(\sqrt{2}\log n)\mapsto[0,1]. A direct application of Lemma C.2 shows that

𝒩(ϵ,12B,)(8ϵ)𝒩(ϵB/4,B2d+1(2logn),2).\displaystyle\mathcal{N}(\epsilon,\mathcal{F}^{\frac{1}{2B}},\|\cdot\|_{\infty})\leq\left(\frac{8}{\epsilon}\right)^{\mathcal{N}(\epsilon B/4,B_{2}^{d+1}(\sqrt{2}\log n),\|\cdot\|_{2})}. (C.1)

By the definition of 12B\mathcal{F}^{\frac{1}{2B}}, the covering numbers satisfy

𝒩(2Bϵ,1,)=𝒩(ϵ,12B,).\displaystyle\mathcal{N}(2B\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})=\mathcal{N}(\epsilon,\mathcal{F}^{\frac{1}{2B}},\|\cdot\|_{\infty}). (C.2)

Note that B2d+1(2logn)B^{d+1}_{2}(\sqrt{2}\log n) is a subset of [2logn,2logn]d[-\sqrt{2}\log n,\sqrt{2}\log n]^{d}, and [2logn,2logn]d[-\sqrt{2}\log n,\sqrt{2}\log n]^{d} can be covered with finite ϵ\epsilon-balls in d\mathbb{R}^{d} that cover the small hypercube with side length 2ϵ/d2\epsilon/\sqrt{d}. It follows that

𝒩(ϵ,B2d+1(2logn),2)(2(d+1)lognϵ)d+1.\displaystyle\mathcal{N}(\epsilon,B^{d+1}_{2}(\sqrt{2}\log n),\|\cdot\|_{2})\leq\left(\frac{\sqrt{2(d+1)}\log n}{\epsilon}\right)^{d+1}. (C.3)

Combining (C.1), (C.2) and (C.3), we obtain an upper bound for the covering number of the 1-Lipschitz class 1\mathcal{F}^{1}

log𝒩(ϵ,1,)(82(d+1)lognϵ)d+1log16Bϵ.\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})\leq\left(\frac{8\sqrt{2(d+1)}\log n}{\epsilon}\right)^{d+1}\log\frac{16B}{\epsilon}. (C.4)

With the upper bound for the covering entropy in (C.4), a direct application of Lemma C.1 (see Section E for details) by taking δ=82(d+1)n1d+1(logn)1+1d+1\delta=8\sqrt{2(d+1)}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}} leads to

max{𝔼3,𝔼4}\displaystyle\max\{\mathbb{E}\mathcal{E}_{3},\mathbb{E}\mathcal{E}_{4}\} =O(dn1d+1(logn)1+1d+1+n1d+1(logn)1+1d+1)\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}+n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right) (C.5)
=O(dn1d+1(logn)1+1d+1).\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right). (C.6)

C.1.2 The second method (better order of logn\log n)

We now consider the second method that leads to a better order for the logn\log n term in the upper bound at the expense of explicitness of the constant related to dd. The next lemma directly provides an upper bound for the covering number of Lipschitz class but with an implicit constant related to dd. It is a straightforward corollary of Van der Vaart and Wellner (1996, Theorem 2.7.1).

Lemma C.3.

Let 𝒳\mathcal{X} be a bounded, convex subset of d\mathbb{R}^{d} with nonempty interior. There exists a constant cdc_{d} depending only on dd such that

log𝒩(ϵ,1(𝒳),)cdλ(𝒳1)(1ϵ)d\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}^{1}(\mathcal{X}),\|\cdot\|_{\infty})\leq c_{d}\lambda(\mathcal{X}^{1})\left(\frac{1}{\epsilon}\right)^{d}

for every ϵ>0\epsilon>0, where 1(𝒳)\mathcal{F}^{1}(\mathcal{X}) is the 1-Lipschitz function class defined on 𝒳\mathcal{X}, and λ(𝒳1)\lambda(\mathcal{X}^{1}) is the Lebesgue measure of the set {x:x𝒳<1}\{x:\|x-\mathcal{X}\|<1\}.

Applying Lemmas C.1 and C.3 (see Section E for details) by taking δ=n1d+1logn\delta=n^{-\frac{1}{d+1}}\log n yields

max{𝔼3,𝔼4}\displaystyle\max\{\mathbb{E}\mathcal{E}_{3},\mathbb{E}\mathcal{E}_{4}\} =O(Cdn1d+1logn),\displaystyle=O\left(C_{d}n^{-\frac{1}{d+1}}\log n\right), (C.7)

where CdC_{d} is some constant depending on dd. Combining (C.6) and (C.7), we get

max{𝔼3,𝔼4}\displaystyle\max\{\mathbb{E}\mathcal{E}_{3},\mathbb{E}\mathcal{E}_{4}\} =O(Cdn1d+1logndn1d+1(logn)1+1d+1).\displaystyle=O\left(C_{d}n^{-\frac{1}{d+1}}\log n\wedge\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right). (C.8)
Remark 3.

Here, we have a tradeoff between the logarithmic factor logn\log n and the explicitness of the constant depending on dd. If we want an explicit constant depending on dd, then we have the factor (logn)1+1d+1(\log n)^{1+\frac{1}{d+1}} in the upper bound. Later we will see that 𝔼3\mathbb{E}\mathcal{E}_{3} and 𝔼4\mathbb{E}\mathcal{E}_{4} are the dominating terms in the four error terms, hence the explicitness of the corresponding constant becomes important. Therefore, we list two different methods here to bound 𝔼3\mathbb{E}\mathcal{E}_{3} and 𝔼4\mathbb{E}\mathcal{E}_{4}.

C.2 Combination of the four error terms

With all the upper bounds for the four different error terms obtained above, next we consider 1\mathcal{E}_{1}-4\mathcal{E}_{4} simultaneously to obtain an overall convergence rate. First, recall how we bound 1\mathcal{E}_{1} and 4\mathcal{E}_{4}. With Lemma 4.2, we have

1=O(d(W1L1)2d+1logn).\displaystyle\mathcal{E}_{1}=O\left(\sqrt{d}(W_{1}L_{1})^{-\frac{2}{d+1}}\log n\right). (C.9)

To control 1\mathcal{E}_{1} while keeping the architecture of discriminator class NN\mathcal{F}_{NN} as small as possible, we let W1L1=nW_{1}L_{1}=\left\lceil\sqrt{n}\right\rceil, so that 1=O(dn1d+1logn)\mathcal{E}_{1}=O\left(\sqrt{d}n^{-\frac{1}{d+1}}\log n\right) dominated by 3\mathcal{E}_{3} and 4\mathcal{E}_{4}.

By Theorem 4.3, we can choose the architectures of generator and encoder classes accordingly to perfectly control 2\mathcal{E}_{2}, i.e. 2=0\mathcal{E}_{2}=0.

We note that because we imposed Condition 1 on both generator and encoder classes, Theorem 4.3 can not be applied if we have some xi\|x_{i}\| or zi\|z_{i}\| greater than logn\log n, in which case 2\mathcal{E}_{2} can not be perfectly controlled. But we can still handle this case by considering the probability of the bad set.

Under Condition 1, on the nice set A:={max1inxilogn}{max1inzilogn}A:=\{\max_{1\leq i\leq n}\|x_{i}\|\leq\log n\}\cap\{\max_{1\leq i\leq n}\|z_{i}\|\leq\log n\}, we have 2=0\mathcal{E}_{2}=0. Probability of the nice set AA has the following lower bound.

P(A)\displaystyle P(A) =Pμ(xilogn)nPν(zilogn)n\displaystyle=P_{\mu}(||x_{i}||\leq\log n)^{n}\cdot P_{\nu}(||z_{i}||\leq\log n)^{n}
(1Cn(logn)δd)2n, for some constant C>0 by Assumption 1\displaystyle\geq(1-Cn^{-\frac{(\log n)^{\delta}}{d}})^{2n},\ \text{ for some constant $C>0$ by Assumption \ref{asp1}}
1Cn(logn)δd(2n), for large n.\displaystyle\geq 1-Cn^{-\frac{(\log n)^{\delta}}{d}}\cdot(2n),\ \text{ for large $n$.}

The bad set AcA^{c} is where 2>0\mathcal{E}_{2}>0, which has the probability upper bound as follows.

P(Ac)\displaystyle P(A^{c}) Cn(logn)δd(2n)\displaystyle\leq Cn^{-\frac{(\log n)^{\delta}}{d}}\cdot(2n)
=O(n(logn)δd), for any δ<δ.\displaystyle=O\left(n^{-\frac{(\log n)^{\delta^{\prime}}}{d}}\right)\text{, for any $\delta^{\prime}<\delta$}.

In Assumption 1, the (logn)δ(\log n)^{\delta} factor was to make the tail of the target μ\mu strictly subexponential, which leads to P(Ac)0P(A^{c})\to 0, while the exponential tail or heavier will cause the undesired result P(Ac)1P(A^{c})\to 1.

Now we are ready to obtain the desired result in Theorem 3.2. The nice set A={max1inxilogn}{max1inzilogn}A=\{\max_{1\leq i\leq n}\|x_{i}\|\leq\log n\}\cap\{\max_{1\leq i\leq n}\|z_{i}\|\leq\log n\} is where 2=0\mathcal{E}_{2}=0. By combining the results discussed above, we have

𝔼d1(𝝂^,𝝁^)\displaystyle\mathbb{E}d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}}) =21+2𝟙A+2𝟙Ac+𝔼3+𝔼4\displaystyle=2\mathcal{E}_{1}+\mathcal{E}_{2}\mathbbm{1}_{A}+\mathcal{E}_{2}\mathbbm{1}_{A^{c}}+\mathbb{E}\mathcal{E}_{3}+\mathbb{E}\mathcal{E}_{4}
O(dn1d+1logn+0+2BPμ(Ac)+dn1d+1(logn)1+1d+1Cdn1d+1logn)\displaystyle\leq O\left(\sqrt{d}n^{-\frac{1}{d+1}}\log n+0+2BP_{\mu}(A^{c})+\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\wedge C_{d}n^{-\frac{1}{d+1}}\log n\right)
=O(dn1d+1(logn)1+1d+1Cdn1d+1logn+n(logn)δd)\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\wedge C_{d}n^{-\frac{1}{d+1}}\log n+n^{-\frac{(\log n)^{\delta^{\prime}}}{d}}\right)
=O(dn1d+1(logn)1+1d+1Cdn1d+1logn),\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\wedge C_{d}n^{-\frac{1}{d+1}}\log n\right),

which completes the proof of Theorem 3.2.

Appendix D Proof of Inequality (4.2)

For ease of reference, we restate inequality (4.2) as the following lemma.

Lemma 4.2.

For any symmetric function classes \mathcal{F} and \mathcal{H}, denote the approximation error (,)\mathcal{E}(\mathcal{H},\mathcal{F}) as

(,):=suphinffhf,\displaystyle\mathcal{E}(\mathcal{H},\mathcal{F}):=\underset{h\in\mathcal{H}}{\sup}\underset{f\in\mathcal{F}}{\inf}\|h-f\|_{\infty},

then for any probability distributions μ\mu and ν\nu,

d(μ,ν)d(μ,ν)2(,).\displaystyle d_{\mathcal{H}}(\mu,\nu)-d_{\mathcal{F}}(\mu,\nu)\leq 2\mathcal{E}(\mathcal{H},\mathcal{F}).
Proof of Lemma 4.2.

By the definition of supremum, for any ϵ>0\epsilon>0, there exists hϵh_{\epsilon}\in\mathcal{H} such that

d(μ,ν):\displaystyle d_{\mathcal{H}}(\mu,\nu): =suph[𝔼μh𝔼νh]\displaystyle=\underset{h\in\mathcal{H}}{\sup}[\mathbb{E}_{\mu}h-\mathbb{E}_{\nu}h]
𝔼μhϵ𝔼νhϵ+ϵ\displaystyle\leq\mathbb{E}_{\mu}h_{\epsilon}-\mathbb{E}_{\nu}h_{\epsilon}+\epsilon
=inff[𝔼μ(hϵf)𝔼ν(hϵf)+𝔼μ(f)𝔼ν(f)]+ϵ\displaystyle=\underset{f\in\mathcal{F}}{\inf}[\mathbb{E}_{\mu}(h_{\epsilon}-f)-\mathbb{E}_{\nu}(h_{\epsilon}-f)+\mathbb{E}_{\mu}(f)-\mathbb{E}_{\nu}(f)]+\epsilon
2inffhϵf+d(μ,ν)+ϵ\displaystyle\leq 2\underset{f\in\mathcal{F}}{\inf}\|h_{\epsilon}-f\|_{\infty}+d_{\mathcal{F}}(\mu,\nu)+\epsilon
2(,)+d(μ,ν)+ϵ,\displaystyle\leq 2\mathcal{E}(\mathcal{H},\mathcal{F})+d_{\mathcal{F}}(\mu,\nu)+\epsilon,

where the last line is due to the definition of (,)\mathcal{E}(\mathcal{H},\mathcal{F}). ∎

It is easy to check that if we replace d(μ,ν)d_{\mathcal{H}}(\mu,\nu) by d^(μ,ν):=suph[𝔼^μh𝔼^νh]\hat{d}_{\mathcal{H}}(\mu,\nu):=\underset{h\in\mathcal{H}}{\sup}[\hat{\mathbb{E}}_{\mu}h-\hat{\mathbb{E}}_{\nu}h], Lemma 4.2 still holds.

Appendix E Bounding 𝔼3\mathbb{E}\mathcal{E}_{3} and 𝔼4\mathbb{E}\mathcal{E}_{4}

E.1 Method One

With the upper bound for the covering entropy (C.4), i.e.

log𝒩(ϵ,1,)(82(d+1)lognϵ)d+1log16Bϵ\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})\leq\left(\frac{8\sqrt{2(d+1)}\log n}{\epsilon}\right)^{d+1}\log\frac{16B}{\epsilon}

and δ=82(d+1)n1d+1(logn)1+1d+1\delta=8\sqrt{2(d+1)}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}, applying Lemma C.1 we have

𝔼3\displaystyle\mathbb{E}\mathcal{E}_{3} =O(δ+n12δB(82(d+1)lognϵ)d+12(log16Bϵ)12𝑑ϵ)\displaystyle=O\left(\delta+n^{-\frac{1}{2}}\int_{\delta}^{B}\left(\frac{8\sqrt{2(d+1)}\log n}{\epsilon}\right)^{\frac{d+1}{2}}\left(\log\frac{16B}{\epsilon}\right)^{\frac{1}{2}}d\epsilon\right)
=O(δ+n12(82(d+1)logn)d+12(lognd+1)12δ1d+12)\displaystyle=O\left(\delta+n^{-\frac{1}{2}}(8\sqrt{2(d+1)}\log n)^{\frac{d+1}{2}}(\frac{\log n}{d+1})^{\frac{1}{2}}\delta^{1-\frac{d+1}{2}}\right)
=O(dn1d+1(logn)1+1d+1+n1d+1(logn)1+1d+1)\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}+n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right)
=O(dn1d+1(logn)1+1d+1),\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right),

where the second equality is due to

log16Bϵ\displaystyle\log\frac{16B}{\epsilon} =O(log1ϵ)=O(log(n1d+182(d+1)(logn)1+1d+1))=O(logn1d+1),\displaystyle=O\left(\log\frac{1}{\epsilon}\right)=O\left(\log\left(\frac{n^{\frac{1}{d+1}}}{8\sqrt{2(d+1)}(\log n)^{1+\frac{1}{d+1}}}\right)\right)=O\left(\log n^{\frac{1}{d+1}}\right),

and the third equality follows from simple algebra.

E.2 Method Two

By Lemma C.3, we have

log𝒩(ϵ,1,)cd(lognϵ)d+1.\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})\leq c_{d}\left(\frac{\log n}{\epsilon}\right)^{d+1}.

Taking δ=n1d+1logn\delta=n^{-\frac{1}{d+1}}\log n and applying Lemma C.1, we obtain

𝔼3\displaystyle\mathbb{E}\mathcal{E}_{3} =O(δ+(cdn)12(logn)d+12δM(1ϵ)d+12𝑑ϵ)\displaystyle=O\left(\delta+(\frac{c_{d}}{n})^{\frac{1}{2}}(\log n)^{\frac{d+1}{2}}\int_{\delta}^{M}(\frac{1}{\epsilon})^{\frac{d+1}{2}}d\epsilon\right)
=O~(δ+n12(logn)d+12δ1d+12)\displaystyle=\tilde{O}\left(\delta+n^{-\frac{1}{2}}(\log n)^{\frac{d+1}{2}}\delta^{1-\frac{d+1}{2}}\right)
=O~(n1d+1logn),\displaystyle=\tilde{O}\left(n^{-\frac{1}{d+1}}\log n\right),

where O~()\tilde{O}(\cdot) omitted the constant related to dd.

Appendix F Proof of Lemma C.1

For completeness we provide a proof of the refined Dudley’s inequality in Lemma C.1. We apply the standard symmetrization and chaining technics in the proof, see, for example, Van der Vaart and Wellner (1996).

Proof.

Let Y1,,YnY_{1},\ldots,Y_{n} be random samples from μ\mu which are independent of XisX_{i}^{\prime}s. Then we have

𝔼d(μ^n,μ)\displaystyle\mathbb{E}d_{\mathcal{F}}(\hat{\mu}_{n},\mu) =𝔼supf[1ni=1nf(Xi)𝔼f(Xi)]\displaystyle=\mathbb{E}\underset{f\in\mathcal{F}}{\sup}[\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\mathbb{E}f(X_{i})]
=𝔼supf[1ni=1nf(Xi)𝔼1ni=1nf(Yi)]\displaystyle=\mathbb{E}\underset{f\in\mathcal{F}}{\sup}[\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\mathbb{E}\frac{1}{n}\sum_{i=1}^{n}f(Y_{i})]
𝔼X,Ysupf[1ni=1nf(Xi)1ni=1nf(Yi)]\displaystyle\leq\mathbb{E}_{X,Y}\underset{f\in\mathcal{F}}{\sup}[\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\frac{1}{n}\sum_{i=1}^{n}f(Y_{i})]
=𝔼X,Ysupf[1ni=1nϵi(f(Xi)f(Yi))]\displaystyle=\mathbb{E}_{X,Y}\underset{f\in\mathcal{F}}{\sup}[\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f(X_{i})-f(Y_{i}))]
2𝔼^n()\displaystyle\leq 2\mathbb{E}\hat{\mathcal{R}}_{n}(\mathcal{F})

where the first inequality is due to Jensen inequality, and the third equality is because that (f(Xi)f(Yi))(f(X_{i})-f(Y_{i})) has symmetric distribution.

Let α0=M\alpha_{0}=M and for any j+j\in\mathbb{N}_{+} let αj=2jM\alpha_{j}=2^{-j}M. For each jj, let TiT_{i} be a αi\alpha_{i}-cover of \mathcal{F} w.r.t. L2(Pn)L_{2}(P_{n}) such that |Ti|=𝒩(αi,,L2(Pn))|T_{i}|=\mathcal{N}(\alpha_{i},\mathcal{F},L_{2}(P_{n})). For each ff\in\mathcal{F} and jj, pick a function f^iTi\hat{f}_{i}\in T_{i} such that f^ifL2(Pn)<αi\|\hat{f}_{i}-f\|_{L_{2}(P_{n})}<\alpha_{i}. Let f^0=0\hat{f}_{0}=0 and for any NN, we can express ff by chaining as

f=ff^N+i=1N(f^if^i1).\displaystyle f=f-\hat{f}_{N}+\sum_{i=1}^{N}(\hat{f}_{i}-\hat{f}_{i-1}).

Hence for any NN, we can express the empirical Rademacher complexity as

^n()\displaystyle\hat{\mathcal{R}}_{n}(\mathcal{F}) =1n𝔼ϵsupfi=1nϵi(f(Xi)f^N(Xi)+j=1N(f^j(Xi)f^j1(Xi)))\displaystyle=\frac{1}{n}\mathbb{E}_{\epsilon}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\epsilon_{i}\left(f(X_{i})-\hat{f}_{N}(X_{i})+\sum_{j=1}^{N}(\hat{f}_{j}(X_{i})-\hat{f}_{j-1}(X_{i}))\right)
1n𝔼ϵsupfi=1nϵi(f(Xi)f^N(Xi))+i=1n1n𝔼ϵsupfj=1Nϵi(f^j(Xi)f^j1(Xi))\displaystyle\leq\frac{1}{n}\mathbb{E}_{\epsilon}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\epsilon_{i}\left(f(X_{i})-\hat{f}_{N}(X_{i})\right)+\sum_{i=1}^{n}\frac{1}{n}\mathbb{E}_{\epsilon}\sup_{f\in\mathcal{F}}\sum_{j=1}^{N}\epsilon_{i}\left(\hat{f}_{j}(X_{i})-\hat{f}_{j-1}(X_{i})\right)
ϵL2(Pn)supfff^NL2(Pn)+i=1n1n𝔼ϵsupfj=1Nϵi(f^j(Xi)f^j1(Xi))\displaystyle\leq\|\epsilon\|_{L_{2}(P_{n})}\sup_{f\in\mathcal{F}}\|f-\hat{f}_{N}\|_{L_{2}(P_{n})}+\sum_{i=1}^{n}\frac{1}{n}\mathbb{E}_{\epsilon}\sup_{f\in\mathcal{F}}\sum_{j=1}^{N}\epsilon_{i}\left(\hat{f}_{j}(X_{i})-\hat{f}_{j-1}(X_{i})\right)
αN+i=1n1n𝔼ϵsupfj=1Nϵi(f^j(Xi)f^j1(Xi)),\displaystyle\leq\alpha_{N}+\sum_{i=1}^{n}\frac{1}{n}\mathbb{E}_{\epsilon}\sup_{f\in\mathcal{F}}\sum_{j=1}^{N}\epsilon_{i}\left(\hat{f}_{j}(X_{i})-\hat{f}_{j-1}(X_{i})\right),

where ϵ=(ϵ1,,ϵn)\epsilon=(\epsilon_{1},\ldots,\epsilon_{n}) and the second-to-last inequality is due to Cauchy–Schwarz. Now the second term is the summation of empirical Rademacher complexity w.r.t. the function classes {ff′′:fTj,f′′Tj1}\{f^{\prime}-f^{\prime\prime}:f^{\prime}\in T_{j},f^{\prime\prime}\in T_{j-1}\}, j=1,,Nj=1,\ldots,N. Note that

f^jf^j1L2(Pn)2\displaystyle\|\hat{f}_{j}-\hat{f}_{j-1}\|^{2}_{L_{2}(P_{n})} (f^jfL2(Pn)+ff^j1L2(Pn))2\displaystyle\leq\left(\|\hat{f}_{j}-f\|_{L_{2}(P_{n})}+\|f-\hat{f}_{j-1}\|_{L_{2}(P_{n})}\right)^{2}
(αj+αj1)2\displaystyle\leq(\alpha_{j}+\alpha_{j-1})^{2}
=3αj2.\displaystyle=3\alpha_{j}^{2}.

Massart’s lemma (Mohri et al., 2018, Theorem 3.7) states that if for any finite function class \mathcal{F}, supffL2(Pn)M\sup_{f\in\mathcal{F}}\|f\|_{L_{2}(P_{n})}\leq M, then we have

^n()2M2log(||)n.\displaystyle\hat{\mathcal{R}}_{n}(\mathcal{F})\leq\sqrt{\frac{2M^{2}\log(|\mathcal{F}|)}{n}}.

Applying Massart’s lemma to the function classes {ff′′:fTj,f′′Tj1}\{f^{\prime}-f^{\prime\prime}:f^{\prime}\in T_{j},f^{\prime\prime}\in T_{j-1}\}, j=1,,Nj=1,\ldots,N, we get that for any NN,

^n()\displaystyle\hat{\mathcal{R}}_{n}(\mathcal{F}) αN+j=1N3αj2log(|Tj||Tj1|)n\displaystyle\leq\alpha_{N}+\sum_{j=1}^{N}3\alpha_{j}\sqrt{\frac{2\log(|T_{j}|\cdot|T_{j-1}|)}{n}}
αN+6j=1Nαjlog(|Tj|)n\displaystyle\leq\alpha_{N}+6\sum_{j=1}^{N}\alpha_{j}\sqrt{\frac{\log(|T_{j}|)}{n}}
αN+12j=1N(αjαj+1)log𝒩(αj,,L2(Pn))n\displaystyle\leq\alpha_{N}+12\sum_{j=1}^{N}(\alpha_{j}-\alpha_{j+1})\sqrt{\frac{\log\mathcal{N}(\alpha_{j},\mathcal{F},L_{2}(P_{n}))}{n}}
αN+12αN+1α0log𝒩(r,,L2(Pn))n𝑑r,\displaystyle\leq\alpha_{N}+12\int_{\alpha_{N+1}}^{\alpha_{0}}\sqrt{\frac{\log\mathcal{N}(r,\mathcal{F},L_{2}(P_{n}))}{n}}dr,

where the third inequality is due to 2(αjαj+1)=αj2(\alpha_{j}-\alpha_{j+1})=\alpha_{j}. Now for any small δ>0\delta>0 we can choose NN such that αN+1δ<αN\alpha_{N+1}\leq\delta<\alpha_{N}. Hence,

^n()\displaystyle\hat{\mathcal{R}}_{n}(\mathcal{F}) 2δ+12δ/2Mlog𝒩(r,,L2(Pn))n𝑑r.\displaystyle\leq 2\delta+12\int_{\delta/2}^{M}\sqrt{\frac{\log\mathcal{N}(r,\mathcal{F},L_{2}(P_{n}))}{n}}dr.

Since δ>0\delta>0 is arbitrary, we can take inf\inf w.r.t. δ\delta to get

^n()\displaystyle\hat{\mathcal{R}}_{n}(\mathcal{F}) inf0<δ<M(4δ+12δMlog𝒩(r,,L2(Pn))n𝑑r).\displaystyle\leq\inf_{0<\delta<M}\left(4\delta+12\int_{\delta}^{M}\sqrt{\frac{\log\mathcal{N}(r,\mathcal{F},L_{2}(P_{n}))}{n}}dr\right).

The result follows due to the fact that

𝒩(r,,L2(Pn))𝒩(ϵ,,L(Pn))𝒩(ϵ,,).\displaystyle\mathcal{N}(r,\mathcal{F},L_{2}(P_{n}))\leq\mathcal{N}(\epsilon,\mathcal{F},L_{\infty}(P_{n}))\leq\mathcal{N}(\epsilon,\mathcal{F},\|\cdot\|_{\infty}).

Appendix G Proof of Theorem 3.1

Proof.

Taking W1L1=nW_{1}L_{1}=\left\lceil\sqrt{n}\right\rceil, Shen et al. (2019, Theorem 4.3) gives rise to 1=O(dn1d+1)\mathcal{E}_{1}=O(\sqrt{d}n^{-\frac{1}{d+1}}). The range of gg and ee covers the supports of μ\mu and ν\nu, respectively, hence Theorem 4.3 leads to 2=0\mathcal{E}_{2}=0. By Lemma C.2, we have

log𝒩(ϵ,1,)(82(d+1)Mϵ)d+1log16Bϵ.\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})\leq\left(\frac{8\sqrt{2(d+1)}M}{\epsilon}\right)^{d+1}\log\frac{16B}{\epsilon}.

Now following the same procedure as in Section E by taking δ=82(d+1)n1d+1(logn)1d+1\delta=8\sqrt{2(d+1)}n^{-\frac{1}{d+1}}(\log n)^{\frac{1}{d+1}}, we have

max{𝔼3,𝔼4}=O(dn1d+1(logn)1d+1).\displaystyle\max\{\mathbb{E}\mathcal{E}_{3},\mathbb{E}\mathcal{E}_{4}\}=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{\frac{1}{d+1}}\right).

At last, we consider all four error terms simultaneously.

𝔼d1(𝝂^,𝝁^)\displaystyle\mathbb{E}d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}}) 1+2+𝔼4+𝔼3\displaystyle\leq\mathcal{E}_{1}+\mathcal{E}_{2}+\mathbb{E}\mathcal{E}_{4}+\mathbb{E}\mathcal{E}_{3}
=O(dn1d+1+0+dn1d+1(logn)1d+1)\displaystyle=O(\sqrt{d}n^{-\frac{1}{d+1}}+0+\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{\frac{1}{d+1}})
=O(dn1d+1(logn)1d+1).\displaystyle=O(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{\frac{1}{d+1}}).

Appendix H Proof of Theorem 3.3

Following the same proof as Theorem 4.3, we have the following theorem.

Theorem H.1.

Suppose ν\nu supported on k\mathbb{R}^{k} and μ\mu supported on d\mathbb{R}^{d} are both absolutely continuous w.r.t. Lebesgue measure, and zisz_{i}^{\prime}s and xisx_{i}^{\prime}s are i.i.d. samples from ν\nu and μ\mu, respectively for 1in1\leq i\leq n. Then there exist generator and encoder neural network functions g:kdg:\mathbb{R}^{k}\mapsto\mathbb{R}^{d} and e:dke:\mathbb{R}^{d}\mapsto\mathbb{R}^{k} such that gg and ee are inverse bijections of each other between {zi:1in}\{z_{i}:1\leq i\leq n\} and {xi:1in}\{x_{i}:1\leq i\leq n\}. Moreover, such neural network functions gg and ee can be obtained by properly specifying W22L2=c2dnW_{2}^{2}L_{2}=c_{2}dn and W32L3=c3knW_{3}^{2}L_{3}=c_{3}kn for some constant 12c2,c338412\leq c_{2},c_{3}\leq 384.

Since μ\mu and ν\nu are absolutely continuous by assumption, they are also absolutely continuous in any one dimension. Hence the proof reduces to the one-dimensional case.

Appendix I Additional Lemma

Denote 𝒮d(z0,,zN+1)\mathcal{S}^{d}(z_{0},\ldots,z_{N+1}) as the set of all continuous piecewise linear functions f:df:\mathbb{R}\mapsto\mathbb{R}^{d} which have breakpoints only at z0<z1<<zN<zN+1z_{0}<z_{1}<\cdots<z_{N}<z_{N+1} and are constant on (,z0)(-\infty,z_{0}) and (zN+1,)(z_{N+1},\infty). The following lemma is a result in Yang et al. (2021).

Lemma I.1.

Suppose that W7d+1W\geq 7d+1, L2L\geq 2 and N(Wd1)Wd16dL2N\leq(W-d-1)\left\lfloor\frac{W-d-1}{6d}\right\rfloor\left\lfloor\frac{L}{2}\right\rfloor. Then for any z0<z1<<zN<zN+1z_{0}<z_{1}<\cdots<z_{N}<z_{N+1}, 𝒮d(z0,,zN+1)\mathcal{S}^{d}(z_{0},\ldots,z_{N+1}) can be represented by a ReLU FNNs with width and depth no larger than WW and LL, respectively.

This result indicates that the expressive capacity of ReLU FNNs for piecewise linear functions. If we choose N=(Wd1)Wd16dL2N=(W-d-1)\left\lfloor\frac{W-d-1}{6d}\right\rfloor\left\lfloor\frac{L}{2}\right\rfloor, a simple calculation shows cW2L/dNCW2L/dcW^{2}L/d\leq N\leq CW^{2}L/d with c=1/384c=1/384 and C=1/12C=1/12. This means when the number of breakpoints are moderate compared with the network structure, such piecewise linear functions are expressible by feedforward ReLU networks.