Non-Asymptotic Error Bounds for
Bidirectional GANs

Shiao Liu
Department of Statistics and Actuarial Science, University of Iowa
Iowa City, IA 52242, USA
shiao-liu@uiowa.edu &Yunfei Yang
Department of Mathematics, The Hong Kong University of Science and Technology
Clear Water Bay, Hong Kong, China
yyangdc@connect.ust.hk
&Jian Huang ^∗
Department of Statistics and Actuarial Science, University of Iowa
Iowa City, IA 52242, USA
jian-huang@uiowa.edu
&Yuling Jiao^∗
School of Mathematics and Statistics, Wuhan University
Wuhan, Hubei, China 430072
yulingjiaomath@whu.edu.cn
&Yang Wang
Department of Mathematics, The Hong Kong University of Science and Technology
Clear Water Bay, Hong Kong, China
yangwang@ust.hk
Corresponding authors

Abstract

We derive nearly sharp bounds for the bidirectional GAN (BiGAN) estimation error under the Dudley distance between the latent joint distribution and the data joint distribution with appropriately specified architecture of the neural networks used in the model. To the best of our knowledge, this is the first theoretical guarantee for the bidirectional GAN learning approach. An appealing feature of our results is that they do not assume the reference and the data distributions to have the same dimensions or these distributions to have bounded support. These assumptions are commonly assumed in the existing convergence analysis of the unidirectional GANs but may not be satisfied in practice. Our results are also applicable to the Wasserstein bidirectional GAN if the target distribution is assumed to have a bounded support. To prove these results, we construct neural network functions that push forward an empirical distribution to another arbitrary empirical distribution on a possibly different-dimensional space. We also develop a novel decomposition of the integral probability metric for the error analysis of bidirectional GANs. These basic theoretical results are of independent interest and can be applied to other related learning problems.

1 Introduction

Generative adversarial networks (GAN) (Goodfellow et al., 2014) is an important approach to implicitly learning and sampling from high-dimensional complex distributions. GANs have been shown to achieve impressive performance in many machine learning tasks (Radford et al., 2016; Reed et al., 2016; Zhu et al., 2017; Karras et al., 2018, 2019; Brock et al., 2019). Several recent studies have generalized GANs to bidirectional generative learning, which learns an encoder mapping the data distribution to the reference distribution simultaneously together with the generator doing reversely. These studies include the adversarial autoencoder (AAE) (Makhzani et al., 2015), bidirectional GAN (BiGAN) (Donahue et al., 2016), adversarially learned inference (ALI) (Dumoulin et al., 2016), and bidirectional generative modeling using adversarial gradient estimation (AGES) (Shen et al., 2020). A common feature of these methods is that they generalize the basic adversarial training framework of the original GAN from unidirectional to bidirectional. Dumoulin et al. (2016) showed that BiGANs make use of the joint distribution of data and latent representations, which can better capture the information of data than the vanilla GANs. Comparing with the unidirectional GANs, the joint distribution matching in the training of bidirectional GANs alleviates mode dropping and encourages cycle consistency (Shen et al., 2020).

Several elegant and stimulating papers have analyzed the theoretical properties of unidirectional GANs. Arora et al. (2017) considered the generalization error of GANs under the neural net distance. Zhang et al. (2018) improved the generalization error bound in Arora et al. (2017). Liang (2020) studied the minimax optimal rates for learning distributions with empirical samples under Sobolev evaluation class and density class. The minimax rate is $O(n^{-{1}/{2}}\vee n^{-{\alpha+\beta}/{(2\alpha+\beta)}})$ , where $\alpha$ and $\beta$ are the regularity parameters for Sobolev density and evaluation class, respectively. Bai et al. (2019) analyzed the estimation error of GANs under the Wasserstein distance for a special class of distributions implemented by a generator, while the discriminator is designed to guarantee zero bias. Chen et al. (2020) studies the convergence properties of GANs when both the evaluation class and the target density class are Hölder classes and derived $O(n^{-{\beta}/{(2\beta+d)}}\log^{2}n)$ bound, where $d$ is the dimension of the data distribution and $\alpha$ and $\beta$ are the regularity parameters for Hölder density and evaluation class, respectively. While impressive progresses have been made on the theoretical understanding of GANs, there are still some drawbacks in the existing results. For example,

(a)

The reference distribution and the target data distribution are assumed to have the same dimension, which is not the actual setting for GAN training.
(b)

The reference and the target data distributions are assumed to be supported on bounded sets.
(c)

The prefactors in the convergence rates may depend on the dimension $d$ of the data distribution exponentially.

In practice, GANs are usually trained using a reference distributions with a lower dimension than that of the target data distribution. Indeed, an important strength of GANs is that they can model low-dimensional latent structures via using a low-dimensional reference distribution. The bounded support assumption excludes some commonly used Gaussian distributions as the reference. Therefore, strictly speaking, the existing convergence analysis results do not apply to what have been done in practice. In addition, there has been no theoretical analysis of bidirectional GANs in the literature.

1.1 Contributions

We derive nearly sharp non-asymptotic bounds for the GAN estimation error under the Dudley distance between the reference joint distribution and the data joint distribution.To the best of our knowledge, this is the first result providing theoretical guarantees for bidirectional GAN estimation error rate. We do not assume that the reference and the target data distributions have the same dimension or these distributions have bounded support. Also, our results are applicable to the Wasserstein distance if the target data distribution is assumed to have a bounded support.

The main novel aspects of our work are as follows.

(1)

We allow the dimension of the reference distribution to be different from the dimension of the target distribution, in particular, it can be much lower than that of the target distribution.
(2)

We allow unbounded support for the reference distribution and the target distribution under mild conditions on the tail probabilities of the target distribution.
(3)

We explicitly establish that the prefactors in the error bounds depend on the square root of the dimension of the target distribution. This is a significant improvement over the exponential dependence on $d$ in the existing works.

Moreover, we develop a novel decomposition of integral probability metric for the error analysis of bidirectional GANs. We also show that the pushforward distribution of an empirical distribution based on neural networks can perfectly approximate another arbitrary empirical distribution as long as the number of discrete points are the same.

Notation We use $\sigma$ to denote the ReLU activation function in neural networks, which is $\sigma(x)=\max\{x,0\},x\in\mathbb{R}$ . We use $I$ to denote the identity map. Without further indication, $\|\cdot\|$ represents the $L_{2}$ norm. For any function $g$ , let $\|g\|_{\infty}=\sup_{x}\|g(x)\|$ . We use notation $O(\cdot)$ and $\tilde{O}(\cdot)$ to express the order of function slightly differently, where $O(\cdot)$ omits the universal constant independent of $d$ while $\tilde{O}(\cdot)$ omits the constant depending on $d$ . We use $B_{2}^{d}(a)$ to denote the $L_{2}$ ball in $\mathbb{R}^{d}$ with center at $\mathbf{0}$ and radius $a$ . Let $g_{\#}\nu$ be the pushforward distribution of $\nu$ by function $g$ in the sense that $g_{\#}\nu(A)=\nu(g^{-1}(A))$ for any measurable set $A$ . We use $\hat{\mathbb{E}}$ to denote taking expectation with respect to the empirical distribution.

2 Bidirectional generative learning

We describe the setup of the bidirectional GAN estimation problem and present the assumptions we need in our analysis.

2.1 Bidirectional GAN estimators

Let $\mu$ be the target data distribution supported on $\mathbb{R}^{d}$ for $d\geq 1.$ Let $\nu$ be a reference distribution which is easy to sample from. We first consider the case when $\nu$ is supported on $\mathbb{R}$ , and then extend it to $\mathbb{R}^{k}$ , where $k\geq 1$ can be different from $d$ . Usually, $k\ll d$ in practical machine learning tasks such as image generation. The goal is to learn functions $g:\mathbb{R}\to\mathbb{R}^{d}$ and $e:\mathbb{R}^{d}\to\mathbb{R}$ such that $\tilde{g}_{\#}\nu=\tilde{e}_{\#}\mu$ , where $\tilde{g}:=(g,I)$ and $\tilde{e}:=(I,e)$ , $\tilde{g}_{\#}\nu$ is the pushforward distribution of $\nu$ under $\tilde{g}$ and $\tilde{e}_{\#}\mu$ is the pushforward distribution of $\mu$ under $\tilde{e}$ . We call $\tilde{g}_{\#}\nu$ the joint latent distribution or joint reference distribution and $\tilde{e}_{\#}\mu$ the joint data distribution or joint target distribution. At the population level, the bidirectional GAN solves the minimax problem:

(g^{*},e^{*},f^{*})\in\arg\min_{g\in\mathcal{G},e\in\mathcal{E}}\max_{f\in\mathcal{F}}{\mathbb{E}}_{Z\sim\nu}[f(g(Z),Z)]-{\mathbb{E}}_{x\sim\mu}[f(X,e(X))],

where $\mathcal{G},\mathcal{E},\mathcal{F}$ are referred to as the generator class, the encoder class, and the discriminator class, respectively. Suppose we have two independent random samples $Z_{1},\ldots,Z_{n}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\nu$ and $X_{1},\ldots,X_{n}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mu$ . At the sample level, the bidirectional GAN solves the empirical version of the above minimax problem:

\displaystyle(\hat{g}_{\theta},\hat{e}_{\varphi},\hat{f}_{\omega})

\displaystyle=\arg\min_{g_{\theta}\in\mathcal{G}_{NN},e_{\varphi}\in\mathcal{E}_{NN}}\max_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}f_{\omega}(g_{\theta}(Z_{i}),Z_{i})-\frac{1}{n}\sum_{j=1}^{n}f_{\omega}(X_{j},e_{\varphi}(X_{j})),

(2.1)

where $\mathcal{G}_{NN}$ and $\mathcal{E}_{NN}$ are two classes of neural networks approximating the generator class $\mathcal{G}$ and the encoder class $\mathcal{E}$ respectively, and $\mathcal{F}_{NN}$ is a class of neural networks approximating the discriminator class $\mathcal{F}$ .

2.2 Assumptions

We assume the target $\mu$ and the reference $\nu$ satisfy the following assumptions.

Assumption 1 (Subexponential tail).

For a large $n$ , the target distribution $\mu$ on $\mathbb{R}^{d}$ and the reference distribtuion $\nu$ on $\mathbb{R}$ satisfies the first moment tail condition for some $\delta>0$ ,

\max\{\mathbb{E}_{\nu}\|Z\|\mathbbm{1}_{\{\|Z\|>\log n\}},\mathbb{E}_{\mu}\|X\|\mathbbm{1}_{\{\|X\|>\log n\}}\}=O(n^{-\frac{(\log n)^{\delta}}{d}}).

Assumption 2 (Absolute continuity).

Both the target distribution $\mu$ on $\mathbb{R}^{d}$ and the reference distribution $\nu$ on $\mathbb{R}$ are absolutely continuous with respect to the Lebesgue measure $\lambda.$

Assumption 1 is a technical condition for dealing with the case when $\mu$ and $\nu$ are supported on $\mathbb{R}^{d}$ and $\mathbb{R}$ instead of compact subsets. For distributions with bounded supports, this assumption is automatically satisfied. Here the factor $(\log n)^{\delta}$ ensures that the tails of $\mu$ and $\nu$ are sub-exponential, and it can be easily satisfied if the distributions are sub-gaussian. For the reference distribution, Assumption 1 and 2 can be easily satisfied by specifying $\nu$ as some common distribution with easy-to-sample density such as Gaussian or uniform, which is usually done in the applications of GANs. For the target distribution, Assumption 1 and 2 specifies the type of distributions that are learnable by bidirectional GAN with our theoretical guarantees. Note that Assumption 1 is also necessary in our proof for bounding the generator and encoder approximation error in the sense that the results will not hold if we replace $(\log n)^{\delta}$ with 1. Assumption 2 is also necessary for Theorem 4.3 in mapping between empirical samples, which is essential in bounding generator and encoder approximation error.

2.3 Generator, encoder and discriminator classes

Let $\mathcal{F}_{NN}:=\mathcal{NN}(W_{1},L_{1})$ be the discriminator class consisting of the feedforward ReLU neural networks $f_{\omega}:\mathbb{R}^{d+1}\mapsto\mathbb{R}$ with width $W_{1}$ and depth $L_{1}$ . Similarly, let $\mathcal{G}_{NN}:=\mathcal{NN}(W_{2},L_{2})$ be the generator class consisting of the feedforward ReLU neural networks $g_{\theta}:\mathbb{R}\mapsto\mathbb{R}^{d}$ with width $W_{2}$ and depth $L_{2}$ , and $\mathcal{E}_{NN}:=\mathcal{NN}(W_{3},L_{3})$ the encoder class consisting of the feedforward ReLU neural networks $e_{\varphi}:\mathbb{R}^{d}\mapsto\mathbb{R}$ with width $W_{3}$ and depth $L_{3}$ .

The functions $f_{\omega}\in\mathcal{F}_{NN}$ have the following form:

\displaystyle f_{\omega}(x)=A_{L_{1}}\cdot\sigma(A_{L_{1}-1}\cdots\sigma(A_{1}x+b_{1})\cdots+b_{L_{1}-1})+b_{L_{1}}

where $A_{i}$ are the weight matrices with number of rows and columns no larger than the width $W_{1}$ , $b_{i}$ are the bias vectors with compatible dimensions, and $\sigma$ is the ReLU activation function $\sigma(x)=x\vee 0$ . Similarly, functions $g_{\theta}\in\mathcal{G}_{NN}$ and $e_{\varphi}\in\mathcal{E}_{NN}$ have the following form:

	$\displaystyle g_{\theta}(x)$	$\displaystyle=A^{\prime}_{L_{2}}\cdot\sigma(A^{\prime}_{L_{2}-1}\cdots\sigma(A^{\prime}_{1}x+b^{\prime}_{1})\cdots+b^{\prime}_{L_{2}-1})+b^{\prime}_{L_{2}}$
	$\displaystyle e_{\varphi}(x)$	$\displaystyle=A^{\prime\prime}_{L_{3}}\cdot\sigma(A^{\prime\prime}_{L_{3}-1}\cdots\sigma(A^{\prime\prime}_{1}x+b^{\prime\prime}_{1})\cdots+b^{\prime\prime}_{L_{3}-1})+b^{\prime\prime}_{L_{3}}$

where $A_{i}^{\prime}$ and $A_{i}^{\prime\prime}$ are the weight matrices with number of rows and columns no larger than $W_{2}$ and $W_{3}$ , respectively, and $b_{i}^{\prime}$ and $b_{i}^{\prime\prime}$ are the bias vectors with compatible dimensions.

We impose the following conditions on $\mathcal{G}_{NN}$ , $\mathcal{E}_{NN}$ , and $\mathcal{F}_{NN}$ .

Condition 1.

For any $g_{\theta}\in\mathcal{G}_{NN}$ and $e_{\varphi}\in\mathcal{E}_{NN}$ , we have $\max\{\|g_{\theta}\|_{\infty},\|e_{\varphi}\|_{\infty}\}\leq\log n.$

Condition 1 on $\mathcal{G}_{NN}$ can be easily satisfied by adding an additional clipping layer $\ell$ after the original output layer, with $c_{n,d}\equiv{(\log n)}/{\sqrt{d}}$ ,

\displaystyle\ell(a)=a\wedge c_{n,d}\vee(-c_{n,d})=\sigma(a+c_{n,d})-\sigma(a-c_{n,d})-c_{n,d}.

(2.2)

We truncate the output of $\|g_{\theta}\|$ to an increasing interval $[-\log n,\log n]$ to include the whole $\mathbb{R}^{d}$ support for the evaluation function class. Condition 1 on $\mathcal{E}_{NN}$ can be satisfied in the same manner. This condition is technically necessary in our proof (see appendix).

3 Non-asymptotic error bounds

We characterize the bidirectional GAN solutions based on minimizing the integral probability metric (IPM, Müller (1997)) between two distributions $\mu$ and $\nu$ with respect to a symmetric evaluation function class $\mathcal{F}$ , defined by

\displaystyle d_{\mathcal{F}}(\mu,\nu)=\sup_{f\in\mathcal{F}}[\mathbb{E}_{\mu}f-\mathbb{E}_{\nu}f].

(3.1)

By specifying the evaluation function class $\mathcal{F}$ differently, we can obtain many commonly-used metrics (Liu et al., 2017). Here we focus on the following two

•

$\mathcal{F}=$ bounded Lipschitz function class $:d_{\mathcal{F}}=d_{BL}$ , (bounded Lipschitz (or Dudley) metric: metrizing weak convergence, Dudley (2018)),
•

$\mathcal{F}=$ Lipschitz function class $:d_{\mathcal{F}}=W_{1}$ (Wasserstein GAN, Arjovsky et al. (2017)).

We consider the estimation error under the Dudley metric $d_{BL}$ . Note that in the case when $\mu$ and $\nu$ have bounded supports, the Dudley metric $d_{BL}$ is equivalent to the 1-Wasserstein metric $W_{1}$ . Therefore, under the bounded support condition for $\mu$ and $\nu$ , all our convergence results also hold under the Wasserstein distance $W_{1}$ . Even if the support of $\mu$ and $\nu$ are unbounded, we can still apply the result of Lu and Lu (2020) to avoid empirical process theory and obtain an stochastic error bound under the Wasserstein distance $W_{1}$ . However, the result of Lu and Lu (2020) requires sub-gaussianity to obtain the $\sqrt{d}$ prefactor. In order to make it more general, we use the empirical processes theory to get the explicit prefactor. Also, the discriminator approximation error will be unbounded if we consider the Wasserstein distance $W_{1}$ . Hence, we can only consider $d_{BL}$ for the unbounded support case.

The bidirectional GAN solution $(\hat{g}_{\theta},\hat{e}_{\varphi})$ in (2.1) also minimizes the distance between $(\tilde{g}_{\theta})_{\#}\hat{\nu}_{n}$ and $(\tilde{e}_{\varphi})_{\#}\hat{\mu}_{n}$ under $d_{\mathcal{F}_{NN}}$

\displaystyle\underset{g_{\theta}\in\mathcal{G}_{NN},e_{\varphi}\in\mathcal{E}_{NN}}{\min}d_{\mathcal{F}_{NN}}((\tilde{g}_{\theta})_{\#}\hat{\nu}_{n},(\tilde{e}_{\varphi})_{\#}\hat{\mu}_{n}).

However, even if two distributions are close with respect to $d_{\mathcal{F}_{NN}}$ , there is no automatic guarantee that they will still be close under other metrics, for example, the Dudley or the Wasserstein distance (Arora et al., 2017). Therefore, it is natural to ask the question:

•

How close are the two bidirectional GAN estimators $\hat{\boldsymbol{\nu}}:=(\hat{g}_{\theta},I)_{\#}\nu$ and $\hat{\boldsymbol{\mu}}:=(I,\hat{e}_{\varphi})_{\#}\mu$ under some other stronger metrics?

We consider the IPM with the uniformly bounded 1-Lipschitz function class on $\mathbb{R}^{d+1}$ , as the evaluation class, which is defined as, for some finite $B>0$ ,

\displaystyle\mathcal{F}^{1}:=\big{\{}f:\

\displaystyle\mathbb{R}^{d+1}\mapsto\mathbb{R}\big{|}\ |f(x)-f(y)|\leq\|x-y\|,x,y\in\mathbb{R}^{d+1}\text{ and }\|f\|_{\infty}\leq B\big{\}}

(3.2)

In Theorem 3.1, we consider the bounded support case where $d_{\mathcal{F}}=W_{1}$ ; In Theorem 3.2, we extend the result to the unbounded support case; In Theorem 3.3, we extend the result to the case where the dimension of the reference distribution is arbitrary.

We first present a result when $\mu$ is supported on a compact subset $[-M,M]^{d}\subset\mathbb{R}^{d}$ and $\nu$ is supported on $[-M,M]\subset\mathbb{R}$ for a finite $M>0$ .

Theorem 3.1.

Suppose that the target $\mu$ is supported on $[-M,M]^{d}\subset\mathbb{R}^{d}$ and the reference $\nu$ is supported on $[-M,M]\subset\mathbb{R}$ for a finite $M>0$ , and Assumption 2 holds. Let the outputs of $g_{\theta}$ and $e_{\varphi}$ be within $[-M,M]^{d}$ and $[-M,M]$ for $g_{\theta}\in\mathcal{G}_{NN}$ and $e_{\varphi}\in\mathcal{E}_{NN}$ , respectively. By specifying the three network structures as $W_{1}L_{1}\geq\left\lceil\sqrt{n}\right\rceil$ , $W_{2}^{2}L_{2}=C_{1}dn$ , and $W_{3}^{2}L_{3}=C_{2}n$ for some constants $12\leq C_{1},C_{2}\leq 384$ and properly choosing parameters, we have

\displaystyle\mathbb{E}d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})\leq C_{0}\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{\frac{1}{d+1}},

where $C_{0}>0$ is a constant independent of $d$ and $n$ .

The prefactor $C_{0}\sqrt{d}$ in the error bound depends on $d^{1/2}$ linearly. This is different from the existing works where the dependence of the prefactor on $d$ is either not clearly described or is exponential. In high-dimensional settings with large $d$ , this makes a substantial difference in the quality of the error bounds. These remarks apply to all the results stated below.

The next theorem deals with the case of unbounded support.

Theorem 3.2.

Suppose Assumption 1 and 2 hold, and Condition 1 is satisfied. By specifying the structures of the three network classes as $W_{1}L_{1}\geq\left\lceil\sqrt{n}\right\rceil$ , $W_{2}^{2}L_{2}=C_{1}dn$ , and $W_{3}^{2}L_{3}=C_{2}n$ for some constants $12\leq C_{1},C_{2}\leq 384$ and properly choosing parameters, we have

\displaystyle\mathbb{E}d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})\leq\min\big{\{}C_{0}\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}},C_{d}n^{-\frac{1}{d+1}}\log n\big{\}},

where $C_{0}$ is a constant independent of $d$ and $n$ , but $C_{d}$ depends on $d$ .

Note that two methods are used in bounding stochastic errors (see appendix), which leads to two different bounds: one with an explicit $\sqrt{d}$ prefactor with the cost that we have an additional $\log n$ factor. Another one with an implicit prefactor but with a better $\log n$ factor. Hence, it is a tradeoff between the explicitness of prefactor and the order of $\log n$ .

Our next result generalizes the results to the case when the reference distribution $\nu$ is supported on $\mathbb{R}^{k}$ for $k\in\mathbb{N}_{+}.$

Assumption 3.

The target distribution $\mu$ on $\mathbb{R}^{d}$ is absolutely continuous with respect to the Lebesgue measure on $\mathbb{R}^{d}$ and the reference distribution $\nu$ on $\mathbb{R}^{k}$ is absolutely continuous with respect to the Lebesgue measure on $\mathbb{R}^{k}$ , and $k\ll d$ .

With the above assumption, we have the following theorem providing theoretical guarantees for the validity of any dimensional reference $\nu$ .

Theorem 3.3.

Suppose Assumption 1 and 3 hold, and Condition 1 is satisfied. By specifying generator and discriminator class structure as $W_{1}L_{1}\geq\left\lceil\sqrt{n}\right\rceil$ , $W_{2}^{2}L_{2}=C_{1}dn$ , and $W_{3}^{2}L_{3}=C_{2}kn$ for some constants $12\leq C_{1},C_{2}\leq 384$ and properly choosing parameters, we have

\displaystyle\mathbb{E}d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})\leq\min\big{\{}C_{0}\sqrt{d}n^{-\frac{1}{d+k}}(\log n)^{1+\frac{1}{d+k}},C_{d}n^{-\frac{1}{d+k}}\log n\big{\}},

where $C_{0}$ is a constant independent of $d$ and $n$ , but $C_{d}$ depends on $d$ .

Note that the errors bounds established in Theorems 3.1-3.3 are tight up to a logarithmic factor, since the minimax rate measured in Wasserstein distance for learning distributions when the Lipschitz evaluation class is defined on $\mathbb{R}^{d}$ is $\tilde{O}(n^{-\frac{1}{d}})$ (Liang, 2020).

4 Approximation and stochastic errors

In this section we present a novel inequality for decomposing the total error into approximation and stochastic errors and establish bounds on these errors.

4.1 Decomposition of the estimation error

Define the approximation error of a function class $\mathcal{F}$ to another function class $\mathcal{H}$ by

\mathcal{E}(\mathcal{H},\mathcal{F}):=\underset{h\in\mathcal{H}}{\sup}\underset{f\in\mathcal{F}}{\inf}\|h-f\|_{\infty}.

We decompose the Dudley distance $d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})$ between the latent joint distribution and the data joint distribution into four different error terms,

•

the approximation error of the discriminator class $\mathcal{F}_{NN}$ to $\mathcal{F}^{1}$ :

$\mathcal{E}_{1}=\mathcal{E}(\mathcal{F}^{1},\mathcal{F}_{NN}),$

•

the approximation error of the generator and encoder classes:

\mathcal{E}_{2}=\inf_{g_{\theta}\in\mathcal{G}_{NN},e_{\varphi}\in\mathcal{E}_{NN}}\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(g_{\theta}(z_{i}),z_{i})-f_{\omega}(x_{i},e_{\varphi}(x_{i}))\Big{)},

•

the stochastic error for the latent joint distribution $\hat{\boldsymbol{\nu}}$ :

\mathcal{E}_{3}=\sup_{f_{\omega}\in\mathcal{F}^{1}}\mathbb{E}f_{\omega}(\hat{g}(z),z)-\hat{\mathbb{E}}f_{\omega}(\hat{g}(z),z),

•

the stochastic error for the latent joint distribution $\hat{\boldsymbol{\mu}}$ :

\mathcal{E}_{4}=\sup_{f_{\omega}\in\mathcal{F}^{1}}\hat{\mathbb{E}}f_{\omega}(x,\hat{e}(x))-\mathbb{E}f_{\omega}(x,\hat{e}(x)).

Lemma 4.1.

Let $(\hat{g}_{\theta},\hat{e}_{\varphi})$ be the bidirectional GAN solution in (2.1) and $\mathcal{F}^{1}$ be the uniformly bounded 1-Lipschitz function class defined in (3.2). Then the Dudley distance between the latent joint distribution $\hat{\boldsymbol{\nu}}=(\hat{g}_{\theta},I)_{\#}\nu$ and the data joint distribution $\hat{\boldsymbol{\mu}}=(I,\hat{e}_{\varphi})_{\#}\mu$ can be decomposed as follows

\displaystyle d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})

\displaystyle\leq

\displaystyle 2\mathcal{E}_{1}+\mathcal{E}_{2}+\mathcal{E}_{3}+\mathcal{E}_{4}.

(4.1)

The novel decomposition (4.1) is fundamental to our error analysis. Based on (4.1), we bound each error term on the right side of (4.1) and balance the bounds to obtain an overall bound for the bidirectional GAN estimation.

For proving Lemma 4.1, we introduce the following useful inequality, which states that for any two probability distributions, the difference in IPMs with two distinct evaluation classes will not exceed 2 times the approximation error between the two evaluation classes, that is, for any probability distributions $\mu$ and $\nu$ and symmetric function classes $\mathcal{F}$ and $\mathcal{H}$ ,

\displaystyle d_{\mathcal{H}}(\mu,\nu)-d_{\mathcal{F}}(\mu,\nu)\leq 2\mathcal{E}(\mathcal{H},\mathcal{F}).

(4.2)

It is easy to check that if we replace $d_{\mathcal{H}}(\mu,\nu)$ by $\hat{d}_{\mathcal{H}}(\mu,\nu):=\underset{h\in\mathcal{H}}{\sup}[\hat{\mathbb{E}}_{\mu}h-\hat{\mathbb{E}}_{\nu}h]$ , (4.2) still holds.

Proof of Lemma 4.1.

We have

	$\displaystyle d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})=$	$\displaystyle\sup_{f_{\omega}\in\mathcal{F}^{1}}\mathbb{E}f_{\omega}(\hat{g}(z),z)-\mathbb{E}f_{\omega}(x,\hat{e}(x))$
	$\displaystyle\leq$	$\displaystyle\sup_{f_{\omega}\in\mathcal{F}^{1}}\mathbb{E}f_{\omega}(\hat{g}(z),z)-\hat{\mathbb{E}}f_{\omega}(\hat{g}(z),z)+\sup_{f_{\omega}\in\mathcal{F}^{1}}\hat{\mathbb{E}}f_{\omega}(\hat{g}(z),z)-\hat{\mathbb{E}}f_{\omega}(x,\hat{e}(x))$
		$\displaystyle+\sup_{f_{\omega}\in\mathcal{F}^{1}}\hat{\mathbb{E}}f_{\omega}(x,\hat{e}(x))-\mathbb{E}f_{\omega}(x,\hat{e}(x))$
	$\displaystyle=$	$\displaystyle\mathcal{E}_{3}+\mathcal{E}_{4}+\sup_{f_{\omega}\in\mathcal{F}^{1}}\hat{\mathbb{E}}f_{\omega}(\hat{g}(z),z)-\hat{\mathbb{E}}f_{\omega}(x,\hat{e}(x)).$

Denote $A:=\sup_{f_{\omega}\in\mathcal{F}^{1}}\hat{\mathbb{E}}f_{\omega}(\hat{g}(z),z)-\hat{\mathbb{E}}f_{\omega}(x,\hat{e}(x))$ . By (4.2) and the optimality of the bidirectional GAN solutions, $A$ satisfies

	$\displaystyle A$	$\displaystyle=\sup_{f_{\omega}\in\mathcal{F}^{1}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(\hat{g}(z_{i}),z_{i})-f_{\omega}(x_{i},\hat{e}(x_{i}))\Big{)}$
		$\displaystyle\leq\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(\hat{g}(z_{i}),z_{i})-f_{\omega}(x_{i},\hat{e}(x_{i}))\Big{)}+2\mathcal{E}(\mathcal{F}^{1},\mathcal{F}_{NN})$
		$\displaystyle=\inf_{g_{\theta}\in\mathcal{G}_{NN},e_{\varphi}\in\mathcal{E}_{NN}}\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(g_{\theta}(z_{i}),z_{i})-f_{\omega}(x_{i},e_{\varphi}(x_{i}))\Big{)}+2\mathcal{E}_{1}$
		$\displaystyle=2\mathcal{E}_{1}+\mathcal{E}_{2}.$

∎

Note that we cannot directly apply the symmetrization technic (see appendix) to $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$ since $e^{*}$ and $g^{*}$ are correlated with $x_{i}$ and $z_{i}$ . However, this problem can be solved by replacing the samples $(x_{i},z_{i})$ in the empirical terms in $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$ with ghost samples $(x^{\prime}_{i},z^{\prime}_{i})$ independent of $(x_{i},z_{i})$ and replacing $g^{*}$ and $e^{*}$ with $g^{**}$ and $e^{**}$ which are obtained from the ghost samples, respectively. That is, we replace $\hat{\mathbb{E}}f_{\omega}(g^{*}(z),z)$ and $\hat{\mathbb{E}}f_{\omega}(x,e^{*}(x))$ with $\hat{\mathbb{E}}f_{\omega}(g^{**}(z^{\prime}),z^{\prime})$ and $\hat{\mathbb{E}}f_{\omega}(x^{\prime},e^{**}(x^{\prime}))$ in $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$ , respectively. Then we can proceed with the same proof of Lemma 4.1 and apply the symmetrization technic to $\mathcal{E}_{3}$ and $\mathcal{E}_{4},$ since $(g^{*}(z_{i}),z_{i})$ and $(g^{**}(z^{\prime}_{i}),z^{\prime}_{i})$ have the same distribution. To simplify the notation, we will just use $\hat{\mathbb{E}}f_{\omega}(g^{*}(z),z)$ and $\hat{\mathbb{E}}f_{\omega}(x,e^{*}(x))$ to denote $\hat{\mathbb{E}}f_{\omega}(g^{**}(z^{\prime}),z^{\prime})$ and $\hat{\mathbb{E}}f_{\omega}(x^{\prime},e^{**}(x^{\prime}))$ here, respectively.

4.2 Approximation errors

We now discuss the errors due to the discriminator approximation and the generator and encoder approximation.

4.2.1 The discriminator approximation error $\mathcal{E}_{1}$

The discriminator approximation error $\mathcal{E}_{1}$ describes how well the discriminator neural network class approximates functions from the Lipschitz class $\mathcal{F}^{1}$ . Lemma 4.2 below can be applied to obtain the neural network approximation error for Lipschitz functions. It leads to a quantitative and non-asymptotic approximation rate in terms of the width and depth of the neural networks when bounding $\mathcal{E}_{1}$ .

Lemma 4.2 (Shen et al. (2021)).

Let $f$ be a Lipschitz continuous function defined on $[-R,R]^{d}$ . For arbitrary $W,L\in\mathbb{N}_{+}$ , there exists a function $\psi$ implemented by a ReLU feedforward neural network with width $W$ and depth $L$ such that

\displaystyle||f-\psi||_{\infty}=O\big{(}\sqrt{d}R(WL)^{-\frac{2}{d}}\big{)}.

By Lemma 4.2 and our choice of the architecture of discriminator class $\mathcal{F}_{NN}$ in the theorems, we have $\mathcal{E}_{1}=O\big{(}\sqrt{d}(W_{1}L_{1})^{-\frac{2}{d+1}}\log n\big{)}$ . Theorem 4.2 also informs about how to choose the architecture of the discriminator networks based on how small we want the approximation error $\mathcal{E}_{1}$ to be. By setting $(W_{1}L_{1})^{2}\geq n$ , $\mathcal{E}_{1}$ is dominated by the stochastic terms $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$ .

4.2.2 The generator and encoder approximation error $\mathcal{E}_{2}$

The generator and encoder approximation error $\mathcal{E}_{2}$ describes how powerful the generator and encoder classes are in pushing the empirical distributions $\hat{\mu}_{n}$ and $\hat{\nu}_{n}$ to each other. A natural question is

•

Can we find some generator and encoder neural network functions such that $\mathcal{E}_{2}=0$ ?

Most of the current literature concerning the error analysis of GANs applied the optimal transport theory (Villani, 2008) to minimize an error term similar to $\mathcal{E}_{2}$ , see, for example, Chen et al. (2020). However, the existence of the optimal transport function from $\mathbb{R}\to\mathbb{R}^{d}$ is not guaranteed. Therefore, the existing analysis of GANs can only deal with the scenario when the reference and the target data distribution are assumed to have the same dimension. This equal dimensionality assumption is not satisfied in the actual training of GANs or bidirectional GANs in many applications. Here, instead of using the optimal transport theory, we establish the following approximation results in Theorem 4.3, which enables us to forgo the equal dimensionality assumption.

Theorem 4.3.

Suppose that $\nu$ supported on $\mathbb{R}$ and $\mu$ supported on $\mathbb{R}^{d}$ are both absolutely continuous w.r.t. the Lebesgue measures, and $z_{i}^{\prime}s$ and $x_{i}^{\prime}s$ are i.i.d. samples from $\nu$ and $\mu$ , respectively for $1\leq i\leq n$ . Then there exist generator and encoder neural network functions $g:\mathbb{R}\mapsto\mathbb{R}^{d}$ and $e:\mathbb{R}^{d}\mapsto\mathbb{R}$ such that $g$ and $e$ are inverse bijections of each other between $\{z_{i}:1\leq i\leq n\}$ and $\{x_{i}:1\leq i\leq n\}$ up to a permutation. Moreover, such neural network functions $g$ and $e$ can be obtained by properly specifying $W_{2}^{2}L_{2}=c_{2}dn$ and $W_{3}^{2}L_{3}=c_{3}n$ for some constant $12\leq c_{2},c_{3}\leq 384$ .

Proof.

By the absolute continuity of $\nu$ and $\mu$ , all the $z_{i}^{\prime}s$ and $x_{i}^{\prime}s$ are distinct a.s.. We can reorder $z_{i}^{\prime}s$ from the smallest to the largest, so $z_{1}<z_{2}<\ldots<z_{n}$ . Let $z_{i+1/2}$ be any point between $z_{i}$ and $z_{i+1}$ for $i\in\{1,2,\ldots,n-1\}$ . We define the continuous piece-wise linear function $g:\mathbb{R}\mapsto\mathbb{R}^{d}$ by

\displaystyle g(z)=\begin{cases}x_{1}&z\leq z_{1},\\ \frac{z-z_{i+1/2}}{z_{i}-z_{i+1/2}}x_{i}+\frac{z-z_{i}}{z_{i+\frac{1}{2}}-z_{i}}x_{i+1}&z=(z_{i},z_{i+1/2}),\text{ for }i=1,\ldots,n-1,\\ x_{i+1}&z\in[z_{i+1/2},z_{i+1}],\text{ for }i=1,\ldots,n-2,\\ x_{n}&z\geq z_{n-1+1/2}.\end{cases}

By Yang et al. (2021, Lemma 3.1) , $g\in\mathcal{NN}(W_{2},L_{2})$ if $n\leq(W_{2}-d-1)\left\lfloor\frac{W_{2}-d-1}{6d}\right\rfloor\left\lfloor\frac{L_{2}}{2}\right\rfloor$ . Taking $n=(W_{2}-d-1)\left\lfloor\frac{W_{2}-d-1}{6d}\right\rfloor\left\lfloor\frac{L_{2}}{2}\right\rfloor$ , a simple calculation shows $W^{2}_{2}L_{2}=cdn$ for some constant $12\leq c\leq 384$ . The existence of neural net function $e$ can be constructed in the same way due to the fact that the first coordinate of $x_{i}^{\prime}s$ are distinct almost surely. ∎

When the number of point masses of the empirical distributions are relatively moderate compared with the structure of the neural nets, we can approximate empirical distributions arbitrarily well with any empirical distribution with the same number of point masses pushforwarded by the neural nets.

Theorem 4.3 provides an effective way to specify the architecture of generator and encoder classes. According to this lemma, we can take $n=\frac{W_{2}-d}{2}\left\lfloor\frac{W_{2}-d}{6d}\right\rfloor\left\lfloor\frac{L_{2}}{2}\right\rfloor+2=\frac{W_{3}-1}{2}\left\lfloor\frac{W_{3}-1}{6}\right\rfloor\left\lfloor\frac{L_{3}}{2}\right\rfloor+2$ , which gives rise to $W_{2}^{2}L_{2}/d\asymp W_{3}^{2}L_{3}\asymp n$ . More importantly, Theorem 4.3 can be applied to bound $\mathcal{E}_{2}$ as follows.

	$\displaystyle\mathcal{E}_{2}=$	$\displaystyle\inf_{g_{\theta}\in\mathcal{G}_{NN},e_{\varphi}\in\mathcal{E}_{NN}}\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(g_{\theta}(z_{i}),z_{i})-f_{\omega}(x_{i},e_{\varphi}(x_{i}))\Big{)}$
	$\displaystyle\leq$	$\displaystyle\inf_{g_{\theta}\in\mathcal{G}_{NN}}\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(g_{\theta}(z_{i}),z_{i})-f_{\omega}(x_{i},z_{i})\Big{)}$
		$\displaystyle+\inf_{e_{\varphi}\in\mathcal{E}_{NN}}\sup_{f_{\omega}\in\mathcal{F}_{NN}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{\omega}(x_{i},z_{i})-f_{\omega}(x_{i},e_{\varphi}(x_{i}))\Big{)}$
	$\displaystyle=$	$\displaystyle 0.$

We simply reordered $z_{i}^{\prime}s$ and $x_{i}^{\prime}s$ as in the proof. Therefore, this error term can be perfectly eliminated.

4.3 Stochastic errors

The stochastic error $\mathcal{E}_{3}$ ( $\mathcal{E}_{4}$ ) quantifies how close the empirical distribution and the true latent joint distribution (data joint distribution) are with the Lipschitz class $\mathcal{F}^{1}$ as the evaluation class under IPM. We apply the results in the refined Dudley inequality (Schreuder, 2020) in Lemma C.1 to bound $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$ .

Lemma 4.4 (Refined Dudley Inequality).

For a symmetric function class $\mathcal{F}$ with $\sup_{f\in\mathcal{F}}||f||_{\infty}\leq M$ , we have

\displaystyle\mathbb{E}[d_{\mathcal{F}}(\hat{\mu}_{n},\mu)]\leq\inf_{0<\delta<M}\left(4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{M}\sqrt{\log\mathcal{N}(\epsilon,\mathcal{F},||\cdot||_{\infty})}d\epsilon\right).

The original Dudley inequality (Dudley, 1967; Van der Vaart and Wellner, 1996) suffers from the problem that if the covering number $\mathcal{N}(\epsilon,\mathcal{F},||\cdot||_{\infty})$ increases too fast as $\epsilon$ goes to $0$ , then the upper bound will be infinity, which is totally meaningless. The improved Dudley inequality circumvents such a problem by only allowing $\epsilon$ to integrate from $\delta>0$ as is shown in Lemma C.1, which also indicates that $\mathbb{E}\mathcal{E}_{3}$ scales with the covering number $\mathcal{N}(\epsilon,\mathcal{F}^{1},||\cdot||_{\infty})$ .

By calculating the covering number of $\mathcal{F}^{1}$ and utilizing the refined Dudley inequality, we can obtain the upper bound

\displaystyle\max\{\mathbb{E}\mathcal{E}_{3},\mathbb{E}\mathcal{E}_{4}\}

\displaystyle=O\left(C_{d}n^{-\frac{1}{d+1}}\log n\wedge\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right).

(4.3)

5 Related work

Recently, several impressive works have studied the challenging problem of the convergence properties of unidirectional GANs. Arora et al. (2017) noted that training of GANs may not have good generalization properties in the sense that even if training may appear successful but the trained distribution may be far from target distribution in standard metrics. On the other hand, Bai et al. (2019) showed that GANs can learn distributions in Wasserstein distance with polynomial sample complexity. Liang (2020) studied the rates of convergence of a class of GANs, including Wasserstein, Sobolev and MMD GANs. This work also established the nonparametric minimax optimal rate under the Sobolev IPM. The results of Bai et al. (2019) and Liang (2020) require invertible generator networks, meaning all the weight matrices need to be full-rank, and the activation function needs to be the invertible leaky ReLU activation. Chen et al. (2020) established an upper bound for the estimation error rate under Hölder evaluation and target density classes, where $\mathcal{H}^{\beta}$ is Hölder class with regularity $\beta$ and the density of the target $\mu$ is assumed to belong to $\mathcal{H}^{\alpha}$ . They assumed that the reference distribution has the same dimension as the target distribution and applied the optimal transport theory to control the generator approximation error. However, how the prefactor depends in the error bounds on the dimension $d$ in the existing results (Liang, 2020; Chen et al., 2020) is either not clearly described or is exponential. In high-dimensional settings with large $d$ , this makes a substantial difference in the quality of the error bounds.

Singh et al. (2019) studied minimax convergence rates of nonparametric density estimation under a class of adversarial losses and investigated how the choice of loss and the assumed smoothness of the underlying density together determine the minimax rate; they also discussed connections to learning generative models in a minimax statistical sense. Uppal et al. (2019) generates the idea of Sobolev IPM to Besov IPM, where both target density and the evaluation classes are Besov classes. They also showed how their results imply bounds on the statistical error of a GAN.

These results provide important insights in the understanding of GANs. However, as we mentioned earlier, some of the assumptions made in these results, including equal dimension between the reference and target distributions and bounded support of the distributions, are not satisfied in the training of GANs in practice. Our results avoid these assumptions. Moreover, the prefactors in our error bounds are clearly described as being dependent on the square root of the dimension $d$ . Finally, the aforementioned results only dealt with unidirectional GANs. Our work is the first to address the convergence properties of bidirectional GANs.

6 Conclusion

This paper derives the error bounds for the bidirectional GANs under the Dudley distance between the latent joint distribution and the data joint distribution. The results are established without the two crucial conditions that are commonly assumed in the existing literature: equal dimensionality between the reference and the target distributions and bounded support for these distributions. Additionally, this work contributes to the neural network approximation theory by constructing neural network functions such that the pushforward distribution of an empirical distribution can perfectly approximate another arbitrary empirical distribution with a different dimension as long as their number of point masses are equal. A novel decomposition of integral probability metric is also developed for error analysis of bidirectional GANs, which can be useful in other generative learning problems.

A limitation of our results, as well as all the existing results on the convergence properties of GANs, is that they suffer from the curse of dimensionality, which cannot be circumvented by assuming sufficient smoothness assumptions. In many applications, high-dimensional complex data such as images, texts and natural languages, tend to be supported on approximate lower-dimensional manifolds. It is desirable to take into such structure in the theoretical analysis. An important extension of the present results is to show that bidirectional GANs can circumvent the curse of dimensionality if the target distribution is assumed to be supported on an approximate lower-dimensional manifold. This appears to be a technically challenging problem and will be pursued in our future work.

Acknowledgements

The authors wish to thank the three anonymous reviewers for their insightful comments and constructive suggestions that helped improve the paper significantly.

The work of J. Huang is partially supported by the U.S. NSF grant DMS-1916199. The work of Y. Jiao is supported in part by the National Science Foundation of China under Grant 11871474 and by the research fund of KLATASDSMOE. The work of Y. Wang is supported in part by the Hong Kong Research Grant Council grants 16308518 and 16317416 and HK Innovation Technology Fund ITS/044/18FX, as well as Guangdong-Hong Kong-Macao Joint Laboratory for Data-Driven Fluid Mechanics and Engineering Applications.

References

Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In ICML.
Arora et al. (2017) Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (gans). In International Conference on Machine Learning, pages 224–232. PMLR.
Bai et al. (2019) Bai, Y., Ma, T., and Risteski, A. (2019). Approximability of discriminators implies diversity in GANs. In International Conference on Learning Representations.
Brock et al. (2019) Brock, A., Donahue, J., and Simonyan, K. (2019). Large scale gan training for high fidelity natural image synthesis.
Chen et al. (2020) Chen, M., Liao, W., Zha, H., and Zhao, T. (2020). Statistical guarantees of generative adversarial networks for distribution estimation. arXiv preprint arXiv:2002.03938.
Donahue et al. (2016) Donahue, J., Krähenbühl, P., and Darrell, T. (2016). Adversarial feature learning. arXiv preprint arXiv:1605.09782.
Dudley (1967) Dudley, R. (1967). The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330.
Dudley (2018) Dudley, R. M. (2018). Real Analysis and Probability. CRC Press.
Dumoulin et al. (2016) Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., and Courville, A. (2016). Adversarially learned inference. arXiv preprint arXiv:1606.00704.
Goodfellow et al. (2014) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial networks. arXiv preprint arXiv:1406.2661.
Gottlieb et al. (2013) Gottlieb, L.-A., Kontorovich, A., and Krauthgamer, R. (2013). Efficient regression in metric spaces via approximate lipschitz extension. In International Workshop on Similarity-Based Pattern Recognition, pages 43–58. Springer.
Karras et al. (2018) Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018). Progressive growing of gans for improved quality, stability, and variation.
Karras et al. (2019) Karras, T., Laine, S., and Aila, T. (2019). A style-based generator architecture for generative adversarial networks.
Liang (2020) Liang, T. (2020). How well generative adversarial networks learn distributions.
Liu et al. (2017) Liu, S., Bousquet, O., and Chaudhuri, K. (2017). Approximation and convergence properties of generative adversarial learning. arXiv preprint arXiv:1705.08991.
Lu and Lu (2020) Lu, Y. and Lu, J. (2020). A universal approximation theorem of deep neural networks for expressing distributions. arXiv preprint arXiv:2004.08867.
Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. (2015). Adversarial autoencoders. arXiv preprint arXiv:1511.05644.
Mohri et al. (2018) Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning. MIT press.
Müller (1997) Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, pages 429–443.
Radford et al. (2016) Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks.
Reed et al. (2016) Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H. (2016). Generative adversarial text to image synthesis. In ICML.
Schreuder (2020) Schreuder, N. (2020). Bounding the expectation of the supremum of empirical processes indexed by hölder classes.
Shen et al. (2020) Shen, X., Zhang, T., and Chen, K. (2020). Bidirectional generative modeling using adversarial gradient estimation. arXiv preprint arXiv:2002.09161.
Shen et al. (2019) Shen, Z., Yang, H., and Zhang, S. (2019). Deep network approximation characterized by number of neurons. arXiv preprint arXiv:1906.05497.
Singh et al. (2019) Singh, S., Uppal, A., Li, B., Li, C.-L., Zaheer, M., and Póczos, B. (2019). Nonparametric density estimation with adversarial losses. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 10246–10257.
Srebro and Sridharan (2010) Srebro, N. and Sridharan, K. (2010). Note on refined dudley integral covering number bound. Unpublished results. http://ttic. uchicago. edu/karthik/dudley. pdf.
Uppal et al. (2019) Uppal, A., Singh, S., and Póczos, B. (2019). Nonparametric density estimation & convergence rates for gans under besov ipm losses. arXiv preprint arXiv:1902.03511.
Van der Vaart and Wellner (1996) Van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence. In Weak convergence and empirical processes. Springer.
Villani (2008) Villani, C. (2008). Optimal Transport: Old and New, volume 338. Springer Science & Business Media.
Yang et al. (2021) Yang, Y., Li, Z., and Wang, Y. (2021). On the capacity of deep generative networks for approximating distributions. arXiv preprint arXiv:2101.12353.
Zhang et al. (2018) Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. (2018). On the discrimination-generalization tradeoff in GANs. In International Conference on Learning Representations.
Zhu et al. (2017) Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.

Appendix

In the appendix, we first prove Theorem 3.2, and then Theorems 3.1 and 3.3.

Appendix A Notations and Preliminaries

We use $\sigma$ to denote the ReLU activation function in neural networks, which is $\sigma(x)=\max\{x,0\}$ . Without further indication, $\|\cdot\|$ represents the $L_{2}$ norm. For any function $g$ , let $\|g\|_{\infty}=\sup_{x}\|g(x)\|$ . We use notation $O(\cdot)$ and $\tilde{O}(\cdot)$ to express the order of function slightly differently, where $O(\cdot)$ omits the universal constant not relying on $d$ while $\tilde{O}(\cdot)$ omits the constant related to $d$ . We use $B_{2}^{d}(a)$ to denote $L_{2}$ ball in $\mathbb{R}^{d}$ with center at $\mathbf{0}$ and radius $a$ . Let $g_{\#}\nu$ be the pushforward distribution of $\nu$ by function $g$ in the sense that $g_{\#}\nu(A)=\nu(g^{-1}(A))$ for any measurable set $A$ .

The $r$ -covering number of some class $\mathcal{F}$ w.r.t. norm $\|\cdot\|$ is the minimum number of $r$ - $\|\cdot\|$ radius balls needed to cover $\mathcal{F}$ , which we denote as $\mathcal{N}(r,\mathcal{F},\|\cdot\|)$ . We denote $\mathcal{N}(r,\mathcal{F},L_{2}(P_{n}))$ as the covering number of $\mathcal{F}$ w.r.t. $L_{2}(P_{n})$ , which is defined as $\|f\|^{2}_{L_{2}(P_{n})}=\frac{1}{n}\sum_{i=1}^{n}\|f(X_{i})\|^{2}$ where $X_{1},\ldots,X_{n}$ are the empirical samples. We denote $\mathcal{N}(r,\mathcal{F},L_{\infty}(P_{n}))$ as the covering number of $\mathcal{F}$ w.r.t. $L_{\infty}(P_{n})$ , which is defined as $\|f\|_{L_{\infty}(P_{n})}=\max_{1\leq i\leq n}\|f(X_{i})\|$ . It is easy to check that

\displaystyle\mathcal{N}(r,\mathcal{F},L_{2}(P_{n}))\leq\mathcal{N}(r,\mathcal{F},L_{\infty}(P_{n}))\leq\mathcal{N}(r,\mathcal{F},\|\cdot\|_{\infty}).

Appendix B Restriction on the domain of uniformly bounded Lipschitz function class $\mathcal{F}^{1}$

So far, most of the related works assume that the target distribution $\mu$ is supported on a compact set, for example Chen et al. (2020) and Liang (2020). To remove the compact support assumption, we need to assume Assumption 1, i.e., the tails of the target $\mu$ and the reference $\nu$ are subexponential. Define $\mathcal{F}_{n}^{1}:=\{f|_{B_{2}^{d+1}(\sqrt{2}\log n)}:f\in\mathcal{F}^{1}\}$ . In this section, we show that proving Theorem 3.2 is equivalent to establishing the same convergence rate but with the domain restricted function class $\mathcal{F}_{n}^{1}$ as the evaluation class.

Under Assumption 1 and by the Markov inequality, we have

\displaystyle P_{\nu}(\|z\|>\log n)\leq\frac{\mathbb{E}_{\nu}\|z\|\mathbbm{1}_{\{\|z\|>\log n\}}}{\log n}=O(n^{-\frac{(\log n)^{\delta}}{d}}/\log n)

(B.1)

The Dudley distance between latent joint distribution $\hat{\boldsymbol{\nu}}$ and data joint distribution $\hat{\boldsymbol{\mu}}$ is defined as

\displaystyle d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})=\sup_{f\in\mathcal{F}^{1}}\mathbb{E}f(\hat{g}(z),z)-\mathbb{E}f(x,\hat{e}(x))

(B.2)

The first term above can be decomposed as

\displaystyle\mathbb{E}f(\hat{g}(z),z)

\displaystyle=\mathbb{E}f(\hat{g}(z),z)\mathbbm{1}_{\|z\|\leq\log n}+\mathbb{E}f(\hat{g}(z),z)\mathbbm{1}_{\|z\|>\log n}

(B.3)

For any $f\in\mathcal{F}^{1}$ and fixed point $z_{0}$ such that $\|z_{0}\|\leq\log n$ , due to the Lipschitzness of $f$ , the second term above satisfies

	$\displaystyle\|\mathbb{E}f(\hat{g}(z),z)\mathbbm{1}_{\\|z\\|>\log n}\|$	$\displaystyle\leq\|\mathbb{E}f(\hat{g}(z),z)\mathbbm{1}_{\\|z\\|>\log n}-\mathbb{E}f(\hat{g}(z_{0}),z_{0})\mathbbm{1}_{\\|z\\|>\log n}\|$
		$\displaystyle+\|\mathbb{E}f(\hat{g}(z_{0}),z_{0})\mathbbm{1}_{\\|z\\|>\log n}\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\\|(\hat{g}(z)-\hat{g}(z_{0}),z-z_{0})\\|\mathbbm{1}_{\\|z\\|>\log n}+BP_{\nu}(\\|z\\|>\log n)$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}(\\|(\hat{g}(z)-\hat{g}(z_{0})\\|+\\|z-z_{0}\\|)\mathbbm{1}_{\\|z\\|>\log n}+BP_{\nu}(\\|z\\|>\log n)$
	$\displaystyle\leq$	$\displaystyle 2(\log n)P_{\nu}(\\|z\\|>\log n)+\mathbb{E}\\|z-z_{0}\\|\mathbbm{1}_{\\|z\\|>\log n}+BP_{\nu}(\\|z\\|>\log n)$
	$\displaystyle=$	$\displaystyle O(n^{-\frac{(\log n)^{\delta}}{d}})$

where the second inequality is due to lipschitzness and boundedness of $f$ , and the last inequality is due to Assumption 1, (B.1), and the boundedness condition of $\hat{g}$ . In the first term in (B.3), $f$ only acts on the increasing $L_{2}$ ball $B^{d}_{2}(\sqrt{2}\log n)$ because of Condition 1 and the indicator function $\mathbbm{1}_{\{\|z\|\leq\log n\}}$ . Similarly, we can apply the same procedure to the second term in (B.2). Therefore, it is still an equivalent problem if we restrict the domain of $\mathcal{F}^{1}$ on $B^{d}_{2}(\sqrt{2}\log n)$ . Hence, in order to prove the estimation error rate in Theorem 3.2, we only need to show that for the restricted evaluation function class $\mathcal{F}^{1}_{n}$ , we have

\displaystyle\mathbb{E}d_{\mathcal{F}_{n}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})\leq C_{0}\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\wedge C_{d}n^{-\frac{1}{d+1}}\log n

Due to this fact, to keep notation simple, we are going to denote $\mathcal{F}_{n}^{1}$ as $\mathcal{F}^{1}$ in the following sections.

Remark 1.

The restriction on $\mathcal{F}^{1}$ is technically necessary for calculating the covering number of $\mathcal{F}^{1}$ later we will see the use of it when bounding the stochastic error $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$ below.

Appendix C Stochastic errors

C.1 Bounding $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$

The stochastic errors $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$ quantify how close the empirical distributions and the true latent joint distribution (data joint distribution) are with the Lipschitz class $\mathcal{F}^{1}$ as the evaluation class under IPM. We apply the results in Lemma C.1 to bound $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$ . We introduce two methods to bound $\max\{\mathcal{E}_{3},\mathcal{E}_{4}\}$ , which gives two different upper bounds for $\max\{\mathcal{E}_{3},\mathcal{E}_{4}\}$ . They both utilize the following lemma, which we shall prove later. More detailed description about the refined Dudley inequality can be found in Srebro and Sridharan (2010) and Schreuder (2020).

Lemma C.1 (Refined Dudley Inequality).

For a symmetric function class $\mathcal{F}$ with
$\sup_{f\in\mathcal{F}}\|f\|_{\infty}\leq M$ , we have

\displaystyle\mathbb{E}[d_{\mathcal{F}}(\hat{\mu}_{n},\mu)]\leq\inf_{0<\delta<M}\left(4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{M}\sqrt{\log\mathcal{N}(\epsilon,\mathcal{F},\|\cdot\|_{\infty})}\,d\epsilon\right).

Remark 2.

The original Dudley inequality (Dudley, 1967; Van der Vaart and Wellner, 1996) suffers from the problem that if the covering number $\mathcal{N}(\epsilon,\mathcal{F},\|\cdot\|_{\infty})$ increases too fast as $\epsilon$ goes to $0$ , then the upper bound can be infinite. The improved Dudley inequality circumvents this problem by only allowing $\epsilon$ to integrate from $\delta>0$ , which also indicates that $\mathbb{E}\mathcal{E}_{3}$ scales with the covering number $\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})$ .

C.1.1 The first method (explicit constant)

The first method provides an explicit constant depending on $d$ at the expense of the higher order of $\log n$ in the upper bounds. It utilizes the next lemma (Gottlieb et al., 2013, Lemma 6), which turns the problem of bounding the covering number of a Lipschitz function class into the one bounding the covering number of the domain defined for the function class.

Lemma C.2 (Gottlieb et al. (2013)).

Let $\mathcal{F}^{L}$ be the collection of $L-$ Lipschitz functions mapping the metric space $(\mathcal{X},\rho)$ to $[0,1]$ . Then the covering number of $\mathcal{F}^{L}$ can be estimated in terms of the covering number of $\mathcal{X}$ with respect to $\rho$ as follows.

\displaystyle\mathcal{N}(\epsilon,\mathcal{F}^{L},\|\cdot\|_{\infty})\leq(\frac{8}{\epsilon})^{\mathcal{N}(\epsilon/8L,\mathcal{X},\rho)}.

Now we apply Lemma C.2 to bound the covering number for the 1-Lipschitz class $\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})$ by bounding the covering number for its domain $\mathcal{N}(\epsilon,B^{d+1}_{2}(\sqrt{2}\log n),\|\cdot\|_{2})$ . Define a new function class $\mathcal{F}^{\frac{1}{2B}}$ as

\displaystyle\mathcal{F}^{\frac{1}{2B}}:=\{\frac{f+B}{2B}:f\in\mathcal{F}^{1}\}.

Recall that $\mathcal{F}^{1}$ is restricted on $B^{d+1}_{2}(\sqrt{2}\log n)$ . Obviously, $\mathcal{F}^{\frac{1}{2B}}$ is a $\frac{1}{2B}-$ Lipschitz function class : $B^{d+1}_{2}(\sqrt{2}\log n)\mapsto[0,1]$ . A direct application of Lemma C.2 shows that

\displaystyle\mathcal{N}(\epsilon,\mathcal{F}^{\frac{1}{2B}},\|\cdot\|_{\infty})\leq\left(\frac{8}{\epsilon}\right)^{\mathcal{N}(\epsilon B/4,B_{2}^{d+1}(\sqrt{2}\log n),\|\cdot\|_{2})}.

(C.1)

By the definition of $\mathcal{F}^{\frac{1}{2B}}$ , the covering numbers satisfy

\displaystyle\mathcal{N}(2B\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})=\mathcal{N}(\epsilon,\mathcal{F}^{\frac{1}{2B}},\|\cdot\|_{\infty}).

(C.2)

Note that $B^{d+1}_{2}(\sqrt{2}\log n)$ is a subset of $[-\sqrt{2}\log n,\sqrt{2}\log n]^{d}$ , and $[-\sqrt{2}\log n,\sqrt{2}\log n]^{d}$ can be covered with finite $\epsilon$ -balls in $\mathbb{R}^{d}$ that cover the small hypercube with side length $2\epsilon/\sqrt{d}$ . It follows that

\displaystyle\mathcal{N}(\epsilon,B^{d+1}_{2}(\sqrt{2}\log n),\|\cdot\|_{2})\leq\left(\frac{\sqrt{2(d+1)}\log n}{\epsilon}\right)^{d+1}.

(C.3)

Combining (C.1), (C.2) and (C.3), we obtain an upper bound for the covering number of the 1-Lipschitz class $\mathcal{F}^{1}$

\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})\leq\left(\frac{8\sqrt{2(d+1)}\log n}{\epsilon}\right)^{d+1}\log\frac{16B}{\epsilon}.

(C.4)

With the upper bound for the covering entropy in (C.4), a direct application of Lemma C.1 (see Section E for details) by taking $\delta=8\sqrt{2(d+1)}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}$ leads to

	$\displaystyle\max\{\mathbb{E}\mathcal{E}_{3},\mathbb{E}\mathcal{E}_{4}\}$	$\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}+n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right)$		(C.5)
		$\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right).$		(C.6)

C.1.2 The second method (better order of $\log n$ )

We now consider the second method that leads to a better order for the $\log n$ term in the upper bound at the expense of explicitness of the constant related to $d$ . The next lemma directly provides an upper bound for the covering number of Lipschitz class but with an implicit constant related to $d$ . It is a straightforward corollary of Van der Vaart and Wellner (1996, Theorem 2.7.1).

Lemma C.3.

Let $\mathcal{X}$ be a bounded, convex subset of $\mathbb{R}^{d}$ with nonempty interior. There exists a constant $c_{d}$ depending only on $d$ such that

\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}^{1}(\mathcal{X}),\|\cdot\|_{\infty})\leq c_{d}\lambda(\mathcal{X}^{1})\left(\frac{1}{\epsilon}\right)^{d}

for every $\epsilon>0$ , where $\mathcal{F}^{1}(\mathcal{X})$ is the 1-Lipschitz function class defined on $\mathcal{X}$ , and $\lambda(\mathcal{X}^{1})$ is the Lebesgue measure of the set $\{x:\|x-\mathcal{X}\|<1\}$ .

Applying Lemmas C.1 and C.3 (see Section E for details) by taking $\delta=n^{-\frac{1}{d+1}}\log n$ yields

\displaystyle\max\{\mathbb{E}\mathcal{E}_{3},\mathbb{E}\mathcal{E}_{4}\}

\displaystyle=O\left(C_{d}n^{-\frac{1}{d+1}}\log n\right),

(C.7)

where $C_{d}$ is some constant depending on $d$ . Combining (C.6) and (C.7), we get

\displaystyle\max\{\mathbb{E}\mathcal{E}_{3},\mathbb{E}\mathcal{E}_{4}\}

\displaystyle=O\left(C_{d}n^{-\frac{1}{d+1}}\log n\wedge\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right).

(C.8)

Remark 3.

Here, we have a tradeoff between the logarithmic factor $\log n$ and the explicitness of the constant depending on $d$ . If we want an explicit constant depending on $d$ , then we have the factor $(\log n)^{1+\frac{1}{d+1}}$ in the upper bound. Later we will see that $\mathbb{E}\mathcal{E}_{3}$ and $\mathbb{E}\mathcal{E}_{4}$ are the dominating terms in the four error terms, hence the explicitness of the corresponding constant becomes important. Therefore, we list two different methods here to bound $\mathbb{E}\mathcal{E}_{3}$ and $\mathbb{E}\mathcal{E}_{4}$ .

C.2 Combination of the four error terms

With all the upper bounds for the four different error terms obtained above, next we consider $\mathcal{E}_{1}$ - $\mathcal{E}_{4}$ simultaneously to obtain an overall convergence rate. First, recall how we bound $\mathcal{E}_{1}$ and $\mathcal{E}_{4}$ . With Lemma 4.2, we have

\displaystyle\mathcal{E}_{1}=O\left(\sqrt{d}(W_{1}L_{1})^{-\frac{2}{d+1}}\log n\right).

(C.9)

To control $\mathcal{E}_{1}$ while keeping the architecture of discriminator class $\mathcal{F}_{NN}$ as small as possible, we let $W_{1}L_{1}=\left\lceil\sqrt{n}\right\rceil$ , so that $\mathcal{E}_{1}=O\left(\sqrt{d}n^{-\frac{1}{d+1}}\log n\right)$ dominated by $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$ .

By Theorem 4.3, we can choose the architectures of generator and encoder classes accordingly to perfectly control $\mathcal{E}_{2}$ , i.e. $\mathcal{E}_{2}=0$ .

We note that because we imposed Condition 1 on both generator and encoder classes, Theorem 4.3 can not be applied if we have some $\|x_{i}\|$ or $\|z_{i}\|$ greater than $\log n$ , in which case $\mathcal{E}_{2}$ can not be perfectly controlled. But we can still handle this case by considering the probability of the bad set.

Under Condition 1, on the nice set $A:=\{\max_{1\leq i\leq n}\|x_{i}\|\leq\log n\}\cap\{\max_{1\leq i\leq n}\|z_{i}\|\leq\log n\}$ , we have $\mathcal{E}_{2}=0$ . Probability of the nice set $A$ has the following lower bound.

	$\displaystyle P(A)$	$\displaystyle=P_{\mu}(\|\|x_{i}\|\|\leq\log n)^{n}\cdot P_{\nu}(\|\|z_{i}\|\|\leq\log n)^{n}$
		$\displaystyle\geq(1-Cn^{-\frac{(\log n)^{\delta}}{d}})^{2n},\ \text{ for some constant $C>0$ by Assumption \ref{asp1}}$
		$\displaystyle\geq 1-Cn^{-\frac{(\log n)^{\delta}}{d}}\cdot(2n),\ \text{ for large $n$.}$

The bad set $A^{c}$ is where $\mathcal{E}_{2}>0$ , which has the probability upper bound as follows.

	$\displaystyle P(A^{c})$	$\displaystyle\leq Cn^{-\frac{(\log n)^{\delta}}{d}}\cdot(2n)$
		$\displaystyle=O\left(n^{-\frac{(\log n)^{\delta^{\prime}}}{d}}\right)\text{, for any $\delta^{\prime}<\delta$}.$

In Assumption 1, the $(\log n)^{\delta}$ factor was to make the tail of the target $\mu$ strictly subexponential, which leads to $P(A^{c})\to 0$ , while the exponential tail or heavier will cause the undesired result $P(A^{c})\to 1$ .

Now we are ready to obtain the desired result in Theorem 3.2. The nice set $A=\{\max_{1\leq i\leq n}\|x_{i}\|\leq\log n\}\cap\{\max_{1\leq i\leq n}\|z_{i}\|\leq\log n\}$ is where $\mathcal{E}_{2}=0$ . By combining the results discussed above, we have

	$\displaystyle\mathbb{E}d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})$	$\displaystyle=2\mathcal{E}_{1}+\mathcal{E}_{2}\mathbbm{1}_{A}+\mathcal{E}_{2}\mathbbm{1}_{A^{c}}+\mathbb{E}\mathcal{E}_{3}+\mathbb{E}\mathcal{E}_{4}$
		$\displaystyle\leq O\left(\sqrt{d}n^{-\frac{1}{d+1}}\log n+0+2BP_{\mu}(A^{c})+\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\wedge C_{d}n^{-\frac{1}{d+1}}\log n\right)$
		$\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\wedge C_{d}n^{-\frac{1}{d+1}}\log n+n^{-\frac{(\log n)^{\delta^{\prime}}}{d}}\right)$
		$\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\wedge C_{d}n^{-\frac{1}{d+1}}\log n\right),$

which completes the proof of Theorem 3.2.

Appendix D Proof of Inequality (4.2)

For ease of reference, we restate inequality (4.2) as the following lemma.

Lemma 4.2.

For any symmetric function classes $\mathcal{F}$ and $\mathcal{H}$ , denote the approximation error $\mathcal{E}(\mathcal{H},\mathcal{F})$ as

\displaystyle\mathcal{E}(\mathcal{H},\mathcal{F}):=\underset{h\in\mathcal{H}}{\sup}\underset{f\in\mathcal{F}}{\inf}\|h-f\|_{\infty},

then for any probability distributions $\mu$ and $\nu$ ,

\displaystyle d_{\mathcal{H}}(\mu,\nu)-d_{\mathcal{F}}(\mu,\nu)\leq 2\mathcal{E}(\mathcal{H},\mathcal{F}).

Proof of Lemma 4.2.

By the definition of supremum, for any $\epsilon>0$ , there exists $h_{\epsilon}\in\mathcal{H}$ such that

	$\displaystyle d_{\mathcal{H}}(\mu,\nu):$	$\displaystyle=\underset{h\in\mathcal{H}}{\sup}[\mathbb{E}_{\mu}h-\mathbb{E}_{\nu}h]$
		$\displaystyle\leq\mathbb{E}_{\mu}h_{\epsilon}-\mathbb{E}_{\nu}h_{\epsilon}+\epsilon$
		$\displaystyle=\underset{f\in\mathcal{F}}{\inf}[\mathbb{E}_{\mu}(h_{\epsilon}-f)-\mathbb{E}_{\nu}(h_{\epsilon}-f)+\mathbb{E}_{\mu}(f)-\mathbb{E}_{\nu}(f)]+\epsilon$
		$\displaystyle\leq 2\underset{f\in\mathcal{F}}{\inf}\\|h_{\epsilon}-f\\|_{\infty}+d_{\mathcal{F}}(\mu,\nu)+\epsilon$
		$\displaystyle\leq 2\mathcal{E}(\mathcal{H},\mathcal{F})+d_{\mathcal{F}}(\mu,\nu)+\epsilon,$

where the last line is due to the definition of $\mathcal{E}(\mathcal{H},\mathcal{F})$ . ∎

Appendix E Bounding $\mathbb{E}\mathcal{E}_{3}$ and $\mathbb{E}\mathcal{E}_{4}$

E.1 Method One

With the upper bound for the covering entropy (C.4), i.e.

\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})\leq\left(\frac{8\sqrt{2(d+1)}\log n}{\epsilon}\right)^{d+1}\log\frac{16B}{\epsilon}

and $\delta=8\sqrt{2(d+1)}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}$ , applying Lemma C.1 we have

	$\displaystyle\mathbb{E}\mathcal{E}_{3}$	$\displaystyle=O\left(\delta+n^{-\frac{1}{2}}\int_{\delta}^{B}\left(\frac{8\sqrt{2(d+1)}\log n}{\epsilon}\right)^{\frac{d+1}{2}}\left(\log\frac{16B}{\epsilon}\right)^{\frac{1}{2}}d\epsilon\right)$
		$\displaystyle=O\left(\delta+n^{-\frac{1}{2}}(8\sqrt{2(d+1)}\log n)^{\frac{d+1}{2}}(\frac{\log n}{d+1})^{\frac{1}{2}}\delta^{1-\frac{d+1}{2}}\right)$
		$\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}+n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right)$
		$\displaystyle=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{1+\frac{1}{d+1}}\right),$

where the second equality is due to

\displaystyle\log\frac{16B}{\epsilon}

\displaystyle=O\left(\log\frac{1}{\epsilon}\right)=O\left(\log\left(\frac{n^{\frac{1}{d+1}}}{8\sqrt{2(d+1)}(\log n)^{1+\frac{1}{d+1}}}\right)\right)=O\left(\log n^{\frac{1}{d+1}}\right),

and the third equality follows from simple algebra.

E.2 Method Two

By Lemma C.3, we have

\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})\leq c_{d}\left(\frac{\log n}{\epsilon}\right)^{d+1}.

Taking $\delta=n^{-\frac{1}{d+1}}\log n$ and applying Lemma C.1, we obtain

	$\displaystyle\mathbb{E}\mathcal{E}_{3}$	$\displaystyle=O\left(\delta+(\frac{c_{d}}{n})^{\frac{1}{2}}(\log n)^{\frac{d+1}{2}}\int_{\delta}^{M}(\frac{1}{\epsilon})^{\frac{d+1}{2}}d\epsilon\right)$
		$\displaystyle=\tilde{O}\left(\delta+n^{-\frac{1}{2}}(\log n)^{\frac{d+1}{2}}\delta^{1-\frac{d+1}{2}}\right)$
		$\displaystyle=\tilde{O}\left(n^{-\frac{1}{d+1}}\log n\right),$

where $\tilde{O}(\cdot)$ omitted the constant related to $d$ .

Appendix F Proof of Lemma C.1

For completeness we provide a proof of the refined Dudley’s inequality in Lemma C.1. We apply the standard symmetrization and chaining technics in the proof, see, for example, Van der Vaart and Wellner (1996).

Proof.

Let $Y_{1},\ldots,Y_{n}$ be random samples from $\mu$ which are independent of $X_{i}^{\prime}s$ . Then we have

	$\displaystyle\mathbb{E}d_{\mathcal{F}}(\hat{\mu}_{n},\mu)$	$\displaystyle=\mathbb{E}\underset{f\in\mathcal{F}}{\sup}[\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\mathbb{E}f(X_{i})]$
		$\displaystyle=\mathbb{E}\underset{f\in\mathcal{F}}{\sup}[\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\mathbb{E}\frac{1}{n}\sum_{i=1}^{n}f(Y_{i})]$
		$\displaystyle\leq\mathbb{E}_{X,Y}\underset{f\in\mathcal{F}}{\sup}[\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\frac{1}{n}\sum_{i=1}^{n}f(Y_{i})]$
		$\displaystyle=\mathbb{E}_{X,Y}\underset{f\in\mathcal{F}}{\sup}[\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f(X_{i})-f(Y_{i}))]$
		$\displaystyle\leq 2\mathbb{E}\hat{\mathcal{R}}_{n}(\mathcal{F})$

where the first inequality is due to Jensen inequality, and the third equality is because that $(f(X_{i})-f(Y_{i}))$ has symmetric distribution.

Let $\alpha_{0}=M$ and for any $j\in\mathbb{N}_{+}$ let $\alpha_{j}=2^{-j}M$ . For each $j$ , let $T_{i}$ be a $\alpha_{i}$ -cover of $\mathcal{F}$ w.r.t. $L_{2}(P_{n})$ such that $|T_{i}|=\mathcal{N}(\alpha_{i},\mathcal{F},L_{2}(P_{n}))$ . For each $f\in\mathcal{F}$ and $j$ , pick a function $\hat{f}_{i}\in T_{i}$ such that $\|\hat{f}_{i}-f\|_{L_{2}(P_{n})}<\alpha_{i}$ . Let $\hat{f}_{0}=0$ and for any $N$ , we can express $f$ by chaining as

\displaystyle f=f-\hat{f}_{N}+\sum_{i=1}^{N}(\hat{f}_{i}-\hat{f}_{i-1}).

Hence for any $N$ , we can express the empirical Rademacher complexity as

	$\displaystyle\hat{\mathcal{R}}_{n}(\mathcal{F})$	$\displaystyle=\frac{1}{n}\mathbb{E}_{\epsilon}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\epsilon_{i}\left(f(X_{i})-\hat{f}_{N}(X_{i})+\sum_{j=1}^{N}(\hat{f}_{j}(X_{i})-\hat{f}_{j-1}(X_{i}))\right)$
		$\displaystyle\leq\frac{1}{n}\mathbb{E}_{\epsilon}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\epsilon_{i}\left(f(X_{i})-\hat{f}_{N}(X_{i})\right)+\sum_{i=1}^{n}\frac{1}{n}\mathbb{E}_{\epsilon}\sup_{f\in\mathcal{F}}\sum_{j=1}^{N}\epsilon_{i}\left(\hat{f}_{j}(X_{i})-\hat{f}_{j-1}(X_{i})\right)$
		$\displaystyle\leq\\|\epsilon\\|_{L_{2}(P_{n})}\sup_{f\in\mathcal{F}}\\|f-\hat{f}_{N}\\|_{L_{2}(P_{n})}+\sum_{i=1}^{n}\frac{1}{n}\mathbb{E}_{\epsilon}\sup_{f\in\mathcal{F}}\sum_{j=1}^{N}\epsilon_{i}\left(\hat{f}_{j}(X_{i})-\hat{f}_{j-1}(X_{i})\right)$
		$\displaystyle\leq\alpha_{N}+\sum_{i=1}^{n}\frac{1}{n}\mathbb{E}_{\epsilon}\sup_{f\in\mathcal{F}}\sum_{j=1}^{N}\epsilon_{i}\left(\hat{f}_{j}(X_{i})-\hat{f}_{j-1}(X_{i})\right),$

where $\epsilon=(\epsilon_{1},\ldots,\epsilon_{n})$ and the second-to-last inequality is due to Cauchy–Schwarz. Now the second term is the summation of empirical Rademacher complexity w.r.t. the function classes $\{f^{\prime}-f^{\prime\prime}:f^{\prime}\in T_{j},f^{\prime\prime}\in T_{j-1}\}$ , $j=1,\ldots,N$ . Note that

	$\displaystyle\\|\hat{f}_{j}-\hat{f}_{j-1}\\|^{2}_{L_{2}(P_{n})}$	$\displaystyle\leq\left(\\|\hat{f}_{j}-f\\|_{L_{2}(P_{n})}+\\|f-\hat{f}_{j-1}\\|_{L_{2}(P_{n})}\right)^{2}$
		$\displaystyle\leq(\alpha_{j}+\alpha_{j-1})^{2}$
		$\displaystyle=3\alpha_{j}^{2}.$

Massart’s lemma (Mohri et al., 2018, Theorem 3.7) states that if for any finite function class $\mathcal{F}$ , $\sup_{f\in\mathcal{F}}\|f\|_{L_{2}(P_{n})}\leq M$ , then we have

\displaystyle\hat{\mathcal{R}}_{n}(\mathcal{F})\leq\sqrt{\frac{2M^{2}\log(|\mathcal{F}|)}{n}}.

Applying Massart’s lemma to the function classes $\{f^{\prime}-f^{\prime\prime}:f^{\prime}\in T_{j},f^{\prime\prime}\in T_{j-1}\}$ , $j=1,\ldots,N$ , we get that for any $N$ ,

	$\displaystyle\hat{\mathcal{R}}_{n}(\mathcal{F})$	$\displaystyle\leq\alpha_{N}+\sum_{j=1}^{N}3\alpha_{j}\sqrt{\frac{2\log(\|T_{j}\|\cdot\|T_{j-1}\|)}{n}}$
		$\displaystyle\leq\alpha_{N}+6\sum_{j=1}^{N}\alpha_{j}\sqrt{\frac{\log(\|T_{j}\|)}{n}}$
		$\displaystyle\leq\alpha_{N}+12\sum_{j=1}^{N}(\alpha_{j}-\alpha_{j+1})\sqrt{\frac{\log\mathcal{N}(\alpha_{j},\mathcal{F},L_{2}(P_{n}))}{n}}$
		$\displaystyle\leq\alpha_{N}+12\int_{\alpha_{N+1}}^{\alpha_{0}}\sqrt{\frac{\log\mathcal{N}(r,\mathcal{F},L_{2}(P_{n}))}{n}}dr,$

where the third inequality is due to $2(\alpha_{j}-\alpha_{j+1})=\alpha_{j}$ . Now for any small $\delta>0$ we can choose $N$ such that $\alpha_{N+1}\leq\delta<\alpha_{N}$ . Hence,

\displaystyle\hat{\mathcal{R}}_{n}(\mathcal{F})

\displaystyle\leq 2\delta+12\int_{\delta/2}^{M}\sqrt{\frac{\log\mathcal{N}(r,\mathcal{F},L_{2}(P_{n}))}{n}}dr.

Since $\delta>0$ is arbitrary, we can take $\inf$ w.r.t. $\delta$ to get

\displaystyle\hat{\mathcal{R}}_{n}(\mathcal{F})

\displaystyle\leq\inf_{0<\delta<M}\left(4\delta+12\int_{\delta}^{M}\sqrt{\frac{\log\mathcal{N}(r,\mathcal{F},L_{2}(P_{n}))}{n}}dr\right).

The result follows due to the fact that

\displaystyle\mathcal{N}(r,\mathcal{F},L_{2}(P_{n}))\leq\mathcal{N}(\epsilon,\mathcal{F},L_{\infty}(P_{n}))\leq\mathcal{N}(\epsilon,\mathcal{F},\|\cdot\|_{\infty}).

∎

Appendix G Proof of Theorem 3.1

Proof.

Taking $W_{1}L_{1}=\left\lceil\sqrt{n}\right\rceil$ , Shen et al. (2019, Theorem 4.3) gives rise to $\mathcal{E}_{1}=O(\sqrt{d}n^{-\frac{1}{d+1}})$ . The range of $g$ and $e$ covers the supports of $\mu$ and $\nu$ , respectively, hence Theorem 4.3 leads to $\mathcal{E}_{2}=0$ . By Lemma C.2, we have

\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}^{1},\|\cdot\|_{\infty})\leq\left(\frac{8\sqrt{2(d+1)}M}{\epsilon}\right)^{d+1}\log\frac{16B}{\epsilon}.

Now following the same procedure as in Section E by taking $\delta=8\sqrt{2(d+1)}n^{-\frac{1}{d+1}}(\log n)^{\frac{1}{d+1}}$ , we have

\displaystyle\max\{\mathbb{E}\mathcal{E}_{3},\mathbb{E}\mathcal{E}_{4}\}=O\left(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{\frac{1}{d+1}}\right).

At last, we consider all four error terms simultaneously.

	$\displaystyle\mathbb{E}d_{\mathcal{F}^{1}}(\hat{\boldsymbol{\nu}},\hat{\boldsymbol{\mu}})$	$\displaystyle\leq\mathcal{E}_{1}+\mathcal{E}_{2}+\mathbb{E}\mathcal{E}_{4}+\mathbb{E}\mathcal{E}_{3}$
		$\displaystyle=O(\sqrt{d}n^{-\frac{1}{d+1}}+0+\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{\frac{1}{d+1}})$
		$\displaystyle=O(\sqrt{d}n^{-\frac{1}{d+1}}(\log n)^{\frac{1}{d+1}}).$

∎

Appendix H Proof of Theorem 3.3

Following the same proof as Theorem 4.3, we have the following theorem.

Theorem H.1.

Suppose $\nu$ supported on $\mathbb{R}^{k}$ and $\mu$ supported on $\mathbb{R}^{d}$ are both absolutely continuous w.r.t. Lebesgue measure, and $z_{i}^{\prime}s$ and $x_{i}^{\prime}s$ are i.i.d. samples from $\nu$ and $\mu$ , respectively for $1\leq i\leq n$ . Then there exist generator and encoder neural network functions $g:\mathbb{R}^{k}\mapsto\mathbb{R}^{d}$ and $e:\mathbb{R}^{d}\mapsto\mathbb{R}^{k}$ such that $g$ and $e$ are inverse bijections of each other between $\{z_{i}:1\leq i\leq n\}$ and $\{x_{i}:1\leq i\leq n\}$ . Moreover, such neural network functions $g$ and $e$ can be obtained by properly specifying $W_{2}^{2}L_{2}=c_{2}dn$ and $W_{3}^{2}L_{3}=c_{3}kn$ for some constant $12\leq c_{2},c_{3}\leq 384$ .

Since $\mu$ and $\nu$ are absolutely continuous by assumption, they are also absolutely continuous in any one dimension. Hence the proof reduces to the one-dimensional case.

Appendix I Additional Lemma

Denote $\mathcal{S}^{d}(z_{0},\ldots,z_{N+1})$ as the set of all continuous piecewise linear functions $f:\mathbb{R}\mapsto\mathbb{R}^{d}$ which have breakpoints only at $z_{0}<z_{1}<\cdots<z_{N}<z_{N+1}$ and are constant on $(-\infty,z_{0})$ and $(z_{N+1},\infty)$ . The following lemma is a result in Yang et al. (2021).

Lemma I.1.

Suppose that $W\geq 7d+1$ , $L\geq 2$ and $N\leq(W-d-1)\left\lfloor\frac{W-d-1}{6d}\right\rfloor\left\lfloor\frac{L}{2}\right\rfloor$ . Then for any $z_{0}<z_{1}<\cdots<z_{N}<z_{N+1}$ , $\mathcal{S}^{d}(z_{0},\ldots,z_{N+1})$ can be represented by a ReLU FNNs with width and depth no larger than $W$ and $L$ , respectively.

This result indicates that the expressive capacity of ReLU FNNs for piecewise linear functions. If we choose $N=(W-d-1)\left\lfloor\frac{W-d-1}{6d}\right\rfloor\left\lfloor\frac{L}{2}\right\rfloor$ , a simple calculation shows $cW^{2}L/d\leq N\leq CW^{2}L/d$ with $c=1/384$ and $C=1/12$ . This means when the number of breakpoints are moderate compared with the network structure, such piecewise linear functions are expressible by feedforward ReLU networks.

	$\displaystyle\|\mathbb{E}f(\hat{g}(z),z)\mathbbm{1}_{\\|z\\|>\log n}\|$	$\displaystyle\leq\|\mathbb{E}f(\hat{g}(z),z)\mathbbm{1}_{\\|z\\|>\log n}-\mathbb{E}f(\hat{g}(z_{0}),z_{0})\mathbbm{1}_{\\|z\\|>\log n}\|$
		$\displaystyle+\|\mathbb{E}f(\hat{g}(z_{0}),z_{0})\mathbbm{1}_{\\|z\\|>\log n}\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\\|(\hat{g}(z)-\hat{g}(z_{0}),z-z_{0})\\|\mathbbm{1}_{\\|z\\|>\log n}+BP_{\nu}(\\|z\\|>\log n)$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}(\\|(\hat{g}(z)-\hat{g}(z_{0})\\|+\\|z-z_{0}\\|)\mathbbm{1}_{\\|z\\|>\log n}+BP_{\nu}(\\|z\\|>\log n)$
	$\displaystyle\leq$	$\displaystyle 2(\log n)P_{\nu}(\\|z\\|>\log n)+\mathbb{E}\\|z-z_{0}\\|\mathbbm{1}_{\\|z\\|>\log n}+BP_{\nu}(\\|z\\|>\log n)$
	$\displaystyle=$	$\displaystyle O(n^{-\frac{(\log n)^{\delta}}{d}})$

Non-Asymptotic Error Bounds for Bidirectional GANs

Abstract

1 Introduction

1.1 Contributions

2 Bidirectional generative learning

2.1 Bidirectional GAN estimators

2.2 Assumptions

Assumption 1 (Subexponential tail).

Assumption 2 (Absolute continuity).

2.3 Generator, encoder and discriminator classes

Condition 1.

3 Non-asymptotic error bounds

Theorem 3.1.

Theorem 3.2.

Assumption 3.

Theorem 3.3.

4 Approximation and stochastic errors

4.1 Decomposition of the estimation error

Lemma 4.1.

Proof of Lemma 4.1.

4.2 Approximation errors

4.2.1 The discriminator approximation error ℰ1\mathcal{E}_{1}

Lemma 4.2 (Shen et al. (2021)).

4.2.2 The generator and encoder approximation error ℰ2\mathcal{E}_{2}

Theorem 4.3.

Proof.

4.3 Stochastic errors

Lemma 4.4 (Refined Dudley Inequality).

5 Related work

6 Conclusion

Acknowledgements

References

Appendix A Notations and Preliminaries

Appendix B Restriction on the domain of uniformly bounded Lipschitz function class ℱ1\mathcal{F}^{1}

Remark 1.

Appendix C Stochastic errors

C.1 Bounding ℰ3\mathcal{E}_{3} and ℰ4\mathcal{E}_{4}

Lemma C.1 (Refined Dudley Inequality).

Remark 2.

C.1.1 The first method (explicit constant)

Lemma C.2 (Gottlieb et al. (2013)).

C.1.2 The second method (better order of log⁡n\log n)

Lemma C.3.

Remark 3.

C.2 Combination of the four error terms

Appendix D Proof of Inequality (4.2)

Lemma 4.2.

Proof of Lemma 4.2.

Appendix E Bounding 𝔼​ℰ3\mathbb{E}\mathcal{E}_{3} and 𝔼​ℰ4\mathbb{E}\mathcal{E}_{4}

E.1 Method One

E.2 Method Two

Appendix F Proof of Lemma C.1

Proof.

Appendix G Proof of Theorem 3.1

Proof.

Appendix H Proof of Theorem 3.3

Theorem H.1.

Appendix I Additional Lemma

Lemma I.1.

Non-Asymptotic Error Bounds for
Bidirectional GANs

4.2.1 The discriminator approximation error $\mathcal{E}_{1}$

4.2.2 The generator and encoder approximation error $\mathcal{E}_{2}$

Appendix B Restriction on the domain of uniformly bounded Lipschitz function class $\mathcal{F}^{1}$

C.1 Bounding $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$

C.1.2 The second method (better order of $\log n$ )

Appendix E Bounding $\mathbb{E}\mathcal{E}_{3}$ and $\mathbb{E}\mathcal{E}_{4}$