Non-Asymptotic Error Bounds for
Bidirectional GANs
Abstract
We derive nearly sharp bounds for the bidirectional GAN (BiGAN) estimation error under the Dudley distance between the latent joint distribution and the data joint distribution with appropriately specified architecture of the neural networks used in the model. To the best of our knowledge, this is the first theoretical guarantee for the bidirectional GAN learning approach. An appealing feature of our results is that they do not assume the reference and the data distributions to have the same dimensions or these distributions to have bounded support. These assumptions are commonly assumed in the existing convergence analysis of the unidirectional GANs but may not be satisfied in practice. Our results are also applicable to the Wasserstein bidirectional GAN if the target distribution is assumed to have a bounded support. To prove these results, we construct neural network functions that push forward an empirical distribution to another arbitrary empirical distribution on a possibly different-dimensional space. We also develop a novel decomposition of the integral probability metric for the error analysis of bidirectional GANs. These basic theoretical results are of independent interest and can be applied to other related learning problems.
1 Introduction
Generative adversarial networks (GAN) (Goodfellow et al., 2014) is an important approach to implicitly learning and sampling from high-dimensional complex distributions. GANs have been shown to achieve impressive performance in many machine learning tasks (Radford et al., 2016; Reed et al., 2016; Zhu et al., 2017; Karras et al., 2018, 2019; Brock et al., 2019). Several recent studies have generalized GANs to bidirectional generative learning, which learns an encoder mapping the data distribution to the reference distribution simultaneously together with the generator doing reversely. These studies include the adversarial autoencoder (AAE) (Makhzani et al., 2015), bidirectional GAN (BiGAN) (Donahue et al., 2016), adversarially learned inference (ALI) (Dumoulin et al., 2016), and bidirectional generative modeling using adversarial gradient estimation (AGES) (Shen et al., 2020). A common feature of these methods is that they generalize the basic adversarial training framework of the original GAN from unidirectional to bidirectional. Dumoulin et al. (2016) showed that BiGANs make use of the joint distribution of data and latent representations, which can better capture the information of data than the vanilla GANs. Comparing with the unidirectional GANs, the joint distribution matching in the training of bidirectional GANs alleviates mode dropping and encourages cycle consistency (Shen et al., 2020).
Several elegant and stimulating papers have analyzed the theoretical properties of unidirectional GANs. Arora et al. (2017) considered the generalization error of GANs under the neural net distance. Zhang et al. (2018) improved the generalization error bound in Arora et al. (2017). Liang (2020) studied the minimax optimal rates for learning distributions with empirical samples under Sobolev evaluation class and density class. The minimax rate is , where and are the regularity parameters for Sobolev density and evaluation class, respectively. Bai et al. (2019) analyzed the estimation error of GANs under the Wasserstein distance for a special class of distributions implemented by a generator, while the discriminator is designed to guarantee zero bias. Chen et al. (2020) studies the convergence properties of GANs when both the evaluation class and the target density class are Hölder classes and derived bound, where is the dimension of the data distribution and and are the regularity parameters for Hölder density and evaluation class, respectively. While impressive progresses have been made on the theoretical understanding of GANs, there are still some drawbacks in the existing results. For example,
-
(a)
The reference distribution and the target data distribution are assumed to have the same dimension, which is not the actual setting for GAN training.
-
(b)
The reference and the target data distributions are assumed to be supported on bounded sets.
-
(c)
The prefactors in the convergence rates may depend on the dimension of the data distribution exponentially.
In practice, GANs are usually trained using a reference distributions with a lower dimension than that of the target data distribution. Indeed, an important strength of GANs is that they can model low-dimensional latent structures via using a low-dimensional reference distribution. The bounded support assumption excludes some commonly used Gaussian distributions as the reference. Therefore, strictly speaking, the existing convergence analysis results do not apply to what have been done in practice. In addition, there has been no theoretical analysis of bidirectional GANs in the literature.
1.1 Contributions
We derive nearly sharp non-asymptotic bounds for the GAN estimation error under the Dudley distance between the reference joint distribution and the data joint distribution.To the best of our knowledge, this is the first result providing theoretical guarantees for bidirectional GAN estimation error rate. We do not assume that the reference and the target data distributions have the same dimension or these distributions have bounded support. Also, our results are applicable to the Wasserstein distance if the target data distribution is assumed to have a bounded support.
The main novel aspects of our work are as follows.
-
(1)
We allow the dimension of the reference distribution to be different from the dimension of the target distribution, in particular, it can be much lower than that of the target distribution.
-
(2)
We allow unbounded support for the reference distribution and the target distribution under mild conditions on the tail probabilities of the target distribution.
-
(3)
We explicitly establish that the prefactors in the error bounds depend on the square root of the dimension of the target distribution. This is a significant improvement over the exponential dependence on in the existing works.
Moreover, we develop a novel decomposition of integral probability metric for the error analysis of bidirectional GANs. We also show that the pushforward distribution of an empirical distribution based on neural networks can perfectly approximate another arbitrary empirical distribution as long as the number of discrete points are the same.
Notation We use to denote the ReLU activation function in neural networks, which is . We use to denote the identity map. Without further indication, represents the norm. For any function , let . We use notation and to express the order of function slightly differently, where omits the universal constant independent of while omits the constant depending on . We use to denote the ball in with center at and radius . Let be the pushforward distribution of by function in the sense that for any measurable set . We use to denote taking expectation with respect to the empirical distribution.
2 Bidirectional generative learning
We describe the setup of the bidirectional GAN estimation problem and present the assumptions we need in our analysis.
2.1 Bidirectional GAN estimators
Let be the target data distribution supported on for Let be a reference distribution which is easy to sample from. We first consider the case when is supported on , and then extend it to , where can be different from . Usually, in practical machine learning tasks such as image generation. The goal is to learn functions and such that , where and , is the pushforward distribution of under and is the pushforward distribution of under . We call the joint latent distribution or joint reference distribution and the joint data distribution or joint target distribution. At the population level, the bidirectional GAN solves the minimax problem:
where are referred to as the generator class, the encoder class, and the discriminator class, respectively. Suppose we have two independent random samples and . At the sample level, the bidirectional GAN solves the empirical version of the above minimax problem:
(2.1) |
where and are two classes of neural networks approximating the generator class and the encoder class respectively, and is a class of neural networks approximating the discriminator class .
2.2 Assumptions
We assume the target and the reference satisfy the following assumptions.
Assumption 1 (Subexponential tail).
For a large , the target distribution on and the reference distribtuion on satisfies the first moment tail condition for some ,
Assumption 2 (Absolute continuity).
Both the target distribution on and the reference distribution on are absolutely continuous with respect to the Lebesgue measure
Assumption 1 is a technical condition for dealing with the case when and are supported on and instead of compact subsets. For distributions with bounded supports, this assumption is automatically satisfied. Here the factor ensures that the tails of and are sub-exponential, and it can be easily satisfied if the distributions are sub-gaussian. For the reference distribution, Assumption 1 and 2 can be easily satisfied by specifying as some common distribution with easy-to-sample density such as Gaussian or uniform, which is usually done in the applications of GANs. For the target distribution, Assumption 1 and 2 specifies the type of distributions that are learnable by bidirectional GAN with our theoretical guarantees. Note that Assumption 1 is also necessary in our proof for bounding the generator and encoder approximation error in the sense that the results will not hold if we replace with 1. Assumption 2 is also necessary for Theorem 4.3 in mapping between empirical samples, which is essential in bounding generator and encoder approximation error.
2.3 Generator, encoder and discriminator classes
Let be the discriminator class consisting of the feedforward ReLU neural networks with width and depth . Similarly, let be the generator class consisting of the feedforward ReLU neural networks with width and depth , and the encoder class consisting of the feedforward ReLU neural networks with width and depth .
The functions have the following form:
where are the weight matrices with number of rows and columns no larger than the width , are the bias vectors with compatible dimensions, and is the ReLU activation function . Similarly, functions and have the following form:
where and are the weight matrices with number of rows and columns no larger than and , respectively, and and are the bias vectors with compatible dimensions.
We impose the following conditions on , , and .
Condition 1.
For any and , we have
Condition 1 on can be easily satisfied by adding an additional clipping layer after the original output layer, with ,
(2.2) |
We truncate the output of to an increasing interval to include the whole support for the evaluation function class. Condition 1 on can be satisfied in the same manner. This condition is technically necessary in our proof (see appendix).
3 Non-asymptotic error bounds
We characterize the bidirectional GAN solutions based on minimizing the integral probability metric (IPM, Müller (1997)) between two distributions and with respect to a symmetric evaluation function class , defined by
(3.1) |
By specifying the evaluation function class differently, we can obtain many commonly-used metrics (Liu et al., 2017). Here we focus on the following two
We consider the estimation error under the Dudley metric . Note that in the case when and have bounded supports, the Dudley metric is equivalent to the 1-Wasserstein metric . Therefore, under the bounded support condition for and , all our convergence results also hold under the Wasserstein distance . Even if the support of and are unbounded, we can still apply the result of Lu and Lu (2020) to avoid empirical process theory and obtain an stochastic error bound under the Wasserstein distance . However, the result of Lu and Lu (2020) requires sub-gaussianity to obtain the prefactor. In order to make it more general, we use the empirical processes theory to get the explicit prefactor. Also, the discriminator approximation error will be unbounded if we consider the Wasserstein distance . Hence, we can only consider for the unbounded support case.
The bidirectional GAN solution in (2.1) also minimizes the distance between and under
However, even if two distributions are close with respect to , there is no automatic guarantee that they will still be close under other metrics, for example, the Dudley or the Wasserstein distance (Arora et al., 2017). Therefore, it is natural to ask the question:
-
•
How close are the two bidirectional GAN estimators and under some other stronger metrics?
We consider the IPM with the uniformly bounded 1-Lipschitz function class on , as the evaluation class, which is defined as, for some finite ,
(3.2) |
In Theorem 3.1, we consider the bounded support case where ; In Theorem 3.2, we extend the result to the unbounded support case; In Theorem 3.3, we extend the result to the case where the dimension of the reference distribution is arbitrary.
We first present a result when is supported on a compact subset and is supported on for a finite .
Theorem 3.1.
Suppose that the target is supported on and the reference is supported on for a finite , and Assumption 2 holds. Let the outputs of and be within and for and , respectively. By specifying the three network structures as , , and for some constants and properly choosing parameters, we have
where is a constant independent of and .
The prefactor in the error bound depends on linearly. This is different from the existing works where the dependence of the prefactor on is either not clearly described or is exponential. In high-dimensional settings with large , this makes a substantial difference in the quality of the error bounds. These remarks apply to all the results stated below.
The next theorem deals with the case of unbounded support.
Theorem 3.2.
Note that two methods are used in bounding stochastic errors (see appendix), which leads to two different bounds: one with an explicit prefactor with the cost that we have an additional factor. Another one with an implicit prefactor but with a better factor. Hence, it is a tradeoff between the explicitness of prefactor and the order of .
Our next result generalizes the results to the case when the reference distribution is supported on for
Assumption 3.
The target distribution on is absolutely continuous with respect to the Lebesgue measure on and the reference distribution on is absolutely continuous with respect to the Lebesgue measure on , and .
With the above assumption, we have the following theorem providing theoretical guarantees for the validity of any dimensional reference .
Theorem 3.3.
4 Approximation and stochastic errors
In this section we present a novel inequality for decomposing the total error into approximation and stochastic errors and establish bounds on these errors.
4.1 Decomposition of the estimation error
Define the approximation error of a function class to another function class by
We decompose the Dudley distance between the latent joint distribution and the data joint distribution into four different error terms,
-
•
the approximation error of the discriminator class to :
-
•
the approximation error of the generator and encoder classes:
-
•
the stochastic error for the latent joint distribution :
-
•
the stochastic error for the latent joint distribution :
Lemma 4.1.
The novel decomposition (4.1) is fundamental to our error analysis. Based on (4.1), we bound each error term on the right side of (4.1) and balance the bounds to obtain an overall bound for the bidirectional GAN estimation.
For proving Lemma 4.1, we introduce the following useful inequality, which states that for any two probability distributions, the difference in IPMs with two distinct evaluation classes will not exceed 2 times the approximation error between the two evaluation classes, that is, for any probability distributions and and symmetric function classes and ,
(4.2) |
It is easy to check that if we replace by , (4.2) still holds.
Proof of Lemma 4.1.
∎
Note that we cannot directly apply the symmetrization technic (see appendix) to and since and are correlated with and . However, this problem can be solved by replacing the samples in the empirical terms in and with ghost samples independent of and replacing and with and which are obtained from the ghost samples, respectively. That is, we replace and with and in and , respectively. Then we can proceed with the same proof of Lemma 4.1 and apply the symmetrization technic to and since and have the same distribution. To simplify the notation, we will just use and to denote and here, respectively.
4.2 Approximation errors
We now discuss the errors due to the discriminator approximation and the generator and encoder approximation.
4.2.1 The discriminator approximation error
The discriminator approximation error describes how well the discriminator neural network class approximates functions from the Lipschitz class . Lemma 4.2 below can be applied to obtain the neural network approximation error for Lipschitz functions. It leads to a quantitative and non-asymptotic approximation rate in terms of the width and depth of the neural networks when bounding .
Lemma 4.2 (Shen et al. (2021)).
Let be a Lipschitz continuous function defined on . For arbitrary , there exists a function implemented by a ReLU feedforward neural network with width and depth such that
By Lemma 4.2 and our choice of the architecture of discriminator class in the theorems, we have . Theorem 4.2 also informs about how to choose the architecture of the discriminator networks based on how small we want the approximation error to be. By setting , is dominated by the stochastic terms and .
4.2.2 The generator and encoder approximation error
The generator and encoder approximation error describes how powerful the generator and encoder classes are in pushing the empirical distributions and to each other. A natural question is
-
•
Can we find some generator and encoder neural network functions such that ?
Most of the current literature concerning the error analysis of GANs applied the optimal transport theory (Villani, 2008) to minimize an error term similar to , see, for example, Chen et al. (2020). However, the existence of the optimal transport function from is not guaranteed. Therefore, the existing analysis of GANs can only deal with the scenario when the reference and the target data distribution are assumed to have the same dimension. This equal dimensionality assumption is not satisfied in the actual training of GANs or bidirectional GANs in many applications. Here, instead of using the optimal transport theory, we establish the following approximation results in Theorem 4.3, which enables us to forgo the equal dimensionality assumption.
Theorem 4.3.
Suppose that supported on and supported on are both absolutely continuous w.r.t. the Lebesgue measures, and and are i.i.d. samples from and , respectively for . Then there exist generator and encoder neural network functions and such that and are inverse bijections of each other between and up to a permutation. Moreover, such neural network functions and can be obtained by properly specifying and for some constant .
Proof.
By the absolute continuity of and , all the and are distinct a.s.. We can reorder from the smallest to the largest, so . Let be any point between and for . We define the continuous piece-wise linear function by
By Yang et al. (2021, Lemma 3.1) , if . Taking , a simple calculation shows for some constant . The existence of neural net function can be constructed in the same way due to the fact that the first coordinate of are distinct almost surely. ∎
When the number of point masses of the empirical distributions are relatively moderate compared with the structure of the neural nets, we can approximate empirical distributions arbitrarily well with any empirical distribution with the same number of point masses pushforwarded by the neural nets.
Theorem 4.3 provides an effective way to specify the architecture of generator and encoder classes. According to this lemma, we can take , which gives rise to . More importantly, Theorem 4.3 can be applied to bound as follows.
We simply reordered and as in the proof. Therefore, this error term can be perfectly eliminated.
4.3 Stochastic errors
The stochastic error () quantifies how close the empirical distribution and the true latent joint distribution (data joint distribution) are with the Lipschitz class as the evaluation class under IPM. We apply the results in the refined Dudley inequality (Schreuder, 2020) in Lemma C.1 to bound and .
Lemma 4.4 (Refined Dudley Inequality).
For a symmetric function class with , we have
The original Dudley inequality (Dudley, 1967; Van der Vaart and Wellner, 1996) suffers from the problem that if the covering number increases too fast as goes to , then the upper bound will be infinity, which is totally meaningless. The improved Dudley inequality circumvents such a problem by only allowing to integrate from as is shown in Lemma C.1, which also indicates that scales with the covering number .
By calculating the covering number of and utilizing the refined Dudley inequality, we can obtain the upper bound
(4.3) |
5 Related work
Recently, several impressive works have studied the challenging problem of the convergence properties of unidirectional GANs. Arora et al. (2017) noted that training of GANs may not have good generalization properties in the sense that even if training may appear successful but the trained distribution may be far from target distribution in standard metrics. On the other hand, Bai et al. (2019) showed that GANs can learn distributions in Wasserstein distance with polynomial sample complexity. Liang (2020) studied the rates of convergence of a class of GANs, including Wasserstein, Sobolev and MMD GANs. This work also established the nonparametric minimax optimal rate under the Sobolev IPM. The results of Bai et al. (2019) and Liang (2020) require invertible generator networks, meaning all the weight matrices need to be full-rank, and the activation function needs to be the invertible leaky ReLU activation. Chen et al. (2020) established an upper bound for the estimation error rate under Hölder evaluation and target density classes, where is Hölder class with regularity and the density of the target is assumed to belong to . They assumed that the reference distribution has the same dimension as the target distribution and applied the optimal transport theory to control the generator approximation error. However, how the prefactor depends in the error bounds on the dimension in the existing results (Liang, 2020; Chen et al., 2020) is either not clearly described or is exponential. In high-dimensional settings with large , this makes a substantial difference in the quality of the error bounds.
Singh et al. (2019) studied minimax convergence rates of nonparametric density estimation under a class of adversarial losses and investigated how the choice of loss and the assumed smoothness of the underlying density together determine the minimax rate; they also discussed connections to learning generative models in a minimax statistical sense. Uppal et al. (2019) generates the idea of Sobolev IPM to Besov IPM, where both target density and the evaluation classes are Besov classes. They also showed how their results imply bounds on the statistical error of a GAN.
These results provide important insights in the understanding of GANs. However, as we mentioned earlier, some of the assumptions made in these results, including equal dimension between the reference and target distributions and bounded support of the distributions, are not satisfied in the training of GANs in practice. Our results avoid these assumptions. Moreover, the prefactors in our error bounds are clearly described as being dependent on the square root of the dimension . Finally, the aforementioned results only dealt with unidirectional GANs. Our work is the first to address the convergence properties of bidirectional GANs.
6 Conclusion
This paper derives the error bounds for the bidirectional GANs under the Dudley distance between the latent joint distribution and the data joint distribution. The results are established without the two crucial conditions that are commonly assumed in the existing literature: equal dimensionality between the reference and the target distributions and bounded support for these distributions. Additionally, this work contributes to the neural network approximation theory by constructing neural network functions such that the pushforward distribution of an empirical distribution can perfectly approximate another arbitrary empirical distribution with a different dimension as long as their number of point masses are equal. A novel decomposition of integral probability metric is also developed for error analysis of bidirectional GANs, which can be useful in other generative learning problems.
A limitation of our results, as well as all the existing results on the convergence properties of GANs, is that they suffer from the curse of dimensionality, which cannot be circumvented by assuming sufficient smoothness assumptions. In many applications, high-dimensional complex data such as images, texts and natural languages, tend to be supported on approximate lower-dimensional manifolds. It is desirable to take into such structure in the theoretical analysis. An important extension of the present results is to show that bidirectional GANs can circumvent the curse of dimensionality if the target distribution is assumed to be supported on an approximate lower-dimensional manifold. This appears to be a technically challenging problem and will be pursued in our future work.
Acknowledgements
The authors wish to thank the three anonymous reviewers for their insightful comments and constructive suggestions that helped improve the paper significantly.
The work of J. Huang is partially supported by the U.S. NSF grant DMS-1916199. The work of Y. Jiao is supported in part by the National Science Foundation of China under Grant 11871474 and by the research fund of KLATASDSMOE. The work of Y. Wang is supported in part by the Hong Kong Research Grant Council grants 16308518 and 16317416 and HK Innovation Technology Fund ITS/044/18FX, as well as Guangdong-Hong Kong-Macao Joint Laboratory for Data-Driven Fluid Mechanics and Engineering Applications.
References
- Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In ICML.
- Arora et al. (2017) Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (gans). In International Conference on Machine Learning, pages 224–232. PMLR.
- Bai et al. (2019) Bai, Y., Ma, T., and Risteski, A. (2019). Approximability of discriminators implies diversity in GANs. In International Conference on Learning Representations.
- Brock et al. (2019) Brock, A., Donahue, J., and Simonyan, K. (2019). Large scale gan training for high fidelity natural image synthesis.
- Chen et al. (2020) Chen, M., Liao, W., Zha, H., and Zhao, T. (2020). Statistical guarantees of generative adversarial networks for distribution estimation. arXiv preprint arXiv:2002.03938.
- Donahue et al. (2016) Donahue, J., Krähenbühl, P., and Darrell, T. (2016). Adversarial feature learning. arXiv preprint arXiv:1605.09782.
- Dudley (1967) Dudley, R. (1967). The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330.
- Dudley (2018) Dudley, R. M. (2018). Real Analysis and Probability. CRC Press.
- Dumoulin et al. (2016) Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., and Courville, A. (2016). Adversarially learned inference. arXiv preprint arXiv:1606.00704.
- Goodfellow et al. (2014) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial networks. arXiv preprint arXiv:1406.2661.
- Gottlieb et al. (2013) Gottlieb, L.-A., Kontorovich, A., and Krauthgamer, R. (2013). Efficient regression in metric spaces via approximate lipschitz extension. In International Workshop on Similarity-Based Pattern Recognition, pages 43–58. Springer.
- Karras et al. (2018) Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018). Progressive growing of gans for improved quality, stability, and variation.
- Karras et al. (2019) Karras, T., Laine, S., and Aila, T. (2019). A style-based generator architecture for generative adversarial networks.
- Liang (2020) Liang, T. (2020). How well generative adversarial networks learn distributions.
- Liu et al. (2017) Liu, S., Bousquet, O., and Chaudhuri, K. (2017). Approximation and convergence properties of generative adversarial learning. arXiv preprint arXiv:1705.08991.
- Lu and Lu (2020) Lu, Y. and Lu, J. (2020). A universal approximation theorem of deep neural networks for expressing distributions. arXiv preprint arXiv:2004.08867.
- Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. (2015). Adversarial autoencoders. arXiv preprint arXiv:1511.05644.
- Mohri et al. (2018) Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning. MIT press.
- Müller (1997) Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, pages 429–443.
- Radford et al. (2016) Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks.
- Reed et al. (2016) Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H. (2016). Generative adversarial text to image synthesis. In ICML.
- Schreuder (2020) Schreuder, N. (2020). Bounding the expectation of the supremum of empirical processes indexed by hölder classes.
- Shen et al. (2020) Shen, X., Zhang, T., and Chen, K. (2020). Bidirectional generative modeling using adversarial gradient estimation. arXiv preprint arXiv:2002.09161.
- Shen et al. (2019) Shen, Z., Yang, H., and Zhang, S. (2019). Deep network approximation characterized by number of neurons. arXiv preprint arXiv:1906.05497.
- Singh et al. (2019) Singh, S., Uppal, A., Li, B., Li, C.-L., Zaheer, M., and Póczos, B. (2019). Nonparametric density estimation with adversarial losses. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 10246–10257.
- Srebro and Sridharan (2010) Srebro, N. and Sridharan, K. (2010). Note on refined dudley integral covering number bound. Unpublished results. http://ttic. uchicago. edu/karthik/dudley. pdf.
- Uppal et al. (2019) Uppal, A., Singh, S., and Póczos, B. (2019). Nonparametric density estimation & convergence rates for gans under besov ipm losses. arXiv preprint arXiv:1902.03511.
- Van der Vaart and Wellner (1996) Van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence. In Weak convergence and empirical processes. Springer.
- Villani (2008) Villani, C. (2008). Optimal Transport: Old and New, volume 338. Springer Science & Business Media.
- Yang et al. (2021) Yang, Y., Li, Z., and Wang, Y. (2021). On the capacity of deep generative networks for approximating distributions. arXiv preprint arXiv:2101.12353.
- Zhang et al. (2018) Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. (2018). On the discrimination-generalization tradeoff in GANs. In International Conference on Learning Representations.
- Zhu et al. (2017) Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.
Appendix
Appendix A Notations and Preliminaries
We use to denote the ReLU activation function in neural networks, which is . Without further indication, represents the norm. For any function , let . We use notation and to express the order of function slightly differently, where omits the universal constant not relying on while omits the constant related to . We use to denote ball in with center at and radius . Let be the pushforward distribution of by function in the sense that for any measurable set .
The -covering number of some class w.r.t. norm is the minimum number of - radius balls needed to cover , which we denote as . We denote as the covering number of w.r.t. , which is defined as where are the empirical samples. We denote as the covering number of w.r.t. , which is defined as . It is easy to check that
Appendix B Restriction on the domain of uniformly bounded Lipschitz function class
So far, most of the related works assume that the target distribution is supported on a compact set, for example Chen et al. (2020) and Liang (2020). To remove the compact support assumption, we need to assume Assumption 1, i.e., the tails of the target and the reference are subexponential. Define . In this section, we show that proving Theorem 3.2 is equivalent to establishing the same convergence rate but with the domain restricted function class as the evaluation class.
Under Assumption 1 and by the Markov inequality, we have
(B.1) |
The Dudley distance between latent joint distribution and data joint distribution is defined as
(B.2) |
The first term above can be decomposed as
(B.3) |
For any and fixed point such that , due to the Lipschitzness of , the second term above satisfies
where the second inequality is due to lipschitzness and boundedness of , and the last inequality is due to Assumption 1, (B.1), and the boundedness condition of . In the first term in (B.3), only acts on the increasing ball because of Condition 1 and the indicator function . Similarly, we can apply the same procedure to the second term in (B.2). Therefore, it is still an equivalent problem if we restrict the domain of on . Hence, in order to prove the estimation error rate in Theorem 3.2, we only need to show that for the restricted evaluation function class , we have
Due to this fact, to keep notation simple, we are going to denote as in the following sections.
Remark 1.
The restriction on is technically necessary for calculating the covering number of later we will see the use of it when bounding the stochastic error and below.
Appendix C Stochastic errors
C.1 Bounding and
The stochastic errors and quantify how close the empirical distributions and the true latent joint distribution (data joint distribution) are with the Lipschitz class as the evaluation class under IPM. We apply the results in Lemma C.1 to bound and . We introduce two methods to bound , which gives two different upper bounds for . They both utilize the following lemma, which we shall prove later. More detailed description about the refined Dudley inequality can be found in Srebro and Sridharan (2010) and Schreuder (2020).
Lemma C.1 (Refined Dudley Inequality).
For a symmetric function class with
, we have
Remark 2.
The original Dudley inequality (Dudley, 1967; Van der Vaart and Wellner, 1996) suffers from the problem that if the covering number increases too fast as goes to , then the upper bound can be infinite. The improved Dudley inequality circumvents this problem by only allowing to integrate from , which also indicates that scales with the covering number .
C.1.1 The first method (explicit constant)
The first method provides an explicit constant depending on at the expense of the higher order of in the upper bounds. It utilizes the next lemma (Gottlieb et al., 2013, Lemma 6), which turns the problem of bounding the covering number of a Lipschitz function class into the one bounding the covering number of the domain defined for the function class.
Lemma C.2 (Gottlieb et al. (2013)).
Let be the collection of Lipschitz functions mapping the metric space to . Then the covering number of can be estimated in terms of the covering number of with respect to as follows.
Now we apply Lemma C.2 to bound the covering number for the 1-Lipschitz class by bounding the covering number for its domain . Define a new function class as
Recall that is restricted on . Obviously, is a Lipschitz function class : . A direct application of Lemma C.2 shows that
(C.1) |
By the definition of , the covering numbers satisfy
(C.2) |
Note that is a subset of , and can be covered with finite -balls in that cover the small hypercube with side length . It follows that
(C.3) |
Combining (C.1), (C.2) and (C.3), we obtain an upper bound for the covering number of the 1-Lipschitz class
(C.4) |
With the upper bound for the covering entropy in (C.4), a direct application of Lemma C.1 (see Section E for details) by taking leads to
(C.5) | ||||
(C.6) |
C.1.2 The second method (better order of )
We now consider the second method that leads to a better order for the term in the upper bound at the expense of explicitness of the constant related to . The next lemma directly provides an upper bound for the covering number of Lipschitz class but with an implicit constant related to . It is a straightforward corollary of Van der Vaart and Wellner (1996, Theorem 2.7.1).
Lemma C.3.
Let be a bounded, convex subset of with nonempty interior. There exists a constant depending only on such that
for every , where is the 1-Lipschitz function class defined on , and is the Lebesgue measure of the set .
Applying Lemmas C.1 and C.3 (see Section E for details) by taking yields
(C.7) |
where is some constant depending on . Combining (C.6) and (C.7), we get
(C.8) |
Remark 3.
Here, we have a tradeoff between the logarithmic factor and the explicitness of the constant depending on . If we want an explicit constant depending on , then we have the factor in the upper bound. Later we will see that and are the dominating terms in the four error terms, hence the explicitness of the corresponding constant becomes important. Therefore, we list two different methods here to bound and .
C.2 Combination of the four error terms
With all the upper bounds for the four different error terms obtained above, next we consider - simultaneously to obtain an overall convergence rate. First, recall how we bound and . With Lemma 4.2, we have
(C.9) |
To control while keeping the architecture of discriminator class as small as possible, we let , so that dominated by and .
By Theorem 4.3, we can choose the architectures of generator and encoder classes accordingly to perfectly control , i.e. .
We note that because we imposed Condition 1 on both generator and encoder classes, Theorem 4.3 can not be applied if we have some or greater than , in which case can not be perfectly controlled. But we can still handle this case by considering the probability of the bad set.
Under Condition 1, on the nice set , we have . Probability of the nice set has the following lower bound.
The bad set is where , which has the probability upper bound as follows.
In Assumption 1, the factor was to make the tail of the target strictly subexponential, which leads to , while the exponential tail or heavier will cause the undesired result .
Appendix D Proof of Inequality (4.2)
For ease of reference, we restate inequality (4.2) as the following lemma.
Lemma 4.2.
For any symmetric function classes and , denote the approximation error as
then for any probability distributions and ,
Proof of Lemma 4.2.
By the definition of supremum, for any , there exists such that
where the last line is due to the definition of . ∎
It is easy to check that if we replace by , Lemma 4.2 still holds.
Appendix E Bounding and
E.1 Method One
E.2 Method Two
Appendix F Proof of Lemma C.1
For completeness we provide a proof of the refined Dudley’s inequality in Lemma C.1. We apply the standard symmetrization and chaining technics in the proof, see, for example, Van der Vaart and Wellner (1996).
Proof.
Let be random samples from which are independent of . Then we have
where the first inequality is due to Jensen inequality, and the third equality is because that has symmetric distribution.
Let and for any let . For each , let be a -cover of w.r.t. such that . For each and , pick a function such that . Let and for any , we can express by chaining as
Hence for any , we can express the empirical Rademacher complexity as
where and the second-to-last inequality is due to Cauchy–Schwarz. Now the second term is the summation of empirical Rademacher complexity w.r.t. the function classes , . Note that
Massart’s lemma (Mohri et al., 2018, Theorem 3.7) states that if for any finite function class , , then we have
Applying Massart’s lemma to the function classes , , we get that for any ,
where the third inequality is due to . Now for any small we can choose such that . Hence,
Since is arbitrary, we can take w.r.t. to get
The result follows due to the fact that
∎
Appendix G Proof of Theorem 3.1
Proof.
Taking , Shen et al. (2019, Theorem 4.3) gives rise to . The range of and covers the supports of and , respectively, hence Theorem 4.3 leads to . By Lemma C.2, we have
Now following the same procedure as in Section E by taking , we have
At last, we consider all four error terms simultaneously.
∎
Appendix H Proof of Theorem 3.3
Following the same proof as Theorem 4.3, we have the following theorem.
Theorem H.1.
Suppose supported on and supported on are both absolutely continuous w.r.t. Lebesgue measure, and and are i.i.d. samples from and , respectively for . Then there exist generator and encoder neural network functions and such that and are inverse bijections of each other between and . Moreover, such neural network functions and can be obtained by properly specifying and for some constant .
Since and are absolutely continuous by assumption, they are also absolutely continuous in any one dimension. Hence the proof reduces to the one-dimensional case.
Appendix I Additional Lemma
Denote as the set of all continuous piecewise linear functions which have breakpoints only at and are constant on and . The following lemma is a result in Yang et al. (2021).
Lemma I.1.
Suppose that , and . Then for any , can be represented by a ReLU FNNs with width and depth no larger than and , respectively.
This result indicates that the expressive capacity of ReLU FNNs for piecewise linear functions. If we choose , a simple calculation shows with and . This means when the number of breakpoints are moderate compared with the network structure, such piecewise linear functions are expressible by feedforward ReLU networks.