Inferential Wasserstein Generative Adversarial Networks

Yao Chen Department of Statistics, Purdue University Qingyi Gao Department of Statistics, Purdue University Xiao Wang Department of Statistics, Purdue University

Abstract

Generative Adversarial Networks (GANs) have been impactful on many problems and applications but suffer from unstable training. The Wasserstein GAN (WGAN) leverages the Wasserstein distance to avoid the caveats in the minmax two-player training of GANs but has other defects such as mode collapse and lack of metric to detect the convergence. We introduce a novel inferential Wasserstein GAN (iWGAN) model, which is a principled framework to fuse auto-encoders and WGANs. The iWGAN model jointly learns an encoder network and a generator network motivated by the iterative primal dual optimization process. The encoder network maps the observed samples to the latent space and the generator network maps the samples from the latent space to the data space. We establish the generalization error bound of the iWGAN to theoretically justify its performance. We further provide a rigorous probabilistic interpretation of our model under the framework of maximum likelihood estimation. The iWGAN, with a clear stopping criteria, has many advantages over other autoencoder GANs. The empirical experiments show that the iWGAN greatly mitigates the symptom of mode collapse, speeds up the convergence, and is able to provide a measurement of quality check for each individual sample. We illustrate the ability of the iWGAN by obtaining competitive and stable performances for benchmark datasets.

Keywords: generalization error, generative adversarial networks, latent variable models, primal dual optimization, Wasserstein distance.

1 Introduction

One of the goals of generative modeling is to match the model distribution $P_{\theta}(x)$ with parameters $\theta$ to the true data distribution $P_{X}$ for a random variable $X\in{\cal X}$ . For latent variable models, the data point $X$ is generated from a latent variable $Z\in{\cal Z}$ through a conditional distribution $p(x|z)$ . Here ${\cal X}$ denotes the support for $P_{X}$ and ${\cal Z}$ denotes the support for $P_{Z}$ . In this paper, we consider models with $Z\sim{\cal N}(0,I)$ . There has been a surge of research on deep generative networks in recent years and the literature is too vast to summarize here (Kingma and Welling, 2014; Goodfellow et al., 2014; Li et al., 2015; Gao et al., 2020; Qiu and Wang, 2020). These models have provided a powerful framework for modeling complex high dimensional datasets.

We start introducing two main approaches for generative modeling. The first one is called variational auto-encoders (VAEs) (Kingma and Welling, 2014), which use variational inference (Blei et al., 2017) to learn a model by maximizing the lower bound of the likelihood function. Specifically, let the latent variable $Z$ be drawn from a prior $p(z)$ and the data $X$ have a likelihood $p(x|z)$ that is conditioned on $Z=z$ . Unfortunately obtaining the marginal distribution of $X$ requires computing an intractable integral $p(x)=\int p(x|z)p(z)dz$ . Variational inference approximates the posterior $p(z|x)$ by a family of distribution $q_{\eta}(z|x)$ with the parameter $\eta$ . The objective is to maximize the lower bound of the log-likelihood function known as the evidence lower bound (ELBO). The ELBO is given by

\mbox{ELBO}=\mathbb{E}_{z\sim q_{\eta}(z|x)}\big{[}\log p(x|z)\big{]}-\mathbb{E}_{z\sim q_{\eta}(z|x)}\big{[}\log{q_{\eta}(z|x)\over p(z)}\big{]}.

Note that the difference between the log-likelihood $\log p(x)$ and the ELBO is the Kullback-Leibler divergence between $q_{\eta}(z|x)$ and the true posterior $p(z|x)$ . Usually $q_{\eta}(z|x)$ is a normal density where both the conditional mean and the conditional covariance are modeled by deep neural networks (DNNs), so that the first term of the ELBO can be approximated efficiently by Monte Carlo methods and the second term can be calculated explicitly. Therefore, the ELBO allows us to do approximate posterior inference with tractable computation. VAEs have elegant theoretical foundations but the drawback is that they tend to produce blurry images. The second approach is called generative adversarial networks (GANs) (Goodfellow et al., 2014), which learn a model by using a powerful discriminator to distinguish between real data samples and generative data samples. Specifically, we define a generator $G:{\cal Z}\rightarrow{\cal X}$ and a discriminator $D:{\cal X}\rightarrow\mathbb{[}0,1]$ . The generator and discriminator play a two-player minimax game by being alternatively updated, such that the generator tries to produce real-looking images and the discriminator tries to distinguish between generated images and observed images. The GAN objective can be written as $\min_{G}\max_{D}V(G,D)$ , where

V(G,D)=\mathbb{E}_{x\sim P_{X}}\log D(x)+\mathbb{E}_{z\sim P_{Z}}\log(1-D(G(z))).

GANs produce more visually realistic images but suffer from the unstable training and the mode collapse problem. Although there are many variants of generative models trying to take advantages of both VAEs and GANs (Tolstikhin et al., 2018; Rosca et al., 2017), to the best of our knowledge, the model which provides a unifying framework combining the best of VAEs and GANs in a principled way is yet to be discovered.

1.1 Related work

In this section, we provide a brief introduction of different variants of generative models.

Wasserstein GAN. The Wasserstein GAN (WGAN) (Arjovsky et al., 2017) is an extension to the GAN that improves the stability of the training by introducing a new loss function motivated from the Waseerstein distance between two probability measures (Villani, 2008). Let $P_{G(Z)}$ denote the generative model distribution through the generator $G$ and the latent variable $Z\sim P_{Z}$ . Both the vanilla GAN (Goodfellow et al., 2014) and the WGAN can be viewed as minimizing certain divergence between the data distribution $P_{X}$ and the generative distribution $P_{G(Z)}$ . For example, the Jensen-Shannon (JS) divergence is implicitly used in vanilla GANs (Goodfellow et al., 2014), while the $1$ -Wasserstein distance is employed in WGANs. Empirical experiments suggest that the Wasserstein distance is a more sensible measure to differentiate probability measures supported in low-dimensional manifold. In terms of training, it turns out that it is hard or even impossible to compute these standard divergences in probability, especially when $P_{X}$ is unknown and $P_{G(Z)}$ is parameterized by DNNs. Instead, the training of WGANs is to study its dual problem because of the elegant form of Kantorovich-Rubinstein duality (Villani, 2008).

Autoencoder GANs. The main difference between autoencoder GANs and standard GANs is that, besides the generator $G$ , there is an encoder $Q:{\cal X}\rightarrow{\cal Z}$ which maps the data points into the latent space. This deterministic encoder is to approximate the conditional distribution $p(z|x)$ of the latent variable $Z$ given the data point $X$ . Larsen et al. (2016) first introduced the VAE-GAN, which is a hybrid of VAEs and GANs and uses a GAN discriminator to replace a VAE’s decoder to learn the loss function. For both the Adversarially Learned Inference (ALI) (Dumoulin et al., 2017) and the Bidirectional Generative Adversarial Network (BiGAN) (Donahue et al., 2017), the objective is to match two joint distributions, $(X,Q(X))$ and $(G(Z),Z)$ , under the framework of vanilla GANs. When the algorithm achieves equilibrium, these two joint distributions roughly match. It is expected to obtain more meaningful latent codes by $Q(X)$ , and this should improve the quality of the generator as well. For other VAE-GAN variants, please see Rosca et al. (2017); Mescheder et al. (2017); Hu et al. (2018); Ulyanov et al. (2018).

Energy-Based GANs. Energy-based Generative Adversarial Networks (EBGANs) (Zhao et al., 2017) view the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions, and the generator as being trained to produce contrastive samples with minimal energies. Han et al. (2019) presented the joint training of generator model, energy-based model, and inference model, which introduces a new objective function called divergence triangle that makes the processes of sampling, inference, energy evaluation readily available without the need for costly Markov chain Monte Carlo methods.

Duality in GANs. Regarding the optimization perspectives of GANs, (Chen et al., 2018; Zhao et al., 2018) studied duality-based methods for improving algorithm performance for training. Farnia and Tse (2018) developed a convex duality framework to address the case when the discriminator is constrained into a smaller class. Grnarova et al. (2018) developed an evaluation metric to detect the non-convergence behavior of vanilla GANs, which is the duality gap defined as the difference between the primal and the dual objective functions.

1.2 Our Contributions

Although there are many interesting works on autoencoder GANs, it remains unclear what the principles are underlying the fusion of auto-encoders and GANs. For example, do there even exist these two mappings, the encoder $Q$ and the decoder $G$ , for any high-dimensional random variable $X$ , such that $Q(X)$ has the same distribution as $Z$ and $G(Z)$ has the same distribution as $X$ ? Is there any probabilistic interpretation such as the maximum likelihood principle on autoencoder GANs? What is the generalization performance of autoencoder GANs? In this paper, we introduce inferential WGANs (iWGANs), which provide satisfying answers for these questions. We will mainly focus on the 1-Wasserstein distance, instead of the Kullback-Leibler divergence. We borrow the strength from both the primal and the dual problems and demonstrate the synergistic effect between these two optimizations. The encoder component turns out to be a natural consequence from our algorithm. The iWGAN learns both an encoder and a decoder simultaneously. We prove the existence of meaningful encoder and decoder, establish an equivalence between the WGAN and iWGAN, and develop the generalization error bound for the iWGAN. Furthermore, the iWGAN has a natural probabilistic interpretation under the maximum likelihood principle. Our learning algorithm is equivalent to the maximum likelihood estimation motivated from the variational approach when our model is defined as an energy-based model based on an autoencoder. As a byproduct, this interpretation allows us to perform the quality check at the individual sample level. In addition, we demonstrate the natural use of the duality gap as a measure of convergence for the iWGAN, and show its effectiveness for various numerical settings. Our experiments do not experience any mode collapse problem.

The rest of the paper is organized as follows. Section 2 presents the new iWGAN framework, and its extension to general inferential f-GANs. Section 3 establishes the generalization error bound and introduces the algorithm for the iWGAN. The probabilistic interpretation and the connection with the maximum likelihood estimation are introduced in Section 4. Extensive numerical experiments are demonstrated in Section 5 to show the advantages of the iWGAN framework. Proofs of theorems and additional numerical results are provided in the Appendix.

2 The iWGAN Model

The autoencoder generative model consists of two parts: an encoder $Q$ and a generator $G$ . The encoder $Q$ maps a data sample $x\in{\cal X}$ to a latent variable $z\in{\cal Z}$ , and the generator $G$ takes a latent variable $z\in{\cal Z}$ to produce a sample $G(z)$ . In general, the autoencoder generative model should satisfy the following three conditions simultaneously: (a) The generator can generate images which have a similar distribution with observed images, i.e., the distribution of $G(Z)$ is similar to that of $P_{X}$ ; (b) The encoder can produce meaningful encodings in the latent space, i.e., $Q(X)$ has a similar distribution with $Z$ ; (c) The reconstruction errors of this model based on these meaningful encodings are small, i.e., the difference between $X$ and $G(Q(X))$ is small.

We emphasize that the benefit of using an autoencoder is to encourage the model to better represent all the data it is trained with, so that it discourages mode collapse. We first show that, for any distribution residing on a compact smooth Riemannian manifold ¹¹1A smooth manifold ${\cal X}$ is a manifold with a $C^{\infty}$ atlas on ${\cal X}$ . A $C^{\infty}$ atlas is a collection of charts $\{\varphi_{\alpha}:U_{\alpha}\rightarrow\mathbb{R}^{d}\}$ such that $\{U_{\alpha}\}$ covers ${\cal X}$ , and for all $\alpha$ and $\beta$ , the transition map $\varphi_{\alpha}\cdot\varphi_{\beta}^{-1}$ is a $C^{\infty}$ map. Here $U_{\alpha}$ is an open subset of ${\cal X}$ . For any point $p\in{\cal X}$ , let $T_{p}{\cal X}$ be the tangent space of ${\cal X}$ at $p$ . A Riemannian metric assigns to each $p$ a positive definite inner product $g_{p}:T_{p}{\cal X}\times T_{p}{\cal X}\rightarrow\mathbb{R}$ , along with which comes a norm $\|\cdot\|_{p}:T_{p}{\cal X}\rightarrow\mathbb{R}$ defined by $\|v\|_{p}=\sqrt{g_{p}(v,v)}$ . The smooth manifold ${\cal X}$ endowed with this metric $g$ is called a smooth Reimannian manifold., there always exists an encoder $Q^{*}:{\cal X}\rightarrow{\cal Z}$ which guarantees meaningful encodings and exists a generator $G^{*}:{\cal Z}\rightarrow{\cal X}$ which generates samples with the same distribution as data points by using these meaningful codes.

Theorem 1.

Consider a continuous random variable $X\in{\cal X}$ , where ${\cal X}$ is a $d$ -dimensional compact smooth Riemannian manifold. Then, there exist two mappings $Q^{*}:{\cal X}\rightarrow\mathbb{R}^{p}$ and $G^{*}:\mathbb{R}^{p}\rightarrow{\cal X}$ , with $p=\max\{d(d+5)/2,d(d+3)/2+5\}$ , such that $Q^{*}(X)$ follows a multivariate normal distribution with zero mean and identity covariance matrix and $G^{*}\circ Q^{*}$ is an identity mapping, i.e., $X=G^{*}(Q^{*}(X))$ .

Theorem 1 is a natural consequence of the Nash embedding theorem (Nash, 1956; Günther, 1991) and the probability integral transformation (Rosenblatt, 1952). In Theorem 1, we have proved the existence of $Q^{*}$ and $G^{*}$ , however, learning $Q^{*}$ and $G^{*}$ from the data points is still a challenging task. Consider a general $f$ -GAN model (Nowozin et al., 2016). Let $h:\mathbb{R}\rightarrow(-\infty,\infty]$ be a convex function with $h(1)=0$ . The $f$ -GAN defines the $f$ -divergence between the data distribution $P_{X}$ and the generative model distribution $P_{G(Z)}$ for the generator $G$ as:

\text{GAN}_{h}(P_{X},P_{G(Z)})=\sup_{f\in{\cal F}}\Big{[}\mathbb{E}_{X}\left\{f(X)\right\}-\mathbb{E}_{Z}\left\{h^{*}(f(G(Z))\right\}\Big{]},

where $h^{*}(x)=\sup_{y}\{x\cdot y-h(y)\}$ is the convex conjugate of $h$ and ${\cal F}=\{f|f:\cal X\rightarrow\mathbb{R}\}$ is a class of functions whose output range is the domain of $h^{*}$ . When $f$ is approximated by a DNN, its output range can be achieved by choosing an appropriate activation function specific to the f-divergence used. For example, if $h(x)=x\log(x)-(x+1)\log(x+1)$ , then the corresponding convex conjugate $h^{*}(x)=-\log(1-\exp(x))$ . To satisfy the above condition, we select the output activation function of the DNN $f$ to be $\sigma(v)=-\log(1+\exp(-v))$ such that the $f$ -GAN can recover the original vanilla GAN (Goodfellow et al., 2014). If $h(x)=0$ when $x=1$ and $h(x)=\infty$ otherwise, we have $h^{*}(x)=x$ . With the property that $\cal F$ is 1-Lipschitz function class, the $f$ -GAN turns to be the WGAN.

For ease of presentation, we illustrate our methodology by mainly focusing on the Wasserstein distance and the inferential WGAN (iWGAN) model. The extension to general inferential f-GANs (ifGANs) is straightforward and will be presented in Section 2.3.

2.1 iWGAN

Recall that the $1$ -Wasserstein distance between $P_{X}$ and $P_{G(Z)}$ is defined as

W_{1}(P_{X},P_{G(Z)})=\inf_{\pi\in\Pi(P_{X},P_{Z})}\mathbb{E}_{(X,Z)\sim\pi}\big{\|}X-G(Z)\big{\|},

(1)

where $\|\cdot\|$ represents the $L_{2}$ -norm and $\Pi(P_{X},P_{Z})$ is the set of all joint distributions of $(X,Z)$ with marginal measures $P_{X}$ and $P_{Z}$ , respectively. The main difficulty in (1) is to find the optimal coupling $\pi$ , and this is a constrained optimization because the joint distribution $\pi$ needs to match these two marginal distributions $P_{X}$ and $P_{Z}$ .

Based on the Kantorovich-Rubinstein duality (Villani, 2008), the WGAN studies the $1$ -Wasserstein distance (1) through its dual format

W_{1}(P_{X},P_{G(Z)})=\sup_{f\in{\cal F}}\Big{[}\mathbb{E}_{X\sim P_{X}}\big{\{}f(X)\big{\}}-\mathbb{E}_{Z\sim P_{Z}}\big{\{}f(G(Z))\big{\}}\Big{]},

(2)

where ${\cal F}$ is the set of all bounded $1$ -Lipschitz functions. This is also a constrained optimization due to the Lipschitz constraint on $f$ such that $f(x)-f(y)\leq\|x-y\|$ for all $x,y\in{\cal X}$ . Weight clipping (Arjovsky et al., 2017) and gradient penalty (Gulrajani et al., 2017) have been used to satisfy the constraint of Lipschitz continuity. Arjovsky et al. (2017) used a clipping parameter $c$ to clamp each weight parameter to a fixed interval $[-c,c]$ after each gradient update is set. However, this method is very sensitive to the choice of clipping parameter $c$ . Instead, Gulrajani et al. (2017) introduced a gradient penalty, $\mathbb{E}_{\hat{x}}\big{\{}(\|\nabla_{\hat{x}}f(\hat{x})\|_{2}-1)^{2}\big{\}}$ , in the loss function to enforce the Lipschitz constraint, where $\hat{x}$ is sampled uniformly along straight lines between pairs of points sampled from $P_{X}$ and $P_{G(Z)}$ . This is motivated by the fact that the optimal critic contains straight lines with gradient norm $1$ connecting coupled points from $P_{X}$ and $P_{G(Z)}$ . The experiment of (Arjovsky et al., 2017) showed that the WGAN can avoid the problem of gradient vanishment. However, the WGAN does not produce meaningful encodings and many experiments still display the problem of mode collapse (Arjovsky et al., 2017; Gulrajani et al., 2017).

On the other hand, the Wasserstein Autoencoder (WAE) (Tolstikhin et al., 2018), after introducing an encoder $Q:{\cal X}\rightarrow{\cal Z}$ to approximate the conditional distribution of $Z$ given $X$ , minimizes the reconstruction error $\inf_{Q\in{\cal Q}}\mathbb{E}_{X}\big{\|}X-G(Q(X))\big{\|}$ , where ${\cal Q}$ is a set of encoder mappings whose elements satisfies $P_{Q(X)}=P_{Z}$ . The penalty, such as ${\cal D}(P_{Q(X)},P_{Z})$ , is added to the objective to satisfy this constraint, where ${\cal D}$ is an arbitrary divergence between $P_{Q(X)}$ and $P_{Z}$ . The WAE can produce meaningful encodings and has controlled reconstruction error. However, the WAE defines a generative model in an implicit way and does not model the generator through $G(Z)$ with $Z\sim P_{Z}$ directly.

To take the advantages of both the WGAN and WAE, we propose a new autoencoder GAN model, called the iWGAN, which defines the divergence between $P_{X}$ and $P_{G(Z)}$ by

\overline{W}_{1}(P_{X},P_{G(Z)})=\inf_{Q\in{\cal Q}}\sup_{f\in{\cal F}}\Big{[}\mathbb{E}_{X}\|X-G(Q(X))\|+\mathbb{E}_{X}\big{\{}f(G(Q(X)))\big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}}\Big{]}.

(3)

Our goal is to find the tuple $(G,Q,f)$ which minimizes $\overline{W}_{1}(P_{X},P_{G(Z)})$ . The motivation and explanation of this objective function are provided in Section 2.2 in detail. The term $\|X-G(Q(X))\|$ can be treated as the autoencoder reconstruction error as well as a loss to match the distributions between $X$ and $G(Q(X))$ . We note that the $L_{1}$ -norm $\|\cdot\|_{1}$ has been used for the reconstruction term by the $\alpha$ -GAN (Rosca et al., 2017) and CycleGAN (Zhu et al., 2017). Another term $\mathbb{E}_{X\sim P_{X}}\{f(G(Q(X)))\}-\mathbb{E}_{Z\sim P_{Z}}\{f(G(Z))\}$ can be treated as a loss for the generator as well as a loss to match the distribution between $G(Q(X))$ and $G(Z)$ . We emphasize that this term is different with the objective function of the WGAN in (2). The properties of (3) will be discussed in Theorem 2, and the primal and dual explanation of (3) will be presented in Section 2.2.

Furthermore, it is challenging for practitioners to determine when to stop training GANs. Most of the GAN algorithms do not provide any explicit standard for the convergence of the model. However, the measure of convergence for the iWGAN becomes very natural and we use the duality gap as the measure. For a given tuple $(G,Q,f)$ , the duality gap is defined as

\mbox{DualGap}(G,Q,f)=\sup_{\overline{f}\in{\cal F}}L(G,Q,\overline{f})-\inf_{\overline{G}\in{\cal G},\overline{Q}\in{\cal Q}}L(\overline{G},\overline{Q},f),

(4)

where $L(G,Q,f)$ is

L(G,Q,f)=\mathbb{E}_{X}\|X-G(Q(X))\|+\mathbb{E}_{X}\{f(G(Q(X)))\}-\mathbb{E}_{Z}\{f(G(Z))\}.

In practice, the function spaces ${\cal G}$ , ${\cal Q}$ , and ${\cal F}$ are modeled by spaces containing deep neural networks with specific architectures. The architecture hyperparameters usually include number of channels, number of layers, and width of each layer. The architectures for our numerical experiments are provided in the appendix. We assume that these network spaces are large enough to include the true encoder $Q^{*}$ , generator $G^{*}$ , and the optimal discriminator $f$ in (2). This is not a strong assumption due to the universal approximation theorem of DNNs (Hornik, 1991).

Theorem 2.

(a). The iWGAN objective (3) is equivalent to

\displaystyle\overline{W}_{1}(P_{X},P_{G(Z)})=\inf_{Q\in\mathcal{Q}}\Big{\{}W_{1}(P_{X},P_{G(Q(X))})+W_{1}(P_{G(Q(X))},P_{G(Z)})\Big{\}}.

(5)

Therefore, $W_{1}(P_{X},P_{G(Z)})\leq\overline{W}_{1}(P_{X},P_{G(Z)})$ . If there exists a $Q^{*}\in{\cal Q}$ such that $Q^{*}(X)$ has the same distribution with $Z$ , then $W_{1}(P_{X},P_{G(Z)})=\overline{W}_{1}(P_{X},P_{G(Z)})$ .
(b). Let $(\widetilde{Q},\widetilde{G},\widetilde{f})$ be a fixed solution. Then

\mbox{DualGap}(\tilde{G},\tilde{Q},\tilde{f})\geq W_{1}(P_{X},P_{\widetilde{G}(\widetilde{Q}(X))})+W_{1}(P_{\widetilde{G}(\widetilde{Q}(X))},P_{\widetilde{G}(Z)}).

Moreover, if $\widetilde{G}$ outputs the same distribution as $X$ and $\widetilde{Q}$ outputs the same distribution as $Z$ , both the duality gap and $\overline{W}_{1}(P_{X},P_{\widetilde{G}(Z)})$ are zeros and $X=\widetilde{G}(\widetilde{Q}(X))$ for $X\sim P_{X}$ .

According to Theorem 2, the iWGAN objective is in general the upper bound of $W_{1}(P_{X},P_{G(Z)})$ . However, this upper bound is tight. When the space ${\cal Q}$ includes a special encoder $Q^{*}$ such that $Q^{*}(X)$ has the same distribution as $Z$ , the iWGAN objective is exactly the same as $W_{1}(P_{X},P_{G(Z)})$ . Theorem 2 also provides an appealing property from a practical point of view. The values of both the duality gap and $\overline{W}_{1}(P_{X},P_{\widetilde{G}(Z)})$ give us a natural criteria to justify the algorithm convergence.

2.2 A Primal-Dual Explanation

We explain the iWGAN objective function (3) from the view of primal and dual problems. Note that both the primal problem (1) and the dual problem (2) are constrained optimization problems. First, for the primal problem (1), two constraints on $\pi$ are $\int\pi(x,z)dz-p_{X}(x)=0$ for all $x\in{\cal X}$ , and $\int\pi(x,z)dz-p_{Z}(z)=0$ for all $z\in{\cal Z}$ . Recall that the primal variable $f$ for the dual problem (2) is also a dual variable for the primal problem (1). From the Lagrange multiplier perspective, we can write the primal problem (1) as


	$\displaystyle=$	$\displaystyle\inf_{Q\in{\cal Q}}\mathbb{E}_{X}\Big{\{}\\|X-G(Q(X))\\|+f(G(Q(X)))\Big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}},$

where we use the encoder $Q$ to approximate the conditional distribution of $Z$ given $X$ , and the Lagrange multipliers for two constraints are $f(x)$ and $-f(G(z))$ respectively. Second, for the dual problem (2), the $1$ -Lipschitz constraint on $f$ is $f(x)-f(G(z))\leq\|x-G(z)\|$ for all $x\in{\cal X}$ and $z\in{\cal Z}$ . Recall that the primal variable $\pi$ for the primal problem (1) is also a dual variable for the dual problem (2). Similarly, we can write the dual problem (2) as

		$\displaystyle\sup_{f\in{\cal F}}\mathbb{E}_{X}\big{\{}f(X)\big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}}-\int_{{\cal X}\times\cal Z}\pi(x,z)\Big{(}f(x)-f(G(z))-\\|x-G(z)\\|\Big{)}dxdz$
	$\displaystyle=$	$\displaystyle\sup_{f\in{\cal F}}\mathbb{E}_{X}\Big{\{}\\|X-G(Q(X))\\|+f(G(Q(X)))\Big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}},$

where the Lagrange multiplier for the $1$ -Lipschitz constraint is $\pi(x,z)$ . When we solve primal and dual problems iteratively, this turns out to be exactly the same as our iWGAN algorithm.

In addition, the optimal value of the primal problem (1) satisfies

\displaystyle\inf_{Q\in{\cal Q}}\sup_{f\in{\cal F}}\mathbb{E}_{X}\Big{\{}\|X-G(Q(X))\|+f(G(Q(X)))\Big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}},

and the optimal value of the dual problem (2) satisfies

\displaystyle\sup_{f\in{\cal F}}\inf_{Q\in{\cal Q}}\mathbb{E}_{X}\Big{\{}\|X-G(Q(X))\|+f(G(Q(X)))\Big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}}.

The difference between the optimal primal and dual values is exactly the duality gap in (4).

2.3 Extension to f-GANs

This framework can be easily extended to other types of GANs. Assume that $\cal F$ is the 1-Lipschitz function class. We extend the iWGAN framework to the inferential f-GAN (ifGAN) framework. Define the ifGAN objective function as follows:

$\overline{W}_{1,h}(P_{X},P_{G(Z)})=\inf_{Q\in{\cal Q}}\sup_{f\in{\cal F}}\Big{[}\mathbb{E}_{X}\|X-G(Q(X))\|+\mathbb{E}_{X}\big{\{}f(G(Q(X)))\big{\}}-\mathbb{E}_{Z}\big{\{}h^{*}(f(G(Z)))\big{\}}\Big{]}$ .

(6)

Following this definition, we have

\displaystyle\overline{W}_{1,h}(P_{X},P_{G(Z)})=\inf_{Q\in{\cal Q}}\Big{\{}W_{1}(P_{X},P_{G(Q(X))})+\text{GAN}_{h}(P_{G(Q(X))},P_{G(Z)})\Big{\}}.

We show $\text{GAN}_{h}(P_{X},P_{G(Z)})\leq\overline{W}_{1,h}(P_{X},P_{G(Z)})$ . This is because

		$\displaystyle\text{GAN}_{h}(P_{X},P_{G(Z)})=\sup_{f\in\cal F}\mathbb{E}_{X}\left\{f(X)\right\}-\mathbb{E}_{Z}\left\{h^{*}(f(G(Z))\right\}$
	$\displaystyle\leq$	$\displaystyle\inf_{Q\in\cal Q}\Big{[}\sup_{f\in\cal F}\mathbb{E}_{X}\left\{f(X)\right\}-\mathbb{E}_{X}\left\{f(G(Q(X)))\right\}+\sup_{f\in\cal F}\mathbb{E}_{X}\left\{f(G(Q(X)))\right\}-\mathbb{E}_{Z}\left\{h^{*}(f(G(Z))\right\}\Big{]}$
	$\displaystyle=$	$\displaystyle\overline{W}_{1,h}(P_{X},P_{G(Z)}).$

This indicates that the ifGAN objective (6) is an upper bound of the f-GAN objective.

3 Generalization Error Bound and the Algorithm

Suppose that we observe $n$ samples $\{x_{1},\ldots,x_{n}\}$ . In practice, we minimize the empirical version, denoted by $\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})$ , of $\overline{W}_{1}(P_{X},P_{G(Z)})$ to learn both the encoder and the generator, where,

\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})=\inf_{Q\in{\cal Q}}\sup_{f\in{\cal F}}\Big{[}\hat{\mathbb{E}}_{obs}\|x-G(Q(x))\|+\hat{\mathbb{E}}_{obs}\big{\{}f(G(Q(x)))\big{\}}-\hat{\mathbb{E}}_{z}\big{\{}f(G(z))\big{\}}\Big{]}.

(7)

Here $\hat{\mathbb{E}}_{obs}\{\cdot\}$ denotes the empirical average on the observed data $\{x_{i}\}$ and $\hat{\mathbb{E}}_{z}$ denotes the empirical average on a random sample of standard normal random variables. Before we present the details of the algorithm, we first establish the generalization error bound for the iWGAN in this section.

In the context of supervised learning, generalization error is defined as the gap between the empirical risk and the expected risk. The empirical risk is corresponding to the training error, and the expected risk is corresponding to the testing error. Mathematically, the difference between the expected risk and the empirical risk, i.e. the generalization error, is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. However, in the context of GANs, neither the training error nor the test error is well defined. But we can define the generalization error in a similar way. Explicitly, we define the “training error” as $\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})$ in (7), which is minimized based on observed samples. Define the “test error” as $W_{1}(P_{X},P_{G(Z)})$ in (1), which is the true 1-Wasserstein distance between $P_{X}$ and $P_{G(Z)}$ . The generalization error for the iWGAN is defined as the gap between these two “errors”. In other words, for an iWGAN model with the parameter $(G,Q,f)$ , the generalization error is defined as $\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})-W_{1}(P_{X},P_{G(Z)})$ . For discussions of generalization performance of classical GANs, see Arora et al. (2017) and Jiang et al. (2019).

Theorem 3.

Given a generator $G\in\mathcal{G}$ , and $n$ samples $(x_{1},\ldots,x_{n})$ from $\mathcal{X}=\{x:\|x\|\leq B\}$ , with probability at least $1-\delta$ for any $\delta\in(0,1)$ , we have

\displaystyle W_{1}(

\displaystyle P_{X},P_{G(Z)})\leq\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})+2\widehat{\mathfrak{R}}_{n}(\mathcal{F})+3B\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)},

(8)

where $\widehat{\mathfrak{R}}_{n}(\mathcal{F})=\mathbb{E}_{\epsilon}\left\{\sup_{f\in\mathcal{F}}n^{-1}\sum_{i=1}^{n}\epsilon_{i}f(x_{i})\right\}$ is the empirical Rademacher complexity of the 1-Lipschitz function set $\mathcal{F}$ , in which $\epsilon_{i}$ is the Rademacher variable.

For a fixed generator $G$ , Theorem 3 holds uniformly for any discriminator $f\in\mathcal{F}$ . It indicates that the 1-Wasserstein distance between $P_{X}$ and $P_{G(Z)}$ can be dominantly upper bounded by the empirical $\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})$ and Rademacher complexity of $\mathcal{F}$ . Since $\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})\leq\widehat{W}_{1}(P_{X},P_{G(Q(X))})+\widehat{W}_{1}(P_{G(Q(X))},P_{G(Z)})$ for any $Q\in{\cal Q}$ , the capacity of $\mathcal{Q}$ determines the value of $\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})$ . In learning theory, Rademacher complexity, named after Hans Rademacher, measures richness of a class of real-valued functions with respect to a probability distribution. There are several existing results on the empirical Rademacher complexity of neural networks. For example, when $\mathcal{F}$ is a set of 1-Lipschitz neural networks, we can apply the conclusion from Bartlett et al. (2017) to $\widehat{\mathfrak{R}}_{n}(\mathcal{F})$ , which produces an upper bound scaling as $\mathcal{O}(B\sqrt{L^{3}/n})$ . Here $L$ denotes the depth of network $f\in\mathcal{F}$ . Similar upper bound with an order of $\mathcal{O}(B\sqrt{Ld^{2}/n})$ can be obtained by utilizing the results from Li et al. (2019), where $d$ is the width of the network.

Algorithm 1 The training algorithm of iWGAN

1:The regularization coefficients

\lambda_{1}

and

\lambda_{2}

, tolerance for duality gap

\epsilon_{1}

, tolerance for loss

\epsilon_{2}

, and running steps

n

2:Initialization

(G^{0},Q^{0},f^{0})

3:while

\mbox{DualGap}(G^{i},Q^{i},f^{i})>\epsilon_{1}

L(G^{i},Q^{i},f^{i})>\epsilon_{2}

4: for

t=1

, …,

n

5: Sample real data

\{x^{i}_{k}\}_{k=1}^{n}\sim{P}_{X}

, latent variable

\{z^{i}_{k}\}_{k=1}^{n}\sim{P}_{Z}

and

\{\epsilon_{k}\}_{k=1}^{n}\sim U[0,1]

6: Set

\hat{x}^{i}_{k}\leftarrow\epsilon_{k}x^{i}_{k}+(1-\epsilon_{k})G^{i}(z^{i}_{k})

i=1,...,n

for the calculation of gradient penalty

7: Calculate:

L^{i}=L(G^{i},Q^{i},f^{i})

J_{1}(f^{i})=(\|\nabla_{\hat{x}^{i}}f^{i}(\hat{x}^{i})\|_{2}-1)^{2}

, and

\displaystyle-\nabla_{f}L^{i}

\displaystyle=\nabla_{f}\Big{[}\dfrac{1}{n}\sum_{k=1}^{n}\Big{(}f^{i}(G^{i}(z_{k}^{i}))-f^{i}(G^{i}(Q^{i}(x_{k}^{i})))+\lambda_{1}J_{1}(f^{i})\Big{)}\Big{]}

8: Update

f

by Adam:

f^{i+1}\leftarrow f^{i}+Adam(-\nabla_{f}L^{i})

9: end for

10: for

t=1

, …,

n

11: Sample real data

\{x^{i}_{k}\}_{k=1}^{n}\sim{P}_{X}

, latent variable

\{z^{i}_{k}\}_{k=1}^{n}\sim{P}_{Z}

12: Calculate:

L^{\prime i}=L(G^{i},Q^{i},f^{i+1})

J_{2}(Q^{i})

, and

\displaystyle\nabla_{G,Q}L^{\prime i}

\displaystyle=\nabla_{G,Q}\Big{[}\dfrac{1}{n}\sum_{k=1}^{n}\Big{(}\|x_{k}^{i}-G^{i}(Q^{i}(x_{k}^{i}))\|+f^{i+1}(G^{i}(Q^{i}(x_{k}^{i})))-f^{i+1}(G^{i}(z_{k}^{i}))+\lambda_{2}J_{2}(Q^{i})\Big{)}\Big{]}

13: Update

G

Q

by Adam:

(G^{i+1},Q^{i+1})\leftarrow(G^{i},Q^{i})+Adam(\nabla_{G,Q}L^{\prime i})

14: end for

15: DualGap

(G^{i+1},Q^{i+1},f^{i+1})=L(G^{i},Q^{i},f^{i+1})-L(G^{i+1},Q^{i+1},f^{i+1})

16:

i\leftarrow i+1

17:end while

Next, we introduce the details of the algorithm. Our target is to solve the following optimization problem:

$\min\limits_{G\in{\cal G},Q\in{\cal Q}}\max\limits_{f\in{\cal F}}\Big{[}\hat{\mathbb{E}}_{obs}\|x-G(Q(x))\|+\hat{\mathbb{E}}_{obs}\big{\{}f(G(Q(x)))\big{\}}-\hat{\mathbb{E}}_{z}\big{\{}f(G(z))\big{\}}-\lambda_{1}J_{1}(f)+\lambda_{2}J_{2}(Q)\Big{]}$ ,

(9)

where $J_{1}(f)$ and $J_{2}(Q)$ are regularization terms for $f$ and $Q$ respectively. We approximate $G,Q,f$ by three neural networks with pre-specified architectures.

Since $f$ is assumed to be 1-Lipschitz, we adopt the gradient penalty defined as $J_{1}(f)=\mathbb{E}_{\hat{x}}\big{\{}(\mathbb{\|}\nabla_{\hat{x}}f(\hat{x})\|_{2}-1)^{2}\big{\}}$ in (Gulrajani et al., 2017) to enforce the 1-Lipschitz constraint on $f\in{\cal F}$ . Furthermore, since we expect $Q(X)$ follows approximately standard normal, we use the maximum mean discrepancy (MMD) penalty (Gretton et al., 2012), denoted by $J_{2}(Q)=\mbox{MMD}_{k}(P_{Q(X)},P_{Z})$ , to enforce $Q(X)$ to converge to $P_{Z}$ . In particular,

J_{2}(Q)=\dfrac{1}{n(n-1)}\sum_{l\neq j}k(z^{i}_{l},z^{i}_{j})+\dfrac{1}{n(n-1)}\sum_{l\neq j}k(Q(x^{i}_{l}),Q(x^{i}_{j}))-\dfrac{2}{n^{2}}\sum_{l,j}k(z^{i}_{l},Q(x^{i}_{j})),

where $k$ is set to be the Gaussian radial kernel function $k(x,y)=\exp(\frac{-\|x-y\|^{2}}{2})$ .

We have adopted the stochastic gradient descent algorithm called the ADAM (Kingma and Ba, 2015) to estimate the unknown parameters in neural networks. The ADAM is an algorithm for first-order gradient-based optimization of stochastic objection functions, based on adaptive estimates of lower-order moments. Given the current tuple $(G^{i},Q^{i},f^{i})$ at the $i$ th iteration, we sample a batch of observations $\{x^{i}_{k}\}_{k=1}^{n}\sim{P}_{X}$ , latent variable $\{z^{i}_{k}\}_{k=1}^{n}\sim{P}_{Z}$ , and $\{\epsilon_{k}\}_{k=1}^{n}\sim U[0,1]$ . Then we construct $\hat{x}^{i}_{k}\leftarrow\epsilon_{k}x^{i}_{k}+(1-\epsilon_{k})G^{i}(z^{i}_{k})$ , $i=1,\ldots,n$ , for computing the gradient penalty. Let $L^{i}=L(G^{i},Q^{i},f^{i})$ and $J_{1}(f^{i})=(\|\nabla_{\hat{x}^{i}}f^{i}(\hat{x}^{i})\|_{2}-1)^{2}$ . We can evaluate the gradient with respect to the parameters in $f$ , which is denoted by

\displaystyle-\nabla_{f}L^{i}=\nabla_{f}\Big{[}\dfrac{1}{n}\sum_{k=1}^{n}\Big{(}f^{i}(G^{i}(z_{k}^{i}))-f^{i}(G^{i}(Q^{i}(x_{k}^{i})))+\lambda_{1}J_{1}(f^{i})\Big{)}\Big{]}.

Then we can update $f^{i}$ by the ADAM using this gradient. Similarly, we can evaluate the gradient with respect to the parameters in $G$ and $Q$ , which is denoted by

\nabla_{G,Q}L^{i}=\nabla_{G,Q}\Big{[}\dfrac{1}{n}\sum_{k=1}^{n}\Big{(}\|x_{k}^{i}-G^{i}(Q^{i}(x_{k}^{i}))\|+f^{i+1}(G^{i}(Q^{i}(x_{k}^{i})))-f^{i+1}(G^{i}(z_{k}^{i}))+\lambda_{2}J_{2}(Q^{i})\Big{)}\Big{]}.

Then we can update $(G^{i},Q^{i})$ by the ADAM using this gradient. The stopping criteria are both the DualGap $(G^{i},Q^{i},f^{i})$ in (4) and the objective function $L(G^{i},Q^{i},f^{i})$ are less than pre-specified error tolerances $\epsilon_{1}$ and $\epsilon_{2}$ , respectively. Specifically, based on the definition of the duality gap in (4), we approximate DualGap $(G^{i},Q^{i},f^{i})$ by the difference between $L(G^{i},Q^{i},f^{i+1})$ and The optimization (9) consists of two tuning parameters $\lambda_{1}$ and $\lambda_{2}$ . We pre-specify some values for $\lambda_{1}$ and $\lambda_{2}$ and select the optimal tuning parameters by grid search using cross validation. The details of the algorithm are presented in Algorithm 1.

4 Probabilistic Interpretation and the MLE

The iWGAN has proposed an efficient framework to stably and automatically estimate both the encoder and the generator. In this section, we provide a probabilistic interpretation of the iWGAN under the framework of maximum likelihood estimation.

Maximum likelihood estimator (MLE) is a fundamental statistical framework for learning models from data. However, for complex models, MLE can be computationally prohibitive due to the intractable normalization constant. MCMC has been used to approximate the intractable likelihood function but do not work efficiently in practice, since running MCMC till convergence to obtain a sample can be computationally expensive. For example, to reduce the computational complexity, Hinton (2002) proposed a simple and fast algorithm, called the contrastive divergence (CD). The basic idea of CD is to truncate MCMC at the $k$ -th step, where $k$ is a fixed integer as small as one. The simplicity and computational efficiency of CD makes it widely used in many popular energy-based models. However, the success of CD also raised a lot of questions regarding its convergence property. Both theoretical and empirical results show that CD in general does not converge to a local minimum of the likelihood function (Carreira-Perpinan and Hinton, 2005; Qiu et al., 2020), and diverges even in some simple models (Schulz et al., 2010; Fischer and Igel, 2010). The iWGAN can be treated as an adaptive method for the MLE training, which not only provides computational advantages but also allows us to generate more realistic-looking images. Furthermore, this probabilistic interpretation enables other novel applications such as image quality checking and outlier detection.

Let $X$ denote the image. Define the density of $X$ by an energy-based model based on an autoencoder (Gu and Zhu, 2001; Zhao et al., 2017; Berthelot et al., 2017):

p(x|\theta)=\exp\big{(}-\big{\|}x-G_{\theta}(Q_{\theta}(x))\big{\|}-V(\theta)\big{)},

(10)

where

V(\theta)=\log\int\exp(-\big{\|}x-G_{\theta}(Q_{\theta}(x))\big{\|})dx,

and $\theta\in\Theta$ is the unknown parameter and $V(\theta)$ is the log normalization constant. The major difficulty for the likelihood inference is due to the intractable function $V(\theta)$ . Suppose that we have the observed data $\{x_{i}:i=1,\ldots,n\}$ . The log-likelihood function of $\theta\in\Theta$ is $\ell(\theta)=n^{-1}\sum_{i=1}^{n}\log~{}p(x_{i}|\theta)$ , whose gradient is

\nabla_{\theta}\ell(\theta)=-\hat{\mathbb{E}}_{obs}\big{\{}\partial_{\theta}\big{\|}x-G_{\theta}(Q_{\theta}(x))\big{\|}\big{\}}+\mathbb{E}_{\theta}\big{\{}\partial_{\theta}\big{\|}x-G_{\theta}(Q_{\theta}(x))\big{\|}\big{\}},

(11)

where $\hat{\mathbb{E}}_{obs}[\cdot]$ denotes the empirical average on the observed data $\{x_{i}\}$ and $\mathbb{E}_{\theta}[\cdot]$ denotes the expectation under model $p(x|\theta)$ . The key computational obstacle lies in the approximations of the model expectation $\mathbb{E}_{\theta}[\cdot]$ .

To address this problem, we can rewrite the log-likelihood function by introducing a variational distribution $q(x)$ . This leads to

$\displaystyle\hat{\mathbb{E}}_{obs}\log p(x\|\theta)$	$\displaystyle=-\hat{\mathbb{E}}_{obs}\\|x-G_{\theta}(Q_{\theta}(x))\\|-V(\theta)$
	$\displaystyle=-\hat{\mathbb{E}}_{obs}\\|x-G_{\theta}(Q_{\theta}(x))\\|-\log\int q(x){e^{-\\|x-G_{\theta}(Q_{\theta}(x))\\|}\over q(x)}dx$
	$\displaystyle\leq-\hat{\mathbb{E}}_{obs}\\|x-G_{\theta}(Q_{\theta}(x))\\|-\int q(x)\log{e^{-\\|x-G_{\theta}(Q_{\theta}(x))\\|}\over q(x)}dx$
	$\displaystyle=-\hat{\mathbb{E}}_{obs}\\|x-G_{\theta}(Q_{\theta}(x))\\|+\mathbb{E}_{q(x)}\\|x-G_{\theta}(Q_{\theta}(x))\\|-H(q),$	(12)

where $H(q)=-\int q\log q$ denotes the entropy of $q(x)$ and the inequality is due to Jensens’s inequality. Equation (4) provides an upper bound for the log-likelihood function. We expect to choose $q(x)$ so that (4) is closer to the log-likelihood function, and then maximize (4) as a surrogate for maximizing log-likelihood. We choose $q(x)=p(x|\tilde{\theta})$ and define the surrogate log-likelihood as

{\cal L}(\theta;\tilde{\theta})=-\hat{\mathbb{E}}_{obs}\|x-G_{\theta}(Q_{\theta}(x))\|+\mathbb{E}_{\tilde{\theta}}\|x-G_{\theta}(Q_{\theta}(x))\|-H(p(x|\tilde{\theta})).

(13)

Theorem 4.

(a). For any $\theta,\tilde{\theta}\in\Theta$ , we have $\ell(\theta)\leq{\cal L}(\theta;\tilde{\theta})$ . In addition, $\ell(\theta)={\cal L}(\theta;\theta)$ .
(b). Consider the following algorithm and the $(t+1)$ th iteration is obtained by $\theta^{(t+1)}=\arg\max_{\theta\in\Theta}{\cal L}(\theta;\theta^{(t)})$ , for $t=0,1,\cdots$ . If $\theta^{(t)}\rightarrow\hat{\theta}$ as $t\rightarrow\infty$ , then $\hat{\theta}$ is the MLE.

Theorem 4 shows that, if we maximize the surrogate log-likelihood function and the algorithm converges, the solution is exactly the same as the MLE. The additional identity $\ell(\theta)={\cal L}(\theta;\theta)$ is the key to our algorithm to obtain the MLE, which is different with the ELBO in VAEs. The ELBO is in general not a tight lower bound of the log-likelihood function.

In terms of training, by Theorem 5.10 of Villani (2008), for any random variables $X$ and $Y$ , there exists an optimal $f^{*}$ such that

\mathbb{P}_{\pi}\big{(}f^{*}(Y)-f^{*}(X)=\|Y-X\|\big{)}=1

(14)

for the optimal coupling $\pi$ which is the joint distribution of $X$ and $Y$ . Therefore, there exists a $f^{*}$ such that

f^{*}(X)-f^{*}(G_{\theta}(Q_{\theta}(X)))=\|X-G_{\theta}(Q_{\theta}(X))\|

(15)

with probability one. Because $f^{*}$ needs to be learned as well, we approximate $f^{*}$ by a neural network $f_{\eta}$ with an unknown parameter $\eta$ . This amounts to using the following max-min objective

\max_{\theta}\min_{\eta}-\hat{\mathbb{E}}_{obs}\|x-G_{\theta}(Q_{\theta}(x))\|+\mathbb{E}_{\theta^{(t)}}f_{\eta}(x)-\mathbb{E}_{\theta^{(t)}}f_{\eta}(G_{\theta}(Q_{\theta}(x))).

(16)

Note that, for the gradient update, the expectation in (16) is taken under the current estimated $\theta^{(t)}$ . Since we require $G_{\theta}$ to be a good generator and the distributions of $G_{\theta}(z)$ is close the distribution $p(x|\theta^{(t)})$ , we replace $\mathbb{E}_{\theta^{(t)}}f_{\eta}(x)$ by $\mathbb{E}_{z}f_{\eta}(G_{\theta}(z)))$ . Since an additional regularization is added to enforce $Q_{\theta}(X)$ to follow a normal distribution, we use the expectation under the data distribution to replace the second expectation of (16). This yields a gradient update for $\theta$ of form $\theta\leftarrow\theta+\epsilon\hat{\nabla}_{\theta}\ell(\theta)$ , where

\hat{\nabla}_{\theta}\ell(\theta)=-\hat{\mathbb{E}}_{obs}\big{\{}\partial_{\theta}\big{\|}x-G_{\theta}(Q_{\theta}(x))\big{\|}\big{\}}+\big{\{}{\mathbb{E}}_{z}\partial_{\theta}f_{\eta}(G_{\theta}(z))-\mathbb{E}_{x}\partial_{\theta}f_{\eta}(G_{\theta}(Q_{\theta}(x))\big{\}}.

(17)

A gradient update for $\eta$ is given by

\eta\leftarrow\eta+\epsilon~{}\Big{\{}{\mathbb{E}}_{z}\partial_{\eta}f_{\eta}(G_{\theta}(z))-\mathbb{E}_{x}\partial_{\eta}f_{\eta}(G_{\theta}(Q_{\theta}(x))\Big{\}}.

(18)

The above iterative updating process is exactly the same as in Algorithm 1. Therefore, the training of the iWGAN is to seek the MLE. This probabilistic interpretation provides a novel alternative method to tackle problems with the intractable normalization constant in latent variable models. The MLE gradient update of $p(x|\theta)$ decreases the energy of the training data and increases the dual objective. Compare with original GANs or WGANs, our method gives much faster convergence and simultaneously provides a higher quality generated images.

The probabilistic modeling opens a door for many interesting applications. Next, we present a completely new approach for determining a highest density region (HDR) estimate for the distribution of $X$ . What makes HDR distinct from other statistical methods is that it finds the smallest region, denoted by $U(\alpha)$ , in the high dimensional space with a given probability coverage $1-\alpha$ , i.e., $\mathbb{P}(X\in U(\alpha))=1-\alpha$ . We can use $U(\alpha)$ to assess each individual sample quality. Note that commonly used inception scores (IS) and Fréchet inception distances (FID) are to measure the whole sample quality, not at the individual sample level. More introductions of IS and FID are given in Appendix G. Let $\hat{\theta}$ be the MLE. The density ratio at $x_{1}$ and $x_{2}$ is

\frac{p(x_{1}|\hat{\theta})}{p(x_{2}|\hat{\theta})}=\exp\big{\{}-(\|x_{1}-G_{\hat{\theta}}(Q_{\hat{\theta}}(x_{1}))\|-\|x_{2}-G_{\hat{\theta}}(Q_{\hat{\theta}}(x_{2}))\|)\big{\}}.

The smaller the reconstruction error is, the larger the density value is. We can define the HDR for $x$ through the HDR for the reconstruction error $e_{x}:=\|x-G_{\hat{\theta}}(Q_{\hat{\theta}}(x))\|$ , which is simple because it is a one-dimensional problem. Let $\tilde{U}(\alpha)$ be the HDR for $e_{x}$ . Then, $U(\alpha)=\{x:e_{x}\in\tilde{U}(\alpha)\}$ . Here $Q_{\hat{\theta}}(U(\alpha))$ defines the corresponding region in the latent space, which can be used to generate better quality samples.

5 Experimental Results

The goal of our numerical experiments is to demonstrate that the iWGAN can achieve the following three objectives simultaneously: high-quality generative samples, meaningful latent codes, and small reconstruction errors. We also compare the iWGAN with other well-known GAN models such as the Wasserstein GAN with gradient penalty (WGAN-GP) (Gulrajani et al., 2017), the Wasserstein Autoencoder (WAE) (Tolstikhin et al., 2018), the Adversarial Learned Inference (ALI) (Dumoulin et al., 2017), and the CycleGAN (Zhu et al., 2017) to illustrate a competitive and stable performance for benchmark datasets.

5.1 Mixture of Gaussians

We first train our iWGAN model on three datasets from the mixture of Gaussians with an increasing difficulty shown in the Figure 1: a). RING: a mixture of 8 Gaussians with means $\{(2\cdot\cos{\dfrac{2\pi i}{8}},2\cdot\sin{\dfrac{2\pi i}{8}})|i=0,\dots 7\}$ and standard deviation $0.02$ , b). SPIRAL: a mixture of 20 Gaussians with means $\{(0.1+0.1\cdot\dfrac{2\pi}{20}\cdot\cos{\dfrac{2\pi i}{20}},0.1+0.1\cdot\dfrac{2\pi}{20}\cdot\sin{\dfrac{2\pi i}{20}})|i=0,\dots,19\}$ and standard deviation $0.02$ and c). GRID: a mixture of 25 Gaussians with means $\{(2\cdot i,2\cdot j)|i=-2,-1,\dots,2,j=-2,-1,\dots,2\}$ and standard deviation $0.02$ . As the true data distributions are known, this setting allows for tracking of convergence and mode dropping.

Duality gap and convergence. We illustrate that as the duality gap converges to $0$ , our model converges to the generated samples from the true distribution. We keep track of the generated samples using $G(z)$ and record the duality gap at each iteration to check the corresponding generated samples. We compare our method with the WGAN-GP and CycleGAN in Figure 2. All methods adopt the same structure, learning rate, number of critical steps, and other hyper-parameters.

Figure 2 shows that the iWGAN converges quickly in terms of both the duality gap and the true distributions learning. Duality gap has also been a good indicator of whether the model has generated the desired distribution. When comparing with the WGAN model, the iWGAN surpasses the performance of the WGAN-GP at very early stage and avoids the appearance of mode collapse. We have further tested the CycleGAN on these distributions. The CycleGAN objective function is the sum of two parts. The first part includes two vanilla GAN objectives (Goodfellow et al., 2014) to differentiate between $X$ and $G(Z)$ , and $Z$ and $Q(X)$ . The second part is the cycle consistency loss given by $\mathbb{E}_{Z}\big{\|}Z-Q(G(Z))\|_{1}+\mathbb{E}_{X}\big{\|}X-G(Q(X))\|_{1}$ , where $\|\cdot\|_{1}$ is the $L_{1}$ -norm of a vector. Unfortunately, Figure 2 shows that the CycleGAN fails on all three distributions and experienced the mode collapse problem.

Latent space. We choose the latent distribution to be a 5-dimensional standard multivariate normal distribution $Z\sim N(0,I_{5})$ . During the training each batch size is chosen to be 512. After training, the distribution of $Q(X)$ is expected to be close to the distribution of $Z$ . To demonstrate the latent distribution visually, we plot the $i$ th compoment of $Q(X)$ , $Q(X)_{i}$ , against the $j$ th compoment of $Q(X)$ , $Q(X)_{j}$ , for all $i\neq j$ in Figure 3. We can tell that the joint distribution of any two dimensions of $Q(X)$ is close to a bivariate normal distribution.

Mode collapse. We investigate the mode collapse problem for the iWGAN. If we draw two random samples in the latent space $z_{1},z_{2}\sim N({0},{I_{5}})$ , the interpolation, $G(\lambda z_{1}+(1-\lambda)z_{2})$ , $0\leq\lambda\leq 1$ , should fall around the mode to represent a reasonable sample. In Figure 4, we select $\lambda\in\{0,0.05,0.10,\dots,0.95,1.0\}$ , and do interpolations on two random samples. We repeat this procedure several times on 3 datasets as demonstrated in Figure 4. No matter where the interpolations start and end, the interpolations would fall around the modes other than the locations where true distribution has a low density. There may still be some samples that appears in the middle of two modes. This may be because the generator $G$ is not able to approximate a step function well.

Individual sample quality check. From the probability interpretation of the iWGAN, we naturally adopt the reconstruction error $\|X-G(Q(X))\|$ , or the quality score

\mbox{Quality Score}=\exp{(-\|X-G(Q(X))\|)}

as the metric of the quality of any individual sample. The larger the quality score is, the better quality the sample has. Figure 5 shows their quality scores for different samples. The quality scores of samples near the modes of the true distribution are close to 1, and become smaller as the sample draw is away from the modes. This indicates that the iWGAN converges and learns the distribution well, and the quality score is a reliable metric for the individual sample quality.

5.2 CelebA

We experimentally demonstrate our model’s ability on two well-known benchmark datasets, MNIST and CelebA. We present the performance of the iWGAN on CelebA in this section and the performance on MNIST in the Appendix. CelebA (CelebFaces Attributes Dataset) is a large-scale face attributes dataset with $202,599$ $64\times 64$ colored celebrity face images, which cover large pose variations and diverse people. This dataset is ideal for training models to generate synthetic images. The MNIST database (Modified National Institute of Standards and Technology database) is another large database of handwritten digits $0\sim 9$ that is commonly used for training various image processing systems. The MNIST database contains $70,000$ $28\times 28$ grey images. CelebA is a more complex dataset than MNIST. The CelebA dataset is available at http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html and the MNIST dataset is available at http://yann.lecun.com/exdb/mnist/.

The first result by the iWGAN on CelebA is shown in Figure 6(c). The dimension of the latent space is chosen to be $64$ . For each panel, Figure 6(c) respectively shows the generated samples from $G(Z)$ , the reconstructed samples from $G(Q(X))$ , and the latent space interpolation between two randomly chosen images. In particular, we perform latent space interpolations between CelebA validation set examples. We sample pairs of validation set examples $x_{1}$ and $x_{2}$ and project them into $z_{1}$ and $z_{2}$ by the encoder $Q$ . We then linearly interpolate between $z_{1}$ and $z_{2}$ and pass the intermediate points through the decoder to plot the input-space interpolations. In addition, Figure 7 shows the first 8 dimensions of the latent space calculated by $Q(x)$ on CelebA. Figures 6(c) and 7 visually demonstrate that the iWGAN can simultaneously generate high quality samples, produce small reconstruction errors, and have meaningful latent codes. Figure 8 also displays images with high and low quality scores selected from CelebA. The images with low quality scores are quite different with other images in the dataset and these images usually contain lighter background with masks or glasses.

We compare the iWGAN, both visually and numerically, with the WGAN-GP, WAE, ALI, and CycleGAN. Figures 9(a)—9(e) display the random generated samples from the iWGAN, WGAN-GP, WAE, ALI, and CycleGAN, respectively. The generated faces by the iWGAN demonstrate higher qualities than other four methods. The top panel of Figure 10 shows the comparison between real images and reconstructed images among four methods, the iWGAN, WAE, ALI, and CycleGAN. Note that the WGAN-GP cannot provide reconstructed images since it does not produce the latent codes. The bottom panel of Figure 10 shows the interpolated images by the iWGAN, WAE, ALI, and CycleGAN.

We numerically compare these five methods, the iWGAN, ALI, WAE, CycleGAN, and WGAN-GP. Four performance measures are chosen, which are inception scores (IS), Fréchet inception distances (FID), reconstruction errors (RE), and maximum mean discrepancy (MMD) between encodings and standard normal random variables. The details of these comparison metrics are given in Appendix G. Proposed by Salimans et al. (2016), the IS involves using a pre-trained Inception v3 model to predict the class probabilities for each generated image. Higher scores are better, corresponding to a larger KL-divergence between the two distributions. The FID is proposed by Heusel et al. (2017) to improve the IS by actually comparing the statistics of generated samples to real samples. For the FID, the lower the better. However, as discussed in Barratt and Sharma (2018), IS is not a reliable metric for the wellness of generated samples. This is also consistent with our experiments. Although the WAE delivers the best inception scores among five methods, the WAE also has the worst FID scores. The generated samples (Figure 9(c)) show that the WAE is not the best generative model compared with other four methods. Furthermore, The reconstruction error (RE) is used to measure if the method has generated meaningful latent encodings. Smaller reconstruction errors indicate a more meaningful latent space which can be decoded into the original samples. The MMD is used to measure the difference between distribution of latent encodings and standard normal random variables. Smaller MMD indicates that the distribution of encodings is close to the standard normal distribution.

From Table 1, in terms of generative models, the iWGAN and ALI are better models, where the WGAN-GP and CycleGAN come after, but the WAE is suffering from generating clear pictures. In terms of RE and MMD, the iWGAN and WAE are better choices, where the ALI and CycleGAN cannot always reconstruct the sample to itself (see Figure 10(a)). In general, Table 1 shows that the iWGAN has successfully produced both meaningful encodings and reliable generator simultaneously.

Table 1: Comparison of the iWGAN, ALI, WAE, CyclaGAN, WGAN-GP

Methods	IS	FID	RE	MMD
True	1.96(0.019)	18.63	–	–
iWGAN	1.51(0.017)	51.20	13.55(2.41)	$\mathbf{6\times 10^{-3}}$
ALI	1.50(0.014)	51.12	34.49(8.23)	$0.39$
WAE	1.71(0.029)	77.53	9.88(1.42)	$\mathbf{4\times 10^{-3}}$
CycleGAN	1.41(0.011)	61.78	31.90(0.84)	$0.30$
WGAN-GP	1.54(0.016)	61.39	–	–

6 Conclusion

We have developed a novel iWGAN model, which fuses auto-encoders and GANs in a principled way. We have established the generalization error bound for the iWGAN. We have provided a solid probabilistic interpretation on the iWGAN using the maximum likelihood principle. Our training algorithm with an iterative primal and dual optimization has demonstrated an efficient and stable learning. We have proposed a stopping criteria for our algorithm and a metric for individual sample quality checking. The empirical results on both synthetic and benchmark datasets are state-of-the-art.

We now mention several future directions for research on the iWGAN. First, in this paper, we assume the conditional distribution of $Z$ given $X$ is modeled by a point mass $q(z|x)=\delta(x-Q(x))$ . It is interesting to extend this to a more flexible inference model. In addition, it is desirable to make the latent distribution more flexible, and consider a more general latent distribution such as the energy-based model (Gao et al., 2020). Second, we have ignored approximation errors in our analysis by assuming the unknown mappings belong to the neural network spaces. It is interesting to incorporate the approximation errors to analyze the behavior of the iWGAN divergence. Third, one might be interested in applying the iWGAN into image-to-image translation, as the extension should be straightforward. A fourth direction is to develop a formal hypothesis testing procedure to test whether the samples generated from the iWAGN is the same as the data distribution. We are also working on incorporating the iWGAN into the recent GAN modules such as the BigGAN (Brock et al., 2019), which can produce high-resolution and high-fidelity images. As its name suggests, the BigGAN focuses on scaling up the GAN models including more model parameters, larger batch sizes, and architectural changes. Instead, the iWGAN is able to stabilize its training, and it is a promising idea to fuse these two frameworks together.

Appendix

A. Proof of Theorem 1

According to the Nash embedding theorem (Nash, 1956; Günther, 1991), every $d$ -dimensional smooth Riemannian manifold ${\cal X}$ possesses a smooth isometric embedding into $\mathbb{R}^{p}$ with $p=\max\{d(d+5)/2,d(d+3)/2+5\}$ . Therefore, there exists an injective mapping $u:{\cal X}\rightarrow\mathbb{R}^{p}$ which preserves the metric in the sense that the manifold metric on ${\cal X}$ is equal to the pullback of the usual Euclidean metric on $\mathbb{R}^{p}$ by $u$ . The mapping $u$ is injective so that we can define the inverse mapping $u^{-1}:u(\cal X)\rightarrow{\cal X}$ .

Let $\tilde{X}=u(X)\in{\mathbb{R}}^{p}$ , and write $\tilde{X}=(\tilde{X}_{1},\ldots,\tilde{X}_{p})$ . Let $F_{i}(x)=\mathbb{P}(\tilde{X}_{i}\leq x)$ , $i=1,\ldots,p$ , be the marginal cdfs. By applying the probability integral transformation to each component, the random vector

\big{(}U_{1},U_{2},\ldots,U_{p}\big{)}:=\big{(}F_{1}(\tilde{X}_{1}),F_{2}(\tilde{X}_{2}),\ldots,F_{p}(\tilde{X}_{p})\big{)}

has uniformly distributed marginals. Let $C:[0,1]^{p}\rightarrow[0,1]$ be the copula of $\tilde{X}$ , which is defined as the joint cdf of $(U_{1},\ldots,U_{p})$ :

C(u_{1},u_{2},\ldots,u_{p})=\mathbb{P}\big{(}U_{1}\leq u_{1},U_{2}\leq u_{2},\ldots,U_{p}\leq u_{p}\big{)}.

The copula $C$ contains all information on the dependence structure among the components of $\tilde{X}$ , while the marginal cumulative distribution functions $F_{i}$ contain all information on the marginal distributions. Therefore, the joint cdf of $\tilde{X}$ is

H(\tilde{x}_{1},\tilde{x}_{2},\ldots,\tilde{x}_{p})=C\big{(}F_{1}(\tilde{x}_{1}),F_{2}(\tilde{x}_{2}),\ldots,F_{p}(\tilde{x}_{p})\big{)}.

Denote the conditional distribution of $U_{k}$ , given $U_{1},\ldots,U_{k-1}$ , by

\displaystyle C_{k}(u_{k}|u_{1},\ldots,u_{k-1})

\displaystyle=\mathbb{P}\big{(}U_{k}\leq u_{k}|U_{1}=u_{1},\ldots,U_{k-1}=u_{k-1}\big{)}

for $k=2,\ldots,p$ .

We will construct $Q^{*}$ as follows. First, we obtain $\tilde{X}\in\mathbb{R}^{p}$ by $\tilde{X}=u(X)$ . Second, we transform $\tilde{X}$ into a random vector with uniformly distributed marginals $(U_{1},\ldots,U_{p})$ by the marginal cdf $F_{i}$ . Then, define $\tilde{U}_{1}=U_{1}$ and

\tilde{U}_{k}=C_{k}\big{(}U_{k}|U_{1},\ldots,U_{k-1}\big{)},~{}~{}~{}k=2,\ldots,p.

One can readily show that $\tilde{U}_{1},\ldots,\tilde{U}_{p}$ are independent uniform random variables. This is because

	$\displaystyle\mathbb{P}(\tilde{U}_{k}\leq\tilde{u}_{k}:k=1,\ldots,p)$	$\displaystyle=\int_{C_{1}\big{(}v_{1}\big{)}\leq\tilde{u}_{1}}\cdots\int_{C_{p}\big{(}v_{p}\|v_{1},\ldots,v_{p-1}\big{)}\leq\tilde{u}_{p}}dC_{p}\big{(}v_{p}\|v_{1},\ldots,v_{p-1}\big{)}\cdots dC_{1}\big{(}v_{1}\big{)}$
		$\displaystyle=\int_{0}^{\tilde{u}_{1}}\cdots\int_{0}^{\tilde{u}_{p}}dz_{p}\cdots dz_{1}=\prod_{k=1}^{p}\tilde{u}_{k}.$

In fact, this transformation is the well-known Rosenblatt transform (Rosenblatt, 1952). Finally, let $Z_{i}=\Phi^{-1}(\tilde{U}_{i})$ for $i=1,\ldots,p$ , where $\Phi^{-1}$ is the inverse cdf of a standard normal random variable. This completes the transformation $Q^{*}$ from $X$ to $Z=(Z_{1},\ldots,Z_{p})$ .

The above process can be inverted to obtain $G^{*}$ . First, we transform $Z$ into independent uniform random variables by $\tilde{U}_{i}=\Phi(Z_{i})$ for $i=1,\ldots,p$ . Next, let $U_{1}=\tilde{U}_{1}$ . Define

U_{k}=C_{k}^{-1}(\tilde{U}_{k}|\tilde{U}_{1},\ldots,\tilde{U}_{k-1}),~{}~{}~{}i=2,\ldots,p,

where $C_{k}^{-1}(\cdot|u_{1},\ldots,u_{k})$ is the inverse of $C_{k}$ and can be obtained by numerical root finding. Finally, let $\tilde{X}_{i}=F_{i}^{-1}(U_{i})$ for $i=1,\ldots,p$ and $X=u^{-1}(\tilde{X})$ , where $u^{-1}:u(\cal X)\rightarrow{\cal X}$ is the inverse mapping of $u$ . This completes the transformation $G^{*}$ from $Z$ to $X$ .

B. Proof of Theorem 2

(a). By the iWGAN objective (3), (5) holds. Since $W_{1}$ is a distance between two probability measures, $W_{1}(P_{X},P_{G(Z)})\leq\overline{W}_{1}(P_{X},P_{G(Z)})$ . If there exists a $Q^{*}\in{\cal Q}$ such that $Q^{*}(X)$ has the same distribution as $P_{Z}$ , we have

\displaystyle\overline{W}_{1}(P_{X},P_{G(Z)})\leq W_{1}(P_{X},P_{G(Q^{*}(X))})+W_{1}(P_{G(Q^{*}(X))},P_{G(Z)})=W_{1}(P_{X},P_{G(Z)}).

Hence, $W_{1}(P_{X},P_{G(Z)})=\overline{W}_{1}(P_{X},P_{G(Z)})$ .
(b). We observe that

\sup_{f}L(\widetilde{G},\widetilde{Q},f)=W_{1}(P_{X},P_{\widetilde{G}(\widetilde{Q}(X))})+W_{1}(P_{\widetilde{G}(\widetilde{Q}(X))},P_{\widetilde{G}(Z)}).

By Theorem 1, we have $\inf_{G,Q}L(G,Q,\widetilde{f})\leq L(G^{*},Q^{*},\widetilde{f})=0$ when the encoder and the decoder have enough capacities. Therefore, the duality gap is larger than $W_{1}(P_{X},P_{\widetilde{G}(\widetilde{Q}(X))})+W_{1}(P_{\widetilde{G}(\widetilde{Q}(X))},P_{\widetilde{G}(Z)})$ . It is easy to see that, if $\widetilde{G}$ outputs the same distribution as $X$ and $\widetilde{Q}$ outputs the same distribution as $Z$ , both the duality gap and $\overline{W}_{1}(P_{X},P_{G(Z)})$ are zeros and $X=\widetilde{G}(\widetilde{Q}(X))$ for $X\sim P_{X}$ .

C. Proof of Theorem 3

We first consider the difference between population $W_{1}(P_{X},P_{G(Z)})$ and empirical $\widehat{W}_{1}(P_{X},P_{G(Z)})$ given $n$ samples $S=\{x_{1},\ldots,x_{n}\}$ . Let $f_{1}$ and $f_{2}$ be their witness function respectively. Using the dual form of 1-Wassertein distance, we have

		$\displaystyle W_{1}(P_{X},P_{G(Z)})-\widehat{W}_{1}(P_{X},P_{G(Z)})$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{X\sim P_{X}}\{f_{1}(X)\}-\mathbb{E}_{Z\sim P_{Z}}\{f_{1}(G(Z))\}-\frac{1}{n}\sum_{i=1}^{n}f_{2}(x_{i})+\mathbb{E}_{Z\sim P_{Z}}\{f_{2}(G(Z))\}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{X\sim P_{X}}\{f_{1}(X)\}-\mathbb{E}_{Z\sim P_{Z}}\{f_{1}(G(Z))\}-\frac{1}{n}\sum_{i=1}^{n}f_{1}(x_{i})+\mathbb{E}_{Z\sim P_{Z}}\{f_{1}(G(Z))\}$
	$\displaystyle\leq$	$\displaystyle\sup_{f}\mathbb{E}_{X\sim P_{X}}\{f(X)\}-\frac{1}{n}\sum_{i=1}^{n}f(x_{i})\triangleq\Phi(S).$

Given another sample set $S^{\prime}=\{x_{1},\ldots,x_{i}^{\prime},\ldots,x_{n}\}$ , it is clear that

\displaystyle\Phi(S)-\Phi(S^{\prime})\leq\sup_{f}\frac{|f(x_{i})-f(x_{i}^{\prime})|}{n}\leq\frac{\|x_{i}-x_{i}^{\prime}\|}{n}\leq\frac{2B}{n},

where the second inequality is obtained since $f$ is 1-Lipschitz continuous function. Applying McDiamond’s Inequality, with probability at least $1-\delta/2$ for any $\delta\in(0,1)$ , we have

\Phi(S)\leq\mathbb{E}\{\Phi(S)\}+B\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)}.

(19)

By the standard technique of symmetrization in Mohri et al. (2018), we have

\mathbb{E}\{\Phi(S)\}=\mathbb{E}\left\{\sup_{f}\mathbb{E}_{X\sim P_{X}}\{f(X)\}-\frac{1}{n}\sum_{i=1}^{n}f(x_{i})\right\}\leq 2\mathfrak{R}_{n}(\mathcal{F}).

(20)

It has been proved in Mohri et al. (2018) that with probability at least $1-\delta/2$ for any $\delta\in(0,1)$ ,

\mathfrak{R}_{n}(\mathcal{F})\leq\widehat{\mathfrak{R}}_{n}(\mathcal{F})+B\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)}.

(21)

Combining Equation (19), Equation (20) and Equation (21), we have

W_{1}(P_{X},P_{G(Z)})\leq\widehat{W}_{1}(P_{X},P_{G(Z)})+2\widehat{\mathfrak{R}}_{n}(\mathcal{F})+3B\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)}.

By Theorem 2, we have $\widehat{W}_{1}(P_{X},P_{G(Z)})\leq\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})$ . Thus,

W_{1}(P_{X},P_{G(Z)})\leq\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})+2\widehat{\mathfrak{R}}_{n}(\mathcal{F})+3B\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)}.

D. Proof of Theorem 4

(a) It is obvious that $\ell(\theta)\leq{\cal L}(\theta;\tilde{\theta})$ from Equation (4). When $q(x)=p(x|\theta)$ ,

\int q(x)\log{e^{-\|x-G_{\theta}(Q_{\theta}(x))\|}\over q(x)}dx=\int p(x|\theta)\log{e^{-\|x-G_{\theta}(Q_{\theta}(x))\|}\over e^{-\|x-G_{\theta}(Q_{\theta}(x))\|-V(\theta)}}dx=V(\theta).

Therefore, $\ell(\theta)={\cal L}(\theta;\theta)$ .

(b) Since $\theta^{(t)}\rightarrow\hat{\theta}$ as $t\rightarrow\infty$ , we have ${\cal L}(\theta^{(t+1)};\theta^{(t)})\rightarrow{\cal L}(\hat{\theta};\hat{\theta})=\ell(\hat{\theta})$ . This implies $\hat{\theta}$ is the MLE.

E. Experimental Results on MNIST

E.1. Latent Space

Figure 11 shows the latent space of MNIST, i.e. $Q(X)_{i}$ against $Q(X)_{j}$ for all $i\neq j$ .

E.2. Generated Samples

Figure 12 shows the comparison of random generated samples between the WGAN-GP and iWGAN. Figure 13 shows examples of interpolations of two random generated samples.

E.3. Reconstruction

Figure 14(b) shows, based on the samples from validation dataset, the distribution of reconstruction error. Figure 14(a) shows examples of reconstructed samples. Figure 15 shows the best and worst samples based on quality scores from the validation dataset.

F. Architectures

The codes and examples used for this paper is available at: https://drive.google.com/drive/folders/1-_vIrbOYwf2BH1lOrVEcEPJUxkyV5CiB?usp=sharing. In this section, we present the architectures used for each experiment.

Mixture of Guassians

For Mixture Guassians, the latent space $Z\in\mathbb{R}^{5}$ , for each batch, the sample size is 256.

Encoder architecture:

	$\displaystyle x\in\mathbb{R}^{2}$	$\displaystyle\rightarrow FC_{1024}\rightarrow RELU$
		$\displaystyle\rightarrow FC_{512}\rightarrow RELU$
		$\displaystyle\rightarrow FC_{256}\rightarrow RELU$
		$\displaystyle\rightarrow FC_{128}\rightarrow RELU\rightarrow FC_{5}$

Generator architecture:

	$\displaystyle z\in\mathbb{R}^{5}$	$\displaystyle\rightarrow FC_{512}\rightarrow RELU$
		$\displaystyle\rightarrow FC_{512}\rightarrow RELU$
		$\displaystyle\rightarrow FC_{512}\rightarrow RELU\rightarrow FC_{2}$

Discriminator architecture:

	$\displaystyle x\in\mathbb{R}^{2}$	$\displaystyle\rightarrow FC_{512}\rightarrow RELU$
		$\displaystyle\rightarrow FC_{512}\rightarrow RELU$
		$\displaystyle\rightarrow FC_{512}\rightarrow RELU\rightarrow FC_{1}$

MNIST

For MNIST, the latent space $Z\in\mathbb{R}^{8}$ and batch size is 250.

Encoder architecture:

	$\displaystyle x\in\mathbb{R}^{28\times 28}$	$\displaystyle\rightarrow Conv_{128}\rightarrow RELU$
		$\displaystyle\rightarrow Conv_{256}\rightarrow RELU$
		$\displaystyle\rightarrow Conv_{512}\rightarrow RELU\rightarrow FC_{8}$

Generator architecture:

	$\displaystyle z\in\mathbb{R}^{8}$	$\displaystyle\rightarrow FC_{4\times 4\times 512}\rightarrow RELU$
		$\displaystyle\rightarrow ConvTrans_{256}\rightarrow RELU$
		$\displaystyle\rightarrow ConvTrans_{128}\rightarrow RELU\rightarrow ConvTrans_{1}$

Discriminator architecture:

	$\displaystyle x\in\mathbb{R}^{28\times 28}$	$\displaystyle\rightarrow Conv_{128}\rightarrow RELU$
		$\displaystyle\rightarrow Conv_{256}\rightarrow RELU$
		$\displaystyle\rightarrow Conv_{512}\rightarrow RELU\rightarrow FC_{1}$

CelebA

For CelebA, the latent space $Z\in\mathbb{R}^{64}$ and batch size is 64.

Encoder architecture:

	$\displaystyle x\in\mathbb{R}^{64\times 64\times 3}$	$\displaystyle\rightarrow Conv_{128}\rightarrow LeakyRELU$
		$\displaystyle\rightarrow Conv_{256}\rightarrow InstanceNorm\rightarrow LeakyRELU$
		$\displaystyle\rightarrow Conv_{512}\rightarrow InstanceNorm\rightarrow LeakyRELU\rightarrow Conv_{1}$

Generator architecture:

	$\displaystyle z\in\mathbb{R}^{64}$	$\displaystyle\rightarrow FC_{4\times 4\times 1024}$
		$\displaystyle\rightarrow ConvTrans_{512}\rightarrow BN\rightarrow RELU$
		$\displaystyle\rightarrow ConvTrans_{256}\rightarrow BN\rightarrow RELU$
		$\displaystyle\rightarrow ConvTrans_{128}\rightarrow BN\rightarrow RELU\rightarrow ConvTrans_{3}$

Discriminator architecture:

	$\displaystyle x\in\mathbb{R}^{64\times 64\times 3}$	$\displaystyle\rightarrow Conv_{128}\rightarrow LeakyRELU$
		$\displaystyle\rightarrow Conv_{256}\rightarrow InstanceNorm\rightarrow LeakyRELU$
		$\displaystyle\rightarrow Conv_{512}\rightarrow InstanceNorm\rightarrow LeakyRELU\rightarrow Conv_{1}$

G. Comparison Metrics

Four performance measures, such as inception scores (IS), Fréchet inception distances (FID), reconstruction errors (RE), and maximum mean discrepancy (MMD) between encodings and standard normal random variables, are used to compare different models.

Proposed by Salimans et al. (2016), the IS involves using a pre-trained Inception v3 model to predict the class probabilities for each generated image. These predictions are then summarized into the IS by the KL divergence as following,

\text{IS}=\exp\left(\mathbb{E}_{x\sim P_{G(Z)}}D_{KL}\left(p(y|x)\|p(y)\right)\right),

(22)

where $p(y|x)$ is the predicted probabilities conditioning on the generated images, and $p(y)$ is the corresponding marginal distribution. Higher scores are better, corresponding to a larger KL-divergence between the two distributions. The FID is proposed by Heusel et al. (2017) to improve the IS by actually comparing the statistics of generated samples to real samples. It is defined as the Fréchet distance between two multivariate Gaussians,

\text{FID}=\|\mu_{r}-\mu_{G}\|^{2}+\text{Tr}\left(\Sigma_{r}+\Sigma_{G}-2(\Sigma_{r}\Sigma_{G})^{1/2}\right),

(23)

where $X_{r}\sim N(\mu_{r},\Sigma_{r})$ and $X_{G}\sim N(\mu_{G},\Sigma_{G})$ are the 2048-dimensional activations of the Inception-v3 pool-3 layer for real and generated samples respectively. For the FID, the lower the better. Furthermore, the reconstruction error (RE) is defined as

\mbox{RE}=\dfrac{1}{N}\sum_{i=1}^{N}\|\hat{X}_{i}-X_{i}\|_{2},

(24)

where $\hat{X}_{i}$ is the reconstructed sample for $X_{i}$ . RE is used to measure if the method has generated meaningful latent encodings. Smaller reconstruction errors indicate a more meaningful latent space which can be decoded into the original samples. The maximum mean discrepancy (MMD) is defined as

\mbox{MMD}=\dfrac{1}{N(N-1)}\sum_{l\neq j}k(z_{l},z_{j})+\dfrac{1}{N(N-1)}\sum_{l\neq j}k(\tilde{z}_{l},\tilde{z}_{j})-\dfrac{2}{N^{2}}\sum_{l,j}k(z_{l},\tilde{z}_{j})

(25)

where $k$ is a positive-definite reproducing kernel, $z_{i}$ ’s are drawn from prior distribution ${P}_{Z}$ , and $\tilde{z}_{i}=Q(x_{i})$ are the latent encodings of real samples. MMD is used to measure the difference between distribution of latent encodings and standard normal random variables. Smaller MMD indicates that the distribution of encodings is close to the standard normal distribution.

References

Arjovsky et al. (2017) Arjovsky, M., S. Chintala, and L. Bottou (2017). Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223. PMLR.
Arora et al. (2017) Arora, S., R. Ge, Y. Liang, T. Ma, and Y. Zhang (2017). Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 224–232. JMLR. org.
Barratt and Sharma (2018) Barratt, S. and R. Sharma (2018). A note on the inception score. arXiv preprint arXiv:1801.01973.
Bartlett et al. (2017) Bartlett, P. L., D. J. Foster, and M. J. Telgarsky (2017). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240–6249.
Berthelot et al. (2017) Berthelot, D., T. Schumm, and L. Metz (2017). Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717.
Blei et al. (2017) Blei, D. M., A. Kucukelbir, and J. D. McAuliffe (2017). Variational inference: A review for statisticians. Journal of the American statistical Association 112(518), 859–877.
Brock et al. (2019) Brock, A., J. Donahue, and K. Simonyan (2019). Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations.
Carreira-Perpinan and Hinton (2005) Carreira-Perpinan, M. A. and G. E. Hinton (2005). On contrastive divergence learning. In Aistats, Volume 10, pp. 33–40. Citeseer.
Chen et al. (2018) Chen, X., J. Wang, and H. Ge (2018). Training generative adversarial networks via primal-dual subgradient methods: A lagrangian perspective on GAN. In International Conference on Learning Representations.
Donahue et al. (2017) Donahue, J., P. Krähenbühl, and T. Darrell (2017). Adversarial feature learning. In International Conference on Learning Representations (ICLR).
Dumoulin et al. (2017) Dumoulin, V., I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2017). Adversarially learned inference. In International Conference on Learning Representations (ICLR).
Farnia and Tse (2018) Farnia, F. and D. Tse (2018). A convex duality framework for gans. In Advances in Neural Information Processing Systems, pp. 5248–5258.
Fischer and Igel (2010) Fischer, A. and C. Igel (2010). Empirical analysis of the divergence of gibbs sampling based learning algorithms for restricted boltzmann machines. In International Conference on Artificial Neural Networks, pp. 208–217. Springer.
Gao et al. (2020) Gao, R., R. Nijkamp, D. Kingma, Z. Xu, A. Dai, and Y. Wu (2020). Flow contrastive estimation of energy-based models. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7515–7525.
Goodfellow et al. (2014) Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.
Gretton et al. (2012) Gretton, A., K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012). A kernel two-sample test. Journal of Machine Learning Research 13(Mar), 723–773.
Grnarova et al. (2018) Grnarova, P., K. Y. Levy, A. Lucchi, N. Perraudin, T. Hofmann, and A. Krause (2018). Evaluating gans via duality. arXiv preprint arXiv:1811.05512.
Gu and Zhu (2001) Gu, M. G. and H. Zhu (2001). Maximum likelihood estimation for spatial models by markov chain monte carlo stochastic approximation. Journal of the Royal Statistical Society B 63(2), 339–355.
Gulrajani et al. (2017) Gulrajani, I., F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017). Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777.
Günther (1991) Günther, M. (1991). Isometric embeddings of riemannian manifolds. In Proceedings of the International Congress of Mathematicians, pp. 1137–1143.
Han et al. (2019) Han, T., E. Nijkamp, X. Fang, M. Hill, S.-C. Zhu, and Y. N. Wu (2019). Divergence triangle for joint training of generator model, energy-based model, and inferential model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8670–8679.
Heusel et al. (2017) Heusel, M., H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637.
Hinton (2002) Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural computation 14(8), 1771–1800.
Hornik (1991) Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks 4(2), 251–257.
Hu et al. (2018) Hu, Z., Z. Yang, R. Salakhutdinov, and E. P. Xing (2018). On unifying deep generative models. In International Conference on Learning Representations.
Jiang et al. (2019) Jiang, H., Z. Chen, M. Chen, F. Liu, D. Wang, and T. Zhao (2019). On computation and generalization of generative adversarial networks under spectrum control. In International Conference on Learning Representations.
Kingma and Ba (2015) Kingma, D. P. and J. Ba (2015). Adam: A method for stochastic optimization. In 3rd International Conference for Learning Representations, San Diego, 2015.
Kingma and Welling (2014) Kingma, D. P. and M. Welling (2014). Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
Larsen et al. (2016) Larsen, A. B. L., S. K. Sønderby, H. Larochelle, and O. Winther (2016). Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning, pp. 1558–1566.
Li et al. (2019) Li, X., J. Lu, Z. Wang, J. Haupt, and T. Zhao (2019). On tighter generalization bounds for deep neural networks: CNNs, resnets, and beyond.
Li et al. (2015) Li, Y., K. Swersky, and R. Zemel (2015). Generative moment matching networks. In International Conference on Machine Learning, pp. 1718–1727. PMLR.
Mescheder et al. (2017) Mescheder, L., S. Nowozin, and A. Geiger (2017). Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2391–2400. JMLR. org.
Mohri et al. (2018) Mohri, M., A. Rostamizadeh, and A. Talwalkar (2018). Foundations of machine learning. MIT press.
Nash (1956) Nash, J. (1956). The imbedding problem for riemannian manifolds. Annals of mathematics 63(1), 20–63.
Nowozin et al. (2016) Nowozin, S., B. Cseke, and R. Tomioka (2016). f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279.
Qiu and Wang (2020) Qiu, Y. and X. Wang (2020). Almond: Adaptive latent modeling and optimization via neural networks and langevin diffusion. Journal of the American Statistical Association 0(0), 1–13.
Qiu et al. (2020) Qiu, Y., L. Zhang, and X. Wang (2020). Unbiased contrastive divergence algorithm for training energy-based latent variable models. In International Conference on Learning Representations (ICLR).
Rosca et al. (2017) Rosca, M., B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed (2017). Variational approaches for auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987.
Rosenblatt (1952) Rosenblatt, M. (1952). Remarks on a multivariate transformation. Annals of Mathematical Statistics 23(3), 470–472.
Salimans et al. (2016) Salimans, T., I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016). Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242.
Schulz et al. (2010) Schulz, H., A. Müller, and S. Behnke (2010). Investigating convergence of restricted boltzmann machine learning. In NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning.
Tolstikhin et al. (2018) Tolstikhin, I., O. Bousquet, S. Gelly, and B. Schoelkopf (2018). Wasserstein auto-encoders. In International Conference on Learning Representations.
Ulyanov et al. (2018) Ulyanov, D., A. Vedaldi, and V. Lempitsky (2018). It takes (only) two: Adversarial generator-encoder networks. In Thirty-Second AAAI Conference on Artificial Intelligence.
Villani (2008) Villani, C. (2008). Optimal transport: old and new, Volume 338. Springer Science & Business Media.
Zhao et al. (2017) Zhao, J., M. Mathieu, and Y. LeCun (2017). Energy-based generative adversarial network. In International Conference on Learning Representations (ICLR).
Zhao et al. (2018) Zhao, S., J. Song, and S. Ermon (2018). The information-autoencoding family: A lagrangian perspective on latent variable generative modeling.
Zhu et al. (2017) Zhu, J. Y., T. Park, P. Isola, and A. A. Efros (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232.

$\displaystyle\hat{\mathbb{E}}_{obs}\log p(x\|\theta)$	$\displaystyle=-\hat{\mathbb{E}}_{obs}\\|x-G_{\theta}(Q_{\theta}(x))\\|-V(\theta)$
	$\displaystyle=-\hat{\mathbb{E}}_{obs}\\|x-G_{\theta}(Q_{\theta}(x))\\|-\log\int q(x){e^{-\\|x-G_{\theta}(Q_{\theta}(x))\\|}\over q(x)}dx$
	$\displaystyle\leq-\hat{\mathbb{E}}_{obs}\\|x-G_{\theta}(Q_{\theta}(x))\\|-\int q(x)\log{e^{-\\|x-G_{\theta}(Q_{\theta}(x))\\|}\over q(x)}dx$
	$\displaystyle=-\hat{\mathbb{E}}_{obs}\\|x-G_{\theta}(Q_{\theta}(x))\\|+\mathbb{E}_{q(x)}\\|x-G_{\theta}(Q_{\theta}(x))\\|-H(q),$	(12)