This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Inferential Wasserstein Generative Adversarial Networks

Yao Chen Department of Statistics, Purdue University Qingyi Gao Department of Statistics, Purdue University Xiao Wang Department of Statistics, Purdue University
Abstract

Generative Adversarial Networks (GANs) have been impactful on many problems and applications but suffer from unstable training. The Wasserstein GAN (WGAN) leverages the Wasserstein distance to avoid the caveats in the minmax two-player training of GANs but has other defects such as mode collapse and lack of metric to detect the convergence. We introduce a novel inferential Wasserstein GAN (iWGAN) model, which is a principled framework to fuse auto-encoders and WGANs. The iWGAN model jointly learns an encoder network and a generator network motivated by the iterative primal dual optimization process. The encoder network maps the observed samples to the latent space and the generator network maps the samples from the latent space to the data space. We establish the generalization error bound of the iWGAN to theoretically justify its performance. We further provide a rigorous probabilistic interpretation of our model under the framework of maximum likelihood estimation. The iWGAN, with a clear stopping criteria, has many advantages over other autoencoder GANs. The empirical experiments show that the iWGAN greatly mitigates the symptom of mode collapse, speeds up the convergence, and is able to provide a measurement of quality check for each individual sample. We illustrate the ability of the iWGAN by obtaining competitive and stable performances for benchmark datasets.

Keywords: generalization error, generative adversarial networks, latent variable models, primal dual optimization, Wasserstein distance.

1 Introduction

One of the goals of generative modeling is to match the model distribution Pθ(x)P_{\theta}(x) with parameters θ\theta to the true data distribution PXP_{X} for a random variable X𝒳X\in{\cal X}. For latent variable models, the data point XX is generated from a latent variable Z𝒵Z\in{\cal Z} through a conditional distribution p(x|z)p(x|z). Here 𝒳{\cal X} denotes the support for PXP_{X} and 𝒵{\cal Z} denotes the support for PZP_{Z}. In this paper, we consider models with Z𝒩(0,I)Z\sim{\cal N}(0,I). There has been a surge of research on deep generative networks in recent years and the literature is too vast to summarize here (Kingma and Welling, 2014; Goodfellow et al., 2014; Li et al., 2015; Gao et al., 2020; Qiu and Wang, 2020). These models have provided a powerful framework for modeling complex high dimensional datasets.

We start introducing two main approaches for generative modeling. The first one is called variational auto-encoders (VAEs) (Kingma and Welling, 2014), which use variational inference (Blei et al., 2017) to learn a model by maximizing the lower bound of the likelihood function. Specifically, let the latent variable ZZ be drawn from a prior p(z)p(z) and the data XX have a likelihood p(x|z)p(x|z) that is conditioned on Z=zZ=z. Unfortunately obtaining the marginal distribution of XX requires computing an intractable integral p(x)=p(x|z)p(z)𝑑zp(x)=\int p(x|z)p(z)dz. Variational inference approximates the posterior p(z|x)p(z|x) by a family of distribution qη(z|x)q_{\eta}(z|x) with the parameter η\eta. The objective is to maximize the lower bound of the log-likelihood function known as the evidence lower bound (ELBO). The ELBO is given by

ELBO=𝔼zqη(z|x)[logp(x|z)]𝔼zqη(z|x)[logqη(z|x)p(z)].\mbox{ELBO}=\mathbb{E}_{z\sim q_{\eta}(z|x)}\big{[}\log p(x|z)\big{]}-\mathbb{E}_{z\sim q_{\eta}(z|x)}\big{[}\log{q_{\eta}(z|x)\over p(z)}\big{]}.

Note that the difference between the log-likelihood logp(x)\log p(x) and the ELBO is the Kullback-Leibler divergence between qη(z|x)q_{\eta}(z|x) and the true posterior p(z|x)p(z|x). Usually qη(z|x)q_{\eta}(z|x) is a normal density where both the conditional mean and the conditional covariance are modeled by deep neural networks (DNNs), so that the first term of the ELBO can be approximated efficiently by Monte Carlo methods and the second term can be calculated explicitly. Therefore, the ELBO allows us to do approximate posterior inference with tractable computation. VAEs have elegant theoretical foundations but the drawback is that they tend to produce blurry images. The second approach is called generative adversarial networks (GANs) (Goodfellow et al., 2014), which learn a model by using a powerful discriminator to distinguish between real data samples and generative data samples. Specifically, we define a generator G:𝒵𝒳G:{\cal Z}\rightarrow{\cal X} and a discriminator D:𝒳[0,1]D:{\cal X}\rightarrow\mathbb{[}0,1]. The generator and discriminator play a two-player minimax game by being alternatively updated, such that the generator tries to produce real-looking images and the discriminator tries to distinguish between generated images and observed images. The GAN objective can be written as minGmaxDV(G,D)\min_{G}\max_{D}V(G,D), where

V(G,D)=𝔼xPXlogD(x)+𝔼zPZlog(1D(G(z))).V(G,D)=\mathbb{E}_{x\sim P_{X}}\log D(x)+\mathbb{E}_{z\sim P_{Z}}\log(1-D(G(z))).

GANs produce more visually realistic images but suffer from the unstable training and the mode collapse problem. Although there are many variants of generative models trying to take advantages of both VAEs and GANs (Tolstikhin et al., 2018; Rosca et al., 2017), to the best of our knowledge, the model which provides a unifying framework combining the best of VAEs and GANs in a principled way is yet to be discovered.

1.1 Related work

In this section, we provide a brief introduction of different variants of generative models.

Wasserstein GAN. The Wasserstein GAN (WGAN) (Arjovsky et al., 2017) is an extension to the GAN that improves the stability of the training by introducing a new loss function motivated from the Waseerstein distance between two probability measures (Villani, 2008). Let PG(Z)P_{G(Z)} denote the generative model distribution through the generator GG and the latent variable ZPZZ\sim P_{Z}. Both the vanilla GAN (Goodfellow et al., 2014) and the WGAN can be viewed as minimizing certain divergence between the data distribution PXP_{X} and the generative distribution PG(Z)P_{G(Z)}. For example, the Jensen-Shannon (JS) divergence is implicitly used in vanilla GANs (Goodfellow et al., 2014), while the 11-Wasserstein distance is employed in WGANs. Empirical experiments suggest that the Wasserstein distance is a more sensible measure to differentiate probability measures supported in low-dimensional manifold. In terms of training, it turns out that it is hard or even impossible to compute these standard divergences in probability, especially when PXP_{X} is unknown and PG(Z)P_{G(Z)} is parameterized by DNNs. Instead, the training of WGANs is to study its dual problem because of the elegant form of Kantorovich-Rubinstein duality (Villani, 2008).

Autoencoder GANs. The main difference between autoencoder GANs and standard GANs is that, besides the generator GG, there is an encoder Q:𝒳𝒵Q:{\cal X}\rightarrow{\cal Z} which maps the data points into the latent space. This deterministic encoder is to approximate the conditional distribution p(z|x)p(z|x) of the latent variable ZZ given the data point XX. Larsen et al. (2016) first introduced the VAE-GAN, which is a hybrid of VAEs and GANs and uses a GAN discriminator to replace a VAE’s decoder to learn the loss function. For both the Adversarially Learned Inference (ALI) (Dumoulin et al., 2017) and the Bidirectional Generative Adversarial Network (BiGAN) (Donahue et al., 2017), the objective is to match two joint distributions, (X,Q(X))(X,Q(X)) and (G(Z),Z)(G(Z),Z), under the framework of vanilla GANs. When the algorithm achieves equilibrium, these two joint distributions roughly match. It is expected to obtain more meaningful latent codes by Q(X)Q(X), and this should improve the quality of the generator as well. For other VAE-GAN variants, please see Rosca et al. (2017); Mescheder et al. (2017); Hu et al. (2018); Ulyanov et al. (2018).

Energy-Based GANs. Energy-based Generative Adversarial Networks (EBGANs) (Zhao et al., 2017) view the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions, and the generator as being trained to produce contrastive samples with minimal energies. Han et al. (2019) presented the joint training of generator model, energy-based model, and inference model, which introduces a new objective function called divergence triangle that makes the processes of sampling, inference, energy evaluation readily available without the need for costly Markov chain Monte Carlo methods.

Duality in GANs. Regarding the optimization perspectives of GANs, (Chen et al., 2018; Zhao et al., 2018) studied duality-based methods for improving algorithm performance for training. Farnia and Tse (2018) developed a convex duality framework to address the case when the discriminator is constrained into a smaller class. Grnarova et al. (2018) developed an evaluation metric to detect the non-convergence behavior of vanilla GANs, which is the duality gap defined as the difference between the primal and the dual objective functions.

1.2 Our Contributions

Although there are many interesting works on autoencoder GANs, it remains unclear what the principles are underlying the fusion of auto-encoders and GANs. For example, do there even exist these two mappings, the encoder QQ and the decoder GG, for any high-dimensional random variable XX, such that Q(X)Q(X) has the same distribution as ZZ and G(Z)G(Z) has the same distribution as XX? Is there any probabilistic interpretation such as the maximum likelihood principle on autoencoder GANs? What is the generalization performance of autoencoder GANs? In this paper, we introduce inferential WGANs (iWGANs), which provide satisfying answers for these questions. We will mainly focus on the 1-Wasserstein distance, instead of the Kullback-Leibler divergence. We borrow the strength from both the primal and the dual problems and demonstrate the synergistic effect between these two optimizations. The encoder component turns out to be a natural consequence from our algorithm. The iWGAN learns both an encoder and a decoder simultaneously. We prove the existence of meaningful encoder and decoder, establish an equivalence between the WGAN and iWGAN, and develop the generalization error bound for the iWGAN. Furthermore, the iWGAN has a natural probabilistic interpretation under the maximum likelihood principle. Our learning algorithm is equivalent to the maximum likelihood estimation motivated from the variational approach when our model is defined as an energy-based model based on an autoencoder. As a byproduct, this interpretation allows us to perform the quality check at the individual sample level. In addition, we demonstrate the natural use of the duality gap as a measure of convergence for the iWGAN, and show its effectiveness for various numerical settings. Our experiments do not experience any mode collapse problem.

The rest of the paper is organized as follows. Section 2 presents the new iWGAN framework, and its extension to general inferential f-GANs. Section 3 establishes the generalization error bound and introduces the algorithm for the iWGAN. The probabilistic interpretation and the connection with the maximum likelihood estimation are introduced in Section 4. Extensive numerical experiments are demonstrated in Section 5 to show the advantages of the iWGAN framework. Proofs of theorems and additional numerical results are provided in the Appendix.

2 The iWGAN Model

The autoencoder generative model consists of two parts: an encoder QQ and a generator GG. The encoder QQ maps a data sample x𝒳x\in{\cal X} to a latent variable z𝒵z\in{\cal Z}, and the generator GG takes a latent variable z𝒵z\in{\cal Z} to produce a sample G(z)G(z). In general, the autoencoder generative model should satisfy the following three conditions simultaneously: (a) The generator can generate images which have a similar distribution with observed images, i.e., the distribution of G(Z)G(Z) is similar to that of PXP_{X}; (b) The encoder can produce meaningful encodings in the latent space, i.e., Q(X)Q(X) has a similar distribution with ZZ; (c) The reconstruction errors of this model based on these meaningful encodings are small, i.e., the difference between XX and G(Q(X))G(Q(X)) is small.

We emphasize that the benefit of using an autoencoder is to encourage the model to better represent all the data it is trained with, so that it discourages mode collapse. We first show that, for any distribution residing on a compact smooth Riemannian manifold 111A smooth manifold 𝒳{\cal X} is a manifold with a CC^{\infty} atlas on 𝒳{\cal X}. A CC^{\infty} atlas is a collection of charts {φα:Uαd}\{\varphi_{\alpha}:U_{\alpha}\rightarrow\mathbb{R}^{d}\} such that {Uα}\{U_{\alpha}\} covers 𝒳{\cal X}, and for all α\alpha and β\beta, the transition map φαφβ1\varphi_{\alpha}\cdot\varphi_{\beta}^{-1} is a CC^{\infty} map. Here UαU_{\alpha} is an open subset of 𝒳{\cal X}. For any point p𝒳p\in{\cal X}, let Tp𝒳T_{p}{\cal X} be the tangent space of 𝒳{\cal X} at pp. A Riemannian metric assigns to each pp a positive definite inner product gp:Tp𝒳×Tp𝒳g_{p}:T_{p}{\cal X}\times T_{p}{\cal X}\rightarrow\mathbb{R}, along with which comes a norm p:Tp𝒳\|\cdot\|_{p}:T_{p}{\cal X}\rightarrow\mathbb{R} defined by vp=gp(v,v)\|v\|_{p}=\sqrt{g_{p}(v,v)}. The smooth manifold 𝒳{\cal X} endowed with this metric gg is called a smooth Reimannian manifold., there always exists an encoder Q:𝒳𝒵Q^{*}:{\cal X}\rightarrow{\cal Z} which guarantees meaningful encodings and exists a generator G:𝒵𝒳G^{*}:{\cal Z}\rightarrow{\cal X} which generates samples with the same distribution as data points by using these meaningful codes.

Theorem 1.

Consider a continuous random variable X𝒳X\in{\cal X}, where 𝒳{\cal X} is a dd-dimensional compact smooth Riemannian manifold. Then, there exist two mappings Q:𝒳pQ^{*}:{\cal X}\rightarrow\mathbb{R}^{p} and G:p𝒳G^{*}:\mathbb{R}^{p}\rightarrow{\cal X}, with p=max{d(d+5)/2,d(d+3)/2+5}p=\max\{d(d+5)/2,d(d+3)/2+5\}, such that Q(X)Q^{*}(X) follows a multivariate normal distribution with zero mean and identity covariance matrix and GQG^{*}\circ Q^{*} is an identity mapping, i.e., X=G(Q(X))X=G^{*}(Q^{*}(X)).

Theorem 1 is a natural consequence of the Nash embedding theorem (Nash, 1956; Günther, 1991) and the probability integral transformation (Rosenblatt, 1952). In Theorem 1, we have proved the existence of QQ^{*} and GG^{*}, however, learning QQ^{*} and GG^{*} from the data points is still a challenging task. Consider a general ff-GAN model (Nowozin et al., 2016). Let h:(,]h:\mathbb{R}\rightarrow(-\infty,\infty] be a convex function with h(1)=0h(1)=0. The ff-GAN defines the ff-divergence between the data distribution PXP_{X} and the generative model distribution PG(Z)P_{G(Z)} for the generator GG as:

GANh(PX,PG(Z))=supf[𝔼X{f(X)}𝔼Z{h(f(G(Z))}],\text{GAN}_{h}(P_{X},P_{G(Z)})=\sup_{f\in{\cal F}}\Big{[}\mathbb{E}_{X}\left\{f(X)\right\}-\mathbb{E}_{Z}\left\{h^{*}(f(G(Z))\right\}\Big{]},

where h(x)=supy{xyh(y)}h^{*}(x)=\sup_{y}\{x\cdot y-h(y)\} is the convex conjugate of hh and ={f|f:𝒳}{\cal F}=\{f|f:\cal X\rightarrow\mathbb{R}\} is a class of functions whose output range is the domain of hh^{*}. When ff is approximated by a DNN, its output range can be achieved by choosing an appropriate activation function specific to the f-divergence used. For example, if h(x)=xlog(x)(x+1)log(x+1)h(x)=x\log(x)-(x+1)\log(x+1), then the corresponding convex conjugate h(x)=log(1exp(x))h^{*}(x)=-\log(1-\exp(x)). To satisfy the above condition, we select the output activation function of the DNN ff to be σ(v)=log(1+exp(v))\sigma(v)=-\log(1+\exp(-v)) such that the ff-GAN can recover the original vanilla GAN (Goodfellow et al., 2014). If h(x)=0h(x)=0 when x=1x=1 and h(x)=h(x)=\infty otherwise, we have h(x)=xh^{*}(x)=x. With the property that \cal F is 1-Lipschitz function class, the ff-GAN turns to be the WGAN.

For ease of presentation, we illustrate our methodology by mainly focusing on the Wasserstein distance and the inferential WGAN (iWGAN) model. The extension to general inferential f-GANs (ifGANs) is straightforward and will be presented in Section 2.3.

2.1 iWGAN

Recall that the 11-Wasserstein distance between PXP_{X} and PG(Z)P_{G(Z)} is defined as

W1(PX,PG(Z))=infπΠ(PX,PZ)𝔼(X,Z)πXG(Z),W_{1}(P_{X},P_{G(Z)})=\inf_{\pi\in\Pi(P_{X},P_{Z})}\mathbb{E}_{(X,Z)\sim\pi}\big{\|}X-G(Z)\big{\|}, (1)

where \|\cdot\| represents the L2L_{2}-norm and Π(PX,PZ)\Pi(P_{X},P_{Z}) is the set of all joint distributions of (X,Z)(X,Z) with marginal measures PXP_{X} and PZP_{Z}, respectively. The main difficulty in (1) is to find the optimal coupling π\pi, and this is a constrained optimization because the joint distribution π\pi needs to match these two marginal distributions PXP_{X} and PZP_{Z}.

Based on the Kantorovich-Rubinstein duality (Villani, 2008), the WGAN studies the 11-Wasserstein distance (1) through its dual format

W1(PX,PG(Z))=supf[𝔼XPX{f(X)}𝔼ZPZ{f(G(Z))}],W_{1}(P_{X},P_{G(Z)})=\sup_{f\in{\cal F}}\Big{[}\mathbb{E}_{X\sim P_{X}}\big{\{}f(X)\big{\}}-\mathbb{E}_{Z\sim P_{Z}}\big{\{}f(G(Z))\big{\}}\Big{]}, (2)

where {\cal F} is the set of all bounded 11-Lipschitz functions. This is also a constrained optimization due to the Lipschitz constraint on ff such that f(x)f(y)xyf(x)-f(y)\leq\|x-y\| for all x,y𝒳x,y\in{\cal X}. Weight clipping (Arjovsky et al., 2017) and gradient penalty (Gulrajani et al., 2017) have been used to satisfy the constraint of Lipschitz continuity. Arjovsky et al. (2017) used a clipping parameter cc to clamp each weight parameter to a fixed interval [c,c][-c,c] after each gradient update is set. However, this method is very sensitive to the choice of clipping parameter cc. Instead, Gulrajani et al. (2017) introduced a gradient penalty, 𝔼x^{(x^f(x^)21)2}\mathbb{E}_{\hat{x}}\big{\{}(\|\nabla_{\hat{x}}f(\hat{x})\|_{2}-1)^{2}\big{\}}, in the loss function to enforce the Lipschitz constraint, where x^\hat{x} is sampled uniformly along straight lines between pairs of points sampled from PXP_{X} and PG(Z)P_{G(Z)}. This is motivated by the fact that the optimal critic contains straight lines with gradient norm 11 connecting coupled points from PXP_{X} and PG(Z)P_{G(Z)}. The experiment of (Arjovsky et al., 2017) showed that the WGAN can avoid the problem of gradient vanishment. However, the WGAN does not produce meaningful encodings and many experiments still display the problem of mode collapse (Arjovsky et al., 2017; Gulrajani et al., 2017).

On the other hand, the Wasserstein Autoencoder (WAE) (Tolstikhin et al., 2018), after introducing an encoder Q:𝒳𝒵Q:{\cal X}\rightarrow{\cal Z} to approximate the conditional distribution of ZZ given XX, minimizes the reconstruction error infQ𝒬𝔼XXG(Q(X))\inf_{Q\in{\cal Q}}\mathbb{E}_{X}\big{\|}X-G(Q(X))\big{\|}, where 𝒬{\cal Q} is a set of encoder mappings whose elements satisfies PQ(X)=PZP_{Q(X)}=P_{Z}. The penalty, such as 𝒟(PQ(X),PZ){\cal D}(P_{Q(X)},P_{Z}), is added to the objective to satisfy this constraint, where 𝒟{\cal D} is an arbitrary divergence between PQ(X)P_{Q(X)} and PZP_{Z}. The WAE can produce meaningful encodings and has controlled reconstruction error. However, the WAE defines a generative model in an implicit way and does not model the generator through G(Z)G(Z) with ZPZZ\sim P_{Z} directly.

To take the advantages of both the WGAN and WAE, we propose a new autoencoder GAN model, called the iWGAN, which defines the divergence between PXP_{X} and PG(Z)P_{G(Z)} by

W¯1(PX,PG(Z))=infQ𝒬supf[𝔼XXG(Q(X))+𝔼X{f(G(Q(X)))}𝔼Z{f(G(Z))}].\overline{W}_{1}(P_{X},P_{G(Z)})=\inf_{Q\in{\cal Q}}\sup_{f\in{\cal F}}\Big{[}\mathbb{E}_{X}\|X-G(Q(X))\|+\mathbb{E}_{X}\big{\{}f(G(Q(X)))\big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}}\Big{]}. (3)

Our goal is to find the tuple (G,Q,f)(G,Q,f) which minimizes W¯1(PX,PG(Z))\overline{W}_{1}(P_{X},P_{G(Z)}). The motivation and explanation of this objective function are provided in Section 2.2 in detail. The term XG(Q(X))\|X-G(Q(X))\| can be treated as the autoencoder reconstruction error as well as a loss to match the distributions between XX and G(Q(X))G(Q(X)). We note that the L1L_{1}-norm 1\|\cdot\|_{1} has been used for the reconstruction term by the α\alpha-GAN (Rosca et al., 2017) and CycleGAN (Zhu et al., 2017). Another term 𝔼XPX{f(G(Q(X)))}𝔼ZPZ{f(G(Z))}\mathbb{E}_{X\sim P_{X}}\{f(G(Q(X)))\}-\mathbb{E}_{Z\sim P_{Z}}\{f(G(Z))\} can be treated as a loss for the generator as well as a loss to match the distribution between G(Q(X))G(Q(X)) and G(Z)G(Z). We emphasize that this term is different with the objective function of the WGAN in (2). The properties of (3) will be discussed in Theorem 2, and the primal and dual explanation of (3) will be presented in Section 2.2.

Furthermore, it is challenging for practitioners to determine when to stop training GANs. Most of the GAN algorithms do not provide any explicit standard for the convergence of the model. However, the measure of convergence for the iWGAN becomes very natural and we use the duality gap as the measure. For a given tuple (G,Q,f)(G,Q,f), the duality gap is defined as

DualGap(G,Q,f)=supf¯L(G,Q,f¯)infG¯𝒢,Q¯𝒬L(G¯,Q¯,f),\mbox{DualGap}(G,Q,f)=\sup_{\overline{f}\in{\cal F}}L(G,Q,\overline{f})-\inf_{\overline{G}\in{\cal G},\overline{Q}\in{\cal Q}}L(\overline{G},\overline{Q},f), (4)

where L(G,Q,f)L(G,Q,f) is

L(G,Q,f)=𝔼XXG(Q(X))+𝔼X{f(G(Q(X)))}𝔼Z{f(G(Z))}.L(G,Q,f)=\mathbb{E}_{X}\|X-G(Q(X))\|+\mathbb{E}_{X}\{f(G(Q(X)))\}-\mathbb{E}_{Z}\{f(G(Z))\}.

In practice, the function spaces 𝒢{\cal G}, 𝒬{\cal Q}, and {\cal F} are modeled by spaces containing deep neural networks with specific architectures. The architecture hyperparameters usually include number of channels, number of layers, and width of each layer. The architectures for our numerical experiments are provided in the appendix. We assume that these network spaces are large enough to include the true encoder QQ^{*}, generator GG^{*}, and the optimal discriminator ff in (2). This is not a strong assumption due to the universal approximation theorem of DNNs (Hornik, 1991).

Theorem 2.

(a). The iWGAN objective (3) is equivalent to

W¯1(PX,PG(Z))=infQ𝒬{W1(PX,PG(Q(X)))+W1(PG(Q(X)),PG(Z))}.\displaystyle\overline{W}_{1}(P_{X},P_{G(Z)})=\inf_{Q\in\mathcal{Q}}\Big{\{}W_{1}(P_{X},P_{G(Q(X))})+W_{1}(P_{G(Q(X))},P_{G(Z)})\Big{\}}. (5)

Therefore, W1(PX,PG(Z))W¯1(PX,PG(Z))W_{1}(P_{X},P_{G(Z)})\leq\overline{W}_{1}(P_{X},P_{G(Z)}). If there exists a Q𝒬Q^{*}\in{\cal Q} such that Q(X)Q^{*}(X) has the same distribution with ZZ, then W1(PX,PG(Z))=W¯1(PX,PG(Z))W_{1}(P_{X},P_{G(Z)})=\overline{W}_{1}(P_{X},P_{G(Z)}).
(b). Let (Q~,G~,f~)(\widetilde{Q},\widetilde{G},\widetilde{f}) be a fixed solution. Then

DualGap(G~,Q~,f~)W1(PX,PG~(Q~(X)))+W1(PG~(Q~(X)),PG~(Z)).\mbox{DualGap}(\tilde{G},\tilde{Q},\tilde{f})\geq W_{1}(P_{X},P_{\widetilde{G}(\widetilde{Q}(X))})+W_{1}(P_{\widetilde{G}(\widetilde{Q}(X))},P_{\widetilde{G}(Z)}).

Moreover, if G~\widetilde{G} outputs the same distribution as XX and Q~\widetilde{Q} outputs the same distribution as ZZ, both the duality gap and W¯1(PX,PG~(Z))\overline{W}_{1}(P_{X},P_{\widetilde{G}(Z)}) are zeros and X=G~(Q~(X))X=\widetilde{G}(\widetilde{Q}(X)) for XPXX\sim P_{X}.

According to Theorem 2, the iWGAN objective is in general the upper bound of W1(PX,PG(Z))W_{1}(P_{X},P_{G(Z)}). However, this upper bound is tight. When the space 𝒬{\cal Q} includes a special encoder QQ^{*} such that Q(X)Q^{*}(X) has the same distribution as ZZ, the iWGAN objective is exactly the same as W1(PX,PG(Z))W_{1}(P_{X},P_{G(Z)}). Theorem 2 also provides an appealing property from a practical point of view. The values of both the duality gap and W¯1(PX,PG~(Z))\overline{W}_{1}(P_{X},P_{\widetilde{G}(Z)}) give us a natural criteria to justify the algorithm convergence.

2.2 A Primal-Dual Explanation

We explain the iWGAN objective function (3) from the view of primal and dual problems. Note that both the primal problem (1) and the dual problem (2) are constrained optimization problems. First, for the primal problem (1), two constraints on π\pi are π(x,z)𝑑zpX(x)=0\int\pi(x,z)dz-p_{X}(x)=0 for all x𝒳x\in{\cal X}, and π(x,z)𝑑zpZ(z)=0\int\pi(x,z)dz-p_{Z}(z)=0 for all z𝒵z\in{\cal Z}. Recall that the primal variable ff for the dual problem (2) is also a dual variable for the primal problem (1). From the Lagrange multiplier perspective, we can write the primal problem (1) as

=\displaystyle= infQ𝒬𝔼X{XG(Q(X))+f(G(Q(X)))}𝔼Z{f(G(Z))},\displaystyle\inf_{Q\in{\cal Q}}\mathbb{E}_{X}\Big{\{}\|X-G(Q(X))\|+f(G(Q(X)))\Big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}},

where we use the encoder QQ to approximate the conditional distribution of ZZ given XX, and the Lagrange multipliers for two constraints are f(x)f(x) and f(G(z))-f(G(z)) respectively. Second, for the dual problem (2), the 11-Lipschitz constraint on ff is f(x)f(G(z))xG(z)f(x)-f(G(z))\leq\|x-G(z)\| for all x𝒳x\in{\cal X} and z𝒵z\in{\cal Z}. Recall that the primal variable π\pi for the primal problem (1) is also a dual variable for the dual problem (2). Similarly, we can write the dual problem (2) as

supf𝔼X{f(X)}𝔼Z{f(G(Z))}𝒳×𝒵π(x,z)(f(x)f(G(z))xG(z))𝑑x𝑑z\displaystyle\sup_{f\in{\cal F}}\mathbb{E}_{X}\big{\{}f(X)\big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}}-\int_{{\cal X}\times\cal Z}\pi(x,z)\Big{(}f(x)-f(G(z))-\|x-G(z)\|\Big{)}dxdz
=\displaystyle= supf𝔼X{XG(Q(X))+f(G(Q(X)))}𝔼Z{f(G(Z))},\displaystyle\sup_{f\in{\cal F}}\mathbb{E}_{X}\Big{\{}\|X-G(Q(X))\|+f(G(Q(X)))\Big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}},

where the Lagrange multiplier for the 11-Lipschitz constraint is π(x,z)\pi(x,z). When we solve primal and dual problems iteratively, this turns out to be exactly the same as our iWGAN algorithm.

In addition, the optimal value of the primal problem (1) satisfies

infQ𝒬supf𝔼X{XG(Q(X))+f(G(Q(X)))}𝔼Z{f(G(Z))},\displaystyle\inf_{Q\in{\cal Q}}\sup_{f\in{\cal F}}\mathbb{E}_{X}\Big{\{}\|X-G(Q(X))\|+f(G(Q(X)))\Big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}},

and the optimal value of the dual problem (2) satisfies

supfinfQ𝒬𝔼X{XG(Q(X))+f(G(Q(X)))}𝔼Z{f(G(Z))}.\displaystyle\sup_{f\in{\cal F}}\inf_{Q\in{\cal Q}}\mathbb{E}_{X}\Big{\{}\|X-G(Q(X))\|+f(G(Q(X)))\Big{\}}-\mathbb{E}_{Z}\big{\{}f(G(Z))\big{\}}.

The difference between the optimal primal and dual values is exactly the duality gap in (4).

2.3 Extension to f-GANs

This framework can be easily extended to other types of GANs. Assume that \cal F is the 1-Lipschitz function class. We extend the iWGAN framework to the inferential f-GAN (ifGAN) framework. Define the ifGAN objective function as follows:

W¯1,h(PX,PG(Z))=infQ𝒬supf[𝔼XXG(Q(X))+𝔼X{f(G(Q(X)))}𝔼Z{h(f(G(Z)))}]\overline{W}_{1,h}(P_{X},P_{G(Z)})=\inf_{Q\in{\cal Q}}\sup_{f\in{\cal F}}\Big{[}\mathbb{E}_{X}\|X-G(Q(X))\|+\mathbb{E}_{X}\big{\{}f(G(Q(X)))\big{\}}-\mathbb{E}_{Z}\big{\{}h^{*}(f(G(Z)))\big{\}}\Big{]}.

(6)

Following this definition, we have

W¯1,h(PX,PG(Z))=infQ𝒬{W1(PX,PG(Q(X)))+GANh(PG(Q(X)),PG(Z))}.\displaystyle\overline{W}_{1,h}(P_{X},P_{G(Z)})=\inf_{Q\in{\cal Q}}\Big{\{}W_{1}(P_{X},P_{G(Q(X))})+\text{GAN}_{h}(P_{G(Q(X))},P_{G(Z)})\Big{\}}.

We show GANh(PX,PG(Z))W¯1,h(PX,PG(Z))\text{GAN}_{h}(P_{X},P_{G(Z)})\leq\overline{W}_{1,h}(P_{X},P_{G(Z)}). This is because

GANh(PX,PG(Z))=supf𝔼X{f(X)}𝔼Z{h(f(G(Z))}\displaystyle\text{GAN}_{h}(P_{X},P_{G(Z)})=\sup_{f\in\cal F}\mathbb{E}_{X}\left\{f(X)\right\}-\mathbb{E}_{Z}\left\{h^{*}(f(G(Z))\right\}
\displaystyle\leq infQ𝒬[supf𝔼X{f(X)}𝔼X{f(G(Q(X)))}+supf𝔼X{f(G(Q(X)))}𝔼Z{h(f(G(Z))}]\displaystyle\inf_{Q\in\cal Q}\Big{[}\sup_{f\in\cal F}\mathbb{E}_{X}\left\{f(X)\right\}-\mathbb{E}_{X}\left\{f(G(Q(X)))\right\}+\sup_{f\in\cal F}\mathbb{E}_{X}\left\{f(G(Q(X)))\right\}-\mathbb{E}_{Z}\left\{h^{*}(f(G(Z))\right\}\Big{]}
=\displaystyle= W¯1,h(PX,PG(Z)).\displaystyle\overline{W}_{1,h}(P_{X},P_{G(Z)}).

This indicates that the ifGAN objective (6) is an upper bound of the f-GAN objective.

3 Generalization Error Bound and the Algorithm

Suppose that we observe nn samples {x1,,xn}\{x_{1},\ldots,x_{n}\}. In practice, we minimize the empirical version, denoted by W¯^1(PX,PG(Z))\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)}), of W¯1(PX,PG(Z))\overline{W}_{1}(P_{X},P_{G(Z)}) to learn both the encoder and the generator, where,

W¯^1(PX,PG(Z))=infQ𝒬supf[𝔼^obsxG(Q(x))+𝔼^obs{f(G(Q(x)))}𝔼^z{f(G(z))}].\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})=\inf_{Q\in{\cal Q}}\sup_{f\in{\cal F}}\Big{[}\hat{\mathbb{E}}_{obs}\|x-G(Q(x))\|+\hat{\mathbb{E}}_{obs}\big{\{}f(G(Q(x)))\big{\}}-\hat{\mathbb{E}}_{z}\big{\{}f(G(z))\big{\}}\Big{]}. (7)

Here 𝔼^obs{}\hat{\mathbb{E}}_{obs}\{\cdot\} denotes the empirical average on the observed data {xi}\{x_{i}\} and 𝔼^z\hat{\mathbb{E}}_{z} denotes the empirical average on a random sample of standard normal random variables. Before we present the details of the algorithm, we first establish the generalization error bound for the iWGAN in this section.

In the context of supervised learning, generalization error is defined as the gap between the empirical risk and the expected risk. The empirical risk is corresponding to the training error, and the expected risk is corresponding to the testing error. Mathematically, the difference between the expected risk and the empirical risk, i.e. the generalization error, is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. However, in the context of GANs, neither the training error nor the test error is well defined. But we can define the generalization error in a similar way. Explicitly, we define the “training error” as W¯^1(PX,PG(Z))\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)}) in (7), which is minimized based on observed samples. Define the “test error” as W1(PX,PG(Z))W_{1}(P_{X},P_{G(Z)}) in (1), which is the true 1-Wasserstein distance between PXP_{X} and PG(Z)P_{G(Z)}. The generalization error for the iWGAN is defined as the gap between these two “errors”. In other words, for an iWGAN model with the parameter (G,Q,f)(G,Q,f), the generalization error is defined as W¯^1(PX,PG(Z))W1(PX,PG(Z))\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})-W_{1}(P_{X},P_{G(Z)}). For discussions of generalization performance of classical GANs, see Arora et al. (2017) and Jiang et al. (2019).

Theorem 3.

Given a generator G𝒢G\in\mathcal{G}, and nn samples (x1,,xn)(x_{1},\ldots,x_{n}) from 𝒳={x:xB}\mathcal{X}=\{x:\|x\|\leq B\}, with probability at least 1δ1-\delta for any δ(0,1)\delta\in(0,1), we have

W1(\displaystyle W_{1}( PX,PG(Z))W¯^1(PX,PG(Z))+2^n()+3B2nlog(2δ),\displaystyle P_{X},P_{G(Z)})\leq\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})+2\widehat{\mathfrak{R}}_{n}(\mathcal{F})+3B\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)}, (8)

where ^n()=𝔼ϵ{supfn1i=1nϵif(xi)}\widehat{\mathfrak{R}}_{n}(\mathcal{F})=\mathbb{E}_{\epsilon}\left\{\sup_{f\in\mathcal{F}}n^{-1}\sum_{i=1}^{n}\epsilon_{i}f(x_{i})\right\} is the empirical Rademacher complexity of the 1-Lipschitz function set \mathcal{F}, in which ϵi\epsilon_{i} is the Rademacher variable.

For a fixed generator GG, Theorem 3 holds uniformly for any discriminator ff\in\mathcal{F}. It indicates that the 1-Wasserstein distance between PXP_{X} and PG(Z)P_{G(Z)} can be dominantly upper bounded by the empirical W¯^1(PX,PG(Z))\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)}) and Rademacher complexity of \mathcal{F}. Since W¯^1(PX,PG(Z))W^1(PX,PG(Q(X)))+W^1(PG(Q(X)),PG(Z))\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})\leq\widehat{W}_{1}(P_{X},P_{G(Q(X))})+\widehat{W}_{1}(P_{G(Q(X))},P_{G(Z)}) for any Q𝒬Q\in{\cal Q}, the capacity of 𝒬\mathcal{Q} determines the value of W¯^1(PX,PG(Z))\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)}). In learning theory, Rademacher complexity, named after Hans Rademacher, measures richness of a class of real-valued functions with respect to a probability distribution. There are several existing results on the empirical Rademacher complexity of neural networks. For example, when \mathcal{F} is a set of 1-Lipschitz neural networks, we can apply the conclusion from Bartlett et al. (2017) to ^n()\widehat{\mathfrak{R}}_{n}(\mathcal{F}), which produces an upper bound scaling as 𝒪(BL3/n)\mathcal{O}(B\sqrt{L^{3}/n}). Here LL denotes the depth of network ff\in\mathcal{F}. Similar upper bound with an order of 𝒪(BLd2/n)\mathcal{O}(B\sqrt{Ld^{2}/n}) can be obtained by utilizing the results from Li et al. (2019), where dd is the width of the network.

Algorithm 1 The training algorithm of iWGAN
1:The regularization coefficients λ1\lambda_{1} and λ2\lambda_{2}, tolerance for duality gap ϵ1\epsilon_{1}, tolerance for loss ϵ2\epsilon_{2}, and running steps nn
2:Initialization (G0,Q0,f0)(G^{0},Q^{0},f^{0})
3:while DualGap(Gi,Qi,fi)>ϵ1\mbox{DualGap}(G^{i},Q^{i},f^{i})>\epsilon_{1} or L(Gi,Qi,fi)>ϵ2L(G^{i},Q^{i},f^{i})>\epsilon_{2} do
4:    for t=1t=1, …, nn do
5:         Sample real data {xki}k=1nPX\{x^{i}_{k}\}_{k=1}^{n}\sim{P}_{X}, latent variable {zki}k=1nPZ\{z^{i}_{k}\}_{k=1}^{n}\sim{P}_{Z} and {ϵk}k=1nU[0,1]\{\epsilon_{k}\}_{k=1}^{n}\sim U[0,1]
6:         Set x^kiϵkxki+(1ϵk)Gi(zki)\hat{x}^{i}_{k}\leftarrow\epsilon_{k}x^{i}_{k}+(1-\epsilon_{k})G^{i}(z^{i}_{k}), i=1,,ni=1,...,n for the calculation of gradient penalty
7:         Calculate: Li=L(Gi,Qi,fi)L^{i}=L(G^{i},Q^{i},f^{i}), J1(fi)=(x^ifi(x^i)21)2J_{1}(f^{i})=(\|\nabla_{\hat{x}^{i}}f^{i}(\hat{x}^{i})\|_{2}-1)^{2}, and
fLi\displaystyle-\nabla_{f}L^{i} =f[1nk=1n(fi(Gi(zki))fi(Gi(Qi(xki)))+λ1J1(fi))]\displaystyle=\nabla_{f}\Big{[}\dfrac{1}{n}\sum_{k=1}^{n}\Big{(}f^{i}(G^{i}(z_{k}^{i}))-f^{i}(G^{i}(Q^{i}(x_{k}^{i})))+\lambda_{1}J_{1}(f^{i})\Big{)}\Big{]}
8:         Update ff by Adam: fi+1fi+Adam(fLi)f^{i+1}\leftarrow f^{i}+Adam(-\nabla_{f}L^{i})
9:    end for
10:    for t=1t=1, …, nn do
11:         Sample real data {xki}k=1nPX\{x^{i}_{k}\}_{k=1}^{n}\sim{P}_{X}, latent variable {zki}k=1nPZ\{z^{i}_{k}\}_{k=1}^{n}\sim{P}_{Z}
12:         Calculate: Li=L(Gi,Qi,fi+1)L^{\prime i}=L(G^{i},Q^{i},f^{i+1}), J2(Qi)J_{2}(Q^{i}), and
G,QLi\displaystyle\nabla_{G,Q}L^{\prime i} =G,Q[1nk=1n(xkiGi(Qi(xki))+fi+1(Gi(Qi(xki)))fi+1(Gi(zki))+λ2J2(Qi))]\displaystyle=\nabla_{G,Q}\Big{[}\dfrac{1}{n}\sum_{k=1}^{n}\Big{(}\|x_{k}^{i}-G^{i}(Q^{i}(x_{k}^{i}))\|+f^{i+1}(G^{i}(Q^{i}(x_{k}^{i})))-f^{i+1}(G^{i}(z_{k}^{i}))+\lambda_{2}J_{2}(Q^{i})\Big{)}\Big{]}
13:         Update GG, QQ by Adam: (Gi+1,Qi+1)(Gi,Qi)+Adam(G,QLi)(G^{i+1},Q^{i+1})\leftarrow(G^{i},Q^{i})+Adam(\nabla_{G,Q}L^{\prime i})
14:    end for
15:    DualGap(Gi+1,Qi+1,fi+1)=L(Gi,Qi,fi+1)L(Gi+1,Qi+1,fi+1)(G^{i+1},Q^{i+1},f^{i+1})=L(G^{i},Q^{i},f^{i+1})-L(G^{i+1},Q^{i+1},f^{i+1})
16:    ii+1i\leftarrow i+1
17:end while

Next, we introduce the details of the algorithm. Our target is to solve the following optimization problem:

minG𝒢,Q𝒬maxf[𝔼^obsxG(Q(x))+𝔼^obs{f(G(Q(x)))}𝔼^z{f(G(z))}λ1J1(f)+λ2J2(Q)]\min\limits_{G\in{\cal G},Q\in{\cal Q}}\max\limits_{f\in{\cal F}}\Big{[}\hat{\mathbb{E}}_{obs}\|x-G(Q(x))\|+\hat{\mathbb{E}}_{obs}\big{\{}f(G(Q(x)))\big{\}}-\hat{\mathbb{E}}_{z}\big{\{}f(G(z))\big{\}}-\lambda_{1}J_{1}(f)+\lambda_{2}J_{2}(Q)\Big{]},

(9)

where J1(f)J_{1}(f) and J2(Q)J_{2}(Q) are regularization terms for ff and QQ respectively. We approximate G,Q,fG,Q,f by three neural networks with pre-specified architectures.

Since ff is assumed to be 1-Lipschitz, we adopt the gradient penalty defined as J1(f)=𝔼x^{(x^f(x^)21)2}J_{1}(f)=\mathbb{E}_{\hat{x}}\big{\{}(\mathbb{\|}\nabla_{\hat{x}}f(\hat{x})\|_{2}-1)^{2}\big{\}} in (Gulrajani et al., 2017) to enforce the 1-Lipschitz constraint on ff\in{\cal F}. Furthermore, since we expect Q(X)Q(X) follows approximately standard normal, we use the maximum mean discrepancy (MMD) penalty (Gretton et al., 2012), denoted by J2(Q)=MMDk(PQ(X),PZ)J_{2}(Q)=\mbox{MMD}_{k}(P_{Q(X)},P_{Z}), to enforce Q(X)Q(X) to converge to PZP_{Z}. In particular,

J2(Q)=1n(n1)ljk(zli,zji)+1n(n1)ljk(Q(xli),Q(xji))2n2l,jk(zli,Q(xji)),J_{2}(Q)=\dfrac{1}{n(n-1)}\sum_{l\neq j}k(z^{i}_{l},z^{i}_{j})+\dfrac{1}{n(n-1)}\sum_{l\neq j}k(Q(x^{i}_{l}),Q(x^{i}_{j}))-\dfrac{2}{n^{2}}\sum_{l,j}k(z^{i}_{l},Q(x^{i}_{j})),

where kk is set to be the Gaussian radial kernel function k(x,y)=exp(xy22)k(x,y)=\exp(\frac{-\|x-y\|^{2}}{2}).

We have adopted the stochastic gradient descent algorithm called the ADAM (Kingma and Ba, 2015) to estimate the unknown parameters in neural networks. The ADAM is an algorithm for first-order gradient-based optimization of stochastic objection functions, based on adaptive estimates of lower-order moments. Given the current tuple (Gi,Qi,fi)(G^{i},Q^{i},f^{i}) at the iith iteration, we sample a batch of observations {xki}k=1nPX\{x^{i}_{k}\}_{k=1}^{n}\sim{P}_{X}, latent variable {zki}k=1nPZ\{z^{i}_{k}\}_{k=1}^{n}\sim{P}_{Z}, and {ϵk}k=1nU[0,1]\{\epsilon_{k}\}_{k=1}^{n}\sim U[0,1]. Then we construct x^kiϵkxki+(1ϵk)Gi(zki)\hat{x}^{i}_{k}\leftarrow\epsilon_{k}x^{i}_{k}+(1-\epsilon_{k})G^{i}(z^{i}_{k}), i=1,,ni=1,\ldots,n, for computing the gradient penalty. Let Li=L(Gi,Qi,fi)L^{i}=L(G^{i},Q^{i},f^{i}) and J1(fi)=(x^ifi(x^i)21)2J_{1}(f^{i})=(\|\nabla_{\hat{x}^{i}}f^{i}(\hat{x}^{i})\|_{2}-1)^{2}. We can evaluate the gradient with respect to the parameters in ff, which is denoted by

fLi=f[1nk=1n(fi(Gi(zki))fi(Gi(Qi(xki)))+λ1J1(fi))].\displaystyle-\nabla_{f}L^{i}=\nabla_{f}\Big{[}\dfrac{1}{n}\sum_{k=1}^{n}\Big{(}f^{i}(G^{i}(z_{k}^{i}))-f^{i}(G^{i}(Q^{i}(x_{k}^{i})))+\lambda_{1}J_{1}(f^{i})\Big{)}\Big{]}.

Then we can update fif^{i} by the ADAM using this gradient. Similarly, we can evaluate the gradient with respect to the parameters in GG and QQ, which is denoted by

G,QLi=G,Q[1nk=1n(xkiGi(Qi(xki))+fi+1(Gi(Qi(xki)))fi+1(Gi(zki))+λ2J2(Qi))].\nabla_{G,Q}L^{i}=\nabla_{G,Q}\Big{[}\dfrac{1}{n}\sum_{k=1}^{n}\Big{(}\|x_{k}^{i}-G^{i}(Q^{i}(x_{k}^{i}))\|+f^{i+1}(G^{i}(Q^{i}(x_{k}^{i})))-f^{i+1}(G^{i}(z_{k}^{i}))+\lambda_{2}J_{2}(Q^{i})\Big{)}\Big{]}.

Then we can update (Gi,Qi)(G^{i},Q^{i}) by the ADAM using this gradient. The stopping criteria are both the DualGap(Gi,Qi,fi)(G^{i},Q^{i},f^{i}) in (4) and the objective function L(Gi,Qi,fi)L(G^{i},Q^{i},f^{i}) are less than pre-specified error tolerances ϵ1\epsilon_{1} and ϵ2\epsilon_{2}, respectively. Specifically, based on the definition of the duality gap in (4), we approximate DualGap(Gi,Qi,fi)(G^{i},Q^{i},f^{i}) by the difference between L(Gi,Qi,fi+1)L(G^{i},Q^{i},f^{i+1}) and The optimization (9) consists of two tuning parameters λ1\lambda_{1} and λ2\lambda_{2}. We pre-specify some values for λ1\lambda_{1} and λ2\lambda_{2} and select the optimal tuning parameters by grid search using cross validation. The details of the algorithm are presented in Algorithm 1.

4 Probabilistic Interpretation and the MLE

The iWGAN has proposed an efficient framework to stably and automatically estimate both the encoder and the generator. In this section, we provide a probabilistic interpretation of the iWGAN under the framework of maximum likelihood estimation.

Maximum likelihood estimator (MLE) is a fundamental statistical framework for learning models from data. However, for complex models, MLE can be computationally prohibitive due to the intractable normalization constant. MCMC has been used to approximate the intractable likelihood function but do not work efficiently in practice, since running MCMC till convergence to obtain a sample can be computationally expensive. For example, to reduce the computational complexity, Hinton (2002) proposed a simple and fast algorithm, called the contrastive divergence (CD). The basic idea of CD is to truncate MCMC at the kk-th step, where kk is a fixed integer as small as one. The simplicity and computational efficiency of CD makes it widely used in many popular energy-based models. However, the success of CD also raised a lot of questions regarding its convergence property. Both theoretical and empirical results show that CD in general does not converge to a local minimum of the likelihood function (Carreira-Perpinan and Hinton, 2005; Qiu et al., 2020), and diverges even in some simple models (Schulz et al., 2010; Fischer and Igel, 2010). The iWGAN can be treated as an adaptive method for the MLE training, which not only provides computational advantages but also allows us to generate more realistic-looking images. Furthermore, this probabilistic interpretation enables other novel applications such as image quality checking and outlier detection.

Let XX denote the image. Define the density of XX by an energy-based model based on an autoencoder (Gu and Zhu, 2001; Zhao et al., 2017; Berthelot et al., 2017):

p(x|θ)=exp(xGθ(Qθ(x))V(θ)),p(x|\theta)=\exp\big{(}-\big{\|}x-G_{\theta}(Q_{\theta}(x))\big{\|}-V(\theta)\big{)}, (10)

where

V(θ)=logexp(xGθ(Qθ(x)))𝑑x,V(\theta)=\log\int\exp(-\big{\|}x-G_{\theta}(Q_{\theta}(x))\big{\|})dx,

and θΘ\theta\in\Theta is the unknown parameter and V(θ)V(\theta) is the log normalization constant. The major difficulty for the likelihood inference is due to the intractable function V(θ)V(\theta). Suppose that we have the observed data {xi:i=1,,n}\{x_{i}:i=1,\ldots,n\}. The log-likelihood function of θΘ\theta\in\Theta is (θ)=n1i=1nlogp(xi|θ)\ell(\theta)=n^{-1}\sum_{i=1}^{n}\log~{}p(x_{i}|\theta), whose gradient is

θ(θ)=𝔼^obs{θxGθ(Qθ(x))}+𝔼θ{θxGθ(Qθ(x))},\nabla_{\theta}\ell(\theta)=-\hat{\mathbb{E}}_{obs}\big{\{}\partial_{\theta}\big{\|}x-G_{\theta}(Q_{\theta}(x))\big{\|}\big{\}}+\mathbb{E}_{\theta}\big{\{}\partial_{\theta}\big{\|}x-G_{\theta}(Q_{\theta}(x))\big{\|}\big{\}}, (11)

where 𝔼^obs[]\hat{\mathbb{E}}_{obs}[\cdot] denotes the empirical average on the observed data {xi}\{x_{i}\} and 𝔼θ[]\mathbb{E}_{\theta}[\cdot] denotes the expectation under model p(x|θ)p(x|\theta). The key computational obstacle lies in the approximations of the model expectation 𝔼θ[]\mathbb{E}_{\theta}[\cdot].

To address this problem, we can rewrite the log-likelihood function by introducing a variational distribution q(x)q(x). This leads to

𝔼^obslogp(x|θ)\displaystyle\hat{\mathbb{E}}_{obs}\log p(x|\theta) =𝔼^obsxGθ(Qθ(x))V(θ)\displaystyle=-\hat{\mathbb{E}}_{obs}\|x-G_{\theta}(Q_{\theta}(x))\|-V(\theta)
=𝔼^obsxGθ(Qθ(x))logq(x)exGθ(Qθ(x))q(x)𝑑x\displaystyle=-\hat{\mathbb{E}}_{obs}\|x-G_{\theta}(Q_{\theta}(x))\|-\log\int q(x){e^{-\|x-G_{\theta}(Q_{\theta}(x))\|}\over q(x)}dx
𝔼^obsxGθ(Qθ(x))q(x)logexGθ(Qθ(x))q(x)dx\displaystyle\leq-\hat{\mathbb{E}}_{obs}\|x-G_{\theta}(Q_{\theta}(x))\|-\int q(x)\log{e^{-\|x-G_{\theta}(Q_{\theta}(x))\|}\over q(x)}dx
=𝔼^obsxGθ(Qθ(x))+𝔼q(x)xGθ(Qθ(x))H(q),\displaystyle=-\hat{\mathbb{E}}_{obs}\|x-G_{\theta}(Q_{\theta}(x))\|+\mathbb{E}_{q(x)}\|x-G_{\theta}(Q_{\theta}(x))\|-H(q), (12)

where H(q)=qlogqH(q)=-\int q\log q denotes the entropy of q(x)q(x) and the inequality is due to Jensens’s inequality. Equation (4) provides an upper bound for the log-likelihood function. We expect to choose q(x)q(x) so that (4) is closer to the log-likelihood function, and then maximize (4) as a surrogate for maximizing log-likelihood. We choose q(x)=p(x|θ~)q(x)=p(x|\tilde{\theta}) and define the surrogate log-likelihood as

(θ;θ~)=𝔼^obsxGθ(Qθ(x))+𝔼θ~xGθ(Qθ(x))H(p(x|θ~)).{\cal L}(\theta;\tilde{\theta})=-\hat{\mathbb{E}}_{obs}\|x-G_{\theta}(Q_{\theta}(x))\|+\mathbb{E}_{\tilde{\theta}}\|x-G_{\theta}(Q_{\theta}(x))\|-H(p(x|\tilde{\theta})). (13)
Theorem 4.

(a). For any θ,θ~Θ\theta,\tilde{\theta}\in\Theta, we have (θ)(θ;θ~)\ell(\theta)\leq{\cal L}(\theta;\tilde{\theta}). In addition, (θ)=(θ;θ)\ell(\theta)={\cal L}(\theta;\theta).
(b). Consider the following algorithm and the (t+1)(t+1)th iteration is obtained by θ(t+1)=argmaxθΘ(θ;θ(t))\theta^{(t+1)}=\arg\max_{\theta\in\Theta}{\cal L}(\theta;\theta^{(t)}), for t=0,1,t=0,1,\cdots. If θ(t)θ^\theta^{(t)}\rightarrow\hat{\theta} as tt\rightarrow\infty, then θ^\hat{\theta} is the MLE.

Theorem 4 shows that, if we maximize the surrogate log-likelihood function and the algorithm converges, the solution is exactly the same as the MLE. The additional identity (θ)=(θ;θ)\ell(\theta)={\cal L}(\theta;\theta) is the key to our algorithm to obtain the MLE, which is different with the ELBO in VAEs. The ELBO is in general not a tight lower bound of the log-likelihood function.

In terms of training, by Theorem 5.10 of Villani (2008), for any random variables XX and YY, there exists an optimal ff^{*} such that

π(f(Y)f(X)=YX)=1\mathbb{P}_{\pi}\big{(}f^{*}(Y)-f^{*}(X)=\|Y-X\|\big{)}=1 (14)

for the optimal coupling π\pi which is the joint distribution of XX and YY. Therefore, there exists a ff^{*} such that

f(X)f(Gθ(Qθ(X)))=XGθ(Qθ(X))f^{*}(X)-f^{*}(G_{\theta}(Q_{\theta}(X)))=\|X-G_{\theta}(Q_{\theta}(X))\| (15)

with probability one. Because ff^{*} needs to be learned as well, we approximate ff^{*} by a neural network fηf_{\eta} with an unknown parameter η\eta. This amounts to using the following max-min objective

maxθminη𝔼^obsxGθ(Qθ(x))+𝔼θ(t)fη(x)𝔼θ(t)fη(Gθ(Qθ(x))).\max_{\theta}\min_{\eta}-\hat{\mathbb{E}}_{obs}\|x-G_{\theta}(Q_{\theta}(x))\|+\mathbb{E}_{\theta^{(t)}}f_{\eta}(x)-\mathbb{E}_{\theta^{(t)}}f_{\eta}(G_{\theta}(Q_{\theta}(x))). (16)

Note that, for the gradient update, the expectation in (16) is taken under the current estimated θ(t)\theta^{(t)}. Since we require GθG_{\theta} to be a good generator and the distributions of Gθ(z)G_{\theta}(z) is close the distribution p(x|θ(t))p(x|\theta^{(t)}), we replace 𝔼θ(t)fη(x)\mathbb{E}_{\theta^{(t)}}f_{\eta}(x) by 𝔼zfη(Gθ(z)))\mathbb{E}_{z}f_{\eta}(G_{\theta}(z))). Since an additional regularization is added to enforce Qθ(X)Q_{\theta}(X) to follow a normal distribution, we use the expectation under the data distribution to replace the second expectation of (16). This yields a gradient update for θ\theta of form θθ+ϵ^θ(θ)\theta\leftarrow\theta+\epsilon\hat{\nabla}_{\theta}\ell(\theta), where

^θ(θ)=𝔼^obs{θxGθ(Qθ(x))}+{𝔼zθfη(Gθ(z))𝔼xθfη(Gθ(Qθ(x))}.\hat{\nabla}_{\theta}\ell(\theta)=-\hat{\mathbb{E}}_{obs}\big{\{}\partial_{\theta}\big{\|}x-G_{\theta}(Q_{\theta}(x))\big{\|}\big{\}}+\big{\{}{\mathbb{E}}_{z}\partial_{\theta}f_{\eta}(G_{\theta}(z))-\mathbb{E}_{x}\partial_{\theta}f_{\eta}(G_{\theta}(Q_{\theta}(x))\big{\}}. (17)

A gradient update for η\eta is given by

ηη+ϵ{𝔼zηfη(Gθ(z))𝔼xηfη(Gθ(Qθ(x))}.\eta\leftarrow\eta+\epsilon~{}\Big{\{}{\mathbb{E}}_{z}\partial_{\eta}f_{\eta}(G_{\theta}(z))-\mathbb{E}_{x}\partial_{\eta}f_{\eta}(G_{\theta}(Q_{\theta}(x))\Big{\}}. (18)

The above iterative updating process is exactly the same as in Algorithm 1. Therefore, the training of the iWGAN is to seek the MLE. This probabilistic interpretation provides a novel alternative method to tackle problems with the intractable normalization constant in latent variable models. The MLE gradient update of p(x|θ)p(x|\theta) decreases the energy of the training data and increases the dual objective. Compare with original GANs or WGANs, our method gives much faster convergence and simultaneously provides a higher quality generated images.

The probabilistic modeling opens a door for many interesting applications. Next, we present a completely new approach for determining a highest density region (HDR) estimate for the distribution of XX. What makes HDR distinct from other statistical methods is that it finds the smallest region, denoted by U(α)U(\alpha), in the high dimensional space with a given probability coverage 1α1-\alpha, i.e., (XU(α))=1α\mathbb{P}(X\in U(\alpha))=1-\alpha. We can use U(α)U(\alpha) to assess each individual sample quality. Note that commonly used inception scores (IS) and Fréchet inception distances (FID) are to measure the whole sample quality, not at the individual sample level. More introductions of IS and FID are given in Appendix G. Let θ^\hat{\theta} be the MLE. The density ratio at x1x_{1} and x2x_{2} is

p(x1|θ^)p(x2|θ^)=exp{(x1Gθ^(Qθ^(x1))x2Gθ^(Qθ^(x2)))}.\frac{p(x_{1}|\hat{\theta})}{p(x_{2}|\hat{\theta})}=\exp\big{\{}-(\|x_{1}-G_{\hat{\theta}}(Q_{\hat{\theta}}(x_{1}))\|-\|x_{2}-G_{\hat{\theta}}(Q_{\hat{\theta}}(x_{2}))\|)\big{\}}.

The smaller the reconstruction error is, the larger the density value is. We can define the HDR for xx through the HDR for the reconstruction error ex:=xGθ^(Qθ^(x))e_{x}:=\|x-G_{\hat{\theta}}(Q_{\hat{\theta}}(x))\|, which is simple because it is a one-dimensional problem. Let U~(α)\tilde{U}(\alpha) be the HDR for exe_{x}. Then, U(α)={x:exU~(α)}U(\alpha)=\{x:e_{x}\in\tilde{U}(\alpha)\}. Here Qθ^(U(α))Q_{\hat{\theta}}(U(\alpha)) defines the corresponding region in the latent space, which can be used to generate better quality samples.

5 Experimental Results

The goal of our numerical experiments is to demonstrate that the iWGAN can achieve the following three objectives simultaneously: high-quality generative samples, meaningful latent codes, and small reconstruction errors. We also compare the iWGAN with other well-known GAN models such as the Wasserstein GAN with gradient penalty (WGAN-GP) (Gulrajani et al., 2017), the Wasserstein Autoencoder (WAE) (Tolstikhin et al., 2018), the Adversarial Learned Inference (ALI) (Dumoulin et al., 2017), and the CycleGAN (Zhu et al., 2017) to illustrate a competitive and stable performance for benchmark datasets.

5.1 Mixture of Gaussians

We first train our iWGAN model on three datasets from the mixture of Gaussians with an increasing difficulty shown in the Figure 1: a). RING: a mixture of 8 Gaussians with means {(2cos2πi8,2sin2πi8)|i=0,7}\{(2\cdot\cos{\dfrac{2\pi i}{8}},2\cdot\sin{\dfrac{2\pi i}{8}})|i=0,\dots 7\} and standard deviation 0.020.02, b). SPIRAL: a mixture of 20 Gaussians with means {(0.1+0.12π20cos2πi20,0.1+0.12π20sin2πi20)|i=0,,19}\{(0.1+0.1\cdot\dfrac{2\pi}{20}\cdot\cos{\dfrac{2\pi i}{20}},0.1+0.1\cdot\dfrac{2\pi}{20}\cdot\sin{\dfrac{2\pi i}{20}})|i=0,\dots,19\} and standard deviation 0.020.02 and c). GRID: a mixture of 25 Gaussians with means {(2i,2j)|i=2,1,,2,j=2,1,,2}\{(2\cdot i,2\cdot j)|i=-2,-1,\dots,2,j=-2,-1,\dots,2\} and standard deviation 0.020.02. As the true data distributions are known, this setting allows for tracking of convergence and mode dropping.

Refer to caption
(a) RING
Refer to caption
(b) Swiss Roll
Refer to caption
(c) GRID
Figure 1: Samples of mixture of Gaussians

Duality gap and convergence. We illustrate that as the duality gap converges to 0, our model converges to the generated samples from the true distribution. We keep track of the generated samples using G(z)G(z) and record the duality gap at each iteration to check the corresponding generated samples. We compare our method with the WGAN-GP and CycleGAN in Figure 2. All methods adopt the same structure, learning rate, number of critical steps, and other hyper-parameters.

Refer to caption
Figure 2: Duality gap, generated samples from the iWGAN, WGAN-GP, and CycleGAN on mixture of Gaussians at 1000, 2000, …, 5000 epochs. First row: The duality gaps of the iWGAN in 3 experiments which converge to 0 very fast; Second row: Generated samples of the iWGAN. The iWGAN has successfully generated samples following all target distributions. Third row: Generated samples of the WGAN-GP. The WGAN-GP has successfully generated samples for Ring and Swiss Roll, but has failed for Grid. Fourth row: Generated samples of the CycleGAN. The CycleGAN has failed on all 3 distributions and experienced the mode collapse problem.

Figure 2 shows that the iWGAN converges quickly in terms of both the duality gap and the true distributions learning. Duality gap has also been a good indicator of whether the model has generated the desired distribution. When comparing with the WGAN model, the iWGAN surpasses the performance of the WGAN-GP at very early stage and avoids the appearance of mode collapse. We have further tested the CycleGAN on these distributions. The CycleGAN objective function is the sum of two parts. The first part includes two vanilla GAN objectives (Goodfellow et al., 2014) to differentiate between XX and G(Z)G(Z), and ZZ and Q(X)Q(X). The second part is the cycle consistency loss given by 𝔼ZZQ(G(Z))1+𝔼XXG(Q(X))1\mathbb{E}_{Z}\big{\|}Z-Q(G(Z))\|_{1}+\mathbb{E}_{X}\big{\|}X-G(Q(X))\|_{1}, where 1\|\cdot\|_{1} is the L1L_{1}-norm of a vector. Unfortunately, Figure 2 shows that the CycleGAN fails on all three distributions and experienced the mode collapse problem.

Refer to caption
(a) RING
Refer to caption
(b) Swiss Roll
Refer to caption
(c) GRID
Figure 3: Latent spaces of mixture of Gaussians, i.e. Q(x)iQ(x)_{i} against Q(x)jQ(x)_{j} iji\neq j. The joint distribution of any two dimensions of Q(X)Q(X) is close to a bivariate normal distribution.

Latent space. We choose the latent distribution to be a 5-dimensional standard multivariate normal distribution ZN(0,I5)Z\sim N(0,I_{5}). During the training each batch size is chosen to be 512. After training, the distribution of Q(X)Q(X) is expected to be close to the distribution of ZZ. To demonstrate the latent distribution visually, we plot the iith compoment of Q(X)Q(X), Q(X)iQ(X)_{i}, against the jjth compoment of Q(X)Q(X), Q(X)jQ(X)_{j}, for all iji\neq j in Figure 3. We can tell that the joint distribution of any two dimensions of Q(X)Q(X) is close to a bivariate normal distribution.

Refer to caption
Figure 4: Interpolations: \blacktriangledown and \blacktriangle indicates the first and last samples in the interpolations, other colored samples are the interpolations. The first sample is generated by G(z1)G(z_{1}) and the last sample is generated by G(z2)G(z_{2}), and other colored dots are generated by G(λz1+(1λ)z2)G(\lambda z_{1}+(1-\lambda)z_{2}), where λ(0,1)\lambda\in(0,1). This figure indicates almost all interpolated samples are falling around one of the modes and successfully avoid gaps in between modes.

Mode collapse. We investigate the mode collapse problem for the iWGAN. If we draw two random samples in the latent space z1,z2N(0,I5)z_{1},z_{2}\sim N({0},{I_{5}}), the interpolation, G(λz1+(1λ)z2)G(\lambda z_{1}+(1-\lambda)z_{2}), 0λ10\leq\lambda\leq 1, should fall around the mode to represent a reasonable sample. In Figure 4, we select λ{0,0.05,0.10,,0.95,1.0}\lambda\in\{0,0.05,0.10,\dots,0.95,1.0\}, and do interpolations on two random samples. We repeat this procedure several times on 3 datasets as demonstrated in Figure 4. No matter where the interpolations start and end, the interpolations would fall around the modes other than the locations where true distribution has a low density. There may still be some samples that appears in the middle of two modes. This may be because the generator GG is not able to approximate a step function well.

Refer to caption
Figure 5: Quality check with the heatmap of quality scores over the space of XX

Individual sample quality check. From the probability interpretation of the iWGAN, we naturally adopt the reconstruction error XG(Q(X))\|X-G(Q(X))\|, or the quality score

Quality Score=exp(XG(Q(X)))\mbox{Quality Score}=\exp{(-\|X-G(Q(X))\|)}

as the metric of the quality of any individual sample. The larger the quality score is, the better quality the sample has. Figure 5 shows their quality scores for different samples. The quality scores of samples near the modes of the true distribution are close to 1, and become smaller as the sample draw is away from the modes. This indicates that the iWGAN converges and learns the distribution well, and the quality score is a reliable metric for the individual sample quality.

5.2 CelebA

We experimentally demonstrate our model’s ability on two well-known benchmark datasets, MNIST and CelebA. We present the performance of the iWGAN on CelebA in this section and the performance on MNIST in the Appendix. CelebA (CelebFaces Attributes Dataset) is a large-scale face attributes dataset with 202,599202,599 64×6464\times 64 colored celebrity face images, which cover large pose variations and diverse people. This dataset is ideal for training models to generate synthetic images. The MNIST database (Modified National Institute of Standards and Technology database) is another large database of handwritten digits 090\sim 9 that is commonly used for training various image processing systems. The MNIST database contains 70,00070,000 28×2828\times 28 grey images. CelebA is a more complex dataset than MNIST. The CelebA dataset is available at http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html and the MNIST dataset is available at http://yann.lecun.com/exdb/mnist/.

Refer to caption
(a) Generated samples
Refer to caption
(b) Reconstructions
Refer to caption
(c) Interpolations
Figure 6: The iWGAN on CelebA. (a) The samples are generated by G(z)G(z), where zN(0,I64)z\sim N(\textbf{0},I_{64}); (b) The first and third columns are the original images and the second and fourth columns are the corresponding reconstructed images generated by G(Q(x))G(Q(x)); (c) Four examples of interpolation between two random sample from latent space z1z_{1} and z2z_{2}, the interpolations are generated by G((1λ)z1+λz2)G((1-\lambda)z_{1}+\lambda z_{2})

.

The first result by the iWGAN on CelebA is shown in Figure 6(c). The dimension of the latent space is chosen to be 6464. For each panel, Figure 6(c) respectively shows the generated samples from G(Z)G(Z), the reconstructed samples from G(Q(X))G(Q(X)), and the latent space interpolation between two randomly chosen images. In particular, we perform latent space interpolations between CelebA validation set examples. We sample pairs of validation set examples x1x_{1} and x2x_{2} and project them into z1z_{1} and z2z_{2} by the encoder QQ. We then linearly interpolate between z1z_{1} and z2z_{2} and pass the intermediate points through the decoder to plot the input-space interpolations. In addition, Figure 7 shows the first 8 dimensions of the latent space calculated by Q(x)Q(x) on CelebA. Figures 6(c) and 7 visually demonstrate that the iWGAN can simultaneously generate high quality samples, produce small reconstruction errors, and have meaningful latent codes. Figure 8 also displays images with high and low quality scores selected from CelebA. The images with low quality scores are quite different with other images in the dataset and these images usually contain lighter background with masks or glasses.

Refer to caption
Figure 7: Latent space (first 8 dimensions) of CelebA, i.e. Q(x)iQ(x)_{i} against Q(x)jQ(x)_{j} iji\neq j. The joint distribution of any two dimensions of Q(X)Q(X) is close to a bivariate normal distribution.
Refer to caption
Refer to caption
Figure 8: Images from CelebA with high (left) and low (right) quality scores by the iWGAN.
Refer to caption
(a) iWGAN
Refer to caption
(b) WGAN-GP
Refer to caption
(c) WAE
Refer to caption
(d) ALI
Refer to caption
(e) CycleGAN
Figure 9: Comparison of generated samples by different models.

We compare the iWGAN, both visually and numerically, with the WGAN-GP, WAE, ALI, and CycleGAN. Figures 9(a)9(e) display the random generated samples from the iWGAN, WGAN-GP, WAE, ALI, and CycleGAN, respectively. The generated faces by the iWGAN demonstrate higher qualities than other four methods. The top panel of Figure 10 shows the comparison between real images and reconstructed images among four methods, the iWGAN, WAE, ALI, and CycleGAN. Note that the WGAN-GP cannot provide reconstructed images since it does not produce the latent codes. The bottom panel of Figure 10 shows the interpolated images by the iWGAN, WAE, ALI, and CycleGAN.

Refer to caption
(a) Reconstructions
Refer to caption
(b) Interpolations
Figure 10: Reconstructions and interpolations by different models.

We numerically compare these five methods, the iWGAN, ALI, WAE, CycleGAN, and WGAN-GP. Four performance measures are chosen, which are inception scores (IS), Fréchet inception distances (FID), reconstruction errors (RE), and maximum mean discrepancy (MMD) between encodings and standard normal random variables. The details of these comparison metrics are given in Appendix G. Proposed by Salimans et al. (2016), the IS involves using a pre-trained Inception v3 model to predict the class probabilities for each generated image. Higher scores are better, corresponding to a larger KL-divergence between the two distributions. The FID is proposed by Heusel et al. (2017) to improve the IS by actually comparing the statistics of generated samples to real samples. For the FID, the lower the better. However, as discussed in Barratt and Sharma (2018), IS is not a reliable metric for the wellness of generated samples. This is also consistent with our experiments. Although the WAE delivers the best inception scores among five methods, the WAE also has the worst FID scores. The generated samples (Figure 9(c)) show that the WAE is not the best generative model compared with other four methods. Furthermore, The reconstruction error (RE) is used to measure if the method has generated meaningful latent encodings. Smaller reconstruction errors indicate a more meaningful latent space which can be decoded into the original samples. The MMD is used to measure the difference between distribution of latent encodings and standard normal random variables. Smaller MMD indicates that the distribution of encodings is close to the standard normal distribution.

From Table 1, in terms of generative models, the iWGAN and ALI are better models, where the WGAN-GP and CycleGAN come after, but the WAE is suffering from generating clear pictures. In terms of RE and MMD, the iWGAN and WAE are better choices, where the ALI and CycleGAN cannot always reconstruct the sample to itself (see Figure 10(a)). In general, Table 1 shows that the iWGAN has successfully produced both meaningful encodings and reliable generator simultaneously.

Table 1: Comparison of the iWGAN, ALI, WAE, CyclaGAN, WGAN-GP
Methods IS FID RE MMD
True 1.96(0.019) 18.63
iWGAN 1.51(0.017) 51.20 13.55(2.41) 𝟔×𝟏𝟎𝟑\mathbf{6\times 10^{-3}}
ALI 1.50(0.014) 51.12 34.49(8.23) 0.390.39
WAE 1.71(0.029) 77.53 9.88(1.42) 𝟒×𝟏𝟎𝟑\mathbf{4\times 10^{-3}}
CycleGAN 1.41(0.011) 61.78 31.90(0.84) 0.300.30
WGAN-GP 1.54(0.016) 61.39

6 Conclusion

We have developed a novel iWGAN model, which fuses auto-encoders and GANs in a principled way. We have established the generalization error bound for the iWGAN. We have provided a solid probabilistic interpretation on the iWGAN using the maximum likelihood principle. Our training algorithm with an iterative primal and dual optimization has demonstrated an efficient and stable learning. We have proposed a stopping criteria for our algorithm and a metric for individual sample quality checking. The empirical results on both synthetic and benchmark datasets are state-of-the-art.

We now mention several future directions for research on the iWGAN. First, in this paper, we assume the conditional distribution of ZZ given XX is modeled by a point mass q(z|x)=δ(xQ(x))q(z|x)=\delta(x-Q(x)). It is interesting to extend this to a more flexible inference model. In addition, it is desirable to make the latent distribution more flexible, and consider a more general latent distribution such as the energy-based model (Gao et al., 2020). Second, we have ignored approximation errors in our analysis by assuming the unknown mappings belong to the neural network spaces. It is interesting to incorporate the approximation errors to analyze the behavior of the iWGAN divergence. Third, one might be interested in applying the iWGAN into image-to-image translation, as the extension should be straightforward. A fourth direction is to develop a formal hypothesis testing procedure to test whether the samples generated from the iWAGN is the same as the data distribution. We are also working on incorporating the iWGAN into the recent GAN modules such as the BigGAN (Brock et al., 2019), which can produce high-resolution and high-fidelity images. As its name suggests, the BigGAN focuses on scaling up the GAN models including more model parameters, larger batch sizes, and architectural changes. Instead, the iWGAN is able to stabilize its training, and it is a promising idea to fuse these two frameworks together.

Appendix

A. Proof of Theorem 1

According to the Nash embedding theorem (Nash, 1956; Günther, 1991), every dd-dimensional smooth Riemannian manifold 𝒳{\cal X} possesses a smooth isometric embedding into p\mathbb{R}^{p} with p=max{d(d+5)/2,d(d+3)/2+5}p=\max\{d(d+5)/2,d(d+3)/2+5\}. Therefore, there exists an injective mapping u:𝒳pu:{\cal X}\rightarrow\mathbb{R}^{p} which preserves the metric in the sense that the manifold metric on 𝒳{\cal X} is equal to the pullback of the usual Euclidean metric on p\mathbb{R}^{p} by uu. The mapping uu is injective so that we can define the inverse mapping u1:u(𝒳)𝒳u^{-1}:u(\cal X)\rightarrow{\cal X}.

Let X~=u(X)p\tilde{X}=u(X)\in{\mathbb{R}}^{p}, and write X~=(X~1,,X~p)\tilde{X}=(\tilde{X}_{1},\ldots,\tilde{X}_{p}). Let Fi(x)=(X~ix)F_{i}(x)=\mathbb{P}(\tilde{X}_{i}\leq x), i=1,,pi=1,\ldots,p, be the marginal cdfs. By applying the probability integral transformation to each component, the random vector

(U1,U2,,Up):=(F1(X~1),F2(X~2),,Fp(X~p))\big{(}U_{1},U_{2},\ldots,U_{p}\big{)}:=\big{(}F_{1}(\tilde{X}_{1}),F_{2}(\tilde{X}_{2}),\ldots,F_{p}(\tilde{X}_{p})\big{)}

has uniformly distributed marginals. Let C:[0,1]p[0,1]C:[0,1]^{p}\rightarrow[0,1] be the copula of X~\tilde{X}, which is defined as the joint cdf of (U1,,Up)(U_{1},\ldots,U_{p}):

C(u1,u2,,up)=(U1u1,U2u2,,Upup).C(u_{1},u_{2},\ldots,u_{p})=\mathbb{P}\big{(}U_{1}\leq u_{1},U_{2}\leq u_{2},\ldots,U_{p}\leq u_{p}\big{)}.

The copula CC contains all information on the dependence structure among the components of X~\tilde{X}, while the marginal cumulative distribution functions FiF_{i} contain all information on the marginal distributions. Therefore, the joint cdf of X~\tilde{X} is

H(x~1,x~2,,x~p)=C(F1(x~1),F2(x~2),,Fp(x~p)).H(\tilde{x}_{1},\tilde{x}_{2},\ldots,\tilde{x}_{p})=C\big{(}F_{1}(\tilde{x}_{1}),F_{2}(\tilde{x}_{2}),\ldots,F_{p}(\tilde{x}_{p})\big{)}.

Denote the conditional distribution of UkU_{k}, given U1,,Uk1U_{1},\ldots,U_{k-1}, by

Ck(uk|u1,,uk1)\displaystyle C_{k}(u_{k}|u_{1},\ldots,u_{k-1}) =(Ukuk|U1=u1,,Uk1=uk1)\displaystyle=\mathbb{P}\big{(}U_{k}\leq u_{k}|U_{1}=u_{1},\ldots,U_{k-1}=u_{k-1}\big{)}

for k=2,,pk=2,\ldots,p.

We will construct QQ^{*} as follows. First, we obtain X~p\tilde{X}\in\mathbb{R}^{p} by X~=u(X)\tilde{X}=u(X). Second, we transform X~\tilde{X} into a random vector with uniformly distributed marginals (U1,,Up)(U_{1},\ldots,U_{p}) by the marginal cdf FiF_{i}. Then, define U~1=U1\tilde{U}_{1}=U_{1} and

U~k=Ck(Uk|U1,,Uk1),k=2,,p.\tilde{U}_{k}=C_{k}\big{(}U_{k}|U_{1},\ldots,U_{k-1}\big{)},~{}~{}~{}k=2,\ldots,p.

One can readily show that U~1,,U~p\tilde{U}_{1},\ldots,\tilde{U}_{p} are independent uniform random variables. This is because

(U~ku~k:k=1,,p)\displaystyle\mathbb{P}(\tilde{U}_{k}\leq\tilde{u}_{k}:k=1,\ldots,p) =C1(v1)u~1Cp(vp|v1,,vp1)u~p𝑑Cp(vp|v1,,vp1)𝑑C1(v1)\displaystyle=\int_{C_{1}\big{(}v_{1}\big{)}\leq\tilde{u}_{1}}\cdots\int_{C_{p}\big{(}v_{p}|v_{1},\ldots,v_{p-1}\big{)}\leq\tilde{u}_{p}}dC_{p}\big{(}v_{p}|v_{1},\ldots,v_{p-1}\big{)}\cdots dC_{1}\big{(}v_{1}\big{)}
=0u~10u~p𝑑zp𝑑z1=k=1pu~k.\displaystyle=\int_{0}^{\tilde{u}_{1}}\cdots\int_{0}^{\tilde{u}_{p}}dz_{p}\cdots dz_{1}=\prod_{k=1}^{p}\tilde{u}_{k}.

In fact, this transformation is the well-known Rosenblatt transform (Rosenblatt, 1952). Finally, let Zi=Φ1(U~i)Z_{i}=\Phi^{-1}(\tilde{U}_{i}) for i=1,,pi=1,\ldots,p, where Φ1\Phi^{-1} is the inverse cdf of a standard normal random variable. This completes the transformation QQ^{*} from XX to Z=(Z1,,Zp)Z=(Z_{1},\ldots,Z_{p}).

The above process can be inverted to obtain GG^{*}. First, we transform ZZ into independent uniform random variables by U~i=Φ(Zi)\tilde{U}_{i}=\Phi(Z_{i}) for i=1,,pi=1,\ldots,p. Next, let U1=U~1U_{1}=\tilde{U}_{1}. Define

Uk=Ck1(U~k|U~1,,U~k1),i=2,,p,U_{k}=C_{k}^{-1}(\tilde{U}_{k}|\tilde{U}_{1},\ldots,\tilde{U}_{k-1}),~{}~{}~{}i=2,\ldots,p,

where Ck1(|u1,,uk)C_{k}^{-1}(\cdot|u_{1},\ldots,u_{k}) is the inverse of CkC_{k} and can be obtained by numerical root finding. Finally, let X~i=Fi1(Ui)\tilde{X}_{i}=F_{i}^{-1}(U_{i}) for i=1,,pi=1,\ldots,p and X=u1(X~)X=u^{-1}(\tilde{X}), where u1:u(𝒳)𝒳u^{-1}:u(\cal X)\rightarrow{\cal X} is the inverse mapping of uu. This completes the transformation GG^{*} from ZZ to XX.

B. Proof of Theorem 2

(a). By the iWGAN objective (3), (5) holds. Since W1W_{1} is a distance between two probability measures, W1(PX,PG(Z))W¯1(PX,PG(Z))W_{1}(P_{X},P_{G(Z)})\leq\overline{W}_{1}(P_{X},P_{G(Z)}). If there exists a Q𝒬Q^{*}\in{\cal Q} such that Q(X)Q^{*}(X) has the same distribution as PZP_{Z}, we have

W¯1(PX,PG(Z))W1(PX,PG(Q(X)))+W1(PG(Q(X)),PG(Z))=W1(PX,PG(Z)).\displaystyle\overline{W}_{1}(P_{X},P_{G(Z)})\leq W_{1}(P_{X},P_{G(Q^{*}(X))})+W_{1}(P_{G(Q^{*}(X))},P_{G(Z)})=W_{1}(P_{X},P_{G(Z)}).

Hence, W1(PX,PG(Z))=W¯1(PX,PG(Z))W_{1}(P_{X},P_{G(Z)})=\overline{W}_{1}(P_{X},P_{G(Z)}).
(b). We observe that

supfL(G~,Q~,f)=W1(PX,PG~(Q~(X)))+W1(PG~(Q~(X)),PG~(Z)).\sup_{f}L(\widetilde{G},\widetilde{Q},f)=W_{1}(P_{X},P_{\widetilde{G}(\widetilde{Q}(X))})+W_{1}(P_{\widetilde{G}(\widetilde{Q}(X))},P_{\widetilde{G}(Z)}).

By Theorem 1, we have infG,QL(G,Q,f~)L(G,Q,f~)=0\inf_{G,Q}L(G,Q,\widetilde{f})\leq L(G^{*},Q^{*},\widetilde{f})=0 when the encoder and the decoder have enough capacities. Therefore, the duality gap is larger than W1(PX,PG~(Q~(X)))+W1(PG~(Q~(X)),PG~(Z))W_{1}(P_{X},P_{\widetilde{G}(\widetilde{Q}(X))})+W_{1}(P_{\widetilde{G}(\widetilde{Q}(X))},P_{\widetilde{G}(Z)}). It is easy to see that, if G~\widetilde{G} outputs the same distribution as XX and Q~\widetilde{Q} outputs the same distribution as ZZ, both the duality gap and W¯1(PX,PG(Z))\overline{W}_{1}(P_{X},P_{G(Z)}) are zeros and X=G~(Q~(X))X=\widetilde{G}(\widetilde{Q}(X)) for XPXX\sim P_{X}.

C. Proof of Theorem 3

We first consider the difference between population W1(PX,PG(Z))W_{1}(P_{X},P_{G(Z)}) and empirical W^1(PX,PG(Z))\widehat{W}_{1}(P_{X},P_{G(Z)}) given nn samples S={x1,,xn}S=\{x_{1},\ldots,x_{n}\}. Let f1f_{1} and f2f_{2} be their witness function respectively. Using the dual form of 1-Wassertein distance, we have

W1(PX,PG(Z))W^1(PX,PG(Z))\displaystyle W_{1}(P_{X},P_{G(Z)})-\widehat{W}_{1}(P_{X},P_{G(Z)})
=\displaystyle= 𝔼XPX{f1(X)}𝔼ZPZ{f1(G(Z))}1ni=1nf2(xi)+𝔼ZPZ{f2(G(Z))}\displaystyle\mathbb{E}_{X\sim P_{X}}\{f_{1}(X)\}-\mathbb{E}_{Z\sim P_{Z}}\{f_{1}(G(Z))\}-\frac{1}{n}\sum_{i=1}^{n}f_{2}(x_{i})+\mathbb{E}_{Z\sim P_{Z}}\{f_{2}(G(Z))\}
\displaystyle\leq 𝔼XPX{f1(X)}𝔼ZPZ{f1(G(Z))}1ni=1nf1(xi)+𝔼ZPZ{f1(G(Z))}\displaystyle\mathbb{E}_{X\sim P_{X}}\{f_{1}(X)\}-\mathbb{E}_{Z\sim P_{Z}}\{f_{1}(G(Z))\}-\frac{1}{n}\sum_{i=1}^{n}f_{1}(x_{i})+\mathbb{E}_{Z\sim P_{Z}}\{f_{1}(G(Z))\}
\displaystyle\leq supf𝔼XPX{f(X)}1ni=1nf(xi)Φ(S).\displaystyle\sup_{f}\mathbb{E}_{X\sim P_{X}}\{f(X)\}-\frac{1}{n}\sum_{i=1}^{n}f(x_{i})\triangleq\Phi(S).

Given another sample set S={x1,,xi,,xn}S^{\prime}=\{x_{1},\ldots,x_{i}^{\prime},\ldots,x_{n}\}, it is clear that

Φ(S)Φ(S)supf|f(xi)f(xi)|nxixin2Bn,\displaystyle\Phi(S)-\Phi(S^{\prime})\leq\sup_{f}\frac{|f(x_{i})-f(x_{i}^{\prime})|}{n}\leq\frac{\|x_{i}-x_{i}^{\prime}\|}{n}\leq\frac{2B}{n},

where the second inequality is obtained since ff is 1-Lipschitz continuous function. Applying McDiamond’s Inequality, with probability at least 1δ/21-\delta/2 for any δ(0,1)\delta\in(0,1), we have

Φ(S)𝔼{Φ(S)}+B2nlog(2δ).\Phi(S)\leq\mathbb{E}\{\Phi(S)\}+B\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)}. (19)

By the standard technique of symmetrization in Mohri et al. (2018), we have

𝔼{Φ(S)}=𝔼{supf𝔼XPX{f(X)}1ni=1nf(xi)}2n().\mathbb{E}\{\Phi(S)\}=\mathbb{E}\left\{\sup_{f}\mathbb{E}_{X\sim P_{X}}\{f(X)\}-\frac{1}{n}\sum_{i=1}^{n}f(x_{i})\right\}\leq 2\mathfrak{R}_{n}(\mathcal{F}). (20)

It has been proved in Mohri et al. (2018) that with probability at least 1δ/21-\delta/2 for any δ(0,1)\delta\in(0,1),

n()^n()+B2nlog(2δ).\mathfrak{R}_{n}(\mathcal{F})\leq\widehat{\mathfrak{R}}_{n}(\mathcal{F})+B\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)}. (21)

Combining Equation (19), Equation (20) and Equation (21), we have

W1(PX,PG(Z))W^1(PX,PG(Z))+2^n()+3B2nlog(2δ).W_{1}(P_{X},P_{G(Z)})\leq\widehat{W}_{1}(P_{X},P_{G(Z)})+2\widehat{\mathfrak{R}}_{n}(\mathcal{F})+3B\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)}.

By Theorem 2, we have W^1(PX,PG(Z))W¯^1(PX,PG(Z))\widehat{W}_{1}(P_{X},P_{G(Z)})\leq\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)}). Thus,

W1(PX,PG(Z))W¯^1(PX,PG(Z))+2^n()+3B2nlog(2δ).W_{1}(P_{X},P_{G(Z)})\leq\widehat{\overline{W}}_{1}(P_{X},P_{G(Z)})+2\widehat{\mathfrak{R}}_{n}(\mathcal{F})+3B\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)}.

D. Proof of Theorem 4

(a) It is obvious that (θ)(θ;θ~)\ell(\theta)\leq{\cal L}(\theta;\tilde{\theta}) from Equation (4). When q(x)=p(x|θ)q(x)=p(x|\theta),

q(x)logexGθ(Qθ(x))q(x)dx=p(x|θ)logexGθ(Qθ(x))exGθ(Qθ(x))V(θ)dx=V(θ).\int q(x)\log{e^{-\|x-G_{\theta}(Q_{\theta}(x))\|}\over q(x)}dx=\int p(x|\theta)\log{e^{-\|x-G_{\theta}(Q_{\theta}(x))\|}\over e^{-\|x-G_{\theta}(Q_{\theta}(x))\|-V(\theta)}}dx=V(\theta).

Therefore, (θ)=(θ;θ)\ell(\theta)={\cal L}(\theta;\theta).

(b) Since θ(t)θ^\theta^{(t)}\rightarrow\hat{\theta} as tt\rightarrow\infty, we have (θ(t+1);θ(t))(θ^;θ^)=(θ^){\cal L}(\theta^{(t+1)};\theta^{(t)})\rightarrow{\cal L}(\hat{\theta};\hat{\theta})=\ell(\hat{\theta}). This implies θ^\hat{\theta} is the MLE.

E. Experimental Results on MNIST

E.1. Latent Space

Figure 11 shows the latent space of MNIST, i.e. Q(X)iQ(X)_{i} against Q(X)jQ(X)_{j} for all iji\neq j.

Refer to caption
Figure 11: Latent Space of MNIST dataset

E.2. Generated Samples

Figure 12 shows the comparison of random generated samples between the WGAN-GP and iWGAN. Figure 13 shows examples of interpolations of two random generated samples.

Refer to caption
(a) WGAN-GP
Refer to caption
(b) iWGAN
Figure 12: Generated samples on MNIST
Refer to caption
Figure 13: Interpolations by the iWGAN on MNIST

E.3. Reconstruction

Figure 14(b) shows, based on the samples from validation dataset, the distribution of reconstruction error. Figure 14(a) shows examples of reconstructed samples. Figure 15 shows the best and worst samples based on quality scores from the validation dataset.

Refer to caption
(a) Reconstructions
Refer to caption
(b) Histogram of RE
Figure 14: Reconstructions on MNIST
Refer to caption
(a) Samples with high quality scores
Refer to caption
(b) Samples with low quality scores
Figure 15: Sample quality check by the iWGAN on the validation dataset of MNIST

F. Architectures

The codes and examples used for this paper is available at: https://drive.google.com/drive/folders/1-_vIrbOYwf2BH1lOrVEcEPJUxkyV5CiB?usp=sharing. In this section, we present the architectures used for each experiment.

Mixture of Guassians

For Mixture Guassians, the latent space Z5Z\in\mathbb{R}^{5}, for each batch, the sample size is 256.

Encoder architecture:

x2\displaystyle x\in\mathbb{R}^{2} FC1024RELU\displaystyle\rightarrow FC_{1024}\rightarrow RELU
FC512RELU\displaystyle\rightarrow FC_{512}\rightarrow RELU
FC256RELU\displaystyle\rightarrow FC_{256}\rightarrow RELU
FC128RELUFC5\displaystyle\rightarrow FC_{128}\rightarrow RELU\rightarrow FC_{5}

Generator architecture:

z5\displaystyle z\in\mathbb{R}^{5} FC512RELU\displaystyle\rightarrow FC_{512}\rightarrow RELU
FC512RELU\displaystyle\rightarrow FC_{512}\rightarrow RELU
FC512RELUFC2\displaystyle\rightarrow FC_{512}\rightarrow RELU\rightarrow FC_{2}

Discriminator architecture:

x2\displaystyle x\in\mathbb{R}^{2} FC512RELU\displaystyle\rightarrow FC_{512}\rightarrow RELU
FC512RELU\displaystyle\rightarrow FC_{512}\rightarrow RELU
FC512RELUFC1\displaystyle\rightarrow FC_{512}\rightarrow RELU\rightarrow FC_{1}

MNIST

For MNIST, the latent space Z8Z\in\mathbb{R}^{8} and batch size is 250.

Encoder architecture:

x28×28\displaystyle x\in\mathbb{R}^{28\times 28} Conv128RELU\displaystyle\rightarrow Conv_{128}\rightarrow RELU
Conv256RELU\displaystyle\rightarrow Conv_{256}\rightarrow RELU
Conv512RELUFC8\displaystyle\rightarrow Conv_{512}\rightarrow RELU\rightarrow FC_{8}

Generator architecture:

z8\displaystyle z\in\mathbb{R}^{8} FC4×4×512RELU\displaystyle\rightarrow FC_{4\times 4\times 512}\rightarrow RELU
ConvTrans256RELU\displaystyle\rightarrow ConvTrans_{256}\rightarrow RELU
ConvTrans128RELUConvTrans1\displaystyle\rightarrow ConvTrans_{128}\rightarrow RELU\rightarrow ConvTrans_{1}

Discriminator architecture:

x28×28\displaystyle x\in\mathbb{R}^{28\times 28} Conv128RELU\displaystyle\rightarrow Conv_{128}\rightarrow RELU
Conv256RELU\displaystyle\rightarrow Conv_{256}\rightarrow RELU
Conv512RELUFC1\displaystyle\rightarrow Conv_{512}\rightarrow RELU\rightarrow FC_{1}

CelebA

For CelebA, the latent space Z64Z\in\mathbb{R}^{64} and batch size is 64.

Encoder architecture:

x64×64×3\displaystyle x\in\mathbb{R}^{64\times 64\times 3} Conv128LeakyRELU\displaystyle\rightarrow Conv_{128}\rightarrow LeakyRELU
Conv256InstanceNormLeakyRELU\displaystyle\rightarrow Conv_{256}\rightarrow InstanceNorm\rightarrow LeakyRELU
Conv512InstanceNormLeakyRELUConv1\displaystyle\rightarrow Conv_{512}\rightarrow InstanceNorm\rightarrow LeakyRELU\rightarrow Conv_{1}

Generator architecture:

z64\displaystyle z\in\mathbb{R}^{64} FC4×4×1024\displaystyle\rightarrow FC_{4\times 4\times 1024}
ConvTrans512BNRELU\displaystyle\rightarrow ConvTrans_{512}\rightarrow BN\rightarrow RELU
ConvTrans256BNRELU\displaystyle\rightarrow ConvTrans_{256}\rightarrow BN\rightarrow RELU
ConvTrans128BNRELUConvTrans3\displaystyle\rightarrow ConvTrans_{128}\rightarrow BN\rightarrow RELU\rightarrow ConvTrans_{3}

Discriminator architecture:

x64×64×3\displaystyle x\in\mathbb{R}^{64\times 64\times 3} Conv128LeakyRELU\displaystyle\rightarrow Conv_{128}\rightarrow LeakyRELU
Conv256InstanceNormLeakyRELU\displaystyle\rightarrow Conv_{256}\rightarrow InstanceNorm\rightarrow LeakyRELU
Conv512InstanceNormLeakyRELUConv1\displaystyle\rightarrow Conv_{512}\rightarrow InstanceNorm\rightarrow LeakyRELU\rightarrow Conv_{1}

G. Comparison Metrics

Four performance measures, such as inception scores (IS), Fréchet inception distances (FID), reconstruction errors (RE), and maximum mean discrepancy (MMD) between encodings and standard normal random variables, are used to compare different models.

Proposed by Salimans et al. (2016), the IS involves using a pre-trained Inception v3 model to predict the class probabilities for each generated image. These predictions are then summarized into the IS by the KL divergence as following,

IS=exp(𝔼xPG(Z)DKL(p(y|x)p(y))),\text{IS}=\exp\left(\mathbb{E}_{x\sim P_{G(Z)}}D_{KL}\left(p(y|x)\|p(y)\right)\right), (22)

where p(y|x)p(y|x) is the predicted probabilities conditioning on the generated images, and p(y)p(y) is the corresponding marginal distribution. Higher scores are better, corresponding to a larger KL-divergence between the two distributions. The FID is proposed by Heusel et al. (2017) to improve the IS by actually comparing the statistics of generated samples to real samples. It is defined as the Fréchet distance between two multivariate Gaussians,

FID=μrμG2+Tr(Σr+ΣG2(ΣrΣG)1/2),\text{FID}=\|\mu_{r}-\mu_{G}\|^{2}+\text{Tr}\left(\Sigma_{r}+\Sigma_{G}-2(\Sigma_{r}\Sigma_{G})^{1/2}\right), (23)

where XrN(μr,Σr)X_{r}\sim N(\mu_{r},\Sigma_{r}) and XGN(μG,ΣG)X_{G}\sim N(\mu_{G},\Sigma_{G}) are the 2048-dimensional activations of the Inception-v3 pool-3 layer for real and generated samples respectively. For the FID, the lower the better. Furthermore, the reconstruction error (RE) is defined as

RE=1Ni=1NX^iXi2,\mbox{RE}=\dfrac{1}{N}\sum_{i=1}^{N}\|\hat{X}_{i}-X_{i}\|_{2}, (24)

where X^i\hat{X}_{i} is the reconstructed sample for XiX_{i}. RE is used to measure if the method has generated meaningful latent encodings. Smaller reconstruction errors indicate a more meaningful latent space which can be decoded into the original samples. The maximum mean discrepancy (MMD) is defined as

MMD=1N(N1)ljk(zl,zj)+1N(N1)ljk(z~l,z~j)2N2l,jk(zl,z~j)\mbox{MMD}=\dfrac{1}{N(N-1)}\sum_{l\neq j}k(z_{l},z_{j})+\dfrac{1}{N(N-1)}\sum_{l\neq j}k(\tilde{z}_{l},\tilde{z}_{j})-\dfrac{2}{N^{2}}\sum_{l,j}k(z_{l},\tilde{z}_{j}) (25)

where kk is a positive-definite reproducing kernel, ziz_{i}’s are drawn from prior distribution PZ{P}_{Z}, and z~i=Q(xi)\tilde{z}_{i}=Q(x_{i}) are the latent encodings of real samples. MMD is used to measure the difference between distribution of latent encodings and standard normal random variables. Smaller MMD indicates that the distribution of encodings is close to the standard normal distribution.

References

  • Arjovsky et al. (2017) Arjovsky, M., S. Chintala, and L. Bottou (2017). Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223. PMLR.
  • Arora et al. (2017) Arora, S., R. Ge, Y. Liang, T. Ma, and Y. Zhang (2017). Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.  224–232. JMLR. org.
  • Barratt and Sharma (2018) Barratt, S. and R. Sharma (2018). A note on the inception score. arXiv preprint arXiv:1801.01973.
  • Bartlett et al. (2017) Bartlett, P. L., D. J. Foster, and M. J. Telgarsky (2017). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240–6249.
  • Berthelot et al. (2017) Berthelot, D., T. Schumm, and L. Metz (2017). Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717.
  • Blei et al. (2017) Blei, D. M., A. Kucukelbir, and J. D. McAuliffe (2017). Variational inference: A review for statisticians. Journal of the American statistical Association 112(518), 859–877.
  • Brock et al. (2019) Brock, A., J. Donahue, and K. Simonyan (2019). Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations.
  • Carreira-Perpinan and Hinton (2005) Carreira-Perpinan, M. A. and G. E. Hinton (2005). On contrastive divergence learning. In Aistats, Volume 10, pp.  33–40. Citeseer.
  • Chen et al. (2018) Chen, X., J. Wang, and H. Ge (2018). Training generative adversarial networks via primal-dual subgradient methods: A lagrangian perspective on GAN. In International Conference on Learning Representations.
  • Donahue et al. (2017) Donahue, J., P. Krähenbühl, and T. Darrell (2017). Adversarial feature learning. In International Conference on Learning Representations (ICLR).
  • Dumoulin et al. (2017) Dumoulin, V., I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2017). Adversarially learned inference. In International Conference on Learning Representations (ICLR).
  • Farnia and Tse (2018) Farnia, F. and D. Tse (2018). A convex duality framework for gans. In Advances in Neural Information Processing Systems, pp. 5248–5258.
  • Fischer and Igel (2010) Fischer, A. and C. Igel (2010). Empirical analysis of the divergence of gibbs sampling based learning algorithms for restricted boltzmann machines. In International Conference on Artificial Neural Networks, pp. 208–217. Springer.
  • Gao et al. (2020) Gao, R., R. Nijkamp, D. Kingma, Z. Xu, A. Dai, and Y. Wu (2020). Flow contrastive estimation of energy-based models. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  7515–7525.
  • Goodfellow et al. (2014) Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.
  • Gretton et al. (2012) Gretton, A., K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012). A kernel two-sample test. Journal of Machine Learning Research 13(Mar), 723–773.
  • Grnarova et al. (2018) Grnarova, P., K. Y. Levy, A. Lucchi, N. Perraudin, T. Hofmann, and A. Krause (2018). Evaluating gans via duality. arXiv preprint arXiv:1811.05512.
  • Gu and Zhu (2001) Gu, M. G. and H. Zhu (2001). Maximum likelihood estimation for spatial models by markov chain monte carlo stochastic approximation. Journal of the Royal Statistical Society B 63(2), 339–355.
  • Gulrajani et al. (2017) Gulrajani, I., F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017). Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777.
  • Günther (1991) Günther, M. (1991). Isometric embeddings of riemannian manifolds. In Proceedings of the International Congress of Mathematicians, pp.  1137–1143.
  • Han et al. (2019) Han, T., E. Nijkamp, X. Fang, M. Hill, S.-C. Zhu, and Y. N. Wu (2019). Divergence triangle for joint training of generator model, energy-based model, and inferential model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  8670–8679.
  • Heusel et al. (2017) Heusel, M., H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637.
  • Hinton (2002) Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural computation 14(8), 1771–1800.
  • Hornik (1991) Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks 4(2), 251–257.
  • Hu et al. (2018) Hu, Z., Z. Yang, R. Salakhutdinov, and E. P. Xing (2018). On unifying deep generative models. In International Conference on Learning Representations.
  • Jiang et al. (2019) Jiang, H., Z. Chen, M. Chen, F. Liu, D. Wang, and T. Zhao (2019). On computation and generalization of generative adversarial networks under spectrum control. In International Conference on Learning Representations.
  • Kingma and Ba (2015) Kingma, D. P. and J. Ba (2015). Adam: A method for stochastic optimization. In 3rd International Conference for Learning Representations, San Diego, 2015.
  • Kingma and Welling (2014) Kingma, D. P. and M. Welling (2014). Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
  • Larsen et al. (2016) Larsen, A. B. L., S. K. Sønderby, H. Larochelle, and O. Winther (2016). Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning, pp. 1558–1566.
  • Li et al. (2019) Li, X., J. Lu, Z. Wang, J. Haupt, and T. Zhao (2019). On tighter generalization bounds for deep neural networks: CNNs, resnets, and beyond.
  • Li et al. (2015) Li, Y., K. Swersky, and R. Zemel (2015). Generative moment matching networks. In International Conference on Machine Learning, pp. 1718–1727. PMLR.
  • Mescheder et al. (2017) Mescheder, L., S. Nowozin, and A. Geiger (2017). Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.  2391–2400. JMLR. org.
  • Mohri et al. (2018) Mohri, M., A. Rostamizadeh, and A. Talwalkar (2018). Foundations of machine learning. MIT press.
  • Nash (1956) Nash, J. (1956). The imbedding problem for riemannian manifolds. Annals of mathematics 63(1), 20–63.
  • Nowozin et al. (2016) Nowozin, S., B. Cseke, and R. Tomioka (2016). f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279.
  • Qiu and Wang (2020) Qiu, Y. and X. Wang (2020). Almond: Adaptive latent modeling and optimization via neural networks and langevin diffusion. Journal of the American Statistical Association 0(0), 1–13.
  • Qiu et al. (2020) Qiu, Y., L. Zhang, and X. Wang (2020). Unbiased contrastive divergence algorithm for training energy-based latent variable models. In International Conference on Learning Representations (ICLR).
  • Rosca et al. (2017) Rosca, M., B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed (2017). Variational approaches for auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987.
  • Rosenblatt (1952) Rosenblatt, M. (1952). Remarks on a multivariate transformation. Annals of Mathematical Statistics 23(3), 470–472.
  • Salimans et al. (2016) Salimans, T., I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016). Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242.
  • Schulz et al. (2010) Schulz, H., A. Müller, and S. Behnke (2010). Investigating convergence of restricted boltzmann machine learning. In NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning.
  • Tolstikhin et al. (2018) Tolstikhin, I., O. Bousquet, S. Gelly, and B. Schoelkopf (2018). Wasserstein auto-encoders. In International Conference on Learning Representations.
  • Ulyanov et al. (2018) Ulyanov, D., A. Vedaldi, and V. Lempitsky (2018). It takes (only) two: Adversarial generator-encoder networks. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Villani (2008) Villani, C. (2008). Optimal transport: old and new, Volume 338. Springer Science & Business Media.
  • Zhao et al. (2017) Zhao, J., M. Mathieu, and Y. LeCun (2017). Energy-based generative adversarial network. In International Conference on Learning Representations (ICLR).
  • Zhao et al. (2018) Zhao, S., J. Song, and S. Ermon (2018). The information-autoencoding family: A lagrangian perspective on latent variable generative modeling.
  • Zhu et al. (2017) Zhu, J. Y., T. Park, P. Isola, and A. A. Efros (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp.  2223–2232.