This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GANs May Have No Nash Equilibria


Farzan Farnia, Asuman Ozdaglar
{farnia,asuman}@mit.edu

Massachusetts Institute of Technology
Abstract

Generative adversarial networks (GANs) represent a zero-sum game between two machine players, a generator and a discriminator, designed to learn the distribution of data. While GANs have achieved state-of-the-art performance in several benchmark learning tasks, GAN minimax optimization still poses great theoretical and empirical challenges. GANs trained using first-order optimization methods commonly fail to converge to a stable solution where the players cannot improve their objective, i.e., the Nash equilibrium of the underlying game. Such issues raise the question of the existence of Nash equilibrium solutions in the GAN zero-sum game. In this work, we show through several theoretical and numerical results that indeed GAN zero-sum games may not have any local Nash equilibria. To characterize an equilibrium notion applicable to GANs, we consider the equilibrium of a new zero-sum game with an objective function given by a proximal operator applied to the original objective, a solution we call the proximal equilibrium. Unlike the Nash equilibrium, the proximal equilibrium captures the sequential nature of GANs, in which the generator moves first followed by the discriminator. We prove that the optimal generative model in Wasserstein GAN problems provides a proximal equilibrium. Inspired by these results, we propose a new approach, which we call proximal training, for solving GAN problems. We discuss several numerical experiments demonstrating the existence of proximal equilibrium solutions in GAN minimax problems.

1 Introduction

Since their introduction in [1], generative adversarial networks (GANs) have gained great success in many tasks of learning the distribution of observed samples. Unlike the traditional approaches to distribution learning, GANs view the learning problem as a zero-sum game between the following two players: 1) generator GG aiming to generate real-like samples from a random noise input, 2) discriminator DD trying to distinguish GG’s generated samples from real training data. This game is commonly formulated through a minimax optimization problem as follows:

minG𝒢maxD𝒟V(G,D).\min_{G\in\mathcal{G}}\;\max_{D\in\mathcal{D}}\;V(G,D). (1.1)

Here, 𝒢\mathcal{G} and 𝒟\mathcal{D} are respectively the generator and discriminator function sets, commonly chosen as two deep neural nets, and V(G,D)V(G,D) denotes the minimax objective for generator GG and discriminator DD capturing how dissimilar the generated samples and training data are.

GAN optimization problems are commonly solved by alternating gradient methods, which under proper regularization have resulted in state-of-the-art generative models for various benchmark datasets. However, GAN minimax optimization has led to several theoretical and empirical challenges in the machine learning literature. Training GANs is widely known as a challenging optimization task requiring an exhaustive hyper-parameter and architecture search and demonstrating an unstable behavior. While a few regularization schemes have achieved empirical success in training GANs [2, 3, 4, 5], still little is known about the conditions under which GAN minimax optimization can be successfully solved by first-order optimization methods.

To understand the minimax optimization in GANs, one needs to first answer the following question: What is the proper notion of equilibrium in the GAN zero-sum game? In other words, what are the optimality criteria in the GAN’s minimax optimization problem? A classical notion of equilibrium in the game theory literature is the Nash equilibrium, a state in which no player can raise its individual gain by choosing a different strategy. According to this definition, a Nash equilibrium (G,D)(G^{*},D^{*}) for the GAN minimax problem (1.1) must satisfy the following for every G𝒢G\in\mathcal{G} and D𝒟D\in\mathcal{D}:

V(G,D)V(G,D)V(G,D).V(G^{*},D)\,\leq\,V(G^{*},D^{*})\,\leq\,V(G,D^{*}). (1.2)

As a well-known result, for a generator GG expressive enough to reproduce the distribution of observed samples, Nash equilibrium exists for the generator producing the data distribution [6]. However, such a Nash equilibrium would be of little interest from a learning perspective, since the trained generator merely overfits the empirical distribution of training samples [7]. More importantly, state-of-the-art GAN architectures [4, 5, 8, 9] commonly restrict the generator function through various means of regularization such as batch or spectral normalization. Such regularization mechanisms do not allow the generator to produce the empirical distribution of observed data-points. Since the realizability assumption does not apply to such regularized GANs, the existence of Nash equilibria will not be guaranteed in their minimax problems.

The above discussion motivates studying the equilibrium of GAN zero-sum games in the non-realizable settings where the generator cannot express the empirical distribution of training data. Here, a natural question is whether a Nash equilibrium still exists for the GAN minimax problem. In this work, we focus on this question and demonstrate through several theoretical and numerical results that:

  • Nash equilibrium may not exist in GAN zero-sum games.

We provide theoretical examples of well-known GAN formulations including the vanilla GAN [1], Wasserstein GAN (WGAN) [3], ff-GAN [10], and the second-order Wasserstein GAN (W2GAN) [11] where no local Nash equilibria exist in their minimax optimization problems. We further perform numerical experiments on widely-used GAN architectures which suggest that an empirically successful GAN training may converge to non-Nash equilibrium solutions.

Next, we focus on characterizing a new notion of equilibrium for GAN problems. To achieve this goal, we consider the Nash equilibrium of a new zero-sum game where the objective function is given by the following proximal operator applied to the minimax objective V(G,D)V(G,D) with respect to a norm on discriminator functions:

Vprox(G,D)maxD~𝒟V(G,D~)D~D2.V^{\operatorname{prox}}(G,D)\coloneqq\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\bigl{\|}\widetilde{D}-D\bigr{\|}^{2}. (1.3)

We refer to the Nash equilibrium of the new zero-sum game as the proximal equilibrium. Given the inherent sequential nature of GAN problems where the generator moves first followed by the discriminator, we consider a Stackelberg game for its representation and focus on the subgame perfect equilibrium (SPE) of the game as the right notion of equilibrium for such problems [12]. We prove that the proximal equilibrium of Wasserstein GANs provides an SPE for the GAN problem. This result applies to both the first-order and second-order Wasserstein GANs. In these cases, we show a proximal equilibrium exists for the optimal generator minimizing the distance to the data distribution.

Inspired by these theoretical results, we propose a proximal approach for training GANs, which we call proximal training, by changing the original minimax objective to the proximal objective in (1.3). In addition to preserving the optimal solution to the GAN minimax problem, proximal training can further enjoy the existence of Nash equilibrium solutions in the new minimax objective. We discuss numerical results supporting the proximal training approach and the role of proximal equilibrium solutions in various GAN problems.

2 Related Work

Understanding the minimax optimization in modern machine learning applications including GANs has been a subject of great interest in the machine learning literature. A large body of recent works [13, 14, 15, 16, 17, 18, 19, 20, 21] have analyzed the convergence properties of first-order optimization methods in solving different classes of minimax games.

In a related work, [12] proposes a new notion of local optimality, called local minimax, designed for general sequential machine learning games. Compared to the notion of local minimax, the proximal equilibrium proposed in our work gives a notion of global optimality, which as we show directly applies to Wasserstein GANs. [12] also provides examples of minimax problems where Nash equilibria do not exist; however, the examples do not represent GAN minimax problems. Some recent works [21, 22, 23] have analyzed the convergence of different optimization methods to local minimax solutions.

In another related work, [24] analyzes the stable points of the gradient descent ascent (GDA) and optimistic GDA [13] algorithms, proving that they will give strict supersets of the local saddle points. Regarding the stability of GAN algorithms, [25] proves that the GDA algorithm will be locally stable for the vanilla and regularized Wasserstein GAN problems. [11] shows the GDA algorithm is globally stable for W2GANs with linear generator and quadratic discriminator functions.

Regarding the equilibrium in GANs, [7] studies the Nash equilibrium of GAN minimax games in realizable settings. Also, [7, 26] develop methods for finding mixed strategy Nash equilibria. On the other hand, our results focus on the pure strategies in non-realizable settings. [27] empirically studies the equilibrium of GAN problems regularized via the gradient penalty, reporting positive results on the stability of regularized GANs. However, our focus is on the existence of pure Nash equilibrium solutions. [28] suggests a moment matching GAN formulation using the Sobolev norm. As a different direction, we use the Sobolev norm to analyze equilibrium in GANs. Finally, developing GAN architectures with improved equilibrium and stability properties has been studied in several recent works [2, 29, 30, 31, 32, 33, 34, 35].

3 An Initial Experiment on Equilibrium in GANs

To examine whether the Nash equilibrium exists in GAN problems empirically, we performed a simple numerical experiment. In this experiment, we applied three standard GAN implementations including the Wasserstein GAN with weight-clipping (WGAN-WC) [3], the improved Wasserstein GAN with gradient penalty (WGAN-GP) [4], and the spectrally-normalized vanilla GAN (SN-GAN) [5], to the two benchmark MNIST [36] and CelebA [37] databases. We used the convolutional architecture of the DC-GAN [38] optimized with the Adam [39] or RMSprop [40] (only for WGAN-WC) optimizers.

We performed each of the GAN experiments for 200,000 generator iterations to reach (G𝜽final,(G_{{\bm{\theta}}_{\operatorname{final}}}, D𝐰final)D_{{\mathbf{w}}_{\operatorname{final}}}) with 𝜽final{\bm{\theta}}_{\operatorname{final}} and 𝐰final{\mathbf{w}}_{\operatorname{final}} denoting the trained generator and discriminator parameters at the end of the 200,000 iterations. Our goal is to examine whether the solution pair (G𝜽final,D𝐰final)(G_{{\bm{\theta}}_{\operatorname{final}}},D_{{\mathbf{w}}_{\operatorname{final}}}) represents a Nash equilibrium or not. To do this, we fixed the trained discriminator and kept optimizing the generator, i.e. continuing optimizing the generator G𝜽G_{\bm{\theta}} without changing the discriminator D𝐰finalD_{{\mathbf{w}}_{\operatorname{final}}}. Here we solved the following optimization problem initialized at 𝜽(0)=𝜽final\bm{\theta}^{(0)}={\bm{\theta}}_{\operatorname{final}} using the default first-order optimizer for the generator function for 10,000 iterations:

min𝜽V(G𝜽,D𝐰final).\min_{\bm{\theta}}\>V(G_{\bm{\theta}},D_{{\mathbf{w}}_{\operatorname{final}}}). (3.1)

If the pair (G𝜽final,D𝐰final)(G_{{\bm{\theta}}_{\operatorname{final}}},D_{{\mathbf{w}}_{\operatorname{final}}}) was in fact a Nash equilibrium, it would give a local saddle point to the minimax optimization and the above optimization could not make the objective any smaller than its initial value. Also, the image samples generated by the generator G𝜽G_{\bm{\theta}} should have improved or at least preserved their initial quality during this optimization, since the discriminator D𝐰finalD_{{\mathbf{w}}_{\operatorname{final}}} would be the optimal discriminator against all generator functions.

Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 1: Optimizing the SN-GAN’s generator objective without changing the trained discriminator on the MNIST and CelebA data. Both the SN-GAN objective and samples’ quality were decreasing during the optimization.

Despite the above predictions, we observed that none of the mentioned statements hold in reality for any of the six experiments with the three standard GAN implementations and the two datasets. The optimization objective decreased rapidly from the beginning of the optimization, and the pictures sampled from the generator completely lost their quality over this optimization. Figures 1(a), 1(b) show the objective for the SN-GAN experiments over the 10,000 steps of the above optimization. These figures also demonstrate the SN-GAN generated samples before and during the optimization, which shows the significant drop in the quality of generated pictures. We defer the results for the WGAN-WC and WGAN-GP problems to the Appendix.

The results of the above experiments show that practical GAN experiments may not converge to local Nash equilibrium solutions. After fixing the trained discriminator, the trained generator can be further optimized using a first-order optimization method to reach smaller values of the generator objective. More importantly, this optimization not only does not improve the quality of the generator’s output samples, but also totally disturbs the trained generator. As demonstrated in these experiments, simultaneous optimization of the two players is in fact necessary for the proper convergence and stability behavior in GAN minimax optimization. The above experiments suggest that practical GAN solutions are not local Nash equilibrium. In the upcoming sections, we review some standard GAN formulations and then show that there are examples of GAN minimax problems for which no Nash equilibrium exists. Those theoretical results will further support our observations in the above experiments.

4 Review of GAN Formulations

4.1 Vanilla GAN & ff-GAN

Consider samples 𝐱1,,𝐱n\mathbf{x}_{1},\ldots,\mathbf{x}_{n} observed independently from distribution P𝐗P_{\mathbf{X}}. Our goal is to find a generator function G𝒢G\in\mathcal{G} where G(𝐙)G(\mathbf{Z}) maps a random noise input 𝐙\mathbf{Z} from a known P𝐙P_{\mathbf{Z}} to an output G(𝐙)G(\mathbf{Z}) distributed as P𝐗P_{\mathbf{X}}, i.e., we aim to match the probability distributions PG(𝐙)P_{G(\mathbf{Z})} and P𝐗P_{\mathbf{X}}. To find such a generator function, [1] proposes the following minimax problem which is commonly referred to as the vanilla GAN problem:

minG𝒢maxD𝒟𝔼[log(D(𝐗))]+𝔼[log(1D(G(𝐙)))].\min_{G\in\mathcal{G}}\>\max_{D\in\mathcal{D}}\>\mathbb{E}\bigl{[}\log(D(\mathbf{X}))\bigr{]}+\mathbb{E}\bigl{[}\log(1-D(G(\mathbf{Z})))\bigr{]}. (4.1)

Here 𝒢\mathcal{G} and 𝒟\mathcal{D} represent the set of generator and discriminator functions, respectively. In this formulation, the discriminator is optimized to map real samples from P𝐗P_{\mathbf{X}} to larger values than the values assigned to generated samples from PG(𝐙)P_{G(\mathbf{Z})}.

As shown in [1], the above minimax problem for an unconstrained 𝒟\mathcal{D} containing all real-valued functions reduces to the following divergence minimization problem:

minG𝒢JSD(P𝐗,PG(𝐙)),\min_{G\in\mathcal{G}}\>\operatorname{JSD}\bigl{(}P_{\mathbf{X}},P_{G(\mathbf{Z})}\bigr{)}, (4.2)

where JSD\operatorname{JSD} denotes the Jensen-Shannon (JS) divergence defined in terms of KL-divergence as

JSD(P,Q)12KL(P,P+Q2)+12KL(Q,P+Q2).\operatorname{JSD}(P,Q)\coloneqq\frac{1}{2}\operatorname{KL}\bigl{(}P,\frac{P+Q}{2}\bigr{)}\,+\,\frac{1}{2}\operatorname{KL}\bigl{(}Q,\frac{P+Q}{2}\bigr{)}.

ff-GANs extend the vanilla GAN problem by generalizing the JS-divergence to a general ff-divergence. For a convex function f:f:\mathbb{R}\rightarrow\mathbb{R} with f(1)=0f(1)=0, the ff-divergence dfd_{f} corresponding to ff is defined as

df(P,Q)p(𝐱)f(q(𝐱)p(𝐱))d𝐱.d_{f}(P,Q)\coloneqq\int p(\mathbf{x})f\bigl{(}\frac{q(\mathbf{x})}{p(\mathbf{x})}\bigr{)}\mathop{}\!\mathrm{d}\mathbf{x}. (4.3)

Notice that the JS-divergence is a special case of ff-divergence with fJSD(t)=tlogt(t+1)logt+12f_{\operatorname{JSD}}(t)=t\log t-(t+1)\log\frac{t+1}{2}. [10] shows that generalizing the divergence minimization (4.2) to minimizing a ff-divergence results in the following minimax problem called ff-GAN:

minG𝒢maxD𝒟𝔼[D(𝐗)]𝔼[f(D(G(𝐙)))],\min_{G\in\mathcal{G}}\>\max_{D\in\mathcal{D}}\>\mathbb{E}\bigl{[}D(\mathbf{X})]-\mathbb{E}\bigl{[}f^{*}\bigl{(}D(G(\mathbf{Z}))\bigr{)}\bigr{]}, (4.4)

where ff^{*} denotes the Fenchel-conjugate to ff defined as f(u)=suptutf(t)f^{*}(u)=\sup_{t}ut-f(t). The space 𝒟\mathcal{D} implied by the f-divergence minimization will be the set of all functions, but a similar interpretation further applies to a constrained 𝒟\mathcal{D} [41, 42]. Several examples of ff-GANs have been formulated and discussed in [10].

4.2 Wasserstein GANs

To resolve GAN training issues, [3] proposes to formulate a GAN problem by minimizing the optimal transport costs which unlike ff-divergences change continuously with the input distributions. Given a transportation cost c(𝐱,𝐱)c(\mathbf{x},\mathbf{x}^{\prime}) for transporting 𝐱\mathbf{x} to 𝐱\mathbf{x}^{\prime}, the optimal transport cost WcW_{c} is defined as

Wc(P,Q)=infMΠ(P,Q)𝔼M[c(𝐗,𝐗)]W_{c}(P,Q)=\inf_{M\in\Pi(P,Q)}\>\mathbb{E}_{M}\bigl{[}c(\mathbf{X},\mathbf{X}^{\prime})\bigr{]} (4.5)

where Π(P,Q)\Pi(P,Q) denotes the set of all joint distributions on (𝐗,𝐗)(\mathbf{X},\mathbf{X}^{\prime}) with 𝐗,𝐗\mathbf{X},\,\mathbf{X}^{\prime} marginally distributed as P,QP,\,Q, respectively. An important special case is the first-order Wasserstein distance (W1W_{1}-distance) corresponding to c(𝐗,𝐗)=𝐗𝐗c(\mathbf{X},\mathbf{X}^{\prime})=\|\mathbf{X}-\mathbf{X}^{\prime}\|. In this special case, the Kantorovich-Rubinstein duality shows

W1(P,Q)=maxDLip1𝔼P[D(𝐗)]𝔼Q[D(𝐗)].W_{1}(P,Q)=\max_{\|D\|_{\operatorname{Lip}}\leq 1}\,\mathbb{E}_{P}[D(\mathbf{X})]-\mathbb{E}_{Q}[D(\mathbf{X})]. (4.6)

Here 𝔼P[]\mathbb{E}_{P}[\cdot] denotes the expected value with respect to distribution PP and DLip\|D\|_{\operatorname{Lip}} denotes the Lipschitz constant of function DD which is defined as the smallest LL satisfying |D(𝐱)D(𝐱)|L𝐱𝐱|D(\mathbf{x})-D(\mathbf{x}^{\prime})|\leq L\,\|\mathbf{x}-\mathbf{x}^{\prime}\| for every 𝐱,𝐱\mathbf{x},\mathbf{x}^{\prime}. Formulating a GAN problem minimizing the W1W_{1}-distance, [3] states the Wasserstein GAN (WGAN) problem as follows:

minG𝒢maxDLip1𝔼[D(𝐗)]𝔼[D(G(𝐙))].\min_{G\in\mathcal{G}}\>\max_{\|D\|_{\operatorname{Lip}}\leq 1}\>\mathbb{E}\bigl{[}D(\mathbf{X})]-\mathbb{E}\bigl{[}D(G(\mathbf{Z}))\bigr{]}. (4.7)

The above Wasserstein GAN problem can be generalized to a general optimal transport cost with arbitrary cost function c(𝐱,𝐱)c(\mathbf{x},\mathbf{x}^{\prime}). The generalization is as follows:

minG𝒢maxDcconcave𝔼[D(𝐗)]𝔼[Dc(G(𝐙))],\min_{G\in\mathcal{G}}\>\max_{D\,\operatorname{c-concave}}\>\mathbb{E}\bigl{[}D(\mathbf{X})]-\mathbb{E}\bigl{[}D^{c}(G(\mathbf{Z}))\bigr{]}, (4.8)

where the c-transform is defined as Dc(𝐱)=sup𝐱D(𝐱)c(𝐱,𝐱)D^{c}(\mathbf{x})=\sup_{\mathbf{x}^{\prime}}\,D(\mathbf{x}^{\prime})-c(\mathbf{x},\mathbf{x}^{\prime}) and a function DD is called c-concave if it is the c-transform of some valid function. In particular, the optimal transport GAN formulation with the quadratic cost c(𝐱,𝐱)=𝐱𝐱22c(\mathbf{x},\mathbf{x}^{\prime})=\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2}^{2} results in the second-order Wasserstein GAN (W2GAN) problem which has been studied in several recent works [11, 43, 44, 45].

5 Existence of Nash Equilibrium Solutions in GANs

Consider a general GAN minimax problem (1.1) with a minimax objective V(G,D)V(G,D). As discussed in the previous section, the optimal generator G𝒢G^{*}\in\mathcal{G} is defined to minimize the GAN’s target divergence to the data distribution. The following proposition is a well-known result regarding the Nash equilibrium of the GAN game in realizable settings where there exists a generator G𝒢G\in\mathcal{G} producing the data distribution.

Proposition 1.

Assume that generator G𝒢G^{*}\in\mathcal{G} results in the distribution of data, i.e., we have PG(𝐙)=P𝐗P_{G^{*}(\mathbf{Z})}=P_{\mathbf{X}}. Then, for each of the GAN problems discussed in Section 4 there exists a constant discriminator function DconstantD_{\operatorname{constant}} which together with GG^{*} results in a Nash equilibrium to the GAN game, and hence satisfies the following for every G𝒢G\in\mathcal{G} and D𝒟D\in\mathcal{D}:

V(G,D)V(G,Dconstant)V(G,Dconstant).V(G^{*},D)\,\leq\,V(G^{*},D_{\operatorname{constant}})\,\leq\,V(G,D_{\operatorname{constant}}).
Proof.

This proposition is well-known for the vanilla GAN [46]. In the Appendix, we provide a proof for general ff-GANs and Wasserstein GANs. ∎

The above proposition shows that in a realizable setting with a generator function generating the distribution of observed samples, a Nash equilibrium exists for that optimal generator. However, the realizability assumption in this proposition does not always hold in real GAN experiments. For example, in the GAN experiments discussed in Section 3, we observed that the divergence estimate never reached the zero value because of regularizing the generator function. Therefore, the Nash equilibrium described in Proposition 1 does not apply to the trained generator and discriminator in such GAN experiments.

Here, we address the question of the existence of Nash equilibrium solutions for non-realizable settings, where no generator G𝒢G\in\mathcal{G} can produce the data distribution. Do Nash equilibria always exist in non-realizable GAN zero-sum games? The following theorem shows that the answer is in general no. Note that σmax()\sigma_{\max}(\cdot) in this theorem denotes the maximum singular value, i.e., the spectral norm.

Theorem 1.

Consider a GAN minimax problem for learning a normally distributed 𝐗𝒩(𝟎,σ2I)\mathbf{X}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I) with zero mean and scalar covariance matrix where σ>1\sigma>1. In the GAN formulation, we use a linear generator function G(𝐳)=𝐖𝐳+𝐮G(\mathbf{z})=\mathbf{W}\mathbf{z}+\mathbf{u} where the weight matrix 𝐖\mathbf{W} is spectrally-regularized to satisfy σmax(𝐖)1\sigma_{\max}(\mathbf{W})\leq 1. Suppose that the Gaussian latent vector is normally distributed as 𝐙𝒩(𝟎,I)\mathbf{Z}\sim\mathcal{N}(\mathbf{0},I) with zero mean and identity covariance matrix. Then,

  • For the ff-GAN problem corresponding to an ff with non-decreasing t2f′′(t)t^{2}f^{\prime\prime}(t) over t(0,+)t\in(0,+\infty) and an unconstrained discriminator DD where the dimensions of data 𝐗\mathbf{X} and latent 𝐙\mathbf{Z} match, the f-GAN minimax problem has no Nash equilibrium solutions.

  • For the W2GAN problem with discriminator DD trained over cc-concave functions, where cc is the quadratic cost, the W2GAN minimax problem has no Nash equilibrium solutions. Also, given a quadratic discriminator D(𝐱)=𝐱TA𝐱+𝐛T𝐱D(\mathbf{x})=\mathbf{x}^{T}A\mathbf{x}+\mathbf{b}^{T}\mathbf{x} parameterized by A,𝐛A,\mathbf{b}, the W2GAN problem has no local Nash equilibria.

  • For the WGAN problem with 11-dimensional X,ZX,Z and a discriminator DD trained over 1-Lipschitz functions, the WGAN minimax problem has no Nash equilibria.

Proof.

We defer the proof to the Appendix. Note that the condition on the ff-GAN holds for all ff-GAN examples in [10] including the vanilla GAN. ∎

The above theorem shows that under the stated assumptions the GAN zero-sum game does not have Nash equilibrium solutions. Consequently, the optimal divergence-minimizing generative model does not result in a Nash equilibrium. In contrast to Theorem 1, the following remark shows that the GAN zero-sum game in a non-realizable case may have Nash equilibrium solutions, of course if Theorem 1’s assumptions do not hold.

Remark 1.

Consider the same setting as in Theorem 1. However, unlike Theorem 1 suppose that σ<1\sigma<1 and σmin(𝐖)1\sigma_{\min}(\mathbf{W})\geq 1 where σmin\sigma_{\min} stands for the minimum singular value. Then, for the WGAN and W2GAN problems described in Theorem 1, the Wasserstein distance-minimizing generator results in a Nash equilibrium.

Proof.

We defer the proof to the Appendix. ∎

The above remark explains that the phenomenon shown in Theorem 1 does not always hold in non-realizable GAN settings. As a result, we need other notions of equilibrium which consistently explain optimality in GAN games.

6 Proximal Equilibrium: A Relaxation of Nash Equilibrium

To define a proper notion of equilibrium for GANs, note that due to the sequential nature of GAN games the equilibrium notion should be flexible to allow to some extent the optimization of the discriminator around the equilibrium solution. This property is in fact consistent with the stability feature observed for the first-order GAN training methods where the alternating first-order method stabilizes around a certain solution. To this end, we consider the following objective for a GAN problem with minimax objective V(G,D)V(G,D):

Vλprox(G,D)maxD~𝒟V(G,D~)λ2D~D2.V_{\lambda}^{\operatorname{prox}}(G,D)\,\coloneqq\,\max_{\widetilde{D}\in\mathcal{D}}\,V(G,\widetilde{D})-\frac{\lambda}{2}\bigl{\|}\widetilde{D}-D\bigr{\|}^{2}. (6.1)

The above definition represents the application of a proximal operator to V(G,D)V(G,D), which further optimizes the original objective in the proximity of discriminator DD. To keep the D~\widetilde{D} function variable close to DD, we penalize the distance among the two functions in the proximal optimization. Here the distance is measured using a norm \|\cdot\| on the discriminator function space.

To extend the notion of Nash equilibrium to general minimax problems, we propose considering the Nash equilibria of the defined Vλprox(G,D)V_{\lambda}^{\operatorname{prox}}(G,D).

Definition 1.

We call (G,D)(G^{*},D^{*}) a λ\lambda-proximal equilibrium for V(G,D)V(G,D) if it represents a Nash equilibrium for Vλprox(G,D)V_{\lambda}^{\operatorname{prox}}(G,D) , i.e. for every G𝒢G\in\mathcal{G} and D𝒟D\in\mathcal{D}

Vλprox(G,D)Vλprox(G,D)Vλprox(G,D).V_{\lambda}^{\operatorname{prox}}(G^{*},D)\leq V_{\lambda}^{\operatorname{prox}}(G^{*},D^{*})\leq V_{\lambda}^{\operatorname{prox}}(G,D^{*}). (6.2)

The next proposition provides necessary and sufficient conditions in terms of the original objective V(G,D)V(G,D) for the proximal equilibrium solutions.

Proposition 2.

(G,D)(G^{*},D^{*}) is a λ\lambda-proximal equilibrium if and only if for every G𝒢G\in\mathcal{G} and D𝒟D\in\mathcal{D} we have

V(G,D)V(G,D)maxD~𝒟V(G,D~)λ2D~D2\displaystyle V(G^{*},D)\leq V(G^{*},D^{*})\leq\max_{\widetilde{D}\in\mathcal{D}}V(G,\widetilde{D})-\frac{\lambda}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}

Therefore, if (G,D)(G^{*},D^{*}) is a λ\lambda-proximal equilibrium it will give a global minimax solution, i.e., G𝒢G^{*}\in\mathcal{G} minimizes the worst-case objective, maxD𝒟V(G,D)\max_{D\in\mathcal{D}}V(G,D), with DD^{*} being its optimal solution.

Proof.

We defer the proof to the Appendix. ∎

The following result shows the proximal equilibria provide a hierarchy of equilibrium solutions for different λ\lambda values.

Proposition 3.

Define PEλ(V)\operatorname{PE}_{\lambda}(V) to be the set of the λ\lambda-proximal equilibria for V(G,D)V(G,D). Then, if λ1λ2\lambda_{1}\leq\lambda_{2},

PEλ2(V)PEλ1(V).\operatorname{PE}_{\lambda_{2}}(V)\subseteq\operatorname{PE}_{\lambda_{1}}(V). (6.3)
Proof.

We defer the proof to the Appendix. ∎

Note that as λ\lambda approaches infinity, Vλprox(G,D)V_{\lambda}^{\operatorname{prox}}(G,D) tends to the original V(G,D)V(G,D), implying that PEλ=+(V)\operatorname{PE}_{\lambda=+\infty}(V) is the set of V(G,D)V(G,D)’s Nash equilibria. In contrast, for λ=0\lambda=0 the proximal objective becomes the worst-case objective maxD𝒟V(G,D)\max_{D\in\mathcal{D}}V(G,D). As a result, PEλ=0(V)\operatorname{PE}_{\lambda=0}(V) is the set of global minimax solutions described in Proposition 2.

Concerning the proximal optimization problem in (6.1), the following proposition shows that if the original minimax objective is a smooth function of the discriminator parameters, the proximal optimization can be solved efficiently and therefore one can efficiently compute the gradient of the proximal objective.

Proposition 4.

Consider the maximization problem in the definition of proximal objective (6.1) where generator G𝛉G_{\bm{\theta}} and discriminator D𝐰D_{\mathbf{w}} are parameterized by vectors 𝛉,𝐰\bm{\theta},\,\mathbf{w}, respectively. Suppose that

  • For the considered discriminator norm \|\cdot\|, D𝐰D2\|D_{\mathbf{w}}-D\|^{2} is η1\eta_{1}-strongly convex in 𝐰\mathbf{w} for any function DD, i.e. for any 𝐰,𝐰,D\mathbf{w},\mathbf{w}^{\prime},D:

    𝐰D𝐰D2𝐰D𝐰D22η1𝐰𝐰2,\bigl{\|}\nabla_{\mathbf{w}}\|D_{\mathbf{w}}-D\|^{2}\,-\,\nabla_{\mathbf{w}}\|D_{\mathbf{w}^{\prime}}-D\|^{2}\bigr{\|}_{2}\,\geq\,\eta_{1}\bigl{\|}\mathbf{w}-\mathbf{w}^{\prime}\bigr{\|}_{2},
  • For every G𝜽G_{\bm{\theta}}, The GAN minimax objective V(G𝜽,D𝐰)V(G_{\bm{\theta}},D_{\mathbf{w}}) is η2\eta_{2}-smooth in 𝐰\mathbf{w}, i.e. i.e. for any 𝐰,𝐰,𝜽\mathbf{w},\mathbf{w}^{\prime},\bm{\theta}:

    𝐰V(G𝜽,D𝐰)𝐰V(G𝜽,D𝐰)2η2𝐰𝐰2.\bigl{\|}\nabla_{\mathbf{w}}V(G_{\bm{\theta}},D_{\mathbf{w}})\,-\,\nabla_{\mathbf{w}}V(G_{\bm{\theta}},D_{\mathbf{w}^{\prime}})\bigr{\|}_{2}\,\leq\,\eta_{2}\|\mathbf{w}-\mathbf{w}^{\prime}\|_{2}.

Under the above assumptions, if η2<λη12\eta_{2}<\frac{\lambda\eta_{1}}{2}, the maximization objective in (6.1) is λη12η2\frac{\lambda\eta_{1}}{2}-\eta_{2}-strongly concave. Then, the maximization problem has a unique solution 𝐰\mathbf{w}^{*} and if V(G𝛉,D𝐰)V(G_{\bm{\theta}},D_{\mathbf{w}}) is differentiable with respect to θ\theta we have

θVλprox(G𝜽,D𝐰)=θV(G𝜽,D𝐰).\nabla_{\theta}V_{\lambda}^{\operatorname{prox}}(G_{\bm{\theta}},D_{\mathbf{w}})=\nabla_{\theta}V(G_{\bm{\theta}},D_{\mathbf{w}^{*}}). (6.4)
Proof.

We defer the proof to the Appendix. ∎

The above proposition suggests that under the mentioned assumptions, one can efficiently compute the optimal solution to the proximal maximization through a first-order optimization method. The assumptions require the smoothness of the GAN minimax objective with respect to the discriminator parameters, which can be imposed by applying norm-based regularization tools to neural network discriminators.

7 Proximal Equilibrium in Wasserstein GANs

As shown earlier, GAN minimax games may not have any Nash equilibria in non-realizable settings. As a result, we seek for a different notion of equilibrium which remains applicable to GAN problems. Here, we show the proposed proximal equilibrium provides such an equilibrium notion for Wasserstein GAN problems.

To define a proper proximal operator for defining proximal equilibria in Wasserstein GAN problems, we use the second-order Sobolev semi-norm averaged over the underlying distribution of data. Given the underlying distribution P𝐗P_{\mathbf{X}}, we define the Sobolev semi-norm as

DH˙1𝔼P𝐗[𝐱D(𝐗)22].\displaystyle\big{\|}D\big{\|}_{\dot{H}^{1}}\,\coloneqq\,\sqrt{\,\mathbb{E}_{P_{\mathbf{X}}}\left[\big{\|}\nabla_{\mathbf{x}}D(\mathbf{X})\big{\|}^{2}_{2}\,\right]}\,. (7.1)

The above semi-norm is induced by the following semi-inner product and therefore leads to a semi-Hilbert space of functions:

D1,D2H˙1𝔼P𝐗[D1(𝐗)TD2(𝐗)].\langle D_{1},D_{2}{\rangle}_{\dot{H}^{1}}\coloneqq\mathbb{E}_{P_{\mathbf{X}}}\left[\,\nabla D_{1}(\mathbf{X})^{T}\nabla D_{2}(\mathbf{X})\,\right]. (7.2)

Throughout our discussion, we consider a parameterized set of generators 𝒢={G𝜽:𝜽Θ}\mathcal{G}=\{G_{\bm{\theta}}:\,\bm{\theta}\in\Theta\}. For a GAN minimax objective V(G,D)V(G,D), we define D𝜽D^{\bm{\theta}} to be the optimal discriminator function for the parameterized generator G𝜽G_{\bm{\theta}}:

D𝜽argmaxD𝒟V(G𝜽,D).D^{\bm{\theta}}\coloneqq\underset{D\in\mathcal{D}}{\arg\!\max}\>V(G_{\bm{\theta}},D). (7.3)

The following theorem shows that the Wasserstein distance-minimizing generator function in the second-order Wasserstein GAN problem satisfies the conditions of a proximal equilibrium based on the Sobolev semi-norm defined in (7.1).

Theorem 2.

Consider the second-order Wasserstein GAN problem (4.8) with a quadratic cost c(𝐱,𝐱)=η2𝐱𝐱22c(\mathbf{x},\mathbf{x}^{\prime})=\frac{\eta}{2}\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2}^{2}. Suppose that the set of optimal discriminators {D𝛉:𝛉Θ}\{D^{\bm{\theta}}:\,\bm{\theta}\in\Theta\} is convex. Then, (G𝛉,D𝛉)(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}}) for the Wasserstein distance-minimizing generator G𝛉𝒢G_{\bm{\theta}^{*}}\in\mathcal{G} will provide a 1η\frac{1}{\eta}-proximal equilibrium with respect to the Sobolev norm in (7.1).

Proof.

We defer the proof to the Appendix. ∎

The above theorem shows that while, as demonstrated in Theorem 1, the W2GAN problem may have no local Nash equilibrium solutions, the proximal equilibrium exists for the W2GAN problem and holds at the Wasserstein-distance minimizing generator G𝜽G_{\bm{\theta}^{*}}. The next theorem extends this result to the first-order Wasserstein GAN (WGAN) problem.

Theorem 3.

Consider the WGAN problem (4.7) minimizing the first-order Wasserstein distance. For each G𝛉G_{\bm{\theta}}, define α𝛉:d0\alpha_{\bm{\theta}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{\geq 0} to be the magnitude of the resulted optimal transport map from 𝐗\mathbf{X} to G𝛉(𝐙)G_{\bm{\theta}}(\mathbf{Z}), i.e. 𝐗α𝛉2(𝐗)D𝛉(𝐗)\mathbf{X}-{\alpha}^{2}_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X}) shares the same distribution with G𝛉(𝐙)G_{\bm{\theta}}(\mathbf{Z}).111Note that as shown in the proof such a mapping α𝛉\alpha_{\bm{\theta}} exists under mild regularity assumptions. Given these definitions, assume that

  • {α𝜽()D𝜽():𝜽Θ}\{{\alpha}_{\bm{\theta}}(\cdot)\nabla D^{\bm{\theta}}(\cdot):\,\bm{\theta}\in\Theta\} is a convex set,

  • for every 𝐱\mathbf{x} and 𝜽\bm{\theta}, η2α𝜽2(𝐱)\frac{\eta}{2}\leq\alpha_{\bm{\theta}}^{2}(\mathbf{x}) holds for constant η\eta.

Then, (G𝛉,D𝛉)(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}}) for the Wasserstein distance-minimizing generator function G𝛉G_{\bm{\theta}^{*}} provides an η\eta-proximal equilibrium with respect to the Sobolev norm in (7.1).

Proof.

We defer the proof to the Appendix. ∎

The above theorem shows that if the magnitude of optimal transport map is everywhere lower-bounded by λ2\frac{\lambda}{2}, then the Wasserstein distance-minimizing generator in the WGAN problem yields a λ\lambda-proximal equilibrium.

8 Proximal Training

As shown for Wasserstein GAN problems, given the defined Sobolev norm and a small enough λ\lambda the proximal objective Vλprox(G,D)V_{\lambda}^{\operatorname{prox}}(G,D) will possess a Nash equilibrium solution. This result motivates performing the minimax optimization for the proximal objective Vλprox(G,D)V_{\lambda}^{\operatorname{prox}}(G,D) instead of the original objective V(G,D)V(G,D). Therefore, we propose proximal training in which we solve the following minimax optimization problem:

minG𝜽𝒢maxD𝐰𝒟Vλprox(G𝜽,D𝐰),\min_{G_{\bm{\theta}}\in\mathcal{G}}\;\max_{D_{\mathbf{w}}\in\mathcal{D}}\;V_{\lambda}^{\operatorname{prox}}(G_{\bm{\theta}},D_{\mathbf{w}}), (8.1)

with the proximal operator defined according to the Sobolev norm in (7.1).

In order to take the gradient of Vλprox(G𝜽,D𝐰)V_{\lambda}^{\operatorname{prox}}(G_{\bm{\theta}},D_{\mathbf{w}}) with respect to 𝜽\bm{\theta}, Proposition 4 suggests solving the proximal optimization followed by computing the gradient of the original objective V(G𝜽,D𝐰)V(G_{\bm{\theta}},D_{\mathbf{w}^{*}}) where the discriminator is parameterized with the optimal solution 𝐰\mathbf{w}^{*} to the proximal optimization.

Algorithm 1 GAN Proximal Training
  Input: data 𝐱i\mathbf{x}_{i}, size nn
  Initialize the parameters 𝐰(0),𝜽(0)\mathbf{w}^{(0)},\bm{\theta}^{(0)}
  for k=0\text{\rm k}=0 to MAX_ITER do
     \gg Update 𝐰(k+1)=argmax𝐰V(G𝜽(k),D𝐰)λ2ni=1n𝐱D𝐰(𝐱i)𝐱D𝐰(k)(𝐱i)22\mathbf{w}^{(k+1)}\,=\,\underset{\mathbf{w}}{\arg\!\max}\>V(G_{\bm{\theta}^{(k)}},D_{\mathbf{w}})-\frac{\lambda}{2n}\sum_{i=1}^{n}\|\nabla_{\mathbf{x}}D_{\mathbf{w}}(\mathbf{x}_{i})-\nabla_{\mathbf{x}}D_{\mathbf{w}^{(k)}}(\mathbf{x}_{i})\|^{2}_{2}.
     \gg Update 𝜽(k+1)=𝜽(k)γk𝜽V(G𝜽(k),D𝐰(k+1)).{\bm{\theta}}^{(k+1)}={\bm{\theta}}^{(k)}-\gamma_{k}\nabla_{\bm{\theta}}V(G_{\bm{\theta}^{(k)}},D_{\mathbf{w}^{(k+1)}}).
  end for

Algorithm 1 summarizes the main two steps of proximal training. At every iteration, the discriminator is optimized with an additive Sobolev norm penalty forcing the discriminator to remain in the proximity of the current discriminator. Next, the generator is optimized using a gradient descent method with the gradient evaluated at the optimal discriminator solving the proximal optimization. The stepsize parameter γk\gamma_{k} can be adaptively selected at every iteration kk. In practice, we can solve the proximal maximization problem via a first-order optimization method for a certain number of iterations. Assuming the conditions of Proposition 4 hold, the proximal optimization leads to the maximization of a strongly-concave objective which can be solved linearly fast through first-order optimization methods.

9 Numerical Experiments

To experiment the theoretical results of this work, we performed several experiments using the [4]’s implementation of Wasserstein GANs with the code available at the paper’s Github repository. In addition, we used the implementations of [5, 47] for applying spectral regularization to the discriminator network. In the experiments, we used the DC-GAN 4-layer CNN architecture for both the discriminator and generator functions [38] and ran each experiment for 200,000 generator iterations with 5 discriminator updates per generator update. We used the RMSprop optimzier [40] for WGAN experiments with weight clipping or spectral normalization and the Adam optimizer [39] for the other experiments.

9.1 Proximal Equilibrium in Wasserstein and Lipschitz GANs

Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 2: Optimizing the proximal objective over the generator with a fixed discriminator on MNIST and CelebA datasets. The SN-GAN’s objective and samples’ quality were preserved during the optimization.

We examined whether the solutions found by Wasserstein and Lipschitz vanilla GANs represent proximal equilibria. Toward this goal, we performed similar experiments to Section 3’s experiments for the WGAN-WC [3], WGAN-GP [4], and SN-GAN [5] problems over the MNIST and CelebA datasets. In Section 3, we observed that after fixing the trained discriminator D𝐰finalD_{\mathbf{w}_{\operatorname{final}}} the GAN’s minimax objective V(G𝜽,D𝐰final)V(G_{\bm{\theta}},D_{\mathbf{w}_{\operatorname{final}}}) kept decreasing when we optimized only the generator G𝜽G_{\bm{\theta}}. In the new experiments, we similarly fixed the trained discriminator D𝐰finalD_{\mathbf{w}_{\operatorname{final}}}resulted from the 200,000 training iterations, but instead of optimizing the GAN minimax objective we optimized the proximal objective defined by the norm (7.1) with λ=0.1\lambda=0.1. Thus, we solved the following optimization problem initialized at 𝜽final{\bm{\theta}}_{\operatorname{final}} which denotes the parameters of the trained generator:

min𝜽Vλ=0.1prox(G𝜽,D𝐰final).\min_{\bm{\theta}}\>V^{\operatorname{prox}}_{\lambda=0.1}(G_{\bm{\theta}},D_{\mathbf{w}_{\operatorname{final}}}). (9.1)

We computed the gradient of the above proximal objective by applying the Adam optimizer for 5050 steps to approximate the solution to the proximal optimization (6.1) which at every iteration was initialized at 𝐰final\mathbf{w}_{\operatorname{final}}. Figures 2(a) and 2(b) show that in the SN-GAN experiments the original minimax objective had only minor changes, compared to the results in Section 3, and the quality of generated samples did not change significantly during the optimization. We defer the similar numerical results of the WGAN-WC and WGAN-GP experiments to the Appendix. These numerical results suggest that while Wasserstein and Lipschitz GANs may not converge to local Nash equilibrium solutions as shown in Section 3, their found solutions can still represent a local proximal equilibrium.

9.2 Proximal Training Improves Lipschitz GANs

Table 1: Inception scores for ordinary vs. proximal training

GAN Problem Ordinary Proximal
WGAN-WC (DIM=64) 4.16±0.154.16\pm 0.15 4.56±0.194.56\pm 0.19
WGAN-WC (DIM=128) 2.52±0.122.52\pm 0.12 4.23±0.154.23\pm 0.15
SN-GAN (DIM=64) 5.12±0.255.12\pm 0.25 5.72±0.225.72\pm 0.22
SN-GAN (DIM=128) 5.62±0.235.62\pm 0.23 6.12±0.226.12\pm 0.22

We applied the proximal training in Algorithm 1 to the WGAN-WC and SN-GAN problems. To compute the gradient of the proximal minimax objective, we solved the maximization problem in the Algorithm 1’s first step in the for loop by applying 2020 steps of Adam optimization initialized at the discriminator parameters at that iteration. Applying the proximal training to MNIST, CIFAR-10, and CelebA datasets, we qualitatively observed slightly visually better generated pictures. We postpone the generated samples to the Appendix.

To quantitatively compare the proximal and ordinary non-proximal GAN training, we measured the Inception scores of the samples generated in the CIFAR-10 experiments. As shown in Table 1, proximal training results in an improved inception score. In this table, DIM stands for the dimension parameter of the DC-GAN’s CNN networks.

References

  • [1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [2] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
  • [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • [4] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
  • [5] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
  • [6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning, volume 1. MIT Press, 2016.
  • [7] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 224–232, 2017.
  • [8] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
  • [9] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  • [10] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016.
  • [11] Soheil Feizi, Farzan Farnia, Tony Ginart, and David Tse. Understanding gans: the lqg setting. arXiv preprint arXiv:1710.10793, 2017.
  • [12] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, 2019.
  • [13] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. arXiv preprint arXiv:1711.00141, 2017.
  • [14] Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. In Advances in Neural Information Processing Systems, pages 14905–14916, 2019.
  • [15] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. arXiv preprint arXiv:1901.08511, 2019.
  • [16] Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Efficient algorithms for smooth minimax optimization. In Advances in Neural Information Processing Systems, pages 12659–12670, 2019.
  • [17] Kaiqing Zhang, Zhuoran Yang, and Tamer Basar. Policy optimization provably converges to nash equilibria in zero-sum linear quadratic games. In Advances in Neural Information Processing Systems, pages 11598–11610, 2019.
  • [18] Eric V Mazumdar, Michael I Jordan, and S Shankar Sastry. On finding local nash equilibria (and only local nash equilibria) in zero-sum games. arXiv preprint arXiv:1901.00838, 2019.
  • [19] Tanner Fiez, Benjamin Chasnov, and Lillian J Ratliff. Convergence of learning dynamics in stackelberg games. arXiv preprint arXiv:1906.01217, 2019.
  • [20] Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pages 6586–6595, 2019.
  • [21] Tianyi Lin, Chi Jin, and Michael I Jordan. On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331, 2019.
  • [22] Qi Lei, Jason D Lee, Alexandros G Dimakis, and Constantinos Daskalakis. Sgd learns one-layer networks in wgans. arXiv preprint arXiv:1910.07030, 2019.
  • [23] Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimization locally: A follow-the-ridge approach. In International Conference on Learning Representations, 2020.
  • [24] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems, pages 9236–9246, 2018.
  • [25] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances in neural information processing systems, pages 5585–5595, 2017.
  • [26] Ya-Ping Hsieh, Chen Liu, and Volkan Cevher. Finding mixed nash equilibria of generative adversarial networks. arXiv preprint arXiv:1811.02002, 2018.
  • [27] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M Dai, Shakir Mohamed, and Ian Goodfellow. Many paths to equilibrium: Gans do not need to decrease a divergence at every step. arXiv preprint arXiv:1710.08446, 2017.
  • [28] Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev gan. arXiv preprint arXiv:1711.04894, 2017.
  • [29] David Berthelot, Thomas Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
  • [30] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626–6637, 2017.
  • [31] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, pages 1825–1835, 2017.
  • [32] Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in neural information processing systems, pages 2018–2028, 2017.
  • [33] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of gans. arXiv preprint arXiv:1705.07215, 2017.
  • [34] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? arXiv preprint arXiv:1801.04406, 2018.
  • [35] Zhiming Zhou, Jiadong Liang, Yuxuan Song, Lantao Yu, Hongwei Wang, Weinan Zhang, Yong Yu, and Zhihua Zhang. Lipschitz generative adversarial nets. arXiv preprint arXiv:1902.05687, 2019.
  • [36] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  • [37] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  • [38] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [39] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [40] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. 14(8), 2012.
  • [41] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pages 5545–5553, 2017.
  • [42] Farzan Farnia and David Tse. A convex duality framework for gans. In Advances in Neural Information Processing Systems, pages 5248–5258, 2018.
  • [43] Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving gans using optimal transport. arXiv preprint arXiv:1803.05573, 2018.
  • [44] Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. On the convergence and robustness of training gans with regularized optimal transport. In Advances in Neural Information Processing Systems, pages 7091–7101, 2018.
  • [45] Amirhossein Taghvaei and Amin Jalali. 2-wasserstein approximation via restricted convex potentials with application to improved training for gans. arXiv preprint arXiv:1902.07197, 2019.
  • [46] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
  • [47] Farzan Farnia, Jesse Zhang, and David Tse. Generalizable adversarial training via spectral normalization. In International Conference on Learning Representations, 2019.
  • [48] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • [49] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • [50] Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.
  • [51] Luigi Ambrosio and Nicola Gigli. A user’s guide to optimal transport. In Modelling and optimisation of flows on networks, pages 1–155. Springer, 2013.

Appendix A Numerical Results for Section 3

Here, we provide the complete numerical results for the experiments discussed in Section 3 of the main text. Regarding the plots shown in Section 3 for the SN-GAN implementation, here we present the same plots for the Wasserstein GAN with weight clipping (WGAN-WC) and with gradient penalty (WGAN-GP) problems. Figures 3(a)-4(b) repeat the experiments of Figures 1,2 in the main text for the WGAN-WC and WGAN-GP problems. These plots suggest that a similar result also holds for the WGAN-WC and WGAN-GP problems, where the objective and the generated samples’ quality were decreasing during the generator optimization. For a larger set of generated samples in the main text’s Figures 1,2 and Figures 3(a)-4(b), we refer the readers to Figures 5(a)-7(b).

Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 3: Optimizing the trained generator of WGAN-WC with a fixed discriminator on the MNIST and CelebA data. The GAN’s objective and samples’ quality were decreasing over the optimization.
Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 4: Optimizing the trained generator of WGAN-GP with a fixed discriminator on the MNIST and CelebA data. The GAN’s objective and samples’ quality were decreasing over the optimization.
Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 5: SN-GAN’s generated samples at the iterations marked in Figures 1(a) & 1(b)
Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 6: WGAN-WC’s generated samples at the iterations marked in Figures 3(a) & 3(b)
Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 7: WGAN-GP’s generated samples at the iterations marked in Figures 4(a) & 4(b)

Appendix B Numerical Results for Section 9

Here, we present the complete numerical results for the experiments of Section 9 in the main text. Figures 8(a)-9(b) demonstrate the results of the main text’s Figures 3,4 for the WGAN-WC and WGAN-GP problems. Here, except the WGAN-GP experiment on the CelebA dataset, we observed that the objective and the generated samples’ quality did not significantly decrease over the generator optimization. Even for the WGAN-GP experiment on the CelebA data, we observed that the objective value decreased three times less than in minimizing the original objective rather than the proximal objective. These experiments suggest that the Wasserstein and Lipschitz GAN problems can converge to local proximal equilibrium solutions. We also show a larger group of generated samples at the beginning and final iterations of Figures 3,4 in the main text and Figures 8(a)-9(b) in Figures 10(a)-12(b).

For the proximal training experiments, Figures 13-15(b) show the samples generated by the SN-GAN and WGAN-WC proximally trained on CIFAR-10 and CelebA data with the results for the baseline regular training on the top of the figure and the results for proximal training on the bottom. We observed a somewhat improved quality achieved by proximal training, which was further supported by the inception scores for the CIFAR-10 experiments reported in the main text.

Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 8: Optimizing the proximal objective for the trained generator in WGAN-WC with a fixed discriminator on data. The GAN’s objective and samples’ quality were preserved over the optimization.
Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 9: Optimizing the proximal objective for the trained generator in WGAN-GP with a fixed discriminator on data. The GAN’s objective and samples’ quality were preserved over the optimization.
Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 10: SN-GAN’s generated samples at the first and last iterations of Figures 2(a),2(b)
Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 11: WGAN-WC’s generated samples at the first and last iterations of Figures 8(a),8(b)
Refer to caption
(a) Results on the MNIST data
Refer to caption
(b) Results on the CelebA data
Figure 12: WGAN-GP’s generated samples at the first and last iterations of Figures 9(a),9(b)
Refer to caption
Figure 13: The images generated by the SN-GAN (DIM=128) trained on CIFAR-10 data with (top) ordinary and (bottom) proximal training.
Refer to caption
(a) WGAN-WC with DCGAN-DIM=64
Refer to caption
(b) WGAN-WC with DCGAN-DIM=128
Figure 14: The images generated by the WGAN-WC trained on CIFAR-10 data with (top) ordinary and (bottom) proximal training.
Refer to caption
(a) The results for SN-GAN
Refer to caption
(b) The results for WGAN-WC
Figure 15: The images generated by the GAN trained on CelebA data with (top) ordinary and (bottom) proximal training.

Appendix C Proofs

C.1 Proof of Proposition 1

Proof for ff-GANs:

Consider the following ff-GAN minimax problem corresponding to the convex function ff:

minG𝒢maxD𝔼[D(𝐗)]𝔼[f(D(G(𝐙)))].\min_{G\in\mathcal{G}}\>\max_{D}\>\mathbb{E}\bigl{[}D(\mathbf{X})\bigr{]}-\mathbb{E}\bigl{[}f^{*}(D(G(\mathbf{Z})))\bigr{]}. (C.1)

Due to the realizability assumption, given G𝒢G^{*}\in\mathcal{G} we assume that the data distribution and the generative model are identical, i.e., P𝐗=PG(𝐙)P_{\mathbf{X}}=P_{G^{*}(\mathbf{Z})}. Then, the minimax objective for GG^{*} reduces to

𝔼P𝐗[D(𝐗)f(D(𝐗))].\mathbb{E}_{P_{\mathbf{X}}}\bigl{[}\,D(\mathbf{X})-f^{*}(D(\mathbf{X}))\,\bigr{]}. (C.2)

The above objective decouples across 𝐗\mathbf{X} outcomes. As a result, the maximizing discriminator D(𝐱)=f(1)D^{*}(\mathbf{x})=f^{\prime}(1) will be a constant function where the constant value f(1)f^{\prime}(1) follows from the optimization problem:

f(1)=argmaxuuf(u).f^{\prime}(1)=\underset{u\in\mathbb{R}}{\arg\!\max}\>u-f^{*}(u). (C.3)

Note that the objective uf(u)u-f^{*}(u) is a concave function of uu whose derivative is zero at f1(1)=f(1){{f^{*}}^{\prime}}^{-1}(1)=f^{\prime}(1), because the Fenchel-conjugate of a convex ff satisfies f1=f{{f^{*}}^{\prime}}^{-1}=f^{\prime}.

So far we have proved that the constant function Dconstant(𝐱)=f(1)D_{\operatorname{constant}}(\mathbf{x})=f^{\prime}(1) provides the optimal discriminator for generator GG^{*}. Therefore, for every discriminator DD we have

V(G,D)V(G,Dconstant),V(G^{*},D)\leq V(G^{*},D_{\operatorname{constant}}), (C.4)

where V(G,D)V(G,D) denotes the ff-GAN’s minimax objective. Moreover, note that for a constant DD the value of the minimax objective does not change with generator GG. As a result, for every GG

V(G,Dconstant)=V(G,Dconstant).V(G,D_{\operatorname{constant}})=V(G^{*},D_{\operatorname{constant}}). (C.5)

Then, (C.4) and (C.5) collectively prove that for every GG and DD we have

V(G,D)V(G,Dconstant)V(G,Dconstant),V(G^{*},D)\leq V(G^{*},D_{\operatorname{constant}})\leq V(G,D_{\operatorname{constant}}),

which completes the proof for ff-GANs.

Proof for Wasserstein GANs:

Consider a general Wasserstein GAN problem with a cost function cc satisfying c(𝐱,𝐱)=0c(\mathbf{x},\mathbf{x})=0 for every 𝐱\mathbf{x}. Notice that this property holds for all Wasserstein distance measures corresponding to cost function 𝐱𝐱q\|\mathbf{x}-\mathbf{x}^{\prime}\|^{q} for q1q\geq 1. The generalized Wasserstein GAN minimax problem is as follows:

minG𝒢maxDc-concave𝔼[D(𝐗)]𝔼[Dc(G(𝐙))].\min_{G\in\mathcal{G}}\>\max_{D\,\text{\rm$c$-concave}}\>\mathbb{E}[D(\mathbf{X})]-\mathbb{E}\bigl{[}D^{c}(G(\mathbf{Z}))\bigr{]}. (C.6)

Due to the realizability assumption, a generator function G𝒢G^{*}\in\mathcal{G} results in the data distribution such that PG(𝐙)=P𝐗P_{G^{*}(\mathbf{Z})}=P_{\mathbf{X}}. Then, the above minimax objective for GG^{*} reduces to

𝔼P𝐗[D(𝐗)Dc(𝐗)].\mathbb{E}_{P_{\mathbf{X}}}\bigl{[}\,D(\mathbf{X})-D^{c}(\mathbf{X})\,\bigr{]}. (C.7)

Since the cost is assumed to take a zero value given identical inputs, we have:

Dc(𝐱):\displaystyle D^{c}(\mathbf{x}): =max𝐱D(𝐱)c(𝐱,𝐱)\displaystyle=\max_{\mathbf{x}^{\prime}}\,D(\mathbf{x}^{\prime})-c(\mathbf{x},\mathbf{x}^{\prime})
D(𝐱)c(𝐱,𝐱)\displaystyle\geq D(\mathbf{x})-c(\mathbf{x},\mathbf{x})
=D(𝐱).\displaystyle=D(\mathbf{x}).

As a result, D(𝐱)Dc(𝐱)0D(\mathbf{x})-D^{c}(\mathbf{x})\leq 0 holds for every 𝐱\mathbf{x}. Hence, the objective in (C.7) will be non-positive and takes its maximum zero value for any constant function DconstantD_{\operatorname{constant}}, which by definition satisfies cc-concavity. Therefore, letting V(G,D)V(G,D) denote the GAN minimax objective, for every DD we have

V(G,D)V(G,Dconstant).V(G^{*},D)\leq V(G^{*},D_{\operatorname{constant}}). (C.8)

We also know that for a constant discriminator DconstantD_{\operatorname{constant}} the value of the minimax objective is independent from the generator function. Therefore, for every GG we have

V(G,Dconstant)=V(G,Dconstant).V(G^{*},D_{\operatorname{constant}})=V(G,D_{\operatorname{constant}}). (C.9)

As a result, (C.8) and (C.9) together show that for every GG and DD

V(G,D)V(G,Dconstant)V(G,Dconstant),V(G^{*},D)\leq V(G^{*},D_{\operatorname{constant}})\leq V(G,D_{\operatorname{constant}}), (C.10)

which makes the proof complete for Wasserstein GANs.

C.2 Proof of Theorem 1 & Remark 1

Proof for ff-GANs:

Lemma 1.

Consider two random vectors 𝐗,𝐗~\mathbf{X},\widetilde{\mathbf{X}} with probability density functions p,qp,\,q, respectively. Suppose that p,qp,q are non-zero everywhere. Then, considering the following variational representation of df(P,Q)d_{f}(P,Q),

df(P,Q)=maxD𝔼[D(𝐗)]𝔼[f(D(𝐗~))],d_{f}(P,Q)=\max_{D}\>\mathbb{E}[D(\mathbf{X})]-\mathbb{E}[f^{*}(D(\widetilde{\mathbf{X}}))], (C.11)

the optimal solution DD^{*} will satisfy

D(𝐱)=f(p(𝐱)q(𝐱)).D^{*}(\mathbf{x})=f^{\prime}\bigl{(}\frac{p(\mathbf{x})}{q(\mathbf{x})}\bigr{)}. (C.12)
Proof.

Let us rewrite the ff-divergence’s variational representation as

df(P,Q)\displaystyle d_{f}(P,Q)\, =maxD𝔼[D(𝐗)]𝔼[f(D(𝐗~))]\displaystyle=\,\max_{D}\>\mathbb{E}[D(\mathbf{X})]-\mathbb{E}[f^{*}(D(\widetilde{\mathbf{X}}))]
=maxD[p(𝐱)D(𝐱)q(𝐱)f(D(𝐱))]d𝐱\displaystyle=\,\max_{D}\>\int\bigl{[}p(\mathbf{x})D(\mathbf{x})-q(\mathbf{x})f^{*}(D(\mathbf{x}))\bigr{]}\mathop{}\!\mathrm{d}{\mathbf{x}}
=maxD(𝐱)[p(𝐱)D(𝐱)q(𝐱)f(D(𝐱))]d𝐱\displaystyle=\int\max_{D(\mathbf{x})}\bigl{[}p(\mathbf{x})D(\mathbf{x})-q(\mathbf{x})f^{*}(D(\mathbf{x}))\bigr{]}\mathop{}\!\mathrm{d}{\mathbf{x}}

where the last equality holds, since the maximization objective decouples across 𝐱\mathbf{x} values. It can be seen that the inside optimization problem for each D(𝐱)D(\mathbf{x}) is maximizing a concave objective in which by setting the derivative to zero we obtain

f(D(𝐱))=p(𝐱)q(𝐱).{f^{*}}^{\prime}(D^{*}(\mathbf{x}))=\frac{p(\mathbf{x})}{q(\mathbf{x})}. (C.13)

As a property of the Fenchel-conjugate of a convex ff, we know f1=f{{f^{*}}^{\prime}}^{-1}=f^{\prime} which combined with the above equation implies that

D(𝐱)=f(p(𝐱)q(𝐱)).D^{*}(\mathbf{x})=f^{\prime}\bigr{(}\frac{p(\mathbf{x})}{q(\mathbf{x})}\bigr{)}. (C.14)

The above result completes Lemma 1’s proof. ∎

Consider the ff-GAN problem with the generator function specified in the theorem:

min𝐖,𝐮:𝐖21maxD𝔼[D(𝐗)]𝔼[f(D(𝐖𝐙+𝐮))].\min_{\mathbf{W},\mathbf{u}:\,\|\mathbf{W}\|_{2}\leq 1}\>\max_{D}\>\mathbb{E}\bigl{[}D(\mathbf{X})\bigr{]}-\mathbb{E}\bigl{[}f^{*}\bigl{(}D(\mathbf{W}\mathbf{Z}+\mathbf{u})\bigr{)}\bigr{]}. (C.15)

Note that 𝐗𝒩(𝟎,σ2I)\mathbf{X}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I) and 𝐖𝐙+𝐮𝒩(𝐮,𝐖𝐖T)\mathbf{W}\mathbf{Z}+\mathbf{u}\sim\mathcal{N}(\mathbf{u},\mathbf{W}\mathbf{W}^{T}). Notice that if 𝐖\mathbf{W} was not full-rank, the maximized discriminator objective would be ++\infty achieved by a DD assigning an infinity value to the points not included in the rank-constrained support set of generator 𝐖𝐳+𝐮\mathbf{W}\mathbf{z}+\mathbf{u}. This will not result in a solution to the ff-GAN problem, because we assume that the dimensions of 𝐗\mathbf{X} and 𝐙\mathbf{Z} match each other and hence there exists a full-rank 𝐖\mathbf{W} with a finite maximized objective, i.e. ff-divergence value. Therefore, in a Nash equilibrium of the ff-GAN problem, the solution 𝐖\mathbf{W} must be full-rank and invertible.

Lemma 1 results in the following equation for the optimal discriminator D𝐖,𝐮D^{*}_{\mathbf{W},\mathbf{u}} given generator parameters 𝐖,𝐮\mathbf{W},\mathbf{u}:

D𝐖,𝐮(𝐱)\displaystyle D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x})\, =f(1(2πσ2)kexp{12σ2𝐱22}1(2π)kdet(𝐖𝐖T)exp{12(𝐖𝐖T)1/2(𝐱𝐮)22})\displaystyle=\,f^{\prime}\bigl{(}\frac{\frac{1}{\sqrt{(2\pi\sigma^{2})^{k}}}\exp\big{\{}-\frac{1}{2\sigma^{2}}\big{\|}\mathbf{x}\big{\|}_{2}^{2}\big{\}}}{\frac{1}{\sqrt{(2\pi)^{k}\det(\mathbf{W}\mathbf{W}^{T})}}\exp\big{\{}-\frac{1}{2}\big{\|}(\mathbf{W}\mathbf{W}^{T})^{-1/2}\mathbf{(x-u)}\big{\|}_{2}^{2}\big{\}}}\bigl{)}
=f(det(𝐖𝐖T)σ2kexp{12𝐱T((𝐖𝐖T)1σ2I)𝐱𝐮T(𝐖𝐖T)1/2𝐱+𝐮T(𝐖𝐖T)1𝐮}).\displaystyle=\,f^{\prime}\bigl{(}\sqrt{\frac{\det(\mathbf{W}\mathbf{W}^{T})}{\sigma^{2k}}}\exp\biggl{\{}\frac{1}{2}\mathbf{x}^{T}((\mathbf{W}\mathbf{W}^{T})^{-1}-\sigma^{-2}I)\mathbf{x}-\mathbf{u}^{T}(\mathbf{W}\mathbf{W}^{T})^{-1/2}\mathbf{x}+\mathbf{u}^{T}(\mathbf{W}\mathbf{W}^{T})^{-1}\mathbf{u}\biggr{\}}\,\bigr{)}.

As a result, the function f(D𝐖,𝐮())f^{*}(D^{*}_{\mathbf{W},\mathbf{u}}(\cdot)) appearing in the ff-GAN’s minimax objective will be

f(D𝐖,𝐮(𝐱))=\displaystyle{f^{*}}\bigl{(}D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x})\bigr{)}\,=\, f(f(det(𝐖𝐖T)σ2kexp{12𝐱T((𝐖T𝐖)1\displaystyle{f^{*}}\biggl{(}f^{\prime}\biggl{(}\sqrt{\frac{\det(\mathbf{W}\mathbf{W}^{T})}{\sigma^{2k}}}\exp\biggl{\{}\frac{1}{2}\mathbf{x}^{T}((\mathbf{W}^{T}\mathbf{W})^{-1}
\displaystyle- σ2I)𝐱𝐮T(𝐖𝐖T)1/2𝐱+𝐮T(𝐖𝐖T)1𝐮})).\displaystyle\sigma^{-2}I)\mathbf{x}-\mathbf{u}^{T}(\mathbf{W}\mathbf{W}^{T})^{-1/2}\mathbf{x}+\mathbf{u}^{T}(\mathbf{W}\mathbf{W}^{T})^{-1}\mathbf{u}\biggr{\}}\biggr{)}\biggr{)}.

Claim: f(D𝐖,𝐮(𝐱))f^{*}(D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x})) is a strictly convex function of 𝐱\mathbf{x}.

To show this claim, note that the following expression is a strongly-convex quadratic function of 𝐱\mathbf{x}, since we have assumed that the spectral norm of 𝐖\mathbf{W} is bounded as σmax(𝐖)1<σ{\sigma}_{\max}(\mathbf{W})\leq 1<\sigma:

12𝐱T((𝐖T𝐖)1σ2I)𝐱𝐮T(𝐖T𝐖)1/2𝐱𝐮T(𝐖T𝐖)1𝐮.\displaystyle\frac{1}{2}\mathbf{x}^{T}((\mathbf{W}^{T}\mathbf{W})^{-1}-\sigma^{-2}I)\mathbf{x}-\mathbf{u}^{T}(\mathbf{W}^{T}\mathbf{W})^{-1/2}\mathbf{x}\mathbf{u}^{T}(\mathbf{W}^{T}\mathbf{W})^{-1}\mathbf{u}.

For simplicity, we denote the above strongly-convex function with g(𝐱)g(\mathbf{x}) and define the function h:h:\mathbb{R}\rightarrow\mathbb{R} as

h(y):=f(f(det(𝐖𝐖T)σ2k×ey)).\displaystyle h(y):={f^{*}}\bigl{(}{{f}^{\prime}}\bigl{(}\sqrt{\frac{\det(\mathbf{W}\mathbf{W}^{T})}{\sigma^{2k}}}\times e^{y}\bigr{)}\bigr{)}.

According to the above definitions, f(D𝐖,𝐮(𝐱))=h(g(𝐱))f^{*}(D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x}))=h(g(\mathbf{x})) is the composition of hh and strongly-convex gg. Note that hh is a monotonically increasing function, since defining c=det(𝐖𝐖T)σ2k>0c=\sqrt{\frac{\det(\mathbf{W}\mathbf{W}^{T})}{\sigma^{2k}}}>0 we have

h(y)=(cey)2f′′(cey)0,h^{\prime}(y)=(c\,e^{y})^{2}\,f^{\prime\prime}(c\,e^{y})\geq 0, (C.16)

which follows from the equality

f(f(z)):=supu{uf(z)f(u)}=zf(z)f(z){f^{*}}(f^{\prime}(z)):=\sup_{u}\{uf^{\prime}(z)-f(u)\}\,=\,zf^{\prime}(z)-f(z)

that is a consequence of the definition of Fenchel-conjugate, implying that df(f(z))dz=zf′′(z)\frac{\mathop{}\!\mathrm{d}{}{f^{*}}(f^{\prime}(z))}{\mathop{}\!\mathrm{d}{z}}=zf^{\prime\prime}(z) for the convex ff. Note that h(y)>0h^{\prime}(y)>0 holds everywhere, because ff is assumed to be strictly convex. This proves that hh is strictly increasing. Furthermore, hh is a convex function, because h(y)h^{\prime}(y) is non-decreasing due to the assumption that t2f′′(t)t^{2}f^{\prime\prime}(t) is non-decreasing over t(0,+)t\in(0,+\infty). As a result, hh is an increasing convex function.

Therefore, f(D𝐖,𝐮(𝐱))=h(g(𝐱))f^{*}(D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x}))=h(g(\mathbf{x})) is a composition of a strongly-convex gg and an increasing convex h:h:\mathbb{R}\rightarrow\mathbb{R}. Therefore, as a well-known result in convex optimization [48], the claim is true and f(D𝐖,𝐮(𝐱))f^{*}(D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x})) is a strictly convex function of 𝐱\mathbf{x}.

We showed that the claim is true for every feasible 𝐖,𝐮\mathbf{W},\mathbf{u}. Now, we prove that the pair (G𝐖,𝐮,D𝐖,𝐮)(G_{\mathbf{W},\mathbf{u}},D^{*}_{\mathbf{W},\mathbf{u}}) will not be a local Nash equilibrium for any feasible 𝐖,𝐮\mathbf{W},\mathbf{u}. If the pair (G𝐖,𝐮,D𝐖,𝐮)(G_{\mathbf{W},\mathbf{u}},D^{*}_{\mathbf{W},\mathbf{u}}) was a local Nash equilibrium, 𝐖,𝐮\mathbf{W},\mathbf{u} would be a local minimum for the following minimax objective where DD^{*} is fixed to be D𝐖,𝐮D^{*}_{\mathbf{W},\mathbf{u}}:

𝔼[D(𝐗)]𝔼[f(D(𝐖𝐙+𝐮))].\mathbb{E}\bigl{[}D^{*}(\mathbf{X})\bigr{]}-\mathbb{E}\bigl{[}f^{*}\bigl{(}D^{*}(\mathbf{W}\mathbf{Z}+\mathbf{u})\bigr{)}\bigr{]}. (C.17)

However, as shown earlier, for any feasible 𝐖,𝐮\mathbf{W},\mathbf{u}, f(D(𝐱))f^{*}\bigl{(}D^{*}(\mathbf{x})) is a strictly-convex function of 𝐱\mathbf{x}, which in turn shows that (C.17) is a strictly-concave function of variable 𝐮\mathbf{u}. This consequence proves that the objective has no local minima for the unconstrained variable 𝐮\mathbf{u}. Due to the shown contradiction, a pair with the form (G𝐖,𝐮,D𝐖,𝐮)(G_{\mathbf{W},\mathbf{u}},D^{*}_{\mathbf{W},\mathbf{u}}) cannot be a local Nash equilibrium in parameters 𝐖,𝐮\mathbf{W},\mathbf{u}. Consequently, the minimax problem has no pure Nash equilibrium solutions, since in a pure Nash equilibrium the discriminator will be by definition optimal against the choice of generator.

Proof for W2GANs:

Consider the W2GAN problem with the assumed generator function:

min𝐖,𝐮:𝐖21maxDc-concave𝔼[D(𝐗)]𝔼[Dc(𝐖𝐙+𝐮)],\min_{\mathbf{W},\mathbf{u}:\,\|\mathbf{W}\|_{2}\leq 1}\>\max_{D\,\text{\rm$c$-concave}}\>\mathbb{E}\bigl{[}D(\mathbf{X})\bigr{]}-\mathbb{E}\bigl{[}D^{c}(\mathbf{W}\mathbf{Z}+\mathbf{u})\bigr{]}, (C.18)

where the c-transform is defined for the quadratic cost function c(𝐱,𝐱)=12𝐱𝐱22c(\mathbf{x},\mathbf{x}^{\prime})=\frac{1}{2}\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2}^{2}. Similar to the ff-GAN case, define D𝐖,𝐮D^{*}_{\mathbf{W},\mathbf{u}} to be the optimal discriminator for the generator function parameterized by 𝐖,𝐮\mathbf{W},\mathbf{u}. Note that 𝐗𝒩(𝟎,σ2I)\mathbf{X}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I) and 𝐖𝐙+𝐮𝒩(𝐮,𝐖𝐖T)\mathbf{W}\mathbf{Z}+\mathbf{u}\sim\mathcal{N}(\mathbf{u},\mathbf{W}\mathbf{W}^{T}).

According to the Brenier’s theorem [49], the optimal transport from the Gaussian data distribution to the Gaussian generative model will be

ψopt(𝐱)=𝐱𝐱D𝐖,𝐮(𝐱).\psi^{\operatorname{opt}}(\mathbf{x})=\mathbf{x}-\nabla_{\mathbf{x}}D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x}).

As a well-known result regarding the second-order optimal transport map between two Gaussian distributions, the optimal transport will be a linear transformation as ψopt(𝐱)=1σ(𝐖𝐖T)1/2𝐱+𝐮\psi^{\operatorname{opt}}(\mathbf{x})=\frac{1}{\sigma}(\mathbf{W}\mathbf{W}^{T})^{1/2}\mathbf{x}+\mathbf{u}. This result shows that

𝐱D𝐖,𝐮(𝐱)=(I1σ(𝐖𝐖T)1/2)𝐱𝐮.\nabla_{\mathbf{x}}D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x})=\bigl{(}I-\frac{1}{\sigma}(\mathbf{W}\mathbf{W}^{T})^{1/2}\bigr{)}\mathbf{x}-\mathbf{u}. (C.19)

Note that the c-transform for cost c(𝐱,𝐱)=12𝐱𝐱22c(\mathbf{x},\mathbf{x}^{\prime})=\frac{1}{2}\|\mathbf{x}-\mathbf{x}^{\prime}\|^{2}_{2} satisfies Dc(𝐱)=(12𝐱2D(𝐱))12𝐱2D^{c}(\mathbf{x})=(\frac{1}{2}\|\mathbf{x}\|^{2}-D(\mathbf{x}))^{*}-\frac{1}{2}\|\mathbf{x}\|^{2} where gg^{*} denotes gg’s Fenchel-conjugate. For general convex quadratic function g(𝐱)=12𝐱TA𝐱+𝐛T𝐱g(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}A\mathbf{x}+\mathbf{b}^{T}\mathbf{x} we have g(𝐱)=12(𝐱𝐛)TA(𝐱𝐛)g^{*}(\mathbf{x})=\frac{1}{2}(\mathbf{x}-\mathbf{b})^{T}A^{\dagger}(\mathbf{x}-\mathbf{b}) where AA^{\dagger} denotes AA’s Moore Penrose pseudoinverse. Therefore, for the c-transform of the optimal discriminator we will have

𝐱D𝐖,𝐮c(𝐱)=(σ((𝐖𝐖T)1/2)I)𝐱σ((𝐖𝐖T)1/2)𝐮.\displaystyle\nabla_{\mathbf{x}}D^{*^{c}}_{\mathbf{W},\mathbf{u}}(\mathbf{x})=\,\bigl{(}\sigma((\mathbf{W}\mathbf{W}^{T})^{1/2})^{\dagger}-I\bigr{)}\mathbf{x}-\sigma((\mathbf{W}\mathbf{W}^{T})^{1/2})^{\dagger}\mathbf{u}.

Since every feasible 𝐖\mathbf{W} satisfies the bounded spectral norm condition as σmax(𝐖)1<σ{\sigma}_{\max}(\mathbf{W})\leq 1<\sigma, the optimal D𝐖,𝐮cD^{*^{c}}_{\mathbf{W},\mathbf{u}} will be a quadratic function whose Hessian has at least one strictly positive eigenvalue along the principal eigenvector of 𝐖𝐖T\mathbf{W}\mathbf{W}^{T}. The positive eigenvalue exists in general case where 𝐙\mathbf{Z}’s dimension can be even smaller than 𝐗\mathbf{X}’s dimension. If we had the stronger assumption that the two dimensions exactly match, similar to the f-GAN problem considered, then the pseudo-inverse AA^{\dagger} would be the same as the inverse A1A^{-1} resulting in a strongly-convex quadratic D𝐖,𝐮cD^{*^{c}}_{\mathbf{W},\mathbf{u}}. Nevertheless, as we prove here, the theorem’s result on W2GAN holds in the general case and does not necessarily require the same dimension between 𝐗\mathbf{X} and 𝐙\mathbf{Z}.

Consider the W2GAN minimax objective for the pair (G𝐖,𝐮,D)(G_{\mathbf{W,u}},D^{*}) where DD^{*} is fixed to be the optimal D𝐖,𝐮D^{*}_{\mathbf{W,u}}:

𝔼[D(𝐗)]𝔼[Dc(𝐖𝐙+𝐮)],\mathbb{E}\bigl{[}D(\mathbf{X})\bigr{]}-\mathbb{E}\bigl{[}{D^{*}}^{c}(\mathbf{W}\mathbf{Z}+\mathbf{u})\bigr{]}, (C.20)

If (G𝐖,𝐮,D𝐖,𝐮)(G_{\mathbf{W,u}},D^{*}_{\mathbf{W,u}}) was a local Nash equilibrium, the variables 𝐖,𝐮\mathbf{W,u} would provide a local minimum to the above objective. However, since D𝐖,𝐮cD^{*^{c}}_{\mathbf{W},\mathbf{u}} is shown to be a quadratic function with a Hessian possessing positive eigenvalues, the above minimax objective will not have a local minimum in the unconstrained variable 𝐮\mathbf{u}. Therefore, the minimax problem possesses no local Nash equilibrium solutions with the form (G𝐖,𝐮,D𝐖,𝐮)(G_{\mathbf{W,u}},D^{*}_{\mathbf{W,u}}) and therefore no pure Nash equilibrium solutions.

For the parameterized case with a quadratic discriminator DA,𝐛(𝐱)=𝐱TA𝐱+𝐛T𝐱D_{A,\mathbf{b}}(\mathbf{x})=\mathbf{x}^{T}A\mathbf{x}+\mathbf{b}^{T}\mathbf{x}, first of all note that as shown in the proof the optimal discriminator D𝐖,𝐮D^{*}_{\mathbf{W,u}} for any generator parameter 𝐖,𝐮\mathbf{W,u} will be a cc-concave quadratic function. Therefore, the optimal solution for the discriminator does not change because of the new quadratic constraint. Furthermore, the discriminator optimization problem has a concave objective in parameters A,𝐛A,\mathbf{b}. This is because the discriminator DA,𝐛(𝐱)D_{A,\mathbf{b}}(\mathbf{x}) is a linear function in terms of A,𝐛A,\mathbf{b}, and DA,𝐛c(𝐱)=sup𝐱DA,𝐛(𝐱)c(𝐱,𝐱)D^{c}_{A,\mathbf{b}}(\mathbf{x})=\sup_{\mathbf{x}^{\prime}}D_{A,\mathbf{b}}(\mathbf{x}^{\prime})-c(\mathbf{x}^{\prime},\mathbf{x}) is a convex function of A,𝐛A,\mathbf{b} as the supremum of some affine functions is convex.

As a result, the discriminator optimization reduces to maximizing a concave objective of A,𝐛A,\mathbf{b} constrained to a convex set {A:IA0}\{A:\,I-A\succcurlyeq 0\} which is equivalent to the cc-concave constraint on the quadratic DA,𝐛D_{A,\mathbf{b}}. Hence, any local solution to this optimization problem will also be a global solution. This result implies that any local Nash equilibrium for the new parameterized minimax problem will have the form (G𝐖,𝐮,D𝐖,𝐮)(G_{\mathbf{W,u}},D^{*}_{\mathbf{W,u}}), which as we have already shown does not exist under the theorem’s assumptions.

Proof for the 1-dimensional WGAN:

Consider the 1-dimensional Wasserstein GAN problem for the assumed linear generator function:

minw,u:|w|1maxDLip1𝔼[D(X)]𝔼[D(wZ+u)].\min_{w,u:\,|w|\leq 1}\>\max_{\|D\|_{\operatorname{Lip}}\leq 1}\>\mathbb{E}[D(X)]-\mathbb{E}[D(wZ+u)]. (C.21)

The inner maximization problem can be rewritten as

maxDLip1(pX(x)pwZ+u(x))D(x)dx.\max_{\|D\|_{\operatorname{Lip}}\leq 1}\>\int\bigl{(}p_{X}(x)-p_{wZ+u}(x)\bigr{)}\,D(x)\mathop{}\!\mathrm{d}{x}. (C.22)

Here we have

pX(x)pwZ+u(x)=12πσ2exp{12σ2x2}12πw2exp{12w2(xu)2}\displaystyle p_{X}(x)-p_{wZ+u}(x)\,=\,\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\bigl{\{}\frac{-1}{2\sigma^{2}}x^{2}\bigr{\}}-\frac{1}{\sqrt{2\pi w^{2}}}\exp\bigl{\{}\frac{-1}{2w^{2}}(x-u)^{2}\bigr{\}}

Since |w|1<σ|w|\leq 1<\sigma, it can be seen that the above difference will be positive everywhere except over an interval (a1,a2)(a_{1},a_{2}), where a1,a2a_{1},a_{2} are the two solutions to the quadratic equation:

(1w21σ2)x22uw2x+(u2w2log(σ|w|))=0.\bigl{(}\frac{1}{w^{2}}-\frac{1}{{\sigma}^{2}}\bigr{)}x^{2}-2\frac{u}{w^{2}}x+\bigl{(}\frac{u^{2}}{w^{2}}-\log(\frac{\sigma}{|w|})\bigr{)}=0. (C.23)

Note that the above quadratic equation has two distinct solutions a1<a2a_{1}<a_{2}, since |w|<σ|w|<\sigma and log(σ|w|)>0\log(\frac{\sigma}{|w|})>0 leading to the positive discriminant:

4u2w2σ2+4log(σ|w|)(1w21σ2)> 0.4\frac{u^{2}}{w^{2}{\sigma}^{2}}+4\log(\frac{\sigma}{|w|})\bigl{(}\frac{1}{w^{2}}-\frac{1}{{\sigma}^{2}}\bigr{)}\,>\,0. (C.24)

As the function DD in the maximization problem (C.22) is only constrained to be 1-Lipschitz, the optimal Dw,uD^{*}_{w,u}’s slope must be equal to 1-1 over (,a1](-\infty,a_{1}] and equal to +1+1 over [a2,)[a_{2},\infty), in order to allow the maximum increase in the maximization objective. Over the interval (a1,a2)(a_{1},a_{2}), we claim that for the optimal DD is a convex function, because otherwise its double Fenchel-conjugate DD^{**}, which is by definition convex, achieves a higher value.

First of all, note that the double Fenchel-conjugate DD^{**} will not be different from DD outside the (a1,a2)(a_{1},a_{2}) interval, because DD^{**} is defined to provide the largest convex function satisfying DDD^{**}\leq D, and DD is supposed to be 11-Lipschitz taking its minimum derivative on (,a1](\infty,a_{1}] and its maximum derivative over [a2,)[a_{2},\infty). Next, since DD^{**} lower-bounds DD, it results in a non-smaller integral value over the interval (a1,a2)(a_{1},a_{2}) as pX(x)pwZ+u(x)p_{X}(x)-p_{wZ+u}(x) takes negative values over (a1,a2)(a_{1},a_{2}). If DD is not convex, then DD^{**} provides a strict lower-bound for DD which matches DD over (,a1][a2,)(\infty,a_{1}]\cup[a_{2},\infty). Therefore, the convex 11-Lipschitz DD^{**} results in a greater objective that is a contradiction to DD’s optimality. This contradiction proves that the optimal discriminator Dw,uD^{*}_{w,u} is a convex function.

Therefore, for every feasible |w|1|w|\leq 1, there exists an optimal solution Dw,uD^{*}_{w,u} for (C.22) that is a non-constant convex function. This result proves that the WGAN problem has no local Nash equilibiria with the form (Gw,u,Dw,u)(G_{w,u},D^{*}_{w,u}), because if (Gw,u,Dw,u)(G_{w,u},D^{*}_{w,u}) was a local Nash equilibrium then w,uw,u would be a local minimum for the following objective where DD^{*} is fixed to be Dw,uD^{*}_{w,u}:

𝔼[D(X)]𝔼[D(wZ+u)].\mathbb{E}[D^{*}(X)]-\mathbb{E}[D^{*}(wZ+u)]. (C.25)

However, the above objective is a non-constant concave function of the unconstrained variable uu and hence does not have a local minimum in uu. This shows that the WGAN problem does not have a Nash equilibrium and completes the proof for the WGAN case.

Remark.

Consider the same setting as in Theorem 1. However, unlike Theorem 1 suppose that σ<1\sigma<1 and σmin(𝐖)1\sigma_{\min}(\mathbf{W})\geq 1 where σmin\sigma_{\min} stands for the minimum singular value. Then, for the WGAN and W2GAN problems described in Theorem 1, the Wasserstein distance-minimizing generator results in a Nash equilibrium.

Proof.

Proof for the W2GAN:

For the W2GAN case, note that if we repeat the same steps as in the proof of Theorem 1, we can show

𝐱D𝐖,𝐮c(𝐱)=(σ(𝐖𝐖T)1/2I)𝐱σ(𝐖𝐖T)1/2𝐮.\displaystyle\nabla_{\mathbf{x}}D^{*^{c}}_{\mathbf{W},\mathbf{u}}(\mathbf{x})=\,(\sigma(\mathbf{W}\mathbf{W}^{T})^{-1/2}-I\bigr{)}\mathbf{x}-\sigma(\mathbf{W}\mathbf{W}^{T})^{-1/2}\mathbf{u}.

which is a concave quadratic function of 𝐱\mathbf{x}, since the assumptions imply that (𝐖𝐖T)1σ2I(\mathbf{W}\mathbf{W}^{T})^{-1}\preccurlyeq\sigma^{-2}I. Here 𝐖\mathbf{W} is supposed to be a full-rank square matrix as its minimum singular value is assumed to be positive and 𝐙\mathbf{Z} has the same dimension as 𝐗\mathbf{X}.

We claim that for the feasible choice 𝐖=I\mathbf{W}^{*}=I and 𝐮=𝟎\mathbf{u}^{*}=\mathbf{0}, the pair (G𝐖,𝐮,D𝐖,𝐮)(G_{\mathbf{W}^{*},\mathbf{u}^{*}},D^{*}_{\mathbf{W}^{*},\mathbf{u}^{*}}) results in a Nash equilibrium of the minimax problem. Considering the definition of the optimal discriminator D𝐖,𝐮D^{*}_{\mathbf{W}^{*},\mathbf{u}^{*}}, its optimlaity for G𝐖,𝐮G_{\mathbf{W}^{*},\mathbf{u}^{*}} directly follows. Moreover, (C.2) implies that

D𝐖,𝐮c(𝐱)=σ12𝐱22.D^{*^{c}}_{\mathbf{W}^{*},\mathbf{u}^{*}}(\mathbf{x})=\frac{\sigma-1}{2}\|\mathbf{x}\|_{2}^{2}. (C.26)

As a result, fixing the above discriminator function the minimax objective will be

𝔼[D(𝐗)]𝔼[σ12𝐖𝐙+𝐮22]\displaystyle\mathbb{E}[D^{*}(\mathbf{X})]-\mathbb{E}\bigl{[}\,\frac{\sigma-1}{2}\,\|\mathbf{W}\mathbf{Z}+\mathbf{u}\|_{2}^{2}\bigr{]}
=\displaystyle=\; 𝔼[D(𝐗)]+1σ2(𝐖F2+𝐮22)\displaystyle\mathbb{E}[D^{*}(\mathbf{X})]+\frac{1-\sigma}{2}\,\bigl{(}\|\mathbf{W}\|^{2}_{F}+\|\mathbf{u}\|^{2}_{2}\bigr{)}

which is minimized at 𝐖=I\mathbf{W}=I and 𝐮=𝟎\mathbf{u}=\mathbf{0} over the specified feasible set, as we know the Frobenius norm-squared, 𝐖F2\|\mathbf{W}\|^{2}_{F}, is the sum of the squared of 𝐖\mathbf{W}’s singular values. Therefore, the claim holds and the choice 𝐖=I\mathbf{W}^{*}=I and 𝐮=0\mathbf{u}^{*}=0 results in the optimal solution and a Nash equilibrium.

Proof for the 1-dimensional Wasserstein GAN:

Here we select the parameters w=1w^{*}=1, u=0u^{*}=0. We claim that the optimal discriminator function for this choice is the negative absolute value function D(x)=|x|D^{*}(x)=-|x|. Note that the optimal 1-Lipschitz DD^{*} solves the following problem:

maxDLip1(12πσ2ex22σ212πex22)D(x)dx.\max_{\|D\|_{\operatorname{Lip}}\leq 1}\>\int\bigl{(}\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{x^{2}}{2\sigma^{2}}}-\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}\bigr{)}D(x)\mathop{}\!\mathrm{d}{x}. (C.27)

In the above objective given η=2σ2log(1/σ)1σ2\eta=\sqrt{\frac{2\sigma^{2}\log(1/\sigma)}{1-\sigma^{2}}}, the function 12πσ2ex22σ212πex22\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{x^{2}}{2\sigma^{2}}}-\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}} is positive over (η,η)(-\eta,\eta) and negative elsewhere. Therefore, the optimal DD^{*} should get the maximum +1+1 derivative over (,η](-\infty,-\eta] and the minimum 1-1 derivative over [+η,+)[+\eta,+\infty). Because of the even structure of 12πσ2ex22σ212πex22\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{x^{2}}{2\sigma^{2}}}-\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}, there exists an even optimal DD^{*} because D(x)+D(x)2\frac{D^{*}(x)+D^{*}(-x)}{2} remains 11-Lipschitz and optimal for any optimal 11-Lipschitz discriminator DD^{*}. The optimal even DD^{*} should further be continuous as a 11-Lipschitz function, implying that such a DD^{*} is decreasing over (0,η](0,\eta] and increasing over [η,0)[-\eta,0). Enforcing the maximum derivative over the two interval results in the optimal D(x)=|x|D^{*}(x)=-|x|.

Therefore, D(x)=|x|D^{*}(x)=-|x| provides an optimal discriminator for w=1,u=0w^{*}=1,\,u^{*}=0. Also, for this DD^{*} the minimax objective of the Wasserstein GAN will be

𝔼[|X|]𝔼[|wZ+u|]=𝔼[|X|]+𝔼[|wZ+u|]\displaystyle\mathbb{E}[-|X|]-\mathbb{E}[-|wZ+u|]\,\,=-\mathbb{E}[|X|]+\mathbb{E}[|wZ+u|]

In the above equation, wZ+u𝒩(u,w2)wZ+u\sim\mathcal{N}(u,w^{2}), showing that the above objective is minimized at w=1,u=0w^{*}=1,u=0 considering the assumed feasible set where |w|1|w|\geq 1. As a result, the pair (Gw,u,D)(G_{w^{*},u^{*}},D^{*}) provides a Nash equilibrium to the WGAN minimax game. ∎

C.3 Proof of Proposition 2

Proof of the \Rightarrow direction:

Assume that (G,D)(G^{*},D^{*}) is a λ\lambda-proximal equilibrium. According to the definition of the proximal equilibrium, the following holds for every G𝒢G\in\mathcal{G} and D𝒟D\in\mathcal{D}:

Vλprox(G,D)Vλprox(G,D)Vλprox(G,D).V^{\operatorname{prox}}_{\lambda}(G^{*},D)\,\leq\,V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})\,\leq\,V^{\operatorname{prox}}_{\lambda}(G,D^{*}). (C.28)

Claim: Vλprox(G,D)=V(G,D)V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})=V(G^{*},D^{*}).

To show this claim, note that

Vλprox(G,D):=maxD~𝒟V(G,D~)λ2D~D2.V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*}):=\max_{\widetilde{D}\in\mathcal{D}}\>V(G^{*},\widetilde{D})-\frac{\lambda}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}. (C.29)

In this optimization, the optimal solution D~\tilde{D} is DD^{*} itself. Otherwise, for the optimal D~𝒟\tilde{D}\in\mathcal{D} we have D~D>0\|\widetilde{D}-D^{*}\bigr{\|}>0 and as a result

Vλprox(G,D)<V(G,D~)Vλprox(G,D~),V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})<V(G^{*},\tilde{D})\leq V^{\operatorname{prox}}_{\lambda}(G^{*},\tilde{D}), (C.30)

which is a contradiction given that (G,D)(G^{*},D^{*}) is a λ\lambda-proximal equilibrium. Therefore, DD^{*} optimizes the proximal optimization, which shows the claim is valid and we have Vλprox(G,D)=VG,D)V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})=VG^{*},D^{*}). Knowing that V(G,D)Vλprox(G,D)V(G,D)\leq V^{\operatorname{prox}}_{\lambda}(G,D) holds for every G𝒢,D𝒟G\in\mathcal{G},D\in\mathcal{D}, we have

V(G,D)\displaystyle V(G^{*},D) Vλprox(G,D)\displaystyle\leq V^{\operatorname{prox}}_{\lambda}(G^{*},D)
Vλprox(G,D)\displaystyle\leq V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})
=V(G,D).\displaystyle=V(G^{*},D^{*}).

Furthermore,

V(G,D)\displaystyle V(G^{*},D^{*}) =Vλprox(G,D)\displaystyle=\,V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*}) (C.31)
Vλprox(G,D)\displaystyle\leq\,V^{\operatorname{prox}}_{\lambda}(G,D^{*}) (C.32)
=maxD~𝒟V(G,D~)λ2D~D2.\displaystyle=\,\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\frac{\lambda}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}.

Therefore, the proof is complete.

Proof of the \Leftarrow direction:

Suppose that for (G,D)(G^{*},D^{*}) the following holds for every G𝒢G\in\mathcal{G} and D𝒟D\in\mathcal{D}:

V(G,D)V(G,D)maxD~𝒟V(G,D~)λ2D~D2.\displaystyle V(G^{*},D)\,\leq\,V(G^{*},D^{*})\,\leq\,\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\frac{\lambda}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}. (C.33)

We claim that V(G,D)=Vλprox(G,D)V(G^{*},D^{*})=V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*}). To show this claim, consider the definition of the λ\lambda-proximal equilibrium:

Vλprox(G,D):=maxD~𝒟V(G,D~)λ2DD~.V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*}):=\max_{\tilde{D}\in\mathcal{D}}\,V(G^{*},\tilde{D})-\frac{\lambda}{2}\big{\|}D^{*}-\tilde{D}\big{\|}. (C.34)

Here DD^{*} maximizes the objective because we have assumed that V(G,D)V(G,D)V(G^{*},D)\leq V(G^{*},D^{*}) holds for every D𝒟D\in\mathcal{D}. Therefore, the claim is valid and V(G,D)=Vλprox(G,D)V(G^{*},D^{*})=V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*}).

Also, note that for every DD the solution D~\tilde{D} in the proximal optimization satisfies Vλprox(G,D)V(G,D~)V^{\operatorname{prox}}_{\lambda}(G^{*},D)\leq V(G^{*},\tilde{D}). Combining these results with (C.33), we obtain the following inequalities which hold for every G𝒢G\in\mathcal{G} and D𝒟D\in\mathcal{D}:

Vλprox(G,D)Vλprox(G,D)Vλprox(G,D).V^{\operatorname{prox}}_{\lambda}(G^{*},D)\,\leq\,V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})\,\leq\,V^{\operatorname{prox}}_{\lambda}(G,D^{*}). (C.35)

The above equation shows that the pair (G,D)(G^{*},D^{*}) is a λ\lambda-proximal equilibrium.

C.4 Proof of Proposition 3

Consider a λ2\lambda_{2}-proximal equilibrium (G,D)(G^{*},D^{*}). As a result of Proposition 2, for every G𝒢G\in\mathcal{G} and D𝒟D\in\mathcal{D} we have

V(G,D)V(G,D)maxD~𝒟V(G,D~)λ22D~D2.\displaystyle V(G^{*},D)\,\leq\,V(G^{*},D^{*})\leq\,\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\frac{\lambda_{2}}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}.

Since λ1λ2\lambda_{1}\leq\lambda_{2}, the following holds

maxD~𝒟V(G,D~)λ22D~D2maxD~𝒟V(G,D~)λ12D~D2,\displaystyle\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\frac{\lambda_{2}}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}\,\leq\,\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\frac{\lambda_{1}}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2},

which shows that

V(G,D)V(G,D)maxD~V(G,D~)λ12D~D2.\displaystyle V(G^{*},D)\,\leq\,V(G^{*},D^{*})\,\leq\,\max_{\widetilde{D}}\>V(G,\widetilde{D})-\frac{\lambda_{1}}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}.

Due to Proposition 2, (G,D)(G^{*},D^{*}) will be a λ1\lambda_{1}-proximal equilibrium as well. Hence, the proof is complete and we have

PEλ2(V)PEλ1(V).\operatorname{PE}_{\lambda_{2}}(V)\subseteq\operatorname{PE}_{\lambda_{1}}(V). (C.36)

C.5 Proof of Proposition 4

Consider the definition of a λ\lambda-proximal equilibrium in the parameterized space:

Vλprox(G𝜽,D𝐰):=max𝐰~V(G𝜽,D𝐰~)λ2D𝐰~D𝐰2.V_{\lambda}^{\operatorname{prox}}(G_{\bm{\theta}},D_{\mathbf{w}}):=\max_{\widetilde{\mathbf{w}}}\,V(G_{\bm{\theta}},D_{\widetilde{\mathbf{w}}})-\frac{\lambda}{2}\bigl{\|}D_{\widetilde{\mathbf{w}}}-D_{\mathbf{w}}\bigr{\|}^{2}. (C.37)

In the above optimization problem, the first term V(G𝜽,D𝐰~)V(G_{\bm{\theta}},D_{\widetilde{\mathbf{w}}}) is assumed to be η2\eta_{2}-smooth in 𝐰~\widetilde{\mathbf{w}}, while the second term λ2D𝐰~D𝐰2\frac{\lambda}{2}\|D_{\widetilde{\mathbf{w}}}-D_{\mathbf{w}}\|^{2} will be λ2η1\frac{\lambda}{2}\eta_{1}-strongly convex in 𝐰~\widetilde{\mathbf{w}}. As a result, the sum of the two terms will be (λη12η2)(\frac{\lambda\eta_{1}}{2}-\eta_{2})-strongly concave if η2<λη12\eta_{2}<\frac{\lambda\eta_{1}}{2} holds. Since the objective function is strongly-concave in 𝐰~\widetilde{\mathbf{w}}, it will be maximized by a unique solution 𝐰\mathbf{w}^{*}. Moreover, applying the Danskin’s theorem [50] implies that the following holds at the optimal 𝐰\mathbf{w}^{*}:

θVλprox(G𝜽,D𝐰)=θV(G𝜽,D𝐰).\nabla_{\theta}V_{\lambda}^{\operatorname{prox}}(G_{\bm{\theta}},D_{\mathbf{w}})=\nabla_{\theta}V(G_{\bm{\theta}},D_{\mathbf{w}^{*}}). (C.38)

C.6 Proof of Theorem 2

Lemma 2.

Suppose that ff is a γ\gamma-strongly convex function according to norm \|\cdot\|, i.e. for any x,ydom(f)x,y\in\operatorname{dom}(f) and λ[0,1]\lambda\in[0,1] we have

f(λx+(1λ)y)λf(x)+(1λ)f(y)γλ(1λ)2xy2.\displaystyle f(\lambda x+(1-\lambda)y)\,\leq\,\lambda f(x)+(1-\lambda)f(y)-\frac{\gamma\lambda(1-\lambda)}{2}\|x-y\|^{2}. (C.39)

Consider the following optimization problem where the feasible set 𝒳\mathcal{X} is a convex set and xx^{*} is the optimal solution,

minx𝒳f(x).\min_{x\in\mathcal{X}}\,f(x). (C.40)

Then, for every x𝒳x\in\mathcal{X} we have

f(x)f(x)γ2xx2.f(x)-f(x^{*})\geq\frac{\gamma}{2}\|x-x^{*}\|^{2}. (C.41)
Proof.

If we apply the strong-convexity definition (C.39) to x𝒳,xx\in\mathcal{X},\,x^{*} we obtain

f(λx+(1λ)x)λf(x)+(1λ)f(x)γλ(1λ)2xx2.\displaystyle f(\lambda x+(1-\lambda)x^{*})\,\leq\,\lambda f(x)+(1-\lambda)f(x^{*})-\frac{\gamma\lambda(1-\lambda)}{2}\|x-x^{*}\|^{2}. (C.42)

The above inequality results in

f(λx+(1λ)x)f(x)λ(f(x)f(x))γλ(1λ)2xx2.\displaystyle f(\lambda x+(1-\lambda)x^{*})-f(x^{*})\,\leq\,\lambda(f(x)-f(x^{*}))-\frac{\gamma\lambda(1-\lambda)}{2}\|x-x^{*}\|^{2}. (C.43)

Note that 𝒳\mathcal{X} is assumed to be a convex set and therefore λx+(1λ)x𝒳\lambda x+(1-\lambda)x^{*}\in\mathcal{X} implying f(x)f(λx+(1λ)x)f(x^{*})\leq f(\lambda x+(1-\lambda)x^{*}). As a result, for every 0λ10\leq\lambda\leq 1

γλ(1λ)2xx2λ(f(x)f(x)).\frac{\gamma\lambda(1-\lambda)}{2}\|x-x^{*}\|^{2}\,\leq\,\lambda(f(x)-f(x^{*})). (C.44)

Thus for every 0<λ10<\lambda\leq 1 we have

γ(1λ)2xx2f(x)f(x),\frac{\gamma(1-\lambda)}{2}\|x-x^{*}\|^{2}\,\leq\,f(x)-f(x^{*}), (C.45)

which proves that γ2xx2f(x)f(x)\frac{\gamma}{2}\|x-x^{*}\|^{2}\leq f(x)-f(x^{*}) and completes Lemma 2’s proof. ∎

Based on Proposition 2 and the definition of D𝜽D^{\bm{\theta}}, we only need to show that for the W2GAN’s objective, which we denote by V(G,D)V(G,D), the following holds for every 𝜽\bm{\theta}:

V(G𝜽,D𝜽)maxD𝒟V(G𝜽,D)12ηDD𝜽H˙12.V(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})\leq\max_{D\in\mathcal{D}}\>V(G_{\bm{\theta}},D)-\frac{1}{2\eta}\bigl{\|}D-D^{\bm{\theta}^{*}}\bigr{\|}^{2}_{\dot{H}^{1}}. (C.46)

To show the above inequality, it suffices to prove the following inequality

V(G𝜽,D𝜽)V(G𝜽,D𝜽)12ηD𝜽D𝜽H˙12.V(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})\leq V(G_{\bm{\theta}},D^{\bm{\theta}})-\frac{1}{2\eta}\bigl{\|}D^{\bm{\theta}}-D^{\bm{\theta}^{*}}\bigr{\|}^{2}_{\dot{H}^{1}}. (C.47)

Claim: For the W2GAN problem, we have

V(G𝜽,D𝜽)=12η𝔼[D𝜽(𝐗)22].V(G_{\bm{\theta}},D^{\bm{\theta}})=\frac{1}{2\eta}\mathbb{E}\bigl{[}\|\nabla D^{\bm{\theta}}(\mathbf{X})\|^{2}_{2}\bigr{]}.

To show this claim, note that according to the W2GAN’s formulation we have V(G𝜽,D𝜽):=Wc(P𝐗,PG𝜽(𝐙))V(G_{\bm{\theta}},D^{\bm{\theta}}):=W_{c}(P_{\mathbf{X}},P_{G_{\bm{\theta}}(\mathbf{Z})}) where c(𝐱,𝐱)=η2𝐱𝐱22c(\mathbf{x},\mathbf{x}^{\prime})=\frac{\eta}{2}\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2}^{2} is the second-order cost function specified in the theorem. We start by proving this result for η=1\eta=1. In this case, the Brenier theorem [51] proves that the optimal transport map from the data variable 𝐗\mathbf{X} to the generative model G𝜽(𝐙)G_{\bm{\theta}}(\mathbf{Z}) can be derived from the gradient of the optimal D𝜽D^{\bm{\theta}} as follows

ψopt(𝐱)=𝐱D𝜽(𝐱).\psi^{\operatorname{opt}}(\mathbf{x})=\mathbf{x}-\nabla D^{\bm{\theta}}(\mathbf{x}). (C.48)

which plugged into the optimal transport objective Wc(P,Q):=infΠ(P,Q)𝔼[c(𝐗,𝐗)]W_{c}(P,Q):=\inf_{\Pi(P,Q)}\mathbb{E}[c(\mathbf{X},\mathbf{X}^{\prime})] proves that

V(G𝜽,D𝜽):=Wc(P𝐗,PG𝜽(𝐙))=𝔼[12D𝜽(𝐗)22].V(G_{\bm{\theta}},D^{\bm{\theta}}):=W_{c}(P_{\mathbf{X}},P_{G_{\bm{\theta}}(\mathbf{Z})})=\mathbb{E}\bigl{[}\frac{1}{2}\|\nabla D^{\bm{\theta}}(\mathbf{X})\|^{2}_{2}\bigr{]}.

The above equation proves the result holds for η=1\eta=1. For a general η>0\eta>0, note that applying a simple change of variable in the Kantorovich duality representation and solving the dual problem for D~(𝐱)=ηD(𝐱)\tilde{D}(\mathbf{x})=\eta D(\mathbf{x}) shows that ψopt(𝐱)=𝐱1ηD~𝜽(𝐱)\psi^{\operatorname{opt}}(\mathbf{x})=\mathbf{x}-\frac{1}{\eta}\nabla\tilde{D}^{\bm{\theta}}(\mathbf{x}) transport samples from the data domain to the generative model. This is due to the fact that after applying this change of variable the Kantorovich duality reduces to η\eta multiplied to the dual problem for η=1\eta=1. As a result, applying the transport map to the definition of the optimal transport cost shows that

V(G𝜽,D𝜽):=Wc(P𝐗,PG𝜽(𝐙))=12η𝔼[D𝜽(𝐗)22],V(G_{\bm{\theta}},D^{\bm{\theta}}):=W_{c}(P_{\mathbf{X}},P_{G_{\bm{\theta}}(\mathbf{Z})})=\frac{1}{2\eta}\mathbb{E}\bigl{[}\|\nabla D^{\bm{\theta}}(\mathbf{X})\|^{2}_{2}\bigr{]},

proving the claim holds for any η>0\eta>0.

Substituting the discriminator maximization with the result in the above claim, the W2GAN problem reduces to the following problem:

min𝜽12η𝔼[D𝜽(𝐗)22].\min_{\bm{\theta}}\>\frac{1}{2\eta}\mathbb{E}\bigl{[}\big{\|}\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}^{2}_{2}\bigr{]}. (C.49)

Here we can equivalently optimize for D𝜽{D𝜽:θΘ}D^{\bm{\theta}}\in\{D^{\bm{\theta}}:\,\theta\in\Theta\} instead of minimizing over the variable 𝜽\bm{\theta}, obtaining

minD𝜽12η𝔼[D𝜽(𝐗)22].\min_{D^{\bm{\theta}}}\>\frac{1}{2\eta}\mathbb{E}\bigl{[}\big{\|}\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}^{2}_{2}\bigr{]}. (C.50)

Note that the term 12𝔼[D𝜽(𝐗)22]=12D𝜽H˙12\frac{1}{2}\mathbb{E}\bigl{[}\big{\|}\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}^{2}_{2}\bigr{]}=\frac{1}{2}\big{\|}D^{\bm{\theta}}\big{\|}^{2}_{\dot{H}^{1}} reduces to the squared of the defined Sobolev norm in a semi-Hilbert space, which results in a 11-strongly convex function according to H˙1\|\cdot\|_{\dot{H}^{1}} with strong convexity defined as in (C.39). As a result, the objective in (C.50) is 1η\frac{1}{\eta}-strongly convex according to the Sobolev norm H˙1\|\cdot\|_{\dot{H}^{1}}. In addition, this objective is minimized over a convex feasible set {D𝜽:θΘ}\{D^{\bm{\theta}}:\,\theta\in\Theta\}, due to the theorem’s assumption. Therefore, Lemma 2 shows that the optimal D𝜽D^{\bm{\theta}^{*}} satisfies the following inequality for every 𝜽\bm{\theta}:

12η𝔼[D𝜽(𝐗)22]12η𝔼[D𝜽(𝐗)22]12ηD𝜽D𝜽H˙12.\displaystyle\frac{1}{2\eta}\mathbb{E}\bigl{[}\big{\|}\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}^{2}_{2}\bigr{]}-\frac{1}{2\eta}\mathbb{E}\bigl{[}\big{\|}\nabla D^{\bm{\theta}^{*}}(\mathbf{X})\big{\|}^{2}_{2}\bigr{]}\>\geq\>\frac{1}{2\eta}\big{\|}D^{\bm{\theta}}-D^{\bm{\theta}^{*}}\big{\|}^{2}_{\dot{H}^{1}}.

The above result implies that

V(G𝜽,D𝜽)V(G𝜽,D𝜽)12ηD𝜽D𝜽H˙12,V(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})\leq V(G_{\bm{\theta}},D^{\bm{\theta}})-\frac{1}{2\eta}\bigl{\|}D^{\bm{\theta}}-D^{\bm{\theta}^{*}}\bigr{\|}^{2}_{\dot{H}^{1}}, (C.51)

which completes the proof.

C.7 Proof of Theorem 3

Lemma 3.

Consider two vectors 𝐱,𝐲\mathbf{x},\mathbf{y} with equal Euclidean norms 𝐱2=𝐲2\|\mathbf{x}\|_{2}=\|\mathbf{y}\|_{2}. Then for every 0ab0\leq a\leq b, we have

a𝐱𝐲2a𝐱b𝐲2.a\|\mathbf{x}-\mathbf{y}\|_{2}\leq\|a\mathbf{x}-b\mathbf{y}\|_{2}. (C.52)
Proof.

Note that

a𝐱b𝐲22a2𝐱𝐲22=\displaystyle\|a\mathbf{x}-b\mathbf{y}\|^{2}_{2}-a^{2}\|\mathbf{x}-\mathbf{y}\|^{2}_{2}\>=\> (b2a2)𝐲222a(ba)𝐱T𝐲\displaystyle(b^{2}-a^{2})\|\mathbf{y}\|^{2}_{2}-2a(b-a)\mathbf{x}^{T}\mathbf{y}
=\displaystyle=\> (ba)((b+a)𝐲222a𝐱T𝐲)\displaystyle(b-a)((b+a)\|\mathbf{y}\|^{2}_{2}-2a\mathbf{x}^{T}\mathbf{y})
\displaystyle\geq\; 0.\displaystyle 0.

The above holds as we have assumed that 0ab0\leq a\leq b implying 02ab+a0\leq 2a\leq b+a and since the two vectors 𝐱,𝐲\mathbf{x},\mathbf{y} share the same Euclidean norm

|𝐱T𝐲|𝐱22+𝐲222=𝐲22.|\mathbf{x}^{T}\mathbf{y}|\leq\frac{\|\mathbf{x}\|_{2}^{2}+\|\mathbf{y}\|_{2}^{2}}{2}=\|\mathbf{y}\|_{2}^{2}.

Hence, Lemma 3’s proof is complete. ∎

As shown by the Kantorovich duality [49], for the optimal D𝜽D^{\bm{\theta}} and the optimal coupling π𝜽(P𝐗,PG𝜽(𝐙))\pi_{\bm{\theta}}(P_{\mathbf{X}},P_{G_{\bm{\theta}}(\mathbf{Z})}) the following holds with probability 11 for every joint sample (𝐗,𝐗)(\mathbf{X},\mathbf{X}^{\prime}) drawn from the optimal coupling π𝜽\pi_{\bm{\theta}},

D𝜽(𝐗)D𝜽(𝐗)=𝐗𝐗2.D^{\bm{\theta}}(\mathbf{X})-D^{\bm{\theta}}(\mathbf{X}^{\prime})=\|\mathbf{X}-\mathbf{X}^{\prime}\|_{2}. (C.53)

Knowing that D𝜽D^{\bm{\theta}} is 1-Lipschitz, for every convex combination β𝐗+(1β)𝐗\beta\mathbf{X}+(1-\beta)\mathbf{X}^{\prime} we must have

D𝜽(β𝐗+(1β)𝐗)=𝐗𝐗𝐗𝐗.\nabla D^{\bm{\theta}}\bigl{(}\beta\mathbf{X}+(1-\beta)\mathbf{X}^{\prime}\bigr{)}=\frac{\mathbf{X}-\mathbf{X}^{\prime}}{\|\mathbf{X}-\mathbf{X}^{\prime}\|}.

This will imply that there definitely exists α𝜽\alpha_{\bm{\theta}} such that the transport map described in the theorem maps the data distribution to the generative model. Plugging this transport map into the definition of the first-order Wasserstein distance, we obtain

V(G𝜽,D𝜽):\displaystyle V(G_{\bm{\theta}},D^{\bm{\theta}}): =W1(P𝐗,PG𝜽(𝐙))\displaystyle=W_{1}(P_{\mathbf{X}},P_{G_{\bm{\theta}}(\mathbf{Z})})
=𝔼[α𝜽2(𝐗)D𝜽(𝐗)2]\displaystyle=\mathbb{E}\bigl{[}\big{\|}{\alpha}^{2}_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}_{2}\bigr{]}
=𝔼[α𝜽(𝐗)D𝜽(𝐗)22]\displaystyle=\mathbb{E}\bigl{[}\big{\|}\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}_{2}^{2}\bigr{]}

where the last equality holds since the Euclidean norm of D𝜽(𝐗)\nabla D^{\bm{\theta}}(\mathbf{X}) has a unit Euclidean norm with probability 11 over the data distribution P𝐗P_{\mathbf{X}} as we proved D𝜽(β𝐗+(1β)𝐗)=𝐗𝐗𝐗𝐗\nabla D^{\bm{\theta}}(\beta\mathbf{X}+(1-\beta)\mathbf{X}^{\prime})=\frac{\mathbf{X}-\mathbf{X}^{\prime}}{\|\mathbf{X}-\mathbf{X}^{\prime}\|} holds for every 0β10\leq\beta\leq 1 including β=1\beta=1.

As a result, the Wasserstein GAN problem reduces to the following optimization problem

min𝜽𝔼[α𝜽(𝐗)D𝜽(𝐗)22]\min_{\bm{\theta}}\>\mathbb{E}\bigl{[}\big{\|}\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}_{2}^{2}\bigr{]} (C.54)

Defining h𝜽(𝐗):=α𝜽(𝐗)D𝜽(𝐗)h_{\bm{\theta}}(\mathbf{X}):=\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X}), 12𝔼[h𝜽(𝐗)22]\frac{1}{2}\mathbb{E}\bigl{[}\|h_{\bm{\theta}}(\mathbf{X})\|_{2}^{2}\bigr{]} is 11-strongly convex with respect to the norm function

hH˙0=𝔼[h2(𝐗)]\|h\|_{\dot{H}^{0}}=\sqrt{\mathbb{E}\bigl{[}\,h^{2}(\mathbf{X})\,\bigr{]}}

that is induced by the following inner product and results in a Hilbert space

D1,D2H˙0:=𝔼P𝐗[D1(𝐗)D2(𝐗)].\langle D_{1},D_{2}\rangle_{\dot{H}^{0}}:=\mathbb{E}_{P_{\mathbf{X}}}[D_{1}(\mathbf{X})D_{2}(\mathbf{X})].

Therefore, for the 𝜽\bm{\theta}^{*} minimizing the objective in (C.54) over the assumed convex set {h𝜽:𝜽Θ}\{h_{\bm{\theta}}:\,\bm{\theta}\in\Theta\}, Lemma 2 implies that

𝔼[α𝜽(𝐗)D𝜽(𝐗)22]𝔼[α𝜽(𝐗)D𝜽(𝐗)22]=\displaystyle\mathbb{E}\bigl{[}\big{\|}\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}_{2}^{2}\bigr{]}-\mathbb{E}\bigl{[}\big{\|}\alpha_{\bm{\theta}^{*}}(\mathbf{X})\nabla D^{\bm{\theta}^{*}}(\mathbf{X})\big{\|}_{2}^{2}\bigr{]}\>=\> 𝔼[h𝜽(𝐗)22]𝔼[h𝜽(𝐗)22]\displaystyle\mathbb{E}\bigl{[}\big{\|}h_{\bm{\theta}}(\mathbf{X})\big{\|}_{2}^{2}\bigr{]}-\mathbb{E}\bigl{[}\big{\|}h_{\bm{\theta}^{*}}(\mathbf{X})\big{\|}_{2}^{2}\bigr{]}
\displaystyle\geq\> h𝜽h𝜽H˙02\displaystyle\|h_{\bm{\theta}}-h_{\bm{\theta}^{*}}\|_{\dot{H}^{0}}^{2}
=\displaystyle=\> 𝔼[h𝜽(𝐗)h𝜽(𝐗)22]\displaystyle\mathbb{E}\bigl{[}\|h_{\bm{\theta}}(\mathbf{X})-h_{\bm{\theta}^{*}}(\mathbf{X})\|_{2}^{2}\bigr{]}
=\displaystyle=\> 𝔼[α𝜽(𝐗)D𝜽(𝐗)α𝜽(𝐗)D𝜽(𝐗)22]\displaystyle\mathbb{E}\bigl{[}\|\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})-\alpha_{\bm{\theta}^{*}}(\mathbf{X})\nabla D^{\bm{\theta}^{*}}(\mathbf{X})\|_{2}^{2}\bigr{]}
\displaystyle\geq\> η2𝔼[D𝜽(𝐗)D𝜽(𝐗)22]\displaystyle\frac{\eta}{2}\mathbb{E}\bigl{[}\|\nabla D^{\bm{\theta}}(\mathbf{X})-\nabla D^{\bm{\theta}^{*}}(\mathbf{X})\|_{2}^{2}\bigr{]}
=\displaystyle=\> η2D𝜽D𝜽H˙12.\displaystyle\frac{\eta}{2}\|D_{\bm{\theta}}-D_{\bm{\theta}^{*}}\|^{2}_{\dot{H}^{1}}.

Here the last inequality follows from Lemma 3 since every D𝜽D^{\bm{\theta}} has a unit-norm gradient with probability 11 according to the data distribution P𝐗P_{\mathbf{X}}. Therefore, we have proved that

V(G𝜽,D𝜽)V(G𝜽,D𝜽)η2D𝜽D𝜽H˙12.V(G_{\bm{\theta}},D^{\bm{\theta}})-V(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})\geq\frac{\eta}{2}\|D_{\bm{\theta}}-D_{\bm{\theta}^{*}}\|^{2}_{\dot{H}^{1}}. (C.55)

The above inequality results in the following for every feasible 𝜽\bm{\theta}

V(G𝜽,D𝜽)maxD𝒟V(G𝜽,D)η2DD𝜽H˙12.V(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})\leq\max_{D\in\mathcal{D}}V(G_{\bm{\theta}},D)-\frac{\eta}{2}\|D-D_{\bm{\theta}^{*}}\|^{2}_{\dot{H}^{1}}. (C.56)

Hence, according to Proposition 2, we have shown that the pair (G𝜽,D𝜽)(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}}) is an η\eta-proximal equilibrium with respect to the Sobolev norm H˙1\|\cdot\|_{\dot{H}^{1}}.