GANs May Have No Nash Equilibria

Farzan Farnia, Asuman Ozdaglar
{farnia,asuman}@mit.edu

Massachusetts Institute of Technology

Abstract

Generative adversarial networks (GANs) represent a zero-sum game between two machine players, a generator and a discriminator, designed to learn the distribution of data. While GANs have achieved state-of-the-art performance in several benchmark learning tasks, GAN minimax optimization still poses great theoretical and empirical challenges. GANs trained using first-order optimization methods commonly fail to converge to a stable solution where the players cannot improve their objective, i.e., the Nash equilibrium of the underlying game. Such issues raise the question of the existence of Nash equilibrium solutions in the GAN zero-sum game. In this work, we show through several theoretical and numerical results that indeed GAN zero-sum games may not have any local Nash equilibria. To characterize an equilibrium notion applicable to GANs, we consider the equilibrium of a new zero-sum game with an objective function given by a proximal operator applied to the original objective, a solution we call the proximal equilibrium. Unlike the Nash equilibrium, the proximal equilibrium captures the sequential nature of GANs, in which the generator moves first followed by the discriminator. We prove that the optimal generative model in Wasserstein GAN problems provides a proximal equilibrium. Inspired by these results, we propose a new approach, which we call proximal training, for solving GAN problems. We discuss several numerical experiments demonstrating the existence of proximal equilibrium solutions in GAN minimax problems.

1 Introduction

Since their introduction in [1], generative adversarial networks (GANs) have gained great success in many tasks of learning the distribution of observed samples. Unlike the traditional approaches to distribution learning, GANs view the learning problem as a zero-sum game between the following two players: 1) generator $G$ aiming to generate real-like samples from a random noise input, 2) discriminator $D$ trying to distinguish $G$ ’s generated samples from real training data. This game is commonly formulated through a minimax optimization problem as follows:

\min_{G\in\mathcal{G}}\;\max_{D\in\mathcal{D}}\;V(G,D).

(1.1)

Here, $\mathcal{G}$ and $\mathcal{D}$ are respectively the generator and discriminator function sets, commonly chosen as two deep neural nets, and $V(G,D)$ denotes the minimax objective for generator $G$ and discriminator $D$ capturing how dissimilar the generated samples and training data are.

GAN optimization problems are commonly solved by alternating gradient methods, which under proper regularization have resulted in state-of-the-art generative models for various benchmark datasets. However, GAN minimax optimization has led to several theoretical and empirical challenges in the machine learning literature. Training GANs is widely known as a challenging optimization task requiring an exhaustive hyper-parameter and architecture search and demonstrating an unstable behavior. While a few regularization schemes have achieved empirical success in training GANs [2, 3, 4, 5], still little is known about the conditions under which GAN minimax optimization can be successfully solved by first-order optimization methods.

To understand the minimax optimization in GANs, one needs to first answer the following question: What is the proper notion of equilibrium in the GAN zero-sum game? In other words, what are the optimality criteria in the GAN’s minimax optimization problem? A classical notion of equilibrium in the game theory literature is the Nash equilibrium, a state in which no player can raise its individual gain by choosing a different strategy. According to this definition, a Nash equilibrium $(G^{*},D^{*})$ for the GAN minimax problem (1.1) must satisfy the following for every $G\in\mathcal{G}$ and $D\in\mathcal{D}$ :

V(G^{*},D)\,\leq\,V(G^{*},D^{*})\,\leq\,V(G,D^{*}).

(1.2)

As a well-known result, for a generator $G$ expressive enough to reproduce the distribution of observed samples, Nash equilibrium exists for the generator producing the data distribution [6]. However, such a Nash equilibrium would be of little interest from a learning perspective, since the trained generator merely overfits the empirical distribution of training samples [7]. More importantly, state-of-the-art GAN architectures [4, 5, 8, 9] commonly restrict the generator function through various means of regularization such as batch or spectral normalization. Such regularization mechanisms do not allow the generator to produce the empirical distribution of observed data-points. Since the realizability assumption does not apply to such regularized GANs, the existence of Nash equilibria will not be guaranteed in their minimax problems.

The above discussion motivates studying the equilibrium of GAN zero-sum games in the non-realizable settings where the generator cannot express the empirical distribution of training data. Here, a natural question is whether a Nash equilibrium still exists for the GAN minimax problem. In this work, we focus on this question and demonstrate through several theoretical and numerical results that:

Nash equilibrium may not exist in GAN zero-sum games.

We provide theoretical examples of well-known GAN formulations including the vanilla GAN [1], Wasserstein GAN (WGAN) [3], $f$ -GAN [10], and the second-order Wasserstein GAN (W2GAN) [11] where no local Nash equilibria exist in their minimax optimization problems. We further perform numerical experiments on widely-used GAN architectures which suggest that an empirically successful GAN training may converge to non-Nash equilibrium solutions.

Next, we focus on characterizing a new notion of equilibrium for GAN problems. To achieve this goal, we consider the Nash equilibrium of a new zero-sum game where the objective function is given by the following proximal operator applied to the minimax objective $V(G,D)$ with respect to a norm on discriminator functions:

V^{\operatorname{prox}}(G,D)\coloneqq\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\bigl{\|}\widetilde{D}-D\bigr{\|}^{2}.

(1.3)

We refer to the Nash equilibrium of the new zero-sum game as the proximal equilibrium. Given the inherent sequential nature of GAN problems where the generator moves first followed by the discriminator, we consider a Stackelberg game for its representation and focus on the subgame perfect equilibrium (SPE) of the game as the right notion of equilibrium for such problems [12]. We prove that the proximal equilibrium of Wasserstein GANs provides an SPE for the GAN problem. This result applies to both the first-order and second-order Wasserstein GANs. In these cases, we show a proximal equilibrium exists for the optimal generator minimizing the distance to the data distribution.

Inspired by these theoretical results, we propose a proximal approach for training GANs, which we call proximal training, by changing the original minimax objective to the proximal objective in (1.3). In addition to preserving the optimal solution to the GAN minimax problem, proximal training can further enjoy the existence of Nash equilibrium solutions in the new minimax objective. We discuss numerical results supporting the proximal training approach and the role of proximal equilibrium solutions in various GAN problems.

2 Related Work

Understanding the minimax optimization in modern machine learning applications including GANs has been a subject of great interest in the machine learning literature. A large body of recent works [13, 14, 15, 16, 17, 18, 19, 20, 21] have analyzed the convergence properties of first-order optimization methods in solving different classes of minimax games.

In a related work, [12] proposes a new notion of local optimality, called local minimax, designed for general sequential machine learning games. Compared to the notion of local minimax, the proximal equilibrium proposed in our work gives a notion of global optimality, which as we show directly applies to Wasserstein GANs. [12] also provides examples of minimax problems where Nash equilibria do not exist; however, the examples do not represent GAN minimax problems. Some recent works [21, 22, 23] have analyzed the convergence of different optimization methods to local minimax solutions.

In another related work, [24] analyzes the stable points of the gradient descent ascent (GDA) and optimistic GDA [13] algorithms, proving that they will give strict supersets of the local saddle points. Regarding the stability of GAN algorithms, [25] proves that the GDA algorithm will be locally stable for the vanilla and regularized Wasserstein GAN problems. [11] shows the GDA algorithm is globally stable for W2GANs with linear generator and quadratic discriminator functions.

Regarding the equilibrium in GANs, [7] studies the Nash equilibrium of GAN minimax games in realizable settings. Also, [7, 26] develop methods for finding mixed strategy Nash equilibria. On the other hand, our results focus on the pure strategies in non-realizable settings. [27] empirically studies the equilibrium of GAN problems regularized via the gradient penalty, reporting positive results on the stability of regularized GANs. However, our focus is on the existence of pure Nash equilibrium solutions. [28] suggests a moment matching GAN formulation using the Sobolev norm. As a different direction, we use the Sobolev norm to analyze equilibrium in GANs. Finally, developing GAN architectures with improved equilibrium and stability properties has been studied in several recent works [2, 29, 30, 31, 32, 33, 34, 35].

3 An Initial Experiment on Equilibrium in GANs

To examine whether the Nash equilibrium exists in GAN problems empirically, we performed a simple numerical experiment. In this experiment, we applied three standard GAN implementations including the Wasserstein GAN with weight-clipping (WGAN-WC) [3], the improved Wasserstein GAN with gradient penalty (WGAN-GP) [4], and the spectrally-normalized vanilla GAN (SN-GAN) [5], to the two benchmark MNIST [36] and CelebA [37] databases. We used the convolutional architecture of the DC-GAN [38] optimized with the Adam [39] or RMSprop [40] (only for WGAN-WC) optimizers.

We performed each of the GAN experiments for 200,000 generator iterations to reach $(G_{{\bm{\theta}}_{\operatorname{final}}},$ $D_{{\mathbf{w}}_{\operatorname{final}}})$ with ${\bm{\theta}}_{\operatorname{final}}$ and ${\mathbf{w}}_{\operatorname{final}}$ denoting the trained generator and discriminator parameters at the end of the 200,000 iterations. Our goal is to examine whether the solution pair $(G_{{\bm{\theta}}_{\operatorname{final}}},D_{{\mathbf{w}}_{\operatorname{final}}})$ represents a Nash equilibrium or not. To do this, we fixed the trained discriminator and kept optimizing the generator, i.e. continuing optimizing the generator $G_{\bm{\theta}}$ without changing the discriminator $D_{{\mathbf{w}}_{\operatorname{final}}}$ . Here we solved the following optimization problem initialized at $\bm{\theta}^{(0)}={\bm{\theta}}_{\operatorname{final}}$ using the default first-order optimizer for the generator function for 10,000 iterations:

\min_{\bm{\theta}}\>V(G_{\bm{\theta}},D_{{\mathbf{w}}_{\operatorname{final}}}).

(3.1)

If the pair $(G_{{\bm{\theta}}_{\operatorname{final}}},D_{{\mathbf{w}}_{\operatorname{final}}})$ was in fact a Nash equilibrium, it would give a local saddle point to the minimax optimization and the above optimization could not make the objective any smaller than its initial value. Also, the image samples generated by the generator $G_{\bm{\theta}}$ should have improved or at least preserved their initial quality during this optimization, since the discriminator $D_{{\mathbf{w}}_{\operatorname{final}}}$ would be the optimal discriminator against all generator functions.

Refer to caption — (a) Results on the MNIST data

Despite the above predictions, we observed that none of the mentioned statements hold in reality for any of the six experiments with the three standard GAN implementations and the two datasets. The optimization objective decreased rapidly from the beginning of the optimization, and the pictures sampled from the generator completely lost their quality over this optimization. Figures 1(a), 1(b) show the objective for the SN-GAN experiments over the 10,000 steps of the above optimization. These figures also demonstrate the SN-GAN generated samples before and during the optimization, which shows the significant drop in the quality of generated pictures. We defer the results for the WGAN-WC and WGAN-GP problems to the Appendix.

The results of the above experiments show that practical GAN experiments may not converge to local Nash equilibrium solutions. After fixing the trained discriminator, the trained generator can be further optimized using a first-order optimization method to reach smaller values of the generator objective. More importantly, this optimization not only does not improve the quality of the generator’s output samples, but also totally disturbs the trained generator. As demonstrated in these experiments, simultaneous optimization of the two players is in fact necessary for the proper convergence and stability behavior in GAN minimax optimization. The above experiments suggest that practical GAN solutions are not local Nash equilibrium. In the upcoming sections, we review some standard GAN formulations and then show that there are examples of GAN minimax problems for which no Nash equilibrium exists. Those theoretical results will further support our observations in the above experiments.

4 Review of GAN Formulations

4.1 Vanilla GAN & $f$ -GAN

Consider samples $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}$ observed independently from distribution $P_{\mathbf{X}}$ . Our goal is to find a generator function $G\in\mathcal{G}$ where $G(\mathbf{Z})$ maps a random noise input $\mathbf{Z}$ from a known $P_{\mathbf{Z}}$ to an output $G(\mathbf{Z})$ distributed as $P_{\mathbf{X}}$ , i.e., we aim to match the probability distributions $P_{G(\mathbf{Z})}$ and $P_{\mathbf{X}}$ . To find such a generator function, [1] proposes the following minimax problem which is commonly referred to as the vanilla GAN problem:

\min_{G\in\mathcal{G}}\>\max_{D\in\mathcal{D}}\>\mathbb{E}\bigl{[}\log(D(\mathbf{X}))\bigr{]}+\mathbb{E}\bigl{[}\log(1-D(G(\mathbf{Z})))\bigr{]}.

(4.1)

Here $\mathcal{G}$ and $\mathcal{D}$ represent the set of generator and discriminator functions, respectively. In this formulation, the discriminator is optimized to map real samples from $P_{\mathbf{X}}$ to larger values than the values assigned to generated samples from $P_{G(\mathbf{Z})}$ .

As shown in [1], the above minimax problem for an unconstrained $\mathcal{D}$ containing all real-valued functions reduces to the following divergence minimization problem:

\min_{G\in\mathcal{G}}\>\operatorname{JSD}\bigl{(}P_{\mathbf{X}},P_{G(\mathbf{Z})}\bigr{)},

(4.2)

where $\operatorname{JSD}$ denotes the Jensen-Shannon (JS) divergence defined in terms of KL-divergence as

\operatorname{JSD}(P,Q)\coloneqq\frac{1}{2}\operatorname{KL}\bigl{(}P,\frac{P+Q}{2}\bigr{)}\,+\,\frac{1}{2}\operatorname{KL}\bigl{(}Q,\frac{P+Q}{2}\bigr{)}.

$f$ -GANs extend the vanilla GAN problem by generalizing the JS-divergence to a general $f$ -divergence. For a convex function $f:\mathbb{R}\rightarrow\mathbb{R}$ with $f(1)=0$ , the $f$ -divergence $d_{f}$ corresponding to $f$ is defined as

d_{f}(P,Q)\coloneqq\int p(\mathbf{x})f\bigl{(}\frac{q(\mathbf{x})}{p(\mathbf{x})}\bigr{)}\mathop{}\!\mathrm{d}\mathbf{x}.

(4.3)

Notice that the JS-divergence is a special case of $f$ -divergence with $f_{\operatorname{JSD}}(t)=t\log t-(t+1)\log\frac{t+1}{2}$ . [10] shows that generalizing the divergence minimization (4.2) to minimizing a $f$ -divergence results in the following minimax problem called $f$ -GAN:

\min_{G\in\mathcal{G}}\>\max_{D\in\mathcal{D}}\>\mathbb{E}\bigl{[}D(\mathbf{X})]-\mathbb{E}\bigl{[}f^{*}\bigl{(}D(G(\mathbf{Z}))\bigr{)}\bigr{]},

(4.4)

where $f^{*}$ denotes the Fenchel-conjugate to $f$ defined as $f^{*}(u)=\sup_{t}ut-f(t)$ . The space $\mathcal{D}$ implied by the f-divergence minimization will be the set of all functions, but a similar interpretation further applies to a constrained $\mathcal{D}$ [41, 42]. Several examples of $f$ -GANs have been formulated and discussed in [10].

4.2 Wasserstein GANs

To resolve GAN training issues, [3] proposes to formulate a GAN problem by minimizing the optimal transport costs which unlike $f$ -divergences change continuously with the input distributions. Given a transportation cost $c(\mathbf{x},\mathbf{x}^{\prime})$ for transporting $\mathbf{x}$ to $\mathbf{x}^{\prime}$ , the optimal transport cost $W_{c}$ is defined as

W_{c}(P,Q)=\inf_{M\in\Pi(P,Q)}\>\mathbb{E}_{M}\bigl{[}c(\mathbf{X},\mathbf{X}^{\prime})\bigr{]}

(4.5)

where $\Pi(P,Q)$ denotes the set of all joint distributions on $(\mathbf{X},\mathbf{X}^{\prime})$ with $\mathbf{X},\,\mathbf{X}^{\prime}$ marginally distributed as $P,\,Q$ , respectively. An important special case is the first-order Wasserstein distance ( $W_{1}$ -distance) corresponding to $c(\mathbf{X},\mathbf{X}^{\prime})=\|\mathbf{X}-\mathbf{X}^{\prime}\|$ . In this special case, the Kantorovich-Rubinstein duality shows

W_{1}(P,Q)=\max_{\|D\|_{\operatorname{Lip}}\leq 1}\,\mathbb{E}_{P}[D(\mathbf{X})]-\mathbb{E}_{Q}[D(\mathbf{X})].

(4.6)

Here $\mathbb{E}_{P}[\cdot]$ denotes the expected value with respect to distribution $P$ and $\|D\|_{\operatorname{Lip}}$ denotes the Lipschitz constant of function $D$ which is defined as the smallest $L$ satisfying $|D(\mathbf{x})-D(\mathbf{x}^{\prime})|\leq L\,\|\mathbf{x}-\mathbf{x}^{\prime}\|$ for every $\mathbf{x},\mathbf{x}^{\prime}$ . Formulating a GAN problem minimizing the $W_{1}$ -distance, [3] states the Wasserstein GAN (WGAN) problem as follows:

\min_{G\in\mathcal{G}}\>\max_{\|D\|_{\operatorname{Lip}}\leq 1}\>\mathbb{E}\bigl{[}D(\mathbf{X})]-\mathbb{E}\bigl{[}D(G(\mathbf{Z}))\bigr{]}.

(4.7)

The above Wasserstein GAN problem can be generalized to a general optimal transport cost with arbitrary cost function $c(\mathbf{x},\mathbf{x}^{\prime})$ . The generalization is as follows:

\min_{G\in\mathcal{G}}\>\max_{D\,\operatorname{c-concave}}\>\mathbb{E}\bigl{[}D(\mathbf{X})]-\mathbb{E}\bigl{[}D^{c}(G(\mathbf{Z}))\bigr{]},

(4.8)

where the c-transform is defined as $D^{c}(\mathbf{x})=\sup_{\mathbf{x}^{\prime}}\,D(\mathbf{x}^{\prime})-c(\mathbf{x},\mathbf{x}^{\prime})$ and a function $D$ is called c-concave if it is the c-transform of some valid function. In particular, the optimal transport GAN formulation with the quadratic cost $c(\mathbf{x},\mathbf{x}^{\prime})=\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2}^{2}$ results in the second-order Wasserstein GAN (W2GAN) problem which has been studied in several recent works [11, 43, 44, 45].

5 Existence of Nash Equilibrium Solutions in GANs

Consider a general GAN minimax problem (1.1) with a minimax objective $V(G,D)$ . As discussed in the previous section, the optimal generator $G^{*}\in\mathcal{G}$ is defined to minimize the GAN’s target divergence to the data distribution. The following proposition is a well-known result regarding the Nash equilibrium of the GAN game in realizable settings where there exists a generator $G\in\mathcal{G}$ producing the data distribution.

Proposition 1.

Assume that generator $G^{*}\in\mathcal{G}$ results in the distribution of data, i.e., we have $P_{G^{*}(\mathbf{Z})}=P_{\mathbf{X}}$ . Then, for each of the GAN problems discussed in Section 4 there exists a constant discriminator function $D_{\operatorname{constant}}$ which together with $G^{*}$ results in a Nash equilibrium to the GAN game, and hence satisfies the following for every $G\in\mathcal{G}$ and $D\in\mathcal{D}$ :

V(G^{*},D)\,\leq\,V(G^{*},D_{\operatorname{constant}})\,\leq\,V(G,D_{\operatorname{constant}}).

Proof.

This proposition is well-known for the vanilla GAN [46]. In the Appendix, we provide a proof for general $f$ -GANs and Wasserstein GANs. ∎

The above proposition shows that in a realizable setting with a generator function generating the distribution of observed samples, a Nash equilibrium exists for that optimal generator. However, the realizability assumption in this proposition does not always hold in real GAN experiments. For example, in the GAN experiments discussed in Section 3, we observed that the divergence estimate never reached the zero value because of regularizing the generator function. Therefore, the Nash equilibrium described in Proposition 1 does not apply to the trained generator and discriminator in such GAN experiments.

Here, we address the question of the existence of Nash equilibrium solutions for non-realizable settings, where no generator $G\in\mathcal{G}$ can produce the data distribution. Do Nash equilibria always exist in non-realizable GAN zero-sum games? The following theorem shows that the answer is in general no. Note that $\sigma_{\max}(\cdot)$ in this theorem denotes the maximum singular value, i.e., the spectral norm.

Theorem 1.

Consider a GAN minimax problem for learning a normally distributed $\mathbf{X}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I)$ with zero mean and scalar covariance matrix where $\sigma>1$ . In the GAN formulation, we use a linear generator function $G(\mathbf{z})=\mathbf{W}\mathbf{z}+\mathbf{u}$ where the weight matrix $\mathbf{W}$ is spectrally-regularized to satisfy $\sigma_{\max}(\mathbf{W})\leq 1$ . Suppose that the Gaussian latent vector is normally distributed as $\mathbf{Z}\sim\mathcal{N}(\mathbf{0},I)$ with zero mean and identity covariance matrix. Then,

•

For the $f$ -GAN problem corresponding to an $f$ with non-decreasing $t^{2}f^{\prime\prime}(t)$ over $t\in(0,+\infty)$ and an unconstrained discriminator $D$ where the dimensions of data $\mathbf{X}$ and latent $\mathbf{Z}$ match, the f-GAN minimax problem has no Nash equilibrium solutions.
•

For the W2GAN problem with discriminator $D$ trained over $c$ -concave functions, where $c$ is the quadratic cost, the W2GAN minimax problem has no Nash equilibrium solutions. Also, given a quadratic discriminator $D(\mathbf{x})=\mathbf{x}^{T}A\mathbf{x}+\mathbf{b}^{T}\mathbf{x}$ parameterized by $A,\mathbf{b}$ , the W2GAN problem has no local Nash equilibria.
•

For the WGAN problem with $1$ -dimensional $X,Z$ and a discriminator $D$ trained over 1-Lipschitz functions, the WGAN minimax problem has no Nash equilibria.

Proof.

We defer the proof to the Appendix. Note that the condition on the $f$ -GAN holds for all $f$ -GAN examples in [10] including the vanilla GAN. ∎

The above theorem shows that under the stated assumptions the GAN zero-sum game does not have Nash equilibrium solutions. Consequently, the optimal divergence-minimizing generative model does not result in a Nash equilibrium. In contrast to Theorem 1, the following remark shows that the GAN zero-sum game in a non-realizable case may have Nash equilibrium solutions, of course if Theorem 1’s assumptions do not hold.

Remark 1.

Consider the same setting as in Theorem 1. However, unlike Theorem 1 suppose that $\sigma<1$ and $\sigma_{\min}(\mathbf{W})\geq 1$ where $\sigma_{\min}$ stands for the minimum singular value. Then, for the WGAN and W2GAN problems described in Theorem 1, the Wasserstein distance-minimizing generator results in a Nash equilibrium.

Proof.

We defer the proof to the Appendix. ∎

The above remark explains that the phenomenon shown in Theorem 1 does not always hold in non-realizable GAN settings. As a result, we need other notions of equilibrium which consistently explain optimality in GAN games.

6 Proximal Equilibrium: A Relaxation of Nash Equilibrium

To define a proper notion of equilibrium for GANs, note that due to the sequential nature of GAN games the equilibrium notion should be flexible to allow to some extent the optimization of the discriminator around the equilibrium solution. This property is in fact consistent with the stability feature observed for the first-order GAN training methods where the alternating first-order method stabilizes around a certain solution. To this end, we consider the following objective for a GAN problem with minimax objective $V(G,D)$ :

V_{\lambda}^{\operatorname{prox}}(G,D)\,\coloneqq\,\max_{\widetilde{D}\in\mathcal{D}}\,V(G,\widetilde{D})-\frac{\lambda}{2}\bigl{\|}\widetilde{D}-D\bigr{\|}^{2}.

(6.1)

The above definition represents the application of a proximal operator to $V(G,D)$ , which further optimizes the original objective in the proximity of discriminator $D$ . To keep the $\widetilde{D}$ function variable close to $D$ , we penalize the distance among the two functions in the proximal optimization. Here the distance is measured using a norm $\|\cdot\|$ on the discriminator function space.

To extend the notion of Nash equilibrium to general minimax problems, we propose considering the Nash equilibria of the defined $V_{\lambda}^{\operatorname{prox}}(G,D)$ .

Definition 1.

We call $(G^{*},D^{*})$ a $\lambda$ -proximal equilibrium for $V(G,D)$ if it represents a Nash equilibrium for $V_{\lambda}^{\operatorname{prox}}(G,D)$ , i.e. for every $G\in\mathcal{G}$ and $D\in\mathcal{D}$

V_{\lambda}^{\operatorname{prox}}(G^{*},D)\leq V_{\lambda}^{\operatorname{prox}}(G^{*},D^{*})\leq V_{\lambda}^{\operatorname{prox}}(G,D^{*}).

(6.2)

The next proposition provides necessary and sufficient conditions in terms of the original objective $V(G,D)$ for the proximal equilibrium solutions.

Proposition 2.

$(G^{*},D^{*})$ is a $\lambda$ -proximal equilibrium if and only if for every $G\in\mathcal{G}$ and $D\in\mathcal{D}$ we have

\displaystyle V(G^{*},D)\leq V(G^{*},D^{*})\leq\max_{\widetilde{D}\in\mathcal{D}}V(G,\widetilde{D})-\frac{\lambda}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}

Therefore, if $(G^{*},D^{*})$ is a $\lambda$ -proximal equilibrium it will give a global minimax solution, i.e., $G^{*}\in\mathcal{G}$ minimizes the worst-case objective, $\max_{D\in\mathcal{D}}V(G,D)$ , with $D^{*}$ being its optimal solution.

Proof.

We defer the proof to the Appendix. ∎

The following result shows the proximal equilibria provide a hierarchy of equilibrium solutions for different $\lambda$ values.

Proposition 3.

Define $\operatorname{PE}_{\lambda}(V)$ to be the set of the $\lambda$ -proximal equilibria for $V(G,D)$ . Then, if $\lambda_{1}\leq\lambda_{2}$ ,

\operatorname{PE}_{\lambda_{2}}(V)\subseteq\operatorname{PE}_{\lambda_{1}}(V).

(6.3)

Proof.

We defer the proof to the Appendix. ∎

Note that as $\lambda$ approaches infinity, $V_{\lambda}^{\operatorname{prox}}(G,D)$ tends to the original $V(G,D)$ , implying that $\operatorname{PE}_{\lambda=+\infty}(V)$ is the set of $V(G,D)$ ’s Nash equilibria. In contrast, for $\lambda=0$ the proximal objective becomes the worst-case objective $\max_{D\in\mathcal{D}}V(G,D)$ . As a result, $\operatorname{PE}_{\lambda=0}(V)$ is the set of global minimax solutions described in Proposition 2.

Concerning the proximal optimization problem in (6.1), the following proposition shows that if the original minimax objective is a smooth function of the discriminator parameters, the proximal optimization can be solved efficiently and therefore one can efficiently compute the gradient of the proximal objective.

Proposition 4.

Consider the maximization problem in the definition of proximal objective (6.1) where generator $G_{\bm{\theta}}$ and discriminator $D_{\mathbf{w}}$ are parameterized by vectors $\bm{\theta},\,\mathbf{w}$ , respectively. Suppose that

•

For the considered discriminator norm $\|\cdot\|$ , $\|D_{\mathbf{w}}-D\|^{2}$ is $\eta_{1}$ -strongly convex in $\mathbf{w}$ for any function $D$ , i.e. for any $\mathbf{w},\mathbf{w}^{\prime},D$ :

\bigl{\|}\nabla_{\mathbf{w}}\|D_{\mathbf{w}}-D\|^{2}\,-\,\nabla_{\mathbf{w}}\|D_{\mathbf{w}^{\prime}}-D\|^{2}\bigr{\|}_{2}\,\geq\,\eta_{1}\bigl{\|}\mathbf{w}-\mathbf{w}^{\prime}\bigr{\|}_{2},

•

For every $G_{\bm{\theta}}$ , The GAN minimax objective $V(G_{\bm{\theta}},D_{\mathbf{w}})$ is $\eta_{2}$ -smooth in $\mathbf{w}$ , i.e. i.e. for any $\mathbf{w},\mathbf{w}^{\prime},\bm{\theta}$ :

\bigl{\|}\nabla_{\mathbf{w}}V(G_{\bm{\theta}},D_{\mathbf{w}})\,-\,\nabla_{\mathbf{w}}V(G_{\bm{\theta}},D_{\mathbf{w}^{\prime}})\bigr{\|}_{2}\,\leq\,\eta_{2}\|\mathbf{w}-\mathbf{w}^{\prime}\|_{2}.

Under the above assumptions, if $\eta_{2}<\frac{\lambda\eta_{1}}{2}$ , the maximization objective in (6.1) is $\frac{\lambda\eta_{1}}{2}-\eta_{2}$ -strongly concave. Then, the maximization problem has a unique solution $\mathbf{w}^{*}$ and if $V(G_{\bm{\theta}},D_{\mathbf{w}})$ is differentiable with respect to $\theta$ we have

\nabla_{\theta}V_{\lambda}^{\operatorname{prox}}(G_{\bm{\theta}},D_{\mathbf{w}})=\nabla_{\theta}V(G_{\bm{\theta}},D_{\mathbf{w}^{*}}).

(6.4)

Proof.

We defer the proof to the Appendix. ∎

The above proposition suggests that under the mentioned assumptions, one can efficiently compute the optimal solution to the proximal maximization through a first-order optimization method. The assumptions require the smoothness of the GAN minimax objective with respect to the discriminator parameters, which can be imposed by applying norm-based regularization tools to neural network discriminators.

7 Proximal Equilibrium in Wasserstein GANs

As shown earlier, GAN minimax games may not have any Nash equilibria in non-realizable settings. As a result, we seek for a different notion of equilibrium which remains applicable to GAN problems. Here, we show the proposed proximal equilibrium provides such an equilibrium notion for Wasserstein GAN problems.

To define a proper proximal operator for defining proximal equilibria in Wasserstein GAN problems, we use the second-order Sobolev semi-norm averaged over the underlying distribution of data. Given the underlying distribution $P_{\mathbf{X}}$ , we define the Sobolev semi-norm as

\displaystyle\big{\|}D\big{\|}_{\dot{H}^{1}}\,\coloneqq\,\sqrt{\,\mathbb{E}_{P_{\mathbf{X}}}\left[\big{\|}\nabla_{\mathbf{x}}D(\mathbf{X})\big{\|}^{2}_{2}\,\right]}\,.

(7.1)

The above semi-norm is induced by the following semi-inner product and therefore leads to a semi-Hilbert space of functions:

\langle D_{1},D_{2}{\rangle}_{\dot{H}^{1}}\coloneqq\mathbb{E}_{P_{\mathbf{X}}}\left[\,\nabla D_{1}(\mathbf{X})^{T}\nabla D_{2}(\mathbf{X})\,\right].

(7.2)

Throughout our discussion, we consider a parameterized set of generators $\mathcal{G}=\{G_{\bm{\theta}}:\,\bm{\theta}\in\Theta\}$ . For a GAN minimax objective $V(G,D)$ , we define $D^{\bm{\theta}}$ to be the optimal discriminator function for the parameterized generator $G_{\bm{\theta}}$ :

D^{\bm{\theta}}\coloneqq\underset{D\in\mathcal{D}}{\arg\!\max}\>V(G_{\bm{\theta}},D).

(7.3)

The following theorem shows that the Wasserstein distance-minimizing generator function in the second-order Wasserstein GAN problem satisfies the conditions of a proximal equilibrium based on the Sobolev semi-norm defined in (7.1).

Theorem 2.

Consider the second-order Wasserstein GAN problem (4.8) with a quadratic cost $c(\mathbf{x},\mathbf{x}^{\prime})=\frac{\eta}{2}\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2}^{2}$ . Suppose that the set of optimal discriminators $\{D^{\bm{\theta}}:\,\bm{\theta}\in\Theta\}$ is convex. Then, $(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})$ for the Wasserstein distance-minimizing generator $G_{\bm{\theta}^{*}}\in\mathcal{G}$ will provide a $\frac{1}{\eta}$ -proximal equilibrium with respect to the Sobolev norm in (7.1).

Proof.

We defer the proof to the Appendix. ∎

The above theorem shows that while, as demonstrated in Theorem 1, the W2GAN problem may have no local Nash equilibrium solutions, the proximal equilibrium exists for the W2GAN problem and holds at the Wasserstein-distance minimizing generator $G_{\bm{\theta}^{*}}$ . The next theorem extends this result to the first-order Wasserstein GAN (WGAN) problem.

Theorem 3.

Consider the WGAN problem (4.7) minimizing the first-order Wasserstein distance. For each $G_{\bm{\theta}}$ , define $\alpha_{\bm{\theta}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{\geq 0}$ to be the magnitude of the resulted optimal transport map from $\mathbf{X}$ to $G_{\bm{\theta}}(\mathbf{Z})$ , i.e. $\mathbf{X}-{\alpha}^{2}_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})$ shares the same distribution with $G_{\bm{\theta}}(\mathbf{Z})$ .¹¹1Note that as shown in the proof such a mapping $\alpha_{\bm{\theta}}$ exists under mild regularity assumptions. Given these definitions, assume that

•

$\{{\alpha}_{\bm{\theta}}(\cdot)\nabla D^{\bm{\theta}}(\cdot):\,\bm{\theta}\in\Theta\}$ is a convex set,
•

for every $\mathbf{x}$ and $\bm{\theta}$ , $\frac{\eta}{2}\leq\alpha_{\bm{\theta}}^{2}(\mathbf{x})$ holds for constant $\eta$ .

Then, $(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})$ for the Wasserstein distance-minimizing generator function $G_{\bm{\theta}^{*}}$ provides an $\eta$ -proximal equilibrium with respect to the Sobolev norm in (7.1).

Proof.

We defer the proof to the Appendix. ∎

The above theorem shows that if the magnitude of optimal transport map is everywhere lower-bounded by $\frac{\lambda}{2}$ , then the Wasserstein distance-minimizing generator in the WGAN problem yields a $\lambda$ -proximal equilibrium.

8 Proximal Training

As shown for Wasserstein GAN problems, given the defined Sobolev norm and a small enough $\lambda$ the proximal objective $V_{\lambda}^{\operatorname{prox}}(G,D)$ will possess a Nash equilibrium solution. This result motivates performing the minimax optimization for the proximal objective $V_{\lambda}^{\operatorname{prox}}(G,D)$ instead of the original objective $V(G,D)$ . Therefore, we propose proximal training in which we solve the following minimax optimization problem:

\min_{G_{\bm{\theta}}\in\mathcal{G}}\;\max_{D_{\mathbf{w}}\in\mathcal{D}}\;V_{\lambda}^{\operatorname{prox}}(G_{\bm{\theta}},D_{\mathbf{w}}),

(8.1)

with the proximal operator defined according to the Sobolev norm in (7.1).

In order to take the gradient of $V_{\lambda}^{\operatorname{prox}}(G_{\bm{\theta}},D_{\mathbf{w}})$ with respect to $\bm{\theta}$ , Proposition 4 suggests solving the proximal optimization followed by computing the gradient of the original objective $V(G_{\bm{\theta}},D_{\mathbf{w}^{*}})$ where the discriminator is parameterized with the optimal solution $\mathbf{w}^{*}$ to the proximal optimization.

Algorithm 1 GAN Proximal Training

Input: data

\mathbf{x}_{i}

, size

n

Initialize the parameters

\mathbf{w}^{(0)},\bm{\theta}^{(0)}

for

\text{\rm k}=0

to MAX_ITER do

\gg

Update

\mathbf{w}^{(k+1)}\,=\,\underset{\mathbf{w}}{\arg\!\max}\>V(G_{\bm{\theta}^{(k)}},D_{\mathbf{w}})-\frac{\lambda}{2n}\sum_{i=1}^{n}\|\nabla_{\mathbf{x}}D_{\mathbf{w}}(\mathbf{x}_{i})-\nabla_{\mathbf{x}}D_{\mathbf{w}^{(k)}}(\mathbf{x}_{i})\|^{2}_{2}

\gg

Update

{\bm{\theta}}^{(k+1)}={\bm{\theta}}^{(k)}-\gamma_{k}\nabla_{\bm{\theta}}V(G_{\bm{\theta}^{(k)}},D_{\mathbf{w}^{(k+1)}}).

end for

Algorithm 1 summarizes the main two steps of proximal training. At every iteration, the discriminator is optimized with an additive Sobolev norm penalty forcing the discriminator to remain in the proximity of the current discriminator. Next, the generator is optimized using a gradient descent method with the gradient evaluated at the optimal discriminator solving the proximal optimization. The stepsize parameter $\gamma_{k}$ can be adaptively selected at every iteration $k$ . In practice, we can solve the proximal maximization problem via a first-order optimization method for a certain number of iterations. Assuming the conditions of Proposition 4 hold, the proximal optimization leads to the maximization of a strongly-concave objective which can be solved linearly fast through first-order optimization methods.

9 Numerical Experiments

To experiment the theoretical results of this work, we performed several experiments using the [4]’s implementation of Wasserstein GANs with the code available at the paper’s Github repository. In addition, we used the implementations of [5, 47] for applying spectral regularization to the discriminator network. In the experiments, we used the DC-GAN 4-layer CNN architecture for both the discriminator and generator functions [38] and ran each experiment for 200,000 generator iterations with 5 discriminator updates per generator update. We used the RMSprop optimzier [40] for WGAN experiments with weight clipping or spectral normalization and the Adam optimizer [39] for the other experiments.

9.1 Proximal Equilibrium in Wasserstein and Lipschitz GANs

We examined whether the solutions found by Wasserstein and Lipschitz vanilla GANs represent proximal equilibria. Toward this goal, we performed similar experiments to Section 3’s experiments for the WGAN-WC [3], WGAN-GP [4], and SN-GAN [5] problems over the MNIST and CelebA datasets. In Section 3, we observed that after fixing the trained discriminator $D_{\mathbf{w}_{\operatorname{final}}}$ the GAN’s minimax objective $V(G_{\bm{\theta}},D_{\mathbf{w}_{\operatorname{final}}})$ kept decreasing when we optimized only the generator $G_{\bm{\theta}}$ . In the new experiments, we similarly fixed the trained discriminator $D_{\mathbf{w}_{\operatorname{final}}}$ resulted from the 200,000 training iterations, but instead of optimizing the GAN minimax objective we optimized the proximal objective defined by the norm (7.1) with $\lambda=0.1$ . Thus, we solved the following optimization problem initialized at ${\bm{\theta}}_{\operatorname{final}}$ which denotes the parameters of the trained generator:

\min_{\bm{\theta}}\>V^{\operatorname{prox}}_{\lambda=0.1}(G_{\bm{\theta}},D_{\mathbf{w}_{\operatorname{final}}}).

(9.1)

We computed the gradient of the above proximal objective by applying the Adam optimizer for $50$ steps to approximate the solution to the proximal optimization (6.1) which at every iteration was initialized at $\mathbf{w}_{\operatorname{final}}$ . Figures 2(a) and 2(b) show that in the SN-GAN experiments the original minimax objective had only minor changes, compared to the results in Section 3, and the quality of generated samples did not change significantly during the optimization. We defer the similar numerical results of the WGAN-WC and WGAN-GP experiments to the Appendix. These numerical results suggest that while Wasserstein and Lipschitz GANs may not converge to local Nash equilibrium solutions as shown in Section 3, their found solutions can still represent a local proximal equilibrium.

9.2 Proximal Training Improves Lipschitz GANs

Table 1: Inception scores for ordinary vs. proximal training

GAN Problem	Ordinary	Proximal
WGAN-WC (DIM=64)	$4.16\pm 0.15$	$4.56\pm 0.19$
WGAN-WC (DIM=128)	$2.52\pm 0.12$	$4.23\pm 0.15$
SN-GAN (DIM=64)	$5.12\pm 0.25$	$5.72\pm 0.22$
SN-GAN (DIM=128)	$5.62\pm 0.23$	$6.12\pm 0.22$

We applied the proximal training in Algorithm 1 to the WGAN-WC and SN-GAN problems. To compute the gradient of the proximal minimax objective, we solved the maximization problem in the Algorithm 1’s first step in the for loop by applying $20$ steps of Adam optimization initialized at the discriminator parameters at that iteration. Applying the proximal training to MNIST, CIFAR-10, and CelebA datasets, we qualitatively observed slightly visually better generated pictures. We postpone the generated samples to the Appendix.

To quantitatively compare the proximal and ordinary non-proximal GAN training, we measured the Inception scores of the samples generated in the CIFAR-10 experiments. As shown in Table 1, proximal training results in an improved inception score. In this table, DIM stands for the dimension parameter of the DC-GAN’s CNN networks.

References

[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[2] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
[3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
[4] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
[5] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning, volume 1. MIT Press, 2016.
[7] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 224–232, 2017.
[8] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
[9] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
[10] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016.
[11] Soheil Feizi, Farzan Farnia, Tony Ginart, and David Tse. Understanding gans: the lqg setting. arXiv preprint arXiv:1710.10793, 2017.
[12] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, 2019.
[13] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. arXiv preprint arXiv:1711.00141, 2017.
[14] Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. In Advances in Neural Information Processing Systems, pages 14905–14916, 2019.
[15] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. arXiv preprint arXiv:1901.08511, 2019.
[16] Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Efficient algorithms for smooth minimax optimization. In Advances in Neural Information Processing Systems, pages 12659–12670, 2019.
[17] Kaiqing Zhang, Zhuoran Yang, and Tamer Basar. Policy optimization provably converges to nash equilibria in zero-sum linear quadratic games. In Advances in Neural Information Processing Systems, pages 11598–11610, 2019.
[18] Eric V Mazumdar, Michael I Jordan, and S Shankar Sastry. On finding local nash equilibria (and only local nash equilibria) in zero-sum games. arXiv preprint arXiv:1901.00838, 2019.
[19] Tanner Fiez, Benjamin Chasnov, and Lillian J Ratliff. Convergence of learning dynamics in stackelberg games. arXiv preprint arXiv:1906.01217, 2019.
[20] Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pages 6586–6595, 2019.
[21] Tianyi Lin, Chi Jin, and Michael I Jordan. On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331, 2019.
[22] Qi Lei, Jason D Lee, Alexandros G Dimakis, and Constantinos Daskalakis. Sgd learns one-layer networks in wgans. arXiv preprint arXiv:1910.07030, 2019.
[23] Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimization locally: A follow-the-ridge approach. In International Conference on Learning Representations, 2020.
[24] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems, pages 9236–9246, 2018.
[25] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances in neural information processing systems, pages 5585–5595, 2017.
[26] Ya-Ping Hsieh, Chen Liu, and Volkan Cevher. Finding mixed nash equilibria of generative adversarial networks. arXiv preprint arXiv:1811.02002, 2018.
[27] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M Dai, Shakir Mohamed, and Ian Goodfellow. Many paths to equilibrium: Gans do not need to decrease a divergence at every step. arXiv preprint arXiv:1710.08446, 2017.
[28] Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev gan. arXiv preprint arXiv:1711.04894, 2017.
[29] David Berthelot, Thomas Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
[30] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626–6637, 2017.
[31] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, pages 1825–1835, 2017.
[32] Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in neural information processing systems, pages 2018–2028, 2017.
[33] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of gans. arXiv preprint arXiv:1705.07215, 2017.
[34] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? arXiv preprint arXiv:1801.04406, 2018.
[35] Zhiming Zhou, Jiadong Liang, Yuxuan Song, Lantao Yu, Hongwei Wang, Weinan Zhang, Yong Yu, and Zhihua Zhang. Lipschitz generative adversarial nets. arXiv preprint arXiv:1902.05687, 2019.
[36] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
[37] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
[38] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[39] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[40] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. 14(8), 2012.
[41] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pages 5545–5553, 2017.
[42] Farzan Farnia and David Tse. A convex duality framework for gans. In Advances in Neural Information Processing Systems, pages 5248–5258, 2018.
[43] Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving gans using optimal transport. arXiv preprint arXiv:1803.05573, 2018.
[44] Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. On the convergence and robustness of training gans with regularized optimal transport. In Advances in Neural Information Processing Systems, pages 7091–7101, 2018.
[45] Amirhossein Taghvaei and Amin Jalali. 2-wasserstein approximation via restricted convex potentials with application to improved training for gans. arXiv preprint arXiv:1902.07197, 2019.
[46] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
[47] Farzan Farnia, Jesse Zhang, and David Tse. Generalizable adversarial training via spectral normalization. In International Conference on Learning Representations, 2019.
[48] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
[49] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
[50] Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.
[51] Luigi Ambrosio and Nicola Gigli. A user’s guide to optimal transport. In Modelling and optimisation of flows on networks, pages 1–155. Springer, 2013.

Appendix A Numerical Results for Section 3

Here, we provide the complete numerical results for the experiments discussed in Section 3 of the main text. Regarding the plots shown in Section 3 for the SN-GAN implementation, here we present the same plots for the Wasserstein GAN with weight clipping (WGAN-WC) and with gradient penalty (WGAN-GP) problems. Figures 3(a)-4(b) repeat the experiments of Figures 1,2 in the main text for the WGAN-WC and WGAN-GP problems. These plots suggest that a similar result also holds for the WGAN-WC and WGAN-GP problems, where the objective and the generated samples’ quality were decreasing during the generator optimization. For a larger set of generated samples in the main text’s Figures 1,2 and Figures 3(a)-4(b), we refer the readers to Figures 5(a)-7(b).

Appendix B Numerical Results for Section 9

Here, we present the complete numerical results for the experiments of Section 9 in the main text. Figures 8(a)-9(b) demonstrate the results of the main text’s Figures 3,4 for the WGAN-WC and WGAN-GP problems. Here, except the WGAN-GP experiment on the CelebA dataset, we observed that the objective and the generated samples’ quality did not significantly decrease over the generator optimization. Even for the WGAN-GP experiment on the CelebA data, we observed that the objective value decreased three times less than in minimizing the original objective rather than the proximal objective. These experiments suggest that the Wasserstein and Lipschitz GAN problems can converge to local proximal equilibrium solutions. We also show a larger group of generated samples at the beginning and final iterations of Figures 3,4 in the main text and Figures 8(a)-9(b) in Figures 10(a)-12(b).

For the proximal training experiments, Figures 13-15(b) show the samples generated by the SN-GAN and WGAN-WC proximally trained on CIFAR-10 and CelebA data with the results for the baseline regular training on the top of the figure and the results for proximal training on the bottom. We observed a somewhat improved quality achieved by proximal training, which was further supported by the inception scores for the CIFAR-10 experiments reported in the main text.

Appendix C Proofs

C.1 Proof of Proposition 1

Proof for $f$ -GANs:

Consider the following $f$ -GAN minimax problem corresponding to the convex function $f$ :

\min_{G\in\mathcal{G}}\>\max_{D}\>\mathbb{E}\bigl{[}D(\mathbf{X})\bigr{]}-\mathbb{E}\bigl{[}f^{*}(D(G(\mathbf{Z})))\bigr{]}.

(C.1)

Due to the realizability assumption, given $G^{*}\in\mathcal{G}$ we assume that the data distribution and the generative model are identical, i.e., $P_{\mathbf{X}}=P_{G^{*}(\mathbf{Z})}$ . Then, the minimax objective for $G^{*}$ reduces to

\mathbb{E}_{P_{\mathbf{X}}}\bigl{[}\,D(\mathbf{X})-f^{*}(D(\mathbf{X}))\,\bigr{]}.

(C.2)

The above objective decouples across $\mathbf{X}$ outcomes. As a result, the maximizing discriminator $D^{*}(\mathbf{x})=f^{\prime}(1)$ will be a constant function where the constant value $f^{\prime}(1)$ follows from the optimization problem:

f^{\prime}(1)=\underset{u\in\mathbb{R}}{\arg\!\max}\>u-f^{*}(u).

(C.3)

Note that the objective $u-f^{*}(u)$ is a concave function of $u$ whose derivative is zero at ${{f^{*}}^{\prime}}^{-1}(1)=f^{\prime}(1)$ , because the Fenchel-conjugate of a convex $f$ satisfies ${{f^{*}}^{\prime}}^{-1}=f^{\prime}$ .

So far we have proved that the constant function $D_{\operatorname{constant}}(\mathbf{x})=f^{\prime}(1)$ provides the optimal discriminator for generator $G^{*}$ . Therefore, for every discriminator $D$ we have

V(G^{*},D)\leq V(G^{*},D_{\operatorname{constant}}),

(C.4)

where $V(G,D)$ denotes the $f$ -GAN’s minimax objective. Moreover, note that for a constant $D$ the value of the minimax objective does not change with generator $G$ . As a result, for every $G$

V(G,D_{\operatorname{constant}})=V(G^{*},D_{\operatorname{constant}}).

(C.5)

Then, (C.4) and (C.5) collectively prove that for every $G$ and $D$ we have

V(G^{*},D)\leq V(G^{*},D_{\operatorname{constant}})\leq V(G,D_{\operatorname{constant}}),

which completes the proof for $f$ -GANs.

Proof for Wasserstein GANs:

Consider a general Wasserstein GAN problem with a cost function $c$ satisfying $c(\mathbf{x},\mathbf{x})=0$ for every $\mathbf{x}$ . Notice that this property holds for all Wasserstein distance measures corresponding to cost function $\|\mathbf{x}-\mathbf{x}^{\prime}\|^{q}$ for $q\geq 1$ . The generalized Wasserstein GAN minimax problem is as follows:

\min_{G\in\mathcal{G}}\>\max_{D\,\text{\rm$c$-concave}}\>\mathbb{E}[D(\mathbf{X})]-\mathbb{E}\bigl{[}D^{c}(G(\mathbf{Z}))\bigr{]}.

(C.6)

Due to the realizability assumption, a generator function $G^{*}\in\mathcal{G}$ results in the data distribution such that $P_{G^{*}(\mathbf{Z})}=P_{\mathbf{X}}$ . Then, the above minimax objective for $G^{*}$ reduces to

\mathbb{E}_{P_{\mathbf{X}}}\bigl{[}\,D(\mathbf{X})-D^{c}(\mathbf{X})\,\bigr{]}.

(C.7)

Since the cost is assumed to take a zero value given identical inputs, we have:

	$\displaystyle D^{c}(\mathbf{x}):$	$\displaystyle=\max_{\mathbf{x}^{\prime}}\,D(\mathbf{x}^{\prime})-c(\mathbf{x},\mathbf{x}^{\prime})$
		$\displaystyle\geq D(\mathbf{x})-c(\mathbf{x},\mathbf{x})$
		$\displaystyle=D(\mathbf{x}).$

As a result, $D(\mathbf{x})-D^{c}(\mathbf{x})\leq 0$ holds for every $\mathbf{x}$ . Hence, the objective in (C.7) will be non-positive and takes its maximum zero value for any constant function $D_{\operatorname{constant}}$ , which by definition satisfies $c$ -concavity. Therefore, letting $V(G,D)$ denote the GAN minimax objective, for every $D$ we have

V(G^{*},D)\leq V(G^{*},D_{\operatorname{constant}}).

(C.8)

We also know that for a constant discriminator $D_{\operatorname{constant}}$ the value of the minimax objective is independent from the generator function. Therefore, for every $G$ we have

V(G^{*},D_{\operatorname{constant}})=V(G,D_{\operatorname{constant}}).

(C.9)

As a result, (C.8) and (C.9) together show that for every $G$ and $D$

V(G^{*},D)\leq V(G^{*},D_{\operatorname{constant}})\leq V(G,D_{\operatorname{constant}}),

(C.10)

which makes the proof complete for Wasserstein GANs.

C.2 Proof of Theorem 1 & Remark 1

Proof for $f$ -GANs:

Lemma 1.

Consider two random vectors $\mathbf{X},\widetilde{\mathbf{X}}$ with probability density functions $p,\,q$ , respectively. Suppose that $p,q$ are non-zero everywhere. Then, considering the following variational representation of $d_{f}(P,Q)$ ,

d_{f}(P,Q)=\max_{D}\>\mathbb{E}[D(\mathbf{X})]-\mathbb{E}[f^{*}(D(\widetilde{\mathbf{X}}))],

(C.11)

the optimal solution $D^{*}$ will satisfy

D^{*}(\mathbf{x})=f^{\prime}\bigl{(}\frac{p(\mathbf{x})}{q(\mathbf{x})}\bigr{)}.

(C.12)

Proof.

Let us rewrite the $f$ -divergence’s variational representation as

	$\displaystyle d_{f}(P,Q)\,$	$\displaystyle=\,\max_{D}\>\mathbb{E}[D(\mathbf{X})]-\mathbb{E}[f^{*}(D(\widetilde{\mathbf{X}}))]$
		$\displaystyle=\,\max_{D}\>\int\bigl{[}p(\mathbf{x})D(\mathbf{x})-q(\mathbf{x})f^{*}(D(\mathbf{x}))\bigr{]}\mathop{}\!\mathrm{d}{\mathbf{x}}$
		$\displaystyle=\int\max_{D(\mathbf{x})}\bigl{[}p(\mathbf{x})D(\mathbf{x})-q(\mathbf{x})f^{*}(D(\mathbf{x}))\bigr{]}\mathop{}\!\mathrm{d}{\mathbf{x}}$

where the last equality holds, since the maximization objective decouples across $\mathbf{x}$ values. It can be seen that the inside optimization problem for each $D(\mathbf{x})$ is maximizing a concave objective in which by setting the derivative to zero we obtain

{f^{*}}^{\prime}(D^{*}(\mathbf{x}))=\frac{p(\mathbf{x})}{q(\mathbf{x})}.

(C.13)

As a property of the Fenchel-conjugate of a convex $f$ , we know ${{f^{*}}^{\prime}}^{-1}=f^{\prime}$ which combined with the above equation implies that

D^{*}(\mathbf{x})=f^{\prime}\bigr{(}\frac{p(\mathbf{x})}{q(\mathbf{x})}\bigr{)}.

(C.14)

The above result completes Lemma 1’s proof. ∎

Consider the $f$ -GAN problem with the generator function specified in the theorem:

\min_{\mathbf{W},\mathbf{u}:\,\|\mathbf{W}\|_{2}\leq 1}\>\max_{D}\>\mathbb{E}\bigl{[}D(\mathbf{X})\bigr{]}-\mathbb{E}\bigl{[}f^{*}\bigl{(}D(\mathbf{W}\mathbf{Z}+\mathbf{u})\bigr{)}\bigr{]}.

(C.15)

Note that $\mathbf{X}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I)$ and $\mathbf{W}\mathbf{Z}+\mathbf{u}\sim\mathcal{N}(\mathbf{u},\mathbf{W}\mathbf{W}^{T})$ . Notice that if $\mathbf{W}$ was not full-rank, the maximized discriminator objective would be $+\infty$ achieved by a $D$ assigning an infinity value to the points not included in the rank-constrained support set of generator $\mathbf{W}\mathbf{z}+\mathbf{u}$ . This will not result in a solution to the $f$ -GAN problem, because we assume that the dimensions of $\mathbf{X}$ and $\mathbf{Z}$ match each other and hence there exists a full-rank $\mathbf{W}$ with a finite maximized objective, i.e. $f$ -divergence value. Therefore, in a Nash equilibrium of the $f$ -GAN problem, the solution $\mathbf{W}$ must be full-rank and invertible.

Lemma 1 results in the following equation for the optimal discriminator $D^{*}_{\mathbf{W},\mathbf{u}}$ given generator parameters $\mathbf{W},\mathbf{u}$ :

	$\displaystyle D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x})\,$	$\displaystyle=\,f^{\prime}\bigl{(}\frac{\frac{1}{\sqrt{(2\pi\sigma^{2})^{k}}}\exp\big{\{}-\frac{1}{2\sigma^{2}}\big{\\|}\mathbf{x}\big{\\|}_{2}^{2}\big{\}}}{\frac{1}{\sqrt{(2\pi)^{k}\det(\mathbf{W}\mathbf{W}^{T})}}\exp\big{\{}-\frac{1}{2}\big{\\|}(\mathbf{W}\mathbf{W}^{T})^{-1/2}\mathbf{(x-u)}\big{\\|}_{2}^{2}\big{\}}}\bigl{)}$
		$\displaystyle=\,f^{\prime}\bigl{(}\sqrt{\frac{\det(\mathbf{W}\mathbf{W}^{T})}{\sigma^{2k}}}\exp\biggl{\{}\frac{1}{2}\mathbf{x}^{T}((\mathbf{W}\mathbf{W}^{T})^{-1}-\sigma^{-2}I)\mathbf{x}-\mathbf{u}^{T}(\mathbf{W}\mathbf{W}^{T})^{-1/2}\mathbf{x}+\mathbf{u}^{T}(\mathbf{W}\mathbf{W}^{T})^{-1}\mathbf{u}\biggr{\}}\,\bigr{)}.$

As a result, the function $f^{*}(D^{*}_{\mathbf{W},\mathbf{u}}(\cdot))$ appearing in the $f$ -GAN’s minimax objective will be

	$\displaystyle{f^{}}\bigl{(}D^{}_{\mathbf{W},\mathbf{u}}(\mathbf{x})\bigr{)}\,=\,$	$\displaystyle{f^{*}}\biggl{(}f^{\prime}\biggl{(}\sqrt{\frac{\det(\mathbf{W}\mathbf{W}^{T})}{\sigma^{2k}}}\exp\biggl{\{}\frac{1}{2}\mathbf{x}^{T}((\mathbf{W}^{T}\mathbf{W})^{-1}$
	$\displaystyle-$	$\displaystyle\sigma^{-2}I)\mathbf{x}-\mathbf{u}^{T}(\mathbf{W}\mathbf{W}^{T})^{-1/2}\mathbf{x}+\mathbf{u}^{T}(\mathbf{W}\mathbf{W}^{T})^{-1}\mathbf{u}\biggr{\}}\biggr{)}\biggr{)}.$

Claim: $f^{*}(D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x}))$ is a strictly convex function of $\mathbf{x}$ .

To show this claim, note that the following expression is a strongly-convex quadratic function of $\mathbf{x}$ , since we have assumed that the spectral norm of $\mathbf{W}$ is bounded as ${\sigma}_{\max}(\mathbf{W})\leq 1<\sigma$ :

\displaystyle\frac{1}{2}\mathbf{x}^{T}((\mathbf{W}^{T}\mathbf{W})^{-1}-\sigma^{-2}I)\mathbf{x}-\mathbf{u}^{T}(\mathbf{W}^{T}\mathbf{W})^{-1/2}\mathbf{x}\mathbf{u}^{T}(\mathbf{W}^{T}\mathbf{W})^{-1}\mathbf{u}.

For simplicity, we denote the above strongly-convex function with $g(\mathbf{x})$ and define the function $h:\mathbb{R}\rightarrow\mathbb{R}$ as

\displaystyle h(y):={f^{*}}\bigl{(}{{f}^{\prime}}\bigl{(}\sqrt{\frac{\det(\mathbf{W}\mathbf{W}^{T})}{\sigma^{2k}}}\times e^{y}\bigr{)}\bigr{)}.

According to the above definitions, $f^{*}(D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x}))=h(g(\mathbf{x}))$ is the composition of $h$ and strongly-convex $g$ . Note that $h$ is a monotonically increasing function, since defining $c=\sqrt{\frac{\det(\mathbf{W}\mathbf{W}^{T})}{\sigma^{2k}}}>0$ we have

h^{\prime}(y)=(c\,e^{y})^{2}\,f^{\prime\prime}(c\,e^{y})\geq 0,

(C.16)

which follows from the equality

{f^{*}}(f^{\prime}(z)):=\sup_{u}\{uf^{\prime}(z)-f(u)\}\,=\,zf^{\prime}(z)-f(z)

that is a consequence of the definition of Fenchel-conjugate, implying that $\frac{\mathop{}\!\mathrm{d}{}{f^{*}}(f^{\prime}(z))}{\mathop{}\!\mathrm{d}{z}}=zf^{\prime\prime}(z)$ for the convex $f$ . Note that $h^{\prime}(y)>0$ holds everywhere, because $f$ is assumed to be strictly convex. This proves that $h$ is strictly increasing. Furthermore, $h$ is a convex function, because $h^{\prime}(y)$ is non-decreasing due to the assumption that $t^{2}f^{\prime\prime}(t)$ is non-decreasing over $t\in(0,+\infty)$ . As a result, $h$ is an increasing convex function.

Therefore, $f^{*}(D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x}))=h(g(\mathbf{x}))$ is a composition of a strongly-convex $g$ and an increasing convex $h:\mathbb{R}\rightarrow\mathbb{R}$ . Therefore, as a well-known result in convex optimization [48], the claim is true and $f^{*}(D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x}))$ is a strictly convex function of $\mathbf{x}$ .

We showed that the claim is true for every feasible $\mathbf{W},\mathbf{u}$ . Now, we prove that the pair $(G_{\mathbf{W},\mathbf{u}},D^{*}_{\mathbf{W},\mathbf{u}})$ will not be a local Nash equilibrium for any feasible $\mathbf{W},\mathbf{u}$ . If the pair $(G_{\mathbf{W},\mathbf{u}},D^{*}_{\mathbf{W},\mathbf{u}})$ was a local Nash equilibrium, $\mathbf{W},\mathbf{u}$ would be a local minimum for the following minimax objective where $D^{*}$ is fixed to be $D^{*}_{\mathbf{W},\mathbf{u}}$ :

\mathbb{E}\bigl{[}D^{*}(\mathbf{X})\bigr{]}-\mathbb{E}\bigl{[}f^{*}\bigl{(}D^{*}(\mathbf{W}\mathbf{Z}+\mathbf{u})\bigr{)}\bigr{]}.

(C.17)

However, as shown earlier, for any feasible $\mathbf{W},\mathbf{u}$ , $f^{*}\bigl{(}D^{*}(\mathbf{x}))$ is a strictly-convex function of $\mathbf{x}$ , which in turn shows that (C.17) is a strictly-concave function of variable $\mathbf{u}$ . This consequence proves that the objective has no local minima for the unconstrained variable $\mathbf{u}$ . Due to the shown contradiction, a pair with the form $(G_{\mathbf{W},\mathbf{u}},D^{*}_{\mathbf{W},\mathbf{u}})$ cannot be a local Nash equilibrium in parameters $\mathbf{W},\mathbf{u}$ . Consequently, the minimax problem has no pure Nash equilibrium solutions, since in a pure Nash equilibrium the discriminator will be by definition optimal against the choice of generator.

Proof for W2GANs:

Consider the W2GAN problem with the assumed generator function:

\min_{\mathbf{W},\mathbf{u}:\,\|\mathbf{W}\|_{2}\leq 1}\>\max_{D\,\text{\rm$c$-concave}}\>\mathbb{E}\bigl{[}D(\mathbf{X})\bigr{]}-\mathbb{E}\bigl{[}D^{c}(\mathbf{W}\mathbf{Z}+\mathbf{u})\bigr{]},

(C.18)

where the c-transform is defined for the quadratic cost function $c(\mathbf{x},\mathbf{x}^{\prime})=\frac{1}{2}\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2}^{2}$ . Similar to the $f$ -GAN case, define $D^{*}_{\mathbf{W},\mathbf{u}}$ to be the optimal discriminator for the generator function parameterized by $\mathbf{W},\mathbf{u}$ . Note that $\mathbf{X}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I)$ and $\mathbf{W}\mathbf{Z}+\mathbf{u}\sim\mathcal{N}(\mathbf{u},\mathbf{W}\mathbf{W}^{T})$ .

According to the Brenier’s theorem [49], the optimal transport from the Gaussian data distribution to the Gaussian generative model will be

\psi^{\operatorname{opt}}(\mathbf{x})=\mathbf{x}-\nabla_{\mathbf{x}}D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x}).

As a well-known result regarding the second-order optimal transport map between two Gaussian distributions, the optimal transport will be a linear transformation as $\psi^{\operatorname{opt}}(\mathbf{x})=\frac{1}{\sigma}(\mathbf{W}\mathbf{W}^{T})^{1/2}\mathbf{x}+\mathbf{u}$ . This result shows that

\nabla_{\mathbf{x}}D^{*}_{\mathbf{W},\mathbf{u}}(\mathbf{x})=\bigl{(}I-\frac{1}{\sigma}(\mathbf{W}\mathbf{W}^{T})^{1/2}\bigr{)}\mathbf{x}-\mathbf{u}.

(C.19)

Note that the c-transform for cost $c(\mathbf{x},\mathbf{x}^{\prime})=\frac{1}{2}\|\mathbf{x}-\mathbf{x}^{\prime}\|^{2}_{2}$ satisfies $D^{c}(\mathbf{x})=(\frac{1}{2}\|\mathbf{x}\|^{2}-D(\mathbf{x}))^{*}-\frac{1}{2}\|\mathbf{x}\|^{2}$ where $g^{*}$ denotes $g$ ’s Fenchel-conjugate. For general convex quadratic function $g(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}A\mathbf{x}+\mathbf{b}^{T}\mathbf{x}$ we have $g^{*}(\mathbf{x})=\frac{1}{2}(\mathbf{x}-\mathbf{b})^{T}A^{\dagger}(\mathbf{x}-\mathbf{b})$ where $A^{\dagger}$ denotes $A$ ’s Moore Penrose pseudoinverse. Therefore, for the c-transform of the optimal discriminator we will have

\displaystyle\nabla_{\mathbf{x}}D^{*^{c}}_{\mathbf{W},\mathbf{u}}(\mathbf{x})=\,\bigl{(}\sigma((\mathbf{W}\mathbf{W}^{T})^{1/2})^{\dagger}-I\bigr{)}\mathbf{x}-\sigma((\mathbf{W}\mathbf{W}^{T})^{1/2})^{\dagger}\mathbf{u}.

Since every feasible $\mathbf{W}$ satisfies the bounded spectral norm condition as ${\sigma}_{\max}(\mathbf{W})\leq 1<\sigma$ , the optimal $D^{*^{c}}_{\mathbf{W},\mathbf{u}}$ will be a quadratic function whose Hessian has at least one strictly positive eigenvalue along the principal eigenvector of $\mathbf{W}\mathbf{W}^{T}$ . The positive eigenvalue exists in general case where $\mathbf{Z}$ ’s dimension can be even smaller than $\mathbf{X}$ ’s dimension. If we had the stronger assumption that the two dimensions exactly match, similar to the f-GAN problem considered, then the pseudo-inverse $A^{\dagger}$ would be the same as the inverse $A^{-1}$ resulting in a strongly-convex quadratic $D^{*^{c}}_{\mathbf{W},\mathbf{u}}$ . Nevertheless, as we prove here, the theorem’s result on W2GAN holds in the general case and does not necessarily require the same dimension between $\mathbf{X}$ and $\mathbf{Z}$ .

Consider the W2GAN minimax objective for the pair $(G_{\mathbf{W,u}},D^{*})$ where $D^{*}$ is fixed to be the optimal $D^{*}_{\mathbf{W,u}}$ :

\mathbb{E}\bigl{[}D(\mathbf{X})\bigr{]}-\mathbb{E}\bigl{[}{D^{*}}^{c}(\mathbf{W}\mathbf{Z}+\mathbf{u})\bigr{]},

(C.20)

If $(G_{\mathbf{W,u}},D^{*}_{\mathbf{W,u}})$ was a local Nash equilibrium, the variables $\mathbf{W,u}$ would provide a local minimum to the above objective. However, since $D^{*^{c}}_{\mathbf{W},\mathbf{u}}$ is shown to be a quadratic function with a Hessian possessing positive eigenvalues, the above minimax objective will not have a local minimum in the unconstrained variable $\mathbf{u}$ . Therefore, the minimax problem possesses no local Nash equilibrium solutions with the form $(G_{\mathbf{W,u}},D^{*}_{\mathbf{W,u}})$ and therefore no pure Nash equilibrium solutions.

For the parameterized case with a quadratic discriminator $D_{A,\mathbf{b}}(\mathbf{x})=\mathbf{x}^{T}A\mathbf{x}+\mathbf{b}^{T}\mathbf{x}$ , first of all note that as shown in the proof the optimal discriminator $D^{*}_{\mathbf{W,u}}$ for any generator parameter $\mathbf{W,u}$ will be a $c$ -concave quadratic function. Therefore, the optimal solution for the discriminator does not change because of the new quadratic constraint. Furthermore, the discriminator optimization problem has a concave objective in parameters $A,\mathbf{b}$ . This is because the discriminator $D_{A,\mathbf{b}}(\mathbf{x})$ is a linear function in terms of $A,\mathbf{b}$ , and $D^{c}_{A,\mathbf{b}}(\mathbf{x})=\sup_{\mathbf{x}^{\prime}}D_{A,\mathbf{b}}(\mathbf{x}^{\prime})-c(\mathbf{x}^{\prime},\mathbf{x})$ is a convex function of $A,\mathbf{b}$ as the supremum of some affine functions is convex.

As a result, the discriminator optimization reduces to maximizing a concave objective of $A,\mathbf{b}$ constrained to a convex set $\{A:\,I-A\succcurlyeq 0\}$ which is equivalent to the $c$ -concave constraint on the quadratic $D_{A,\mathbf{b}}$ . Hence, any local solution to this optimization problem will also be a global solution. This result implies that any local Nash equilibrium for the new parameterized minimax problem will have the form $(G_{\mathbf{W,u}},D^{*}_{\mathbf{W,u}})$ , which as we have already shown does not exist under the theorem’s assumptions.

Proof for the 1-dimensional WGAN:

Consider the 1-dimensional Wasserstein GAN problem for the assumed linear generator function:

\min_{w,u:\,|w|\leq 1}\>\max_{\|D\|_{\operatorname{Lip}}\leq 1}\>\mathbb{E}[D(X)]-\mathbb{E}[D(wZ+u)].

(C.21)

The inner maximization problem can be rewritten as

\max_{\|D\|_{\operatorname{Lip}}\leq 1}\>\int\bigl{(}p_{X}(x)-p_{wZ+u}(x)\bigr{)}\,D(x)\mathop{}\!\mathrm{d}{x}.

(C.22)

Here we have

\displaystyle p_{X}(x)-p_{wZ+u}(x)\,=\,\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\bigl{\{}\frac{-1}{2\sigma^{2}}x^{2}\bigr{\}}-\frac{1}{\sqrt{2\pi w^{2}}}\exp\bigl{\{}\frac{-1}{2w^{2}}(x-u)^{2}\bigr{\}}

Since $|w|\leq 1<\sigma$ , it can be seen that the above difference will be positive everywhere except over an interval $(a_{1},a_{2})$ , where $a_{1},a_{2}$ are the two solutions to the quadratic equation:

\bigl{(}\frac{1}{w^{2}}-\frac{1}{{\sigma}^{2}}\bigr{)}x^{2}-2\frac{u}{w^{2}}x+\bigl{(}\frac{u^{2}}{w^{2}}-\log(\frac{\sigma}{|w|})\bigr{)}=0.

(C.23)

Note that the above quadratic equation has two distinct solutions $a_{1}<a_{2}$ , since $|w|<\sigma$ and $\log(\frac{\sigma}{|w|})>0$ leading to the positive discriminant:

4\frac{u^{2}}{w^{2}{\sigma}^{2}}+4\log(\frac{\sigma}{|w|})\bigl{(}\frac{1}{w^{2}}-\frac{1}{{\sigma}^{2}}\bigr{)}\,>\,0.

(C.24)

As the function $D$ in the maximization problem (C.22) is only constrained to be 1-Lipschitz, the optimal $D^{*}_{w,u}$ ’s slope must be equal to $-1$ over $(-\infty,a_{1}]$ and equal to $+1$ over $[a_{2},\infty)$ , in order to allow the maximum increase in the maximization objective. Over the interval $(a_{1},a_{2})$ , we claim that for the optimal $D$ is a convex function, because otherwise its double Fenchel-conjugate $D^{**}$ , which is by definition convex, achieves a higher value.

First of all, note that the double Fenchel-conjugate $D^{**}$ will not be different from $D$ outside the $(a_{1},a_{2})$ interval, because $D^{**}$ is defined to provide the largest convex function satisfying $D^{**}\leq D$ , and $D$ is supposed to be $1$ -Lipschitz taking its minimum derivative on $(\infty,a_{1}]$ and its maximum derivative over $[a_{2},\infty)$ . Next, since $D^{**}$ lower-bounds $D$ , it results in a non-smaller integral value over the interval $(a_{1},a_{2})$ as $p_{X}(x)-p_{wZ+u}(x)$ takes negative values over $(a_{1},a_{2})$ . If $D$ is not convex, then $D^{**}$ provides a strict lower-bound for $D$ which matches $D$ over $(\infty,a_{1}]\cup[a_{2},\infty)$ . Therefore, the convex $1$ -Lipschitz $D^{**}$ results in a greater objective that is a contradiction to $D$ ’s optimality. This contradiction proves that the optimal discriminator $D^{*}_{w,u}$ is a convex function.

Therefore, for every feasible $|w|\leq 1$ , there exists an optimal solution $D^{*}_{w,u}$ for (C.22) that is a non-constant convex function. This result proves that the WGAN problem has no local Nash equilibiria with the form $(G_{w,u},D^{*}_{w,u})$ , because if $(G_{w,u},D^{*}_{w,u})$ was a local Nash equilibrium then $w,u$ would be a local minimum for the following objective where $D^{*}$ is fixed to be $D^{*}_{w,u}$ :

\mathbb{E}[D^{*}(X)]-\mathbb{E}[D^{*}(wZ+u)].

(C.25)

However, the above objective is a non-constant concave function of the unconstrained variable $u$ and hence does not have a local minimum in $u$ . This shows that the WGAN problem does not have a Nash equilibrium and completes the proof for the WGAN case.

Remark.

Proof.

Proof for the W2GAN:

For the W2GAN case, note that if we repeat the same steps as in the proof of Theorem 1, we can show

\displaystyle\nabla_{\mathbf{x}}D^{*^{c}}_{\mathbf{W},\mathbf{u}}(\mathbf{x})=\,(\sigma(\mathbf{W}\mathbf{W}^{T})^{-1/2}-I\bigr{)}\mathbf{x}-\sigma(\mathbf{W}\mathbf{W}^{T})^{-1/2}\mathbf{u}.

which is a concave quadratic function of $\mathbf{x}$ , since the assumptions imply that $(\mathbf{W}\mathbf{W}^{T})^{-1}\preccurlyeq\sigma^{-2}I$ . Here $\mathbf{W}$ is supposed to be a full-rank square matrix as its minimum singular value is assumed to be positive and $\mathbf{Z}$ has the same dimension as $\mathbf{X}$ .

We claim that for the feasible choice $\mathbf{W}^{*}=I$ and $\mathbf{u}^{*}=\mathbf{0}$ , the pair $(G_{\mathbf{W}^{*},\mathbf{u}^{*}},D^{*}_{\mathbf{W}^{*},\mathbf{u}^{*}})$ results in a Nash equilibrium of the minimax problem. Considering the definition of the optimal discriminator $D^{*}_{\mathbf{W}^{*},\mathbf{u}^{*}}$ , its optimlaity for $G_{\mathbf{W}^{*},\mathbf{u}^{*}}$ directly follows. Moreover, (C.2) implies that

D^{*^{c}}_{\mathbf{W}^{*},\mathbf{u}^{*}}(\mathbf{x})=\frac{\sigma-1}{2}\|\mathbf{x}\|_{2}^{2}.

(C.26)

As a result, fixing the above discriminator function the minimax objective will be

		$\displaystyle\mathbb{E}[D^{*}(\mathbf{X})]-\mathbb{E}\bigl{[}\,\frac{\sigma-1}{2}\,\\|\mathbf{W}\mathbf{Z}+\mathbf{u}\\|_{2}^{2}\bigr{]}$
	$\displaystyle=\;$	$\displaystyle\mathbb{E}[D^{*}(\mathbf{X})]+\frac{1-\sigma}{2}\,\bigl{(}\\|\mathbf{W}\\|^{2}_{F}+\\|\mathbf{u}\\|^{2}_{2}\bigr{)}$

which is minimized at $\mathbf{W}=I$ and $\mathbf{u}=\mathbf{0}$ over the specified feasible set, as we know the Frobenius norm-squared, $\|\mathbf{W}\|^{2}_{F}$ , is the sum of the squared of $\mathbf{W}$ ’s singular values. Therefore, the claim holds and the choice $\mathbf{W}^{*}=I$ and $\mathbf{u}^{*}=0$ results in the optimal solution and a Nash equilibrium.

Proof for the 1-dimensional Wasserstein GAN:

Here we select the parameters $w^{*}=1$ , $u^{*}=0$ . We claim that the optimal discriminator function for this choice is the negative absolute value function $D^{*}(x)=-|x|$ . Note that the optimal 1-Lipschitz $D^{*}$ solves the following problem:

\max_{\|D\|_{\operatorname{Lip}}\leq 1}\>\int\bigl{(}\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{x^{2}}{2\sigma^{2}}}-\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}\bigr{)}D(x)\mathop{}\!\mathrm{d}{x}.

(C.27)

In the above objective given $\eta=\sqrt{\frac{2\sigma^{2}\log(1/\sigma)}{1-\sigma^{2}}}$ , the function $\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{x^{2}}{2\sigma^{2}}}-\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}$ is positive over $(-\eta,\eta)$ and negative elsewhere. Therefore, the optimal $D^{*}$ should get the maximum $+1$ derivative over $(-\infty,-\eta]$ and the minimum $-1$ derivative over $[+\eta,+\infty)$ . Because of the even structure of $\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{x^{2}}{2\sigma^{2}}}-\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}$ , there exists an even optimal $D^{*}$ because $\frac{D^{*}(x)+D^{*}(-x)}{2}$ remains $1$ -Lipschitz and optimal for any optimal $1$ -Lipschitz discriminator $D^{*}$ . The optimal even $D^{*}$ should further be continuous as a $1$ -Lipschitz function, implying that such a $D^{*}$ is decreasing over $(0,\eta]$ and increasing over $[-\eta,0)$ . Enforcing the maximum derivative over the two interval results in the optimal $D^{*}(x)=-|x|$ .

Therefore, $D^{*}(x)=-|x|$ provides an optimal discriminator for $w^{*}=1,\,u^{*}=0$ . Also, for this $D^{*}$ the minimax objective of the Wasserstein GAN will be

\displaystyle\mathbb{E}[-|X|]-\mathbb{E}[-|wZ+u|]\,\,=-\mathbb{E}[|X|]+\mathbb{E}[|wZ+u|]

In the above equation, $wZ+u\sim\mathcal{N}(u,w^{2})$ , showing that the above objective is minimized at $w^{*}=1,u=0$ considering the assumed feasible set where $|w|\geq 1$ . As a result, the pair $(G_{w^{*},u^{*}},D^{*})$ provides a Nash equilibrium to the WGAN minimax game. ∎

C.3 Proof of Proposition 2

Proof of the $\Rightarrow$ direction:

Assume that $(G^{*},D^{*})$ is a $\lambda$ -proximal equilibrium. According to the definition of the proximal equilibrium, the following holds for every $G\in\mathcal{G}$ and $D\in\mathcal{D}$ :

V^{\operatorname{prox}}_{\lambda}(G^{*},D)\,\leq\,V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})\,\leq\,V^{\operatorname{prox}}_{\lambda}(G,D^{*}).

(C.28)

Claim: $V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})=V(G^{*},D^{*})$ .

To show this claim, note that

V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*}):=\max_{\widetilde{D}\in\mathcal{D}}\>V(G^{*},\widetilde{D})-\frac{\lambda}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}.

(C.29)

In this optimization, the optimal solution $\tilde{D}$ is $D^{*}$ itself. Otherwise, for the optimal $\tilde{D}\in\mathcal{D}$ we have $\|\widetilde{D}-D^{*}\bigr{\|}>0$ and as a result

V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})<V(G^{*},\tilde{D})\leq V^{\operatorname{prox}}_{\lambda}(G^{*},\tilde{D}),

(C.30)

which is a contradiction given that $(G^{*},D^{*})$ is a $\lambda$ -proximal equilibrium. Therefore, $D^{*}$ optimizes the proximal optimization, which shows the claim is valid and we have $V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})=VG^{*},D^{*})$ . Knowing that $V(G,D)\leq V^{\operatorname{prox}}_{\lambda}(G,D)$ holds for every $G\in\mathcal{G},D\in\mathcal{D}$ , we have

	$\displaystyle V(G^{*},D)$	$\displaystyle\leq V^{\operatorname{prox}}_{\lambda}(G^{*},D)$
		$\displaystyle\leq V^{\operatorname{prox}}_{\lambda}(G^{},D^{})$
		$\displaystyle=V(G^{},D^{}).$

Furthermore,

$\displaystyle V(G^{},D^{})$	$\displaystyle=\,V^{\operatorname{prox}}_{\lambda}(G^{},D^{})$	(C.31)
	$\displaystyle\leq\,V^{\operatorname{prox}}_{\lambda}(G,D^{*})$	(C.32)
	$\displaystyle=\,\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\frac{\lambda}{2}\bigl{\\|}\widetilde{D}-D^{*}\bigr{\\|}^{2}.$

Therefore, the proof is complete.

Proof of the $\Leftarrow$ direction:

Suppose that for $(G^{*},D^{*})$ the following holds for every $G\in\mathcal{G}$ and $D\in\mathcal{D}$ :

\displaystyle V(G^{*},D)\,\leq\,V(G^{*},D^{*})\,\leq\,\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\frac{\lambda}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}.

(C.33)

We claim that $V(G^{*},D^{*})=V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})$ . To show this claim, consider the definition of the $\lambda$ -proximal equilibrium:

V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*}):=\max_{\tilde{D}\in\mathcal{D}}\,V(G^{*},\tilde{D})-\frac{\lambda}{2}\big{\|}D^{*}-\tilde{D}\big{\|}.

(C.34)

Here $D^{*}$ maximizes the objective because we have assumed that $V(G^{*},D)\leq V(G^{*},D^{*})$ holds for every $D\in\mathcal{D}$ . Therefore, the claim is valid and $V(G^{*},D^{*})=V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})$ .

Also, note that for every $D$ the solution $\tilde{D}$ in the proximal optimization satisfies $V^{\operatorname{prox}}_{\lambda}(G^{*},D)\leq V(G^{*},\tilde{D})$ . Combining these results with (C.33), we obtain the following inequalities which hold for every $G\in\mathcal{G}$ and $D\in\mathcal{D}$ :

V^{\operatorname{prox}}_{\lambda}(G^{*},D)\,\leq\,V^{\operatorname{prox}}_{\lambda}(G^{*},D^{*})\,\leq\,V^{\operatorname{prox}}_{\lambda}(G,D^{*}).

(C.35)

The above equation shows that the pair $(G^{*},D^{*})$ is a $\lambda$ -proximal equilibrium.

C.4 Proof of Proposition 3

Consider a $\lambda_{2}$ -proximal equilibrium $(G^{*},D^{*})$ . As a result of Proposition 2, for every $G\in\mathcal{G}$ and $D\in\mathcal{D}$ we have

\displaystyle V(G^{*},D)\,\leq\,V(G^{*},D^{*})\leq\,\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\frac{\lambda_{2}}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}.

Since $\lambda_{1}\leq\lambda_{2}$ , the following holds

\displaystyle\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\frac{\lambda_{2}}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}\,\leq\,\max_{\widetilde{D}\in\mathcal{D}}\>V(G,\widetilde{D})-\frac{\lambda_{1}}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2},

which shows that

\displaystyle V(G^{*},D)\,\leq\,V(G^{*},D^{*})\,\leq\,\max_{\widetilde{D}}\>V(G,\widetilde{D})-\frac{\lambda_{1}}{2}\bigl{\|}\widetilde{D}-D^{*}\bigr{\|}^{2}.

Due to Proposition 2, $(G^{*},D^{*})$ will be a $\lambda_{1}$ -proximal equilibrium as well. Hence, the proof is complete and we have

\operatorname{PE}_{\lambda_{2}}(V)\subseteq\operatorname{PE}_{\lambda_{1}}(V).

(C.36)

C.5 Proof of Proposition 4

Consider the definition of a $\lambda$ -proximal equilibrium in the parameterized space:

V_{\lambda}^{\operatorname{prox}}(G_{\bm{\theta}},D_{\mathbf{w}}):=\max_{\widetilde{\mathbf{w}}}\,V(G_{\bm{\theta}},D_{\widetilde{\mathbf{w}}})-\frac{\lambda}{2}\bigl{\|}D_{\widetilde{\mathbf{w}}}-D_{\mathbf{w}}\bigr{\|}^{2}.

(C.37)

In the above optimization problem, the first term $V(G_{\bm{\theta}},D_{\widetilde{\mathbf{w}}})$ is assumed to be $\eta_{2}$ -smooth in $\widetilde{\mathbf{w}}$ , while the second term $\frac{\lambda}{2}\|D_{\widetilde{\mathbf{w}}}-D_{\mathbf{w}}\|^{2}$ will be $\frac{\lambda}{2}\eta_{1}$ -strongly convex in $\widetilde{\mathbf{w}}$ . As a result, the sum of the two terms will be $(\frac{\lambda\eta_{1}}{2}-\eta_{2})$ -strongly concave if $\eta_{2}<\frac{\lambda\eta_{1}}{2}$ holds. Since the objective function is strongly-concave in $\widetilde{\mathbf{w}}$ , it will be maximized by a unique solution $\mathbf{w}^{*}$ . Moreover, applying the Danskin’s theorem [50] implies that the following holds at the optimal $\mathbf{w}^{*}$ :

\nabla_{\theta}V_{\lambda}^{\operatorname{prox}}(G_{\bm{\theta}},D_{\mathbf{w}})=\nabla_{\theta}V(G_{\bm{\theta}},D_{\mathbf{w}^{*}}).

(C.38)

C.6 Proof of Theorem 2

Lemma 2.

Suppose that $f$ is a $\gamma$ -strongly convex function according to norm $\|\cdot\|$ , i.e. for any $x,y\in\operatorname{dom}(f)$ and $\lambda\in[0,1]$ we have

\displaystyle f(\lambda x+(1-\lambda)y)\,\leq\,\lambda f(x)+(1-\lambda)f(y)-\frac{\gamma\lambda(1-\lambda)}{2}\|x-y\|^{2}.

(C.39)

Consider the following optimization problem where the feasible set $\mathcal{X}$ is a convex set and $x^{*}$ is the optimal solution,

\min_{x\in\mathcal{X}}\,f(x).

(C.40)

Then, for every $x\in\mathcal{X}$ we have

f(x)-f(x^{*})\geq\frac{\gamma}{2}\|x-x^{*}\|^{2}.

(C.41)

Proof.

If we apply the strong-convexity definition (C.39) to $x\in\mathcal{X},\,x^{*}$ we obtain

\displaystyle f(\lambda x+(1-\lambda)x^{*})\,\leq\,\lambda f(x)+(1-\lambda)f(x^{*})-\frac{\gamma\lambda(1-\lambda)}{2}\|x-x^{*}\|^{2}.

(C.42)

The above inequality results in

\displaystyle f(\lambda x+(1-\lambda)x^{*})-f(x^{*})\,\leq\,\lambda(f(x)-f(x^{*}))-\frac{\gamma\lambda(1-\lambda)}{2}\|x-x^{*}\|^{2}.

(C.43)

Note that $\mathcal{X}$ is assumed to be a convex set and therefore $\lambda x+(1-\lambda)x^{*}\in\mathcal{X}$ implying $f(x^{*})\leq f(\lambda x+(1-\lambda)x^{*})$ . As a result, for every $0\leq\lambda\leq 1$

\frac{\gamma\lambda(1-\lambda)}{2}\|x-x^{*}\|^{2}\,\leq\,\lambda(f(x)-f(x^{*})).

(C.44)

Thus for every $0<\lambda\leq 1$ we have

\frac{\gamma(1-\lambda)}{2}\|x-x^{*}\|^{2}\,\leq\,f(x)-f(x^{*}),

(C.45)

which proves that $\frac{\gamma}{2}\|x-x^{*}\|^{2}\leq f(x)-f(x^{*})$ and completes Lemma 2’s proof. ∎

Based on Proposition 2 and the definition of $D^{\bm{\theta}}$ , we only need to show that for the W2GAN’s objective, which we denote by $V(G,D)$ , the following holds for every $\bm{\theta}$ :

V(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})\leq\max_{D\in\mathcal{D}}\>V(G_{\bm{\theta}},D)-\frac{1}{2\eta}\bigl{\|}D-D^{\bm{\theta}^{*}}\bigr{\|}^{2}_{\dot{H}^{1}}.

(C.46)

To show the above inequality, it suffices to prove the following inequality

V(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})\leq V(G_{\bm{\theta}},D^{\bm{\theta}})-\frac{1}{2\eta}\bigl{\|}D^{\bm{\theta}}-D^{\bm{\theta}^{*}}\bigr{\|}^{2}_{\dot{H}^{1}}.

(C.47)

Claim: For the W2GAN problem, we have

V(G_{\bm{\theta}},D^{\bm{\theta}})=\frac{1}{2\eta}\mathbb{E}\bigl{[}\|\nabla D^{\bm{\theta}}(\mathbf{X})\|^{2}_{2}\bigr{]}.

To show this claim, note that according to the W2GAN’s formulation we have $V(G_{\bm{\theta}},D^{\bm{\theta}}):=W_{c}(P_{\mathbf{X}},P_{G_{\bm{\theta}}(\mathbf{Z})})$ where $c(\mathbf{x},\mathbf{x}^{\prime})=\frac{\eta}{2}\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2}^{2}$ is the second-order cost function specified in the theorem. We start by proving this result for $\eta=1$ . In this case, the Brenier theorem [51] proves that the optimal transport map from the data variable $\mathbf{X}$ to the generative model $G_{\bm{\theta}}(\mathbf{Z})$ can be derived from the gradient of the optimal $D^{\bm{\theta}}$ as follows

\psi^{\operatorname{opt}}(\mathbf{x})=\mathbf{x}-\nabla D^{\bm{\theta}}(\mathbf{x}).

(C.48)

which plugged into the optimal transport objective $W_{c}(P,Q):=\inf_{\Pi(P,Q)}\mathbb{E}[c(\mathbf{X},\mathbf{X}^{\prime})]$ proves that

V(G_{\bm{\theta}},D^{\bm{\theta}}):=W_{c}(P_{\mathbf{X}},P_{G_{\bm{\theta}}(\mathbf{Z})})=\mathbb{E}\bigl{[}\frac{1}{2}\|\nabla D^{\bm{\theta}}(\mathbf{X})\|^{2}_{2}\bigr{]}.

The above equation proves the result holds for $\eta=1$ . For a general $\eta>0$ , note that applying a simple change of variable in the Kantorovich duality representation and solving the dual problem for $\tilde{D}(\mathbf{x})=\eta D(\mathbf{x})$ shows that $\psi^{\operatorname{opt}}(\mathbf{x})=\mathbf{x}-\frac{1}{\eta}\nabla\tilde{D}^{\bm{\theta}}(\mathbf{x})$ transport samples from the data domain to the generative model. This is due to the fact that after applying this change of variable the Kantorovich duality reduces to $\eta$ multiplied to the dual problem for $\eta=1$ . As a result, applying the transport map to the definition of the optimal transport cost shows that

V(G_{\bm{\theta}},D^{\bm{\theta}}):=W_{c}(P_{\mathbf{X}},P_{G_{\bm{\theta}}(\mathbf{Z})})=\frac{1}{2\eta}\mathbb{E}\bigl{[}\|\nabla D^{\bm{\theta}}(\mathbf{X})\|^{2}_{2}\bigr{]},

proving the claim holds for any $\eta>0$ .

Substituting the discriminator maximization with the result in the above claim, the W2GAN problem reduces to the following problem:

\min_{\bm{\theta}}\>\frac{1}{2\eta}\mathbb{E}\bigl{[}\big{\|}\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}^{2}_{2}\bigr{]}.

(C.49)

Here we can equivalently optimize for $D^{\bm{\theta}}\in\{D^{\bm{\theta}}:\,\theta\in\Theta\}$ instead of minimizing over the variable $\bm{\theta}$ , obtaining

\min_{D^{\bm{\theta}}}\>\frac{1}{2\eta}\mathbb{E}\bigl{[}\big{\|}\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}^{2}_{2}\bigr{]}.

(C.50)

Note that the term $\frac{1}{2}\mathbb{E}\bigl{[}\big{\|}\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}^{2}_{2}\bigr{]}=\frac{1}{2}\big{\|}D^{\bm{\theta}}\big{\|}^{2}_{\dot{H}^{1}}$ reduces to the squared of the defined Sobolev norm in a semi-Hilbert space, which results in a $1$ -strongly convex function according to $\|\cdot\|_{\dot{H}^{1}}$ with strong convexity defined as in (C.39). As a result, the objective in (C.50) is $\frac{1}{\eta}$ -strongly convex according to the Sobolev norm $\|\cdot\|_{\dot{H}^{1}}$ . In addition, this objective is minimized over a convex feasible set $\{D^{\bm{\theta}}:\,\theta\in\Theta\}$ , due to the theorem’s assumption. Therefore, Lemma 2 shows that the optimal $D^{\bm{\theta}^{*}}$ satisfies the following inequality for every $\bm{\theta}$ :

\displaystyle\frac{1}{2\eta}\mathbb{E}\bigl{[}\big{\|}\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}^{2}_{2}\bigr{]}-\frac{1}{2\eta}\mathbb{E}\bigl{[}\big{\|}\nabla D^{\bm{\theta}^{*}}(\mathbf{X})\big{\|}^{2}_{2}\bigr{]}\>\geq\>\frac{1}{2\eta}\big{\|}D^{\bm{\theta}}-D^{\bm{\theta}^{*}}\big{\|}^{2}_{\dot{H}^{1}}.

The above result implies that

V(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})\leq V(G_{\bm{\theta}},D^{\bm{\theta}})-\frac{1}{2\eta}\bigl{\|}D^{\bm{\theta}}-D^{\bm{\theta}^{*}}\bigr{\|}^{2}_{\dot{H}^{1}},

(C.51)

which completes the proof.

C.7 Proof of Theorem 3

Lemma 3.

Consider two vectors $\mathbf{x},\mathbf{y}$ with equal Euclidean norms $\|\mathbf{x}\|_{2}=\|\mathbf{y}\|_{2}$ . Then for every $0\leq a\leq b$ , we have

a\|\mathbf{x}-\mathbf{y}\|_{2}\leq\|a\mathbf{x}-b\mathbf{y}\|_{2}.

(C.52)

Proof.

Note that

	$\displaystyle\\|a\mathbf{x}-b\mathbf{y}\\|^{2}_{2}-a^{2}\\|\mathbf{x}-\mathbf{y}\\|^{2}_{2}\>=\>$	$\displaystyle(b^{2}-a^{2})\\|\mathbf{y}\\|^{2}_{2}-2a(b-a)\mathbf{x}^{T}\mathbf{y}$
	$\displaystyle=\>$	$\displaystyle(b-a)((b+a)\\|\mathbf{y}\\|^{2}_{2}-2a\mathbf{x}^{T}\mathbf{y})$
	$\displaystyle\geq\;$	$\displaystyle 0.$

The above holds as we have assumed that $0\leq a\leq b$ implying $0\leq 2a\leq b+a$ and since the two vectors $\mathbf{x},\mathbf{y}$ share the same Euclidean norm

|\mathbf{x}^{T}\mathbf{y}|\leq\frac{\|\mathbf{x}\|_{2}^{2}+\|\mathbf{y}\|_{2}^{2}}{2}=\|\mathbf{y}\|_{2}^{2}.

Hence, Lemma 3’s proof is complete. ∎

As shown by the Kantorovich duality [49], for the optimal $D^{\bm{\theta}}$ and the optimal coupling $\pi_{\bm{\theta}}(P_{\mathbf{X}},P_{G_{\bm{\theta}}(\mathbf{Z})})$ the following holds with probability $1$ for every joint sample $(\mathbf{X},\mathbf{X}^{\prime})$ drawn from the optimal coupling $\pi_{\bm{\theta}}$ ,

D^{\bm{\theta}}(\mathbf{X})-D^{\bm{\theta}}(\mathbf{X}^{\prime})=\|\mathbf{X}-\mathbf{X}^{\prime}\|_{2}.

(C.53)

Knowing that $D^{\bm{\theta}}$ is 1-Lipschitz, for every convex combination $\beta\mathbf{X}+(1-\beta)\mathbf{X}^{\prime}$ we must have

\nabla D^{\bm{\theta}}\bigl{(}\beta\mathbf{X}+(1-\beta)\mathbf{X}^{\prime}\bigr{)}=\frac{\mathbf{X}-\mathbf{X}^{\prime}}{\|\mathbf{X}-\mathbf{X}^{\prime}\|}.

This will imply that there definitely exists $\alpha_{\bm{\theta}}$ such that the transport map described in the theorem maps the data distribution to the generative model. Plugging this transport map into the definition of the first-order Wasserstein distance, we obtain

	$\displaystyle V(G_{\bm{\theta}},D^{\bm{\theta}}):$	$\displaystyle=W_{1}(P_{\mathbf{X}},P_{G_{\bm{\theta}}(\mathbf{Z})})$
		$\displaystyle=\mathbb{E}\bigl{[}\big{\\|}{\alpha}^{2}_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})\big{\\|}_{2}\bigr{]}$
		$\displaystyle=\mathbb{E}\bigl{[}\big{\\|}\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})\big{\\|}_{2}^{2}\bigr{]}$

where the last equality holds since the Euclidean norm of $\nabla D^{\bm{\theta}}(\mathbf{X})$ has a unit Euclidean norm with probability $1$ over the data distribution $P_{\mathbf{X}}$ as we proved $\nabla D^{\bm{\theta}}(\beta\mathbf{X}+(1-\beta)\mathbf{X}^{\prime})=\frac{\mathbf{X}-\mathbf{X}^{\prime}}{\|\mathbf{X}-\mathbf{X}^{\prime}\|}$ holds for every $0\leq\beta\leq 1$ including $\beta=1$ .

As a result, the Wasserstein GAN problem reduces to the following optimization problem

\min_{\bm{\theta}}\>\mathbb{E}\bigl{[}\big{\|}\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})\big{\|}_{2}^{2}\bigr{]}

(C.54)

Defining $h_{\bm{\theta}}(\mathbf{X}):=\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})$ , $\frac{1}{2}\mathbb{E}\bigl{[}\|h_{\bm{\theta}}(\mathbf{X})\|_{2}^{2}\bigr{]}$ is $1$ -strongly convex with respect to the norm function

\|h\|_{\dot{H}^{0}}=\sqrt{\mathbb{E}\bigl{[}\,h^{2}(\mathbf{X})\,\bigr{]}}

that is induced by the following inner product and results in a Hilbert space

\langle D_{1},D_{2}\rangle_{\dot{H}^{0}}:=\mathbb{E}_{P_{\mathbf{X}}}[D_{1}(\mathbf{X})D_{2}(\mathbf{X})].

Therefore, for the $\bm{\theta}^{*}$ minimizing the objective in (C.54) over the assumed convex set $\{h_{\bm{\theta}}:\,\bm{\theta}\in\Theta\}$ , Lemma 2 implies that

	$\displaystyle\mathbb{E}\bigl{[}\big{\\|}\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})\big{\\|}_{2}^{2}\bigr{]}-\mathbb{E}\bigl{[}\big{\\|}\alpha_{\bm{\theta}^{}}(\mathbf{X})\nabla D^{\bm{\theta}^{}}(\mathbf{X})\big{\\|}_{2}^{2}\bigr{]}\>=\>$	$\displaystyle\mathbb{E}\bigl{[}\big{\\|}h_{\bm{\theta}}(\mathbf{X})\big{\\|}_{2}^{2}\bigr{]}-\mathbb{E}\bigl{[}\big{\\|}h_{\bm{\theta}^{*}}(\mathbf{X})\big{\\|}_{2}^{2}\bigr{]}$
	$\displaystyle\geq\>$	$\displaystyle\\|h_{\bm{\theta}}-h_{\bm{\theta}^{*}}\\|_{\dot{H}^{0}}^{2}$
	$\displaystyle=\>$	$\displaystyle\mathbb{E}\bigl{[}\\|h_{\bm{\theta}}(\mathbf{X})-h_{\bm{\theta}^{*}}(\mathbf{X})\\|_{2}^{2}\bigr{]}$
	$\displaystyle=\>$	$\displaystyle\mathbb{E}\bigl{[}\\|\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})-\alpha_{\bm{\theta}^{}}(\mathbf{X})\nabla D^{\bm{\theta}^{}}(\mathbf{X})\\|_{2}^{2}\bigr{]}$
	$\displaystyle\geq\>$	$\displaystyle\frac{\eta}{2}\mathbb{E}\bigl{[}\\|\nabla D^{\bm{\theta}}(\mathbf{X})-\nabla D^{\bm{\theta}^{*}}(\mathbf{X})\\|_{2}^{2}\bigr{]}$
	$\displaystyle=\>$	$\displaystyle\frac{\eta}{2}\\|D_{\bm{\theta}}-D_{\bm{\theta}^{*}}\\|^{2}_{\dot{H}^{1}}.$

Here the last inequality follows from Lemma 3 since every $D^{\bm{\theta}}$ has a unit-norm gradient with probability $1$ according to the data distribution $P_{\mathbf{X}}$ . Therefore, we have proved that

V(G_{\bm{\theta}},D^{\bm{\theta}})-V(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})\geq\frac{\eta}{2}\|D_{\bm{\theta}}-D_{\bm{\theta}^{*}}\|^{2}_{\dot{H}^{1}}.

(C.55)

The above inequality results in the following for every feasible $\bm{\theta}$

V(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})\leq\max_{D\in\mathcal{D}}V(G_{\bm{\theta}},D)-\frac{\eta}{2}\|D-D_{\bm{\theta}^{*}}\|^{2}_{\dot{H}^{1}}.

(C.56)

Hence, according to Proposition 2, we have shown that the pair $(G_{\bm{\theta}^{*}},D^{\bm{\theta}^{*}})$ is an $\eta$ -proximal equilibrium with respect to the Sobolev norm $\|\cdot\|_{\dot{H}^{1}}$ .

	$\displaystyle\mathbb{E}\bigl{[}\big{\\|}\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})\big{\\|}_{2}^{2}\bigr{]}-\mathbb{E}\bigl{[}\big{\\|}\alpha_{\bm{\theta}^{}}(\mathbf{X})\nabla D^{\bm{\theta}^{}}(\mathbf{X})\big{\\|}_{2}^{2}\bigr{]}\>=\>$	$\displaystyle\mathbb{E}\bigl{[}\big{\\|}h_{\bm{\theta}}(\mathbf{X})\big{\\|}_{2}^{2}\bigr{]}-\mathbb{E}\bigl{[}\big{\\|}h_{\bm{\theta}^{*}}(\mathbf{X})\big{\\|}_{2}^{2}\bigr{]}$
	$\displaystyle\geq\>$	$\displaystyle\\|h_{\bm{\theta}}-h_{\bm{\theta}^{*}}\\|_{\dot{H}^{0}}^{2}$
	$\displaystyle=\>$	$\displaystyle\mathbb{E}\bigl{[}\\|h_{\bm{\theta}}(\mathbf{X})-h_{\bm{\theta}^{*}}(\mathbf{X})\\|_{2}^{2}\bigr{]}$
	$\displaystyle=\>$	$\displaystyle\mathbb{E}\bigl{[}\\|\alpha_{\bm{\theta}}(\mathbf{X})\nabla D^{\bm{\theta}}(\mathbf{X})-\alpha_{\bm{\theta}^{}}(\mathbf{X})\nabla D^{\bm{\theta}^{}}(\mathbf{X})\\|_{2}^{2}\bigr{]}$
	$\displaystyle\geq\>$	$\displaystyle\frac{\eta}{2}\mathbb{E}\bigl{[}\\|\nabla D^{\bm{\theta}}(\mathbf{X})-\nabla D^{\bm{\theta}^{*}}(\mathbf{X})\\|_{2}^{2}\bigr{]}$
	$\displaystyle=\>$	$\displaystyle\frac{\eta}{2}\\|D_{\bm{\theta}}-D_{\bm{\theta}^{*}}\\|^{2}_{\dot{H}^{1}}.$

GANs May Have No Nash Equilibria

Abstract

1 Introduction

2 Related Work

3 An Initial Experiment on Equilibrium in GANs

4 Review of GAN Formulations

4.1 Vanilla GAN & ff-GAN

4.2 Wasserstein GANs

5 Existence of Nash Equilibrium Solutions in GANs

Proposition 1.

Proof.

Theorem 1.

Proof.

Remark 1.

Proof.

6 Proximal Equilibrium: A Relaxation of Nash Equilibrium

Definition 1.

Proposition 2.

Proof.

Proposition 3.

Proof.

Proposition 4.

Proof.

7 Proximal Equilibrium in Wasserstein GANs

Theorem 2.

Proof.

Theorem 3.

Proof.

8 Proximal Training

9 Numerical Experiments

9.1 Proximal Equilibrium in Wasserstein and Lipschitz GANs

9.2 Proximal Training Improves Lipschitz GANs

References

Appendix A Numerical Results for Section 3

Appendix B Numerical Results for Section 9

Appendix C Proofs

C.1 Proof of Proposition 1

C.2 Proof of Theorem 1 & Remark 1

Lemma 1.

Proof.

Remark.

Proof.

C.3 Proof of Proposition 2

C.4 Proof of Proposition 3

C.5 Proof of Proposition 4

C.6 Proof of Theorem 2

Lemma 2.

Proof.

C.7 Proof of Theorem 3

Lemma 3.

Proof.

4.1 Vanilla GAN & $f$ -GAN