This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\setsecnumdepth

subsection \settypeblocksize8in5.5in* \setulmargins**1.1 \setlrmargins**1 \setheadfoot\onelineskip2\onelineskip \setheaderspaces*2\onelineskip* \checkandfixthelayout

Properties of f-divergences and f-GAN training

Matt Shannon
Google Research
mattshannon@google.com
(August 2020)
Abstract

In this technical report we describe some properties of f-divergences and f-GAN training. We present an elementary derivation of the f-divergence lower bounds which form the basis of f-GAN training. We derive informative but perhaps underappreciated properties of f-divergences and f-GAN training, including a gradient matching property and the fact that all f-divergences agree up to an overall scale factor on the divergence between nearby distributions. We provide detailed expressions for computing various common f-divergences and their variational lower bounds. Finally, based on our reformulation, we slightly generalize f-GAN training in a way that may improve its stability.

1 The family of f-divergences

We start by reviewing the definition of an f-divergence (or ϕ\phi-divergence) (Csiszár, 1967; Ali and Silvey, 1966) and establishing some basic properties.

1.1 Definition

Given a strictly convex twice continuously differentiable function f:>0f\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}_{>0}\to\mathbb{R}, the ff-divergence between probability distributions with densities111Most results also hold for “discrete” probability distributions. pp and qq over K\mathbb{R}^{K} is defined as

Df(p,q)=q(x)f(p(x)q(x))dxD_{f}(p,q)=\int q(x)f\left(\frac{p(x)}{q(x)}\right)\operatorname{d\!}x (1)

For simplicity, we assume the probability distributions pp and qq are suitably nice, e.g. absolutely continuous with respect to the Lebesgue measure on K\mathbb{R}^{K}, p(x),q(x)>0p(x),q(x)>0 for xKx\in\mathbb{R}^{K}, and pp and qq continuously differentiable. These assumptions are discussed in §3.4. We refer to ff as the defining function of the divergence DfD_{f}.

There is some redundancy in the above definition. Adding an affine-linear term to the defining function just adds a constant to the divergence: if g(u)=f(u)+a+bug(u)=f(u)+a+bu for a,ba,b\in\mathbb{R} then Dg(p,q)=Df(p,q)+a+bD_{g}(p,q)=D_{f}(p,q)+a+b for all pp and qq. Typically we do not care about an overall additive shift and would regard DfD_{f} and DgD_{g} as essentially the same divergence. Thus we have identified two unnecessary degrees of freedom in the specification of the defining function, and these may be removed by fixing f(1)=f(1)=0f(1)=f^{\prime}(1)=0. This results in no loss of generality since any defining function can be put in this form by adding a suitable affine-linear term.222This affine-linear term also does not affect the various bounds and finite sample approximations derived below, as long as the reparameterization trick is used for the generator gradient as is standard practice. In the discrete case or if other finite sample approximation such as naive REINFORCE is used then adding an affine-linear term to ff may affect the variance of the finite sample approximation of the generator gradient. From here on we assume f(1)=f(1)=0f(1)=f^{\prime}(1)=0 as part of our definition of an f-divergence. The choice f(1)=0f(1)=0 ensures Df(p,q)=0D_{f}(p,q)=0 when p=qp=q. The choice f(1)=0f^{\prime}(1)=0 has the ancillary benefit of (1) being non-negative even if pp and qq are positive functions that do not integrate to one.

There is also a multiplicative degree of freedom in the defining function that is often irrelevant: multiplying the defining function by a constant k>0k>0 just multiplies the divergence by the same constant. Removing this superficial source of variation makes different f-divergences easier to compare. We refer to an f-divergence as being canonical if f′′(1)=1f^{\prime\prime}(1)=1. Any f-divergence can be made canonical by scaling appropriately. Intuitively, fixing f′′(1)f^{\prime\prime}(1) corresponds to fixing the behavior of the f-divergence in the region u1u\approx 1, corresponding to pqp\approx q. This will be made precise below.

1.2 Properties

f-divergences satisfy several mathematical properties:

  • DfD_{f} is linear in ff.

  • Df(p,q)0D_{f}(p,q)\geq 0 for all distributions pp and qq with equality iff p=qp=q. This justifies referring to DfD_{f} as a divergence.

  • DfD_{f} uniquely determines ff.

  • All f-divergences agree up to an overall scale factor on the divergence between nearby distributions: If pqp\approx q then Df(p,q)f′′(1)KL(pq)D_{f}(p,q)\approx f^{\prime\prime}(1)\operatorname{KL}(p\,\|\,q) to second order. In particular all canonical f-divergences agree when pqp\approx q.

  • For many common f-divergences, f′′f^{\prime\prime} has a simpler algebraic form than ff and is easier to work with.

Linearity is straightforward to verify. If DfD_{f} and DgD_{g} are two f-divergences and k>0k>0 then Df+g=Df+DgD_{f+g}=D_{f}+D_{g} and D(kf)=kDfD_{(kf)}=kD_{f}. If ff and gg are strictly convex then so is f+gf+g and kfkf, and (f+g)(1)=f(1)+g(1)=0(f+g)(1)=f(1)+g(1)=0 and (kf)(1)=kf(1)=0(kf)(1)=kf(1)=0, and similarly for ff^{\prime}, so Df+gD_{f+g} and DkfD_{kf} are valid f-divergences.

The non-negativity of DfD_{f} follows from the convexity of ff. Since f(1)=f(1)=0f(1)=f^{\prime}(1)=0 and ff is strictly convex, f(u)0f(u)\geq 0 with equality iff u=1u=1. Thus

q(x)f(p(x)q(x))dx0\int q(x)f\left(\frac{p(x)}{q(x)}\right)\operatorname{d\!}x\geq 0 (2)

Thus Df(p,q)0D_{f}(p,q)\geq 0 for all pp and qq. In general, if g(x)dx=0\int g(x)\operatorname{d\!}x=0 for a continuous integrable non-negative function g:Kg\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{K}\to\mathbb{R} then g(x)=0g(x)=0 for all xx. Applying this to g(x)=q(x)f(p(x)/q(x))g(x)=q(x)f(p(x)/q(x)) we see that we have equality in (2) iff p(x)/q(x)=1p(x)/q(x)=1 for all xx. Thus Df(p,q)=0D_{f}(p,q)=0 implies p=qp=q. The non-negativity of DfD_{f} can also be seen by plugging the constant function u(x)=1u(x)=1 into (14).

We now show that DfD_{f} completely determines ff. Suppose Df=DgD_{f}=D_{g}. We wish to show that f=gf=g. Consider first the discrete case where pp and qq are distributions over a two-point set {0,1}\{0,1\}. Given u>1u>1, choose pup_{u} and quq_{u} such that pu(0)/qu(0)=up_{u}(0)/q_{u}(0)=u and pu(1)/qu(1)=12p_{u}(1)/q_{u}(1)=\tfrac{1}{2}. It is straightforward to show that such pup_{u} and quq_{u} exist and are given by

pu(0)\displaystyle p_{u}(0) =u2u1\displaystyle=\frac{u}{2u-1} qu(0)\displaystyle q_{u}(0) =12u1\displaystyle=\frac{1}{2u-1} (3)
pu(1)\displaystyle p_{u}(1) =u12u1\displaystyle=\frac{u-1}{2u-1} qu(1)\displaystyle q_{u}(1) =2(u1)2u1\displaystyle=\frac{2(u-1)}{2u-1} (4)

Thus

(2u1)Df(pu,qu)=f(u)+2(u1)f(12)(2u-1)D_{f}(p_{u},q_{u})=f(u)+2(u-1)f(\tfrac{1}{2}) (5)

Subtracting the equivalent equation for gg and using Df=DgD_{f}=D_{g}, we see that ff and gg differ only in an affine-linear term: g(u)=f(u)+a(u1)g(u)=f(u)+a(u-1) for all u>1u>1 for some aa\in\mathbb{R}. Therefore g(u)=f(u)+ag^{\prime}(u)=f^{\prime}(u)+a for u>1u>1, and taking the limit as u1u\to 1 we have g(1)=f(1)+ag^{\prime}(1)=f^{\prime}(1)+a. But f(1)=g(1)=0f^{\prime}(1)=g^{\prime}(1)=0, so a=0a=0, so f(u)=g(u)f(u)=g(u) for u>1u>1. A similar argument applies for 0<u<10<u<1. In this case we choose pu(0)/qu(0)=up_{u}(0)/q_{u}(0)=u and pu(1)/qu(1)=2p_{u}(1)/q_{u}(1)=2, and we have

pu(0)\displaystyle p_{u}(0) =u2u\displaystyle=\frac{u}{2-u} qu(0)\displaystyle q_{u}(0) =12u\displaystyle=\frac{1}{2-u} (6)
pu(1)\displaystyle p_{u}(1) =2(1u)2u\displaystyle=\frac{2(1-u)}{2-u} qu(1)\displaystyle q_{u}(1) =1u2u\displaystyle=\frac{1-u}{2-u} (7)

and

(2u)Df(pu,qu)=f(u)+(1u)f(2)(2-u)D_{f}(p_{u},q_{u})=f(u)+(1-u)f(2) (8)

From here we apply the same argument as above. Thus f(u)=g(u)f(u)=g(u) for 0<u<10<u<1. Combining this with the above result for u>1u>1 and the fact that f(1)=g(1)f(1)=g(1), we have f=gf=g as desired. For the continuous case where pp and qq are densities over K\mathbb{R}^{K}, we reduce this to the discrete case by considering mixtures of Gaussians with shrinking covariances. Consider the u>1u>1 case. Fix any two distinct points μ0,μ1K\mu_{0},\mu_{1}\in\mathbb{R}^{K}. Let p~uσ(x)=pu(0)𝒩(x;μ0,σ2I)+pu(1)𝒩(x;μ1,σ2I)\tilde{p}_{u\sigma}(x)=p_{u}(0)\,\mathcal{N}(x;\mu_{0},\sigma^{2}I)+p_{u}(1)\,\mathcal{N}(x;\mu_{1},\sigma^{2}I) and q~uσ(x)=qu(0)𝒩(x;μ0,σ2I)+qu(1)𝒩(x;μ1,σ2I)\tilde{q}_{u\sigma}(x)=q_{u}(0)\,\mathcal{N}(x;\mu_{0},\sigma^{2}I)+q_{u}(1)\,\mathcal{N}(x;\mu_{1},\sigma^{2}I) where pup_{u} and quq_{u} are as specified for the discrete u>1u>1 case above. As σ0\sigma\to 0, Df(p~uσ,q~uσ)Df(pu,qu)D_{f}(\tilde{p}_{u\sigma},\tilde{q}_{u\sigma})\to D_{f}(p_{u},q_{u}), so (2u1)Df(p~uσ,q~uσ)f(u)+2(u1)f(12)(2u-1)D_{f}(\tilde{p}_{u\sigma},\tilde{q}_{u\sigma})\to f(u)+2(u-1)f(\tfrac{1}{2}), and similarly for gg. But Df=DgD_{f}=D_{g}, so we have the same quantity tending to two limits, so the limits must be equal. Thus g(u)=f(u)+a(u1)g(u)=f(u)+a(u-1) for some aa\in\mathbb{R} and the remainder of the argument is as above. A similar argument applies for 0<u<10<u<1. Thus f=gf=g also holds in the continuous case.

Different f-divergences may behave very differently when pp and qq are far apart but are essentially identical when qpq\approx p. One way to make this precise is to consider a parametric family {qλ:λΛ}\{q_{\lambda}\mathrel{\mathop{\mathchar 58\relax}}\lambda\in\Lambda\} of densities. A Taylor expansion of Df(qλ,qλ+εv)D_{f}\left(q_{\lambda},q_{\lambda+\varepsilon v}\right) in terms of ϵ\epsilon shows that

Df(qλ,qλ+εv)=12ε2f′′(1)v𝖳F(λ)v+O(ε3)D_{f}\left(q_{\lambda},q_{\lambda+\varepsilon v}\right)=\tfrac{1}{2}\varepsilon^{2}f^{\prime\prime}(1){v}^{\mathsf{T}}F(\lambda)v+O(\varepsilon^{3}) (9)

where ε\varepsilon\in\mathbb{R}, vKv\in\mathbb{R}^{K}, and

Fij(λ)=qλ(x)(λilogqλ(x))(λjlogqλ(x))dxF_{ij}(\lambda)=\int q_{\lambda}(x)\left(\mathinner{\tfrac{\partial{}}{\partial{\lambda_{i}}}}\log q_{\lambda}(x)\right)\left(\mathinner{\tfrac{\partial{}}{\partial{\lambda_{j}}}}\log q_{\lambda}(x)\right)\operatorname{d\!}x (10)

is the Fisher information matrix of the parametric family, also known as the Fisher metric. Thus all f-divergences agree up to a constant factor on the divergence between two nearby distributions, and they are all just scaled versions of the Fisher distance in this regime. Alternatively this may be stated in the non-parametric form

Df(q,q+εv)=12ε2f′′(1)(v(x))2q(x)dx+O(ε3)D_{f}\left(q,q+\varepsilon v\right)=\tfrac{1}{2}\varepsilon^{2}f^{\prime\prime}(1)\int\frac{\left(v(x)\right)^{2}}{q(x)}\operatorname{d\!}x+O(\varepsilon^{3}) (11)

where v:Kv\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{K}\to\mathbb{R} satisfies v(x)dx=0\int v(x)\operatorname{d\!}x=0. Informally we may state this as

Df(p,q)12f′′(1)(p(x)q(x))2p(x)dxD_{f}(p,q)\approx\tfrac{1}{2}f^{\prime\prime}(1)\int\frac{\left(p(x)-q(x)\right)^{2}}{p(x)}\operatorname{d\!}x (12)

Thus all f-divergences agree up to a constant factor on the divergence between nearby distributions.

The fact that f′′f^{\prime\prime} has a simpler algebraic form than ff for many common f-divergences will be seen when we consider specific f-divergences below. Considering f′′f^{\prime\prime} is natural for a number of reasons. Adding an affine-linear term to ff does not change f′′f^{\prime\prime}. Thus even without the constraints f(1)=f(1)=0f(1)=f^{\prime}(1)=0, f′′f^{\prime\prime} would not have the two unnecessary degrees of freedom mentioned above and DfD_{f} would uniquely determine f′′f^{\prime\prime}. Strict convexity of ff corresponds to the simple condition f′′(u)>0f^{\prime\prime}(u)>0 for all uu. We will also see below that various gradients of DfD_{f} and of its lower bound EfE_{f} depend on ff only through f′′f^{\prime\prime}. Considering f′′f^{\prime\prime} also provides a simple view of how various f-divergences are related. For example, three common symmetric divergences are the Jensen-Shannon, squared Hellinger and Jeffreys divergences. In the f′′f^{\prime\prime} domain, these may all be viewed as forms of average of KL and reverse KL, specifically as the harmonic mean, geometric mean and arithmetic mean respectively.

2 Variational divergence estimation

f-GANs are based on an elegant way to estimate the f-divergence between two distributions given only samples from the two distributions (Nguyen et al., 2010). In this section we review this approach to variational divergence estimation. We provide a simple and easy to understand derivation involving elementary facts about convex functions. At the end of this section we discuss how our derivation and notation relates to that of the original f-GAN paper (Nowozin et al., 2016).

2.1 Variational lower bound

0.50.5111.51.5220.5-0.50.50.5111.51.5uuf(u)f(u)Tangent line
Figure 1: A convex function f:>0f\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}_{>0}\to\mathbb{R} and a tangent line. The variational bound used by f-GANs is based on the fact that a convex function ff lies at or above its tangent lines.

We first derive a variational lower bound on the f-divergence DfD_{f}. Since ff is strictly convex, its graph lies at or above any of its tangent lines and only touches in one place. That is, for k,u>0k,u>0,

f(k)f(u)+(ku)f(u)=kf(u)[uf(u)f(u)]f(k)\geq f(u)+(k-u)f^{\prime}(u)=kf^{\prime}(u)-\left[uf^{\prime}(u)-f(u)\right] (13)

with equality iff k=uk=u. This inequality is illustrated in Figure 1. Substituting p(x)/q(x)p(x)/q(x) for kk and u(x)u(x) for uu, for any continuously differentiable function u:K>0u\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{K}\to\mathbb{R}_{>0} we obtain

Df(p,q)p(x)f(u(x))dxq(x)[u(x)f(u(x))f(u(x))]dxD_{f}(p,q)\geq\int p(x)f^{\prime}(u(x))\operatorname{d\!}x-\int q(x)\left[u(x)f^{\prime}(u(x))-f(u(x))\right]\operatorname{d\!}x (14)

with equality iff u=uu=u^{*}, where u(x)=p(x)/q(x)u^{*}(x)=p(x)/q(x). The function uu is referred to as the critic. It will be helpful to have a concise notation for this bound. Writing u(x)=exp(d(x))u(x)=\exp(d(x)) without loss of generality, for any continuously differentiable function d:Kd\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{K}\to\mathbb{R}, we have

Df(p,q)Ef(p,q,d)D_{f}(p,q)\geq E_{f}(p,q,d) (15)

with equality iff d=dd=d^{*}, where

Ef(p,q,d)\displaystyle E_{f}(p,q,d) =p(x)af(d(x))dxq(x)bf(d(x))dx\displaystyle=\int p(x)a_{f}(d(x))\operatorname{d\!}x-\int q(x)b_{f}(d(x))\operatorname{d\!}x (16)
af(d)\displaystyle a_{f}(d) =f(exp(d))\displaystyle=f^{\prime}(\exp(d)) (17)
bf(d)\displaystyle b_{f}(d) =exp(d)f(exp(d))f(exp(d))\displaystyle=\exp(d)f^{\prime}(\exp(d))-f(\exp(d)) (18)
d(x)\displaystyle d^{*}(x) =logp(x)logq(x)\displaystyle=\log p(x)-\log q(x) (19)

Note that both afa_{f} and bfb_{f} are linear in ff. Their derivatives af(logu)=uf′′(u)a_{f}^{\prime}(\log u)=uf^{\prime\prime}(u) and bf(logu)=u2f′′(u)b_{f}^{\prime}(\log u)=u^{2}f^{\prime\prime}(u) depend on ff only through f′′f^{\prime\prime}. Note that bf(d)=af(d)exp(d)b_{f}^{\prime}(d)=a_{f}^{\prime}(d)\exp(d).

2.2 Formulation of variational divergence estimation

The bound (15) leads naturally to variational divergence estimation. The ff-divergence between pp and qq can be estimated by maximizing EfE_{f} with respect to dd (Nguyen et al., 2010). Conveniently EfE_{f} is expressed in terms of expectations and may be approximately computed and maximized with respect to dd using only samples from pp and qq. Ultimately this property derives from the fact that the tangent lines of f(u)f(u) are affine-linear in uu, and the constant term leads to expectations with respect to qq and the linear term leads to expectations with respect to pp. If we parameterize dd as a neural net dνd_{\nu} with parameters ν\nu then we can approximate the divergence by maximizing Ef(p,q,dν)E_{f}(p,q,d_{\nu}) with respect to ν\nu. This does not compute the exact divergence for several reasons: there is no guarantee that dd^{*} lies in the family {dν:ν}\{d_{\nu}\mathrel{\mathop{\mathchar 58\relax}}\nu\} of functions representable by the neural net; gradient-based optimization may find a local but not global minimum; and we have to be careful not to overfit given a finite set of samples from pp (typically qq is a model and so we have access to arbitrarily many samples). However we hope for sufficiently flexible neural nets and careful optimization that the approximation will be close.

2.3 Expressions for common f-divergences

In this section we express several common divergences in terms of (1) and the variational lower bound (16). For each f-divergence, we give explicit expressions for ff, f′′f^{\prime\prime}, DfD_{f}, EfE_{f}, afa_{f}, bfb_{f}, afa_{f}^{\prime} and bfb_{f}^{\prime}. We also list the tail weights, which are related to the asymptotic behavior of f′′(u)f^{\prime\prime}(u) as u0u\to 0 and uu\to\infty and which determine the key qualitative properties of an f-divergence (Shannon et al., 2020). We mention below how some of the divergences are related to each other by softening (Shannon et al., 2020). The p-softening of a divergence DD is the divergence D~\tilde{D} given by D~(p,q)=4D(12p+12q,q)\tilde{D}(p,q)=4D(\tfrac{1}{2}p+\tfrac{1}{2}q,q). Similarly the q-softening is given by D~(p,q)=4D(p,12p+12q)\tilde{D}(p,q)=4D(p,\tfrac{1}{2}p+\tfrac{1}{2}q). The factor of 44 here ensures that D~\tilde{D} remains canonical (Shannon et al., 2020).

The Kullback-Leibler (KL) divergence (or I divergence or relative entropy) satisfies:

Df(p,q)\displaystyle D_{f}(p,q) =KL(pq)=p(x)logp(x)q(x)dx\displaystyle=\operatorname{KL}(p\,\|\,q)=\int p(x)\log\frac{p(x)}{q(x)}\operatorname{d\!}x (20)
Ef(p,q,d)\displaystyle E_{f}(p,q,d) =1+p(x)d(x)dxq(x)exp(d(x))dx\displaystyle=1+\int p(x)d(x)\operatorname{d\!}x-\int q(x)\exp(d(x))\operatorname{d\!}x (21)
f(u)\displaystyle f(u) =uloguu+1\displaystyle=u\log u-u+1 (22)
f′′(u)\displaystyle f^{\prime\prime}(u) =u1\displaystyle=u^{-1} (23)
af(d)\displaystyle a_{f}(d) =d\displaystyle=d (24)
bf(d)\displaystyle b_{f}(d) =exp(d)1\displaystyle=\exp(d)-1 (25)
af(d)\displaystyle a_{f}^{\prime}(d) =1\displaystyle=1 (26)
bf(d)\displaystyle b_{f}^{\prime}(d) =exp(d)\displaystyle=\exp(d) (27)

The KL divergence has (1,2)(1,2) tail weights. The KL divergence defining function is sometimes given as f(u)=uloguf(u)=u\log u. The additional affine-linear term in our expression for ff is due to the constraints f(1)=f(1)=0f(1)=f^{\prime}(1)=0. We note in passing that this precisely corresponds to what is sometimes referred to as the generalized KL divergence D(p,q)=p(x)logp(x)q(x)dxp(x)dx+q(x)dxD(p,q)=\int p(x)\log\frac{p(x)}{q(x)}\operatorname{d\!}x-\int p(x)\operatorname{d\!}x+\int q(x)\operatorname{d\!}x. This has the property that D(p,q)0D(p,q)\geq 0 with equality iff p=qp=q even if we remove the constraint that pp and qq be valid densities that integrate to one.

The reverse KL divergence satisfies:

Df(p,q)\displaystyle D_{f}(p,q) =KL(qp)=q(x)logq(x)p(x)dx\displaystyle=\operatorname{KL}(q\,\|\,p)=\int q(x)\log\frac{q(x)}{p(x)}\operatorname{d\!}x (28)
Ef(p,q,d)\displaystyle E_{f}(p,q,d) =1p(x)exp(d(x))dxq(x)d(x)dx\displaystyle=1-\int p(x)\exp(-d(x))\operatorname{d\!}x-\int q(x)d(x)\operatorname{d\!}x (29)
f(u)\displaystyle f(u) =logu+u1\displaystyle=-\log u+u-1 (30)
f′′(u)\displaystyle f^{\prime\prime}(u) =u2\displaystyle=u^{-2} (31)
af(d)\displaystyle a_{f}(d) =1exp(d)\displaystyle=1-\exp(-d) (32)
bf(d)\displaystyle b_{f}(d) =d\displaystyle=d (33)
af(d)\displaystyle a_{f}^{\prime}(d) =exp(d)\displaystyle=\exp(-d) (34)
bf(d)\displaystyle b_{f}^{\prime}(d) =1\displaystyle=1 (35)

The reverse KL divergence has (2,1)(2,1) tail weights. Note the explicit symmetry between the representations of KL and reverse KL in terms of afa_{f} and bfb_{f}. Their symmetric relationship is less apparent from ff and f′′f^{\prime\prime}. In general if Dg(p,q)=Df(q,p)D_{g}(p,q)=D_{f}(q,p) then ag(d)=bf(d)a_{g}(d)=b_{f}(-d) and bg(d)=af(d)b_{g}(d)=a_{f}(-d).

The canonicalized Jensen-Shannon divergence (Lin, 1991) (or Jensen difference (Burbea and Rao, 1982) or capacitory discrimination (Topsoe, 2000)) satisfies:

Df(p,q)\displaystyle D_{f}(p,q) =4JS(p,q)\displaystyle=4\operatorname{JS}(p,q) (36)
=2KL(p12p+12q)+2KL(q12p+12q)\displaystyle=2\operatorname{KL}(p\,\|\,\tfrac{1}{2}p+\tfrac{1}{2}q)+2\operatorname{KL}(q\,\|\,\tfrac{1}{2}p+\tfrac{1}{2}q) (37)
=4log2+2p(x)logp(x)p(x)+q(x)dx+2q(x)logq(x)p(x)+q(x)dx\displaystyle=4\log 2+2\int p(x)\log\frac{p(x)}{p(x)+q(x)}\operatorname{d\!}x+2\int q(x)\log\frac{q(x)}{p(x)+q(x)}\operatorname{d\!}x (38)
Ef(p,q,d)\displaystyle E_{f}(p,q,d) =4log2+2p(x)logσ(d(x))dx+2q(x)logσ(d(x))dx\displaystyle=4\log 2+2\int p(x)\log\sigma(d(x))\operatorname{d\!}x+2\int q(x)\log\sigma(-d(x))\operatorname{d\!}x (39)
f(u)\displaystyle f(u) =2ulogu2(u+1)log(u+1)+2ulog2+2log2\displaystyle=2u\log u-2(u+1)\log(u+1)+2u\log 2+2\log 2 (40)
f′′(u)\displaystyle f^{\prime\prime}(u) =2u(u+1)\displaystyle=\frac{2}{u(u+1)} (41)
af(d)\displaystyle a_{f}(d) =2logσ(d)+2log2\displaystyle=2\log\sigma(d)+2\log 2 (42)
bf(d)\displaystyle b_{f}(d) =2logσ(d)2log2\displaystyle=-2\log\sigma(-d)-2\log 2 (43)
af(d)\displaystyle a_{f}^{\prime}(d) =2σ(d)\displaystyle=2\sigma(-d) (44)
bf(d)\displaystyle b_{f}^{\prime}(d) =2σ(d)\displaystyle=2\sigma(d) (45)

Here JS\operatorname{JS} is the conventional, non-canonical Jensen-Shannon divergence with f′′(1)=1/4f^{\prime\prime}(1)=1/4. The Jensen-Shannon divergence has (1,1)(1,1) tail weights. Its f′′(u)f^{\prime\prime}(u) is the harmonic mean of f′′(u)f^{\prime\prime}(u) for KL and f′′(u)f^{\prime\prime}(u) for reverse KL. The square root of the Jensen-Shannon divergence defines a metric on the space of probability distributions (Endres and Schindelin, 2003; Österreicher and Vajda, 2003).

The canonicalized squared Hellinger distance (closely related to the Freeman-Tukey statistic for hypothesis testing) satisfies:

Df(p,q)\displaystyle D_{f}(p,q) =2(p(x)q(x))2dx\displaystyle=2\int\left(\sqrt{p(x)}-\sqrt{q(x)}\right)^{2}\operatorname{d\!}x (46)
=44p(x)q(x)dx\displaystyle=4-4\int\sqrt{p(x)q(x)}\operatorname{d\!}x (47)
Ef(p,q,d)\displaystyle E_{f}(p,q,d) =42p(x)exp(12d(x))dx2q(x)exp(12d(x))dx\displaystyle=4-2\int p(x)\exp\left(-\tfrac{1}{2}d(x)\right)\operatorname{d\!}x-2\int q(x)\exp\left(\tfrac{1}{2}d(x)\right)\operatorname{d\!}x (48)
f(u)\displaystyle f(u) =2(1u)2\displaystyle=2(1-\sqrt{u})^{2} (49)
f′′(u)\displaystyle f^{\prime\prime}(u) =u32\displaystyle=u^{-\frac{3}{2}} (50)
af(d)\displaystyle a_{f}(d) =22exp(12d)\displaystyle=2-2\exp\left(-\tfrac{1}{2}d\right) (51)
bf(d)\displaystyle b_{f}(d) =2exp(12d)2\displaystyle=2\exp\left(\tfrac{1}{2}d\right)-2 (52)
af(d)\displaystyle a_{f}^{\prime}(d) =exp(12d)\displaystyle=\exp\left(-\tfrac{1}{2}d\right) (53)
bf(d)\displaystyle b_{f}^{\prime}(d) =exp(12d)\displaystyle=\exp\left(\tfrac{1}{2}d\right) (54)

Here p(x)q(x)dx\int\sqrt{p(x)q(x)}\operatorname{d\!}x is known as the Bhattacharyya coefficient. The squared Hellinger distance has (32,32)(\frac{3}{2},\frac{3}{2}) tail weights. Its f′′(u)f^{\prime\prime}(u) is the geometric mean of f′′(u)f^{\prime\prime}(u) for KL and f′′(u)f^{\prime\prime}(u) for reverse KL. The Hellinger distance defines a metric on the space of probability distributions (Vajda, 2009).

The Jeffreys divergence (or J divergence) (p,q)12KL(p,q)+12KL(q,p)(p,q)\mapsto\tfrac{1}{2}\operatorname{KL}(p,q)+\tfrac{1}{2}\operatorname{KL}(q,p) is the arithmetic mean of the KL divergence and reverse KL divergence. Since ff, f′′f^{\prime\prime}, DfD_{f}, EfE_{f}, afa_{f}, bfb_{f}, afa_{f}^{\prime} and bfb_{f}^{\prime} are all linear in ff, they are also all just the arithmetic mean of the corresponding quantities for KL and reverse KL and so we do not list them separately. The Jeffreys divergence has (2,2)(2,2) tail weights.

The canonicalized squared Le Cam distance (or squared Puri-Vincze distance or triangular discrimination) satisfies:

Df(p,q)\displaystyle D_{f}(p,q) =2Δ(p,q)\displaystyle=2\Delta(p,q) (55)
=χ2(p,12p+12q)\displaystyle=\chi^{2}(p,\tfrac{1}{2}p+\tfrac{1}{2}q) (56)
=χ2(q,12p+12q)\displaystyle=\chi^{2}(q,\tfrac{1}{2}p+\tfrac{1}{2}q) (57)
=(p(x)q(x))2p(x)+q(x)dx\displaystyle=\int\frac{\left(p(x)-q(x)\right)^{2}}{p(x)+q(x)}\operatorname{d\!}x (58)
Ef(p,q,d)\displaystyle E_{f}(p,q,d) =24p(x)(σ(d(x)))2dx4q(x)(σ(d(x)))2dx\displaystyle=2-4\int p(x)\left(\sigma(-d(x))\right)^{2}\operatorname{d\!}x-4\int q(x)\left(\sigma(d(x))\right)^{2}\operatorname{d\!}x (59)
f(u)\displaystyle f(u) =(u1)21+u\displaystyle=\frac{(u-1)^{2}}{1+u} (60)
f′′(u)\displaystyle f^{\prime\prime}(u) =8(1+u)3\displaystyle=\frac{8}{(1+u)^{3}} (61)
af(d)\displaystyle a_{f}(d) =14(σ(d))2\displaystyle=1-4\left(\sigma(-d)\right)^{2} (62)
bf(d)\displaystyle b_{f}(d) =4(σ(d))21\displaystyle=4\left(\sigma(d)\right)^{2}-1 (63)
af(d)\displaystyle a_{f}^{\prime}(d) =8σ(d)(σ(d))2\displaystyle=8\sigma(d)\left(\sigma(-d)\right)^{2} (64)
bf(d)\displaystyle b_{f}^{\prime}(d) =8(σ(d))2σ(d)\displaystyle=8\left(\sigma(d)\right)^{2}\sigma(-d) (65)

Here Δ\Delta is the conventional, non-canonical squared Le Cam distance with f′′(1)=12f^{\prime\prime}(1)=\tfrac{1}{2}, and χ2\chi^{2} is the Pearson χ2\chi^{2} divergence defined below. The squared Le Cam distance has (0,0)(0,0) tail weights. It is symmetric with respect to pp and qq. The canonicalized squared Le Cam distance may be obtained by q-softening the canonicalized Pearson χ2\chi^{2} divergence or p-softening the canonicalized Neymann divergence. The Le Cam distance defines a metric on the space of probability distributions (Vajda, 2009).

The canonicalized Pearson χ2\chi^{2} divergence (or Kagan divergence) satisfies:

Df(p,q)\displaystyle D_{f}(p,q) =12χ2(p,q)\displaystyle=\tfrac{1}{2}\chi^{2}(p,q) (66)
=12(p(x)q(x))2q(x)dx\displaystyle=\tfrac{1}{2}\int\frac{\left(p(x)-q(x)\right)^{2}}{q(x)}\operatorname{d\!}x (67)
Ef(p,q,d)\displaystyle E_{f}(p,q,d) =12+p(x)exp(d(x))dx12q(x)exp(2d(x))dx\displaystyle=-\tfrac{1}{2}+\int p(x)\exp\left(d(x)\right)\operatorname{d\!}x-\tfrac{1}{2}\int q(x)\exp\left(2d(x)\right)\operatorname{d\!}x (68)
f(u)\displaystyle f(u) =12(u1)2\displaystyle=\tfrac{1}{2}(u-1)^{2} (69)
f′′(u)\displaystyle f^{\prime\prime}(u) =1\displaystyle=1 (70)
af(d)\displaystyle a_{f}(d) =exp(d)1\displaystyle=\exp(d)-1 (71)
bf(d)\displaystyle b_{f}(d) =12exp(2d)12\displaystyle=\tfrac{1}{2}\exp(2d)-\tfrac{1}{2} (72)
af(d)\displaystyle a_{f}^{\prime}(d) =exp(d)\displaystyle=\exp(d) (73)
bf(d)\displaystyle b_{f}^{\prime}(d) =exp(2d)\displaystyle=\exp(2d) (74)

Here χ2\chi^{2} is the conventional, non-canonical Pearson χ2\chi^{2} divergence with f′′(1)=2f^{\prime\prime}(1)=2. The Pearson χ2\chi^{2} divergence has (0,3)(0,3) tail weights. The p-softened canonicalized Pearson χ2\chi^{2} divergence is itself. The q-softened canonicalized Pearson χ2\chi^{2} divergence is the canonicalized squared Le Cam distance.

The canonicalized Neymann divergence satisfies:

Df(p,q)\displaystyle D_{f}(p,q) =12χ2(q,p)\displaystyle=\tfrac{1}{2}\chi^{2}(q,p) (75)
=12(p(x)q(x))2p(x)dx\displaystyle=\tfrac{1}{2}\int\frac{\left(p(x)-q(x)\right)^{2}}{p(x)}\operatorname{d\!}x (76)
Ef(p,q,d)\displaystyle E_{f}(p,q,d) =1212p(x)exp(2d(x))dx+q(x)exp(d(x))dx\displaystyle=-\tfrac{1}{2}-\tfrac{1}{2}\int p(x)\exp\left(-2d(x)\right)\operatorname{d\!}x+\int q(x)\exp\left(-d(x)\right)\operatorname{d\!}x (77)
f(u)\displaystyle f(u) =(u1)22u\displaystyle=\frac{(u-1)^{2}}{2u} (78)
f′′(u)\displaystyle f^{\prime\prime}(u) =u3\displaystyle=u^{-3} (79)
af(d)\displaystyle a_{f}(d) =1212exp(2d)\displaystyle=\tfrac{1}{2}-\tfrac{1}{2}\exp(-2d) (80)
bf(d)\displaystyle b_{f}(d) =1exp(d)\displaystyle=1-\exp(-d) (81)
af(d)\displaystyle a_{f}^{\prime}(d) =exp(2d)\displaystyle=\exp(-2d) (82)
bf(d)\displaystyle b_{f}^{\prime}(d) =exp(d)\displaystyle=\exp(-d) (83)

The Neymann divergence has (3,0)(3,0) tail weights. It is the reverse of the Pearson χ2\chi^{2} divergence. The p-softened canonicalized Neymann divergence is the canonicalized squared Le Cam distance. The q-softened canonicalized Neymann divergence is itself.

The softened reverse KL divergence (Shannon et al., 2020) satisfies:

Df(p,q)\displaystyle D_{f}(p,q) =4KL(12p+12qp)\displaystyle=4\operatorname{KL}(\tfrac{1}{2}p+\tfrac{1}{2}q\,\|\,p) (84)
Ef(p,q,d)=24log2+2p(x)[exp(d(x))logσ(d(x))]dx2q(x)logσ(d(x))dx\displaystyle\begin{split}E_{f}(p,q,d)&=2-4\log 2+2\int p(x)\left[-\exp\left(-d(x)\right)-\log\sigma\left(d(x)\right)\right]\operatorname{d\!}x\\ &\quad-2\int q(x)\log\sigma\left(d(x)\right)\operatorname{d\!}x\end{split} (85)
f(u)\displaystyle f(u) =2(u+1)logu+1u4log2\displaystyle=2(u+1)\log\frac{u+1}{u}-4\log 2 (86)
f′′(u)\displaystyle f^{\prime\prime}(u) =2u2(u+1)\displaystyle=\frac{2}{u^{2}(u+1)} (87)
af(d)\displaystyle a_{f}(d) =2exp(d)2logσ(d)22log2\displaystyle=-2\exp(-d)-2\log\sigma(d)-2-2\log 2 (88)
bf(d)\displaystyle b_{f}(d) =2logσ(d)+2log2\displaystyle=2\log\sigma(d)+2\log 2 (89)
af(d)\displaystyle a_{f}^{\prime}(d) =2exp(d)σ(d)\displaystyle=2\exp(-d)\sigma(-d) (90)
bf(d)\displaystyle b_{f}^{\prime}(d) =2σ(d)\displaystyle=2\sigma(-d) (91)

The softened reverse KL divergence has (2,0)(2,0) tail weights. It is obtained by q-softening the reverse KL divergence. This is the divergence approximately minimized by conventional non-saturating GAN training (Shannon et al., 2020).

divergence f′′(u)f^{\prime\prime}(u) (left, right)
tail weights
KL u1u^{-1} (1,2)(1,2)
reverse KL u2u^{-2} (2,1)(2,1)
Jensen-Shannon 2u(1+u)\frac{2}{u(1+u)} (1,1)(1,1) (harmonic mean of KL and RKL)
squared Hellinger u32u^{-\frac{3}{2}} (32,32)(\frac{3}{2},\frac{3}{2}) (geometric mean of KL and RKL)
Jeffreys 1+u2u2\frac{1+u}{2u^{2}} (2,2)(2,2) (arithmetic mean of KL and RKL)
squared Le Cam 8(1+u)3\frac{8}{(1+u)^{3}} (0,0)(0,0)
Pearson χ2\chi^{2} 11 (0,3)(0,3)
Neymann u3u^{-3} (3,0)(3,0)
softened reverse KL 2u2(1+u)\frac{2}{u^{2}(1+u)} (2,0)(2,0)
Table 1: Concise specification of various f-divergences in terms of f′′f^{\prime\prime}. The divergences are scaled to make them canonical (f′′(1)=1f^{\prime\prime}(1)=1). All are low-degree rational functions of uu (or u)\sqrt{u}). Tail weights, which determine the most important qualitative properties of an f-divergence, are also shown (Shannon et al., 2020).

The f′′f^{\prime\prime} for various f-divergences is summarized in Table 1. We see that f′′f^{\prime\prime} provides a particular simple and concise way to define many common f-divergences. These are all rational functions of u\sqrt{u}.

2.4 Relationship to original f-GAN formulation

The original f-GAN paper (Nowozin et al., 2016) phrases the results presented in §2 in terms of the Legendre transform or Fenchel conjugate ff^{*} of ff. The two descriptions are equivalent333Assuming ff is differentiable. This is also assumed in practice in the original f-GAN paper. If this was not the case then it would not be possible to train the critic using gradient-based optimization. , as can be seen by setting T(x)=f(u(x))T(x)=f^{\prime}(u(x)) and using the result f(f(u))=uf(u)f(u)f^{*}(f^{\prime}(u))=uf^{\prime}(u)-f(u). We find our description helpful since it avoids having to explicitly match the domain of ff^{*}, ensures the optimal dd is the same for all ff-divergences, and because the Legendre transform is complicated for one of the divergences we consider. An “output activation” was used in the original f-GAN paper to adapt the output dd of the neural net to the domain of ff^{*}. This is equal to f(exp(d))f^{\prime}(\exp(d)), up to irrelevant additive constants, for all the divergences we consider, and so our description also matches the original description in this respect.

3 Variational divergence minimization

f-GANs (Nowozin et al., 2016) generalize classic GANs to allow approximately minimizing any f-divergence. In this section we review and discuss the f-GAN formulation.

Consider the task of estimating a probabilistic model from data using an f-divergence. Here pp is the true distribution and the goal is to minimize l(λ)=Df(p,qλ)l(\lambda)=D_{f}(p,q_{\lambda}) with respect to λ\lambda, where λqλ\lambda\mapsto q_{\lambda} is a parametric family of densities over K\mathbb{R}^{K}. We refer to qλq_{\lambda} as the generator. For implicit generative models such as typical GAN generators, the distribution qλq_{\lambda} is the result of a deterministic transform gλ(z)g_{\lambda}(z) of a stochastic latent variable zz. However we do not need to assume this specific form for most of our discussion.

3.1 Gradient matching property

We first note that the variational divergence bound EfE_{f} satisfies a convenient gradient matching property. This is not made explicit in the original f-GAN paper. Denote the optimal dd given pp and qλq_{\lambda} by dλd^{*}_{\lambda}. We saw above that Df(p,qλ)D_{f}(p,q_{\lambda}) and Ef(p,qλ,d)E_{f}(p,q_{\lambda},d) match values at d=dλd=d^{*}_{\lambda}. They also match gradients with respect to the generator parameters λ\lambda:

λDf(p,qλ)=λEf(p,qλ,d)|d=dλ=[λqλ(x)]bf(dλ(x))dx\mathinner{\dfrac{\partial{}}{\partial{\lambda}}}D_{f}(p,q_{\lambda})=\mathinner{\dfrac{\partial{}}{\partial{\lambda}}}E_{f}(p,q_{\lambda},d)\mathinner{\Biggr{\rvert}}_{d=d^{*}_{\lambda}}=-\int\left[\mathinner{\dfrac{\partial{}}{\partial{\lambda}}}q_{\lambda}(x)\right]b_{f}(d^{*}_{\lambda}(x))\operatorname{d\!}x (92)

This follows from the fact that EfE_{f} is a tight lower bound on DfD_{f}, similarly to the one-dimensional result that any differentiable function f:f\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}\to\mathbb{R} with f(x)0f(x)\geq 0 for all xx and f(0)=0f(0)=0 has f(0)=0f^{\prime}(0)=0. We can also verify this property directly from the definitions of DfD_{f} and EfE_{f}.

3.2 Formulation of variational divergence minimization

We can minimize Df(p,qλ)D_{f}(p,q_{\lambda}) using variational divergence minimization, maximizing Ef(p,qλ,dν)E_{f}(p,q_{\lambda},d_{\nu}) with respect to ν\nu while minimizing it with respect to λ\lambda. Adversarial optimization such as this lies at the heart of all flavors of GAN training. Define λ¯\overline{\lambda} and ν¯\overline{\nu} as

λ¯\displaystyle\overline{\lambda} =λEf(p,qλ,dν)=[λqλ(x)]bf(dν(x))dx\displaystyle=-\mathinner{\dfrac{\partial{}}{\partial{\lambda}}}E_{f}(p,q_{\lambda},d_{\nu})=\int\left[\mathinner{\dfrac{\partial{}}{\partial{\lambda}}}q_{\lambda}(x)\right]b_{f}(d_{\nu}(x))\operatorname{d\!}x (93)
ν¯\displaystyle\overline{\nu} =νEf(p,qλ,dν)\displaystyle=\mathinner{\dfrac{\partial{}}{\partial{\nu}}}E_{f}(p,q_{\lambda},d_{\nu}) (94)

To perform the adversarial optimization, we can feed λ¯\overline{\lambda} and ν¯\overline{\nu} (or in practice, stochastic approximations to them) as the gradients into any gradient-based optimizer designed for minimization, e.g. stochastic gradient descent or ADAM.

The gradient matching property shows that performing very many critic updates followed by a single generator update is a sensible learning strategy which, assuming the critic is sufficiently flexible and amenable to optimization, essentially performs very slow gradient-based optimization on the true divergence DfD_{f} with respect to λ\lambda. However in practice performing a few critic updates for each generator update, or simultaneous generator and critic updates, performs well, and it is easy to see that these approaches at least have the correct fixed points in terms of Nash equilibria of EfE_{f} and optima of DfD_{f}, subject as always to the assumption that the critic is sufficiently richly parameterized. Convergence properties of these schemes are investigated much more thoroughly elsewhere, for example (Nagarajan and Kolter, 2017; Gulrajani et al., 2017; Mescheder et al., 2017, 2018; Balduzzi et al., 2018; Peng et al., 2019), and are not the main focus here.

3.3 Hybrid training schemes

There is a simple generalization of the above training procedure, which is to base the generator gradients on EfE_{f} but the critic gradients on EhE_{h} for a possibly different function hh (Poole et al., 2016, Section 2.3). We refer to this as using hybrid (f,h)(f,h) gradients. This also approximately minimizes DfD_{f}. Subject as always to the assumption of a richly parameterized critic, if we perform very many critic updates for each generator update, then the dd used to compute the generator gradient will still be close to dd^{*}, and so the generator gradient will be close to the gradient of DfD_{f}, even though the path dd took to approach dd^{*} was governed by gg rather than ff. The fixed points of the two gradients are also still correct, and so it seems reasonable to again use more general update schemes and we might hope for similar convergence results (not analyzed here).

Hybrid schemes may potentially be useful for stabilizing training. For example the reverse KL generator gradient depends on ff only through bf(d)=1b_{f}^{\prime}(d)=1, so is likely to be stable with respect to minor inaccuracies in the critic, but the reverse KL critic gradient involves af(d)=exp(d)a_{f}^{\prime}(d)=\exp(-d) which may lead to very large updates if d(x)d(x) is ever large and negative for real xx. A hybrid (reverse KL, Jensen-Shannon) scheme uses a stable update for both the generator and critic while still approximately minimizing reverse KL.

3.4 Low-dimensional generator support

Many GAN generators used in practice have low-dimensional support and do not satisfy the condition q(x)>0q(x)>0 for all xx which we assumed for simplicity. In this section we briefly discuss the implications of this for GAN training. We argue that using generators with q(x)>0q(x)>0 everywhere has both theoretical and practical benefits, and so it is not unreasonable to restrict attention to this case.

The vast majority of GAN generators consist of a deterministic neural net applied to a fixed source of noise. Often the noise is far lower-dimensional than the output space, meaning that the set of possible generator outputs for a given trained generator (its support as a probability distribution) is a low-dimensional manifold in output space. For example, the progressive GAN generator (Karras et al., 2018) used a 512512-dimensional noise source and an output space with roughly 33 million dimensions. The natural data is often assumed to also lie on a low-dimensional manifold in output space, but we would argue that this is almost never exactly the case in practice. Due to sensor noise and dithering if nothing else, it is difficult to say that any image, say, is literally impossible in natural data. If desired, we can even guarantee that this is the case by adding a perceptually insignificant amount of white noise to the data. It seems more accurate to say that the natural data lies close to a low-dimensional manifold, but that no output is impossible, i.e. p(x)>0p(x)>0 everywhere. The low-dimensional generator support combined with high-dimensional data distribution support leads to several pathologies:

  • The set of all possible generator outputs has probability zero under the data distribution.

  • With probability 11, the generator assigns a natural image a probability density of zero.

  • The KL divergence between the data distribution and the generator is infinite.

  • The true log likelihood of natural data under the model is -\infty (despite approaches based on Parzen windows which produce finite estimates for the log likelihood).

  • Essentially all f-divergences are either undefined or completely saturate with gradient precisely zero.

  • The optimal critic d(x)=logp(x)logq(x)d^{*}(x)=\log p(x)-\log q(x) is \infty almost everywhere.

  • A sufficiently powerful critic can learn to distinguish generator output essentially perfectly by outputting “fake” for a tiny sliver around the low-dimensional generator support (a region which has vanishingly small probability under the data distribution) without learning anything about the natural data distribution. In general the critic is incentivized to focus on tiny details relevant to detecting the current generator support but potentially imperceptible to a human, and the generator is incentivized to change these tiny details slightly to move the current support.

These pathologies seem highly undesirable for a probabilistic model. Much attention has been devoted to this issue, especially under the assumption that the data support is also low-dimensional (Arjovsky et al., 2017; Mescheder et al., 2018, for example), and it is one of the motivating scenarios for Wasserstein GANs.

This issue is extremely easy to fix by injecting noise at all levels of the generator network, including the output (with a learned variance parameter). If a given injection of noise is not useful then it easy enough for the generator to learn to ignore it. This injected noise (indeed, even just the output noise) is enough to formally make the generator support cover all of output space and eliminate all of the above theoretical pathologies. The dimensionality argument above no longer applies since the dimensionality of all noise sources is now greater than the output dimensionality (the output noise alone ensures this). The injected noise can also have advantages in practice (Karras et al., 2019).

References

  • Ali and Silvey (1966) S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131–142, 1966.
  • Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proc. ICML, pages 214–223, 2017.
  • Balduzzi et al. (2018) D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of n-player differentiable games. In Proc. ICML, 2018.
  • Burbea and Rao (1982) J. Burbea and C. Rao. On the convexity of some divergence measures based on entropy functions. IEEE Transactions on Information Theory, 28(3):489–495, 1982.
  • Csiszár (1967) I. Csiszár. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2:229–318, 1967. Available at http://real-j.mtak.hu/id/eprint/5453.
  • Endres and Schindelin (2003) D. M. Endres and J. E. Schindelin. A new metric for probability distributions. IEEE Transactions on Information theory, 49(7):1858–1860, 2003.
  • Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
  • Karras et al. (2018) T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In Proc. ICLR, 2018.
  • Karras et al. (2019) T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
  • Lin (1991) J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991.
  • Mescheder et al. (2017) L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In Advances in Neural Information Processing Systems, pages 1825–1835, 2017.
  • Mescheder et al. (2018) L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? In Proc. ICML, 2018.
  • Nagarajan and Kolter (2017) V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. In Advances in Neural Information Processing Systems, pages 5585–5595, 2017.
  • Nguyen et al. (2010) X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
  • Nowozin et al. (2016) S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
  • Österreicher and Vajda (2003) F. Österreicher and I. Vajda. A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics, 55(3):639–653, 2003.
  • Peng et al. (2019) W. Peng, Y. Dai, H. Zhang, and L. Cheng. Training GANs with centripetal acceleration. arXiv preprint arXiv:1902.08949, 2019.
  • Poole et al. (2016) B. Poole, A. A. Alemi, J. Sohl-Dickstein, and A. Angelova. Improved generator objectives for GANs. In Proc. NIPS Workshop on Adversarial Training, 2016.
  • Shannon et al. (2020) M. Shannon, S. Mariooryad, B. Poole, T. Bagby, E. Battenberg, D. Kao, D. Stanton, and R. Skerry-Ryan. The divergences minimized by non-saturating GAN training. arXiv preprint arXiv:, 2020.
  • Topsoe (2000) F. Topsoe. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory, 46(4):1602–1609, 2000.
  • Vajda (2009) I. Vajda. On metric divergences of probability measures. Kybernetika, 45(6):885–900, 2009.