This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Geometry of Score Based Generative Models

Sandesh Ghimire    Jinyang Liu    Armand Comas    Davin Hill    Aria Masoomi    Octavia Camps    Jennifer Dy
Abstract

In this work, we look at Score-based generative models (also called diffusion generative models) from a geometric perspective. From a new view point, we prove that both the forward and backward process of adding noise and generating from noise are Wasserstein gradient flow in the space of probability measures. We are the first to prove this connection. Our understanding of Score-based (and Diffusion) generative models have matured and become more complete by drawing ideas from different fields like Bayesian inference, control theory, stochastic differential equation and Schrodinger bridge. However, many open questions and challenges remain. One problem, for example, is how to decrease the sampling time? We demonstrate that looking from geometric perspective enables us to answer many of these questions and provide new interpretations to some known results. Furthermore, geometric perspective enables us to devise an intuitive geometric solution to the problem of faster sampling. By augmenting traditional score-based generative models with a projection step, we show that we can generate high quality images with significantly fewer sampling-steps.

Machine Learning, ICML

1 Introduction

Refer to caption

Figure 1: Both the forward diffusion process and reverse generation process in diffusion generative models correspond to Wasserstein gradient flow along the same gradient-flow-path.

Score-based (or Diffusion) models are a new type of generative models in the field of computer vision and machine learning, achieving state-of-the-art results in image synthesis (Dhariwal & Nichol, 2021) and log likelihood (Kingma et al., 2021). They have recently gained popularity due to interesting applications such as text to image generation (DALL-E (Ramesh et al., 2022), (Rombach et al., 2022) and Imagen (Saharia et al., 2022)), image super-resolution, image editing (Meng et al., 2022), etc. Score-based generative models have enjoyed diverse perspectives from different fields. Originally, diffusion models (DDPM) were developed from expected lower bound (ELBO) maximization on data log likelihood (Ho et al., 2020). Song & Ermon (2019) showed that we can learn gradient of log likelihood (called score functions) and use it to generate images. Song et al. (2021) showed that the epsilon function in DDPM is in fact the scaled version of the score function. They further generalized these models to a continuous setting as stochastic differential equations (Song et al., 2021). More recent works have connected score-based generative models with Schrodinger bridge problem (De Bortoli et al., 2021) and control theoretic perspectives (Chen et al., 2022; Huang et al., 2021).

In this work, we present a completely different view-point on score-based generative models: geometric perspective. To the best of our knowledge, we are the first to explore the geometric connection of these generative models. Applying the solid mathematical framework in the area of Wasserstein gradient flow (Jordan et al., 1998; Ambrosio et al., 2005; Wibisono, 2018; Salim et al., 2020; Korba et al., 2020), we show that the forward and backward process of adding noise and generating image from the noise are in fact equivalent to moving on a gradient-flow-path in a metric space of probability distributions following the Wasserstein gradient flow equation.

While our understanding of score-based generative models has matured over time, few important questions remain unanswered. For example, why is it a good idea to choose forward and reverse variance the same? Can we choose reverse variance differently? Are score-based generative models same as the energy based models (Xie et al., 2016; Gao et al., 2020; Du et al., 2021)? Furthermore, new models have been proposed like Wavefit (Koizumi et al., 2022), which tries to generalize the diffusion sampling to a proximal gradient type of update. How can we explain this type of algorithms? In this work, we demonstrate that the geometric connection investigated in this work helps to answer these questions from a geometric point of view.

In addition to conceptual advantages and novel perspectives, geometric framework enables us to design practical algorithms with faster sampling capability. Score-based generative models work remarkably well when the number of sampling-steps is large (i.e. small step-size). However, the sampling time is also large for such finer schemes. As we decrease the number of sampling steps, the samples move away from the gradient-flow-path incurring error in each step, and resulting into high overall error. To minimize such error and achieve high quality samples even with small number of sampling steps, we propose to project back intermediate samples to the gradient-flow-path after every step. To achieve this, we propose an efficient estimation of Wasserstein gradient to descend towards the flow-path. As demonstrated in the result section, our proposed method significantly reduces error for smaller number of sampling steps. Below we summarize our contributions. All complete proofs are included in the appendix.

  1. 1.

    To the best of our knowledge, this is the first work to theoretically prove the connection between score-based generative models and Wasserstein gradient flow. We establish this relationship through Theorems 1 and 2.

  2. 2.

    This connection sheds light on several interesting questions: 1) the reverse variance in score-based generative models, 2) the connection between score-based model and energy based model, and 3) the use of proximal gradient algorithms as proposed in recent works.

  3. 3.

    Based on these insights, we propose a new algorithm which generalizes the score-based model and allows for significantly faster sampling, which would otherwise be very difficult to achieve. To achieve this, we also propose an efficient Wasserstein gradient estimation algorithm.

2 Related Works

Early works on diffusion models were based on matching the forward and reverse joint distributions through bounds on log likelihood (Ho et al., 2020; Sohl-Dickstein et al., 2015). (Song & Ermon, 2019) proposed a score-based generative model motivating from the Langevin dynamics and estimating the score function. Later (Song et al., 2021) showed that the two approaches are actually equivalent and it can be generalized further in continuous time setting through the stochastic differential equations. On the more theoretical directions, score-based optimization has been shown to be equivalent to likelihood maximization through Feynman-Kac theorem (Chen et al., 2022; Huang et al., 2021). Other notable works are interpreting the forward diffusion and generation as solving the Schrodinger Bridge problem (De Bortoli et al., 2021). Many approaches have been proposed to speed up the sampling process through clever ways to solve differential equations (Lu et al., 2022).

In their seminal work, Jordan, Kinderlehrer, Otto (JKO) proved the connection between Wasserstein gradient flow and the the diffusion systems guided by Fokker-Planck equations (Jordan et al., 1998). This result has been vastly generalized and formalized by (Villani, 2003, 2009) and (Ambrosio et al., 2005) giving birth to the theory of Wasserstein gradient flow and optimization on space of probability measures. Several notable works have followed in machine learning (Wibisono, 2018), (Korba et al., 2020), (Salim et al., 2020), for example for sampling, generative models, etc.

3 Preliminaries

3.1 Notations

Let (𝒳)\mathcal{B}(\mathcal{X}) denote the Borel σ\sigma- algebra over 𝒳\mathcal{X}, and let μ\mu denote a probability measure on 𝒳\mathcal{X}. 𝒫2(𝒳)\mathcal{P}_{2}(\mathcal{X}) denotes the space of probability measures μ\mu on 𝒳\mathcal{X} with finite second order moment. For any μ𝒫2(𝒳)\mu\in\mathcal{P}_{2}(\mathcal{X}), L2(μ)L^{2}(\mu) is the space of functions f:𝒳𝒳f:\mathcal{X}\to\mathcal{X} such that f2𝑑μ<\int||f||^{2}d\mu<\infty (Ambrosio et al., 2005; Korba et al., 2020). Let T:𝒳𝒳T:\mathcal{X}\to\mathcal{X}, then T#μT_{\#}\mu denotes the pushforward measure of μ\mu by TT such that the transfer lemma ϕ(T(x))𝑑μ(x)=ϕ(y)𝑑Tμ(y)\int\phi(T(x))d\mu(x)=\int\phi(y)dT\mu(y) holds for any measurable bounded function ϕ\phi. We use Wasserstein-2 distance as a metric on the space of probability measures. The Wasserstein-2 distance is defined as W22(μ,ν)=infsS(μ,ν)xy2𝑑s(x,y)W_{2}^{2}(\mu,\nu)=\inf_{s\in S(\mu,\nu)}\int||x-y||^{2}ds(x,y), where μ,νP2(𝒳)\mu,\nu\in P_{2}(\mathcal{X}) and 𝒮(μ,ν)\mathcal{S}(\mu,\nu) is the set of couplings between μ\mu and ν\nu, i.e. the set of nonnegative measures, ss over 𝒳×𝒳\mathcal{X}\times\mathcal{X} such that their projections on first and second components are P#s=μP_{\#}s=\mu and Q#s=νQ_{\#}s=\nu where P:(x,y)xP:(x,y)\mapsto x and Q:(x,y)yQ:(x,y)\mapsto y (Villani, 2003).

3.2 Wasserstein Gradient Flow

Let (μt)t(0,T)(\mu_{t})_{t\in(0,T)} denote a family of probability measures. This family satisfies a continuity equation if there exists a family of velocity fields, (vt)t(0,T)(v_{t})_{t\in(0,T)} such that

μtt+div(μtvt)=0\displaystyle\frac{\partial\mu_{t}}{\partial t}+div(\mu_{t}v_{t})=0 (1)

in a distributional sense. It is also absolutely continuous if vtL2(μt)||v_{t}||_{L^{2}(\mu_{t})} is integrable over (0,T)(0,T). Among all possible vtv_{t}, there is one with minimum L2(μt)L^{2}(\mu_{t}) norm, and it lies on the tangent space of 𝒫2(𝒳)\mathcal{P}_{2}(\mathcal{X}) and is called tangent vector field (Ambrosio et al., 2005) Chapter 8.

We define a functional on the space of probability measures, (μ):𝒫2(𝒳)(,)\mathcal{F}(\mu):\mathcal{P}_{2}(\mathcal{X})\to(-\infty,\infty). We define Wasserstein gradient of any functionals on the 𝒫2(𝒳)\mathcal{P}_{2}(\mathcal{X}) space as the change in the value of functional with small perturbation on the probability measure. Wasserstein gradient can be expressed in the following form (Ambrosio et al., 2005) Chapter 10:

W2=(μ)\displaystyle\nabla_{W_{2}}\mathcal{F}=\nabla\mathcal{F^{\prime}}(\mu) (2)

Consider the KL divergence, KL(μ||π)\text{KL}(\mu||\pi) between any measure μ\mu and a base measure π\pi. We can show that the Wasserstein gradient of the functional KL(.||π)\text{KL}(.||\pi) at μ\mu is

W2KL(.||π)=log(μπ)\displaystyle\nabla_{W_{2}}\text{KL}(.||\pi)=\nabla\log(\frac{\mu}{\pi}) (3)

In the family (μt)t(0,T)(\mu_{t})_{t\in(0,T)}, let the initial measure be μ0=ρ\mu_{0}=\rho and final measure is μT=π\mu_{T}=\pi. Then there exists a geodesic between two probability measures ρ\rho and π\pi with respect to the Wasserstein metric. If we choose the velocity field equal to the negative of Wasserstein gradient (i.e. vt=W2KL(.||π)v_{t}=-\nabla_{W_{2}}\text{KL}(.||\pi)), then we can show that the path traced by the probability measures is the geodesic between ρ\rho and π\pi (Ambrosio et al., 2005) Chapter 7, and the flow is known as Wasserstein gradient flow. Using the functional KL(.||π)\text{KL}(.||\pi) and the continuity equation, we obtain the equation of Wasserstein gradient flow as:

μtt=div[μtlog(μtπ)]\displaystyle\frac{\partial\mu_{t}}{\partial t}=div\big{[}\mu_{t}\nabla\log(\frac{\mu_{t}}{\pi})] (4)

Wasserstein gradient flow is a differential equation of probability measures. Consider a Wasserstein gradient flow with initial measure μ0\mu_{0} satisfying the continuity equation 1. Let x0μ0x_{0}\sim\mu_{0} be sample from the initial measure. The differential equation for the samples can be derived from continuity equation as follows (Ambrosio et al., 2005):

x˙=vt\displaystyle\dot{x}=v_{t} (5)

3.3 Score Based Generative Model

Score-based generative model (Song et al., 2021) extends diffusion models to work on continuous time setting using stochastic differential equations (SDEs). The forward and reverse process of adding noise and generating images are interpreted as forward and reverse diffusion process with following differential equations:

FOR:dx\displaystyle\text{FOR}:dx =f(x,t)dt+gtdw\displaystyle=f(x,t)dt+g_{t}dw (6)
t:0T,x0ρ\displaystyle t:0\to T,x_{0}\sim\rho
REV:dx\displaystyle\text{REV}:dx =[f(x,t)gt2xlogμt(x)]dt+gtdw¯\displaystyle=[f(x,t)-g_{t}^{2}\nabla_{x}\log\mu_{t}(x)]dt+g_{t}d\bar{w} (7)
t:T0,xTπ\displaystyle t:T\to 0,x_{T}\sim\pi

where ff is forward drift function and dwdw is the Brownian motion. Note that the flow of time in two SDE is different: the time flows from 0 to TT in the forward process and the initial distribution is ρ\rho, while the time flows from TT to 0 in the reverse process. Time direction is crucial in the stochastic differential equation because for the forward process, xtx_{t} is independent of the future t>tt^{\prime}>t while in the reverse direction it is independent of the past (Anderson, 1982). To make things simpler such that time always flow in positive direction, we can equivalently use the positive time notation indexed by τ\tau (following is equivalent to eq.(7)):

dx\displaystyle dx =[f(x,τ)+gτ2xlogμτ(x)]dτ+gτdw¯\displaystyle=[-f(x,\tau)+g_{\tau}^{2}\nabla_{x}\log\mu_{\tau}(x)]d\tau+g_{\tau}d\bar{w} (8)
τ:0T,x0π,τ=Tt\displaystyle\tau:0\to T,x_{0}\sim\pi,\tau=T-t

Note that we use tt for forward flow of time and τ=Tt\tau=T-t for the backward flow of time, so that τ\tau now flows from 0 to TT. With this notation, μt=0=μτ=T=ρ\mu_{t=0}=\mu_{\tau=T}=\rho, μt=T=μτ=0=π\mu_{t=T}=\mu_{\tau=0}=\pi, and βt=βTτ\beta_{t}=\beta_{T-\tau}. Euler-Maruyama discretization of the reverse SDE equation yields:

xτ+δτ=xτ+(gτ2xlogμτ(x)f(x,τ))δτ+gτz\displaystyle x_{\tau+\delta\tau}=x_{\tau}+(g_{\tau}^{2}\nabla_{x}\log\mu_{\tau}(x)-f(x,\tau))\delta\tau+g_{\tau}z (9)

where zN(0,I)z\sim N(0,I) is a random normal distributed sample.

4 Forward Diffusion as Gradient Flow

Instead of taking the velocity vector to be negative of Wasserstein gradient, we consider an accelerated flow where at any time, tt, the velocity is equal to the negative Wasserstein gradient scaled by a time-varying βt\beta_{t}.

Proposition 1 (Accelerated Wasserstein Gradient Flow).

We define accelerated gradient flow with respect to the functional \mathcal{F} as the gradient flow where the velocity vector is defined as vt=βtW2v_{t}=-\beta_{t}\nabla_{W_{2}}\mathcal{F}. Consequently, the continuity equation is given by:

μtt=div[μtβtW2]\displaystyle\frac{\partial\mu_{t}}{\partial t}=div\big{[}\mu_{t}\beta_{t}\nabla_{W_{2}}\mathcal{F}] (10)

Using this accelerated Wasserstein Gradient flow, we can establish a connection with the forward process in score-based generative model. We start from the Fokker-Planck equation corresponding to the stochastic differential equation of the forward diffusion process given by eq.(6):

μtt=div(μtf)+12div((gt2μt))\displaystyle\frac{\partial\mu_{t}}{\partial t}=-div(\mu_{t}f)+\frac{1}{2}div(\nabla(g_{t}^{2}\mu_{t})) (11)

where the initial measure is μ0=ρ\mu_{0}=\rho. Following this SDE, we know that it will end up in the final measure, μT=π\mu_{T}=\pi. Next theorem shows that the forward Fokker Planck equation and the accelerated Wasserstein Gradient descent are equivalent.

Theorem 1.

Consider an accelerated gradient flow in eq.(10) with initial measure μ0=ρ\mu_{0}=\rho and the target measure μT=π\mu_{T}=\pi and the functional on the Wasserstein space defined by (μ)=KL(.||π)\mathcal{F}(\mu)=\text{KL}(.||\pi). The family of measures corresponding to this gradient flow is equivalent to the family of measures corresponding to the forward Fokker Plank equation in eq.(11) given that ff and βt\beta_{t} take the following form: f=βtlogπf={\beta_{t}}\nabla\log\pi, βt=gt22\beta_{t}=\frac{g_{t}^{2}}{2}.

Remark 1.1.

Consider the special case with measure μT=π=𝒩(0,I)=exp(x22)/Z\mu_{T}=\pi=\mathcal{N}(0,I)={\exp(-\frac{||x||^{2}}{2})}/{Z}, we get f=βxf=-\beta x and the forward diffusion equation is given by the following SDE:

dx=βtxdt+2βtdw\displaystyle dx=-{\beta_{t}}xdt+\sqrt{2\beta_{t}}dw (12)

which is exactly the forward flow of DDPM model (Ho et al., 2020; Song et al., 2021).

This implies that the forward diffusion process considered in the diffusion generative model, DDPM (Ho et al., 2020; Song et al., 2021) can be equivalently thought as an accelerated Wasserstein gradient flow starting from an initial measure μ0=ρ\mu_{0}=\rho corresponding to the data distribution and following the negative gradient towards the target measure μT=𝒩(0,I)\mu_{T}=\mathcal{N}(0,I).

Remark 1.2.

We can also think of accelerated Wasserstein gradient flow as regular Wasserstein gradient flow with non-uniform discretization, i.e., step at tt is scaled by βt\beta_{t}.

Next we investigate the geometric interpretation of the generation process or the reverse SDE.

5 Generation as Reverse Gradient Flow

Next theorem establishes the equivalence between the reverse SDE and the Wasserstein Gradient flow.

Theorem 2.

The reverse SDE in eq.(8) is equivalent to the Wasserstein gradient flow in the space of probability measures with respect to the functional (μ)=KL(.||π)\mathcal{F}(\mu)=-\text{KL}(.||\pi) starting from the initial measure μτ=0=π\mu_{\tau=0}=\pi towards the target measure μτ=T=ρ\mu_{\tau=T}=\rho.

Proof.
(μ)\displaystyle\mathcal{F}(\mu) =KL(μ||π)\displaystyle=-KL(\mu||\pi) (13)
=logμdμ+logπdμ\displaystyle=\int-\log\mu d\mu+\int\log\pi d\mu (14)
=(2logμ+logπ)𝑑μ𝒢+logμdμ\displaystyle=\color[rgb]{1,0,0}{\underbrace{\color[rgb]{0,0,0}{\int(-2\log\mu+\log\pi)d\mu}}_{{\mathcal{G}}}}\color[rgb]{0,0,0}{+}\color[rgb]{0,0,1}{\underbrace{\color[rgb]{0,0,0}{\int\log\mu d\mu}}_{\mathcal{H}}} (15)

Here, we apply forward backward splitting scheme due to (Wibisono, 2018; Salim et al., 2020)

ντ\displaystyle\nu_{\tau} =(IβτW2𝒢(μτ))#μτ\displaystyle=(I-\beta_{\tau}\nabla_{{W}_{2}}\mathcal{G}(\mu_{\tau}))_{\#}\mu_{\tau} (16)
μτ+δτ\displaystyle\mu_{\tau+\delta\tau} =JKOβτ(ντ)\displaystyle=JKO_{\beta_{\tau}\mathcal{H}}(\nu_{\tau}) (17)

where W2𝒢(μ)\nabla_{{W}_{2}}\mathcal{G}(\mu) is the Wasserstein gradient, the expression for which can be obtained as:

W2𝒢=𝒢(μ)=(2logμ+logπ)\displaystyle\nabla_{W_{2}}\mathcal{G}=\nabla\mathcal{G^{\prime}}(\mu)=\nabla(-2\log\mu+\log\pi) (18)

In eq.(16), we are trying to move in the direction of Wasserstein gradient. Let samples xτx_{\tau} from distribution xτμτx_{\tau}\sim\mu_{\tau}. Transforming differential equation in measure space to sample space, similar to eq.(5), yields:

yτ=xτβτ(2logμt(xτ)+logπ(xτ))δτ\displaystyle y_{\tau}=x_{\tau}-\beta_{\tau}\nabla(-2\log\mu_{t}(x_{\tau})+\log\pi(x_{\tau}))\delta\tau (19)

In eq.(17), we are using JKO operator as a solution of the negative entropy functional, \mathcal{H}, where the JKO operator is defined as :

JKOβ,(ν)=argminζ𝒫2(𝒳)(ζ)+12βW22(ζ,ν)JKO_{\beta,\mathcal{H}}(\nu)=\underset{\zeta\in\mathcal{P}_{2}(\mathcal{X})}{\text{argmin}}\mathcal{H}(\zeta)+\frac{1}{2\beta}W_{2}^{2}(\zeta,\nu)

For the negative entropy functional, we have the exact solution as Brownian motion (Jordan et al., 1998; Wibisono, 2018; Salim et al., 2020). Let yτντy_{\tau}\sim\nu_{\tau}, we obtain

yτ=xτβτ(2logμt(xτ)+logπ(xτ))δτ\displaystyle y_{\tau}=x_{\tau}-\beta_{\tau}\nabla(-2\log\mu_{t}(x_{\tau})+\log\pi(x_{\tau}))\delta\tau (20)
xτ+δτ=yτ+2βτzτ\displaystyle x_{\tau+\delta\tau}=y_{\tau}+\sqrt{2\beta_{\tau}}z_{\tau} (21)

Combining both, we obtain

xτ+δτ\displaystyle x_{\tau+\delta\tau} =xτ+(2βτlogμτ(xτ)βτlogπ(xτ))δτ\displaystyle=x_{\tau}+(2\beta_{\tau}\nabla\log\mu_{\tau}(x_{\tau})-\beta_{\tau}\nabla\log\pi(x_{\tau}))\delta\tau
+2βτzτ\displaystyle+\sqrt{2\beta_{\tau}}z_{\tau} (22)

In the limiting case as δτ0\delta\tau\to 0, we obtain,

dx=(2βτlogμτ(xτ)βτlogπ(xτ))dτ+2βτdw\displaystyle dx=(2\beta_{\tau}\nabla\log\mu_{\tau}(x_{\tau})-\beta_{\tau}\nabla\log\pi(x_{\tau}))d\tau+\sqrt{2\beta_{\tau}}dw

which coincides exactly with the reverse SDE in eq.(8) for gτ2=2βτg_{\tau}^{2}=2\beta_{\tau} and f=βτlogπf=\beta_{\tau}\nabla\log\pi. ∎

The reverse SDE or the score-based model is trying to reverse the forward process by tracing the path followed in the forward process in the opposite direction. One important implication of this theorem is that since we are moving towards the target measure μT=𝒩(0,I)\mu_{T}=\mathcal{N}(0,I) in the forward process, the reverse is actually simply moving away from 𝒩(0,I)\mathcal{N}(0,I), which is realized as the accelerated Wasserstein gradient flow with the functional KL(.||π)-\text{KL}(.||\pi). The gradient flow path with constant velocity is the geodesic. Since we are considering gradient flow path with acceleration, it is not exactly the geodesic, but similar path traced by gradient flow. We will call it gradient-flow-path in rest of the paper.

6 Insights, Connections, Discussion

We have shown that both the forward and reverse diffusion process involved in score-based generative models are gradient flows on the space of probability measures. We gain geometric insights because of this geometric interpretation.

6.1 Alternative Interpretation of Reverse SDE equation

Score-based generative model uses the fact that for every forward SDE of the form in eq.(6), there exists a reverse SDE as in eq.(8), which is a remarkable result due to (Anderson, 1982). Theorem 2 provides an interesting interpretation of this result from a completely different perspective. In eq.(15), we added and subtracted the negative entropy term logμdμ\int\log\mu d\mu in diffusion and drift terms respectively. It allowed us to design a forward-backward algorithm instead of forward algorithm of Wasserstein gradient flow. The backward term essentially added the Brownian motion term yielding us a reverse stochastic differential equation. Note that if we had not added and subtracted the term logμdμ\int\log\mu d\mu, we would have obtained following iterative scheme:

xτ+δτ\displaystyle x_{\tau+\delta\tau} =xτ+(βτlogμτ(xτ)βτlogπ(xτ))δτ\displaystyle=x_{\tau}+(\beta_{\tau}\nabla\log\mu_{\tau}(x_{\tau})-\beta_{\tau}\nabla\log\pi(x_{\tau}))\delta\tau (23)

Note that this is a discretized version of the following ODE.

dxdτ=f(x,τ)+12gτ2logμτ(x),τ:[0T]\displaystyle\frac{dx}{d\tau}=-f(x,\tau)+\frac{1}{2}g_{\tau}^{2}\nabla\log\mu_{\tau}(x),\hskip 5.69046pt\tau:[0\to T] (24)

Comparing this equation with eq.(8), observe that eq.(8) has an additional 12gt2logμt(x)\frac{1}{2}g_{t}^{2}\nabla\log\mu_{t}(x) in the drift part which is compensated by the Brownian motion gτdw~g_{\tau}d\tilde{w}. It is clear to see that eq.(8) and eq.(24) yields same family of marginal distributions at τ[0,T]\tau\in[0,T] even though the former is deterministic differential equation and the latter is the stochastic. Perhaps the advantage of score-based models is that stochasticity helps in generating diverse samples for small sample size.

6.2 Why is reverse variance same as the forward variance?

In the DDPM model (Ho et al., 2020), it was not clear how to choose the variance of the reverse differential equation, and why choosing the reverse variance the same as in forward is a good strategy. From previous analysis, we see that the reverse variance must be same as the forward because we have added and subtracted the same negative entropy term from both the drift and the diffusion. However, it is possible to change the reverse time variance. For example, we can add αlogμdμ\alpha\int\log\mu d\mu to both the drift and diffusion terms in eq. (15). Then the reverse SDE variance will be 2αβt\sqrt{2\alpha\beta_{t}}, but then the drift term in eq.(8) will also be modified to 1+α2gt2logμt(x)\frac{1+\alpha}{2}g_{t}^{2}\nabla\log\mu_{t}(x) instead of gt2logμt(x)g_{t}^{2}\nabla\log\mu_{t}(x).

6.3 Contrasting Score-based with Energy-based Model

Let’s assume that the probability measure of data can be obtained in the form of ρexp(V)\rho\propto\exp(-V). Consider Wasserstein gradient descent with the functional as KL(.||ρ)KL(.||\rho)

(μ)\displaystyle\mathcal{F}(\mu) =KL(μ||ρ)\displaystyle=KL(\mu||\rho) (25)
=V𝑑μ+logμdμ\displaystyle=\int Vd\mu+\int\log\mu d\mu (26)

We can use the same forward-backward splitting scheme as we used in Theorem 2 Proof, and with similar reasoning, we can recover the Langevin dynamics:

xτ+δτ\displaystyle x_{\tau+\delta\tau} =xτβτV(xτ)δτ+2βτz\displaystyle=x_{\tau}-\beta_{\tau}\nabla V(x_{\tau})\delta\tau+\sqrt{2\beta_{\tau}}z (27)

This demonstrates the critical difference between the Energy-based model and the score-based model: while the energy based model is moving towards the data distribution μ0=ρ\mu_{0}=\rho with functional KL(.||ρ)KL(.||\rho), the score-based model is moving away from the isotropic Gaussian distribution (π\pi) with the functional, KL(.||π)-KL(.||\pi) . Score-based generative model traces the forward diffusion path in the reverse direction thereby avoiding the need to work with the data distribution ρ\rho. In the energy based model, however, we need to either estimate energy function like VV (Gao et al., 2020; Du et al., 2021) or KL divergence with the data, ρ\rho.

6.4 Proximal Algorithms in Diffusion models

WaveFit (Koizumi et al., 2022) tries to generalize the iteration in diffusion models to a proximal algorithm. Motivating from a fixed point iteration, they try to improve upon DDPM model by drawing ideas from GANs and propose a proximal algorithm type of approach which is faster in generating samples than DDPM without losing quality. Here, we show that starting from geometric perspective we can reach the proximal algorithm as a way to perform Wasserstein gradient descent. Consider a functional, say (μ)=KL(μ||ρ)\mathcal{F}(\mu)=KL(\mu||\rho) for example where ρ\rho is the data distribution. Forward discretization of Wasserstein gradient descent yields us iteration (Jordan et al., 1998; Salim et al., 2020)

μτ=argminν𝒫2(𝒳)(ν)+12γW22(ν,μ)\displaystyle\mu_{\tau^{\prime}}=\underset{\nu\in\mathcal{P}_{2}(\mathcal{X})}{\text{argmin}}\mathcal{F}(\nu)+\frac{1}{2\gamma}W_{2}^{2}(\nu,\mu) (28)

which is a proximal gradient algorithm in the space of probability measures. This justifies why proximal algorithms make sense in the context of diffusion generative models or score-based generative models because we are trying to reach the data distribution descending in the direction of Wasserstein gradient. Jordan et al. (1998); Wibisono (2018); Salim et al. (2020) have shown that proximal algorithm converges to the target distribution, ρ\rho. As for the choice of functional \mathcal{F}, it can be any convex functional that decreases as we descend towards the target measure ρ\rho. Wavefit (Koizumi et al., 2022) shows that using much stronger GAN-type objective as a functional yields good result.

7 Challenges with Faster Sampling

Refer to caption

Figure 2: Qualitative comparison of Celeb-A samples generated by Score-based model and our model at different number of sampling steps (N). Our model maintains reasonable quality even when N decreases down to 20.

Once the connection between the gradient flow and score-based generative model is established, we can interpret the generation as a process walking on the gradient-flow-path. If we follow the flow-path with small steps, we can reliably reach the initial data distribution, as demonstrated by the success of the score-based generative models and diffusion models. However, this is not a great idea if we increase the step size. The score-based increment is a linear approximation and therefore it accrues more error as we increase the step size. It has been experimentally observed that the samples get poorer as we increase the step size in score-based models (Dhariwal & Nichol, 2021; Ho et al., 2020; Song et al., 2021). From a geometric point-of-view, we are taking Wasserstein gradient steps using forward-backward strategy. While this strategy works well when the step size is small, it converges to a biased measure for large step-size. Bias associated with the forward-backward strategy for large step size has been studied in the context of Wasserstein gradient flow (Wibisono, 2018). In our case, this issue is further exacerbated by the fact that the functional KL(μ||π)-\text{KL}(\mu||\pi) we are trying to minimize is actually concave with respect to μ\mu.

Refer to caption

Figure 3: During generation, sample moves tangentially to the gradient-flow-path in the reverse direction. If the step-size is large, it incurs error, which we mitigate by projection to the gradient-flow-path.

To mitigate this issue, we propose an intuitive and geometric idea: projection. As shown in Fig.3, as we try to sample in score-based models with large step-size, the error gets large and the trajectory deviates away from the gradient-flow-path. We propose to resolve this problem by projecting again to the gradient-flow-path before taking another step.

8 Projection to Gradient-flow-path

Algorithm 1 Predict-Project Sampling
1:  Inputs: NN, δμ\delta\mu, sθs_{\theta}^{*}, TθT_{\theta}^{*}
2:  x𝒩(0,I)x\sim\mathcal{N}(0,I), δτ=1/N\delta\tau=1/N
3:  for τ=0\tau=0 to 11/N1-1/N do
4:     xpred=x+(2βτsθ(x)βτlogπ(xτ))δτx^{pred}=x+(2\beta_{\tau}s_{\theta}^{*}(x)-\beta_{\tau}\nabla\log\pi(x_{\tau}))\delta\tau
5:     xproj=xpred+Tθ(xpred,τ+δτ).βτ.δμ.δτx^{proj}=x^{pred}+T_{\theta}^{*}(x^{pred},\tau+\delta\tau).\sqrt{\beta_{\tau}}.{\delta\mu}.\delta\tau
6:     z𝒩(0,I)z\sim\mathcal{N}(0,I)
7:     x=xproj+2βtzx=x^{proj}+\sqrt{2\beta_{t}}z
8:  end for
9:  RETURN xx

Score-based generative model first trains a score model, ss such that sθ(xτ,τ)=logμτ(xτ)s_{\theta}^{*}(x_{\tau},\tau)=\nabla\log\mu_{\tau}(x_{\tau}) using score matching strategy. Once, the score model is trained, the discretized Eurler-Maruyama step (eq.(9)) is used for generation of samples, where score function, ss^{*} replaces logμτ(xτ)\nabla\log\mu_{\tau}(x_{\tau}):

xτ+δτ\displaystyle x_{\tau+\delta\tau} =xτ+(2βτsθ(xτ,τ)βτlogπ(xτ))δτ\displaystyle=x_{\tau}+(2\beta_{\tau}s_{\theta}^{*}(x_{\tau},\tau)-\beta_{\tau}\nabla\log\pi(x_{\tau}))\delta\tau
+2βτz\displaystyle+\sqrt{2\beta_{\tau}}z
=xτ+δτpred+2βτzτ\displaystyle=x_{\tau+\delta\tau}^{pred}+\sqrt{2\beta_{\tau}}z_{\tau} (29)

It can also be interpreted as predict and diffuse steps, where predict step is xτ+δτpred=xτ+(2βτsθ(xτ,τ)βτlogπ(xτ))δτx_{\tau+\delta\tau}^{pred}=x_{\tau}+(2\beta_{\tau}s_{\theta}^{*}(x_{\tau},\tau)-\beta_{\tau}\nabla\log\pi(x_{\tau}))\delta\tau. Since generation is trying to trace the gradient-flow-path, after each of these predict-diffuse, we should obtain samples from the measure μτ+δτ\mu_{\tau+\delta\tau} on the gradient-flow-path. Because of discretization error and bias, these samples do not lie on the gradient-flow-path, in fact they deviate away. To pull these samples towards the measure, μτ+δτ\mu_{\tau+\delta\tau} on the gradient-flow-path, we use the fact that we have a way to sample from the measures on gradient-flow-path. Using the SDE equation, we can write closed form conditional distribution of μτ+δτ\mu_{\tau+\delta\tau} as follows:

μτ+δτ|T(xτ+δτ|xT)=𝒩(xτ+δτ;xτ+δτmean(xτ=T),2βτ+δτI)\displaystyle\mu_{\tau+\delta\tau|T}(x_{\tau+\delta\tau}|x_{T})=\mathcal{N}(x_{\tau+\delta\tau};x_{\tau+\delta\tau}^{mean}(x_{\tau=T}),2\beta_{\tau+\delta\tau}I)

Sampling from this conditional distribution is given by the following equation:

xτ+δτ=xτ+δτmean(xτ=T)+2βτ+δτzτ\displaystyle x_{\tau+\delta\tau}=x_{\tau+\delta\tau}^{mean}(x_{\tau=T})+\sqrt{2\beta_{\tau+\delta\tau}}z_{\tau} (30)

Comparing eq.(30) with eq.(29), we note that pulling xτ+δτpredx_{\tau+\delta\tau}^{pred} close to xτ+δτmeanx_{\tau+\delta\tau}^{mean} may be enough to pull the samples xτ+δτx_{\tau+\delta\tau} in eq.(29) towards gradient-flow-path assuming that βτ+δτ\beta_{\tau+\delta\tau} is close to βτ\beta_{\tau}. In terms of measures, we consider the measure associated with samples xτ+δτpredx_{\tau+\delta\tau}^{pred} and target samples xτ+δτmeanx_{\tau+\delta\tau}^{mean}. Let’s define the measure corresponding to samples xτ+δτpredx^{pred}_{\tau+\delta\tau} as μτ+δτpred\mu^{pred}_{\tau+\delta\tau} and the measures corresponding to means, xτ+δτmeanx_{\tau+\delta\tau}^{mean} as μτ+δτmean\mu^{mean}_{\tau+\delta\tau}. We can sample from the measure μτ+δτmean\mu^{mean}_{\tau+\delta\tau} by first sampling from xτ=Tμτ=Tx_{\tau=T}\sim\mu_{\tau=T}, and passing them through xτ+δτmeanx_{\tau+\delta\tau}^{mean}. Our strategy to project the samples in eq.(29) onto gradient-flow-path is to project the pred measure, μτ+δτpred\mu^{pred}_{\tau+\delta\tau} to the mean measure μτ+δτmean\mu^{mean}_{\tau+\delta\tau}. We achieve this through Wasserstein gradient descent in the space of probability measures. For that, we need an efficient way to estimate Wasserstein gradient, which we describe in next subsection.

8.1 Efficient Estimation of Wasserstein Gradient

Imagine that we want to estimate Wasserstein Gradient of a functional 𝒥(μ)\mathcal{J}(\mu), i.e., W2𝒥(μ)\nabla_{W_{2}}\mathcal{J}(\mu). For that, we can use the following Taylor expansion:

𝒥((I+hT)#μ)=𝒥(μ)+hW2𝒥(μ),Tμ+o(h)\displaystyle\mathcal{J}((I+hT)_{\#\mu})=\mathcal{J}(\mu)+h\langle\nabla_{W_{2}}\mathcal{J}(\mu),T\rangle_{\mu}+o(h) (31)

where, W2𝒥(μ)L2(μ)\nabla_{W_{2}}\mathcal{J}(\mu)\in L^{2}(\mu) is the Wasserstein gradient of 𝒥\mathcal{J} at μ\mu. To estimate the Wasserstein gradient, consider the following optimization problem:

T=argmin𝑇W2𝒥(μ),Tμ+12hTμ2\displaystyle T^{*}=\underset{T}{argmin}\hskip 5.69046pt{\langle\nabla_{W_{2}}\mathcal{J}(\mu),T\rangle_{\mu}+\frac{1}{2h}||T||^{2}_{\mu}} (32)

It is easy to see that T=hW2𝒥(μ)T^{*}=-h\nabla_{W_{2}}\mathcal{J}(\mu) is the solution of this problem. Plug in eq.(31), and parameterize the gradient function TT as a function of neural network parameters θ\theta. Hence, we solve the following optimization problem:

min𝜃𝒥((I+Tθ)#μ)+12hTθμ2\displaystyle\underset{\theta}{min}\hskip 5.69046pt{\mathcal{J}((I+T_{\theta})_{\#\mu})+\frac{1}{2h}||T_{\theta}||^{2}_{\mu}} (33)

This optimization is efficient and can use parallel processing because: 1) it only requires samples from the measure μ\mu, and 2) we can use minibatch from the measure μ\mu to update neural network parameters at a time. This removes the need to obtain all samples at a time leading to stochastic gradient descent optimization of θ\theta.

8.2 Predict-Project Algorithm

Table 1: Comparison of FID score as a measure of generated image quality between score-based and our generative model at different number of sampling steps N.
Datasets N = 1000 N = 100 N = 40 N = 20
FID Score \downarrow
Celeb-A Score Model 6.331 35.14 149.42 222.71
Predict-Project 20.54 68.23 121.12
LSUN Score Model 15.12 34.62 122.23 246.17
Predict-Project 25.35 66.61 164.32
SVHN Score Model 18.95 146.63 183.30 285.61
Predict-Project 149.56 174.34 152.94

Refer to caption

Figure 4: Comparison of generated samples in three datasets when number of sampling steps is decreased to as low as N = 40.

With the Wasserstein gradient estimation method in hand, we now move on to project μτ+δτpred\mu_{\tau+\delta\tau}^{pred} to μτ+δτmean\mu_{\tau+\delta\tau}^{mean}. We define the functional 𝒥\mathcal{J} in the following way:

𝒥τ+δτ((I+Tθ,τ+δτ)#μτ+δτpred)\displaystyle\mathcal{J}_{\tau+\delta\tau}((I+T_{\theta,\tau+\delta\tau})_{\#}{\mu_{\tau+\delta\tau}^{pred}})
=xτ+δτpred+Tθ(xτ+δτpred)xτ+δτmean2𝑑μτ+δτpred(xτ+δτpred)\displaystyle=\int||x_{\tau+\delta\tau}^{pred}+T_{\theta}(x_{\tau+\delta\tau}^{pred})-x_{\tau+\delta\tau}^{mean}||^{2}d\mu_{\tau+\delta\tau}^{pred}(x_{\tau+\delta\tau}^{pred}) (34)

Note that, Tθ,τ+δτT_{\theta,\tau+\delta\tau} is indexed by time. Instead of learning different TθT_{\theta} for different time, we parameterize it by time as T(.,τ)T(.,\tau) as in score function (Song et al., 2021). Similarly, we choose h=1/βτh=1/\sqrt{\beta_{\tau}} to be different for different τ\tau in eq.(33). To train the projection function, TT, we sample τ\tau from the uniform distribution in the interval (0,1](0,1], xτ=Tμτ=Tx_{\tau=T}\sim\mu_{\tau=T} and optimize the following optimization:

min𝜃Eτ,xτ=T\displaystyle\underset{\theta}{min}\hskip 5.69046ptE_{\tau,x_{\tau=T}} [||xτ+δτpred+Tθ(xτ+δτpred,τ+δτ)xτ+δτmean||2\displaystyle\big{[}||x_{\tau+\delta\tau}^{pred}+T_{\theta}(x_{\tau+\delta\tau}^{pred},\tau+\delta\tau)-x_{\tau+\delta\tau}^{mean}||^{2}
+βτ2||Tθ(xτ+δτpred,τ+δτ)||2]\displaystyle+\frac{\sqrt{\beta_{\tau}}}{2}||T_{\theta}(x_{\tau+\delta\tau}^{pred},\tau+\delta\tau)||^{2}\big{]}

After training, we have, βτTθ=W2𝒥(μpred)\sqrt{\beta_{\tau}}T_{\theta}^{*}=-\nabla_{W_{2}}\mathcal{J}(\mu^{pred}). Using this relation, we update the sample as

xproj\displaystyle x^{proj} =xpredW2𝒥(μpred)(xpred).δμ.δτ\displaystyle=x^{pred}-\nabla_{W_{2}}\mathcal{J}(\mu^{pred})(x^{pred}).{\delta\mu}.\delta\tau (35)
=xpred+βτTθ(xpred,τ).δμ.δτ\displaystyle=x^{pred}+\sqrt{\beta_{\tau}}T_{\theta}^{*}(x^{pred},\tau).{\delta\mu}.\delta\tau (36)

where δμ\delta\mu is the small scalar by which to move in the direction of Wasserstein gradient and δτ\delta\tau is present due to the fact that the Wasserstein gradient flow with velocity field vtv_{t} corresponds to dynamics xτ˙=vτ(xτ)\dot{x_{\tau}}=v_{\tau}(x_{\tau}) (see eq.(5)). See Algorithm 1 for full sampling algorithm.

9 Experimental Results

To demonstrate efficacy of our algorithm, we train and generate samples on three datasets: 1) Celeb-A dataset, 2) LSUN-church dataset, and 3) SVHN dataset, where all images are of size 64×6464\times 64. Our neural network architecture for both score model and projection model uses standard U-net architecture with attention due to (Dhariwal & Nichol, 2021). We use publicly available code from (Song et al., 2021) as score-based model. In these experiments, we demonstrate that as we decrease the number of sampling steps, the quality of samples decreases in Score-based generative model, but we maintain quality to a reasonable level even when the number of sampling steps is reduced to as low as 20. We use FID metric (Heusel et al., 2017) to measure the sample qualities.

In Table 1, we compare the FID score of generated images from score-based method and our Predict-Project method. We outperform the score-based method by a large margin in all cases except SVHN (N=100). This underperformance could be because our model is not trained well in SVHN (see appendix) due to lack of time. For qualitative comparison, please see Fig. (4) and Fig. (2). These results our claim that projecting to the gradient-flow-path improves sample quality, especially when the number of sampling-step is low.

10 Conclusion

We presented a novel geometric perspective on score-based generative models (also called diffusion generative models) by showing that they are in fact gradient flows in a space of probability measures. The geometric insight gained from this connection helped us answer and clarify some critical open questions. We also demonstrated that it can help us design faster sampling algorithm. We believe that this connection will help diffuse knowledge between Wasserstein gradient flow field and score-based generative models field in the future inspiring interesting solutions to problems in both areas. Similarly, connection with energy-based models, proximal algorithms and reverse SDE could help design better algorithms in general and generative models in specific. Energy-based models, for example, can be combined with score-based models in the light of geometric understanding.

References

  • Ambrosio et al. (2005) Ambrosio, L., Gigli, N., and Savaré, G. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
  • Anderson (1982) Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
  • Chen et al. (2022) Chen, T., Liu, G.-H., and Theodorou, E. Likelihood training of schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nioAdKCEdXB.
  • De Bortoli et al. (2021) De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
  • Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  8780–8794. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf.
  • Du et al. (2021) Du, Y., Li, S., Sharma, Y., Tenenbaum, J., and Mordatch, I. Unsupervised learning of compositional energy concepts. Advances in Neural Information Processing Systems, 34:15608–15620, 2021.
  • Gao et al. (2020) Gao, R., Nijkamp, E., Kingma, D. P., Xu, Z., Dai, A. M., and Wu, Y. N. Flow contrastive estimation of energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7518–7528, 2020.
  • Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, 2020. URL https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf.
  • Huang et al. (2021) Huang, C.-W., Lim, J. H., and Courville, A. C. A variational perspective on diffusion-based generative models and score matching. Advances in Neural Information Processing Systems, 34:22863–22876, 2021.
  • Jordan et al. (1998) Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
  • Kingma et al. (2021) Kingma, D., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  • Koizumi et al. (2022) Koizumi, Y., Yatabe, K., Zen, H., and Bacchiani, M. Wavefit: An iterative and non-autoregressive neural vocoder based on fixed-point iteration. arXiv preprint arXiv:2210.01029, 2022.
  • Korba et al. (2020) Korba, A., Salim, A., Arbel, M., Luise, G., and Gretton, A. A non-asymptotic analysis for stein variational gradient descent. Advances in Neural Information Processing Systems, 33:4672–4682, 2020.
  • Lu et al. (2022) Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
  • Meng et al. (2022) Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  • Ramesh et al. (2022) Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Rombach et al. (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  • Saharia et al. (2022) Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  • Salim et al. (2020) Salim, A., Korba, A., and Luise, G. The wasserstein proximal gradient algorithm. Advances in Neural Information Processing Systems, 33:12356–12366, 2020.
  • Sohl-Dickstein et al. (2015) Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
  • Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  • Song et al. (2021) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
  • Villani (2003) Villani, C. Topics in optimal transportation.(books). OR/MS Today, 30(3):66–67, 2003.
  • Villani (2009) Villani, C. Optimal transport: old and new, volume 338. Springer, 2009.
  • Wibisono (2018) Wibisono, A. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pp.  2093–3027. PMLR, 2018.
  • Xie et al. (2016) Xie, J., Lu, Y., Zhu, S.-C., and Wu, Y. A theory of generative convnet. In International Conference on Machine Learning, pp. 2635–2644. PMLR, 2016.

Appendix A Proof of Theorems

A.1 Forward Diffusion as Gradient Flow

Theorem 3.

Consider an accelerated gradient flow in eq.(10) with initial measure μ0=ρ\mu_{0}=\rho and the target measure μT=π\mu_{T}=\pi and the functional on the Wasserstein space defined by (μ)=KL(.||π)\mathcal{F}(\mu)=\text{KL}(.||\pi). The family of measures corresponding to this gradient flow is equivalent to the family of measures corresponding to the forward Fokker Plank equation in eq.(11) given that ff and βt\beta_{t} take the following form: f=βtlogπf={\beta_{t}}\nabla\log\pi, βt=gt22\beta_{t}=\frac{g_{t}^{2}}{2}.

Proof.

The Fokker Planck equation in eq.(11) is

μtt=div(μtf)+12div((gt2μt))\displaystyle\frac{\partial\mu_{t}}{\partial t}=-div(\mu_{t}f)+\frac{1}{2}div(\nabla(g_{t}^{2}\mu_{t})) (37)

We can rewrite this equation as:

μtt\displaystyle\frac{\partial\mu_{t}}{\partial t} =div(μtf)+div(μt12gt2(logμt))\displaystyle=-div(\mu_{t}f)+div(\mu_{t}\frac{1}{2}g_{t}^{2}\nabla(\log\mu_{t})) (38)
=div(μt(f12gt2logμt))\displaystyle=-div\big{(}\mu_{t}(f-\frac{1}{2}g_{t}^{2}\nabla\log\mu_{t})\big{)} (39)

On the other hand, accelerated Wasserstein gradient flow is given by eq.(10):

μtt\displaystyle\frac{\partial\mu_{t}}{\partial t} =div[μt.βtlog(μtπ)]\displaystyle=div\big{[}\mu_{t}.\beta_{t}\nabla\log(\frac{\mu_{t}}{\pi})] (40)
=div(μt.(βtlogπβtlogμt))\displaystyle=-div(\mu_{t}.(\beta_{t}\nabla\log\pi-\beta_{t}\nabla\log\mu_{t})) (41)

Comparing eq.(39) and eq.(41), it is clear that accelerated Wasserstein gradient Flow is same as the forward stochastic differential equation if we choose:

βt\displaystyle\beta_{t} =12gt2,and\displaystyle=\frac{1}{2}g_{t}^{2},\text{and} (42)
f\displaystyle f =βtlogπ\displaystyle=\beta_{t}\nabla\log\pi (43)

Remark 3.1.

Consider the special case with measure μT=π=𝒩(0,I)=exp(x22)/Z\mu_{T}=\pi=\mathcal{N}(0,I)={\exp(-\frac{||x||^{2}}{2})}/{Z}, we get f=βxf=-\beta x and the forward diffusion equation is given by the following SDE:

dx=βtxdt+2βtdw\displaystyle dx=-{\beta_{t}}xdt+\sqrt{2\beta_{t}}dw (44)

which is exactly the forward flow of DDPM model (Ho et al., 2020; Song et al., 2021).

A.2 Generation as Reverse Gradient Flow

Theorem 4.

The reverse SDE in eq.(8) is equivalent to the Wasserstein gradient flow in the space of probability measures with respect to the functional (μ)=KL(.||π)\mathcal{F}(\mu)=-\text{KL}(.||\pi) starting from the initial measure μτ=0=π\mu_{\tau=0}=\pi towards the target measure μτ=T=ρ\mu_{\tau=T}=\rho.

Proof.
(μ)\displaystyle\mathcal{F}(\mu) =KL(μ||π)\displaystyle=-KL(\mu||\pi) (45)
=logμdμ+logπdμ\displaystyle=\int-\log\mu d\mu+\int\log\pi d\mu (46)
=(2logμ+logπ)𝑑μ𝒢+logμdμ\displaystyle=\color[rgb]{1,0,0}{\underbrace{\color[rgb]{0,0,0}{\int(-2\log\mu+\log\pi)d\mu}}_{{\mathcal{G}}}}\color[rgb]{0,0,0}{+}\color[rgb]{0,0,1}{\underbrace{\color[rgb]{0,0,0}{\int\log\mu d\mu}}_{\mathcal{H}}} (47)

Here, we apply forward backward splitting scheme due to (Wibisono, 2018; Salim et al., 2020)

ντ\displaystyle\nu_{\tau} =(IβτW2𝒢(μτ))#μτ\displaystyle=(I-\beta_{\tau}\nabla_{{W}_{2}}\mathcal{G}(\mu_{\tau}))_{\#}\mu_{\tau} (48)
μτ+δτ\displaystyle\mu_{\tau+\delta\tau} =JKOβτ(ντ)\displaystyle=JKO_{\beta_{\tau}\mathcal{H}}(\nu_{\tau}) (49)

where W2𝒢(μ)\nabla_{{W}_{2}}\mathcal{G}(\mu) is the Wasserstein gradient. We can compute the expression for the Wasserstein gradient using the following relation:

W2𝒢=𝒢(μ)\displaystyle\nabla_{W_{2}}\mathcal{G}=\nabla\mathcal{G^{\prime}}(\mu) (50)

First, the derivative is given by

𝒢(μ)=2logμ+logπ\displaystyle\mathcal{G}^{\prime}(\mu)=-2\log\mu+\log\pi (51)

Therefore,

W2𝒢(μ)=𝒢(μ)=(2logμ+logπ)\displaystyle\nabla_{{W}_{2}}\mathcal{G}(\mu)=\nabla\mathcal{G}^{\prime}(\mu)=\nabla(-2\log\mu+\log\pi) (52)

In eq.(48), we are trying to move in the direction of Wasserstein gradient. Let samples xτx_{\tau} from distribution xτμτx_{\tau}\sim\mu_{\tau}. Transforming differential equation in measure space to sample space, similar to eq.(5), yields:

yτ=xτβτ(2logμt(xτ)+logπ(xτ))δτ\displaystyle y_{\tau}=x_{\tau}-\beta_{\tau}\nabla(-2\log\mu_{t}(x_{\tau})+\log\pi(x_{\tau}))\delta\tau (53)

In eq.(49), we are using JKO operator as a solution of the negative entropy functional, \mathcal{H}, where the JKO operator is defined as :

JKOβ,(ν)=argminζ𝒫2(𝒳)(ζ)+12βW22(ζ,ν)JKO_{\beta,\mathcal{H}}(\nu)=\underset{\zeta\in\mathcal{P}_{2}(\mathcal{X})}{\text{argmin}}\mathcal{H}(\zeta)+\frac{1}{2\beta}W_{2}^{2}(\zeta,\nu)

For the negative entropy functional, we have the exact solution as the Brownian motion (Jordan et al., 1998; Wibisono, 2018; Salim et al., 2020). Let yτντy_{\tau}\sim\nu_{\tau}, we obtain

yτ=xτβτ(2logμt(xτ)+logπ(xτ))δτ\displaystyle y_{\tau}=x_{\tau}-\beta_{\tau}\nabla(-2\log\mu_{t}(x_{\tau})+\log\pi(x_{\tau}))\delta\tau (54)
xτ+δτ=yτ+2βτzτ\displaystyle x_{\tau+\delta\tau}=y_{\tau}+\sqrt{2\beta_{\tau}}z_{\tau} (55)

Combining both, we obtain

xτ+δτ\displaystyle x_{\tau+\delta\tau} =xτ+(2βτlogμτ(xτ)βτlogπ(xτ))δτ\displaystyle=x_{\tau}+(2\beta_{\tau}\nabla\log\mu_{\tau}(x_{\tau})-\beta_{\tau}\nabla\log\pi(x_{\tau}))\delta\tau
+2βτzτ\displaystyle+\sqrt{2\beta_{\tau}}z_{\tau} (56)

In the limiting case as δτ0\delta\tau\to 0, we obtain,

dx=(2βτlogμτ(xτ)βτlogπ(xτ))dτ+2βτdw\displaystyle dx=(2\beta_{\tau}\nabla\log\mu_{\tau}(x_{\tau})-\beta_{\tau}\nabla\log\pi(x_{\tau}))d\tau+\sqrt{2\beta_{\tau}}dw

which coincides exactly with the reverse SDE in eq.(8) for gτ2=2βτg_{\tau}^{2}=2\beta_{\tau} and f=βτlogπf=\beta_{\tau}\nabla\log\pi. ∎

Appendix B Experimental Details

We jointly train the score model sθs_{\theta} and projection model TθT_{\theta}. They have the same U-Net with attention architecture following (Dhariwal & Nichol, 2021). We apply minibatch optimization to optimize both score model and projection model. Because of this the computational burden is low. In terms of parameters, since we have additional, projection model, the parameter is twice of regular score model.

We train Celeb-A model upto 450K iteration and LSUN upto 300K iteration with the batch size of 3232, and report the FID score. We had to terminate SVHN early at 50K iteration (batch size = 32) due to lack of time. We will continue to train this model and will update the score later if we get chance.

B.1 Hyperparameter δμ\delta\mu

While projecting measures to the gradient-flow-path, we scale the Wasserstein gradient towards path by a factor, δμ\delta\mu. This factor intuitively represents how much error is likely to be present in the prediction step of the score model. Obviously, the error is large for N=20N=20 and small for N=100N=100, so we choose δμ[0,1]\delta\mu\in[0,1] large for N=20N=20 and small for N=100N=100. At the moment, it is a hyperparameter, which we optimize keeping in mind that it should correspond to the level of error is score model prediction. In the future work, we will estimate this hyperparameter from the training loss Tθμ2||T_{\theta}||_{\mu}^{2}. Current best hyperparameter for δμ\delta\mu are:

N=20,δμ=1\displaystyle N=20,\hskip 5.69046pt\delta\mu=1
N=40,δμ=0.4\displaystyle N=40,\hskip 5.69046pt\delta\mu=0.4
N=100,δμ=0.1\displaystyle N=100,\hskip 5.69046pt\delta\mu=0.1

We are cleaning up the code and will make it publicly available.