This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: State Key Laboratory for Novel Software Technology, Nanjing University 22institutetext: Alibaba Group    Shanghai AI Lab
https://subpath-linear-approx-model.github.io/

Accelerating Image Generation with
Sub-Path Linear Approximation Model

Chen Xu 1122 0009-0006-8534-0611    Tianhui Song 1122 0009-0007-4256-6407    Weixin Feng 22 0009-0002-7184-4115    Xubin Li 22 0000-0002-1547-4052   
Tiezheng Ge
22 0000-0003-1381-2692
   Bo Zheng 22 0000-0002-4037-6315    Limin Wang 1133 0000-0002-3674-7718 33
Abstract

Diffusion models have significantly advanced the state of the art in image, audio, and video generation tasks. However, their applications in practical scenarios are hindered by slow inference speed. Drawing inspiration from the consistency models, we propose the Sub-Path Linear Approximation Model (SPLAM), which can accelerate diffusion models while maintaining high-quality image generation. SPLAM treats the PF-ODE trajectory as a series of PF-ODE sub-paths divided by sampled points, and harnesses sub-path linear (SL) ODEs to form a progressive and continuous error estimation along each individual PF-ODE sub-path. The optimization on such SL-ODEs allows SPLAM to construct denoising mapping with smaller cumulative approximated error. An efficient distillation method is also developed to facilitate the incorporation of pre-trained diffusion models, such as latent diffusion models. The extensive experimental results demonstrate SPLAM achieves remarkable training efficiency, requiring only 6 A100 GPU days to produce a high-quality generative model capable of 2 to 4-step generation. Comprehensive evaluations on LAION, MS COCO 2014, and MS COCO 2017 datasets also illustrate that SPLAM surpasses the existing acceleration methods in few-step generation tasks, achieving state-of-the-art performance both on FID and the quality of the generated images. .

Keywords:
Diffusion Models Accelerating Diffusion Models Diffusion Model Distillation Consistency Models
footnotetext: Equal contributions. Interns at Alibaba Group. Corresponding author.

1 Introduction

Diffusion models, also known as score-based generative models, have emerged as a potent paradigm in generative computer vision, enabling the synthesis of highly realistic images by progressively refining random noise into structured visual content [9, 46, 29, 32, 45]. Despite their impressive ability, one of the primary challenges associated with diffusion models lies in their computational intensity, often requiring hundreds of iteration steps to produce a single image. This has spurred a surge of research focused on accelerating diffusion models to retain high-quality outputs while significantly reducing the computation cost during the inference phase [42, 23, 24, 36, 50, 49, 21, 44, 26, 20].

Within the spectrum of acceleration techniques, consistency models [44, 26] have garnered attention as they forge a consistent denoising mapping across points on Probability Flow (PF) ODE trajectories. The learning strategy brings consistency models a notable consistency property and could estimate the overall prediction errors as a summation of incremental errors, which are computed as the difference between the predicted results of adjacent trajectory points. In this paper, we recognize that the approximation of denoising mappings by consistency models is essentially a minimization process targeting the endpoints of sub-paths along ODE trajectories. We observe that the approximated performance is currently limited by the accumulation of errors that arise from either an overabundance of approximation operations, or the heightened challenge of optimizing individual sub-path errors as the skipping step size expands.

To address these challenges, we propose a novel approach in this paper, designated as the Sub-Path Linear Approximation Model (SPLAM). SPLAM adheres to the foundational concept of cumulative approximation of PF-ODE trajectories but innovates through its sustained learning from Sub-Path Linear (SL) ODEs. Specifically, we dissect the sub-path learning objective based on the noise prediction design [9, 14] into two interrelated aspects, and establish the SL-ODEs to give respective progressive or continuous estimation for each component, by a carefully designed linear interpolation between the endpoints of sub-paths. We then utilize the SL-ODEs to approximate the complete PF-ODE trajectories which allows a more nuanced optimization. Consequently, the prediction error of our approach is assessed through iterative solutions of all SL-ODEs, enabling a reduction of cumulative errors and an enhancement in image generation quality. Furthermore, we also develop an efficient distillation procedure for our SPLAM which enables the incorporation with pre-trained latent diffusion models [34] (e.g., Stable Diffusion). Our contributions can be summarized as below:

  1. 1.

    We identify that the optimization process for consistency models essentially minimizes the cumulative approximated error along PF-ODE sub-path endpoints, and observe that the performance of such approximations is hindered by the proliferating number of approximations or the amplified difficulty in optimizing single sub-path errors for as skipping step size increases.

  2. 2.

    To address these challenges, we propose a novel approach as Sub-Path Linear Approximation Model (SPLAM). SPLAM employs Sub-Path Linear (SL) ODEs to continuously approximate the complete PF-ODE trajectories and progressively optimize the sub-path learning objectives, which could construct the denoising mappings with smaller cumulative approximated errors.

  3. 3.

    Leveraging the proposed SPLAM and SL-ODE framework, we put forth an efficient distillation method. When integrated with powerful pre-trained models like Stable Diffusion, our approach allows more efficient training and respectively attains impressive FIDs as 10.09, 10.06, 20.77 in LAION, MS COCO 2014, MS COCO 2017 datasets, achieving better performance and close inference latency to all previous accelerating approaches.

Refer to caption
Figure 1: Our Sub-Path Linear Approximation Model employs Sub-Path Linear ODEs to approximate the sub-paths on the PF-ODE trajectories, which is determined by the linear interpolation of corresponding endpoints. SPLAM is then trained based on the consistent mapping along SL-ODEs to minimize the approximated errors.

2 Related Work

Diffusion Models [9, 1, 46, 40, 14, 30, 34] have solidified their status as a cornerstone in the realm of generative models, outshining previous approaches in creating rich and detailed images. Song et al. [46] model this process from continuous-time perspective with a stochastic differential equation (SDE), which iteratively denoise an initial noise distribution leveraging the learned score of the data distribution to steer the process towards data points [9, 45, 46]. This reverse diffusion process has been verified to be particularly adept at capturing the intricate structures and variations inherent in complex data sets. They also demonstrate that there exists an ordinary differential equation (ODE), dubbed as Probability Flow (PF) ODE, which shares the marginal probability densities with the reverse-time SDE and thus yields a deterministic sampling trajectory [14, 46]. In contrast to other generative models like VAEs [15, 41] and GANs [6], diffusion models demonstrate remarkable robustness in training and excel in producing samples with substantial diversity and high fidelity, thereby offering a robust solution for modeling complex distributions in an ever-expanding array of generative tasks.

Accelerating Diffusion Models. While diffusion models have demonstrated their superiority in generating high-quality samples, the generation speed remains a major hindrance due to requiring thousands of sampling steps, which poses difficulties for practical and efficient applications. To address these issues, a surge of advancements has emerged aiming to accelerate the inference process. Some works concentrate on designing non-training fast diffusion samplers [23, 24, 46, 14, 19, 2, 55, 12], potentially cutting down the steps from one thousand to a modest 20-50. In the realm of distillation [8], efforts have been undertaken [36, 55, 25, 28, 3, 7, 53] to condense the inference steps of pre-trained diffusion models to fewer than 10. Progressive distillation (PD) [36] seeks to amortize the integration of PF-ODE into a new sampler that takes half as many sampling steps, displaying efficacy with as few as 2/4 steps. Consistency models [44, 26, 27, 43], as a nascent class of models, offer the promise of high-quality one-step generation by mapping any point along the PF-ODE trajectory back to the origin. Representing flow-based approaches [21, 20, 18, 47], InstaFlow [21, 20] propose a reflow technique to straighten the trajectories of probability flows and refine the coupling between noises and images, which achieves a one-step SD model. Concurrently, some strategies are exploring the inclusion of GAN-like objectives into diffusion models to afford fast generative capabilities[49, 37, 17, 50]. DMD [50] additionally proposes a distribution matching method that enables one-step high-quality image generation.

3 Preliminaries

Diffusion Models are a class of generative models that gradually transform data into a noisy state through Gaussian perturbations and subsequently learn to reverse this process to reconstruct the original data by progressively denoising it. Denote 𝒙0\bm{x}_{0} as the data sampled from the original distribution 𝒙0pdata(𝒙)\bm{x}_{0}\sim p_{data}(\bm{x}) and α(t),σ(t)\alpha(t),\sigma(t) as functions that define a noise schedule. Diffusion models transition the data to a noise-corrupted marginal distribution, which can be expressed as:

pt(𝒙t|𝒙0)=𝒩(𝒙t|α(t)𝒙0,σ(t)2I),p_{t}(\bm{x}_{t}|\bm{x}_{0})=\mathcal{N}(\bm{x}_{t}|\alpha(t)\bm{x}_{0},\sigma(t)^{2}\mathbf{\mathit{I}}), (1)

for any time step t[0,T]t\in[0,T].

Song et al. [46] describe the diffusion process using a stochastic differential equation (SDE):

d𝒙t=𝒇(𝒙t,t)dt+g(t)d𝒘t,d\bm{x}_{t}=\bm{f}(\bm{x}_{t},t)dt+g(t)d\bm{w}_{t}, (2)

where f(,)f(\cdot,\cdot) and g()g(\cdot) denote the drift and diffusion coefficients, respectively, and 𝒘t\bm{w}_{t} signifies the standard Brownian motion at time tt. They also derive an ordinary differential equation (ODE) corresponding to this SDE, which defines the trajectories of solutions sampled at time tt according to pt(𝒙t)p_{t}(\bm{x}_{t}):

d𝒙t=[𝒇(𝒙t,t)12g(t)2𝒙logpt(𝒙t)]dt,d\bm{x}_{t}=\left[\bm{f}(\bm{x}_{t},t)-\frac{1}{2}g(t)^{2}\nabla_{\bm{x}}\log p_{t}(\bm{x}_{t})\right]dt, (3)

referred to as the Probability Flow (PF) ODE. In the reverse denoising process, models are taught to learn a score function 𝐬𝜽(𝒙t,t)logpt(𝒙t)\mathbf{s}_{\bm{\theta}}(\bm{x}_{t},t)\approx\nabla\log p_{t}(\bm{x}_{t}), adhering to the PF-ODE. Therefore, diffusion models are also recognized as score-based generative models. Based on the diffusion process, latent diffusion models (LDMs) additionally employ a VAE encoder ()\mathcal{E}(\cdot) and decoder 𝒟()\mathcal{D}(\cdot) to compress the image 𝒙\bm{x} into latent space as 𝒛=(𝒙)\bm{z}=\mathcal{E}(\bm{x}) and reconstruct it by the decoder: 𝒙^=𝒟(𝒛)\hat{\bm{x}}=\mathcal{D}(\bm{z}), and implement the diffusion process on the compressed vector 𝒛\bm{z} via latent space [34]. With the latent diffusion process, the pre-trained large-scale LDMs like Stable Diffusion (SD) Models could achieve more precise PF-ODE solutions and thus generate high-quality images.

Consistency Model has been proposed by Song et al. [44] as a novel paradigm within the family of generative models. Considering a solution trajectory of the PF-ODE {(𝒙t,t)}t[ϵ,T]\{(\bm{x}_{t},t)\}_{t\in[\epsilon,T]}, consistency models comply with a consistency function that projects every pair (𝒙t,t)(\bm{x}_{t},t) along the trajectory back to the starting point: 𝑭(𝒙t,t)𝒙ϵ\bm{F}(\bm{x}_{t},t)\mapsto\bm{x}_{\epsilon}, for any t[ϵ,T]t\in[\epsilon,T], to obtain a one-step generator. Here, ϵ\epsilon represents a small positive constant, thereby making 𝒙ϵ\bm{x}_{\epsilon} a viable surrogate for 𝒙0\bm{x}_{0}. An important characteristic of the consistency models is the self-consistency property:

𝑭(𝒙t,t)=𝑭(𝒙t,t),t,t[ϵ,T],\bm{F}(\bm{x}_{t},t)=\bm{F}(\bm{x}_{t}^{\prime},t^{\prime}),\quad\forall t,t^{\prime}\in[\epsilon,T], (4)

which is leveraged as the training constraint for the consistency models, whether when distilling knowledge from a pre-trained model or training from scratch. The model is parameterized as follows:

𝑭𝜽(𝒙t,t)=cskip(t)𝒙t+cout(t)𝒇𝜽(𝒙t,t),\bm{F}_{\bm{\theta}}(\bm{x}_{t},t)=c_{\mathrm{skip}}(t)\bm{x}_{t}+c_{\mathrm{out}}(t)\bm{f}_{\bm{\theta}}(\bm{x}_{t},t), (5)

where cskip(t)c_{\text{skip}}(t) and cout(t)c_{\text{out}}(t) are differentiable functions ensuring that cskip(ϵ)=1c_{\text{skip}}(\epsilon)=1 and cout(ϵ)=0c_{\text{out}}(\epsilon)=0, guaranteeing that 𝑭𝜽(𝒙ϵ,ϵ)𝒙ϵ\bm{F}_{\bm{\theta}}(\bm{x}_{\epsilon},\epsilon)\equiv\bm{x}_{\epsilon}, and 𝒇𝜽(𝒙t,t)\bm{f}_{\bm{\theta}}(\bm{x}_{t},t) is a deep neural network. For the distillation approach called as Consistency Distillation, the training objective is formulated as:

CD(𝜽,𝜽;ϕ)=𝔼[d(𝑭𝜽(𝒙tn+1,tn+1),𝑭𝜽(𝒙^tnΦ,tn))],\mathcal{L}_{CD}({\bm{\theta}},{\bm{\theta}}^{-};\phi)=\mathbb{E}[d(\bm{F}_{\bm{\theta}}(\bm{x}_{t_{n+1}},t_{n+1}),\bm{F}_{{\bm{\theta}}^{-}}(\hat{\bm{x}}^{\Phi}_{t_{n}},t_{n}))], (6)

where 𝒙^tnΦ=𝒙tn+1+(tn+1tn)Φ(𝒙tn+1,tn+1;ϕ)\hat{\bm{x}}^{\Phi}_{t_{n}}=\bm{x}_{t_{n+1}}+(t_{n+1}-t_{n})\Phi(\bm{x}_{t_{n+1}},t_{n+1};\phi) serves as a one-step estimation of 𝒙tn\bm{x}_{t_{n}} based on 𝒙tn+1\bm{x}_{t_{n+1}} from Φ(;ϕ)\Phi(\cdot;\phi), a update function of a one-step ODE solver, and d(,)d(\cdot,\cdot) is a chosen distance metric. Consistency models also utilize the EMA strategy to stabilize the training, and 𝜽{\bm{\theta}}^{-} is the running average of 𝜽{\bm{\theta}}. Latent Consistency Models (LCMs) [26] introduce consistency model into the distillation for latent diffusion models. To accelerate the training of consistency models, LCM employs a skipping step size kk to ensure consistency between the current timestep and kk-step away. With a conditional input cc and a guidance scale ww to achieve the CFG strategy [10], the modified learning objective for the latent consistency distillation is formulated as:

LCD(𝜽,𝜽;ϕ)=𝔼[d(𝑭𝜽(𝒙tn+k,w,c,tn+k),𝑭𝜽(𝒙^tnΦ,w,c,tn))].\mathcal{L}_{LCD}({\bm{\theta}},{\bm{\theta}}^{-};\phi)=\mathbb{E}[d(\bm{F}_{\bm{\theta}}(\bm{x}_{t_{n+k}},w,c,t_{n+k}),\bm{F}_{{\bm{\theta}}^{-}}(\hat{\bm{x}}^{\Phi}_{t_{n}},w,c,t_{n}))]. (7)

4 Methodology

4.1 Approximation Strategy for Denoiser

One-step Denoiser Parameterization. To synthesize an image from a sampled input 𝒙t\bm{x}_{t} at a large time step tt in one-step, a natural approach is to adopt the strategy from [9] that employs a neural network ϵ𝜽(𝒙t,t)\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t) to predict a standard Gaussian distribution, which implements the denoising mapping parameterized as 𝒇𝜽(𝒙t,t)=𝒙tσ(t)ϵ𝜽(𝒙t,t)α(t)\bm{f}_{\bm{\theta}}(\bm{x}_{t},t)=\frac{\bm{x}_{t}-\sigma(t)\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t)}{\alpha(t)}. By redefining the target distribution for (𝒙t,t)(\bm{x}_{t},t) as 𝒙0t=α(t)𝒙0pdata,t(α(t)𝒙)\bm{x}^{t}_{0}=\alpha(t)\bm{x}_{0}\sim p_{data,t}(\alpha(t)\bm{x}) and 𝑫𝜽(𝒙t,t)=α(t)𝒇𝜽(𝒙t,t)=𝒙tσ(t)ϵ𝜽(𝒙t,t)\bm{D}_{\bm{\theta}}(\bm{x}_{t},t)=\alpha(t)*\bm{f}_{\bm{\theta}}(\bm{x}_{t},t)=\bm{x}_{t}-\sigma(t)\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t), this predictive formulation can be recast into a canonical denoiser function defined in [14] that aims to minimize the denoising error as follows:

𝑫(𝜽)=𝔼𝒙0tpdata,t,𝒙t𝒩(𝒙0t,σ(t)2I)[|𝑫𝜽(𝒙t,t)α(t)𝒙0|],\mathcal{L}_{\bm{D}}({\bm{\theta}})=\mathbb{E}_{\bm{x}^{t}_{0}\sim p_{data,t},\bm{x}_{t}\sim\mathcal{N}(\bm{x}^{t}_{0},\sigma(t)^{2}I)}[|\bm{D}_{\bm{\theta}}(\bm{x}_{t},t)-\alpha(t)\bm{x}_{0}|], (8)

where |||\cdot| is an estimation of the error vector (e.g., a L2 distance). However, the Eq. 8 is hard to be optimized in practice. For instance, when α(t)\alpha(t) decreases over time step tt which implies α(t)𝒙0𝟎\alpha(t)\bm{x}_{0}\to\mathbf{0}, the training is likely to collapse and the denoiser is taught to generally give a zero output.

Approximation Strategy in Consistency Models. We observe that, consistency models [44, 26] provide a solution to the aforementioned issues by leveraging the consistency property. As we presume that we have obtained a good prediction result 𝒇𝜽(𝒙tk)𝒙0\bm{f}_{\bm{\theta}}(\bm{x}_{t-k})\approx\bm{x}_{0}, from a time step tkt-k ahead of tt for kk steps, this property yields an approximated error estimation of Eq. 8 as :

𝔼[|𝑫𝜽(𝒙t,t)α(t)𝒇𝜽(𝒙tk,tk)|].\mathbb{E}[|\bm{D}_{\bm{\theta}}(\bm{x}_{t},t)-\alpha(t)\bm{f}_{\bm{\theta}}(\bm{x}_{t-k},t-k)|]. (9)

By incorporating the expressions for 𝒇𝜽(𝒙tk,tk)\bm{f}_{\bm{\theta}}(\bm{x}_{t-k},t-k) and 𝑫𝜽(𝒙t,t)\bm{D}_{\bm{\theta}}(\bm{x}_{t},t), we derive the approximated error estimation based on ϵ𝜽(,)\bm{\epsilon}_{\bm{\theta}}(\cdot,\cdot) as:

Approx(𝜽)=𝔼[|𝒙tα(t)α(tk)𝒙tk+α(t)α(tk)σ(tk)ϵ𝜽(𝒙tk,tk)σ(t)ϵ𝜽(𝒙t,t)|],\small\mathcal{L}_{\text{Approx}}({\bm{\theta}})=\mathbb{E}[|\bm{x}_{t}-\frac{\alpha(t)}{\alpha(t-k)}\bm{x}_{t-k}+\frac{\alpha(t)}{\alpha(t-k)}\sigma(t-k)\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t-k},t-k)-\sigma(t)\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t)|], (10)

where the mentioned impact on optimization is reduced as the coefficient is amplified by α(tk)\alpha(t-k). When kk is limited to 1, the error between the mapping result 𝒇𝜽(𝒙t,t)\bm{f}_{\bm{\theta}}(\bm{x}_{t},t) and the trajectory origin 𝒙0\bm{x}_{0} can be quantified by the accumulation of incremental approximated errors [44]: |𝒙0𝒇𝜽(𝒙t,t)|1tt|𝒇𝜽(𝒙t,t)𝒇𝜽(𝒙t1,t1)||\bm{x}_{0}-\bm{f}_{\bm{\theta}}(\bm{x}_{t},t)|\leq\sum\limits_{1\leq t^{\prime}\leq t}|\bm{f}_{\bm{\theta}}(\bm{x}_{t^{\prime}},t^{\prime})-\bm{f}_{\bm{\theta}}(\bm{x}_{t^{\prime}-1},t^{\prime}-1)|. Ideally if the error of one single approximation can be bounded, we can reduce the cumulative error by decreasing the number of approximations. This technique, also called Skipping-Step in LCM [26], extends to optimize the error for skipping sampled points on the trajectories as |𝒇𝜽(𝒙t,t)𝒇𝜽(𝒙tk,tk)||\bm{f}_{\bm{\theta}}(\bm{x}_{t^{\prime}},t^{\prime})-\bm{f}_{\bm{\theta}}(\bm{x}_{t^{\prime}-k},t^{\prime}-k)| for a fixed skipping step size kk. However, our insights reveal this precondition does not hold for extended situations. Denote {𝒙t}t[tk,t]\{\bm{x}_{t^{\prime}}\}_{t^{\prime}\in[t-k,t]} as the sub-path between 𝒙tk\bm{x}_{t-k} and 𝒙t\bm{x}_{t} from the original PF-ODE trajectory, we discern that the learning objective in Eq. 10 for ϵ𝜽(𝒙t,t)\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t) can be decomposed into two complementary components: 1) distΔ(𝒙tk,𝒙t,t)=𝒙tα(t)α(tk)𝒙tkdist_{\Delta}(\bm{x}_{t-k},\bm{x}_{t},t)=\bm{x}_{t}-\frac{\alpha(t)}{\alpha(t-k)}\bm{x}_{t-k}, which gauges the incremental distance from 𝒙tk\bm{x}_{t-k} to 𝒙t\bm{x}_{t} attributable to the drift and diffusion processes, and 2) dist0,𝜽(𝒙tk,tk,t)=α(t)α(tk)σ(tk)ϵ𝜽(𝒙tk,tk)dist_{0,{\bm{\theta}}}(\bm{x}_{t-k},t-k,t)=\frac{\alpha(t)}{\alpha(t-k)}\sigma(t-k)\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t-k},t-k), which captures the denoising contribution from previous time steps that should be coherently propagated to subsequent time steps tt. Thus we rewrite Eq. 10 as a sub-path learning objective:

Sub-p(𝜽,k)\displaystyle\mathcal{L}_{\text{Sub-p}}({\bm{\theta}},k) =𝔼[|distΔ(𝒙t,𝒙tk,t)+dist0,𝜽(𝒙tk,tk,t)σ(t)ϵ𝜽(𝒙t,t)|].\displaystyle=\mathbb{E}[|dist_{\Delta}(\bm{x}_{t},\bm{x}_{t-k},t)+dist_{0,{\bm{\theta}}}(\bm{x}_{t-k},t-k,t)-\sigma(t)\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t)|]. (11)

In Eq. 11, the learning process of distΔdist_{\Delta} equates to modeling the denoising distribution p(𝒙tk|𝒙t)p(\bm{x}_{t-k}|\bm{x}_{t}), which deviates from Gaussian for larger skipping step sizes and is found to be intractable to estimate [48, 23, 24, 49, 14]. Consequently, the approximated error escalates uncontrollably with increased kk due to reliance on the flawed learning. Although LCM sets an empirical kk of 20 to balance the pros and cons, the fundamental issues remain unaddressed and unexplored.

4.2 Sub-Path Linear Approximation Model

To improve the learning objective in Eq. 11, in this paper we introduce a new approach for accelerating diffusion models termed Sub-path Linear Approximation Model (SPLAM). SPLAM introduces Sub-Path Linear (SL) ODEs to approximate the sub-paths on the PF-ODE trajectories as a linear interpolation between the according sub-path endpoints. As the optimization based on such SL-ODEs gives a respectively progressive and continuous estimation for the decomposed two terms in Eq. 11, our SPLAM is trained based on the conducted SL-ODE learning objectives, and achieves smaller overall prediction errors and better generation quality. We also develop an efficient distillation procedure for latent diffusion models  [34], with Multiple Estimation strategy which improves the estimated results of teacher models.

4.2.1 Sub-Path Linear ODE

Based on the above analysis, in this paper, we introduce Sub-Path Linear (SL) ODEs to model approximated sub-paths in the original PF-ODE trajectories, which gives a progressive estimation for distΔdist_{\Delta}. For a sampled sub-path {𝒙t}t[tk,t]\{\bm{x}_{t^{\prime}}\}_{t\in[t-k,t]} on a solution trajectory dictated by Eq. 3, we interpolate a linear path from (𝒙tk,tk)(\bm{x}_{t-k},t-k) to (𝒙t,t)(\bm{x}_{t},t), guided by the vector direction of distΔ(𝒙t,𝒙tk,t)dist_{\Delta}(\bm{x}_{t},\bm{x}_{t-k},t). To distinguish the impacts of distΔdist_{\Delta} and dist0,𝜽dist_{0,{\bm{\theta}}}, we account for the drift component in the linear approximated path, causing a shift on coefficient from (𝒙tk,tk)(\bm{x}_{t-k},t-k) to (α(t)α(tk)𝒙tk,tk)(\frac{\alpha(t)}{\alpha(t-k)}\bm{x}_{t-k},t-k). The points on the approximated path {𝒙γ,t}γ[0,1]\{\bm{x}_{\gamma,t}\}_{\gamma\in[0,1]} are thus computed as:

𝒙γ,t\displaystyle\bm{x}_{\gamma,t} =α(t)α(tk)𝒙tk+γdistΔ(𝒙t,𝒙tk,t)\displaystyle=\frac{\alpha(t)}{\alpha(t-k)}\bm{x}_{t-k}+\gamma*dist_{\Delta}(\bm{x}_{t},\bm{x}_{t-k},t) (12)
=(1γ)α(t)α(tk)𝒙tk+γ𝒙t,\displaystyle=(1-\gamma)\frac{\alpha(t)}{\alpha(t-k)}\bm{x}_{t-k}+\gamma\bm{x}_{t},

for a sampled (𝒙tk,tk)(\bm{x}_{t-k},t-k) and (𝒙t,t)(\bm{x}_{t},t).

Since 𝒙t\bm{x}_{t} and 𝒙tk\bm{x}_{t-k} conform to distributions governed by the PF-ODE, our linear transformation effectively defines a linear ODE from distribution α(t)α(tk)𝒙tkptk,k(𝒙tk)\frac{\alpha(t)}{\alpha(t-k)}\bm{x}_{t-k}\sim p_{t-k,k}(\bm{x}_{t-k}) to 𝒙tpt(𝒙t)\bm{x}_{t}\sim p_{t}(\bm{x}_{t}) over γ\gamma, where pt,k(𝒙t)p_{t,k}(\bm{x}_{t}) has the property pt,k(𝒙t|𝒙0)=𝒩(α(t+k)𝒙0,[α(t+k)σ(t)α(t)]2I)p_{t,k}(\bm{x}_{t}|\bm{x}_{0})=\mathcal{N}(\alpha(t+k)\bm{x}_{0},\left[\frac{\alpha(t+k)\sigma(t)}{\alpha(t)}\right]^{2}I):

d𝒙γ,t=[γdistΔ(𝒙t,𝒙tk,t)]dγ.d\bm{x}_{\gamma,t}=[\gamma*dist_{\Delta}(\bm{x}_{t},\bm{x}_{t-k},t)]d\gamma. (13)

We denote it as Sub-Path Linear (SL) ODE. To apply the approximation strategy on the SL-ODE, the Denoiser and generation function replacing 𝒙t\bm{x}_{t} with 𝒙γ,t\bm{x}_{\gamma,t} are given by:

𝑫𝜽(𝒙γ,t,γ,t)=𝒙γ,tσ(γ,t)ϵ𝜽(𝒙γ,t,γ,t),\displaystyle\bm{D}_{\bm{\theta}}(\bm{x}_{\gamma,t},\gamma,t)=\bm{x}_{\gamma,t}-\sigma(\gamma,t)\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{\gamma,t},\gamma,t), (14)
𝒇𝜽(𝒙γ,t,γ,t)=𝑫𝜽(𝒙γ,t,γ,t)α(t).\displaystyle\bm{f}_{\bm{\theta}}(\bm{x}_{\gamma,t},\gamma,t)=\frac{\bm{D}_{\bm{\theta}}(\bm{x}_{\gamma,t},\gamma,t)}{\alpha(t)}.

Incorporating these into Eq. 11, we derive the sub-path learning objective for our SPLAM model as :

SPLAM(𝜽,k)=𝔼[|γdistΔ(𝒙t,𝒙tk,t)+dist0,𝜽(𝒙tk,t,tk)σ(γ,t)ϵ𝜽(𝒙γ,t,γ,t)|],\small\mathcal{L}_{\text{SPLAM}}({\bm{\theta}},k)=\mathbb{E}[|\gamma*dist_{\Delta}(\bm{x}_{t},\bm{x}_{t-k},t)+dist_{0,{\bm{\theta}}}(\bm{x}_{t-k},t,t-k)-\sigma(\gamma,t)\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{\gamma,t},\gamma,t)|], (15)

which gives a progressive estimation for the otherwise intractable distΔdist_{\Delta} objective. The value for σ(γ,t)\sigma(\gamma,t) can be precisely estimated from the distribution pt(𝒙t)p_{t}(\bm{x}_{t}) and ptk(𝒙tk)p_{t-k}(\bm{x}_{t-k}) but has a complex expression. Empirically we utilize an approximate result as σ(γ,t)=(1γ)α(t)α(tk)σ(tk)+γσ(t)\sigma(\gamma,t)=(1-\gamma)\frac{\alpha(t)}{\alpha(t-k)}\sigma(t-k)+\gamma*\sigma(t). Compared to consistency models which adopt Eq. 10 or Eq. 11, our \mathcal{L}_{\text{}} maintains a progressive estimation for distΔdist_{\Delta} and a consistent estimation for dist0,𝜽dist_{0,{\bm{\theta}}}, which enables the learning for large skipping step size. The overall prediction error can still be assessed by the aggregate of approximated errors between the sub-path endpoints and the approximated error between these points is continuously optimized through the SL-ODEs. Consequently, the optimization for the approximated errors in our SPLAM could be significantly improved. Our approach could further benefit from the increased skipping step size, allowing our method to generate images of higher quality with reduced sampling steps in more efficient training.

4.2.2 Sub-Path Linear Approximation Distillation

In this paper, we adopt pre-trained Stable Diffusion (SD) models [34] to obtain the solution PF-ODE trajectories which we build our SL-ODEs upon, and we call the approach Sub-path Linear Approximation Distillation (SPLAD). To achieve conditional generation with the conditional input cc, the noise prediction model is parameterized as ϵ𝜽(𝒛t,c,t)\bm{\epsilon}_{\bm{\theta}}(\bm{z}_{t},c,t) [46, 23]. We also introduce γ\gamma into the prediction models for solving our SL-ODEs, and leverage the γ\gamma-conditioned training where γ\gamma is converted to Fourier embeddings and fed into the models as an input. Specifically, to predict 𝒛0\bm{z}_{0} in the latent space, the generation function for SPLAM is defined as:

𝑭𝜽(𝒛γ,t,c,γ,t)=cskip(t)𝒛γ,t+cout(t)𝒇𝜽(𝒛γ,t,c,γ,t),\bm{F}_{\bm{\theta}}(\bm{z}_{\gamma,t},c,\gamma,t)=c_{\text{skip}}(t)\bm{z}_{\gamma,t}+c_{\text{out}}(t)\bm{f}_{\bm{\theta}}(\bm{z}_{\gamma,t},c,\gamma,t), (16)

where 𝒇𝜽(𝒛γ,t,c,γ,t)\bm{f}_{\bm{\theta}}(\bm{z}_{\gamma,t},c,\gamma,t) mirrors Eq. 14 while replacing ϵ𝜽(𝒛γ,t,γ,t)\bm{\epsilon}_{\bm{\theta}}(\bm{z}_{\gamma,t},\gamma,t) with the conditional form ϵ𝜽(𝒛γ,t,c,γ,t)\bm{\epsilon}_{\bm{\theta}}(\bm{z}_{\gamma,t},c,\gamma,t). The functions cskipc_{\text{skip}} and coutc_{\text{out}} are leveraged to ensure that 𝑭𝜽(𝒛1,0,c,1,0)𝒛0\bm{F}_{\bm{\theta}}(\bm{z}_{1,0},c,1,0)\equiv\bm{z}_{0} (we regard F𝜽F_{\bm{\theta}} as the same expression of f𝜽f_{\bm{\theta}} since cskip(t)cout(t)c_{\text{skip}}(t)\ll c_{\text{out}}(t) for most time steps). Integrating this with Eq. 9, our SPLAD approach minimizes the following objective:

SPLAD(𝜽,𝜽;ϕ)=𝔼𝒛0pdata,t𝒰[k,T],γ𝒰[0,1][|𝑭𝜽(𝒛γ,t,c,γ,t)𝑭𝜽(𝒛^1,tkΦ,c,1,tk)|],\small\mathcal{L}_{\text{SPLAD}}({\bm{\theta}},{\bm{\theta}}^{-};\phi)=\mathbb{E}_{\bm{z}_{0}\sim p_{data},t\sim\mathcal{U}[k,T],\gamma\sim\mathcal{U}[0,1]}[|\bm{F}_{\bm{\theta}}(\bm{z}_{\gamma,t},c,\gamma,t)-\bm{F}_{{\bm{\theta}}^{-}}(\hat{\bm{z}}^{\Phi}_{1,t-k},c,1,t-k)|], (17)

where 𝒰\mathcal{U} denotes the uniform distribution, and kk is a pre-determined skipping step size. The α(t)\alpha(t) in Eq. 9 is omitted due to its negligible effect on optimization in practice. The term 𝒛^1,tkΦ=𝒛^tkΦ\hat{\bm{z}}^{\Phi}_{1,t-k}=\hat{\bm{z}}^{\Phi}_{t-k} is estimated using ODE solvers Φ(;ϕ)\Phi(\cdot\cdot\cdot;\phi) derived from teacher models. In this paper DDIM [42] is employed as our choice from the advanced solvers of LDMs. Moreover, to improve the estimation of 𝒛^tkΦ\hat{\bm{z}}^{\Phi}_{t-k}, we apply the Multiple Estimation which executes the solver Φ(,ϕ)\Phi(\cdot\cdot\cdot,\phi) multiple times with a reduced skipping step size kϕk_{\phi}. Denoting tϕ,i=tikϕt_{\phi,i}=t-i*k_{\phi} and initializing 𝒛^tϕ,0Φ=𝒛t\hat{\bm{z}}^{\Phi}_{t_{\phi,0}}=\bm{z}_{t}, the multiple estimation is iteratively executed as:

𝒛^tϕ,i+1Φ=𝒛^tϕ,iΦ+wΦ(𝒛^tϕ,iΦ,tϕ,i,tϕ,i+1,c;ϕ)+(1w)Φ(𝒛^tϕ,iΦ,tϕ,i,tϕ,i+1,;ϕ),\hat{\bm{z}}^{\Phi}_{t_{\phi,i+1}}=\hat{\bm{z}}^{\Phi}_{t_{\phi,i}}+w\Phi(\hat{\bm{z}}^{\Phi}_{t_{\phi,i}},t_{\phi,i},t_{\phi,i+1},c;\phi)+(1-w)\Phi(\hat{\bm{z}}^{\Phi}_{t_{\phi,i}},t_{\phi,i},t_{\phi,i+1},\emptyset;\phi), (18)

for i=0,1,2,;ikkϕ1i=0,1,2,...;i\leq\frac{k}{k_{\phi}}-1, where \emptyset denotes no conditional inputs and ww is a fixed guidance scale which controls the effect of conditional generation [10] from the conditional input cc.

Algorithm 1 Sub-Path Linear Approximation Distillation (SPLAD)
Input: dataset 𝒟\mathcal{D}, initial model parameter 𝜽{\bm{\theta}}, learning rate η\eta, EMA decay rate μ\mu, ODE solver Φ(,;ϕ)\Phi(\cdot,\cdot;\phi), distance estimation |||\cdot|, a fixed guidance scale ww, step size kk, VAE encoder ()\mathcal{E}(\cdot), noise schedule α(t),σ(t)\alpha(t),\sigma(t)
𝜽𝜽{\bm{\theta}}^{-}\leftarrow{\bm{\theta}}
repeat
    sample (x,c)𝒟,t𝒰[k,T](x,c)\sim\mathcal{D},t\sim\mathcal{U}[k,T] and γ𝒰[0,1]\gamma\sim\mathcal{U}[0,1]
    convert xx into latent space: z=(x)z=\mathcal{E}(x)
    sample 𝒛t𝒩(α(t)z,σ(z)2I)\bm{z}_{t}\sim\mathcal{N}(\alpha(t)z,\sigma(z)^{2}I)
    𝒛^tϕ,0Φ𝒛t\hat{\bm{z}}^{\Phi}_{t_{\phi,0}}\leftarrow\bm{z}_{t} , i0i\leftarrow 0
    repeat
         𝒛^tϕ,i+1Φ𝒛^tϕ,iΦ+wΦ(𝒛^tϕ,iΦ,tϕ,i,tϕ,i+1,c;ϕ)+(1w)Φ(𝒛^tϕ,iΦ,tϕ,i,tϕ,i+1,;ϕ)\hat{\bm{z}}^{\Phi}_{t_{\phi,i+1}}\leftarrow\hat{\bm{z}}^{\Phi}_{t_{\phi,i}}+w\Phi(\hat{\bm{z}}^{\Phi}_{t_{\phi,i}},t_{\phi,i},t_{\phi,i+1},c;\phi)+(1-w)\Phi(\hat{\bm{z}}^{\Phi}_{t_{\phi,i}},t_{\phi,i},t_{\phi,i+1},\emptyset;\phi)
         ii+1i\leftarrow i+1
    until k=ikϕk=i*k_{\phi}
    𝒛γ,t(1γ)α(t)α(tk)𝒛^ikΦ+γ𝒛t\bm{z}_{\gamma,t}\leftarrow(1-\gamma)*\frac{\alpha(t)}{\alpha(t-k)}\hat{\bm{z}}^{\Phi}_{i-k}+\gamma*\bm{z}_{t} \triangleright Sample a point on the SL-ODE.
    (𝜽,𝜽;ϕ)|(𝑭𝜽(𝒛γ,t,c,γ,t)𝑭𝜽(𝒛^1,tkΦ,c,1,tk))|\mathcal{L}({\bm{\theta}},{\bm{\theta}}^{-};\phi)\leftarrow|(\bm{F}_{\bm{\theta}}(\bm{z}_{\gamma,t},c,\gamma,t)-\bm{F}_{\bm{\theta}}(\hat{\bm{z}}^{\Phi}_{1,t-k},c,1,t-k))|
    𝜽𝜽η𝜽(𝜽,𝜽;ϕ){\bm{\theta}}\leftarrow{\bm{\theta}}-\eta\nabla_{\bm{\theta}}\mathcal{L}({\bm{\theta}},{\bm{\theta}}^{-};\phi)
    𝜽{\bm{\theta}}^{-}\leftarrow stopgrad(μ𝜽+(1μ)𝜽\mu{\bm{\theta}}^{-}+(1-\mu){\bm{\theta}})
until convergence

The pseudo-code for SPLAD is presented in Algorithm 1. SPLAD shares a similar training pipeline with consistency models[44, 26] but can be distinguished as it optimizes the sub-path learning objectives based on the SL-ODEs and utilizes the γ\gamma-conditioned training. For a pair of input noise and time step (𝒛t,t)(\bm{z}_{t},t), SPLAM gives the prediction of the denoised latent 𝒛^0\hat{\bm{z}}_{0} as:

𝒛^0=𝑭𝜽(𝒛1,t,c,1,t),\hat{\bm{z}}_{0}=\bm{F}_{{\bm{\theta}}^{-}}(\bm{z}_{1,t},c,1,t), (19)

for one-step generation, adhering strictly to the γ=1\gamma=1 condition. We also use the same iterative sample strategy as illustrated in [44] which could improve the quality of the generated images. In practice, we set the γ-\gamma\text{-}embedding to 𝟎\mathbf{0} for γ=1\gamma=1, thereby allowing the weights associated with trained γ\gamma-embeddings to be discarded post-training. Thus our Sub-Path Linear Approximation Model (SPLAM) necessitates no additional parameters beyond the training phase and can be utilized the same as teacher models.

5 Experiments

In this section, we conduct experiments to examine the performance of our proposed Sub-Path Linear Approximation Model (SPLAM). Firstly, we describe the experiment configuration and implementation details, and evaluate our models comprehensively on the text-to-image task (Sec. 5.1). Secondly, we verify the effectiveness of our algorithm design through detailed ablation studies (Sec. 5.2). Finally, we present the qualitative results of our SPLAM. (Sec. 5.3).

5.1 Text-to-Image Generation

5.1.1 Experimental Configuration

On text-to-image generation task, we train two models with pre-trained Stable Diffusion-V1.5 (SDv1.5) and Stable Diffusion-V2.1-base (SDv2.1-base) as teacher models respectively. Following the setting of [26], the training dataset is one subset of LAION-5B[39]: LAION-Aesthetics-6+. We choose DDIM-Solver as the ODE solver ϕ\phi at skipping step kϕ=20k_{\phi}=20.

For evaluation, we adopt the commonly used FID and CLIP Score metrics. The results are reported on both SDv1.5 and SDv2.1-base backbones, thus verifying the generalizability of our method. For the experiment of distilling SDv2.1-base, we bench-mark our model on two test sets, LAION-Aesthetics-6+ as used in LCM [26] and MSCOCO2014-30k for zero-shot generalization. We also reproduce a SDv2.1-base LCM according to the training configuration outlined in [26] while replacing the ww-condition with the fixed guidance scale, which has also improved the performance. We generally set the guidance scale for distilling SDv2.1-base to 8 and skipping step size to 20, which is consistent with  [26]. For the experiment of distilling SDv1.5, we compare our model with state-of-the-art generative models including foundation diffusion models, GANs, and accelerated diffusion models. The guidance scale is set to 33 to obtain the optimal FID, and we adopt the huber [43] loss for our SPLAD metric. The skipping step size is set to 100 for SPLAM which has shown fast convergence. We examine our method on two commonly used benchmarks, MSCOCO2014-30k and MSCOCO2017-5k. More implementation details are provided in the supplementary materials.

Table 1: Quantitative results for SDv2.1-base with w=8w=8. The results of DDIM, DPM, DPM++ and LCM for LAION test-set are derived from [26]. LCM (fix ww) is our reproduction conducted as stated in the paper. The results on COCO-30k are evaluated by us.
Methods LAION-Aesthetics-6+ COCO-30k
FID()(\downarrow) CLIP-Score()(\uparrow) FID()(\downarrow) CLIP-Score()(\uparrow)
1 Step 2 Steps 4 Steps 1 Step 2 Steps 4 Steps 1 Step 2 Steps 4 Steps 1 Steps 2 Steps 4 Steps
DDIM [42] 183.29 81.05 22.38 6.03 14.13 25.89 431.26 229.44 32.77 2.88 7.72 28.76
DPM Solver [23] 185.78 72.81 18.53 6.35 15.10 26.64 206.37 73.87 22.04 10.56 22.87 31.18
DPM Solver++ [24] 185.78 72.81 18.43 6.35 15.10 26.64 206.35 73.82 22.11 10.57 22.87 31.16
LCM [26] 35.36 13.31 11.10 24.14 27.83 28.69 - - - - - -
LCM (fix ww) [26] 32.41 12.17 10.43 26.99 30.13 30.76 43.87 15.71 14.88 27.66 31.07 31.52
SPLAM 32.64 12.06 10.09 27.13 30.18 30.76 40.52 14.59 13.81 27.83 31.00 31.45

5.1.2 Main Results

Table 2: Quantitative results for SDv1.5. Baseline numbers are cited from [50] and [49]. All the results of LCM are our reproduction whose performance is aligned as stated in the paper. Results are evaluated by us using the released models.
(a) Results on MSCOCO2014-30k, w=3w=3.
Family Methods Latency()(\downarrow) FID()(\downarrow)
Unaccelerated DALL-E [33] - 27.5
DALL-E2 [32] - 10.39
Parti-750M [51] - 10.71
Parti-3B [51] 6.4s 8.10
Parti-20B [51] - 7.23
Make-A-Scene [5] 25.0s 11.84
Muse-3B [4] 1.3 7.88
GLIDE [29] 15.0s 12.24
LDM [34] 3.7s 12.63
Imagen [35] 9.1s 7.27
eDiff-I [1] 32.0s 6.95
GANs LAFITE [54] 0.02s 26.94
StyleGAN-T [38] 0.10s 13.90
GigaGAN [13] 0.13s 9.09
Accelerated Diffusion DPM++ (4step) [24] 0.26s 22.36
UniPC (4step) [52] 0.26s 19.57
LCM-LoRA (4step) [27] 0.19s 23.62
InstaFlow-0.9B [21] 0.09s 13.10
InstaFlow-1.7B [21] 0.12s 11.83
UFOGen [49] 0.09s 12.78
DMD [50] 0.09s 11.49
LCM (2step) [26] 0.12s 14.29
SPLAM (2step) 0.12s 12.31
LCM (4step) [26] 0.19s 10.68
SPLAM (4step) 0.19s 10.06
Teacher SDv1.5 [34] 2.59s 8.03
(b) Results on MSCOCO2017-5k, w=3w=3.
Methods #Step Latency()(\downarrow) FID()(\downarrow)
DPM Solver++ [24] 4 0.21s 35.0
8 0.34s 21.0
Progressive Distillation [36] 1 0.09s 37.2
2 0.13s 26.0
4 0.21s 26.4
CFG-Aware Distillation [16] 8 0.34s 24.2
InstaFlow-0.9B [21] 1 0.09s 23.4
InstaFlow-1.7B [21] 1 0.12s 22.4
UFOGen [49] 1 0.09s 22.5
LCM [26] 2 0.12s 25.22
4 0.19s 21.41
SPLAM 2 0.12s 23.07
4 0.19s 20.77
(c) Results on MSCOCO2014-30k, w=8w=8.
Family Methods Latency()(\downarrow) FID()(\downarrow)
Accelerated Diffusion DPM++ (4step) 0.26s 22.44
UniPC (4step) [52] 0.26s 23.30
LCM-LoRA (4step) [27] 0.19s 23.62
DMD [50] 0.09s 14.93
LCM (2step) [26] [26] 0.12s 15.56
SPLAM (2step) 0.12s 14.50
LCM (4step) [26] [26] 0.19s 14.53
SPLAM (4step) 0.19s 13.39
Teacher SDv1.5 [34] 2.59s 13.05

The results for SDv2.1-base are presented in Tab. 1, we use DDIM [42], DPM [23], DPM++ [24] and LCM [26] as baselines. It reveals that our SPLAM surpasses baseline methods nearly across both test sets, at each step, and on both FID and CLIP Score metrics. We suppose that the close results on LAION are caused by overfitting, since the test set and train set are sourced from the same data collection. For SDv1.5 under the guidance scale w=3w=3, the quantitative results are demonstrated in Tab. 2(a) and Tab. 2(c). Our model with 4 steps gets FID-30k of 10.06 and FID-5k of 20.77, which outperforms all other accelerated diffusion models, including flow-based method InstaFlow [21] and techniques that introduce GAN objectives such as UFOGen [49] and DMD [50]. Furthermore, SPLAM showcases commensurate results with state-of-the-art foundation generative models such as DALL-E2 [32]. Even in two steps, SPLAM has achieved a competitive performance of FID-30k 12.31 with parallel algorithms. In practical scenarios, a higher guidance scale ww is typically favored to enhance the resultant image quality. Accordingly, we trained our SPLAM with ww set to 8 and bench-mark it against a range of advanced diffusion methodologies, as delineated in Tab. 2(c). In this regime, SPLAM also demonstrates significant advantages, achieving state-of-the-art performance with a four-step FID-30k of 13.39 which exceeds other models by a large margin and is close to the teacher model. Notably, the FID-30k of our model with only two steps reaches 14.50, surpassing the four-step LCM and DMD. While DMD training consumes over one hundred A100 GPU days, which is more than 16 times our training duration.

5.2 Ablation Study

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: (a) Ablations on skipping step size and skipping mechanism. ME denotes for our Multiple Estimation strategy. (b) Training curve comparing LCM and SPLAM. Our SPLAM with step size 100 is conducted with ME, which brings faster convergence. (c) Estimation of the error δ\delta between consistency mapping values of two adjacent points through PF-ODE. SPLAM consistently outperforms LCM in terms of the error.
Refer to caption
(a)
Refer to caption
(b)

 

Figure 3: (a) Visualization for different guidance scale ww on SPLAM. (b) The trade-off curve of applying difference guidance scale. ww increases from {3.0,5.0,8.0,12.0}\{3.0,5.0,8.0,12.0\}.

Skipping Step Size & Training Cost Fig. 2(a) ablates the skipping step size during training, where we compare SPLAM with or without the multiple estimation strategy (Sec. 4.2.2) and LCM. We can observe that: 1) Without multiple estimation, when the skipping step size kk is increasing, LCM suffers a more drastic decline in performance due to heightened optimization challenges for sub-path learning. Through leveraging our proposed Sub-Path Linear ODE, SPLAM can progressively learn the distΔdist_{\Delta} and effectively alleviate this collapse. 2) Equipped with the multiple estimation strategy, SPLAM is capable of stably maintaining high image fidelity with large steps. Moreover, we compare the convergence trends between our method and LCM during training, as depicted in Fig. 2(b). When k=20k=20, although our metrics initially converge more slowly during the early stages, the performance of our method gradually surpasses LCM by a large margin. It indicates that our training strategy provides a more effective learning objective, enabling SPLAM to achieve a better result, while LCM quickly becomes overfitted. As kk raised to 100, larger skipping step size brings SPLAM faster convergence that needs just 2K to 6K iterations which requires about only 6 A100 GPU days training, facilitating practical applications with fewer resources. Note that LCM needs 10k+ iterations for optimal performance which costs about 16 A100 GPU days and can not be applied to larger skipping step size due to the serious performance gap.
Approximated Error Estimation for SPLAM. To illustrate the efficacy of our approach, we directly estimate the denoising mapping error between two adjacent samples on the PF-ODE: δ(t,k)=𝔼[|𝒇𝜽(𝒙tn+k,tn+k),𝒇𝜽(𝒙tn,tn))|]\delta(t,k)=\mathbb{E}[|\bm{f}_{\bm{\theta}}(\bm{x}_{t_{n+k},t_{n+k}}),\bm{f}_{\bm{\theta}}(\bm{x}_{t_{n}},t_{n}))|], which is firstly defined in Eq. 6. The results are shown in Fig. 2(c). We randomly selected 1000 samples from the COCO dataset and simulated adjacent points on the ODE by adding the same noise with adjacent timesteps. We utilize k=20k=20 and the corresponding 50 timesteps for the DDIM scheduler, disregarding steps smaller than 100 due to their relatively larger simulation deviation. It can be seen that, especially at larger timesteps, the error δ\delta of our SPLAM is further reduced (about 10%10\% at t=800t=800). This observation substantiates that SPLAM indeed contributes to minimizing approximated errors, boosting the model’s capacity for high-quality image generation.

The Effect of Guidance Scale ww. The guidance scale ww is a critical hyper-parameter in Stable Diffusion [34, 10], with its adjustment allowing users to alter the semantic alignment and the quality of the generated image. In this study, we also examine the impact of varying the guidance scale ww for our SPLAM based on SDv1.5, which is visualized in Fig. 3. As well as vanilla Stable Diffusion, while a higher ww value contributes to better sample quality as reflected by CLIP Scores, it concurrently leads to a degradation in FID performance and oversaturation.

Refer to caption
Figure 4: Comparsion of our SPLAM and LCM [26] in 1,2 and 4-step generation. The results of LCM are based on our reproduction as illustrated in Sec. 5.1. SPLAM has generated consistently higher-quality images that are clearer and more detailed. Noteworthy is the remarkable performance of SPLAM in the 2-step generation, which aligns closely with the 4-step generation results of LCM, highlighting the efficiency and effectiveness of our approach in producing high-fidelity images with fewer generation steps.
Refer to caption
(a)
Refer to caption
(b)
Figure 5: Qualitative Results. The text prompts are selected from DMD [50] in (a) and UFOGEN [49] in (b), and the results of the two are also cited from respective papers. Clearly, SPLAM demonstrates the best generation quality in 4-step generation except for the SD models. When decreasing the sampling step to 2, SPLAM still maintains a comparable performance, which generates even better results than 4-step LCM [26].

5.3 Qualitative Results

To emphasize the boosted generation quality of our SPLAM, we display the 1,2 and 4-step generation results with the comparison to LCM [26] in Fig. 4. Moreover, we compare our SPLAM distilled from SDv1.5 [34] with the most advanced accelerating diffusion models in Fig. 5, which demonstrate that our SPLAM has achieved the best generation quality across the existing methods.

6 Conclusion

In this paper, we propose a novel approach Sub-Path Linear Approximation Models (SPLAM) for accelerating diffusion models. SPLAM leverages the approximation strategy in consistency models and considers the PF-ODE trajectories as a series of interconnected sub-paths delineated by sampled points. Guided by the optimization direction charted by each sub-path, Sub-Path Linear (SL) ODEs also enable our approach to progressively and continuously optimize the approximated learning objectives and thus construct the denoising mappings with smaller cumulative errors. We also develop an efficient distillation procedure for SPLAM to enable the incorporation of latent diffusion models. Extensive experiments on LAION, MS COCO 2014 and MS COCO 2017 datasets have consistently demonstrated the superiority of our method across existing accelerating diffusion approaches in a few-step generation with a fast training convergence.

Acknowledgments

This work is supported by the National Key R&\&D Program of China (No. 2022ZD0160900), the National Natural Science Foundation of China (No. 62076119, No. 61921006), the Fundamental Research Funds for the Central Universities (No. 020214380119), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

Appendix

Appendix 0.A Implementation Details

Common Settings: On text-to-image generation task, we train two models with pre-trained Stable Diffusion-V1.5 (SDv1.5) and Stable Diffusion-V2.1-base (SDv2.1-base) as teacher models respectively. Following the setting of [26], the training dataset is one subset of LAION-5B[39]: LAION-Aesthetics-6+, which comprises 12M text-image pairs with predicted aesthetics scores higher than 6. We choose DDIM-Solver [42] as the ODE solver Φ\Phi and uniformly set the solver skipping step size kϕk_{\phi} to 20 across all experiments. The training process is optimized using the AdamW optimizer [22], with a learning rate configured to 8e-6 and a weight decay of 0. To ensure the stability of gradient updates, we implement gradient clipping with an L2 norm threshold of 10, and apply Exponential Moving Average (EMA) with a decay rate of 0.95. The models are trained with a total batch size of 1024 distributed across 16 A100 GPUs consistently. For evaluation, we adopt the common practice that uses Fréchet Inception Distance (FID) which measures image quality, and CLIP Score [31] which measures text-to-image alignment, which we employ LAION OpenCLIP ViT-G/14 [11] to calculate. For LAION, as stated in [26], we generate 30k images from 10k text prompts as the evaluation set. For MSCOCO2014-30k (Zero-shot FID-30k) and MSCOCO 2017-5k, we follow the evaluation protocol described in [35] and [28], respectively.

For SDv2.1-base: To make a fair comparison between LCM and our approach, we train our SPLAM with skipping step size k=20k=20 and L2 distance metric function, which is consistent with [26] (the multiple estimation strategy is disabled since kkϕk\leq k_{\phi}). The difference is that we trained our models with a fixed guidance scale w=8w=8. We also reproduce an SDv2.1-base LCM according to the training configuration outlined in [26] while replacing the ww-condition with the fixed guidance scale, which has also improved the performance. For training costs, one GPU day can process 1.6K iterations with batch size of 1024 and without multiple estimation (ME) strategy. We train SPLAM on SDv2.1-base with 80k iterations, which costs about 60 A100 GPU days.

For SDv1.5: The guidance scale ww is set to 33 to obtain the optimal FIDs, and we adopt the huber loss in [43] as we find that it provides faster convergence and better performance on two-step and one-step. Guidance scale w=8w=8 is also conducted and all ablation studies are under SDv1.5 with w=8w=8 by default. For training costs, we train SPLAM on SDv1.5 with multiple estimation (ME) described in Sec. 4.2 in the main paper, which executes solver Φ\Phi for kkϕ\frac{k}{k_{\phi}} times in one iteration. In our platform, when kk=100 and kϕ=20k_{\phi}=20, one GPU day can process nearly half the amount of data compared to previously. As the skipping step size kk set to 100100 has shown fast convergence, it just costs 6k training iterations which require only 6 GPU days.

Appendix 0.B Derivation of α(γ,t)\alpha(\gamma,t) and σ(γ,t)\sigma(\gamma,t)

The perturbation process of DDPM [9] which corresponds to Variance Preserving (VP) SDE [46], could be given by the following Markov chain:

𝒙t=α(t)α(tk)𝒙tk+1α(t)2α(tk)2ϵ,\bm{x}_{t}=\frac{\alpha({t})}{\alpha({t-k})}\bm{x}_{t-k}+\sqrt{1-\frac{\alpha(t)^{2}}{\alpha({t-k})^{2}}}\bm{\epsilon}, (20)

where ϵ𝒩(𝟎,𝑰)\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{I}). For VP α(t)\alpha(t) is the controlling schedule with α(0)=1\alpha(0)=1, and σ(t)\sigma(t) is defined as σ(t)=1α(t)2\sigma(t)=\sqrt{1-\alpha({t})^{2}}.

A nice property of the above process is that we can sample 𝒙t\bm{x}_{t} at any arbitrary time step tt in a closed form:

𝒙t\displaystyle\bm{x}_{t} =α(t)𝒙0+σ(t)ϵ\displaystyle=\alpha({t})\bm{x}_{0}+\sigma(t)\bm{\epsilon} (21)
=α(t)𝒙0+1α(t)2ϵ.\displaystyle=\alpha({t})\bm{x}_{0}+\sqrt{1-\alpha({t})^{2}}\bm{\epsilon}.

Accordingly, we have the posterior:

p(𝒙t|𝒙0)=𝒩(𝒙t;α(t)𝒙0,σ(t)2𝑰)p(\bm{x}_{t}|\bm{x}_{0})=\mathcal{N}(\bm{x}_{t};\alpha(t)\bm{x}_{0},\sigma(t)^{2}\bm{I}) (22)

To obtain the posterior for 𝒙γ,t\bm{x}_{\gamma,t}, where p(𝒙γ,t|𝒙0)=𝒩(𝒙γ,t;α(γ,t)𝒙0,σ(γ,t)2𝑰)p(\bm{x}_{\gamma,t}|\bm{x}_{0})=\mathcal{N}(\bm{x}_{\gamma,t};\alpha(\gamma,t)\bm{x}_{0},\sigma(\gamma,t)^{2}\bm{I}), we compute 𝒙γ,t\bm{x}_{\gamma,t} as described in Eq. (12) in the main paper, and substitute Eq. 20 and Eq. 21 into it:

𝒙γ,t\displaystyle\bm{x}_{\gamma,t} =(1γ)α(t)α(tk)𝒙tk+γ𝒙t\displaystyle=(1-\gamma)\frac{\alpha(t)}{\alpha(t-k)}\bm{x}_{t-k}+\gamma\bm{x}_{t} (23)
=(1γ)α(t)α(tk)𝒙tk+γ(α(t)α(tk)𝒙tk+1α(t)2α(tk)2ϵ)\displaystyle=(1-\gamma)\frac{\alpha(t)}{\alpha(t-k)}\bm{x}_{t-k}+\gamma\left(\frac{\alpha(t)}{\alpha(t-k)}\bm{x}_{t-k}+\sqrt{1-\frac{\alpha(t)^{2}}{\alpha(t-k)^{2}}}\bm{\epsilon}\right)
=α(t)α(tk)𝒙tk+γ1α(t)2α(tk)2ϵ,\displaystyle=\frac{\alpha(t)}{\alpha(t-k)}\bm{x}_{t-k}+\gamma\sqrt{1-\frac{\alpha(t)^{2}}{\alpha({t-k})^{2}}}\bm{\epsilon},

Thus 𝒙γ,t\bm{x}_{\gamma,t} could be represented as a merging of two independent variables. Since we have pt(𝒙tk|x0)=𝒩(𝒙tk;α(tk)𝒙0,σ(tk)2𝑰)p_{t}(\bm{x}_{t-k}|x_{0})=\mathcal{N}(\bm{x}_{t-k};\alpha(t-k)\bm{x}_{0},\sigma(t-k)^{2}\bm{I}), the closed-form expression for the mean and variance in pt(𝒙γ,t|𝒙0)p_{t}(\bm{x}_{\gamma,t}|\bm{x}_{0}) should be:

α(γ,t)\displaystyle\alpha(\gamma,t) =α(t)α(tk)α(tk)+0=αt,\displaystyle=\frac{\alpha(t)}{\alpha(t-k)}*\alpha(t-k)+0=\alpha_{t}, (24)
σ(γ,t)2\displaystyle\sigma(\gamma,t)^{2} =α(t)2α(tk)2σ(tk)2+γ2(1α(t)2α(tk)2),\displaystyle=\frac{\alpha(t)^{2}}{\alpha(t-k)^{2}}\sigma(t-k)^{2}+\gamma^{2}\left(1-\frac{\alpha(t)^{2}}{\alpha(t-k)^{2}}\right),

where σ(t)=1α(t)2\sigma(t)=\sqrt{1-\alpha(t)^{2}} as defined in this noise schedule.

Subsequently, we can compute the error eσVPe^{\text{VP}}_{\sigma} between the derived result and the empirical value σ(γ,t)=(1γ)α(t)α(tk)σ(tk)+γα(t)\sigma^{\prime}(\gamma,t)=(1-\gamma)*\frac{\alpha(t)}{\alpha(t-k)}\sigma(t-k)+\gamma*\alpha(t) that we have employed in the main paper:

eσVP=\displaystyle e^{\text{VP}}_{\sigma}= σ(γ,t)2σ(γ,t)2\displaystyle\sigma(\gamma,t)^{2}-\sigma^{\prime}(\gamma,t)^{2} (25)
=\displaystyle= α(t)2α(tk)2(1αtk2)+γ2(1α(t)2α(tk)2)(γσt+(1γ)α(t)α(tk)σtk)2\displaystyle\frac{\alpha(t)^{2}}{\alpha(t-k)^{2}}(1-\alpha_{t-k}^{2})+\gamma^{2}\left(1-\frac{\alpha(t)^{2}}{\alpha(t-k)^{2}}\right)-\left(\gamma\sigma_{t}+(1-\gamma)\frac{\alpha(t)}{\alpha(t-k)}\sigma_{t-k}\right)^{2}
=\displaystyle= 2γ(1γ)α(t)α(tk)σ(tk)(α(t)α(tk)σ(tk)σ(t))\displaystyle 2\gamma(1-\gamma)\frac{\alpha(t)}{\alpha(t-k)}\sigma(t-k)\left(\frac{\alpha(t)}{\alpha(t-k)}\sigma(t-k)-\sigma(t)\right)
=\displaystyle= 2γ(1γ)α(t)α(tk)σ(tk)(α(t)2α(tk)2α(t)21α(t)2).\displaystyle 2\gamma(1-\gamma)\frac{\alpha(t)}{\alpha(t-k)}\sigma(t-k)\left(\sqrt{\frac{\alpha(t)^{2}}{\alpha(t-k)^{2}}-\alpha(t)^{2}}-\sqrt{1-\alpha(t)^{2}}\right).

As denoted in Eq. 25, the estimated error eσVP0e^{\text{VP}}_{\sigma}\equiv 0 for any endpoints on the PF-ODE trajectories, specifically when γ=1\gamma=1 or 0, and thus ensures the fidelity for generation. It is observed that as the value of kk increases, a concomitant rise in error is typically expected. However, in our experimental analysis, we discovered that even for the scenario where k=100k=100, training with σ(γ,t)\sigma(\gamma,t) or σ(γ,t)\sigma^{\prime}(\gamma,t) yields nearly the same results. This indicates that the training on SL-ODEs could tolerate such small errors in the intermediate sampled points.

Appendix 0.C Strategies for Constructing Sub-path ODE

To construct approximated paths for the sup-paths on the PF-ODE trajectories, another intuitive way except for our SL ODEs is to directly connect the corresponding endpoints (𝒙tk,tk)(\bm{x}_{t-k},t-k) and (𝒙t,t)(\bm{x}_{t},t), and we call it as Direct Linking (DL) ODE. The DL-ODE thus can be formulated as:

𝒙γ,t=𝒙tk+γ(𝒙t𝒙tk)\displaystyle\bm{x}_{\gamma,t}=\bm{x}_{t-k}+\gamma(\bm{x}_{t}-\bm{x}_{t-k}) (26)
d𝒙=[γ(𝒙t𝒙tk)]dγ.\displaystyle d\bm{x}=[\gamma*(\bm{x}_{t}-\bm{x}_{t-k})]d\gamma.

And the approximated learning objective for DL ODEs is:

DL(𝜽,k)=𝔼[|𝒙γ,tσDL(t,γ)ϵ𝜽(𝒙γ,t,γ,t)αDL(γ,t)𝒙tkσ(tk)ϵ𝜽(𝒙tk,tk)α(tk)|],\small\mathcal{L}_{DL}(\bm{\theta},k)=\mathbb{E}[|\frac{\bm{x}_{\gamma,t}-\sigma_{DL}(t,\gamma)\bm{\epsilon_{\theta}}(\bm{x}_{\gamma,t},\gamma,t)}{\alpha_{DL}(\gamma,t)}-\frac{\bm{x}_{t-k}-\sigma(t-k)\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t-k},t-k)}{\alpha(t-k)}|], (27)

where the noise schedule αDL,σDL\alpha_{DL},\sigma_{DL} is defined by the marginal distribution of 𝒙γ,t\bm{x}_{\gamma,t}: p(𝒙γ,t|x0)=𝒩(𝒙γ,t;αDL(γ,t)𝒙0,σDL2(γ,t)𝑰)p(\bm{x}_{\gamma,t}|x_{0})=\mathcal{N}(\bm{x}_{\gamma,t};\alpha_{DL}(\gamma,t)\bm{x}_{0},\sigma_{DL}^{2}(\gamma,t)\bm{I}). We have derived the expression of αDL,σDL\alpha_{DL},\sigma_{DL}, which should be :

αDL(γ,t)=γα(t)+(1γ)α(tk)\displaystyle\alpha_{DL}(\gamma,t)=\gamma*\alpha(t)+(1-\gamma)*\alpha(t-k) (28)
σDL(γ,t)2=(1γ+α(t)α(tk)γ)2σ(tk)2+γ2(1α(t)2α(tk)2).\displaystyle\sigma_{DL}(\gamma,t)^{2}=(1-\gamma+\frac{\alpha(t)}{\alpha(t-k)}*\gamma)^{2}*\sigma(t-k)^{2}+\gamma^{2}\left(1-\frac{\alpha(t)^{2}}{\alpha(t-k)^{2}}\right).

Comparing Eq. 28 and Eq. (15) in the main paper, the learning on SPLAM could separate the optimization process for distΔdist_{\Delta} and dist0,θdist_{0,\theta}, whereas the DL-ODE provides a blended estimation. We apply both strategies to the distillation process from SDv1.5, and denote the optimized models as OptDL{}_{\text{DL}} for the training on DL-ODEs and OptSL{}_{\text{SL}} for our SPLAM learning. The tested FIDs are shown in Tab. 3. While DL-ODE also makes a continuous estimation for the sub-path learning objective and surely outperforms LCM, the entangled optimization for distΔdist_{\Delta} and dist0,θdist_{0,\theta} still restricts its optimized performance. In contrast, our approach consistently yields optimal FIDs, reaffirming its superiority in generating high-quality images.

Table 3: Comparison between the optimized performance based on DL ODEs and SL ODEs. We report the 4-step generation results for all methods with a guidance scale ww=8.
Methods FID-5k FID-30k
LCM [26] 24.68 14.53
OptDL{}_{\text{DL}} 24.20 13.89
OptSL{}_{\text{SL}} (SPLAM) 23.76 13.39

Appendix 0.D Additional Generated Results

0.D.1 Comparsion to LCM with pre-trained DreamShaper.

Since LCM has only released one versionhttps://huggingface.co/SimianLuo/LCM_Dreamshaper_v7 of SDv1.5 based model that is distilled from DreamShaper-v7, we train a SPLAM on DreamShaper-v7 as well and compare it to LCM, as shown in Fig. 6 and Fig. 7.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Generated samples on 2-step from our SPLAM and LCM distilled from DreamShaper-v7. We leverage the LCM-Dreamshaper-v7 checkpoint hosted on Hugging Face.
Refer to caption
Refer to caption
Refer to caption
Figure 7: Generated samples on 4-step from our SPLAM and LCM distilled from DreamShaper-v7. We leverage the LCM-Dreamshaper-v7 checkpoint hosted on Hugging Face.
Refer to caption
Figure 8: Generated samples from SPLAM (our method) and LCM.
Refer to caption
Figure 9: Generated samples from SPLAM (our method) and LCM.

0.D.2 More Generated images with SD pre-trained Models.

Here we provide more generated images from our SPLAM, which are compared with LCM [26], the newly released one-step generation approach InstaFlow [21] and Stable Diffusion [34]. Unless otherwise specified, our SPLAM’s generated results are based on the model distilled from the open source SDv1.5, and the results for LCM are generated from our reproduction. The images are shown in Figs. 8, 9, 10 and 11.

Refer to caption
Figure 10: Comparison of our method SPLAM with one-step generation (InstaFlow [21]) and Stable Diffusion [34].
Refer to caption
Figure 11: Comparison of our method SPLAM with one-step generation (InstaFlow [21]) and Stable Diffusion [34].

References

  • [1] Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
  • [2] Bao, F., Li, C., Zhu, J., Zhang, B.: Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503 (2022)
  • [3] Berthelot, D., Autef, A., Lin, J., Yap, D.A., Zhai, S., Hu, S., Zheng, D., Talbot, W., Gu, E.: Tract: Denoising diffusion models with transitive closure time-distillation. arXiv preprint arXiv:2303.04248 (2023)
  • [4] Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.H., Murphy, K., Freeman, W.T., Rubinstein, M., et al.: Muse: Text-to-image generation via masked generative transformers. In: ICML (2023)
  • [5] Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: Scene-based text-to-image generation with human priors. In: ECCV (2022)
  • [6] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative Adversarial Nets. In: NIPS (2014)
  • [7] Gu, J., Zhai, S., Zhang, Y., Liu, L., Susskind, J.M.: Boot: Data-free distillation of denoising diffusion models with bootstrapping. In: ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling (2023)
  • [8] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NeurIPS 2014 Deep Learning Workshop (2015)
  • [9] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
  • [10] Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: arXiv preprint arXiv:2207.12598 (2022)
  • [11] Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Openclip (Jul 2021). https://doi.org/10.5281/zenodo.5143773, https://doi.org/10.5281/zenodo.5143773, if you use this software, please cite it as below.
  • [12] Jolicoeur-Martineau, A., Li, K., Piché-Taillefer, R., Kachman, T., Mitliagkas, I.: Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080 (2021)
  • [13] Kang, M., Zhu, J.Y., Zhang, R., Park, J., Shechtman, E., Paris, S., Park, T.: Scaling up gans for text-to-image synthesis. In: CVPR (2023)
  • [14] Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. In: NeurIPS (2022)
  • [15] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)
  • [16] Li, Y., Wang, H., Jin, Q., Hu, J., Chemerys, P., Fu, Y., Wang, Y., Tulyakov, S., Ren, J.: Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. Advances in Neural Information Processing Systems 36 (2024)
  • [17] Lin, S., Wang, A., Yang, X.: Sdxl-lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929 (2024)
  • [18] Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)
  • [19] Liu, L., Ren, Y., Lin, Z., Zhao, Z.: Pseudo numerical methods for diffusion models on manifolds. In: ICLR (2022)
  • [20] Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: ICLR (2023)
  • [21] Liu, X., Zhang, X., Ma, J., Peng, J., Liu, Q.: Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380 (2023)
  • [22] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
  • [23] Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: NeurIPS (2022)
  • [24] Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. In: arXiv preprint arXiv:2211.01095 (2022)
  • [25] Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 (2021)
  • [26] Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)
  • [27] Luo, S., Tan, Y., Patil, S., Gu, D., von Platen, P., Passos, A., Huang, L., Li, J., Zhao, H.: Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2310.04378 (2023)
  • [28] Meng, C., Rombach, R., Gao, R., Kingma, D., Ermon, S., Ho, J., Salimans, T.: On distillation of guided diffusion models. In: CVPR (2023)
  • [29] Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: ICML (2022)
  • [30] Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning. pp. 8162–8171. PMLR (2021)
  • [31] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
  • [32] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
  • [33] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: ICML (2021)
  • [34] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
  • [35] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
  • [36] Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: ICLR (2022)
  • [37] Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042 (2023)
  • [38] Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: Scaling stylegan to large diverse datasets. In: SIGGRAPH (2022)
  • [39] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
  • [40] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
  • [41] Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. Advances in neural information processing systems 28 (2015)
  • [42] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
  • [43] Song, Y., Dhariwal, P.: Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189 (2023)
  • [44] Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: ICML (2023)
  • [45] Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)
  • [46] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: ICLR (2021)
  • [47] Tong, A., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Fatras, K., Wolf, G., Bengio, Y.: Improving and generalizing flow-based generative models with minibatch optimal transport. arXiv preprint arXiv:2302.00482 (2023)
  • [48] Xiao, Z., Kreis, K., Vahdat, A.: Tackling the generative learning trilemma with denoising diffusion gans. In: ICLR (2022)
  • [49] Xu, Y., Zhao, Y., Xiao, Z., Hou, T.: Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257 (2023)
  • [50] Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. arXiv preprint arXiv:2311.18828 (2023)
  • [51] Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B.K., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 2(3),  5 (2022)
  • [52] Zhao, W., Bai, L., Rao, Y., Zhou, J., Lu, J.: Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. arXiv preprint arXiv:2302.04867 (2023)
  • [53] Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., Anandkumar, A.: Fast sampling of diffusion models via operator learning. In: ICML (2023)
  • [54] Zhou, Y., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., Sun, T.: Towards language-free training for text-to-image generation. In: CVPR (2022)
  • [55] Zhou, Z., Chen, D., Wang, C., Chen, C.: Fast ode-based sampling for diffusion models in around 5 steps. arXiv preprint arXiv:2312.00094 (2023)