This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DreamWalk: Style Space Exploration using Diffusion Guidance

Michelle Shu*1 and Charles Herrmann*3 and Richard S. Bowen2 and Forrester Cole3 and Ramin Zabih1,3
1Cornell University, 2Geomagical Labs, 3Google Research
Abstract.

Text-conditioned diffusion models can generate impressive images, but fall short when it comes to fine-grained control. Unlike direct-editing tools like Photoshop, text conditioned models require the artist to perform “prompt engineering,” constructing special text sentences to control the style or amount of a particular subject present in the output image. Our goal is to provide fine-grained control over the style and substance specified by the prompt, for example to adjust the intensity of styles in different regions of the image (Figure LABEL:fig:teaser). Our approach is to decompose the text prompt into conceptual elements, and apply a separate guidance term for each element in a single diffusion process. We introduce guidance scale functions to control when in the diffusion process and where in the image to intervene. Since the method is based solely on adjusting diffusion guidance, it does not require fine-tuning or manipulating the internal layers of the diffusion model’s neural network, and can be used in conjunction with LoRA- or DreamBooth-trained models (Figure 2). Project page: https://mshu1.github.io/dreamwalk.github.io/

1. Introduction

While the quality of text-to-image generation has dramatically improved in the last couple of years thanks to diffusion models, the best way to interact and control these methods still remains unclear. Many techniques center around Standard tools like Photoshop or Procreate can apply a wide range of effects, ranging from brushes that change the image in precise local ways, to image filters that globally alter the image. These effects come with a variety of user-adjustable parameters that provide fine-grained control. For example, a sepia filter can be applied with adjustable magnitude or tint.

Generative models offer new and powerful capabilities, and can create a completely new image from a simple text prompt such as “a river flows under a bridge with clear sky”. For a given text prompt, sampling different seeds allows for exploring new compositions or scenes with elements from that prompt. Styles can be applied to these scenes by adding a style prompt – “a river flows under a bridge with clear sky in the style of van Gogh”. However, changes to the prompt often come with large changes in the generated image’s composition or layout and it is difficult to use a prompt to granularly control the intensity or the location of an applied style.

Our goal is to support user control over the amount, location, and type of style being applied, and thus use diffusion models to explore the style space for a given generated image.

We generalize the multiple guidance formulation of (Liu et al., 2022) in order to decompose prompts into different conceptual elements, such as a subject prompt and a style prompt; these can be granularly emphasized or de-emphasized. We introduce guidance scale functions, which allow the user to control when in the diffusion process and where on the image each guidance term is applied. The control is fine-grained in that the emphasis can be modified on a real-valued scale, which provides more control than just adding language to the prompt. It is flexible in that emphasis can be applied spatially (to all or just part of the image) and temporally: emphasizing a prompt early in the diffusion process causes it to affect overall composition or layout more strongly, whereas applying it later causes it to affect texture or image details (see Section 3.4.1).

Unlike many other approaches to control like ControlNet (Zhang et al., 2023) or StyleDrop (Sohn et al., 2023), our approach does not require fine-tuning or training of any kind. We also do not access the internal components of the network like other techniques such as Prompt-to-Prompt (Hertz et al., 2022) or Gligen (Li et al., 2023) and as a result our technique does not need to be adapted or changed based on the architecture. Our method comes directly out of the diffusion formulation and as such can be applied to any diffusion network trained with text guidance (Ho and Salimans, 2022).

Our technique can control the application of style, as shown in Figure LABEL:fig:teaser, as well as the degree of personalization shown in Figure 2. We provide results on both Stable Diffusion 1.5 and Stable Diffusion XL, two diffusion models with different architectures, and demonstrate that our technique works with both standard text-to-image models as well as DreamBooth and LORA fine-tunings.

Refer to caption
Figure 2. Controllable subject / prompt emphasis. Our formulation can explore adherence to a DreamBooth subject or adherence to the text prompt. Generated with SD1.5.

2. Background and Related Work

2.1. Diffusion Models and Guidance Functions

Denoising diffusion models(Sohl-Dickstein et al., 2015; Ho et al., 2020; Yang et al., 2022) are a class of generative models which are trained to transform a sample of Gaussian noise into a sample of a data distribution. To generate a sample from the data distribution, we start with x1N(0,I)x_{1}\sim N(0,I) as Gaussian noise. For a given time step tt from 11 to 0, we alternate between denoising and adding back a smaller amount of Gaussian noise. Over the course of the sampling process, from t=1t=1 to t=0t=0, the image becomes less noisy and more realistic. An update step using a DDPM (Ho et al., 2020) sampler might be:

(1) xt1=(xtfθ(xt,t))+ϵtx_{t-1}=(x_{t}-f_{\theta}(x_{t},t))+\epsilon_{t}

where fθ(xt,t)f_{\theta}(x_{t},t) is the guidance function, which is responsible for the denoising stage of the diffusion process. The guidance function may combine multiple forward passes of the diffusion model, fθ(xt,t,c)f_{\theta}(x_{t},t,c), a network trained to estimate the noise of the image given a condition cc. For example, classifier-free guidance (Ho and Salimans, 2022) combines conditional and unconditional outputs to improve adherence to the conditioning vector.

(2) fθ(xt,t)=fθ(xt,t,)+si[fθ(xt,t,ci)fθ(xt,t,)]f_{\theta}(x_{t},t)=f_{\theta}(x_{t},t,\emptyset)+s_{i}\cdot\left[f_{\theta}(x_{t},t,c_{i})-f_{\theta}(x_{t},t,\emptyset)\right]

Liu, et al. (Liu et al., 2022) demonstrated that multiple diffusion forward passes may be composed to to allow the image generator to “build” a scene from various elements. For example, they combine phrases like “a photo of cherry blossom trees”, “sun dog”, and “green grass” in order to generate an image with components from each of these phrases. Given a series of conditions c0,,ckc_{0},\ldots,c_{k} defined on a single model θ\theta, they compute a single overall update step as follows:

(3) fθ(xt,t)=fθ(xt,t,)+i=1,,ksi[fθ(xt,t,ci)fθ(xt,t,)]f_{\theta}(x_{t},t)=f_{\theta}(x_{t},t,\emptyset)+\sum_{i=1,\ldots,k}s_{i}\cdot\left[f_{\theta}(x_{t},t,c_{i})-f_{\theta}(x_{t},t,\emptyset)\right]

where sis_{i} are the guidance scales or coefficients (see Equation 15 in (Liu et al., 2022)). Note that if k=1k=1, c0c_{0} is a text-conditioning vector and \emptyset is an empty conditioning vector (“null text”), Eq. 3 is equivalent to classifier-free guidance (Ho and Salimans, 2022).

We build upon (Liu et al., 2022), while addressing a different task and providing a more general formulation. (Liu et al., 2022) is focused on compositionality, with the goal of producing a single sample with all the elements present in the multiple text prompts (e.g. a tree, a road, a blue sky). In an image they produce, a particular element in the text is either present or absent; a success is defined as having all the elements in the multiple text prompts present in the generated image. In contrast to their true/false task, we focus on fine-grained control for stylization and personalization, producing multiple samples which offer a choice of target distributions. To solve this very different task requires a generalization of the multiple guidance framework of (Liu et al., 2022) through the introduction of guidance scale functions; this is described in section 3 where the relationship with (Liu et al., 2022) is discussed in more detail.

Multidiffusion (Bar-Tal et al., 2023) and SyncDiffusion (Lee et al., 2023) use a similar formulation to create panoramas or images at a higher resolution than can be naturally generated by the diffusion model, which typically has a fixed generation resolution. Both methods do this by partitioning an image into overlapping windows and then running a single diffusion process where each window provides guidance to pixels in its regions. In both techniques, all the pixels go through a single, joint diffusion process.

2.2. Style for Image Creation

Stylization or style transfer, taking a real photo and applying a style, is a long explored field, with early techniques involving hand designed algorithms designed to mimic a certain artistic style (Haeberli, 1990; Salisbury et al., 2023; Litwinowicz, 1997; Hertzmann, 2003), and parametric (Heeger and Bergen, 1995; Portilla and Simoncelli, 2000) or non-parameteric (Efros and Leung, 1999; Barnes et al., 2009; Hertzmann et al., 2023) techniques for matching image statistics to either new texture patches or stylized images. Recently, as deep learning has become more popular, the field has shifted towards neural style transfer and texture synthesis (Gatys et al., 2015; Jing et al., 2019; Azadi et al., 2018). Unlike these techniques, we do not aim to transfer image statistics from a style exemplar image. Instead, we focus on gaining fine-grained control over a generative model’s pre-existing ability to apply style and move to between different styles.

(Karras et al., 2019) showed that modifying the generator architecture of a GAN can induce a latent space with coarse and fine features disentangled. A number of publications followed, focusing on controllability by disentangling the latent space, producing interpretable paths or directions through the latent space that allow for continuous control of pose, zoom, scene composition, hair length, and many others: (Wu et al., 2021; Tewari et al., 2020; Voynov and Babenko, 2020; Härkönen et al., 2020; Shen et al., 2020). Other methods tried to identify existing interpretable paths in the model  (Jahanian et al., 2019; Härkönen et al., 2020; Voynov and Babenko, 2020). (Patashnik et al., 2021) learned to use CLIP embeddings to control StyleGAN latents with text prompts. More recently, similar techniques have been applied to diffusion models: (Park et al., 2023) introduces a pullback metric and uses this to find latent-space basis vectors with particular semantics; (Preechakul et al., 2022) learn to encode directly into the diffusion model’s latent space, allowing interpolation along axes like age and hair color. (Wu et al., 2023a) use a soft weighting of base and attribute prompts, with the weighting factor chosen to try to limit the impact of the attribute prompt only to appearance, and not the layout, of the generated image. This weighting factor is learned through an optimization problem and is done per image, though the weights can be applied to any image. Their main focus is taking a generated image and then applying the style once, without fine-grained control over how much the style is applied.

2.3. Personalized Text-to-Image Generation

Given the difficulty of capturing fine-grained concepts using text, considerable attention has been given to methods for personalizing image generation models.

One approach is to train a new model that is conditioned on both images and text (Chen et al., 2023; Ma et al., 2023; Shi et al., 2023), which requires gathering a large dataset and retraining. An alternative approach is to finetune an existing model, either by finetuning the text embeddings (Gal et al., 2022), the weights of the generator network (Ruiz et al., 2023; Sohn et al., 2023; Kumari et al., 2023), or both (Avrahami et al., 2023). Most commonly, personalized finetuning takes 3-5 images of a specific object or style and finetunes the entire text-to-image network to reconstruct these images when conditioned with text of a rare token [𝒱][\mathcal{V}], placed in context of its noun, e.g. “a photo of [𝒱][\mathcal{V}] dog”. DreamBooth (Ruiz et al., 2023) also uses a class-preservation prior in order to prevent the noun’s conditioning from drifting away from its original meaning and collapsing to only representing the 3-5 images being used for fine-tuning.

Parameter-efficient finetuning (Hu et al., 2021; Houlsby et al., 2019) such as LoRA is commonly used to speed up finetuning and reduce the size of the weight modifications, allowing users to transfer only the update weights instead of the entire network weights.

3. Our Approach

Our overall goal is to be able to “walk” or explore style space – providing fine-grained control over different aspects of style, such as location, intensity, and type. To do this, we need three things: (1) the ability to aim for a specific target distribution based on multiple conditions (Sections 3.1 and 3.3); (2) a way to create the appropriate conditions (Section 3.2); and (3) a mechanism to provide fine-grain control over how we guide towards these conditions and their relative location, intensity, and type (Section 3.4).

Our full framework allows for the robust application of various style. In Figures 10 and 11, we show a simple application of the framework, applying a single style to a given prompt. In later sections, we will demonstrate more complex capabilities, including controlling how style is applied in the image formation process and walking in style space.

3.1. Multiple Guidance Formulation

Building on the multiple guidance terms introduced by (Liu et al., 2022), we propose guidance scale functions si(t,u,v)s_{i}(t,u,v) which allow for granular control over when, in the diffusion noise schedule, and where, in the image space, a forward pass with a specific condition is used. Specifically, we consider the following formulation for composition of diffusion models:

fθ(xt,t)=fθ(xt,t,)+i=1,,ksi(t,u,v)gθ(xt,t,ci)\displaystyle f_{\theta}(x_{t},t)=f_{\theta}(x_{t},t,\emptyset)+\sum_{i=1,\ldots,k}s_{i}(t,u,v)g_{\theta}(x_{t},t,c_{i})
gθ(xt,t,ci)=fθ(xt,t,ci)fθ(xt,t,)\displaystyle g_{\theta}(x_{t},t,c_{i})=f_{\theta}(x_{t},t,c_{i})-f_{\theta}(x_{t},t,\emptyset)

where for simplicity we define gθ(xt,t,ci)g_{\theta}(x_{t},t,c_{i}) as the guidance term for condition cic_{i}. The guidance term includes the forward passes of the diffusion model fθ(xt,t,ci)f_{\theta}(x_{t},t,c_{i}) and fθ(xt,t,)f_{\theta}(x_{t},t,\emptyset) and indicates a direction and magnitude towards a condition. This magnitude can be emphasized or de-emphasized by the guidance scale function, si(t,u,v)s_{i}(t,u,v) (dependent on time tt and spatial location uu and vv).

3.2. Creating multiple guidance terms from text prompts

Now that we have the ability to target a specific distribution based on multiple conditions, we need a way to create the appropriate conditions for our task of controlling stylization and personalization.

Decomposing prompts for stylization Numerous works (Haeberli, 1990; Salisbury et al., 2023; Hertzmann, 2003; Rashtchian et al., 2023; Wu et al., 2023b) consider the style and substance of an image to be two separate decomposable concepts: e.g., any subject can be painted in watercolor or created as a woodblock print. As a result, we explore decomposing the prompt into several constituent parts, primarily focusing on a “base” prompt consisting of a description of the scene (subject, action, setting, etc.) and a style component often indicated by an art medium, artistic period, or an artist (watercolor, Impressionism, Hokusai, etc.). For example, the prompt “a river flows under a bridge with a clear sky in the style of van Gogh” might decompose into a base “a river flows under a bridge with a clear sky” and style “van Gogh”.

Separating personalized subjects from their context. The same technique can be used to for subject personalization (Ruiz et al., 2023). DreamBooth introduces a rare token as an adjective that modifies an existing subject, e.g. “a [𝒱][\mathcal{V}] vase”. This adjective is then trained to produce a specific vase. The phrase is then placed in context of a complete text prompt, often containing other adjectives, verbs (actions), or other context (background, style, etc.), e.g. “a [𝒱][\mathcal{V}] vase with a colorful bouquet”. This complete sentence is then used as the single text conditioning.

In the multiple guidance formulation, this sentence can be decomposed into its component parts and then guided towards each of these distributions separately allowing for control over which of these relative concepts is the most important to the user. If there is a desired subject (“a [𝒱][\mathcal{V}] vase”) and a desired state (“with a colorful bouquet”), the user can decide which of these to emphasize. Figure 2 show examples allowing adherence to either the text prompt (background, action) or the subject (the DreamBooth token and subject). This control is useful since one common failure is the entanglement between a subject’s pose / actions and the subject’s appearance. If the images used for personalization share a similar pose, e.g. all images of a dog are front-facing and with the dog sitting, then the generator can struggle to change the pose of the dog. Our method can allow the user to emphasize the action separately from the token reducing this bias.

Refer to caption
Figure 3. Applying a style without specifying a subject can, depending on the biases of the prompt’s distribution, can lead to changes in the subjects in the generated image. E.g., “Hokusai style” distribution has a strong bias towards showing Hokusai’s famous waves even when they overwhelm the original content of the image. Adding a subject to the prompt, like “house”, can mitigate this. Generated with SDXL.

3.2.1. Mitigating subject drift through prompting.

While these decompositions make sense in theory, in practice we observe that guiding towards a prompt which only contains a style phrase “Hokusai style” can induce unexpected results, as seen in Figure 3. Since text-to-image generative networks have been trained with pairs of images and text, style and substance have not been totally disentangled. This can produce “subject drift”: while the user may like their content and just want it to look more like woodblock print, introducing “Hokusai” may also drive the content towards subjects that Hokusai depicted such as Mount Fuji, ocean waves, or Japanese style houses.

In practice, as shown in Figure 3, we find that this can be mitigated by introducing a small amount of the subject matter into the style prompt. While this prompt engineering technique prevents wide-spread subject drift, there are smaller drifts that the user may or may not want; e.g. when applying “Hokusai style” to a building, the generation tends to produce Japanese style houses. In later sections, we show that this effect can be mitigated if desired through use of temporally-varying guidance scale functions.

Refer to caption
Figure 4. Controlling the emphasis. Left, the standard way of using text-to-image models is to specify everything in a single prompt and then guide towards it with the guidance term gθg_{\theta} multiplied by scale ss. However, this does not provide a natural way to emphasize individual parts of a single prompt. We note that prompts can be naturally decomposed, the simplest of which is base prompt (the objects, verbs, and setting; “a river flows under a bridge with clear sky”) and any style application (“van Gogh”). Right, each decomposed prompt receives its own guidance term and scale function. By varying the scale function, we can walk towards a linear combination of terms, as denoted by the transparent arrows. This allows us to aim for any of the distributions represented by circle. This example depicts style interpolation with a layout from a base prompt. Note the unconditional distribution is not shown above.

3.3. Controllable walks using the guidance scale

As described in (Ho and Salimans, 2022; Liu et al., 2022), each guidance term pushes the final result towards its own distribution in the image manifold, with the the magnitude of the guidance scale determining how strong the push is. With a single conditional, this term is thought of as simply controlling how closely the image adheres to the single text prompt versus a generic image. With the same seed, a different magnitude would generate similar images that increasing adhere to the prompt (left of Fig. 4).

This same concept applies in the multiple guidance formulation, with each guidance term’s scale controlling how much the generated image adheres to that condition. In the compositionality problem that (Liu et al., 2022) addresses, the goal is to generate a successful sample, defined as images where all the text elements are present. In contrast, our goal is to sample from a user-controlled choice of distributions, selected to explore style space. Since we can control the guidance scale differently for each condition, this provides a choice of distributions to target as shown on the right of Figure 4. For example, by choosing to target different distributions our formulation can walk between two different prompts (e.g. “base” and “style”), interpolating between them, moving from a style focused distribution (red in the figure) to a base focused distribution (blue). Interpolating between multiple style distributions produces a style interpolation; emphasizing both produces a mixture. The scales between a base and style distribution provide control over how text adherence versus style. The addition of time-varying GSFs enables the base to control the layout of the final generation (see Sec 3.4). Time-varying GSFs applied to DB tokens versus text allow us to control the importance of subject personalization versus text adherence as shown in Fig. 2.

3.4. Using GSFs for fine-grained style control

Once the prompts are decomposed as above, each guidance scale function controls the magnitude of the diffusion step for a single component. Critically, this modulation can depend on location within the image or on timestep. Here we show examples of using GSFs to control how a style applied to an image.

Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
t=1t=1 t=0.8t=0.8 t=0.6t=0.6 t=0t=0
Figure 5. Examining the norm of the denoising predictions at different time-steps tt suggest that the image is formed coarse-to-fine. Near t=1t=1, the network edits the global image layout (low frequencies) whereas near t=0t=0 the network seems to focus on texture (high frequencies). Note, the top row is the latent at different tt steps decoded through the VAE, bottom row is norm of denoising prediction for text conditioned pass of the denoising network.

3.4.1. Timestep dependence: disentangling style from content

As noted by several prior works including  (Rissanen et al., 2022; Meng et al., 2021), the progressive denoising approach of diffusion models means that they implicitly create images in a coarse-to-fine manner. While this analysis does not directly apply to Stable Diffusion (Rombach et al., 2022) since it is a latent diffusion model which operates on the latent space (as opposed to pixel space), we find empirically that Stable Diffusion does form images from coarse to fine. Fig. 5 shows the norm of the guidance terms, gθ(xt,t,text)2\lVert g_{\theta}(x_{t},t,\textrm{text})\rVert_{2}, as a heat map, indicating which parts of the image are being edited at different timesteps. We observe that at t=1t=1 the entire image is edited; at t=0.8t=0.8, some of the edges and shapes are edited; and at t=0t=0 only the high frequency textures are edited. This can be roughly observed in the decoded latent as well – first the rough location of the dog is visible, and later details are filled in.

Refer to caption
Figure 6. Temporally-varying guidance scale functions. Using different temporally-varying GSFs for the same style guidance, “van Gogh” leads to different effects. A higher guidance scale at t=1t=1 leads to layout changes to the Base generation; roughly speaking, this acts as a knob that allows the user to control how much they want to allow style to affect the overall structure and composition of the image. Specifically, “late start” applies higher frequency components of a style, whereas “early start” changes both low and high frequency components. Base prompt: “A cat walking in a field of flowers.” Exact functions used are in the supplemental. Generated with SDXL.

This observation motivates the technique shown in Figure 6. The base generation is “a cat in a field of flowers”. Typically, applying a van Gogh style results in a change to the coarse layout of the generated image. This is true either when editing the prompt or when decomposing the prompt and applying a uniform guidance scale to the ‘van Gogh‘ style. Our general goal is to explore granular application of style while keeping the rough layout of the base generated image – put simply, “this composition, but as though van Gogh painted it”. Our observations regarding coarse-to-fine image generation directly motivate our proposed solution. Early in generation, we guide the image primarily towards the base prompt with little to no emphasis on the style prompt; later in generation, we guide the image primarily towards the style prompt with little to no emphasis on the base prompt. This results in a very similar composition to the base but with van Gogh’s signature swirly paint.

Technically for style, we propose a simple parameterized function which linearly increases and discuss the effects of different parameter settings of the functions.

sstyle(t,u,v)=max(0,ma(at))s_{\textrm{style}}(t,u,v)=\max(0,\frac{m}{a}(a-t))

where conceptually mm is the magnitude of the scale and aa determines when the guidance term becomes non-zero. Since tt typically starts at 11, if a>1a>1 then the guidance happens at all stages of the diffusion process and this guidance term will have an “early start” non-trivially affecting the layout. If a<1a<1, then the guidance term becomes non-zero later in the diffusion process and the guidance term has a “late start” and does not affect layout or medium frequency elements. “base start” is a=1a=1 and acts as an intermediate or default option for this type of time-varying. For the base guidance term, we simply use a linearly decreasing GSF, si(t)=mts_{i}(t)=m\cdot t. We experimented with other GSFs for both base and style guidance term, but found that for variably applying different levels of style these very simple GSFs tend to work.

In Fig. 6, “late start” affects the layout the least and introduces relatively subtle high frequency style considerations like smooth oil paint texture. “Base start” allows for changes to the low frequency components and causes style changes to occur in the pose and the sky, where stars begin to emerge. We can get even more layout from the style prompt component through “early start”; while this results in a bigger change in the layout of the image, e.g. the cat’s pose, it also adds several classical van Gogh elements, e.g. emphasizing the stars in the background.

Additionally, our “base start” GSF has very different behavior than constant guidance when the magnitude of the scale is increased. As shown in Figure 8, applying guidance with a constant scale can lead to large image changes as the scale increases. With our default setting for the GSF, the style guidance term is applied primarily to the high-frequency components, weakening the ability for the guidance term to impact the layout or composition but preserving its ability to change texture and middle-frequency components necessary for stylization. This makes the “walk” induced by changing the magnitude of the guidance scale smoother perceptually.

3.4.2. Painting on style with spatial masks

Refer to caption
Figure 7. Spatially-varying guidance functions. Based on the Base image, style guidance can be applied to different parts of the image. This allows users to define their own masks manually or with computed signals like bounding boxes. Generated with SDXL.

Just as we can modulate the guidance scale functions by time, we can also modulate them in pixel space, by writing si=si(t,u,v)s_{i}=s_{i}(t,u,v). Examples are shown in Figures LABEL:fig:teaser and 7. The technique flexibly allows us to generate various effects like fading left-to-right from Hokusai to Cyberpunk; doing the same fade radially; or using a bounding box of a semantic object, the person at the bottom right, and changing just this person into cyberpunk.

4. Experiments

Refer to caption
Figure 8. Changing the intensity of a style with different guidance scale functions. We can also select the intensity of the applied style. Changing the magnitude of the guidance scale has different effects based on the temporal component of the GSFs. Increasing the magnitude of a guidance scale without time dependence (bottom row) can change the layout of the image. Our default setting for temporal dependence preserves the rough layout of the base generation while still increasing the amount of style being added (top row). This is an additional degree of freedom that the GSFs’ provide. Base prompt: “A dog running on a beach”. Added Style prompt: “Hokusai style dog”. Generated with SDXL.

4.1. Experimental settings

For our experiments, we use both Stable Diffusion 1.5 (SD1.5) and Stable Diffusion XL (SDXL), with the default DPM+ and 50 denoising steps. For LORA learning, we conduct model training using 300 screenshots capturing diverse scenes from ”Breath of the Wild,” employing an instance prompt of ”a photo of [𝒱\mathcal{V}] style.” The learning rate was set to 10410^{-4} for 500 steps.

For DB training, we follow the same experimental procedure used the (Ruiz et al., 2023) paper with some minor hyperparameter adjustments made to change from Imagen (Saharia et al., 2022). Specifically, we train our models on 30 objects from the Dreambooth dataset (Ruiz et al., 2023) using SD1.5. The training process generates 1,000 samples from ”a photo of [class noun]” class prior, and uses a batch size of 1, λ=1\lambda=1, a learning rate of 10610^{-6}, and a total of 1,500 iterations.

4.2. Walks in Style and Personalization

In Figures 10 and 11, we show straight forward applications of style to a given layout. These examples use different architectures (SD1.5 and SDXL) and a diverse set of styles including a LORA-trained style for the popular video game “Breath of the Wild”. We also demonstrate in Figure 14  that our technique is not limited to one or even two styles and can be used with an arbitrary number of guidance terms, mixing 4 distinct styles globally in the same image, taking some aspects from each.

Figure 13 shows that we can control the amount of style applied to a base image while preserving the rough layout, even when using a style trained with LORA on SD1.5. This example applies the style guidance term on the full image but uses the “default” option for the temporal aspect of the guidance scale function. The top left image is the base generation without any stylization applied; lower in the grid has more “Breath of the Wild” style applied to it, and to the right has more “Qi-Bashi calligraphy style”. Figure 2 shows the same capability for personalization on SD1.5, amplifying either the subject (the specific type of dog) or the action and setting. Note that amplifying both still preserves the subject while forcing a change in pose / action. In Figure 15, we show that both of these terms can be applied at once; we can take a DB subject and incrementally add style. Despite the addition of the style, the guidance term for personalization preserves the shape and color of the subject.

Our method also allows users to interpolate between multiple styles. In Figure 9 and 12, we show that our smooth interpolation between two different styles for models with very different architectures SD1.5 and SDXL. We also complete a user study in the following section on interpolation.

4.3. Baselines

While stylization is a popular topic in Computer Vision, much of the prior work focused on walking in the latent space of GANs.

In the process of interpolating styles, we benchmark against two widely adopted techniques in the community, employing the compel library (Stewart, 2023): prompt blending and prompt weighting. To achieve blending, such as between Picasso and Hokusai, we assign weights λ\lambda and 1λ1-\lambda to the prompts ”Artist painting vivid sunset on beach canvas in the style of Picasso” and ”Artist painting vivid sunset on beach canvas in the style of Hokusai,” respectively, with λ[0,1]\lambda\in[0,1]. Regarding prompt weighting, we consolidate the two prompts into a singular prompt, ”Artist painting vivid sunset on beach canvas in the style of Picasso and Hokusai,” and apply weights λ\lambda and 1λ1-\lambda to the terms ”Picasso” and ”Hokusai,” respectively, where λ[0,k]\lambda\in[0,k]. We observe that when k=1k=1, the change is nearly imperceptible, prompting our choice of k=1.5k=1.5 for a more distinct transition. The outcomes, displayed in Figure 9, reveal that both baseline methods yield more abrupt transitions. Notably, in certain instances, prompt blending leads to the complete loss of style, a phenomenon we term the ’style-free valley,’ which is not observed with DreamWalk.

4.4. User study

Since there is no clear source of ground truth for our task, we conducted a user study to determine if the results we generate are viewed as aesthetically pleasing. Users were presented with a start image and an end image, along with two possible series of images showing the transition from the start to the end, similar to Figure 9. One series was generated by the Compel baseline and one by DreamWalk; we randomly picked the layout in which the two series were displayed. Users were asked which series produced a “more pleasingly smooth transition”. Our user study closely followed (Ruiz et al., 2023), Users preferred the DreamWalk output 68% of the time.

5. Limitations and Ethics

One of the main limitations of this technique is global entanglements as discussed in Section 3.2. While we show that this can be mitigated through prompt engineering and guidance scale functions, not all the correlations can be prevented. Additionally, at inference time our method requires one forward pass per prompt component, unlike fine-tuning methods (although of course the latter require the actual fine-tuning time). We also acknowledge that this work, along with most generative media, comes with a degree of ethical concern. Better control over generated images can increase the risk of these tools being used by bad actors.

6. Conclusion

We have presented DreamWalk, a generalized guidance formulation designed specifically for personalizing text-to-images. This method allows for granular control over the amount of style applied or adherence to a DB token or LORA. We have empirically shown the efficiency of this method on several tasks including style interpolation, DB sampling, changing materials, and granularly manipulating the texture and layout of generated images.

6.1. Acknowledgments

The portion of this work performed at Cornell has been supported by a gift from the Simons Foundation. We thank Yuanzhen Li for coming up with the project name, and Miki Rubinstein and Deqing Sun for helpful discussions.

Refer to caption
Figure 9. Baseline comparisons, in each panel prompt blending (Face, [n. d.]) shown at the top and DreamWalk shown below. Prompt blending makes an abrupt transition while DreamWalk changes smoothly; prompt blending also sometimes drops both styles altogether. Generated with SD1.5.
Breath of the Wild LORA Hokusai Pixel Art Watercolor
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Magritte Monet Van Gogh Qi Bashi
Figure 10. Our style application works across a wide variety of styles, including those already defined in the model’s text and those trained using LORA. Prompt: “A macaron in Banff National Park”. Generated with SD1.5.
Pixel Art Monet Qi Bashi Hokusai
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Picasso van Gogh Magritte Watercolor
Figure 11. Our style application works on any diffusion model regardless of architecture differences. Prompt: “A peaceful riverside village with charming old cottages”. Generated with SDXL.
Van Gogh Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Monet
Van Gogh Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Picasso
Figure 12. Our method can interpolate between two different styles while maintaining the rough layout of the scene. Top row is “a campsite with a fire at night”. Bottom row is “a dog running on a beach”. Generated with SDXL.
[Uncaptioned image]

References

  • (1)
  • Avrahami et al. (2023) Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. 2023. Break-A-Scene: Extracting Multiple Concepts from a Single Image. arXiv preprint arXiv:2305.16311 (2023).
  • Azadi et al. (2018) Samaneh Azadi, Matthew Fisher, Vladimir G Kim, Zhaowen Wang, Eli Shechtman, and Trevor Darrell. 2018. Multi-content gan for few-shot font style transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7564–7573.
  • Bar-Tal et al. (2023) Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. 2023. Multidiffusion: Fusing diffusion paths for controlled image generation. (2023).
  • Barnes et al. (2009) Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. 2009. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28, 3 (2009), 24.
  • Chen et al. (2023) Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, and William W Cohen. 2023. Subject-driven Text-to-Image Generation via Apprenticeship Learning. arXiv preprint arXiv:2304.00186 (2023).
  • Efros and Leung (1999) Alexei A Efros and Thomas K Leung. 1999. Texture synthesis by non-parametric sampling. In Proceedings of the seventh IEEE international conference on computer vision, Vol. 2. IEEE, 1033–1038.
  • Face ([n. d.]) Hugging Face. [n. d.]. Prompt weighting. https://huggingface.co/docs/diffusers/using-diffusers/weighted_prompts.
  • Gal et al. (2022) Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
  • Gatys et al. (2015) Leon Gatys, Alexander S Ecker, and Matthias Bethge. 2015. Texture synthesis using convolutional neural networks. Advances in neural information processing systems 28 (2015).
  • Haeberli (1990) Paul Haeberli. 1990. Paint by numbers: Abstract image representations. In Proceedings of the 17th annual conference on Computer graphics and interactive techniques. 207–214.
  • Härkönen et al. (2020) Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. 2020. Ganspace: Discovering interpretable gan controls. Advances in neural information processing systems 33 (2020), 9841–9850.
  • Heeger and Bergen (1995) David J Heeger and James R Bergen. 1995. Pyramid-based texture analysis/synthesis. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques. 229–238.
  • Hertz et al. (2022) Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
  • Hertzmann (2003) Aaron Hertzmann. 2003. A survey of stroke-based rendering. Institute of Electrical and Electronics Engineers.
  • Hertzmann et al. (2023) Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, Brian Curless, and David H Salesin. 2023. Image analogies. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 557–570.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
  • Ho and Salimans (2022) Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
  • Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning. PMLR, 2790–2799.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  • Jahanian et al. (2019) Ali Jahanian, Lucy Chai, and Phillip Isola. 2019. On the” steerability” of generative adversarial networks. arXiv preprint arXiv:1907.07171 (2019).
  • Jing et al. (2019) Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu, and Mingli Song. 2019. Neural style transfer: A review. IEEE transactions on visualization and computer graphics 26, 11 (2019), 3365–3385.
  • Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4396–4405. https://doi.org/10.1109/CVPR.2019.00453
  • Kumari et al. (2023) Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. 2023. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1931–1941.
  • Lee et al. (2023) Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. 2023. Syncdiffusion: Coherent montage via synchronized joint diffusions. arXiv preprint arXiv:2306.05178 (2023).
  • Li et al. (2023) Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. 2023. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22511–22521.
  • Litwinowicz (1997) Peter Litwinowicz. 1997. Processing images and video for an impressionist effect. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques. 407–414.
  • Liu et al. (2022) Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. 2022. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision. Springer, 423–439.
  • Ma et al. (2023) Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. 2023. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410 (2023).
  • Meng et al. (2021) Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2021. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021).
  • Park et al. (2023) Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. 2023. Understanding the latent space of diffusion models through the lens of riemannian geometry. arXiv preprint arXiv:2307.12868 (2023).
  • Patashnik et al. (2021) Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2085–2094.
  • Portilla and Simoncelli (2000) Javier Portilla and Eero P Simoncelli. 2000. A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision 40 (2000), 49–70.
  • Preechakul et al. (2022) Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. 2022. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10619–10629.
  • Rashtchian et al. (2023) Cyrus Rashtchian, Charles Herrmann, Chun-Sung Ferng, Ayan Chakrabarti, Dilip Krishnan, Deqing Sun, Da-Cheng Juan, and Andrew Tomkins. 2023. Substance or Style: What Does Your Image Embedding Know? arXiv preprint arXiv:2307.05610 (2023).
  • Rissanen et al. (2022) Severi Rissanen, Markus Heinonen, and Arno Solin. 2022. Generative modelling with inverse heat dissipation. arXiv preprint arXiv:2206.13397 (2022).
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  • Ruiz et al. (2023) Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
  • Salisbury et al. (2023) Michaet P Salisbury, Sean E Anderson, Ronen Barzel, and David H Satesin. 2023. Interactive pen-and-ink illustration. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 393–400.
  • Shen et al. (2020) Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. 2020. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9243–9252.
  • Shi et al. (2023) Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. 2023. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023).
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256–2265.
  • Sohn et al. (2023) Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, et al. 2023. StyleDrop: Text-to-Image Generation in Any Style. arXiv preprint arXiv:2306.00983 (2023).
  • Stewart (2023) Damian Stewart. 2023. Compel. https://github.com/damian0815/compel.
  • Tewari et al. (2020) Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhofer, and Christian Theobalt. 2020. Stylerig: Rigging stylegan for 3d control over portrait images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6142–6151.
  • Voynov and Babenko (2020) Andrey Voynov and Artem Babenko. 2020. Unsupervised discovery of interpretable directions in the gan latent space. In International conference on machine learning. PMLR, 9786–9796.
  • Wu et al. (2023a) Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. 2023a. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1900–1910.
  • Wu et al. (2023b) Yankun Wu, Yuta Nakashima, and Noa Garcia. 2023b. Not Only Generative Art: Stable Diffusion for Content-Style Disentanglement in Art Analysis. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. 199–208.
  • Wu et al. (2021) Zongze Wu, Dani Lischinski, and Eli Shechtman. 2021. Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12863–12872.
  • Yang et al. (2022) Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Yingxia Shao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. 2022. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796 (2022).
  • Zhang et al. (2023) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding Conditional Control to Text-to-Image Diffusion Models.