: Are We Really Meant to Confine the Imagination in Style Transfer?

Gary Song Yan
ISAI Lab
Xi’an Institute of High-tech
Xi’an, China
gary_144@outlook.com
&Yusen Zhang
ISAI Lab
Xi’an Institute of High-tech
Xi’an, China
ysen1101@163.com
&Jinyu Zhao
ISAI Lab
Xi’an Institute of High-tech
Xi’an, China
369360770@qq.com
&Hao Zhang
Department of Basic Courses
Xi’an Institute of High-tech
Xi’an, China
zh01020938@163.com
&Zhangping Yang
ISAI Lab
Xi’an Institute of High-tech
Xi’an, China
akame1027@163.com
&Guanye Xiong
ISAI Lab
Xi’an Institute of High-tech
Xi’an, China
13350433677@163.com
&Yanfei Liu
Department of Basic Courses
Xi’an Institute of High-tech
Xi’an, China
bbmcu@126.com
&Tao Zhang
Department of Computer Science
Huazhong University of Science and Technology
Wuhan, China
zhangtao_2023@hust.edu.cn
&Yujie He
ISAI Lab
Xi’an Institute of High-tech
Xi’an, China
ksy5201314@163.com
&Siyuan Tian
College of Science
National University of Defence Technology
Changsha, China
tiansy@nudt.edu.cn
&Yao Gou
Intelligent Game and Decision Lab
Beijing, China
gouayao@163.com
&Min Li
ISAI Lab
Xi’an Institute of High-tech
Xi’an, China
proflimin@163.com
Corresponding author

Abstract

In the realm of image style transfer, existing algorithms relying on single reference style images encounter formidable challenges, such as severe semantic drift, overfitting, color limitations, and a lack of a unified framework. These issues impede the generation of high quality, diverse, and semantically accurate images. In this pioneering study, we introduce StyleWallfacer, a groundbreaking unified training and inference framework, which not only addresses various issues encountered in the style transfer process of traditional methods but also unifies the framework for different tasks. This framework is designed to revolutionize the field by enabling artist level style transfer and text driven stylization. First, we propose a semantic-based style injection method that uses BLIP to generate text descriptions strictly aligned with the semantics of the style image in CLIP space. By leveraging a large language model to remove style-related descriptions from these descriptions, we create a semantic gap. This gap is then used to fine-tune the model, enabling efficient and drift-free injection of style knowledge. Second, we propose a data augmentation strategy based on human feedback, incorporating high-quality samples generated early in the fine-tuning process into the training set to facilitate progressive learning and significantly reduce its overfitting. Finally, we design a training-free triple diffusion process using the fine-tuned model, which manipulates the features of self-attention layers in a manner similar to the cross-attention mechanism. Specifically, in the generation process, the key and value of the content-related process are replaced with those of the style-related process to inject style while maintaining text control over the model. We also introduce query preservation to mitigate disruptions to the original content. Under such a design, we have achieved high-quality image-driven style transfer and text-driven stylization, delivering artist-level style transfer results while preserving the original image content. Moreover, we achieve image color editing during the style transfer process for the first time, further pushing the boundaries of controllable image generation and editing technologies and breaking the limitations imposed by reference images on style transfer. Our experimental results demonstrate that our proposed method outperforms state-of-the-art methods.

Refer to caption — Figure 1: We found that existing image style transfer methods based on a single style image (b) either suffer from overfitting and semantic drift when performing text-driven style transfer or merely achieve texture blending rather than truly learning the artist’s style at the level of artistic creation during image-driven style transfer (c). Moreover, there is currently a lack of a unified framework for addressing the various issues in style transfer. For genuine artworks (a), the colors are never confined to just one piece (b). Therefore, this paper designs a unified model that can learn the artist’s creative style in the true sense and achieves style transfer results that are indistinguishable from the artist’s creative style (d), (e) and (f). Meanwhile, the model also realizes color editing during the style transfer process for the first time (g).

1 Introduction

Art encapsulates human civilization’s essence, epitomizing our imagination and creativity, and has yielded innumerable masterpieces. Online, you may encounter a painting that profoundly affects you, yet you may find it hard to describe the artist’s unique style or locate more similar works. This highlights a key issue in image generation: style transfer.

Recently, numerous excellent works have conducted research on this issue, which are mainly divided into three categories: text-driven style transfer rout2025rbmodulation ; DBLP:conf/nips/SohnJBLRKCLERHE23 , image-driven style transfer DBLP:conf/cvpr/ChungHH24 ; DBLP:conf/iccv/LiuLHLWLSLD21 ; DBLP:conf/iccv/HuangB17 and text-driven stylization jiang2024artist ; DBLP:conf/cvpr/BrooksHE23 ; Tumanyan_2023_CVPR . The mainstream approach of image-driven style transfer is to decouple the style and content information of a reference style image, and then inject the style information as an additional condition into the model’s generation process DBLP:conf/cvpr/ChenTH24 ; DBLP:journals/corr/abs-2407-00788 . This enables the model to generate new content that is similar to the reference style image in terms of texture and color. Alternatively, a unique identifier can be used to characterize the style of the style image, and the model can be fine-tuned to learn new stylistic knowledge for text-driven style transfer DBLP:conf/cvpr/RuizLJPRA23 ; DBLP:conf/iccv/HanLZMMY23 . This allows the model to recognize and generate corresponding style images using the identifier. For text-driven stylization, most methods involve blending the pre-trained model’s prior style knowledge with the texture of the target image to achieve the final style transfer result jiang2024artist . However, as shown in Figure 1 (c), these models generally suffer from the following issues:

Limited color domain: Both the style-content disentanglement-based and identifier-based fine-tuning approaches commonly face the problem of a restricted color domain in the generated images. Specifically, the color distribution of the generated images is entirely consistent with that of the single reference style image. For example, in the case of Van Gogh’s paintings DBLP:conf/cvpr/OjhaLLELS021 as shown in Figure 1 (a), great artists are by no means limited to the color palette of a single artwork in Figure 1 (b). Therefore, such generation results are unreasonable. For more detailed visualization results and discussions, please refer to Appendix D.1.

Failure of text guidance: Due to the architectural flaws in the style-content disentanglement-based methods and the mismatch between text and image in the style information injection of the identifier-based fine-tuning methods, models exhibit significant semantic drift, which refers to the phenomenon of inconsistency or deviation in semantics between the generated image and the input text prompt in the T2I model. This not only leads to chaotic generation but also results in the loss of the model’s ability to handle complex text prompts. For more detailed visualization results and discussions, please refer to Appendix LABEL:FTG.

Risk of overfitting : Due to the extremely limited number of training samples, traditional approaches are generally prone to overfitting. This results in a loss of structural diversity in the generated content. For more detailed visualization results and discussions, please refer to Appendix D.2.

Lack of a unified framework: Due to the significant differences between various style transfer tasks, most existing style transfer methods are only capable of handling one specific task, and there is a lack of a unified framework to integrate these tasks.

These problems, much like the "sophons" in "The Three-Body Problem" TheThreeBodyProblem that restrict human technological progress, limit people’s imagination for style transfer. In fact, truly good style transfer enables the model to learn and imitate the artist’s creative style, rather than mechanically copying the textures of reference images, thus achieving true artistic creation. Just as in "The Three-Body Problem", humanity uses the "Wallfacer Plan" to break the technological blockade imposed by the "sophons", to break the limitations of "sophons" in style transfer, this paper proposes a novel unified style transfer framework, called StyleWallfacer, which consists of three main components:

Firstly, a style knowledge injection method based on semantic differences is proposed (Figure 2 (a)). By using BLIP DBLP:conf/icml/0001LXH22 to generate text descriptions that are strictly aligned with the target style image in CLIP space DBLP:conf/icml/RadfordKHRGASAM21 , and then leveraging LLM to remove the style-related descriptions, a semantic gap is created. This gap allows the model to maintain its prior knowledge as much as possible during training, focusing solely on learning the style information. As a result, the model captures the most fundamental stylistic elements of the style image (e.g., the artist’s brushstrokes). As shown in Figure 1 (d), this not only enables the generation of new samples with rich and diverse colors but also preserves the model’s ability to handle complex text prompts.

Secondly, a progressive learning method based on human feedback (HF) is employed (Figure 5). At the beginning of model training, the model is trained using a single sample. During the training process, users are allowed to select high-quality samples generated by the model and add them to the training set. This effectively expands the single-sample dataset and significantly mitigates overfitting of the model.

Thirdly, we propose a brand-new training-free triple diffusion “style-structure” diffusion process (Figure 2 (b) and (b1)). It explores the impact of different noise thresholds on the model’s generation effects by using the diffusion process with a smaller noise threshold as the main process to preserve the content information of the original image, and employing the diffusion process with a larger noise threshold as the style guidance process. Meanwhile, the Key and Value from the self-attention layer during this process are extracted to replace the Key and Value in the main diffusion process and obtain the initial noise of the style image to be transferred through DDIM inversion DBLP:conf/iclr/SongME21 . The Query from the diffusion process of the inverted noise is extracted and fused with the Query in the main diffusion process, serving as a structural guidance for the main diffusion process. Meanwhile, the pre-trained style LoRA DBLP:conf/iclr/HuSWALWWC22 is used as a style guide to direct the model to conduct image-driven style transfer. This approach thus achieves the artist-level style transfer results as shown in Figure 1 (e) and (f). During the generation process, text prompt is employed as a condition, and in combination with the aforementioned structure, it also enables color editing of the model during the style transfer process as shown in Figure 1 (g).

Our main contributions are summarized as follows:

(1) We propose the first unified style transfer framework that simultaneously achieves high-quality style transfer. Meanwhile, for the first time, it enables text-based color editing during the style transfer process.

(2) We propose a style knowledge injection method based on semantic differences, which achieves efficient style knowledge injection without affecting the model’s semantic space and suppresses semantic confusion during the style injection process.

(3) We propose a progressive learning method based on human feedback for few-shot datasets, which alleviates the model overfitting caused by insufficient data and significantly improves the generation quality after model training.

(4) We propose a novel training-free triple diffusion process that achieves artist-level style transfer results while retaining the control ability of text prompts over the generation results, and for the first time enables color editing during the style transfer process.

(5) Our experiments demonstrate that the proposed method in this paper addresses many issues encountered by traditional methods during style transfer, achieving artist-level style generation results rather than merely texture blending, and delivering state-of-the-art performance.

2 StyleWallfacer

2.1 Overall Architecture of StyleWallfacer

As shown in Figure 2, StyleWallfacer mainly consists of two parts: First is the semantic-based style learning strategy, which aims to guide the model to learn the most essential style features in artworks based on the semantic differences between images and their text descriptions during the model fine-tuning process, truly helping the model understand the artist’s creative style. It also employs a data augmentation method based on human feedback to suppress overfitting when the model is fine-tuned on a single image, thereby achieving realistic text-driven style transfer.

The second part is the training-free triple diffusion process, which is designed using the previously fine-tuned LoRA weights. This section comprises three newly designed pipelines tailored to address different style transfer problems. By adjusting the self-attention layers of three denoising networks that share weights (denoted as ), it achieves high-quality style control. This results in artist-level style transfer and, for the first time, enables text prompts to control image colors during the style transfer process, solving the traditional method’s shortcomings of monochromatic colors, simple textures, and lack of text control when transferring styles based on a single image.

2.2 Semantic-based Style Learning Strategy

The semantic-based style learning strategy primarily aims to fine-tune text-to-image (T2I) models using their native “language” to enhance their comprehension of the knowledge humans intend them to learn during the fine-tuning process. Taking Stable Diffusion as an example, there is a significant discrepancy between the image semantics understood by the pre-trained CLIP and the intuitive human understanding of image semantics. Therefore, to better “communicate” with the pre-trained T2I model during fine-tuning, this paper employs a method of reverse-engineering the semantic information of image $\mathbf{I}$ in the CLIP space through BLIP DBLP:conf/icml/0001LXH22 :

\displaystyle\mathbf{T}_{\text{CLIP}}=\text{BLIP}\left(\mathbf{I}\right)

(1)

where $\mathbf{T}_{\text{CLIP}}$ denotes the image prompt derived through BLIP.

Although such methods enable us to obtain the semantic information corresponding to an image in the CLIP space, this text description cannot be directly employed in the fine-tuning process. This is because the description $\mathbf{T}_{\text{CLIP}}$ encompasses all information pertaining to the image, including content, style, and other details understood by CLIP. Utilizing this comprehensive description for fine-tuning still results in the model’s inability to comprehend human fine-tuning intentions, thereby preventing it from learning the stylistic information of the dataset.

Therefore, our StyleWallfacer transforms $\mathbf{T}_{\text{CLIP}}$ by creating an semantic discrepancy among descriptions. By incorporating a large language model to perform subtle semantic edits on $\mathbf{T}_{\text{CLIP}}$ , descriptions related to image style are selectively removed:

\displaystyle\mathbf{T}_{w/oS}=\text{LLM}\left(\mathbf{T}_{\text{CLIP}}\right)

(2)

where $\mathbf{T}_{w/oS}$ denotes the text description after removing the style information, and LLM stands for large language model.

After such processing, we obtain the image $\mathbf{I}$ and its corresponding text description $\mathbf{T}_{w/oS}$ in the CLIP space, from which stylistic descriptions have been removed. As shown in Figure 2 (a), fine-tuning a pre-trained T2I model using these image-text pairs enables it to focus more effectively on understanding stylistic information, thereby circumventing unnecessary semantic drift.

2.3 Training-free Triple Diffusion Process

After fine-tuning, the model has essentially learned the most fundamental style knowledge from the reference style image. Therefore, how to activate this knowledge so that it can be utilized for image-driven style transfer has become an extremely critical issue.

Unlike traditional one-shot style transfer algorithms that require the reference style image as input during style transfer, we aim to rely solely on the pre-trained style LoRA obtained in Section 2.2 for style transfer. Therefore, we cannot adopt a method similar to StyleID DBLP:conf/cvpr/ChungHH24 to manipulate the features in the self-attention layer as if they were cross-attention features, with the features from the style image $\mathbf{I}_{s}$ serving as the condition for style injection.

However, as shown in Figure 3, we observe that when initializing the noisy latent $\mathbf{X}_{0}$ with the original image and using the U-Net to denoise it, the larger the noise schedule threshold $t_{s}$ , the more stylized the generated image will be, losing the original image’s content information and retaining only its basic semantics. Conversely, the smaller the noise schedule threshold $t_{s}$ , the more the model’s generation tends to preserve the original image’s content information, while reducing the diversity and stylization in the generation process.

Therefore, we contemplate: Is it possible to fully leverage this characteristic by employing a diffusion process with a smaller $t_{s}$ as the main diffusion process, and using a diffusion process with a larger $t_{s}$ as the stylistic guiding process? Meanwhile, we can utilize the inverted latent obtained through DDIM inversion as the noisy latent for the third diffusion process, and harness the residual information from its denoising process as content guidance. In this way, we aim to achieve high-quality style transfer results while preserving the image content.

To this end, as shown in Figure 2 (b), we first use the VAE encoder to transform the image $\mathbf{I}_{c}$ to be transferred from the pixel space to the latent space, obtaining $\mathbf{F}_{0}$ . By setting a larger noise schedule threshold $t_{s}^{l}$ , we add noise to $\mathbf{F}_{0}$ (at $t=0$ ) to obtain $\mathbf{F}_{l}$ (at $t=t_{s}^{l}$ ). Similarly, by using a smaller noise schedule threshold $t_{s}^{s}$ , we obtain $\mathbf{F}_{s}$ (at $t=t_{s}^{s}$ ). Additionally, we use DDIM inversion to invert $\mathbf{F}_{0}$ to Gaussian noise $\mathbf{F}_{i}$ (at $t=T$ ). Then, using the same denoising U-Net, we denoise $\mathbf{F}_{s}$ , $\mathbf{F}_{l}$ , and $\mathbf{F}_{i}$ respectively. As shown in Figure 2 (b1), during the entire denoise process of latent $\mathbf{F}_{s}$ , we transfer $\mathbf{F}_{s}$ to $\mathbf{F}_{l}$ by injecting the Key $\mathbf{K}_{t}^{l}$ and Value $\mathbf{V}_{t}^{l}$ collected from $\mathbf{F}_{l}$ into the self-attention layer, instead of the original Key $\mathbf{K}_{t}^{s}$ and Value $\mathbf{V}_{t}^{s}$ . However, merely implementing this substitution can result in content disruption, as the content of the $\mathbf{F}_{s}$ representation would be progressively altered with the changes in the attended values.

Consequently, we propose a query preservation mechanism to retain the original content. Simply, as shown in Figure 2 (b1), we fuse the Query $\mathbf{Q}_{t}^{i}$ of DDIM inverted latent $\mathbf{F}_{i}$ with the original Query $\mathbf{Q}_{t}^{s}$ to get Query $\mathbf{Q}_{t}^{f}$ and inject it to the main denoise process instead of the original Query $\mathbf{Q}_{t}^{s}$ . These style injection, query preservation and structural residual injection processes at time step t are expressed as follows:

	$\displaystyle\mathbf{Q}_{t}^{f}=\beta\mathbf{Q}_{t}^{i}+(1-\beta)\mathbf{Q}_{t}^{s},$		(3)
	$\displaystyle\phi_{\mathrm{out}}^{l}=\mathrm{Attn}(\mathbf{Q}_{t}^{f},\mathbf{K}_{t}^{l},\mathbf{V}_{t}^{l}),$		(4)

where $\beta\in[0,1]$ . $\mu(\cdot)$ , $\sigma(\cdot)$ and $\phi_{\mathrm{out}}^{l}$ denote channel-wise mean, standard deviation and the result of self-attention calculation after replacement, respectively. In addition, we apply these operations on the decoder of U-net in SD. We also highlight that the proposed method can adjust the degree of style transfer by changing noise schedule threshold $t_{s}^{l}$ and $t_{s}^{s}$ . Specifically, lower $t_{s}^{l}$ and $t_{s}^{s}$ maintains more content, while higher $t_{s}^{l}$ and $t_{s}^{s}$ strengthens effects of style transfer.

2.4 Data Augmentation for Small Scale Datasets Based on Human Feedback

Although this paper proposes a more robust style knowledge injection method than DreamBooth DBLP:conf/cvpr/RuizLJPRA23 in Section 2.2, fine-tuning models with a single sample remains challenging. Therefore, inspired by human feedback reinforcement learning (HFRL) shen2025exploringdatascalingtrends , this paper proposes a human feedback-based data augmentation method for small-scale datasets to compensate for dataset insufficiency and mitigate overfitting.

Specifically, when the model is first trained on a single style image, as shown in Figure 4, although the injection of style knowledge does not generalize well to all the prior knowledge of the model in the early stages of training, and some of the generated results do not match the reference style consistently, resulting in an asynchronous phenomenon in the injection of style knowledge. However, there are still many excellent samples in the model’s generated results. The reason for the emergence of these samples is that the prior knowledge represented by these samples is similar to the style image used in training in the CLIP space. This makes it easier for the model to transfer style knowledge to these pieces of knowledge during training. Therefore, it is possible to select samples that meet the style requirements from the large number of generated samples and add them to the training set for further fine-tuning of the model.

To this end, as shown in Figure 5, we divide the model’s fine-tuning process into three stages. In the first stage, which is the single-sample fine-tuning stage, we generate a large number of new samples using a text prompt that reflects the basic semantics of the reference image after the model has been trained. We then manually select the 50 samples that best match the stylistic features of the reference image and add them to the training set for the second stage of fine-tuning. In the second stage of fine-tuning, the basic idea is similar to the first stage. We expand the training set from 50 to 100 images. Finally, we fine-tune the model using these 100 samples to obtain the final style LoRA.

Through this data augmentation strategy, the overfitting phenomenon of the model during the fine-tuning process is greatly alleviated. The fine-tuned model is able to generate more diverse results and generalize the style knowledge to all the prior knowledge of the model, rather than being limited to a single sample.

3 Experiments

3.1 Experimental Setting

Baselines Our baseline list in one-shot text-driven style transfer includes DreamBooth DBLP:conf/cvpr/RuizLJPRA23 , a LoRA DBLP:conf/iclr/HuSWALWWC22 version of DreamBooth, Textual Inversion DBLP:conf/iclr/GalAAPBCC23 , and SVDiff DBLP:conf/iccv/HanLZMMY23 . For the text-driven stylization task, the baselines we selected include Artist jiang2024artist , InstructPix2Pix DBLP:conf/cvpr/BrooksHE23 , and Plug-and-play (PnP) Tumanyan_2023_CVPR . And baseline list in one-shot image-driven style transfer includes StyleID DBLP:conf/cvpr/ChungHH24 , AdaAttn DBLP:conf/iccv/LiuLHLWLSLD21 , AdaIN DBLP:conf/iccv/HuangB17 , AesPA-Net DBLP:conf/iccv/HongJLAKLKUB23 and InstantStyle-Plus DBLP:journals/corr/abs-2407-00788 .

Datasets We selected one image from each of the three widely used 10-shot datasets, including landscapes XiaogangWang_XiaoouTang_2009 , Van Gogh houses DBLP:conf/cvpr/OjhaLLELS021 , and watercolor dogs DBLP:journals/corr/abs-2306-00763 , to form our one-shot datasets, in order to quantitatively evaluate the proposed method from a better perspective. To test our model, we first used FLUX flux2024 to generate 1,000 images of houses, 1,000 images of dogs and 1,000 images of mountains based on the prompts "a photo of a house", "a photo of a dog" and "a photo of a moutain", respectively. These images served as the style-free images to be transferred.

Metric For image style similarity, we compute CLIP-FID DBLP:conf/cvpr/Parmar0Z22 , CLIP-I score, CLIP-T score and DINO score DBLP:conf/iclr/0097LL000NS23 between 1,000 samples with the full few-shot datasets. For image content similarity, we compute the LPIPS DBLP:conf/cvpr/Parmar0Z22 between 1,000 samples and the source image to evaluate the content similarity between the style-transferred images and the original images. Intra-clustered LPIPS Ojha_Li_Lu_Efros_JaeLee_Shechtman_Zhang_2021 ; DBLP:conf/cvpr/ZhangIESW18 of 1,000 samples is also reported as a standalone diversity metric.

Detail For other details of the experiments and StyleWallfacer, please refer to Appendix B.

3.2 Qualitative Comparison

One-shot Text-driven Style Transfer Experimental Qualitative Results. As depicted in Figure 6, StyleWallfacer outperforms other methods in generating diverse and semantically accurate images. Unlike other methods that suffer from overfitting and semantic drift when trained on single-style images, StyleWallfacer employs multi-stage progressive learning with human feedback to reduce overfitting and enhance diversity. It also avoids identifiers for style injection, minimizing semantic drift and enabling precise style generation based on prompts.

Text-driven Stylization Experimental Qualitative Results. As shown in Figure 7, other methods, except StyleWallfacer, although have completed the task of style transfer, the results obtained after the transfer are far from the authentic style of the painter and fall short of the expected level. However, StyleWallfacer has achieved the best balance between image style transfer and content preservation. The images after style transfer not only closely match the painter’s authentic style but also feature finer details and a high degree of fidelity to the original image content.

One-shot Image-driven Style Transfer Experimental Qualitative Results.

As shown in Figure 8 (a), a visual comparison of the style transfer results of the StyleWallfacer model with other methods is presented. Clearly, the StyleWallfacer model achieves the best results in terms of image structure preservation and style transfer. Compared with the results of other methods, the style transfer results of StyleWallfacer have truly realized the style transfer, as if the painter himself had redrawn the original image according to his painting style, rather than simply blending the textures and colors of the original and reference images. Moreover, in terms of detail, the results generated by StyleWallfacer feature more refined texture details, while other methods generally suffer from noise and damage.

One-shot Image-driven Style Transfer & Color Edit Experimental Qualitative Results. As depicted in Figure 8 (b), the visualization results of image-driven style transfer and color editing are presented. Analysis of the figure reveals that the proposed method in this paper not only accomplishes style transfer but also retains the model’s controllability via text prompts. This enables synchronous guidance of the model’s generation process by both "text and style", thereby enhancing controllability. Moreover, the images obtained after style transfer maintain a high degree of content consistency with the original images, achieving a better balance between generation diversity and controllability.

3.3 Quantitative Comparison

Method													Landscapes (one-shot)	Van Gogh Houses (one-shot)	Watercolor Dogs (one-shot)
	CLIP-FID ↓	DINO ↑	CLIP-I ↑	I-LPIPS ↑	CLIP-FID ↓	DINO ↑	CLIP-I ↑	I-LPIPS ↑	CLIP-FID ↓	DINO ↑	CLIP-I ↑	I-LPIPS ↑
DreamBooth* DBLP:conf/cvpr/RuizLJPRA23	29.25	0.8565	0.8611	0.7878	28.95	0.8480	0.8224	0.7553	35.31	0.8224	0.7648	0.6570
DreamBooth+LoRA* DBLP:conf/iclr/HuSWALWWC22	29.54	0.8489	0.8628	0.7200	31.08	0.8316	0.8000	0.6611	37.78	0.8510	0.8124	0.7145
SVDiff* DBLP:conf/iccv/HanLZMMY23	29.53	0.8406	0.8648	0.7301	27.76	0.8641	0.8642	0.7435	45.09	0.7670	0.7854	0.6815
Text Inversion* DBLP:conf/iclr/GalAAPBCC23	30.58	0.8425	0.8513	0.6947	29.35	0.8488	0.8245	0.7616	27.77	0.8393	0.7964	0.6941
Ours	28.34	0.8649	0.8712	0.8388	26.44	0.8649	0.8732	0.7084	26.64	0.8608	0.8540	0.7205

Table 1: Quantitative Comparisons to SOTAs on Text-driven Style Transfer Task. The results that achieve the highest and second-highest performance metrics are respectively delineated in red and blue.

Method																House $\rightarrow$ Van Gogh style (text-driven stylization)	House $\rightarrow$ Monet style (text-driven stylization)	House $\rightarrow$ Cezanne style (text-driven stylization)
	CLIP-FID ↓	DINO ↑	CLIP-I ↑	CLIP-T ↑	LPIPS ↓	CLIP-FID ↓	DINO ↑	CLIP-I ↑	CLIP-T ↑	LPIPS ↓	CLIP-FID ↓	DINO ↑	CLIP-I ↑	CLIP-T ↑	LPIPS ↓
Artist jiang2024artist	70.75	0.7925	0.6260	0.2989	0.8060	68.74	0.6699	0.4910	0.2755	0.7815	79.27	0.6587	0.5302	0.2830	0.7494
InstructPix2Pix DBLP:conf/cvpr/BrooksHE23	72.36	0.7464	0.5680	0.2378	0.3677	85.05	0.6499	0.4696	0.2446	0.4135	77.23	0.6424	0.5336	0.2693	0.4151
Plug-and-play Tumanyan_2023_CVPR	57.96	0.7977	0.6776	0.3086	0.4295	79.87	0.6638	0.5024	0.2545	0.2800	73.86	0.6576	0.5506	0.2777	0.3295
Ours	45.69	0.8075	0.6870	0.3117	0.7444	57.69	0.7049	0.5788	0.2859	0.7268	63.83	0.6761	0.5816	0.3145	0.7042

Table 2: Quantitative Comparisons to SOTAs on Text-driven Stylization Task.

Method													Mountain $\rightarrow$ Landscapes (one-shot)	Houses $\rightarrow$ Van Gogh Houses (one-shot)	Dogs $\rightarrow$ Watercolor Dogs (one-shot)
	CLIP-FID ↓	DINO ↑	CLIP-I ↑	LPIPS ↓	CLIP-FID ↓	DINO ↑	CLIP-I ↑	LPIPS ↓	CLIP-FID ↓	DINO ↑	CLIP-I ↑	LPIPS ↓
AdaAttn DBLP:conf/iccv/LiuLHLWLSLD21	60.92	0.7444	0.7200	0.7613	70.43	0.7839	0.5825	0.7046	40.05	0.7455	0.7556	0.6995
AdaAIN DBLP:conf/iccv/HuangB17	64.03	0.7590	0.6942	0.7005	73.09	0.7892	0.5516	0.7504	37.82	0.7708	0.7530	0.7170
AesPA-Net DBLP:conf/iccv/HongJLAKLKUB23	61.71	0.7554	0.6996	0.6592	65.65	0.7887	0.5979	0.7380	39.92	0.7438	0.7645	0.6677
StyleID DBLP:conf/cvpr/ChungHH24	47.45	0.7518	0.7564	0.6062	55.79	0.7996	0.6501	0.7183	36.83	0.7615	0.7613	0.6859
InstantStyle-Plus DBLP:journals/corr/abs-2407-00788	59.04	0.7595	0.7381	0.3909	64.32	0.7582	0.6077	0.2903	41.04	0.7437	0.7633	0.3132
Ours	45.14	0.8124	0.8210	0.5917	37.19	0.8346	0.7309	0.7437	35.40	0.8041	0.7852	0.6848

Table 3: Quantitative Comparisons to SOTAs on Image-driven Style Transfer Task.

As shown in Table 1, Table 2 and Table 3, the method proposed in this paper achieved the best results compared with all the baseline methods, further demonstrating the effectiveness of the proposed method from a quantitative perspective.

3.4 Ablation Study

To prove that the proposed techniques can indeed effectively improve the performance of StyleWallfacer in various generation scenarios, we conduct extensive ablation studies focusing on these techniques and leave them in Appendix LABEL:abla due to page limit. And we have also understood the source of StyleWallfacer’s superiority from a mathematical perspective, for details see Appendix LABEL:math.

4 Conclusion

In this work, we focus on building a unified framework for style transfer by analyzing semantic drift, overfitting, and the true meaning of style transfer that previous works have failed to settle, and accordingly proposing a new method named StyleWallfacer. StyleWallfacer includes a one-stage fine-tuning process and a training-free inference framework that aims to solve these issues, namely the semantic-based style learning strategy, the training-free triple diffusion process, and the data augmentation method for small scale datasets based on human feedback. With these designs tailored to style transfer, our StyleWallfacer achieves convincing performance on text/image-driven style transfer scenarios, text-driven stylization, and image-driven style transfer with color edit, while solving problems before. In Appendix G and H, we will discuss possible limitations and potential future works of StyleWallfacer.

References

[1] Meta AI. Llama-3.2-1b, 2024. Accessed: 2025-03-06.
[2] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel, 2 edition, 2008.
[3] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 18392–18402. IEEE, 2023.
[4] Yancheng Cai, Ali Bozorgian, Maliha Ashraf, Robert Wanat, and K Rafał Mantiuk. elatcsf: A temporal contrast sensitivity function for flicker detection and modeling variable refresh rate flicker. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024.
[5] Yancheng Cai, Fei Yin, Dounia Hammou, and Rafal Mantiuk. Do computer vision foundation models learn the low-level characteristics of the human visual system? arXiv preprint arXiv:2502.20256, 2025.
[6] Yancheng Cai, Bo Zhang, Baopu Li, Tao Chen, Hongliang Yan, Jingdong Zhang, and Jiahao Xu. Rethinking cross-domain pedestrian detection: A background-focused distribution alignment framework for instance-free one-stage detectors. IEEE transactions on image processing, 32:4935–4950, 2023.
[7] Dar-Yen Chen, Hamish Tennent, and Ching-Wen Hsu. Artadapter: Text-to-image style transfer using multi-level style encoder and explicit adaptation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 8619–8628. IEEE, 2024.
[8] Jingwen Chen, Yingwei Pan, Ting Yao, and Tao Mei. Controlstyle: Text-driven stylized image generation using diffusion priors. In Abdulmotaleb El-Saddik, Tao Mei, Rita Cucchiara, Marco Bertini, Diana Patricia Tobon Vallejo, Pradeep K. Atrey, and M. Shamim Hossain, editors, Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pages 7540–7548. ACM, 2023.
[9] Jiwoo Chung, Sangeek Hyun, and Jae-Pil Heo. Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 8795–8805. IEEE, 2024.
[10] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
[11] Rinon Gal, Or Patashnik, Haggai Maron, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Trans. Graph., 41(4):141:1–141:13, 2022.
[12] Junyao Gao, Yanchen Liu, Yanan Sun, Yinhao Tang, Yanhong Zeng, Kai Chen, and Cairong Zhao. Styleshot: A snapshot on any style. CoRR, abs/2407.01414, 2024.
[13] Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris N. Metaxas, and Feng Yang. Svdiff: Compact parameter space for diffusion fine-tuning. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 7289–7300. IEEE, 2023.
[14] Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta denoising score. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 2328–2337. IEEE, 2023.
[15] Amir Hertz, Andrey Voynov, Shlomi Fruchter, and Daniel Cohen-Or. Style aligned image generation via shared attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 4775–4785. IEEE, 2024.
[16] Kibeom Hong, Seogkyu Jeon, Junsoo Lee, Namhyuk Ahn, Kunhee Kim, Pilhyeon Lee, Daesik Kim, Youngjung Uh, and Hyeran Byun. Aespa-net: Aesthetic pattern-aware style transfer networks. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 22701–22710. IEEE, 2023.
[17] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
[18] Nisha Huang, Yuxin Zhang, Fan Tang, Chongyang Ma, Haibin Huang, Weiming Dong, and Changsheng Xu. Diffstyler: Controllable dual diffusion for text-driven image stylization. IEEE Trans. Neural Networks Learn. Syst., 36(2):3370–3383, 2025.
[19] Xun Huang and Serge J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1510–1519. IEEE Computer Society, 2017.
[20] Ruixiang Jiang and Changwen Chen. Artist: Aesthetically controllable text-driven stylization without training, 2024.
[21] Ruixiang Jiang, Can Wang, Jingbo Zhang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 14325–14336. IEEE, 2023.
[22] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 6007–6017. IEEE, 2023.
[23] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 2416–2425. IEEE, 2022.
[24] Gihyun Kwon and Jong Chul Ye. Clipstyler: Image style transfer with a single text condition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 18041–18050. IEEE, 2022.
[25] Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024.
[26] Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR, 2022.
[27] Cixin Liu. The Three-Body Problem. Tor Books, 2014.
[28] Songhua Liu, Tianwei Lin, Dongliang He, Fu Li, Meiling Wang, Xin Li, Zhengxing Sun, Qian Li, and Errui Ding. Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 6629–6638. IEEE, 2021.
[29] Wu-Qin Liu, Minxuan Lin, Haibin Huang, Chongyang Ma, and Weiming Dong. Freestyler: A free-form stylization method via multimodal vector quantization. In Fang-Lue Zhang and Andrei Sharf, editors, Computational Visual Media - 12th International Conference, CVM 2024, Wellington, New Zealand, April 10-12, 2024, Proceedings, Part II, volume 14593 of Lecture Notes in Computer Science, pages 259–278. Springer, 2024.
[30] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. Text2mesh: Text-driven neural stylization for meshes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 13482–13492. IEEE, 2022.
[31] Utkarsh Ojha, Yijun Li, Jingwan Lu, Alexei A. Efros, Yong Jae Lee, Eli Shechtman, and Richard Zhang. Few-shot image generation via cross-domain correspondence. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021.
[32] Utkarsh Ojha, Yijun Li, Jingwan Lu, Alexei A. Efros, Yong Jae Lee, Eli Shechtman, and Richard Zhang. Few-shot image generation via cross-domain correspondence. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 10743–10752. Computer Vision Foundation / IEEE, 2021.
[33] OpenAI. Gpt-4o. https://openai.com/chatgpt/overview/, 2024. Accessed: 2024-10-05.
[34] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in GAN evaluation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 11400–11410. IEEE, 2022.
[35] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
[36] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
[37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
[38] Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, and Wen-Sheng Chu. RB-modulation: Training-free stylization using reference-based modulation. In The Thirteenth International Conference on Learning Representations, 2025.
[39] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22500–22510. IEEE, 2023.
[40] Filippo Santambrogio. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Progress in Nonlinear Differential Equations and Their Applications. Birkhäuser Cham, 1 edition, 2015.
[41] Wei Shen, Guanlin Liu, Zheng Wu, Ruofei Zhu, Qingping Yang, Chao Xin, Yu Yue, and Lin Yan. Exploring data scaling trends and effects in reinforcement learning from human feedback, 2025.
[42] Kihyuk Sohn, Lu Jiang, Jarred Barber, Kimin Lee, Nataniel Ruiz, Dilip Krishnan, Huiwen Chang, Yuanzhen Li, Irfan Essa, Michael Rubinstein, Yuan Hao, Glenn Entis, Irina Blok, and Daniel Castro Chin. Styledrop: Text-to-image synthesis of any style. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[43] Kihyuk Sohn, Albert E. Shaw, Yuan Hao, Han Zhang, Luisa Polania, Huiwen Chang, Lu Jiang, and Irfan Essa. Learning disentangled prompts for compositional image synthesis. CoRR, abs/2306.00763, 2023.
[44] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
[45] Stephen Tian, Yancheng Cai, Hong-Xing Yu, Sergey Zakharov, Katherine Liu, Adrien Gaidon, Yunzhu Li, and Jiajun Wu. Multi-object manipulation via object-centric neural scattering functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9021–9031, 2023.
[46] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1921–1930, June 2023.
[47] Can Wang, Ruixiang Jiang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Nerf-art: Text-driven neural radiance fields stylization. IEEE Trans. Vis. Comput. Graph., 30(8):4983–4996, 2024.
[48] Haofan Wang, Matteo Spinelli, Qixun Wang, Xu Bai, Zekui Qin, and Anthony Chen. Instantstyle: Free lunch towards style-preserving in text-to-image generation. CoRR, abs/2404.02733, 2024.
[49] Haofan Wang, Peng Xing, Renyuan Huang, Hao Ai, Qixun Wang, and Xu Bai. Instantstyle-plus: Style transfer with content-preserving in text-to-image generation. CoRR, abs/2407.00788, 2024.
[50] Xiaogang Wang and Xiaoou Tang. Face photo-sketch synthesis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1955–1967, Nov 2009.
[51] Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. CoRR, abs/2312.12148, 2023.
[52] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
[53] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3813–3824. IEEE, 2023.
[54] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 586–595. Computer Vision Foundation / IEEE Computer Society, 2018.

Appendix

Overview

This supplementary material provides the relsted works, additional experiments and results to further support our main findings and proposed StyleWallfacer. These were not included in the main paper due to the space limitations. The supplementary material is organized as follows:

: Are We Really Meant to Confine the Imagination in Style Transfer?

Abstract

1 Introduction

2 StyleWallfacer

2.1 Overall Architecture of StyleWallfacer

2.2 Semantic-based Style Learning Strategy

2.3 Training-free Triple Diffusion Process

2.4 Data Augmentation for Small Scale Datasets Based on Human Feedback

3 Experiments

3.1 Experimental Setting

3.2 Qualitative Comparison

3.3 Quantitative Comparison

3.4 Ablation Study

4 Conclusion

References

Appendix

Overview

Appendix A Related Works

A.1 One-shot Text-driven Style Transfer

A.2 One-shot Image-driven Style Transfer

A.3 Text-driven Image Stylization

Appendix B Implementation Detail

B.1 Model

B.2 Training

B.3 Inference

B.4 Details about the LLM

Appendix C The Mathematical Explanation of the Efficacy of StyleWallfacer

C.1 Study on the Impact of Noise Time Thresholds on Model Generation Outcomes

C.1.1 Study on the tslt_{s}^{l}

C.1.2 Study on the tsst_{s}^{s}

C.2 Generalizability Study on Image Edit

Appendix D Problems of Existing Methods and Their Visualizations

D.1 Limited Color Domain

D.2 Risk of Overfitting

Appendix E Additional Analysis

E.1 Comparison with GPT-4o [33] Image Generation

Appendix F Additional Results

F.1 One-shot Text-driven Style Transfer

F.2 Text-driven Styliztion

F.3 One-shot Image-driven Style Transfer

F.4 One-shot Text-driven Style Transfer & Color Edit

F.5 Cross-Content Image Testing Results

Appendix G Limitation and Future Work

Appendix H Broader Impact

C.1.1 Study on the $t_{s}^{l}$

C.1.2 Study on the $t_{s}^{s}$