POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation

First Author
Institution1
Institution1 address
firstauthor@i1.org Second Author
Institution2
First line of institution2 address
secondauthor@i2.org

Abstract

This paper targets to enhance the diffusion-based text-to-video generation by improving the two input prompts, including the noise and the text. Accommodated with this goal, we propose POS, a Prompt Optimization Suite to boost text-to-video models. POS is motivated by two observations: (1) Video generation shows instability in terms of noise. Given the same text, different noises lead to videos that differ significantly in terms of both frame quality and temporal consistency. This observation implies that there exists an optimal noise matched to each textual input; To capture the potential noise, we propose an optimal noise approximator to approach the potential optimal noise. Particularly, the optimal noise approximator initially searches a video that closely relates to the text prompt and then inverts it into the noise space to serve as an improved noise prompt for the textual input. (2) Improving the text prompt via LLMs often causes semantic deviation. Many existing text-to-vision works have utilized LLMs to improve the text prompts for generation enhancement. However, existing methods often neglect the semantic alignment between the original text and the rewritten one. In response to this issue, we design a semantic-preserving rewriter to impose constraints in both rewriting and denoising phrases to preserve semantic consistency. Extensive experiments on popular benchmarks show that our POS can improve the text-to-video models with a clear margin. The code will be open-sourced.

1 Introduction

Text-to-video generation has emerged as a valuable approach in automated video production, offering a human-friendly method with wide applications in diverse industries such as media, gaming, and film. Recently, diffusion-based text-to-video generation has witnessed remarkable strides [11, 30, 10, 46]. Following the paradigm of conditional diffusion models [28], the text-to-video generation models typically sample a Gaussian noise and a text condition to synthesize the videos. However, during our practice, we found that the noise and the text prompt seriously affected the video quality of existing text-to-video models.

Refer to caption — Figure 1: Different noises can yield significantly different videos in terms of quality. With this observation, we posit there exists a potential optimal noise (orange circle), randomly sampled noise close to the optimal noise can synthesize high-quality results, while the noise far away leads to poor quality. Both videos are produced by ModelScope [38] with the same prompt “*A dog wearing a Superhero outfit with red cape flying through the sky*”.

Different noises can yield significantly varied videos in terms of frame quality and temporal consistency, just as shown in Figure 1. This observation highlights the existence of an optimal noise sample for a given text prompt, but the strategy of random sampling noise is hard to hit or get close to it every time. As a result, video quality varies, with higher quality achieved when the noise sample aligns closely to the optimal point, and poorer quality observed when it deviates further away. Given this consideration, we target to approach the optimal noise for a given text prompt to consistently generate high-quality videos. Notably, the inversion and denoising procedures within the diffusion model establish a bidirectional mapping between the noise space and the video space. Based on the fact that the inversion of a sample can almost reconstruct itself through denoising [20, 18, 37], we think that the inversion of the groundtruth video of the given text prompt is close enough to the optimal noise. Nevertheless, the groundtruth video is not available during inference. To tackle this challenge, we propose to leverage the neighboring counterpart of the desired video to approximate the optimal noise.

To be specific, we first search a neighbor video for the text input, and then apply the inversion procedure on it to locate a point in the noise space as shown in Figure 2. Motivated by this concept, we initially assemble a pool of text-video pairs. For a given input text prompt, we select the video from the pair whose text is similar to the text prompt as the neighbor video. Subsequently, this chosen video is inverted into the noise space as an approximation for the optimal noise. Finally, the approximated noise is input to the forward procedure in the diffusion model, resulting in the synthesis of higher-quality videos. To get rid of the storage pressure of the retrieval pool, we also take the text-inversion noise pairs as the training samples, and train an optimal noise prediction network to output the optimal noise directly. However, we note that the fixed noise for every text would lead to poor diversity, to remedy this issue, we further augment the noise initialization by introducing the random noise via Gaussian Mixture to maintain the diversity.

In addition to noise inputs, text prompts also play a pivotal role in influencing the quality of generated videos. Detailed descriptions tend to produce superior results, and many works have utilized large-scale language models (LLMs) to enhance the text prompts [47, 5, 12, 21]. For the text-to-video generation, we can straightforwardly enhance the text descriptions using LLMs such as ChatGPT [22] and Llama2 [33]. However,this naïve strategy presents two issues: (a) The potential information gained through simple rewriting is often limited, resulting in incremental improvements to the text prompts. (b) Without appropriate guidance, the introduction of unexpected content may lead to the generation of videos that deviate from the user’s original intention. In light of these issues, we propose a semantic-preserving rewriter, formed by a Reference-Guided Rewriting mechanism and Denoising with Hybrid-Semantics to respectively remedy the above two challenges.

Particularly, Reference-Guided Rewriting (RGR) searches several texts as the reference for the rewriting, serving as the information pool to aid the LLMs in compensating reasonable details for input text. To preserve the semantics, we design a Denoising with a Hybrid-Semantics (DHS) strategy, the rewritten text acts as the contextual condition in the early diffusion steps for quality enhancement, while the original text is introduced in later steps of the denoising process to align the semantics with the original text. By strategically including the original text, DHS ensures that the final video closely adheres to the original text, maintaining the intended narrative consistency. Through advancements in enhancing the two inputs of text-to-video models, we have successfully developed an optimization-free and (diffusion) model-agnostic method.

In summary, we contribute a prompts optimization suite (POS) for text-to-video generation, featured by two components: 1) An optimal noise approximator (ONA) that offers the noise initialization close to the optimal noise for each text prompt. 2) A semantic-preserving rewriter (SPR) characterized by the RGR and DHS to provide detail-rich text prompts while maintaining the semantics of the final video to keep consistent with the original text.

2 Related work

Text-to-video generation within open domains, using diffusion models, is an active and compelling research area. The prevailing approaches [11, 30, 10, 46, 2, 42, 4, 19, 39] typically extend text-to-image models to facilitate video generation. This strategy leverages the knowledge from text-to-image models. Broadly, existing methods enhance text-to-video generation through two primary avenues: architecture-oriented optimization [11, 30, 10] and noise-oriented optimization [19, 4]. Architecture-oriented optimization typically designs modules for temporal relation enhancement. For example, VideoFactory [39] proposes a novel spatiotemporal cross-attention to reinforce the interaction between spatial and temporal features. While noise-oriented optimization seeks to provide precise noise initialization to maintain video fluency more effectively. VideoFusion [19] and Preserve Your Own Correlation [4] both target to maintain the frame coherence from the noise input. Instead of designing the noise empirically, this work targets to directly approximate the optimal noise for the text prompt. What’s more, we also enrich the text prompts via the LLMs. Notably, the proposed POS is optimization-free and model-agnostic, which are also important features that distinguish our method from existing models.

3 Preliminaries

Before diving into the details of our method, we first revisit two crucial modules in Denoising Diffusion Implicit Models (DDIMs) [31], i.e., the Denoising and the Inversion procedures, to ease the understanding of our method. Denoising in diffusion model builds a mapping from normal noise to samples (image, video etc). Compared with Denoising Diffusion Probabilistic Models (DDPMs) [9], DDIMs [31] are more efficient, allowing for the use of a smaller number of steps to synthesize a sample $z_{0}$ from a Gaussian noise $z_{T}$ using the iterative procedure:

z_{t-1}=\sqrt{\alpha_{t-1}}(\frac{z_{t}}{\sqrt{\alpha_{t}}}+(\sqrt{\frac{1-\alpha_{t-1}}{\alpha_{t-1}}}-\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}})\epsilon_{\theta}(z_{t},t,c)),

(1)

where $\alpha_{t}=\prod_{t=1}^{T}(1-\beta_{t})$ , $\beta_{t}$ is a predefined hyperparameter in DDPMs, $c$ is the condition to control the generation, $\epsilon_{\theta}$ is a noise prediction network. We define function $\text{DN}(z_{t},t,c)$ as the right term of the above equation to simplify the subsequent description: $z_{t-1}=\text{DN}(z_{t},t,c)$ .

Inversion defines a projection from data space to noise space. Given a sufficiently large $T$ , Eq.1 approaches an ordinary differential equation (ODE). With the assumption that the ODE process can be reversed in the limit of small steps [20], we can acquire a noise from a data $z_{0}$ :

z_{t+1}=\sqrt{\alpha_{t+1}}(\frac{z_{t}}{\sqrt{\alpha_{t}}}+(\sqrt{\frac{1-\alpha_{t+1}}{\alpha_{t+1}}}-\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}})\epsilon_{\theta}(z_{t},t,c)).

(2)

Analogously, we define the right term of the above equation as $\text{INV}(z_{t},t,c)$ to convince the subsequent elaboration: $z_{t+1}=\text{INV}(z_{t},t,c)$ . With the above preliminaries, we next elaborate on the key components of our POS, i.e., optimal noise approximator and semantic-preserving rewriter.

4 Methodology

Figure 3 depicts our framework, POS, formed by ONA and SPR, which can be seamlessly integrated into the existing text-to-video generation frameworks. Given a trained text-to-video model and the text prompt, ONA produces an improved noise by video inversion or directly generating with the noise prediction network, which is then combined with random noise to form the final noise prompt. SPR locates textual references for LLMs rewriting and performs denoising with hybrid semantics for semantics preservation.

4.1 Optimal Noise Approximator

Video Retrieval seeks to find a video $v*$ that aligns with the text prompt in semantics. We first prepare a pool of $N$ text-video pairs $\mathcal{X}=\{S_{i},V_{i}\}_{i=1}^{N}$ , where $S$ and $V$ represent text and video, respectively. Given a text prompt $\mathcal{C}$ , we initially estimate the similarity between the text prompt and the text within $\mathcal{X}$ and then select the video accordingly:

v*=\{V_{i}|\arg\max\limits_{i}\{\text{sim}(E_{t}(\mathcal{C}),E_{t}(S_{i}))|\{S_{i},V_{i}\}\in\mathcal{X}\}\}

(3)

where $\text{sim}(\cdot,\cdot)$ is the cosine similarity, and $E_{t}$ can be a off-the-shelf language model like [27, 3], to extract the text feature. The reasons we anchor on the text-similarity for the video selection are three-fold: first, text similarity is a more reliable clue than the text-video relevance; second, the computation efficiency for text feature extraction is superior; third, due to the available text-video dataset like WebVid-10M [1], InvernVid [40], collecting the text-video pairs is not effort-intensive as well.

Guided Noise Inversion module takes the searched $v*$ as the inversion source to approximate the optimal noise. Formally, we follow the efficient latent diffusion architecture [28], where the raw video $v*$ is first encoded via the encoder of VQ-VAE [26] and then the inversion is recurrently acted on the latent feature with $T$ steps: $\epsilon_{\text{inv}}=\text{INV}(\mathcal{E}(v*),t,\varnothing)|_{t=0}^{T-1}$ , where $\epsilon_{\text{inv}}$ is the inverted noise, $\varnothing$ represents the empty text and $\mathcal{E}$ is the VQ-VAE encoder. With the video as the inversion source, $\epsilon_{\text{inv}}$ can also inherit the coherence of $v*$ , thereby benefiting the temporal consistency as well.

As shown in Figure 2, this simple strategy can help locate a point close to the optimal noise. On the other hand, the prompts that share the inversion source video will use the same noise; especially, multiple inferences of the same prompt will yield the same result, severely sacrificing the diversity. In response to this shortcoming, we introduce the Gaussian noise and integrate it with the inverted noise $\epsilon_{\text{inv}}$ via Gaussian mixture to maintain the randomness:

\epsilon_{\text{mix}}(v*,\eta)=\frac{1}{\sqrt{1+\eta^{2}}}\cdot\epsilon+\frac{\eta}{\sqrt{1+\eta^{2}}}\cdot\epsilon_{\text{inv}},

(4)

where $\epsilon\sim\mathcal{N}(0,\mathbf{I})$ , $\eta$ is the hyperparameter to balance the noises. Noise mixture in Eq. 4 is inspired by PYoCo[4]. Unlike PYoCo, which performs a mixture between noise frames to achieve smoothness, the noise mixture here is a tool to introduce random noise for diversity preservation, which is not the paper’s focus.

Video Synthesize with Improved Noise. Given the improved noise $\epsilon_{\text{mix}}$ and text prompt $\mathcal{C}$ , the latent code $z_{0}$ of the video is calculated by the following formula:

z_{t-1}=\left\{\begin{array}[]{lcl}\text{DN}(\epsilon_{\text{mix}}(v*,\eta),t,\mathcal{C})&&\text{if}\;{t=T}\\ \text{DN}(z_{t},t,\mathcal{C})&&\text{if}\;{0<t\leq T-1},\end{array}\right.

(5)

where $T$ is the total DDIM sampling steps.

4.2 Noise Prediction Network

Following subsection 4.1, we can acquire $\epsilon_{\text{inv}}$ in a training-free manner. However, it requires a paired video-text pool as the prerequisite. To circumvent this limitation, we train a Noise Prediction Network (NPNet) to generate the optimal noise directly. As illustrated in Figure 4, NPNet follows the video diffusion paradigm. It comprises two parts: 1) Text encoder CLIP [25] extracts features from the input text as conditions. 2) Sparse Spatial-Temporal attention network for denoising, which is built upon the Sparse Spatial-Temporal (SS-T) Attention. Figure 4 (b) shows the detailed architecture of SS-T attention, The input feature undergoes processing through consecutive layers including a 2D convolution (conv2D), a 3D convolution (conv3D), Sparse Causal Attention (SCAttn) [42], spatial cross attention, and two temporal attention modules. During training, we leverage inversion noises and paired texts as the training data, and train NPNet using the noise reconstruction loss [9]. Post-training, we obtain the optimal noise by feeding the text prompt through NPNet, which replaces $\epsilon_{\text{inv}}$ in Eq. 4.

NPNet is primarily intended as an alternative to bypass the text-video search pool in our standard ONA framework; however, its architectural design is not the central focus of this study. Therefore, we omit the architectural details in the main paper and instead include them in the supplementary materials.

4.3 Semantic-Preserving Rewriter

SPR targets to improve another crucial input for text-to-video generation, i.e., the text prompt, while keeping the semantics aligned with the original text.

Reference-Guided Rewriting aims to provide descriptions to guide the rewriting, allowing the LLMs to “imagine” reasonable textual details. In particular, we first use the Sentence-BERT model [27] to encode the text prompt. Consider a text prompt $\mathcal{C}$ , its top- $k$ references are selected using the cosine similarity:

\{S^{r}_{i}\}_{1}^{k}=\mathop{\mathrm{arg\,top}\,k}\{\text{sim}(E_{t}(\mathcal{C}),E_{t}(S_{i}))|S_{i}\in\mathcal{X}\}.

(6)

Subsequently, we integrate the references $\{S^{r}_{i}\}_{1}^{k}$ and the text prompt $\mathcal{C}$ into a designed instruction template IT, which is fed into a LLM to rewrite the prompt:

\mathcal{C^{\text{r}}}=\text{LLM}(\texttt{IT}\{\{S^{r}_{i}\}_{1}^{k},\mathcal{C}\}),

(7)

Denoising with Hybrid Semantics. Reference sentences provide valuable guidance for the reasonable details compensation, however, we found this strategy cannot perfectly maintain the original semantics of the text prompt due to the excellent association ability of LLMs. To remedy this issue, we propose to introduce the original text prompt into the denoising process. Specifically, we apply the rewritten sentence as the condition in the early stage to boost the content quality, while the original text prompt is employed in the latter denoising steps to pull the semantics close to the original prompt. As a result, we evolve the video synthesis of Eq. 5 as follows:

z_{i-1}=\left\{\begin{array}[]{lll}\text{DN}(\epsilon_{\text{mix}}(v*,\eta),t,\mathcal{C}^{r}),&\text{if}\;{t=T}\\ \text{DN}(z_{t},t,\mathcal{C^{\text{r}}})&\text{if}\;{T-m<t\leq T-1}\\ \text{DN}(z_{t},t,\mathcal{C})&\text{if}\;{0<t\leq T-m},\end{array}\right.

(8)

where $m=\lfloor T\times\gamma\rfloor,\gamma\in(0,1).$ $m$ indicates how many steps of rewritten text is performed, $\lfloor\cdot\rfloor$ is the floor operation. Following the latent diffusion model [28], we finally synthesize the video by feeding the final latent feature $z_{0}$ into the VQ-VAE decoder: $\hat{v}=\mathcal{D}(z_{0})$ .

4.4 Discussion

ONA vs Inversion in Video Editing. Video Editing (VE) approaches often invert the video for editing to initialize the noise [42, 17, 24]. ONA differs from VE methods in many aspects. (1) Different goals. VE inverts the video to preserve more information about the original video, thereby better supporting the subsequent editing. While ONA uses the inversion to approach the potential optimal noise, which is the core idea of this paper. (2) Video diversity. Video Editing (VE) does not require much consideration of diversity. While our focus is on the general text-to-video generation, consequently, ONA has to maintain the diversity, which is accomplished via the Gaussian mixture. (3) Inversion source. The video for editing is the straightforward inversion source in VE. In contrast, the video for the optimal noise is not available during inference, therefore, ONA resorts to searching a neighbor video as the inversion source.

SPR vs Text Prompt-Enhancement. Prior research has employed LLMs to enhance text prompts for improving text-to-vision generation [47, 5, 12, 21]. However, there approaches either introduce extra training burden [5], or simply utilize the frozen LLMs to provide additional textual details [21], but rarely study how to better unlock the potential of LLMs when rewriting and pay rare attention to the alignment between the original text and the videos condition on the rewritten text [47, 12]. While SPR is optimization-free, and RGR in our SPR assists LLMs by showing several sentences as references, and DHS in SPR presents a hybrid denoising strategy to keep the semantics of the final video consistent with the original text. These features set our SPR apart from existing text prompt-rewriting methods.

5 Experiment

5.1 Standard Evaluation Setup.

During our practice, we found that the evaluation settings in existing text-to-video works are either varied or ambiguous, posing the risk of unfair comparison. To remedy this issue, we present our detailed evaluation configuration and hope to standardize the future evaluation of text-to-video models.

Datasets & Metrics. We follow the previous works [11, 30, 4, 2, 39] and adopt the widely-used MSR-VTT [43] and UCF101 [32] datasets for performance evaluation. MSR-VTT provides 2,990 video clips for testing, each accompanied by around 20 captions. To evaluate the performance, we randomly select one caption per video to create a set of 2,990 text-video evaluation pairs. FID [8]¹¹1FID: https://github.com/GaParmar/clean-fid and CLIP-FID [15]²²2CLIP-FID: https://github.com/GaParmar/clean-fid is used to assess the video quality, along with CLIP-SIM [41]³³3CLIPSIM: https://github.com/openai/CLIP metric to measure semantic consistency between videos and text prompts. UCF101 contains 13,320 video clips of 101 human action categories. We evaluate performance using 3,783 test videos. As there are no captions, we use the text prompts from PYoCo [4] for video generation. IS [29] and FVD [35, 44] serve as the evaluation criteria. Video IS [30, 13] uses C3D[34] pretrained on UCF101 as the feature extractor⁴⁴4IS: https://github.com/pfnet-research/tgan2, while FVD utilizes I3D [36] pretrained on Kinetics-400 [14] for video feature encoding⁵⁵5FVD: https://github.com/wilson1yan/VideoGPT.

Resolution. Following previous works [30, 46, 4, 6], we generate videos of size $16\times 256\times 256$ for performance evaluation. This choice is motivated by two main factors: First, the training dataset WebVid-10M [1] for most text-to-video models is dominated by 360P videos; Second, both the MSR-VTT and UCF101 datasets consist of videos with a resolution of around 240P. Consequently, $256\times 256$ ensures a consistent and comparable evaluation setting.

	MSR-VTT				UCF101
	FID (inception v3)↓	CLIP-FID ↓	CLIPSIM ↑	Hyperparamters	FVD ↓	IS ↑	Hyperparameters
$\text{Tune-A-Video}^{{\dagger}}$	51.365	17.093	0.276	-	1321.45	22.752	-
+ ONA	${48.623}_{{\color[rgb]{0.484375,0.8046875,0.484375}2.743}}$	${16.393}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.700}}$	$\textbf{0.278}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.002}}$	$\eta=0.1$	${1156.91}_{{\color[rgb]{0.484375,0.8046875,0.484375}164.54}}$	${25.129}_{{\color[rgb]{0.484375,0.8046875,0.484375}2.377}}$	$\eta=0.1$
+ SPR	${47.389}_{{\color[rgb]{0.484375,0.8046875,0.484375}3.976}}$	${16.436}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.657}}$	${0.273}_{{\color[rgb]{0.94140625,0.59765625,0.55078125}-0.003}}$	$\gamma=0.5$	${1276.67}_{{\color[rgb]{0.484375,0.8046875,0.484375}44.78}}$	${24.960}_{{\color[rgb]{0.484375,0.8046875,0.484375}2.208}}$	$\gamma=0.5$
+ POS	${\textbf{45.270}}_{{\color[rgb]{0.484375,0.8046875,0.484375}6.095}}$	${\textbf{16.164}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.929}}$	${0.274}_{{\color[rgb]{0.94140625,0.59765625,0.55078125}-0.002}}$	$\{\eta=0.1,\gamma=0.5\}$	${\textbf{1078.54}}_{{\color[rgb]{0.484375,0.8046875,0.484375}242.91}}$	$\textbf{27.317}_{{\color[rgb]{0.484375,0.8046875,0.484375}4.565}}$	$\{\eta=0.1,\gamma=0.5\}$
VideoCrafter	51.349	20.899	0.282	-	777.02	31.211	-
+ ONA	${51.488}_{{\color[rgb]{0.94140625,0.59765625,0.55078125}-0.139}}$	${\textbf{20.808}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.091}}$	${\textbf{0.285}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.003}}$	$\eta=0.3$	${764.86}_{{\color[rgb]{0.484375,0.8046875,0.484375}12.16}}$	$\textbf{32.439}_{{\color[rgb]{0.484375,0.8046875,0.484375}1.228}}$	$\eta=0.3$
+ SPR	${\textbf{50.221}}_{{\color[rgb]{0.484375,0.8046875,0.484375}1.128}}$	${20.870}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.029}}$	${0.283}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.001}}$	$\gamma=0.1$	${\textbf{746.91}}_{{\color[rgb]{0.484375,0.8046875,0.484375}30.11}}$	${31.242}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.031}}$	$\gamma=0.1$
+ POS	${50.882}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.467}}$	${20.886}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.013}}$	${0.284}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.002}}$	$\{\eta=0.3,\gamma=0.1\}$	${747.32}_{{\color[rgb]{0.484375,0.8046875,0.484375}29.70}}$	${32.372}_{{\color[rgb]{0.484375,0.8046875,0.484375}1.161}}$	$\{\eta=0.3,\gamma=0.1\}$
ModelScope	45.378	13.677	0.296	-	774.14	32.337
+ ONA	${43.092}_{{\color[rgb]{0.484375,0.8046875,0.484375}2.287}}$	${\textbf{13.572}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.105}}$	$\text{0.299}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.003}}$	$\eta=0.2$	${607.11}_{{\color[rgb]{0.484375,0.8046875,0.484375}167.03}}$	$\textbf{38.992}_{{\color[rgb]{0.484375,0.8046875,0.484375}6.655}}$	$\eta=0.2$
+ SPR	${43.585}_{{\color[rgb]{0.484375,0.8046875,0.484375}1.793}}$	${\textbf{13.572}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.105}}$	${0.297}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.001}}$	$\gamma=0.04$	${684.90}_{{\color[rgb]{0.484375,0.8046875,0.484375}89.24}}$	${33.136}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.799}}$	$\gamma=1.0$
+ POS	${\textbf{42.755}}_{{\color[rgb]{0.484375,0.8046875,0.484375}2.623}}$	${13.674}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.003}}$	${\textbf{0.300}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.004}}$	$\{\eta=0.2,\gamma=0.04\}$	${\textbf{566.68}}_{{\color[rgb]{0.484375,0.8046875,0.484375}207.46}}$	${38.190}_{{\color[rgb]{0.484375,0.8046875,0.484375}5.853}}$	$\{\eta=0.2,\gamma=1.0\}$
SCVideo	48.331	14.802	0.283	-	750.74	23.224	-
+ ONA	$43.585_{{\color[rgb]{0.484375,0.8046875,0.484375}4.746}}$	${14.287}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.515}}$	${\textbf{0.289}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.006}}$	$\eta=0.5$	${558.51}_{{\color[rgb]{0.484375,0.8046875,0.484375}192.23}}$	${31.538}_{{\color[rgb]{0.484375,0.8046875,0.484375}8.314}}$	$\eta=0.5$
+ SPR	${43.760}_{{\color[rgb]{0.484375,0.8046875,0.484375}4.571}}$	${14.291}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.511}}$	${0.284}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.001}}$	$\gamma=0.1$	$\text{712.88}_{{\color[rgb]{0.484375,0.8046875,0.484375}37.86}}$	${25.648}_{{\color[rgb]{0.484375,0.8046875,0.484375}2.424}}$	$\gamma=1.0$
+ POS	${\textbf{42.902}}_{{\color[rgb]{0.484375,0.8046875,0.484375}5.429}}$	${\textbf{14.155}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.647}}$	${0.288}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.005}}$	$\{\eta=0.5,\gamma=0.1\}$	${\textbf{541.43}}_{{\color[rgb]{0.484375,0.8046875,0.484375}209.31}}$	$\hskip 2.84544pt{\textbf{33.245}}_{{\color[rgb]{0.484375,0.8046875,0.484375}10.021}}$	$\{\eta=0.5,\gamma=1.0\}$

Table 1: Quantitative results on MSR-VTT and UFC101 datasets, our POS, and its components ONA and SPR can improve the performance with a clear margin, where “+ ONA” means equipping ONA to the models, and Bold highlights the best performance. Green and Red numbers means performance improvement and decline, respectively

	MSR-VTT				UCF101
	FID (inception v3)↓	CLIP-FID ↓	CLIPSIM ↑	Hyperparamters	FVD ↓	IS ↑	Hyperparameters
ModelScope	45.378	13.677	0.296	-	774.14	32.337
+ POS	${42.755}_{{\color[rgb]{0.484375,0.8046875,0.484375}2.623}}$	${\textbf{13.674}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.003}}$	${\textbf{0.300}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.004}}$	$\{\eta=0.2,\gamma=0.04\}$	${\textbf{566.68}}_{{\color[rgb]{0.484375,0.8046875,0.484375}207.46}}$	${38.190}_{{\color[rgb]{0.484375,0.8046875,0.484375}5.853}}$	$\{\eta=0.2,\gamma=1.0\}$
+ POS^∗	${\textbf{42.296}}_{{\color[rgb]{0.484375,0.8046875,0.484375}3.082}}$	${13.881}_{{\color[rgb]{0.94140625,0.59765625,0.55078125}-0.204}}$	${0.299}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.003}}$	$\{\eta=0.2,\gamma=0.1\}$	${570.27}_{{\color[rgb]{0.484375,0.8046875,0.484375}203.87}}$	${\textbf{44.255}}_{{\color[rgb]{0.484375,0.8046875,0.484375}11.918}}$	$\{\eta=0.5,\gamma=0.1\}$
SCVideo	48.331	14.802	0.283	-	750.74	23.224	-
+ POS	${42.902}_{{\color[rgb]{0.484375,0.8046875,0.484375}5.429}}$	${\textbf{14.155}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.647}}$	${0.288}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.005}}$	$\{\eta=0.5,\gamma=0.1\}$	${\textbf{541.43}}_{{\color[rgb]{0.484375,0.8046875,0.484375}209.31}}$	$\hskip 2.84544pt{33.245}_{{\color[rgb]{0.484375,0.8046875,0.484375}10.021}}$	$\{\eta=0.5,\gamma=1.0\}$
+ POS^∗	${\textbf{42.838}}_{{\color[rgb]{0.484375,0.8046875,0.484375}5.493}}$	${14.517}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.285}}$	${\textbf{0.292}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.009}}$	$\{\eta=0.5,\gamma=0.1\}$	${543.73}_{{\color[rgb]{0.484375,0.8046875,0.484375}207.01}}$	$\hskip 2.84544pt{\textbf{38.575}}_{{\color[rgb]{0.484375,0.8046875,0.484375}15.31}}$	$\{\eta=0.5,\gamma=0.5\}$

Table 2: Quantitative results on MSR-VTT and UFC101 datasets, our POS^∗ is capable of achieving a performance similar to that of POS, with both significantly outperforming the baseline model.

Sampling Strategy. For FID and CLIP-FID, we sample 14,950 images from 2,990 MSR-VTT videos and generated videos respectively. Concretely, we sample 5 images for each video in both real and generated videos. For the real video, sample an image every 12 frames. For the generated video, sample an image every 4 frames. Regarding CLIP-SMI, we calculate the clip similarity between each image of the generated video and the corresponding prompt and then average it. All 3,783 UCF101 test videos are used to calculate FVD and IS. For the real video, we sample an image every 5 frames, for a total of 16 frames, to calculate FVD.

5.2 Implementation Details

We sample 100k image-text pairs from WebVid-10M to form our pre-defined candidate pool, i.e., $N=100$ k. In Guided Noise Inversion, we search for the most similar video from the pool for each text prompt according to Eq. 3. In Noise Prediction Network, we randomly sample 600K video-text pairs from WebVid-10M and utilize DDIM inversion to convert the videos to noise. More inverted videos harvests limited benefits in our practice. Sentence-BERT is taken to extract text features⁶⁶6Sentence-BERT: https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. In Semantic-Preserving Rewriter, we pick the top-5 ( $k=5$ ) reference sentences for the text prompt and adopt ChatGPT [22] to perform rewriting, which imitates the adjectives, adverbs, or sentence patterns of these 5 sentences. The instruction template IT is “Let me give you 5 examples:[Ref1],[Ref2],[Ref3],[Ref4],[Ref5], rewrite the sentence [Input Text Prompt] without changing the meaning of the original sentence to a maximum of 20 words, imitating/combining the adjectives, adverbs or sentence patterns from the 5 examples above”. POS can benefit many trained text-to-video models. To support this claim, we equip the proposed modules on three open-sourced text2video models, including ModelScope [38], VideoCrafter [7], Tune-A-Video [42] and our trained SCVideo to study the performance gain. Tune-A-Video is a well-known one-shot video generation model, to perform the general video generation, we re-train this model by optimizing more parameters of spatiotemporal attention and 3D convolution module on WebVid-10M by referring to many text-to-video works[2, 4, 6]. The re-trained Tune-A-Video is marked as $\text{Tune-A-Video}^{{\dagger}}$ . SCVideo is our designed model extended from text-to-image models by equipping temporal modules, including temporal convolution, Sparse Causal (SC) Attention, and temporal attention.

Training. Both SCVideo and the Noise Prediction Network employ the same UNet architecture and identical loss functions. The difference lies in the initialization of network parameters. SCVideo is initialized using the pretrained weights of SD 1.4, and the newly introduced weights and bias are initialized with identity matrix and zero, respectively. NPNet is initialized from the parameters of a well-trained SCVideo. Additionally, we only optimize the 3D temporal convolution and all the attention modules of the former, whereas the latter undergoes training for all parameters. SCVideo is trained with 4 Nvidia A100-80G GPUS with batch size 5, and the total number of training steps per GPU is 80K. The learning rate is set as $3\times 10^{-5}$ with $10^{-2}$ weight decay. We train our model using AdamW optimizer, and the hyperparameters are configured as $\beta_{1}$ = 0.9, $\beta_{2}$ = 0.999 and $\epsilon$ = $10^{-8}$ . In addition, a cosine schedule with a linear warmup of 1000 steps is adopted during the training process. NPNet is trained with 2 Nvidia A100-80G GPUS with batch size 5, and the total number of training steps per GPU is 40K. Other hyperparameter settings are the same as SCVideo.

Next, we extensively evaluate our approach on multiple baseline models. ONA and POS denote the use of retrieved videos and DDIM Inversion, while POS^∗ represents the utilization of Noise Prediction Network

	Stable Diffusion 1.5				Stable Diffusion-XL
	SD 1.5	+ ONA	+ SPR	+ POS	SD-XL	+ ONA	+ SPR	+ POS
FID (inception v3)↓	24.824	${24.641}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.183}}$	${23.861}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.963}}$	${\textbf{23.623}}_{{\color[rgb]{0.484375,0.8046875,0.484375}1.201}}$	25.608	${24.129}_{{\color[rgb]{0.484375,0.8046875,0.484375}1.479}}$	${25.307}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.301}}$	${\textbf{24.011}}_{{\color[rgb]{0.484375,0.8046875,0.484375}1.597}}$
CLIP-FID ↓	${13.545}$	${13.352}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.193}}$	${13.212}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.333}}$	${\textbf{13.194}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.351}}$	14.940	${12.885}_{{\color[rgb]{0.484375,0.8046875,0.484375}2.055}}$	${14.717}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.223}}$	${\textbf{12.866}}_{{\color[rgb]{0.484375,0.8046875,0.484375}2.074}}$
CLIPSIM ↑	0.324	${\textbf{0.325}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.001}}$	${0.320}_{{\color[rgb]{0.94140625,0.59765625,0.55078125}-0.004}}$	${0.320}_{{\color[rgb]{0.94140625,0.59765625,0.55078125}-0.004}}$	0.336	${\textbf{0.337}}_{{\color[rgb]{0.484375,0.8046875,0.484375}0.001}}$	${0.330}_{{\color[rgb]{0.94140625,0.59765625,0.55078125}-0.006}}$	${0.331}_{{\color[rgb]{0.94140625,0.59765625,0.55078125}-0.005}}$

Table 3: Quantitative results for text-to-image generation, the results are evaluated on MS-COCO dataset

	MSR-VTT			UCF101
	FID↓	CLIP-FID↓	CLIPSIM↑	FVD↓	IS ↑
$N$ = 10K	43.236	14.352	0.288	591.79	31.735
$N$ = 100K	43.585	14.287	0.289	558.51	31.538
$N$ = 1M	42.587	14.334	0.288	620.29	33.047
$N$ = 10M	42.072	14.285	0.289	571.27	33.232

(a)

	MSR-VTT			UCF101
	FID↓	CLIP-FID↓	CLIPSIM↑	FVD↓	IS↑
$\gamma$ =0	48.331	14.802	0.283	750.74	23.224
$\gamma$ =0.1	43.760	14.291	0.284	769.35	25.006
$\gamma$ =0.2	43.170	14.384	0.283	731.48	25.508
$\gamma$ =0.5	42.992	14.755	0.278	729.23	25.453
$\gamma$ =1.0	43.531	15.484	0.275	712.88	25.648

(b)

	MSR-VTT			UCF101
	FID↓	CLIP-FID↓	CLIPSIM↑	FVD↓	IS↑
K=0	45.435	15.504	0.278	808.16	23.849
K=1	45.410	15.386	0.273	745.07	24.604
K=2	44.458	15.385	0.275	824.99	25.361
K=5	43.531	15.484	0.275	712.88	25.648
K=10	43.125	15.257	0.277	755.55	26.598

(c)

	MSR-VTT			UCF101
	FID↓	CLIP-FID↓	CLIPSIM↑	FVD↓	IS↑
Llama2-7B	43.811	14.395	0.287	551.87	31.305
ChatGPT	42.901	14.155	0.288	541.43	33.245

(d)

Table 4: Ablation experiments. SCVideo is taken as the baseline to study the key hyperparameters in our ONA and SPR.

5.3 Quantitative Results.

Table 1 reports the main results on four benchmarks, the hyperparameter configurations are also presented to make everything clear. We can observe that the POS can bring consistent performance improvement of four models on both datasets. Particularly, ONA and SPR can improve the FID of ModelScope from 45.378 to 43.092 and 43.585, respectively. Equipping both modules (POS) makes the performance step further, achieving a 42.755 FID. Furthermore, POS also maintains a good semantic consistency, 0.296 vs 0.3. On UCF101 dataset, ONA shows great effectiveness, boosts the FVD of ModelScope from 774.14 to 607.11, and IS is also clearly improved. The contributions on our SCVideo are more remarkable, we can harvest an improvement of 5.429 FID on MSR-VTT and 209.31 of FVD on UCF101. $\text{Tune-A-Video}^{{\dagger}}$ is also clearly improved by our POS, whose FID and FVD can be boosted by a margin of 6.095 and 242.91, respectively. In comparison, VideoCrafter only harvests incremental performance gains from ONA, the reason mainly stems from the poor inversion quality. In our practice, we found that the noise from VideoCrafter’s inversion function cannot well reconstruct the video, which indicates that the inverted noise fails to locate the optimal noise, thereby providing a not good guidance for ONA. Figure 5 shows two groups of qualitative results from two SOTA models, ModelScope and SCVideo, from which we can intuitively observe the effectiveness of our POS.

As shown in Table 2, with the assistance of NPNet, POS* significantly enhances the performance of the baseline models, achieving comparable results to POS utilizing the retrieval pool. This provides us with two options: 1) Employing the training-free POS with a retrieval pool. 2) Leveraging POS^∗ with a pre-trained NPNet to overcome storage issues.

5.4 Ablations

Can POS Benefit Image Generation? To answer this question, we augment the Stable Diffusion (SD) v1.5 [28] and Stable Diffusion-XL (SD-XL) [23] with our POS and evaluate their performance on MS-COCO test set [16]. We employ the test set of Flickr30k [45] as the candidate pool instead of the MS-COCO training set to verify the generalization of POS, and $\eta$ and $\gamma$ are set as 0.05 and 0.4, the other hyperparameters remain the same as video experiments.

Table 3 compares the results, we have the following observations: (a) ONA yields a more substantial performance improvement for SD-XL compared to SD 1.5. This discrepancy arises from the fact that SD-XL samples noise from a larger space (128×128) in comparison to SD 1.5 (64×64), thereby rendering the task of achieving or approximating optimal noise more challenging for SD-XL. Consequently, ONA can leverage its strengths more effectively in the case of SD-XL. Conversely, the noise space of SD 1.5 is relatively constrained, and the extensive training with a voluminous dataset has effectively aligned the noise and image spaces. Consequently, the efficacy of ONA is less pronounced in this context. (b) SPR exhibits superior performance on SD 1.5 in contrast to SD-XL. This phenomenon can be attributed to SD-XL training a more potent text encoder, which is adept at capturing finer textual details.

Size of Candidate Pool. To investigate the effect of candidate pool size, we randomly sample 10K, 100K, 1M, and 10M samples from the Webvid dataset to study the performance trend. From Table LABEL:tab:poolDis, we can observe that the larger the retrieval pool typically results in better performance. For example, enlarging the pool size from 10K to 10M can promote the FID on MSR-VTT from 43.236 to 42.072. Notably, although our default set is $N=$ 100K, reducing the scale to 10K does not severely hinder the performance. We take 100K to pursue a better trade-off between the scale of the candidate pool and the performance.

Proportion of Guided Noise in ONA. Hyperparameter $\eta$ in Eq. 4 determines the final composition of the noise, this part discusses the performance impact of how to hybridze the two types of noises by varying $\eta$ . The results are reported in Figure 6, We can observe a clear tendency from the figure, that using the pure random noise ( $\eta=0$ ) or guided noise ( $\eta=\infty$ ) can not yield satisfactory performance. Instead, an appropriate fusion of the two noise types is a more effective manner, we can harvest the best performance when $\eta=0.5$ .

Number of Dominated Steps of Rewritten Text in DHS. The hyperparameter $\gamma$ controls how many steps of the rewritten text are performed in DHS. We vary $\gamma$ in this part and study its effect on performance, and do not equip the ONA to observe a clearer trend. As shown in Table LABEL:tab:diffStep, the rewritten sentence ( $\gamma=1$ ) can boost the quality compared to the raw prompt. Specifically, compared with the baseline( $\gamma=0$ without rewritten text), applying the rewritten sentence reduces FID from 48.331 to 43.531 and promotes the IS score from 23.224 to 25.648. However, the CLIP-SIM drops from 0.283 to 0.275, indicating the semantic consistency is deteriorated. Our DHS strategy can remedy the issue, for example, introducing the original text in the last 90% denoising steps ( $\gamma=0.1$ ) can boost the CLIP-SIM from 0.275 to 0.284.

Number of Reference Text for Rewriting. By default, we employ 5 reference sentences for the text rewriting, this subsection examines the impact of varying the number of references for text rewriting. Table LABEL:tab:numTextDis presents the performance results across different reference numbers, highlighting the positive impact of incorporating references (we focus solely on the rewriting strategy, excluding the ONA and DHS techniques). For instance, the FID score is significantly improved from 45.435 without references to 43.531 with five references. Despite consistent improvements in other metrics, CLIP-SIM shows a decline due to the introduction of new content or objects. However, this issue can be effectively mitigated by our DHS mechanism, as evidenced in Table 1 and Table LABEL:tab:diffStep.

Effect of Large Language Models (LLMs). ChatGPT is taken as the rewriting engine due to its excellent performance, we also present a discussion regarding the LLMs to further validate our methods. Table LABEL:tab:llms compares the results of using ChatGPT (GPT-3.5-Turbo) and Llama2-7B in SPR. The findings indicate that ChatGPT consistently outperforms Llama2-7B across five key metrics, exhibiting a substantial advantage. Moreover, ChatGPT also shows a higher level of intelligence in our practice, as it does not need regularization to filter out irrelevant content.

6 Conclusion

This work presents POS, a model-agnostic suite for enhancing diffusion-based text-to-video generation by improving two crucial inputs: the noise and the text prompt. To approximate the optimal noise for a given text prompt, we propose an optimal noise approximator. This module involves a two-stage process, starting with the search for a video neighbor closely related to the text prompt, and subsequently performing DDIM inversion on the selected video. Alternatively, we also present another solution: training a Noise Prediction Network to overcome storage and search issues associated with the candidate pool, while achieving comparable performance. Additionally, we devise a semantic-preserving rewriter to enrich the details in the text prompt, aiming to augment the original text input by providing more comprehensive information. To allow a reasonable detail compensation and maintain semantic consistency, we propose a reference-guided rewriting approach and incorporate hybrid semantics during the denoising stage for semantics preserving. To evaluate the effectiveness of our method, we integrate the proposed POS into four backbones and conduct extensive experiments using the widely used benchmarks. The experimental results demonstrate the efficacy of our approach in enhancing text-to-video models.

References

Bain et al. [2021] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
Blattmann et al. [2023] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
Ge et al. [2023] Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. arXiv preprint arXiv:2305.10474, 2023.
Hao et al. [2022] Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
He et al. [2022a] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022a.
He et al. [2022b] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. 2022b.
Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Ho et al. [2022a] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
Ho et al. [2022b] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022b.
Hong et al. [2023] Susung Hong, Junyoung Seo, Sunghwan Hong, Heeseong Shin, and Seungryong Kim. Large language models are frame-level directors for zero-shot text-to-video generation, 2023.
Hong et al. [2022] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
Kay et al. [2017] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
Kynkäänniemi et al. [2022] Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of imagenet classes in fr $\backslash$ ’echet inception distance. arXiv preprint arXiv:2203.06026, 2022.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Liu et al. [2023] Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control, 2023.
Lu et al. [2023] Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. Tf-icon: Diffusion-based training-free cross-domain image composition, 2023.
Luo et al. [2023] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
Mokady et al. [2023] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
[21] OpenAI. Dall-e3. https://cdn.openai.com/papers/DALL_E_3_System_Card.pdf.
OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
Podell et al. [2023] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
Qi et al. [2023] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing, 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
Razavi et al. [2019] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
Reimers and Gurevych [2019] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019.
Rombach et al. [2021] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
Singer et al. [2022] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
Soomro et al. [2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Tran et al. [2015] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
Unterthiner et al. [2018] Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
[36] Quo Vadis, Joao Carreira, and Andrew Zisserman. Action recognition? a new model and the kinetics dataset. Joao Carreira, Andrew Zisserman.
Wallace et al. [2022] Bram Wallace, Akash Gokul, and Nikhil Naik. Edict: Exact diffusion inversion via coupled transformations, 2022.
Wang et al. [2023a] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report, 2023a.
Wang et al. [2023b] Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, and Jiaying Liu. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023b.
Wang et al. [2023c] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023c.
Wu et al. [2021] Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
Wu et al. [2022] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
Xu et al. [2016] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
Yan et al. [2021] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
Young et al. [2014] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
Zhou et al. [2022] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
Zhu et al. [2023] Junchen Zhu, Huan Yang, Huiguo He, Wenjing Wang, Zixi Tuo, Wen-Huang Cheng, Lianli Gao, Jingkuan Song, and Jianlong Fu. Moviefactory: Automatic movie creation from text using large generative models for language and images, 2023.