OmniVDiff: Omni Controllable Video Diffusion
for Generation and Understanding

Dianbing Xi^1,2,∗, Jiepeng Wang^2,∗,‡, Yuanzhi Liang², Xi Qiu², Yuchi Huo¹,
Rui Wang^1,†, Chi Zhang^2,†, Xuelong Li^2,†
¹Zhejiang University, ²Institute of Artificial Intelligence, China Telecom (TeleAI)

Abstract

In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff , aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality’s role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications. Our project page: https://tele-ai.github.io/OmniVDiff/.

\begin{overpic}[width=433.62pt]{fig/fig1_teaser3.png} \put(2.0,20.0){\rotatebox{90.0}{\small{(a) Text to multi-modal video generation}}} \put(5.0,17.0){\rotatebox{90.0}{\small canny}} \put(5.0,27.5){\rotatebox{90.0}{\small segmentation}} \put(5.0,43.0){\rotatebox{90.0}{\small depth}} \put(5.0,56.0){\rotatebox{90.0}{\small rgb}} \par\put(95.0,36.0){\rotatebox{-90.0}{\small{(b) Segmentation-conditioned video generation}}} \put(90.0,35.0){\rotatebox{-90.0}{\small seg}} \put(90.0,6.0){\rotatebox{-90.0}{\small rgb}} \par\end{overpic}

Figure 1: Omni controllable video generation. Given a text prompt, (a) OmniVDiff generates high-quality RGB videos while simultaneously producing aligned multi-modal visual understanding outputs (i.e., depth, segmentation and canny). Additionally, (b) OmniVDiff supports X-conditioned video generation within a unified framework, such as segmentation-conditioned video generation.

¹¹footnotetext: Equal contribution, ^†Corresponding author, ^‡Project lead.

1 Introduction

Diffusion models have made significant strides in image [41] and video generation [1, 32, 25, 58], achieving remarkable results by training on large-scale datasets and demonstrating remarkable improvements in downstream applications such as conditional controllable generation. For controllable video generation, models typically leverage conditions such as depth [19, 35, 21, 57, 14, 6], segmentation [67, 31, 28], or canny edges [37] to guide the diffusion process. By fine-tuning a pretrained text-to-video (T2V) model [1, 58], these models achieve significant results of controllable video generations. Despite these achievements, most existing models require extensive task-specific fine-tuning, limiting their generalization across different tasks and making them computationally expensive. Additionally, many studies have explored leveraging diffusion models for joint modality generation [61, 7], such as [61] for joint rgb and depth prediction. However, these approaches primarily focus on joint generation and lack support for generative understanding and conditional generation, limiting their applicability to broader tasks. Overall, these works highlight the impressive capabilities of video diffusion models, but the lack of adaptability presents a key challenge in developing a unified and efficient video diffusion framework capable of addressing diverse video-related tasks.

Recently, several concurrent studies in the image domain explored unifying multiple tasks within a single diffusion framework, by treating image-level tasks as a sequence of image views [33, 13, 65] (analogous to video generation). For example, the depth-conditioned generation can be regarded as a two-view (depth and rgb) diffusion task. While this approach has been effective for image-based tasks, extending it to video generation presents significant challenges. Unlike images, videos introduce an additional temporal dimension. Treating modalities as distinct video sequences would significantly increase the token length and computation cost in the transformer-based diffusion process, especially considering the quadratic computational complexity in the attention mechanism [47]. The challenge of extending such approaches into a unified video diffusion framework that can handle both conditioned and unconditioned generation remains largely unexplored.

In this work, we present OmniVDiff , a unified framework to support various video generation and understanding tasks. Our approach consists of two key components: (1) a multi-modal video diffusion architecture and (2) an adaptive modality control strategy to support generation and understanding of multiple visual modalities efficiently. Specifically, (1) for the diffusion network architecture, we extend the dimensionality of the input noise to match the number of modalities, allowing our model to process multiple visual inputs seamlessly. Meanwhile, we adopt distinct visual decoding heads on the output side, each responsible for generating the corresponding visual signals, ensuring modality-specific outputs while maintaining a unified framework. (2) To enhance adaptability, we introduce a flexible controlling strategy that dynamically determines the role of each visual modality as either a generation or conditioning modality. This is achieved using learnable embeddings that guide the behavior of the model based on the task. For instance, in text-to-video generation, all modalities utilize the generation embedding, enabling the model to synthesize video purely from noise conditioned on textual descriptions. In contrast, for depth-conditioned video generation, the depth modality is not blended with noise and assigned a conditioning embedding, while other modalities retain the generation embedding, ensuring the depth information effectively guides the generation process. Through this strategy, our method provides fine-grained control over different modalities, achieving a unified yet adaptable framework capable of handling a diverse range of video generation and understanding tasks.

In this paper, we focus on four representative visual modalities: rgb, depth, segmentation, and canny edges. To effectively train our model, we require a sufficient dataset containing paired modalities. However, due to the scarcity of such data, we first filter a subset videos from Koala-36M [51] and then utilize expert models to generate pseudo labeled data for each modality, ensuring high-quality supervision and broad coverage across diverse scenarios.

We evaluate our approach through extensive experiments across various tasks, including text-to-video generation, X-conditioned video generation, and video multi-modality understanding. Our results demonstrate the model’s ability for controllable video generation via a single diffusion model, without the need of external expert models and finetuning on different conditions independently. Additionally, we assess the extensibility of our method in adapting it to various downstream applications and new modalities, showing that it can be easily generalized to novel tasks such as video-to-video style control and video super-resolution. These results highlight the robustness and versatility of our framework, showcasing its potential for a wide range of video-related applications.

2 Related Works

2.1 Text-to-Video Diffusion

Text-to-video (T2V) diffusion models have made significant progress in generating realistic and temporally consistent videos from text prompts [3, 70, 32, 68, 39]. SVD [1], VDM [24] and following works [23, 42, 20, 48, 25] explore extending image diffusion models [41] for video synthesis with spatial and temporal attention [69, 18, 2, 8, 9, 50, 17, 52, 60, 59, 49, 44, 31, 53, 1, 56, 55, 62, 54, 15]. Recent methods also introduce 3D Variational Autoencoder (VAE) to compress videos across spatial and temporal dimensions, improving compression efficiency and video quality[58, 66, 32, 68].

However, these approaches primarily focus on text-conditioned video generation and lack fine-grained control over video attributes. Tasks such as depth-guided or segmentation-conditioned video generation remain challenging, as text-to-video diffusion models do not explicitly support these controls. Meanwhile, all these methods mainly focus on the RGB modality output, without considering the generative capability of other visual modalities.

2.2 Controllable Video Diffusion

To address controllable video generation, many methods try to introduce additional conditioning signals to guide the diffusion process. Depth maps can provide accurate geometric and structural information, ensuring realistic spatial consistency across frames [19, 35, 21, 57, 14, 6, 12, 63, 37, 28, 71, 26]. Pose conditioning ensures accurate human motion synthesis by constraining body articulation and joint movements[30, 27, 38, 64, 16, 28, 45, 71]. Optical flow constrains motion trajectories by capturing temporal coherence and movement patterns, enhancing dynamic realism [11, 29, 35].

However, these existing methods face two major challenges: (1) Fine-tuning for each task: incorporating new control signals typically requires task-specific fine-tuning on large-scale diffusion architectures, making these models computationally expensive and difficult to scale across diverse control modalities. (2) Dependency on external expert models: most approaches rely on pre-extracted conditioning signals from external expert models. For example, in depth-conditioned video generation, a separate depth estimation model is first applied to a reference video, and the estimated depth is then fed into a distinct video diffusion model for generation. This results in a multi-step, non-end-to-end pipeline where each component is trained separately, potentially causing inconsistencies across models and complex operations.

Some efforts have attempted to unify multi-modal generation within a single diffusion model, such as joint video-depth generation [61]. A concurrent work, VideoJAM [7], extends this idea by simultaneously predicting RGB and optical flow. However, these methods primarily focus on the joint prediction of two modalities, while generalizing to more modalities and enabling flexible modality-conditioned generation and understanding remain largely under-explored. In this paper, our method addresses these challenges by introducing a unified framework that allows fine-grained adaptive modality control. Unlike prior works, we do not require separate fine-tuning for each control modality and eliminate the reliance on external expert models by integrating multi-modal understanding and generation into a single pipeline. This enables more efficient, end-to-end controllable video synthesis, significantly improving scalability and coherence across video generation tasks.

3 Method

\begin{overpic}[width=433.62pt]{fig/fig2_method5.png} \put(4.0,36.0){\small rgb} \put(7.0,36.0){\small depth} \put(12.0,36.0){\small seg} \put(16.0,36.0){\small canny} \par\put(26.0,27.0){\large$\mathcal{E}$ } \put(74.0,27.0){\large$\mathcal{D}$ } \put(41.5,36.5){\small Text prompt } \par\put(46.0,27.0){ \emph{OmniVDiff} } \par\put(21.7,7.1){\small\emph{OVDiff} } \put(72.2,7.1){ \small\emph{OVDiff} } \par\put(10.0,39.0){\small(a) Video encoding} \put(45.0,39.0){\small(b) \emph{OmniVDiff} } \put(75.0,39.0){\small(c) Video decoding} \par\put(35.0,21.0){\small$e_{m}$} \par\put(42.0,21.0){\small$x_{i}$} \put(58.0,21.0){\small$x_{o}$} \par\put(10.0,-1.5){\small(d) Multi-modal video generation} \put(60.0,-1.5){\small(e) X-conditioned generation/understanding} \par\put(3.0,16.0){\small[rgb} \put(7.0,16.0){\small depth} \put(12.0,16.0){\small seg} \put(16.0,16.0){\small canny] + noise} \par\put(31.0,16.0){\small rgb} \put(35.0,16.0){\small depth} \put(40.0,16.0){\small seg} \put(44.0,16.0){\small canny} \par\par\put(54.0,16.0){\small{{\color[rgb]{0,0,1}rgb}}} \put(57.0,16.0){\small[depth} \put(63.0,16.0){\small seg} \put(66.0,16.0){\small canny] + noise} \par\put(82.0,16.0){\small{\color[rgb]{0,0,1}rgb}} \put(86.0,16.0){\small depth} \put(91.0,16.0){\small seg} \put(95.0,16.0){\small canny} \par\par\end{overpic}

Figure 2: Method overview. (a) Given a video with four paired modalities, we first encode it into latents using a shared 3D-VAE encoder; (b) Then, concatenate them along the channel dimension and apply noise for video diffusion, where the denoised latents are then decoded into their respective modalities via modality-specific decoding heads; (c) Finally, each modality can be reconstructed into color space by the 3D-VAE decoder. During inference, the model enables various tasks by dynamically adjusting the role of each modality: (d) Text-to-video generation, where all modalities are denoised from pure noise and (e) X-conditioned generation, where the condition X is given and other modalities are denoised from pure noise. If X is RGB modality, the model will perform generative understanding.

In this section, we introduce OmniVDiff , a unified framework for video generation and understanding, extending video diffusion models to support multi-modal video synthesis and analysis. We begin with a preliminary introduction to video diffusion models (Sec. 3.1). Then, in Sec. 3.2, we detail our network design and adaptive control strategy, which enable seamless handling of text-to-video generation, modality-conditioned video generation, and multi-modal video understanding. Finally, in Sec. 3.3, we describe our training data preparation and strategy. Fig. 2 provides an overview of our framework.

3.1 Preliminary

Video diffusion models generate videos by progressively refining noisy inputs through a denoising process, following a learned data distribution. CogVideoX [58], a state-of-the-art text-to-video diffusion model, incorporates a 3D Variational Autoencoder (3D-VAE) to efficiently compress video data along both spatial and temporal dimensions, significantly reducing computational costs while preserving motion consistency.

Given an input video $V\in\mathbb{R}^{f\times h\times w\times c}$ , where $f,h,w,c$ denote the number of frames, height, width, and channels, respectively, the 3D-VAE encoder downsamples it using a spatiotemporal downsampling factor of (8,8,4) along the height, width, and frame dimensions: $F=\frac{f}{4},\quad H=\frac{h}{8},\quad W=\frac{w}{8}$ . This process captures both appearance and motion features while significantly reducing the memory and computational requirements of the diffusion process. The video diffusion model operates in this latent space, iteratively denoising $\mathbf{x}_{t}$ through a learned reverse process. The training objective minimizes the mean squared error (MSE) loss for noise prediction:

\mathcal{L}_{\text{denoise}}=\mathbb{E}_{\mathbf{x}_{0},t,\epsilon}\left[\|\epsilon-\epsilon_{\theta}(\mathbf{x}_{t},t)\|^{2}\right]

where $\epsilon_{\theta}$ is the noise prediction model, $\mathbf{x}_{t}$ is the noisy latent at timestep $t$ , and $\epsilon$ is the added noise.

3.2 Omni Video Diffusion

Multi-Modal video diffusion architecture

To achieve omni-controllable video diffusion, we design a novel video diffusion architecture that learns a joint distribution over multiple visual modalities. Building upon the pretrained text-to-video diffusion model CogVideoX, we extend the input space to accommodate multiple modalities. On the output side, we introduce modality-specific decoding heads to recover each modality separately. This design enables our architecture to seamlessly support multi-modal inputs and outputs, ensuring flexible and controllable video generation.

Given a video sequence and its paired visual modalities $V=\{V_{r},V_{d},V_{s},V_{e}\}$ , where $V_{r}$ , $V_{d}$ , $V_{s}$ , and $V_{e}$ represent rgb, depth, segmentation, and canny edges, respectively, we first encode them into a latent space using a pretrained 3D-causal VAE encoder $\mathcal{E}$ [58]. Each modality is mapped to latent patches:

x_{m}=\mathcal{E}(V_{m}),\quad m\in\{r,d,s,c\},

where $x_{m}\in\mathbb{R}^{F\times H\times W\times C}$ and $F,H,W,C$ denote the number of frames, height, width, and latent channels, respectively.

Next, we concatenate these latents along the channel dimension, forming a unified representation as the diffusion transformer input $x_{i}=\text{Concat}(x_{r},x_{d},x_{s},x_{c})$ . This multi-modal latent representation is then blended with noise and processed through the video diffusion model to learn a joint distribution across different modalities.

On the output side, we employ modality-specific decoding heads $H_{m}$ , where each head is responsible for reconstructing the noise output $\epsilon_{m}$ of a specific modality from the diffusion transformer output $x_{o}$ :

\epsilon_{m}=H_{m}(x_{o})

Each decoding head $H_{m}$ is independently optimized to reconstruct its corresponding modality while maintaining temporal consistency across frames. Finally, the denoised latents are decoded back into the color space using the pretrained 3D-VAE decoder $\mathcal{D}$ [58], producing high-fidelity multi-modal video outputs.

Adaptive modality control strategy

A key challenge in unified video generation is determining the role of each modality—whether it serves as a generation signal or a conditioning input. To address this, we introduce an adaptive control strategy that dynamically assigns roles to different modalities based on the task.

During training, generation modalities are blended with noise before being fed into the diffusion model, while conditioning modalities remain unchanged and are concatenated with the noisy inputs of other modalities to serve as conditioning signals. This mechanism ensures flexible and adaptive control over different modalities, allowing the model to seamlessly handle diverse tasks within a unified framework. Specifically, in a text-to-video generation task, all modalities are generated from pure noise, meaning they act as generation signals. In an $X$ -conditioned generation task, where $X$ represents depth, segmentation, or canny edges, the conditioning modality $X$ is provided as input directly without blending with noise and concatenated with the noisy latent representations of other modalities. Notably, if $X$ represents the rgb modality, the model instead performs a video understanding task and predicts corresponding multi-modal outputs.

To further enhance the diffusion model’s ability to distinguish modality roles, we introduce a modality embedding $\mathbf{e}_{m}$ that differentiates between generation ( $\mathbf{e}_{g}$ ) and conditioning ( $\mathbf{e}_{c}$ ) roles, which can be directly added to the diffusion model input.

\mathbf{e}_{m}=\begin{cases}\mathbf{e}_{g},&\text{if }m\text{ is for generation}\\ \mathbf{e}_{c},&\text{if }m\text{ is for conditioning}\end{cases}

This strategy enables flexible and efficient control, allowing the model to seamlessly adapt to different tasks without requiring separate architectures for each modality.

3.3 Training

Training data

To train our model, we prepared a large training set consisting of paired data from various modalities, including segmentation, depth, and canny edge information. However, due to the inherent scarcity of high-quality labeled video data, expert models play a crucial role in our training pipeline. These models generate pseudo labels for unannotated video data, which are then used as part of the training set. This approach allows us to scale the dataset and improve the model’s generalization without the need for extensive manual labeling.

For video depth, we utilize Video Depth Anything [10], which is specifically designed for generating accurate depth maps for video sequences. This method ensures depth consistency throughout the video and aids in generating realistic depth-aware video outputs. For segmentation, we first use SemanticSAM [34] to perform instance segmentation on the first frame of each video, which is then used as a prompt to guide SAM2 [40] for segmenting subsequent frames, ensuring semantic consistency across the video. For canny edges, we employ the canny algorithm [4] implemented in OpenCV library to perform edge detection.

In total, we processed 400K video samples, randomly sampled from the Koala-36M [51] dataset. The training process for video depth estimation required approximately 3 days to complete, while video segmentation took 5 days using 8 H100 GPUs for parallel processing.

Training strategy

We adopt a two-stage training strategy to enable a smooth transition from single-modality video generation to multi-modal controllable video synthesis. In the first stage, we fine-tune a pre-trained video generation model, CogVideoX, to generate multiple visual modalities (e.g., depth, segmentation, canny edges) from text conditions. This step helps the model learn a strong prior for multi-modal video synthesis, ensuring alignment between different modalities within a unified latent space. In the second stage, we introduce our adaptive modality control strategy, allowing the model to perform X-conditioned video generation and multi-modal video understanding. This stage enables the model to distinguish between generation and conditioning roles for different modalities, enhancing its flexibility across tasks.

Through this progressive learning approach, the model evolves from single rgb video generation to multi-modal controllable video generation, enabling seamless transition between various video generation and understanding tasks.

Training loss

We optimize our unified video generation and understanding framework using a multi-modality diffusion loss, ensuring high-quality generation while maintaining flexibility across different modalities.

For each modality, we apply an independent denoising loss. If a modality serves as a conditioning input, the denoising loss is skipped for that modality, ensuring it only guides the generation process without being explicitly optimized. The final objective is:

\mathcal{L}=\sum_{m,m\notin Cond}\lambda_{m}\mathbb{E}_{\mathbf{x}_{m},t,\epsilon,m}\left[\|\epsilon-\epsilon_{\theta}(\mathbf{x}_{m}^{t},t,e_{m})\|^{2}\right]

where $\lambda_{m}$ controls the supervision strength for each modality, $\epsilon_{\theta}$ is the noise prediction model, and $x_{m}^{t}=(1-t)*\epsilon+t*x_{m}+e_{m}$ is the noisy video representation at timestep $t$ .

This approach provides adaptive supervision, enabling flexible role assignments for modalities and allowing the model to seamlessly transition between generation and conditioning tasks.

4 Experiments

4.1 Implementation details

We fine-tune our model based on CogVideoX-5B [58], a large-scale text-to-video diffusion model. The fine-tuning process follows a two-stage training strategy, progressively adapting the model from multi-modality video generation to multi-modal controllable video synthesis with the support of X-conditioned video generation and video visual understanding. We train the model using a learning rate of 2e-5 on 8 H100 GPUs for 600K steps. The model is optimized using a batch size of 1, with each training stage consisting of 300K steps. To evaluate the performance of video generation, we follow [36] and report two commonly used metrics: Fréchet Video Distance (FVD) [46] and Kernel Video Distance (KVD) [46] on UCF101[43].

4.2 Omni controllable video generation

To evaluate the effectiveness of our approach, we compare it with state-of-the-art methods on text-to-video generation and X-conditioned video generation. Meanwhile, our model’s video understanding capacity is also evaluated for controllable video generation.

Text-conditioned Multi-modal Video Generation

Given a text prompt, OmniVDiff generates multi-modal video sequences simultaneously within a single diffusion process. To provide a comprehensive evaluation of our generation performance, we compare our method with state-of-the-art video diffusion models on rgb video generation and assess the quality of the generated rgb videos.

Table 1 presents a quantitative comparison, where our model achieves a comparable FVD with all baselines, demonstrating superior generation quality. Notably, our multi-modality training scheme improves performance than CogVideoX[58]. We attribute this to the additional visual modalities, which provide stronger regularization compared to using only the rgb space, leading to more coherent and consistent predictions, where the concurrent work VideoJAM [7] shares similar observations. Fig. 3 showcases qualitative results, illustrating that our method produces more plausible and coherent videos with better alignment to text prompts. Please refer to the supplementary file for more video visualizations.

\begin{overpic}[width=411.93767pt]{fig/fig3_t2mm.png} \put(-6.0,26.0){ \rotatebox{90.0}{\small Ours}} \put(-6.0,0.0){ \rotatebox{90.0}{\small CogVideoX}} \end{overpic}

Figure 3: Qualitative comparison of text-to-video generation. Our method generates more coherent and temporally consistent video sequences. Additionally, our method also produces multiple aligned visual modalities, which are not displayed here due to space constraints.

Method	Condition	FVD $\downarrow$	KVD $\downarrow$
CogVideo (Chinese)[25]	Text	751.34	–
CogVideo (English)[25]	Text	701.59	–
CogVideoX[58]	Text	584.74	70.20
MagicVideo[69]	Text	699.00	–
Video LDM[2]	Text	550.61	–
Ours	Text	527.12	60.79

Table 1: Video generation quality comparison of FVD and KVD with baseline methods.

X-conditioned video generation

For modality-conditioned video generation, we compare our model against baselines, which synthesize videos guided by additional visual inputs such as depth, segmentation, or edge maps. Unlike these baselines, which require separately fine-tuned models for a specified condition, our framework unifies different conditioning modalities within a single architecture using adaptive modality control.

As shown in Table 2, our model achieves better performance in depth-conditioned generation than methods specifically designed for depth-based conditioning, demonstrating improved structural preservation and stronger adherence to the conditioning signal. We additionally report the performance of our method on segmentation- and canny-conditioned generation to provide a comprehensive evaluation. Fig. 4 and Fig. 5 present qualitative comparisons of depth-conditioned generation and visualizations of segmentation- and canny-conditioned generation, showing that our method faithfully adheres to the given conditions.

These results validate the effectiveness of our unified video diffusion framework, demonstrating that it not only excels in text-to-video generation but also surpasses existing approaches in modality-conditioned controllable video synthesis within a single model.

\begin{overpic}[width=433.62pt]{fig/fig6_compare_depth.png} \put(-5.0,40.0){\small\rotatebox{90.0}{(a) depth cond}} \put(-5.0,25.0){\small\rotatebox{90.0}{(b) ours}} \put(-5.0,3.0){\small\rotatebox{90.0}{(c) MYV}} \end{overpic}

Figure 4: Depth-conditioned video generation comparison. Given a depth condition (a), our method (b) can produce more consistent and coherent video sequence than Make-Your-Video (c).

\begin{overpic}[width=433.62pt]{fig/fig4_x-cond.png} \put(-5.0,45.0){\small\rotatebox{90.0}{(a) canny-cond gen}} \put(-5.0,5.0){\small\rotatebox{90.0}{(b) seg-cond gen}} \par\end{overpic}

Figure 5: X-conditioned video generation. OmniVDiff can robustly perform fine-grain-conditioned generation from canny (a) and segmentation (b) signals.

Method	Condition	FVD $\downarrow$	KVD $\downarrow$
T2V-z+Ctrl [31]	Text+Depth	951.38	115.55
LVDM[22]	Text+Depth	537.85	85.47
MYV[36]	Text+Depth	330.49	29.52
Ours	Text+Depth	326.99	27.00
Ours	Text+Canny	302.77	30.25
Ours	Text+Segment	254.09	20.05

Table 2: Comparison of FVD and KVD for different methods and conditions.

RGB-conditioned generative video understanding

Beyond its generation capability, our model also possesses generative video understanding ability. When using rgb as the conditioning modality, the model can estimate corresponding visual properties, including depth, segmentation, and canny edges. Fig. 6 presents an example of our visual understanding results, showing that our method produces consistent and plausible predictions across frames.

To further evaluate the advantage of our framework’s understanding capacity for controllable video generation, we use the depth, segmentation, and canny maps predicted by our model as conditioning signals to generate new video sequences. We then compare the generation quality with results obtained using conditioning signals from individual expert models. Specifically, given an input video, we first use RGB sequences as input conditions to generate the corresponding multi-modal understanding outputs. Then, for each estimated modality, we use it as a conditioning signal and feed it back into our generation model to synthesize a new video sequence.

Table 3 summarizes the comparison results, showing that our self-estimated modality conditions lead to better generation quality than conditions obtained from separate expert models. This demonstrates that our model not only enhances video understanding but also provides stronger and more coherent guidance for controllable video generation.

\begin{overpic}[width=433.62pt]{fig/fig5_percep.png} \put(-5.0,48.0){\small\rotatebox{90.0}{(a) rgb input}} \put(-5.0,31.0){\small\rotatebox{90.0}{(b) depth}} \put(-5.0,16.0){\small\rotatebox{90.0}{(c) canny}} \put(-5.0,0.0){\small\rotatebox{90.0}{(d) seg}} \end{overpic}

Figure 6: Visualization of multi-modal video understanding. Given a reference video (a), OmniVDiff can estimate aligned multiple visual understanding predictions in one diffusion process (b,c,d).

Method	Condition	FVD $\downarrow$	KVD $\downarrow$
Ours	Text+Depth	326.99	27.00
VDA[10]	Text+Depth	337.01	28.18
Ours	Text+Canny	302.77	30.25
Canny[4]	Text+Canny	235.65	19.34
Ours	Text+Segment	254.09	20.05
SAM[34, 40]	Text+Segment	256.32	21.33

Table 3: Video Understanding for controllable video generation. Using predicted conditioning signals from our model leads to better generation quality compared to conditioning on outputs from individual expert models.

4.3 Ablation

To analyze the significance of our model’s design choices, we conduct an ablation study by training variants of our model under different configurations. We systematically examine the impact of multi-modal output and adaptive modality control by fine-tuning on CogVideo-5B with different training durations. We use CogVideo-5B as the baseline and evaluate the generation quality of rgb videos from our model on UCF101 [43] (for training) and Kinetics-700 [5] (for testing). Our experiments are structured as follows: 1) Baseline (CogVideo-5B) – standard text-to-video model with only rgb output. 2) Multi-modal video generation (Ours-MM) – fine-tuning for 300K steps, enabling simultaneous multi-modal outputs. 3) Adaptive modality control (Ours-AMC, full) – further fine-tuning for 300K steps, allowing the model to dynamically adapt to different combinations of generation and conditioning modalities.

Table 4 presents quantitative comparisons across different configurations, showing that our full setting achieves the best generation results. Compared to the vanilla CogVideoX, fine-tuning with multi-modal outputs helps regularize the latent space, leading to improved generation quality. Additionally, applying the adaptive control strategy enables the model to better capture the relationships between modalities, further enhancing performance and producing the most consistent results.

Method	Condition	FVD $\downarrow$
CogVideoX	Text	853.57
Ours-MM	Text	531.79
Ours-AMC (full)	Text	518.11

Table 4: Ablation study of different training settings. “Ours-MM” refers to CogVideoX fine-tuned with multi-modal output, while “Ours-AMC (full)” includes both multi-modal output and adaptive modality control.

4.4 Applications

Video-to-video style control

OmniVDiff can be seamlessly applied to the downstream task of video-to-video style control. It enables controllable video generation by preserving the structural layout of a reference video while flexibly modifying its appearance through user-defined text prompts. We first use OmniVDiff to estimate the depth (Fig. 7(b)) of a reference video (Fig. 7(a)). Then, leveraging this structural understanding, OmniVDiff can generate videos with diverse scene styles, such as winter, autumn, and sunset, while faithfully maintaining the original scene structure, as shown in Fig. 7.

\begin{overpic}[width=390.25534pt]{fig/fig7_style.png} \put(-5.0,65.0){\rotatebox{90.0}{\small(a)}} \put(-5.0,50.0){\rotatebox{90.0}{\small(b)}} \put(-5.0,32.0){\rotatebox{90.0}{\small(c)winter}} \put(-5.0,14.0){\rotatebox{90.0}{\small(d)autumn}} \put(-5.0,-3.0){\rotatebox{90.0}{\small(e)sunset}} \end{overpic}

Figure 7: Video2Video Style Control. Given a reference video (a), OmniVDiff first estimates the corresponding depth (b) and uses it as a bridge to control the scene structure, enabling the generation of videos with diverse scene styles through text-based control (c,d,e).

3D scene reconstruction

Additionally, since OmniVDiff estimates the corresponding depth, which captures the 3D geometric structure of the scene, we can directly reproject the depth video into a 3D space to reconstruct the scene. This reconstructed 3D world can then be rendered over time from novel viewpoints, enabling dynamic scene visualization from different perspectives, as illustrated in Fig. 8.

\begin{overpic}[width=411.93767pt]{fig/fig9_4d_2.png} \put(-5.0,37.0){\small(a)} \put(-5.0,22.0){\small(b)} \put(-5.0,7.0){\small(c)} \par\end{overpic}

Figure 8: Scene reconstruction. Given a reference video (a), our model estimates the corresponding depth (b). The depth video can be reprojected into the 3D world and rendered from novel viewpoints.

Adaptability to new modalities/tasks

To assess the adaptability of our model to new modalities and applications, we conduct experiments on video deblurring and video super-resolution. Specifically, we further finetune OmniVDiff for 2k steps and repurpose an existing modality slot (canny) to accommodate a new input modality—either blurred rgb videos or low-resolution rgb videos in the training stage. During inference, the model uses these inputs as conditioning signals and generates high-quality outputs, producing deblurred or high-resolution rgb videos accordingly. Fig. 9 shows two examples for these two tasks. This setup demonstrates the model’s flexibility in extending to previously unseen modalities and tasks with minimal adjustments.

\begin{overpic}[width=390.25534pt]{fig/fig8_deblur+sr2.png} \put(-5.0,85.0){\small(a)} \put(-5.0,70.0){\small(b)} \put(-5.0,55.0){\small(c)} \put(-5.0,37.0){\small(d)} \put(-5.0,22.0){\small(e)} \put(-5.0,7.0){\small(f)} \end{overpic}

Figure 9: Adaptation to new applications. (1) Video super-resolution: (a) a reference video; (b) a low-resolution video from (a) as model input; (c) OmniVDiff generates a corresponding high-resolution video. (2) Video deblurring: Similarly, OmniVDiff can be fine-tuned for the video deblurring task, producing a sharp video (f) from a blurred input (e). The reference video is shown in (d).

4.5 Limitations

Despite its effectiveness, OmniVDiff has several limitations. First, due to the scarcity of high-quality real-world data with paired modalities, we adopt a strategy that uses pseudo labels generated by expert models. However, these pseudo labels can introduce artifacts—especially in segmentation, where results may lack consistency, as shown in Fig. 10. Impoving the quality of training data or leveraging high-quality synthetic data is a promising direction for enhancing training robustness. Second, in this work, we primarily focus on demonstrating the effectiveness and practicality of our approach for video generation using four commonly used modalities: rgb, depth, segmentation, and canny edges. Extending OmniVDiff to support a broader range of modalities is another promising direction, which we leave for future work. Lastly, due to the large computational cost and multiple denoising steps required by diffusion models, generating high-quality videos remains time-consuming. Exploring ways to accelerate inference, is an important avenue for future research.

\begin{overpic}[width=390.25534pt]{fig/fig10_limitations.png} \put(8.5,-3.5){\small(a)} \put(35.0,-3.5){\small(b)} \put(74.0,-3.5){\small(c)} \end{overpic}

Figure 10: Issues in training data (segmentation). (a) Inconsistency: the vehicle region is incorrectly segmented into multiple classes due to occlusion by a person; (b) Ambiguity: the segmentation granularity is not consistent; (c) Flickering: the segmentation labels of the same object (e.g., a tree) vary drastically across frames, leading to temporal inconsistency.

5 Conclusion

In this paper, we introduced OmniVDiff , a unified framework for multi-modal video generation and understanding, extending video diffusion models to support both text-to-video generation and X-conditioned video generation. Our approach enables the simultaneous generation of multiple visual modalities, including rgb, depth, segmentation, and edges, and incorporates an adaptive modality control strategy to distinguish between generation and conditioning inputs flexibly. Our unified video understanding results further highlight the efficiency of our approach, eliminating the need for sequential processing pipelines that rely on separate expert models. Our work provides a scalable, adaptable solution for video generation, allowing easy integration of new modalities while maintaining high performance across diverse video tasks. Future research can explore expanding modality support, improving long-term consistency, and enhancing real-time efficiency, further advancing the capabilities of unified video diffusion models.

References

Blattmann et al. [2023a] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
Blattmann et al. [2023b] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023b.
Brooks et al. [2024] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. 2024. URL https://openai. com/research/video-generation-models-as-world-simulators, 3:1, 2024.
Canny [1986] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
Carreira et al. [2019] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019.
Ceylan et al. [2023] Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023.
Chefer et al. [2025] Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for enhanced motion generation in video models, 2025.
Chen et al. [2023a] Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
Chen et al. [2024a] Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024a.
Chen et al. [2025] Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. arXiv:2501.12375, 2025.
Chen et al. [2023b] Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023b.
Chen et al. [2023c] Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video diffusion models with motion prior and reward feedback learning. arXiv preprint arXiv:2305.13840, 2023c.
Chen et al. [2024b] Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, Hui Ding, Zhe Lin, and Hengshuang. Unireal: Universal image generation and editing via learning real-world dynamics. arXiv preprint arXiv:2412.07774, 2024b.
Esser et al. [2023] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7346–7356, 2023.
Feng et al. [2024] Ruoyu Feng, Wenming Weng, Yanhui Wang, Yuhui Yuan, Jianmin Bao, Chong Luo, Zhibo Chen, and Baining Guo. Ccedit: Creative and controllable video editing via diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6712–6722, 2024.
Gan et al. [2025] Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. Humandit: Pose-guided diffusion transformer for long-form human motion video generation. arXiv preprint arXiv:2502.04847, 2025.
Ge et al. [2023] Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
Guo et al. [2023] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
Guo et al. [2024] Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. In European Conference on Computer Vision, pages 330–348. Springer, 2024.
He et al. [2022] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
He et al. [2023a] Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval-augmented video generation. arXiv preprint arXiv:2307.06940, 2023a.
He et al. [2023b] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation, 2023b.
Ho et al. [2022a] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models, 2022a.
Ho et al. [2022b] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022b.
Hong et al. [2022] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
Hou et al. [2024] Chen Hou, Guoqiang Wei, Yan Zeng, and Zhibo Chen. Training-free camera control for video generation. arXiv preprint arXiv:2406.10126, 2024.
Hu [2024] Li Hu. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024.
Hu et al. [2025] Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, and Liefeng Bo. Animate anyone 2: High-fidelity character image animation with environment affordance. arXiv preprint arXiv:2502.06145, 2025.
Hu and Xu [2023] Zhihao Hu and Dong Xu. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073, 2023.
Karras et al. [2023] Johanna Karras, Aleksander Holynski, Ting-Chun Wang, and Ira Kemelmacher-Shlizerman. Dreampose: Fashion image-to-video synthesis via stable diffusion. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22623–22633. IEEE, 2023.
Khachatryan et al. [2023] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023.
Kong et al. [2024] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024.
Le et al. [2024] Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, and Jiasen Lu. One diffusion to generate them all, 2024.
Li et al. [2023] Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023.
Liu et al. [2024a] Chang Liu, Rui Li, Kaidong Zhang, Yunwei Lan, and Dong Liu. Stablev2v: Stablizing shape consistency in video-to-video editing. arXiv preprint arXiv:2411.11045, 2024a.
Liu et al. [2024b] Fangfu Liu, Hanyang Wang, Weiliang Chen, Haowen Sun, and Yueqi Duan. Make-your-3d: Fast and consistent subject-driven 3d content generation, 2024b.
Lv et al. [2024] Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1430–1440, 2024.
Ma et al. [2024] Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4117–4125, 2024.
Polyak et al. [2025] Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy, Shelly Sheynin, Siddharth Bhattacharya, Simran Motwani, Tao Xu, Tianhe Li, Tingbo Hou, Wei-Ning Hsu, Xi Yin, Xiaoliang Dai, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, Yi-Chiao Wu, Yue Zhao, Yuval Kirstain, Zecheng He, Zijian He, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu, Arun Mallya, Baishan Guo, Boris Araya, Breena Kerr, Carleigh Wood, Ce Liu, Cen Peng, Dimitry Vengertsev, Edgar Schonfeld, Elliot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jeff Liang, John Hoffman, Jonas Kohler, Kaolin Fire, Karthik Sivakumar, Lawrence Chen, Licheng Yu, Luya Gao, Markos Georgopoulos, Rashel Moritz, Sara K. Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, and Yuming Du. Movie gen: A cast of media foundation models, 2025.
Ravi et al. [2024] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Singer et al. [2022] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
Soomro et al. [2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
Tang et al. [2023] Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal. Any-to-any generation via composable diffusion. Advances in Neural Information Processing Systems, 36:16083–16099, 2023.
Tu et al. [2024] Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. Stableanimator: High-quality identity-preserving human image animation. arXiv preprint arXiv:2411.17697, 2024.
Unterthiner et al. [2019] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. In DGS@ICLR, 2019.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Villegas et al. [2022] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
Wang et al. [2023a] Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264, 2023a.
Wang et al. [2023b] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023b.
Wang et al. [2024a] Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. arXiv preprint arXiv:2410.08260, 2024a.
Wang et al. [2023c] Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, and Jiaying Liu. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. 2023c.
Wang et al. [2023d] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36:7594–7611, 2023d.
Wang et al. [2024b] Xingrui Wang, Xin Li, Yaosi Hu, Hanxin Zhu, Chen Hou, Cuiling Lan, and Zhibo Chen. Tiv-diffusion: Towards object-centric movement for text-driven image to video generation. arXiv preprint arXiv:2412.10275, 2024b.
Wang et al. [2024c] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision, pages 1–20, 2024c.
Wu et al. [2023] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
Xing et al. [2024] Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, et al. Make-your-video: Customized video generation using textual and structural guidance. IEEE Transactions on Visualization and Computer Graphics, 2024.
Yang et al. [2024] Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024.
Yin et al. [2023] Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, et al. Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv preprint arXiv:2303.12346, 2023.
Yu et al. [2023] Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18456–18466, 2023.
Zhai et al. [2024] Yuanhao Zhai, Kevin Lin, Linjie Li, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, David Doermann, Junsong Yuan, Zicheng Liu, and Lijuan Wang. Idol: Unified dual-modal latent diffusion for human-centric joint video-depth generation. In European Conference on Computer Vision, pages 134–152. Springer, 2024.
Zhang et al. [2024a] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. International Journal of Computer Vision, pages 1–15, 2024a.
Zhang et al. [2023] Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023.
Zhang et al. [2024b] Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance. arXiv preprint arXiv:2406.19680, 2024b.
Zhao et al. [2025] Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, and Chunhua Shen. Diception: A generalist diffusion model for visual perceptual tasks. arXiv preprint arXiv:2502.17157, 2025.
Zhao et al. [2024] Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, and Ying Shan. Cv-vae: A compatible video vae for latent generative video models. arXiv preprint arXiv:2405.20279, 2024.
Zhao et al. [2023] Yuyang Zhao, Enze Xie, Lanqing Hong, Zhenguo Li, and Gim Hee Lee. Make-a-protagonist: Generic video editing with an ensemble of experts. arXiv preprint arXiv:2305.08850, 2023.
Zheng et al. [2024] Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024.
Zhou et al. [2022] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
Zhu et al. [2024a] Hanxin Zhu, Tianyu He, Anni Tang, Junliang Guo, Zhibo Chen, and Jiang Bian. Compositional 3d-aware video generation with llm director. arXiv preprint arXiv:2409.00558, 2024a.
Zhu et al. [2024b] Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image animation with 3d parametric guidance. In European Conference on Computer Vision, pages 145–162. Springer, 2024b.

OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding