OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces the first unified video diffusion model to jointly address controllable video generation and understanding within a single framework. To tackle the challenge of co-modeling these divergent tasks, we construct a multimodal joint latent space that simultaneously encodes RGB, depth, Canny edge, and semantic segmentation modalities. We propose an adaptive modality-role dynamic control mechanism, enabling seamless task switching—e.g., from text-to-video generation to video-to-depth/segmentation understanding—within one diffusion process. Further, we incorporate color-space-driven joint distribution modeling, dynamic conditional gating, and cross-modal consistency constraints. Our method achieves state-of-the-art performance across multiple benchmarks, supporting zero-shot video translation, text-guided multimodal video synthesis, and real-time frame-level depth/segmentation estimation. It significantly improves controllability, generalization, and cross-task compatibility without requiring task-specific architectures or fine-tuning.

Technology Category

Application Category

📝 Abstract
In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.
Problem

Research questions and friction points this paper is trying to address.

Unified framework for controllable video generation and understanding
Adaptive control of multiple visual modalities in diffusion
Supports text-conditioned and attribute-conditioned video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint distribution learning in color space
Adaptive control strategy for modalities
Unified framework for diverse video tasks
🔎 Similar Papers
No similar papers found.
D
Dianbing Xi
Zhejiang University
Jiepeng Wang
Jiepeng Wang
The University of Hong Kong
3D VisionAIGCRobotics
Yuanzhi Liang
Yuanzhi Liang
UTS
X
Xi Qiu
Institute of Artificial Intelligence, China Telecom (TeleAI)
Y
Yuchi Huo
Zhejiang University
R
Rui Wang
Institute of Artificial Intelligence, China Telecom (TeleAI)
C
Chi Zhang
Institute of Artificial Intelligence, China Telecom (TeleAI)
X
Xuelong Li
Institute of Artificial Intelligence, China Telecom (TeleAI)