CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding and controllable generation frameworks struggle to unify these tasks, while relying solely on geometric cues (e.g., depth, edges) leads to distortions and temporal drift in physically grounded editing—such as relighting or material replacement. To address this, we propose MMVideo: the first unified multimodal diffusion framework integrating geometric, semantic, and graphics-intrinsic modalities—including surface normals, segmentation masks, albedo, and roughness. Its core innovation is the Hybrid Modality Control Strategy (HMCS), enabling robust feature routing and fusion under arbitrary subsets of input modalities. We further introduce MMVideo, a large-scale, cross-reality-aligned multimodal video dataset. Experiments demonstrate that MMVideo significantly outperforms state-of-the-art methods across diverse video understanding and generation benchmarks. Moreover, it achieves high fidelity and long-term temporal consistency in hierarchical controllable tasks—including illumination editing, material replacement, and object insertion.

Technology Category

Application Category

📝 Abstract
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
Problem

Research questions and friction points this paper is trying to address.

Addresses controllable video generation challenges using unified diffusion framework
Overcomes limitations of geometry-only cues for appearance and material edits
Solves multimodal integration difficulties while maintaining temporal consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion model with multimodal control strategy
Hybrid dataset enabling aligned multimodal supervision
Robust generation from any subset of input modalities
🔎 Similar Papers
No similar papers found.
D
Dianbing Xi
State Key Laboratory of CAD&CG, Zhejiang University
Jiepeng Wang
Jiepeng Wang
The University of Hong Kong
3D VisionAIGCRobotics
Yuanzhi Liang
Yuanzhi Liang
UTS
X
Xi Qiu
Institute of Artificial Intelligence, China Telecom (TeleAI)
Jialun Liu
Jialun Liu
Baidu | JLU
long-tailed data learningmetric learning3D generation
H
Hao Pan
Tsinghua University
Y
Yuchi Huo
State Key Laboratory of CAD&CG, Zhejiang University
R
Rui Wang
State Key Laboratory of CAD&CG, Zhejiang University
Haibin Huang
Haibin Huang
Principal Research Scientist at TeleAI
Computer GraphicsComputer VisionGeometric Modeling3D Deep Learning
C
Chi Zhang
Institute of Artificial Intelligence, China Telecom (TeleAI)
X
Xuelong Li
Institute of Artificial Intelligence, China Telecom (TeleAI)