Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in 3D video generation—namely, multi-task controllability, temporal inconsistency, and lack of native 3D awareness. We propose the first multimodal controllable video diffusion framework conditioned on 3D-tracked videos. Methodologically, we inject sparse 3D trajectories as a universal control signal into a latent video diffusion (LVD) model, enabling unified modeling of camera manipulation, motion transfer, mesh-driven animation, and object editing; additionally, we introduce a temporally consistent implicit feature propagation mechanism to ensure inter-frame geometric and motion coherence. Our contributions include: (i) the first use of 3D-tracked videos as generic control input, endowing diffusion models with intrinsic 3D awareness; and (ii) breaking the single-task paradigm by achieving shared architecture and strong temporal consistency across diverse control tasks. Trained on fewer than 10K videos for three days on eight H800 GPUs, our method achieves high-quality, high-fidelity generation across all four tasks.

Technology Category

Application Category

📝 Abstract
Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.
Problem

Research questions and friction points this paper is trying to address.

3D Video Generation
Control Precision
Unified Architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-aware video diffusion
unified architecture
versatile video control
🔎 Similar Papers
No similar papers found.
Zekai Gu
Zekai Gu
National University of Singapore
Generative AIAutonomous DrivingRobotics
R
Rui Yan
Zhejiang University, China
J
Jiahao Lu
Hong Kong University of Science and Technology, China
P
Peng Li
Hong Kong University of Science and Technology, China
Z
Zhiyang Dou
The University of Hong Kong, China
C
Chenyang Si
Nanyang Technological University, Singapore
Z
Zhen Dong
Wuhan University, China
Q
Qifeng Liu
Hong Kong University of Science and Technology, China
C
Cheng Lin
The University of Hong Kong, China
Ziwei Liu
Ziwei Liu
Associate Professor, Nanyang Technological University
Computer VisionMachine LearningComputer Graphics
Wenping Wang
Wenping Wang
Texas A&M University
Computer GraphicsGeometric Computing
Y
Yuan Liu
Hong Kong University of Science and Technology, China