VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

📅 2024-07-17
🏛️ arXiv.org
📈 Citations: 28
Influential: 2
📄 PDF
🤖 AI Summary
Existing video diffusion models struggle with fine-grained 3D camera motion control, limiting their applicability in content creation and 3D vision tasks. To address this, we introduce the first framework that integrates controllable camera pose generation into a spatiotemporal video Transformer architecture for diffusion-based video synthesis. Our method proposes a novel 3D camera control paradigm—Plücker-coordinate-based spatiotemporal camera embeddings—designed to support ControlNet-style conditional guidance. We fine-tune a video diffusion Transformer end-to-end on RealEstate10K using this embedding mechanism. Experiments demonstrate substantial improvements in camera pose control accuracy and achieve state-of-the-art performance on controllable video generation. To our knowledge, this is the first work enabling precise, large-model video Transformer-based 3D camera control.

Technology Category

Application Category

📝 Abstract
Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of complex videos from a text description. However, most existing models lack fine-grained control over camera movement, which is critical for downstream applications related to content creation, visual effects, and 3D vision. Recently, new methods demonstrate the ability to generate videos with controllable camera poses these techniques leverage pre-trained U-Net-based diffusion models that explicitly disentangle spatial and temporal generation. Still, no existing approach enables camera control for new, transformer-based video diffusion models that process spatial and temporal information jointly. Here, we propose to tame video transformers for 3D camera control using a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings based on Pl""ucker coordinates. The approach demonstrates state-of-the-art performance for controllable video generation after fine-tuning on the RealEstate10K dataset. To the best of our knowledge, our work is the first to enable camera control for transformer-based video diffusion models.
Problem

Research questions and friction points this paper is trying to address.

Lack of fine-grained camera control in text-to-video models
No camera control for transformer-based video diffusion models
Need for 3D camera control in joint spatiotemporal generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

ControlNet-like conditioning for camera control
Spacetime camera embeddings with Plücker coordinates
Fine-tuning transformer-based video diffusion models
🔎 Similar Papers
No similar papers found.