3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing subject-driven video generation methods predominantly rely on 2D representations and lack explicit 3D geometric priors, making it challenging to preserve identity and structural consistency under novel viewpoints. This work proposes 3DreamBooth, a framework that achieves 3D-aware subject-customized video generation without requiring multi-view training videos. By decoupling spatial geometry from temporal motion through single-frame optimization, the method injects strong 3D priors by updating only the spatial representation. It introduces 3Dapter, a dynamic selective routing module that extracts geometric cues from a few reference views and enables multi-view joint optimization via asymmetric conditioning. Experiments demonstrate that the generated videos exhibit high fidelity and geometric consistency under novel views, significantly outperforming existing 2D-based approaches while effectively mitigating temporal overfitting.

Technology Category

Application Category

📝 Abstract
Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/
Problem

Research questions and friction points this paper is trying to address.

3D-aware video generation
subject-driven customization
view-consistent synthesis
spatial priors
multi-view video
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-aware generation
subject-driven video
geometry-texture disentanglement
single-view pre-training
multi-view optimization
🔎 Similar Papers
No similar papers found.