AR4D: Autoregressive 4D Generation from Monocular Videos

📅 2025-01-03

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing dynamic 3D video (4D) generation methods rely heavily on Score Distillation Sampling (SDS), leading to limited content diversity, spatiotemporal inconsistency, and poor alignment with text/video prompts. This work proposes the first SDS-free autoregressive 4D generation framework tailored for monocular video input, enabling high-fidelity dynamic 3D content synthesis without SDS. Our core contributions are: (1) a progressive multi-view sampling strategy coupled with global deformation refinement, jointly optimizing geometric accuracy, motion coherence, and appearance stability; and (2) synergistic integration of pre-trained 3D expert models, inter-frame 3D representation autoregression, and global deformation-field-driven appearance refinement. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches across all key metrics—content diversity, spatiotemporal consistency, and alignment fidelity with both textual and video prompts.

Technology Category

Application Category

📝 Abstract

Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame's 3D representation based on its previous frame's representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame's 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.

Problem

Research questions and friction points this paper is trying to address.

4D Video Generation

Fractional Distillation Sampling

Temporal-Spatial Inconsistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

AR4D

4D Video Generation

Self-regressive Model

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency