4DNeX: Feed-Forward 4D Generative Modeling Made Easy

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work introduces the first feed-forward framework for single-image-to-4D (dynamic 3D) scene generation, addressing key limitations of prior approaches—namely, reliance on multi-frame inputs or computationally expensive optimization. Methodologically: (1) we construct 4DNeX-10M, a large-scale 4D dataset, leveraging pretrained video diffusion models to synthesize high-fidelity 4D annotations; (2) we propose a unified 6D spatiotemporal representation that jointly encodes RGB and XYZ coordinates across both spatial and temporal dimensions; and (3) we design a lightweight adaptation strategy to end-to-end fine-tune video diffusion models for direct single-image-to-dynamic-point-cloud generation. Experiments demonstrate substantial improvements over state-of-the-art methods on novel-view dynamic video synthesis—achieving superior accuracy, strong generalization, and high inference efficiency. Our approach establishes a scalable, generative paradigm for 4D world modeling.

Technology Category

Application Category

📝 Abstract

We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.

Problem

Research questions and friction points this paper is trying to address.

Generating 4D scene representations from single images

Overcoming scarcity of 4D data with large-scale dataset

Adapting video diffusion models for efficient 4D modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes pretrained video diffusion model

Introduces unified 6D video representation

Constructs large-scale 4D dataset with annotations

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency