🤖 AI Summary
This work introduces the first feed-forward framework for single-image-to-4D (dynamic 3D) scene generation, addressing key limitations of prior approaches—namely, reliance on multi-frame inputs or computationally expensive optimization. Methodologically: (1) we construct 4DNeX-10M, a large-scale 4D dataset, leveraging pretrained video diffusion models to synthesize high-fidelity 4D annotations; (2) we propose a unified 6D spatiotemporal representation that jointly encodes RGB and XYZ coordinates across both spatial and temporal dimensions; and (3) we design a lightweight adaptation strategy to end-to-end fine-tune video diffusion models for direct single-image-to-dynamic-point-cloud generation. Experiments demonstrate substantial improvements over state-of-the-art methods on novel-view dynamic video synthesis—achieving superior accuracy, strong generalization, and high inference efficiency. Our approach establishes a scalable, generative paradigm for 4D world modeling.
📝 Abstract
We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.