🤖 AI Summary
Current 4D world modeling is hindered by the scarcity of high-quality, highly dynamic, and multi-domain data. Existing benchmarks suffer from limited spatiotemporal complexity, insufficient modality diversity, and inadequate support for key tasks—including 4D geometric reconstruction, future action prediction, and camera-controllable video generation. To address this, we introduce OmniWorld: the first large-scale, multi-domain, multimodal 4D world modeling dataset. It features a newly collected, interaction-rich, photorealistic sub-dataset—OmniWorld-Game—with fine-grained spatiotemporal annotations. Leveraging multi-source acquisition, cross-modal alignment, and collaborative fine-tuning with generative models, we establish a unified benchmark. Experiments demonstrate substantial improvements over state-of-the-art methods in both 4D reconstruction and video generation, while enabling rigorous cross-domain evaluation on high-stakes tasks. Our work validates the critical role of data-driven paradigms in advancing general-purpose 4D understanding.
📝 Abstract
The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines' holistic understanding of the physical world.