DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing monocular video datasets suffer from simulator-based origins, ambiguous scale annotations from structure-from-motion (SfM), and insufficient semantic descriptions, hindering embodied intelligence’s ability to perceive physical scale in dynamic real-world environments. To address this, we propose the first 4D multimodal world modeling paradigm integrating metric geometry, realistic motion modeling, and global optimization—establishing an end-to-end mapping framework from large-scale internet videos to 4D semantic scenes, encompassing 3D geometry, instance masks, camera poses, depth, motion trajectories, and textual/image descriptions. Our method synergistically combines large vision models, geometric priors, and multimodal understanding, employing windowed bundle adjustment coupled with global optimization for high-accuracy, long-video reconstruction and joint perception. We release a benchmark dataset comprising over 100K videos, 800K+ instance masks, and 10M+ frames. Experiments demonstrate significant improvements over state-of-the-art methods in video depth, pose, and intrinsic parameter estimation, achieving—for the first time—physically consistent, globally accurate 4D semantic reconstruction.

Technology Category

Application Category

📝 Abstract
Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consists of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.
Problem

Research questions and friction points this paper is trying to address.

Bridges gaps in dynamic 4D world modeling from monocular videos
Enables accurate physical-scale interpretation of real-world video dynamics
Provides a comprehensive multimodal dataset for foundation model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal 4D world modeling framework
Integrates window-based Bundle Adjustment with global optimization
Large-scale dataset from internet videos with annotations
🔎 Similar Papers
No similar papers found.