🤖 AI Summary
Existing datasets struggle to simultaneously support large-scale 3D geometric perception and controllable video generation due to a lack of unified video resources that combine rich semantics with spatiotemporal consistency. To address this gap, this work introduces a million-scale multimodal video dataset captured from real-world scenes, which, for the first time at such scale, provides synchronized high-precision camera parameters, dense depth maps, temporally consistent 3D point trajectories, and textual descriptions. This enables a unified alignment framework bridging video, geometry, and semantics. The dataset establishes new benchmarks for tasks including monocular depth estimation, dynamic scene reconstruction, 3D point tracking, and text-to-video generation, significantly advancing the synergistic development of 3D-aware and controllable video synthesis.
📝 Abstract
The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.