SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing datasets struggle to simultaneously support large-scale 3D geometric perception and controllable video generation due to a lack of unified video resources that combine rich semantics with spatiotemporal consistency. To address this gap, this work introduces a million-scale multimodal video dataset captured from real-world scenes, which, for the first time at such scale, provides synchronized high-precision camera parameters, dense depth maps, temporally consistent 3D point trajectories, and textual descriptions. This enables a unified alignment framework bridging video, geometry, and semantics. The dataset establishes new benchmarks for tasks including monocular depth estimation, dynamic scene reconstruction, 3D point tracking, and text-to-video generation, significantly advancing the synergistic development of 3D-aware and controllable video synthesis.
📝 Abstract
The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.
Problem

Research questions and friction points this paper is trying to address.

large-scale video dataset
3D geometric perception
video synthesis
semantic annotations
spatio-temporal information
Innovation

Methods, ideas, or system contributions that make the work stand out.

large-scale video dataset
geometric annotations
semantic annotations
3D point tracks
camera-aware video synthesis
🔎 Similar Papers
No similar papers found.