🤖 AI Summary
Existing camera-controllable video generation methods are constrained by static-scene datasets (e.g., RealEstate10K) and relative-scale camera poses, limiting their ability to model realistic object motion and precise camera trajectories while lacking metric-scale geometric consistency. To address this, we introduce the first open-source, high-resolution dynamic-scene video dataset comprising over 1,200 4K (3840×2160) video sequences, each annotated with centimeter-accurate, pixel-aligned metric-scale camera trajectories. Our hybrid annotation pipeline integrates multi-view geometric calibration, SLAM-based optimization, and manual refinement—enabling, for the first time, pixel-aligned metric-scale camera motion annotation in dynamic scenes. Evaluated on state-of-the-art video generation models, this dataset significantly improves geometric fidelity, physical plausibility of object motion, and accuracy of synthesized camera trajectories.
📝 Abstract
Recent advances in camera-controllable video generation have been constrained by the reliance on static-scene datasets with relative-scale camera annotations, such as RealEstate10K. While these datasets enable basic viewpoint control, they fail to capture dynamic scene interactions and lack metric-scale geometric consistency-critical for synthesizing realistic object motions and precise camera trajectories in complex environments. To bridge this gap, we introduce the first fully open-source, high-resolution dynamic-scene dataset with metric-scale camera annotations in https://github.com/ZGCTroy/RealCam-Vid.