🤖 AI Summary
Robust recovery of camera intrinsics, ego-motion trajectories, and dense near-metric depth maps from unconstrained raw videos remains challenging. This paper introduces the first end-to-end video pose estimation engine supporting multiple camera models—pinhole, wide-angle, and 360°—and operating directly on uncalibrated input without requiring prior calibration. Our method unifies multi-view geometry, structure-from-motion (SfM), and monocular depth estimation within a single differentiable framework, accelerated via GPU to achieve real-time inference at 3–5 FPS. On TUM and KITTI benchmarks, it outperforms state-of-the-art methods by 18% and 50%, respectively, in pose accuracy. To support training and evaluation, we construct and publicly release a large-scale annotated dataset comprising 96 million frames: 100K real-world, 1M AI-synthesized, and 2K panoramic video frames. All code, pretrained models, and the dataset are fully open-sourced.
📝 Abstract
Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360° panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.