ViPE: Video Pose Engine for 3D Geometric Perception

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robust recovery of camera intrinsics, ego-motion trajectories, and dense near-metric depth maps from unconstrained raw videos remains challenging. This paper introduces the first end-to-end video pose estimation engine supporting multiple camera models—pinhole, wide-angle, and 360°—and operating directly on uncalibrated input without requiring prior calibration. Our method unifies multi-view geometry, structure-from-motion (SfM), and monocular depth estimation within a single differentiable framework, accelerated via GPU to achieve real-time inference at 3–5 FPS. On TUM and KITTI benchmarks, it outperforms state-of-the-art methods by 18% and 50%, respectively, in pose accuracy. To support training and evaluation, we construct and publicly release a large-scale annotated dataset comprising 96 million frames: 100K real-world, 1M AI-synthesized, and 2K panoramic video frames. All code, pretrained models, and the dataset are fully open-sourced.

Technology Category

Application Category

📝 Abstract
Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360° panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.
Problem

Research questions and friction points this paper is trying to address.

Accurate 3D geometric perception from unconstrained videos
Robust estimation of camera intrinsics and motion
Large-scale annotation of videos with depth and poses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates camera intrinsics and motion
Generates dense near-metric depth maps
Supports diverse camera models and scenarios
🔎 Similar Papers
No similar papers found.