🤖 AI Summary
Monocular visual odometry (MVO) suffers from severe long-term scale drift due to the absence of absolute scale information. To address this, we propose an end-to-end differentiable pose estimation framework grounded in bird’s-eye view (BEV) representation. Leveraging the ground-plane assumption, our method projects images into BEV space, reducing the 6-DoF pose estimation problem to 3-DoF. Within BEV, we jointly perform keypoint detection, matching, and differentiable weighted Procrustes analysis—eliminating the need for depth prediction or relative scale constraints. This work is the first to deeply integrate BEV representation with differentiable weighted Procrustes solving, enabling end-to-end training using only pose-level supervision. Evaluated on long-sequence benchmarks—NCLT, Oxford RobotCar, and KITTI—our approach significantly mitigates scale drift and achieves state-of-the-art performance across multiple metrics.
📝 Abstract
Monocular Visual Odometry (MVO) provides a cost-effective, real-time positioning solution for autonomous vehicles. However, MVO systems face the common issue of lacking inherent scale information from monocular cameras. Traditional methods have good interpretability but can only obtain relative scale and suffer from severe scale drift in long-distance tasks. Learning-based methods under perspective view leverage large amounts of training data to acquire prior knowledge and estimate absolute scale by predicting depth values. However, their generalization ability is limited due to the need to accurately estimate the depth of each point. In contrast, we propose a novel MVO system called BEV-DWPVO. Our approach leverages the common assumption of a ground plane, using Bird's-Eye View (BEV) feature maps to represent the environment in a grid-based structure with a unified scale. This enables us to reduce the complexity of pose estimation from 6 Degrees of Freedom (DoF) to 3-DoF. Keypoints are extracted and matched within the BEV space, followed by pose estimation through a differentiable weighted Procrustes solver. The entire system is fully differentiable, supporting end-to-end training with only pose supervision and no auxiliary tasks. We validate BEV-DWPVO on the challenging long-sequence datasets NCLT, Oxford, and KITTI, achieving superior results over existing MVO methods on most evaluation metrics.