🤖 AI Summary
To address the sparse supervision and information loss inherent in bird’s-eye view (BEV) representations for monocular visual odometry—particularly during perspective-to-BEV projection—this paper proposes a novel PV-BEV dual-branch network. Our method enables pixel-level training using only pose labels by introducing dense BEV optical flow supervision, thereby preserving full 6-DoF motion cues. We further design a PV-BEV feature fusion module and an enhanced rotational sampling strategy to improve robustness in multimodal motion modeling. Additionally, we incorporate multi-level supervision—including dense BEV flow, a 5-DoF perspective-view (PV) branch, and a 3-DoF output—and construct the multi-scale ZJH-VO dataset. Extensive experiments on KITTI, NCLT, Oxford RobotCar, and ZJH-VO demonstrate significant improvements over existing BEV-based methods, achieving a 40% reduction in relative translation error (RTE). Both the ZJH-VO dataset and source code are publicly released.
📝 Abstract
Bird's-Eye-View (BEV) representation offers a metric-scaled planar workspace, facilitating the simplification of 6-DoF ego-motion to a more robust 3-DoF model for monocular visual odometry (MVO) in intelligent transportation systems. However, existing BEV methods suffer from sparse supervision signals and information loss during perspective-to-BEV projection. We present BEV-ODOM2, an enhanced framework addressing both limitations without additional annotations. Our approach introduces: (1) dense BEV optical flow supervision constructed from 3-DoF pose ground truth for pixel-level guidance; (2) PV-BEV fusion that computes correlation volumes before projection to preserve 6-DoF motion cues while maintaining scale consistency. The framework employs three supervision levels derived solely from pose data: dense BEV flow, 5-DoF for the PV branch, and final 3-DoF output. Enhanced rotation sampling further balances diverse motion patterns in training. Extensive evaluation on KITTI, NCLT, Oxford, and our newly collected ZJH-VO multi-scale dataset demonstrates state-of-the-art performance, achieving 40 improvement in RTE compared to previous BEV methods. The ZJH-VO dataset, covering diverse ground vehicle scenarios from underground parking to outdoor plazas, is publicly available to facilitate future research.