BEV-ODOM2: Enhanced BEV-based Monocular Visual Odometry with PV-BEV Fusion and Dense Flow Supervision for Ground Robots

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the sparse supervision and information loss inherent in bird’s-eye view (BEV) representations for monocular visual odometry—particularly during perspective-to-BEV projection—this paper proposes a novel PV-BEV dual-branch network. Our method enables pixel-level training using only pose labels by introducing dense BEV optical flow supervision, thereby preserving full 6-DoF motion cues. We further design a PV-BEV feature fusion module and an enhanced rotational sampling strategy to improve robustness in multimodal motion modeling. Additionally, we incorporate multi-level supervision—including dense BEV flow, a 5-DoF perspective-view (PV) branch, and a 3-DoF output—and construct the multi-scale ZJH-VO dataset. Extensive experiments on KITTI, NCLT, Oxford RobotCar, and ZJH-VO demonstrate significant improvements over existing BEV-based methods, achieving a 40% reduction in relative translation error (RTE). Both the ZJH-VO dataset and source code are publicly released.

Technology Category

Application Category

📝 Abstract
Bird's-Eye-View (BEV) representation offers a metric-scaled planar workspace, facilitating the simplification of 6-DoF ego-motion to a more robust 3-DoF model for monocular visual odometry (MVO) in intelligent transportation systems. However, existing BEV methods suffer from sparse supervision signals and information loss during perspective-to-BEV projection. We present BEV-ODOM2, an enhanced framework addressing both limitations without additional annotations. Our approach introduces: (1) dense BEV optical flow supervision constructed from 3-DoF pose ground truth for pixel-level guidance; (2) PV-BEV fusion that computes correlation volumes before projection to preserve 6-DoF motion cues while maintaining scale consistency. The framework employs three supervision levels derived solely from pose data: dense BEV flow, 5-DoF for the PV branch, and final 3-DoF output. Enhanced rotation sampling further balances diverse motion patterns in training. Extensive evaluation on KITTI, NCLT, Oxford, and our newly collected ZJH-VO multi-scale dataset demonstrates state-of-the-art performance, achieving 40 improvement in RTE compared to previous BEV methods. The ZJH-VO dataset, covering diverse ground vehicle scenarios from underground parking to outdoor plazas, is publicly available to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Addresses sparse supervision in BEV visual odometry
Reduces information loss during perspective-to-BEV projection
Enhances monocular odometry accuracy for ground robots
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense BEV optical flow supervision from pose
PV-BEV fusion preserving 6-DoF motion cues
Enhanced rotation sampling for diverse motion
🔎 Similar Papers
No similar papers found.
Y
Yufei Wei
1
W
Wangtao Lu
1
Sha Lu
Sha Lu
Zhejiang University
RoboticsSLAM
C
Chenxiao Hu
1
F
Fuzhang Han
2
Rong Xiong
Rong Xiong
Zhejiang University
Robotics
Y
Yue Wang
1