BEV-ODOM2: Enhanced BEV-based Monocular Visual Odometry with PV-BEV Fusion and Dense Flow Supervision for Ground Robots

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address the sparse supervision and information loss inherent in bird’s-eye view (BEV) representations for monocular visual odometry—particularly during perspective-to-BEV projection—this paper proposes a novel PV-BEV dual-branch network. Our method enables pixel-level training using only pose labels by introducing dense BEV optical flow supervision, thereby preserving full 6-DoF motion cues. We further design a PV-BEV feature fusion module and an enhanced rotational sampling strategy to improve robustness in multimodal motion modeling. Additionally, we incorporate multi-level supervision—including dense BEV flow, a 5-DoF perspective-view (PV) branch, and a 3-DoF output—and construct the multi-scale ZJH-VO dataset. Extensive experiments on KITTI, NCLT, Oxford RobotCar, and ZJH-VO demonstrate significant improvements over existing BEV-based methods, achieving a 40% reduction in relative translation error (RTE). Both the ZJH-VO dataset and source code are publicly released.

Technology Category

Application Category

📝 Abstract

Bird's-Eye-View (BEV) representation offers a metric-scaled planar workspace, facilitating the simplification of 6-DoF ego-motion to a more robust 3-DoF model for monocular visual odometry (MVO) in intelligent transportation systems. However, existing BEV methods suffer from sparse supervision signals and information loss during perspective-to-BEV projection. We present BEV-ODOM2, an enhanced framework addressing both limitations without additional annotations. Our approach introduces: (1) dense BEV optical flow supervision constructed from 3-DoF pose ground truth for pixel-level guidance; (2) PV-BEV fusion that computes correlation volumes before projection to preserve 6-DoF motion cues while maintaining scale consistency. The framework employs three supervision levels derived solely from pose data: dense BEV flow, 5-DoF for the PV branch, and final 3-DoF output. Enhanced rotation sampling further balances diverse motion patterns in training. Extensive evaluation on KITTI, NCLT, Oxford, and our newly collected ZJH-VO multi-scale dataset demonstrates state-of-the-art performance, achieving 40 improvement in RTE compared to previous BEV methods. The ZJH-VO dataset, covering diverse ground vehicle scenarios from underground parking to outdoor plazas, is publicly available to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Addresses sparse supervision in BEV visual odometry

Reduces information loss during perspective-to-BEV projection

Enhances monocular odometry accuracy for ground robots

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense BEV optical flow supervision from pose

PV-BEV fusion preserving 6-DoF motion cues

Enhanced rotation sampling for diverse motion

🔎 Similar Papers

No similar papers found.