DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the robustness, generalization, and real-time limitations of learning-based monocular visual odometry (VO), this paper proposes DINO-VO—the first end-to-end framework integrating the vision foundation model DINOv2 into feature-based VO. To reconcile DINOv2’s coarse-grained features with VO’s fine-grained geometric requirements, we design a lightweight salient keypoint detector and fuse semantically robust features with differentiable geometric descriptors. Furthermore, we introduce a Transformer-based matching network and a differentiable pose estimation layer for high-accuracy camera motion estimation. Evaluated on TartanAir and KITTI, DINO-VO significantly outperforms existing frame-to-frame VO methods; on EuRoC, it achieves competitive accuracy. The system runs at 72 FPS on a single GPU with memory consumption under 1 GB, and delivers outdoor localization accuracy comparable to state-of-the-art SLAM systems.

Technology Category

Application Category

📝 Abstract

Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2's coarse features. Furthermore, we complement DINOv2's robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like SuperPoint, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and generalization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-to-frame VO methods on the TartanAir and KITTI datasets and is competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM systems on outdoor driving scenarios, showcasing its generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

Enhancing robustness and generalization in monocular visual odometry

Integrating DINOv2's coarse features for precise sparse matching

Combining semantic and geometric features for better localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Salient keypoints detector for DINOv2 features

Combines robust-semantic and fine-grained geometric features

Transformer-based matcher with differentiable pose estimation

🔎 Similar Papers

LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry