DINO-VO: Learning Where to Focus for Enhanced State Estimation

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes the first end-to-end differentiable monocular visual odometry system, addressing the limited generalization of traditional methods that rely on handcrafted features in large-scale outdoor environments. The approach introduces a novel DINO pre-trained model–guided attention mechanism to enable data-driven, adaptive image patch selection. By integrating multi-task feature extraction, inverse depth prior modeling, and differentiable bundle adjustment, the framework jointly optimizes feature learning and geometric state estimation within a unified pipeline. Extensive experiments demonstrate significant improvements in trajectory accuracy across diverse indoor, outdoor, and synthetic benchmarks—including TartanAir, KITTI, EuRoC, and TUM—highlighting the method’s strong cross-domain generalization capabilities.
📝 Abstract
We present DINO Patch Visual Odometry (DINO-VO), an end-to-end monocular visual odometry system with strong scene generalization. Current Visual Odometry (VO) systems often rely on heuristic feature extraction strategies, which can degrade accuracy and robustness, particularly in large-scale outdoor environments. DINO-VO addresses these limitations by incorporating a differentiable adaptive patch selector into the end-to-end pipeline, improving the quality of extracted patches and enhancing generalization across diverse datasets. Additionally, our system integrates a multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors, enabling the system to learn and utilize appearance and geometric information effectively. This integration bridges the gap between feature learning and state estimation. Extensive experiments on the TartanAir, KITTI, Euroc, and TUM datasets demonstrate that DINO-VO exhibits strong generalization across synthetic, indoor, and outdoor environments, achieving state-of-the-art tracking accuracy.
Problem

Research questions and friction points this paper is trying to address.

Visual Odometry
Feature Extraction
Scene Generalization
Monocular
State Estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive patch selection
differentiable bundle adjustment
end-to-end visual odometry
inverse depth priors
scene generalization
🔎 Similar Papers
No similar papers found.