Keyframe-Based Feed-Forward Visual Odometry

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing feedforward visual odometry systems built upon foundation vision models, which suffer from computational redundancy and susceptibility to low-parallax frames due to the absence of a keyframe mechanism. To overcome this, we propose a novel keyframe selection strategy that integrates reinforcement learning with geometric heuristics, replacing handcrafted rules with a data-driven approach for the first time. Our method enables end-to-end co-optimization with the underlying vision foundation model by leveraging its high-dimensional implicit representations. Trained on TartanAir and evaluated across multiple real-world datasets, the proposed approach significantly outperforms state-of-the-art feedforward visual odometry methods, achieving a superior balance between efficiency and accuracy.

Technology Category

Application Category

📝 Abstract
The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
Problem

Research questions and friction points this paper is trying to address.

visual odometry
foundation models
keyframe selection
computational redundancy
low parallax
Innovation

Methods, ideas, or system contributions that make the work stand out.

keyframe selection
visual odometry
foundation models
reinforcement learning
feed-forward network
🔎 Similar Papers
No similar papers found.
Weichen Dai
Weichen Dai
Hangzhou Dianzi University
3D VisionSLAMBrain-inspired intelligence
W
Wenhan Su
Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, School of Computer Science, Hangzhou Dianzi University, Hangzhou, China
Da Kong
Da Kong
Technion - Israel Institute of Technology
RoboticsPOMDPSLAMNavigation3D Vision
Yuhang Ming
Yuhang Ming
Lecturer at Hangzhou Dianzi University
SLAMVPRComputer VisionRoboticsSpatial AI
W
Wanzeng Kong
Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, School of Computer Science, Hangzhou Dianzi University, Hangzhou, China