Visual Odometry with Transformers

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing monocular visual odometry (VO) methods rely on hand-crafted modules—such as feature matching and bundle adjustment—and require precise camera calibration and extensive hyperparameter tuning, resulting in poor generalization. While large models have been explored to alleviate these issues, they struggle to model long-term temporal dependencies in video sequences and fail to produce high-accuracy per-frame pose estimates. This paper proposes VoT, the first end-to-end Transformer-based monocular VO framework. VoT employs spatiotemporal self-attention to directly capture global dependencies across image sequences and is trained in a self- or weakly supervised manner using only pose annotations—without 3D reconstruction or geometric priors. It supports plug-and-play integration of multi-source pretrained encoders, enabling strong cross-scene generalization. VoT outperforms conventional methods on multiple benchmarks, achieves over 3× faster inference, and exhibits consistent performance gains with increased data scale and stronger backbone networks.

Technology Category

Application Category

📝 Abstract
Modern monocular visual odometry methods typically combine pre-trained deep learning components with optimization modules, resulting in complex pipelines that rely heavily on camera calibration and hyperparameter tuning, and often struggle in unseen real-world scenarios. Recent large-scale 3D models trained on massive amounts of multi-modal data have partially alleviated these challenges, providing generalizable dense reconstruction and camera pose estimation. Still, they remain limited in handling long videos and providing accurate per-frame estimates, which are required for visual odometry. In this work, we demonstrate that monocular visual odometry can be addressed effectively in an end-to-end manner, thereby eliminating the need for handcrafted components such as bundle adjustment, feature matching, camera calibration, or dense 3D reconstruction. We introduce VoT, short for Visual odometry Transformer, which processes sequences of monocular frames by extracting features and modeling global relationships through temporal and spatial attention. Unlike prior methods, VoT directly predicts camera motion without estimating dense geometry and relies solely on camera poses for supervision. The framework is modular and flexible, allowing seamless integration of various pre-trained encoders as feature extractors. Experimental results demonstrate that VoT scales effectively with larger datasets, benefits substantially from stronger pre-trained backbones, generalizes across diverse camera motions and calibration settings, and outperforms traditional methods while running more than 3 times faster. The code will be released.
Problem

Research questions and friction points this paper is trying to address.

Monocular visual odometry eliminates handcrafted components like bundle adjustment
Direct camera motion prediction without dense geometry estimation or calibration
Handles long videos and diverse camera motions through temporal attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end transformer for monocular visual odometry
Direct camera motion prediction without dense geometry
Modular framework with pre-trained feature encoders
🔎 Similar Papers
No similar papers found.