🤖 AI Summary
Existing visual odometry (VO) methods rely on predefined camera calibration, exhibiting poor generalization and failing to support zero-shot deployment across diverse cameras and environments. This paper introduces ZeroVO—the first calibration-free, fine-tuning-free zero-shot VO framework. Our approach addresses the problem through three core innovations: (1) a calibration-free geometric-aware network that implicitly models camera geometry; (2) language-prior-guided semantic feature extraction to enhance cross-domain semantic consistency; and (3) a semi-supervised iterative self-adaptation training paradigm integrating monocular depth estimation, noise-robust geometric modeling, and language–vision feature fusion. Evaluated on multi-source benchmarks—including KITTI, nuScenes, Argoverse 2, and GTA—ZeroVO achieves over 30% average reduction in absolute trajectory error, significantly improving cross-domain localization robustness and practical deployability.
📝 Abstract
We introduce ZeroVO, a novel visual odometry (VO) algorithm that achieves zero-shot generalization across diverse cameras and environments, overcoming limitations in existing methods that depend on predefined or static camera calibration setups. Our approach incorporates three main innovations. First, we design a calibration-free, geometry-aware network structure capable of handling noise in estimated depth and camera parameters. Second, we introduce a language-based prior that infuses semantic information to enhance robust feature extraction and generalization to previously unseen domains. Third, we develop a flexible, semi-supervised training paradigm that iteratively adapts to new scenes using unlabeled data, further boosting the models' ability to generalize across diverse real-world scenarios. We analyze complex autonomous driving contexts, demonstrating over 30% improvement against prior methods on three standard benchmarks, KITTI, nuScenes, and Argoverse 2, as well as a newly introduced, high-fidelity synthetic dataset derived from Grand Theft Auto (GTA). By not requiring fine-tuning or camera calibration, our work broadens the applicability of VO, providing a versatile solution for real-world deployment at scale.