ZeroVO: Visual Odometry with Minimal Assumptions

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual odometry (VO) methods rely on predefined camera calibration, exhibiting poor generalization and failing to support zero-shot deployment across diverse cameras and environments. This paper introduces ZeroVO—the first calibration-free, fine-tuning-free zero-shot VO framework. Our approach addresses the problem through three core innovations: (1) a calibration-free geometric-aware network that implicitly models camera geometry; (2) language-prior-guided semantic feature extraction to enhance cross-domain semantic consistency; and (3) a semi-supervised iterative self-adaptation training paradigm integrating monocular depth estimation, noise-robust geometric modeling, and language–vision feature fusion. Evaluated on multi-source benchmarks—including KITTI, nuScenes, Argoverse 2, and GTA—ZeroVO achieves over 30% average reduction in absolute trajectory error, significantly improving cross-domain localization robustness and practical deployability.

Technology Category

Application Category

📝 Abstract
We introduce ZeroVO, a novel visual odometry (VO) algorithm that achieves zero-shot generalization across diverse cameras and environments, overcoming limitations in existing methods that depend on predefined or static camera calibration setups. Our approach incorporates three main innovations. First, we design a calibration-free, geometry-aware network structure capable of handling noise in estimated depth and camera parameters. Second, we introduce a language-based prior that infuses semantic information to enhance robust feature extraction and generalization to previously unseen domains. Third, we develop a flexible, semi-supervised training paradigm that iteratively adapts to new scenes using unlabeled data, further boosting the models' ability to generalize across diverse real-world scenarios. We analyze complex autonomous driving contexts, demonstrating over 30% improvement against prior methods on three standard benchmarks, KITTI, nuScenes, and Argoverse 2, as well as a newly introduced, high-fidelity synthetic dataset derived from Grand Theft Auto (GTA). By not requiring fine-tuning or camera calibration, our work broadens the applicability of VO, providing a versatile solution for real-world deployment at scale.
Problem

Research questions and friction points this paper is trying to address.

Achieves zero-shot generalization across diverse cameras and environments
Overcomes dependency on predefined or static camera calibration setups
Enhances robust feature extraction and generalization to unseen domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Calibration-free geometry-aware network structure
Language-based prior for semantic enhancement
Flexible semi-supervised training paradigm
🔎 Similar Papers
No similar papers found.