Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of current visual navigation evaluation, which overemphasizes success rate while neglecting trajectory quality, collision behavior, and environmental robustness. The authors conduct a zero-shot benchmark of five state-of-the-art models—GNM, ViNT, NoMaD, NaviBridger, and CrossFormer—in real-world indoor and outdoor environments. By integrating path-quality metrics, visual goal recognition scores, and controlled image perturbations (e.g., motion blur and sun glare), they systematically uncover common failure modes: inadequate geometric understanding, confusion in visually similar scenes, and sensitivity to distribution shifts. The work introduces a reproducible, multi-dimensional evaluation framework, revealing that prevailing models frequently collide with obstacles, misjudge repeated environments, and suffer performance degradation under out-of-distribution conditions. Code and datasets are publicly released to foster standardized benchmarking.
📝 Abstract
Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.
Problem

Research questions and friction points this paper is trying to address.

Visual Navigation Models
real-world evaluation
trajectory quality
collision behavior
robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Navigation Models
zero-shot evaluation
real-world benchmarking
robustness to perturbations
trajectory quality
🔎 Similar Papers
No similar papers found.