🤖 AI Summary
This study addresses the limitations of current visual navigation evaluation, which overemphasizes success rate while neglecting trajectory quality, collision behavior, and environmental robustness. The authors conduct a zero-shot benchmark of five state-of-the-art models—GNM, ViNT, NoMaD, NaviBridger, and CrossFormer—in real-world indoor and outdoor environments. By integrating path-quality metrics, visual goal recognition scores, and controlled image perturbations (e.g., motion blur and sun glare), they systematically uncover common failure modes: inadequate geometric understanding, confusion in visually similar scenes, and sensitivity to distribution shifts. The work introduces a reproducible, multi-dimensional evaluation framework, revealing that prevailing models frequently collide with obstacles, misjudge repeated environments, and suffer performance degradation under out-of-distribution conditions. Code and datasets are publicly released to foster standardized benchmarking.
📝 Abstract
Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.