🤖 AI Summary
To address high evaluation costs, oversimplified simulation, and scarce benchmarks in embodied navigation, this paper introduces NaviTrace—the first vision-language navigation (VLN) benchmark supporting multi-modal mobility (e.g., wheeled and legged robots) and enabling fine-grained evaluation from natural language instructions to 2D trajectories. We propose a novel semantic-aware trajectory scoring mechanism that jointly integrates dynamic time warping (DTW), target point error (TPE), and embodiment-aware penalties, significantly improving alignment with human preferences. Leveraging pixel-level semantic analysis and embodiment-type modeling, we establish a scalable, reproducible evaluation framework. We systematically evaluate eight state-of-the-art vision-language models (VLMs) across 1,000 photorealistic scenes and publicly release the benchmark, evaluation tools, and leaderboard—establishing a standardized assessment paradigm for real-world robotic navigation.
📝 Abstract
Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.