NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address high evaluation costs, oversimplified simulation, and scarce benchmarks in embodied navigation, this paper introduces NaviTrace—the first vision-language navigation (VLN) benchmark supporting multi-modal mobility (e.g., wheeled and legged robots) and enabling fine-grained evaluation from natural language instructions to 2D trajectories. We propose a novel semantic-aware trajectory scoring mechanism that jointly integrates dynamic time warping (DTW), target point error (TPE), and embodiment-aware penalties, significantly improving alignment with human preferences. Leveraging pixel-level semantic analysis and embodiment-type modeling, we establish a scalable, reproducible evaluation framework. We systematically evaluate eight state-of-the-art vision-language models (VLMs) across 1,000 photorealistic scenes and publicly release the benchmark, evaluation tools, and leaderboard—establishing a standardized assessment paradigm for real-world robotic navigation.

Technology Category

Application Category

📝 Abstract

Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.

Problem

Research questions and friction points this paper is trying to address.

Evaluating embodied navigation capabilities of vision-language models

Addressing limitations of costly real-world trials and simplified simulations

Measuring spatial grounding and goal localization gaps in navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces NaviTrace benchmark for navigation evaluation

Uses semantic-aware trace score combining multiple metrics

Evaluates VLMs on embodiment-conditioned navigation traces

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models