NavBench: Probing Multimodal Large Language Models for Embodied Navigation

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work evaluates the zero-shot embodied navigation capabilities of multimodal large language models (MLLMs). Addressing the lack of systematic benchmarks, we introduce NavBench—the first zero-shot embodied navigation benchmark—comprising navigation understanding tasks (instruction alignment, temporal progress estimation, and observation-action reasoning) and step-by-step execution on 432 indoor navigation segments. We propose a novel pipeline mapping MLLM outputs to robot actions and decouple understanding from execution via cognitively stratified tasks and complexity-graded environments. A map-context augmentation mechanism is further introduced to enhance spatial reasoning. Experiments reveal: (1) GPT-4o achieves the best overall performance; (2) lightweight open-source MLLMs attain 72% accuracy on simple tasks; (3) map context improves decision accuracy by 19% on medium-complexity tasks; and (4) temporal progress estimation remains the weakest capability, with only 41% accuracy.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated strong generalization in vision-language tasks, yet their ability to understand and act within embodied environments remains underexplored. We present NavBench, a benchmark to evaluate the embodied navigation capabilities of MLLMs under zero-shot settings. NavBench consists of two components: (1) navigation comprehension, assessed through three cognitively grounded tasks including global instruction alignment, temporal progress estimation, and local observation-action reasoning, covering 3,200 question-answer pairs; and (2) step-by-step execution in 432 episodes across 72 indoor scenes, stratified by spatial, cognitive, and execution complexity. To support real-world deployment, we introduce a pipeline that converts MLLMs' outputs into robotic actions. We evaluate both proprietary and open-source models, finding that GPT-4o performs well across tasks, while lighter open-source models succeed in simpler cases. Results also show that models with higher comprehension scores tend to achieve better execution performance. Providing map-based context improves decision accuracy, especially in medium-difficulty scenarios. However, most models struggle with temporal understanding, particularly in estimating progress during navigation, which may pose a key challenge.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' embodied navigation in zero-shot settings
Assessing navigation comprehension and step-by-step execution capabilities
Converting MLLMs' outputs into real-world robotic actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for MLLM navigation in zero-shot settings
Pipeline converts MLLM outputs to robotic actions
Map-based context enhances decision accuracy
🔎 Similar Papers
No similar papers found.
Yanyuan Qiao
Yanyuan Qiao
Postdoctoral Research Fellow, EPFL
Embodied-AIVision and LanguageMulti-modal Learning
H
Haodong Hong
The University of Queensland
Wenqi Lyu
Wenqi Lyu
The university of Adelaide
Embodied-AI
D
Dong An
Mohamed bin Zayed University of Artificial Intelligence
S
Siqi Zhang
Tongji University
Y
Yutong Xie
Mohamed bin Zayed University of Artificial Intelligence
X
Xinyu Wang
The University of Adelaide
Q
Qi Wu
The University of Adelaide