How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study presents the first systematic evaluation of embodied navigation capabilities in large multimodal models (LMMs) for goal-directed navigation within urban 3D airspace. By constructing a high-quality dataset comprising 5,037 samples, the authors benchmark 17 state-of-the-art models on their ability to execute vertical spatial actions and interpret semantic cues, revealing a pronounced nonlinear error divergence at critical decision points. The work identifies four key directions for improvement—geometric awareness, cross-view understanding, spatial imagination, and long-term memory—and validates the efficacy of corresponding enhancement strategies. Findings indicate that while current LMMs exhibit rudimentary navigation competence, they remain substantially inferior to human performance. This research establishes a foundational benchmark and outlines actionable pathways toward advancing embodied intelligence in complex 3D environments.
📝 Abstract
Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal-oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples, with an emphasis on 3D vertical actions and rich urban semantic information. Then, we comprehensively assess 17 representative models, including non-reasoning LMMs, reasoning LMMs, agent-based methods, and vision-language-action models. Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human-level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge rapidly from the destination after a critical decision bifurcation. The limitations of LMMs are investigated by analyzing their behavior at these critical decision bifurcations. Finally, we experimentally explore four promising directions for improvement: geometric perception, cross-view understanding, spatial imagination, and long-term memory. The project is available at: https://github.com/serenditipy-AC/Embodied-Navigation-Bench.
Problem

Research questions and friction points this paper is trying to address.

Large Multimodal Models
Embodied Navigation
Spatial Action
Urban Airspace
Goal-Oriented Navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied navigation
large multimodal models
3D urban airspace
spatial action
decision bifurcation
Baining Zhao
Baining Zhao
Tsinghua University
Z
Ziyou Wang
Northeastern University
Jianjie Fang
Jianjie Fang
Master of Tsinghua University
Embodied AI、LLMs
Z
Zile Zhou
Shenzhen International Graduate School, Tsinghua University
Y
Yanggang Xu
Shenzhen International Graduate School, Tsinghua University
Y
Yatai Ji
National University of Defense Technology
Jiacheng Xu
Jiacheng Xu
Nanyang Technological University
Reinforcement LearningLarge Language Model
Q
Qian Zhang
Shenzhen International Graduate School, Tsinghua University
Weichen Zhang
Weichen Zhang
PhD, University of Sydney
Computer VisionDeep LearningTransfer LearningDomain Adaptation
Chen Gao
Chen Gao
BNRist, Tsinghua University
Data MiningLLM AgentEmbodied AI
Xinlei Chen
Xinlei Chen
Associate Professor, Tsinghua University
AIoTCyber Physical SystemUbiquitous ComputingBCI