🤖 AI Summary
This study presents the first systematic evaluation of embodied navigation capabilities in large multimodal models (LMMs) for goal-directed navigation within urban 3D airspace. By constructing a high-quality dataset comprising 5,037 samples, the authors benchmark 17 state-of-the-art models on their ability to execute vertical spatial actions and interpret semantic cues, revealing a pronounced nonlinear error divergence at critical decision points. The work identifies four key directions for improvement—geometric awareness, cross-view understanding, spatial imagination, and long-term memory—and validates the efficacy of corresponding enhancement strategies. Findings indicate that while current LMMs exhibit rudimentary navigation competence, they remain substantially inferior to human performance. This research establishes a foundational benchmark and outlines actionable pathways toward advancing embodied intelligence in complex 3D environments.
📝 Abstract
Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal-oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples, with an emphasis on 3D vertical actions and rich urban semantic information. Then, we comprehensively assess 17 representative models, including non-reasoning LMMs, reasoning LMMs, agent-based methods, and vision-language-action models. Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human-level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge rapidly from the destination after a critical decision bifurcation. The limitations of LMMs are investigated by analyzing their behavior at these critical decision bifurcations. Finally, we experimentally explore four promising directions for improvement: geometric perception, cross-view understanding, spatial imagination, and long-term memory. The project is available at: https://github.com/serenditipy-AC/Embodied-Navigation-Bench.