🤖 AI Summary
Existing studies lack systematic evaluation of large language models’ (LLMs) spatial navigation capabilities in the absence of visual cues, hindering reliable real-world deployment. Method: We introduce MazeEval—the first benchmark dedicated to pure spatial reasoning assessment—employing coordinate-based feedback and Euclidean distance signals to evaluate abstract spatial navigation across multi-scale grids (5×5 to 15×15), with the first bilingual (English/Icelandic) controlled experiments. Contribution/Results: Language modality significantly impacts spatial reasoning: performance degrades by 3–4 difficulty levels on average in Icelandic. OpenAI o3 achieves zero-error navigation in 30×30 mazes, whereas mainstream models frequently enter cycles beyond 9×9. These findings reveal profound effects of linguistic structure on spatial cognition and establish a novel, empirically grounded paradigm for evaluating LLMs’ spatial reasoning capabilities.
📝 Abstract
As Large Language Models (LLMs) increasingly power autonomous agents in robotics and embodied AI, understanding their spatial reasoning capabilities becomes crucial for ensuring reliable real-world deployment. Despite advances in language understanding, current research lacks evaluation of how LLMs perform spatial navigation without visual cues, a fundamental requirement for agents operating with limited sensory information. This paper addresses this gap by introducing MazeEval, a benchmark designed to isolate and evaluate pure spatial reasoning in LLMs through coordinate-based maze navigation tasks. Our methodology employs a function-calling interface where models navigate mazes of varying complexity ($5 imes 5$ to $15 imes 15$ grids) using only coordinate feedback and distance-to-wall information, excluding visual input to test fundamental spatial cognition. We evaluate eight state-of-the-art LLMs across identical mazes in both English and Icelandic to assess cross-linguistic transfer of spatial abilities. Our findings reveal striking disparities: while OpenAI's O3 achieves perfect navigation for mazes up to size $30 imes 30$, other models exhibit catastrophic failure beyond $9 imes 9$ mazes, with 100% of failures attributed to excessive looping behavior where models revisit a cell at least 10 times. We document a significant performance degradation in Icelandic, with models solving mazes 3-4 sizes smaller than in English, suggesting spatial reasoning in LLMs emerges from linguistic patterns rather than language-agnostic mechanisms. These results have important implications for global deployment of LLM-powered autonomous systems, showing spatial intelligence remains fundamentally constrained by training data availability and highlighting the need for architectural innovations to achieve reliable navigation across linguistic contexts.