MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing studies lack systematic evaluation of large language models’ (LLMs) spatial navigation capabilities in the absence of visual cues, hindering reliable real-world deployment. Method: We introduce MazeEval—the first benchmark dedicated to pure spatial reasoning assessment—employing coordinate-based feedback and Euclidean distance signals to evaluate abstract spatial navigation across multi-scale grids (5×5 to 15×15), with the first bilingual (English/Icelandic) controlled experiments. Contribution/Results: Language modality significantly impacts spatial reasoning: performance degrades by 3–4 difficulty levels on average in Icelandic. OpenAI o3 achieves zero-error navigation in 30×30 mazes, whereas mainstream models frequently enter cycles beyond 9×9. These findings reveal profound effects of linguistic structure on spatial cognition and establish a novel, empirically grounded paradigm for evaluating LLMs’ spatial reasoning capabilities.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) increasingly power autonomous agents in robotics and embodied AI, understanding their spatial reasoning capabilities becomes crucial for ensuring reliable real-world deployment. Despite advances in language understanding, current research lacks evaluation of how LLMs perform spatial navigation without visual cues, a fundamental requirement for agents operating with limited sensory information. This paper addresses this gap by introducing MazeEval, a benchmark designed to isolate and evaluate pure spatial reasoning in LLMs through coordinate-based maze navigation tasks. Our methodology employs a function-calling interface where models navigate mazes of varying complexity ($5 imes 5$ to $15 imes 15$ grids) using only coordinate feedback and distance-to-wall information, excluding visual input to test fundamental spatial cognition. We evaluate eight state-of-the-art LLMs across identical mazes in both English and Icelandic to assess cross-linguistic transfer of spatial abilities. Our findings reveal striking disparities: while OpenAI's O3 achieves perfect navigation for mazes up to size $30 imes 30$, other models exhibit catastrophic failure beyond $9 imes 9$ mazes, with 100% of failures attributed to excessive looping behavior where models revisit a cell at least 10 times. We document a significant performance degradation in Icelandic, with models solving mazes 3-4 sizes smaller than in English, suggesting spatial reasoning in LLMs emerges from linguistic patterns rather than language-agnostic mechanisms. These results have important implications for global deployment of LLM-powered autonomous systems, showing spatial intelligence remains fundamentally constrained by training data availability and highlighting the need for architectural innovations to achieve reliable navigation across linguistic contexts.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' spatial reasoning without visual cues

Assessing cross-linguistic transfer in spatial navigation tasks

Identifying performance gaps in maze-solving across LLM architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Function-calling interface for maze navigation

Coordinate-based tasks without visual input

Cross-linguistic spatial reasoning evaluation

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning