Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) possess the capacity to construct internal spatial world models that support reasoning and planning. Using grid-based maze tasks, the authors systematically evaluate multiple state-of-the-art LLMs across varying input representations—adjacency tokens versus visual grids—and prompting strategies, including chain-of-thought reasoning, to assess multi-step planning and spatial abstraction capabilities. The findings reveal, for the first time, that LLMs’ spatial reasoning is highly sensitive to input format and prompting methodology, with no consistent evidence of a cumulative or robust internal spatial world model. For instance, Gemini achieves 80–86% accuracy on small mazes using adjacency-based inputs but drops sharply to 16–34% when presented with visual grid formats, indicating strong semantic coverage yet limited reliability in performing stable spatial computations.

Technology Category

Application Category

📝 Abstract

Foundation models have shown remarkable performance across diverse tasks, yet their ability to construct internal spatial world models for reasoning and planning remains unclear. We systematically evaluate the spatial understanding of large language models through maze tasks, a controlled testing context requiring multi-step planning and spatial abstraction. Across comprehensive experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities. Using chain-of-thought prompting, Gemini achieves 80-86% accuracy on smaller mazes (5x5 to 7x7 grids) with tokenized adjacency representations, but performance collapses to 16-34% with visual grid formats, which is a 2-5x difference, suggesting representation-dependent rather than format-invariant spatial reasoning. We further probe spatial understanding through sequential proximity questions and compositional distance comparisons. Despite achieving 96-99% semantic coverage in reasoning traces, models fail to leverage this understanding for consistent spatial computations, indicating that they treat each question independently rather than building cumulative spatial knowledge. Our findings based on the maze-solving tasks suggest that LLMs do not develop robust spatial world models, but rather exhibit representation-specific and prompting-dependent reasoning that succeeds only under narrow conditions. These results have critical implications for deploying foundation models in applications requiring spatial abstraction.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

world models

large language models

maze tasks

spatial abstraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial world models

maze tasks

representation dependence