Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models can construct internal spatial world models from purely textual input and evaluates their cross-linguistic transfer capabilities. To this end, we introduce MentalMap—the first multilingual, multi-level diagnostic benchmark for spatial reasoning—featuring six hierarchical capability levels and four diagnostic dimensions, with structured multilingual textual tasks generated from ProcTHOR. Through diverse prompting strategies and human-controlled experiments, we find that all models exhibit a pronounced performance drop at viewpoint reasoning tasks (a Level 3 bottleneck), a phenomenon consistently observed across languages, model scales, and prompting methods. These results indicate that working memory limitations under purely textual conditions are a fundamental constraint on spatial modeling in current language models.
📝 Abstract
Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.
Problem

Research questions and friction points this paper is trying to address.

world models
spatial reasoning
large language models
multilingual
text-only
Innovation

Methods, ideas, or system contributions that make the work stand out.

world modeling
spatial reasoning
multilingual benchmark
MentalMap
reasoning cliff
🔎 Similar Papers
No similar papers found.