🤖 AI Summary
Existing spatial reasoning benchmarks predominantly rely on one-shot generation of complete solutions, failing to capture agents’ true capabilities in interactive decision-making. This work proposes Spatial-Gym—a step-by-step, interactive evaluation framework built upon Gymnasium—that formulates 2D grid-based path planning as a sequential decision task supporting backtracking, and integrates both chain-of-thought and reinforcement learning interfaces. Experiments reveal that even the strongest model, GPT-OSS 120B, achieves only a 16.0% success rate, starkly below human performance at 98.0%. Stepwise reasoning improves weaker models’ performance but constrains stronger models’ global planning abilities. Visual inputs significantly degrade performance, whereas extended chain-of-thought reasoning consistently maintains a 3–5× accuracy advantage across three reasoning paradigms. This study is the first to uncover critical limitations of large language models in spatially constrained reasoning tasks.
📝 Abstract
Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by constraining global planning. Backtracking improves episode completion, but increases solve rate only for weaker models; stronger models rarely backtrack and do not benefit from it. Our experiments have three key findings: (1) models fail to scale reasoning effort with difficulty, (2) vision models receiving images of the spatial environment reduce solve rate by 73%, and (3) extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference even in the step-by-step setting. Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning.