Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
Existing spatial reasoning benchmarks predominantly rely on one-shot generation of complete solutions, failing to capture agents’ true capabilities in interactive decision-making. This work proposes Spatial-Gym—a step-by-step, interactive evaluation framework built upon Gymnasium—that formulates 2D grid-based path planning as a sequential decision task supporting backtracking, and integrates both chain-of-thought and reinforcement learning interfaces. Experiments reveal that even the strongest model, GPT-OSS 120B, achieves only a 16.0% success rate, starkly below human performance at 98.0%. Stepwise reasoning improves weaker models’ performance but constrains stronger models’ global planning abilities. Visual inputs significantly degrade performance, whereas extended chain-of-thought reasoning consistently maintains a 3–5× accuracy advantage across three reasoning paradigms. This study is the first to uncover critical limitations of large language models in spatially constrained reasoning tasks.

Technology Category

Application Category

📝 Abstract
Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by constraining global planning. Backtracking improves episode completion, but increases solve rate only for weaker models; stronger models rarely backtrack and do not benefit from it. Our experiments have three key findings: (1) models fail to scale reasoning effort with difficulty, (2) vision models receiving images of the spatial environment reduce solve rate by 73%, and (3) extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference even in the step-by-step setting. Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

spatial reasoning
sequential decision-making
benchmarking
navigation
interactive evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial reasoning
sequential decision-making
step-by-step evaluation
backtracking
Spatial-Gym