🤖 AI Summary
This study systematically investigates the limitations of large language models in systematic generalization, with a focus on their performance under unseen spatial structures and extended reasoning paths. To this end, the authors introduce a controllable synthetic benchmark based on shortest-path planning that effectively disentangles confounding factors such as training data coverage, learning paradigms, and inference strategies. Experimental results demonstrate that while models generalize well across spatial configurations, they consistently fail to scale to longer reasoning sequences. The upper bound of generalization is dictated by the extent of sequence lengths observed during training; reinforcement learning improves stability but cannot surpass this bound, and test-time scaling fails to remedy the length generalization gap. The findings reveal that this failure stems from inherent instability in recursive reasoning, rather than deficiencies in data or training methodology.
📝 Abstract
Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.