🤖 AI Summary
This work investigates the root cause of generalization failure in large language models (LLMs) on implicit reasoning—particularly multi-step mathematical tasks. Methodologically, we train GPT-2 from scratch and construct a custom multi-step reasoning dataset to systematically analyze learning dynamics. We find that implicit reasoning degenerates into shortcut learning: models achieve high in-domain and out-of-domain accuracy (>90%) when trained on fixed input-output patterns, yet collapse under minor pattern perturbations; by contrast, non-fixed patterns induce severe overfitting. Cross-model validation confirms this shortcut dependence is pervasive across state-of-the-art LLMs. Our key contributions are threefold: (1) the first explicit identification of shortcut learning as the core mechanism underlying implicit reasoning failure; (2) establishment of a causal link between pattern stability and generalization capability; and (3) proposal of a novel evaluation paradigm for implicit reasoning grounded in controllable data pattern design.
📝 Abstract
Test-time compute is emerging as a new paradigm for enhancing language models' complex multi-step reasoning capabilities, as demonstrated by the success of OpenAI's o1 and o3, as well as DeepSeek's R1. Compared to explicit reasoning in test-time compute, implicit reasoning is more inference-efficient, requiring fewer generated tokens. However, why does the advanced reasoning capability fail to emerge in the implicit reasoning style? In this work, we train GPT-2 from scratch on a curated multi-step mathematical reasoning dataset and conduct analytical experiments to investigate how language models perform implicit reasoning in multi-step tasks. Our findings reveal: 1) Language models can perform step-by-step reasoning and achieve high accuracy in both in-domain and out-of-domain tests via implicit reasoning. However, this capability only emerges when trained on fixed-pattern data. 2) Conversely, implicit reasoning abilities emerging from training on unfixed-pattern data tend to overfit a specific pattern and fail to generalize further. Notably, this limitation is also observed in state-of-the-art large language models. These findings suggest that language models acquire implicit reasoning through shortcut learning, enabling strong performance on tasks with similar patterns while lacking generalization.