🤖 AI Summary
This work investigates whether language models can implicitly perform k-hop (k=2,3,4) multi-hop reasoning in a single forward pass—without chain-of-thought prompting. Method: We train GPT-2 architectures from scratch on controlled synthetic datasets to systematically assess feasibility and scaling behavior. Contribution/Results: We discover, for the first time, that implicit k-hop reasoning requires training data growing exponentially with hop count (∝cᵏ) and model depth increasing linearly (∝k). Curriculum learning reduces sample requirements for 4-hop tasks by ~40%, yet fails to overcome the fundamental exponential data bottleneck. A theoretical analysis attributes this phenomenon to the coupling of combinatorial path explosion in multi-hop reasoning and constrained inter-layer information propagation in Transformers. Our results confirm that implicit multi-hop reasoning is learnable in principle, but subject to an inherent trade-off between data efficiency and model scale—imposing hard limits on practical deployment.
📝 Abstract
Implicit reasoning is the ability of a language model to solve multi-hop reasoning tasks in a single forward pass, without chain of thought. We investigate this capability using GPT2-style language models trained from scratch on controlled $k$-hop reasoning datasets ($k = 2, 3, 4$). We show that while such models can indeed learn implicit $k$-hop reasoning, the required training data grows exponentially in $k$, and the required number of transformer layers grows linearly in $k$. We offer a theoretical explanation for why this depth growth is necessary. We further find that the data requirement can be mitigated, but not eliminated, through curriculum learning.