🤖 AI Summary
This study investigates how to measure and enhance the superintelligence of large language models (LLMs) through test-time search, with a focus on multi-step success rate (γ) in logical extrapolation reasoning tasks. Building upon the Diligent Learner framework, we introduce the first out-of-distribution (OOD) logical reasoning benchmark based on GF(2) circuit reconstruction, featuring a quantifiable γ metric to evaluate deep reasoning capabilities. Experiments reveal that γ for smaller models declines superlinearly with reasoning depth, whereas state-of-the-art models exhibit notable robustness. Moreover, successful large-scale reasoning critically depends on precise tool usage. Our work underscores that constructing accurate tools is essential for achieving superintelligence and establishes a new paradigm for evaluating test-time search and reasoning performance.
📝 Abstract
The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $\gamma$. In this work, we design a benchmark to measure $\gamma$ on logical out-of-distribution inference. We construct a class of tasks involving GF(2) circuit reconstruction that grow more difficult with each reasoning step, and that are, from an information-theoretic standpoint, impossible to reliably solve unless the LLM carefully integrates all of the information provided. Our analysis demonstrates that while the $\gamma$ value for small LLMs declines superlinearly as depth increases, frontier models exhibit partial robustness on this task. Furthermore, we find that successful reasoning at scale is contingent upon precise tool calls, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.