🤖 AI Summary
This work investigates the optimal scaling trajectory for test-time computation in large language models (LLMs), identifying inherent suboptimality in validator-free chain-of-thought distillation. We formally prove that performance degradation due to validator absence worsens with increasing test-time compute, and introduce the “anti-concentration” condition to characterize how heterogeneity in the base model’s solution distribution impairs scaling efficiency. Methodologically, we systematically compare validator-guided reinforcement learning and search against validator-free distillation, conducting theoretical analysis and mathematical reasoning evaluation across a 3B–32B multi-scale model family. Results demonstrate that validator-based methods exhibit consistently widening performance advantages in long-output and large-data regimes, whereas validator-free approaches suffer pronounced scaling inefficiency. These findings underscore the critical role of validators in enabling efficient test-time compute expansion.
📝 Abstract
Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: first, distilling successful search or thinking traces; and second, using verification (e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [ErdH{o}s, 1945]. This implies a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF methods widening as test-time budget grows. We corroborate our theory empirically on both didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.