🤖 AI Summary
This work systematically investigates the sample efficiency and representational capacity of test-time strategies—including self-consistency, best-of-n sampling, and self-refinement—under realistic inference constraints. Method: Leveraging probabilistic gap analysis and computational expressivity theory, we formally characterize the sample complexity separation among these strategies and propose a validator-driven test-time inference framework. Contribution/Results: We establish the first theoretical separation: self-consistency requires Θ(1/Δ²) samples, whereas best-of-n achieves Θ(1/Δ), demonstrating quadratic improvement. Moreover, self-refinement with validator feedback enables a single Transformer to simulate an online expert ensemble—even without task-specific priors—thereby extending Transformer representational capacity to multi-task settings. Empirically, our validator-guided framework yields significant gains in multi-task zero-shot generalization. These results reveal fundamental efficiency limits of test-time scaling and elucidate the intrinsic mechanisms governing generalization in adaptive inference.
📝 Abstract
Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of-$n$, and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires $Theta(1/Delta^2)$ samples to produce the correct answer, while best-of-$n$ only needs $Theta(1/Delta)$, where $Delta<1$ denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query, extending the representation theory of Transformers from single-task to multi-task settings. Finally, we empirically validate our theoretical results, demonstrating the practical effectiveness of self-correction methods.