Sample Complexity and Representation Ability of Test-time Scaling Paradigms

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work systematically investigates the sample efficiency and representational capacity of test-time strategies—including self-consistency, best-of-n sampling, and self-refinement—under realistic inference constraints. Method: Leveraging probabilistic gap analysis and computational expressivity theory, we formally characterize the sample complexity separation among these strategies and propose a validator-driven test-time inference framework. Contribution/Results: We establish the first theoretical separation: self-consistency requires Θ(1/Δ²) samples, whereas best-of-n achieves Θ(1/Δ), demonstrating quadratic improvement. Moreover, self-refinement with validator feedback enables a single Transformer to simulate an online expert ensemble—even without task-specific priors—thereby extending Transformer representational capacity to multi-task settings. Empirically, our validator-guided framework yields significant gains in multi-task zero-shot generalization. These results reveal fundamental efficiency limits of test-time scaling and elucidate the intrinsic mechanisms governing generalization in adaptive inference.

Technology Category

Application Category

📝 Abstract

Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of-$n$, and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires $Theta(1/Delta^2)$ samples to produce the correct answer, while best-of-$n$ only needs $Theta(1/Delta)$, where $Delta<1$ denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query, extending the representation theory of Transformers from single-task to multi-task settings. Finally, we empirically validate our theoretical results, demonstrating the practical effectiveness of self-correction methods.

Problem

Research questions and friction points this paper is trying to address.

Theoretical understanding of test-time scaling paradigms' sample efficiency is limited

Comparing sample requirements between self-consistency and best-of-n strategies

Exploring self-correction's ability to enable Transformers for multi-task learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-consistency requires Θ(1/Δ²) samples

Best-of-n only needs Θ(1/Δ) samples

Self-correction enables multi-task Transformer learning

🔎 Similar Papers

Breaking Neural Network Scaling Laws with Modularity