🤖 AI Summary
Existing test-time scaling methods uniformly increase computation across all samples and generation steps, ignoring instance-level local difficulty and leading to suboptimal resource efficiency. To address this, we propose a locally adaptive test-time scaling framework that dynamically estimates the local difficulty of each generation step using a validation model, and accordingly triggers fine-grained control actions—including resampling, backtracking, restart, or early termination—to allocate computational resources adaptively. Our core contribution is the introduction of a local difficulty awareness mechanism, which breaks away from the conventional paradigm of global, uniform computation expansion. Extensive experiments across multiple tasks demonstrate that our method maintains or even improves accuracy while significantly reducing average computational cost, thereby achieving a superior accuracy–efficiency trade-off.
📝 Abstract
One common strategy for improving the performance of Large Language Models (LLMs) on downstream tasks involves using a emph{verifier model} to either select the best answer from a pool of candidates or to steer the auto-regressive generation process towards better outputs. This class of methods typically results in improved accuracy at the cost of increased computation at test-time, a paradigm known as emph{test-time scaling}. However, most existing approaches increase computation uniformly across all samples and generation steps, without considering the complexity of individual instances, leading to inefficient resource use. We address this limitation by proposing an approach, called emph{Locally Adaptive Test-Time Scaling (LATTS)}, that allocates variable compute across generation steps. Specifically, at each generation step, LATTS employs a verifier-based acceptance criterion to decide whether to resample, backtrack, restart, or stop the generation process. This criterion effectively adjusts the per-step computational effort based on a precise notion of emph{local difficulty} derived from the verifier model. Empirical results show that LATTS achieves significantly superior accuracy--compute tradeoffs compared to standard verifier-based methods.