What If We Allocate Test-Time Compute Adaptively?

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work proposes a verifier-guided adaptive inference framework that overcomes the inefficiencies of static computation allocation in conventional test-time reasoning. By modeling inference as an iterative process of trajectory generation and selection, the method dynamically plans, selects tools, and adjusts computational strategies at each step, all under the unified guidance of a Process Reward Model (PRM). This approach achieves, for the first time, fine-grained, cross-iteration adaptive computation allocation based on PRM signals, transcending the limitations of fixed sampling and post-hoc reranking. Evaluated on challenging benchmarks—including MATH-500, AIME24, and AMO-Bench—the framework significantly outperforms existing test-time scaling methods, delivering higher accuracy while reducing wasteful generations and tool invocation overhead.

Technology Category

Application Category

📝 Abstract

Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection. For each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects a set of reasoning tools and a compute strategy together with an exploration parameter, and then generates a candidate reasoning trajectory. A process reward model (PRM) serves as a unified control signal: within each iteration, step-level PRM scores are aggregated to guide pruning and expansion during generation, and across iterations, aggregated trajectory rewards are used to select the final response. Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench. We characterize efficiency using theoretical FLOPs and a compute intensity metric penalizing wasted generation and tool overhead, demonstrating that verification-guided allocation concentrates computation on high-utility reasoning paths.

Problem

Research questions and friction points this paper is trying to address.

test-time compute

adaptive allocation

reasoning trajectories

verification

compute efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive test-time compute

process reward model

iterative reasoning trajectory