🤖 AI Summary
This work proposes a verifier-guided adaptive inference framework that overcomes the inefficiencies of static computation allocation in conventional test-time reasoning. By modeling inference as an iterative process of trajectory generation and selection, the method dynamically plans, selects tools, and adjusts computational strategies at each step, all under the unified guidance of a Process Reward Model (PRM). This approach achieves, for the first time, fine-grained, cross-iteration adaptive computation allocation based on PRM signals, transcending the limitations of fixed sampling and post-hoc reranking. Evaluated on challenging benchmarks—including MATH-500, AIME24, and AMO-Bench—the framework significantly outperforms existing test-time scaling methods, delivering higher accuracy while reducing wasteful generations and tool invocation overhead.
📝 Abstract
Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection. For each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects a set of reasoning tools and a compute strategy together with an exploration parameter, and then generates a candidate reasoning trajectory. A process reward model (PRM) serves as a unified control signal: within each iteration, step-level PRM scores are aggregated to guide pruning and expansion during generation, and across iterations, aggregated trajectory rewards are used to select the final response. Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench. We characterize efficiency using theoretical FLOPs and a compute intensity metric penalizing wasted generation and tool overhead, demonstrating that verification-guided allocation concentrates computation on high-utility reasoning paths.