Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization

๐Ÿ“… 2026-04-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

198K/year
๐Ÿค– AI Summary
This work addresses the challenge of dynamically allocating computation across inputs under limited inference budgets to maximize the accuracy of large language models (LLMs). It formalizes test-time compute allocation as a constrained optimization problem and derives a closed-form, per-instance oracle policy via Lagrangian relaxation, efficiently solved using binary search. The authors then reduce constrained inference to supervised learning by training a lightweight classifier to imitate this oracle policy. Theoretical analysis provides bounds on both policy regret and imitation error. Empirically, the method achieves up to a 12.8% relative accuracy gain over baselines on MATH and GSM8K across three LLMs, with the imitator achieving over 91% fidelity to the oracleโ€”closely approaching the theoretical performance upper bound.

Technology Category

Application Category

๐Ÿ“ Abstract
Test-time compute scaling, the practice of spending extra computation during inference via repeated sampling, search, or extended reasoning, has become a powerful lever for improving large language model performance. Yet deploying these techniques under finite inference budgets requires a decision that current systems largely ignore: which inputs deserve more compute, and which can be answered cheaply? We formalize this as a constrained optimization problem (maximize expected accuracy subject to an average compute budget) and solve it with a two-stage Solve-then-Learn pipeline. In the solve stage, Lagrangian relaxation decomposes the global constraint into per-instance sub-problems, each admitting a closed-form oracle action that optimally prices accuracy against cost. We prove that the induced cost is monotone in the dual variable, enabling exact budget targeting via binary search. In the learn stage, a lightweight classifier is trained to predict oracle actions from cheap input features, amortizing the allocation rule for real-time deployment. We establish that the task-level regret of the learned policy is bounded by its imitation error times the worst-case per-instance gap, yielding a clean reduction from constrained inference to supervised classification. Experiments on MATH and GSM8K with three LLMs (DeepSeek-V3, GPT-4o-mini, Qwen2.5-7B) show that our method consistently outperforms uniform and heuristic allocation baselines, achieving up to 12.8% relative accuracy improvement on MATH under matched budget constraints, while closely tracking the Lagrangian oracle upper bound with over 91% imitation accuracy.
Problem

Research questions and friction points this paper is trying to address.

test-time compute allocation
constrained optimization
reasoning LLMs
inference budget
adaptive computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time compute allocation
constrained policy optimization
Lagrangian relaxation
Solve-then-Learn pipeline
adaptive inference
๐Ÿ”Ž Similar Papers