OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

To address computational redundancy caused by fixed-budget majority voting in test-time strategy optimization for large language models, this paper proposes a dynamic reasoning budget allocation framework. The method introduces Bayesian sequential probability ratio testing (SPRT) to test-time learning for the first time, enabling label-free adaptive sampling termination and online policy updating. By tightly coupling a dynamic stopping mechanism with policy optimization algorithms—such as PPO or GRPO—the framework supports on-demand rollout allocation and cross-step rollout reuse. Evaluated across diverse reasoning tasks, the approach reduces rollout overhead by an average of 37% while maintaining or improving accuracy. This demonstrates a superior trade-off between inference efficiency and task performance, validating both the theoretical soundness and practical effectiveness of the proposed framework.

Technology Category

Application Category

📝 Abstract

Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.

Problem

Research questions and friction points this paper is trying to address.

Optimizes test-time policy adaptation for LLMs

Reduces computational overhead in rollout allocation

Integrates Bayesian stopping with on-policy learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive inference budget allocation for test-time optimization

Bayesian sequential probability ratio test for dynamic sampling

On-policy updates using retained rollouts without ground-truth labels

🔎 Similar Papers

Which Experiences Are Influential for RL Agents? Efficiently Estimating The Influence of Experiences