Optimizing Anytime Reasoning via Budget Relative Policy Optimization

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low inference efficiency and poor flexibility of large language models (LLMs) under dynamic token budgets. We propose AnytimeReasoner, a framework incorporating Budget-Relative Policy Optimization (BRPO), which introduces budget-aware dense rewards during inference for the first time. Our method decouples chain-of-thought (CoT) generation from answer summarization, and integrates CoT truncation, budget-aware sampling, and variance reduction techniques for end-to-end optimization. Crucially, it requires no architectural modifications to the base LLM and supports anytime truncation and response generation. It improves token efficiency jointly during training and deployment. On mathematical reasoning benchmarks, AnytimeReasoner consistently outperforms GRPO, achieving significantly faster training convergence and higher inference token efficiency across diverse budget prior distributions.

Technology Category

Application Category

📝 Abstract
Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.
Problem

Research questions and friction points this paper is trying to address.

Optimizing reasoning performance under varying token budgets
Improving token efficiency in large language models
Enhancing training and deployment flexibility via dense rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Truncates thinking process for budget constraints
Uses verifiable dense rewards in RL
Decouples thinking and summary policies optimization
🔎 Similar Papers
No similar papers found.