CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs

📅 2026-02-03

📈 Citations: 1

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limitations of fixed or instance-level rollback budget allocation strategies in reinforcement learning-based post-training of large language models, which often fail to adapt to the model’s dynamic learning state and inefficiently utilize computational resources. To overcome this, the authors propose a capability-aware adaptive budget allocation mechanism that dynamically evaluates the potential training utility of each task by integrating a capability-oriented value function with a heap-structured greedy policy. This approach optimizes the trade-off between exploration and exploitation during training. Implemented within the RLVR and GRPO frameworks, the method achieves significant improvements in generalization performance across multiple complex reasoning benchmarks while simultaneously enhancing both computational efficiency and overall training effectiveness.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key approach for enhancing LLM reasoning. However, standard frameworks like Group Relative Policy Optimization (GRPO) typically employ a uniform rollout budget, leading to resource inefficiency. Moreover, existing adaptive methods often rely on instance-level metrics, such as task pass rates, failing to capture the model's dynamic learning state. To address these limitations, we propose CoBA-RL, a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model's evolving capability. Specifically, CoBA-RL utilizes a Capability-Oriented Value function to map tasks to their potential training gains and employs a heap-based greedy strategy to efficiently self-calibrate the distribution of computational resources to samples with high training value. Extensive experiments demonstrate that our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements across multiple challenging benchmarks. These findings underscore that quantifying sample training value and optimizing budget allocation are pivotal for advancing LLM post-training efficiency.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Budget Allocation

Large Language Models

Resource Efficiency

Training Dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Capability-Oriented Budget Allocation

Reinforcement Learning

LLM Post-Training