Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the inefficiencies of online policy reinforcement learning in traditional large language model (LLM) post-training, which often leads to wasted experience and reward homogenization, hindering effective learning from challenging samples. To overcome these limitations, the authors propose Batch-Adaptive Policy Optimization (BAPO), a novel framework that introduces off-policy reinforcement learning into LLM post-training for the first time. BAPO employs a replay buffer mechanism to dynamically re-evaluate and reuse high-value historical trajectories, thereby enhancing data efficiency while providing a theoretical lower bound on policy performance improvement. Experimental results demonstrate that BAPO outperforms GRPO by an average of 12.5% across mathematical reasoning, planning, and visual reasoning tasks, and successfully solves 40.7% of problems that were previously unsolvable by the base model.

Technology Category

Application Category

📝 Abstract

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.

Problem

Research questions and friction points this paper is trying to address.

off-policy reinforcement learning

experience waste

reward homogeneity

large language model reasoning

data efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

off-policy reinforcement learning

batch adaptation

data efficiency