Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing reinforcement learning fine-tuning (RFT) methods predominantly adopt online on-policy paradigms, hindering effective reuse of historical data and incurring high computational overhead and prohibitive costs for continual training. To address this, we propose ReMix—a novel framework enabling the first effective off-policy utilization of historical trajectories by mainstream algorithms including PPO and GRPO. ReMix introduces a KL-convex constraint to ensure stable policy updates, a policy resurrection mechanism to mitigate distributional shift, and supports high update-to-data ratios alongside multi-base-model and multi-optimizer integration. Evaluated on five mathematical reasoning benchmarks, ReMix achieves state-of-the-art performance with minimal data requirements: a 1.5B model attains 52.10% accuracy using only 79K response trajectories, while a 7B model reaches 64.39%, reducing training cost by 30–450× compared to conventional on-policy RFT.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.

Problem

Research questions and friction points this paper is trying to address.

Efficient off-policy RL for LLM finetuning

Reducing training cost with off-policy data

Balancing stability and flexibility in RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy RL for efficient LLM finetuning

Mix-policy proximal policy gradient

KL-Convex policy constraint stability

🔎 Similar Papers

No similar papers found.