LLMs Can Learn to Reason Via Off-Policy RL

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Current reinforcement learning methods for large language models rely on the on-policy assumption, yet distributed training inherently produces off-policy data due to policy lag, limiting both performance and efficiency. This work proposes OAPL, a novel off-policy reinforcement learning algorithm that explicitly leverages the lagged inference policy for training—without requiring importance sampling or modifications to the inference engine. OAPL supports extreme off-policy scenarios with up to 400-step gradient lags, outperforming importance-sampling-based GRPO on competitive mathematics tasks. Moreover, on LiveCodeBench, it achieves performance comparable to DeepCoder using only one-third of the training generations, significantly enhancing test-time Pass@k scaling capabilities.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differences between the training and inference policies break this assumption, making the data off-policy by design. To rectify this, prior work has focused on making this off-policy data appear more on-policy, either via importance sampling (IS), or by more closely aligning the training and inference policies by explicitly modifying the inference engine. In this work, we embrace off-policyness and propose a novel off-policy RL algorithm that does not require these modifications: Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL). We show that OAPL outperforms GRPO with importance sampling on competition math benchmarks, and can match the performance of a publicly available coding model, DeepCoder, on LiveCodeBench, while using 3x fewer generations during training. We further empirically demonstrate that models trained via OAPL have improved test time scaling under the Pass@k metric. OAPL allows for efficient, effective post-training even with lags of more than 400 gradient steps between the training and inference policies, 100x more off-policy than prior approaches.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Reinforcement Learning

Off-policy

Policy Lag

Training-Inference Mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

off-policy reinforcement learning

OAPL

policy lag