🤖 AI Summary
This work addresses the trade-off between sample efficiency and asymptotic performance in reinforcement learning with large language models, which is often exacerbated by distributional shift and wasted training signals due to off-policy reuse. The authors propose a two-stage optimization framework: first, a weakly constrained off-policy optimization phase on fixed data to fully extract informative signals; followed by an extreme-region policy distillation step under trust-region constraints to transfer effective knowledge back to the base policy while suppressing harmful distributional shifts. This approach decouples sample efficiency from KL efficiency and enables token-level supervision from both strong and weak teacher policies. Empirical results on mathematical reasoning tasks demonstrate that the method achieves comparable or superior performance with significantly smaller KL divergence, substantially improving robustness and effectiveness.
📝 Abstract
Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strictly on-policy methods discard trajectories after a single update, while off-policy reuse introduces distribution mismatch that existing trust-region techniques mitigate primarily by enforcing conservative optimization, often leaving rich training signals underutilized. To investigate this, we perform extensive off-policy updates on fixed data. Our experiments reveal that aggressive multi-step optimization brings rapid initial gains, but excessive updates cause trajectory probabilities to deviate and entropy to collapse, with performance plateauing early. Tightening KL constraints merely lowers the ceiling without resolving the degradation. This motivates Extreme Region Policy Distillation (ERPD), a two-stage framework that decouples sample efficiency from KL efficiency. The first stage performs weakly constrained off-policy optimization on fixed data to maximally extract training signals. The resulting policy provides token-level supervision. In the second stage, we distill these signals into the base policy under trust-region constraints, filtering harmful drift while preserving useful signals. The distilled policy achieves comparable or better performance with substantially smaller KL divergence, indicating that much of the first-stage divergence was spent on unnecessary drift rather than genuine improvement. Crucially, ERPD accommodates both strong and weak teachers: when aggressive optimization yields no stronger policy, even degenerate teachers provide effective supervision via alternative signal construction strategies. We validate ERPD on mathematical reasoning, showing gains for strong base models where on-policy training plateaus, and reliable improvements with weak teachers.