Learning to Reason under Off-Policy Guidance

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing zero-shot reinforcement learning (zero-RL) methods are constrained by the on-policy paradigm, limiting their ability to transcend initial capability boundaries. To address this, we propose Off-ZeroRL—a novel off-policy zero-RL framework that introduces high-quality offline reasoning trajectories into zero-RL for the first time. It establishes a dynamic hybrid training mechanism guided by an offline policy and employs regularized importance sampling to enable robust policy shaping, effectively balancing demonstration exploitation and exploratory generalization while avoiding rigid imitation. Evaluated on six mathematical reasoning benchmarks, Off-ZeroRL achieves an average improvement of +7.0 points; on out-of-distribution (OOD) tasks, it outperforms baselines by +6.2 points—substantially surpassing supervised fine-tuning. This work establishes a scalable, off-policy zero-RL paradigm for advancing large language models’ reasoning capabilities beyond pretraining.

Technology Category

Application Category

📝 Abstract

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.

Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning models with off-policy learning

Balancing imitation and exploration in reasoning tasks

Improving generalization in math and out-of-distribution tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines off-policy and on-policy training dynamically

Uses regularized importance sampling for policy shaping

Enhances reasoning models with off-policy guidance

🔎 Similar Papers

No similar papers found.