Selective Off-Policy Reference Tuning with Plan Guidance

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
This work addresses a critical limitation of existing GRPO-style methods, which struggle to learn effectively when all sampled trajectories fail. To overcome this, the authors propose SORT, the first approach to introduce a planning-guided selective off-policy update mechanism. Without altering trajectory generation, SORT extracts plans from reference solutions and computes the difference in token probabilities with and without plan conditioning. Tokens that become more predictable under the plan are assigned higher weights, thereby transforming entirely incorrect prompts into structured, plan-aware learning signals. By integrating reference planning, conditional probability contrast, and selective reweighting, SORT significantly outperforms GRPO and other guidance-based methods across three backbone models and eight reasoning benchmarks, with particularly pronounced gains on weaker models.
📝 Abstract
Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
reasoning
hard prompts
rollout failure
off-policy learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Off-Policy Learning
Plan Guidance
Token-Level Weighting
Reference Solution Conditioning
Reasoning Reinforcement Learning
🔎 Similar Papers
2024-10-04Conference on Empirical Methods in Natural Language ProcessingCitations: 5