RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the challenge of aligning large language models to domain-specific data without compromising their general capabilities. While supervised fine-tuning often degrades general performance and existing reinforcement learning approaches struggle to balance hard example utilization with training stability, this paper proposes Restatement Policy Optimization (RePO). RePO guides the policy model to restate off-policy knowledge into high-quality trajectories that conform to its own distribution and dynamically replaces low-reward trajectories. By innovatively integrating trajectory restatement with dynamic replacement, RePO avoids forced distributional shifts and substantially enhances the utilization of hard examples while preserving on-policy training stability. Experimental results demonstrate that RePO outperforms existing methods across multiple benchmarks, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

on-policy learning

off-policy knowledge

hard sample utilization

training instability

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rephrasing Policy Optimization

on-policy learning

off-policy knowledge