RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of aligning large language models to domain-specific data without compromising their general capabilities. While supervised fine-tuning often degrades general performance and existing reinforcement learning approaches struggle to balance hard example utilization with training stability, this paper proposes Restatement Policy Optimization (RePO). RePO guides the policy model to restate off-policy knowledge into high-quality trajectories that conform to its own distribution and dynamically replaces low-reward trajectories. By innovatively integrating trajectory restatement with dynamic replacement, RePO avoids forced distributional shifts and substantially enhances the utilization of hard examples while preserving on-policy training stability. Experimental results demonstrate that RePO outperforms existing methods across multiple benchmarks, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

on-policy learning
off-policy knowledge
hard sample utilization
training instability
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rephrasing Policy Optimization
on-policy learning
off-policy knowledge
reinforcement learning
large language models
🔎 Similar Papers
No similar papers found.
L
Linxuan Xia
State Key Laboratory of CAD&CG, the College of Computer Science and Technology, Zhejiang University
X
Xiaolong Yang
FiT, Tencent, Shenzhen, China
Y
Yongyuan Chen
State Key Laboratory of CAD&CG, the College of Computer Science and Technology, Zhejiang University
E
Enyue Zhao
State Key Laboratory of CAD&CG, the College of Computer Science and Technology, Zhejiang University
Deng Cai
Deng Cai
Professor of Computer Science, Zhejiang University
Machine learningComputer visionData miningInformation retrieval
Yasheng Wang
Yasheng Wang
Tencent
Natural Language Processing
B
Boxi Wu
School of Software Technology, Zhejiang University