On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the challenges of poor synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in post-training large language models—namely, policy degradation and overfitting to expert demonstrations—this paper proposes CHORD. Methodologically, CHORD reformulates SFT and RL from a unified on-policy perspective, treating SFT as a dynamically adjusted auxiliary objective within the RL framework. It introduces a dual-control mechanism: a global transition coefficient governing the trade-off between exploration and imitation, and token-level weighting to precisely modulate expert knowledge injection. Implemented as an end-to-end on-policy RL optimizer, CHORD eliminates the need for distillation or staged training. Experiments on standard benchmarks demonstrate significantly improved training stability and convergence speed, while effectively mitigating mode collapse and expert-data overfitting. CHORD consistently outperforms existing SFT+RL baselines across all evaluated metrics.

Technology Category

Application Category

📝 Abstract

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.

Problem

Research questions and friction points this paper is trying to address.

Harmonizing supervised fine-tuning and reinforcement learning dynamically

Mitigating overfitting to expert data in RL integration

Balancing off-policy imitation with on-policy exploration effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic weighting of SFT and RL objectives

Dual-control mechanism for transition and granular learning

Token-wise weighting to preserve on-policy exploration

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL