Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

๐Ÿ“… 2026-04-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge in reinforcement learningโ€“based post-training of large language models, where coarse credit assignment or instability in later training stages often hinders simultaneous rapid improvement and sustained optimization. To overcome this, the authors propose Sample Routing Policy Optimization (SRPO), which innovatively integrates Group Relative Policy Optimization (GRPO) with Self-Distillation Policy Optimization (SDPO). SRPO employs a dynamic routing mechanism that directs correct responses to GRPO and failed ones to SDPO for fine-grained correction, while an entropy-aware dynamic weighting scheme mitigates signal degradation and ambiguity in self-distillation. Experiments across five benchmarks and two model scales demonstrate that SRPO consistently outperforms both GRPO and SDPO, yielding average performance gains of 3.4%โ€“6.3% on Qwen3-8B, reducing per-step computational overhead by up to 17.2%, and maintaining moderate response lengths.
๐Ÿ“ Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards
Group Relative Policy Optimization
Self-Distillation Policy Optimization
Credit Assignment
Training Instability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sample Routing
Policy Optimization
Self-Distillation
Reinforcement Learning with Verifiable Rewards
Entropy-Aware Weighting
๐Ÿ”Ž Similar Papers
No similar papers found.
G
Gengsheng Li
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
T
Tianyu Yang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
Junfeng Fang
Junfeng Fang
National University of Singapore
Model EditingAI SafetyLLM ExplainabilityAI4Science
Mingyang Song
Mingyang Song
Tencent Inc.
NLPIRLLMs
M
Mao Zheng
Tencent
Haiyun Guo
Haiyun Guo
Rice University ECE Ph.D.
optical imagingcomputational photographyMetalens
D
Dan Zhang
National University of Singapore
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
Tat-Seng Chua
Tat-Seng Chua
National University of Singapore
Multimedia Information RetrievalLive Social Media Analysis