ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing on-policy self-distillation methods suffer from overfitting and poor generalization, particularly in cross-domain scenarios, due to their indiscriminate imitation of reference solutions and full-sequence distillation. This work proposes a reflective on-policy self-distillation framework that incorporates a self-reflection mechanism to identify the first erroneous segment in a model’s reasoning trajectory and applies localized, token-level distillation exclusively to the detected error regions. By preserving correct reasoning prefixes and providing targeted corrective supervision, this approach uniquely integrates error localization with local distillation. The method substantially enhances the model’s capacity for reasoning correction, improves in-domain performance, and boosts cross-domain generalization, consistently outperforming current self-distillation techniques across multiple benchmarks.
πŸ“ Abstract
On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.
Problem

Research questions and friction points this paper is trying to address.

on-policy self-distillation
reasoning generalization
error correction
language model reasoning
out-of-domain generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy self-distillation
reflection-guided correction
error-localized distillation
reasoning generalization
large language models