OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

159K/year
🤖 AI Summary
This work addresses the mismatch between teacher and student responses and the imprecision of token-level supervision in policy self-distillation, which arise from reflection bias and rigid response templates. To mitigate these issues, the authors propose a logit modulation framework that integrates trajectory-level outcome rewards with token-level guidance. By contrasting successful and failed reasoning trajectories, the method calibrates teacher logits, thereby incorporating verifiable signals of trajectory correctness while preserving dense token-level supervision. This dual signal stabilizes and enhances the self-distillation process. Experimental results demonstrate that the proposed approach significantly outperforms standard and existing variant policy self-distillation methods across multiple reasoning benchmarks, effectively improving both the reasoning performance and training stability of large language models.
📝 Abstract
We study {on-policy self-distillation} (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite the performance gains of OPSD, we identify a common but often overlooked mismatch between teacher and student responses: self-reflected teacher responses can be shifted by reflection-induced bias and response templates, leading to miscalibrated token-level supervision. To mitigate this issue, we propose \methodname, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to contrast successful and failed on-policy trajectories and calibrate teacher logits. By combining outcome-level correctness with dense token-level guidance through logit steering, \methodname stabilizes self-distillation and improves reasoning performance over standard OPSD and other variants across diverse benchmarks.
Problem

Research questions and friction points this paper is trying to address.

on-policy self-distillation
reasoning
logit steering
teacher-student mismatch
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy self-distillation
logit steering
outcome-guided calibration
reasoning enhancement
LLM distillation