🤖 AI Summary
This work addresses the limitation of existing reinforcement learning post-training methods, which predominantly rely on sparse final rewards and overlook the rich predictive signals embedded in intermediate representations. To overcome this, the paper introduces a novel paradigm of intra-policy self-distillation that, for the first time, leverages the policy’s own final layer as an internal teacher during inference. By aligning logits and attention maps between the final and intermediate layers, the method efficiently propagates high-level predictive signals downward without external supervision, thereby enhancing representation quality while preserving policy consistency. Integrated with Group Relative Policy Optimization (GRPO) and a signed-advantage-weighted Jensen–Shannon divergence alignment mechanism, the approach demonstrates significant and consistent improvements over strong RL baselines across four mathematical reasoning benchmarks, validating its effectiveness and generalizability.
📝 Abstract
Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage-weighted Jensen--Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at https://github.com/THE-MALT-LAB/OISD