From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the ambiguity in credit assignment arising from token-level rewards in existing policy self-distillation methods, which fail to distinguish input-specific reasoning from input-agnostic generic associations. From a posterior compatibility perspective, the authors reveal that such rewards fundamentally correspond to pointwise mutual information (PMI). To resolve this issue, they propose CREDIT, a method that decomposes teacher log-probabilities to isolate input-specific components and employs a batch-wise contrastive baseline to extract input-specific credit signals. This yields a sequence-level contrastive mutual information proxy objective that suppresses overconfident model responses to irrelevant inputs. Evaluated across code generation, scientific reasoning, and tool-use benchmarks, CREDIT achieves state-of-the-art overall performance on two major model families with negligible computational overhead.

📝 Abstract

On-policy self-distillation has emerged as a promising paradigm for post-training language models, in which the model conditions on environment feedback to serve as its own teacher, providing dense token-level rewards without external teacher models or step-level annotations. Despite its empirical success, what this reward actually measures and what kind of credit it assigns remain unclear. Under a posterior-compatibility interpretation of feedback conditioning, standard in the implicit-reward literature, we show that the self-distillation token reward is a Bayesian filtering increment whose trajectory sum is exactly the pointwise mutual information between the response and the feedback given the input. This pMI can be raised by input-specific reasoning or by input-generic shortcuts, so we further decompose the teacher log-probability along the input axis. Based on this analysis, we propose CREDIT (Contrastive REward from DIsTillation), which isolates the input-specific component with a batch-contrastive baseline. At the sequence level, CREDIT is a teacher-side surrogate for a contrastive pMI objective that also penalizes responses remaining likely under unrelated inputs. Across coding, scientific reasoning, and tool-use benchmarks on two model families, CREDIT delivers the strongest aggregate performance at negligible additional compute.

Problem

Research questions and friction points this paper is trying to address.

on-policy self-distillation

credit assignment

input-specific reasoning

pointwise mutual information

reward interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy self-distillation

pointwise mutual information

input-specific credit