🤖 AI Summary
This work addresses the ambiguity in credit assignment arising from token-level rewards in existing policy self-distillation methods, which fail to distinguish input-specific reasoning from input-agnostic generic associations. From a posterior compatibility perspective, the authors reveal that such rewards fundamentally correspond to pointwise mutual information (PMI). To resolve this issue, they propose CREDIT, a method that decomposes teacher log-probabilities to isolate input-specific components and employs a batch-wise contrastive baseline to extract input-specific credit signals. This yields a sequence-level contrastive mutual information proxy objective that suppresses overconfident model responses to irrelevant inputs. Evaluated across code generation, scientific reasoning, and tool-use benchmarks, CREDIT achieves state-of-the-art overall performance on two major model families with negligible computational overhead.
📝 Abstract
On-policy self-distillation has emerged as a promising paradigm for post-training language models, in which the model conditions on environment feedback to serve as its own teacher, providing dense token-level rewards without external teacher models or step-level annotations. Despite its empirical success, what this reward actually measures and what kind of credit it assigns remain unclear. Under a posterior-compatibility interpretation of feedback conditioning, standard in the implicit-reward literature, we show that the self-distillation token reward is a Bayesian filtering increment whose trajectory sum is exactly the pointwise mutual information between the response and the feedback given the input. This pMI can be raised by input-specific reasoning or by input-generic shortcuts, so we further decompose the teacher log-probability along the input axis. Based on this analysis, we propose CREDIT (Contrastive REward from DIsTillation), which isolates the input-specific component with a batch-contrastive baseline. At the sequence level, CREDIT is a teacher-side surrogate for a contrastive pMI objective that also penalizes responses remaining likely under unrelated inputs. Across coding, scientific reasoning, and tool-use benchmarks on two model families, CREDIT delivers the strongest aggregate performance at negligible additional compute.