Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing self-distillation methods leveraging privileged context exhibit unstable performance in mathematical reasoning and struggle to enhance multi-step reasoning capabilities. Through pointwise mutual information analysis, this work identifies that privileged context distorts the teacher model’s confidence over critical reasoning tokens. To address this issue, the authors propose a reverse self-distillation mechanism that maximizes—rather than minimizes—the divergence between student and teacher output distributions, coupled with an entropy-triggered gating strategy for adaptive training. Requiring no external strong supervision, the method achieves baseline accuracy using only 2%–50% of the original training steps across models ranging from 4B to 30B parameters, yielding improvements of up to 11.5 percentage points.

📝 Abstract

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.

Problem

Research questions and friction points this paper is trying to address.

self-distillation

mathematical reasoning

pointwise mutual information

reasoning RL

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Anti-Self-Distillation

Pointwise Mutual Information

Reasoning Reinforcement Learning