Skill-Conditioned Gated Self-Distillation for LLM Reasoning

๐Ÿ“… 2026-05-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work proposes Skill-Gated Self-Distillation (SGSD), a novel approach that leverages a noisy empirical skill repository as a privileged information source in settings where reliable ground-truth answers are unavailable. By constructing a multi-teacher pool to score the same reasoning trajectory and incorporating a teacher hypothesis verification mechanism alongside a gated objective function, SGSD dynamically filters out uncertain or extreme signals to enable robust knowledge distillation. Integrated with skill retrieval, polarity validation, and reinforcement learning fine-tuning, SGSD achieves significant performance gains on the Qwen3-1.7B model, surpassing GRPO by 6.2% and OPSD by 1.7% on average across the AIME24, AIME25, and HMMT25 benchmarks, thereby substantially enhancing reasoning capabilities under weak privileged information assumptions.
๐Ÿ“ Abstract
On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.
Problem

Research questions and friction points this paper is trying to address.

self-distillation
privileged information
skill bank
LLM reasoning
teacher-student learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Skill-Conditioned Self-Distillation
Privileged Information
Gated Distillation
LLM Reasoning
Teacher Hypothesis Validation