UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenges in self-distillation of autoregressive large language models, including unconstrained generation trajectories, strong task dependency, and unstable supervision signals, for which existing approaches lack a systematic understanding of key design components and their synergistic mechanisms. To this end, we propose UniSD, a unified self-distillation framework that establishes the first comprehensive research paradigm by integrating multiple teacher consistency, EMA-based teacher stabilization, token-level contrastive learning, hidden feature matching, and KL divergence clipping. This integration substantially enhances training stability, representation alignment, and distillation reliability. Extensive experiments across three model families, six distinct architectures, and six benchmarks demonstrate that UniSDfull achieves an average improvement of 5.4 points and outperforms the strongest baseline by 2.8 points.

📝 Abstract

Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

Problem

Research questions and friction points this paper is trying to address.

self-distillation

large language models

supervision reliability

training stability

autoregressive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation

large language models

multi-teacher agreement