MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the catastrophic forgetting induced by supervised fine-tuning (SFT) when injecting new knowledge, which severely degrades a model’s reasoning and general capabilities. To mitigate this issue, the authors propose MixSD, a self-distillation approach that requires no external teacher model. MixSD dynamically constructs supervision signals by mixing contextual representations derived from the base model’s own conditional distributions over known facts and original priors, thereby aligning the updated model’s output distribution with its pre-training behavior. Leveraging Fisher information to analyze parameter updates, MixSD consistently outperforms standard SFT and online self-distillation across diverse model scales and tasks. It achieves near-perfect training accuracy while preserving nearly 100% of the original performance—substantially surpassing SFT, which in some cases retains as little as 1% of baseline capabilities.

📝 Abstract

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

Problem

Research questions and friction points this paper is trying to address.

catastrophic forgetting

knowledge injection

supervised fine-tuning

language models

distribution alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation

knowledge injection

catastrophic forgetting