Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Feature-based knowledge distillation (KD) fails in Vision Transformers (ViTs) due to representational paradigm mismatch between teacher and student models along ViT’s intrinsic U-shaped information processing path, inducing negative transfer. Method: The authors propose a “distillation dynamics” analytical framework integrating spectral analysis, information entropy, and activation magnitude tracking to quantify teacher–student representation discrepancies in the frequency domain—first of its kind for ViTs. Contribution/Results: Empirical analysis reveals that naive feature imitation disregards ViTs’ architectural capacity constraints and dynamic representation evolution, underperforming even logit-level KD. Based on these insights, the authors establish principled, architecture-aware distillation design guidelines. This work provides both theoretical grounding and practical optimization strategies for efficient ViT compression, bridging the gap between KD theory and ViT-specific implementation.

Technology Category

Application Category

📝 Abstract

While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as ``distillation dynamics", combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided in the supplementary material.

Problem

Research questions and friction points this paper is trying to address.

Analyzes why feature distillation fails in Vision Transformers unlike CNNs

Identifies representational mismatch between teacher and student ViT models

Proposes methods respecting representational constraints for effective ViT compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing ViT distillation via frequency spectrum and entropy

Identifying U-shaped information processing pattern in ViTs

Addressing representational mismatch through constraint-aware distillation

🔎 Similar Papers

No similar papers found.