Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Feature-based knowledge distillation (KD) fails in Vision Transformers (ViTs) due to representational paradigm mismatch between teacher and student models along ViT’s intrinsic U-shaped information processing path, inducing negative transfer. Method: The authors propose a “distillation dynamics” analytical framework integrating spectral analysis, information entropy, and activation magnitude tracking to quantify teacher–student representation discrepancies in the frequency domain—first of its kind for ViTs. Contribution/Results: Empirical analysis reveals that naive feature imitation disregards ViTs’ architectural capacity constraints and dynamic representation evolution, underperforming even logit-level KD. Based on these insights, the authors establish principled, architecture-aware distillation design guidelines. This work provides both theoretical grounding and practical optimization strategies for efficient ViT compression, bridging the gap between KD theory and ViT-specific implementation.

Technology Category

Application Category

📝 Abstract
While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as ``distillation dynamics", combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided in the supplementary material.
Problem

Research questions and friction points this paper is trying to address.

Analyzes why feature distillation fails in Vision Transformers unlike CNNs
Identifies representational mismatch between teacher and student ViT models
Proposes methods respecting representational constraints for effective ViT compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing ViT distillation via frequency spectrum and entropy
Identifying U-shaped information processing pattern in ViTs
Addressing representational mismatch through constraint-aware distillation
🔎 Similar Papers
No similar papers found.
H
Huiyuan Tian
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
B
Bonan Xu
Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University
Shijian Li
Shijian Li
zhejiang university
pervasive computinghuman computer interactionartificial intelligence