🤖 AI Summary
This work identifies the fundamental cause of feature-map knowledge distillation failure in Vision Transformers (ViTs): although teacher models exhibit globally low-rank feature representations, their token-level high-bandwidth encoding induces channel-capacity mismatch between wide (teacher) and narrow (student) models. To address this, we introduce token-level spectral energy analysis—revealing for the first time the coexistence of low-rankness and high-bandwidth encoding in ViT features. Building on this insight, we propose two lightweight mismatch-mitigation strategies: (i) an inference-time lightweight projector preserving essential spectral components, and (ii) native width alignment at the student’s final layer. Our approach relies solely on inter-layer singular value decomposition (SVD) and spectral analysis, requiring no additional complex modules. On ImageNet-1K, DeiT-Tiny achieves a substantial accuracy gain—from 74.86% to 78.23%—while also significantly improving teacher-free student performance. This reestablishes the effectiveness and practicality of simple feature distillation for ViTs.
📝 Abstract
Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only $121/61/34/14$ dimensions suffice to capture $99%/95%/90%/80%$ of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student's last block to the teacher's width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from $74.86%$ to $77.53%$ and $78.23%$ when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.