From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies the fundamental cause of feature-map knowledge distillation failure in Vision Transformers (ViTs): although teacher models exhibit globally low-rank feature representations, their token-level high-bandwidth encoding induces channel-capacity mismatch between wide (teacher) and narrow (student) models. To address this, we introduce token-level spectral energy analysis—revealing for the first time the coexistence of low-rankness and high-bandwidth encoding in ViT features. Building on this insight, we propose two lightweight mismatch-mitigation strategies: (i) an inference-time lightweight projector preserving essential spectral components, and (ii) native width alignment at the student’s final layer. Our approach relies solely on inter-layer singular value decomposition (SVD) and spectral analysis, requiring no additional complex modules. On ImageNet-1K, DeiT-Tiny achieves a substantial accuracy gain—from 74.86% to 78.23%—while also significantly improving teacher-free student performance. This reestablishes the effectiveness and practicality of simple feature distillation for ViTs.

Technology Category

Application Category

📝 Abstract
Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only $121/61/34/14$ dimensions suffice to capture $99%/95%/90%/80%$ of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student's last block to the teacher's width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from $74.86%$ to $77.53%$ and $78.23%$ when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.
Problem

Research questions and friction points this paper is trying to address.

Analyzing why feature distillation fails in Vision Transformers despite low-rank structure
Identifying encoding mismatch between wide teachers and narrow students as key issue
Proposing minimal strategies to reactivate feature distillation for compact ViTs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses token-level spectral energy pattern analysis
Proposes post-hoc feature lifting with lightweight projector
Implements native width alignment in last block
🔎 Similar Papers
No similar papers found.
H
Huiyuan Tian
Zhejiang University
B
Bonan Xu
The Hong Kong Polytechnic University
Shijian Li
Shijian Li
zhejiang university
pervasive computinghuman computer interactionartificial intelligence
X
Xin Jin
GenPi Inc.