Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing feature-based knowledge distillation (KD) methods still rely on logit-level losses (e.g., cross-entropy), hindering effective transfer of intermediate-layer feature knowledge. Method: We propose the first purely feature-driven KD framework that completely eliminates logit supervision. Instead, it trains student backbone networks via intermediate-feature alignment and geometric analysis of latent-space representations. We introduce a novel metric to quantitatively assess feature knowledge quality, enabling adaptive selection of optimal teacher layers; additionally, we design a distribution-aware alignment loss grounded in feature geometry to enhance representation consistency. Contribution/Results: Our method achieves significant improvements over state-of-the-art approaches on three image classification benchmarks, with up to 15% absolute Top-1 accuracy gain. It is the first work to empirically validate both the feasibility and superiority of high-fidelity feature knowledge transfer without any logit-level supervision.

Technology Category

Application Category

📝 Abstract
Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at https://github.com/Thegolfingocto/KD_wo_CE.
Problem

Research questions and friction points this paper is trying to address.

Proposes logit-free feature distillation to overcome limitations of logit-based losses
Introduces knowledge quality metric to identify optimal teacher layers for transfer
Achieves state-of-the-art accuracy improvements across diverse neural network architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exclusive feature-based losses replace logit losses
Knowledge quality metric identifies optimal teacher layers
Framework achieves state-of-the-art distillation performance
🔎 Similar Papers
No similar papers found.
N
Nicholas Cooper
University of Colorado Boulder
Lijun Chen
Lijun Chen
University of Colorado at Boulder
Optimization and control of networked systemsComputer networksPower networksOptimizationGame theory
S
Sailesh Dwivedy
University of Colorado Boulder
D
D. Gurari
University of Colorado Boulder