Discriminative and Consistent Representation Distillation

📅 2024-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient discriminability of student models and structural inconsistency between teacher and student representations in knowledge distillation, this paper proposes a joint distillation framework that synergistically optimizes discriminability and consistency. Methodologically, it introduces a contrastive learning loss to enhance inter-class separability, couples it with a distributional consistency regularizer to align latent-space structures, and incorporates learnable temperature and bias parameters to dynamically balance the dual objectives—eliminating reliance on fixed hyperparameters for adaptive optimization. Evaluated on CIFAR-100 and ImageNet, the method achieves state-of-the-art performance, with student models even surpassing teacher accuracy. Strong generalization is further validated via cross-dataset transfer on Tiny ImageNet and STL-10. The core contribution lies in the first unified formulation of discriminative modeling and structural consistency modeling within a learnable trade-off mechanism.

Technology Category

Application Category

📝 Abstract
Knowledge Distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. While contrastive learning has shown promise in self-supervised learning by creating discriminative representations, its application in knowledge distillation remains limited and focuses primarily on discrimination, neglecting the structural relationships captured by the teacher model. To address this limitation, we propose Discriminative and Consistent Distillation (DCD), which employs a contrastive loss along with a consistency regularization to minimize the discrepancy between the distributions of teacher and student representations. Our method introduces learnable temperature and bias parameters that adapt during training to balance these complementary objectives, replacing the fixed hyperparameters commonly used in contrastive learning approaches. Through extensive experiments on CIFAR-100 and ImageNet ILSVRC-2012, we demonstrate that DCD achieves state-of-the-art performance, with the student model sometimes surpassing the teacher's accuracy. Furthermore, we show that DCD's learned representations exhibit superior cross-dataset generalization when transferred to Tiny ImageNet and STL-10.
Problem

Research questions and friction points this paper is trying to address.

Enhancing knowledge transfer efficiency
Balancing discriminative and structural learning
Improving cross-dataset generalization capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive loss with consistency regularization
Learnable temperature and bias parameters
Superior cross-dataset generalization capability
🔎 Similar Papers
No similar papers found.