🤖 AI Summary
Existing knowledge distillation methods neglect intra-class and inter-class structural relationships among samples, hindering the student model’s acquisition of fine-grained knowledge representations from the teacher. To address this, we propose a context-aware knowledge distillation framework: it constructs a dynamic memory bank from teacher features to retrieve class-consistent (positive) and class-inconsistent (negative) contextual samples for each query; and, for the first time, formulates distillation as context-driven label smoothing regularization. We design a dual-path mechanism—Positive Instance Context Distillation (PICD) and Negative Instance Context Distillation (NICD)—to explicitly optimize intra-class compactness and inter-class separability. Theoretically grounded and empirically robust, our method achieves state-of-the-art performance across offline, online, and teacher-free distillation settings on CIFAR-100 and ImageNet, demonstrating superior generalization and interpretability.
📝 Abstract
Conventional knowledge distillation (KD) approaches are designed for the student model to predict similar output as the teacher model for each sample. Unfortunately, the relationship across samples with same class is often neglected. In this paper, we explore to redefine the knowledge in distillation, capturing the relationship between each sample and its corresponding in-context samples (a group of similar samples with the same or different classes), and perform KD from an in-context sample retrieval perspective. As KD is a type of learned label smoothing regularization (LSR), we first conduct a theoretical analysis showing that the teacher's knowledge from the in-context samples is a crucial contributor to regularize the student training with the corresponding samples. Buttressed by the analysis, we propose a novel in-context knowledge distillation (IC-KD) framework that shows its superiority across diverse KD paradigms (offline, online, and teacher-free KD). Firstly, we construct a feature memory bank from the teacher model and retrieve in-context samples for each corresponding sample through retrieval-based learning. We then introduce Positive In-Context Distillation (PICD) to reduce the discrepancy between a sample from the student and the aggregated in-context samples with the same class from the teacher in the logit space. Moreover, Negative In-Context Distillation (NICD) is introduced to separate a sample from the student and the in-context samples with different classes from the teacher in the logit space. Extensive experiments demonstrate that IC-KD is effective across various types of KD, and consistently achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets.