Revisiting Knowledge Distillation under Distribution Shift

📅 2023-12-25

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work investigates the robustness of knowledge distillation (KD) under distributional shift, systematically uncovering for the first time the “teaching failure” phenomenon of teacher models under diversity and relevance shifts. To address this, we introduce the first systematic evaluation framework tailored to these two shift types, encompassing five benchmark datasets and over thirty KD methods. Our methodological contributions include a multi-perspective distillation approach integrating algorithmic design, data augmentation, and optimization strategies, along with a novel cross-distribution generalization assessment protocol. Empirical results demonstrate that most sophisticated distillation algorithms and augmentation techniques yield marginal gains under distribution shift, whereas lightweight distillation schemes exhibit superior robustness. This study provides both theoretical insights and practical benchmarks for reliable KD deployment in real-world applications characterized by distributional mismatch.

📝 Abstract

Knowledge distillation transfers knowledge from large models into small models, and has recently made remarkable achievements. However, few studies has investigated the mechanism of knowledge distillation against distribution shift. Distribution shift refers to the data distribution drifts between training and testing phases. In this paper, we reconsider the paradigm of knowledge distillation by reformulating the objective function in shift situations. Under the real scenarios, we propose a unified and systematic framework to benchmark knowledge distillation against two general distributional shifts including diversity and correlation shift. The evaluation benchmark covers more than 30 methods from algorithmic, data-driven, and optimization perspectives for five benchmark datasets. Overall, we conduct extensive experiments on the student model. We reveal intriguing observations of poor teaching performance under distribution shifts; in particular, complex algorithms and data augmentation offer limited gains in many cases.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Knowledge Distillation reliability under distribution shift

Benchmarking KD methods against diversity and correlation shifts

Analyzing key factors affecting student model training robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for benchmarking KD under shift

Evaluates 30+ methods across diverse datasets

Analyzes key training factors for robustness

🔎 Similar Papers

No similar papers found.