Revisiting Knowledge Distillation under Distribution Shift

๐Ÿ“… 2023-12-25
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates the robustness of knowledge distillation (KD) under distributional shift, systematically uncovering for the first time the โ€œteaching failureโ€ phenomenon of teacher models under diversity and relevance shifts. To address this, we introduce the first systematic evaluation framework tailored to these two shift types, encompassing five benchmark datasets and over thirty KD methods. Our methodological contributions include a multi-perspective distillation approach integrating algorithmic design, data augmentation, and optimization strategies, along with a novel cross-distribution generalization assessment protocol. Empirical results demonstrate that most sophisticated distillation algorithms and augmentation techniques yield marginal gains under distribution shift, whereas lightweight distillation schemes exhibit superior robustness. This study provides both theoretical insights and practical benchmarks for reliable KD deployment in real-world applications characterized by distributional mismatch.
๐Ÿ“ Abstract
Knowledge distillation transfers knowledge from large models into small models, and has recently made remarkable achievements. However, few studies has investigated the mechanism of knowledge distillation against distribution shift. Distribution shift refers to the data distribution drifts between training and testing phases. In this paper, we reconsider the paradigm of knowledge distillation by reformulating the objective function in shift situations. Under the real scenarios, we propose a unified and systematic framework to benchmark knowledge distillation against two general distributional shifts including diversity and correlation shift. The evaluation benchmark covers more than 30 methods from algorithmic, data-driven, and optimization perspectives for five benchmark datasets. Overall, we conduct extensive experiments on the student model. We reveal intriguing observations of poor teaching performance under distribution shifts; in particular, complex algorithms and data augmentation offer limited gains in many cases.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Knowledge Distillation reliability under distribution shift
Benchmarking KD methods against diversity and correlation shifts
Analyzing key factors affecting student model training robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for benchmarking KD under shift
Evaluates 30+ methods across diverse datasets
Analyzes key training factors for robustness
๐Ÿ”Ž Similar Papers
No similar papers found.
Songming Zhang
Songming Zhang
Beijing Jiaotong University
natural language processingtext generationmachine translation
Z
Ziyu Lyu
Sun Yat-sen University
X
Xiaofeng Chen
Chongqing Jiaotong University