Adaptive Group Robust Ensemble Knowledge Distillation

📅 2024-11-22
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Neural networks often capture spurious correlations in data, leading to degraded performance on underrepresented subgroups; knowledge distillation—particularly ensemble distillation—exacerbates this issue, causing significant deterioration in worst-group accuracy even when teacher models are debiased. This paper proposes Adaptive Group-Robust Ensemble Distillation (AGRED), the first framework to systematically identify and mitigate the implicit harm that conventional ensemble distillation inflicts on disadvantaged subgroups. Its core innovation is a bias-model-guided gradient direction selection mechanism: by dynamically weighting teacher knowledge via gradient alignment, AGRED enables adaptive, worst-group-aware knowledge fusion. Evaluated on multiple benchmark datasets, AGRED substantially improves worst-group accuracy over baselines—including majority voting—while enhancing out-of-distribution robustness and generalization of student models.

Technology Category

Application Category

📝 Abstract
Neural networks can learn spurious correlations in the data, often leading to performance degradation for underrepresented subgroups. Studies have demonstrated that the disparity is amplified when knowledge is distilled from a complex teacher model to a relatively ``simple''student model. Prior work has shown that ensemble deep learning methods can improve the performance of the worst-case subgroups; however, it is unclear if this advantage carries over when distilling knowledge from an ensemble of teachers, especially when the teacher models are debiased. This study demonstrates that traditional ensemble knowledge distillation can significantly drop the performance of the worst-case subgroups in the distilled student model even when the teacher models are debiased. To overcome this, we propose Adaptive Group Robust Ensemble Knowledge Distillation (AGRE-KD), a simple ensembling strategy to ensure that the student model receives knowledge beneficial for unknown underrepresented subgroups. Leveraging an additional biased model, our method selectively chooses teachers whose knowledge would better improve the worst-performing subgroups by upweighting the teachers with gradient directions deviating from the biased model. Our experiments on several datasets demonstrate the superiority of the proposed ensemble distillation technique and show that it can even outperform classic model ensembles based on majority voting. Our source code is available at https://github.com/patrikken/AGRE-KD
Problem

Research questions and friction points this paper is trying to address.

Addresses performance degradation for underrepresented subgroups in knowledge distillation
Overcomes traditional ensemble distillation's failure on worst-case subgroups
Ensures student models receive knowledge beneficial for unknown underrepresented groups
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive group robust ensemble knowledge distillation method
Selects teachers using biased model gradient deviation
Improves worst-case subgroup performance in distillation
🔎 Similar Papers
No similar papers found.