Adaptive Group Robust Ensemble Knowledge Distillation

📅 2024-11-22

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Neural networks often capture spurious correlations in data, leading to degraded performance on underrepresented subgroups; knowledge distillation—particularly ensemble distillation—exacerbates this issue, causing significant deterioration in worst-group accuracy even when teacher models are debiased. This paper proposes Adaptive Group-Robust Ensemble Distillation (AGRED), the first framework to systematically identify and mitigate the implicit harm that conventional ensemble distillation inflicts on disadvantaged subgroups. Its core innovation is a bias-model-guided gradient direction selection mechanism: by dynamically weighting teacher knowledge via gradient alignment, AGRED enables adaptive, worst-group-aware knowledge fusion. Evaluated on multiple benchmark datasets, AGRED substantially improves worst-group accuracy over baselines—including majority voting—while enhancing out-of-distribution robustness and generalization of student models.

Technology Category

Application Category

📝 Abstract

Neural networks can learn spurious correlations in the data, often leading to performance degradation for underrepresented subgroups. Studies have demonstrated that the disparity is amplified when knowledge is distilled from a complex teacher model to a relatively ``simple''student model. Prior work has shown that ensemble deep learning methods can improve the performance of the worst-case subgroups; however, it is unclear if this advantage carries over when distilling knowledge from an ensemble of teachers, especially when the teacher models are debiased. This study demonstrates that traditional ensemble knowledge distillation can significantly drop the performance of the worst-case subgroups in the distilled student model even when the teacher models are debiased. To overcome this, we propose Adaptive Group Robust Ensemble Knowledge Distillation (AGRE-KD), a simple ensembling strategy to ensure that the student model receives knowledge beneficial for unknown underrepresented subgroups. Leveraging an additional biased model, our method selectively chooses teachers whose knowledge would better improve the worst-performing subgroups by upweighting the teachers with gradient directions deviating from the biased model. Our experiments on several datasets demonstrate the superiority of the proposed ensemble distillation technique and show that it can even outperform classic model ensembles based on majority voting. Our source code is available at https://github.com/patrikken/AGRE-KD

Problem

Research questions and friction points this paper is trying to address.

Addresses performance degradation for underrepresented subgroups in knowledge distillation

Overcomes traditional ensemble distillation's failure on worst-case subgroups

Ensures student models receive knowledge beneficial for unknown underrepresented groups

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive group robust ensemble knowledge distillation method

Selects teachers using biased model gradient deviation

Improves worst-case subgroup performance in distillation

🔎 Similar Papers

No similar papers found.