🤖 AI Summary
This study investigates whether knowledge distillation (KD) effectively transfers debiasing capabilities from teacher to student models, focusing on robustness against spurious correlations in natural language inference and image classification. Through attention pattern analysis and circuit probing, we find that debiasing ability consistently degrades significantly during distillation, with heterogeneous attenuation across bias types. We identify— for the first time—that this degradation stems from KD-induced compression weakening critical debiasing attention pathways and deteriorating discriminative circuits. To address this, we propose three targeted interventions: bias-aware high-quality data augmentation, multi-round iterative distillation, and teacher-weight-guided student initialization. Extensive experiments demonstrate that their synergistic application substantially improves debiasing knowledge transfer, restoring—and in several cases exceeding—the teacher’s robustness on multiple benchmarks.
📝 Abstract
Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.