Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether knowledge distillation (KD) effectively transfers debiasing capabilities from teacher to student models, focusing on robustness against spurious correlations in natural language inference and image classification. Through attention pattern analysis and circuit probing, we find that debiasing ability consistently degrades significantly during distillation, with heterogeneous attenuation across bias types. We identify— for the first time—that this degradation stems from KD-induced compression weakening critical debiasing attention pathways and deteriorating discriminative circuits. To address this, we propose three targeted interventions: bias-aware high-quality data augmentation, multi-round iterative distillation, and teacher-weight-guided student initialization. Extensive experiments demonstrate that their synergistic application substantially improves debiasing knowledge transfer, restoring—and in several cases exceeding—the teacher’s robustness on multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.
Problem

Research questions and friction points this paper is trying to address.

Investigates knowledge distillation effects on debiasing capability transfer
Explores how distillation undermines model robustness against spurious correlations
Identifies attention patterns causing bias mitigation failures post-distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality data augmentation for bias mitigation
Iterative knowledge distillation to enhance debiasing
Teacher weight initialization for student models
🔎 Similar Papers
No similar papers found.