🤖 AI Summary
This study investigates the performance–fairness trade-off in language models induced by moral alignment—specifically, debiasing gender stereotypes. Addressing the limitation of existing fairness objectives, which struggle to jointly preserve task performance and eliminate bias during fine-tuning, we propose an analytical framework integrating model editing with selective fine-tuning to systematically characterize knowledge forgetting patterns and their downstream impacts. Experiments reveal that performance degradation is primarily driven by global forgetting—not targeted removal of stereotypical knowledge—and that prevailing fairness objectives lack explicit control over forgetting dynamics; moreover, general anti-forgetting methods fail to mitigate global forgetting or restore performance. Crucially, we uncover the counterintuitive phenomenon that *selective forgetting exacerbates performance loss*, challenging conventional assumptions. Our work provides both theoretical grounding and empirical evidence for designing alignment objectives that simultaneously ensure fairness and robustness.
📝 Abstract
Moral alignment has emerged as a widely adopted approach for regulating the behavior of pretrained language models (PLMs), typically through fine-tuning or model editing on curated datasets. However, this process often comes at the cost of degraded downstream task performance. Prior studies commonly aim to achieve a performance trade-off by encouraging PLMs to selectively forget stereotypical knowledge through carefully designed fairness objectives, while preserving their helpfulness. In this short paper, we investigate the underlying mechanisms of the performance trade-off in the context of mitigating gender stereotypes, through the lens of forgetting and the fairness objective. Our analysis reveals the limitations of current fairness objective in achieving trade-off by demonstrating that: (1) downstream task performance is primarily driven by the overall forgetting level; (2) selective forgetting of stereotypes tends to increase overall forgetting; and (3) general solutions for mitigating forgetting are ineffective at reducing overall forgetting and fail to improve downstream task performance.