Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study investigates the performance–fairness trade-off in language models induced by moral alignment—specifically, debiasing gender stereotypes. Addressing the limitation of existing fairness objectives, which struggle to jointly preserve task performance and eliminate bias during fine-tuning, we propose an analytical framework integrating model editing with selective fine-tuning to systematically characterize knowledge forgetting patterns and their downstream impacts. Experiments reveal that performance degradation is primarily driven by global forgetting—not targeted removal of stereotypical knowledge—and that prevailing fairness objectives lack explicit control over forgetting dynamics; moreover, general anti-forgetting methods fail to mitigate global forgetting or restore performance. Crucially, we uncover the counterintuitive phenomenon that *selective forgetting exacerbates performance loss*, challenging conventional assumptions. Our work provides both theoretical grounding and empirical evidence for designing alignment objectives that simultaneously ensure fairness and robustness.

Technology Category

Application Category

📝 Abstract

Moral alignment has emerged as a widely adopted approach for regulating the behavior of pretrained language models (PLMs), typically through fine-tuning or model editing on curated datasets. However, this process often comes at the cost of degraded downstream task performance. Prior studies commonly aim to achieve a performance trade-off by encouraging PLMs to selectively forget stereotypical knowledge through carefully designed fairness objectives, while preserving their helpfulness. In this short paper, we investigate the underlying mechanisms of the performance trade-off in the context of mitigating gender stereotypes, through the lens of forgetting and the fairness objective. Our analysis reveals the limitations of current fairness objective in achieving trade-off by demonstrating that: (1) downstream task performance is primarily driven by the overall forgetting level; (2) selective forgetting of stereotypes tends to increase overall forgetting; and (3) general solutions for mitigating forgetting are ineffective at reducing overall forgetting and fail to improve downstream task performance.

Problem

Research questions and friction points this paper is trying to address.

Investigating performance trade-offs in moral alignment of language models

Analyzing how fairness objectives affect gender stereotype mitigation

Demonstrating limitations of selective forgetting in preserving model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes performance trade-off in moral alignment

Investigates forgetting mechanisms in gender stereotypes

Reveals limitations of current fairness objectives

🔎 Similar Papers

A Survey on Moral Foundation Theory and Pre-Trained Language Models: Current Advances and Challenges