🤖 AI Summary
AI language models widely exhibit and amplify gender bias, undermining fairness and practical utility. A key challenge lies in the excessive coupling of gender information at the neuron level within Transformer architectures.
Method: We propose a gradient-driven encoder-decoder framework that, for the first time, enables neuron-level localization, separation, and intervention of monosemantic gender features. Leveraging differentiable architecture design and gradient-sensitivity analysis, our approach achieves identifiable and editable disentanglement of gender representations at the neuron granularity.
Contribution/Results: The method ensures strong interpretability and minimal performance degradation. Evaluated on multiple encoder-only models, it significantly reduces bias scores on SEAT and BOLD benchmarks while incurring less than 1% average GLUE performance drop—demonstrating both effective debiasing and robust capability preservation.
📝 Abstract
AI systems frequently exhibit and amplify social biases, including gender bias, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a single monosemantic feature neuron encoding gender information. We show that our method can be used to debias transformer-based language models, while maintaining other capabilities. We demonstrate the effectiveness of our approach across multiple encoder-only based models and highlight its potential for broader applications.