GRADIEND: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
AI language models widely exhibit and amplify gender bias, undermining fairness and practical utility. A key challenge lies in the excessive coupling of gender information at the neuron level within Transformer architectures. Method: We propose a gradient-driven encoder-decoder framework that, for the first time, enables neuron-level localization, separation, and intervention of monosemantic gender features. Leveraging differentiable architecture design and gradient-sensitivity analysis, our approach achieves identifiable and editable disentanglement of gender representations at the neuron granularity. Contribution/Results: The method ensures strong interpretability and minimal performance degradation. Evaluated on multiple encoder-only models, it significantly reduces bias scores on SEAT and BOLD benchmarks while incurring less than 1% average GLUE performance drop—demonstrating both effective debiasing and robust capability preservation.

Technology Category

Application Category

📝 Abstract
AI systems frequently exhibit and amplify social biases, including gender bias, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a single monosemantic feature neuron encoding gender information. We show that our method can be used to debias transformer-based language models, while maintaining other capabilities. We demonstrate the effectiveness of our approach across multiple encoder-only based models and highlight its potential for broader applications.
Problem

Research questions and friction points this paper is trying to address.

Gender Bias
AI Models
Language Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

GRADIEND method
gender bias correction
applicability and potential broad usage