Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Gender bias is pervasive in large language models (LLMs), and existing debiasing methods struggle to balance fairness preservation with model capability retention. Method: Grounded in neural mechanisms, this work constructs the CommonWords dataset for systematic bias evaluation, uncovering—for the first time—the hierarchical interaction between gender-specific and general-purpose neurons; it further reveals that editing general-purpose neurons often induces capability degradation. To address this, we propose an interpretable hybrid neuron editing method integrating logit-space perturbation with causal intervention. Contribution/Results: Our approach achieves precise debiasing while maintaining functional stability. Evaluated on five mainstream LLMs, it reduces gender bias by 42.6% on average, with task accuracy degradation of less than 0.8%, significantly outperforming state-of-the-art fine-tuning and model-editing baselines.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) often exhibit gender bias, posing challenges for their safe deployment. Existing methods to mitigate bias lack a comprehensive understanding of its mechanisms or compromise the model's core capabilities. To address these issues, we propose the CommonWords dataset, to systematically evaluate gender bias in LLMs. Our analysis reveals pervasive bias across models and identifies specific neuron circuits, including gender neurons and general neurons, responsible for this behavior. Notably, editing even a small number of general neurons can disrupt the model's overall capabilities due to hierarchical neuron interactions. Based on these insights, we propose an interpretable neuron editing method that combines logit-based and causal-based strategies to selectively target biased neurons. Experiments on five LLMs demonstrate that our method effectively reduces gender bias while preserving the model's original capabilities, outperforming existing fine-tuning and editing approaches. Our findings contribute a novel dataset, a detailed analysis of bias mechanisms, and a practical solution for mitigating gender bias in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Gender Bias

Large Language Models

Fairness

Innovation

Methods, ideas, or system contributions that make the work stand out.

CommonWords Dataset

Neuron Fine-Tuning

Bias Reduction

🔎 Similar Papers

Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias