Gender Encoding Patterns in Pretrained Language Model Representations

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The internal representation mechanisms of gender bias in pretrained language models (PLMs) remain poorly understood, and the efficacy of existing debiasing methods is questionable. Method: This study systematically investigates how gender information is encoded and propagated across encoder-based models (e.g., BERT, RoBERTa, DeBERTa), employing mutual information analysis, feature attribution, and controllable bias probing to establish a multi-model, multi-layer debiasing evaluation framework. Contribution/Results: We identify, for the first time, a cross-model-consistent, deeply structured gender representation. Crucially, we find that mainstream output-layer debiasing techniques reduce output-level bias by only ~12% on average, yet inadvertently increase gender separability in intermediate layers by up to 37%. This reveals a severe decoupling between representation-level and output-level bias mitigation. Our findings provide both theoretical insight into bias mechanistics and empirical grounding for layer-aware, hierarchical debiasing strategies.

Technology Category

Application Category

📝 Abstract
Gender bias in pretrained language models (PLMs) poses significant social and ethical challenges. Despite growing awareness, there is a lack of comprehensive investigation into how different models internally represent and propagate such biases. This study adopts an information-theoretic approach to analyze how gender biases are encoded within various encoder-based architectures. We focus on three key aspects: identifying how models encode gender information and biases, examining the impact of bias mitigation techniques and fine-tuning on the encoded biases and their effectiveness, and exploring how model design differences influence the encoding of biases. Through rigorous and systematic investigation, our findings reveal a consistent pattern of gender encoding across diverse models. Surprisingly, debiasing techniques often exhibit limited efficacy, sometimes inadvertently increasing the encoded bias in internal representations while reducing bias in model output distributions. This highlights a disconnect between mitigating bias in output distributions and addressing its internal representations. This work provides valuable guidance for advancing bias mitigation strategies and fostering the development of more equitable language models.
Problem

Research questions and friction points this paper is trying to address.

Analyzes gender bias encoding in pretrained language models.
Evaluates effectiveness of bias mitigation techniques and fine-tuning.
Explores impact of model design on gender bias representation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information-theoretic analysis of gender encoding
Evaluation of bias mitigation techniques' effectiveness
Impact of model design on bias encoding
🔎 Similar Papers
No similar papers found.