Tracing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Pretrained Transformer models are widely adopted in software engineering but may encode and amplify societal biases. This work introduces the concept of “bias neurons,” hypothesizing that social stereotypes are concentrated in a small subset of internal model neurons. To investigate this, we construct a dataset encompassing nine categories of bias-related associations and leverage knowledge neuron theory combined with neuron attribution methods to identify and suppress these bias neurons in BERT. Our experiments demonstrate that biased knowledge is indeed highly localized in a limited number of neurons, and targeted suppression substantially reduces model bias while preserving performance on downstream software engineering tasks. This approach enables interpretable, fine-grained debiasing interventions without compromising task utility.

Technology Category

Application Category

📝 Abstract

The advent of transformer-based language models has reshaped how AI systems process and generate text. In software engineering (SE), these models now support diverse activities, accelerating automation and decision-making. Yet, evidence shows that these models can reproduce or amplify social biases, raising fairness concerns. Recent work on neuron editing has shown that internal activations in pre-trained transformers can be traced and modified to alter model behavior. Building on the concept of knowledge neurons, neurons that encode factual information, we hypothesize the existence of biased neurons that capture stereotypical associations within pre-trained transformers. To test this hypothesis, we build a dataset of biased relations, i.e., triplets encoding stereotypes across nine bias types, and adapt neuron attribution strategies to trace and suppress biased neurons in BERT models. We then assess the impact of suppression on SE tasks. Our findings show that biased knowledge is localized within small neuron subsets, and suppressing them substantially reduces bias with minimal performance loss. This demonstrates that bias in transformers can be traced and mitigated at the neuron level, offering an interpretable approach to fairness in SE.

Problem

Research questions and friction points this paper is trying to address.

bias

transformers

fairness

stereotypes

pre-trained models

Innovation

Methods, ideas, or system contributions that make the work stand out.

biased neurons

neuron editing

transformer fairness