Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work systematically investigates the structured neural representations of social, demographic, and gender biases in large language models (GPT-2, Llama2) from a mechanistic interpretability perspective. Method: We propose an integrated framework combining causal mediation analysis, directed edge importance evaluation, cross-layer systematic ablation, and multi-task stability testing. Contribution/Results: We provide the first empirical validation that bias computation is highly localized and dynamically migrates across layers during fine-tuning. Biases are driven by structured neural components concentrated in a few critical layers, whose computational pathways significantly overlap with general linguistic capabilities (e.g., named entity recognition, syntactic judgment). Targeted ablation of these components substantially reduces biased outputs but incurs quantifiable trade-offs in downstream task performance. These findings establish an interpretable, neuron-level foundation for bias localization, intervention, and robust alignment—bridging mechanistic understanding with practical debiasing strategies.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.

Problem

Research questions and friction points this paper is trying to address.

Analyzing structural representation of social and gender biases in LLMs

Identifying internal edges causing biased behavior in models

Assessing impact of bias removal on other NLP tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mechanistic interpretability analyzes bias in LLMs

Identify bias-localized layers via systematic ablations

Component removal reduces bias affects other tasks

🔎 Similar Papers

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation