DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing large language models (LLMs) exhibit unfair outputs whose underlying mechanisms remain poorly understood, limiting the efficacy and robustness of existing debiasing methods. This paper introduces DiffHeads, the first framework to establish a causal link between prompting strategies and bias generation: by computing differential activations of attention heads under Direct-Answer (DA) versus Chain-of-Thought (CoT) prompting, it identifies and dynamically masks bias-inducing clusters of attention heads. The method requires no fine-tuning and operates as a lightweight, plug-and-play intervention during inference. Extensive experiments across multiple mainstream LLMs show average fairness improvements of 49.4% (DA) and 40.3% (CoT), with strict preservation of original task performance. Key contributions include (i) uncovering the attention-level mechanism of bias generation, (ii) proposing a causally grounded, head-level intervention paradigm, and (iii) enabling efficient, lossless real-time debiasing.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) increasingly mediate decisions in domains where unfair treatment of demographic groups is unacceptable. Existing work probes when biased outputs appear, but gives little insight into the mechanisms that generate them, leaving existing mitigations largely fragile. In this paper, we conduct a systematic investigation LLM unfairness and propose DiffHeads, a lightweight debiasing framework for LLMs. We first compare Direct-Answer (DA) prompting to Chain-of-Thought (CoT) prompting across eight representative open- and closed-source LLMs. DA will trigger the nature bias part of LLM and improve measured unfairness by 534.5%-391.9% in both one-turn and two-turn dialogues. Next, we define a token-to-head contribution score that traces each token's influence back to individual attention heads. This reveals a small cluster of bias heads that activate under DA but stay largely dormant with CoT, providing the first causal link between prompting strategy and bias emergence. Finally, building on this insight, we propose DiffHeads that identifies bias heads through differential activation analysis between DA and CoT, and selectively masks only those heads. DiffHeads reduces unfairness by 49.4%, and 40.3% under DA and CoT, respectively, without harming model utility.

Problem

Research questions and friction points this paper is trying to address.

Identifying mechanisms causing bias in large language models

Developing differential analysis to detect bias-inducing attention heads

Selectively masking bias heads to reduce unfairness while preserving utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies bias heads via differential activation analysis

Selectively masks bias heads during inference time

Reduces unfairness without harming model utility

🔎 Similar Papers

From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings