Robustly Improving LLM Fairness in Realistic Settings via Interpretability

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work reveals that mainstream large language models (LLMs) exhibit significant and latent racial/gender bias—up to a 12% disparity in interview invitation rates—when evaluated in realistic hiring scenarios incorporating company names, cultural descriptions, and elite-university preferences; conventional debiasing prompts fail entirely. To address this, we propose the first *internal bias suppression method* grounded in activation-space interpretability: a *prompt-free, inference-time affine concept editing technique* that identifies and neutralizes bias directions associated with sensitive attributes within hidden-layer activation spaces—without fine-tuning and with strong cross-model generalizability. Evaluated on GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Flash, and open-weight models (Gemma, Mistral), our method reduces real-world bias rates from up to 12% to ≤2.5%, with negligible performance degradation. It is the first approach to robustly, transparently, and zero-shot eliminate implicit contextual bias in LLM-based hiring decisions.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people's careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10%") induces significant racial and gender biases (up to 12% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model's chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1%, always below 2.5%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.

Problem

Research questions and friction points this paper is trying to address.

Addressing LLM biases in hiring decisions with realistic contexts

Reducing racial and gender biases in commercial and open-source LLMs

Developing internal bias mitigation for equitable hiring outcomes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Internal bias mitigation via sensitive attribute neutralization

Affine concept editing at inference time

Robust bias reduction maintaining model performance

🔎 Similar Papers

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation