Silenced Biases: The Dark Side LLMs Learned to Refuse

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing LLM fairness evaluations overlook “silence bias”—a distortion arising when safety-aligned models refuse sensitive queries, thereby concealing latent unfairness in their hidden representations and yielding misleading fairness assessments. Method: We formally define silence bias and propose activation-guided steering to suppress refusal behavior, enabling the first prompt-free, bias-unmanipulated benchmark for silence bias evaluation (SBB). Leveraging latent-space analysis and a controllable question-answering framework, we achieve automated, scalable, and unbiased deep-fairness assessment. Results: Experiments across multiple state-of-the-art LLMs reveal pervasive latent bias beneath surface-level “fair” refusals, substantially exposing genuine fairness gaps and challenging prevailing assumptions about the efficacy of current alignment techniques.

Technology Category

Application Category

📝 Abstract

Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model's refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models'latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models'direct responses and their underlying fairness issues.

Problem

Research questions and friction points this paper is trying to address.

Uncovering hidden unfair preferences in safety-aligned large language models

Addressing limitations of standard fairness evaluation methods in AI systems

Developing scalable framework to detect biases masked by refusal responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation steering reduces model refusals

Benchmark uncovers biases in latent space

Scalable framework evaluates fairness beyond alignment

🔎 Similar Papers

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation