Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of current fairness alignment approaches for large language models, which typically focus on a single sensitive attribute and neglect the multidimensionality and context dependence of bias, potentially leading to bias spillover. The authors apply gender-targeted Direct Preference Optimization (DPO) alignment on three mainstream models and systematically evaluate fairness shifts across nine sensitive attributes using the BBQ benchmark in both explicit and ambiguous contexts. Their analysis reveals, for the first time, that while gender alignment improves overall fairness, it significantly degrades fairness concerning attributes such as physical appearance, sexual orientation, and disability status in ambiguous contexts (p<0.001). This finding provides empirical evidence of cross-attribute bias spillover and underscores the necessity of developing context-aware, multi-attribute fairness evaluation frameworks.

Technology Category

Application Category

📝 Abstract
Conventional large language model (LLM) fairness alignment largely focuses on mitigating bias along single sensitive attributes, overlooking fairness as an inherently multidimensional and context-specific value. This approach risks creating systems that achieve narrow fairness metrics while exacerbating disparities along untargeted attributes, a phenomenon known as bias spillover. While extensively studied in machine learning, bias spillover remains critically underexplored in LLM alignment. In this work, we investigate how targeted gender alignment affects fairness across nine sensitive attributes in three state-of-the-art LLMs (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B). Using Direct Preference Optimization and the BBQ benchmark, we evaluate fairness under ambiguous and disambiguous contexts. Our findings reveal noticeable bias spillover: while aggregate results show improvements, context-aware analysis exposes significant degradations in ambiguous contexts, particularly for physical appearance ($p< 0.001$ across all models), sexual orientation, and disability status. We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the necessity of context-aware, multi-attribute fairness evaluation frameworks.
Problem

Research questions and friction points this paper is trying to address.

bias spillover
fairness alignment
sensitive attributes
context-aware fairness
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

bias spillover
context-aware fairness
multi-attribute fairness
LLM alignment
Direct Preference Optimization
🔎 Similar Papers
No similar papers found.
Eva Paraschou
Eva Paraschou
PhD Student, Denmark Technical University, Department of Applied Mathematics and Computer Science
L
Line Harder Clemmensen
Department of Mathematical Sciences, University of Copenhagen
S
Sneha Das
Department of Applied Mathematics and Computer Science, Technical University of Denmark