HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment

๐Ÿ“… 2025-10-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing fairness evaluation methods are disconnected from high-stakes application scenarios and overlook the differential severity of bias consequencesโ€”e.g., diagnostic errors in healthcare versus stylistic biases in text generation. This paper introduces HALF, the first harm-aware fairness evaluation framework. It establishes a three-tier severity taxonomy across nine application domains based on potential societal harm and proposes a five-stage deployment-aligned evaluation pipeline. Leveraging scenario modeling, multidimensional bias measurement, cross-model comparison, and domain adaptability analysis, we conduct empirical studies across eight large language models (LLMs). Results reveal pronounced domain heterogeneity in model fairness; neither parameter count nor general-purpose performance guarantees fairness; while reasoning-oriented LLMs outperform others in healthcare, they underperform in education. Crucially, HALF achieves the first quantitative alignment between fairness metrics and real-world deployment risk.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM fairness with real-world harm severity considerations
Addressing bias evaluation gaps in high-impact deployment domains
Assessing fairness discrepancies across domains using harm-aware framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Harm-aware fairness framework for LLMs
Three-tier severity classification system
Five-stage evaluation pipeline for deployment
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Ali Mekky
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
O
Omar El Herraoui
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Preslav Nakov
Preslav Nakov
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computational LinguisticsLarge Language ModelsFact-checkingFake News
Yuxia Wang
Yuxia Wang
MBZUAI
Natural Language Processing