๐ค AI Summary
Existing fairness evaluation methods are disconnected from high-stakes application scenarios and overlook the differential severity of bias consequencesโe.g., diagnostic errors in healthcare versus stylistic biases in text generation. This paper introduces HALF, the first harm-aware fairness evaluation framework. It establishes a three-tier severity taxonomy across nine application domains based on potential societal harm and proposes a five-stage deployment-aligned evaluation pipeline. Leveraging scenario modeling, multidimensional bias measurement, cross-model comparison, and domain adaptability analysis, we conduct empirical studies across eight large language models (LLMs). Results reveal pronounced domain heterogeneity in model fairness; neither parameter count nor general-purpose performance guarantees fairness; while reasoning-oriented LLMs outperform others in healthcare, they underperform in education. Crucially, HALF achieves the first quantitative alignment between fairness metrics and real-world deployment risk.
๐ Abstract
Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.