HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing fairness evaluation methods are disconnected from high-stakes application scenarios and overlook the differential severity of bias consequences—e.g., diagnostic errors in healthcare versus stylistic biases in text generation. This paper introduces HALF, the first harm-aware fairness evaluation framework. It establishes a three-tier severity taxonomy across nine application domains based on potential societal harm and proposes a five-stage deployment-aligned evaluation pipeline. Leveraging scenario modeling, multidimensional bias measurement, cross-model comparison, and domain adaptability analysis, we conduct empirical studies across eight large language models (LLMs). Results reveal pronounced domain heterogeneity in model fairness; neither parameter count nor general-purpose performance guarantees fairness; while reasoning-oriented LLMs outperform others in healthcare, they underperform in education. Crucially, HALF achieves the first quantitative alignment between fairness metrics and real-world deployment risk.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM fairness with real-world harm severity considerations

Addressing bias evaluation gaps in high-impact deployment domains

Assessing fairness discrepancies across domains using harm-aware framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Harm-aware fairness framework for LLMs

Three-tier severity classification system

Five-stage evaluation pipeline for deployment

🔎 Similar Papers

No similar papers found.

Authors to Follow