🤖 AI Summary
This study addresses the underexplored bias risks of large language models (LLMs) in high-stakes police dispatch scenarios, particularly in cross-lingual contexts. The authors propose the first cross-lingual fairness auditing framework, modeling emergency call prioritization as a five-level ordinal classification task. Through controlled minimal-pair experiments, they systematically evaluate how 11 state-of-the-art LLMs respond to religious, gender, and racial cues embedded in ambiguous incident descriptions across English and Chinese settings. The analysis of 19,800 model outputs reveals a dynamic interaction among language, contextual ambiguity, and demographic signals, demonstrating asymmetric bias patterns: religious appearance exerts the strongest influence overall, gender bias is more pronounced in Chinese, and racial bias is more salient in English—challenging the simplistic stereotype amplification hypothesis. The proposed framework effectively enables pre-deployment, localized fairness assessments.
📝 Abstract
Large language models (LLMs) are rapidly being integrated into high-stakes public safety systems, including emergency call triage and dispatch decision support, yet their demographic fairness in this context remains largely untested. Here we introduce a cross-lingual audit framework that operationalizes the Police Priority Dispatch System as a five-level ordinal classification task and applies a controlled minimal-pair design to isolate the effect of demographic cues. Across 19,800 model outputs spanning 11 frontier models, 15 scenario pairs, three demographic categories (religious appearance, gender, and race), and two languages (English and Mandarin Chinese), we find that demographic bias emerges systematically when incident severity is ambiguous but largely disappears when the operational priority is clearly determined by call content. Bias magnitude varies by demographic axis, with the largest effects observed for religious appearance, followed by gender and race. Critically, bias does not transfer consistently across languages: gender bias is substantially amplified in Mandarin Chinese, whereas race bias is more pronounced in English, revealing cross-lingual asymmetries that aggregate analyses obscure. In several scenarios, demographic cues produce counter-directional effects, challenging simple stereotype-amplification accounts of model behavior. These findings suggest that bias in LLM-based dispatch is not a fixed property of models alone, but arises from the interaction between demographic signals, contextual ambiguity, and language. Beyond these empirical results, the proposed framework provides a scalable audit infrastructure that enables deploying agencies to evaluate candidate models on jurisdiction-relevant scenarios prior to real-world adoption.