Auditing demographic bias in AI-based emergency police dispatch: a cross-lingual evaluation of eleven large language models

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This study addresses the underexplored bias risks of large language models (LLMs) in high-stakes police dispatch scenarios, particularly in cross-lingual contexts. The authors propose the first cross-lingual fairness auditing framework, modeling emergency call prioritization as a five-level ordinal classification task. Through controlled minimal-pair experiments, they systematically evaluate how 11 state-of-the-art LLMs respond to religious, gender, and racial cues embedded in ambiguous incident descriptions across English and Chinese settings. The analysis of 19,800 model outputs reveals a dynamic interaction among language, contextual ambiguity, and demographic signals, demonstrating asymmetric bias patterns: religious appearance exerts the strongest influence overall, gender bias is more pronounced in Chinese, and racial bias is more salient in English—challenging the simplistic stereotype amplification hypothesis. The proposed framework effectively enables pre-deployment, localized fairness assessments.
📝 Abstract
Large language models (LLMs) are rapidly being integrated into high-stakes public safety systems, including emergency call triage and dispatch decision support, yet their demographic fairness in this context remains largely untested. Here we introduce a cross-lingual audit framework that operationalizes the Police Priority Dispatch System as a five-level ordinal classification task and applies a controlled minimal-pair design to isolate the effect of demographic cues. Across 19,800 model outputs spanning 11 frontier models, 15 scenario pairs, three demographic categories (religious appearance, gender, and race), and two languages (English and Mandarin Chinese), we find that demographic bias emerges systematically when incident severity is ambiguous but largely disappears when the operational priority is clearly determined by call content. Bias magnitude varies by demographic axis, with the largest effects observed for religious appearance, followed by gender and race. Critically, bias does not transfer consistently across languages: gender bias is substantially amplified in Mandarin Chinese, whereas race bias is more pronounced in English, revealing cross-lingual asymmetries that aggregate analyses obscure. In several scenarios, demographic cues produce counter-directional effects, challenging simple stereotype-amplification accounts of model behavior. These findings suggest that bias in LLM-based dispatch is not a fixed property of models alone, but arises from the interaction between demographic signals, contextual ambiguity, and language. Beyond these empirical results, the proposed framework provides a scalable audit infrastructure that enables deploying agencies to evaluate candidate models on jurisdiction-relevant scenarios prior to real-world adoption.
Problem

Research questions and friction points this paper is trying to address.

demographic bias
emergency police dispatch
large language models
cross-lingual evaluation
fairness
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-lingual audit
demographic bias
large language models
minimal-pair design
emergency dispatch
W
William Guey
Department of Industrial Engineering, Tsinghua University, Beijing, China
Wei Zhang
Wei Zhang
Electronic Engineering Department, Tsinghua University
Photonic and quantum devices
P
Pierrick Bougault
Department of Industrial Engineering, Tsinghua University, Beijing, China
Y
Yi Wang
Department of Industrial Engineering, Tsinghua University, Beijing, China
B
Bertan Ucar
Department of Industrial Engineering, Tsinghua University, Beijing, China
V
Vitor D. de Moura
School of Social Sciences, Tsinghua University, Beijing, China
J
José O. Gomes
Department of Industrial Engineering, Federal University of Rio de Janeiro, Brazil