Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Humanitarian organizations face a trade-off between costly commercial APIs and unreliable open-source models for multilingual human rights monitoring—especially for low-resource languages such as Lingala and Burmese. This paper systematically evaluates six large language model categories across seven languages on human rights violation detection. We propose four novel cross-lingual reliability metrics—Consistency Distance (CD), Bias (B), Language Robustness Score (LRS), and Language Stability Score (LSS)—and conduct quantitative analysis over 78,000 inferences. Results show that instruction alignment—not model scale—primarily governs cross-lingual stability: aligned models achieve language-agnostic reasoning, sustaining high accuracy and well-calibrated predictions even in low-resource settings; in contrast, open-source models exhibit marked linguistic sensitivity and miscalibration drift. The study provides resource-constrained organizations with empirically grounded model selection criteria and practical deployment guidelines.

Technology Category

Application Category

📝 Abstract
Humanitarian organizations face a critical choice: invest in costly commercial APIs or rely on free open-weight models for multilingual human rights monitoring. While commercial systems offer reliability, open-weight alternatives lack empirical validation -- especially for low-resource languages common in conflict zones. This paper presents the first systematic comparison of commercial and open-weight large language models (LLMs) for human-rights-violation detection across seven languages, quantifying the cost-reliability trade-off facing resource-constrained organizations. Across 78,000 multilingual inferences, we evaluate six models -- four instruction-aligned (Claude-Sonnet-4, DeepSeek-V3, Gemini-Flash-2.0, GPT-4.1-mini) and two open-weight (LLaMA-3-8B, Mistral-7B) -- using both standard classification metrics and new measures of cross-lingual reliability: Calibration Deviation (CD), Decision Bias (B), Language Robustness Score (LRS), and Language Stability Score (LSS). Results show that alignment, not scale, determines stability: aligned models maintain near-invariant accuracy and balanced calibration across typologically distant and low-resource languages (e.g., Lingala, Burmese), while open-weight models exhibit significant prompt-language sensitivity and calibration drift. These findings demonstrate that multilingual alignment enables language-agnostic reasoning and provide practical guidance for humanitarian organizations balancing budget constraints with reliability in multilingual deployment.
Problem

Research questions and friction points this paper is trying to address.

Comparing commercial and open-weight LLMs for human-rights-violation detection
Evaluating cost-reliability trade-off for multilingual humanitarian monitoring
Assessing cross-lingual stability across seven languages including low-resource ones
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic comparison of commercial and open-weight LLMs
Multilingual alignment ensures cross-lingual stability and reliability
New metrics measure calibration deviation and language robustness
P
Poli Nemkova
University of North Texas, Denton, TX, USA
A
Amrit Adhikari
University of North Texas, Denton, TX, USA
M
Matthew Pearson
Davidson College, Davidson, NC, USA
V
Vamsi Krishna Sadu
University of North Texas, Denton, TX, USA
Mark V. Albert
Mark V. Albert
Associate Professor of Computer Science
machine learningwearable sensorsbiomedical applications