🤖 AI Summary
Humanitarian organizations face a trade-off between costly commercial APIs and unreliable open-source models for multilingual human rights monitoring—especially for low-resource languages such as Lingala and Burmese. This paper systematically evaluates six large language model categories across seven languages on human rights violation detection. We propose four novel cross-lingual reliability metrics—Consistency Distance (CD), Bias (B), Language Robustness Score (LRS), and Language Stability Score (LSS)—and conduct quantitative analysis over 78,000 inferences. Results show that instruction alignment—not model scale—primarily governs cross-lingual stability: aligned models achieve language-agnostic reasoning, sustaining high accuracy and well-calibrated predictions even in low-resource settings; in contrast, open-source models exhibit marked linguistic sensitivity and miscalibration drift. The study provides resource-constrained organizations with empirically grounded model selection criteria and practical deployment guidelines.
📝 Abstract
Humanitarian organizations face a critical choice: invest in costly commercial APIs or rely on free open-weight models for multilingual human rights monitoring. While commercial systems offer reliability, open-weight alternatives lack empirical validation -- especially for low-resource languages common in conflict zones. This paper presents the first systematic comparison of commercial and open-weight large language models (LLMs) for human-rights-violation detection across seven languages, quantifying the cost-reliability trade-off facing resource-constrained organizations. Across 78,000 multilingual inferences, we evaluate six models -- four instruction-aligned (Claude-Sonnet-4, DeepSeek-V3, Gemini-Flash-2.0, GPT-4.1-mini) and two open-weight (LLaMA-3-8B, Mistral-7B) -- using both standard classification metrics and new measures of cross-lingual reliability: Calibration Deviation (CD), Decision Bias (B), Language Robustness Score (LRS), and Language Stability Score (LSS). Results show that alignment, not scale, determines stability: aligned models maintain near-invariant accuracy and balanced calibration across typologically distant and low-resource languages (e.g., Lingala, Burmese), while open-weight models exhibit significant prompt-language sensitivity and calibration drift. These findings demonstrate that multilingual alignment enables language-agnostic reasoning and provide practical guidance for humanitarian organizations balancing budget constraints with reliability in multilingual deployment.