Improving Methodologies for LLM Evaluations Across Global Languages

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of large language models’ (LLMs) safety in multilingual settings and the inconsistent performance of existing safeguards across diverse linguistic and cultural contexts. Conducted by the International Network for Advanced AI Measurement and Evaluation, the research evaluates two open-source LLMs across ten high- and low-resource languages, covering five categories of harmful content. The assessment integrates both LLM-as-a-judge and human annotation, enhanced by culturally contextualized translation, refined evaluation prompts, and detailed annotation guidelines. This work introduces a reusable framework for multilingual AI safety evaluation and demonstrates significant variations in model safety performance across languages and harm types, while also highlighting the critical influence of evaluation methodology on result reliability.

Technology Category

Application Category

📝 Abstract

As frontier AI models are deployed globally, it is essential that their behaviour remains safe and reliable across diverse linguistic and cultural contexts. To examine how current model safeguards hold up in such settings, participants from the International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the EU, France, Kenya, South Korea and the UK conducted a joint multilingual evaluation exercise. Led by Singapore AISI, two open-weight models were tested across ten languages spanning high and low resourced groups: Cantonese English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese and Telugu. Over 6,000 newly translated prompts were evaluated across five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation. The exercise shows how safety behaviours can vary across languages. These include differences in safeguard robustness across languages and harm types and variation in evaluator reliability (LLM-as-judge vs. human review). Further, it also generated methodological insights for improving multilingual safety evaluations, such as the need for culturally contextualised translations, stress-tested evaluator prompts and clearer human annotation guidelines. This work represents an initial step toward a shared framework for multilingual safety testing of advanced AI systems and calls for continued collaboration with the wider research community and industry.

Problem

Research questions and friction points this paper is trying to address.

multilingual evaluation

LLM safety

cross-lingual robustness

AI safety

language fairness

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual evaluation

LLM safety

cross-lingual robustness