Improving Methodologies for LLM Evaluations Across Global Languages

πŸ“… 2026-01-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the lack of systematic evaluation of large language models’ (LLMs) safety in multilingual settings and the inconsistent performance of existing safeguards across diverse linguistic and cultural contexts. Conducted by the International Network for Advanced AI Measurement and Evaluation, the research evaluates two open-source LLMs across ten high- and low-resource languages, covering five categories of harmful content. The assessment integrates both LLM-as-a-judge and human annotation, enhanced by culturally contextualized translation, refined evaluation prompts, and detailed annotation guidelines. This work introduces a reusable framework for multilingual AI safety evaluation and demonstrates significant variations in model safety performance across languages and harm types, while also highlighting the critical influence of evaluation methodology on result reliability.

Technology Category

Application Category

πŸ“ Abstract
As frontier AI models are deployed globally, it is essential that their behaviour remains safe and reliable across diverse linguistic and cultural contexts. To examine how current model safeguards hold up in such settings, participants from the International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the EU, France, Kenya, South Korea and the UK conducted a joint multilingual evaluation exercise. Led by Singapore AISI, two open-weight models were tested across ten languages spanning high and low resourced groups: Cantonese English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese and Telugu. Over 6,000 newly translated prompts were evaluated across five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation. The exercise shows how safety behaviours can vary across languages. These include differences in safeguard robustness across languages and harm types and variation in evaluator reliability (LLM-as-judge vs. human review). Further, it also generated methodological insights for improving multilingual safety evaluations, such as the need for culturally contextualised translations, stress-tested evaluator prompts and clearer human annotation guidelines. This work represents an initial step toward a shared framework for multilingual safety testing of advanced AI systems and calls for continued collaboration with the wider research community and industry.
Problem

Research questions and friction points this paper is trying to address.

multilingual evaluation
LLM safety
cross-lingual robustness
AI safety
language fairness
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual evaluation
LLM safety
cross-lingual robustness
culturally contextualized translation
LLM-as-a-judge
πŸ”Ž Similar Papers
No similar papers found.
A
Akriti Vij
B
Benjamin Chua
D
Darshini Ramiah
E
En Qi Ng
M
Mahran Morsidi
N
Naga Nikshith Gangarapu
S
Sharmini Johnson
V
Vanessa Wilfred
V
V. Kumaran
W
Wan Sie Lee
W
Wenzhuo Yang
Yongsen Zheng
Yongsen Zheng
Nanyang Technological University / Sun Yat-sen University
Recommender SystemHuman-AI Dialogue SystemNatural Language ProcessingTrustworthy AIAI Safety
B
Bill Black
Boming Xia
Boming Xia
Postdoc, Responsible AI Research (RAIR) Centre, Adelaide University
Responsible AIAI SafetySE4AIAI Engineering
F
Frank Sun
H
Hao Zhang
Qinghua Lu
Qinghua Lu
Group Leader, Software Systems Research Group, CSIRO's Data61
AI EngineeringSE4AISoftware ArchitectureAI SafetyResponsible AI
Suyu Ma
Suyu Ma
CSIRO's Data61
Software engineering
Y
Yue Liu
C
Chi-kiu Lo
F
Fatemeh Azadi
Isar Nejadgholi
Isar Nejadgholi
National Research Council Canada, University of Ottawa
Natural Language ProcessingAI for Social ImpactResponsible AIAI Safety
Sowmya Vajjala
Sowmya Vajjala
National Research Council, Canada
Natural Language Processing
A
Agnès Delaborde
N
Nicolas Rolin
T
Tom Seimandi
Akiko Murakami
Akiko Murakami
IBM Research Tokyo
Natural Language ProcessingSocial Analytics
H
Haruto Ishi
Satoshi Sekine
Satoshi Sekine
LLMC, NII
Natural Language Processing
T
Takayuki Semitsu
T
Tasuku Sasaki
A
Angela Kinuthia
J
Jean Wangari
M
Michael Michie
S
Stephanie Kasaon
H
Hankyul Baek
J
Jaewon Noh
K
Kihyuk Nam
S
Sang Seo
S
Sungpil Shin
T
Taewhi Lee
Y
Yongsu Kim
D
Daisy Newbold-Harrop
Jessica Wang
Jessica Wang
Professor of U.S. History, University of British Columbia
U.S. political historystate powerhistory of sciencehistory of medicine
Mahmoud Ghanem
Mahmoud Ghanem
Lecturer of AI and Machine Learning
Machine learningKnowledge managementInnovationSMEsAI
V
Vy Hong