MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

📅 2024-10-23
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based evaluators exhibit language bias and unfairness in assessing non-English outputs, undermining the reliability of multilingual LLM evaluation. Method: We introduce MM-Eval, the first meta-evaluation benchmark explicitly designed for multilingual settings, covering a core set of 18 languages and a consistency set of 122 languages. It pioneers native multilingual meta-evaluation tasks—eliminating reliance on English translation—and proposes a dual-dimensional framework measuring language consistency and cross-lingual fairness, moving beyond single-metric ranking accuracy. Techniques include multilingual prompt engineering, cross-lingual consistency modeling, and absolute score distribution analysis. Contribution/Results: We empirically demonstrate that state-of-the-art English-centric evaluators suffer significant accuracy degradation and unfairness on low-resource languages. MM-Eval achieves significantly higher Best-of-N ranking correlation than existing benchmarks. All data and code are publicly released.

Technology Category

Application Category

📝 Abstract
As Large Language Models (LLMs) are now capable of producing fluent and coherent content in languages other than English, it is not imperative to precisely evaluate these non-English outputs. However, when assessing the outputs from mutlilingual LLMs, prior works often employed LLM based evaluators that excel at assessing English outputs, without a thorough examination of whether these evaluators could effectively assess non-English text as well. Moreover, existing benchmarks to test evaluator LLMs (referred to as"meta-evaluation benchmarks") are mostly English-centric. To bridge this gap and examine whether evaluator LLMs can reliably assess the outputs of multilingual LLMs, we introduce MM-Eval, a multilingual meta-evaluation benchmark comprising five core subsets covering 18 languages and a Language Consistency subset spanning 122 languages. A core attribute of MM-Eval is that, instead of merely translating existing English meta-evaluation benchmarks, it is designed with multilingual-specific challenges in mind. Additionally, unlike existing meta-evaluation benchmarks that focus solely on ranking accuracy over pairwise data, MM-Eval also evaluates the consistency and fairness of absolute score values across a wide range of languages. Our results show that existing evaluator LLMs that excel in English contexts have considerable room for improvement when assessing non-English outputs. Furthermore, we find that evaluators are unfair and inconsistent when evaluating lower-resourced languages. Finally, we validate MM-Eval by measuring its correlation with Best-of-N rankings, finding a significantly stronger correlation compared to other meta-evaluation benchmarks. We publicly release our benchmark and code.
Problem

Research questions and friction points this paper is trying to address.

Evaluating non-English outputs of multilingual LLMs effectively
Assessing fairness and consistency of evaluator LLMs across languages
Addressing lack of multilingual meta-evaluation benchmarks for LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual meta-evaluation benchmark MM-Eval
Covers 18 core and 122 consistency languages
Evaluates ranking accuracy and score fairness
🔎 Similar Papers
No similar papers found.
Guijin Son
Guijin Son
Undergraduate, Yonsei University
Natural Language ProcessingLarge Language Models
D
Dongkeun Yoon
KAIST
Juyoung Suk
Juyoung Suk
KAIST
Large Language Models
J
Javier Aula-Blasco
Barcelona Supercomputing Center
M
Mano Aslan
Artful Media
V
Vu Trong Kim
KAIST
S
Shayekh Bin Islam
Bangladesh University of Engineering and Technology
J
Jaume Prats-Cristia
Barcelona Supercomputing Center
L
Lucía Tormo-Bañuelos
Barcelona Supercomputing Center
Seungone Kim
Seungone Kim
Carnegie Mellon University
Large Language ModelsNatural Language Processing