MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG automatic evaluation benchmarks predominantly rely on English or machine-translated data, failing to capture linguistic and cultural nuances—leading to inaccurate multilingual assessment. Method: We introduce the first native multilingual end-to-end RAG meta-evaluation benchmark, built upon MIRACL, featuring real user queries, responses generated by multiple LLMs, and expert fine-grained annotations for faithfulness and relevance. Crucially, we adopt a purely native multilingual paradigm—eschewing translation—and design a high-agreement annotation protocol (Cohen’s κ > 0.85) alongside a cross-lingual LLM-as-a-judge evaluation framework. Contribution/Results: Experiments demonstrate that our benchmark effectively discriminates performance differences across multilingual RAG systems and exhibits high sensitivity and robustness to prompt engineering optimizations and model upgrades—establishing a reliable new standard for multilingual RAG evaluation.

Technology Category

Application Category

📝 Abstract
Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience. In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG). Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator agreement. We then analyse the performance of the answer-generating LLMs across languages as per the human evaluators. Finally we apply the dataset to our main use-case which is to benchmark multilingual automatic evaluators (LLM-as-a-judge). We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs. We release our benchmark to support the community developing accurate evaluation methods for multilingual RAG systems.
Problem

Research questions and friction points this paper is trying to address.

Develop multilingual RAG evaluation benchmark.
Assess faithfulness and relevance across languages.
Improve automatic evaluators using native datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Meta-Evaluation RAG benchmark
Native-language questions for accuracy
LLM-as-a-judge for multilingual evaluators
🔎 Similar Papers
No similar papers found.