🤖 AI Summary
Existing RAG evaluation relies on either human-annotated heuristic metrics or costly LLM-based judges, compromising efficiency, scalability, and multilingual support.
Method: We introduce the first synthetic RAG evaluation arena covering 18 languages and propose a lightweight proxy judge model. It integrates heuristic features (e.g., ROUGE, F1) and is trained via supervised learning under the Bradley–Terry pairwise comparison framework to replace expensive LLM judges.
Contribution/Results: Evaluated on a synthetic multilingual QA benchmark built from Wikipedia corpora, our proxy judge achieves high rank correlation with LLM judges (Kendall’s τ = 0.909), significantly improving evaluation efficiency and reproducibility. Experiments across 19 multilingual large language models reveal performance boundaries of both closed-source and large open-source models. All code and datasets are publicly released.
📝 Abstract
Traditional retrieval-augmented generation (RAG) benchmarks evaluate systems using heuristic-based metrics, but these require human preferences as the ground truth for reference. In contrast, arena-based benchmarks, where systems compete against each other, require an expensive large language model (LLM) as a judge for a reliable evaluation. We present a simple efficient technique to combine the best of both worlds. The idea is to train a surrogate judge using heuristic metrics as input, to output the LLM as a judge prediction. In our work, we develop MIRAGE-Bench, a synthetic arena-based RAG benchmark for 18 diverse languages on Wikipedia focused on multilingual answer generation evaluation. It extensively couples both heuristic features and LLM as a judge for evaluation. We benchmark 19 multilingual LLMs, and observe a high correlation (Kendall Tau ($ au$) = 0.909) using our surrogate judge and between GPT-4o as a teacher using the Bradley-Terry framework. Our results show proprietary and large open-source LLMs currently dominate on MIRAGE-Bench. Our code and datasets are made publicly available here: https://github.com/vectara/mirage-bench.