Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of domain-specific data in low-resource language settings by systematically investigating strategies for constructing multilingual large language models (LLMs) as automatic evaluators (LLM-as-a-Judge). The authors conduct the first comprehensive experiments across high-, medium-, and low-resource conditions on English, Spanish, and Basque, examining the effects of instruction translation, monolingual versus multilingual supervised fine-tuning, model scale, and zero-shot versus fine-tuned paradigms. Their findings reveal that when in-domain data is available, fine-tuned smaller models can match the performance of closed-source large models; in its absence, large models under zero-shot settings are superior; and out-of-domain fine-tuning may even degrade evaluation performance. The work releases an expanded meta-evaluation dataset and codebase covering Spanish and Basque.
📝 Abstract
Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and low-resource languages, considering instruction translation, monolingual versus multilingual supervision, and model size. For evaluation, we extend two existing meta-evaluation datasets to Basque and Spanish. Our results reveal key trade-offs: When in-domain data is available, fine-tuned smaller models can achieve performance comparable to proprietary models, whereas zero-shot evaluation with larger models proves more effective in out-of-domain settings. We also observe that fine-tuning on out-of-domain data can adversely affect model performance. These findings provide practical guidance for building efficient, reliable multilingual evaluation pipelines. The data and code are publicly available at hitz-zentroa/mJudge.
Problem

Research questions and friction points this paper is trying to address.

multilingual LLMs
LLMs-as-a-Judge
low-resource languages
automatic evaluation
in-domain data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual LLMs-as-a-Judge
low-resource languages
fine-tuning strategies
zero-shot evaluation
meta-evaluation