Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

πŸ“… 2024-10-17
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Evaluating multilingual text generation in NLP has long been hindered by the scarcity of non-English reference answers and limited human-annotated resources. To address this, we propose CIA, the first cross-lingual large language model–based automatic evaluation framework. Our method introduces (1) a cross-lingual zero-shot evaluation paradigm enabling assessment of multilingual responses using English-only references; (2) Recon, a lightweight, human-annotated multilingual test set comprising 500 instruction-response-score triples across six languages; and (3) Hercule, a multilingual evaluation model built via instruction tuning and cross-lingual semantic alignment. Experiments demonstrate that Hercule significantly outperforms commercial closed-source evaluators across all six languages, achieves higher correlation with human judgments, and supports zero-shot generalization to unseen languages. All code, data, and models are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.
Problem

Research questions and friction points this paper is trying to address.

Lack of multilingual evaluation frameworks for machine-generated text
Scarcity of reference answers in non-English languages for assessment
Need for scalable cross-lingual evaluation methods for low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual evaluation model Hercule
Multilingual test set Recon
Zero-shot evaluation on unseen languages
πŸ”Ž Similar Papers
No similar papers found.