LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a novel approach to validating natural language generation evaluation metrics by leveraging large language models as meta-evaluators, circumventing the need for costly and time-consuming human annotations that are typically limited to English. The method generates multilingual synthetic evaluation data through controlled semantic degradation, enabling efficient, human-free construction of benchmark datasets across languages. It achieves strong meta-correlations exceeding 0.9 with human judgments on machine translation, question answering, and summarization tasks, demonstrating both its effectiveness and cross-lingual generalizability. To the best of our knowledge, this is the first framework to enable fully automated, multilingual synthetic data generation for metric validation without any human intervention.

Technology Category

Application Category

📝 Abstract
Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.
Problem

Research questions and friction points this paper is trying to address.

evaluation metrics
human annotations
natural language generation
multilingual evaluation
synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM as a Meta-Judge
synthetic data
evaluation metric validation
meta-correlation
controlled semantic degradation
🔎 Similar Papers
No similar papers found.