LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work proposes a novel approach to validating natural language generation evaluation metrics by leveraging large language models as meta-evaluators, circumventing the need for costly and time-consuming human annotations that are typically limited to English. The method generates multilingual synthetic evaluation data through controlled semantic degradation, enabling efficient, human-free construction of benchmark datasets across languages. It achieves strong meta-correlations exceeding 0.9 with human judgments on machine translation, question answering, and summarization tasks, demonstrating both its effectiveness and cross-lingual generalizability. To the best of our knowledge, this is the first framework to enable fully automated, multilingual synthetic data generation for metric validation without any human intervention.

Technology Category

Application Category

📝 Abstract

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.

Problem

Research questions and friction points this paper is trying to address.

evaluation metrics

human annotations

natural language generation

multilingual evaluation

synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM as a Meta-Judge

synthetic data

evaluation metric validation

meta-correlation

controlled semantic degradation

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

2024-06-26arXiv.orgCitations: 69