Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

📅 2024-08-17

🏛️ Proceedings of the 9th Widening NLP Workshop

📈 Citations: 1

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Automated evaluation of open-ended question answering remains challenging, as conventional shallow metrics (e.g., EM, F1) fail to capture semantic completeness and contextual coherence. Method: We propose a reference-guided, multi-LLM collaborative adjudication framework: leveraging high-quality reference answers as anchors, we design fine-grained semantic alignment prompts; multiple large language models independently score semantic consistency, and their outputs are aggregated via confidence-weighted fusion to yield robust judgments. Contribution/Results: This work departs from the dominant single-model evaluation paradigm and introduces the first “reference-guided–multi-adjudicator–weighted-consensus” evaluation mechanism. On multiple open-ended QA benchmarks, our method achieves Pearson correlation coefficients of ≥0.89 with human judgments—significantly outperforming single-LLM baselines. It establishes a reproducible, highly consistent automated evaluation paradigm for open-ended generation.

Technology Category

Application Category

📝 Abstract

The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on free-form question-answering tasks, we demonstrate that combining multiple models improves the reliability and accuracy of evaluations, especially in tasks where a single model may struggle. The results indicate a strong correlation with human evaluations, establishing the proposed method as a reliable alternative to traditional metrics.

Problem

Research questions and friction points this paper is trying to address.

Automating evaluation of free-form question answering using LLMs as judges

Addressing limitations of conventional metrics for generative outputs

Improving reliability through multi-model verdict systems for open-ended tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging multiple LLMs as automated judges

Using reference-guided verdict method for evaluation

Combining models to improve reliability and accuracy

🔎 Similar Papers

Evaluating the Performance of Large Language Models via Debates

2024-06-16arXiv.orgCitations: 2