LLM-as-a-qualitative-judge: automating error analysis in natural language generation

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Current NLG evaluation lacks interpretable, actionable qualitative analysis. Method: We propose the first LLM-driven qualitative error analysis paradigm, treating LLMs as structured error diagnosticians—not merely scorers—by integrating open-ended singleton error generation, semantic-clustering–driven error type induction, human annotation validation, and a standardized evaluation protocol. Contribution/Results: Our approach automatically evolves from discrete error instances to abstract, semantically coherent error categories, enabling developers to identify root causes. Evaluated on 12 mainstream NLG datasets, it achieves 67% accuracy in singleton error identification and attains an F1-score of 0.82 between its induced error taxonomy and human annotations—significantly outperforming heuristic classification baselines. This provides a practical, deployable diagnostic tool for iterative NLG system improvement.

Technology Category

Application Category

📝 Abstract

Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

Problem

Research questions and friction points this paper is trying to address.

Automating error analysis in NLG using LLMs

Providing structured reports for NLG system improvements

Evaluating LLM-based qualitative judgment accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based structured error report generation

Open-ended per-instance issue analysis

Intuitive cumulative clustering algorithm

🔎 Similar Papers

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence