Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work challenges the validity and reliability of large language models as judges (LLMs-as-Judges, LLJs) in natural language generation (NLG) evaluation, exposing unvalidated implicit assumptions in current practice. Methodologically, it adopts classical test theory to systematically examine four foundational assumptions—unidimensionality, comparability, stability, and construct validity—across three NLG tasks: summarization, data annotation, and safety alignment. The analysis integrates human judgment proxies with multi-scenario empirical validation. Results reveal substantial biases, context sensitivity, and construct misalignment in LLJs; their uncritical substitution for human evaluation risks misguiding NLG development. This paper is the first to rigorously apply psychometric standards to LLJ assessment, introducing a reproducible diagnostic protocol and actionable improvement pathways. It establishes both theoretical foundations and practical guidelines for building more robust, transparent, and responsible automated evaluation systems.

Technology Category

Application Category

📝 Abstract

Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.

Problem

Research questions and friction points this paper is trying to address.

Investigating validity of large language models as evaluation judges

Challenging reliability assumptions in natural language generation assessment

Examining premature adoption of LLMs for automated text evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Critically assesses LLM evaluator assumptions

Examines LLM limitations across applications

Advocates responsible evaluation practices

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks