Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the self-inconsistency problem inherent in large language models (LLMs) when employed as natural language generation (NLG) evaluators. Empirical analysis reveals substantial intra-rater unreliability: LLM judges exhibit high variance across repeated evaluations of identical NLG outputs, with scores approaching randomness in certain settings—severely undermining their credibility as “referees.” To tackle this, we conduct the first systematic, quantitative characterization of self-inconsistency across diverse NLG tasks and benchmarks under the LLM-as-a-judge paradigm. We then propose a stability-enhancing framework grounded in structured prompting and explicit evaluation guidelines. Experimental results demonstrate that carefully designed assessment protocols significantly improve both inter-evaluation consistency and alignment with human preferences. This study establishes foundational principles for modeling and improving the reliability of LLM-based evaluation, offering both theoretical insights and practical, deployable strategies for trustworthy NLG assessment.

Technology Category

Application Category

📝 Abstract

As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.

Problem

Research questions and friction points this paper is trying to address.

LLM judges show low reliability across evaluation runs

Rating inconsistency makes true judgment quality difficult to assess

Quantifying inconsistency across NLG tasks and benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifying LLM judge inconsistency across tasks

Identifying low intra-rater reliability in evaluations

Establishing guidelines for reliable LLM assessment

🔎 Similar Papers

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates