🤖 AI Summary
This work addresses the self-inconsistency problem inherent in large language models (LLMs) when employed as natural language generation (NLG) evaluators. Empirical analysis reveals substantial intra-rater unreliability: LLM judges exhibit high variance across repeated evaluations of identical NLG outputs, with scores approaching randomness in certain settings—severely undermining their credibility as “referees.” To tackle this, we conduct the first systematic, quantitative characterization of self-inconsistency across diverse NLG tasks and benchmarks under the LLM-as-a-judge paradigm. We then propose a stability-enhancing framework grounded in structured prompting and explicit evaluation guidelines. Experimental results demonstrate that carefully designed assessment protocols significantly improve both inter-evaluation consistency and alignment with human preferences. This study establishes foundational principles for modeling and improving the reliability of LLM-based evaluation, offering both theoretical insights and practical, deployable strategies for trustworthy NLG assessment.
📝 Abstract
As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.