Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

LLM-as-a-judge suffers from high uncertainty and low reliability in NLG evaluation. To address this, we introduce conformal prediction—the first such application—to LLM-based assessment, proposing an interval-scoring framework that yields continuous prediction intervals with statistically guaranteed coverage via a single inference pass. For discrete scoring tasks, we design an ordinal boundary adjustment mechanism and use the interval midpoint as a low-bias score estimator. Additionally, prompt optimization is integrated to enhance cross-prompt consistency. Experiments across multiple benchmarks demonstrate that our method significantly improves the reliability and stability of LLM judgments, achieving well-calibrated uncertainty quantification. It establishes a novel paradigm for trustworthy automated evaluation of NLG systems.

Technology Category

Application Category

📝 Abstract

LLM-as-a-judge has become a promising paradigm for using large language models (LLMs) to evaluate natural language generation (NLG), but the uncertainty of its evaluation remains underexplored. This lack of reliability may limit its deployment in many applications. This work presents the first framework to analyze the uncertainty by offering a prediction interval of LLM-based scoring via conformal prediction. Conformal prediction constructs continuous prediction intervals from a single evaluation run, and we design an ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low-bias alternative to raw model score and weighted average. We perform extensive experiments and analysis, which show that conformal prediction can provide valid prediction interval with coverage guarantees. We also explore the usefulness of interval midpoint and judge reprompting for better judgment.

Problem

Research questions and friction points this paper is trying to address.

Analyzing uncertainty in LLM-based evaluation of natural language generation

Providing prediction intervals for LLM scoring using conformal prediction

Improving reliability of LLM-as-a-judge paradigm for practical applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conformal prediction constructs continuous evaluation intervals

Ordinal boundary adjustment handles discrete rating tasks

Midpoint-based score provides low-bias alternative evaluation

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks