🤖 AI Summary
LLM-as-a-judge suffers from high uncertainty and low reliability in NLG evaluation. To address this, we introduce conformal prediction—the first such application—to LLM-based assessment, proposing an interval-scoring framework that yields continuous prediction intervals with statistically guaranteed coverage via a single inference pass. For discrete scoring tasks, we design an ordinal boundary adjustment mechanism and use the interval midpoint as a low-bias score estimator. Additionally, prompt optimization is integrated to enhance cross-prompt consistency. Experiments across multiple benchmarks demonstrate that our method significantly improves the reliability and stability of LLM judgments, achieving well-calibrated uncertainty quantification. It establishes a novel paradigm for trustworthy automated evaluation of NLG systems.
📝 Abstract
LLM-as-a-judge has become a promising paradigm for using large language models (LLMs) to evaluate natural language generation (NLG), but the uncertainty of its evaluation remains underexplored. This lack of reliability may limit its deployment in many applications. This work presents the first framework to analyze the uncertainty by offering a prediction interval of LLM-based scoring via conformal prediction. Conformal prediction constructs continuous prediction intervals from a single evaluation run, and we design an ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low-bias alternative to raw model score and weighted average. We perform extensive experiments and analysis, which show that conformal prediction can provide valid prediction interval with coverage guarantees. We also explore the usefulness of interval midpoint and judge reprompting for better judgment.