🤖 AI Summary
This paper identifies a pervasive response-length bias confounding problem in uncertainty quantification (UQ) evaluation for large language models (LLMs): mainstream correctness metrics (e.g., ROUGE-L) and UQ methods jointly exhibit length preferences, inducing spurious correlations that systematically inflate UQ performance estimates. We formally define the “length-bias coupling effect” between UQ methods and correctness functions and propose LLM-as-a-judge as a robust, low-length-sensitivity evaluation paradigm. Through a comprehensive multidimensional benchmark—spanning 7 correctness functions, 4 datasets, 4 LLMs, and 6 UQ methods—we empirically demonstrate that length bias can shift AUROC by over 0.15. Our proposed paradigm reduces length sensitivity by 62%, significantly enhancing the accuracy and reliability of UQ assessment.
📝 Abstract
Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for improving their safety and reliability. Evaluations often use performance metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). In this paper, we show that commonly used correctness functions bias UQ evaluations by inflating the performance of certain UQ methods. We evaluate 7 correctness functions -- from lexical-based and embedding-based metrics to LLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our analysis reveals that length biases in the errors of these correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.