Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

This paper identifies a pervasive response-length bias confounding problem in uncertainty quantification (UQ) evaluation for large language models (LLMs): mainstream correctness metrics (e.g., ROUGE-L) and UQ methods jointly exhibit length preferences, inducing spurious correlations that systematically inflate UQ performance estimates. We formally define the “length-bias coupling effect” between UQ methods and correctness functions and propose LLM-as-a-judge as a robust, low-length-sensitivity evaluation paradigm. Through a comprehensive multidimensional benchmark—spanning 7 correctness functions, 4 datasets, 4 LLMs, and 6 UQ methods—we empirically demonstrate that length bias can shift AUROC by over 0.15. Our proposed paradigm reduces length sensitivity by 62%, significantly enhancing the accuracy and reliability of UQ assessment.

Technology Category

Application Category

📝 Abstract

Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for improving their safety and reliability. Evaluations often use performance metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). In this paper, we show that commonly used correctness functions bias UQ evaluations by inflating the performance of certain UQ methods. We evaluate 7 correctness functions -- from lexical-based and embedding-based metrics to LLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our analysis reveals that length biases in the errors of these correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.

Problem

Research questions and friction points this paper is trying to address.

Evaluating UQ methods in LMs faces bias from correctness functions

Length biases in errors distort UQ assessments in language models

LLM-as-a-judge approaches reduce length bias in UQ evaluations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates 7 correctness functions across models

Identifies length biases in UQ assessments

Recommends LLM-as-a-judge to reduce biases

🔎 Similar Papers

Improving Instruction Following in Language Models through Proxy-Based Uncertainty Estimation