Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing automatic consistency metrics—such as those based on response probabilities or internal model states—exhibit substantial misalignment with human subjective judgments of LLM-generated text consistency. Method: This work introduces the first large-scale human consistency scoring benchmark (n = 2,976) to systematically quantify the gap between prevailing automated metrics and ground-truth human perception; it further proposes a novel logit-weighted ensemble method that enhances consistency quantification via response resampling and logit-space modeling. Contribution/Results: Empirical evaluation demonstrates that the proposed method achieves state-of-the-art performance in Spearman and Kendall rank correlation with human scores, significantly outperforming conventional metrics. The study confirms that automated consistency measures—when decoupled from human evaluation—suffer from pervasive miscalibration, underscoring the urgent need to anchor evaluation paradigms in human perception.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility -- one of them being measuring the consistency (the model's confidence in the response, or likelihood of generating a similar response when resampled) of LLM responses. In previous work, measuring consistency often relied on the probability of a response appearing within a pool of resampled responses, or internal states or logits of responses. However, it is not yet clear how well these approaches approximate how humans perceive the consistency of LLM responses. We performed a user study (n=2,976) and found current methods typically do not approximate users' perceptions of LLM consistency very well. We propose a logit-based ensemble method for estimating LLM consistency, and we show that this method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods of estimating LLM consistency without human evaluation are sufficiently imperfect that we suggest evaluation with human input be more broadly used.

Problem

Research questions and friction points this paper is trying to address.

Assessing how well surrogate metrics match human perception of LLM consistency

Evaluating current methods for measuring LLM response consistency

Proposing a logit-based ensemble method to better estimate human-rated consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Logit-based ensemble method for LLM consistency

User study compares human and metric perceptions

Advocates human input in LLM consistency evaluation

🔎 Similar Papers

No similar papers found.