🤖 AI Summary
This paper addresses the poor calibration of large language models (LLMs)—specifically, their tendency to assign excessively high confidence to incorrect answers—by introducing GRACE, the first fine-grained calibration benchmark. GRACE employs progressive-clue question-answering tasks to jointly evaluate model and human calibration across three dimensions: answer timing, accuracy, and confidence. It further introduces a human-model real-time adversarial paradigm to collect 1,749 contrastive data instances. The paper proposes CalScore, a novel metric that quantifies model-specific calibration biases, and conducts calibration error decomposition to reveal that state-of-the-art LLMs, despite higher accuracy, exhibit significant under-confidence. GRACE establishes a new, interpretable, multidimensional, and human-AI collaborative benchmark for modeling, diagnosing, and optimizing calibration capabilities.
📝 Abstract
Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams' timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.