GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the poor calibration of large language models (LLMs)—specifically, their tendency to assign excessively high confidence to incorrect answers—by introducing GRACE, the first fine-grained calibration benchmark. GRACE employs progressive-clue question-answering tasks to jointly evaluate model and human calibration across three dimensions: answer timing, accuracy, and confidence. It further introduces a human-model real-time adversarial paradigm to collect 1,749 contrastive data instances. The paper proposes CalScore, a novel metric that quantifies model-specific calibration biases, and conducts calibration error decomposition to reveal that state-of-the-art LLMs, despite higher accuracy, exhibit significant under-confidence. GRACE establishes a new, interpretable, multidimensional, and human-AI collaborative benchmark for modeling, diagnosing, and optimizing calibration capabilities.

Technology Category

Application Category

📝 Abstract
Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams' timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.
Problem

Research questions and friction points this paper is trying to address.

Evaluates language model calibration accuracy
Compares model and human calibration performance
Identifies model miscalibration types using GRACE
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces GRACE benchmark calibration
Compares human and model calibration
Proposes CalScore for error analysis
🔎 Similar Papers
Yoo Yeon Sung
Yoo Yeon Sung
University of Maryland
Natural Language Processing
Eve Fleisig
Eve Fleisig
UC Berkeley
Natural Language ProcessingDeep LearningEthical AIFairness in ML
Y
Yu Hou
University of Maryland
I
Ishan Upadhyay
IIT Bombay
J
J. Boyd-Graber
University of Maryland