GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This paper addresses the poor calibration of large language models (LLMs)—specifically, their tendency to assign excessively high confidence to incorrect answers—by introducing GRACE, the first fine-grained calibration benchmark. GRACE employs progressive-clue question-answering tasks to jointly evaluate model and human calibration across three dimensions: answer timing, accuracy, and confidence. It further introduces a human-model real-time adversarial paradigm to collect 1,749 contrastive data instances. The paper proposes CalScore, a novel metric that quantifies model-specific calibration biases, and conducts calibration error decomposition to reveal that state-of-the-art LLMs, despite higher accuracy, exhibit significant under-confidence. GRACE establishes a new, interpretable, multidimensional, and human-AI collaborative benchmark for modeling, diagnosing, and optimizing calibration capabilities.

Technology Category

Application Category

📝 Abstract

Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams' timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.

Problem

Research questions and friction points this paper is trying to address.

Evaluates language model calibration accuracy

Compares model and human calibration performance

Identifies model miscalibration types using GRACE

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces GRACE benchmark calibration

Compares human and model calibration

Proposes CalScore for error analysis

🔎 Similar Papers

Calibration in Deep Learning: A Survey of the State-of-the-Art