🤖 AI Summary
This work investigates confidence miscalibration in large language models (LLMs) on question-answering tasks: specifically, whether LLMs exhibit human-like difficulty sensitivity—underconfidence on easy items and overconfidence on hard ones—and whether socially grounded identity cues (e.g., expert vs. layperson, race, gender, age) induce systematic, accuracy-irrelevant confidence biases. To isolate confidence estimation from answer generation, we propose Answer-Free Confidence Estimation (AFCE), a two-stage prompting framework. Evaluating Llama-3-70B, Claude-3-Sonnet, and GPT-4o on MMLU and GPQA benchmarks, we find that LLMs’ confidence is largely insensitive to item difficulty and significantly distorted by identity prompts. AFCE achieves the first consistent calibration improvement across models and tasks: reducing mean calibration error by 38%, enhancing difficulty sensitivity, and aligning confidence distributions more closely with empirically observed human cognitive patterns.
📝 Abstract
Psychology research has shown that humans are poor at estimating their performance on tasks, tending towards underconfidence on easy tasks and overconfidence on difficult tasks. We examine three LLMs, Llama-3-70B-instruct, Claude-3-Sonnet, and GPT-4o, on a range of QA tasks of varying difficulty, and show that models exhibit subtle differences from human patterns of overconfidence: less sensitive to task difficulty, and when prompted to answer based on different personas -- e.g., expert vs layman, or different race, gender, and ages -- the models will respond with stereotypically biased confidence estimations even though their underlying answer accuracy remains the same. Based on these observations, we propose Answer-Free Confidence Estimation (AFCE) to improve confidence calibration and LLM interpretability in these settings. AFCE is a self-assessment method that employs two stages of prompting, first eliciting only confidence scores on questions, then asking separately for the answer. Experiments on the MMLU and GPQA datasets spanning subjects and difficulty show that this separation of tasks significantly reduces overconfidence and delivers more human-like sensitivity to task difficulty.