Assessing and Mitigating Miscalibration in LLM-Based Social Science Measurement

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the pervasive miscalibration of large language models (LLMs) in social science measurement, where predicted confidence levels often fail to reflect true accuracy, thereby undermining measurement validity. The work presents the first systematic evaluation of calibration bias in LLMs for such tasks and introduces a novel soft-label distillation approach that incorporates linguistic expressions of confidence: LLM outputs and their self-reported confidence are transformed into soft target distributions to train a lightweight discriminative encoder. Experiments across 14 social constructs demonstrate that this method reduces expected calibration error (ECE) by 43.2% and Brier score by 34.0% on average, substantially improving the alignment between confidence and correctness. These findings advocate for integrating calibration as a core component of social science measurement pipelines.

📝 Abstract

Large language models (LLMs) are increasingly used in social science as scalable measurement tools for converting unstructured text into variables that can enter standard empirical designs. Measurement validity demands more than high average accuracy, which requires well calibrated confidence that faithfully reflects the empirical probability of each measurement being correct. This paper studies the model miscalibration in LLM-based social science measurement. We begin with a case study on FOMC and show that confidence based filtering can change downstream regression estimates when LLM confidence is miscalibrated. We then audit calibration across 14 social science constructs covering both proprietary models, including GPT-5-mini, DeepSeek-V3.2, and open source models. Across tasks and model families, reported confidence is poorly aligned with tolerance-based correctness. As a simple mitigation, we propose a soft label distillation pipeline for calibrating Bert with LLM. The method converts an LLM score and its verbalized confidence into a soft target distribution, then trains a smaller discriminative classifier on encoder models for these targets. Averaged across datasets, this approach reduces ECE by 43.2\% and Brier by 34.0\%. These results suggest that LLM-based social science pipelines should treat calibration as part of measurement validity, rather than as an optional post-processing concern.

Problem

Research questions and friction points this paper is trying to address.

miscalibration

LLM-based measurement

social science

confidence calibration

measurement validity

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM calibration

soft label distillation

social science measurement