LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment Gaps

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
This study addresses the common practice of assigning single-label annotations in large language models (LLMs) for emotion recognition, which overlooks inter-annotator disagreement and the rich distributional information inherent in human judgments. The authors systematically evaluate, for the first time, the ability of zero-shot LLMs and fine-tuned RoBERTa to approximate human emotion distributions across 640,000 annotations from GoEmotions and EmoBank. They introduce a transparency score to predict model–human alignment and find that while models perform well on explicit emotions, they systematically fail on those requiring contextual inference. Three lightweight post-calibration methods are proposed, reducing the human–model distribution gap by up to 14%. Crucially, model scale alone cannot bridge this gap, whereas domain-specific fine-tuning proves effective, leading to practical guidelines for when LLMs can reliably substitute human annotators in emotion labeling tasks.
📝 Abstract
Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distributional information that disagreement encodes. We ask whether LLMs capture the structure of this disagreement, not just majority labels, by comparing emotion judgment distributions between human annotators and four zero-shot LLMs, plus a fine-tuned RoBERTa baseline, across two complementary benchmarks: GoEmotions and EmoBank, totaling 640,000 LLM responses. Zero-shot models diverge substantially from human distributions, and in-domain fine-tuning, not model scale, is required to close the gap. We formalize a lexical-grounding gradient through a quantitative transparency score that predicts per-category human--LLM agreement: LLMs reliably capture emotions with explicit lexical markers but systematically fail on pragmatically complex emotions requiring contextual inference, a pattern that replicates across both categorical and continuous emotion frameworks. We further propose three lightweight post-hoc calibration methods that reduce the distributional gap by up to 14\%, and provide actionable guidelines for when LLM emotion annotations can, and cannot, substitute for human labeling.
Problem

Research questions and friction points this paper is trying to address.

emotion annotation
label uncertainty
human disagreement
distributional judgment
LLM calibration
Innovation

Methods, ideas, or system contributions that make the work stand out.

emotion uncertainty
distributional calibration
lexical grounding
human-LLM judgment gap
post-hoc calibration