Can Large Language Models Capture Human Annotator Disagreements?

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether large language models (LLMs) can effectively model human annotation disagreement—a critical signal of task subjectivity and instance ambiguity. Current evaluation paradigms predominantly assess accuracy against majority-voted labels, neglecting models’ capacity to capture annotation uncertainty. To address this gap, we propose the first systematic evaluation framework for disagreement prediction grounded in single-annotator labels, integrated with RLVR-style reasoning to quantify LLMs’ fidelity to empirical annotation distributions. Our experiments reveal three key findings: (1) mainstream LLMs exhibit poor calibration in predicting human disagreement; (2) majority-label accuracy substantially obscures this limitation; and (3) incorporating reinforcement learning–based reasoning degrades disagreement prediction performance, exposing a misalignment between standard optimization objectives and uncertainty modeling. We publicly release our code and datasets to advance more holistic, human-centered LLM evaluation.

Technology Category

Application Category

📝 Abstract
Human annotation variation (i.e., annotation disagreements) is common in NLP and often reflects important information such as task subjectivity and sample ambiguity. While Large Language Models (LLMs) are increasingly used for automatic annotation to reduce human effort, their evaluation often focuses on predicting the majority-voted "ground truth" labels. It is still unclear, however, whether these models also capture informative human annotation variation. Our work addresses this gap by extensively evaluating LLMs' ability to predict annotation disagreements without access to repeated human labels. Our results show that LLMs struggle with modeling disagreements, which can be overlooked by majority label-based evaluations. Notably, while RLVR-style (Reinforcement learning with verifiable rewards) reasoning generally boosts LLM performance, it degrades performance in disagreement prediction. Our findings highlight the critical need for evaluating and improving LLM annotators in disagreement modeling. Code and data at https://github.com/EdisonNi-hku/Disagreement_Prediction.
Problem

Research questions and friction points this paper is trying to address.

Can LLMs predict human annotation disagreements accurately
Do LLMs capture task subjectivity and sample ambiguity
Does RLVR reasoning degrade disagreement prediction performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating LLMs' ability to predict annotation disagreements
RLVR-style reasoning degrades disagreement prediction performance
Highlighting need for improving LLMs in disagreement modeling
🔎 Similar Papers
No similar papers found.