Do Large Language Models Judge Error Severity Like Humans?

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This study investigates the alignment between large language models (LLMs) and human judgment in assessing the severity of semantic errors in image captions—specifically age, gender, clothing type, and color errors. Building upon van Miltenburg’s experimental framework, we design a controlled error-injection paradigm, conduct multimodal versus unimodal comparative evaluations, and analyze underlying neurocognitive mechanisms. Our key contributions are threefold: (1) We systematically identify social norm bias in LLMs—e.g., over-penalizing gender errors—and perceptual misalignment—e.g., overestimating color-error severity; (2) We demonstrate that visual context substantially enhances human sensitivity to color and type errors, whereas most LLMs exhibit significant deviations from human severity rankings; (3) Among tested models, Doubao shows partial alignment but weak discriminative power, while the unimodal DeepSeek-V3 achieves the highest human alignment across all settings—surpassing multimodal counterparts.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for different reasons. This suggests that these models may have internalised social norms influencing gender judgments but lack the perceptual grounding to emulate human sensitivity to colour, which is shaped by distinct neural mechanisms. Only one of the evaluated LLMs, Doubao, replicates the human-like ranking of error severity, but it fails to distinguish between error types as clearly as humans. Surprisingly, DeepSeek-V3, a unimodal LLM, achieves the highest alignment with human judgments across both unimodal and multimodal conditions, outperforming even state-of-the-art multimodal models.
Problem

Research questions and friction points this paper is trying to address.

Assessing if LLMs replicate human error severity judgments
Comparing human and LLM evaluations of semantic errors
Exploring LLM biases in gender versus color error severity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically compare human and LLM assessments
Extend framework to unimodal and multimodal settings
Evaluate four error types with controlled errors
D
Diege Sun
School of Psychology
G
Guanyi Chen
Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, National Language Resources Monitor and Research Center for Network Media, School of Computer Science, Central China Normal University
Fan Zhao
Fan Zhao
Cue Biopharma
T Cell RecognitionAntigen Processing and PresentationImmunotherapyChemistry & Biochemistry
X
Xiaorong Cheng
School of Psychology
T
Tingting He
Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, National Language Resources Monitor and Research Center for Network Media, School of Computer Science, Central China Normal University