Language Models reach higher Agreement than Humans in Historical Interpretation

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study investigates interpretive consistency, cultural bias, and hallucination patterns in historical text annotation by humans versus large language models (LLMs). Using multi-model experiments (GPT, Claude, Llama), we quantify inter-annotator agreement via Cohen’s κ and Krippendorff’s α, and manually annotate bias and hallucination types. Results show LLMs achieve significantly higher group-level consistency than humans in short-text historical fact interpretation (κ = 0.72 vs. 0.48). Our key contributions are threefold: (1) the first empirical demonstration that LLMs can attain superior consensus in historical interpretation tasks; (2) a clear conceptual and empirical distinction between cultural bias and information omission/hallucination as distinct sources of interpretive divergence; and (3) a reproducible, cross-model evaluation framework for historical annotation—providing both methodological grounding and a quantifiable bias analysis toolkit for digital humanities.

Technology Category

Application Category

📝 Abstract

This paper compares historical annotations by humans and Large Language Models. The findings reveal that both exhibit some cultural bias, but Large Language Models achieve a higher consensus on the interpretation of historical facts from short texts. While humans tend to disagree on the basis of their personal biases, Large Models disagree when they skip information or produce hallucinations. These findings have significant implications for digital humanities, enabling large-scale annotation and quantitative analysis of historical data. This offers new educational and research opportunities to explore historical interpretations from different Language Models, fostering critical thinking about bias.

Problem

Research questions and friction points this paper is trying to address.

Compare human and LLM historical annotation biases

Assess LLM consensus on historical fact interpretation

Explore bias implications for digital humanities research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models achieve higher historical consensus

Models enable large-scale historical data annotation

Quantitative analysis of cultural bias in interpretations

🔎 Similar Papers

Have Large Vision-Language Models Mastered Art History?