Language Models reach higher Agreement than Humans in Historical Interpretation

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates interpretive consistency, cultural bias, and hallucination patterns in historical text annotation by humans versus large language models (LLMs). Using multi-model experiments (GPT, Claude, Llama), we quantify inter-annotator agreement via Cohen’s κ and Krippendorff’s α, and manually annotate bias and hallucination types. Results show LLMs achieve significantly higher group-level consistency than humans in short-text historical fact interpretation (κ = 0.72 vs. 0.48). Our key contributions are threefold: (1) the first empirical demonstration that LLMs can attain superior consensus in historical interpretation tasks; (2) a clear conceptual and empirical distinction between cultural bias and information omission/hallucination as distinct sources of interpretive divergence; and (3) a reproducible, cross-model evaluation framework for historical annotation—providing both methodological grounding and a quantifiable bias analysis toolkit for digital humanities.

Technology Category

Application Category

📝 Abstract
This paper compares historical annotations by humans and Large Language Models. The findings reveal that both exhibit some cultural bias, but Large Language Models achieve a higher consensus on the interpretation of historical facts from short texts. While humans tend to disagree on the basis of their personal biases, Large Models disagree when they skip information or produce hallucinations. These findings have significant implications for digital humanities, enabling large-scale annotation and quantitative analysis of historical data. This offers new educational and research opportunities to explore historical interpretations from different Language Models, fostering critical thinking about bias.
Problem

Research questions and friction points this paper is trying to address.

Compare human and LLM historical annotation biases
Assess LLM consensus on historical fact interpretation
Explore bias implications for digital humanities research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models achieve higher historical consensus
Models enable large-scale historical data annotation
Quantitative analysis of cultural bias in interpretations
🔎 Similar Papers
2024-09-23Annual Meeting of the Association for Computational LinguisticsCitations: 0
Fabio Celli
Fabio Celli
Senior Data Scientist, R&D Maggioli.it
personality recognitionpersonality computingcliodynamicsimpact of AI
G
Georgios Spathulas
Department of Information Security and Communication Technology, Norwegian University of Science and Technology - NTNU, Gjøvik, Norway