On Path to Multimodal Historical Reasoning: HistBench and HistAgent

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Historical reasoning poses unique challenges for large language models (LLMs), including multimodal source interpretation, temporal inference, and cross-lingual analysis—capabilities inadequately addressed by general-purpose agents. To bridge this gap, we introduce HistBench, the first multimodal benchmark tailored to historiography, comprising 414 questions spanning 29 languages, multiple historical periods, and diverse geographical regions; it systematically defines historical reasoning as a multi-dimensional competency. We further propose HistAgent, a domain-specific agent integrating OCR, cross-lingual translation, visual understanding, and structured archival retrieval tools. Built upon GPT-4o and a customized tool-calling framework, HistAgent achieves 27.54% pass@1 on HistBench—substantially outperforming GPT-4o (18.60%) and DeepSeek-R1 (14.49%). Results demonstrate that domain-specialized agents significantly enhance primary-source comprehension, temporal reasoning, and cross-lingual historical analysis.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI's historical reasoning with multimodal challenges
Addressing lack of domain expertise in historical analysis
Improving AI performance on diverse historical tasks globally
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces HistBench for historical reasoning evaluation
Develops HistAgent with specialized history tools
Combines OCR, translation, and image understanding
🔎 Similar Papers
No similar papers found.
Jiahao Qiu
Jiahao Qiu
Princeton University
LLMAI AgentsAI for X
F
Fulian Xiao
Department of History, Fudan University
Y
Yimin Wang
University of Michigan
Yuchen Mao
Yuchen Mao
Zhejiang University
Theoretical Computer Science
Yijia Chen
Yijia Chen
Shanghai Jiao Tong University
Xinzhe Juan
Xinzhe Juan
University of Michigan
AI AgentAI4Science
S
Siran Wang
Department of History, Fudan University
Xuan Qi
Xuan Qi
Undergraduate, Tsinghua university
Natural language processingMulti-modal language model
T
Tongcheng Zhang
Shanghai Jiao Tong University
Z
Zixin Yao
Department of Philosophy, Columbia University
J
Jiacheng Guo
AI Lab, Princeton University
Yifu Lu
Yifu Lu
Undergraduate, University of Michigan
Computer Science
C
Charles Argon
Department of History, Princeton University
J
Jundi Cui
Department of History, Fudan University
D
Daixin Chen
School of Philosophy, Fudan University
J
Junran Zhou
Department of History, Fudan University
Shuyao Zhou
Shuyao Zhou
Princeton University
Human-Computer Interaction
Zhanpeng Zhou
Zhanpeng Zhou
Shanghai Jiao Tong University
Deep Learning Theory
Ling Yang
Ling Yang
Postdoc@Princeton University, PhD@Peking University
LLMDiffusion ModelsReinforcement LearningComplex Data Modeling
Shilong Liu
Shilong Liu
RS@ByteDance, PhD@THU
Computer VisionObject DetectionVisual GroundingMulti-ModalityMultimodal Agent
H
Hongru Wang
The Chinese University of Hong Kong
K
Kaixuan Huang
AI Lab, Princeton University
X
Xun Jiang
Tianqiao and Chrissy Chen Institute, Theta Health Inc.
X
Xi Gao
Department of History, Fudan University
M
Mengdi Wang
AI Lab, Princeton University