Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses the lack of effective evaluation benchmarks for higher-order historical reasoning capabilities—such as evidential analysis—in current large language models (LLMs). To bridge this gap, the authors introduce ProHist-Bench, a specialized benchmark grounded in China’s imperial examination system, comprising 400 expert-crafted questions spanning eight dynasties and accompanied by 10,891 fine-grained scoring rubrics. By systematically evaluating 18 prominent LLMs, the benchmark integrates deep historical scholarship with rigorous AI assessment methodologies. Comparative analysis reveals substantial performance gaps across models on complex historical reasoning tasks, underscoring their limited capacity for domain-specific analytical reasoning. These findings highlight critical shortcomings in current LLMs’ ability to engage with nuanced historical inquiry and offer clear guidance for future model development targeting expert-level reasoning in specialized knowledge domains.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Historical Reasoning

Evidentiary Reasoning

Benchmarking

Chinese Imperial Examination

Innovation

Methods, ideas, or system contributions that make the work stand out.

historical reasoning

large language models

benchmarking