Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing temporal reasoning benchmarks rely heavily on rule-based generation, lacking historical depth, cultural context, and diversity of temporal entities—thus inadequately evaluating large language models’ (LLMs’) temporal cognition. Method: We introduce CTM (Chinese Dynasty Temporal Modeling), the first benchmark dedicated to Chinese dynastic chronology. CTM innovatively integrates historical semantic modeling, multi-granularity temporal alignment, and dynasty-specific cultural constraints into a dynamic evaluation framework covering cross-entity temporal inference, pairwise temporal alignment, and culturally contextualized reasoning. It leverages authoritative chronologies and historical texts to construct a high-quality annotated dataset, incorporating adversarial question design, temporal-logic validation, and expert-in-the-loop evaluation. Contribution/Results: Experiments reveal critical weaknesses in mainstream LLMs—including poor long-span dynastic reasoning, inaccurate parsing of ambiguous calendrical systems, and failure in event-relative positioning. CTM establishes a reproducible, highly diagnostic, history-domain benchmark for temporal cognitive modeling.

Technology Category

Application Category

📝 Abstract

Temporal reasoning is fundamental to human cognition and is crucial for various real-world applications. While recent advances in Large Language Models have demonstrated promising capabilities in temporal reasoning, existing benchmarks primarily rely on rule-based construction, lack contextual depth, and involve a limited range of temporal entities. To address these limitations, we introduce Chinese Time Reasoning (CTM), a benchmark designed to evaluate LLMs on temporal reasoning within the extensive scope of Chinese dynastic chronology. CTM emphasizes cross-entity relationships, pairwise temporal alignment, and contextualized and culturally-grounded reasoning, providing a comprehensive evaluation. Extensive experimental results reveal the challenges posed by CTM and highlight potential avenues for improvement.

Problem

Research questions and friction points this paper is trying to address.

Temporal reasoning in Chinese dynasties

Benchmark for Large Language Models

Cross-entity and contextual temporal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese Time Reasoning benchmark

cross-entity relationships focus

culturally-grounded temporal reasoning

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time