🤖 AI Summary
Existing temporal reasoning benchmarks rely heavily on rule-based generation, lacking historical depth, cultural context, and diversity of temporal entities—thus inadequately evaluating large language models’ (LLMs’) temporal cognition. Method: We introduce CTM (Chinese Dynasty Temporal Modeling), the first benchmark dedicated to Chinese dynastic chronology. CTM innovatively integrates historical semantic modeling, multi-granularity temporal alignment, and dynasty-specific cultural constraints into a dynamic evaluation framework covering cross-entity temporal inference, pairwise temporal alignment, and culturally contextualized reasoning. It leverages authoritative chronologies and historical texts to construct a high-quality annotated dataset, incorporating adversarial question design, temporal-logic validation, and expert-in-the-loop evaluation. Contribution/Results: Experiments reveal critical weaknesses in mainstream LLMs—including poor long-span dynastic reasoning, inaccurate parsing of ambiguous calendrical systems, and failure in event-relative positioning. CTM establishes a reproducible, highly diagnostic, history-domain benchmark for temporal cognitive modeling.
📝 Abstract
Temporal reasoning is fundamental to human cognition and is crucial for various real-world applications. While recent advances in Large Language Models have demonstrated promising capabilities in temporal reasoning, existing benchmarks primarily rely on rule-based construction, lack contextual depth, and involve a limited range of temporal entities. To address these limitations, we introduce Chinese Time Reasoning (CTM), a benchmark designed to evaluate LLMs on temporal reasoning within the extensive scope of Chinese dynastic chronology. CTM emphasizes cross-entity relationships, pairwise temporal alignment, and contextualized and culturally-grounded reasoning, providing a comprehensive evaluation. Extensive experimental results reveal the challenges posed by CTM and highlight potential avenues for improvement.