A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation

πŸ“… 2025-08-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing RAG systems lack standardized benchmarks for evaluating temporal reasoning capabilities, particularly in Chinese. Method: This paper introduces ChronoQAβ€”the first high-quality, multi-scenario question-answering benchmark tailored for temporally sensitive Chinese RAG. Built upon over 300,000 news articles from 2019–2024, ChronoQA employs a hybrid construction pipeline integrating rule-based filtering, large language model generation, and multi-stage human verification. It comprises 5,176 questions covering absolute, relative, and aggregate temporal expressions, supporting both single- and multi-document spatiotemporal alignment and logical consistency evaluation. Each question is annotated with explicit/implicit temporal semantics and structured reasoning chains. Contribution/Results: ChronoQA establishes a scalable, dynamically updatable, and semantically precise evaluation standard for temporally aware RAG, effectively addressing the absence of dedicated Chinese benchmarks for temporal reasoning assessment.

Technology Category

Application Category

πŸ“ Abstract
We introduce ChronoQA, a large-scale benchmark dataset for Chinese question answering, specifically designed to evaluate temporal reasoning in Retrieval-Augmented Generation (RAG) systems. ChronoQA is constructed from over 300,000 news articles published between 2019 and 2024, and contains 5,176 high-quality questions covering absolute, aggregate, and relative temporal types with both explicit and implicit time expressions. The dataset supports both single- and multi-document scenarios, reflecting the real-world requirements for temporal alignment and logical consistency. ChronoQA features comprehensive structural annotations and has undergone multi-stage validation, including rule-based, LLM-based, and human evaluation, to ensure data quality. By providing a dynamic, reliable, and scalable resource, ChronoQA enables structured evaluation across a wide range of temporal tasks, and serves as a robust benchmark for advancing time-sensitive retrieval-augmented question answering systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating temporal reasoning in Chinese QA systems
Assessing time-sensitive retrieval-augmented generation tasks
Validating temporal alignment and logical consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Chinese QA dataset for temporal reasoning
Multi-stage validation ensuring high data quality
Supports single- and multi-document temporal scenarios
πŸ”Ž Similar Papers
No similar papers found.
Ziyang Chen
Ziyang Chen
Peking University
Quantum key distributionQuantum random number generation
Erxue Min
Erxue Min
University of Manchester, Baidu Inc.
Information RetrievalLarge Language Model
X
Xiang Zhao
Laboratory for Big Data and Decision, National University of Defense Technology, Changsha, China
Y
Yunxin Li
Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
Xin Jia
Xin Jia
Baidu Inc., Beijing, China
J
Jinzhi Liao
Laboratory for Big Data and Decision, National University of Defense Technology, Changsha, China
J
Jichao Li
Laboratory for Big Data and Decision, National University of Defense Technology, Changsha, China
Shuaiqiang Wang
Shuaiqiang Wang
Principal Architect of Search Strategy, Baidu Inc.
Large language modelsInformation retrieval
Baotian Hu
Baotian Hu
Harbin Institute of Technology (Shenzhen)
LLMMLLMNLP
Dawei Yin
Dawei Yin
Senior Director, Head of Search Science at Baidu
Machine LearningWeb MiningData Mining