A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing RAG systems lack standardized benchmarks for evaluating temporal reasoning capabilities, particularly in Chinese. Method: This paper introduces ChronoQA—the first high-quality, multi-scenario question-answering benchmark tailored for temporally sensitive Chinese RAG. Built upon over 300,000 news articles from 2019–2024, ChronoQA employs a hybrid construction pipeline integrating rule-based filtering, large language model generation, and multi-stage human verification. It comprises 5,176 questions covering absolute, relative, and aggregate temporal expressions, supporting both single- and multi-document spatiotemporal alignment and logical consistency evaluation. Each question is annotated with explicit/implicit temporal semantics and structured reasoning chains. Contribution/Results: ChronoQA establishes a scalable, dynamically updatable, and semantically precise evaluation standard for temporally aware RAG, effectively addressing the absence of dedicated Chinese benchmarks for temporal reasoning assessment.

Technology Category

Application Category

📝 Abstract

We introduce ChronoQA, a large-scale benchmark dataset for Chinese question answering, specifically designed to evaluate temporal reasoning in Retrieval-Augmented Generation (RAG) systems. ChronoQA is constructed from over 300,000 news articles published between 2019 and 2024, and contains 5,176 high-quality questions covering absolute, aggregate, and relative temporal types with both explicit and implicit time expressions. The dataset supports both single- and multi-document scenarios, reflecting the real-world requirements for temporal alignment and logical consistency. ChronoQA features comprehensive structural annotations and has undergone multi-stage validation, including rule-based, LLM-based, and human evaluation, to ensure data quality. By providing a dynamic, reliable, and scalable resource, ChronoQA enables structured evaluation across a wide range of temporal tasks, and serves as a robust benchmark for advancing time-sensitive retrieval-augmented question answering systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating temporal reasoning in Chinese QA systems

Assessing time-sensitive retrieval-augmented generation tasks

Validating temporal alignment and logical consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Chinese QA dataset for temporal reasoning

Multi-stage validation ensuring high data quality

Supports single- and multi-document temporal scenarios

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time