Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models

πŸ“… 2025-08-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing time-sensitive question answering (TSQA) benchmarks rely on manual construction or fixed templates, limiting scalability, fine-grained temporal resolution, and cross-temporal automation. To address this, we propose TDBenchβ€”the first systematic TSQA evaluation framework grounded in temporal databases. Its core innovation lies in automatically generating multi-hop, domain-adapted time-sensitive question-answer pairs via Temporal SQL queries and functional dependencies, eliminating the need for manual annotation. Additionally, we introduce a time-precision evaluation metric enabling fact verification across granularities from milliseconds to years. Experiments demonstrate that TDBench substantially enhances evaluation scale and diversity, delivering comprehensive and reliable assessment of temporal knowledge capabilities across multiple state-of-the-art LLMs. By providing a reproducible, extensible benchmark infrastructure, TDBench advances research in time-sensitive factual reasoning.

Technology Category

Application Category

πŸ“ Abstract
Facts evolve over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. While factual Time-Sensitive Question-Answering (TSQA) tasks have been widely studied, existing benchmarks often rely on manual curation or a small, fixed set of predefined templates, which restricts scalable and comprehensive TSQA evaluation. To address these challenges, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques such as temporal SQL and functional dependencies. We also introduce a fine-grained evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy to enable a more reliable TSQA evaluation. Extensive experiments on contemporary LLMs show how ours{} enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing existing Wikipedia/Wikidata-based TSQA evaluation approaches by enabling LLM evaluation on application-specific data and seamless multi-hop question generation. Code and data are publicly available at: https://github.com/ssoy0701/tdbench.git.
Problem

Research questions and friction points this paper is trying to address.

Evaluate time-sensitive factual accuracy in large language models
Overcome limitations of manual or template-based TSQA benchmarks
Develop scalable TSQA evaluation using temporal databases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes temporal databases for TSQA pair construction
Introduces time accuracy metric for evaluation
Reduces human labor with scalable TSQA evaluation
πŸ”Ž Similar Papers
No similar papers found.