π€ AI Summary
Existing time-sensitive question answering (TSQA) benchmarks rely on manual construction or fixed templates, limiting scalability, fine-grained temporal resolution, and cross-temporal automation. To address this, we propose TDBenchβthe first systematic TSQA evaluation framework grounded in temporal databases. Its core innovation lies in automatically generating multi-hop, domain-adapted time-sensitive question-answer pairs via Temporal SQL queries and functional dependencies, eliminating the need for manual annotation. Additionally, we introduce a time-precision evaluation metric enabling fact verification across granularities from milliseconds to years. Experiments demonstrate that TDBench substantially enhances evaluation scale and diversity, delivering comprehensive and reliable assessment of temporal knowledge capabilities across multiple state-of-the-art LLMs. By providing a reproducible, extensible benchmark infrastructure, TDBench advances research in time-sensitive factual reasoning.
π Abstract
Facts evolve over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. While factual Time-Sensitive Question-Answering (TSQA) tasks have been widely studied, existing benchmarks often rely on manual curation or a small, fixed set of predefined templates, which restricts scalable and comprehensive TSQA evaluation. To address these challenges, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques such as temporal SQL and functional dependencies. We also introduce a fine-grained evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy to enable a more reliable TSQA evaluation. Extensive experiments on contemporary LLMs show how ours{} enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing existing Wikipedia/Wikidata-based TSQA evaluation approaches by enabling LLM evaluation on application-specific data and seamless multi-hop question generation. Code and data are publicly available at: https://github.com/ssoy0701/tdbench.git.