Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing time-sensitive question answering (TSQA) benchmarks rely on manual construction or fixed templates, limiting scalability, fine-grained temporal resolution, and cross-temporal automation. To address this, we propose TDBench—the first systematic TSQA evaluation framework grounded in temporal databases. Its core innovation lies in automatically generating multi-hop, domain-adapted time-sensitive question-answer pairs via Temporal SQL queries and functional dependencies, eliminating the need for manual annotation. Additionally, we introduce a time-precision evaluation metric enabling fact verification across granularities from milliseconds to years. Experiments demonstrate that TDBench substantially enhances evaluation scale and diversity, delivering comprehensive and reliable assessment of temporal knowledge capabilities across multiple state-of-the-art LLMs. By providing a reproducible, extensible benchmark infrastructure, TDBench advances research in time-sensitive factual reasoning.

Technology Category

Application Category

📝 Abstract

Facts evolve over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. While factual Time-Sensitive Question-Answering (TSQA) tasks have been widely studied, existing benchmarks often rely on manual curation or a small, fixed set of predefined templates, which restricts scalable and comprehensive TSQA evaluation. To address these challenges, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques such as temporal SQL and functional dependencies. We also introduce a fine-grained evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy to enable a more reliable TSQA evaluation. Extensive experiments on contemporary LLMs show how ours{} enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing existing Wikipedia/Wikidata-based TSQA evaluation approaches by enabling LLM evaluation on application-specific data and seamless multi-hop question generation. Code and data are publicly available at: https://github.com/ssoy0701/tdbench.git.

Problem

Research questions and friction points this paper is trying to address.

Evaluate time-sensitive factual accuracy in large language models

Overcome limitations of manual or template-based TSQA benchmarks

Develop scalable TSQA evaluation using temporal databases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes temporal databases for TSQA pair construction

Introduces time accuracy metric for evaluation

Reduces human labor with scalable TSQA evaluation

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

2024-09-20arXiv.orgCitations: 0

UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs' Memorization

2024-07-03arXiv.orgCitations: 4

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Research Scientist Intern, Multimodal AI (PhD)