TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing temporal question answering benchmarks are largely confined to simple factual queries and struggle to support the integration of complex temporal reasoning with multi-hop retrieval. To address this gap, this work proposes TEMPO, the first multi-domain benchmark that combines deep temporal reasoning—such as trend identification and cross-period comparison—with multi-hop retrieval. TEMPO encompasses 13 domains and 1,730 queries, accompanied by step-by-step retrieval plans and gold documents. The study introduces novel evaluation metrics, including Temporal Coverage@k and Temporal Precision@k, enabling fine-grained assessment of retrieval systems’ temporal reasoning capabilities. Experiments on 12 state-of-the-art systems reveal that even the best-performing model, DiVeR, achieves only 32.0 NDCG@10 and 71.4% Temporal Coverage@10, highlighting the significant challenge of retrieving temporally complete evidence.

Technology Category

Application Category

📝 Abstract
Existing temporal QA benchmarks focus on simple fact-seeking queries from news corpora, while reasoning-intensive retrieval benchmarks lack temporal grounding. However, real-world information needs often require reasoning about temporal evolution and synthesizing evidence across time periods. We introduce TEMPO, the first benchmark combining temporal reasoning with reasoning-intensive retrieval across 13 domains. TEMPO features: (1) 1,730 complex queries requiring deep temporal reasoning such as tracking changes, identifying trends, or comparing cross-period evidence; (2) step-wise retrieval planning with 3,976 decomposed steps and gold documents mapped to each step for multi-hop evaluation; and (3) novel temporal metrics including Temporal Coverage@k and Temporal Precision@k measuring whether results span required time periods. Evaluation of 12 retrieval systems reveals substantial challenges: the best model (DiVeR) achieves only 32.0 NDCG@10 and 71.4\% Temporal Coverage@10, demonstrating difficulty in retrieving temporally complete evidence. We believe TEMPO provides a challenging benchmark for improving temporal reasoning in retrieval and RAG systems. Our code and data are available at https://github.com/tempo-bench/Tempo. See also our official website: https://tempo-bench.github.io/.
Problem

Research questions and friction points this paper is trying to address.

temporal reasoning
reasoning-intensive retrieval
multi-domain benchmark
temporal evolution
evidence synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal reasoning
multi-hop retrieval
temporal benchmark
retrieval evaluation metrics
reasoning-intensive QA
🔎 Similar Papers
No similar papers found.