Evaluating List Construction and Temporal Understanding capabilities of Large Language Models

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) suffer from hallucination, incomplete answers, and temporal misalignment in multi-entity, multi-temporal reasoning tasks; moreover, no systematic benchmark evaluates their joint capability for implicit/explicit temporal reasoning and structured list-based answer generation. Method: We introduce Temporal-Alignment List Question Answering (TLQA), the first benchmark explicitly designed to assess LLMs’ ability to simultaneously enumerate multiple entities and perform temporally grounded reasoning within structured list outputs. TLQA features a human-constructed, temporally sensitive QA dataset, supports both closed-book and open-domain settings, and employs a rigorous evaluation protocol. Contribution/Results: Experiments reveal severe deficits in factual completeness and temporal alignment for mainstream models under closed-book conditions; in open-domain settings, retrieval quality emerges as the primary bottleneck. This work fills a critical gap in temporal reasoning evaluation, establishing a novel, principled benchmark and providing concrete directions for advancing temporally aware language modeling.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated immense advances in a wide range of natural language tasks. However, these models are susceptible to hallucinations and errors on particularly temporal understanding tasks involving multiple entities in answers. In such tasks, they fail to associate entities with accurate time intervals, generate a complete list of entities in answers or reason about events associated with specific temporal bounds. Existing works do not extensively evaluate the abilities of the model to perform implicit and explicit temporal understanding in a list answer construction setup. To bridge this gap, we propose the Time referenced List based Question Answering or TLQA benchmark that requires structured answers in list format aligned with corresponding time periods. Our TLQA benchmark, requires both list construction and temporal understanding simultaneously, which to the best of our knowledge has not been explored in prior benchmarks. We investigate the temporal understanding and list construction capabilities of state-of-the-art generative models on TLQA in closed-book and open-domain settings. Our findings reveal significant shortcomings in current models, particularly their inability to provide complete answers and temporally align facts in a closed-book setup and the need to improve retrieval in open-domain setup, providing clear future directions for research on TLQA. The benchmark and code at https://github.com/elixir-research-group/TLQA.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' list construction and temporal understanding capabilities
Assessing models' accuracy in aligning entities with time intervals
Identifying shortcomings in closed-book and open-domain temporal QA tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes TLQA benchmark for temporal list QA
Evaluates list construction and temporal alignment
Tests models in closed-book and open-domain settings