🤖 AI Summary
This work addresses inherent limitations of large language models (LLMs) in temporal reasoning tasks—particularly those involving event ordering, duration estimation, and cross-temporal relational inference. To this end, we propose TISER, a novel framework featuring: (1) an explicit, structured timeline serving as a reasoning anchor; (2) a multi-turn introspective reasoning mechanism that jointly performs test-time chain-of-thought expansion and prompt-guided temporal structuring; and (3) a test-time scaling strategy to enhance generalization robustness. Evaluated across multiple temporal reasoning benchmarks, TISER achieves state-of-the-art performance, especially on out-of-distribution generalization tasks, where it significantly surpasses existing methods. Notably, using only a 7B open-weight LLM, TISER outperforms proprietary large models such as GPT-4, demonstrating the viability of lightweight, efficient temporal modeling.
📝 Abstract
Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks. However, they struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships. These capabilities are critical for applications including question answering, scheduling, and historical analysis. In this paper, we introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection. Our approach leverages test-time scaling to extend the length of reasoning traces, enabling models to capture complex temporal dependencies more effectively. This strategy not only boosts reasoning accuracy but also improves the traceability of the inference process. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, including out-of-distribution test sets, and reveal that TISER enables smaller open-source models to surpass larger closed-weight models on challenging temporal reasoning tasks.