🤖 AI Summary
This work investigates the temporal reasoning capabilities of large language models (LLMs) for time-sensitive event prediction—particularly under graph-text multimodal inputs—and examines inherent challenges including popularity bias and long-tail distribution. To address the lack of domain-specific benchmarks, the authors introduce MidEast-TE-mini, the first graph-text joint benchmark for geopolitical event forecasting in the Middle East. They propose a multi-format input framework and a retrieval-augmented generation (RAG) baseline to explicitly model temporal dependencies among historical events. Experimental results reveal three key findings: (1) LLMs heavily rely on fine-tuning rather than raw textual input for zero-shot extrapolation; (2) RAG significantly improves modeling of sequential event dependencies; and (3) popularity bias and long-tail phenomena severely degrade predictive performance. The study delivers a reproducible temporal prediction benchmark, a systematic analytical framework, and concrete directions for improvement—establishing a foundation for LLM-driven dynamic event reasoning research.
📝 Abstract
Recently, Large Language Models (LLMs) have demonstrated great potential in various data mining tasks, such as knowledge question answering, mathematical reasoning, and commonsense reasoning. However, the reasoning capability of LLMs on temporal event forecasting has been under-explored. To systematically investigate their abilities in temporal event forecasting, we conduct a comprehensive evaluation of LLM-based methods for temporal event forecasting. Due to the lack of a high-quality dataset that involves both graph and textual data, we first construct a benchmark dataset, named MidEast-TE-mini. Based on this dataset, we design a series of baseline methods, characterized by various input formats and retrieval augmented generation (RAG) modules. From extensive experiments, we find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance. In contrast, fine-tuning LLMs with raw texts can significantly improve performance. Additionally, LLMs enhanced with retrieval modules can effectively capture temporal relational patterns hidden in historical events. However, issues such as popularity bias and the long-tail problem persist in LLMs, particularly in the retrieval-augmented generation (RAG) method. These findings not only deepen our understanding of LLM-based event forecasting methods but also highlight several promising research directions. We consider that this comprehensive evaluation, along with the identified research opportunities, will significantly contribute to future research on temporal event forecasting through LLMs.