🤖 AI Summary
This work addresses natural language inference for latent events in time series—i.e., interpreting and generating human-readable descriptions (“what happened”) from sequential data. To this end, we introduce GAMETime, the first time-series-to-text benchmark explicitly designed for event reasoning, comprising 4,200 real-world soccer matches, 1.7 million time steps, and expert-annotated event labels. Leveraging GAMETime, we systematically evaluate 16 large language models (LLMs) and propose a unified framework integrating context modulation, event sequence modeling, and multi-strategy evaluation. Experimental results reveal that the open-weight model DeepSeek-R1 32B outperforms GPT-4o on this task; while LLMs demonstrate notable event inference capability, they exhibit critical limitations in modeling long event chains, maintaining contextual robustness, and achieving consistent evaluation outcomes. This work establishes a novel benchmark, methodology, and empirical understanding for time-series interpretation.
📝 Abstract
Time series data measure how environments change over time and drive decision-making in critical domains like finance and healthcare. When analyzing time series, we often seek to understand the underlying events occurring in the measured environment. For example, one might ask: What caused a sharp drop in the stock price? Events are often described with natural language, so we conduct the first study of whether Large Language Models (LLMs) can infer natural language events from time series. We curate a new benchmark featuring win probabilities collected from 4,200 basketball and American football games, featuring 1.7M timesteps with real value data and corresponding natural language events. Building on the recent wave of using LLMs on time series, we evaluate 16 LLMs and find that they demonstrate promising abilities to infer events from time series data. The open-weights DeepSeek-R1 32B model outperforms proprietary models like GPT-4o. Despite this impressive initial performance, we also find clear avenues to improve recent models, as we identify failures when altering the provided context, event sequence lengths, and evaluation strategy. (All resources needed to reproduce our work are available: https://github.com/BennyTMT/GAMETime)