Inferring Event Descriptions from Time Series with Language Models

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses natural language inference for latent events in time series—i.e., interpreting and generating human-readable descriptions (“what happened”) from sequential data. To this end, we introduce GAMETime, the first time-series-to-text benchmark explicitly designed for event reasoning, comprising 4,200 real-world soccer matches, 1.7 million time steps, and expert-annotated event labels. Leveraging GAMETime, we systematically evaluate 16 large language models (LLMs) and propose a unified framework integrating context modulation, event sequence modeling, and multi-strategy evaluation. Experimental results reveal that the open-weight model DeepSeek-R1 32B outperforms GPT-4o on this task; while LLMs demonstrate notable event inference capability, they exhibit critical limitations in modeling long event chains, maintaining contextual robustness, and achieving consistent evaluation outcomes. This work establishes a novel benchmark, methodology, and empirical understanding for time-series interpretation.

Technology Category

Application Category

📝 Abstract

Time series data measure how environments change over time and drive decision-making in critical domains like finance and healthcare. When analyzing time series, we often seek to understand the underlying events occurring in the measured environment. For example, one might ask: What caused a sharp drop in the stock price? Events are often described with natural language, so we conduct the first study of whether Large Language Models (LLMs) can infer natural language events from time series. We curate a new benchmark featuring win probabilities collected from 4,200 basketball and American football games, featuring 1.7M timesteps with real value data and corresponding natural language events. Building on the recent wave of using LLMs on time series, we evaluate 16 LLMs and find that they demonstrate promising abilities to infer events from time series data. The open-weights DeepSeek-R1 32B model outperforms proprietary models like GPT-4o. Despite this impressive initial performance, we also find clear avenues to improve recent models, as we identify failures when altering the provided context, event sequence lengths, and evaluation strategy. (All resources needed to reproduce our work are available: https://github.com/BennyTMT/GAMETime)

Problem

Research questions and friction points this paper is trying to address.

Infer natural language events from time series data.

Evaluate LLMs' ability to describe events in sports games.

Identify limitations in LLMs for event inference tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs infer events from time series data.

New benchmark with 1.7M timesteps created.

DeepSeek-R1 32B outperforms GPT-4o.

🔎 Similar Papers

No similar papers found.