CaTS-Bench: Can Language Models Describe Numeric Time Series?

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing time-series description benchmarks predominantly rely on synthetic data, neglect metadata and visual modalities, and lack realistic, large-scale, context-aware evaluation scenarios. This work introduces CaTS-Bench—the first real-world time-series description benchmark—comprising 465k training and 105k test samples. It unifies numerical time series, structured metadata, and chart images to support both natural language description generation and multiple-choice question answering. We propose a novel large-model-driven annotation pipeline augmented by human verification, yielding 579 high-quality manually refined annotations. Additionally, we design time-series-specific evaluation metrics and a set of 460 multiple-choice questions targeting temporal reasoning. Comprehensive evaluation of mainstream vision-language models reveals fundamental limitations in trend interpretation and numerical reasoning. CaTS-Bench establishes a scalable, reproducible benchmark platform for cross-disciplinary research at the intersection of time-series analysis and foundation models.

Technology Category

Application Category

📝 Abstract

Time series captioning, the task of describing numeric time series in natural language, requires numerical reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on synthetic data or overly simplistic captions, and typically neglect metadata and visual representations. To close this gap, we introduce CaTS-Bench, the first large-scale, real-world benchmark for Context-aware Time Series captioning. CaTS-Bench is derived from 11 diverse datasets reframed as captioning and Q&A tasks, comprising roughly 465k training and 105k test timestamps. Each sample includes a numeric series segment, contextual metadata, a line-chart image, and a caption. A key contribution of this work is the scalable pipeline used to generate reference captions: while most references are produced by an oracle LLM and verified through factual checks, human indistinguishability studies, and diversity analyses, we also provide a human-revisited subset of 579 test captions, refined from LLM outputs to ensure accuracy and human-like style. Beyond captioning, CaTS-Bench offers 460 multiple-choice questions targeting deeper aspects of time series reasoning. We further propose new tailored evaluation metrics and benchmark leading VLMs, highlighting both their strengths and persistent limitations. Together, these contributions establish CaTS-Bench and its captioning pipeline as a reliable and extensible foundation for future research at the intersection of time series analysis and foundation models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating language models' ability to describe numeric time series data

Addressing limitations of existing benchmarks with synthetic or simplistic captions

Providing context-aware time series captioning with metadata and visual representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world benchmark with contextual metadata

Scalable pipeline using oracle LLM for captions

Human-refined subset ensuring accuracy and style

🔎 Similar Papers

A Survey of Time Series Foundation Models: Generalizing Time Series Representation with Large Language Model