CaTS-Bench: Can Language Models Describe Numeric Time Series?

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing time-series description benchmarks predominantly rely on synthetic data, neglect metadata and visual modalities, and lack realistic, large-scale, context-aware evaluation scenarios. This work introduces CaTS-Bench—the first real-world time-series description benchmark—comprising 465k training and 105k test samples. It unifies numerical time series, structured metadata, and chart images to support both natural language description generation and multiple-choice question answering. We propose a novel large-model-driven annotation pipeline augmented by human verification, yielding 579 high-quality manually refined annotations. Additionally, we design time-series-specific evaluation metrics and a set of 460 multiple-choice questions targeting temporal reasoning. Comprehensive evaluation of mainstream vision-language models reveals fundamental limitations in trend interpretation and numerical reasoning. CaTS-Bench establishes a scalable, reproducible benchmark platform for cross-disciplinary research at the intersection of time-series analysis and foundation models.

Technology Category

Application Category

📝 Abstract
Time series captioning, the task of describing numeric time series in natural language, requires numerical reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on synthetic data or overly simplistic captions, and typically neglect metadata and visual representations. To close this gap, we introduce CaTS-Bench, the first large-scale, real-world benchmark for Context-aware Time Series captioning. CaTS-Bench is derived from 11 diverse datasets reframed as captioning and Q&A tasks, comprising roughly 465k training and 105k test timestamps. Each sample includes a numeric series segment, contextual metadata, a line-chart image, and a caption. A key contribution of this work is the scalable pipeline used to generate reference captions: while most references are produced by an oracle LLM and verified through factual checks, human indistinguishability studies, and diversity analyses, we also provide a human-revisited subset of 579 test captions, refined from LLM outputs to ensure accuracy and human-like style. Beyond captioning, CaTS-Bench offers 460 multiple-choice questions targeting deeper aspects of time series reasoning. We further propose new tailored evaluation metrics and benchmark leading VLMs, highlighting both their strengths and persistent limitations. Together, these contributions establish CaTS-Bench and its captioning pipeline as a reliable and extensible foundation for future research at the intersection of time series analysis and foundation models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating language models' ability to describe numeric time series data
Addressing limitations of existing benchmarks with synthetic or simplistic captions
Providing context-aware time series captioning with metadata and visual representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world benchmark with contextual metadata
Scalable pipeline using oracle LLM for captions
Human-refined subset ensuring accuracy and style
🔎 Similar Papers
No similar papers found.
L
Luca Zhou
Sapienza University of Rome
P
Pratham Yashwante
University of California San Diego
M
Marshall Fisher
University of California San Diego
Alessio Sampieri
Alessio Sampieri
ItalAI
Human MotionEmbodied AIComputer Vision
Z
Zihao Zhou
University of California San Diego
Fabio Galasso
Fabio Galasso
Sapienza University of Rome
Computer visionMachine learningPattern recognitionSequence modellingMeta-learning
Rose Yu
Rose Yu
Associate Professor, University of California, San Diego
Machine LearningComputational Sustainability