BEDTime: A Unified Benchmark for Automatically Describing Time Series

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current time-series language understanding lacks a unified, fine-grained evaluation benchmark. Method: We propose BEDTime—the first multitask benchmark explicitly designed to assess time-series *linguistic* capabilities, covering recognition, discrimination, and generation tasks. It integrates four recent datasets and introduces a standardized evaluation framework comprising true/false questions, multiple-choice items, and open-ended generation—formalizing the task for the first time. Contribution/Results: BEDTime enables direct, decoupled model comparisons. Experiments reveal that pure language models underperform; vision-language models achieve significantly better results; and specialized multimodal time-series–language models, though superior, still exhibit room for improvement. Critically, all models demonstrate poor robustness under perturbation, underscoring the necessity of time-series–specific architectural inductive biases. BEDTime advances standardization and reproducibility in time-series understanding evaluation.

Technology Category

Application Category

📝 Abstract
Many recent studies have proposed general-purpose foundation models designed for a variety of time series analysis tasks. While several established datasets already exist for evaluating these models, previous works frequently introduce their models in conjunction with new datasets, limiting opportunities for direct, independent comparisons and obscuring insights into the relative strengths of different methods. Additionally, prior evaluations often cover numerous tasks simultaneously, assessing a broad range of model abilities without clearly pinpointing which capabilities contribute to overall performance. To address these gaps, we formalize and evaluate 3 tasks that test a model's ability to describe time series using generic natural language: (1) recognition (True/False question-answering), (2) differentiation (multiple choice question-answering), and (3) generation (open-ended natural language description). We then unify 4 recent datasets to enable head-to-head model comparisons on each task. Experimentally, in evaluating 13 state-of-the-art language, vision--language, and time series--language models, we find that (1) popular language-only methods largely underperform, indicating a need for time series-specific architectures, (2) VLMs are quite successful, as expected, identifying the value of vision models for these tasks and (3) pretrained multimodal time series--language models successfully outperform LLMs, but still have significant room for improvement. We also find that all approaches exhibit clear fragility in a range of robustness tests. Overall, our benchmark provides a standardized evaluation on a task necessary for time series reasoning systems.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized benchmark for time series description tasks
Insufficient head-to-head comparison of time series foundation models
Unclear which capabilities drive overall model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalizes three time series description tasks using natural language
Unifies four datasets for direct head-to-head model comparisons
Evaluates 13 state-of-the-art multimodal and language models
M
Medhasweta Sen
University of Virginia
Z
Zachary Gottesman
University of Virginia
J
Jiaxing Qiu
University of Virginia
C. Bayan Bruss
C. Bayan Bruss
Capital One
GraphsNLPDecision TheoryExplainabilityAutoMl
N
Nam Nguyen
CapitalOne
Tom Hartvigsen
Tom Hartvigsen
Assistant Professor, University of Virginia
Machine LearningNLPTime SeriesData MiningHealthcare