MMTS-BENCH: A Comprehensive Benchmark for Time Series Understanding and Reasoning

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models lack systematic benchmarks for time series tasks. This work proposes the first structured, multimodal evaluation framework for time series understanding and reasoning, encompassing five hierarchical dimensions: structural awareness, feature analysis, temporal reasoning, sequence matching, and cross-modal alignment. The framework integrates real-world question answering with modular synthetic data to construct a comprehensive benchmark comprising 2,424 question-answer pairs. Systematic evaluation of open-source, closed-source, and specialized time series models reveals that dedicated time series models exhibit weaker out-of-domain generalization compared to general-purpose large language models and underperform on local tasks relative to global ones. Moreover, chain-of-thought reasoning and multimodal fusion significantly enhance performance, and the choice of model backbone architecture proves more decisive than the design of temporal encoders.

Technology Category

Application Category

📝 Abstract
Time series data are central to domains such as finance, healthcare, and cloud computing, yet existing benchmarks for evaluating various large language models (LLMs) on temporal tasks remain scattered and unsystematic. To bridge this gap, we introduce MMTS-BENCH, a comprehensive multimodal benchmark built upon a hierarchical taxonomy of time-series tasks, spanning structural awareness, feature analysis, temporal reasoning, sequence matching and cross-modal alignment. MMTS-BENCH comprises 2,424 time series question answering (TSQA) pairs across 4 subsets: Base, InWild, Match, and Align, generated through a progressive real-world QA framework and modular synthetic data construction. We conduct extensive evaluations on closed-source, open-source LLMs and existing time series adapted large language models (TS-LLMs), revealing that: (1) TS-LLMs significantly lag behind general-purpose LLMs in cross-domain generalization, (2) LLMs show weaknesses in local tasks compared to global tasks, (3) chain-of-thought (CoT) reasoning and multimodal integration substantially improve performance, and (4) the dominant factor in existing TS-LLMs remains the backbone network capability rather than the time series encoder design. MMTS-BENCH not only provides a rigorous evaluation framework but also offers clear directions for advancing LLMs toward robust, interpretable, and generalizable time-series reasoning.
Problem

Research questions and friction points this paper is trying to address.

time series
benchmark
large language models
temporal reasoning
multimodal
Innovation

Methods, ideas, or system contributions that make the work stand out.

time series benchmark
multimodal reasoning
chain-of-thought
TS-LLM evaluation
hierarchical task taxonomy
🔎 Similar Papers
No similar papers found.
Y
Yao Yin
Tsinghua University, Beijing, China
Z
Zhenyu Xiao
Tsinghua University, Beijing, China
M
Musheng Li
Tsinghua University, Beijing, China
Yiwen Liu
Yiwen Liu
Technical University of Munich
Computer VisionRobotics VisionMultimodal Learning
S
Sutong Nan
Tsinghua University, Beijing, China
Y
Yiting He
Tsinghua University, Beijing, China
R
Ruiqi Wang
Tsinghua University, Beijing, China
Zhenwei Zhang
Zhenwei Zhang
Tsinghua University
AIMLData MiningTime Series
Q
Qingmin Liao
Tsinghua University, Beijing, China
Yuantao Gu
Yuantao Gu
Department of Electronic Engineering, Tsinghua University
Signal processingSparse recoverySparse learningOptimizationGraph signal processing