🤖 AI Summary
This work addresses the lack of systematic evaluation of temporal reasoning capabilities in general-purpose models by introducing TSRBench, the first multitask, multimodal benchmark for time series reasoning, encompassing 14 domains and 4,125 questions. The benchmark establishes a unified evaluation framework across four dimensions: perception, reasoning, forecasting, and decision-making, and formally defines time series reasoning for the first time. Empirical analysis reveals a significant disconnect between semantic understanding and numerical prediction in current models, as well as inadequate multimodal integration. Experiments show that while perception and reasoning abilities scale with model size, forecasting performance does not follow scaling laws; strong reasoning capability does not guarantee high prediction accuracy; and multimodal inputs fail to deliver expected performance gains.
📝 Abstract
Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve practical problems. However, this dimension is notably absent from existing benchmarks of generalist models. To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluated over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context-aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual represenations of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models. Our code and dataset are available at https://tsrbench.github.io/.