TSAQA: Time Series Analysis Question And Answering Benchmark

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing time series question answering (QA) benchmarks are largely confined to forecasting and anomaly detection, offering limited capacity to comprehensively evaluate a model’s temporal reasoning capabilities. To address this gap, this work proposes TSAQA—a unified, multi-task time series QA benchmark encompassing six task categories: anomaly detection, classification, feature characterization, comparison, data transformation, and temporal relationship analysis. TSAQA comprises 210,000 structured QA pairs spanning 13 domains and introduces novel QA formats including true/false, multiple-choice, and puzzle-style questions. The benchmark supports both zero-shot and instruction-tuning evaluation paradigms and is compatible with standard large language model (LLM) testing pipelines. Experimental results reveal that even the strongest commercial model, Gemini-2.5-Flash, achieves only a 65.08 average score under zero-shot settings, while fine-tuned open-source models like LLaMA-3.1-8B still exhibit substantial room for improvement, underscoring the inherent challenges of temporal understanding for current LLMs.

Technology Category

Application Category

📝 Abstract
Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time series question answering (QA), current benchmarks remain limited to forecasting and anomaly detection tasks. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates six diverse tasks under a single framework ranging from conventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, data transformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shot evaluation demonstrates that these tasks are challenging for current Large Language Models (LLMs): the best-performing commercial LLM, Gemini-2.5-Flash, achieves an average score of only 65.08. Although instruction tuning boosts open-source performance: the best-performing open-source model, LLaMA-3.1-8B, shows significant room for improvement, highlighting the complexity of temporal analysis for LLMs.
Problem

Research questions and friction points this paper is trying to address.

time series analysis
question answering
benchmark
temporal reasoning
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

time series question answering
unified benchmark
temporal analysis
large language models
zero-shot evaluation
🔎 Similar Papers
No similar papers found.
Baoyu Jing
Baoyu Jing
University of Illinois at Urbana-Champaign
S
Sanhorn Chen
University of Illinois at Urbana-Champaign
Lecheng Zheng
Lecheng Zheng
University of Illinois at Urbana-Champaign
Heterogeneous LearningGraph MiningMulti-modal LearningAnomaly DetectionMulti-label Learning
Boyu Liu
Boyu Liu
北京航空航天大学
QuantizationAIGC3D Vision
Zihao Li
Zihao Li
Univerisity of Illinois at Urbana-Champaign
musicsportsvideo gamescooking
Jiaru Zou
Jiaru Zou
University of Illinois Urbana-Champaign
LLM ReasoningAgentsReinforcement Learning
Tianxin Wei
Tianxin Wei
University of Illinois Urbana Champaign
Trustworthy Machine LearningLLMInformation Retrieval
Zhining Liu
Zhining Liu
Ph.D. Candidate, UIUC
LLMData-centric AIResponsible AIImbalanced LearningGraph Mining
Zhichen Zeng
Zhichen Zeng
University of Illinois at Urbana-Champaign
Ruizhong Qiu
Ruizhong Qiu
University of Illinois Urbana-Champaign
Large Language ModelsOptimizationGraph Neural Networks
Xiao Lin
Xiao Lin
Ph.D., University of Illinois Urbana-Champaign
Machine LearningGraph LearningTime Series Analysis
Y
Yuchen Yan
Amazon
Dongqi Fu
Dongqi Fu
Research Scientist, Meta AI
Geometric Deep LearningSequence ModelingProbabilistic Graphical Models
Jingchao Ni
Jingchao Ni
University of Houston; Amazon Web Services; NEC Labs (Ph.D., Penn State)
Machine LearningData ScienceArtificial Intelligence
Jingrui He
Jingrui He
University of Illinois at Urbana-Champaign
Machine LearningData MiningSocial NetworksMedical InformaticsSemiconductor Manufacturing
Hanghang Tong
Hanghang Tong
University of Illinois at Urbana-Champaign
Large Scale Data MiningGraph MiningSocial NetworksHealthcareMultimedia