Conversational Time Series Foundation Models: Towards Explainable and Effective Forecasting

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Time-series forecasting faces three core challenges: instability of foundation models, lack of interpretability in model ensembles, and the inability of large language models (LLMs) to directly capture temporal causality. To address these, we propose the first interpretable, conversational forecasting framework that reconfigures an LLM as a “dialogic arbiter” with explicit temporal causal reasoning capability, dynamically orchestrating multi-model ensembles via iterative causal inference. Our method introduces SHAP-guided R1-style fine-tuning, rendering ensemble weights interpretable as causal statements about time-varying dynamics. Evaluated on GIFT-Eval—a comprehensive benchmark spanning 23 datasets and 97 forecasting configurations—our approach achieves new state-of-the-art performance, significantly outperforming existing methods on both CRPS and MASE metrics.

Technology Category

Application Category

📝 Abstract
The proliferation of time series foundation models has created a landscape where no single method achieves consistent superiority, framing the central challenge not as finding the best model, but as orchestrating an optimal ensemble with interpretability. While Large Language Models (LLMs) offer powerful reasoning capabilities, their direct application to time series forecasting has proven ineffective. We address this gap by repositioning the LLM as an intelligent judge that evaluates, explains, and strategically coordinates an ensemble of foundation models. To overcome the LLM's inherent lack of domain-specific knowledge on time series, we introduce an R1-style finetuning process, guided by SHAP-based faithfulness scores, which teaches the model to interpret ensemble weights as meaningful causal statements about temporal dynamics. The trained agent then engages in iterative, multi-turn conversations to perform forward-looking assessments, provide causally-grounded explanations for its weighting decisions, and adaptively refine the optimization strategy. Validated on the GIFT-Eval benchmark on 23 datasets across 97 settings, our approach significantly outperforms leading time series foundation models on both CRPS and MASE metrics, establishing new state-of-the-art results.
Problem

Research questions and friction points this paper is trying to address.

Orchestrating optimal ensembles of time series foundation models with interpretability
Repositioning LLMs as intelligent judges to evaluate and coordinate ensembles
Teaching LLMs domain-specific time series knowledge via R1-style finetuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM as intelligent judge coordinating ensemble models
R1-style finetuning with SHAP-based faithfulness scores
Iterative multi-turn conversations for adaptive optimization
Defu Cao
Defu Cao
Peking University; MBZUAI; University of Southern California; Caltech
Time SeriesFoundation ModelMachine LearningCausal InferenceLLM
M
Michael Gee
University of Southern California
J
Jinbo Liu
University of Southern California
H
Hengxuan Wang
University of Southern California
W
Wei Yang
University of Southern California
R
Rui Wang
Amazon AWS
Y
Yan Liu
University of Southern California