On Identifying Why and When Foundation Models Perform Well on Time-Series Forecasting Using Automated Explanations and Rating

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The performance disparity of foundation models in time-series forecasting remains poorly understood, particularly regarding their applicability boundaries across diverse domains. Method: We propose a score-driven, explainable AI (XAI) framework integrating automated explanation techniques to systematically evaluate forecasting accuracy and interpretability of ARIMA, gradient-boosting models, Chronos, and Llama across financial, energy, and automotive parts domains. Contribution/Results: Experiments reveal that foundation models significantly outperform traditional methods only in stable, low-noise financial settings; in contrast, feature-engineering–based models (e.g., gradient boosting) achieve superior accuracy and stronger causal interpretability in volatile, sparse-data, or domain-knowledge–intensive scenarios—such as electricity load and automotive parts demand forecasting. This work provides the first empirical characterization of the operational boundaries of foundation models for time-series forecasting and establishes an interpretability-informed, evidence-based methodology for model selection.

Technology Category

Application Category

📝 Abstract
Time-series forecasting models (TSFM) have evolved from classical statistical methods to sophisticated foundation models, yet understanding why and when these models succeed or fail remains challenging. Despite this known limitation, time series forecasting models are increasingly used to generate information that informs real-world actions with equally real consequences. Understanding the complexity, performance variability, and opaque nature of these models then becomes a valuable endeavor to combat serious concerns about how users should interact with and rely on these models' outputs. This work addresses these concerns by combining traditional explainable AI (XAI) methods with Rating Driven Explanations (RDE) to assess TSFM performance and interpretability across diverse domains and use cases. We evaluate four distinct model architectures: ARIMA, Gradient Boosting, Chronos (time-series specific foundation model), Llama (general-purpose; both fine-tuned and base models) on four heterogeneous datasets spanning finance, energy, transportation, and automotive sales domains. In doing so, we demonstrate that feature-engineered models (e.g., Gradient Boosting) consistently outperform foundation models (e.g., Chronos) in volatile or sparse domains (e.g., power, car parts) while providing more interpretable explanations, whereas foundation models excel only in stable or trend-driven contexts (e.g., finance).
Problem

Research questions and friction points this paper is trying to address.

Explaining why foundation models succeed or fail in time-series forecasting
Assessing model performance and interpretability across diverse domains
Comparing feature-engineered versus foundation models in volatile versus stable contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining XAI with RDE for model interpretability
Evaluating diverse model architectures across multiple domains
Identifying optimal model performance conditions via explanations
🔎 Similar Papers
No similar papers found.
M
Michael Widener
Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
Kausik Lakkaraju
Kausik Lakkaraju
University of South Carolina
Ethical AICausal ReasoningMultimodal Systems
J
John Aydin
Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
Biplav Srivastava
Biplav Srivastava
University of South Carolina
Artificial Intelligence (Automated PlanningTrustLearning)Smarter Cities (Water)Services