Beyond Accuracy: Are Time Series Foundation Models Well-Calibrated?

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite growing adoption of time-series foundation models (TSFMs), their calibration—i.e., the reliability of probabilistic forecasts—is poorly understood and often overlooked in real-world deployment. Method: We conduct the first systematic empirical evaluation of TSFM calibration across five state-of-the-art models, under diverse prediction head architectures and long-horizon autoregressive settings. Using standard calibration metrics—including expected calibration error (ECE) and maximum calibration error (MCE)—we quantitatively assess confidence calibration and benchmark against traditional deep learning baselines. Results: TSFMs exhibit no systematic overconfidence, significantly outperforming conventional models in calibration across multiple datasets, forecasting tasks, and long-range horizons. Their uncertainty estimates remain robust and consistent, establishing superior and stable probabilistic reliability. This work fills a critical gap in the empirical evaluation of TSFM calibration and provides foundational evidence and methodological guidance for trustworthy time-series forecasting.

Technology Category

Application Category

📝 Abstract
The recent development of foundation models for time series data has generated considerable interest in using such models across a variety of applications. Although foundation models achieve state-of-the-art predictive performance, their calibration properties remain relatively underexplored, despite the fact that calibration can be critical for many practical applications. In this paper, we investigate the calibration-related properties of five recent time series foundation models and two competitive baselines. We perform a series of systematic evaluations assessing model calibration (i.e., over- or under-confidence), effects of varying prediction heads, and calibration under long-term autoregressive forecasting. We find that time series foundation models are consistently better calibrated than baseline models and tend not to be either systematically over- or under-confident, in contrast to the overconfidence often seen in other deep learning models.
Problem

Research questions and friction points this paper is trying to address.

Investigating calibration properties of time series foundation models
Assessing model confidence levels and prediction reliability
Evaluating calibration performance in long-term forecasting scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated calibration properties of time series foundation models
Systematically assessed over- and under-confidence in predictions
Compared foundation models with baseline models on calibration
🔎 Similar Papers
No similar papers found.