🤖 AI Summary
Despite growing adoption of time-series foundation models (TSFMs), their calibration—i.e., the reliability of probabilistic forecasts—is poorly understood and often overlooked in real-world deployment. Method: We conduct the first systematic empirical evaluation of TSFM calibration across five state-of-the-art models, under diverse prediction head architectures and long-horizon autoregressive settings. Using standard calibration metrics—including expected calibration error (ECE) and maximum calibration error (MCE)—we quantitatively assess confidence calibration and benchmark against traditional deep learning baselines. Results: TSFMs exhibit no systematic overconfidence, significantly outperforming conventional models in calibration across multiple datasets, forecasting tasks, and long-range horizons. Their uncertainty estimates remain robust and consistent, establishing superior and stable probabilistic reliability. This work fills a critical gap in the empirical evaluation of TSFM calibration and provides foundational evidence and methodological guidance for trustworthy time-series forecasting.
📝 Abstract
The recent development of foundation models for time series data has generated considerable interest in using such models across a variety of applications. Although foundation models achieve state-of-the-art predictive performance, their calibration properties remain relatively underexplored, despite the fact that calibration can be critical for many practical applications. In this paper, we investigate the calibration-related properties of five recent time series foundation models and two competitive baselines. We perform a series of systematic evaluations assessing model calibration (i.e., over- or under-confidence), effects of varying prediction heads, and calibration under long-term autoregressive forecasting. We find that time series foundation models are consistently better calibrated than baseline models and tend not to be either systematically over- or under-confident, in contrast to the overconfidence often seen in other deep learning models.