🤖 AI Summary
This work systematically evaluates the cross-domain generalization capability of zero-shot time-series foundation models (e.g., TimesFM, PatchTST) on cloud monitoring data and finds pervasive failure: all tested models underperform simple linear baselines (ARIMA, Linear Regression) by significant margins, with several exhibiting pathological outputs—including disorderly and stochastic predictions.
Method: To address this gap, the authors introduce the first multi-model, zero-shot transfer evaluation framework specifically designed for cloud operational data, enabling rigorous, standardized assessment of foundational time-series models in production-grade monitoring settings.
Contribution/Results: The study provides the first empirical evidence of a structural deficiency in mainstream time-series foundation models when applied to cloud-system dynamics. Beyond exposing a critical reliability crisis for such models in high-stakes industrial deployments, it delivers a reproducible benchmark and diagnostic methodology—establishing a foundational reference for developing and evaluating next-generation foundation models tailored to real-world system monitoring.
📝 Abstract
Time series foundation models (FMs) have emerged as a popular paradigm for zero-shot multi-domain forecasting. FMs are trained on numerous diverse datasets and claim to be effective forecasters across multiple different time series domains, including cloud data. In this work we investigate this claim, exploring the effectiveness of FMs on cloud data. We demonstrate that many well-known FMs fail to generate meaningful or accurate zero-shot forecasts in this setting. We support this claim empirically, showing that FMs are outperformed consistently by simple linear baselines. We also illustrate a number of interesting pathologies, including instances where FMs suddenly output seemingly erratic, random-looking forecasts. Our results suggest a widespread failure of FMs to model cloud data.