🤖 AI Summary
This paper addresses the lack of robustness evaluation for Foundation Models for Time Series (FMTS) under input perturbations. We propose the first interpretable, multi-dimensional rating framework integrating causal inference. Methodologically, we design causal sensitivity analysis and perturbation robustness quantification metrics to systematically compare multimodal versus unimodal, and task-specific versus general pre-trained FMTS; practical utility is validated via user studies and interactive visualizations. Key contributions include: (1) the first application of causal inference to FMTS robustness assessment; (2) a standardized, actionable, and interpretable rating system; (3) empirical evidence that multimodal and task-specific FMTS exhibit superior robustness and accuracy; and (4) substantial reduction in cross-model comparison effort, thereby enhancing trustworthiness in high-stakes domains such as finance.
📝 Abstract
Foundation Models (FMs) have improved time series forecasting in various sectors, such as finance, but their vulnerability to input disturbances can hinder their adoption by stakeholders, such as investors and analysts. To address this, we propose a causally grounded rating framework to study the robustness of Foundational Models for Time Series (FMTS) with respect to input perturbations. We evaluate our approach to the stock price prediction problem, a well-studied problem with easily accessible public data, evaluating six state-of-the-art (some multi-modal) FMTS across six prominent stocks spanning three industries. The ratings proposed by our framework effectively assess the robustness of FMTS and also offer actionable insights for model selection and deployment. Within the scope of our study, we find that (1) multi-modal FMTS exhibit better robustness and accuracy compared to their uni-modal versions and, (2) FMTS pre-trained on time series forecasting task exhibit better robustness and forecasting accuracy compared to general-purpose FMTS pre-trained across diverse settings. Further, to validate our framework's usability, we conduct a user study showcasing FMTS prediction errors along with our computed ratings. The study confirmed that our ratings reduced the difficulty for users in comparing the robustness of different systems.