🤖 AI Summary
This work investigates the scaling relationship between model size and performance for large-scale time series models, establishing—for the first time in time series forecasting—a strong power-law scaling law among parameter count ($N$), dataset size ($D$), and compute budget ($C$): $L propto N^{-alpha} D^{-eta} C^{-gamma}$. Methodologically, we train decoder-only Transformer architectures on a large-scale heterogeneous time series corpus, following standardized scaling experiment protocols and fitting empirical losses via power-law regression across five orders of magnitude. Key results show that the scaling law is highly robust, exhibiting minimal sensitivity to architectural details (e.g., width-to-depth ratio, number of attention heads), and that prediction error can be accurately predicted jointly from $N$, $D$, and $C$. This study provides the first quantifiable, reproducible engineering framework to guide the design, training, and resource allocation of time-series foundation models.
📝 Abstract
Scaling laws for large language models (LLMs) have provided useful guidance in training ever larger models for predictable performance gains. Time series forecasting shares a similar sequential structure to language, and is amenable to large-scale transformer architectures. Here we show that foundational decoder-only time series transformer models exhibit analogous scaling-behavior to LLMs, with architectural details (aspect ratio and number of heads) having a minimal effect over broad ranges. We assemble a large corpus of heterogenous time series data on which to train, and establish for the first time power-law scaling with parameter count, dataset size, and training compute, spanning five orders of magnitude.