Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

📅 2024-09-24
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Current time-series forecasting models suffer from limited scale, high inference cost, and poor generalization, hindering the development of large-scale foundation models. To address this, we propose the first unified foundation model framework for ultra-large-scale time-series forecasting. Our approach introduces sparse Mixture of Experts (MoE) into time-series modeling for the first time, yielding a decoder-only Transformer with 2.4 billion parameters. We conduct large-scale autoregressive pretraining on Time-300B—a diverse dataset spanning nine domains and 300 billion time points—empirically validating scaling laws for time-series modeling. Experiments demonstrate that our model significantly outperforms dense baselines under equivalent computational budgets, supports flexible context lengths and prediction horizons, and achieves new state-of-the-art accuracy across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.
Problem

Research questions and friction points this paper is trying to address.

Enhance time series forecasting scalability and efficiency.
Reduce inference costs in large forecasting models.
Achieve superior forecasting precision with Time-MoE.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse mixture-of-experts design
Decoder-only transformer models
Pre-trained on Time-300B dataset
🔎 Similar Papers
No similar papers found.
X
Xiaoming Shi
Princeton University, Squirrel Ai Learning, Griffith University
S
Shiyu Wang
Princeton University, Squirrel Ai Learning, Griffith University
Y
Yuqi Nie
Princeton University
Dianqi Li
Dianqi Li
University of Washington
Deep LearningNatural Language Processing
Zhou Ye
Zhou Ye
Max-Planck Institute for Intelligent Systems
MicroroboticsMicro-manipulationMicrofluidicsMagnetic Robots
Q
Qingsong Wen
Squirrel Ai Learning
M
Ming Jin
Griffith University