๐ค AI Summary
This work addresses the challenge of ineffective cross-modal alignment in multimodal time series forecasting, where existing approaches relying on token-level fusion struggle under conditions of scarce textโtime series pairs or substantial temporal feature discrepancies. To overcome this limitation, the paper proposes a novel text-guided expert modulation mechanism that, for the first time, leverages textual signals to simultaneously govern both routing decisions and expert computations within a Mixture-of-Experts (MoE) architecture, thereby enabling direct cross-modal control over expert behavior. This approach departs from conventional fusion paradigms and significantly enhances the efficiency and robustness of cross-modal alignment. Extensive experiments across multiple multimodal time series forecasting benchmarks demonstrate consistent and substantial performance improvements, validating the methodโs effectiveness and generalization capability.
๐ Abstract
Real-world time series exhibit complex and evolving dynamics, making accurate forecasting extremely challenging. Recent multi-modal forecasting methods leverage textual information such as news reports to improve prediction, but most rely on token-level fusion that mixes temporal patches with language tokens in a shared embedding space. However, such fusion can be ill-suited when high-quality time-text pairs are scarce and when time series exhibit substantial variation in scale and characteristics, thus complicating cross-modal alignment. In parallel, Mixture-of-Experts (MoE) architectures have proven effective for both time series modeling and multi-modal learning, yet many existing MoE-based modality integration methods still depend on token-level fusion. To address this, we propose Expert Modulation, a new paradigm for multi-modal time series prediction that conditions both routing and expert computation on textual signals, enabling direct and efficient cross-modal control over expert behavior. Through comprehensive theoretical analysis and experiments, our proposed method demonstrates substantial improvements in multi-modal time series prediction. The current code is available at https://github.com/BruceZhangReve/MoME