🤖 AI Summary
This paper addresses the challenge of fusing interleaved text and time-series modalities for financial forecasting. Methodologically, it proposes a unified multimodal neural architecture featuring modality-specific expert networks to independently model news semantics and stock price dynamics, augmented by a saliency-guided token-level cross-modal alignment mechanism implemented via cross-attention for fine-grained semantic correspondence; an integrated interpretability module further enables decision attribution. The key contribution lies in the first integration of saliency-driven token-level alignment with a mixture-of-experts design, jointly optimizing temporal modeling accuracy, linguistic understanding depth, and model transparency. The approach achieves state-of-the-art performance across multiple large-scale financial forecasting benchmarks, and investment backtesting confirms statistically significant improvements in risk-adjusted returns.
📝 Abstract
Text and time series data offer complementary views of financial markets: news articles provide narrative context about company events, while stock prices reflect how markets react to those events. However, despite their complementary nature, effectively integrating these interleaved modalities for improved forecasting remains challenging. In this work, we propose a unified neural architecture that models these interleaved sequences using modality-specific experts, allowing the model to learn unique time series patterns, while still enabling joint reasoning across modalities and preserving pretrained language understanding capabilities. To further improve multimodal understanding, we introduce a cross-modal alignment framework with a salient token weighting mechanism that learns to align representations across modalities with a focus on the most informative tokens. We demonstrate the effectiveness of our approach on a large-scale financial forecasting task, achieving state-of-the-art performance across a wide variety of strong unimodal and multimodal baselines. We develop an interpretability method that reveals insights into the value of time series-context and reinforces the design of our cross-modal alignment objective. Finally, we demonstrate that these improvements translate to meaningful economic gains in investment simulations.