Multimodal Language Models with Modality-Specific Experts for Financial Forecasting from Interleaved Sequences of Text and Time Series

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of fusing interleaved text and time-series modalities for financial forecasting. Methodologically, it proposes a unified multimodal neural architecture featuring modality-specific expert networks to independently model news semantics and stock price dynamics, augmented by a saliency-guided token-level cross-modal alignment mechanism implemented via cross-attention for fine-grained semantic correspondence; an integrated interpretability module further enables decision attribution. The key contribution lies in the first integration of saliency-driven token-level alignment with a mixture-of-experts design, jointly optimizing temporal modeling accuracy, linguistic understanding depth, and model transparency. The approach achieves state-of-the-art performance across multiple large-scale financial forecasting benchmarks, and investment backtesting confirms statistically significant improvements in risk-adjusted returns.

Technology Category

Application Category

📝 Abstract
Text and time series data offer complementary views of financial markets: news articles provide narrative context about company events, while stock prices reflect how markets react to those events. However, despite their complementary nature, effectively integrating these interleaved modalities for improved forecasting remains challenging. In this work, we propose a unified neural architecture that models these interleaved sequences using modality-specific experts, allowing the model to learn unique time series patterns, while still enabling joint reasoning across modalities and preserving pretrained language understanding capabilities. To further improve multimodal understanding, we introduce a cross-modal alignment framework with a salient token weighting mechanism that learns to align representations across modalities with a focus on the most informative tokens. We demonstrate the effectiveness of our approach on a large-scale financial forecasting task, achieving state-of-the-art performance across a wide variety of strong unimodal and multimodal baselines. We develop an interpretability method that reveals insights into the value of time series-context and reinforces the design of our cross-modal alignment objective. Finally, we demonstrate that these improvements translate to meaningful economic gains in investment simulations.
Problem

Research questions and friction points this paper is trying to address.

Integrating interleaved text and time series data for financial forecasting
Aligning representations across modalities with informative token weighting
Preserving pretrained language capabilities while learning time series patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-specific experts for interleaved sequences
Cross-modal alignment with salient token weighting
Interpretability method revealing time series-context value
🔎 Similar Papers
No similar papers found.