🤖 AI Summary
Existing multimodal time series methods often suffer from excessive smoothing in non-stationary environments, overlooking critical fluctuations and consequently distorting forecast shapes. To address this limitation, this work proposes the STaT architecture, which for the first time enables collaborative modeling across symbolic, temporal, and textual modalities. The symbolic modality captures structural turning points through discretization, the temporal modality models dynamic dependencies, and the textual modality injects domain knowledge to guide macro-level trends. Through a carefully designed multimodal alignment mechanism, STaT significantly enhances shape fidelity while maintaining high prediction accuracy. Empirical evaluation on eight real-world benchmarks demonstrates that STaT improves overall performance by 8.9% on average and reduces shape distortion by up to 8.5%.
📝 Abstract
Recent research in time series forecasting frequently investigates the integration of textual and visual modalities with numerical models to better navigate non-stationary environments. Despite delivering solid numerical results, existing multi-modal approaches usually encounter a dilemma: prioritizing the minimization of average errors can result in excessively smooth forecasts that overlook essential fluctuations. To resolve this limitation, we introduce STaT, an innovative multimodal architecture for Symbolic-Temporal-Textual Alignment, which seamlessly unites three synergistic modalities. Specifically, the symbolic modality converts continuous time series into discrete tokens, facilitating the accurate identification of structural patterns and turning points; the temporal modality extracts inherent sequential dependencies; and the textual modality leverages domain semantics to steer the macroscopic forecasting trends. Comprehensive evaluations on eight real-world benchmarks indicate that STaT delivers exceptional performance, enhancing conventional magnitude indicators by up to 8.9% while simultaneously decreasing shape distortion by up to 8.5%.