🤖 AI Summary
This work addresses the challenges in multimodal time series understanding, where fine-grained temporal misalignment and severe entanglement between shared and modality-specific semantics hinder model interpretability and complementary reasoning. To overcome these limitations, we propose a novel multimodal large language model framework that achieves cross-modal physical alignment through patch-level alignment, explicitly disentangles shared and modality-specific semantics via discrete disentangled interaction, and enhances query-relevant signals using a critical-token highlighting mechanism. Extensive experiments demonstrate that our approach significantly outperforms both general-purpose and specialized multimodal time series models on synthetic and real-world benchmarks, thereby substantially improving time series comprehension and reasoning capabilities.
📝 Abstract
Advances in multi-modal large language models (MLLMs) have inspired time series understanding and reasoning tasks, that enable natural language querying over time series, producing textual analyses of complex temporal dynamics. Recent attempts hybridize numerical time series with their visualized plots, facilitating precise value reasoning and visual structure comprehension for comprehensive time series understanding of MLLMs. However, effective numerical-visual modality integration remains challenging due to fine-grained temporal misalignment across modalities and severe entanglement between shared and modality-specific semantics, which hinder localized interpretation and complementary reasoning. To address these issues, we propose MADI, a multi-modal LLM enhanced with fine-grained alignment and disentangled interaction, featuring (1) Patch-level Alignment, which enforces physically grounded fine-grained correspondence across heterogeneous modalities, (2) Discrete Disentangled Interaction, which separates modality-common semantics into compact discrete latents and adaptively synergizes the purified modality-unique information, and (3) Critical-token Highlighting, which emphasizes informative, query-relevant signals for robust reasoning. Experiments on synthetic and real-world benchmarks show that MADI consistently outperforms general-purpose LLMs and time-series-specialized MLLMs.