π€ AI Summary
This work addresses the longstanding disconnect between semantic understanding and high-fidelity numerical generation in time series modeling, where generative models often rely on superficial patterns while comprehension models struggle to produce precise values. To bridge this gap, we propose the first vision-centric unified framework that synergistically enhances both capabilities through three key innovations: a novel bidirectional lossless time-series-to-image mapping (Bi-TSI), an explicit comprehension-guided generation mechanism, and a multi-task joint training architecture. We further introduce the TSUMM-Suite benchmark, comprising six understanding and two generation tasks, to holistically evaluate model performance. Extensive experiments demonstrate that our approach significantly improves both semantic comprehension accuracy and numerical generation fidelity, establishing a new paradigm for multimodal time series modeling.
π Abstract
Recent time series modeling faces a sharp divide between numerical generation and semantic understanding, with research showing that generation models often rely on superficial pattern matching, while understanding-oriented models struggle with high-fidelity numerical output. Although unified multimodal models (UMMs) have bridged this gap in vision, their potential for time series remains untapped. We propose TimeOmni-VL, the first vision-centric framework that unifies time series understanding and generation through two key innovations: (1) Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI), which advances Time Series-to-Image (TS2I) and Image-to-Time Series (I2TS) conversions to ensure near-lossless transformations. (2) Understanding-guided generation. We introduce TSUMM-Suite, a novel dataset consists of six understanding tasks rooted in time series analytics that are coupled with two generation tasks. With a calibrated Chain-of-Thought, TimeOmni-VL is the first to leverage time series understanding as an explicit control signal for high-fidelity generation. Experiments confirm that this unified approach significantly improves both semantic understanding and numerical precision, establishing a new frontier for multimodal time series modeling.