π€ AI Summary
This work proposes a multimodal generative framework that reformulates time series classification as a text generation task, jointly modeling numerical sequences, textual context, and task instructions. Traditional approaches often struggle to incorporate contextual information and overlook semantic relationships among classes. To address these limitations, the framework employs time series discretization, an alignment projection layer, and generative self-supervised pretraining, complemented by an implicit feature augmentation mechanism that integrates statistical features with vision-language image descriptions. This design effectively compensates for the inductive bias deficiencies of language models in temporal modeling. Extensive experiments on multiple benchmark datasets demonstrate that the proposed method significantly outperforms existing approaches, highlighting its superior performance and strong generalization capability.
π Abstract
Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.