🤖 AI Summary
Time-series foundation models (TSFMs) lack contextual knowledge and complex reasoning capabilities, while large language models (LLMs) struggle to capture temporal structures. Method: We propose a fine-tuning-free cross-modal representation alignment framework that freezes a pre-trained TSFM and aligns its latent representations with those of an LLM using synthetically generated time-series–text paired data. Alignment is achieved via a two-stage strategy: (i) representation alignment pretraining in the shared latent space, followed by (ii) instruction tuning. Contribution/Results: Our method achieves state-of-the-art performance across diverse downstream tasks in finance, energy, and transportation—outperforming leading TSFMs, LLMs, and vision-language models. Notably, it attains superior generalization with less than half the training data required by competitive approaches, demonstrating both high data efficiency and robust cross-domain adaptability.
📝 Abstract
Time series reasoning is crucial to decision-making in diverse domains, including finance, energy usage, traffic, weather, and scientific discovery. While existing time series foundation models (TSFMs) can capture low-level dynamic patterns and provide accurate forecasting, further analysis usually requires additional background knowledge and sophisticated reasoning, which are lacking in most TSFMs but can be achieved through large language models (LLMs). On the other hand, without expensive post-training, LLMs often struggle with the numerical understanding of time series data. Although it is intuitive to integrate the two types of models, developing effective training recipes that align the two modalities for reasoning tasks is still an open challenge. To this end, we propose TS-Reasoner that aligns the latent representations of TSFMs with the textual inputs of LLMs for downstream understanding/reasoning tasks. Specifically, we propose a simple yet effective method to curate diverse, synthetic pairs of time series and textual captions for alignment training. We then develop a two-stage training recipe that applies instruction finetuning after the alignment pretraining. Unlike existing works that train an LLM to take time series as inputs, we leverage a pretrained TSFM and freeze it during training. Extensive experiments on several benchmarks demonstrate that TS-Reasoner not only outperforms a wide range of prevailing LLMs, Vision Language Models (VLMs), and Time Series LLMs, but also achieves this with remarkable data efficiency, e.g., using less than half the training data.