Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses the limitations of existing time series foundation models, which typically ignore accompanying textual information, and multimodal approaches that often retrofit pretrained language models without natively modeling temporal dynamics or rigorously comparing against strong unimodal baselines. To overcome these issues, the authors propose Chronicle, the first unified decoder architecture jointly pretrained from scratch on both language and time series data. Chronicle leverages a purely parameter-shared Transformer for cross-modal fusion and introduces a short-stage interleaved alignment strategy. It matches Gemma-3-270M-PT on 19 natural language understanding tasks, sets new state-of-the-art results under frozen embeddings on 24 UCR/UEA time series classification datasets, and outperforms all supervised fusion baselines on the Time-MMD multimodal forecasting benchmark—demonstrating, for the first time, competitive performance with specialized foundation models in both modalities within a single unified framework.
📝 Abstract
Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.
Problem

Research questions and friction points this paper is trying to address.

multimodal learning
time series understanding
foundation models
joint pretraining
language and time series
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal foundation model
joint pretraining
time series and language
unified transformer architecture
cross-modal emergence