ITGPT: Generative Pretraining on Irregular Timeseries

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of effectively leveraging abundant unlabeled data in irregularly sampled, missing-value-prone multimodal time-series scenarios commonly encountered in healthcare and predictive maintenance. The authors propose a Transformer-based end-to-end architecture that directly models raw irregular sequences without requiring resampling, explicit imputation, or handcrafted feature fusion. By integrating self-supervised learning with a GPT-like generative pretraining objective, the model harnesses unlabeled data to significantly enhance its representation learning capability. Evaluated on the TIHM healthcare dataset and the CompX predictive maintenance benchmark, the proposed approach consistently outperforms purely supervised baselines, achieving state-of-the-art predictive performance.

📝 Abstract

Timeseries regression models often struggle to leverage large volumes of labeled multimodal data, particularly when the data are irregularly sampled or contain missing values. This is common in domains like healthcare and predictive maintenance, where data are collected from unreliable sources, and labeling requires expert knowledge or costly equipments. Transformer-based large language models have proven effective on structured data such as text through self-supervised learning (SSL) and generative pretraining (GPT) frameworks. However, such models lack the flexibility to efficiently process irregularly sampled multimodal timeseries data. In this paper, we introduce ITGPT, an attention-based architecture designed for handling multimodal, irregularly sampled timeseries by allowing training with both SSL losses and GPT-like objectives. We evaluate its performance on a healthcare task with the TIHM dataset, and a predictive maintenance task with the CompX dataset. Our results demonstrate that ITGPT achieves state-of-the-art performance without requiring resampling, feature fusion or explicit data imputation. Furthermore, when labels are scarce, ITGPT effectively leverages unlabeled data through SSL and GPT training, outperforming the purely supervised approach. This represents an important step towards efficiently using large and unstructured timeseries datasets for practical inference tasks.

Problem

Research questions and friction points this paper is trying to address.

irregular timeseries

multimodal data

missing values

time series regression

data labeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Irregular Timeseries

Generative Pretraining

Self-Supervised Learning