Learning Representations from Incomplete EHR Data with Dual-Masked Autoencoding

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the challenges of modeling electronic health record (EHR) time series, which are often hindered by irregular sampling, heterogeneous missingness patterns, and sparse observations. To tackle these issues, the authors propose the Augmented-Intrinsic Dual-Masked Autoencoder (AID-MAE), a novel framework that jointly optimizes representation learning and missingness pattern modeling by integrating intrinsic missing masks with augmented masking mechanisms—without requiring pre-imputation. During training, only unmasked tokens are reconstructed, enabling the model to learn robust representations that account for both observed data and missingness structure. Evaluated on two real-world EHR datasets, AID-MAE significantly outperforms strong baselines such as XGBoost and DuETT. The learned embeddings naturally capture clinically meaningful patient subgroups and demonstrate strong performance across multiple downstream clinical prediction tasks.

Technology Category

Application Category

📝 Abstract

Learning from electronic health records (EHRs) time series is challenging due to irregular sam- pling, heterogeneous missingness, and the resulting sparsity of observations. Prior self-supervised meth- ods either impute before learning, represent missingness through a dedicated input signal, or optimize solely for imputation, reducing their capacity to efficiently learn representations that support clinical downstream tasks. We propose the Augmented-Intrinsic Dual-Masked Autoencoder (AID-MAE), which learns directly from incomplete time series by applying an intrinsic missing mask to represent naturally missing values and an augmented mask that hides a subset of observed values for reconstruction during training. AID-MAE processes only the unmasked subset of tokens and consistently outperforms strong baselines, including XGBoost and DuETT, across multiple clinical tasks on two datasets. In addition, the learned embeddings naturally stratify patient cohorts in the representation space.

Problem

Research questions and friction points this paper is trying to address.

incomplete EHR data

time series representation learning

heterogeneous missingness

irregular sampling

sparse observations

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-masked autoencoding

incomplete EHR data

self-supervised representation learning