🤖 AI Summary
To address the challenges posed by increasingly high-dimensional, long-sequence, multimodal, and severely incomplete electronic health record (EHR) data, this paper proposes a dual-axis Transformer architecture that jointly models attention across the clinical variable and temporal dimensions. The method explicitly encodes missingness patterns and enhances robustness to sparse observations, while supporting transferable sensor embedding learning. Implemented in PyTorch, it re-implements mainstream baselines with optimized parallel computation and effective long-range dependency modeling. On the MIMIC-III sepsis prediction task, the approach achieves state-of-the-art (SOTA) performance; for in-hospital mortality classification, it matches top-performing methods. Crucially, it demonstrates significantly improved adaptability to missing data and enhanced generalization stability across diverse clinical scenarios.
📝 Abstract
Electronic Health Records (EHRs), the digital representation of a patient's medical history, are a valuable resource for epidemiological and clinical research. They are also becoming increasingly complex, with recent trends indicating larger datasets, longer time series, and multi-modal integrations. Transformers, which have rapidly gained popularity due to their success in natural language processing and other domains, are well-suited to address these challenges due to their ability to model long-range dependencies and process data in parallel. But their application to EHR classification remains limited by data representations, which can reduce performance or fail to capture informative missingness. In this paper, we present the Bi-Axial Transformer (BAT), which attends to both the clinical variable and time point axes of EHR data to learn richer data relationships and address the difficulties of data sparsity. BAT achieves state-of-the-art performance on sepsis prediction and is competitive to top methods for mortality classification. In comparison to other transformers, BAT demonstrates increased robustness to data missingness, and learns unique sensor embeddings which can be used in transfer learning. Baseline models, which were previously located across multiple repositories or utilized deprecated libraries, were re-implemented with PyTorch and made available for reproduction and future benchmarking.