A Weakly Supervised Transformer to Support Rare Disease Diagnosis from Electronic Health Records: Methods and Applications in Rare Pulmonary Disease

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurate diagnosis of rare pulmonary diseases is hindered by the scarcity and systemic bias of gold-standard labels in electronic health records (EHRs). Method: We propose a weakly supervised Transformer framework that jointly leverages a small set of high-quality annotations and large-scale, iteratively refined silver-standard labels. It learns patient embeddings grounded in medical concept semantics and co-occurrence patterns, aggregates longitudinal EHR representations via multi-layer Transformers, and incorporates a dynamic label denoising mechanism for self-correcting silver-standard labels. Contribution/Results: Our approach alleviates reliance on dense expert annotations while mitigating label noise and distributional shift. Evaluated on EHR data from Boston Children’s Hospital, it achieves statistically significant improvements in phenotypic classification accuracy over state-of-the-art baselines. Moreover, it successfully identifies clinically meaningful disease subtypes and accurately predicts disease progression—establishing a scalable, robust paradigm for computational phenotyping of rare diseases.

Technology Category

Application Category

📝 Abstract
Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions often remain poorly characterized and difficult to diagnose due to their low prevalence and limited clinician familiarity. While computational phenotyping algorithms show promise for automating rare disease detection, their development is hindered by the scarcity of labeled data and biases in existing label sources. Gold-standard labels from registries and expert chart reviews are highly accurate but constrained by selection bias and the cost of manual review. In contrast, labels derived from electronic health records (EHRs) cover a broader range of patients but can introduce substantial noise. To address these challenges, we propose a weakly supervised, transformer-based framework that combines a small set of gold-standard labels with a large volume of iteratively updated silver-standard labels derived from EHR data. This hybrid approach enables the training of a highly accurate and generalizable phenotyping model that scales rare disease detection beyond the scope of individual clinical expertise. Our method is initialized by learning embeddings of medical concepts based on their semantic meaning or co-occurrence patterns in EHRs, which are then refined and aggregated into patient-level representations via a multi-layer transformer architecture. Using two rare pulmonary diseases as a case study, we validate our model on EHR data from Boston Children's Hospital. Our framework demonstrates notable improvements in phenotype classification, identification of clinically meaningful subphenotypes through patient clustering, and prediction of disease progression compared to baseline methods. These results highlight the potential of our approach to enable scalable identification and stratification of rare disease patients for clinical care and research applications.
Problem

Research questions and friction points this paper is trying to address.

Develops weakly supervised transformer for rare disease diagnosis
Addresses data scarcity and noise in EHR-based labels
Improves phenotype classification and disease progression prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly supervised transformer for rare disease diagnosis
Combines gold-standard and silver-standard EHR labels
Multi-layer transformer refines medical concept embeddings
🔎 Similar Papers
No similar papers found.
K
Kimberly F. Greco
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, USA
Z
Zongxin Yang
Department of Biomedical Informatics, Harvard Medical School, Boston, USA
M
Mengyan Li
Department of Mathematical Sciences, Bentley University, Waltham, USA
H
Han Tong
Department of Biostatistics, Columbia University, New York, USA
S
Sara Morini Sweet
Department of Biomedical Informatics, Harvard Medical School, Boston, USA
A
Alon Geva
Department of Anesthesiology, Critical Care, and Pain Medicine, Boston Children’s Hospital, Boston, USA
A
Alon Geva
Department of Anesthesia, Harvard Medical School, Boston, USA
A
Alon Geva
Computational Health Informatics Program, Boston Children’s Hospital, Boston, USA
Kenneth D. Mandl
Kenneth D. Mandl
Professor, Harvard Med. Director, Computational Health Informatics Program, Boston Children's
Biomedical InformaticsPopulation Health
Kenneth D. Mandl
Kenneth D. Mandl
Professor, Harvard Med. Director, Computational Health Informatics Program, Boston Children's
Biomedical InformaticsPopulation Health
B
Benjamin A. Raby
Division of Pulmonary Medicine, Boston’s Children Hospital, Harvard Medical School, Boston, USA
B
Benjamin A. Raby
Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, USA
Tianxi Cai
Tianxi Cai
Harvard University
statisticsbiostatisticsmodelingpredictiongenomics
Tianxi Cai
Tianxi Cai
Harvard University
statisticsbiostatisticsmodelingpredictiongenomics