A Weakly Supervised Transformer to Support Rare Disease Diagnosis from Electronic Health Records: Methods and Applications in Rare Pulmonary Disease

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Accurate diagnosis of rare pulmonary diseases is hindered by the scarcity and systemic bias of gold-standard labels in electronic health records (EHRs). Method: We propose a weakly supervised Transformer framework that jointly leverages a small set of high-quality annotations and large-scale, iteratively refined silver-standard labels. It learns patient embeddings grounded in medical concept semantics and co-occurrence patterns, aggregates longitudinal EHR representations via multi-layer Transformers, and incorporates a dynamic label denoising mechanism for self-correcting silver-standard labels. Contribution/Results: Our approach alleviates reliance on dense expert annotations while mitigating label noise and distributional shift. Evaluated on EHR data from Boston Children’s Hospital, it achieves statistically significant improvements in phenotypic classification accuracy over state-of-the-art baselines. Moreover, it successfully identifies clinically meaningful disease subtypes and accurately predicts disease progression—establishing a scalable, robust paradigm for computational phenotyping of rare diseases.

Technology Category

Application Category

📝 Abstract

Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions often remain poorly characterized and difficult to diagnose due to their low prevalence and limited clinician familiarity. While computational phenotyping algorithms show promise for automating rare disease detection, their development is hindered by the scarcity of labeled data and biases in existing label sources. Gold-standard labels from registries and expert chart reviews are highly accurate but constrained by selection bias and the cost of manual review. In contrast, labels derived from electronic health records (EHRs) cover a broader range of patients but can introduce substantial noise. To address these challenges, we propose a weakly supervised, transformer-based framework that combines a small set of gold-standard labels with a large volume of iteratively updated silver-standard labels derived from EHR data. This hybrid approach enables the training of a highly accurate and generalizable phenotyping model that scales rare disease detection beyond the scope of individual clinical expertise. Our method is initialized by learning embeddings of medical concepts based on their semantic meaning or co-occurrence patterns in EHRs, which are then refined and aggregated into patient-level representations via a multi-layer transformer architecture. Using two rare pulmonary diseases as a case study, we validate our model on EHR data from Boston Children's Hospital. Our framework demonstrates notable improvements in phenotype classification, identification of clinically meaningful subphenotypes through patient clustering, and prediction of disease progression compared to baseline methods. These results highlight the potential of our approach to enable scalable identification and stratification of rare disease patients for clinical care and research applications.

Problem

Research questions and friction points this paper is trying to address.

Develops weakly supervised transformer for rare disease diagnosis

Addresses data scarcity and noise in EHR-based labels

Improves phenotype classification and disease progression prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly supervised transformer for rare disease diagnosis

Combines gold-standard and silver-standard EHR labels

Multi-layer transformer refines medical concept embeddings

🔎 Similar Papers

Developing a Dual-Stage Vision Transformer Model for Lung Disease Classification