🤖 AI Summary
Multi-source electronic health records (EHRs) suffer from poor interpretability, cross-institutional comparability, and scalability due to heterogeneous medical coding schemes, institutional terminology variations, and non-standardized data structures. To address this, we propose MASH—a novel framework that enables the first automated hierarchical modeling of unstructured local laboratory test codes. MASH integrates pretrained language models, co-occurrence statistics, clinical text descriptions, and supervised labels, and jointly leverages neural optimal transport and hyperspherical embedding to align multi-source medical codes and construct a hierarchical ontology within a unified semantic space. Evaluated on real-world EHR data, MASH builds an interpretable, multi-domain hierarchical medical concept graph covering diagnoses, medications, and laboratory tests. The resulting ontology significantly improves semantic consistency and cross-institutional interoperability of clinical data, providing a structured, scalable knowledge foundation for downstream biomedical research.
📝 Abstract
Electronic Health Records (EHRs), comprising diverse clinical data such as diagnoses, medications, and laboratory results, hold great promise for translational research. EHR-derived data have advanced disease prevention, improved clinical trial recruitment, and generated real-world evidence. Synthesizing EHRs across institutions enables large-scale, generalizable studies that capture rare diseases and population diversity, but remains hindered by the heterogeneity of medical codes, institution-specific terminologies, and the absence of standardized data structures. These barriers limit the interpretability, comparability, and scalability of EHR-based analyses, underscoring the need for robust methods to harmonize and extract meaningful insights from distributed, heterogeneous data. To address this, we propose MASH (Multi-source Automated Structured Hierarchy), a fully automated framework that aligns medical codes across institutions using neural optimal transport and constructs hierarchical graphs with learned hyperbolic embeddings. During training, MASH integrates information from pre-trained language models, co-occurrence patterns, textual descriptions, and supervised labels to capture semantic and hierarchical relationships among medical concepts more effectively. Applied to real-world EHR data, including diagnosis, medication, and laboratory codes, MASH produces interpretable hierarchical graphs that facilitate the navigation and understanding of heterogeneous clinical data. Notably, it generates the first automated hierarchies for unstructured local laboratory codes, establishing foundational references for downstream applications.