π€ AI Summary
This study addresses the challenge of achieving cross-source semantic alignment in multi-institutional electronic health records (EHRs), which is hindered by heterogeneous coding systems and fragmented clinical concepts. The authors propose SMILE, a novel method that, for the first time, integrates a von MisesβFisher mixture model on the hypersphere with sparse weak supervision derived from knowledge graphs to enable privacy-preserving semantic clustering of synonymous codes across disparate EHR sources. Theoretical analysis demonstrates the statistical gains conferred by multi-source fusion and geometric constraints, establishing non-asymptotic error bounds. Experimental results on both synthetic data and real-world multi-institutional EHRs show that SMILE significantly improves cross-source alignment accuracy and synonym cluster coherence.
π Abstract
Multi-institutional electronic health record (Multi-EHR) data have emerged as a powerful resource for developing predictive models to support clinical decisions and for generating reliable real-world evidence. By aggregating information from diverse patient populations and institutions, they enhance the robustness and generalizability of models and findings. However, analyzing multi-EHR remains challenging because disparate institutions rarely map all data elements to common ontology, and raw EHR codes are often overly granular and institution-specific, fragmenting representations of the same clinical concept. Hence, integrative analysis must overcome two key hurdles: harmonizing codes with the same clinical meaning (synonymy), and aligning institutional feature spaces. To address these challenges, we propose SMILE, a Spherical Mixture Integration for Latent Embedding alignment across multi-source feature spaces, where embeddings from heterogeneous sources serve as privacy-preserving summaries of clinical concepts and sparse auxiliary relationship pairs provide weak supervision on the latent geometry. Synonymy is modeled via a mixture of von Mises-Fisher distributions, yielding unified representations that consolidate semantically equivalent raw codes. We develop a composite quasi-likelihood estimation procedure and establish non-asymptotic error bounds for latent representations and mixture mean directions, together with consistent recovery of synonym clusters. The theory quantifies statistical gains from integrating multiple sources and auxiliary knowledge graph information. Simulations and a multi-institutional EHR application demonstrate improved alignment and synonym clustering.