DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Learning low-rank representations from high-dimensional, heterogeneous, distributed, and privacy-sensitive binary electronic health record (EHR) data remains challenging. Method: DANIEL introduces the first distributed, privacy-preserving, low-rank representation learning framework tailored for large-scale binary data. It formulates a non-convex surrogate loss function grounded in the Ising model and low-rank structural assumptions, and proposes a distributed dual-decomposition gradient descent algorithm—overcoming communication and computational bottlenecks inherent in conventional convex optimization. Crucially, it enables multi-institutional collaborative modeling in a federated setting without sharing raw data. Results: Evaluated on 58,000 patients from UPMC and MGB, DANIEL’s learned representations significantly improve relational discovery, disease phenotyping, and clustering performance—outperforming both state-of-the-art distributed and centralized baseline methods across all tasks.

Technology Category

Application Category

📝 Abstract

Classical probabilistic graphical models face fundamental challenges in modern data environments, which are characterized by high dimensionality, source heterogeneity, and stringent data-sharing constraints. In this work, we revisit the Ising model, a well-established member of the Markov Random Field (MRF) family, and develop a distributed framework that enables scalable and privacy-preserving representation learning from large-scale binary data with inherent low-rank structure. Our approach optimizes a non-convex surrogate loss function via bi-factored gradient descent, offering substantial computational and communication advantages over conventional convex approaches. We evaluate our algorithm on multi-institutional electronic health record (EHR) datasets from 58,248 patients across the University of Pittsburgh Medical Center (UPMC) and Mass General Brigham (MGB), demonstrating superior performance in global representation learning and downstream clinical tasks, including relationship detection, patient phenotyping, and patient clustering. These results highlight a broader potential for statistical inference in federated, high-dimensional settings while addressing the practical challenges of data complexity and multi-institutional integration.

Problem

Research questions and friction points this paper is trying to address.

Addressing scalability and privacy in high-dimensional heterogeneous data environments

Developing distributed framework for global representation learning from binary data

Overcoming data-sharing constraints in multi-institutional healthcare applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed framework for scalable representation learning

Non-convex optimization via bi-factored gradient descent

Privacy-preserving learning from multi-institutional EHR data

🔎 Similar Papers

No similar papers found.