Distributed inference for heterogeneous mixture models using multi-site data

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of individual privacy constraints and site heterogeneity in multi-center data, this paper proposes a distributed heterogeneous mixture model that enforces semantic consistency of latent classes—i.e., shared latent class definitions—across sites without sharing raw data, while allowing site-specific mixing proportions. Methodologically, we introduce, for the first time, a density-ratio-weighted surrogate Q-function to construct a distributed EM algorithm provably convergent to the centralized EM solution. We theoretically establish that the resulting estimators achieve the same contraction rate as their centralized counterparts and rigorously guarantee parameter consistency and cross-site comparability. Extensive simulations and empirical analyses on real multi-center datasets demonstrate the method’s effectiveness and robustness under heterogeneous data distributions and privacy-preserving constraints.

Technology Category

Application Category

📝 Abstract
Mixture models postulate the overall population as a mixture of finite subpopulations with unobserved membership. Fitting mixture models usually requires large sample sizes and combining data from multiple sites can be beneficial. However, sharing individual participant data across sites is often less feasible due to various types of practical constraints, such as data privacy concerns. Moreover, substantial heterogeneity may exist across sites, and locally identified latent classes may not be comparable across sites. We propose a unified modeling framework where a common definition of the latent classes is shared across sites and heterogeneous mixing proportions of latent classes are allowed to account for between-site heterogeneity. To fit the heterogeneous mixture model on multi-site data, we propose a novel distributed Expectation-Maximization (EM) algorithm where at each iteration a density ratio tilted surrogate Q function is constructed to approximate the standard Q function of the EM algorithm as if the data from multiple sites could be pooled together. Theoretical analysis shows that our estimator achieves the same contraction property as the estimators derived from the EM algorithm based on the pooled data.
Problem

Research questions and friction points this paper is trying to address.

Develops a distributed EM algorithm for heterogeneous mixture models
Enables multi-site data analysis without sharing individual participant data
Addresses site heterogeneity while maintaining common latent class definitions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed EM algorithm for multi-site mixture models
Density ratio tilted surrogate Q function approximation
Shared latent class definitions with heterogeneous mixing proportions
🔎 Similar Papers
No similar papers found.
X
Xiaokang Liu
Department of Statistics and Data Science, University of Missouri
Rui Duan
Rui Duan
Harvard University
BiostatisticsBioinformaticsEpidemiologyElectronic Health Record
Raymond J. Carroll
Raymond J. Carroll
Texas A&M University
StatisticsEpidemiology
Yang Ning
Yang Ning
Cornell University
Y
Yong Chen
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania