Efficient Semiparametric Inference for Distributed Data with Blockwise Missingness

📅 2025-08-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In distributed data settings where block-wise missingness occurs across sites—i.e., certain sites completely lack specific variables—and sharing individual-level data is prohibited, existing methods suffer from inefficiency and privacy risks. Method: We propose a communication-efficient augmented one-step estimation framework that leverages data-driven “transfer functions” to integrate summary statistics from external sites. It requires only a single round of low-dimensional statistic exchange to enhance semiparametric inference efficiency at the target site. Contribution/Results: The estimator is proven to be harmless (i.e., never worse than local estimation), achieves the regular semiparametric efficiency bound, and remains asymptotically normal in multi-site high-dimensional settings. Simulation studies demonstrate substantial improvements in statistical efficiency over state-of-the-art distributed estimators, while maintaining computational feasibility and strict privacy constraints—making it well-suited for large-scale, privacy-sensitive distributed analysis.

Technology Category

Application Category

📝 Abstract
We consider statistical inference for a finite-dimensional parameter in a regular semiparametric model under a distributed setting with blockwise missingness, where entire blocks of variables are unavailable at certain sites and sharing individual-level data is not allowed. To improve efficiency of the internal study, we propose a class of augmented one-step estimators that incorporate information from external sites through ``transfer functions.'' The proposed approach has several advantages. First, it is communication-efficient, requiring only one-round communication of summary-level statistics. Second, it satisfies a do-no-harm property in the sense that the augmented estimator is no less efficient than the original one based solely on the internal data. Third, it is statistically optimal, achieving the semiparametric efficiency bound when the transfer function is appropriately estimated from data. Finally, it is scalable, remaining asymptotically normal even when the number of external sites and the data dimension grow exponentially with the internal sample size. Simulation studies confirm both the statistical efficiency and computational feasibility of our method in distributed settings.
Problem

Research questions and friction points this paper is trying to address.

Inference for semiparametric models with blockwise missing data
Improving efficiency using external information without sharing data
Achieving communication efficiency and statistical optimality in distributed settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmented one-step estimators with transfer functions
Communication-efficient one-round summary statistics exchange
Semiparametric efficiency bound achievement with scalability
🔎 Similar Papers
No similar papers found.
Jingyue Huang
Jingyue Huang
Postdoc in Biostatistics, University of Pennsylvania
Huiyuan Wang
Huiyuan Wang
Postdoc of Biostatistics, University of Pennsylvania
Causal inferencemachine learning
Y
Yuqing Lei
Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Y
Yong Chen
Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA