🤖 AI Summary
This study addresses the challenge of performing effective statistical inference on vertically partitioned health data across multiple institutions without sharing individual-level private information. Through a scoping review incorporating interdisciplinary database searches, systematic feature extraction, and citation tracking, 30 relevant studies were identified and evaluated. The findings reveal that existing approaches predominantly focus on linear and logistic regression models, yet commonly lack rigorous validation of equivalence to centralized analyses, incur high communication overhead, and rarely offer quantifiable privacy guarantees. This work provides the first systematic synthesis of methods in this domain, highlighting critical gaps concerning analytical equivalence, communication efficiency, and formal privacy protection, thereby establishing a foundational framework and guiding future research directions.
📝 Abstract
To address the multidimensional nature of health-related questions, advances in health research often require integrating information from various data sources within statistical analyses. When complementary information pertaining to the same set of individuals are distributed across different institutions, vertical methods make it possible to obtain analysis results without sharing or pooling individual-level data. To guide stakeholders toward a transparent use of vertical methods, this study aims to (1) Identify existing vertical methods enabling statistical inference; and (2) Characterize the methodological properties of these methods and the current extent of their use with health data. We conducted a scoping review using four interdisciplinary databases. We then systematically extracted the characteristics of identified vertical methods with respect to comparability with the pooled analysis, efficiency of communication schemes and confidentiality. We additionally screened studies that cited included articles to identify applications on vertically partitioned real-world health data. Among 2887 articles initially screened, 30 were included in the review. Inference for the linear and the logistic regression framework were the most frequent statistical inference tasks undertaken in proposed methods. Equivalence with the pooled analyses was not systematically addressed and most methods required multiple communications between participating parties. Almost all articles described their approach as privacy-preserving, although a minority provided privacy assessments. The scope of existing approaches enabling statistical inference for vertically partitioned data is still relatively limited. Most existing methods do not concurrently achieve results equivalent to centralized analyses, high communication efficiency, and guaranteed protection of individual-level data.