🤖 AI Summary
To address insufficient utilization of source-study information in predicting outcomes across heterogeneous clinical sites, this paper proposes a targeted learning framework based on probabilistic subgroup matching. Unlike conventional study-level matching—which discards within-study heterogeneity and wastes samples—our approach disentangles intra- and inter-group heterogeneity at the subgroup level, enabling fine-grained, lossless knowledge transfer. Technically, we model the multi-source joint distribution via a finite mixture model, design a subgroup-adaptive weighting scheme, and establish non-asymptotic theoretical guarantees for the resulting estimator. Extensive simulations and real-world clinical validations demonstrate that our method significantly improves prediction accuracy at target sites compared to existing study-level matching approaches. It further offers statistical interpretability, robustness to distributional shifts, and rigorous theoretical foundations.
📝 Abstract
In biomedical research, to obtain more accurate prediction results from a target study, leveraging information from multiple similar source studies is proved to be useful. However, in many biomedical applications based on real-world data, populations under consideration in different studies, e.g., clinical sites, can be heterogeneous, leading to challenges in properly borrowing information towards the target study. The state of art methods are typically based on study-level matching to identify source studies that are similar to the target study, whilst samples from source studies that significantly differ from the target study will all be dropped at the study level, which can lead to substantial loss of information. We consider a general situation where all studies are sampled from a super-population composed of distinct subpopulations, and propose a novel framework of targeted learning via subpopulation matching. In contrast to the existing study-level matching methods, measuring similarities between subpopulations can effectively decompose both within- and between-study heterogeneity, allowing incorporation of information from all source studies without dropping any samples as in the existing methods. We devise the proposed framework as a two-step procedure, where a finite mixture model is first fitted jointly across all studies to provide subject-wise probabilistic subpopulation information, followed by a step of within-subpopulation information transferring from source studies to the target study for each identified subpopulation. We establish the non-asymptotic properties of our estimator and demonstrate the ability of our method to improve prediction at the target study via simulation studies.