Targeted learning via probabilistic subpopulation matching

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

To address insufficient utilization of source-study information in predicting outcomes across heterogeneous clinical sites, this paper proposes a targeted learning framework based on probabilistic subgroup matching. Unlike conventional study-level matching—which discards within-study heterogeneity and wastes samples—our approach disentangles intra- and inter-group heterogeneity at the subgroup level, enabling fine-grained, lossless knowledge transfer. Technically, we model the multi-source joint distribution via a finite mixture model, design a subgroup-adaptive weighting scheme, and establish non-asymptotic theoretical guarantees for the resulting estimator. Extensive simulations and real-world clinical validations demonstrate that our method significantly improves prediction accuracy at target sites compared to existing study-level matching approaches. It further offers statistical interpretability, robustness to distributional shifts, and rigorous theoretical foundations.

Technology Category

Application Category

📝 Abstract

In biomedical research, to obtain more accurate prediction results from a target study, leveraging information from multiple similar source studies is proved to be useful. However, in many biomedical applications based on real-world data, populations under consideration in different studies, e.g., clinical sites, can be heterogeneous, leading to challenges in properly borrowing information towards the target study. The state of art methods are typically based on study-level matching to identify source studies that are similar to the target study, whilst samples from source studies that significantly differ from the target study will all be dropped at the study level, which can lead to substantial loss of information. We consider a general situation where all studies are sampled from a super-population composed of distinct subpopulations, and propose a novel framework of targeted learning via subpopulation matching. In contrast to the existing study-level matching methods, measuring similarities between subpopulations can effectively decompose both within- and between-study heterogeneity, allowing incorporation of information from all source studies without dropping any samples as in the existing methods. We devise the proposed framework as a two-step procedure, where a finite mixture model is first fitted jointly across all studies to provide subject-wise probabilistic subpopulation information, followed by a step of within-subpopulation information transferring from source studies to the target study for each identified subpopulation. We establish the non-asymptotic properties of our estimator and demonstrate the ability of our method to improve prediction at the target study via simulation studies.

Problem

Research questions and friction points this paper is trying to address.

Addresses heterogeneity in biomedical studies to improve prediction accuracy.

Proposes subpopulation matching to avoid information loss in source studies.

Transfers information within subpopulations for targeted learning from multiple sources.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic subpopulation matching for targeted learning

Two-step mixture model for cross-study heterogeneity decomposition

Within-subpopulation information transfer without sample dropping

🔎 Similar Papers

Sample Selection Bias in Machine Learning for Healthcare