Towards the Efficient Inference by Incorporating Automated Computational Phenotypes under Covariate Shift

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Automated computation of phenotypes (ACP) exhibits systematic bias relative to gold-standard phenotypes, while manual annotation is prohibitively costly. Method: Under the covariate shift assumption, we develop a semi-supervised inference framework and propose a doubly robust, semiparametrically efficient estimator for phenotype estimation. Contribution/Results: We provide the first theoretical proof that efficiency gains arise primarily from ACP on unlabeled data—not labeled data—and establish a verifiable efficiency-bound analysis paradigm for ACP fusion. Evaluated on synthetic and multiple real-world healthcare datasets, our method reduces average variance by 32%–57% over baseline approaches, significantly improving statistical accuracy and robustness in risk prediction and treatment effect estimation.

Technology Category

Application Category

📝 Abstract

Collecting gold-standard phenotype data via manual extraction is typically labor-intensive and slow, whereas automated computational phenotypes (ACPs) offer a systematic and much faster alternative. However, simply replacing the gold-standard with ACPs, without acknowledging their differences, could lead to biased results and misleading conclusions. Motivated by the complexity of incorporating ACPs while maintaining the validity of downstream analyses, in this paper, we consider a semi-supervised learning setting that consists of both labeled data (with gold-standard) and unlabeled data (without gold-standard), under the covariate shift framework. We develop doubly robust and semiparametrically efficient estimators that leverage ACPs for general target parameters in the unlabeled and combined populations. In addition, we carefully analyze the efficiency gains achieved by incorporating ACPs, comparing scenarios with and without their inclusion. Notably, we identify that ACPs for the unlabeled data, instead of for the labeled data, drive the enhanced efficiency gains. To validate our theoretical findings, we conduct comprehensive synthetic experiments and apply our method to multiple real-world datasets, confirming the practical advantages of our approach. hfill{ exttt{Code}: href{https://github.com/brucejunjin/ICML2025-ACPCS}{faGithub}}

Problem

Research questions and friction points this paper is trying to address.

Efficient inference using automated computational phenotypes under covariate shift

Reducing bias when replacing gold-standard data with automated phenotypes

Enhancing analysis validity in semi-supervised learning with covariate shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-supervised learning with covariate shift

Doubly robust estimators for ACPs

Efficiency gains from unlabeled ACPs

🔎 Similar Papers

No similar papers found.

Authors to Follow