๐ค AI Summary
Automated computation of phenotypes (ACP) exhibits systematic bias relative to gold-standard phenotypes, while manual annotation is prohibitively costly. Method: Under the covariate shift assumption, we develop a semi-supervised inference framework and propose a doubly robust, semiparametrically efficient estimator for phenotype estimation. Contribution/Results: We provide the first theoretical proof that efficiency gains arise primarily from ACP on unlabeled dataโnot labeled dataโand establish a verifiable efficiency-bound analysis paradigm for ACP fusion. Evaluated on synthetic and multiple real-world healthcare datasets, our method reduces average variance by 32%โ57% over baseline approaches, significantly improving statistical accuracy and robustness in risk prediction and treatment effect estimation.
๐ Abstract
Collecting gold-standard phenotype data via manual extraction is typically labor-intensive and slow, whereas automated computational phenotypes (ACPs) offer a systematic and much faster alternative. However, simply replacing the gold-standard with ACPs, without acknowledging their differences, could lead to biased results and misleading conclusions. Motivated by the complexity of incorporating ACPs while maintaining the validity of downstream analyses, in this paper, we consider a semi-supervised learning setting that consists of both labeled data (with gold-standard) and unlabeled data (without gold-standard), under the covariate shift framework. We develop doubly robust and semiparametrically efficient estimators that leverage ACPs for general target parameters in the unlabeled and combined populations. In addition, we carefully analyze the efficiency gains achieved by incorporating ACPs, comparing scenarios with and without their inclusion. Notably, we identify that ACPs for the unlabeled data, instead of for the labeled data, drive the enhanced efficiency gains. To validate our theoretical findings, we conduct comprehensive synthetic experiments and apply our method to multiple real-world datasets, confirming the practical advantages of our approach. hfill{ exttt{Code}: href{https://github.com/brucejunjin/ICML2025-ACPCS}{faGithub}}