Towards the Efficient Inference by Incorporating Automated Computational Phenotypes under Covariate Shift

๐Ÿ“… 2025-05-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Automated computation of phenotypes (ACP) exhibits systematic bias relative to gold-standard phenotypes, while manual annotation is prohibitively costly. Method: Under the covariate shift assumption, we develop a semi-supervised inference framework and propose a doubly robust, semiparametrically efficient estimator for phenotype estimation. Contribution/Results: We provide the first theoretical proof that efficiency gains arise primarily from ACP on unlabeled dataโ€”not labeled dataโ€”and establish a verifiable efficiency-bound analysis paradigm for ACP fusion. Evaluated on synthetic and multiple real-world healthcare datasets, our method reduces average variance by 32%โ€“57% over baseline approaches, significantly improving statistical accuracy and robustness in risk prediction and treatment effect estimation.

Technology Category

Application Category

๐Ÿ“ Abstract
Collecting gold-standard phenotype data via manual extraction is typically labor-intensive and slow, whereas automated computational phenotypes (ACPs) offer a systematic and much faster alternative. However, simply replacing the gold-standard with ACPs, without acknowledging their differences, could lead to biased results and misleading conclusions. Motivated by the complexity of incorporating ACPs while maintaining the validity of downstream analyses, in this paper, we consider a semi-supervised learning setting that consists of both labeled data (with gold-standard) and unlabeled data (without gold-standard), under the covariate shift framework. We develop doubly robust and semiparametrically efficient estimators that leverage ACPs for general target parameters in the unlabeled and combined populations. In addition, we carefully analyze the efficiency gains achieved by incorporating ACPs, comparing scenarios with and without their inclusion. Notably, we identify that ACPs for the unlabeled data, instead of for the labeled data, drive the enhanced efficiency gains. To validate our theoretical findings, we conduct comprehensive synthetic experiments and apply our method to multiple real-world datasets, confirming the practical advantages of our approach. hfill{ exttt{Code}: href{https://github.com/brucejunjin/ICML2025-ACPCS}{faGithub}}
Problem

Research questions and friction points this paper is trying to address.

Efficient inference using automated computational phenotypes under covariate shift
Reducing bias when replacing gold-standard data with automated phenotypes
Enhancing analysis validity in semi-supervised learning with covariate shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-supervised learning with covariate shift
Doubly robust estimators for ACPs
Efficiency gains from unlabeled ACPs
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Chao Ying
Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, USA; Department of Biostatistics & Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin, USA
J
Jun Jin
Henry Ford Health, Detroit, Michigan, USA
Y
Yi Guo
Department of Health Outcomes & Biomedical Informatics, University of Florida, Gainesville, Florida, USA
Xiudi Li
Xiudi Li
Division of Biostatistics, University of California Berkeley
Muxuan Liang
Muxuan Liang
MD Anderson Cancer Center
Precision MedicineMachine LearningBiostatistics
Jiwei Zhao
Jiwei Zhao
University of Wisconsin-Madison
StatisticsMachine LearningData ScienceBiostatisticsBiomedical Data Science