Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
When pretrained models impute missing data, even highly accurate predictions induce bias in causal inference by ignoring prediction uncertainty, thereby invalidating standard statistical inference. Method: We develop a unified framework attributing inference failure to the dual effects—bias and variance—introduced by prediction; conduct error propagation analysis to explicitly model how prediction uncertainty propagates to final causal estimates; and reestablish theoretical links between individual participant data (IPD) analysis and classical statistical theory. Contribution/Results: We demonstrate that direct imputation distorts standard errors and confidence intervals. Our framework yields statistically principled, actionable guidelines for using predicted values in causal estimation, enabling proper uncertainty quantification. Empirical validation confirms substantial improvements in the reliability of causal inference based on machine learning–preprocessed data, bridging the gap between modern predictive modeling and rigorous causal inference.

Technology Category

Application Category

📝 Abstract
As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as substitutes for missing or unobserved data. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to drawing inference with predicted data (IPD) and show that high predictive accuracy does not guarantee valid downstream inference. We show that all such failures reduce to statistical notions of (i) bias, when predictions systematically shift the estimand or distort relationships among variables, and (ii) variance, when uncertainty from the prediction model and the intrinsic variability of the true data are ignored. We then review recent methods for conducting IPD and discuss how this framework is deeply rooted in classical statistical theory. We then comment on some open questions and interesting avenues for future work in this area, and end with some comments on how to use predicted data in scientific studies that is both transparent and statistically principled.
Problem

Research questions and friction points this paper is trying to address.

Addresses statistical challenges of using predicted data for inference
Identifies bias and variance issues in predicted data analysis
Reviews methods for transparent and principled use of predicted data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using predicted data as substitutes for missing observations
Characterizing inference failures via bias and variance issues
Reviewing methods rooted in classical statistical theory
S
Stephen Salerno
Public Health Sciences Division, Fred Hutchinson Cancer Center, Seattle, WA
K
Kentaro Hoffman
Department of Statistics, University of Washington, Seattle, WA
A
Awan Afiaz
Department of Biostatistics, University of Washington; Public Health Sciences Division, Fred Hutchinson Cancer Center, Seattle, WA
A
Anna Neufeld
Department of Mathematics and Statistics, Williams College, Williamstown, MA
Tyler H. McCormick
Tyler H. McCormick
University of Washington
statisticsdata scienceBayesian modelingsocial networksglobal health
J
Jeffrey T. Leek
Public Health Sciences Division, Fred Hutchinson Cancer Center, Seattle, WA