🤖 AI Summary
This paper addresses the problem of inaccurate confidence interval coverage in statistical inference when machine learning predictions are used as covariates—particularly under complex sampling designs (e.g., weighted, stratified, or clustered sampling) and arbitrary missingness patterns requiring subset-wise imputation. We propose a systematic prediction-post-correction estimation framework tailored to non-uniform sampling, integrating bootstrap resampling with heteroskedastic weighting for inference—without requiring assumptions on prediction model quality or strong regularity conditions. Theoretically, our method yields asymptotically valid confidence intervals with guaranteed coverage. Empirically, it achieves accurate coverage and interval widths competitive with conventional approaches, with no additional computational overhead. Our core contribution is a unified treatment of both prediction-induced uncertainty and sampling-induced bias, enabling robust, efficient, and plug-and-play causal or associative inference in realistic, complex-data settings.
📝 Abstract
Machine learning models are increasingly used to produce predictions that serve as input data in subsequent statistical analyses. For example, computer vision predictions of economic and environmental indicators based on satellite imagery are used in downstream regressions; similarly, language models are widely used to approximate human ratings and opinions in social science research. However, failure to properly account for errors in the machine learning predictions renders standard statistical procedures invalid. Prior work uses what we call the Predict-Then-Debias estimator to give valid confidence intervals when machine learning algorithms impute missing variables, assuming a small complete sample from the population of interest. We expand the scope by introducing bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. Importantly, the method can be applied to many settings without requiring additional calculations. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.