Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling

📅 2025-01-30

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This paper addresses the problem of inaccurate confidence interval coverage in statistical inference when machine learning predictions are used as covariates—particularly under complex sampling designs (e.g., weighted, stratified, or clustered sampling) and arbitrary missingness patterns requiring subset-wise imputation. We propose a systematic prediction-post-correction estimation framework tailored to non-uniform sampling, integrating bootstrap resampling with heteroskedastic weighting for inference—without requiring assumptions on prediction model quality or strong regularity conditions. Theoretically, our method yields asymptotically valid confidence intervals with guaranteed coverage. Empirically, it achieves accurate coverage and interval widths competitive with conventional approaches, with no additional computational overhead. Our core contribution is a unified treatment of both prediction-induced uncertainty and sampling-induced bias, enabling robust, efficient, and plug-and-play causal or associative inference in realistic, complex-data settings.

Technology Category

Application Category

📝 Abstract

Machine learning models are increasingly used to produce predictions that serve as input data in subsequent statistical analyses. For example, computer vision predictions of economic and environmental indicators based on satellite imagery are used in downstream regressions; similarly, language models are widely used to approximate human ratings and opinions in social science research. However, failure to properly account for errors in the machine learning predictions renders standard statistical procedures invalid. Prior work uses what we call the Predict-Then-Debias estimator to give valid confidence intervals when machine learning algorithms impute missing variables, assuming a small complete sample from the population of interest. We expand the scope by introducing bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. Importantly, the method can be applied to many settings without requiring additional calculations. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.

Problem

Research questions and friction points this paper is trying to address.

Machine Learning Predictions

Statistical Analysis

Error Handling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Confidence Interval

Missing Data Handling

Uncertainty Quantification in Machine Learning

🔎 Similar Papers

Robust Design and Evaluation of Predictive Algorithms under Unobserved Confounding