Predictions as Surrogates: Revisiting Surrogate Outcomes in the Age of AI

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional inference methods in biostatistics and economics suffer from high experimental cost and low statistical efficiency. To address this, we propose a Prediction-Enhanced Inference (PPI) recalibration framework: it leverages predictions from pretrained AI models as low-cost surrogate outcomes and introduces a novel “recalibration” step to learn an optimal implicit loss function. We establish theoretical guarantees showing that the resulting estimator achieves asymptotically minimal variance and dominates estimators based solely on ground-truth data. The method integrates surrogate outcome modeling, convex optimization, and flexible machine learning—such as neural networks and tree-based models. Evaluated across three real-world applications, our approach substantially increases effective sample size and outperforms existing PPI methods in both accuracy and stability. It thus provides a statistically rigorous, computationally efficient, and provably superior paradigm for inference in high-cost experimental settings.

Technology Category

Application Category

📝 Abstract
We establish a formal connection between the decades-old surrogate outcome model in biostatistics and economics and the emerging field of prediction-powered inference (PPI). The connection treats predictions from pre-trained models, prevalent in the age of AI, as cost-effective surrogates for expensive outcomes. Building on the surrogate outcomes literature, we develop recalibrated prediction-powered inference, a more efficient approach to statistical inference than existing PPI proposals. Our method departs from the existing proposals by using flexible machine learning techniques to learn the optimal ``imputed loss'' through a step we call recalibration. Importantly, the method always improves upon the estimator that relies solely on the data with available true outcomes, even when the optimal imputed loss is estimated imperfectly, and it achieves the smallest asymptotic variance among PPI estimators if the estimate is consistent. Computationally, our optimization objective is convex whenever the loss function that defines the target parameter is convex. We further analyze the benefits of recalibration, both theoretically and numerically, in several common scenarios where machine learning predictions systematically deviate from the outcome of interest. We demonstrate significant gains in effective sample size over existing PPI proposals via three applications leveraging state-of-the-art machine learning/AI models.
Problem

Research questions and friction points this paper is trying to address.

Efficient Data Utilization
Cost-effective Methods
AI-assisted Predictive Analytics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predictive-driven Inference (PPI)
Machine Learning
Re-calibration Technique
🔎 Similar Papers
No similar papers found.