Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge of limited labeled data in AI evaluation and social science settings, where multiple related tasks share underlying structural information yet are typically handled independently. To overcome this limitation, the paper introduces a multi-task Prediction-Powered Inference (PPI) framework that, for the first time, incorporates cross-task structure into PPI. By modeling shared nonlinear proxy–outcome relationships across tasks and introducing a cross-task recalibration mechanism, the method enhances statistical inference efficiency while preserving within-task correction and validity. Theoretical analysis demonstrates that the proposed approach strictly improves upon single-task PPI only when the proxy–outcome relationship exhibits nonlinearity. Empirical validation on synthetic, semi-synthetic data, and a real-world language model auditing case study during the 2024 U.S. presidential election confirms its effectiveness in significantly narrowing confidence interval widths.

📝 Abstract

Many applications require statistically valid inference across many related tasks, while using only a handful of high-quality labels per hypothesis. In AI evaluation, these tasks may correspond to model behaviors across prompts, subgroups, or hypotheses; in social science surveys, they may correspond to related questions, populations, or measurement conditions. Prediction-powered inference (PPI) uses abundant but inexpensive proxy measurements to improve inference from limited, ground-truth labels, but commonly used methods treat tasks independently and therefore fail to exploit shared structure across related tasks. This limitation is especially important in settings where only a small number of labels are available per task. To address this issue, we introduce a multi-task prediction-powered inference framework that uses labeled data from related tasks to improve power while preserving task-specific inference. Our methods exploit the shared structure in the proxy-ground-truth relationship through cross-task recalibration, while retaining within-task rectification and power tuning to construct accurate point estimates and confidence intervals. We prove that efficiency gains beyond power-tuned PPI are only possible when the proxy-ground-truth relationship contains nonlinear structure; affine cross-task recalibrations are asymptotically equivalent to using the original proxy. We complement our theoretical findings with experiments on synthetic and semi-synthetic datasets, as well as a case study auditing language models on election-related information during the 2024 U.S. presidential election. Using a large human-annotation study, we show that cross-task recalibration can substantially reduce confidence interval widths when labels are scarce.

Problem

Research questions and friction points this paper is trying to address.

multi-task inference

prediction-powered inference

proxy measurements

statistical validity

shared structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-task prediction-powered inference

cross-task recalibration

proxy-ground-truth relationship