🤖 AI Summary
In AI model evaluation under interventions, conventional methods relying solely on control-group data suffer from inefficiency and bias when interventions affect outcomes. To address this, we propose Nuisance Parameter Weighting (NPW), the first method enabling unbiased reweighting of treatment-group data in randomized controlled trials (RCTs). NPW theoretically characterizes the bias conditions of traditional aggregate estimators and integrates causal inference with distributional reweighting—requiring no strong parametric or ignorability assumptions while fully utilizing all trial data. Extensive experiments on synthetic and real-world datasets demonstrate that NPW consistently outperforms control-only baselines across varying intervention effect sizes and sample scales. It improves both model selection accuracy and statistical efficiency. By enabling falsifiable, robust evaluation under interventions, NPW establishes a novel paradigm for AI assessment in causal settings.
📝 Abstract
AI models are often evaluated based on their ability to predict the outcome of interest. However, in many AI for social impact applications, the presence of an intervention that affects the outcome can bias the evaluation. Randomized controlled trials (RCTs) randomly assign interventions, allowing data from the control group to be used for unbiased model evaluation. However, this approach is inefficient because it ignores data from the treatment group. Given the complexity and cost often associated with RCTs, making the most use of the data is essential. Thus, we investigate model evaluation strategies that leverage all data from an RCT. First, we theoretically quantify the estimation bias that arises from na""ively aggregating performance estimates from treatment and control groups, and derive the condition under which this bias leads to incorrect model selection. Leveraging these theoretical insights, we propose nuisance parameter weighting (NPW), an unbiased model evaluation approach that reweights data from the treatment group to mimic the distributions of samples that would or would not experience the outcome under no intervention. Using synthetic and real-world datasets, we demonstrate that our proposed evaluation approach consistently yields better model selection than the standard approach, which ignores data from the treatment group, across various intervention effect and sample size settings. Our contribution represents a meaningful step towards more efficient model evaluation in real-world contexts.