🤖 AI Summary
Existing individualized treatment effect (ITE) models lack nonparametric statistical inference methods for moderate calibration—i.e., whether predicted values (z) equal the average true treatment effect within corresponding subgroups—due to unobserved counterfactuals and the need to avoid strong functional assumptions for continuous predictions.
Method: We propose the first fully nonparametric framework that requires neither parametric modeling nor regularization. It characterizes the cumulative prediction error as a Brownian motion process, and integrates functional central limit theory with conditional and marginal expectation substitution to jointly enable numerical testing, graphical diagnosis, and statistical inference of population-level calibration.
Contribution/Results: The method accurately detects diverse miscalibration patterns in simulations and demonstrates strong visual interpretability and practical utility in real-world case studies. It constitutes the first generalizable, calibration-validation tool for ITE modeling, advancing rigorous evaluation of personalized treatment effect estimation.
📝 Abstract
An important aspect of the performance of algorithms that predict individualized treatment effects (ITE) is moderate calibration, i.e., the average treatment effect among individuals with predicted treatment effect of z being equal to z. The assessment of moderate calibration is a challenging task on two fronts: counterfactual responses are unobserved, and quantifying the conditional response function for models that generate continuous predicted values requires regularization or parametric modeling. Perhaps because of these challenges, there is currently no inferential method for the null hypothesis that an ITE model is moderately calibrated in a population. In this work, we propose non-parametric methods for the assessment of moderate calibration of ITE models for binary outcomes using data from a randomized trial. These methods simultaneously resolve both challenges, resulting in novel numerical, graphical, and inferential methods for the assessment of moderate calibration. The key idea is to formulate a stochastic process for the cumulative prediction errors that obeys a functional central limit theorem, enabling the use of the properties of Brownian motion for asymptotic inference. We propose two approaches to construct this process from a sample: a conditional approach that relies on predicted risks (often an output of ITE models), and a marginal approach based on replacing the cumulative conditional expected value and variance terms with their marginal counterparts. Numerical simulations confirm the desirable properties of both approaches and their ability to detect miscalibration of different forms. We use a case study to provide practical suggestions on graphical presentation and the interpretation of results. Moderate calibration of predicted ITEs can be assessed without requiring regularization techniques or making assumptions about the functional form of treatment response.