🤖 AI Summary
Ordinary least squares regression exhibits conditional prediction bias when used for scientific inference based on predictions, stemming from shrinkage toward the mean and leading to systematic inferential errors. This work proposes Outcome-Calibrated Regression (OCR), which explicitly characterizes and addresses this bias for the first time by enforcing, via a closed-form solution, exact agreement between predicted and observed outcomes in conditional expectation. OCR thereby achieves precise calibration of the outcome variable and effectively eliminates conditional prediction bias. Empirical evaluations in brain-age analysis and causal inference demonstrate that OCR substantially improves the unbiasedness and reliability of prediction-based statistical inference.
📝 Abstract
Regression is a fundamental tool in scientific research. Ordinary least squares (OLS), one of the most widely used regression methods, enjoys several desirable properties, including the best linear unbiased estimator (BLUE) property. It is well known that, under the assumptions of the standard model, the OLS is conditionally unbiased given the covariates, i.e., $\mathbb{E}(\widehat Y-Y\mid X=x)=0$. However, an often-overlooked property of OLS is that the prediction error is generally not unbiased conditional on the outcome, i.e., $\mathbb{E}(\widehat Y-Y\mid Y=y)\neq 0$. As a consequence of minimizing mean squared error, OLS predictions are systematically shrunk toward the outcome mean, which explains the classical phenomenon of regression to the mean (RTM): large outcome values tend to be underpredicted, whereas small outcome values tend to be overpredicted. This conditional prediction bias creates a nonignorable problem for predicted outcome-based inference, where scientific inference is performed using the predicted outcome $\widehat Y$ and another variable $W$. In applications such as brain-age analysis and causal inference, we show that inference based on regression-predicted outcomes can be systematically biased. To address this issue, we propose outcome-calibrated regression (OCR), a new regression framework with a closed-form solution that directly enforces outcome calibration. The proposed OCR estimator eliminates conditional prediction bias with respect to the outcome and enables valid inference using regression-predicted outcomes.