🤖 AI Summary
To address degraded generalization performance at deployment caused by demographic representation imbalance in training data, this paper proposes a conditional Γ-bias sampling model that characterizes constrained conditional distributional shifts between test and training distributions. Building upon this model, distributionally robust optimization is formulated as a tractable augmented convex risk minimization problem. For the first time, sieve theory is integrated to establish statistical consistency guarantees. Furthermore, an end-to-end deep learning algorithm is designed, incorporating a custom robust loss function and the Rockafellar–Uryasev representation of Conditional Value-at-Risk (CVaR). Empirical evaluation on mental health score prediction and ICU length-of-stay forecasting demonstrates that the proposed method significantly improves out-of-distribution robustness, consistently outperforming standard empirical risk minimization and existing bias-correction approaches.
📝 Abstract
The empirical risk minimization approach to data-driven decision making requires access to training data drawn under the same conditions as those that will be faced when the decision rule is deployed. However, in a number of settings, we may be concerned that our training sample is biased in the sense that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called conditional $Gamma$-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under $Gamma$-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in a case study on prediction of mental health scores from health survey data and a case study on ICU length of stay prediction.