🤖 AI Summary
This work addresses the problem of evaluating the worst-case subgroup performance of machine learning models under distributional shift, to quantify their distributional robustness. For subgroups of fixed size defined by an arbitrary (continuous) core attribute (Z), we propose a two-stage estimation framework: first, conditionally estimate subgroup performance given (Z); second, optimize over (Z) to identify the worst-case subgroup. Unlike traditional Rademacher complexity-based approaches, our method avoids exponential dependence on the dimension of (Z), achieving dimension-free finite-sample convergence guarantees—where estimation error depends only on the conditional generalization error given (Z). Our key theoretical contribution is the first scalable, non-conservative statistical certification framework for robustness across intersecting vulnerable subgroups. Empirical evaluation on real-world datasets demonstrates the method’s effectiveness in identifying unreliable models and supporting robust deployment decisions.
📝 Abstract
The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.