Evaluating model performance under worst-case subpopulations

📅 2024-07-01
🏛️ Neural Information Processing Systems
📈 Citations: 19
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the problem of evaluating the worst-case subgroup performance of machine learning models under distributional shift, to quantify their distributional robustness. For subgroups of fixed size defined by an arbitrary (continuous) core attribute (Z), we propose a two-stage estimation framework: first, conditionally estimate subgroup performance given (Z); second, optimize over (Z) to identify the worst-case subgroup. Unlike traditional Rademacher complexity-based approaches, our method avoids exponential dependence on the dimension of (Z), achieving dimension-free finite-sample convergence guarantees—where estimation error depends only on the conditional generalization error given (Z). Our key theoretical contribution is the first scalable, non-conservative statistical certification framework for robustness across intersecting vulnerable subgroups. Empirical evaluation on real-world datasets demonstrates the method’s effectiveness in identifying unreliable models and supporting robust deployment decisions.

Technology Category

Application Category

📝 Abstract
The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.
Problem

Research questions and friction points this paper is trying to address.

Assess worst-case model performance across subpopulations
Develop scalable method to evaluate distributional robustness
Prevent deployment of unreliable models in real datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Worst-case subpopulation performance evaluation
Scalable two-stage estimation procedure
Dimension-free convergence with guarantees
🔎 Similar Papers
No similar papers found.