What Does It Take to Build a Performant Selective Classifier?

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the fundamental challenge that selective classifiers often fail to approximate the ideal ranking oracle—i.e., one that monotonically accepts samples by correctness. We propose, for the first time, a five-source decomposition framework for selective classification gap under finite samples, systematically isolating key error sources: Bayesian noise, model capacity limitations, miscalibration, data shift, and ranking mismatch. Crucially, we demonstrate that monotonic calibration alone is insufficient for high-quality ranking; instead, feature-aware calibration and post-hoc re-ranking are required. To enhance distributional robustness, we introduce a distributionally robust training strategy. Controlled experiments on synthetic two-moons data and multimodal real-world benchmarks reveal that Bayesian noise and model capacity constitute the dominant bottlenecks. Our method significantly improves score ranking fidelity, yields quantifiable error budgets, and provides actionable design principles for selective classification systems.

Technology Category

Application Category

📝 Abstract

Selective classifiers improve model reliability by abstaining on inputs the model deems uncertain. However, few practical approaches achieve the gold-standard performance of a perfect-ordering oracle that accepts examples exactly in order of correctness. Our work formalizes this shortfall as the selective-classification gap and present the first finite-sample decomposition of this gap to five distinct sources of looseness: Bayes noise, approximation error, ranking error, statistical noise, and implementation- or shift-induced slack. Crucially, our analysis reveals that monotone post-hoc calibration -- often believed to strengthen selective classifiers -- has limited impact on closing this gap, since it rarely alters the model's underlying score ranking. Bridging the gap therefore requires scoring mechanisms that can effectively reorder predictions rather than merely rescale them. We validate our decomposition on synthetic two-moons data and on real-world vision and language benchmarks, isolating each error component through controlled experiments. Our results confirm that (i) Bayes noise and limited model capacity can account for substantial gaps, (ii) only richer, feature-aware calibrators meaningfully improve score ordering, and (iii) data shift introduces a separate slack that demands distributionally robust training. Together, our decomposition yields a quantitative error budget as well as actionable design guidelines that practitioners can use to build selective classifiers which approximate ideal oracle behavior more closely.

Problem

Research questions and friction points this paper is trying to address.

Formalizing the performance gap between selective classifiers and ideal oracles

Decomposing the gap into five distinct sources of model uncertainty

Identifying limited impact of calibration on improving prediction ranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes selective-classification gap into five error sources

Identifies score reordering as key over monotone calibration

Provides error budget and design guidelines for robustness

🔎 Similar Papers

Effective Subset Selection Through The Lens of Neural Network Pruning