🤖 AI Summary
This work addresses the limitation of class-level evaluation metrics, which often obscure performance disparities among intra-class sub-concepts—particularly when classes are imbalanced and sub-concept distributions are skewed, leading to biased assessments. To mitigate this issue without requiring ground-truth sub-concept labels, the authors propose a utility-weighted evaluation framework that constructs uncertainty-aware soft weights from the posterior probabilities of a multi-class sub-concept model and introduces the prediction-weighted balanced accuracy (pBA). This approach enables, for the first time, a stable and interpretable evaluation grounded solely in predicted probabilities. Empirical results across tabular, medical imaging, and textual datasets demonstrate that conventional unweighted metrics can be misleading under intra-class heterogeneity, whereas pBA provides a more reliable performance measure under non-pathological, imbalanced sub-concept distributions.
📝 Abstract
Class-level evaluation can conceal substantial performance disparities across subconcepts within the same class, causing models that perform well on average to fail on specific subpopulations. Prior work has shown that common evaluation measures for imbalanced classification are biased toward larger minority subconcepts and that utility-based reweighting using true subconcept labels can mitigate this bias; however, such labels are rarely available at test time. We introduce a practical utility-weighted evaluation that replaces unavailable subconcept labels with predicted posterior probabilities from a multiclass subconcept model. Evaluation weights are defined as the expected utility under this posterior, yielding a soft, uncertainty-aware metric we call predicted-weighted balanced accuracy (pBA). Experiments on tabular benchmarks as well as medical-imaging and text datasets show that unweighted scores can be misleading under within-class heterogeneity, while pBA provides more stable and interpretable assessments when subconcept distributions are uneven but not pathological. Our code is available at: https://anonymous.4open.science/r/correcting-bias-imbalance-9C6C/.