🤖 AI Summary
This study addresses the issue of prediction multiplicity—the phenomenon where near-optimal models yield inconsistent predictions for the same input—particularly in high-stakes decision-making contexts, where it disproportionately affects minority-group samples. We systematically demonstrate for the first time that prediction multiplicity is significantly concentrated in low-confidence regions and among minority populations. To mitigate this, we investigate the relationship between classifier calibration and prediction multiplicity, evaluating post-hoc calibration methods—including Platt Scaling, Isotonic Regression, and Temperature Scaling—on their ability to enhance prediction consistency across the Rashomon set of near-equally-performing models, using nine credit risk datasets. Our experiments show that Platt Scaling and Isotonic Regression effectively reduce prediction multiplicity, suggesting that calibration can serve as a consensus mechanism to alleviate algorithmic arbitrariness and promote procedural fairness.
📝 Abstract
As machine learning models are increasingly deployed in high-stakes environments, ensuring both probabilistic reliability and prediction stability has become critical. This paper examines the interplay between classification calibration and predictive multiplicity - the phenomenon in which multiple near-optimal models within the Rashomon set yield conflicting credit outcomes for the same applicant. Using nine diverse credit risk benchmark datasets, we investigate whether predictive multiplicity concentrates in regions of low predictive confidence and how post-hoc calibration can mitigate algorithmic arbitrariness. Our empirical analysis reveals that minority class observations bear a disproportionate multiplicity burden, as confirmed by significant disparities in predictive multiplicity and prediction confidence. Furthermore, our empirical comparisons indicate that applying post-hoc calibration methods - specifically Platt Scaling, Isotonic Regression, and Temperature Scaling - is associated with lower obscurity across the Rashomon set. Among the tested techniques, Platt Scaling and Isotonic Regression provide the most robust reduction in predictive multiplicity. These findings suggest that calibration can function as a consensus-enforcing layer and may support procedural fairness by mitigating predictive multiplicity.