Observational Multiplicity

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In probabilistic classification, observational multiplicity—where multiple near-optimal models yield substantially divergent probability predictions for the same input—undermines interpretability and safety. To address this, we propose *regret*, a novel metric quantifying prediction instability under label sampling variability, and develop a general framework to estimate how training label perturbations affect the output distribution. Our contributions are threefold: (i) the first application of regret to analyze arbitrary model selection in probabilistic classification; (ii) fine-grained, subgroup-level detection of instability; and (iii) natural derivation of abstention mechanisms and active data acquisition strategies. Experiments demonstrate that our method accurately identifies high-regret subgroups and significantly improves decision robustness and data collection efficiency in safety-critical domains such as healthcare and finance.

Technology Category

Application Category

📝 Abstract
Many prediction tasks can admit multiple models that can perform almost equally well. This phenomenon can can undermine interpretability and safety when competing models assign conflicting predictions to individuals. In this work, we study how arbitrariness can arise in probabilistic classification tasks as a result of an effect that we call emph{observational multiplicity}. We discuss how this effect arises in a broad class of practical applications where we learn a classifier to predict probabilities $p_i in [0,1]$ but are given a dataset of observations $y_i in {0,1}$. We propose to evaluate the arbitrariness of individual probability predictions through the lens of emph{regret}. We introduce a measure of regret for probabilistic classification tasks, which measures how the predictions of a model could change as a result of different training labels change. We present a general-purpose method to estimate the regret in a probabilistic classification task. We use our measure to show that regret is higher for certain groups in the dataset and discuss potential applications of regret. We demonstrate how estimating regret promote safety in real-world applications by abstention and data collection.
Problem

Research questions and friction points this paper is trying to address.

Studying arbitrariness in probabilistic classification due to observational multiplicity
Measuring individual prediction arbitrariness using regret in classification tasks
Addressing safety concerns by estimating regret for specific dataset groups
Innovation

Methods, ideas, or system contributions that make the work stand out.

Measure regret in probabilistic classification tasks
Estimate arbitrariness of individual probability predictions
Promote safety via abstention and data collection