🤖 AI Summary
This paper investigates how data characteristics influence model prediction multiplicity—the existence of multiple distinct models achieving near-optimal performance on a given dataset. Specifically, it addresses how perturbing individual data points affects multiplicity.
Method: We introduce the “neighborhood dataset” analytical framework to rigorously characterize multiplicity under local data perturbations. Theoretically and empirically, we establish that higher inter-class distribution overlap weakens the Rashomon effect—i.e., reduces multiplicity—revealing an intrinsic mechanism rooted in parameter sharing across models. Building on this insight, we propose the first multiplicity-aware active learning and data imputation framework, integrating overlap-based metrics, model-space modeling, and sensitivity-driven algorithm design for systematic assessment and explicit control of multiplicity.
Results: Experiments demonstrate that our approach significantly improves model stability and decision reliability, offering principled tools for mitigating multiplicity-induced uncertainty in real-world machine learning deployments.
📝 Abstract
Multiplicity -- the existence of distinct models with comparable performance -- has received growing attention in recent years. While prior work has largely emphasized modelling choices, the critical role of data in shaping multiplicity has been comparatively overlooked. In this work, we introduce a neighbouring datasets framework to examine the most granular case: the impact of a single-data-point difference on multiplicity. Our analysis yields a seemingly counterintuitive finding: neighbouring datasets with greater inter-class distribution overlap exhibit lower multiplicity. This reversal of conventional expectations arises from a shared Rashomon parameter, and we substantiate it with rigorous proofs.
Building on this foundation, we extend our framework to two practical domains: active learning and data imputation. For each, we establish natural extensions of the neighbouring datasets perspective, conduct the first systematic study of multiplicity in existing algorithms, and finally, propose novel multiplicity-aware methods, namely, multiplicity-aware data acquisition strategies for active learning and multiplicity-aware data imputation techniques.