Data as a Lever: A Neighbouring Datasets Perspective on Predictive Multiplicity

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates how data characteristics influence model prediction multiplicity—the existence of multiple distinct models achieving near-optimal performance on a given dataset. Specifically, it addresses how perturbing individual data points affects multiplicity. Method: We introduce the “neighborhood dataset” analytical framework to rigorously characterize multiplicity under local data perturbations. Theoretically and empirically, we establish that higher inter-class distribution overlap weakens the Rashomon effect—i.e., reduces multiplicity—revealing an intrinsic mechanism rooted in parameter sharing across models. Building on this insight, we propose the first multiplicity-aware active learning and data imputation framework, integrating overlap-based metrics, model-space modeling, and sensitivity-driven algorithm design for systematic assessment and explicit control of multiplicity. Results: Experiments demonstrate that our approach significantly improves model stability and decision reliability, offering principled tools for mitigating multiplicity-induced uncertainty in real-world machine learning deployments.

Technology Category

Application Category

📝 Abstract
Multiplicity -- the existence of distinct models with comparable performance -- has received growing attention in recent years. While prior work has largely emphasized modelling choices, the critical role of data in shaping multiplicity has been comparatively overlooked. In this work, we introduce a neighbouring datasets framework to examine the most granular case: the impact of a single-data-point difference on multiplicity. Our analysis yields a seemingly counterintuitive finding: neighbouring datasets with greater inter-class distribution overlap exhibit lower multiplicity. This reversal of conventional expectations arises from a shared Rashomon parameter, and we substantiate it with rigorous proofs. Building on this foundation, we extend our framework to two practical domains: active learning and data imputation. For each, we establish natural extensions of the neighbouring datasets perspective, conduct the first systematic study of multiplicity in existing algorithms, and finally, propose novel multiplicity-aware methods, namely, multiplicity-aware data acquisition strategies for active learning and multiplicity-aware data imputation techniques.
Problem

Research questions and friction points this paper is trying to address.

Examining how single data point differences affect predictive multiplicity
Analyzing how inter-class distribution overlap influences model multiplicity
Developing multiplicity-aware methods for active learning and data imputation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neighbouring datasets framework analyzes single-data-point impact
Shared Rashomon parameter explains inter-class overlap effects
Multiplicity-aware methods developed for active learning and imputation
🔎 Similar Papers
No similar papers found.