On Arbitrary Predictions from Equally Valid Models

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Medical diagnostics face the “model multiplicity” problem: multiple machine learning models with comparable aggregate performance may yield contradictory predictions for the same patient; standard evaluation metrics fail to identify the optimal model, and individual predictions are susceptible to arbitrary choices made during development—undermining diagnostic reliability. This paper proposes a mitigation framework based on small-scale ensembling and predictive abstention, which leverages cross-model consensus analysis to identify high-agreement predictions for automated diagnosis while explicitly deferring ambiguous cases to clinical experts. Its key innovation lies in explicitly modeling model uncertainty as an actionable clinical decision signal. Experiments demonstrate that the approach significantly reduces prediction multiplicity and improves diagnostic consistency, without substantially increasing model complexity.

Technology Category

Application Category

📝 Abstract

Model multiplicity refers to the existence of multiple machine learning models that describe the data equally well but may produce different predictions on individual samples. In medicine, these models can admit conflicting predictions for the same patient -- a risk that is poorly understood and insufficiently addressed. In this study, we empirically analyze the extent, drivers, and ramifications of predictive multiplicity across diverse medical tasks and model architectures, and show that even small ensembles can mitigate/eliminate predictive multiplicity in practice. Our analysis reveals that (1) standard validation metrics fail to identify a uniquely optimal model and (2) a substantial amount of predictions hinges on arbitrary choices made during model development. Using multiple models instead of a single model reveals instances where predictions differ across equally plausible models -- highlighting patients that would receive arbitrary diagnoses if any single model were used. In contrast, (3) a small ensemble paired with an abstention strategy can effectively mitigate measurable predictive multiplicity in practice; predictions with high inter-model consensus may thus be amenable to automated classification. While accuracy is not a principled antidote to predictive multiplicity, we find that (4) higher accuracy achieved through increased model capacity reduces predictive multiplicity. Our findings underscore the clinical importance of accounting for model multiplicity and advocate for ensemble-based strategies to improve diagnostic reliability. In cases where models fail to reach sufficient consensus, we recommend deferring decisions to expert review.

Problem

Research questions and friction points this paper is trying to address.

Analyzing predictive multiplicity in medical machine learning models

Identifying arbitrary predictions from equally valid model choices

Proposing ensemble strategies to mitigate diagnostic unreliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble models mitigate predictive multiplicity risks

Abstention strategy enhances diagnostic reliability

Higher model capacity reduces prediction conflicts

🔎 Similar Papers

Conformal online model aggregation