Multi-Rater Calibrated Segmentation Models

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the common issue of miscalibrated probabilistic outputs in medical image segmentation models, which often arises due to inter-expert annotation disagreement. It introduces, for the first time, a framework that explicitly models multi-rater annotations as ordinal information through an architecture-agnostic ordinal learning approach. By integrating an ordinal-aware loss—specifically the Ranked Probability Score—with standard binary segmentation objectives, the method aligns predicted confidence with the inherent variability in expert annotations. Evaluated on four public datasets, the proposed approach significantly improves calibration performance while preserving segmentation accuracy. To better assess model reliability in multi-rater settings, the study also introduces an extended version of the Expected Calibration Error tailored for multiple annotators.

📝 Abstract

Objective: Accurate probability estimates are essential for the safe deployment of medical image segmentation models in clinical decision-making. However, modern deep segmentation networks are often poorly calibrated, a problem exacerbated when multiple expert annotations exhibit substantial disagreement. While inter-rater variability is typically treated as noise, it provides valuable information about intrinsic annotation ambiguity that must be reflected in model confidence. Methods: We improve the probabilistic calibration of medical image segmentation models by reformulating multi-rater supervision as an ordinal learning problem. Voxel-wise annotator agreement is treated as an ordered target, linking predictive confidence to the empirical variability in training data. This formulation allows the use of ordinal-aware scoring rules, such as the Ranked Probability Score ordinal loss, combined with a standard binary objective to preserve discriminative performance. Results: We evaluated the proposed approach across four public segmentation benchmarks spanning ophthalmology, histopathology, and thoracic imaging. Calibration was assessed using a multi-rater extension of expected calibration error. Results consistently show that ordinal-aware training yields substantially improved calibration with respect to inter-rater agreement without degrading segmentation accuracy. Conclusions: Treating multi-rater annotations as ordered information provides a principled and architecture-agnostic route to more reliable probabilistic segmentation models.

Problem

Research questions and friction points this paper is trying to address.

medical image segmentation

probability calibration

multi-rater annotations

inter-rater variability

annotation ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

ordinal learning

multi-rater calibration

medical image segmentation