Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep models for medical image segmentation often exhibit overconfidence, compromising clinical reliability. To address this, we propose a differentiable, image-level edge-aware L1-average calibration error (mL1-ACE) as an end-to-end trainable auxiliary loss—marking the first integration of calibration objectives into segmentation network optimization. We design hard and soft binning strategies for pixel-wise confidence calibration and introduce a dataset-level reliability histogram to enable cross-sample calibration performance visualization and analysis. Evaluated on four major benchmarks—ACDC, AMOS, KiTS, and BraTS—our method significantly reduces both average calibration error (ACE) and maximum calibration error (MCE). Soft binning achieves optimal calibration performance, while hard binning delivers robust calibration with negligible impact on segmentation accuracy (ΔDSC < 0.3%). The approach thus advances trustworthy segmentation by jointly optimizing accuracy and calibration in a unified framework.

Technology Category

Application Category

📝 Abstract
Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility. In this work, we propose differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis. We compare both hard- and soft-binning approaches to directly improve pixel-wise calibration. Our experiments on four datasets (ACDC, AMOS, KiTS, BraTS) demonstrate that incorporating mL1-ACE significantly reduces calibration errors, particularly Average Calibration Error (ACE) and Maximum Calibration Error (MCE), while largely maintaining high Dice Similarity Coefficients (DSCs). We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance, with hard-binned mL1-ACE maintaining segmentation performance, albeit with weaker calibration improvement. To gain further insight into calibration performance and its variability across an imaging dataset, we introduce dataset reliability histograms, an aggregation of per-image reliability diagrams. The resulting analysis highlights improved alignment between predicted confidences and true accuracies. Overall, our approach not only enhances the trustworthiness of segmentation predictions but also shows potential for safer integration of deep learning methods into clinical workflows. We share our code here: https://github.com/cai4cai/Average-Calibration-Losses
Problem

Research questions and friction points this paper is trying to address.

Reducing overconfidence in medical image segmentation
Improving pixel-wise calibration with mL1-ACE loss
Maintaining segmentation accuracy while enhancing reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable mL1-ACE loss for calibration
Hard-soft binning for pixel-wise calibration
Dataset reliability histograms for analysis