🤖 AI Summary
Deep models for medical image segmentation often exhibit overconfidence, compromising clinical reliability. To address this, we propose a differentiable, image-level edge-aware L1-average calibration error (mL1-ACE) as an end-to-end trainable auxiliary loss—marking the first integration of calibration objectives into segmentation network optimization. We design hard and soft binning strategies for pixel-wise confidence calibration and introduce a dataset-level reliability histogram to enable cross-sample calibration performance visualization and analysis. Evaluated on four major benchmarks—ACDC, AMOS, KiTS, and BraTS—our method significantly reduces both average calibration error (ACE) and maximum calibration error (MCE). Soft binning achieves optimal calibration performance, while hard binning delivers robust calibration with negligible impact on segmentation accuracy (ΔDSC < 0.3%). The approach thus advances trustworthy segmentation by jointly optimizing accuracy and calibration in a unified framework.
📝 Abstract
Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility. In this work, we propose differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis. We compare both hard- and soft-binning approaches to directly improve pixel-wise calibration. Our experiments on four datasets (ACDC, AMOS, KiTS, BraTS) demonstrate that incorporating mL1-ACE significantly reduces calibration errors, particularly Average Calibration Error (ACE) and Maximum Calibration Error (MCE), while largely maintaining high Dice Similarity Coefficients (DSCs). We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance, with hard-binned mL1-ACE maintaining segmentation performance, albeit with weaker calibration improvement. To gain further insight into calibration performance and its variability across an imaging dataset, we introduce dataset reliability histograms, an aggregation of per-image reliability diagrams. The resulting analysis highlights improved alignment between predicted confidences and true accuracies. Overall, our approach not only enhances the trustworthiness of segmentation predictions but also shows potential for safer integration of deep learning methods into clinical workflows. We share our code here: https://github.com/cai4cai/Average-Calibration-Losses