🤖 AI Summary
Although medical AI models achieve high accuracy in image classification, they often produce highly confident yet clinically catastrophic errors—such as misclassifying malignant tumors as benign—due to semantic inconsistencies, severely undermining clinical trust. This work proposes a risk-calibrated learning approach that, for the first time, embeds clinical risk awareness directly into loss function design. By introducing a confusion-aware clinical severity matrix \( M \), the method explicitly differentiates acceptable prediction errors from structurally fatal ones, such as false negatives, during training. The framework requires no architectural modifications and enables end-to-end optimization across mainstream backbones, including CNNs and Transformers. Evaluated on four medical imaging datasets, it substantially reduces critical error rates, achieving relative safety improvements of 20.0%–92.4% over strong baselines like Focal Loss.
📝 Abstract
Deep learning models often achieve expert-level accuracy in medical image classification but suffer from a critical flaw: semantic incoherence. These high-confidence mistakes that are semantically incoherent (e.g., classifying a malignant tumor as benign) fundamentally differ from acceptable errors which stem from visual ambiguity. Unlike safe, fine-grained disagreements, these fatal failures erode clinical trust. To address this, we propose Risk-Calibrated Learning, a technique that explicitly distinguishes between visual ambiguity (fine-grained errors) and catastrophic structural errors. By embedding a confusion-aware clinical severity matrix M into the optimization landscape, our method suppresses critical errors (false negatives) without requiring complex architectural changes. We validate our approach in four different imaging modalities: Brain Tumor MRI, ISIC 2018 (Dermoscopy), BreaKHis (Breast Histopathology), and SICAPv2 (Prostate Histopathology). Extensive experiments demonstrate that our Risk-Calibrated Loss consistently reduces the Critical Error Rate (CER) for all four datasets, achieving relative safety improvements ranging from 20.0% (on breast histopathology) to 92.4% (on prostate histopathology) compared to state-of-the-art baselines such as Focal Loss. These results confirm that our method offers a superior safety-accuracy trade-off across both CNN and Transformer architectures.