๐ค AI Summary
This work addresses the critical limitation of existing multi-class multiple instance learning (MIL) approaches in whole-slide image diagnosis, which neglect the clinically heterogeneous severity of different misclassification errors and thus fail to effectively suppress high-risk mistakes. To this end, we introduce, for the first time in multi-class MIL, an asymmetric misclassification severity modeling framework. By constructing a diagnostic category hierarchy, we propose a severity-weighted cross-entropy loss to intensify penalties on clinically critical errors, complemented by hierarchical probability alignment and semantic feature shuffling to enhance inter-level consistency. Furthermore, we design a medical-oriented evaluation metric grounded in Mikelโs Wheel. Experiments demonstrate that our method significantly reduces clinically critical misdiagnosis rates on both public and real-world histopathology datasets, while also exhibiting strong generalization capability on natural image benchmarks.
๐ Abstract
Multiple Instance Learning (MIL) has emerged as a promising paradigm for Whole Slide Image (WSI) diagnosis, offering effective learning with limited annotations. However, existing MIL frameworks overlook diagnostic priorities and fail to differentiate the severity of misclassifications in multiclass, leaving clinically critical errors unaddressed. We propose a mistake-severity-aware training strategy that organizes diagnostic classes into a hierarchical structure, with each level optimized using a severity-weighted cross-entropy loss that penalizes high-severity misclassifications more strongly. Additionally, hierarchical consistency is enforced through probabilistic alignment, a semantic feature remix applied to the instance bag to robustly train class priority and accommodate clinical cases involving multiple symptoms. An asymmetric Mikel's Wheel-based metric is also introduced to quantify the severity of errors specific to medical fields. Experiments on challenging public and real-world in-house datasets demonstrate that our approach significantly mitigates critical errors in MIL diagnosis compared to existing methods. We present additional experimental results on natural domain data to demonstrate the generalizability of our proposed method beyond medical contexts.