🤖 AI Summary
This study addresses the limitation imposed by inconsistent concept annotations in dermoscopic datasets on the theoretical accuracy ceiling of Concept Bottleneck Models (CBMs). For the first time, it quantifies this inconsistency by applying rough set theory to analyze the Derm7pt dataset, identifying conflicting concept configurations and proposing symmetric and asymmetric filtering strategies to construct a conflict-free subset, Derm7pt+. Experimental results reveal that 16.4% of concept configurations are conflicting, yielding a theoretical upper accuracy bound of 92.1%. On Derm7pt+, CBMs based on EfficientNet-B5 and EfficientNet-B7 achieve optimal performance under the two filtering strategies, attaining a label F1 score of 0.85 and concept accuracy of 0.70, while establishing a reproducible evaluation baseline for CBMs in dermatological diagnosis.
📝 Abstract
Concept Bottleneck Models (CBMs) route predictions exclusively through a clinically grounded concept layer, binding interpretability to concept-label consistency. When a dataset contains concept-level inconsistencies, identical concept profiles mapped to conflicting diagnosis labels create an unresolvable bottleneck that imposes a hard ceiling on achievable accuracy. In this paper, we apply rough set theory to the Derm7pt dermoscopy benchmark and characterize the full extent and clinical structure of this inconsistency. Among 305 unique concept profiles formed by the 7 dermoscopic criteria of the 7-point melanoma checklist, 50 (16.4%) are inconsistent, spanning 306 images (30.3% of the dataset). This yields a theoretical accuracy ceiling of 92.1%, independent of backbone architecture or training strategy for CBMs that exclusively operate with hard concepts. In addition, we characterize the conflict-severity distribution, identify the clinical features most responsible for boundary ambiguity, and evaluate two filtering strategies with quantified effects on dataset composition and CBM interpretability. Symmetric removal of all boundary-region images yields Derm7pt+, a fully consistent benchmark subset of 705 images with perfect quality of classification and no hard accuracy ceiling. Building on this filtered dataset, we present a hard CBM evaluated across 19 backbone architectures from the EfficientNet, DenseNet, ResNet, and Wide ResNet families. Under symmetric filtering, explored for completeness, EfficientNet-B5 achieves the best label F1 score (0.85) and label accuracy (0.90) on the held-out test set, with a concept accuracy of 0.70. Under asymmetric filtering, EfficientNet-B7 leads across all four metrics, reaching a label F1 score of 0.82 and concept accuracy of 0.70. These results establish reproducible baselines for concept-consistent CBM evaluation on dermoscopic data.