🤖 AI Summary
This study addresses the scarcity of annotated data for carotid artery structure segmentation in cardiovascular histopathological images. Method: We systematically evaluate the performance and stability of mainstream segmentation models—including U-Net, DeepLabV3+, SegFormer, SAM, MedSAM, and MedSAM+UNet—under few-shot learning conditions. Using Bayesian hyperparameter optimization and multiple randomized data splits, we quantify variability in model rankings across different train-validation partitions. Contribution/Results: We demonstrate that model rankings under low-data regimes are highly sensitive to data partitioning, with observed performance differences primarily attributable to statistical noise rather than intrinsic algorithmic superiority. This challenges the validity of conventional benchmarking practices in clinical low-resource settings, revealing that standard benchmark scores poorly reflect real-world clinical utility. Crucially, this work provides the first quantitative evidence of evaluation instability in few-shot medical image segmentation and advocates for a paradigm shift toward clinically grounded, robust evaluation frameworks tailored for deployment-ready model assessment.
📝 Abstract
Accurate segmentation of carotid artery structures in histopathological images is vital for advancing cardiovascular disease research and diagnosis. However, deep learning model development in this domain is constrained by the scarcity of annotated cardiovascular histopathological data. This study investigates a systematic evaluation of state-of-the-art deep learning segmentation models, including convolutional neural networks (U-Net, DeepLabV3+), a Vision Transformer (SegFormer), and recent foundation models (SAM, MedSAM, MedSAM+UNet), on a limited dataset of cardiovascular histology images. Despite employing an extensive hyperparameter optimization strategy with Bayesian search, our findings reveal that model performance is highly sensitive to data splits, with minor differences driven more by statistical noise than by true algorithmic superiority. This instability exposes the limitations of standard benchmarking practices in low-data clinical settings and challenges the assumption that performance rankings reflect meaningful clinical utility.