🤖 AI Summary
Fetal ultrasound image quality critically impacts the accuracy of biometric measurements, yet low contrast and artifacts hinder existing deep learning methods by introducing confounding information. To address this, we propose a hierarchical, interpretable concept bottleneck model that emulates expert-level, stepwise reasoning: first extracting task-relevant visual concepts via semantic segmentation, then modeling semantic attribute concepts strongly correlated with image quality. We introduce a novel progressive two-tier concept bottleneck architecture that explicitly embeds human-readable concepts into fine-grained quality assessment—ensuring interpretability while robustly handling real-world challenges. Evaluated on an internal dataset, our model outperforms same-scale concept-free baselines. Moreover, it achieves zero-shot cross-domain generalization on two public benchmarks—Spain and Africa—with statistically significant performance gains over prior methods.
📝 Abstract
The quality of fetal ultrasound screening scans directly influences the precision of biometric measurements. However, acquiring high-quality scans is labor-intensive and highly relies on the operator's skills. Considering the low contrastiveness and imaging artifacts that widely exist in ultrasound, even a dedicated deep-learning model can be vulnerable to learning from confounding information in the image. In this paper, we propose a holistic and explainable method for fetal ultrasound quality assessment, where we design a hierarchical concept bottleneck model by introducing human-readable ``concepts"into the task and imitating the sequential expert decision-making process. This hierarchical information flow forces the model to learn concepts from semantically meaningful areas: The model first passes through a layer of visual, segmentation-based concepts, and next a second layer of property concepts directly associated with the decision-making task. We consider the quality assessment to be in a more challenging but more realistic setting, with fine-grained image recognition. Experiments show that our model outperforms equivalent concept-free models on an in-house dataset, and shows better generalizability on two public benchmarks, one from Spain and one from Africa, without any fine-tuning.