FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound

πŸ“… 2025-12-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
The fetal ultrasound domain lacks standardized vision-language model (VLM) evaluation benchmarks. Method: We introduce Fetal-Gauge, the first large-scale visual question-answering benchmark for fetal ultrasound, comprising over 42,000 clinical ultrasound images and 93,000 high-quality question-answer pairs covering five core clinical tasks: anatomical plane identification, structural localization, fetal presentation assessment, biometric measurement, and anomaly screening. Contribution/Results: Fetal-Gauge enables the first systematic, standardized evaluation of VLMs in fetal ultrasound. Empirical evaluation reveals that state-of-the-art general-purpose and medical-domain-specific VLMs achieve only ~55% average accuracyβ€”far below the clinically required threshold (β‰₯90%). This gap highlights fundamental limitations in handling operator-dependent variability, low-contrast imaging artifacts, and anatomical ambiguities. Our findings underscore the necessity of domain-adapted architectures and specialized training strategies, establishing Fetal-Gauge as a critical evaluation infrastructure to guide future development and clinical translation of fetal ultrasound AI systems.

Technology Category

Application Category

πŸ“ Abstract
The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality's challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark will be publicly available once the paper gets accepted.
Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs on fetal ultrasound tasks with a benchmark
Addresses lack of standardized VLM assessment in prenatal imaging
Identifies performance gaps in clinical ultrasound interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

First VQA benchmark for fetal ultrasound evaluation
Largest dataset with 42k images and 93k QA pairs
Systematic evaluation reveals 55% accuracy gap
πŸ”Ž Similar Papers
No similar papers found.