🤖 AI Summary
To address the challenges of automatic standard anatomical plane localization in fetal ultrasound videos—namely, low accuracy, high manual annotation cost, and poor inter-observer consistency—this paper proposes a novel vision-query-driven video segment localization paradigm. Methodologically, we introduce the first multi-level class-aware Token Transformer architecture that jointly models temporal dynamics and anatomical semantics; incorporate a vision-query mechanism for target-oriented retrieval; and significantly reduce computational overhead via token sparsification and cross-domain (ultrasound + natural video) joint training. Experiments demonstrate substantial improvements: +10–13% mIoU on ultrasound datasets and +5.35% on Ego4D, using only 4% of the original token count. The proposed method achieves an optimal trade-off among accuracy, efficiency, and generalizability, showing strong potential for clinical deployment in resource-constrained settings, particularly low- and middle-income countries (LMICs).
📝 Abstract
Accurate standard plane acquisition in fetal ultrasound (US) videos is crucial for fetal growth assessment, anomaly detection, and adherence to clinical guidelines. However, manually selecting standard frames is time-consuming and prone to intra- and inter-sonographer variability. Existing methods primarily rely on image-based approaches that capture standard frames and then classify the input frames across different anatomies. This ignores the dynamic nature of video acquisition and its interpretation. To address these challenges, we introduce Multi-Tier Class-Aware Token Transformer (MCAT), a visual query-based video clip localization (VQ-VCL) method, to assist sonographers by enabling them to capture a quick US sweep. By then providing a visual query of the anatomy they wish to analyze, MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies. We evaluate MCAT on two ultrasound video datasets and a natural image VQ-VCL dataset based on Ego4D. Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens. MCAT's efficiency and accuracy have significant potential implications for public health, especially in low- and middle-income countries (LMICs), where it may enhance prenatal care by streamlining standard plane acquisition, simplifying US-based screening, diagnosis and allowing sonographers to examine more patients.