🤖 AI Summary
This work addresses the challenges of unreliable segmentation in ultrasound videos caused by speckle noise, weak boundaries, and rapid deformations, particularly under sparse interaction settings where only a single-point click and an anatomical class name are provided in the first frame. The authors propose the first training-free, highly robust segmentation framework, which integrates a frozen medical vision-language model, a foundation vision model, and a promptable video segmenter. They introduce a novel scale-space semantic prompting mechanism to alleviate initialization ambiguity and design a reliability-gated memory update strategy to suppress temporal propagation errors. Additionally, auxiliary point prompts are automatically generated based on the S.E.E.D. criterion. The method significantly outperforms existing training-free and fine-tuning approaches across three ultrasound datasets and contributes the first dynamic fetal-placental ultrasound video segmentation dataset with 671 annotated frames.
📝 Abstract
Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift. We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot orchestrates a frozen medical vision-language model (VLM) for semantic localization, a vision foundation model (VFM) for dense geometric feature extraction, and a promptable video segmentor for mask prediction and propagation. To resolve initialization ambiguity, we propose Scale-Space Semantic Prompting, which first selects an optimal contextual view via a parameter-free S.E.E.D. (Semantic Energy-Entropy Density) criterion, and then synthesizes geometrically precise auxiliary point prompts from dense foundation features without additional user interaction. To reduce propagation drift, a Reliability-Gated Memory update is further introduced to selectively freeze the segmentor's memory bank under uncertain predictions, preventing error accumulation. We also contribute the first dynamic fetal placenta ultrasound video segmentation dataset with 671 annotated frames. Across three ultrasound video datasets, EchoPilot achieves state-of-the-art performance under the sparse-interactive setting, consistently outperforming training-free baselines and finetuned specialists.