Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing ultrasound image segmentation methods struggle to simultaneously achieve strong generalization to novel tasks, low annotation dependency, and real-time inference. To address this, we propose an adaptive multi-scale vision foundation model fusion framework. Our approach introduces the first lightweight segmentation architecture that synergistically integrates Hiera—designed for hierarchical multi-scale feature extraction—with DINOv2—a self-supervised visual representation model—and incorporates a TensorRT-accelerated efficient decoder. Evaluated on six public and one institutional ultrasound datasets, our method achieves state-of-the-art accuracy with only 1%–10% annotated data, improving mean Dice score by over 20% compared to nnUNet. Moreover, it attains 77 FPS inference speed on a single GPU. To the best of our knowledge, this is the first work in ultrasound segmentation to jointly achieve high accuracy, extremely low supervision, and real-time performance.

Technology Category

Application Category

📝 Abstract
We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation. Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance. To overcome these limitations, we introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features, interleaved with DINOv2 representations to enhance visual expressiveness. These enriched features are then decoded to produce precise and robust segmentation. We conduct extensive evaluations on six public datasets and one in-house dataset, covering both cardiac and thyroid ultrasound segmentation. Experiments show that our approach outperforms state-of-the-art methods across multiple datasets and excels with limited supervision, surpassing nnUNet by over 20% on average in the 1% and 10% data settings. Our method achieves $sim$77 FPS inference speed with TensorRT on a single GPU, enabling real-time clinical applications.
Problem

Research questions and friction points this paper is trying to address.

Adapt vision models for real-time ultrasound segmentation
Overcome limitations in adaptability and manual annotations
Enhance performance with multi-scale features and DINOv2
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts hierarchical vision foundation models
Interleaves DINOv2 for enhanced features
Achieves real-time 77 FPS segmentation
🔎 Similar Papers
No similar papers found.