🤖 AI Summary
Medical image segmentation faces significant challenges due to strong modality heterogeneity, the high cost of pixel-level annotations, and the limited adaptability of general-purpose models like SAM across diverse imaging modalities and anatomical structures. To address these issues, this work proposes SegMoTE, a framework that preserves SAM’s prompt interface and zero-shot capabilities while introducing a minimal set of learnable parameters for dynamic cross-modal adaptation. SegMoTE integrates a token-level mixture-of-experts (MoE) mechanism with a progressive auto-prompting strategy, enabling fully automatic segmentation and substantially reducing reliance on manual annotations. Trained exclusively on MedSeg-HQ—a high-quality medical dataset comprising less than 1% of existing data—SegMoTE achieves state-of-the-art performance across multiple imaging modalities and anatomical tasks, demonstrating for the first time that a general-purpose segmentation model can be efficiently, robustly, and scalably adapted to medical images under extremely low annotation budgets.
📝 Abstract
Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM's original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.