๐ค AI Summary
Manual annotation of medical volumetric images (e.g., MRI/CT) is time-consuming and error-prone. Existing video segmentation foundation modelsโsuch as SAM 2โexhibit severe boundary mispropagation across slices, limiting their clinical utility in 3D medical image segmentation. To address this, we propose a novel 3D medical image segmentation framework built upon SAM 2, featuring a dual short-term/long-term memory bank coupled with independent attention modules to explicitly model local slice-wise continuity and global anatomical consistency. Our architecture integrates dual-path memory storage, hierarchical spatiotemporal attention, and modality-adaptive adapters, enabling robust multi-organ 3D segmentation (e.g., organs, bones, muscles). Evaluated on three public benchmarks, our method achieves mean Dice improvements of 0.11โ0.14 using only 1โ5 annotated samples per organ for fine-tuning. It significantly suppresses over-propagation and enhances boundary robustness, advancing the deployment of high-accuracy automated medical annotation in clinical practice.
๐ Abstract
Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on three public datasets covering organs, bones, and muscles across MRI and CT modalities. We show that the proposed method markedly outperforms the default SAM 2, achieving average Dice Similarity Coefficient improvement of 0.14 and 0.11 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, making a notable step toward more accurate automated annotation of medical images for segmentation model development.