🤖 AI Summary
To address label scarcity, high fine-tuning costs, and domain shift in medical image segmentation—particularly for foundation models like SAM—this paper proposes BA-TTA-SAM, a zero-shot test-time adaptation (TTA) framework requiring no source-domain data. It is the first to introduce TTA into medical zero-shot segmentation. The method innovatively incorporates a Gaussian prompt injection mechanism at the encoder level and cross-layer boundary-aware attention alignment within the ViT backbone, dynamically integrating low-level boundary cues with high-level semantic features for task-agnostic online optimization. Evaluated on four benchmark datasets—ISIC, Kvasir, BUSI, and REFUGE—BA-TTA-SAM achieves an average Dice score 12.4% higher than SAM, significantly outperforming existing state-of-the-art methods.
📝 Abstract
Due to the scarcity of annotated data and the substantial computational costs of model, conventional tuning methods in medical image segmentation face critical challenges. Current approaches to adapting pretrained models, including full-parameter and parameter-efficient fine-tuning, still rely heavily on task-specific training on downstream tasks. Therefore, zero-shot segmentation has gained increasing attention, especially with foundation models such as SAM demonstrating promising generalization capabilities. However, SAM still faces notable limitations on medical datasets due to domain shifts, making efficient zero-shot enhancement an urgent research goal. To address these challenges, we propose BA-TTA-SAM, a task-agnostic test-time adaptation framework that significantly enhances the zero-shot segmentation performance of SAM via test-time adaptation. This framework integrates two key mechanisms: (1) The encoder-level Gaussian prompt injection embeds Gaussian-based prompts directly into the image encoder, providing explicit guidance for initial representation learning. (2) The cross-layer boundary-aware attention alignment exploits the hierarchical feature interactions within the ViT backbone, aligning deep semantic responses with shallow boundary cues. Experiments on four datasets, including ISIC, Kvasir, BUSI, and REFUGE, show an average improvement of 12.4% in the DICE score compared with SAM's zero-shot segmentation performance. The results demonstrate that our method consistently outperforms state-of-the-art models in medical image segmentation. Our framework significantly enhances the generalization ability of SAM, without requiring any source-domain training data. Extensive experiments on publicly available medical datasets strongly demonstrate the superiority of our framework. Our code is available at https://github.com/Emilychenlin/BA-TTA-SAM.