🤖 AI Summary
Existing interactive medical segmentation methods treat user interactions as isolated events, resulting in redundant corrections, slow convergence, and limited accuracy gains. To address this, we propose a memory–attention mechanism that introduces, for the first time, an explicit, incrementally updatable memory module into Vision Transformer (ViT) architectures. This module dynamically integrates historical clicks, prior masks, and segmentation states to enable cross-iteration temporal contextual modeling and incremental refinement. Our method incorporates temporal attention gating and multimodal prompt fusion within an encoder–decoder framework, supporting state-aware, continuous optimization. Experiments on multimodal medical imaging datasets demonstrate that our approach reduces average interaction rounds by 37% and improves Dice score by 2.8 percentage points, significantly surpassing the performance ceiling of conventional single-step prompting paradigms.
📝 Abstract
Interactive medical segmentation reduces annotation effort by refining predictions through user feedback. Vision Transformer (ViT)-based models, such as the Segment Anything Model (SAM), achieve state-of-the-art performance using user clicks and prior masks as prompts. However, existing methods treat interactions as independent events, leading to redundant corrections and limited refinement gains. We address this by introducing MAIS, a Memory-Attention mechanism for Interactive Segmentation that stores past user inputs and segmentation states, enabling temporal context integration. Our approach enhances ViT-based segmentation across diverse imaging modalities, achieving more efficient and accurate refinements.