USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the high computational complexity of conventional self-attention mechanisms in medical image segmentation and their limited ability to effectively integrate local and global contextual information. To overcome these challenges, the authors propose USEMA, a hybrid UNet architecture that combines convolutional neural networks with a novel SEMA attention mechanism. The approach leverages local window-based attention to mitigate attention dispersion and introduces an efficient global modeling strategy based on arithmetic averaging. Notably, it unifies Mamba-style attention with a synergistic local–global interaction mechanism. Experimental results demonstrate that USEMA consistently outperforms pure CNN, Transformer, and Mamba-based models across diverse imaging modalities and input resolutions, achieving superior segmentation accuracy while maintaining computational efficiency.

📝 Abstract

Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.

Problem

Research questions and friction points this paper is trying to address.

medical image segmentation

self-attention

computational complexity

local-global information fusion

vision transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-like attention

local window attention

medical image segmentation