🤖 AI Summary
Microscopic medical image segmentation faces challenges including complex anatomical structures, ambiguous boundaries, and cluttered backgrounds; existing CNN- and Transformer-based models suffer from limited generalizability and high computational overhead. To address these issues, we propose an efficient multi-stage feature enhancement framework: (1) a novel hybrid architecture integrating Inception-style depthwise separable convolutions with the Mamba state space model; and (2) a semantic-guided multi-frequency feature enrichment mechanism to strengthen modeling of indistinct boundaries. Evaluated on four standard benchmarks—SegPC21, GlaS, ISIC2017, and ISIC2018—the framework achieves state-of-the-art performance, improving average Dice scores by 1.2–2.8%. Moreover, it reduces parameter count and FLOPs by approximately 5× compared to competitive baselines, striking a superior balance between accuracy and efficiency.
📝 Abstract
Accurate microscopic medical image segmentation plays a crucial role in diagnosing various cancerous cells and identifying tumors. Driven by advancements in deep learning, convolutional neural networks (CNNs) and transformer-based models have been extensively studied to enhance receptive fields and improve medical image segmentation task. However, they often struggle to capture complex cellular and tissue structures in challenging scenarios such as background clutter and object overlap. Moreover, their reliance on the availability of large datasets for improved performance, along with the high computational cost, limit their practicality. To address these issues, we propose an efficient framework for the segmentation task, named InceptionMamba, which encodes multi-stage rich features and offers both performance and computational efficiency. Specifically, we exploit semantic cues to capture both low-frequency and high-frequency regions to enrich the multi-stage features to handle the blurred region boundaries (e.g., cell boundaries). These enriched features are input to a hybrid model that combines an Inception depth-wise convolution with a Mamba block, to maintain high efficiency and capture inherent variations in the scales and shapes of the regions of interest. These enriched features along with low-resolution features are fused to get the final segmentation mask. Our model achieves state-of-the-art performance on two challenging microscopic segmentation datasets (SegPC21 and GlaS) and two skin lesion segmentation datasets (ISIC2017 and ISIC2018), while reducing computational cost by about 5 times compared to the previous best performing method.