🤖 AI Summary
Medical image segmentation suffers from poor model generalizability and a persistent trade-off between accuracy and computational efficiency. To address these challenges, we propose a novel CNN–Transformer–Mamba tri-branch hybrid architecture. Our approach introduces the first Mamba-based Attention Fusion mechanism, synergistically integrating CNNs’ local receptive fields, Transformers’ global contextual modeling, and Mambas’ efficient long-range sequential representation learning. Furthermore, we design a multi-scale attention decoder coupled with cross-scale collaborative attention gating to jointly enhance fine-grained local detail preservation and long-range dependency modeling. Extensive experiments on multiple public medical imaging benchmarks demonstrate that our method achieves state-of-the-art segmentation accuracy with significantly reduced computational overhead, while markedly improving cross-modality and cross-anatomical generalization. The source code and pre-trained models will be publicly released, facilitating clinical deployment.
📝 Abstract
In recent years, deep learning has shown near-expert performance in segmenting complex medical tissues and tumors. However, existing models are often task-specific, with performance varying across modalities and anatomical regions. Balancing model complexity and performance remains challenging, particularly in clinical settings where both accuracy and efficiency are critical. To address these issues, we propose a hybrid segmentation architecture featuring a three-branch encoder that integrates CNNs, Transformers, and a Mamba-based Attention Fusion (MAF) mechanism to capture local, global, and long-range dependencies. A multi-scale attention-based CNN decoder reconstructs fine-grained segmentation maps while preserving contextual consistency. Additionally, a co-attention gate enhances feature selection by emphasizing relevant spatial and semantic information across scales during both encoding and decoding, improving feature interaction and cross-scale communication. Extensive experiments on multiple benchmark datasets show that our approach outperforms state-of-the-art methods in accuracy and generalization, while maintaining comparable computational complexity. By effectively balancing efficiency and effectiveness, our architecture offers a practical and scalable solution for diverse medical imaging tasks. Source code and trained models will be publicly released upon acceptance to support reproducibility and further research.