🤖 AI Summary
Medical image segmentation suffers from an imbalance between preserving local details and modeling global context: CNNs exhibit weak long-range dependency modeling, Transformers lack fine-grained local perception, and encoder-decoder feature fusion remains inefficient. To address this, we propose a novel decoder architecture centered on two key innovations: a multi-dilated contextual attention mechanism and a cross-channel hybrid module, jointly enhancing shallow-detail preservation and deep-semantic integration. The model integrates hierarchical dilated convolutions, attention-driven feature modulation, and a Transformer-based encoder to enable efficient multi-scale contextual capture and hierarchical feature interaction. Evaluated on binary and multi-organ segmentation tasks, our method achieves significant Dice score improvements over state-of-the-art approaches while reducing computational overhead—demonstrating simultaneous gains in accuracy, robustness, and efficiency.
📝 Abstract
Medical image segmentation faces challenges due to variations in anatomical structures. While convolutional neural networks (CNNs) effectively capture local features, they struggle with modeling long-range dependencies. Transformers mitigate this issue with self-attention mechanisms but lack the ability to preserve local contextual information. State-of-the-art models primarily follow an encoder-decoder architecture, achieving notable success. However, two key limitations remain: (1) Shallow layers, which are closer to the input, capture fine-grained details but suffer from information loss as data propagates through deeper layers. (2) Inefficient integration of local details and global context between the encoder and decoder stages. To address these challenges, we propose the MACMD-based decoder, which enhances attention mechanisms and facilitates channel mixing between encoder and decoder stages via skip connections. This design leverages hierarchical dilated convolutions, attention-driven modulation, and a cross channel-mixing module to capture long-range dependencies while preserving local contextual details, essential for precise medical image segmentation. We evaluated our approach using multiple transformer encoders on both binary and multi-organ segmentation tasks. The results demonstrate that our method outperforms state-of-the-art approaches in terms of Dice score and computational efficiency, highlighting its effectiveness in achieving accurate and robust segmentation performance. The code available at https://github.com/lalitmaurya47/MACMD