🤖 AI Summary
This work addresses the challenge of achieving both high boundary precision and computational efficiency in medical image segmentation. While existing Transformer models suffer from high complexity and CNNs lack global modeling capacity, linear attention mechanisms often yield blurry segmentations due to training instability. To overcome these limitations, we propose PVT-GDLA, a decoder-centric efficient Transformer architecture featuring Gated Differential Linear Attention (GDLA). GDLA suppresses common-mode noise through dual-path kernelized attention subtraction and learnable channel-wise scaling, while enhancing boundary fidelity via input-adaptive sparse gating and local depthwise convolutions. The method maintains O(N) computational complexity and low parameter overhead, achieving state-of-the-art performance across multimodal medical imaging tasks—including CT, MRI, ultrasound, and dermoscopy—significantly outperforming CNNs, Transformers, and current linear attention models.
📝 Abstract
Medical image segmentation requires models that preserve fine anatomical boundaries while remaining efficient for clinical deployment. While transformers capture long-range dependencies, they suffer from quadratic attention cost and large data requirements, whereas CNNs are compute-friendly yet struggle with global reasoning. Linear attention offers $\mathcal{O}(N)$ scaling, but often exhibits training instability and attention dilution, yielding diffuse maps. We introduce PVT-GDLA, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time. Its core, Gated Differential Linear Attention (GDLA), computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale to cancel common-mode noise and amplify relevant context. A lightweight, head-specific gate injects nonlinearity and input-adaptive sparsity, mitigating attention sink, and a parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions, improving boundary fidelity, all while retaining $\mathcal{O}(N)$ complexity and low parameter overhead. Coupled with a pretrained Pyramid Vision Transformer (PVT) encoder, PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines. PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.