🤖 AI Summary
Medical image segmentation of minute tumors and micro-organs demands simultaneous modeling of fine-grained local details and global contextual dependencies—yet existing shift-window Vision Transformers (ViTs) suffer from insufficient feature fusion capability. To address this, we propose the Hybrid Bridging Transformer (HBT), whose core innovation is a Multi-Scale Feature Fusion (MFF) decoder: an asymmetric bridge that integrates hierarchical features from a Swin backbone while incorporating a coupled channel-spatial joint attention mechanism. Additionally, depthwise separable and dilated convolutions are integrated to expand receptive fields and improve computational efficiency. Evaluated on multiple public medical segmentation benchmarks, HBT achieves state-of-the-art performance, demonstrating significant improvements in fine-grained boundary accuracy and long-range dependency modeling. Ablation studies confirm the efficacy of each design component, and cross-dataset validation underscores its robust generalizability.
📝 Abstract
Medical image segmentation is a cornerstone of modern clinical diagnostics. While Vision Transformers that leverage shifted window-based self-attention have established new benchmarks in this field, they are often hampered by a critical limitation: their localized attention mechanism struggles to effectively fuse local details with global context. This deficiency is particularly detrimental to challenging tasks such as the segmentation of microtumors and miniature organs, where both fine-grained boundary definition and broad contextual understanding are paramount. To address this gap, we propose HBFormer, a novel Hybrid-Bridge Transformer architecture. The'Hybrid'design of HBFormer synergizes a classic U-shaped encoder-decoder framework with a powerful Swin Transformer backbone for robust hierarchical feature extraction. The core innovation lies in its'Bridge'mechanism, a sophisticated nexus for multi-scale feature integration. This bridge is architecturally embodied by our novel Multi-Scale Feature Fusion (MFF) decoder. Departing from conventional symmetric designs, the MFF decoder is engineered to fuse multi-scale features from the encoder with global contextual information. It achieves this through a synergistic combination of channel and spatial attention modules, which are constructed from a series of dilated and depth-wise convolutions. These components work in concert to create a powerful feature bridge that explicitly captures long-range dependencies and refines object boundaries with exceptional precision. Comprehensive experiments on challenging medical image segmentation datasets, including multi-organ, liver tumor, and bladder tumor benchmarks, demonstrate that HBFormer achieves state-of-the-art results, showcasing its outstanding capabilities in microtumor and miniature organ segmentation. Code and models are available at: https://github.com/lzeeorno/HBFormer.