🤖 AI Summary
To address the challenge of jointly modeling long-range dependencies and ensuring computational efficiency in real-time semantic segmentation of high-resolution images, this paper proposes a CNN-ViT hybrid bottleneck architecture. Our method introduces three key innovations: (1) a Token Pyramid Extraction module for multi-scale dynamic perception; (2) a coupling mechanism between Transformer self-attention and modulated depthwise convolutions to enhance joint spatial-contextual modeling; and (3) a Feature Fusion Consistency Enhancement module to improve cross-scale feature alignment robustness. To our knowledge, this is the first architecture to unify multi-scale perception and contextual optimization within the bottleneck layer. Evaluated on four major benchmarks—ADE20K, Cityscapes, COCO-Stuff, and Pascal Context—it achieves state-of-the-art mIoU performance, outperforming parameter-matched models by +2.1% mIoU while maintaining low latency (<35 ms on 1080p input). The code will be made publicly available.
📝 Abstract
Semantic segmentation assigns labels to pixels in images, a critical yet challenging task in computer vision. Convolutional methods, although capturing local dependencies well, struggle with long-range relationships. Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands, especially for high-resolution inputs. Most research optimizes the encoder architecture, leaving the bottleneck underexplored - a key area for enhancing performance and efficiency. We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation. The framework's efficiency is driven by three synergistic modules: the Token Pyramid Extraction Module (TPEM) for hierarchical multi-scale representation, the Transformer and Modulating DepthwiseConv (Trans-MDC) block for dynamic scale-aware feature modeling, and the Feature Merging Module (FMM) for robust integration with enhanced spatial and contextual consistency. Extensive experiments on ADE20K, Pascal Context, CityScapes, and COCO-Stuff datasets show ContextFormer significantly outperforms existing models, achieving state-of-the-art mIoU scores, setting a new benchmark for efficiency and performance. The codes will be made publicly available.