SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation

📅 2024-12-16
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously achieving global contextual modeling, local detail preservation, and multi-scale feature extraction in semantic segmentation, this paper proposes SegMAN—a linear-complexity architecture. It introduces a hybrid encoder integrating sliding-window local attention with dynamic state-space modeling, and a state-space decoder. We further propose MMSCopE—the first resolution-adaptive multi-scale context enhancement module—combining a multi-scale context pyramid with dynamic parameterized fusion. Evaluated on standard benchmarks, SegMAN achieves 85.1% top-1 accuracy on ImageNet-1K; 52.6% mIoU on ADE20K (+1.6% over prior state-of-the-art); 83.8% mIoU on Cityscapes (+2.1%, with half the GFLOPs); and superior performance on COCO-Stuff, all with significantly improved computational efficiency.

Technology Category

Application Category

📝 Abstract
High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. Our SegMAN-B Encoder achieves 85.1% ImageNet-1k accuracy (+1.5% over VMamba-S with fewer parameters). When paired with our decoder, the full SegMAN-B model achieves 52.6% mIoU on ADE20K (+1.6% over SegNeXt-L with 15% fewer GFLOPs), 83.8% mIoU on Cityscapes (+2.1% over SegFormer-B3 with half the GFLOPs), and 1.6% higher mIoU than VWFormer-B3 on COCO-Stuff with lower GFLOPs. Our code is available at https://github.com/yunxiangfu2001/SegMAN.
Problem

Research questions and friction points this paper is trying to address.

Enabling global context and local detail modeling simultaneously
Enhancing multi-scale feature extraction for varying resolutions
Improving semantic segmentation efficiency and accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid encoder with local attention and state space models
MMSCopE module for multi-scale feature extraction
Linear-time model for efficient global context modeling
🔎 Similar Papers
No similar papers found.