MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping

📅 2024-09-17
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of detail loss and high computational overhead in few-shot semantic segmentation, this paper proposes an efficient and accurate Transformer-based approach. Methodologically, it introduces (1) a novel spatial Transformer decoder coupled with a context-aware mask generation module; (2) a multi-scale hierarchical decoding mechanism that fuses intermediate-layer global features to enhance fine-grained localization; and (3) prototype-guided multi-scale feature pyramid decoding, integrated with spatial-attention-driven support-query relational modeling. With only 1.5M parameters, the model achieves state-of-the-art performance on both PASCAL-5ⁱ and COCO-20ⁱ benchmarks under 1-shot and 5-shot settings. It simultaneously delivers superior accuracy, strong generalization across diverse base/novel classes, and high inference efficiency—outperforming all existing methods while maintaining architectural compactness and computational tractability.

Technology Category

Application Category

📝 Abstract
Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi-scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve state-of-the-art results on benchmark datasets such as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. https://github.com/amirrezafateh/MSDNet
Problem

Research questions and friction points this paper is trying to address.

Few-shot Semantic Segmentation
Detail Preservation
Computational Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based architecture
Multi-scale decoding
Efficient few-shot semantic segmentation
🔎 Similar Papers
No similar papers found.
A
Amirreza Fateh
School of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran
M
Mohammad Reza Mohammadi
School of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran
M
M. Jahed-Motlagh
School of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran