Low-Resolution Self-Attention for Semantic Segmentation

📅 2023-10-08

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

149K/year

🤖 AI Summary

To address the challenge of balancing global contextual modeling and computational efficiency in high-resolution image semantic segmentation, this paper proposes a Low-Resolution Self-Attention (LRSA) mechanism: it models long-range dependencies in a downsampled low-dimensional feature space, decoupling global context aggregation from high-resolution computation, and employs 3×3 depthwise separable convolutions to recover fine-grained details. Building upon LRSA, we introduce the LRFormer encoder-decoder architecture—the first to challenge the prevailing paradigm that vision Transformers must operate at high resolution for effective segmentation. Evaluated on ADE20K, COCO-Stuff, and Cityscapes, our method achieves state-of-the-art mIoU performance with significantly fewer FLOPs. The source code is publicly available.

📝 Abstract

Semantic segmentation tasks naturally require high-resolution information for pixel-wise segmentation and global context information for class prediction. While existing vision transformers demonstrate promising performance, they often utilize high-resolution context modeling, resulting in a computational bottleneck. In this work, we challenge conventional wisdom and introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost, i.e., FLOPs. Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution, with additional 3x3 depth-wise convolutions to capture fine details in the high-resolution space. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure. Extensive experiments on the ADE20K, COCO-Stuff, and Cityscapes datasets demonstrate that LRFormer outperforms state-of-the-art models. he code is available at https://github.com/yuhuan-wu/LRFormer.

Problem

Research questions and friction points this paper is trying to address.

Semantic Segmentation

High-Definition Images

Efficient Computation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Resolution Self-Attention

LRFormer Model

Efficient High-Definition Image Processing

🔎 Similar Papers

No similar papers found.