Low-Resolution Self-Attention for Semantic Segmentation

๐Ÿ“… 2023-10-08
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 2
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenge of balancing global contextual modeling and computational efficiency in high-resolution image semantic segmentation, this paper proposes a Low-Resolution Self-Attention (LRSA) mechanism: it models long-range dependencies in a downsampled low-dimensional feature space, decoupling global context aggregation from high-resolution computation, and employs 3ร—3 depthwise separable convolutions to recover fine-grained details. Building upon LRSA, we introduce the LRFormer encoder-decoder architectureโ€”the first to challenge the prevailing paradigm that vision Transformers must operate at high resolution for effective segmentation. Evaluated on ADE20K, COCO-Stuff, and Cityscapes, our method achieves state-of-the-art mIoU performance with significantly fewer FLOPs. The source code is publicly available.
๐Ÿ“ Abstract
Semantic segmentation tasks naturally require high-resolution information for pixel-wise segmentation and global context information for class prediction. While existing vision transformers demonstrate promising performance, they often utilize high-resolution context modeling, resulting in a computational bottleneck. In this work, we challenge conventional wisdom and introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost, i.e., FLOPs. Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution, with additional 3x3 depth-wise convolutions to capture fine details in the high-resolution space. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure. Extensive experiments on the ADE20K, COCO-Stuff, and Cityscapes datasets demonstrate that LRFormer outperforms state-of-the-art models. he code is available at https://github.com/yuhuan-wu/LRFormer.
Problem

Research questions and friction points this paper is trying to address.

Semantic Segmentation
High-Definition Images
Efficient Computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Resolution Self-Attention
LRFormer Model
Efficient High-Definition Image Processing
๐Ÿ”Ž Similar Papers
No similar papers found.