🤖 AI Summary
This work addresses the challenges of feature misalignment and contextual deficiency in high-resolution feature reconstruction for low-latency semantic segmentation, which are often exacerbated by the high computational cost of conventional approaches. The authors propose Guided Attention Interpolation (GAI), a novel method that, for the first time, integrates cross-level attention mechanisms into the upsampling process. By explicitly modeling spatial and semantic relationships across multi-scale features, GAI adaptively generates high-resolution features that are both precisely aligned and semantically rich. The approach is readily plug-and-play into any lightweight convolutional network, significantly enhancing both accuracy and efficiency. On Cityscapes and CamVid benchmarks, the method achieves state-of-the-art performance with 78.8 mIoU at 22.3 FPS and 80.6 mIoU at 64.5 FPS, respectively, setting new records in low-latency semantic segmentation.
📝 Abstract
Semantic segmentation is a fundamental problem in computer vision and it requires high-resolution feature maps for dense prediction. Current coordinate-guided low-resolution feature interpolation methods, e.g., bilinear interpolation, produce coarse high-resolution features which suffer from feature misalignment and insufficient context information. Moreover, enriching semantics to high-resolution features requires a high computation burden, so that it is challenging to meet the requirement of low-latency inference. We propose a novel Guided Attentive Interpolation (GAI) method to adaptively interpolate fine-grained high-resolution features with semantic features to tackle these issues. Guided Attentive Interpolation determines both spatial and semantic relations of pixels from features of different resolutions and then leverages these relations to interpolate high-resolution features with rich semantics. GAI can be integrated with any deep convolutional network for efficient semantic segmentation. In experiments, the GAI-based semantic segmentation networks, i.e., GAIN, can achieve 78.8 mIoU with 22.3 FPS on Cityscapes and 80.6 mIoU with 64.5 on CamVid using an NVIDIA 1080Ti GPU, which are the new state-of-the-art results of low-latency semantic segmentation. Code and models are available at: https://github.com/hustvl/simpleseg.