🤖 AI Summary
To address the semantic gap between support and query sets, as well as erroneous matching caused by visually similar but semantically conflicting regions within images in few-shot semantic segmentation, this paper proposes a novel framework that jointly optimizes cross-image and intra-image feature consistency. Our key contributions are: (1) introducing class-specific high-level semantic representations to enhance cross-image region localization accuracy; (2) designing a directional masking strategy to explicitly suppress spurious feature responses with high similarity but inconsistent labels; and (3) establishing a global semantic aggregation module coupled with bidirectional support-query image interaction. Evaluated on PASCAL-5$^i$ and COCO-20$^i$ under the 1-shot setting, our method achieves absolute mIoU improvements of 1.9% and 2.1%, respectively, significantly surpassing current state-of-the-art approaches.
📝 Abstract
The annotation bottleneck in semantic segmentation has driven significant interest in few-shot segmentation, which aims to develop segmentation models capable of generalizing rapidly to novel classes using minimal exemplars. Conventional training paradigms typically generate query prior maps by extracting masked-area features from support images, followed by making predictions guided by these prior maps. However, current approaches remain constrained by two critical limitations stemming from inter- and intra-image discrepancies, both of which significantly degrade segmentation performance: 1) The semantic gap between support and query images results in mismatched features and inaccurate prior maps; 2) Visually similar yet semantically distinct regions within support or query images lead to false negative or false positive predictions. We propose a novel FSS method called extbf{I$^2$R}: 1) Using category-specific high level representations which aggregate global semantic cues from support and query images, enabling more precise inter-image region localization and address the first limitation. 2) Directional masking strategy that suppresses inconsistent support-query pixel pairs, which exhibit high feature similarity but conflicting mask, to mitigate the second issue. Experiments demonstrate that our method outperforms state-of-the-art approaches, achieving improvements of 1.9% and 2.1% in mIoU under the 1-shot setting on PASCAL-5$^i$ and COCO-20$^i$ benchmarks, respectively.