๐ค AI Summary
Few-shot semantic segmentation (FSS) faces a critical challenge: image-level correlation modeling introduces background noise and exacerbates overfitting. To address this, we propose an object-level correlation modeling paradigm that replaces conventional whole-image matching with explicit foreground object localization and association. Specifically, a Generalized Object Mining Module (GOMM) extracts multi-scale candidate object features, while a Correlation Construction Module (CCM) enables precise prototype matching and relational learning between support foreground objects and query image objects. Inspired by biological visual attention mechanisms, our approach explicitly suppresses irrelevant background interferenceโan innovation that significantly mitigates overfitting under low-data regimes. Extensive experiments demonstrate state-of-the-art performance on PASCAL-$5^i$ and COCO-$20^i$ benchmarks, with particularly notable gains in the 1-shot setting.
๐ Abstract
Few-shot semantic segmentation (FSS) aims to segment objects of novel categories in the query images given only a few annotated support samples. Existing methods primarily build the image-level correlation between the support target object and the entire query image. However, this correlation contains the hard pixel noise, extit{i.e.}, irrelevant background objects, that is intractable to trace and suppress, leading to the overfitting of the background. To address the limitation of this correlation, we imitate the biological vision process to identify novel objects in the object-level information. Target identification in the general objects is more valid than in the entire image, especially in the low-data regime. Inspired by this, we design an Object-level Correlation Network (OCNet) by establishing the object-level correlation between the support target object and query general objects, which is mainly composed of the General Object Mining Module (GOMM) and Correlation Construction Module (CCM). Specifically, GOMM constructs the query general object feature by learning saliency and high-level similarity cues, where the general objects include the irrelevant background objects and the target foreground object. Then, CCM establishes the object-level correlation by allocating the target prototypes to match the general object feature. The generated object-level correlation can mine the query target feature and suppress the hard pixel noise for the final prediction. Extensive experiments on PASCAL-${5}^{i}$ and COCO-${20}^{i}$ show that our model achieves the state-of-the-art performance.