π€ AI Summary
Existing prompt-based segmentation models (e.g., SAM) still rely on manual visual prompts or domain-specific rules. To address this, we propose a training-free few-shot instance segmentation framework that enables zero-shot cross-image instance segmentation using only a small set of reference images. Our method constructs a reference image memory bank and leverages a frozen foundation model to extract discriminative features. It then establishes pixel-level correspondences between reference and query images via semantic-aware matching and multi-stage feature aggregation, ultimately generating instance-level masks. Crucially, the approach eliminates both prompt engineering and fine-tuning, drastically reducing human intervention. Evaluated on COCO FSOD, PASCAL VOC Few-Shot, and cross-domain FSOD benchmarks, it achieves 36.8% nAP and 71.2% nAP50βsurpassing all existing training-free methods. This work introduces an efficient, general-purpose paradigm for few-shot segmentation.
π Abstract
The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).