🤖 AI Summary
To address zero-shot instance segmentation—i.e., segmenting unseen object instances using only a few example images without prior training—this paper proposes a training-free, general-purpose framework. Methodologically, it synergistically integrates Grounded-SAM 2 and DINOv2: the former generates high-precision candidate bounding boxes and masks, while the latter extracts zero-shot image-level embeddings. We introduce two key innovations: (1) a cyclic patch filtering mechanism to refine feature representations, and (2) an embedding similarity-weighted matching strategy that jointly incorporates bounding box and mask confidence scores to optimize matching reliability. These components collectively enhance cross-category generalization. Evaluated on all seven BOP 2023 benchmark datasets under pure RGB input, our method outperforms all existing state-of-the-art RGB and RGB-D approaches—achieving, for the first time, zero-shot instance segmentation performance at the SOTA level without any fine-tuning.
📝 Abstract
Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed, for all kinds of novel objects, without (re-) training, has proven to be a difficult task. To handle this, we propose a simple, yet powerful, framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). This work stems from and improves upon previous ones like CNOS, SAM-6D and NIDS-Net; thus, it also leverages on recent vision foundation models, namely: Grounded-SAM 2 and DINOv2. It utilises Grounded-SAM 2 to obtain object proposals with precise bounding boxes and their corresponding segmentation masks; while DINOv2's zero-shot capabilities are employed to generate the image embeddings. The quality of those masks, together with their embeddings, is of vital importance to our approach; as the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings. Differently to SAM-6D, calculating the latter involves a prior patch filtering based on the distance between each patch and its corresponding cyclic/roundtrip patch in the image grid. Furthermore, the average confidence of the proposals' bounding box and mask is used as an additional weighting factor for the object matching score. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods on the seven core datasets of the BOP 2023 challenge for the "Model-based 2D segmentation of unseen objects" task.