🤖 AI Summary
This work addresses few-shot semantic segmentation of mechanical components. We propose a novel structure-guided few-shot segmentation paradigm that integrates cross-modal priors from CLIPSeg, general-purpose segmentation capability from SAM, geometric constraints from SuperPoint keypoints, and graph convolutional networks (GCNs) to explicitly model spatial and hierarchical part structures—enabling strong generalization from synthetic to real-world scenes. On a custom crane dataset, the method achieves superior performance using only 1–25 annotated samples: J&F score reaches 92.2% on real images and 71.5% on DAVIS 2017 video segmentation (3-shot setting). Model training completes in under five minutes. The core contribution is the first explicit incorporation of structural priors—encoding part geometry, topology, and hierarchy—into a few-shot segmentation framework, significantly enhancing structural consistency and generalization robustness under extreme data scarcity.
📝 Abstract
This paper proposes a novel approach to few-shot semantic segmentation for machinery with multiple parts that exhibit spatial and hierarchical relationships. Our method integrates the foundation models CLIPSeg and Segment Anything Model (SAM) with the interest point detector SuperPoint and a graph convolutional network (GCN) to accurately segment machinery parts. By providing 1 to 25 annotated samples, our model, evaluated on a purely synthetic dataset depicting a truck-mounted loading crane, achieves effective segmentation across various levels of detail. Training times are kept under five minutes on consumer GPUs. The model demonstrates robust generalization to real data, achieving a qualitative synthetic-to-real generalization with a $J&F$ score of 92.2 on real data using 10 synthetic support samples. When benchmarked on the DAVIS 2017 dataset, it achieves a $J&F$ score of 71.5 in semi-supervised video segmentation with three support samples. This method's fast training times and effective generalization to real data make it a valuable tool for autonomous systems interacting with machinery and infrastructure, and illustrate the potential of combined and orchestrated foundation models for few-shot segmentation tasks.