🤖 AI Summary
Camouflaged object detection (COD) is highly challenging due to the extreme visual similarity between objects and their backgrounds. Existing methods rely heavily on large-scale training data and substantial computational resources, while foundation models (e.g., SAM) offer strong generalization but require meticulous fine-tuning and high-quality handcrafted prompts. This paper proposes RAG-SEG, a training-free two-stage framework: (1) an unsupervised clustering-based retrieval library generates pseudo-labels and coarse masks via feature retrieval; (2) retrieved prompts are fed into SAM2 for end-to-end precise segmentation. To our knowledge, this is the first work to introduce retrieval-augmented generation (RAG) into COD, eliminating both manual prompt engineering and model fine-tuning. RAG-SEG achieves competitive or superior performance against state-of-the-art methods across multiple standard COD benchmarks. All experiments were conducted on a single consumer laptop, demonstrating exceptional efficiency, deployability, and practicality.
📝 Abstract
Camouflaged object detection (COD) poses a significant challenge in computer vision due to the high similarity between objects and their backgrounds. Existing approaches often rely on heavy training and large computational resources. While foundation models such as the Segment Anything Model (SAM) offer strong generalization, they still struggle to handle COD tasks without fine-tuning and require high-quality prompts to yield good performance. However, generating such prompts manually is costly and inefficient. To address these challenges, we propose extbf{First RAG, Second SEG (RAG-SEG)}, a training-free paradigm that decouples COD into two stages: Retrieval-Augmented Generation (RAG) for generating coarse masks as prompts, followed by SAM-based segmentation (SEG) for refinement. RAG-SEG constructs a compact retrieval database via unsupervised clustering, enabling fast and effective feature retrieval. During inference, the retrieved features produce pseudo-labels that guide precise mask generation using SAM2. Our method eliminates the need for conventional training while maintaining competitive performance. Extensive experiments on benchmark COD datasets demonstrate that RAG-SEG performs on par with or surpasses state-of-the-art methods. Notably, all experiments are conducted on a extbf{personal laptop}, highlighting the computational efficiency and practicality of our approach. We present further analysis in the Appendix, covering limitations, salient object detection extension, and possible improvements.