π€ AI Summary
This work addresses the challenge of camouflaged object segmentation, where targets are difficult to discern due to their high similarity in color, texture, and structure with the background. To this end, we propose a language-guided structure-aware network that, for the first time, incorporates textual prompts into camouflaged object detection. Leveraging semantic priors derived from CLIP, our method directs multi-scale visual features toward potential target regions. We further introduce a Fourier edge enhancement module, a structure-aware attention mechanism, and a coarse-to-fine local refinement module to strengthen the modelβs perception of object structures and boundaries. Built upon the PVT-v2 backbone and integrating frequency-domain high-pass filtering with multi-scale feature fusion, the proposed approach achieves state-of-the-art performance across multiple COD benchmarks, significantly improving both segmentation accuracy and boundary completeness.
π Abstract
Camouflaged Object Detection (COD) aims to segment objects that are highly integrated with the background in terms of color, texture, and structure, making it a highly challenging task in computer vision. Although existing methods introduce multi-scale fusion and attention mechanisms to alleviate the above issues, they generally lack the guidance of textual semantic priors, which limits the model's ability to focus on camouflaged regions in complex scenes. To address this issue, this paper proposes a Language-Guided Structure-Aware Network (LGSAN). Specifically, based on the visual backbone PVT-v2, we introduce CLIP to generate masks from text prompts and RGB images, thereby guiding the multi-scale features extracted by PVT-v2 to focus on potential target regions. On this foundation, we further design a Fourier Edge Enhancement Module (FEEM), which integrates multi-scale features with high-frequency information in the frequency domain to extract edge enhancement features. Furthermore, we propose a Structure-Aware Attention Module (SAAM) to effectively enhance the model's perception of object structures and boundaries. Finally, we introduce a Coarse-Guided Local Refinement Module (CGLRM) to enhance fine-grained reconstruction and boundary integrity of camouflaged object regions. Extensive experiments demonstrate that our method consistently achieves highly competitive performance across multiple COD datasets, validating its effectiveness and robustness.