Referring Camouflaged Object Detection With Multi-Context Overlapped Windows Cross-Attention

πŸ“… 2025-11-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Referring Camouflaged Object Detection (Ref-COD) aims to localize camouflaged objects using either a reference image or textual description. To address this, we propose a multi-stage progressive decoding framework. First, an overlapping-window cross-attention mechanism enables fine-grained matching between reference features and local regions of the main image. Second, a reference signal generation module adaptively fuses multimodal saliency priors. Third, robust representation is enhanced via multi-context feature aggregation and cross-stage encoder-feature fusion. Evaluated on the Ref-COD benchmark, our method establishes new state-of-the-art performance, achieving significant improvements in precision, recall, and cross-modal generalization. These results empirically validate the effectiveness of reference-guided localization coupled with synergistic local–global modeling.

Technology Category

Application Category

πŸ“ Abstract
Referring camouflaged object detection (Ref-COD) aims to identify hidden objects by incorporating reference information such as images and text descriptions. Previous research has transformed reference images with salient objects into one-dimensional prompts, yielding significant results. We explore ways to enhance performance through multi-context fusion of rich salient image features and camouflaged object features. Therefore, we propose RFMNet, which utilizes features from multiple encoding stages of the reference salient images and performs interactive fusion with the camouflage features at the corresponding encoding stages. Given that the features in salient object images contain abundant object-related detail information, performing feature fusion within local areas is more beneficial for detecting camouflaged objects. Therefore, we propose an Overlapped Windows Cross-attention mechanism to enable the model to focus more attention on the local information matching based on reference features. Besides, we propose the Referring Feature Aggregation (RFA) module to decode and segment the camouflaged objects progressively. Extensive experiments on the Ref-COD benchmark demonstrate that our method achieves state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Identifying hidden camouflaged objects using reference images and text descriptions
Enhancing detection through multi-context fusion of salient and camouflage features
Improving local information matching between reference features and hidden objects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-context fusion of salient and camouflaged features
Overlapped Windows Cross-attention for local matching
Referring Feature Aggregation module for progressive decoding
πŸ”Ž Similar Papers
No similar papers found.
Y
Yu Wen
School of Computer and Information Engineering, Shanghai Polytechnic University, Shanghai 201209, China
Shuyong Gao
Shuyong Gao
Fudan University
Human Visual AttentionGenerative ModelWeakly Supervised Learning
S
Shuping Zhang
Department of Automation, School of Intelligent Manufacturing and Control Engineering, Shanghai Polytechnic University, Shanghai 201209, China
M
Miao Huang
Department of Automation, School of Intelligent Manufacturing and Control Engineering, Shanghai Polytechnic University, Shanghai 201209, China
Lili Tao
Lili Tao
Department of Automation, School of Intelligent Manufacturing and Control Engineering, Shanghai Polytechnic University, Shanghai 201209, China
H
Han Yang
Zeekr, Geely, Shanghai 200002, China
Haozhe Xing
Haozhe Xing
Unknown affiliation
Lihe Zhang
Lihe Zhang
Dalian University of Technology
B
Boxue Hou
Research and Design Center, Shanghai Insititue Of Computer Technology Company, Shanghai 200040, China