Camouflage-aware Image-Text Retrieval via Expert Collaboration

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of cross-modal image-text alignment in camouflaged scenes by formally introducing the camouflage-aware image-text retrieval task and constructing CamoIT, a novel dataset featuring multi-granularity textual annotations. To tackle this problem, the authors propose the Camouflage Expert Collaborative Network (CECNet), which employs a dual-branch visual encoder to separately model holistic image content and camouflaged object features. A confidence-conditioned graph attention mechanism (C²GA) is further designed to enable complementary fusion across the two branches. Extensive experiments on CamoIT demonstrate that the proposed method achieves an average improvement of approximately 29% in retrieval accuracy over seven state-of-the-art baselines, effectively mitigating the adverse effects caused by camouflage characteristics and complex backgrounds.
📝 Abstract
Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval'' (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising $\sim$10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript{2}GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves $\sim$29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at https://github.com/jiangyao-scu/CA-ITR.
Problem

Research questions and friction points this paper is trying to address.

camouflaged scene understanding
image-text retrieval
cross-modal alignment
camouflage-aware retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

camouflage-aware retrieval
image-text retrieval
dual-branch visual encoder
confidence-conditioned graph attention
camouflaged scene understanding
🔎 Similar Papers
No similar papers found.
Y
Yao Jiang
College of Computer Science, Sichuan University
Z
Zhongkuan Mao
National Key Lab of Fundamental Science on Synthetic Vision, Sichuan University
X
Xuan Wu
College of Computer Science, Sichuan University
Keren Fu
Keren Fu
Sichuan University, College of Computer Science
computer visionimage processingmachine learning
Qijun Zhao
Qijun Zhao
Professor of Computer Science, Sichuan University
Biometrics3D VisionObject Detection and RecognitionFace RecognitionFingerprint Recognition