Camouflage-aware Image-Text Retrieval via Expert Collaboration

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
This study addresses the challenge of cross-modal image-text alignment in camouflaged scenes by formally introducing the camouflage-aware image-text retrieval task and constructing CamoIT, a novel dataset featuring multi-granularity textual annotations. To tackle this problem, the authors propose the Camouflage Expert Collaborative Network (CECNet), which employs a dual-branch visual encoder to separately model holistic image content and camouflaged object features. A confidence-conditioned graph attention mechanism (C²GA) is further designed to enable complementary fusion across the two branches. Extensive experiments on CamoIT demonstrate that the proposed method achieves an average improvement of approximately 29% in retrieval accuracy over seven state-of-the-art baselines, effectively mitigating the adverse effects caused by camouflage characteristics and complex backgrounds.

Technology Category

Application Category

📝 Abstract
Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval'' (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising $\sim$10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript{2}GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves $\sim$29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at https://github.com/jiangyao-scu/CA-ITR.
Problem

Research questions and friction points this paper is trying to address.

camouflaged scene understanding
image-text retrieval
cross-modal alignment
camouflage-aware retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

camouflage-aware retrieval
image-text retrieval
dual-branch visual encoder
confidence-conditioned graph attention
camouflaged scene understanding
🔎 Similar Papers
No similar papers found.