DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing compositional image retrieval (CIR) methods struggle to simultaneously achieve global semantic alignment and fine-grained modeling of visual variations—particularly under subtle textures, local structural changes, and complex textual instructions. To address this, we propose a dual-branch collaborative architecture: a backbone branch capturing cross-modal global semantics, and a novel detail-oriented reasoning branch. The latter leverages atom-level image editing data to construct a detail prior and incorporates an adaptive multi-granularity feature fusion module for query-driven dynamic fine-grained alignment. We further introduce a detail-aware optimization strategy and contrastive learning to enhance cross-modal consistency. Our method achieves state-of-the-art performance on CIRR and FashionIQ, significantly improving retrieval accuracy for nuanced visual changes and intricate instructions. Ablation studies and cross-dataset evaluations validate the generalizability and domain-agnostic effectiveness of our detail-enhancement mechanism.

Technology Category

Application Category

📝 Abstract
Composed Image Retrieval (CIR) aims to retrieve target images from a gallery based on a reference image and modification text as a combined query. Recent approaches focus on balancing global information from two modalities and encode the query into a unified feature for retrieval. However, due to insufficient attention to fine-grained details, these coarse fusion methods often struggle with handling subtle visual alterations or intricate textual instructions. In this work, we propose DetailFusion, a novel dual-branch framework that effectively coordinates information across global and detailed granularities, thereby enabling detail-enhanced CIR. Our approach leverages atomic detail variation priors derived from an image editing dataset, supplemented by a detail-oriented optimization strategy to develop a Detail-oriented Inference Branch. Furthermore, we design an Adaptive Feature Compositor that dynamically fuses global and detailed features based on fine-grained information of each unique multimodal query. Extensive experiments and ablation analyses not only demonstrate that our method achieves state-of-the-art performance on both CIRR and FashionIQ datasets but also validate the effectiveness and cross-domain adaptability of detail enhancement for CIR.
Problem

Research questions and friction points this paper is trying to address.

Balancing global and detailed information for composed image retrieval
Handling subtle visual alterations and intricate textual instructions
Enhancing detail-aware feature fusion for multimodal queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch framework for global and detailed granularities
Detail-oriented optimization strategy with atomic priors
Adaptive Feature Compositor for dynamic feature fusion
🔎 Similar Papers
No similar papers found.
Y
Yuxin Yang
Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences
Yinan Zhou
Yinan Zhou
University of California, Irvine
BlockchainUTXODBMS
Y
Yuxin Chen
ARC Lab, Tencent PCG
Z
Ziqi Zhang
Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; PeopleAI Inc.
Z
Zongyang Ma
Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences
Chunfeng Yuan
Chunfeng Yuan
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
computer visionPattern RecognitionMachine LearningHuman Action RecognitionSparse Representation
B
Bing Li
Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; PeopleAI Inc.
L
Lin Song
ARC Lab, Tencent PCG
J
Jun Gao
HelloGroup Inc.
P
Peng Li
Xiaomi Group
W
Weiming Hu
Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Information Science and Technology, ShanghaiTech University