🤖 AI Summary
Existing compositional image retrieval (CIR) methods struggle to simultaneously achieve global semantic alignment and fine-grained modeling of visual variations—particularly under subtle textures, local structural changes, and complex textual instructions. To address this, we propose a dual-branch collaborative architecture: a backbone branch capturing cross-modal global semantics, and a novel detail-oriented reasoning branch. The latter leverages atom-level image editing data to construct a detail prior and incorporates an adaptive multi-granularity feature fusion module for query-driven dynamic fine-grained alignment. We further introduce a detail-aware optimization strategy and contrastive learning to enhance cross-modal consistency. Our method achieves state-of-the-art performance on CIRR and FashionIQ, significantly improving retrieval accuracy for nuanced visual changes and intricate instructions. Ablation studies and cross-dataset evaluations validate the generalizability and domain-agnostic effectiveness of our detail-enhancement mechanism.
📝 Abstract
Composed Image Retrieval (CIR) aims to retrieve target images from a gallery based on a reference image and modification text as a combined query. Recent approaches focus on balancing global information from two modalities and encode the query into a unified feature for retrieval. However, due to insufficient attention to fine-grained details, these coarse fusion methods often struggle with handling subtle visual alterations or intricate textual instructions. In this work, we propose DetailFusion, a novel dual-branch framework that effectively coordinates information across global and detailed granularities, thereby enabling detail-enhanced CIR. Our approach leverages atomic detail variation priors derived from an image editing dataset, supplemented by a detail-oriented optimization strategy to develop a Detail-oriented Inference Branch. Furthermore, we design an Adaptive Feature Compositor that dynamically fuses global and detailed features based on fine-grained information of each unique multimodal query. Extensive experiments and ablation analyses not only demonstrate that our method achieves state-of-the-art performance on both CIRR and FashionIQ datasets but also validate the effectiveness and cross-domain adaptability of detail enhancement for CIR.