🤖 AI Summary
This work addresses the limited discriminative capability in compositional image retrieval caused by neglecting contextual information. To this end, we propose a Dual-Path Compositional Contextualization Network that models the semantic dependencies between reference images and textual modifications through contextualized encoding. The method explicitly enhances the discriminability between matching and non-matching samples by introducing a similarity discrepancy amplification mechanism. Leveraging a dual-path architecture, our model effectively fuses multimodal features while learning context-aware representations. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on two standard benchmark datasets, significantly improving retrieval accuracy in complex scenarios.
📝 Abstract
Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the neglect of contextual information in discriminating matching samples. However, addressing this limitation is not an easy task due to two challenges: 1) implicit dependencies and 2) the lack of a differential amplification mechanism. To address these challenges, we propose a dual-patH composItional coNtextualized neTwork (HINT), which can perform contextualized encoding and amplify the similarity differences between matching and non-matching samples, thus improving the upper performance of CIR models in complex scenarios. Our HINT model achieves optimal performance on all metrics across two CIR benchmark datasets, demonstrating the superiority of our HINT model. Codes are available at https://github.com/zh-mingyu/HINT.