🤖 AI Summary
Existing zero-shot compositional image retrieval methods struggle to model fine-grained visual variations and often exhibit insufficient integration of visual and semantic information. To address these limitations, this work proposes a three-stage complementary vision–language fusion framework. First, a pretrained mapping network generates image pseudo-tokens to capture fine-grained visual features. Second, multi-image captions are refined through large language models (LLMs) to enrich semantic context. Finally, multimodal features from reference images and relative text descriptions are effectively fused for retrieval. The proposed approach significantly outperforms current state-of-the-art methods across three established benchmarks—CIRR, CIRCO, and FashionIQ—demonstrating its effectiveness and strong generalization capability in compositional image retrieval.
📝 Abstract
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications, allowing users to retrieve a target image by providing a reference image and a relative caption describing the desired modifications. Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively. They primarily rely on either transforming the multimodal query into a single text using image-to-text models or employing large language models for target image description generation, approaches that often fail to capture complementary visual information and complete semantic context. To address these limitations, we propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration (CVSI). Specifically, CVSI leverages three key components: (1) Visual Information Extraction, which not only extracts global image features but also uses a pre-trained mapping network to convert the image into a pseudo token, combining it with the modification text and the objects most likely to be added. (2) Semantic Information Extraction, which involves using a pre-trained captioning model to generate multiple captions for the reference image, followed by leveraging an LLM to generate the modified captions and the objects most likely to be added. (3) Complementary Information Retrieval, which integrates information extracted from both the query and database images to retrieve the target image, enabling the system to efficiently handle retrieval queries in a variety of situations. Extensive experiments on three public datasets (e.g., CIRR, CIRCO, and FashionIQ) demonstrate that CVSI significantly outperforms existing state-of-the-art methods. Our code is available at https://github.com/yyc6631/CVSI.