🤖 AI Summary
To address the challenges of semantic complexity and annotation scarcity in fine-grained compositional image retrieval (CIR) for fashion, this work introduces FACap—the first large-scale, automatically constructed fashion CIR dataset—built via a two-stage automated annotation framework synergizing vision-language models (VLMs) and large language models (LLMs). We further propose FashionBLIP-2, a novel model incorporating lightweight adapters and a multi-head query matching mechanism to enhance fine-grained vision–language alignment. Crucially, our approach eliminates the need for manual annotation, substantially reducing data curation costs. Evaluated on Fashion IQ and the enhanced Fashion IQ (enhFashionIQ) benchmarks, our method achieves state-of-the-art performance in fine-grained text modification scenarios. Results demonstrate both the effectiveness and generalizability of our dataset and model for high-precision e-commerce retrieval tasks.
📝 Abstract
The composed image retrieval (CIR) task is to retrieve target images given a reference image and a modification text. Recent methods for CIR leverage large pretrained vision-language models (VLMs) and achieve good performance on general-domain concepts like color and texture. However, they still struggle with application domains like fashion, because the rich and diverse vocabulary used in fashion requires specific fine-grained vision and language understanding. An additional difficulty is the lack of large-scale fashion datasets with detailed and relevant annotations, due to the expensive cost of manual annotation by specialists. To address these challenges, we introduce FACap, a large-scale, automatically constructed fashion-domain CIR dataset. It leverages web-sourced fashion images and a two-stage annotation pipeline powered by a VLM and a large language model (LLM) to generate accurate and detailed modification texts. Then, we propose a new CIR model FashionBLIP-2, which fine-tunes the general-domain BLIP-2 model on FACap with lightweight adapters and multi-head query-candidate matching to better account for fine-grained fashion-specific information. FashionBLIP-2 is evaluated with and without additional fine-tuning on the Fashion IQ benchmark and the enhanced evaluation dataset enhFashionIQ, leveraging our pipeline to obtain higher-quality annotations. Experimental results show that the combination of FashionBLIP-2 and pretraining with FACap significantly improves the model's performance in fashion CIR especially for retrieval with fine-grained modification texts, demonstrating the value of our dataset and approach in a highly demanding environment such as e-commerce websites. Code is available at https://fgxaos.github.io/facap-paper-website/.