CoLLM: A Large Language Model for Composed Image Retrieval

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current compositional image retrieval (CIR) faces three core challenges: scarcity of high-quality triplets, insufficient diversity in synthesized samples, and inadequate multimodal joint embeddings for supporting fine-grained textual modifications. To address these, we propose a dynamic triplet generation framework featuring the first LLM-driven reference image–text joint embedding method. We introduce Multi-Text CIR (MTCIR), the first large-scale CIR dataset with 3.4 million samples, and comprehensively reconstruct the evaluation benchmark. Additionally, we design a CIR-specific zero-shot evaluation protocol to ensure rigorous and generalizable assessment. Our approach achieves state-of-the-art performance on CIRR and Fashion-IQ. Models trained on MTCIR yield up to a 15% improvement in retrieval accuracy. The redefined benchmark significantly enhances evaluation reliability and cross-dataset generalization capability.

Technology Category

Application Category

📝 Abstract

Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks and settings. MTCIR yields competitive results, with up to 15% performance improvement. Our refined benchmarks provide more reliable evaluation metrics for CIR models, contributing to the advancement of this important field.

Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of Composed Image Retrieval (CIR) training data

Improves multimodal fusion for complex modification texts

Enhances evaluation reliability with refined benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates triplets from image-caption pairs

Uses LLMs for joint multimodal embeddings

Introduces large-scale MTCIR dataset

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs