Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge in compositional image retrieval (CIR) where the target “mental image” is only implicitly defined by a reference image and a textual instruction, making direct matching difficult. The authors propose a novel zero-shot CIR paradigm that requires no training: leveraging a large multimodal model (LMM), they generate the query’s corresponding mental image from first principles and synthesize a counterpart for each real image in the database, thereby constructing a unified synthetic matching space—termed Paracosm. Cross-modal matching is then performed within this synthetic domain using a vision-language model (VLM) without any fine-tuning. The method achieves state-of-the-art performance, significantly outperforming existing zero-shot approaches across four established CIR benchmarks.

Technology Category

Application Category

📝 Abstract

Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ``mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ``mental image''is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search the target image. In contrast, we address CIR from first principles by directly generating the ``mental image''for more accurate matching. Particularly, we prompt an LMM to generate a ``mental image''for a given multimodal query and propose to use this ``mental image''to search for the target image. As the ``mental image''has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.

Problem

Research questions and friction points this paper is trying to address.

Composed Image Retrieval

zero-shot

mental image

multimodal query

image retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Composed Image Retrieval

Zero-Shot Learning

Mental Image Generation

Paracosm

Training-Free

🔎 Similar Papers

iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval

2024-05-05arXiv.orgCitations: 16